Table of Contents
This project encompasses the development of sophisticated data architecture and processing pipelines for BADS Bike Shop, a fictional entity specializing in bicycle sales and rentals. We've leveraged a dataset, encompassing transactional data from Kaggle, customer data from Mockaroo, and simulated GPS and battery data for rental bikes. Our approach includes two key pipelines: a batch pipeline for analytical insights and a stream pipeline for real-time operational monitoring. We've implemented these using Google Cloud's BigQuery and Dataproc services, creating two dashboards - the BI & KYC Dashboard for customer demographics and the Operations Dashboard for real-time bike tracking.
├───batch
│ ├───cleaned
│ ├───data
│ └───integration
├───pipelines
└───stream
├───data
├───kafka
├───notebooks
└───producer
- Spark Notebooks: A collection of Jupyter notebooks containing Spark programs for data processing. This includes:
- Data: Contains datasets used for completeness.
Contains a ci/cd pipeline that automates the conversion of Jupyter notebooks (.ipynb) into Python scripts (.py) and subsequently uploads them to a cloud repository.
- Data: Holds datasets that simulate streaming data for completeness and testing purposes.
- Kafka: Contains a docker-compose file to set up a Kafka consumer environment.
- Notebooks: A Spark program designed to process data incoming from the Kafka stream.
- Producer: A Python program that simulates GPS stream data, effectively acting as a stream data producer from a laptop.
- Andy Huang
- Huub van de Voort
- Oumaima Lemhour
- Roman Nekrasov
- Tom Teurlings