- Source: CSV datasets from kaggle loaded into PostgreSQL
- Ingestion: NiFi reads data from PostgreSQL and writes it to S3
- Storage & Modeling: Snowflake points to the S3 data for batch layer, and dbt manages transformations, modeling, and lineage tracking
- Visualization: Power BI dashboards through snowflake connection.
-
Source: PostgreSQL CDC captured using Debezium by mimicing the transactional data into the postgres
-
Pipeline:
- Debezium publishes CDC events to Kafka raw topic (olist.order_payments)
- Raw Kafka events optionally persisted to S3 for auditing
- Flink processes raw Kafka topics, performs transformations, and outputs to Kafka transformed topics (olist_payments_aggregated_windowed) and (olist_payments_installments_windowed)
- Transformed data stored in Cassandra for real-time queries
- Grafana dashboards provide real-time monitoring and metrics visualization
-
dbt manages:
- Bronze (raw) → Silver (Staging) → Gold (Dims, Facts and Marts)
- Lineage tracking
- Fact and dimension models
| Layer | Tool/Technology |
|---|---|
| Data Ingestion | NiFi, Debezium |
| Messaging & Streaming | Kafka, Flink |
| Storage | PostgreSQL, S3, Snowflake |
| Data Modeling | dbt |
| Real-time Storage | Cassandra |
| Visualization | Power BI (BI dashboards), Grafana (real-time metrics) |
| Containerization | Docker |