Skip to content

This project analyzes the Brazilian E-Commerce dataset from Olist, covering 100,000+ orders from 2016-2018. It includes batch and streaming data pipelines with PostgreSQL, NiFi, Snowflake, dbt, Debezium, Kafka, Flink, Cassandra, Power BI, and Grafana to deliver end-to-end analytics on sales, payments, and customer behavior.

Notifications You must be signed in to change notification settings

databricks-community/databricks-brazilian-ecommerce-olist

 
 

Repository files navigation


⚡ Architecture Overview

1. Batch Layer

  • Source: CSV datasets from kaggle loaded into PostgreSQL
  • Ingestion: NiFi reads data from PostgreSQL and writes it to S3
  • Storage & Modeling: Snowflake points to the S3 data for batch layer, and dbt manages transformations, modeling, and lineage tracking
  • Visualization: Power BI dashboards through snowflake connection.

2. Streaming Layer

  • Source: PostgreSQL CDC captured using Debezium by mimicing the transactional data into the postgres

  • Pipeline:

    1. Debezium publishes CDC events to Kafka raw topic (olist.order_payments)
    2. Raw Kafka events optionally persisted to S3 for auditing
    3. Flink processes raw Kafka topics, performs transformations, and outputs to Kafka transformed topics (olist_payments_aggregated_windowed) and (olist_payments_installments_windowed)
    4. Transformed data stored in Cassandra for real-time queries
    5. Grafana dashboards provide real-time monitoring and metrics visualization

🏗️ Project Architecture

Project


📊 Data Modeling & Lineage

  • dbt manages:

    • Bronze (raw) → Silver (Staging) → Gold (Dims, Facts and Marts)
    • Lineage tracking
    • Fact and dimension models

dbt lineage

Model Diagram

Last Model Diagram


🔄 Data Flows

NiFi Flow (Batch ingestion pipeline)

Nifi Flow

S3 Storage Structure

  • Batch: S3 Batch

  • Stream: S3 Stream


📈 Dashboards

Power BI

BI Dashboard 1

BI Dashboard 2

Grafana

Grafana Dashboard


⚙ Technologies Used

Layer Tool/Technology
Data Ingestion NiFi, Debezium
Messaging & Streaming Kafka, Flink
Storage PostgreSQL, S3, Snowflake
Data Modeling dbt
Real-time Storage Cassandra
Visualization Power BI (BI dashboards), Grafana (real-time metrics)
Containerization Docker

About

This project analyzes the Brazilian E-Commerce dataset from Olist, covering 100,000+ orders from 2016-2018. It includes batch and streaming data pipelines with PostgreSQL, NiFi, Snowflake, dbt, Debezium, Kafka, Flink, Cassandra, Power BI, and Grafana to deliver end-to-end analytics on sales, payments, and customer behavior.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 55.4%
  • Python 38.7%
  • Dockerfile 5.9%