ETL Pipeline: GitHub Commits to PostgreSQL

Overview

A massive shoutout to my bro vuthanhhai2302 whose brilliant Python ETL pipeline inspired this Golang version. You can check out his work right here

The code will run using a main script and all the processing function will be wrapped in the processor folder.

This ETL pipeline extracts commit data from the GitHub API, saves the raw data to file storage partitioned by month, converts the raw data into a list of commit models, and then loads the validated data into a PostgreSQL database. The process also includes post-load validations to ensure data integrity.

Prerequisites

Go 1.22+
Environment variables set (GITHUB_TOKEN for API authentication)
Docker compose
Install dependencies

docker compose up -d

Pipeline Components

Extractor: Fetches commit data asynchronously from the GitHub API and aggregates it by month, save the aggregated commit data of each month to corresponding file
Transformer: Loads and converts the file storage data into a list of validated commit model instances, push validated commits to channel
Loader: Listen to Transformer data channel, perform batch inserts to destination data

Pipeline Flow

Notes: the code will ingest and load to local storage and then load to destination database. the main reason is if we have trouble loading to the destination database, we can re run the failed task (if we are using a ochestrator).

Configuration & Environment Setup:
The pipeline run date is determined using the current date. Config is loaded from config.yaml, and environment variables (e.g., GITHUB_TOKEN) are read for API authentication
Data Extraction:
The Extractor collects commit data from GitHub using asynchronous API calls. It aggregates commits by month for the past six months
Saving to File Storage:
The Extractor also writes the aggregated data into files (organized by year and month), returning a list of file paths
Data Transformation:
The Transformer loads the commit data from the files and converts it into a channel of commit model instances (Commit) for downstream processing
Data Loading:
The Loader uses the established PostgresQL connection to create CommitStore. Existing records for the current pipeline run date are deleted, and the new commit data is batch-inserted into the target table.
Post-Load Validation:
The pipeline verifies that the number of rows loaded into PostgreSQL matches the expected count from the file storage. If there is a mismatch, an error is logged and raised.
Cleanup:
The PostgreSQL connection is closed and a success log message is produced if all validations pass.

SQL for queries

you can find the queries from folder sql.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cmd		cmd
conf		conf
internal		internal
migrations		migrations
pkg		pkg
sql		sql
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Pipeline: GitHub Commits to PostgreSQL

Overview

Prerequisites

Pipeline Components

Pipeline Flow

SQL for queries

About

Uh oh!

Releases

Packages

Uh oh!

Languages

chunguyenduc/git_commit_etl

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline: GitHub Commits to PostgreSQL

Overview

Prerequisites

Pipeline Components

Pipeline Flow

SQL for queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages