A massive shoutout to my bro vuthanhhai2302 whose brilliant Python ETL pipeline inspired this Golang version. You can check out his work right here
The code will run using a main script and all the processing function will be wrapped in the processor folder.
This ETL pipeline extracts commit data from the GitHub API, saves the raw data to file storage partitioned by month, converts the raw data into a list of commit models, and then loads the validated data into a PostgreSQL database. The process also includes post-load validations to ensure data integrity.
- Go 1.22+
- Environment variables set (
GITHUB_TOKENfor API authentication) - Docker compose
- Install dependencies
docker compose up -d- Extractor: Fetches commit data asynchronously from the GitHub API and aggregates it by month, save the aggregated commit data of each month to corresponding file
- Transformer: Loads and converts the file storage data into a list of validated commit model instances, push validated commits to channel
- Loader: Listen to Transformer data channel, perform batch inserts to destination data
Notes: the code will ingest and load to local storage and then load to destination database. the main reason is if we have trouble loading to the destination database, we can re run the failed task (if we are using a ochestrator).
-
Configuration & Environment Setup:
The pipeline run date is determined using the current date. Config is loaded fromconfig.yaml, and environment variables (e.g.,GITHUB_TOKEN) are read for API authentication -
Data Extraction:
TheExtractorcollects commit data from GitHub using asynchronous API calls. It aggregates commits by month for the past six months -
Saving to File Storage:
TheExtractoralso writes the aggregated data into files (organized by year and month), returning a list of file paths -
Data Transformation:
TheTransformerloads the commit data from the files and converts it into a channel of commit model instances (Commit) for downstream processing -
Data Loading:
TheLoaderuses the established PostgresQL connection to createCommitStore. Existing records for the current pipeline run date are deleted, and the new commit data is batch-inserted into the target table. -
Post-Load Validation:
The pipeline verifies that the number of rows loaded into PostgreSQL matches the expected count from the file storage. If there is a mismatch, an error is logged and raised. -
Cleanup:
The PostgreSQL connection is closed and a success log message is produced if all validations pass.
you can find the queries from folder sql.