An end-to-end tool for curating GitHub repositories into structured code datasets.
- Fast parallel processing - Download and extract with configurable workers
- Smart filtering - Only processes programming files using GitHub Linguist
- GPT-2 tokenization - Ready-to-use token counts for ML workflows
- Efficient caching - Uses ETags to avoid re-downloading unchanged repos
Perfect for curating training data, running code analysis, or creating repository archives.
cargo install --path .Create an input file with one GitHub repository per line:
"microsoft/vscode"
"vercel/next.js"
"tensorflow/tensorflow"
"bitcoin/bitcoin"
"rust-lang/rust"
"kubernetes/kubernetes"
"facebook/react"
"docker/compose"
"ansible/ansible"
"elastic/elasticsearch"Download repositories:
codecurator download ./configs/repos.jsonlThis creates ZIP files in /zip/ directory. Downloads from main branch first, falls back to master if needed.
Extract and process:
codecurator extract ./configs/repos.jsonl --languages Python Rust VerilogProcesses all programming files, tokenizes content, and outputs structured data to /jsonl/ directory.
Deduplication:
codecurator dedupe ./configs/repos.jsonlHashes the contents of all files and deduplicates them. Stores the final data to /dedup/ by default.
Statistics:
$ bash stats/count_records.sh ./jsonl/
Total records: 110645
$ bash stats/count_tokens.sh ./dedup/
Total tokens: 346574283