CodeCurator

An end-to-end tool for curating GitHub repositories into structured code datasets.

Fast parallel processing - Download and extract with configurable workers
Smart filtering - Only processes programming files using GitHub Linguist
GPT-2 tokenization - Ready-to-use token counts for ML workflows
Efficient caching - Uses ETags to avoid re-downloading unchanged repos

Perfect for curating training data, running code analysis, or creating repository archives.

Installation

cargo install --path .

Usage

Create an input file with one GitHub repository per line:

"microsoft/vscode"
"vercel/next.js"
"tensorflow/tensorflow"
"bitcoin/bitcoin"
"rust-lang/rust"
"kubernetes/kubernetes"
"facebook/react"
"docker/compose"
"ansible/ansible"
"elastic/elasticsearch"

Download repositories:

codecurator download ./configs/repos.jsonl

This creates ZIP files in /zip/ directory. Downloads from main branch first, falls back to master if needed.

Extract and process:

codecurator extract ./configs/repos.jsonl --languages Python Rust Verilog

Processes all programming files, tokenizes content, and outputs structured data to /jsonl/ directory.

Deduplication:

codecurator dedupe ./configs/repos.jsonl

Hashes the contents of all files and deduplicates them. Stores the final data to /dedup/ by default.

Statistics:

$ bash stats/count_records.sh ./jsonl/
Total records: 110645

$ bash stats/count_tokens.sh ./dedup/
Total tokens: 346574283

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
configs		configs
src		src
stats		stats
vendor		vendor
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

CodeCurator

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

Uh oh!

ggcr/codecurator

Folders and files

Latest commit

History

Repository files navigation

CodeCurator

Installation

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages