llm.rs 🦀

A Rust port of Karpathy's llm.c - an LLM implementation focused on pretraining, and in particular reproducing the GPT-2 and GPT-3 miniseries.

Rust is focused on performance and reliability, so I found it particularly interesting to see how these principles apply to an LLM training loop.

llm.rs is an educational project, in the spirit of llm.c but in Rust, where I tried to write code as safe and idiomatic as possible. The CPU and GPU paths remain close to the C/CUDA originals to ease learning and benchmarking. Training results are bit-exact with the reference implementation.

🔐 Rust Safety

Rust enforces memory safety, thread safety, and data race prevention mostly at compile time, preserving runtime performance.

The CPU implementation train_gpt2.rs is 100% safe Rust (no unsafe block).

A notable difference with C is the usage of sized arrays instead of pointers, preventing out-of-bounds access.
Shared buffers (parameters, activations, gradients) are split with split_at_mut() to satisfy the borrow checker and prevent aliasing (multiple mutable references to the same memory location).
Importantly, these checks have minimal performance cost, check the performance section

⚡️ CUDA with Rust

The CUDA implementation train_gpt2_cuda.rs is using cust, a light wrapper around the CUDA Driver API. This keeps the CUDA code from llm.c almost unchanged.

The kernel launchers are exposed in cuda_launchers.rs where most unsafe blocks happen.
The CUDA kernels in (llmrs/cuda) are still compiled with nvcc.
Rust manages GPU memory allocation with RAII cleanup.

📊 Rust vs C performances

CPU Single Thread

llm.c uses -Ofast, which enables fast-math (trading precision for speed).
Rust does not yet provide a direct fast-math equivalent.
For fair comparison, results are shown with both C and Rust compiled at -O3.
An additional C run with fast-math is included for reference.

CPU Multi Threading

llm.c uses OpenMP for multithreading (minimal code changes).
In Rust, the closest equivalent is Rayon.
Rayon is lightweight but requires more code changes, so a dedicated loop was added in train_gpt2_rayon.rs.

GPU

The GPU kernels are the same in both implementations, so we expect minimal difference.

This chart shows the kernel run times, but to be honest I found that the overall loop was slower in Rust, this needs to be investigated.

⚙️ Getting started

Prerequisites

Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

With NVIDIA GPU

With an NVIDIA GPU and CUDA installed, you can build directly with CUDA + BF16 support:

cargo build --release --features cuda,bf16

Download the training datasets:

sh ./dev/download_starter_pack.sh

Run CUDA training:

cargo run --release --features cuda,bf16 --bin train_gpt2_cuda

Without NVIDIA GPU (Mac & others)

Fortunately it is also possible to develop without an NVIDIA GPU on your local computer. This is what I need with devcontainer and modal.com.

With devcontainer, part of your IDE is running inside a container. This project ships with a devcontainer.json that pulls a CUDA enabled image + Rust. Thus you can make the IDE linting work as expected and build the project locally.

To use it in Cursor for instance: Cmd + Shift + P → Dev Container: Rebuild and Reopen in Container. This opens a new Cursor window on the project in which you can develop as usual and build.

Then to actually run CUDA training or tests with CUDA, you will need a GPU. One option is modal.com:

Create a Modal account.
Install Python package:
```
pip install modal
```
Authenticate:
```
modal setup
# or: python -m modal setup
```

Run CUDA training on Modal:

modal run run_on_modal.py --command "cargo run --features cuda,bf16 --release --bin train_gpt2_cuda"

The run_on_modal.py script defaults to an NVIDIA L4 instance.

📝 Todo

Measure the complete GPU training loop timings. Rust should not be slower - FFI call overhead is too low for that.
Automate the bit-exact comparison with the reference implementation in CI
Improve Rust idiomatic style — I’m sure this can still be refined.

license

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1,555 Commits
.cargo		.cargo
.devcontainer		.devcontainer
.github/workflows		.github/workflows
dev		dev
doc		doc
llmrs		llmrs
scripts		scripts
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_on_modal.py		run_on_modal.py
train_gpt2.py		train_gpt2.py
train_llama3.py		train_llama3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm.rs 🦀

🔐 Rust Safety

⚡️ CUDA with Rust

📊 Rust vs C performances

CPU Single Thread

CPU Multi Threading

GPU

⚙️ Getting started

Prerequisites

With NVIDIA GPU

Without NVIDIA GPU (Mac & others)

📝 Todo

license

About

Uh oh!

Releases

Packages

Languages

License

phildav/llm.rs

Folders and files

Latest commit

History

Repository files navigation

llm.rs 🦀

🔐 Rust Safety

⚡️ CUDA with Rust

📊 Rust vs C performances

CPU Single Thread

CPU Multi Threading

GPU

⚙️ Getting started

Prerequisites

With NVIDIA GPU

Without NVIDIA GPU (Mac & others)

📝 Todo

license

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages