GitHub - tairov/llama2.mojo: Inference Llama 2 in one file of pure 🔥

llama2.🔥

Have you ever wanted to inference a baby Llama 2 model in pure Mojo? No? Well, now you can!

supported version: Mojo 0.25.7.0

With the release of Mojo, I was inspired to take my Python port of llama2.py and transition it to Mojo. The result? A version that leverages Mojo's SIMD & vectorization primitives, boosting the Python performance by nearly 250x. Impressively, after few native improvements the Mojo version outperforms the original llama2.c by 30% in multi-threaded inference. As well as it outperforms llama.cpp on baby-llama inference on CPU by 20%. This showcases the potential of hardware-level optimizations through Mojo's advanced features.

supported models

At the moment, the following models were successfully executed via llama2.mojo:

Models
stories 260K, 15M, 42M, 110M
Tinyllama-1.1B-Chat-v0.2

extensive benchmark on Apple M1 Max

mojo vs 6 programming languages

benchmark (updated)

Mac M1 Max (6 threads)

Model	llama2.c (OMP/parallelized)	llama2.mojo (parallelized)	llama.cpp (CPU, 6 threads)

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
assets		assets
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
gradio_app.py		gradio_app.py
llama2.mojo		llama2.mojo
run-tests.sh		run-tests.sh
t260.bin		t260.bin
tokenizer.bin		tokenizer.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama2.🔥

supported models

extensive benchmark on Apple M1 Max

benchmark (updated)

License

tairov/llama2.mojo

Folders and files

Latest commit

History

Repository files navigation

llama2.🔥

supported models

extensive benchmark on Apple M1 Max

benchmark (updated)