Have you ever wanted to inference a baby Llama 2 model in pure Mojo? No? Well, now you can!
supported version: Mojo 0.25.7.0
With the release of Mojo, I was inspired to take my Python port
of llama2.py and transition it to Mojo. The result? A version that leverages
Mojo's SIMD & vectorization primitives, boosting the Python performance by nearly 250x.
Impressively, after few native improvements the Mojo version outperforms the original llama2.c by 30% in multi-threaded inference. As well as it outperforms llama.cpp on baby-llama inference on CPU by 20%.
This showcases the potential of hardware-level optimizations through Mojo's advanced features.
At the moment, the following models were successfully executed via llama2.mojo:
| Models |
|---|
| stories 260K, 15M, 42M, 110M |
| Tinyllama-1.1B-Chat-v0.2 |
mojo vs 6 programming languages
Mac M1 Max (6 threads)
| Model | llama2.c (OMP/parallelized) | llama2.mojo (parallelized) | llama.cpp (CPU, 6 threads) |
|---|