Skip to content

tairov/llama2.mojo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama2.🔥

llama2.mojo benchmark

Have you ever wanted to inference a baby Llama 2 model in pure Mojo? No? Well, now you can!

supported version: Mojo 0.25.7.0

With the release of Mojo, I was inspired to take my Python port of llama2.py and transition it to Mojo. The result? A version that leverages Mojo's SIMD & vectorization primitives, boosting the Python performance by nearly 250x. Impressively, after few native improvements the Mojo version outperforms the original llama2.c by 30% in multi-threaded inference. As well as it outperforms llama.cpp on baby-llama inference on CPU by 20%. This showcases the potential of hardware-level optimizations through Mojo's advanced features.

supported models

At the moment, the following models were successfully executed via llama2.mojo:

Models
stories 260K, 15M, 42M, 110M
Tinyllama-1.1B-Chat-v0.2

extensive benchmark on Apple M1 Max

mojo vs 6 programming languages

benchmark (updated)

Mac M1 Max (6 threads)

Model llama2.c (OMP/parallelized) llama2.mojo (parallelized) llama.cpp (CPU, 6 threads)