Evaluating DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI's simple-evals

Context

The recent launch of the DeepSeek-R1 model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost.
Beyond the headlines and online buzz, how can we assess the model's reasoning abilities using recognized benchmarks?
DeepSeek's user interface makes it easy to explore its capabilities, but using it programmatically offers deeper insights and more seamless integration into real-world applications.
Understanding how to run such models locally also provides enhanced control and offline access.
In this project, we will explore how to use Ollama and OpenAI's simple-evals to evaluate the reasoning capabilities of DeepSeek-R1's distilled models based on the famous GPQA-Diamond benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
config		config
data		data
simple_evals @ 0a6e8f6		simple_evals @ 0a6e8f6
utils		utils
.env.template		.env.template
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
gpqa_mocker.py		gpqa_mocker.py
main.py		main.py
requirements.txt		requirements.txt
walkthrough.ipynb		walkthrough.ipynb