Skip to content

kennethleungty/DeepSeek-R1-Ollama-Simple-Evals

Repository files navigation

Evaluating DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI's simple-evals

Link to Article: https://towardsdatascience.com/how-to-benchmark-deepseek-r1-distilled-models-on-gpqa-using-ollama-and-openais-simple-evals/

Context

  • The recent launch of the DeepSeek-R1 model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost.
  • Beyond the headlines and online buzz, how can we assess the model's reasoning abilities using recognized benchmarks? 
  • DeepSeek's user interface makes it easy to explore its capabilities, but using it programmatically offers deeper insights and more seamless integration into real-world applications.
  • Understanding how to run such models locally also provides enhanced control and offline access.
  • In this project, we will explore how to use Ollama and OpenAI's simple-evals to evaluate the reasoning capabilities of DeepSeek-R1's distilled models based on the famous GPQA-Diamond benchmark.