Link to Article: https://towardsdatascience.com/how-to-benchmark-deepseek-r1-distilled-models-on-gpqa-using-ollama-and-openais-simple-evals/
- The recent launch of the DeepSeek-R1 model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost.
- Beyond the headlines and online buzz, how can we assess the model's reasoning abilities using recognized benchmarks?
- DeepSeek's user interface makes it easy to explore its capabilities, but using it programmatically offers deeper insights and more seamless integration into real-world applications.
- Understanding how to run such models locally also provides enhanced control and offline access.
- In this project, we will explore how to use Ollama and OpenAI's simple-evals to evaluate the reasoning capabilities of DeepSeek-R1's distilled models based on the famous GPQA-Diamond benchmark.