This repository offers a lightweight and flexible solution for evaluating models on the GSM8K benchmark. The results are generally consistent with those obtained using lm-evaluation-harness.
The 8-shot prompt is from the lm-evaluation-harness gsm8k-cot
python eval_gsm8k_few_shot.py --model <model_name>
| Model | Accuracy | Harness Accuracy | 
|---|---|---|
| Mistral-7B-v0.1 | 41.02 | 42.99 (1.36) | 
| Llama-2-7b-hf | 13.72 | 14.33 (0.97) | 
python eval_gsm8k_zero_shot.py --model <model_name> --use_majority_vote --temp 0.2 --n_votes 8
| Model | Accuracy | Harness Accuracy | 
|---|---|---|
| Mistral-7B-v0.1 | 47.84 | 44.96 (1.37) | 
python eval_gsm8k_zero_shot.py --model <model_name> --use_majority_vote --temp 0.4 --n_votes 8
| Model | Accuracy | 
|---|---|
| Mistral-7B-v0.1 | 50.57 | 
use the Chain of Thought prompt "Let's think step by step." before answering the question.
python eval_gsmk_zero_shot.py --model <model_name> --use_cot_prompt
| Model | Accuracy | Harness Accuracy | 
|---|---|---|
| Mistral-7B-v0.1 | 22.06 | 15.85 (1.01) | 
python eval_gsmk_zero_shot.py --model <model_name>
| Model | Accuracy | 
|---|---|
| Mistral-7B-v0.1 | 10.31 |