You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you release the model outputs from already evaluated models? It would be a huge contribution to the community, because:
This benchmark is large, 2.5k questions. It is unaffordable for some people to benchmark SoTA models, like o1. In addiiton, such closed models are updated time by time. We might not be able to reproduce the reported results;
Some researchers only care about a specific domains, like math or text-only tasks. Releasing the model outputs would offer us more flexibilities to compare with SoTA models.