Skip to content

Conversation

@damian1996
Copy link
Collaborator

No description provided.

Copy link

@borgr borgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the data is not from hf? We need a default there beside the string? optional URL + description? something else?
Should we document or suggest somewhere that model name is ideally hf name or model+id form closed APIs?
What is "levels" hierarchical? What do you do with for example Rouge? Split it to 3 separate ones (Rouge1,2,L) and Kendal (correlation,pval?)
I understand why we remove all the enums. But some information is found in the runners, should we record it? The downside is amount of data we store and holes in the data, the upside that if we only need to do it once per eval platform, and certain traits are already recorded there, why not? (in the per benchmark there is a good reason, it would cause people not to contribute). Specifically, inference parameters, num_demonstrations and some prompt details are not commonly reported anyway?

@damian1996
Copy link
Collaborator Author

Do you know any other popular datasets source? Similar to HF? At this moment there are already two options (URL or HF data source). Maybe we should add good description for it and it will be enough.

Yeah, good description for model_name is required. Do you think we need any way of verification for model name? For example if model even exists?

Yeah, for ROUGE I would rather want to to split it to 3 seperate ones metrics.

About info found in the runners - We want to add this additional info from runners alongside with adding next platforms (like lm-eval, helm...)? As optional fields? So at this moment only important fields for inspect-ai and extend it later for support of next platforms.

But if we want to have one common schema for eval platforms and for leaderboars, then I don't know if it is good option to complicate it too much. Avijit and Anastassia rather want to have quite compact and simple schema.

Inference parameters was intended to go here generation_config: Optional[Dict[str, Any]] = None, probably naming isn't optimal? Inference_config or so? About prompt details, in the inspect-ai I've seen prompt template. It probably can be found in most of eval platforms.

@borgr
Copy link

borgr commented Oct 8, 2025

Different sources? Yes, an API, it might have a name and id, just want to make sure we support it (maybe we do and I missed it?)
We can send a warning if the model is not on hf (you can check it without actually downloading it). But it is a nice-to-have.
Re adding as we go, great, that sounds good, don't solve what you might never care about.

We want theirs to be the basis and what required there is required here, and you have more things that you can keep, but are not necessary. They match but they do have very different users that might upload and no reason to throw away important eval data.

Generation_config - looks good, I didn't see where it was defined (now looked and see that you use it) so only saw that you deleted something of a similar name (generation args?). So this is solved.

@damian1996
Copy link
Collaborator Author

damian1996 commented Oct 14, 2025

I've adopted newest Anatassia changes in the leaderboard schema. Added option to add source metadata by user (beside Inspect eval log). Added README for Inspect as well.
For your requested changes:

  1. Added better descriptions in dataset source (HF dataset or URL to other dataset source supported).
  2. Added better description for model_name and added warning if model is not available on huggingface

And I have one problem which I'm not sure how to solve:

  1. in our case as List[EvaluationResult] we log all metrics results (like accuracy, F1) and as detailed_evaluation_results we want to log result for all samples. In this case we want to logdetailed_evaluation_results only once (and all metrics like above are based on these detailed results).
  2. in leaderboard schema probably detailed_evaluation_results should be logged as multiple time (for each EvaluationResult). Example is swebench, where there are few different leaderboard subparts (Bash Only, Verified, Lite, Full, Multimodal).

@borgr Do you have any thougts about it? Maybe we should enable both ways to keep detailed_evaluation_results (eval-level detailed results and metric-level detailed results?

@damian1996 damian1996 requested a review from borgr October 15, 2025 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants