feat: Update schema and add InspectAI adapter #26

damian1996 · 2025-10-05T12:14:03Z

No description provided.

borgr

What if the data is not from hf? We need a default there beside the string? optional URL + description? something else?
Should we document or suggest somewhere that model name is ideally hf name or model+id form closed APIs?
What is "levels" hierarchical? What do you do with for example Rouge? Split it to 3 separate ones (Rouge1,2,L) and Kendal (correlation,pval?)
I understand why we remove all the enums. But some information is found in the runners, should we record it? The downside is amount of data we store and holes in the data, the upside that if we only need to do it once per eval platform, and certain traits are already recorded there, why not? (in the per benchmark there is a good reason, it would cause people not to contribute). Specifically, inference parameters, num_demonstrations and some prompt details are not commonly reported anyway?

damian1996 · 2025-10-08T17:57:28Z

Do you know any other popular datasets source? Similar to HF? At this moment there are already two options (URL or HF data source). Maybe we should add good description for it and it will be enough.

Yeah, good description for model_name is required. Do you think we need any way of verification for model name? For example if model even exists?

Yeah, for ROUGE I would rather want to to split it to 3 seperate ones metrics.

About info found in the runners - We want to add this additional info from runners alongside with adding next platforms (like lm-eval, helm...)? As optional fields? So at this moment only important fields for inspect-ai and extend it later for support of next platforms.

But if we want to have one common schema for eval platforms and for leaderboars, then I don't know if it is good option to complicate it too much. Avijit and Anastassia rather want to have quite compact and simple schema.

Inference parameters was intended to go here generation_config: Optional[Dict[str, Any]] = None, probably naming isn't optimal? Inference_config or so? About prompt details, in the inspect-ai I've seen prompt template. It probably can be found in most of eval platforms.

borgr · 2025-10-08T21:54:00Z

Different sources? Yes, an API, it might have a name and id, just want to make sure we support it (maybe we do and I missed it?)
We can send a warning if the model is not on hf (you can check it without actually downloading it). But it is a nice-to-have.
Re adding as we go, great, that sounds good, don't solve what you might never care about.

We want theirs to be the basis and what required there is required here, and you have more things that you can keep, but are not necessary. They match but they do have very different users that might upload and no reason to throw away important eval data.

Generation_config - looks good, I didn't see where it was defined (now looked and see that you use it) so only saw that you deleted something of a similar name (generation args?). So this is solved.

damian1996 · 2025-10-14T21:33:46Z

I've adopted newest Anatassia changes in the leaderboard schema. Added option to add source metadata by user (beside Inspect eval log). Added README for Inspect as well.
For your requested changes:

Added better descriptions in dataset source (HF dataset or URL to other dataset source supported).
Added better description for model_name and added warning if model is not available on huggingface

And I have one problem which I'm not sure how to solve:

in our case as List[EvaluationResult] we log all metrics results (like accuracy, F1) and as detailed_evaluation_results we want to log result for all samples. In this case we want to logdetailed_evaluation_results only once (and all metrics like above are based on these detailed results).
in leaderboard schema probably detailed_evaluation_results should be logged as multiple time (for each EvaluationResult). Example is swebench, where there are few different leaderboard subparts (Bash Only, Verified, Lite, Full, Multimodal).

@borgr Do you have any thougts about it? Maybe we should enable both ways to keep detailed_evaluation_results (eval-level detailed results and metric-level detailed results?

…model is from HF

…d cleanup

borgr suggested changes Oct 8, 2025

View reviewed changes

damian1996 force-pushed the DS/inspect_support branch from 7f1598b to db15033 Compare October 14, 2025 21:33

damian1996 requested a review from borgr October 15, 2025 13:04

damian1996 added 8 commits October 20, 2025 23:30

feat: Update schema and add InspectAI adapter

b36c638

Fixes

e7043f3

Fix MetadataAdapter

6c83da9

Extended descriptions for source_data and model_info. Added check if …

912409e

…model is from HF

Add README for Inspect

e48ffe0

Fix pyproject.toml after rebase

e69b40c

Fix Inspect AI adapter for newest schema

7e25b76

Added more examples of Inspect eval log files

3696d7b

damian1996 force-pushed the DS/inspect_support branch from 169709f to 3696d7b Compare October 20, 2025 23:41

damian1996 added 3 commits October 22, 2025 00:33

Add tests for inspect, small adjustments in inspect adapter, fixes an…

8c44194

…d cleanup

Fix ruff errors

376f0eb

Add github workflow for PR

762f6eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Update schema and add InspectAI adapter #26

feat: Update schema and add InspectAI adapter #26

Uh oh!

damian1996 commented Oct 5, 2025

Uh oh!

borgr left a comment

Uh oh!

damian1996 commented Oct 8, 2025

Uh oh!

borgr commented Oct 8, 2025

Uh oh!

damian1996 commented Oct 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Update schema and add InspectAI adapter #26

Are you sure you want to change the base?

feat: Update schema and add InspectAI adapter #26

Uh oh!

Conversation

damian1996 commented Oct 5, 2025

Uh oh!

borgr left a comment

Choose a reason for hiding this comment

Uh oh!

damian1996 commented Oct 8, 2025

Uh oh!

borgr commented Oct 8, 2025

Uh oh!

damian1996 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

damian1996 commented Oct 14, 2025 •

edited

Loading