Live Benchmark on Hugging Face Spaces
- OpenGVL: Benchmarking Visual Temporal Progress for Data Curation
OpenGVL provides a benchmark and toolkit to evaluate how well vision-language models (VLMs) understand temporal progress in robotic tasks. It enables automatic annotation and curation of large-scale robotics datasets by predicting task completion from video frames, making it practical for data quality assessment and filtering.
OpenGVL exposes a simple, unified interface across VLMs and data sources, making it a solid foundation for research and practical deployments of Generative Value Learning in robotics and related domains.
- Evaluate temporal understanding of VLMs with a principled metric (Value-Order Correlation, VOC).
- Curate datasets at scale by estimating per-frame task completion.
- Standardize prompts, images, and outputs across multiple models and datasets.
- Load evaluation episodes (shuffled frames) and optional context episodes (ordered, with known progress).
- Build a prompt with few-shot examples.
- Query a chosen VLM with images + prompt.
- Parse the VLM’s textual outputs into per-frame completion percentages.
- Compute VOC/metrics against ground truth and save results.
After setup (see Getting Started), run a prediction using the experiment config at configs/experiments/predict.yaml:
HYDRA_FULL_ERROR=1 PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
--config-dir configs/experiments \
--config-name predictResults are saved under outputs/DATE_TIME/ with predictions, raw outputs, and metrics.
Tip: you can override any config at the CLI, e.g. model.temperature=0.5.
- Python 3.11+
- uv (recommended) or pip for package management
-
Clone the repository:
git clone https://github.com/budzianowski/opengvl.git cd opengvl -
set up a virtual environment:
PYTHONPATH=. uv run python3 -c "print('all packages installed')" # in root of repository
Create a .env file in the project root:
cp .env.example .envThen edit .env with your credentials:
OPENAI_API_KEY="your-openai-api-key"
GOOGLE_API_KEY="your-google-api-key"
HUGGING_FACE_HUB_TOKEN="your-hugging-face-token"
Each prediction uses:
- A prompt constructed from a template (e.g.,
configs/prompts/concise.yaml) plus dataset-specific instructions. Example instruction: “Task: Pick up the blue block and place it in the red bowl. Estimate task completion (0–100%) per frame. Frames can be shuffled.” - A set of images:
- Evaluation episode: shuffled frames to estimate completion.
- Optional context episodes: complete, ordered episodes with known percentages for few-shot guidance.
The VLM returns a text response with per-frame percentages. extract_percentages() in opengvl/utils/inference.py parses the string into a list of integers, e.g., “Frame 1: 50%, Frame 2: 100%, Frame 3: 25%” → [50, 100, 25].
For reproducible, portable runs:
Prerequisites:
- Apptainer
- NVIDIA drivers on the host for GPU support (for locally inferred models)
Build the image:
apptainer build opengvl.sif apptainer/opengvl.defRun the quick start prediction with GPU:
apptainer run --nv opengvl.sif python -m opengvl.scripts.predict \
--config-dir configs/experiments \
--config-name predictTip: pass environment via --env flags or export them in your shell prior to apptainer run.
Configuration lives in configs/:
configs/model/: model configs (e.g.,gemini.yaml,gemma.yaml,openai.yaml)configs/dataset/: dataset configsconfigs/data_loader/: data loader configs (e.g.,huggingface.yaml,local.yaml)configs/prompts/: prompt stylesconfigs/experiments/: complete experiment presets (e.g.,predict.yaml)
Override parameters from the command line. Examples:
# Run with explicit experiment config
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict --config-dir configs/experiments --config-name predict
# Override individual fields
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict --config-dir configs/experiments --config-name predict \
model=gemini dataset=berkeleymvp data_loader=huggingface model.temperature=0.5OpenGVL clients inherit from opengvl.clients.base.BaseModelClient. You only need to implement _generate_from_events(self, events: list[Event]) -> str, which receives a provider-agnostic sequence of text/image events already assembled by the framework. See opengvl/clients/gemini.py for a complete reference implementation.
- Implement a client in
opengvl/clients/my_model.py:
# opengvl/clients/my_model.py (concise example)
import os
from typing import cast, List
from loguru import logger
from opengvl.clients.base import BaseModelClient
from opengvl.utils.aliases import Event, ImageEvent, ImageT, TextEvent
from opengvl.utils.images import encode_image
class MyModelClient(BaseModelClient):
def __init__(self, *, rpm: float = 0.0, model_name: str):
super().__init__(rpm=rpm)
if not os.getenv("MY_MODEL_API_KEY"):
raise OSError("Missing MY_MODEL_API_KEY")
self.model_name = model_name
logger.info(f"Using MyModel '{self.model_name}'")
def _generate_from_events(self, events: List[Event]) -> str:
parts: List[bytes | str] = []
for ev in events:
if isinstance(ev, TextEvent):
parts.append(ev.text)
elif isinstance(ev, ImageEvent):
parts.append(encode_image(cast(ImageT, ev.image)))
# Call your provider with `parts` and return the provider's text response.
# Placeholder response for docs/tests:
return "Frame 1: Task Completion: 50%\nFrame 2: Task Completion: 100%"- Add a Hydra config at
configs/model/my_model.yaml:
_target_: opengvl.clients.my_model.MyModelClient
model_name: my-model-name
rpm: 15 # requests per minute (rate limiter)- Use your model via CLI or experiment config:
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
--config-dir configs/experiments \
--config-name predict \
model=my_modelCreate a dataset config that matches the keys used by our HuggingFace loader (configs/data_loader/huggingface.yaml). Example:
# configs/dataset/my_dataset.yaml
name: my_dataset
dataset_name: "org-or-user/my_dataset_on_hub"
camera_index: 0
max_episodes: 100
num_frames: 15
num_context_episodes: 2Then choose a loader (e.g., Hugging Face) in your experiment or via CLI:
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
--config-dir configs/experiments \
--config-name predict \
dataset=my_dataset data_loader=huggingfacePrompts are split into two layers:
- A high-level prompt
template(underconfigs/prompts/) with free-form text. - Structured
prompt_phrases(underconfigs/prompt_phrases/) with required keys validated by the code.
- Create a new prompt template file:
# configs/prompts/my_prompt.yaml
template: |
You are an expert roboticist. Predict task completion percentages for the task: {instruction}.
Percentages are in [0, 100], where 100 is full completion. Frames may be shuffled.
For each frame that does NOT already have a completion percentage provided,
output strictly: "Frame {{i}}: Task Completion: {{p}}%".
Be precise and consistent; do not include extra text.- (Optional) Create a custom phrase set. The keys must match
PromptPhraseKeyinopengvl/utils/constants.py:
# configs/prompt_phrases/my_style.yaml
initial_scene_label: "Initial robot scene:"
initial_scene_completion: "In the initial robot scene, the task completion percentage is 0%."
context_frame_label_template: "Frame {i}:"
context_frame_completion_template: "Task Completion: {p}%"
eval_frame_label_template: "Frame {i}:"
eval_task_completion_instruction:
- "Now, for the task of {instruction}, output the task completion percentage for the following frames that are presented in random order. For each frame, format your response as follows: Frame {{i}}: Task Completion: {{p}}%"
- "Be rigorous and precise; percentage reflects task completion."
- "Remember: frames are in random order."- Use your custom prompt in an experiment or from CLI:
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
--config-dir configs/experiments \
--config-name predict \
prompts=my_prompt prompt_phrases=my_styleNotes:
- The framework automatically numbers frames across context and evaluation. Your instructions should make it explicit that only frames without provided percentages should be predicted (see our
rigorousprompt for a safe pattern). - The phrase keys are required; missing/empty keys will raise a clear
ValueErrorbefore calling the model.
- macOS library path:
export DYLD_FALLBACK_LIBRARY_PATH=/opt/homebrew/lib - GPU OOM (CUDA): reduce
batch_sizeor image resolution in the model config (e.g.,configs/model/gemini.yaml). - Hugging Face authentication: ensure
HUGGING_FACE_HUB_TOKENis set in.envfor gated models/private datasets. - API rate limits: consider lowering concurrency or increasing
TQDM_MININTERVALwhen applicable.
If you use OpenGVL in your research, please cite:
@misc{budzianowski2025opengvlbenchmarkingvisual,
title={OpenGVL - Benchmarking Visual Temporal Progress for Data Curation},
author={Paweł Budzianowski and Emilia Wiśnios and Gracjan Góral and Michał Tyrolski and Igor Kulakov and Viktor Petrenko and Krzysztof Walas},
year={2025},
eprint={2509.17321},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.17321},
}We thank the broader open-source community and prior work that inspired OpenGVL:
- Foundational research on Generative Value Learning (OpenGVL paper)
- LeRobot for dataset infrastructure
- Hydra for configuration management
- Hugging Face for dataset hosting/model access
This project is licensed under the MIT License. See LICENSE for details.