Skip to content

Astera-org/opengvl

 
 

Repository files navigation

OpenGVL: Benchmarking Visual Temporal Progress for Data Curation

License: MIT Version arXiv

OpenGVL Header Grid (4x2)

Live Benchmark on Hugging Face Spaces



About

OpenGVL provides a benchmark and toolkit to evaluate how well vision-language models (VLMs) understand temporal progress in robotic tasks. It enables automatic annotation and curation of large-scale robotics datasets by predicting task completion from video frames, making it practical for data quality assessment and filtering.

OpenGVL exposes a simple, unified interface across VLMs and data sources, making it a solid foundation for research and practical deployments of Generative Value Learning in robotics and related domains.

Why OpenGVL?

  • Evaluate temporal understanding of VLMs with a principled metric (Value-Order Correlation, VOC).
  • Curate datasets at scale by estimating per-frame task completion.
  • Standardize prompts, images, and outputs across multiple models and datasets.

How it works (high level)

  1. Load evaluation episodes (shuffled frames) and optional context episodes (ordered, with known progress).
  2. Build a prompt with few-shot examples.
  3. Query a chosen VLM with images + prompt.
  4. Parse the VLM’s textual outputs into per-frame completion percentages.
  5. Compute VOC/metrics against ground truth and save results.

Quick Start

After setup (see Getting Started), run a prediction using the experiment config at configs/experiments/predict.yaml:

HYDRA_FULL_ERROR=1 PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
  --config-dir configs/experiments \
  --config-name predict

Results are saved under outputs/DATE_TIME/ with predictions, raw outputs, and metrics.

Tip: you can override any config at the CLI, e.g. model.temperature=0.5.


Getting Started

Prerequisites

  • Python 3.11+
  • uv (recommended) or pip for package management

Installation

  1. Clone the repository:

    git clone https://github.com/budzianowski/opengvl.git
    cd opengvl
  2. set up a virtual environment:

    PYTHONPATH=. uv run python3 -c "print('all packages installed')" # in root of repository

Environment Variables

Create a .env file in the project root:

cp .env.example .env

Then edit .env with your credentials:

OPENAI_API_KEY="your-openai-api-key"
GOOGLE_API_KEY="your-google-api-key"
HUGGING_FACE_HUB_TOKEN="your-hugging-face-token"

VLM Input, Output, and Few-Shot Learning

Each prediction uses:

  1. A prompt constructed from a template (e.g., configs/prompts/concise.yaml) plus dataset-specific instructions. Example instruction: “Task: Pick up the blue block and place it in the red bowl. Estimate task completion (0–100%) per frame. Frames can be shuffled.”
  2. A set of images:
    • Evaluation episode: shuffled frames to estimate completion.
    • Optional context episodes: complete, ordered episodes with known percentages for few-shot guidance.

The VLM returns a text response with per-frame percentages. extract_percentages() in opengvl/utils/inference.py parses the string into a list of integers, e.g., “Frame 1: 50%, Frame 2: 100%, Frame 3: 25%” → [50, 100, 25].


Running with Apptainer/Singularity

For reproducible, portable runs:

Prerequisites:

  • Apptainer
  • NVIDIA drivers on the host for GPU support (for locally inferred models)

Build the image:

apptainer build opengvl.sif apptainer/opengvl.def

Run the quick start prediction with GPU:

apptainer run --nv opengvl.sif python -m opengvl.scripts.predict \
  --config-dir configs/experiments \
  --config-name predict

Tip: pass environment via --env flags or export them in your shell prior to apptainer run.


Configuration (Hydra)

Configuration lives in configs/:

  • configs/model/: model configs (e.g., gemini.yaml, gemma.yaml, openai.yaml)
  • configs/dataset/: dataset configs
  • configs/data_loader/: data loader configs (e.g., huggingface.yaml, local.yaml)
  • configs/prompts/: prompt styles
  • configs/experiments/: complete experiment presets (e.g., predict.yaml)

Override parameters from the command line. Examples:

# Run with explicit experiment config
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict --config-dir configs/experiments --config-name predict

# Override individual fields
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict --config-dir configs/experiments --config-name predict \
  model=gemini dataset=berkeleymvp data_loader=huggingface model.temperature=0.5

Extending OpenGVL

Adding a New Model

OpenGVL clients inherit from opengvl.clients.base.BaseModelClient. You only need to implement _generate_from_events(self, events: list[Event]) -> str, which receives a provider-agnostic sequence of text/image events already assembled by the framework. See opengvl/clients/gemini.py for a complete reference implementation.

  1. Implement a client in opengvl/clients/my_model.py:
# opengvl/clients/my_model.py (concise example)
import os
from typing import cast, List

from loguru import logger

from opengvl.clients.base import BaseModelClient
from opengvl.utils.aliases import Event, ImageEvent, ImageT, TextEvent
from opengvl.utils.images import encode_image


class MyModelClient(BaseModelClient):
  def __init__(self, *, rpm: float = 0.0, model_name: str):
    super().__init__(rpm=rpm)
    if not os.getenv("MY_MODEL_API_KEY"):
      raise OSError("Missing MY_MODEL_API_KEY")
    self.model_name = model_name
    logger.info(f"Using MyModel '{self.model_name}'")

  def _generate_from_events(self, events: List[Event]) -> str:
    parts: List[bytes | str] = []
    for ev in events:
      if isinstance(ev, TextEvent):
        parts.append(ev.text)
      elif isinstance(ev, ImageEvent):
        parts.append(encode_image(cast(ImageT, ev.image)))

    # Call your provider with `parts` and return the provider's text response.
    # Placeholder response for docs/tests:
    return "Frame 1: Task Completion: 50%\nFrame 2: Task Completion: 100%"
  1. Add a Hydra config at configs/model/my_model.yaml:
_target_: opengvl.clients.my_model.MyModelClient
model_name: my-model-name
rpm: 15  # requests per minute (rate limiter)
  1. Use your model via CLI or experiment config:
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
  --config-dir configs/experiments \
  --config-name predict \
  model=my_model

Adding a New Dataset

Create a dataset config that matches the keys used by our HuggingFace loader (configs/data_loader/huggingface.yaml). Example:

# configs/dataset/my_dataset.yaml
name: my_dataset
dataset_name: "org-or-user/my_dataset_on_hub"
camera_index: 0
max_episodes: 100
num_frames: 15
num_context_episodes: 2

Then choose a loader (e.g., Hugging Face) in your experiment or via CLI:

PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
  --config-dir configs/experiments \
  --config-name predict \
  dataset=my_dataset data_loader=huggingface

Modifying Prompts (Templates and Phrases)

Prompts are split into two layers:

  • A high-level prompt template (under configs/prompts/) with free-form text.
  • Structured prompt_phrases (under configs/prompt_phrases/) with required keys validated by the code.
  1. Create a new prompt template file:
# configs/prompts/my_prompt.yaml
template: |
  You are an expert roboticist. Predict task completion percentages for the task: {instruction}.
  Percentages are in [0, 100], where 100 is full completion. Frames may be shuffled.
  For each frame that does NOT already have a completion percentage provided,
  output strictly: "Frame {{i}}: Task Completion: {{p}}%".
  Be precise and consistent; do not include extra text.
  1. (Optional) Create a custom phrase set. The keys must match PromptPhraseKey in opengvl/utils/constants.py:
# configs/prompt_phrases/my_style.yaml
initial_scene_label: "Initial robot scene:"
initial_scene_completion: "In the initial robot scene, the task completion percentage is 0%."
context_frame_label_template: "Frame {i}:"
context_frame_completion_template: "Task Completion: {p}%"
eval_frame_label_template: "Frame {i}:"
eval_task_completion_instruction:
  - "Now, for the task of {instruction}, output the task completion percentage for the following frames that are presented in random order. For each frame, format your response as follows: Frame {{i}}: Task Completion: {{p}}%"
  - "Be rigorous and precise; percentage reflects task completion."
  - "Remember: frames are in random order."
  1. Use your custom prompt in an experiment or from CLI:
PYTHONPATH=. uv run python3 -m opengvl.scripts.predict \
  --config-dir configs/experiments \
  --config-name predict \
  prompts=my_prompt prompt_phrases=my_style

Notes:

  • The framework automatically numbers frames across context and evaluation. Your instructions should make it explicit that only frames without provided percentages should be predicted (see our rigorous prompt for a safe pattern).
  • The phrase keys are required; missing/empty keys will raise a clear ValueError before calling the model.

Known Issues & Troubleshooting

  • macOS library path:
    export DYLD_FALLBACK_LIBRARY_PATH=/opt/homebrew/lib
  • GPU OOM (CUDA): reduce batch_size or image resolution in the model config (e.g., configs/model/gemini.yaml).
  • Hugging Face authentication: ensure HUGGING_FACE_HUB_TOKEN is set in .env for gated models/private datasets.
  • API rate limits: consider lowering concurrency or increasing TQDM_MININTERVAL when applicable.

Citation

If you use OpenGVL in your research, please cite:

@misc{budzianowski2025opengvlbenchmarkingvisual,
      title={OpenGVL - Benchmarking Visual Temporal Progress for Data Curation}, 
      author={Paweł Budzianowski and Emilia Wiśnios and Gracjan Góral and Michał Tyrolski and Igor Kulakov and Viktor Petrenko and Krzysztof Walas},
      year={2025},
      eprint={2509.17321},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.17321}, 
}

Acknowledgments

We thank the broader open-source community and prior work that inspired OpenGVL:


License

This project is licensed under the MIT License. See LICENSE for details.

About

Open GVL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 80.2%
  • Python 18.9%
  • Makefile 0.9%