We are building a general framework for RLVR in VLM. We believe in the power of trenches and longtermism.
Our Interest: General Vision-Language Intelligence & Visual/GUI Agent
Our Goal: 🔄 Algorithm Enhancement ⚡ Efficiency Optimization 🎯 Task Diversity 🌲 Impactful Open Source Research.
Welcome Ideas and Contribution. Stay tuned!
- 
We firstly reveal that Reinforcement Learning with Verifiable Rewards (RLVR) outperforms chain-of-thought supervised fine-tuning (CoT-SFT) in both effectiveness and out-of-distribution (OOD) robustness for vision language models. 
- 
In our experiment, we incentivize VLMs to learn generalizable visual counting abilities, rather than overfitting to the training set. 
- 
The 2B model outperforms the 72B model in OOD tests within just 100 training steps. 
- 
The training was conducted on 8 A100 GPUs for 30 minutes, costing $2.62. 
Blogs:
🎯 RLVR in Vision Language Models: Findings, Questions and Directions
Resources:
🤗 R1V Training Dataset: CLEVR-70k-Counting
🤗 R1V Training Dataset: CLEVR-70k-Complex
🤗 R1V Training Dataset: GEOQA-8k
🤗 R1-Distilled Visual Reasoning Dataset
R1-V Team:
Liang Chen · Lei Li · Haozhe Zhao · Yifan Song · Vinci · Zihao Yue
Contributors:
- 2025-02-21: We write a blog post summarizing the main findings and questions in our visual RLVR experimetns, check it out!
- 2025-02-12: We fixed the batched decoding error. The orignial RL training scirpt now is 3x speeded up.
- 2025-02-12: R1-V now supports vLLM to accelerate training (pip install vllm==0.7.2before use) and SFT.
- 2025-02-11: R1-V now supports Qwen2.5-VL and GEOQA task.
- 2025-02-06: We upload the evaluation script and polish the README. We are writing a blog post summarizing the statistics, findings and underexplored questions.
- 2025-02-03: We upload the training codebase.
- 2025-02-03: We curate and upload some verified Deepseek-R1 visual reasoning traces with some special tricks (see R1-V/src/distill_r1/). Current training code does not rely on it, feel free to explore.
- 2025-02-03: We release the R1-V repo.
- Our top development priority is addressing the issues marked with help wantedlabels, and we welcome ideas/PRs from the community to help solve them.
Note: In our later experiment, we found that letting the 2b base model directly output the result instead of following 
<think></think><answer></answer> would lead to a much higher score (86%) on SuperClevr. It suggests that enforcing Chain-of-Thought reasoning may be not only unnecessary but potentially detrimental to the 2B model performance.
conda create -n r1-v python=3.11 
conda activate r1-v
bash setup.sh- Qwen2-VL
- Qwen2.5-VL
- 
🤗 R1V Training Dataset: CLEVR-70k-Counting: Item Counting Problems 
- 
🤗 R1V Training Dataset: CLEVR-70k-Complex: Number Related Reasoning 
- 
🤗 R1V Training Dataset: GEOQA-8k: Geometry Reasoning 
- SuperClevr-200: Item Counting Problems
- GeoQA-Test-Direct-Answer-735: Geometry Reasoning
cd src/r1-v
export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL
export LOG_PATH="./debug_log_2b.txt"
torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12345" \
    src/open_r1/grpo.py \
    --output_dir <OUTPUT_DIR> \
    --model_name_or_path <PATH-TO-Qwen2-VL-2B-Instruct> \ 
    --dataset_name leonardPKU/clevr_cogen_a_train \  
    --deepspeed local_scripts/zero3.json \
    --max_prompt_length 512 \
    --max_completion_length 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --logging_steps 1 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing false \
    --attn_implementation flash_attention_2 \
    --max_pixels 401408 \
    --num_train_epochs 2 \
    --run_name Qwen2-VL-2B-GRPO-CLEVR-70k \
    --save_steps 100 \
    --save_only_model true \
    --num_generations 8   # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance  
Note
- To reproduce the result, keep the per_device_train_batch_size to 1 for now, as there is a revealed bug about batched training. See the reproduction report here. We realize it is important for effiency and are working on solving it with the community.
- If you meet OOM Error, you can try reduce --num_generations
- To use vLLM to speed up, please refer to this script, currently it only supports Qwen2VL model series.
We also provide SFT code, please follow the script and edit the config to customize the sft task.
accelerate launch --config_file src/r1-v/configs/zero2.yaml src/r1-v/src/open_r1/sft.py --config src/r1-v/configs/qwen2vl_sft_config.yaml We provide the example script to evaluate OOD counting performance on a subset of SuperCLEVR within 1 minute. You can also modify the script and dataset to test on your own dataset.
cd ./src/eval
wget https://www.cs.jhu.edu/~zhuowan/zhuowan/SuperCLEVR/to_be_released/images.zip
unzip images.zip
# change the model path in the script
python test_qwen2vl_counting_superclevr.py 
# tested scores: 
# Qwen2VL-2B-Instruct: 48.0%
# Qwen2VL-2B-Instruct-GRPO-100step: 82.5%We provide the example script to evaluate on the test set (direct answer form) of GEOQA.
# prepare images for testing
cd ./src/eval
git lfs install
git clone https://huggingface.co/datasets/Luckyjhg/Geo170K
cd Geo170K
unzip images.zip
# Evaluation Script
python test_qwen2vl_geoqa.py
# tested scores: 
# Qwen2VL-7B-Instruct: 30.63%
# Qwen2VL-7B-Instruct-GRPO-2epochs: 38.72%
# Qwen2.5VL-3B-Instruct: 35.41%
# Qwen2.5VL-3B-Instruct-GRPO-1epochs: 47.48%To enable faster inference with multiple GPUs, you could also use the script in R1-V/src/scripts/test_grpo_geoqa_multigpu.sh
bash src/scripts/test_grpo_geoqa_multigpu.sh
We sincerely thank DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal (our initial codebase), CLEVR, SuperCLEVR, G-LLAVA for providing open source resources and to build the project. Special thanks to Kimi, bAInance Labs for supporting computation resources and Yuxin Wu, Xinyu Zhou, Baobao Chang for their valuable advice.
@misc{chen2025r1v,
  author       = {Chen, Liang and Li, Lei and Zhao, Haozhe and Song, Yifan and Vinci},
  title        = {R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than \$3},
  howpublished = {\url{https://github.com/Deep-Agent/R1-V}},
  note         = {Accessed: 2025-02-02},
  year         = {2025}
}