Used by Amazon Web Services
This project is a clean fork of the original veRL project to support vision language models, we thank all the authors for providing such a high-performance RL training framework.
EasyR1 is efficient and scalable due to the design of HybirdEngine and the latest release of vLLM's SPMD mode.
- 
Supported models
- Llama3/Qwen2/Qwen2.5/Qwen3 language models
 - Qwen2-VL/Qwen2.5-VL/Qwen3-VL vision language models
 - DeepSeek-R1 distill models
 
 - 
Supported algorithms
 - 
Supported datasets
- Any text, vision-text dataset in a specific format
 
 - 
Supported tricks
- Padding-free training
 - Resuming from the latest/best checkpoint
 - Wandb & SwanLab & Mlflow & Tensorboard tracking
 
 
- Python 3.9+
 - transformers>=4.54.0
 - flash-attn>=2.4.3
 - vllm>=0.8.3
 
We provide a Dockerfile to easily build environments.
We recommend using the pre-built docker image in EasyR1.
docker pull hiyouga/verl:ngc-th2.8.0-cu12.9-vllm0.11.0
docker run -it --ipc=host --gpus=all hiyouga/verl:ngc-th2.8.0-cu12.9-vllm0.11.0If your environment does not support Docker, you can consider using Apptainer:
apptainer pull easyr1.sif docker://hiyouga/verl:ngc-th2.8.0-cu12.9-vllm0.11.0
apptainer shell --nv --cleanenv --bind /mnt/your_dir:/mnt/your_dir easyr1.sifUse USE_MODELSCOPE_HUB=1 to download models from the ModelScope hub.
* estimated
| Method | Bits | 1.5B | 3B | 7B | 32B | 72B | 
|---|---|---|---|---|---|---|
| GRPO Full Fine-Tuning | AMP | 2*24GB | 4*40GB | 8*40GB | 16*80GB | 32*80GB | 
| GRPO Full Fine-Tuning | BF16 | 1*24GB | 1*40GB | 4*40GB | 8*80GB | 16*80GB | 
Note
Use worker.actor.fsdp.torch_dtype=bf16 and worker.actor.optim.strategy=adamw_bf16 to enable bf16 training.
We are working hard to reduce the VRAM in RL training, LoRA support will be integrated in next updates.
Tutorial: Run Qwen2.5-VL GRPO on Geometry3K Dataset in Just 3 Steps
git clone https://github.com/hiyouga/EasyR1.git
cd EasyR1
pip install -e .bash examples/qwen2_5_vl_7b_geo3k_grpo.shpython3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actorTip
If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.
If you want to use SwanLab logger, consider using bash examples/qwen2_5_vl_7b_geo3k_swanlab.sh.
Please refer to the example datasets to prepare your own dataset.
- Text dataset: https://huggingface.co/datasets/hiyouga/math12k
 - Image-text dataset: https://huggingface.co/datasets/hiyouga/geometry3k
 - Multi-image-text dataset: https://huggingface.co/datasets/hiyouga/journeybench-multi-image-vqa
 - Text-image mixed dataset: https://huggingface.co/datasets/hiyouga/rl-mixed-dataset
 
- To learn about the GRPO algorithm, you can refer to Hugging Face's blog.
 
- Start the Ray head node.
 
ray start --head --port=6379 --dashboard-host=0.0.0.0- Start the Ray worker node and connect to the head node.
 
ray start --address=<head_node_ip>:6379- Check the Ray resource pool.
 
ray status- Run training script on the Ray head node only.
 
bash examples/qwen2_5_vl_7b_geo3k_grpo.shSee the veRL's official doc for more details about multi-node training and Ray debugger.
We also reproduced the following two baselines of the R1-V project.
- CLEVR-70k-Counting: Train the Qwen2.5-VL-3B-Instruct model on counting problem.
 - GeoQA-8k: Train the Qwen2.5-VL-3B-Instruct model on GeoQA problem.
 
See baselines.md.
- MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources. 
 - Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models. 
 - Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement. 
 - MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse. 
 - Temporal-R1: Envolving Temporal Reasoning Capability into LMMs via Temporal Consistent Reward. 
 - NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation. 
 - GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents. 
 - R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning. 
 - VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning. 
 - MM-UPT: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO. 
 - RL-with-Cold-Start: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start. 
 - ViGoRL: Grounded Reinforcement Learning for Visual Reasoning. 
 - Revisual-R1: Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning. 
 - SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward. 
 - Vision-Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning. 
 - VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use. 
 - Long-RL: Scaling RL to Long Sequences. 
 - EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation. 
 - ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping. 
 - VPPO: Spotlight on Token Perception for Multimodal Reinforcement Learning. 
 
- Support LoRA (high priority).
 - Support ulysses parallelism for VLMs (middle priority).
 - Support more VLM architectures.
 
Note
We will not provide scripts for supervised fine-tuning and inference in this project. If you have such requirements, we recommend using LLaMA-Factory.
These features are temporarily disabled for now, we plan to fix them one-by-one in the future updates.
- Vision language models are not compatible with ulysses parallelism yet.
 
👋 Join our WeChat group.
ValueError: Image features and image tokens do not match: tokens: 8192, features 9800
Increase the data.max_prompt_length or reduce the data.max_pixels.
RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
Reduce the worker.rollout.gpu_memory_utilization and enable worker.actor.offload.offload_params.
RuntimeError: 0 active drivers ([]). There should only be one.
Uninstall deepspeed from the current python environment.
Core contributors: Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang and Yuwen Xiong
We also thank Guangming Sheng and Chi Zhang for helpful discussions.
@misc{zheng2025easyr1,
  title        = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
  author       = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
  howpublished = {\url{https://github.com/hiyouga/EasyR1}},
  year         = {2025}
}We recommend to also cite the original work.
@article{sheng2024hybridflow,
  title   = {HybridFlow: A Flexible and Efficient RLHF Framework},
  author  = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2409.19256}
}