MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
- [2025.10.13] 🔥🔥 Release SandBox Tasks Generation Code, refer to sandbox!
- [2025.10.09] 🚀🚀 Release Arxiv Paper!
- [2025.10.09] 🚀🚀 Release MM-HELIX-100K Dataset!
- [2025.10.09] 🚀🚀 Release MM-HELIX Benchmark!
- [2025.10.09] 🚀🚀 Release Evaluation Code in VLMEvalkit!
- [2025.10.09] 🚀🚀 Release MM-HELIX-7B-Thinking Model Checkpoint!
- [⏳] AHPO Training Code & RL Environment [Coming Soon]
- [⏳] Step Elicited Response Generation(SERG) Pipeline [Coming Soon]
While Multimodal Large Language Models (MLLMs) have shown proficiency in tasks like mathematics and logic, their ability for long-chain reflective reasoning—a key element for solving complex, real-world problems—is not fully developed. This type of reasoning requires iterative thinking and backtracking, which current models often lack.
MM-HELIX is a comprehensive platform designed to evaluate and enhance this crucial capability in MLLMs. It consists of:
- A Challenging Benchmark: A new benchmark, MM-HELIX, featuring 1,260 instances across 42 difficult tasks that demand reflective reasoning. Our findings show that existing MLLMs struggle significantly on this benchmark.
- A High-Quality Dataset: To address the performance gap, we created MM-HELIX-100K, a dataset with 100,000 high-quality, reflective reasoning instruction-tuning samples, generated through our innovative Step-Elicited Response Generation (SERG) pipeline.
- An Advanced Training Method: We introduce Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that combines offline supervision with online optimization. This method effectively teaches the model to learn from expert data and explore solutions independently, overcoming issues like sparse rewards and catastrophic forgetting that are common in standard Reinforcement Learning.
Our model, based on Qwen2.5-VL-7B, shows a +18.6% improvement in accuracy on the MM-HELIX benchmark and a +5.7% average gain on general math and logic tasks, demonstrating that reflective reasoning can be effectively learned and generalized.
Standard training methods often fall short in complex reasoning tasks. Supervised Fine-Tuning (SFT) can lead to catastrophic forgetting of general capabilities, while on-policy Reinforcement Learning (RL) is inefficient with sparse rewards.
To solve these issues, we developed Adaptive Hybrid Policy Optimization (AHPO), a novel training algorithm that unifies off-policy supervision and on-policy exploration.
AHPO's adaptive mechanism dynamically adjusts the influence of expert data based on the model's performance. When the model struggles (sparse rewards), it relies more on expert guidance. As it improves, it is encouraged to explore and find new solutions on its own. This approach fosters robust and generalizable reasoning skills.
The 42 tasks in the MM-HELIX benchmark.
The MM-HELIX benchmark is designed to test the limits of multimodal long-chain reflective reasoning in MLLMs.
-
Diverse and Challenging Tasks: The benchmark includes 1,260 high-quality samples from 42 unique tasks divided into four categories: algorithms, graphs, puzzles, and games.
-
Controlled Difficulty: Tasks are generated procedurally with five levels of difficulty, from Level 1 (very easy) to Level 5 (very hard), allowing for a detailed analysis of model performance at different complexities.
-
Automated and Objective Evaluation: Our framework includes an Instance Generator, a deterministic Solver, and an automated Verifier. The Verifier validates the correctness of model-generated solutions, enabling objective and scalable evaluation, and also serves as a reward oracle in a reinforcement learning setup.
To train models for complex reasoning, a large-scale, high-quality dataset is essential. We introduce MM-HELIX-100K, a dataset of 100,000 instruction-tuning instances with detailed, reflective reasoning paths.
This dataset was created using our Step-Elicited Response Generation (SERG) pipeline, which efficiently generates high-quality Chain-of-Thought (CoT) trajectories.
The SERG pipeline works as follows:
- A rule-based CoT constructor first generates a skeletal reasoning path.
- This path is then refined by a powerful language model (Qwen3-235B) to create a more natural, human-like reasoning process that includes reflective steps.
- Finally, each generated trajectory is validated by our automated verifier to ensure its correctness and quality.
The Step-Elicited Response Generation (SERG) pipeline.
Our comprehensive evaluation of 23 leading MLLMs on the MM-HELIX benchmark reveals significant limitations in their reflective reasoning abilities. Even top proprietary models struggle to surpass a 50% accuracy threshold, and a notable performance gap exists between multimodal and text-only inputs.
Model | Thinking | Breakdown by Category | Overall | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Algorithms | Graphs | Puzzles | Games | ||||||||
Txt | Img | Txt | Img | Txt | Img | Txt | Img | Txt | Img | ||
Proprietary Models | |||||||||||
GPT-5 | ✅ | 83.0 | 88.5 | 98.3 | 50.4 | 80.9 | 52.6 | 80.0 | 40.0 | 84.5 | 58.1 |
Seed-1.5-VL | ✅ | 89.3 | 78.9 | 86.7 | 40.4 | 51.6 | 41.9 | 55.6 | 33.3 | 66.9 | 48.3 |
o4-mini | ✅ | 76.3 | 50.7 | 95.0 | 42.1 | 69.1 | 45.8 | 66.7 | 35.6 | 75.2 | 44.7 |
Gemini-2.5-Flash | ✅ | 92.6 | 66.7 | 88.3 | 40.8 | 52.1 | 36.7 | 49.4 | 28.3 | 67.3 | 42.7 |
GPT-4.1 | ❌ | 61.9 | 44.4 | 73.8 | 35.0 | 30.9 | 16.8 | 13.9 | 8.9 | 43.3 | 25.1 |
GPT-4o | ❌ | 33.7 | 18.9 | 44.6 | 25.4 | 10.2 | 4.2 | 10.6 | 6.7 | 21.8 | 11.7 |
Open-Source Models | |||||||||||
Intern-S1-241B-A28B | ✅ | 75.2 | 69.3 | 76.7 | 30.0 | 35.3 | 23.5 | 26.1 | 15.0 | 50.4 | 33.3 |
GLM-4.5V-106B-A12B-Thinking | ✅ | 49.6 | 29.3 | 40.4 | 11.3 | 15.3 | 20.2 | 12.2 | 13.9 | 27.0 | 19.5 |
Kimi-VL-16B-A3B-Thinking-2506 | ✅ | 45.9 | 36.3 | 49.6 | 23.3 | 9.6 | 10.4 | 10.6 | 7.2 | 28.9 | 19.3 |
GLM-4.1V-9B-Thinking | ✅ | 38.1 | 30.7 | 50.4 | 29.2 | 11.6 | 7.4 | 5.0 | 6.1 | 23.7 | 16.3 |
Qwen-2.5-VL-72B | ❌ | 24.4 | 18.5 | 42.1 | 25.8 | 8.2 | 3.9 | 5.6 | 7.2 | 20.1 | 13.9 |
Qwen-2.5-VL-32B | ❌ | 22.2 | 15.2 | 46.3 | 22.5 | 8.1 | 4.7 | 5.6 | 6.7 | 20.6 | 12.3 |
QVQ-72B-Preview | ✅ | 22.6 | 21.1 | 36.7 | 16.7 | 4.9 | 3.3 | 6.7 | 3.3 | 17.7 | 11.1 |
MiniCPM-V-4.5-8B | ✅ | 20.0 | 20.0 | 32.1 | 20.8 | 5.8 | 3.7 | 0.0 | 3.3 | 13.0 | 10.4 |
InternVL3-78B | ❌ | 20.0 | 14.4 | 43.3 | 25.4 | 10.2 | 4.0 | 10.0 | 1.1 | 18.6 | 9.9 |
InternVL3-38B | ❌ | 19.3 | 14.1 | 40.8 | 22.5 | 8.2 | 3.5 | 7.8 | 5.6 | 16.7 | 9.7 |
Llama-4-Scout-109B-A17B-16E | ❌ | 24.1 | 16.3 | 40.8 | 21.3 | 4.4 | 4.2 | 2.2 | 1.7 | 15.2 | 9.7 |
Ovis2-34B | ❌ | 14.4 | 10.4 | 33.8 | 22.1 | 3.9 | 1.2 | 5.0 | 1.7 | 12.0 | 7.2 |
Gemma-3-27B-IT | ❌ | 20.7 | 10.4 | 44.2 | 22.1 | 6.5 | 0.5 | 5.6 | 1.7 | 16.6 | 6.9 |
Qwen-2.5-VL-7B | ❌ | 5.6 | 5.9 | 25.4 | 17.9 | 0.4 | 0.4 | 0.6 | 1.1 | 8.0 | 6.3 |
InternVL3-8B | ❌ | 8.1 | 5.9 | 28.8 | 16.7 | 1.6 | 0.7 | 1.1 | 1.1 | 8.1 | 4.9 |
Ovis2-8B | ❌ | 7.8 | 3.3 | 24.2 | 15.4 | 0.5 | 0.2 | 1.1 | 0.6 | 6.7 | 3.8 |
Ours | |||||||||||
MM-HELIX-7B-Thinking | ✅ | 32.2 | 34.8 | 27.5 | 19.2 | 16.3 | 25.3 | 16.1 | 16.7 | 21.8 | 24.9 |
Table 1: Evaluation results on MM-HELIX across multimodal and text-only settings.
When applying AHPO to the Qwen2.5-VL-7B model, we observed remarkable improvements. Our final model, MM-HELIX-7B-Thinking, not only achieves a +18.6% absolute improvement on the MM-HELIX benchmark but also demonstrates strong generalization with a +5.7% average gain on general math and logic benchmarks.
Method | Type | In-Domain | General Reasoning | ||||
---|---|---|---|---|---|---|---|
MM-HELIX | MathVision | MathVerse-V | LogicVista | WeMath | Average | ||
Qwen2.5VL-7B | Baseline | 6.3 | 25.2 | 40.5 | 45.6 | 34.5 | 36.5 |
+GRPO | On-policy | 9.0(+2.7) | 25.8 | 41.0 | 43.6 | 36.4 | 36.7(+0.2) |
+SFT | Off-policy | 23.8(+17.5) | 21.7 | 33.0 | 38.7 | 26.2 | 29.9(-6.6) |
+SFT&GRPO | Sequential | 23.3(+17.0) | 25.9 | 39.1 | 45.9 | 35.7 | 36.7(+0.2) |
+LUFFY | Hybrid | 9.1(+2.8) | 26.0 | 37.9 | 42.7 | 34.8 | 35.4(-1.1) |
+AHPO (Ours) | Hybrid | 24.9(+18.6) | 26.6 | 47.5 | 53.5 | 41.1 | 42.2(+5.7) |
Table 2: Comparison of AHPO and other training strategies.
For detailed results and rankings, please refer to our interactive leaderboard.
If you find our work useful, please consider citing our paper:
@article{zhao2025mmhelix,
title={MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization},
author={Zhao, Xiangyu and Lin, Junming and Liang, Tianhao and Zhou, Yifan and Chai, Wenhao and Gu, Yuzhe and Wang, Weiyun and Chen, Kai and Luo, Gen and Zhang, Wenwei and Yan, Junchi and Yang, Hua and Duan, Haodong and Yang, Xue},
journal={arXiv preprint arXiv:2510.08540},
year={2025}
}