GitHub - PhoenixZ810/MM-HELIX: Official Repository of paper MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

[2025.10.13] 🔥🔥 Release SandBox Tasks Generation Code, refer to sandbox!
[2025.10.09] 🚀🚀 Release Arxiv Paper!
[2025.10.09] 🚀🚀 Release MM-HELIX-100K Dataset!
[2025.10.09] 🚀🚀 Release MM-HELIX Benchmark!
[2025.10.09] 🚀🚀 Release Evaluation Code in VLMEvalkit!
[2025.10.09] 🚀🚀 Release MM-HELIX-7B-Thinking Model Checkpoint!

✅ To-Do List

[⏳] AHPO Training Code & RL Environment [Coming Soon]
[⏳] Step Elicited Response Generation(SERG) Pipeline [Coming Soon]

🚀 Introduction

While Multimodal Large Language Models (MLLMs) have shown proficiency in tasks like mathematics and logic, their ability for long-chain reflective reasoning—a key element for solving complex, real-world problems—is not fully developed. This type of reasoning requires iterative thinking and backtracking, which current models often lack.

MM-HELIX is a comprehensive platform designed to evaluate and enhance this crucial capability in MLLMs. It consists of:

A Challenging Benchmark: A new benchmark, MM-HELIX, featuring 1,260 instances across 42 difficult tasks that demand reflective reasoning. Our findings show that existing MLLMs struggle significantly on this benchmark.
A High-Quality Dataset: To address the performance gap, we created MM-HELIX-100K, a dataset with 100,000 high-quality, reflective reasoning instruction-tuning samples, generated through our innovative Step-Elicited Response Generation (SERG) pipeline.
An Advanced Training Method: We introduce Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that combines offline supervision with online optimization. This method effectively teaches the model to learn from expert data and explore solutions independently, overcoming issues like sparse rewards and catastrophic forgetting that are common in standard Reinforcement Learning.

Our model, based on Qwen2.5-VL-7B, shows a +18.6% improvement in accuracy on the MM-HELIX benchmark and a +5.7% average gain on general math and logic tasks, demonstrating that reflective reasoning can be effectively learned and generalized.

📈 Adaptive Hybrid Policy Optimization (AHPO)

Standard training methods often fall short in complex reasoning tasks. Supervised Fine-Tuning (SFT) can lead to catastrophic forgetting of general capabilities, while on-policy Reinforcement Learning (RL) is inefficient with sparse rewards.

To solve these issues, we developed Adaptive Hybrid Policy Optimization (AHPO), a novel training algorithm that unifies off-policy supervision and on-policy exploration.

AHPO's adaptive mechanism dynamically adjusts the influence of expert data based on the model's performance. When the model struggles (sparse rewards), it relies more on expert guidance. As it improves, it is encouraged to explore and find new solutions on its own. This approach fosters robust and generalizable reasoning skills.

🧩 MM-HELIX Benchmark

The 42 tasks in the MM-HELIX benchmark.

The MM-HELIX benchmark is designed to test the limits of multimodal long-chain reflective reasoning in MLLMs.

Diverse and Challenging Tasks: The benchmark includes 1,260 high-quality samples from 42 unique tasks divided into four categories: algorithms, graphs, puzzles, and games.
Controlled Difficulty: Tasks are generated procedurally with five levels of difficulty, from Level 1 (very easy) to Level 5 (very hard), allowing for a detailed analysis of model performance at different complexities.
Automated and Objective Evaluation: Our framework includes an Instance Generator, a deterministic Solver, and an automated Verifier. The Verifier validates the correctness of model-generated solutions, enabling objective and scalable evaluation, and also serves as a reward oracle in a reinforcement learning setup.

🧩 MM-HELIX-100K Dataset: High-Quality Multimodal Reflective CoT

To train models for complex reasoning, a large-scale, high-quality dataset is essential. We introduce MM-HELIX-100K, a dataset of 100,000 instruction-tuning instances with detailed, reflective reasoning paths.

This dataset was created using our Step-Elicited Response Generation (SERG) pipeline, which efficiently generates high-quality Chain-of-Thought (CoT) trajectories.

The SERG pipeline works as follows:

A rule-based CoT constructor first generates a skeletal reasoning path.
This path is then refined by a powerful language model (Qwen3-235B) to create a more natural, human-like reasoning process that includes reflective steps.
Finally, each generated trajectory is validated by our automated verifier to ensure its correctness and quality.

The Step-Elicited Response Generation (SERG) pipeline.

🎯 MM-HELIX Leaderboard

Our comprehensive evaluation of 23 leading MLLMs on the MM-HELIX benchmark reveals significant limitations in their reflective reasoning abilities. Even top proprietary models struggle to surpass a 50% accuracy threshold, and a notable performance gap exists between multimodal and text-only inputs.

Model	Thinking	Breakdown by Category								Overall
		Algorithms		Graphs		Puzzles		Games		Overall
		Txt	Img	Txt	Img	Txt	Img	Txt	Img	Txt	Img
Proprietary Models
GPT-5	✅	83.0	88.5	98.3	50.4	80.9	52.6	80.0	40.0	84.5	58.1
Seed-1.5-VL	✅	89.3	78.9	86.7	40.4	51.6	41.9	55.6	33.3	66.9	48.3
o4-mini	✅	76.3	50.7	95.0	42.1	69.1	45.8	66.7	35.6	75.2	44.7
Gemini-2.5-Flash	✅	92.6	66.7	88.3	40.8	52.1	36.7	49.4	28.3	67.3	42.7
GPT-4.1	❌	61.9	44.4	73.8	35.0	30.9	16.8	13.9	8.9	43.3	25.1
GPT-4o	❌	33.7	18.9	44.6	25.4	10.2	4.2	10.6	6.7	21.8	11.7
Open-Source Models
Intern-S1-241B-A28B	✅	75.2	69.3	76.7	30.0	35.3	23.5	26.1	15.0	50.4	33.3
GLM-4.5V-106B-A12B-Thinking	✅	49.6	29.3	40.4	11.3	15.3	20.2	12.2	13.9	27.0	19.5
Kimi-VL-16B-A3B-Thinking-2506	✅	45.9	36.3	49.6	23.3	9.6	10.4	10.6	7.2	28.9	19.3
GLM-4.1V-9B-Thinking	✅	38.1	30.7	50.4	29.2	11.6	7.4	5.0	6.1	23.7	16.3
Qwen-2.5-VL-72B	❌	24.4	18.5	42.1	25.8	8.2	3.9	5.6	7.2	20.1	13.9
Qwen-2.5-VL-32B	❌	22.2	15.2	46.3	22.5	8.1	4.7	5.6	6.7	20.6	12.3
QVQ-72B-Preview	✅	22.6	21.1	36.7	16.7	4.9	3.3	6.7	3.3	17.7	11.1
MiniCPM-V-4.5-8B	✅	20.0	20.0	32.1	20.8	5.8	3.7	0.0	3.3	13.0	10.4
InternVL3-78B	❌	20.0	14.4	43.3	25.4	10.2	4.0	10.0	1.1	18.6	9.9
InternVL3-38B	❌	19.3	14.1	40.8	22.5	8.2	3.5	7.8	5.6	16.7	9.7
Llama-4-Scout-109B-A17B-16E	❌	24.1	16.3	40.8	21.3	4.4	4.2	2.2	1.7	15.2	9.7
Ovis2-34B	❌	14.4	10.4	33.8	22.1	3.9	1.2	5.0	1.7	12.0	7.2
Gemma-3-27B-IT	❌	20.7	10.4	44.2	22.1	6.5	0.5	5.6	1.7	16.6	6.9
Qwen-2.5-VL-7B	❌	5.6	5.9	25.4	17.9	0.4	0.4	0.6	1.1	8.0	6.3
InternVL3-8B	❌	8.1	5.9	28.8	16.7	1.6	0.7	1.1	1.1	8.1	4.9
Ovis2-8B	❌	7.8	3.3	24.2	15.4	0.5	0.2	1.1	0.6	6.7	3.8
Ours
MM-HELIX-7B-Thinking	✅	32.2	34.8	27.5	19.2	16.3	25.3	16.1	16.7	21.8	24.9

Table 1: Evaluation results on MM-HELIX across multimodal and text-only settings.

Training Performance

When applying AHPO to the Qwen2.5-VL-7B model, we observed remarkable improvements. Our final model, MM-HELIX-7B-Thinking, not only achieves a +18.6% absolute improvement on the MM-HELIX benchmark but also demonstrates strong generalization with a +5.7% average gain on general math and logic benchmarks.

Method	Type	In-Domain			General Reasoning
		MM-HELIX	MathVision	MathVerse-V	LogicVista	WeMath	Average
Qwen2.5VL-7B	Baseline	6.3	25.2	40.5	45.6	34.5	36.5
+GRPO	On-policy	9.0(+2.7)	25.8	41.0	43.6	36.4	36.7(+0.2)
+SFT	Off-policy	23.8(+17.5)	21.7	33.0	38.7	26.2	29.9(-6.6)
+SFT&GRPO	Sequential	23.3(+17.0)	25.9	39.1	45.9	35.7	36.7(+0.2)
+LUFFY	Hybrid	9.1(+2.8)	26.0	37.9	42.7	34.8	35.4(-1.1)
+AHPO (Ours)	Hybrid	24.9(+18.6)	26.6	47.5	53.5	41.1	42.2(+5.7)

Table 2: Comparison of AHPO and other training strategies.

For detailed results and rankings, please refer to our interactive leaderboard.

Citation

If you find our work useful, please consider citing our paper:

@article{zhao2025mmhelix,
  title={MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization},
  author={Zhao, Xiangyu and Lin, Junming and Liang, Tianhao and Zhou, Yifan and Chai, Wenhao and Gu, Yuzhe and Wang, Weiyun and Chen, Kai and Luo, Gen and Zhang, Wenwei and Yan, Junchi and Yang, Hua and Duan, Haodong and Yang, Xue},
  journal={arXiv preprint arXiv:2510.08540},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
images		images
sandbox		sandbox
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

✅ To-Do List

🚀 Introduction

📈 Adaptive Hybrid Policy Optimization (AHPO)

🧩 MM-HELIX Benchmark

🧩 MM-HELIX-100K Dataset: High-Quality Multimodal Reflective CoT

🎯 MM-HELIX Leaderboard

Training Performance

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

PhoenixZ810/MM-HELIX

Folders and files

Latest commit

History

Repository files navigation

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

✅ To-Do List

🚀 Introduction

📈 Adaptive Hybrid Policy Optimization (AHPO)

🧩 MM-HELIX Benchmark

🧩 MM-HELIX-100K Dataset: High-Quality Multimodal Reflective CoT

🎯 MM-HELIX Leaderboard

Training Performance

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages