If you find our work useful, please consider giving us a star🌟
Our website is automatically generated using our AutoPage, a multi-agent system we highly recommend for effortless academic page creation.
- RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
RAPO++ is a three-stage framework that enhances text-to-video generation without modifying model architectures. It unifies data-aligned prompt refinement (RAPO), test-time iterative optimization (SSPO), and LLM fine-tuning**, enabling more coherent, compositional, and physically realistic video synthesis. Tested on five state-of-the-art models and benchmarks, RAPO++ consistently improves semantic alignment, temporal stability, and visual fidelity, setting a new standard for prompt optimization in T2V generation.
The core contribution of RAPO++ lies in SSPO, a model-agnostic, closed-loop mechanism that iteratively refines prompts through feedback from generated videos. When using RAPO++, users can replace RAPO with their model’s built-in prompt refiner as initialization. The feedback data collected during SSPO can then be used to fine-tune the refiner itself, further enhancing model-specific prompt optimization.
- Clone the Repository
git clone https://github.com/Vchitect/RAPO.git
cd RAPO
- Set up Environment
conda create -n rapo_plus python=3.10
conda activate RAPO
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Download the required model weights RAPO, relation graph and pretrained LLM (e.g. ,
Mistral-7B-Instruct-v0.3 )and place them in the ckpt/ and relation_graph/ directory.
ckpt/
│── all-MiniLM-L6-v2/
│── llama3_1_instruct_lora_rewrite/
│── Mistral-7B-Instruct-v0.3/
relation_graph/
│── graph_data/
We take Wan2.1-T2V as the base model to illustrate the process of SSPO. Download the required model weights Wan2.1-T2V, Qwen2.5-7B-Instruct, Qwen2.5-VL-7B-Instruct, and place them in the ckpt/ directory.
ckpt/
│── Wan2.1-T2V-1.3B-Diffusers/
│── Qwen2.5-7B-Instruct/
│── Qwen2.5-vl-7B-instruct/
cd ./examples/Stage1_RAPO/
- We provide the codes to compose the graph data. We provide two examples of inputs to compose the graph data (
./dataset/graph_test1.csvand./dataset/graph_test2.csv). You can build a relation graph from scratch based on the constructed data:
python construct_graph.py
or you can add data based on the already constructed relation graph:
python add_to_graph.py
- Retrieve related modifiers from relation graph. You can adjust the hyperparameters in
retrieve_modifiers.pyto modify the number of retrieval modifiers.
sh retrieve_modifiers.sh
- Word augmentation and sentence refactoring.
sh word_augment.sh
sh refactoring.sh
- Rewrite via instruction.
sh rewrite_via_instruction.sh
cd ./examples/Stage2_SSPO/
We take physical-aware video generation based on Wan2.1 as an example. We provide an examples.csv file in this directory, which contains some test prompts and the physical rules that T2V generation needs to comply with. For quickly start, the script generates and refines videos iteratively by combining Wan 2.1 T2V generation, Qwen2.5-VL alignment scoring, and physics-based prompt rewriting to enhance realism and consistency. You can modify the script to change the base model, include custom reward functions and historical-prompt backtracking for task-specific adaptation.
python phyaware_wan2.1.py
For LLM fine-tuning, the process depends on the selected T2V base models and further refines the Stage 2 naive-optimized prompts.
Examples include Open-Sora-Plan, Wan2.1, HunyuanVideo, and et.al.
If you find our work helpful for your research, please consider giving a citation 📝
@article{gao2025rapopp,
title = {RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling},
author = {Gao, Bingjie and Ma, Qianli and Wu, Xiaoxue and Yang, Shuai and Lan, Guanzhou and Zhao, Haonan and Chen, Jiaxuan and Liu, Qingyang and Qiao, Yu and Chen, Xinyuan and Wang, Yaohui and Niu, Li},
journal = {arXiv preprint arXiv:2510.20206},
year = {2025}
}
@InProceedings{Gao_2025_CVPR,
author = {Gao, Bingjie and Gao, Xinyu and Wu, Xiaoxue and Zhou, Yujie and Qiao, Yu and Niu, Li and Chen, Xinyuan and Wang, Yaohui},
title = {The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {3173-3183}
}