VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng
NeurIPS 2025
✨ A step towards more reliable world modeling by enhancing physics plausibility in video generation.
| VideoPhy | SA | PC |
|---|---|---|
| CogVideoX-5B | 70.0 | 32.3 |
| +REPA Loss+DINOv2 | 62.5 |
33.7 |
| +REPA Loss+VideoMAEv2 | 59.3 |
35.5 |
| +TRD Loss+VideoMAEv2 (ours) | 72.1 | 40.1 |
🎉 Accepted to NeurIPS 2025!
- Release introduction & visual results
- Release training & inference code
- Upload checkpoints and provide reproducing tips.
- Release evaluation code.
- Release generated videos of VideoREPA. (coming soon!)
If you find VideoREPA useful, please consider giving us a star ⭐ to stay updated.
Figure 1. Evaluation of physics understanding on the Physion benchmark. The chance performance if 50%.
🔍 Physics Understanding Gap: We identify an essential gap in physics understanding between self-supervised VFMs and T2V models, proposing the first method to bridge video understanding models and T2V models. VideoREPA demonstrates that “understanding helps generation.” in video generation field.
Figure 2. Overview of VideoREPA.
VideoREPA enhances physics plausibility in T2V models through Token Relation Distillation (TRD) — a loss that aligns pairwise token relations between self-supervised video encoders and diffusion transformer features.
Each token learns relations about both:
- Spatial relations within a frame
- Temporal relations across frames
🌟 Novelty: VideoREPA is the first successful adaptation of REPA into video generation — overcoming key challenges in finetuning large pretrained video diffusion transformers and maintaining temporal consistency.
| CogVideoX | CogVideoX+REPA loss | VideoREPA | Prompt |
|---|---|---|---|
ts_cogvideox_1.mp4 |
ts_repa_1.mp4 |
ts_videorepa_1.mp4 |
Leather glove catching a hard baseball. |
ts_cogvideox_2.mp4 |
ts_repa_2.mp4 |
ts_videorepa_2.mp4 |
Maple syrup drizzling from a bottle onto pancakes. |
ts_cogvideox_3.mp4 |
ts_repa_3.mp4 |
ts_videorepa_3.mp4 |
Glass shatters on the floor. |
ts_cogvideox_4.mp4 |
ts_repa_4.mp4 |
ts_videorepa_4.mp4 |
A child runs and catches a brightly colored frisbee... |
git clone https://github.com/aHapBean/VideoREPA.git
conda create --name videorepa python=3.10
conda activate videorepa
cd VideoREPA
pip install -r requirements.txt
# Install diffusers locally (recommended)
cd ./finetune/diffusers
pip install -e .Download the OpenVid dataset used in VideoREPA. We use parts 30–49 and select subsets containing 32K and 64K videos, respectively. The corresponding CSV files are located in ./finetune/openvid/.
pip install -U huggingface_hub
# Download parts 30–49
huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part3[0-9].zip"
huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part4[0-9].zip"Then unzip into ./finetune/openvid/videos/.
# Download pretrained CogVideoX checkpoints
huggingface-cli download --repo-type model zai-org/CogVideoX-2b --local-dir ./ckpt/cogvideox-2b
huggingface-cli download --repo-type model zai-org/CogVideoX-5b --local-dir ./ckpt/cogvideox-5b
# Download pretrained vision encoder such as VideoMAEv2, VJEPA and put them into ./ckpt/. Such as ./ckpt/VideoMAEv2/vit_b_k710_dl_from_giant.pth
# Precompute video cache (shared for 2B/5B)
cd finetune/
bash scripts/dataset_precomputing.sh
# Training (adjust GPU count in scripts)
bash scripts/multigpu_VideoREPA_2B_sft.sh
bash scripts/multigpu_VideoREPA_5B_lora.shInference with the VideoREPA
# Transform checkpoint to diffuser format (only for sft)
# Put the scripts/merge.sh into the saved checkpoint-xxx/ and run:
bash merge.sh
# Then copy cogvideox-2b/ from ckpt/ to cogvideox-2b-infer/
# Delete the original transformer dir in cogvideox-2b-infer/
# Move the transformed transformer dir into it
# Modify model_index.config in cogvideox-2b-infer/
# "transformer": [
# "models.cogvideox_align",
# "CogVideoXTransformer3DModelAlign"
# ],
# Inference
cd inference/
bash scripts/infer_videorepa_2b_sft.sh
# bash scripts/infer_videorepa_5b_lora.shOr run inference directly with our released checkpoints. Please download the weights from Huggingface and
-
For VideoREPA-5B, place
pytorch_lora_weights.safetensorsin./inference/ -
For VideoREPA-2B, place the transformer directory inside
./ckpt/cogvideox-2b-infer/
huggingface-cli download --repo-type model aHapBean/VideoREPA --local-dir ./We provide guidance for convenient results reproduction.
All experiments use seed = 42 by default in our paper. However, note that randomness exists in both video generation and VideoPhy evaluation, so identical results across different devices (e.g., GPUs) may not be perfectly reproducible even with the same seed.
To reproduce demo videos, simply download the released VideoREPA checkpoints and run inference — similar videos can be generated using VideoREPA-5B (or 2B).
To approximately reproduce the VideoPhy scores, you may either:
- Use the released evaluation videos, or
- Run inference with the released checkpoints.
After the code release, we reproduced VideoREPA-5B on a different device and found differences in results due to randomness in the benchmark and generation process. Adjusting certain parameters such as proj_coeff (from 0.5 → 0.45) helped restore the reported results, since the original settings were tuned with a different environment (device).
| Model | SA | PC |
|---|---|---|
| VideoREPA-5B (reported) | 72.1 | 40.1 |
| VideoREPA-5B (reproduced) | 74.1 | 40.4 |
Changing the seed slightly may also help. It is expected that you can reproduce the performance trends without further parameter tuning.
If you have any questions related to the code or the paper, feel free to email Xiangdong ([email protected]).
This project is built upon and extends several distinguished open-source projects:
-
CogVideo: A large-scale video generation framework developed by Tsinghua University, which provides the core architectural foundation for this work.
-
finetrainers: A high-efficiency training framework that helped enhance our fine-tuning pipeline.
-
diffusers: A go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules.
@article{zhang2025videorepa,
title={VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models},
author={Zhang, Xiangdong and Liao, Jiaqi and Zhang, Shaofeng and Meng, Fanqing and Wan, Xiangpeng and Yan, Junchi and Cheng, Yu},
journal={arXiv preprint arXiv:2505.23656},
year={2025}
}
videorepa_01.mp4 |
videorepa_02.mp4 |
videorepa_03.mp4 |
videorepa_04.mp4 |
videorepa_05.mp4 |
videorepa_06.mp4 |
videorepa_07.mp4 |
videorepa_08.mp4 |
videorepa_09.mp4 |
videorepa_10.mp4 |
videorepa_11.mp4 |
videorepa_12.mp4 |