VideoREPA (NeurIPS 2025)

Project Page | Paper

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng
NeurIPS 2025

✨ A step towards more reliable world modeling by enhancing physics plausibility in video generation.

VideoPhy	SA	PC
CogVideoX-5B	70.0	32.3
+REPA Loss+DINOv2	62.5 ⚠️	33.7
+REPA Loss+VideoMAEv2	59.3 ⚠️	35.5
+TRD Loss+VideoMAEv2 (ours)	72.1	40.1

✅ Project Status

🎉 Accepted to NeurIPS 2025!

Release introduction & visual results
Release training & inference code
Upload checkpoints and provide reproducing tips.
Release evaluation code.
Release generated videos of VideoREPA. (coming soon!)

If you find VideoREPA useful, please consider giving us a star ⭐ to stay updated.

Introduction

Figure 1. Evaluation of physics understanding on the Physion benchmark. The chance performance if 50%.

🔍 Physics Understanding Gap: We identify an essential gap in physics understanding between self-supervised VFMs and T2V models, proposing the first method to bridge video understanding models and T2V models. VideoREPA demonstrates that “understanding helps generation.” in video generation field.

Overview

Figure 2. Overview of VideoREPA.

VideoREPA enhances physics plausibility in T2V models through Token Relation Distillation (TRD) — a loss that aligns pairwise token relations between self-supervised video encoders and diffusion transformer features.

Each token learns relations about both:

Spatial relations within a frame
Temporal relations across frames

🌟 Novelty: VideoREPA is the first successful adaptation of REPA into video generation — overcoming key challenges in finetuning large pretrained video diffusion transformers and maintaining temporal consistency.

Qualitative Results

CogVideoX	CogVideoX+REPA loss	VideoREPA	Prompt
ts_cogvideox_1.mp4	ts_repa_1.mp4	ts_videorepa_1.mp4	Leather glove catching a hard baseball.
ts_cogvideox_2.mp4	ts_repa_2.mp4	ts_videorepa_2.mp4	Maple syrup drizzling from a bottle onto pancakes.
ts_cogvideox_3.mp4	ts_repa_3.mp4	ts_videorepa_3.mp4	Glass shatters on the floor.
ts_cogvideox_4.mp4	ts_repa_4.mp4	ts_videorepa_4.mp4	A child runs and catches a brightly colored frisbee...

⚙️ Quick start

Environment setup

git clone https://github.com/aHapBean/VideoREPA.git

conda create --name videorepa python=3.10
conda activate videorepa

cd VideoREPA
pip install -r requirements.txt

# Install diffusers locally (recommended)
cd ./finetune/diffusers
pip install -e .

Dataset download

Download the OpenVid dataset used in VideoREPA. We use parts 30–49 and select subsets containing 32K and 64K videos, respectively. The corresponding CSV files are located in ./finetune/openvid/.

pip install -U huggingface_hub

# Download parts 30–49
huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part3[0-9].zip"

huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part4[0-9].zip"

Then unzip into ./finetune/openvid/videos/.

Training

# Download pretrained CogVideoX checkpoints
huggingface-cli download --repo-type model zai-org/CogVideoX-2b --local-dir ./ckpt/cogvideox-2b
huggingface-cli download --repo-type model zai-org/CogVideoX-5b --local-dir ./ckpt/cogvideox-5b

# Download pretrained vision encoder such as VideoMAEv2, VJEPA and put them into ./ckpt/. Such as ./ckpt/VideoMAEv2/vit_b_k710_dl_from_giant.pth

# Precompute video cache (shared for 2B/5B)
cd finetune/
bash scripts/dataset_precomputing.sh

# Training (adjust GPU count in scripts)
bash scripts/multigpu_VideoREPA_2B_sft.sh
bash scripts/multigpu_VideoREPA_5B_lora.sh

Inference

Inference with the VideoREPA

# Transform checkpoint to diffuser format (only for sft)
# Put the scripts/merge.sh into the saved checkpoint-xxx/ and run:
bash merge.sh

# Then copy cogvideox-2b/ from ckpt/ to cogvideox-2b-infer/
# Delete the original transformer dir in cogvideox-2b-infer/
# Move the transformed transformer dir into it

# Modify model_index.config in cogvideox-2b-infer/
# "transformer": [
#   "models.cogvideox_align",
#   "CogVideoXTransformer3DModelAlign"
# ],

# Inference
cd inference/
bash scripts/infer_videorepa_2b_sft.sh
# bash scripts/infer_videorepa_5b_lora.sh

Or run inference directly with our released checkpoints. Please download the weights from Huggingface and

For VideoREPA-5B, place pytorch_lora_weights.safetensors in ./inference/
For VideoREPA-2B, place the transformer directory inside ./ckpt/cogvideox-2b-infer/

huggingface-cli download --repo-type model aHapBean/VideoREPA --local-dir ./

Reproducing tips

We provide guidance for convenient results reproduction.

All experiments use seed = 42 by default in our paper. However, note that randomness exists in both video generation and VideoPhy evaluation, so identical results across different devices (e.g., GPUs) may not be perfectly reproducible even with the same seed.

To reproduce demo videos, simply download the released VideoREPA checkpoints and run inference — similar videos can be generated using VideoREPA-5B (or 2B).

To approximately reproduce the VideoPhy scores, you may either:

Use the released evaluation videos, or
Run inference with the released checkpoints.

After the code release, we reproduced VideoREPA-5B on a different device and found differences in results due to randomness in the benchmark and generation process. Adjusting certain parameters such as proj_coeff (from 0.5 → 0.45) helped restore the reported results, since the original settings were tuned with a different environment (device).

Model	SA	PC
VideoREPA-5B (reported)	72.1	40.1
VideoREPA-5B (reproduced)	74.1	40.4

Changing the seed slightly may also help. It is expected that you can reproduce the performance trends without further parameter tuning.

Contact

If you have any questions related to the code or the paper, feel free to email Xiangdong ([email protected]).

Acknowledgement

This project is built upon and extends several distinguished open-source projects:

CogVideo: A large-scale video generation framework developed by Tsinghua University, which provides the core architectural foundation for this work.
finetrainers: A high-efficiency training framework that helped enhance our fine-tuning pipeline.
diffusers: A go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules.

Citation

@article{zhang2025videorepa,
  title={VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models},
  author={Zhang, Xiangdong and Liao, Jiaqi and Zhang, Shaofeng and Meng, Fanqing and Wan, Xiangpeng and Yan, Junchi and Cheng, Yu},
  journal={arXiv preprint arXiv:2505.23656},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
ckpt		ckpt
evaluation		evaluation
finetune		finetune
inference		inference
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download.sh		download.sh
requirements.txt		requirements.txt

videorepa_01.mp4	videorepa_02.mp4	videorepa_03.mp4
videorepa_04.mp4	videorepa_05.mp4	videorepa_06.mp4
videorepa_07.mp4	videorepa_08.mp4	videorepa_09.mp4
videorepa_10.mp4	videorepa_11.mp4	videorepa_12.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoREPA (NeurIPS 2025)

Project Page | Paper

✅ Project Status

Introduction

Overview

Qualitative Results

⚙️ Quick start

Environment setup

Dataset download

Training

Inference

Reproducing tips

Contact

Acknowledgement

Citation

More Generated Videos

About

Uh oh!

Releases

Packages

Languages

License

aHapBean/VideoREPA

Folders and files

Latest commit

History

Repository files navigation

VideoREPA (NeurIPS 2025)

Project Page | Paper

✅ Project Status

Introduction

Overview

Qualitative Results

⚙️ Quick start

Environment setup

Dataset download

Training

Inference

Reproducing tips

Contact

Acknowledgement

Citation

More Generated Videos

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages