Skip to content

[NeurIPS 2025] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

License

Notifications You must be signed in to change notification settings

aHapBean/VideoREPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VideoREPA (NeurIPS 2025)

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng
NeurIPS 2025

A step towards more reliable world modeling by enhancing physics plausibility in video generation.

VideoPhy SA PC
CogVideoX-5B 70.0 32.3
+REPA Loss+DINOv2 62.5 ⚠️ 33.7
+REPA Loss+VideoMAEv2 59.3 ⚠️ 35.5
+TRD Loss+VideoMAEv2 (ours) 72.1 40.1

✅ Project Status

🎉 Accepted to NeurIPS 2025!

  • Release introduction & visual results
  • Release training & inference code
  • Upload checkpoints and provide reproducing tips.
  • Release evaluation code.
  • Release generated videos of VideoREPA. (coming soon!)

If you find VideoREPA useful, please consider giving us a star ⭐ to stay updated.

Introduction

test test

Figure 1. Evaluation of physics understanding on the Physion benchmark. The chance performance if 50%.

🔍 Physics Understanding Gap: We identify an essential gap in physics understanding between self-supervised VFMs and T2V models, proposing the first method to bridge video understanding models and T2V models. VideoREPA demonstrates that “understanding helps generation.” in video generation field.

Overview

test

Figure 2. Overview of VideoREPA.

VideoREPA enhances physics plausibility in T2V models through Token Relation Distillation (TRD) — a loss that aligns pairwise token relations between self-supervised video encoders and diffusion transformer features.

Each token learns relations about both:

  • Spatial relations within a frame
  • Temporal relations across frames

🌟 Novelty: VideoREPA is the first successful adaptation of REPA into video generation — overcoming key challenges in finetuning large pretrained video diffusion transformers and maintaining temporal consistency.

Qualitative Results

CogVideoX CogVideoX+REPA loss VideoREPA Prompt
ts_cogvideox_1.mp4
ts_repa_1.mp4
ts_videorepa_1.mp4
Leather glove catching a hard baseball.
ts_cogvideox_2.mp4
ts_repa_2.mp4
ts_videorepa_2.mp4
Maple syrup drizzling from a bottle onto pancakes.
ts_cogvideox_3.mp4
ts_repa_3.mp4
ts_videorepa_3.mp4
Glass shatters on the floor.
ts_cogvideox_4.mp4
ts_repa_4.mp4
ts_videorepa_4.mp4
A child runs and catches a brightly colored frisbee...

⚙️ Quick start

Environment setup

git clone https://github.com/aHapBean/VideoREPA.git

conda create --name videorepa python=3.10
conda activate videorepa

cd VideoREPA
pip install -r requirements.txt

# Install diffusers locally (recommended)
cd ./finetune/diffusers
pip install -e .

Dataset download

Download the OpenVid dataset used in VideoREPA. We use parts 30–49 and select subsets containing 32K and 64K videos, respectively. The corresponding CSV files are located in ./finetune/openvid/.

pip install -U huggingface_hub

# Download parts 30–49
huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part3[0-9].zip"

huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part4[0-9].zip"

Then unzip into ./finetune/openvid/videos/.

Training

# Download pretrained CogVideoX checkpoints
huggingface-cli download --repo-type model zai-org/CogVideoX-2b --local-dir ./ckpt/cogvideox-2b
huggingface-cli download --repo-type model zai-org/CogVideoX-5b --local-dir ./ckpt/cogvideox-5b

# Download pretrained vision encoder such as VideoMAEv2, VJEPA and put them into ./ckpt/. Such as ./ckpt/VideoMAEv2/vit_b_k710_dl_from_giant.pth

# Precompute video cache (shared for 2B/5B)
cd finetune/
bash scripts/dataset_precomputing.sh

# Training (adjust GPU count in scripts)
bash scripts/multigpu_VideoREPA_2B_sft.sh
bash scripts/multigpu_VideoREPA_5B_lora.sh

Inference

Inference with the VideoREPA

# Transform checkpoint to diffuser format (only for sft)
# Put the scripts/merge.sh into the saved checkpoint-xxx/ and run:
bash merge.sh

# Then copy cogvideox-2b/ from ckpt/ to cogvideox-2b-infer/
# Delete the original transformer dir in cogvideox-2b-infer/
# Move the transformed transformer dir into it

# Modify model_index.config in cogvideox-2b-infer/
# "transformer": [
#   "models.cogvideox_align",
#   "CogVideoXTransformer3DModelAlign"
# ],

# Inference
cd inference/
bash scripts/infer_videorepa_2b_sft.sh
# bash scripts/infer_videorepa_5b_lora.sh

Or run inference directly with our released checkpoints. Please download the weights from Huggingface and

  • For VideoREPA-5B, place pytorch_lora_weights.safetensors in ./inference/

  • For VideoREPA-2B, place the transformer directory inside ./ckpt/cogvideox-2b-infer/

huggingface-cli download --repo-type model aHapBean/VideoREPA --local-dir ./

Reproducing tips

We provide guidance for convenient results reproduction.

All experiments use seed = 42 by default in our paper. However, note that randomness exists in both video generation and VideoPhy evaluation, so identical results across different devices (e.g., GPUs) may not be perfectly reproducible even with the same seed.

To reproduce demo videos, simply download the released VideoREPA checkpoints and run inference — similar videos can be generated using VideoREPA-5B (or 2B).

To approximately reproduce the VideoPhy scores, you may either:

  • Use the released evaluation videos, or
  • Run inference with the released checkpoints.

After the code release, we reproduced VideoREPA-5B on a different device and found differences in results due to randomness in the benchmark and generation process. Adjusting certain parameters such as proj_coeff (from 0.5 → 0.45) helped restore the reported results, since the original settings were tuned with a different environment (device).

Model SA PC
VideoREPA-5B (reported) 72.1 40.1
VideoREPA-5B (reproduced) 74.1 40.4

Changing the seed slightly may also help. It is expected that you can reproduce the performance trends without further parameter tuning.

Contact

If you have any questions related to the code or the paper, feel free to email Xiangdong ([email protected]).

Acknowledgement

This project is built upon and extends several distinguished open-source projects:

  • CogVideo: A large-scale video generation framework developed by Tsinghua University, which provides the core architectural foundation for this work.

  • finetrainers: A high-efficiency training framework that helped enhance our fine-tuning pipeline.

  • diffusers: A go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules.

Citation

@article{zhang2025videorepa,
  title={VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models},
  author={Zhang, Xiangdong and Liao, Jiaqi and Zhang, Shaofeng and Meng, Fanqing and Wan, Xiangpeng and Yan, Junchi and Cheng, Yu},
  journal={arXiv preprint arXiv:2505.23656},
  year={2025}
}

More Generated Videos

videorepa_01.mp4
videorepa_02.mp4
videorepa_03.mp4
videorepa_04.mp4
videorepa_05.mp4
videorepa_06.mp4
videorepa_07.mp4
videorepa_08.mp4
videorepa_09.mp4
videorepa_10.mp4
videorepa_11.mp4
videorepa_12.mp4

About

[NeurIPS 2025] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published