This project is a long-term open-source project aimed at implementing the full functionality similar to Sora2, and it will include the implementations of multiple works.
Open List:
- Joint audio-video generation (UniAVGen)
Guozhen Zhang
·
Zixiang Zhou
·
Teng Hu
·
Ziqiao Peng
·
Youliang Zhang
Yi Chen
·
Yuan Zhou
·
Qinglin Lu
·
Limin Wang
MCG-NJU | Tencent Hunyuan
This repository is the official implementation of paper "UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions". UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.
At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations: (1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment, (2) Face-Aware Modulation to prioritize salient facial regions during interaction, (3) Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.
- 2025-12-14: Released the inference code and weights of UniAVGen.
- 2025-11-5: Our paper is in public on arxiv.
git clone https://github.com/MCG-NJU/Sora2-mini.git
cd Sora2-mini
conda create -n uniavgen python=3.10 -y
conda activate uniavgen
# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl && python -c "import flash_attn"
pip install "xfuser[diffusers,flash-attn]"
pip install -r requirements.txt
huggingface-cli download MCG-NJU/UniAVGen --local-dir ./UniAVGen
We support joint audio-visual generation (AVG, id: 0), joint generation with reference audio (RAVG, id: 1), audio-driven video generation (A2V, id: 2), and video-driven audio generation (V2A, id: 3).
Given the need to consider compatibility across multiple tasks, we organize data into CSV format, which supports both single-GPU and multi-GPU parallel testing. The definitions of each column in the CSV are provided below:
data_id # [required] The name of output samples.
ref_image_path # [required] reference image is required for all tasks except V2A
speech_content # [required] audio speech content
prompt # [required] video caption
lang # [required] language, en or zh (performance of zh is under improvement)
ref_audio_path # [optional] reference audio is required only for RAVG
ref_speech_content # [optional] The speech content correspond to ref_audio
video_path # [optional] The condition video for V2A
audio_path # [optional] The condition audio for A2VWe provide demo CSVs for each task in the examples/csvs.
You can modify configs/inference.yaml to control the parameters of the sampling process; the details are as follows:
model_path: UniAVGen
audio_guidance_scale: 2.0
video_guidance_scale: 3.0
output_dir: ./outputs/demo
num_steps: 50
shift: 5.0
seed: 2025
video_negative_prompt: ""
test_csv: examples/csvs/test_task_AVG.csv # path of test csv
slg_layer: 11 # skip layer guidance, default = 11
macfg_prop: 0.5 # proportion of timesteps using MA-CFG, default = 0.5# Sigle GPU
torchrun --nnodes 1 --nproc_per_node 1 inference.py --task 0 # specify the task id
# Multi GPU
torchrun --nnodes 1 --nproc_per_node 8 inference.py --task 0 If you think this project is helpful in your research or for application, please feel free to leave a star⭐️ and cite our paper:
@misc{zhang2025uniavgenunifiedaudiovideo,
title={UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions},
author={Guozhen Zhang and Zixiang Zhou and Teng Hu and Ziqiao Peng and Youliang Zhang and Yi Chen and Yuan Zhou and Qinglin Lu and Limin Wang},
year={2025},
eprint={2511.03334},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.03334},
}This project is released under the Apache 2.0 license. The codes are based on Wan2.2, F5TTS and OVI. Please also follow their licenses. Thanks for their awesome works.