Skip to content

MCG-NJU/Sora2-mini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sora2-mini

This project is a long-term open-source project aimed at implementing the full functionality similar to Sora2, and it will include the implementations of multiple works.

Open List:

  • Joint audio-video generation (UniAVGen)

UniAVGen: Unified Audio and Video Generation with
Asymmetric Cross-Modal Interactions

Guozhen Zhang · Zixiang Zhou · Teng Hu · Ziqiao Peng · Youliang Zhang
Yi Chen · Yuan Zhou · Qinglin Lu · Limin Wang
MCG-NJU   |   Tencent Hunyuan

Paper PDF Project Page

This repository is the official implementation of paper "UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions". UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.

At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations: (1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment, (2) Face-Aware Modulation to prioritize salient facial regions during interaction, (3) Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.

teaser

💥 News

  • 2025-12-14: Released the inference code and weights of UniAVGen.
  • 2025-11-5: Our paper is in public on arxiv.

💕 Installation

git clone https://github.com/MCG-NJU/Sora2-mini.git
cd Sora2-mini

conda create -n uniavgen python=3.10 -y
conda activate uniavgen

# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl && python -c "import flash_attn"
pip install "xfuser[diffusers,flash-attn]"
pip install -r requirements.txt

Download Checkpoints

huggingface-cli download MCG-NJU/UniAVGen --local-dir ./UniAVGen

😎 Inference

We support joint audio-visual generation (AVG, id: 0), joint generation with reference audio (RAVG, id: 1), audio-driven video generation (A2V, id: 2), and video-driven audio generation (V2A, id: 3).

Inference Data Construct

Given the need to consider compatibility across multiple tasks, we organize data into CSV format, which supports both single-GPU and multi-GPU parallel testing. The definitions of each column in the CSV are provided below:

data_id # [required] The name of output samples.
ref_image_path # [required] reference image is required for all tasks except V2A
speech_content # [required] audio speech content
prompt # [required] video caption
lang # [required] language, en or zh (performance of zh is under improvement)
ref_audio_path # [optional] reference audio is required only for RAVG
ref_speech_content # [optional] The speech content correspond to ref_audio
video_path # [optional] The condition video for V2A
audio_path # [optional] The condition audio for A2V

We provide demo CSVs for each task in the examples/csvs.

Inference Config

You can modify configs/inference.yaml to control the parameters of the sampling process; the details are as follows:

model_path: UniAVGen
audio_guidance_scale: 2.0   
video_guidance_scale: 3.0   
output_dir: ./outputs/demo
num_steps: 50
shift: 5.0
seed: 2025
video_negative_prompt: ""  
test_csv: examples/csvs/test_task_AVG.csv # path of test csv
slg_layer: 11        # skip layer guidance, default = 11
macfg_prop: 0.5       # proportion of timesteps using MA-CFG, default = 0.5

Sample

# Sigle GPU
torchrun --nnodes 1 --nproc_per_node 1 inference.py --task 0 # specify the task id
# Multi GPU
torchrun --nnodes 1 --nproc_per_node 8 inference.py --task 0 

💪 Citation

If you think this project is helpful in your research or for application, please feel free to leave a star⭐️ and cite our paper:

@misc{zhang2025uniavgenunifiedaudiovideo,
      title={UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions}, 
      author={Guozhen Zhang and Zixiang Zhou and Teng Hu and Ziqiao Peng and Youliang Zhang and Yi Chen and Yuan Zhou and Qinglin Lu and Limin Wang},
      year={2025},
      eprint={2511.03334},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.03334}, 
}

💗 License and Acknowledgement

This project is released under the Apache 2.0 license. The codes are based on Wan2.2, F5TTS and OVI. Please also follow their licenses. Thanks for their awesome works.

About

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages