Sora2-mini

This project is a long-term open-source project aimed at implementing the full functionality similar to Sora2, and it will include the implementations of multiple works.

Open List:

Joint audio-video generation (UniAVGen)

UniAVGen: Unified Audio and Video Generation with
Asymmetric Cross-Modal Interactions

Guozhen Zhang · Zixiang Zhou · Teng Hu · Ziqiao Peng · Youliang Zhang
Yi Chen · Yuan Zhou · Qinglin Lu · Limin Wang
MCG-NJU | Tencent Hunyuan

This repository is the official implementation of paper "UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions". UniAVGen is a unified framework for high-fidelity joint audio-video generation, addressing key limitations of existing methods such as poor lip synchronization, insufficient semantic consistency, and limited task generalization.

At its core, UniAVGen adopts a symmetric dual-branch architecture (parallel Diffusion Transformers for audio and video) and introduces three critical innovations: (1) Asymmetric Cross-Modal Interaction for bidirectional temporal alignment, (2) Face-Aware Modulation to prioritize salient facial regions during interaction, (3) Modality-Aware Classifier-Free Guidance to amplify cross-modal correlations during inference.

💥 News

2025-12-14: Released the inference code and weights of UniAVGen.
2025-11-5: Our paper is in public on arxiv.

💕 Installation

git clone https://github.com/MCG-NJU/Sora2-mini.git
cd Sora2-mini

conda create -n uniavgen python=3.10 -y
conda activate uniavgen

# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl && python -c "import flash_attn"
pip install "xfuser[diffusers,flash-attn]"
pip install -r requirements.txt

Download Checkpoints

huggingface-cli download MCG-NJU/UniAVGen --local-dir ./UniAVGen

😎 Inference

We support joint audio-visual generation (AVG, id: 0), joint generation with reference audio (RAVG, id: 1), audio-driven video generation (A2V, id: 2), and video-driven audio generation (V2A, id: 3).

Inference Data Construct

Given the need to consider compatibility across multiple tasks, we organize data into CSV format, which supports both single-GPU and multi-GPU parallel testing. The definitions of each column in the CSV are provided below:

data_id # [required] The name of output samples.
ref_image_path # [required] reference image is required for all tasks except V2A
speech_content # [required] audio speech content
prompt # [required] video caption
lang # [required] language, en or zh (performance of zh is under improvement)
ref_audio_path # [optional] reference audio is required only for RAVG
ref_speech_content # [optional] The speech content correspond to ref_audio
video_path # [optional] The condition video for V2A
audio_path # [optional] The condition audio for A2V

We provide demo CSVs for each task in the examples/csvs.

Inference Config

You can modify configs/inference.yaml to control the parameters of the sampling process; the details are as follows:

model_path: UniAVGen
audio_guidance_scale: 2.0   
video_guidance_scale: 3.0   
output_dir: ./outputs/demo
num_steps: 50
shift: 5.0
seed: 2025
video_negative_prompt: ""  
test_csv: examples/csvs/test_task_AVG.csv # path of test csv
slg_layer: 11        # skip layer guidance, default = 11
macfg_prop: 0.5       # proportion of timesteps using MA-CFG, default = 0.5

Sample

# Sigle GPU
torchrun --nnodes 1 --nproc_per_node 1 inference.py --task 0 # specify the task id
# Multi GPU
torchrun --nnodes 1 --nproc_per_node 8 inference.py --task 0

💪 Citation

If you think this project is helpful in your research or for application, please feel free to leave a star⭐️ and cite our paper:

@misc{zhang2025uniavgenunifiedaudiovideo,
      title={UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions}, 
      author={Guozhen Zhang and Zixiang Zhou and Teng Hu and Ziqiao Peng and Youliang Zhang and Yi Chen and Yuan Zhou and Qinglin Lu and Limin Wang},
      year={2025},
      eprint={2511.03334},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.03334}, 
}

💗 License and Acknowledgement

This project is released under the Apache 2.0 license. The codes are based on Wan2.2, F5TTS and OVI. Please also follow their licenses. Thanks for their awesome works.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
examples		examples
figs		figs
utils		utils
wan		wan
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sora2-mini

UniAVGen: Unified Audio and Video Generation with
Asymmetric Cross-Modal Interactions

💥 News

💕 Installation

Download Checkpoints

😎 Inference

Inference Data Construct

Inference Config

Sample

💪 Citation

💗 License and Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

MCG-NJU/Sora2-mini

Folders and files

Latest commit

History

Repository files navigation

Sora2-mini

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

💥 News

💕 Installation

Download Checkpoints

😎 Inference

Inference Data Construct

Inference Config

Sample

💪 Citation

💗 License and Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

UniAVGen: Unified Audio and Video Generation with
Asymmetric Cross-Modal Interactions

Packages