This is official inference code of UniVerse-1
- Sep 28, 2025: 👋 We release Verse-Bench metric tools, Verse-Bench tools.
- Sep 09, 2025: 👋 We release the technical report of UniVerse-1.
- Sep 08, 2025: 👋 We release Verse-Bench datasets, Verse-Bench Dataset.
- Sep 08, 2025: 👋 We release model weights of UniVerse-1.
- Sep 08, 2025: 👋 We release inference code of UniVerse-1.
- Sep 03, 2025: 👋 We release the project page of UniVerse-1.
UniVerse-1 is a unified, Veo-3-like model that simultaneously generates synchronized audio and video from a reference image and a text prompt.
-
Unified Audio-Video synthesis: Features the fascinating ability to generate audio and video in tandem. It interprets the input prompt to produce a perfectly synchronized audio-visual output.
-
Speech audio generation: The model can generate fluent speech directly from a text prompt, demonstrating a built-in text-to-speech (TTS) ability. Crucially, it tailors the voice timbre to match the specific character being generated.
-
Musical instrument playing sound generation: The model is also highly proficient at creating sounds of musical instruments. Additionally, it offers some capability for "singing while playing," generating both vocal and instrumental tracks concurrently.
-
Ambient sound generation: The model can generate ambient sounds, producing background audio that matches the visual environment of the video.
-
The first open-sourced Dit-based Audio-Video joint method: We are the first to open-source a DiT-based, Veo-3-like model for joint audio-visual generation.
| Models | 🤗 Hugging Face |
|---|---|
| UniVerse-1 Base | UniVerse-1 |
download our pretrained model into ./checkpoints/UniVerse-1-base/
- Python >= 3.10
- PyTorch >= 2.5.0-cu121
- CUDA Toolkit
- Dependent models:
- Wan-AI/Wan2.1-T2V-1.3B-Diffusers, download into ./huggingfaces/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/
- ACE-Step/ACE-Step-v1-3.5B, download into ./huggingfaces/ACE-Step/ACE-Step-v1-3.5B/
conda create -n universe python=3.10
conda activate universe
pip install torch==2.5.0 torchaudio==2.5.0 torchvision --index-url https://download.pytorch.org/whl/cu121
pip install packaging ninja && pip install flash-attn==2.7.0.post2 --no-build-isolation
pip install -r requirements-lint.txt
pip install -e .
git clone https://github.com/Dorniwang/UniVerse-1-code/
cd UniVerse-1-codebash scripts/inference/inference_universe.shPart of the code for this project comes from:
Thank you to all the open-source projects for their contributions to this project!
The code in the repository is licensed under Apache 2.0 License.
@article{wang2025universe,
title={UniVerse-1: Unified Audio-Video Generation via Stitching of Experts},
author={Wang, Duomin and Zuo, Wei and Li, Aojie and Chen, Ling-Hao and Liao, Xinyao and Zhou, Deyu and Yin, Zixin and Dai, Xili and Jiang, Daxin and Yu, Gang},
journal={arXiv preprint arXiv:2509.06155},
year={2025}
}