Skip to content

Dorniwang/UniVerse-1-code

Repository files navigation

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts.

License

This is official inference code of UniVerse-1

🔥🔥🔥 News!!

  • Sep 28, 2025: 👋 We release Verse-Bench metric tools, Verse-Bench tools.
  • Sep 09, 2025: 👋 We release the technical report of UniVerse-1.
  • Sep 08, 2025: 👋 We release Verse-Bench datasets, Verse-Bench Dataset.
  • Sep 08, 2025: 👋 We release model weights of UniVerse-1.
  • Sep 08, 2025: 👋 We release inference code of UniVerse-1.
  • Sep 03, 2025: 👋 We release the project page of UniVerse-1.

Introduction

UniVerse-1 is a unified, Veo-3-like model that simultaneously generates synchronized audio and video from a reference image and a text prompt.

  • Unified Audio-Video synthesis: Features the fascinating ability to generate audio and video in tandem. It interprets the input prompt to produce a perfectly synchronized audio-visual output.

  • Speech audio generation: The model can generate fluent speech directly from a text prompt, demonstrating a built-in text-to-speech (TTS) ability. Crucially, it tailors the voice timbre to match the specific character being generated.

  • Musical instrument playing sound generation: The model is also highly proficient at creating sounds of musical instruments. Additionally, it offers some capability for "singing while playing," generating both vocal and instrumental tracks concurrently.

  • Ambient sound generation: The model can generate ambient sounds, producing background audio that matches the visual environment of the video.

  • The first open-sourced Dit-based Audio-Video joint method: We are the first to open-source a DiT-based, Veo-3-like model for joint audio-visual generation.

Model Download

Models 🤗 Hugging Face
UniVerse-1 Base UniVerse-1

download our pretrained model into ./checkpoints/UniVerse-1-base/

Model Usage

🔧 Dependencies and Installation

conda create -n universe python=3.10
conda activate universe
pip install torch==2.5.0 torchaudio==2.5.0 torchvision --index-url https://download.pytorch.org/whl/cu121
pip install packaging ninja && pip install flash-attn==2.7.0.post2 --no-build-isolation 
pip install -r requirements-lint.txt
pip install -e .

git clone https://github.com/Dorniwang/UniVerse-1-code/
cd UniVerse-1-code

🚀 Inference Scripts

bash scripts/inference/inference_universe.sh

Acknowledgements

Part of the code for this project comes from:

Thank you to all the open-source projects for their contributions to this project!

License

The code in the repository is licensed under Apache 2.0 License.

Citation

@article{wang2025universe,
  title={UniVerse-1: Unified Audio-Video Generation via Stitching of Experts},
  author={Wang, Duomin and Zuo, Wei and Li, Aojie and Chen, Ling-Hao and Liao, Xinyao and Zhou, Deyu and Yin, Zixin and Dai, Xili and Jiang, Daxin and Yu, Gang},
  journal={arXiv preprint arXiv:2509.06155},
  year={2025}
}

Star History

Star History Chart

About

The official UniVerse-1 code.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published