🐻 URSA: Uniform Discrete Diffusion with Metric Path
for Video Generation

Haoge Deng^1,4*, Ting Pan^2,4*, Fan Zhang^4*, Yang Liu^3,4*, Zhuoyan Luo⁴, Yufeng Cui⁴, Wenxuan Wang⁴
Chunhua Shen³, Shiguang Shan², Zhaoxiang Zhang^1†, Xinlong Wang^4†

CASIA¹, CASICT², ZJU³, BAAI⁴
^* Equal Contribution, ^† Corresponding Author

We present URSA (Uniform discRete diffuSion with metric pAth), a simple yet powerful framework that bridges the gap with continuous approaches. URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens and scales efficiently to long video generation, requiring fewer inference steps. URSA enables multi-task video generation with asynchronous timestep scheduling strategy in one unified model.

🚀 News

[Oct 2025] 🎉 URSA is part of Emu3.5 as DiDA (Discrete Diffusion Adaptation)!
[Oct 2025] Released TI2V 🤗 Demo.
[Oct 2025] Released Paper & Project Page & Evaluation Guide.

✨Hightlights

🥇 Novel Approach: Uniform Discrete Diffusion with Metric Path.
🥈 SOTA Performance: High efficiency with state-of-the-art T2I/T2V/I2V results.
🥉 Unified Modeling: Multi-task capabilities in a single unified model.

🗄️ Models

🖼️ Text to Image

Model	Resolution	Data	Weight	GenEval	DPGBench
URSA-0.6B-IBQ1024	1024x1024	30M	🤗 HF \| 🤖 ModelScope	0.79	85.6
URSA-1.7B-IBQ1024	1024x1024	30M	🤗 HF \| 🤖 ModelScope	0.80	86.0

🎬 Text to Video

Model	Resolution	Data	Weight	VBench-T2V	VBench-I2V
URSA-0.6B-FSQ320	49x512x320	24M	🤗 HF \| 🤖 ModelScope	81.4	86.0
URSA-1.7B-FSQ320	49x512x320	24M	🤗 HF \| 🤖 ModelScope	82.4	86.2

📖 Table of Contents

🔧 Installation

Clone this repository to local disk and install:

pip install diffusers transformers>=4.57.1 accelerate imageio imageio-ffmpeg omegaconf wandb
git clone https://github.com/baaivision/URSA.git
cd URSA && pip install .

🔥 Quick Start

🖼️ Image Generation

import torch
from diffnext.pipelines import URSAPipeline

model_id, height, width = "BAAI/URSA-1.7B-IBQ1024", 1024, 1024
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = URSAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to(torch.device("cuda"))

prompt = "The bear, calm and still, gazes upward as if lost in contemplation of the cosmos."
negative_prompt = "worst quality, low quality, inconsistent motion, static, still, blurry, jittery, distorted, ugly"

image = pipe(**locals()).frames[0]
image.save("ursa.jpg")

🎬 Video Generation

import os, torch, numpy
from diffnext.pipelines import URSAPipeline
from diffnext.utils import export_to_video
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

model_id, height, width = "BAAI/URSA-1.7B-FSQ320", 320, 512
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = URSAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to(torch.device("cuda"))

text_prompt = "a lone grizzly bear walks through a misty forest at dawn, sunlight catching its fur."
negative_prompt = "worst quality, low quality, inconsistent motion, static, still, blurry, jittery, distorted, ugly"

# Text-to-Image
prompt = text_prompt
num_frames, num_inference_steps = 1, 25
image = pipe(**locals()).frames[0]
image.save("ursa.jpg")

# Image-to-Video
prompt = f"motion=9.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
video = pipe(**locals()).frames[0]
export_to_video(video, "ursa_1+48f.mp4", fps=12)

# Text-to-Video
image, video = None, None
prompt = f"motion=9.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
video = pipe(**locals()).frames[0]
export_to_video(video, "ursa_49f.mp4", fps=12)

# Video-to-Video
prompt = f"motion=5.0, {text_prompt}"
num_frames, num_inference_steps = 49, 50
num_cond_frames, cond_noise_scale = 13, 0.1
for i in range(12):
    video, start_video = video[-num_cond_frames:], video
    video = pipe(**locals()).frames[0]
    video = numpy.concatenate([start_video, video[num_cond_frames:]])
    export_to_video(video, "ursa_{}f.mp4".format(video.shape[0]), fps=12)

💻 Gradio Demo

# Text-to-Image (T2I)
python scripts/app_ursa_t2i.py --model "BAAI/URSA-1.7B-IBQ1024" --device 0

# Text-to-Image-to-Video (TI2V)
python scripts/app_ursa_ti2v.py --model "BAAI/URSA-1.7B-FSQ320" --device 0

📋 Todo List

📖 Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{deng2025ursa,
  title={Uniform Discrete Diffusion with Metric Path for Video Generation},
  author={Deng, Haoge and Pan, Ting and Zhang, Fan and Liu, Yang and Luo, Zhuoyan and Cui, Yufeng and Shen, Chunhua and Shan, Shiguang and Zhang, Zhaoxiang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2510.24717},
  year={2025}
}

@article{deng2024nova,
  title={Autoregressive Video Generation without Vector Quantization},
  author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2412.14169},
  year={2024}
}

🤗 Acknowledgement

We thank the repositories:

NOVA. ✨NOVA is the predecessor of 🐻URSA.
FlowMatching. This codebase systemically provides CFM and DFM implementations.
FUDOKI. This codebase provides a naive multimodal DFM implementation.
CodeWithGPU. CodeWithGPU library is the core of our data loading pipeline.

License

Code and models are licensed under Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
accelerate_configs		accelerate_configs
assets		assets
diffnext		diffnext
docs		docs
evaluations		evaluations
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐻 URSA: Uniform Discrete Diffusion with Metric Path
for Video Generation

🚀 News

✨Hightlights

🗄️ Models

🖼️ Text to Image

🎬 Text to Video

📖 Table of Contents

🔧 Installation

🔥 Quick Start

🖼️ Image Generation

🎬 Video Generation

💻 Gradio Demo

📋 Todo List

📖 Citation

🤗 Acknowledgement

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

baaivision/URSA

Folders and files

Latest commit

History

Repository files navigation

🐻 URSA: Uniform Discrete Diffusion with Metric Pathfor Video Generation

🚀 News

✨Hightlights

🗄️ Models

🖼️ Text to Image

🎬 Text to Video

📖 Table of Contents

🔧 Installation

🔥 Quick Start

🖼️ Image Generation

🎬 Video Generation

💻 Gradio Demo

📋 Todo List

📖 Citation

🤗 Acknowledgement

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

🐻 URSA: Uniform Discrete Diffusion with Metric Path
for Video Generation

Packages