🎨 UAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL

Official code of UAE and UniBench benchmark for our paper "Can Understanding and Generation Truly Benefit Together — or Just Coexist?".

UAE is a unified multimodal framework for image generation and understanding.

🌟 Key contributions of our work:

✅ UAE: an Auto-Encoder–based unification that treats understanding as the encoder (I2T) and generation as the decoder (T2I), using reconstruction similarity as an explicit objective to quantify cross-modal coherence and operationalize unification.

✅ Unified-GRPO: to our knowledge, the first RL scheme that jointly improves both modules via two complementary steps—Generation for Understanding (train the encoder to caption for higher reconstruction quality) and Understanding for Generation (refine the decoder to reconstruct from those captions)—forming a positive feedback loop toward unification.

✅ Aha Moment in Multimodal: We report an emergent "aha moment" in multimodal learning. As RL progresses, the encoder autonomously emits longer, more descriptive captions while the decoder simultaneously achieves strikingly faithful reconstructions. This co-evolution offers compelling empirical evidence for unified multimodal intelligence.

✅ Unified-Bench: to the best of our knowledge, the first benchmark explicitly designed to measure the degree of unification in UMMs, rather than individually evaluating the generation or understanding capabilities.

📋 TODO List

Release unified-grpo training code (RL).
Release the training data of long-context-700K for SFT and the training data for unified-grpo.
Release training code for SFT (text-to-image generation).
[☑️] Release all models' checkpoints.
[☑️] Release inference code for both image understanding and generation.

🚀 Quick Start Guide

Installation

conda create -n UAE python==3.12
conda activate UAE
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Model Setup

Download the required model checkpoints:
- Stable Diffusion 3.5 Large model
- UAE fine-tuned weights
- Vision-language model checkpoints
Update the model paths in demo.py:

model_cfg = {
    "SD3": "/path/to/stable-diffusion-3.5-large",
    "dit": "/path/to/dit/checkpoint",
    "dit_lora": "/path/to/dit/lora",
    "llm": "/path/to/llm/model",
    "llm_lora": "/path/to/llm/lora",
    "llm_processor": "/path/to/llm/processor"
}

Here, the items are defined as follows:

"SD3": Path to the official weights of Stable Diffusion 3-Large.
"dit": Our pre-trained weights of DiT.
"dit_lora": Our pre-trained LoRA for DiT, obtained in Stage-3 of unified-GRPO.
"llm": Our pre-trained weights of Qwen-2.5-VL-3B.
"llm_lora": Our pre-trained LoRA for Qwen-2.5-VL-3B, obtained in Stage-2 of unified-GRPO.
"llm_processor": The official configuration of Qwen-2.5-VL-3B, located at ./Checkpoints/llm_processor.

🎯 Core Functionality: demo.py

The demo.py script is the heart of our inference pipeline, supporting two main modes:

1. Text-to-Image (Generation)

Generate images directly from text descriptions:

python demo.py \
    --input_text "A serene mountain landscape with snow-capped peaks reflecting in a crystal clear lake, surrounded by pine forests under a golden sunset sky" \
    --output_path ./output/generated_image.png

2. Image-to-Text (Understanding/Captioning)

Generate detailed descriptions of images:

python demo.py \
    --input_img /path/to/input/image.jpg \
    --prompt_only

📊 Evaluation Framework

Our comprehensive evaluation suite in the Unified-Bench/ directory provides multiple similarity metrics for image-to-image generation assessment.

Supported Metrics

CLIP: Semantic similarity using CLIP vision encoder
DINO v2: Self-supervised visual representation similarity
DINO v3: Enhanced DINO model for improved feature matching
LongCLIP: Extended context CLIP for better long-range dependencies

Running Evaluation

1. Single Model Evaluation

cd eval
python CLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v2.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v3.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python LongCLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images

2. Comprehensive Multi-Model Evaluation

Use the unified evaluation script for complete assessment:

cd eval
python Score_i2i.py \
    --image_path ./Unified-Bench/UniBench/example_image \
    --ref_path ./Unified-Bench/UniBench/Image \
    --output_file ./Unified-Bench/results/example.json \
    --models clip dinov2 dinov3 longclip

Unified-Bench Evaluation

The Unified-Bench/UniBench/ directory contains our evaluation benchmark:

UniBench/
├── Image/           # Reference images (100 samples)
│   ├── 0.jpg
│   └── ...
└── example_image/   # Example generated images
    ├── 0.jpg
    └── ...

The data from the Image folder can be downloaded from the link.

Evaluation Results Format

The evaluation generates comprehensive statistics:

{
  "clip": {
    "0.jpg": 0.8542,
    "1.jpg": 0.7893,
    "average": 0.8234,
    "min": 0.7123,
    "max": 0.9456
  },
  "dinov2": { ... },
  "dinov3": { ... },
  "longclip": { ... }
}

Custom Evaluation

To evaluate your own generated images:

Organize your images following the UniBench structure
Ensure corresponding images have matching names
Run the evaluation script with your paths
Results will include per-image scores and aggregate statistics

📬 Contact & Feedback

For questions or feedback, please reach out:

Email: [[email protected]]

⭐️ If this repository helped your research, please star 🌟 this repo 👍!

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Checkpoints		Checkpoints
Unified-Bench		Unified-Bench
assets		assets
uae		uae
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎨 UAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL

📋 TODO List

🚀 Quick Start Guide

Installation

Model Setup

🎯 Core Functionality: demo.py

1. Text-to-Image (Generation)

2. Image-to-Text (Understanding/Captioning)

📊 Evaluation Framework

Supported Metrics

Running Evaluation

1. Single Model Evaluation

2. Comprehensive Multi-Model Evaluation

Unified-Bench Evaluation

Evaluation Results Format

Custom Evaluation

📬 Contact & Feedback

About

Uh oh!

Releases 1

Packages

Languages

PKU-YuanGroup/UAE

Folders and files

Latest commit

History

Repository files navigation

🎨 UAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL

📋 TODO List

🚀 Quick Start Guide

Installation

Model Setup

🎯 Core Functionality: demo.py

1. Text-to-Image (Generation)

2. Image-to-Text (Understanding/Captioning)

📊 Evaluation Framework

Supported Metrics

Running Evaluation

1. Single Model Evaluation

2. Comprehensive Multi-Model Evaluation

Unified-Bench Evaluation

Evaluation Results Format

Custom Evaluation

📬 Contact & Feedback

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages