📄 Paper | 🤗 Model | 📊 UniBench |
Official code of UAE and UniBench benchmark for our paper "Can Understanding and Generation Truly Benefit Together — or Just Coexist?".
UAE is a unified multimodal framework for image generation and understanding.
🌟 Key contributions of our work:
✅ UAE: an Auto-Encoder–based unification that treats understanding as the encoder (I2T) and generation as the decoder (T2I), using reconstruction similarity as an explicit objective to quantify cross-modal coherence and operationalize unification.
✅ Unified-GRPO: to our knowledge, the first RL scheme that jointly improves both modules via two complementary steps—Generation for Understanding (train the encoder to caption for higher reconstruction quality) and Understanding for Generation (refine the decoder to reconstruct from those captions)—forming a positive feedback loop toward unification.
✅ Aha Moment in Multimodal: We report an emergent "aha moment" in multimodal learning. As RL progresses, the encoder autonomously emits longer, more descriptive captions while the decoder simultaneously achieves strikingly faithful reconstructions. This co-evolution offers compelling empirical evidence for unified multimodal intelligence.
✅ Unified-Bench: to the best of our knowledge, the first benchmark explicitly designed to measure the degree of unification in UMMs, rather than individually evaluating the generation or understanding capabilities.
- Release unified-grpo training code (RL).
- Release the training data of long-context-700K for SFT and the training data for unified-grpo.
- Release training code for SFT (text-to-image generation).
- [☑️] Release all models' checkpoints.
- [☑️] Release inference code for both image understanding and generation.
conda create -n UAE python==3.12
conda activate UAE
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
-
Download the required model checkpoints:
- Stable Diffusion 3.5 Large model
- UAE fine-tuned weights
- Vision-language model checkpoints
-
Update the model paths in
demo.py
:
model_cfg = {
"SD3": "/path/to/stable-diffusion-3.5-large",
"dit": "/path/to/dit/checkpoint",
"dit_lora": "/path/to/dit/lora",
"llm": "/path/to/llm/model",
"llm_lora": "/path/to/llm/lora",
"llm_processor": "/path/to/llm/processor"
}
Here, the items are defined as follows:
- "SD3": Path to the official weights of Stable Diffusion 3-Large.
- "dit": Our pre-trained weights of DiT.
- "dit_lora": Our pre-trained LoRA for DiT, obtained in Stage-3 of unified-GRPO.
- "llm": Our pre-trained weights of Qwen-2.5-VL-3B.
- "llm_lora": Our pre-trained LoRA for Qwen-2.5-VL-3B, obtained in Stage-2 of unified-GRPO.
- "llm_processor": The official configuration of Qwen-2.5-VL-3B, located at
./Checkpoints/llm_processor
.
The demo.py
script is the heart of our inference pipeline, supporting two main modes:
Generate images directly from text descriptions:
python demo.py \
--input_text "A serene mountain landscape with snow-capped peaks reflecting in a crystal clear lake, surrounded by pine forests under a golden sunset sky" \
--output_path ./output/generated_image.png
Generate detailed descriptions of images:
python demo.py \
--input_img /path/to/input/image.jpg \
--prompt_only
Our comprehensive evaluation suite in the Unified-Bench/
directory provides multiple similarity metrics for image-to-image generation assessment.
- CLIP: Semantic similarity using CLIP vision encoder
- DINO v2: Self-supervised visual representation similarity
- DINO v3: Enhanced DINO model for improved feature matching
- LongCLIP: Extended context CLIP for better long-range dependencies
cd eval
python CLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v2.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python DINO_v3.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
python LongCLIP.py --image_path /path/to/generated/images --ref_path /path/to/reference/images
Use the unified evaluation script for complete assessment:
cd eval
python Score_i2i.py \
--image_path ./Unified-Bench/UniBench/example_image \
--ref_path ./Unified-Bench/UniBench/Image \
--output_file ./Unified-Bench/results/example.json \
--models clip dinov2 dinov3 longclip
The Unified-Bench/UniBench/
directory contains our evaluation benchmark:
UniBench/
├── Image/ # Reference images (100 samples)
│ ├── 0.jpg
│ └── ...
└── example_image/ # Example generated images
├── 0.jpg
└── ...
The data from the Image
folder can be downloaded from the link.
The evaluation generates comprehensive statistics:
{
"clip": {
"0.jpg": 0.8542,
"1.jpg": 0.7893,
"average": 0.8234,
"min": 0.7123,
"max": 0.9456
},
"dinov2": { ... },
"dinov3": { ... },
"longclip": { ... }
}
To evaluate your own generated images:
- Organize your images following the UniBench structure
- Ensure corresponding images have matching names
- Run the evaluation script with your paths
- Results will include per-image scores and aggregate statistics
For questions or feedback, please reach out:
- Email: [[email protected]]
⭐️ If this repository helped your research, please star 🌟 this repo 👍!