GitHub - kiminh/U-MARVEL

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Universal multimodal retrieval (UMR) addresses complex retrieval tasks involving diverse modalities for both queries and candidates. Despite the success of state-of-the-art methods based on multimodal large language models (MLLMs) using contrastive learning principles, the mechanisms underlying their retrieval capabilities remain largely unexplored. This gap potentially leads to suboptimal performance and limited generalization ability.

In this study, we systematically analyze the key factors driving effective embedding learning for UMR using MLLMs. We implement a general MLLM-based embedding learning pipeline and investigate contributors to high-performing universal retrieval systems. Our analysis covers various aspects of embedding generation and training strategies, including progressive transition, hard negative mining, and re-ranker distillation. Our findings reveal that often-overlooked factors can significantly impact model performance.

Building on these insights, we introduce U-MARVEL (Universal Multimodal Retrieval via Embedding Learning), a unified framework that outperforms state-of-the-art competitors on the M-BEIR benchmark in supervised settings and demonstrates strong zero-shot performance on tasks such as composed image retrieval and text-to-video retrieval. These results highlight the generalization potential of our framework across various embedding-based retrieval tasks, providing valuable insights for future research.

Model Checkpoints

├── checkpoints
│   ├── hf_models
│   │   └── Qwen2-VL-7B-Instruct
│   └── U-MARVEL-Qwen2VL-7B-Instruct

U-MARVEL-Qwen2VL-7B-Instruct 🤗
Qwen2-VL-7B-Instruct
Inference code available at: U-MARVEL-inference

Requirements

To install requirements:

pip install -r requirements.txt

Data Preparation

Download Qwen2-VL-7B and place it in ./checkpoints/hf_models/Qwen2-VL-7B-Instruct

For NLI dataset, please refer to link

For multimodal instruction tuning datset, please refer to M-BEIR

After downloading all of them, organize the data as follows in ./data

├── data    
│    ├── M-BEIR
│    ├── nli_for_simcse.csv
│    ├── rerank_data_for_training
│    ├── flickr
│    ├── coco
│    ├── sharegpt4v
│    ├── Urban1K
│    ├── circo
│    ├── genecis
│    ├── vist
│    ├── visdial
│    ├── ccneg
│    ├── sugar-crepe
│    ├── MSVD
│    └── msrvtt

Evaluation

To evaluate our model on M-BEIR, run:

python scripts/vtools_eval_mbeir_model.py  # Evaluate locally  
sh scripts/eval_mbeir_global.sh            # Evaluate globally  
sh scripts/eval_zeroshot.sh                # Evaluate zero-shot

Model Performance

The proposed U-MARVEL framework establishes new state-of-the-art performance across both single-model architectures and recall-then-rerank approaches on M-BEIR benchmark.

Acknowledgements

Many thanks to the code bases from LamRA .

Citation

If you use this code for your research or project, please cite:

@article{li2025umarvel,
  title={U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs},
  author={Li, Xiaojie and Li, Chu and Chen, Shi-Zhe and Chen, Xi},
  journal={arXiv preprint arXiv:2507.14902},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
collators		collators
dataset		dataset
ds_configs		ds_configs
eval		eval
figures		figures
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
supported_models.py		supported_models.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Model Checkpoints

Requirements

Data Preparation

Evaluation

Model Performance

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

kiminh/U-MARVEL

Folders and files

Latest commit

History

Repository files navigation

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Model Checkpoints

Requirements

Data Preparation

Evaluation

Model Performance

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages