Skip to content

kiminh/U-MARVEL

 
 

Repository files navigation

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Universal multimodal retrieval (UMR) addresses complex retrieval tasks involving diverse modalities for both queries and candidates. Despite the success of state-of-the-art methods based on multimodal large language models (MLLMs) using contrastive learning principles, the mechanisms underlying their retrieval capabilities remain largely unexplored. This gap potentially leads to suboptimal performance and limited generalization ability.

In this study, we systematically analyze the key factors driving effective embedding learning for UMR using MLLMs. We implement a general MLLM-based embedding learning pipeline and investigate contributors to high-performing universal retrieval systems. Our analysis covers various aspects of embedding generation and training strategies, including progressive transition, hard negative mining, and re-ranker distillation. Our findings reveal that often-overlooked factors can significantly impact model performance.

Building on these insights, we introduce U-MARVEL (Universal Multimodal Retrieval via Embedding Learning), a unified framework that outperforms state-of-the-art competitors on the M-BEIR benchmark in supervised settings and demonstrates strong zero-shot performance on tasks such as composed image retrieval and text-to-video retrieval. These results highlight the generalization potential of our framework across various embedding-based retrieval tasks, providing valuable insights for future research.

M-BEIR-Local

Model Checkpoints

├── checkpoints
│   ├── hf_models
│   │   └── Qwen2-VL-7B-Instruct
│   └── U-MARVEL-Qwen2VL-7B-Instruct

Requirements

To install requirements:

pip install -r requirements.txt

Data Preparation

Download Qwen2-VL-7B and place it in ./checkpoints/hf_models/Qwen2-VL-7B-Instruct

For NLI dataset, please refer to link

For multimodal instruction tuning datset, please refer to M-BEIR

After downloading all of them, organize the data as follows in ./data

├── data    
│    ├── M-BEIR
│    ├── nli_for_simcse.csv
│    ├── rerank_data_for_training
│    ├── flickr
│    ├── coco
│    ├── sharegpt4v
│    ├── Urban1K
│    ├── circo
│    ├── genecis
│    ├── vist
│    ├── visdial
│    ├── ccneg
│    ├── sugar-crepe
│    ├── MSVD
│    └── msrvtt

Evaluation

To evaluate our model on M-BEIR, run:

python scripts/vtools_eval_mbeir_model.py  # Evaluate locally  
sh scripts/eval_mbeir_global.sh            # Evaluate globally  
sh scripts/eval_zeroshot.sh                # Evaluate zero-shot

Model Performance

The proposed U-MARVEL framework establishes new state-of-the-art performance across both single-model architectures and recall-then-rerank approaches on M-BEIR benchmark.

M-BEIR-Local
M-BEIR-Global
M-BEIR-Zero-shot
M-BEIR-Zero-shot

Acknowledgements

Many thanks to the code bases from LamRA .

Citation

If you use this code for your research or project, please cite:

@article{li2025umarvel,
  title={U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs},
  author={Li, Xiaojie and Li, Chu and Chen, Shi-Zhe and Chen, Xi},
  journal={arXiv preprint arXiv:2507.14902},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.3%
  • Shell 2.7%