mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

This repository provides the source code, models, and datasets for our paper mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data. In our work, we explore integrating high-quality synthetic data to boost the performance of multimodal multilingual embeddings across diverse tasks.

Latest Updates

2025-02: We release the paper, code, datasets and models of mmE5.

Model Overview

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark.

Datasets

Our experiments leverage a comprehensive dataset that combines real-world examples with synthetic data, covering a wide range of tasks and languages. We also provide the labeled training set of MMEB benchmark that includes mined hard negatives.

Experimental Results

mmE5 achieves SOTA performance on MMEB benchmark.

Quick Start

pip install -r requirements.txt

Preparation

bash scripts/prepare_images.sh

This script will download images from Synthetic Dataset, MMEB with Hard Negative, MMEB-eval, and XTD.

Caution: This could take a while as the images are large in size. Make sure you have enough disk space (at least 1T).

We have provided example scripts in the scripts/ directory to help you get started with training and evaluation.

Train

bash scripts/train/train.sh

Test MMEB

bash scripts/eval/eval_full.sh

Test XTD

bash scripts/eval/eval_full_multi.sh

You can also use demo.py to embed your own text and images.

python demo.py

Acknowledgement

We have adapted code from VLM2Vec, a comprehensive implementation of transforming MLLMs to embedding models.

Citation

@article{chen2025mmE5,
  title={mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data},
  author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng},
  journal={arXiv preprint arXiv:2502.08468},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
evaluation		evaluation
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
ds_config.json		ds_config.json
eval.py		eval.py
eval_multi.py		eval_multi.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Latest Updates

Model Overview

Datasets

Experimental Results

Quick Start

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

haon-chen/mmE5

Folders and files

Latest commit

History

Repository files navigation

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Latest Updates

Model Overview

Datasets

Experimental Results

Quick Start

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages