Jitesh Jain*, Zhengyuan Yang, Humphrey Shi†, Jianfeng Gao†, Jianwei Yang†
*Work done during an internship at Microsoft Research, Redmond †Equal Advising
[Project Page
] | [arXiv
] [Model Checkpoints
] [Video
] [BibTeX
]
This repo contains the code for our paper Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation.
We propose distilling target visual information into the intermediate representations of the LLM from a set of target encoders. We adopt a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective, resulting in a vision-centric approach to training the Multimodal Large Language Model.
- [September 21, 2025]: VisPer-LM is accepted to NeurIPS 2025! 🥂
- [December 14, 2024]: Our demo is now available on HuggingFace Spaces. Thanks to the HF team for their support with the ZeroGPU grant! 🤗
- [December 12, 2024]: 🚀 Project Page, ArXiv Preprint and GitHub Repo are public! We also open-source the model checkpoints and probes on huggingface hub! 🎁
Note: We trained all our models on AMD MI300x GPUs. However, in this repo, we provide instructions for Nvidia GPUs considering their wider usage.
-
Clone this repository.
git lfs install git clone https://github.com/SHI-Labs/VisPer-LM cd VisPer-LM
-
Setup conda environment with the base dependencies.
conda create -n visper_lm -y conda activate visper_lm pip install -e . pip install flash-attn --no-build-isolation pip install scikit-learn icecream datasets pytorch-fid lpips opencv-python-headless pip install setuptools==61.0.0 pip install -e lmms-eval/ pip install huggingface_hub==0.24.7 pip install transformers==4.41.1
You can use the Gradio interface to interact with VisPer-LM locally. The demo also supports visualizing the respresentations from the slected intermediate LLM layers (embedding loss positions).
# install demo-specific libraries
pip install -e .["demo"]
# start the demo
CUDA_VISIBLE_DEVICES=0 python demo.py --model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b --PT-model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b
Note: We provide the guide to integrating the embeddding losses from VisPer-LM into any custom MLLM in Custom_MLLM.md
- Please see Training.md for training commands and dataset preparation.
- We train all our models using 16 192G MI300X AMD GPUs.
Please see Evaluation.md for evaluation commands
Please see Probing.md for probing commands.
Method | Training Stages | LLM | Base Encoder | CV-Bench | MMStar | RWQA | OK-VQA | Checkpoint |
---|---|---|---|---|---|---|---|---|
VisPer-LM | PT + IFT | Phi3-4k-mini | CLIP-ViT-L | 62.5 | 36.0 | 58.0 | 56.4 | ckpt |
VisPer-LM | PT + IFT | Phi3-4k-mini | CLIP-ConvNeXT-XXL | 63.9 | 38.4 | 58.4 | 56.5 | ckpt |
VisPer-LM | PT + IFT | Llama3-8b | CLIP-ViT-L | 61.4 | 39.5 | 57.9 | 56.6 | ckpt |
VisPer-LM | PT + IFT | Llama3-8b | CLIP-ConvNeXT-XXL | 61.5 | 38.5 | 55.0 | 59.0 | ckpt |
VisPer-LM | PT + VPT + IFT | Llama3-8b | CLIP-ConvNeXT-XXL | 64.6 | 40.6 | 62.9 | 61.1 | ckpt |
If you found VisPer-LM useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
@inproceedings{jain2025visper_lm,
title={{Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
booktitle={NeurIPS},
year={2025}
}
We thank the authors of LLaVA-1.5, OneFormer, Depth-Anything v2, and unCLIP-SD for open-sourcing their codebase and checkpoints. We are grateful to the authors of cambrian and MMStar for releasing their code for CV-Bench and MMStar evaluation, respectively.