VisPer-LM (f.k.a. OLA-VLM)

Jitesh Jain^*, Zhengyuan Yang, Humphrey Shi^†, Jianfeng Gao^†, Jianwei Yang^†

^*Work done during an internship at Microsoft Research, Redmond ^†Equal Advising

[Project Page] | [arXiv] [Model Checkpoints] [Video] [BibTeX]

This repo contains the code for our paper Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation.

We propose distilling target visual information into the intermediate representations of the LLM from a set of target encoders. We adopt a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective, resulting in a vision-centric approach to training the Multimodal Large Language Model.

News

[September 21, 2025]: VisPer-LM is accepted to NeurIPS 2025! 🥂
[December 14, 2024]: Our demo is now available on HuggingFace Spaces. Thanks to the HF team for their support with the ZeroGPU grant! 🤗
[December 12, 2024]: 🚀 Project Page, ArXiv Preprint and GitHub Repo are public! We also open-source the model checkpoints and probes on huggingface hub! 🎁

Installation Instructions

Note: We trained all our models on AMD MI300x GPUs. However, in this repo, we provide instructions for Nvidia GPUs considering their wider usage.

Clone this repository.

git lfs install
git clone https://github.com/SHI-Labs/VisPer-LM
cd VisPer-LM

Setup conda environment with the base dependencies.

conda create -n visper_lm -y
conda activate visper_lm
pip install -e .
pip install flash-attn --no-build-isolation
pip install scikit-learn icecream datasets pytorch-fid lpips opencv-python-headless
pip install setuptools==61.0.0
pip install -e lmms-eval/
pip install huggingface_hub==0.24.7
pip install transformers==4.41.1

Demo

You can use the Gradio interface to interact with VisPer-LM locally. The demo also supports visualizing the respresentations from the slected intermediate LLM layers (embedding loss positions).

# install demo-specific libraries
pip install -e .["demo"]

# start the demo
CUDA_VISIBLE_DEVICES=0 python demo.py --model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b --PT-model-path shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b

Getting Started

Note: We provide the guide to integrating the embeddding losses from VisPer-LM into any custom MLLM in Custom_MLLM.md

Training

Please see Training.md for training commands and dataset preparation.
We train all our models using 16 192G MI300X AMD GPUs.

Evaluation

Please see Evaluation.md for evaluation commands

Probing

Please see Probing.md for probing commands.

Results

Method	Training Stages	LLM	Base Encoder	CV-Bench	MMStar	RWQA	OK-VQA	Checkpoint
VisPer-LM	PT + IFT	Phi3-4k-mini	CLIP-ViT-L	62.5	36.0	58.0	56.4	ckpt
VisPer-LM	PT + IFT	Phi3-4k-mini	CLIP-ConvNeXT-XXL	63.9	38.4	58.4	56.5	ckpt
VisPer-LM	PT + IFT	Llama3-8b	CLIP-ViT-L	61.4	39.5	57.9	56.6	ckpt
VisPer-LM	PT + IFT	Llama3-8b	CLIP-ConvNeXT-XXL	61.5	38.5	55.0	59.0	ckpt
VisPer-LM	PT + VPT + IFT	Llama3-8b	CLIP-ConvNeXT-XXL	64.6	40.6	62.9	61.1	ckpt

Citation

If you found VisPer-LM useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@inproceedings{jain2025visper_lm,
      title={{Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
      author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
      booktitle={NeurIPS},
      year={2025}
}

Acknowledgement

We thank the authors of LLaVA-1.5, OneFormer, Depth-Anything v2, and unCLIP-SD for open-sourcing their codebase and checkpoints. We are grateful to the authors of cambrian and MMStar for releasing their code for CV-Bench and MMStar evaluation, respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
datasets/ocr_vqa		datasets/ocr_vqa
docs		docs
lmms-eval		lmms-eval
ola_vlm		ola_vlm
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VisPer-LM (f.k.a. OLA-VLM)

Contents

News

Installation Instructions

Demo

Getting Started

Training

Evaluation

Probing

Results

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SHI-Labs/VisPer-LM

Folders and files

Latest commit

History

Repository files navigation

VisPer-LM (f.k.a. OLA-VLM)

Contents

News

Installation Instructions

Demo

Getting Started

Training

Evaluation

Probing

Results

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages