The official pytorch implement of "LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs" [Paper]
The implementation changes of LLaVA-SP are in llava_arch.py, clip_encoder.py, llava_trainer.py and train.py.
Please see instructions for https://github.com/haotian-liu/LLaVA/
Please check out https://huggingface.co/Levideus/models for all public LLaVA-SP checkpoints.
python llava/eval/run_llava.py
--model_path /path/llava-sp-cropping-lora
--model_base /path/vicuna-1.5-7b
If you find LLaVA-SP useful for your research and applications, please cite using this BibTeX:
@misc{lou2025llavasp,
title={LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs},
author={Lou, Haoran and Fan, Chunxiao and Liu, Ziyan Liu and Wu, Yuexin Wu and Wang, Xinliang},
publisher={arXiv:2507.00505},
year={2025}
}