My research lies at the intersection of multimodal learning, vision-language pretraining, and vision foundation models.
I’m particularly interested in developing vision-centric models that serve as general-purpose backbones for multimodal systems.
My goal is to design vision encoders that not only deeply understand images but also naturally interface with language, enabling richer reasoning and more seamless interaction across modalities.
- Representation learning for visual and multimodal understanding
- Leveraging synthetic and weakly-labeled data for improved scalability
- Building open, reproducible, and extensible vision backbones for multimodal research
I'm also a contributor to open-source projects such as OpenVision, committed to making powerful vision models accessible to the broader research community.
📄 For more about my work, check out my personal homepage!