Stars
Vision Manus: Your versatile Visual AI assistant
MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
A fork to add multimodal model training to open-r1
[ICCV 2025] Official Implementation for "Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition"
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curatio…
Codebase for Aria - an Open Multimodal Native MoE
A Next-Generation Training Engine Built for Ultra-Large MoE Models
[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-…
[arXiv:2309.16669] Code release for "Training a Large Video Model on a Single Machine in a Day"
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
The code repository for "Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation"
ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model Reuse
FlagScale is a large model toolkit based on open-sourced projects.
Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation (NeurIPS 2023)
AAAI2024 - Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective (ACL 2024)
📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
Official implementation of our paper at ACL 2023: Pre-training Multi-party Dialogue Models with Latent Discourse Inference
⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Refine high-quality datasets and visual AI models
Official PyTorch implementation for "Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels"