Stars
Official repo for 'Large Multimodal Models Evaluation: A Survey'
DepictQA: Depicted Image Quality Assessment with Vision Language Models
A curated list of recent diffusion models for video generation, editing, and various other applications.
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
中文nlp解决方案(大模型、数据、模型、训练、推理)
A Framework of Small-scale Large Multimodal Models
A Simple Framework of Small-scale LMMs for Video Understanding
A minimal PyTorch re-implementation of Qwen3 VL with a fancy CLI
The simplest, fastest repository for training/finetuning small-sized VLMs.
Recipes for shrinking, optimizing, customizing cutting edge vision models. 💜
A unified inference and post-training framework for accelerated video generation.
Collect super-resolution related papers, data, repositories
(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
🔥[Information Fusion 2024, Official Code] for paper "Prompt-guided image color aesthetics assessment: Models, datasets and benchmarks". Official Weights and Demos provided. 首个多因素色彩美学评估数据集、算法和benchm…
Teaching LMMs for Image Quality Scoring and Interpreting
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A PyTorch implementation of EfficientNet
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
[NeurIPS 2025] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
The released data for paper "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models".
The official code for ACL 2025 Modeling Uncertainty in Composed Image Retrieval via Probabilistic Embeddings
[ACL 2025] Towards Text-Image Interleaved Retrieval
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.