Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Starred repositories
[NeurIPS 2025] SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models.
MotionStream: Real-Time Video Generation with Interactive Motion Controls
Examples of my Claude Code infrastructure with skill auto-activation, hooks, and agents
[ECCV 2024] Official GitHub repository for the paper "LingoQA: Visual Question Answering for Autonomous Driving"
FIBO is a SOTA, first open-source, JSON-native text-to-image model built for controllable, predictable, and legally safe image generation.
Your AI mate who chats on tinder and schedules dates for you.
Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"
Collect some World Models for Autonomous Driving (and Robotic) papers.
The official code of "VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning" [NeurIPS25]
[ICCV 2025 Oral] SceneSplat - Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".
《动手学大模型Dive into LLMs》系列编程实践教程
ACL 2025: Synthetic data generation pipelines for text-rich images.
AlignCLIP: Improving Cross-Modal Alignment in CLIP (ICLR 2025)
ChronoDepth: Learning Temporally Consistent Video Depth from Video Diffusion Priors
Official implementation of DepthLM
开源免费的 Wispr Flow 替代方案 | 集成FunASR本地模型和可配置大语言模型的下一代中文桌面语音工作流
利用AI大模型,一键生成高清短视频 Generate short videos with one click using AI LLM.
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
[ICLR 2025] Official Pytorch Implementation of MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
A curated list of publications on image and video segmentation leveraging Multimodal Large Language Models (MLLMs), highlighting state-of-the-art methods, innovative applications, and key advanceme…
[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
Enjoy the magic of Diffusion models!
Code for 3D-LLM: Injecting the 3D World into Large Language Models