Starred repositories
Open-source Autonomous 3D Characters on the Web
official implementation of [PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning, ICCV'25]
the Quest for Generalizable Motion Generation: Data, Model, and Evaluation
[CVPR 2025 Highlight] SkillMimic: Learning Basketball Interaction Skills from Demonstrations
Training, validation, and inference code for various SSL approaches and architectures.
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
(NIPS 2025) OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
A Self-adaptation Framework🐙 that adapts LLMs for unseen tasks in real-time!
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
✨✨Latest Advances on Multimodal Large Language Models
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Foundational Models for State-of-the-Art Speech and Text Translation
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Implementation of Autoregressive Diffusion in Pytorch
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
Code to reproduce the results for our SIGGRAPH 2023 paper "Listen Denoise Action"
Official Dataset Toolbox of the paper "[CVPR 2023]NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions" and "[CVPR2024]HOI-M3: Capture Multiple Humans and Objects Interact…
Virtual Community: An Open World for Humans, Robots, and Society