Stars
A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
Controllable video and image Generation, SVD, Animate Anyone, ControlNet, ControlNeXt, LoRA
Character Animation (AnimateAnyone, Face Reenactment)
Unofficial Implementation of Animate Anyone
This is the official implementation of our paper: "MiniMax-Remover: Taming Bad Noise Helps Video Object Removal"
HY-World 1.5: A Systematic Framework for Interactive World Modeling with Real-Time Latency and Geometric Consistency
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
[Preprint 2025] Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
VAE modified from Descript Audio Codec, which replaces the RVQ with VAE
🎬IMAGEdit🎬:Let Any Subject transform.It is a training-free and plug-and-play framework that aligns prompts and retargets masks to enable any-subject video editing.
[CVPR 2025 (Oral)] Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
Kandinsky 5.0: A family of diffusion models for Video & Image generation
Official repository for the paper “Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models”
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
LongLive: Real-time Interactive Long Video Generation
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
This project is based on the [LTX-Video](https://github.com/Lightricks/LTX-Video) algorithm of the diffusers and optimized and accelerated for multi GPUs inference using the [xDiT](https://github.c…
[ICML 2025 Spotlight] MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding