-
Tencent, YouTu lab
- Shanghai
- linchuming.github.io
Stars
🔊 Repository for our NAACL-HLT 2019 paper: AudioCaps
Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
You can easily calculate FVD, PSNR, SSIM, LPIPS for evaluating the quality of generated or predicted videos.
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding
SEED-Voken: A Series of Powerful Visual Tokenizers
A suite of image and video neural tokenizers
Automatic Video Generation from Scientific Papers
The official code repository for SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
The official code repository for LeVo: High-Quality Song Generation with Multi-Preference Alignment
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
A fundamental toolkit designed for music, song, and audio generation
[ICML 2025] SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…
Official implementation of "JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization"
ACE-Step: A Step Towards Music Generation Foundation Model
AudioLDM training, finetuning, evaluation and inference.
Official PyTorch implementation of ReWaS (AAAI'25) "Read, Watch and Scream! Sound Generation from Text and Video"
The official implementation of OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
[ACL 2025 Main] UniCodec: a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
Text-audio foundation model from Boson AI
HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation