-
Wuhan University
- Wuhan, China
-
16:30
(UTC -12:00) - https://www.whu.edu.cn/
Stars
Towards Scalable Pre-training of Visual Tokenizers for Generation
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
Official code of Motus: A Unified Latent Action World Model
大模型算法岗面试题(含答案):常见问题和概念解析 "大模型面试题"、"算法岗面试"、"面试常见问题"、"大模型算法面试"、"大模型应用基础"
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[TPAMI 2025] Official code for "SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation"
[AAAI 2026] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
[NeurIPS 2025] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
HunyuanVideo-1.5: A leading lightweight video generation model
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
[ICCV 2025] VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE
Official project page of MTVCrafter, a new paradigm for animating arbitrary characters with 4D motion tokens.
[ ECCV 2024 ] MotionLCM: This repo is the official implementation of "MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model"
[ICCV 2025] Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
The official SpeakerVid-5M data curation code.
A feature-rich command-line audio/video downloader
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set (CVPRW 2019). A PyTorch implementation.
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
[ICLR 2025] Autoregressive Video Generation without Vector Quantization
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models