Stars
Official implementation of IROS 2025 paper Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline
Unfied World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Thinking in 360°: Humanoid Visual Search in the Wild
GigaWorld-0: World Models as Data Engine to Empower Embodied AI
WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Official PyTorch Implementation of "Flow Map Distillation Without Data"
[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) reasoning.
Official repository for “DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation”
HunyuanVideo-1.5: A leading lightweight video generation model
Action-Guided Knowledge Distillation for VLA Models
Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Lumina-DiMOO - An Open-Sourced Multi-Modal Large Diffusion Language Model
ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation
MuM's a pretty good feature extractor for 3D tasks, probably the best.
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Offical Repository of POMA-3D: The Point Map Way to 3D Scene Understanding.
NaTex: Seamless Texture Generation as Latent Color Diffusion
Official Pytorch Implementation for "Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising"
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.