-
Huazhong University of Science and Technology
- Huazhong University of Science and Technology
-
23:44
(UTC +08:00)
Highlights
- Pro
Stars
[NeurIPS'25] Official repository of Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
"DeepCode: Open Agentic Coding (Paper2Code & Text2Web & Text2Backend)"
[NeurIPS 2025] NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
[CVPR 2025 Highlight] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Official code repository of Shuffle-R1
[NeurIPS 2025] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models
[ICCV 2025] ACE-G is an architecture and pre-training scheme to improve generalization for scene coordinate regression-based visual relocalization.
Official implementation of Spatial-Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
[NeurIPS 2025] Pixel-Perfect Depth
UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation
[ICCV'25] Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
[NeurIPS 2025] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Official implementation of Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Official implementation for "JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation"
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs".
Official Code for "Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search"
Official repo of paper "Reconstruction Alignment Improves Unified Multimodal Models". Unlocking the Massive Zero-shot Potential in Unified Multimodal Models through Self-supervised Learning.
Official implementation of "VIRAL: Visual Representation Alignment for MLLMs".
Benchmarking Knowledge Transfer in Lifelong Robot Learning
[ICLR'25] Official code for the paper 'MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs'
[CVPR'25 Oral] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision