Highlights
- Pro
Stars
The repository provides code for EgoMAN model and dataset creation scripts.
Simulation of manipulation tasks using Galaxea robots
F1: A Vision Language Action Model Bridging Understanding and Generation to Actions
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
Building General-Purpose Robots Based on Embodied Foundation Model
DelinQu / SimplerEnv-OpenVLA
Forked from simpler-env/SimplerEnvEvaluating and reproducing real-world robot manipulation policies (e.g., RT-1, RT-1-X, Octo, and OpenVLA) in simulation under common setups (e.g., Google Robot, WidowX+Bridge)
[NeurIPS 2025] Official implementation of "RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics"
ViPE: Video Pose Engine for Geometric 3D Perception
Wan: Open and Advanced Large-Scale Video Generative Models
Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.
RoboBrain 2.0: Advanced version of RoboBrain. See Better. Think Harder. Do Smarter. 🎉🎉🎉
Practicalli customisations to the Doom Emacs configuration
[ICML 2025] OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Official implementation of the paper: Task Reconstruction and Extrapolation for $\pi_0$ using Text Latent (https://arxiv.org/pdf/2505.03500)
Code for RSS 2025 paper "Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies"
A Modular Toolkit for Robot Kinematic Optimization
The official implementation of the paper "Human Motion Diffusion as a Generative Prior"
[CoRL 2025] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions
NVIDIA Isaac GR00T N1.6 - A Foundation Model for Generalist Robots.
[CVPR 2025] The offical Implementation of "Universal Actions for Enhanced Embodied Foundation Models"
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
RoboDual: Dual-System for Robotic Manipulation
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.