Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness by Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang.
Abstract. The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird’s-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, ROSS3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.
- [2025/06/26] 🔥🔥🔥 Ross3D has been accepted by ICCV 2025! See you in Hawaii. 🔥🔥🔥
- [2025/06/09] 🔥 All codes and checkpoints of Ross3D have been released.
- [2024/04/02] 🔥 Ross3D has been released. Checkout the paper for details.
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
If you are not using Linux, do NOT proceed.
- Clone this repository and navigate to ross3d folder
git clone https://github.com/Haochen-Wang409/ross3d.git
cd ross3d- Install Package
conda create -n ross3d python=3.10 -y
conda activate ross3d
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation # install flash attentionPlease follow this instruction for detail.
Processed BEV files can be found here.
| Method | LLM | Checkpoint |
|---|---|---|
| Ross3D-7B | LLaVA-Video-Qwen2-7B | HF |
bash scripts/3d/eval/eval_all.sh HaochenWang/llava-video-qwen2-7b-ross3d <num_frames>We set <num_frames>=32 by default.
Ross3D was trained on 8 A100 GPUs with 80GB memory.
To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly.
Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.
Our base model takes the VAE from FLUX.1-dev as the fine-grained tokenizer.
Downloading the checkpoint from this URL and put them into ./checkpoints.
Downloading the pre-trained mm_inv_projector on 2D data from this URL and put the mm_inv_projector.bin into ./checkpoints.
Training script with DeepSpeed ZeRO-3 can be found in scripts/3d/train_ross3d.sh.
If you find Ross useful for your research and applications, please cite using this BibTeX:
@article{wang2025ross3d,
title={Ross3D: Reconstructive visual instruction tuning with 3D-awareness},
author={Wang, Haochen and Zhao, Yucheng and Wang, Tiancai and Fan, Haoqiang and Zhang, Xiangyu and Zhang, Zhaoxiang},
journal={arXiv preprint arXiv:2504.01901},
year={2025}
}- Video-3D-LLM: the codebase we built upon and the dataset we utilized.
- ScanNet, ScanRefer, Multi3DRefer, SQA3D, ScanQA: the datasets we use.