Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness by Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang.

Abstract. The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird’s-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, ROSS3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

Release

[2025/06/26] 🔥🔥🔥 Ross3D has been accepted by ICCV 2025! See you in Hawaii. 🔥🔥🔥
[2025/06/09] 🔥 All codes and checkpoints of Ross3D have been released.
[2024/04/02] 🔥 Ross3D has been released. Checkout the paper for details.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Install

If you are not using Linux, do NOT proceed.

Clone this repository and navigate to ross3d folder

git clone https://github.com/Haochen-Wang409/ross3d.git
cd ross3d

Install Package

conda create -n ross3d python=3.10 -y
conda activate ross3d
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation     # install flash attention

Data Preparation

Please follow this instruction for detail.

Processed BEV files can be found here.

Model Zoo

Method	LLM	Checkpoint
Ross3D-7B	LLaVA-Video-Qwen2-7B	HF

Evaluation

bash scripts/3d/eval/eval_all.sh HaochenWang/llava-video-qwen2-7b-ross3d <num_frames>

We set <num_frames>=32 by default.

Training

Ross3D was trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Download VAE checkpoints

Our base model takes the VAE from FLUX.1-dev as the fine-grained tokenizer. Downloading the checkpoint from this URL and put them into ./checkpoints.

Download mm_inv_projector

Downloading the pre-trained mm_inv_projector on 2D data from this URL and put the mm_inv_projector.bin into ./checkpoints.

Instruction Tuning

Training script with DeepSpeed ZeRO-3 can be found in scripts/3d/train_ross3d.sh.

Citation

If you find Ross useful for your research and applications, please cite using this BibTeX:

@article{wang2025ross3d,
  title={Ross3D: Reconstructive visual instruction tuning with 3D-awareness},
  author={Wang, Haochen and Zhao, Yucheng and Wang, Tiancai and Fan, Haoqiang and Zhang, Xiangyu and Zhang, Zhaoxiang},
  journal={arXiv preprint arXiv:2504.01901},
  year={2025}
}

Acknowledgement

Video-3D-LLM: the codebase we built upon and the dataset we utilized.
ScanNet, ScanRefer, Multi3DRefer, SQA3D, ScanQA: the datasets we use.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
img		img
ross3d		ross3d
scripts/3d		scripts/3d
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Release

Install

Data Preparation

Model Zoo

Evaluation

Training

Download VAE checkpoints

Download mm_inv_projector

Instruction Tuning

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

whitesay/ross3d

Folders and files

Latest commit

History

Repository files navigation

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Release

Install

Data Preparation

Model Zoo

Evaluation

Training

Download VAE checkpoints

Download mm_inv_projector

Instruction Tuning

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages