Skip to content

whitesay/ross3d

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

[Project Page] [Model Zoo]

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness by Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang.

Abstract. The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird’s-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, ROSS3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

Release

  • [2025/06/26] 🔥🔥🔥 Ross3D has been accepted by ICCV 2025! See you in Hawaii. 🔥🔥🔥
  • [2025/06/09] 🔥 All codes and checkpoints of Ross3D have been released.
  • [2024/04/02] 🔥 Ross3D has been released. Checkout the paper for details.

Code License Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Install

If you are not using Linux, do NOT proceed.

  1. Clone this repository and navigate to ross3d folder
git clone https://github.com/Haochen-Wang409/ross3d.git
cd ross3d
  1. Install Package
conda create -n ross3d python=3.10 -y
conda activate ross3d
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation     # install flash attention

Data Preparation

Please follow this instruction for detail.

Processed BEV files can be found here.

Model Zoo

Method LLM Checkpoint
Ross3D-7B LLaVA-Video-Qwen2-7B HF

Evaluation

bash scripts/3d/eval/eval_all.sh HaochenWang/llava-video-qwen2-7b-ross3d <num_frames>

We set <num_frames>=32 by default.

Training

Ross3D was trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Download VAE checkpoints

Our base model takes the VAE from FLUX.1-dev as the fine-grained tokenizer. Downloading the checkpoint from this URL and put them into ./checkpoints.

Download mm_inv_projector

Downloading the pre-trained mm_inv_projector on 2D data from this URL and put the mm_inv_projector.bin into ./checkpoints.

Instruction Tuning

Training script with DeepSpeed ZeRO-3 can be found in scripts/3d/train_ross3d.sh.

Citation

If you find Ross useful for your research and applications, please cite using this BibTeX:

@article{wang2025ross3d,
  title={Ross3D: Reconstructive visual instruction tuning with 3D-awareness},
  author={Wang, Haochen and Zhao, Yucheng and Wang, Tiancai and Fan, Haoqiang and Zhang, Xiangyu and Zhang, Zhaoxiang},
  journal={arXiv preprint arXiv:2504.01901},
  year={2025}
}

Acknowledgement

About

[ICCV'25] Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • Shell 0.7%