Open Vocabulary Monocular 3D Object Detection

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng

Zero-shot predictions on COCO

Installation

Our used cuda version is 12.1.1. Run

conda create -n ovmono3d python=3.8.20
conda activate ovmono3d

pip install torch==2.4.1 torchvision==0.19.1 --index-url https://download.pytorch.org/whl/cu121

to create the environment and install pytorch.

Run

sh setup.sh

to install additional dependencies and download model checkpoints of OVMono3D-LIFT and other foundation models.

Demo

Run

python demo/demo.py --config-file configs/OVMono3D_dinov2_SFP.yaml \
	--input-folder datasets/coco_examples \
	--labels-file datasets/coco_examples/labels.json \
	--threshold 0.45 \
	MODEL.ROI_HEADS.NAME ROIHeads3DGDINO \
	MODEL.WEIGHTS checkpoints/ovmono3d_lift.pth \
	INPUT.USE_DEPTH True \
	OUTPUT_DIR output/coco_examples

to get the results for the example COCO images. The depth maps + estimated camera intrinsics for these example images are pre-generated as .npz files alongside each .jpg; demo.py loads them automatically when present.

You can also try your own images and prompted category labels. See the format of the label file in labels.json. If you know the camera intrinsics you could input them as arguments with the convention --focal-length <float> and --principal-point <float> <float>. Check demo.py for more details.

Note: results on in-the-wild images may be lower quality due to inaccurate estimated intrinsics and depth.

Data

Please follow the instructions in Omni3D to set up the datasets.
Run

sh ./download_data.sh

to download our pre-processed OVMono3D 2D predictions.

Depth pre-generation

The training and evaluation scripts expect per-image depth maps generated by an off-the-shelf depth predictor.

UniDepth (used for SUNRGBD / Hypersim / ARKitScenes / Objectron / KITTI / nuScenes) — set up a separate env per UniDepth's official instructions, then:

python tools/unidepth_script.py --dataset SUNRGBD --split train
# repeat for each (dataset, split) you need

Metric3D (used for Cityscapes3D) — set up a separate env per Metric3D's official instructions, then:

python tools/metric3d_script.py --dataset Cityscapes3D --split test

The depth-augmented JSONs and .npy files are written under datasets/Omni3D_unidepth/, datasets/Omni3D_metric3d/, datasets/pseudo_unidepth/, and datasets/pseudo_metric3d/ respectively. Both unidepth_script.py and metric3d_script.py use separate Python environments from the main OVMono3D env to avoid dependency conflicts.

Evaluation

The OVMono3D-LIFT pre-trained checkpoint is downloaded by setup.sh to checkpoints/ovmono3d_lift.pth.

Evaluate base categories (model's own 2D head):

bash scripts/eval_base.sh

Evaluate novel categories (Grounding DINO oracle 2D, target-aware metric):

bash scripts/eval_novel.sh

Evaluate novel categories with previous-metric oracle:

bash scripts/eval_novel_prev.sh

Evaluate Cityscapes3D (requires Metric3D depth pre-generation, see Data above):

bash scripts/eval_cityscapes.sh

TEST.CAT_MODE controls the category set: novel, base, or all. DATASETS.ORACLE2D_PROMPT selects the 2D prompt source: gdino (target-aware) or gdino_previous_metric.

To run inference and evaluation of OVMono3D-GEO, use the following commands:

python tools/ovmono3d_geo.py
python tools/eval_ovmono3d_geo.py

Training

To train OVMono3D-LIFT from scratch:

bash scripts/dino_bs64_unidepth_omni3d.sh

The script trains with INPUT.USE_DEPTH=True and UniDepth-augmented annotations (under datasets/Omni3D_unidepth/, generated by tools/unidepth_script.py — see Data above). Requires 8 GPUs by default; adjust --num-gpus and SOLVER.IMS_PER_BATCH for fewer.

The training hyperparameters above are used in our experiments. While these parameters can be customized to suit your specific requirements, please note that performance may vary across different configurations.

Citing

If you find this work useful for your research, please kindly cite:

@article{yao2024open,
  title={Open Vocabulary Monocular 3D Object Detection},
  author={Yao, Jin and Gu, Hao and Chen, Xuweiyi and Wang, Jiayun and Cheng, Zezhou},
  journal={arXiv preprint arXiv:2411.16833},
  year={2024}
}

Please also consider cite the awesome work of Omni3D and datasets used in Omni3D.

BibTex

@inproceedings{brazil2023omni3d,
  author =       {Garrick Brazil and Abhinav Kumar and Julian Straub and Nikhila Ravi and Justin Johnson and Georgia Gkioxari},
  title =        {{Omni3D}: A Large Benchmark and Model for {3D} Object Detection in the Wild},
  booktitle =    {CVPR},
  address =      {Vancouver, Canada},
  month =        {June},
  year =         {2023},
  organization = {IEEE},
}

@inproceedings{Geiger2012CVPR,
  author = {Andreas Geiger and Philip Lenz and Raquel Urtasun},
  title = {Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},
  booktitle = {CVPR},
  year = {2012}
}

@inproceedings{caesar2020nuscenes,
  title={nuscenes: A multimodal dataset for autonomous driving},
  author={Caesar, Holger and Bankiti, Varun and Lang, Alex H and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar},
  booktitle={CVPR},
  year={2020}
}

@inproceedings{song2015sun,
  title={Sun rgb-d: A rgb-d scene understanding benchmark suite},
  author={Song, Shuran and Lichtenberg, Samuel P and Xiao, Jianxiong},
  booktitle={CVPR},
  year={2015}
}

@inproceedings{dehghan2021arkitscenes,
  title={{ARK}itScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile {RGB}-D Data},
  author={Gilad Baruch and Zhuoyuan Chen and Afshin Dehghan and Tal Dimry and Yuri Feigin and Peter Fu and Thomas Gebauer and Brandon Joffe and Daniel Kurz and Arik Schwartz and Elad Shulman},
  booktitle={NeurIPS Datasets and Benchmarks Track (Round 1)},
  year={2021},
}

@inproceedings{hypersim,
  author    = {Mike Roberts AND Jason Ramapuram AND Anurag Ranjan AND Atulit Kumar AND
                 Miguel Angel Bautista AND Nathan Paczan AND Russ Webb AND Joshua M. Susskind},
  title     = {{Hypersim}: {A} Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding},
  booktitle = {ICCV},
  year      = {2021},
}

@article{objectron2021,
  title={Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations},
  author={Ahmadyan, Adel and Zhang, Liangkai and Ablavatski, Artsiom and Wei, Jianing and Grundmann, Matthias},
  journal={CVPR},
  year={2021},
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
configs		configs
cubercnn		cubercnn
datasets		datasets
demo		demo
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_data.sh		download_data.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Vocabulary Monocular 3D Object Detection

Installation

Demo

Data

Depth pre-generation

Evaluation

Training

Citing

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open Vocabulary Monocular 3D Object Detection

Installation

Demo

Data

Depth pre-generation

Evaluation

Training

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages