Our used cuda version is 12.1.1. Run
conda create -n ovmono3d python=3.8.20
conda activate ovmono3d
pip install torch==2.4.1 torchvision==0.19.1 --index-url https://download.pytorch.org/whl/cu121to create the environment and install pytorch.
Run
sh setup.shto install additional dependencies and download model checkpoints of OVMono3D-LIFT and other foundation models.
Run
python demo/demo.py --config-file configs/OVMono3D_dinov2_SFP.yaml \
--input-folder datasets/coco_examples \
--labels-file datasets/coco_examples/labels.json \
--threshold 0.45 \
MODEL.ROI_HEADS.NAME ROIHeads3DGDINO \
MODEL.WEIGHTS checkpoints/ovmono3d_lift.pth \
INPUT.USE_DEPTH True \
OUTPUT_DIR output/coco_examplesto get the results for the example COCO images. The depth maps + estimated camera intrinsics for these example images are pre-generated as .npz files alongside each .jpg; demo.py loads them automatically when present.
You can also try your own images and prompted category labels. See the format of the label file in labels.json. If you know the camera intrinsics you could input them as arguments with the convention --focal-length <float> and --principal-point <float> <float>. Check demo.py for more details.
Note: results on in-the-wild images may be lower quality due to inaccurate estimated intrinsics and depth.
Please follow the instructions in Omni3D to set up the datasets.
Run
sh ./download_data.shto download our pre-processed OVMono3D 2D predictions.
The training and evaluation scripts expect per-image depth maps generated by an off-the-shelf depth predictor.
UniDepth (used for SUNRGBD / Hypersim / ARKitScenes / Objectron / KITTI / nuScenes) — set up a separate env per UniDepth's official instructions, then:
python tools/unidepth_script.py --dataset SUNRGBD --split train
# repeat for each (dataset, split) you needMetric3D (used for Cityscapes3D) — set up a separate env per Metric3D's official instructions, then:
python tools/metric3d_script.py --dataset Cityscapes3D --split testThe depth-augmented JSONs and .npy files are written under datasets/Omni3D_unidepth/, datasets/Omni3D_metric3d/, datasets/pseudo_unidepth/, and datasets/pseudo_metric3d/ respectively. Both unidepth_script.py and metric3d_script.py use separate Python environments from the main OVMono3D env to avoid dependency conflicts.
The OVMono3D-LIFT pre-trained checkpoint is downloaded by setup.sh to checkpoints/ovmono3d_lift.pth.
Evaluate base categories (model's own 2D head):
bash scripts/eval_base.shEvaluate novel categories (Grounding DINO oracle 2D, target-aware metric):
bash scripts/eval_novel.shEvaluate novel categories with previous-metric oracle:
bash scripts/eval_novel_prev.shEvaluate Cityscapes3D (requires Metric3D depth pre-generation, see Data above):
bash scripts/eval_cityscapes.shTEST.CAT_MODE controls the category set: novel, base, or all. DATASETS.ORACLE2D_PROMPT selects the 2D prompt source: gdino (target-aware) or gdino_previous_metric.
To run inference and evaluation of OVMono3D-GEO, use the following commands:
python tools/ovmono3d_geo.py
python tools/eval_ovmono3d_geo.pyTo train OVMono3D-LIFT from scratch:
bash scripts/dino_bs64_unidepth_omni3d.shThe script trains with INPUT.USE_DEPTH=True and UniDepth-augmented annotations (under datasets/Omni3D_unidepth/, generated by tools/unidepth_script.py — see Data above). Requires 8 GPUs by default; adjust --num-gpus and SOLVER.IMS_PER_BATCH for fewer.
The training hyperparameters above are used in our experiments. While these parameters can be customized to suit your specific requirements, please note that performance may vary across different configurations.
If you find this work useful for your research, please kindly cite:
@article{yao2024open,
title={Open Vocabulary Monocular 3D Object Detection},
author={Yao, Jin and Gu, Hao and Chen, Xuweiyi and Wang, Jiayun and Cheng, Zezhou},
journal={arXiv preprint arXiv:2411.16833},
year={2024}
}Please also consider cite the awesome work of Omni3D and datasets used in Omni3D.
BibTex
@inproceedings{brazil2023omni3d,
author = {Garrick Brazil and Abhinav Kumar and Julian Straub and Nikhila Ravi and Justin Johnson and Georgia Gkioxari},
title = {{Omni3D}: A Large Benchmark and Model for {3D} Object Detection in the Wild},
booktitle = {CVPR},
address = {Vancouver, Canada},
month = {June},
year = {2023},
organization = {IEEE},
}@inproceedings{Geiger2012CVPR,
author = {Andreas Geiger and Philip Lenz and Raquel Urtasun},
title = {Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},
booktitle = {CVPR},
year = {2012}
}@inproceedings{caesar2020nuscenes,
title={nuscenes: A multimodal dataset for autonomous driving},
author={Caesar, Holger and Bankiti, Varun and Lang, Alex H and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar},
booktitle={CVPR},
year={2020}
}@inproceedings{song2015sun,
title={Sun rgb-d: A rgb-d scene understanding benchmark suite},
author={Song, Shuran and Lichtenberg, Samuel P and Xiao, Jianxiong},
booktitle={CVPR},
year={2015}
}@inproceedings{dehghan2021arkitscenes,
title={{ARK}itScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile {RGB}-D Data},
author={Gilad Baruch and Zhuoyuan Chen and Afshin Dehghan and Tal Dimry and Yuri Feigin and Peter Fu and Thomas Gebauer and Brandon Joffe and Daniel Kurz and Arik Schwartz and Elad Shulman},
booktitle={NeurIPS Datasets and Benchmarks Track (Round 1)},
year={2021},
}@inproceedings{hypersim,
author = {Mike Roberts AND Jason Ramapuram AND Anurag Ranjan AND Atulit Kumar AND
Miguel Angel Bautista AND Nathan Paczan AND Russ Webb AND Joshua M. Susskind},
title = {{Hypersim}: {A} Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding},
booktitle = {ICCV},
year = {2021},
}@article{objectron2021,
title={Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations},
author={Ahmadyan, Adel and Zhang, Liangkai and Ablavatski, Artsiom and Wei, Jianing and Grundmann, Matthias},
journal={CVPR},
year={2021},
}