CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for
Open-vocabulary 3D Object Detection
NeurIPS 2023

Framework

overview
Figure 1: Overview of the proposed CoDA. We consider 3DETR as our base 3D object detection framework, which is represented by the 'Encoder' and 'Decoder' networks. The object queries together with encoded point cloud features are input into the decoder. The updated object query features from the decoder are further input into 3D object classification and localization heads. We first propose a 3D Novel Object Discovery (3D-NOD) strategy which utilizes both 3D geometry priors from predicted 3D boxes and 2D semantic priors from the CLIP model to discover novel objects during training. The discovered novel object boxes are maintained in a novel object box label pool, which is further utilized in our proposed discovery-driven cross-modal alignment (DCMA). The DCMA consists of a class-agnostic distillation and a class-specific contrastive alignment based on discovered novel boxes. Both 3D-NOD and DCMA collaboratively learn to benefit each other to achieve simultaneous novel object localization and classification in an end-to-end manner.

Samples

overview
overview
Figure 2: Qualitative comparison with 3D-CLIP. Benefiting from our contributions, our method can discover more novel objects, which are indicated by blue boxes in the color images. Besides, our method can also detect more base objects, which proves that our method has better open-world detection capabilities, with the proposed collaborative 3D-NOD and Cross-modal Alignment. Here only the objects with classification scores larger than 0.5 are displayed for better visualization.

Abstract

Open-Vocabulary 3D Object Detection (OV-3DDet) aims to detect objects from an arbitrary list of categories within a 3D scene, which remains seldom explored in the literature. There are primarily two fundamental problems in OV-3DDet, i.e., localizing and classifying novel objects. This paper aims at addressing the two problems simultaneously via a unified framework, under the condition of limited base categories. To localize novel 3D objects, we propose an effective 3D Novel Object Discovery strategy, which utilizes both the 3D box geometry priors and 2D semantic open-vocabulary priors to generate pseudo box labels of the novel objects. To classify novel object boxes, we further develop a cross-modal alignment module based on discovered novel boxes, to align feature spaces between 3D point cloud and image/text modalities. Specifically, the alignment process contains a class-agnostic and a class-discriminative alignment, incorporating not only the base objects with annotations but also the increasingly discovered novel objects, resulting in an iteratively enhanced alignment. The novel box discovery and cross-modal alignment are jointly learned to collaboratively benefit each other. The novel object discovery can directly impact the cross-modal alignment, while a better feature alignment can in turn boost the localization capability, leading to a unified OV-3DDet framework, named CoDA, for simultaneous novel object localization and classification. Extensive experiments on two challenging datasets (i.e., SUN-RGBD and ScanNet) demonstrate the effectiveness of our method and also show a significant mAP improvement upon the best-performing alternative method by 80%.

BibTeX

If CoDA is helpful for you, please cite:

@inproceedings{cao2023coda,
  title={CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection},
  author={Cao, Yang and Zeng, Yihan and Xu, Hang  and  Xu, Dan},
  booktitle={NeurIPS},
  year={2023}
}

Acknowledgements

Our code follows several awesome repositories such as CLIP and 3DETR. We appreciate their great codes.