2D-supervised BA2-Det (Ours)
With the rapid development of large models, the need for data has become increasingly crucial. Especially in 3D object detection, costly manual annotations have hindered further advancements. To reduce the burden of annotation, we study the problem of achieving 3D object detection solely based on 2D annotations. Thanks to advanced 3D reconstruction techniques, it is now feasible to reconstruct the overall static 3D scene. However, extracting precise object-level annotations from the entire scene and generalizing these limited annotations to the entire scene remain challenges. In this paper, we introduce a novel paradigm called BA2-Det, encompassing pseudo label generation and multi-stage generalization. We devise the DoubleClustering algorithm to obtain object clusters from reconstructed scene-level points, and further enhance the model's detection capabilities by developing three stages of generalization: progressing from complete to partial, static to dynamic, and close to distant. Experiments conducted on the large-scale Waymo Open Dataset show that the performance of BA2-Det is on par with the fully-supervised methods using 10% annotations. Additionally, using large raw videos for pretraining, BA2-Det can achieve a 20% relative improvement on the KITTI dataset. The method also has great potential for detecting open-set 3D objects in complex scenes.
Pipeline of BA2-Det. Top: reconstruction-based pseudo label generation process. We cluster the object point clouds from the reconstructed scene and fit the tight bounding box as the pseudo label. Bottom: Three stages of network generalization. The neural networks inside the red rounded rectangles are also for the inference.
We show the qualitative results of BA-Det (trained with 10% labeled videos), the baseline method, and the proposed BA2-Det. Our method can achieve comparable performance with fully supervised BA-Det, and even better in some near cases. Compared with the baseline, a very obvious phenomenon is that our recall can be much better than the baseline method, mainly due to the iterative self-retraining design. The illustrations also show a typical failure case of BA2-Det that on a distance of about 75m, there are some false positives. This is because the 3D pseudo labels can be 0-200m and thus somewhat affects the training process. If the annotations include some farther objects, this problem may be alleviated.
We train the single-frame 3D object detector so the object's 3D position can be predicted unrelated to the movement of the ego vehicle. Besides, due to BA-Det is an object-centric temporal aggregation method, it is also independent of ego-motion.
The pseudo 3D labels generated from BA2-Det offer the supervision for a 2D object tracker, called BA2-Track, to learn the 3D representation of objects. This allows a 2D tracker to learn the 3D representation of objects and better distinguish and match them in 3D space.
Here, we show the ability to detect open-set 3D objects in complex scenes with SAM instead of 2D ground truth. We click the objects to generate the 2D masks in SAM.