2D-supervised BA2-Det (Ours)
TL;DR: We propose a novel 2D supervised monocular 3D object detection paradigm, leveraging the idea of global (scene-level) to local (instance-level) 3D reconstruction.
With the advent of the big model era, the demand for data has become more important. Especially in monocular 3D object detection, expensive manual annotations potentially limit further developments. Existing works have investigated weakly supervised algorithms with the help of LiDAR modality to generate 3D pseudo labels, which cannot be applied to ordinary videos. In this paper, we propose a novel paradigm, termed as BA2-Det, leveraging the idea of global-to-local 3D reconstruction for 2D supervised monocular 3D object detection. Specifically, we recover 3D structures from monocular videos by scene-level global reconstruction with global bundle adjustment (BA) and obtain object clusters by the DoubleClustering algorithm. Learning from completely reconstructed objects in global BA, GBA-Learner predicts pseudo labels for occluded objects. Finally, we train an LBA-Learner with object-centric local BA to generalize the generated 3D pseudo labels to moving objects. Experiments on the large-scale Waymo Open Dataset show that the performance of BA2-Det is on par with the fully-supervised BA-Det trained with 10% videos and even outperforms some pioneer fully-supervised methods. We also show the great potential of BA2-Det for detecting open-set 3D objects in complex scenes. The code will be made available.
Pipeline of BA2-Det. We take the video sequence as input. The Global BA stage is to generate 3D pseudo labels from scene-level global reconstruction, including DoubleCluster and GBA-Learner. Then the labels are sent to the Local BA stage, which is to learn a monocular 3D object detector in an iterative way.
We show the qualitative results of BA-Det (trained with 10% labeled videos), the baseline method, and the proposed BA2-Det. Our method can achieve comparable performance with fully supervised BA-Det, and even better in some near cases. Compared with the baseline, a very obvious phenomenon is that our recall can be much better than the baseline method, mainly due to the iterative self-retraining design. The illustrations also show a typical failure case of BA2-Det that on a distance of about 75m, there are some false positives. This is because the 3D pseudo labels can be 0-200m and thus somewhat affects the training process. If the annotations include some farther objects, this problem may be alleviated.
We train the single-frame 3D object detector so the object's 3D position can be predicted unrelated to the movement of the ego vehicle. Besides, due to BA-Det is an object-centric temporal aggregation method, it is also independent of ego-motion.
The pseudo 3D labels generated from BA2-Det offer the supervision for a 2D object tracker, called BA2-Track, to learn the 3D representation of objects. This allows a 2D tracker to learn the 3D representation of objects and better distinguish and match them in 3D space.
Here, we show the ability to detect open-set 3D objects in complex scenes with SAM instead of 2D ground truth. We click the objects to generate the 2D masks in SAM.