【泡泡一分钟】用于自动驾驶三维目标检测的从二维提升到三维的学习方法

InfoRich 2021-07-29

展开全文

每天一分钟，带你读遍机器人顶级会议文章

标题：Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

作者：Siddharth Srivastava, Frederic Jurie and Gaurav Sharma

来源：2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

编译：王靖淇

审核：王靖淇、柴毅

这是泡泡一分钟推送的第 770 篇文章，欢迎个人转发朋友圈；其他机构或自媒体如需转载，后台留言申请授权

摘要

文章解决了自动驾驶场景中从二维单目图像中提取三维目标的问题，提出使用基于学习的神经网络将二维图像提升到三维表示，并直接利用现有工作在三维上的神经网络来执行三维目标检测和定位。通过仔细设计的训练机制以及自动选择的最小噪声数据，结果表明该方法不仅可行，而且相比许多直接从物理传感器获得三维数据输入的方法，可以获得更好的结果。在具有挑战的KITTI基准测试中，作者证明了文章所提的二维到三维提升方法要优于许多最近很有竞争力的三维神经网络，同时要显著优于以往基于单目图像的三维检测技术。文章还表明，在生成的三维图像上进行训练的输出与在真实三维图像上进行训练的输出进行后期融合，可以提高性能。作者发现结果非常有趣，还指出，在人身安全风险低的自动导航场景下，当昂贵的三维传感器出现故障，并且不太可能有多余的传感器时，该方法可以作为一种高度可靠的备用方案。

图1：文章目标是从二维单目图像中进行三维目标检测，方法是：（i）使用最先进的生成式对抗网络获得三维表示；（ii）使用最新的三维神经网络生成用于地平面估计的三维数据。作者证明了在测试时不需要实际的三维数据就可以实现有竞争力的三维目标检测。

图2：BirdNet流程图。基于GAN的生成器可将二维RGB图像转换为与BirdNet架构兼容的BEV图像。然后是一个从RGB图像到三维的神经网络，提供了用于地平面估计的三维信息。接着BEV检测器利用地平面估计信息转换三维目标检测。

图3：MV3D流程图。在MV3D的情况下，多个基于GAN的生成器可独立地将二维RGB图像转换为MV3D兼容BEV图像的不同通道。此外，辅助网络还用于从RGB图像生成前视图（FV）图像。然后，所有三个图像，即RGB、FV和BEV图像，都输入MV3D架构进行三维预测。

图4：RGB图像仅显示前视图，而顶部安装的激光雷达点云也有来自汽车背面和侧面的数据。适当地对雷达点云剪枝，仅保留两种模式中的相应信息。同时也对很远的BEV点剪枝，因为它们在RGB图像中高度遮挡，可能会失去某些对象，例如用红色箭头突出显示的对象

图5：所提方法对比其他不同方法的定性结果。

Abstract

We address the problem of 3D object detection from 2D monocular images in autonomous driving scenarios. We propose to lift the 2D images to 3D representations using learned neural networks and leverage existing networks working directly on 3D to perform 3D object detection and localization. We show that, with carefully designed training mechanism and automatically selected minimally noisy data, such a method is not only feasible, but gives higher results than many methods working on actual 3D inputs acquired from physical sensors. On the challenging KITTI benchmark, we show that our 2D to 3D lifted method outperforms many recent competitive 3D networks while significantly outperforming previous state of the art for 3D detection from monocular images. We also show that a late fusion of the output of the network trained on generated 3D images, with that trained on real 3D images, improves performance. We find the results very interesting and argue that such a method could serve as a highly reliable backup in case of malfunction of expensive 3D sensors, if not potentially making them redundant, at least in the case of low human injury risk autonomous navigation scenarios like warehouse automation