【原】利用基于NVIDIA CUDA的点云库（PCL）加速激光雷达点云技术

点云PCL 2023-10-23 发布于上海

展开全文

前言

在这篇文章将介绍如何使用CUDA-PCL处理点云来获得最佳性能，由于PCL无法充分利用Jetson上的CUDA，NVIDIA开发了一些具有与PCL相同功能的基于CUDA的库。代码地址：https://github.com/NVIDIA-AI-IOT/cuPCL.git（只有动态库和头文件，作者说源码将在未来开源）。

cuPCL包含一些用于使用CUDA处理点云的库，以及用于它们的使用示例。项目中有几个子文件夹，每个子文件夹都包含：由CUDA实现的库以及库用法并通过将其输出与PCL的输出进行比较来检查性能和准确性的示例代码，该库支持Xavier、Orin和Linux x86。

作者介绍

Lei Fan，NVIDIA的高级CUDA软件工程师。他目前与NVIDIA的中国技术支持工程师团队合作，开发通过CUDA优化软件性能的解决方案。

Lily Li，正在NVIDIA的机器人团队担任开发者关系工作。她目前致力于开发Jetson生态系统中的机器人解决方案，以帮助创建最佳实践。

主要内容

许多Jetson用户选择激光雷达用于定位和感知的主要传感器，激光雷达将车辆周围的空间环境描述为一组三维点，称为点云，点云对周围对象的表面进行采样，具有远距离和高精度的特点，非常适合用于高级障碍物感知、地图制作、定位和规划算法。

在这篇文章中介绍了CUDA-PCL 1.0，其这里主要介绍三个CUDA加速的PCL库：

1.CUDA-ICP

2.CUDA-Segmentation

3.CUDA-Filter

CUDA-ICP

在迭代最近点（ICP）中，一个点云目标或参考点云被固定，而源点被变换以最佳匹配参考点。该算法迭代地计算所需的变换矩阵，以最小化误差度量，这通常是来自源点到参考点云的距离，例如匹配对坐标之间的平方差之和。ICP是在给定刚性变换所需的初始猜测位姿的情况下对齐三维模型的广泛使用的算法之一。ICP的优点包括高精度匹配结果，对不同初始化具有强鲁棒性等。然而它消耗大量计算资源。为了改进Jetson上的ICP性能，NVIDIA发布了基于CUDA的ICP，它可以替代点云库（PCL）中的原始ICP版本。以下代码示例是CUDA-ICP示例。可以实例化该类，然后直接执行cudaICP.icp()。

cudaICP icpTest(nPCountM, nQCountM, stream);    icpTest.icp(cloud_source, nPCount,            float *cloud_target, int nQCount,            int Maxiterate, double threshold,            Eigen::Matrix4f &transformation_matrix, stream);

ICP计算两个点云之间的变换矩阵：

source(P)* transformation =target(Q)

因为激光雷达提供了具有固定数量的点云，我们可以得到最大点数。nPCountM和nQCountM都用于为ICP分配缓存。

class cudaICP{public:    /*       nPCountM and nQCountM are the maximum of count for input clouds.       They are used to pre-allocate memory.    */    cudaICP(int nPCountM, int nQCountM, cudaStream_t stream = 0);    ~cudaICP(void);    /*    cloud_target = transformation_matrix *cloud_source    When the Epsilon of the transformation_matrix is less than threshold,    the function returns transformation_matrix.    Input:        cloud_source, cloud_target: Data pointer for the point cloud.        nPCount: Point number of the cloud_source.        nQCount: Point number of the cloud_target.        Maxiterate: Threshold for iterations.        threshold: When the Epsilon of the transformation_matrix is less than            threshold, the function returns transformation_matrix.    Output:        transformation_matrix    */    void icp(float *cloud_source, int nPCount,            float *cloud_target, int nQCount,            int Maxiterate, double threshold,            Eigen::Matrix4f &transformation_matrix,            cudaStream_t stream = 0);    void *m_handle = NULL;};

表2展示了CUDA-ICP与PCL-ICP的性能对比结果

在ICP之前两帧点云的状态

在ICP之后两帧点云的状态

CUDA-Segmentation

点云地图包含许多地面点，这不仅使整个地图看起来凌乱，还给后续障碍点云的分类、识别和跟踪带来了麻烦，因此需要首先将其删除。通过点云分割可以实现去除地面。该库使用随机抽样一致性（Ransac）拟合和非线性优化来实现这一目标。以下是CUDA-Segmentation的示例代码。首先实例化该类，初始化参数，然后直接调用 cudaSeg.segment 来执行地面去除。

//Now Just support: SAC_RANSAC + SACMODEL_PLANE  std::vectorindexV;  cudaSegmentation cudaSeg(SACMODEL_PLANE, SAC_RANSAC, stream);  segParam_t setP;  setP.distanceThreshold = 0.01;  setP.maxIterations = 50;  setP.probability = 0.99;  setP.optimizeCoefficients = true;  cudaSeg.set(setP);  cudaSeg.segment(input, nCount, index, modelCoefficients);  for(int i = 0; i < nCount; i++)  {    if(index[i] == 1)    indexV.push_back(i);  }

CUDA-Segmentation对具有nCount点数的输入进行分割，使用一些参数，index是输入的索引，代表目标平面，而modelCoefficients是平面的系数组。

typedef struct {  double distanceThreshold;  int maxIterations;  double probability;  bool optimizeCoefficients;} segParam_t;class cudaSegmentation{public:    //Now Just support: SAC_RANSAC + SACMODEL_PLANE    cudaSegmentation(int ModelType, int MethodType, cudaStream_t stream = 0);    ~cudaSegmentation(void);    /*    Input:        cloud_in: Data pointer for point cloud        nCount: Count of points in cloud_in    Output:        Index: Data pointer that has the index of points in a plane from input      modelCoefficients: Data pointer that has the group of coefficients of the plane    */    int set(segParam_t param);    void segment(float *cloud_in, int nCount,            int *index, float *modelCoefficients);private:    void *m_handle = NULL;};

表3.展示了CUDA-Segmentation与PCL-Segmentation的性能对比。

图3和图4显示了原始点云数据，然后是仅保留障碍相关点云的处理版本。这个示例在点云处理中很典型，包括去除地面，删除一些点云和提取特征，以及对一些点云进行聚类。

图3. CUDA-Segmentation的原始点云

图4. 由CUDA-Segmentation处理的点云

CUDA-Filter

在点云进行分割、检测、识别等处理之前，滤波是最重要的预处理操作之一。通过滤波可以实现对点云的坐标约束，直接过滤点云的X、Y和Z轴，点云过滤可以仅对Z轴或三个坐标轴X、Y和Z进行约束。CUDA-Filter目前仅支持PassThrough，但以后将支持更多的方法。以下是CUDA-Filter示例的代码示例，创建该类的实例，初始化参数，然后直接调用cudaFilter.filter函数。

cudaFilter filterTest(stream);  FilterParam_t setP;  FilterType_t type = PASSTHROUGH;  setP.type = type;  setP.dim = 2;  setP.upFilterLimits = 1.0;  setP.downFilterLimits = 0.0;  setP.limitsNegative = false;  filterTest.set(setP);  filterTest.filter(output, &countLeft, input, nCount);

CUDA-Filter使用参数筛选输入的nCount个点，然后通过CUDA过滤后，输出具有countLeft个点的结果。

typedef struct {    FilterType_t type;    //0=x,1=y,2=z    int dim;    float upFilterLimits;    float downFilterLimits;    bool limitsNegative;} FilterParam_t;class cudaFilter{public:    cudaFilter(cudaStream_t stream = 0);    ~cudaFilter(void);    int set(FilterParam_t param);    /*    Input:        source: data pointer for point cloud        nCount: count of points in cloud_in    Output:        output: data pointer which has points filtered by CUDA        countLeft: count of points in output    */    int filter(void *output, unsigned int *countLeft, void *source, unsigned int nCount);    void *m_handle = NULL;};

表4.展示了CUDA-Filter与PCL-Filter性能比较。

图5和图6显示了通过在X轴上进行约束的PassThrough滤波器示例。

图6. 原始点云。

图6. 通过在X轴上进行约束滤波的点云。

其他模块对比

VoxelGrid

cuOctree

cuCluster

cuNDT