CN113378854A

CN113378854A - Point cloud target detection method integrating original point cloud and voxel division

Info

Publication number: CN113378854A
Application number: CN202110651776.XA
Authority: CN
Inventors: 姚剑; 蒋天园; 李寅暄; 龚烨; 李礼
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-09-10

Abstract

The invention relates to a point cloud target detection method integrating original point cloud and voxel division. First, the lossless feature extraction network Pointnet++ is used to extract the local detail features and semantic features of point clouds, and then a loss function is constructed to further improve the perception ability of the lossless feature extraction network Pointnt++ for local neighborhood information. Trilinear interpolation is used in the voxel feature initialization stage and sparse volume. In the cumulative perception stage, the local detail features and semantic features without information loss are embedded into the point cloud target detection network based on voxel division. Finally, the pre-set detection anchor boxes are classified and regressed by two-dimensional RPN to obtain the final detection target. The invention makes the detection network have the ability of multi-scale and multi-level information fusion and perception by embedding the lossless coding multi-scale of the point cloud into the voxel method, and the invention integrates two types of point cloud targets based on the original point cloud and based on the voxel division. The detection method has both efficient point cloud perception ability and lossless feature encoding ability.

Description

A point cloud object detection method that combines original point cloud and voxel division

技术领域technical field

本发明属于3D点云目标检测技术领域，特别是涉及一种融合原始点云和体素划分的点云目标检测方法。The invention belongs to the technical field of 3D point cloud target detection, and in particular relates to a point cloud target detection method integrating original point cloud and voxel division.

背景技术Background technique

随着车载激光雷达技术的不断升级，车载激光雷达能够快速、方便地获取当前场景的点云数据，利用场景点云的几何结构信息可以实现对场景中目标的提取，该技术方法已渗透到智慧城市建设、自动驾驶、无人配送等多个行业。由于激光点云散乱无序、密度和稀疏性差异巨大，如果采用传统目标检测算法对海量的点云数据进行统一的手工特征提取，无法适应在自动驾驶复杂道路场景下的目标的形体变化。因此，基于深度学习的点云目标检测算法在自动驾驶场景得到了快速地发展和应用。With the continuous upgrading of vehicle lidar technology, vehicle lidar can quickly and easily obtain the point cloud data of the current scene, and the geometric structure information of the scene point cloud can be used to extract the targets in the scene. This technical method has penetrated into the wisdom of Urban construction, autonomous driving, unmanned distribution and other industries. Because the laser point cloud is scattered and disordered, and the density and sparsity vary greatly, if the traditional target detection algorithm is used to extract the massive point cloud data by hand, it cannot adapt to the shape change of the target in the complex road scene of automatic driving. Therefore, point cloud target detection algorithms based on deep learning have been rapidly developed and applied in autonomous driving scenarios.

当前较为通用的基于深度学习的点云目标检测方法主要有：基于原始点云的目标检测和基于体素划分的点云目标检测。At present, the more common point cloud target detection methods based on deep learning mainly include: target detection based on original point cloud and point cloud target detection based on voxel division.

基于原始点云的3D目标检测算法不对场景点云做任何预处理，直接将原始点云坐标及对应的反射率数值输入由多层感知机(MLP)搭建的神经网络，采用最远点采样(FPS)由浅入深地对点云场景进行逐层采样，通过局部点集特征提取模块(Set Abstract)提取局部细节特征和语义特征，最后采用三线性插值将细节信息特征和语义信息特征通过特征传递层(Feature Propagation)赋予给原始场景中的所有点。该方法是无信息丢失的，但是多层感知器对无序点云的感知能力低于基于体素划分方法所采用的由卷积神经网络所搭建的结构。The 3D target detection algorithm based on the original point cloud does not do any preprocessing on the scene point cloud, but directly inputs the original point cloud coordinates and the corresponding reflectance values into the neural network constructed by the multilayer perceptron (MLP), and uses the farthest point sampling ( FPS) samples the point cloud scene layer by layer from shallow to deep, extracts local detail features and semantic features through the local point set feature extraction module (Set Abstract), and finally uses trilinear interpolation to transfer the detail information features and semantic information features through features. Layer (Feature Propagation) assigned to all points in the original scene. This method has no information loss, but the perception ability of multi-layer perceptron for disordered point cloud is lower than the structure constructed by convolutional neural network adopted by voxel-based method.

基于体素划分的点云目标检测根据不同线型激光雷达扫描到的点云密集程度将场景点云划分成均匀的体素格网，再采用适应不同体素大小的的体素特征提取方式对每个体素做特征提取，然后利用3D卷积或者3D稀疏卷积对初始化后的体素场景语义信息做特征提取，并逐步将高度维压缩至一维，进一步采用二维卷积搭建区域提出网络(RPN)在场景俯视图下对每一个卷积格点预先设置的锚框进行分类和预测。该方法能够快速高效地在自动驾驶点云场景中分类出不易形变且点云密度大的物体，但是体素划分导致了原始点云结构几何形变，尤其是针对行人和自行车这种较小物体，体素划分带来的形变丢失了局部细节信息，使得检测的分类和回归效果偏离真实目标。The point cloud target detection based on voxel division divides the scene point cloud into uniform voxel grids according to the density of point clouds scanned by different line-type lidars, and then uses voxel feature extraction methods adapted to different voxel sizes to detect Feature extraction is performed for each voxel, and then 3D convolution or 3D sparse convolution is used to extract features from the initialized voxel scene semantic information, and gradually compress the height dimension to one dimension, and further use two-dimensional convolution to build a region proposal network (RPN) classifies and predicts the pre-set anchor boxes for each convolution lattice point under the top view of the scene. This method can quickly and efficiently classify objects that are not easily deformed and have high point cloud density in autonomous driving point cloud scenes, but the voxel division leads to geometric deformation of the original point cloud structure, especially for small objects such as pedestrians and bicycles. The deformation caused by voxel division loses local detail information, which makes the classification and regression effects of detection deviate from the real target.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术的不足，提供一种融合原始点云和体素划分的点云目标检测方法。首先利用无损特征提取网络Pointnet++提取点云局部细节特征和语义特征，然后构建损失函数进一步提升无损特征提取网络Pointnt++对局部邻域信息的感知能力，接着分别采用三线性插值在体素特征初始化阶段和稀疏卷积感知阶段将无信息损失的局部细节特征和语义特征嵌入到基于体素划分的点云目标检测网络中，最后通过二维RPN对每个预先设置的检测锚框进行分类和回归得到最终的检测目标。Aiming at the deficiencies of the prior art, the present invention provides a point cloud target detection method integrating original point cloud and voxel division. First, the lossless feature extraction network Pointnet++ is used to extract the local detail features and semantic features of the point cloud, and then a loss function is constructed to further improve the perception ability of the lossless feature extraction network Pointnt++ for local neighborhood information. Then, trilinear interpolation is used in the voxel feature initialization stage and The sparse convolution perception stage embeds local detail features and semantic features without information loss into the point cloud object detection network based on voxel division, and finally classifies and regresses each preset detection anchor box through 2D RPN to obtain the final detection target.

为了达到上述目的，本发明提供的技术方案是一种融合原始点云和体素划分的点云目标检测方法，包括以下步骤：In order to achieve the above purpose, the technical solution provided by the present invention is a point cloud target detection method that integrates original point cloud and voxel division, including the following steps:

步骤1，利用无损特征提取网络Pointnet++提取点云局部细节特征和语义特征；Step 1, use the lossless feature extraction network Pointnet++ to extract the local detail features and semantic features of the point cloud;

步骤1.1，构建多层编码器；Step 1.1, build a multi-layer encoder;

步骤1.2，通过SA模块无信息丢失地提取每一层点云的局部细节特征和语义特征；Step 1.2, extract the local detail features and semantic features of each layer of point cloud without information loss through the SA module;

步骤1.3，采用三线性插值将步骤1.2提取的细节特征和语义特征通过特征传递层赋予给原始场景中的所有点；Step 1.3, using trilinear interpolation to assign the detailed features and semantic features extracted in step 1.2 to all points in the original scene through the feature transfer layer;

步骤2，构建损失函数，监督步骤1特征提取的执行，促进无损特征提取网络Pointnet++感知特征信息；Step 2, build a loss function, supervise the execution of feature extraction in step 1, and promote the lossless feature extraction network Pointnet++ to perceive feature information;

步骤3，将无信息损失的局部细节特征和语义特征嵌入到基于体素划分的点云目标检测网络中；Step 3: Embed local detail features and semantic features without information loss into the point cloud target detection network based on voxel division;

步骤3.1，利用步骤1提取到的局部细节特征对体素特征进行初始化；Step 3.1, initialize the voxel features using the local detail features extracted in step 1;

步骤3.2，使用3D稀疏卷积对步骤3.1初始化后的体素场景语义信息做特征提取；Step 3.2, use 3D sparse convolution to perform feature extraction on the semantic information of the voxel scene initialized in step 3.1;

步骤3.3，采用三线性插值将步骤1得到的语义特征转化为体素特征；Step 3.3, using trilinear interpolation to convert the semantic features obtained in step 1 into voxel features;

步骤3.4，采用注意力机制方式将步骤3.2经过稀疏卷积感知的语义特征与将步骤3.3转化得到的体素特征进行融合，得到融合两种感知模式的语义信息；In step 3.4, the attention mechanism is used to fuse the semantic features perceived by the sparse convolution in step 3.2 with the voxel features transformed in step 3.3 to obtain semantic information that fuses the two perception modes;

步骤4，将步骤3融合得到的语义特征投影到二维俯视图，通过二维卷积搭建区域提出网络(RPN)，在场景俯视图下对每个像素点预先设置的检测锚框进行分类和回归得到最终的检测目标；Step 4: Project the semantic features obtained by fusion in Step 3 to a two-dimensional top view, build a Region Proposition Network (RPN) through two-dimensional convolution, and classify and regress the pre-set detection anchor boxes for each pixel in the top view of the scene. the final detection target;

步骤4.1，RPN网络结构和预定义检测锚框设置；Step 4.1, RPN network structure and predefined detection anchor frame settings;

步骤4.2，点云目标检测损失函数的设计。Step 4.2, the design of point cloud target detection loss function.

而且，所述步骤1.1中构建多层编码器需首先利用最远点采样策略(FPS)从原始点云中采集N个点作为输入点云，然后再利用FPS从输入点云数据中逐层采集数目为

的点云，构成4层编码器，每一层输入的点云即为上一层输出的点集。Moreover, the construction of the multi-layer encoder in the step 1.1 needs to first use the farthest point sampling strategy (FPS) to collect N points from the original point cloud as the input point cloud, and then use FPS to collect from the input point cloud data layer by layer. The number is

The point cloud of each layer constitutes a 4-layer encoder, and the input point cloud of each layer is the point set output by the previous layer.

而且，所述步骤1.2中每一层SA模块的输入为上一层经过FPS采样得到的固定数目的点集，设点p_i为当前层通过FPS采样得到的第i个点，

为上一层中以点p_i为中心，半径为r的球邻域内部的点所构成的集合，点p_i输出特征的计算包括以下几步：Moreover, the input of each layer SA module in the described step 1.2 is a fixed number of point sets obtained by the previous layer through FPS sampling, and the point p _i is the i-th point obtained by the current layer through FPS sampling,

is the set of points inside the spherical neighborhood with point _pi as the center and radius r in the previous layer. The calculation of the output feature of point _pi includes the following steps:

步骤1.2.1，从集合

中随机采样k个点组成集合

Step 1.2.1, from the collection

Randomly sample k points in the set to form a set

步骤1.2.2，通过多层感知机对步骤1.2.1采样的点进行特征融合提取，计算公式如下：Step 1.2.2, perform feature fusion extraction on the points sampled in step 1.2.1 through a multi-layer perceptron. The calculation formula is as follows:

其中，MLP表示多层感知机对点特征的高维映射，max()代表在点集合的特征维上取最大值，f(p_i)即是点p_i的输出特征；Among them, MLP represents the high-dimensional mapping of the multi-layer perceptron to the point feature, max() represents the maximum value on the feature dimension of the point set, and f( _pi ) is the output feature of the point _pi ;

步骤1.2.3，对每一层输入的点云重复FPS采样到对应数目的点云，然后对采样得到的点通过步骤1.2.2聚合邻域特征，由此完成无信息损失的特征提取。其中第一层提取到的是局部细节特征，后三层提取到的为语义特征。Step 1.2.3, repeat the FPS sampling for the input point cloud of each layer to the corresponding number of point clouds, and then aggregate the neighborhood features for the points obtained through the step 1.2.2, thereby completing the feature extraction without information loss. The first layer extracts local detail features, and the last three layers extract semantic features.

而且，所述步骤1.3中特征传递为特征提取的逆过程，从提取层的最后一层出发，依次向上一层做特征传递，即从

层传递到

层、

层传递到

层、

层传递到

层、

层传递到N层。以

层传递到

层为例介绍特征的传递过程，假设点p_i为

层需要传递特征的点，φ(P_i)表示

层中欧式空间中距离P_i最近的k个点组成的组合，P_j表示φ(P_i)中的一点，三线性插值特征传递的计算方法如下：Moreover, the feature transfer in the step 1.3 is the inverse process of feature extraction. Starting from the last layer of the extraction layer, the features are transferred to the upper layer in turn, that is, from the last layer of the extraction layer.

layer passed to

Floor,

layer passed to

Floor,

layer passed to

Floor,

layers are passed to N layers. by

layer passed to

Take the layer as an example to introduce the feature transfer process, assuming that the point _pi is

The point where the layer needs to transfer features, φ(P _i ) represents

The combination of k points closest to P _i in the Euclidean space in the layer, P _j represents a point in φ(P _i ), and the calculation method of trilinear interpolation feature transfer is as follows:

式中，f(p_i)是需要传递的特征，f(p_j)表示在点P_i邻域内的第j个点P_j的特征，w_ij表示点P_i邻域内的第j个点P_j的特征加权权重。In the formula, f(p _i ) is the feature to be transferred, f(p _j ) represents the feature of the jth point P _j in the neighborhood of point P _i , and w _ij represents the jth point P in the neighborhood of point P _i The feature weights for _j .

每一个被传递的点的特征通过对其下一层领域内k个点的特征进行欧式距离的加权求和得到，逐层向前传递即可传递给场景中的每一个点云，使其具备无损失的信息特征。The feature of each transferred point is obtained by the weighted summation of the Euclidean distance of the features of the k points in the next layer of the field, and the forward transfer can be transferred to each point cloud in the scene layer by layer, so that it has Lossless informative features.

而且，所述步骤2中是以原始场景中点云坐标作为点云监督信息，Smooth-L1损失作为损失函数，计算方式如下：Moreover, in the step 2, the point cloud coordinates in the original scene are used as the point cloud supervision information, and the Smooth-L1 loss is used as the loss function, and the calculation method is as follows:

其中，r′和r分别表示无损特征提取网络预测的点云空间坐标和原始点云的空间坐标，φ(p)表示整个原始场景中的点云集合，在损失函数的监督下，无损特征提取网络Pointnt++对局部邻域信息的感知效果得到进一步提升。Among them, r' and r represent the spatial coordinates of the point cloud predicted by the lossless feature extraction network and the spatial coordinates of the original point cloud, respectively, φ(p) represents the point cloud set in the entire original scene, under the supervision of the loss function, the lossless feature extraction The perception effect of the network Pointnt++ on local neighborhood information is further improved.

而且，所述步骤3.1中初始化是先将点云空间均匀划分成体素格网，包含点的体素的被保留，不包含点的体素被舍弃，然后对保留下来的体素，利用步骤1中得到的局部细节特征进行初始化。假设步骤1中编码器网络第一层的输出为

其中P_i表示原始点云空间中需要传递特征的点，F_i ^P为点P_i的特征，

表示步骤1中编码器一共提取到了

个点的局部细节特征。体素中心

V_j表示体素中心，

表示体素中心V_j需要被赋予的特征，M表示一共有M个体素中心需要被赋值。通过三线性插值函数对体素中心的特征进行赋值，令

表示欧式空间中距离V_j最近的k个点组成的组合，P_t表示

的一点，则

的计算方式如下：Moreover, the initialization in the step 3.1 is to first divide the point cloud space into a voxel grid uniformly, the voxels containing the points are retained, and the voxels not containing the points are discarded, and then the retained voxels are used in step 1. The local detail features obtained in are initialized. Suppose the output of the first layer of the encoder network in step 1 is

where Pi represents the point in the original point cloud space that needs to transfer features, F _i ^P is the feature of point _Pi _,

Indicates that in step 1, the encoder has extracted a total of

local detail features of a point. voxel center

V _j represents the voxel center,

Indicates the feature that the voxel center V _j needs to be assigned, and M indicates a total of M voxel centers that need to be assigned. The feature of the voxel center is assigned by the trilinear interpolation function, let

represents the combination of k points closest to V _j in Euclidean space, and P _t represents

a point, then

is calculated as follows:

其中，F_t ^P表示体素中心点V_j邻域内的第t个特征点P_t的特征，w_tj表示体素中心点V_j邻域内的第t个点P_t的特征加权权重。Among them, F _t ^P represents the feature of the t-th feature point P _t in the neighborhood of the voxel center point V _j , and w _tj represents the feature weighting weight of the t-th point P _t in the neighborhood of the voxel center point V _j .

而且，所述步骤3.2是采用Spconv库堆叠4层稀疏卷积模块，其中每个稀疏卷积模块包含两层子流型卷积模块和一层下采样为2的点云稀疏卷积模块。假设输入的体素体征张量表示为L×W×H×C，其中L、W、H、C分别表示体素场景的长、宽、高和每一个体素的特征维度，那么经过4层稀疏卷积输出可以表示为

其中C′表示经过特征提取后的特征维度。Moreover, the step 3.2 is to use the Spconv library to stack 4 layers of sparse convolution modules, wherein each sparse convolution module includes two layers of sub-flow type convolution modules and one layer of point cloud sparse convolution modules with a downsampling of 2. Assuming that the input voxel sign tensor is represented as L×W×H×C, where L, W, H, and C represent the length, width, height of the voxel scene and the feature dimension of each voxel respectively, then after 4 layers The sparse convolution output can be expressed as

where C' represents the feature dimension after feature extraction.

而且，所述步骤3.3中假设从步骤1提取得到的三层语义信息特征表示为

其中4×表示下采样四倍，经过稀疏卷积后的体素坐标为

表示体素中心，

表示体素中心

需要被赋予的特征。采用三线性插值将点的语义特征转化到体素中心表征，令

表示欧式空间中距离

最近的k个点组成的组合，P_t,4×、P_t,8×、P_t,16×均为

中的点，则

的计算方式如下：Moreover, in step 3.3, it is assumed that the three-layer semantic information feature extracted from step 1 is expressed as

where 4× means downsampling four times, and the voxel coordinates after sparse convolution are

represents the voxel center,

Represents the voxel center

characteristics that need to be assigned. Trilinear interpolation is used to transform the semantic features of points into voxel center representations, so that

Represents distance in Euclidean space

The combination of the nearest k points, P _t,4× , P _t,8× , P _t,16× are all

point in the

is calculated as follows:

其中，

表示经过3D稀疏卷积后的体素中心，P_t,4×、P_t,8×、P_t,16×表示进行特征加权的空间点，

表示体素中心点

邻域内的第t个点在下采样四倍层的的特征，w_tj，4×表示体素中心

邻域内的第t个点在下采样四倍层的特征加权权重。in,

represents the voxel center after 3D sparse convolution, P _t,4× , P _t,8× , P _t,16× represent the spatial point for feature weighting,

Represents the voxel center point

The t-th point in the neighborhood is the feature of the down-sampled quadruple layer, w _{tj, 4×} represents the voxel center

The t-th point within the neighborhood is downsampled by four layers of feature weighting weights.

而且，所述步骤3.4中是先将两种语义信息在特征维度上进行串联，假设步骤3.3转化得到的体素特征维度为M₁，稀疏卷积感知得到的体素特征为M₂，那么叠加后的体素特征维度为M₁+M₂，然后采用一层多层感知器将组合后的特征M₁+M₂维度映射为M₁。Moreover, in the step 3.4, the two kinds of semantic information are first connected in series in the feature dimension. Assuming that the voxel feature dimension obtained by the transformation in step 3.3 is M ₁ , and the voxel feature obtained by sparse convolution perception is M ₂ , then the superposition The dimension of the final voxel feature is M ₁ +M ₂ , and then a layer of multi-layer perceptron is used to map the combined feature M ₁ +M ₂ dimension to M ₁ .

而且，所述步骤4.1中RPN由四层二维卷积神经网络搭建，采用U-Net网络结构逐层地输出，具体表示为

各层均采用3×3卷积以减小学习参数。采用编码-解码网络结构对融合后的信息进一步特征抽象，并在最终的特征图上对每一个像素点预先设置一个对应的检测锚框，通过对预先设置的检测锚框进行分类、回归得到RPN检测出的物体。一个三维检测锚框可以表示为{x,y,z,l,w,h,r}，(x,y,z)表示检测锚框的中心位置，l、w、h分别对应长、宽、高，r为在x-y平面的旋转角。经过3D稀疏卷积和语义信息融合后的体素特征可以表征为

将高度维压缩至特征维度得到二维图像表征为

因此对于大小为

的特征图，一共会有

个预定义检测锚框。Moreover, in the step 4.1, the RPN is constructed by a four-layer two-dimensional convolutional neural network, and the U-Net network structure is used to output layer by layer, which is specifically expressed as

Each layer adopts 3×3 convolution to reduce the learning parameters. The encoding-decoding network structure is used to further abstract the features of the fused information, and a corresponding detection anchor frame is preset for each pixel on the final feature map, and the RPN is obtained by classifying and regressing the preset detection anchor frame. detected objects. A three-dimensional detection anchor box can be expressed as {x,y,z,l,w,h,r}, (x,y,z) represents the center position of the detection anchor box, l, w, h correspond to the length, width, height, r is the rotation angle in the xy plane. The voxel features after 3D sparse convolution and semantic information fusion can be characterized as

The height dimension is compressed to the feature dimension to obtain a two-dimensional image representation as

So for a size of

The feature map of , there will be a total of

predefined detection anchor boxes.

而且，所述步骤4.2中分类损失函数L_cls采用交叉熵损失函数，即：Moreover, in the step 4.2, the classification loss function L _cls adopts the cross entropy loss function, namely:

式中，n表示预先设置的检测锚框的个数，P(a_i)表示第i个检测锚框预测的分数，Q(a_i)表示该检测锚框的真实的标签值。In the formula, n represents the number of preset detection anchor boxes, P(a _i ) represents the predicted score of the ith detection anchor box, and Q(a _i ) represents the real label value of the detection anchor box.

回归损失函数L_reg采用Smooth-L1损失函数，即：The regression loss function L _reg adopts the Smooth-L1 loss function, namely:

式中，n表示预先设置的检测锚框的个数，v表示检测锚框的真实值，v′表示RPN预测的检测锚框的值。In the formula, n represents the number of preset detection anchor boxes, v represents the real value of the detection anchor box, and v′ represents the value of the detection anchor box predicted by RPN.

通过分类损失函数和回归损失函数的联合监督，网络最终可学习到点云目标检测的能力。Through the joint supervision of the classification loss function and the regression loss function, the network can finally learn the ability of point cloud object detection.

与现有技术相比，本发明具有如下优点：(1)联合了当前基于体素划分和基于原始点云的点云目标检测方法的优点，同时具备高效的点云感知能力和无损特征编码的能力；(2)通过将点云的无损编码多尺度嵌入到体素方法中，促使检测网络具备多尺度多层级信息融合感知能力。Compared with the prior art, the present invention has the following advantages: (1) It combines the advantages of the current point cloud target detection method based on voxel division and original point cloud, and has efficient point cloud perception ability and lossless feature encoding. (2) By embedding the lossless encoding of point cloud multi-scale into the voxel method, the detection network has the ability of multi-scale and multi-level information fusion perception.

附图说明Description of drawings

图1是本发明实施例的流程图。FIG. 1 is a flowchart of an embodiment of the present invention.

图2是本发明实施例的检测实例图，其中图2(a)为输入点云，图2(b)为点云检测锚框。Fig. 2 is a diagram of a detection example according to an embodiment of the present invention, wherein Fig. 2(a) is an input point cloud, and Fig. 2(b) is a point cloud detection anchor frame.

具体实施方式Detailed ways

本发明提供一种融合原始点云和体素划分的点云目标检测方法，首先利用无损特征提取网络Pointnet++提取点云局部细节特征和语义特征，然后构建损失函数进一步提升无损特征提取网络Pointnt++对局部邻域信息的感知能力，接着分别采用三线性插值在体素特征初始化阶段和稀疏卷积感知阶段将无信息损失的局部细节特征和语义特征嵌入到基于体素划分的点云目标检测网络中，最后通过二维RPN对每个预先设置的检测锚框进行分类和回归得到最终的检测目标。The invention provides a point cloud target detection method that integrates original point cloud and voxel division. First, the lossless feature extraction network Pointnet++ is used to extract the local detail features and semantic features of the point cloud, and then a loss function is constructed to further improve the lossless feature extraction network Pointnt++. The perception ability of neighborhood information, and then use trilinear interpolation to embed local detail features and semantic features without information loss into the point cloud target detection network based on voxel division in the voxel feature initialization stage and the sparse convolution perception stage, respectively. Finally, the final detection target is obtained by classifying and regressing each pre-set detection anchor box by 2D RPN.

下面结合附图和实施例对本发明的技术方案作进一步说明。The technical solutions of the present invention will be further described below with reference to the accompanying drawings and embodiments.

步骤1，利用无损特征提取网络Pointnet++提取点云局部细节特征和语义特征。Step 1, use the lossless feature extraction network Pointnet++ to extract the local detail features and semantic features of the point cloud.

首先对输入点云采集到固定数目N，接着逐层采样并搭建局部点云特征提取器(SetAbstraction，SA)对局部场景做特征提取，然后采用三线性插值将局部细节特征和语义特征通过特征传递层(Feature Propagation)赋予给原始场景中的所有点。包括以下子步骤：First collect a fixed number N of input point clouds, then sample layer by layer and build a local point cloud feature extractor (SetAbstraction, SA) to extract features from the local scene, and then use trilinear interpolation to transfer local detail features and semantic features through features Layer (Feature Propagation) assigned to all points in the original scene. Includes the following sub-steps:

步骤1.1，构建多层编码器。Step 1.1, build a multi-layer encoder.

首先利用最远点采样策略(FPS)从原始点云中采集N个点作为输入点云，然后再利用FPS从输入点云数据中逐层采集数目为

的点云，构成4层编码器，每一层输入的点云即为上一层输出的点集。First use the farthest point sampling strategy (FPS) to collect N points from the original point cloud as the input point cloud, and then use FPS to collect the number of layers from the input point cloud data as

步骤1.2，通过SA模块无信息丢失地提取每一层点云的局部细节特征和语义特征。In step 1.2, the local detail features and semantic features of each layer of point clouds are extracted through the SA module without information loss.

每一层SA模块的输入为上一层经过FPS采样得到的固定数目的点集，设点p_i为当前层通过FPS采样得到的第i个点，

为上一层中以点p_i为中心，半径为r的球邻域内部的点所构成的集合。p_i的输出特征的计算包括以下几步：The input of the SA module of each layer is a fixed number of point sets obtained by the FPS sampling of the previous layer, and the point p _i is the i-th point obtained by the FPS sampling of the current layer,

is the set of points inside the spherical neighborhood with point _pi as the center and radius r in the previous layer. The calculation of the output features of p _i includes the following steps:

步骤1.2.1，从集合

中随机采样k个点组成集合

Step 1.2.1, from the collection

Randomly sample k points in the set to form a set

步骤1.2.2，通过多层感知机对步骤1.2.1采样的点进行特征融合提取，计算得到点p_i的输出特征。Step 1.2.2, perform feature fusion extraction on the points sampled in step 1.2.1 through a multi-layer perceptron, and calculate the output feature of point p _i .

首先采用多层感知机对步骤1.2.1随机采样的点集合

做局部细节特征提取，得到

个点的高维映射特征，然后在特征维度通过最大池化得到在特征维度上的最大信息表征，该最大信息表征的高维映射特征即是点p_i的输出特征。计算公式如下：First, a multi-layer perceptron is used to randomly sample the point set in step 1.2.1

Do local detail feature extraction to get

The high-dimensional mapping feature of each point, and then the maximum information representation in the feature dimension is obtained by maximum pooling in the feature dimension. The high-dimensional mapping feature of the maximum information representation is the output feature of point p _i . Calculated as follows:

其中，MLP表示多层感知机对点特征的高维映射，max()代表在点集合的特征维上取最大值，f(p_i)即是点p_i的输出特征。Among them, MLP represents the high-dimensional mapping of the multi-layer perceptron to the point feature, max() represents the maximum value on the feature dimension of the point set, and f( _pi ) is the output feature of the point _pi .

步骤1.3，采用三线性插值将步骤1.2提取的细节特征和语义特征通过特征传递层赋予给原始场景中的所有点。In step 1.3, trilinear interpolation is used to assign the detailed features and semantic features extracted in step 1.2 to all points in the original scene through the feature transfer layer.

特征传递为特征提取的逆过程，从提取层的最后一层出发，依次向上一层做特征传递，即从

层传递到

层、

层传递到

层、

层传递到

层、

层传递到N层。以

层传递到

层为例介绍特征的传递过程，假设点p_i为

层需要传递特征的点，φ(P_i)表示

层中欧式空间中距离P_i最近的k个点组成的组合，P_j表示φ(P_i)中的一点，三线性插值特征传递的计算方法如下：Feature transfer is the inverse process of feature extraction. Starting from the last layer of the extraction layer, feature transfer is performed to the upper layer in turn, that is, from the last layer of the extraction layer.

layer passed to

Floor,

layer passed to

Floor,

layer passed to

Floor,

layers are passed to N layers. by

layer passed to

The point where the layer needs to transfer features, φ(P _i ) represents

步骤2，构建损失函数，监督步骤1特征提取的执行，促进无损特征提取网络Pointnet++感知特征信息。Step 2, build a loss function, supervise the execution of feature extraction in Step 1, and promote the lossless feature extraction network Pointnet++ to perceive feature information.

采用原始场景中点云坐标作为点云监督信息，Smooth-L1损失作为损失函数，计算方式如下：The point cloud coordinates in the original scene are used as the point cloud supervision information, and the Smooth-L1 loss is used as the loss function. The calculation method is as follows:

步骤3，将无信息损失的局部细节特征和语义特征嵌入到基于体素划分的点云目标检测网络中。Step 3: Embed local detail features and semantic features without information loss into the point cloud object detection network based on voxel division.

首先将原始点云划分成体素，利用步骤1中所提取到的局部细节特征对体素特征进行初始化，接着通过稀疏3D卷积感知点云空间结构，然后再在语义层面融合步骤1提取到的语义特征，包括以下子步骤：First, the original point cloud is divided into voxels, and the voxel features are initialized using the local detail features extracted in step 1. Then, the spatial structure of the point cloud is perceived through sparse 3D convolution, and then the extracted data in step 1 is fused at the semantic level. Semantic features, including the following sub-steps:

步骤3.1，利用步骤1提取到的局部细节特征对体素特征进行初始化。Step 3.1, initialize the voxel features using the local detail features extracted in step 1.

首先将点云空间均匀划分成体素格网，包含点的体素的被保留，不包含点的体素被舍弃，然后对保留下来的体素，利用步骤1中得到的局部细节特征进行初始化。假设步骤1中编码器网络第一层的输出为

表示步骤1中编码器一共提取到了

个点的局部细节特征。体素中心

V_j表示体素中心，

表示欧式空间中距离V_j最近的k个点组成的组合，P_t表示

的一点，则

的计算方式如下：First, the point cloud space is evenly divided into a voxel grid, the voxels containing points are retained, and the voxels not containing points are discarded, and then the retained voxels are initialized using the local detail features obtained in step 1. Suppose the output of the first layer of the encoder network in step 1 is

Indicates that in step 1, the encoder has extracted a total of

local detail features of a point. voxel center

V _j represents the voxel center,

a point, then

is calculated as follows:

步骤3.2，使用3D稀疏卷积对步骤3.1初始化后的体素场景语义信息做特征提取。Step 3.2, use 3D sparse convolution to perform feature extraction on the semantic information of the voxel scene initialized in step 3.1.

采用Spconv库堆叠4层稀疏卷积模块，其中每个稀疏卷积模块包含两层子流型卷积模块和一层下采样为2的点云稀疏卷积模块。假设输入的体素体征张量表示为L×W×H×C，其中L、W、H、C分别表示体素场景的长、宽、高和每一个体素的特征维度，那么经过4层稀疏卷积输出可以表示为

其中C′表示经过特征提取后的特征维度。The Spconv library is used to stack 4 layers of sparse convolution modules, in which each sparse convolution module contains two layers of sub-stream convolution modules and one layer of point cloud sparse convolution modules with downsampling of 2. Assuming that the input voxel sign tensor is represented as L×W×H×C, where L, W, H, and C represent the length, width, height of the voxel scene and the feature dimension of each voxel respectively, then after 4 layers The sparse convolution output can be expressed as

where C' represents the feature dimension after feature extraction.

步骤3.3，采用三线性插值将步骤1得到的语义特征转化为体素特征。Step 3.3, using trilinear interpolation to convert the semantic features obtained in step 1 into voxel features.

假设步骤1得到的三层语义特征为

其中4×表示下采样四倍，经过稀疏卷积后的体素坐标为

表示体素中心，

表示体素中心

表示欧式空间中距离

最近的k个点组成的组合，P_t,4×、P_t,8×、P_t,16×均为

中的点，则

的计算方式如下：Suppose the three-layer semantic feature obtained in step 1 is

represents the voxel center,

Represents the voxel center

Represents distance in Euclidean space

point in the

is calculated as follows:

其中，

表示体素中心点

邻域内的第t个点在下采样四倍层的特征加权权重。in,

Represents the voxel center point

步骤3.4，采用注意力机制方式将步骤3.2经过稀疏卷积感知的语义特征与将步骤3.3转化得到的体素特征进行融合，得到融合两种感知模式的语义信息。In step 3.4, the attention mechanism is used to fuse the semantic features perceived by the sparse convolution in step 3.2 with the voxel features transformed in step 3.3 to obtain semantic information that fuses the two perception modes.

首先将两种语义信息在特征维度上进行串联，假设步骤3.3转化得到的体素特征维度为M₁，稀疏卷积感知得到的体素特征为M₂，那么叠加后的体素特征维度为M₁+M₂,然后采用一层多层感知器将组合后的特征M₁+M₂维度映射为M₁。First, the two kinds of semantic information are concatenated in the feature dimension. Assuming that the voxel feature dimension obtained from the transformation in step 3.3 is M ₁ , and the voxel feature obtained by sparse convolution perception is M ₂ , then the superimposed voxel feature dimension is M ₁ + M ₂ , and then use a layer of multi-layer perceptron to map the combined feature M ₁ +M ₂ dimension to M ₁ .

步骤4，将步骤3融合得到的语义特征投影到二维俯视图，通过二维卷积搭建区域提出网络(RPN)，在场景俯视图下对每个像素点预先设置的检测锚框进行分类和回归得到最终的检测目标，包含以下子步骤：Step 4: Project the semantic features obtained by fusion in Step 3 to a two-dimensional top view, build a Region Proposition Network (RPN) through two-dimensional convolution, and classify and regress the pre-set detection anchor boxes for each pixel in the top view of the scene. The final detection target includes the following sub-steps:

步骤4.1，RPN网络结构和预定义框设置。Step 4.1, RPN network structure and pre-defined frame settings.

RPN由四层二维卷积神经网络搭建，采用U-Net网络结构逐层地输出，具体表示为

将高度维压缩至特征维度得到二维图像表征为

因此对于大小为

的特征图，一共会有

个预定义检测锚框。RPN is built by a four-layer two-dimensional convolutional neural network, and the U-Net network structure is used to output layer by layer, which is specifically expressed as

So for a size of

The feature map of , there will be a total of

predefined detection anchor boxes.

利用分类损失函数和回归损失函数对预先设置的检测锚框进行分类和回归，进而得到RPN检测出的物体。Use the classification loss function and the regression loss function to classify and regress the preset detection anchor frame, and then obtain the object detected by the RPN.

分类损失函数L_cls采用交叉熵损失函数，即：The classification loss function L _cls adopts the cross entropy loss function, namely:

式中，n表示预先设置的检测锚框的个数，P(a_i)表示第i个检测锚框预测的分数，Q(a_i)表示该检测锚框真实的标签值。In the formula, n represents the number of preset detection anchor boxes, P(a _i ) represents the predicted score of the ith detection anchor box, and Q(a _i ) represents the real label value of the detection anchor box.

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代，但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention pertains can make various modifications or additions to the described specific embodiments or substitute in similar manners, but will not deviate from the spirit of the present invention or go beyond the definition of the appended claims range.

Claims

1. a point cloud target detection method of fusion original point cloud and voxel division, is characterized in that, comprises the steps:

Step 1, use the lossless feature extraction network Pointnet++ to extract the local detail features and semantic features of the point cloud;

Step 2, build a loss function, supervise the execution of feature extraction in step 1, and promote the lossless feature extraction network Pointnet++ to perceive feature information;

Step 3: Embed local detail features and semantic features without information loss into the point cloud target detection network based on voxel division;

Step 3.1, initialize the voxel features using the local detail features extracted in step 1;

Step 3.2, use 3D sparse convolution to perform feature extraction on the semantic information of the voxel scene initialized in step 3.1;

Step 3.3, using trilinear interpolation to convert the semantic features obtained in step 1 into voxel features;

In step 3.4, the attention mechanism is used to fuse the semantic features perceived by the sparse convolution in step 3.2 with the voxel features transformed in step 3.3 to obtain semantic information that fuses the two perception modes;

Step 4: Project the semantic features obtained by fusion in step 3 to a two-dimensional top view, build a region through two-dimensional convolution to propose a network RPN, and classify and regress the pre-set detection anchor boxes for each pixel in the top view of the scene to obtain the final result. Detection target.

2. the point cloud target detection method of a kind of fusion original point cloud and voxel division as claimed in claim 1 is characterized in that: described step 1 comprises following several sub-steps:

Step 1.1, build a multi-layer encoder;

Use the farthest point sampling strategy to collect N points from the original point cloud as the input point cloud, and then use the farthest point sampling strategy to collect the number of layers from the input point cloud data as

The point cloud is composed of 4-layer encoder, and the input point cloud of each layer is the point set output by the previous layer;

Step 1.2, extract the local detail features and semantic features of each layer of point cloud through the local point set feature extraction module without information loss;

In step 1.3, trilinear interpolation is used to assign the detailed features and semantic features extracted in step 1.2 to all points in the original scene through the feature transfer layer.

3. the point cloud target detection method of a kind of fusion original point cloud and voxel division as claimed in claim 2, it is characterized in that: the input of each layer local point set feature extraction module in described step 1.2 is the upper layer A fixed number of point sets are sampled by the farthest point sampling strategy, and point p _i is the i-th point sampled by the farthest point sampling strategy in the current layer,

Step 1.2.1, from the collection

Randomly sample k points in the set to form a set

Step 1.2.2, perform feature fusion extraction on the points sampled in step 1.2.1 through a multi-layer perceptron. The calculation formula is as follows:

Among them, MLP represents the high-dimensional mapping of the multi-layer perceptron to the point feature, max() represents the maximum value on the feature dimension of the point set, and f( _pi ) is the output feature of the point _pi ;

Step 1.2.3, repeat the farthest point sampling strategy for the input point cloud of each layer to sample the corresponding number of point clouds, and then aggregate the neighborhood features through step 1.2.2 for the sampled points, thus completing the information loss-free. Feature extraction, in which the first layer extracts local detail features, and the last three layers extract semantic features.

4. a kind of point cloud target detection method of fusion original point cloud and voxel division as claimed in claim 2, it is characterized in that: in described step 1.3, feature transfer is the inverse process of feature extraction, from the last of the extraction layer Starting from the layer, the features are transferred to the upper layer in turn, that is, from

layer passed to

Floor,

layer passed to

Floor,

layer passed to

Floor,

layer is passed to N layers to

layer passed to

The point where the layer needs to transfer features, φ(P _i ) represents

In the formula, f(p _i ) is the feature to be transferred, f(p _j ) represents the feature of the jth point P _j in the neighborhood of point P _i , and w _ij represents the jth point P in the neighborhood of point P _i The feature weights for _j .

5. the point cloud target detection method of a kind of fusion original point cloud and voxel division as claimed in claim 1, it is characterized in that: in described step 2, take point cloud coordinates in original scene as point cloud supervision information, Smooth -L1 loss as a loss function, calculated as follows:

Among them, r' and r represent the spatial coordinates of the point cloud predicted by the lossless feature extraction network and the spatial coordinates of the original point cloud, respectively, φ(p) represents the point cloud set in the entire original scene, under the supervision of the loss function, the lossless feature extraction The perception effect of the network Pointnt++ on local neighborhood information is further improved.

6. the point cloud target detection method of a kind of fusion original point cloud and voxel division as claimed in claim 1, it is characterized in that: in described step 3.1, initialize is to divide point cloud space uniformly into voxel grid first, including The voxels of the point are retained, and the voxels that do not contain the point are discarded, and then the retained voxels are initialized using the local detail features obtained in step 1, so that

represents the combination of k points closest to the voxel center V _j in Euclidean space, and P _t represents

, then the voxel center feature

is calculated as follows:

Among them, F _t ^P represents the feature of the t-th feature point P _t in the neighborhood of the voxel center point V _j , and w _tj represents the feature weighting weight of the t-th point P _t in the neighborhood of the voxel center point V _j .

7. The point cloud target detection method of fusion original point cloud and voxel division as claimed in claim 1, it is characterized in that: described step 3.2 is to adopt Spconv library to stack 4 layers of sparse convolution modules, wherein each sparse convolution module The convolution module includes a two-layer sub-flow convolution module and a layer of point cloud sparse convolution module with a downsampling of 2. It is assumed that the input voxel sign tensor is expressed as L×W×H×C, L, W, H , C represent the length, width, height of the voxel scene and the feature dimension of each voxel, respectively, then the output after 4 layers of sparse convolution can be expressed as

C' represents the feature dimension after feature extraction.

8. A point cloud target detection method fused with original point cloud and voxel division as claimed in claim 1, characterized in that: in said step 3.3, it is assumed that the three-layer semantic information feature extracted from step 1 is expressed as

where 4× means downsampling four times, let

Represents the center of the voxel whose distance is sparsely convolved in Euclidean space

The point in , the voxel center feature after sparse convolution

is calculated as follows:

in,

Represents the voxel center point

9. the point cloud target detection method of a kind of fusion original point cloud and voxel division as claimed in claim 1 is characterized in that: described step 4 comprises following two sub-steps:

Step 4.1, RPN network structure and predefined detection anchor frame settings;

RPN is built by a four-layer two-dimensional convolutional neural network, and the U-Net network structure is used to output layer by layer, which is specifically expressed as

Each layer adopts 3×3 convolution to reduce the learning parameters, adopts the encoding-decoding network structure to further abstract the features of the fused information, and presets a corresponding detection anchor frame for each pixel on the final feature map. , the object detected by RPN is obtained by classifying and regressing the pre-set detection anchor frame, a 3D detection anchor frame can be expressed as {x, y, z, l, w, h, r}, (x, y, z ) represents the center position of the detection anchor frame, l, w, h correspond to the length, width and height respectively, r is the rotation angle in the xy plane, the voxel feature after 3D sparse convolution and semantic information fusion can be represented as

So for a size of

The feature map of , there will be a total of

a predefined detection anchor box;

Step 4.2, the design of point cloud target detection loss function.

10. The point cloud target detection method of fusion original point cloud and voxel division as claimed in claim 9, characterized in that: the point cloud target detection loss function in the step 4.2 comprises a classification loss function and a regression loss function, The classification loss function L _cls adopts the cross entropy loss function, namely:

In the formula, n represents the number of preset detection anchor boxes, P(a _i ) represents the predicted score of the ith detection anchor box, and Q(a _i ) represents the real label value of the detection anchor box;

The regression loss function L _reg adopts the Smooth-L1 loss function, namely:

In the formula, n represents the number of preset detection anchor boxes, v represents the real value of the detection anchor box, and v′ represents the value of the detection anchor box predicted by RPN.