CN111681212A

CN111681212A - A three-dimensional target detection method based on lidar point cloud data

Info

Publication number: CN111681212A
Application number: CN202010433849.3A
Authority: CN
Inventors: 郭裕兰; 张永聪; 陈铭林; 傲晟
Original assignee: Sun Yat Sen University
Current assignee: National University of Defense Technology; Sun Yat Sen University
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2020-09-18
Anticipated expiration: 2040-05-21
Also published as: CN111681212B

Abstract

The invention discloses a three-dimensional target detection method based on laser radar point cloud data. The method adopts a dense data expression form according to the data characteristics of the laser radar point cloud, so as to obtain the dense features and convert the three-dimensional features into Two-dimensional features can effectively improve computing efficiency and improve computing accuracy.

Description

A three-dimensional target detection method based on lidar point cloud data

技术领域technical field

本发明涉及自动驾驶中的三维目标检测技术领域，具体涉及基于激光雷达点云数据的三维目标检测方法。The invention relates to the technical field of three-dimensional target detection in automatic driving, in particular to a three-dimensional target detection method based on laser radar point cloud data.

背景技术Background technique

激光雷达能获取三维空间的物体信息，其通过物体表面的反射时间计算出物体在空间中的位置信息。Lidar can obtain object information in three-dimensional space, and calculate the position information of the object in space through the reflection time of the surface of the object.

在车辆行使过程中，对位于车辆四周的三维目标检测是自动驾驶中的基本组成部分。当前的自动驾驶车辆一般是通过融合图像RGB信息与激光雷达点云信息进行目标检测。本专利只使用激光雷达点云数据作为输入进行感兴趣物体的检测。尽管在二维图像里面的二维目标检测已经获得了重大进展，已经达到了极高的检测精度，但是在在无人驾驶等场景下的三维激光雷达点云检测效果仍然不佳，而这主要由激光雷达点云的稀疏性导致的。In the process of vehicle driving, the detection of three-dimensional objects located around the vehicle is a basic part of automatic driving. Current autonomous vehicles generally detect objects by fusing image RGB information with lidar point cloud information. This patent only uses lidar point cloud data as input for the detection of objects of interest. Although significant progress has been made in 2D target detection in 2D images and a very high detection accuracy has been achieved, the detection effect of 3D LiDAR point cloud in scenarios such as unmanned driving is still poor, and this is mainly Caused by the sparsity of the lidar point cloud.

进一步的说，苹果公司在2018年提出了VoxelNet来对激光雷达点云输入数据进行目标检测。其将空间中的激光雷达点云数据进行体素化分割，将空间分割为一个个独立体素并将体素内的点云数据使用类似PointNet的网络VFELayer(特征学习网络)进行特征提取。最后使用三维卷并在俯视图方向对特征进行拼接后进行物体检测。Further, Apple proposed VoxelNet in 2018 to perform object detection on lidar point cloud input data. It voxelizes the lidar point cloud data in the space, divides the space into independent voxels, and uses the PointNet-like network VFELayer (feature learning network) to extract features from the point cloud data in the voxels. Finally, object detection is performed using the 3D volume and stitching the features in the top view direction.

但是，VoxelNet将三维空间进行三维分割，对每个分割出来的体素内的点云进行特征提取，形成了一张四维的特征图(三维空间加一维特征)，从而需要使用三维卷积操作处理。其运算速度相较于二维卷积操作慢了一个量级。同时，由于点云的稀疏性，绝大部分的体素都是空的，因此三维卷积操作有极大的运算操作都是无用的卷积操作，但是仍需要浪费运算资源。However, VoxelNet divides the three-dimensional space into three-dimensional space, and extracts features from the point cloud in each segmented voxel to form a four-dimensional feature map (three-dimensional space plus one-dimensional features), which requires the use of three-dimensional convolution operations. deal with. Its operation speed is an order of magnitude slower than the two-dimensional convolution operation. At the same time, due to the sparseness of the point cloud, most of the voxels are empty, so the three-dimensional convolution operation has a huge operation and is a useless convolution operation, but it still needs to waste computing resources.

更进一步的说，PointPillars也是一种基于空间体素分割的网络，与VoxelNet不同的是其在俯视图方向上将空间切割成一条条长方体柱形，将柱形内的点云进行特征提取。这种方式处理出来的特征图就是一张三维特征图(二维空间加一维特征)，处理这种特征图只需要使用二维卷积操作，与一般的RGB图像物体检测的特征图一样，因此可以直接采用二维RGB图像检测的后续框架进行处理。速度和精度性能相较于VoxelNet高许多。Furthermore, PointPillars is also a network based on spatial voxel segmentation. Different from VoxelNet, it cuts the space into strips of cuboid columns in the direction of the top view, and extracts features from the point cloud in the column. The feature map processed in this way is a three-dimensional feature map (two-dimensional space plus one-dimensional features). To process this feature map, only two-dimensional convolution operations are required, which is the same as the feature map of general RGB image object detection. Therefore, the subsequent framework of 2D RGB image detection can be directly used for processing. The speed and accuracy performance is much higher than that of VoxelNet.

PointPillars在柱形内特征提取的过程中，将同一竖直方向上的点云直接融合成了一个特征，这种融合方式较为粗糙，竖直方向上特征分布不明显，同时鸟瞰特征图中的特征也十分稀疏，导致二维卷积操作中有大量的算力被浪费。In the process of feature extraction in the column, PointPillars directly fuses the point clouds in the same vertical direction into one feature. This fusion method is relatively rough, and the feature distribution in the vertical direction is not obvious. At the same time, the features in the bird's-eye feature map It is also very sparse, resulting in a lot of wasted computing power in the two-dimensional convolution operation.

发明内容SUMMARY OF THE INVENTION

鉴于现有技术的不足，本发明探讨一种基于激光雷达点云数据的三维目标检测方法，旨在于解决VoxelNet与PointPillars的特征稀疏问题，构建一种稠密的特征表达形式。相比于VoxelNet，其拥有二维卷积的运算效率，相比于Pointpillars,其竖直方向上的点云特征并非强行压缩在一起，保留了更多物体竖直方向上的特征，从而更好表达竖直方向上特征比较重要的目标。In view of the deficiencies of the prior art, the present invention discusses a three-dimensional target detection method based on lidar point cloud data, aiming at solving the feature sparse problem of VoxelNet and PointPillars, and constructing a dense feature expression form. Compared with VoxelNet, it has the operational efficiency of two-dimensional convolution. Compared with Pointpillars, the point cloud features in the vertical direction are not forcibly compressed together, and more features in the vertical direction of the object are retained, which is better. Represents a target whose features are more important in the vertical direction.

更进一步的，本发明根据激光雷达点云的数据特征，采取一种稠密的数据表达形式，以得到稠密的特征并将三维特征转换为二维特征，有效提高运算效率并提升运算精度。Furthermore, the present invention adopts a dense data expression form according to the data characteristics of the lidar point cloud, so as to obtain dense features and convert three-dimensional features into two-dimensional features, thereby effectively improving computing efficiency and improving computing accuracy.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于激光雷达点云数据的三维目标检测方法，所述方法包括以下步骤：A three-dimensional target detection method based on lidar point cloud data, the method comprises the following steps:

将点云表示成密集形的表面图，图中行数为K，其中K是激光雷达的通道数；给定一个激光雷达点p＝(x,y,z,r,)l，其中(x,y,z)，r和l∈{0,...,K-1}分别表示位置、反射率和点所在层数；点p位于表面图S^h×w的网格(h,w)中，其中h＝l，

The point cloud is represented as a dense surface map, and the number of rows in the figure is K, where K is the number of channels of the lidar; given a lidar point p=(x,y,z,r,)l, where (x, y, z), r and l∈{0,...,K-1} represent the position, reflectivity and layer number of the point respectively; point p is located in the grid (h,w) of the surface map S ^h×w , where h=l,

表面图根据场景的表面，将三维点投影到二维网格中，对于表面图的每个网格(h,w)，通过平均网格内的所有点，获得质心点

(h,w)网格内的深度的计算如下：The surface map projects 3D points into a 2D grid according to the surface of the scene, and for each grid (h,w) of the surface map, obtains the centroid point by averaging all points in the grid

The depth within the (h,w) grid is calculated as follows:

然后可以获得表面深度图D_map＝{d}∈R^H×W；表面深度图将深度信息存储在每个网格中；The surface depth map D _map ={d}∈R ^H×W can then be obtained; the surface depth map stores depth information in each grid;

基于体素特征编码层的网格特征编码器，体素特征编码层处理表面图的每个网格以生成该网格的特征，从而生成规则的2D表曲面特征图

若网格没有任何点，则使用零填充；且网格特征编码器不执行体素特征编码层中的随机采样；A grid feature encoder based on a voxel feature encoding layer that processes each grid of the surface map to generate features for that grid, resulting in a regular 2D surface surface feature map

If the grid does not have any points, zero padding is used; and the grid feature encoder does not perform random sampling in the voxel feature encoding layer;

具有N种不同分辨率的表面图，即S^H×W,

由网格特征编码器对其分别进行独立处理，生成N个表面特征图，即S^H×W,

然后，由特征串接得到一个多尺度表面特征F∈R^3C×H×W：Surface maps with N different resolutions, namely S ^H×W ,

They are processed independently by the grid feature encoder to generate N surface feature maps, namely S ^H×W ,

Then, a multi-scale surface feature F∈R ^3C×H×W is obtained by feature concatenation:

此多尺度表面特征用作后续模块的初始输入；This multi-scale surface feature is used as the initial input for subsequent modules;

具有表面特征卷积模块，并通过增加一个低分辨率输出的网络反卷积层来获得全分辨率输出；表面特征卷积模块生成的

中的前视图特征具有与其输入表面特征F相同的分辨率，但特征的维度不同；has a surface feature convolution module, and obtains a full-resolution output by adding a network deconvolution layer with a low-resolution output; the surface feature convolution module generates

The front view feature in F has the same resolution as its input surface feature F, but the dimension of the feature is different;

具有基于深度表面图的前视图特征从前视图到鸟瞰图的视图转换模块，不同对象的深度不同，但是从2d前视伪图像中获取的绝对深度是不相等的；从俯视图特征中得到物体的深度，并在视图变换后对高度进行回归。Front view features based on depth surface map View conversion module from front view to bird's-eye view, the depth of different objects is different, but the absolute depth obtained from the 2d front-view pseudo-image is not equal; the depth of the object is obtained from the top-view feature , and regresses the height after the view transform.

从热图H^O派生的点表示俯视图中检测对象中心的位置，即，x,z，而参数图P^O包含对象的参数，检测网络由一个公共特征提取器和两个分支，即热图分支和参数分支组成。The points derived from the heatmap ^HO represent the position of the center of the detected object in the top view, i.e., x, z, while the parameter map ^PO contains the parameters of the object, the detection network consists of a common feature extractor and two branches, the heatmap branch and parameter branches.

需要说明的是，视图转换模块有两个步骤：扩展和压缩；It should be noted that the view conversion module has two steps: expansion and compression;

在扩展步骤中，FV特征中每个(h,w)位置的特征f将会根据深度信息D映射到扩增特征图E的相应位置(d,h,w)，其中In the expansion step, the feature f at each (h, w) position in the FV feature will be mapped to the corresponding position (d, h, w) of the augmented feature map E according to the depth information D, where

这里R是最大深度范围，若D_map(h,w)>R，则设置d＝D；Here R is the maximum depth range, if D _map (h, w)>R, set d=D;

在压缩步骤中，通过随机选择在其H轴上挤压扩展的特征图，得到2D特征图，其大小为D×W，维度c'；最后，使用M个连续的2D卷积层处理输出，从而得到最终的俯视图特征图。In the compression step, a 2D feature map is obtained by extruding the expanded feature map on its H axis by randomly selecting, its size is D × W, dimension c'; finally, the output is processed using M consecutive 2D convolutional layers, Thus, the final top-view feature map is obtained.

本发明有益效果在于，构建了一种稠密的特征表达形式。不但拥有二维卷积的运算效率，而且竖直方向上的点云特征并非强行压缩在一起，保留了更多物体竖直方向上的特征，从而更好表达竖直方向上特征比较重要的目标。The beneficial effect of the present invention is that a dense feature expression form is constructed. Not only has the operational efficiency of two-dimensional convolution, but also the point cloud features in the vertical direction are not forcibly compressed together, retaining more features in the vertical direction of the object, so as to better express the more important objects in the vertical direction. .

附图说明Description of drawings

图1为本发明的三维目标检测总框架示意图；1 is a schematic diagram of the overall framework of three-dimensional target detection of the present invention;

图2为本发明的长方体体素与表面图；Fig. 2 is the rectangular parallelepiped voxel and surface map of the present invention;

图3为本发明的评测结果对比图示意图。FIG. 3 is a schematic diagram of a comparison diagram of the evaluation results of the present invention.

具体实施方式Detailed ways

以下将结合附图对本发明作进一步的描述，需要说明的是，以下实施例以本技术方案为前提，给出了详细的实施方式和具体的操作过程，但本发明的保护范围并不限于本实施例。The present invention will be further described below in conjunction with the accompanying drawings. It should be noted that the following examples are based on the technical solution, and provide detailed implementations and specific operation processes, but the protection scope of the present invention is not limited to the present invention. Example.

本发明为一种基于激光雷达点云数据的三维目标检测方法，所述方法包括以下步骤：The present invention is a three-dimensional target detection method based on lidar point cloud data, and the method comprises the following steps:

The depth within the (h,w) grid is calculated as follows:

具有N种不同分辨率的表面图，即S^H×W,

实施例Example

表面图(Surface map)Surface map

激光雷达是自动驾驶中常用的传感器。例如，Velodyne HDL-64E激光雷达按激光束的顺序记录64行点，而相邻扫描线之间的点分布是均匀的。基于对激光雷达扫描机理的观察，本发明将点云表示成密集的形式，即表面图(Surface map)。表面图是一个二维伪图像，行数为K，其中K是激光雷达的通道数。沿扫描方向的点(通常是水平的)被放置在表面图的一行中，而沿表面图的一列的点对应于在单次扫描期间在不同通道(即激光束)获得的点。(具有相同的水平但不同的垂直角)。Lidar is a commonly used sensor in autonomous driving. For example, the Velodyne HDL-64E lidar records 64 lines of points in the order of the laser beam, while the point distribution between adjacent scan lines is uniform. Based on the observation of the scanning mechanism of the lidar, the present invention expresses the point cloud in a dense form, that is, a surface map. The surface map is a 2D pseudo-image with K rows, where K is the number of lidar channels. Points along the scan direction (usually horizontal) are placed in a row of the surface map, while points along a column of the surface map correspond to points obtained at different channels (ie laser beams) during a single scan. (with the same horizontal but different vertical angles).

给定一个激光雷达点p＝(x,y,z,r,l)，其中(x,y,z)，r和l∈{0,...,K-1}分别表示位置、反射率和点所在层数。点p位于表面图S^h×w的网格(h,w)中，其中h＝l，

表面图根据场景的表面，将三维点投影到二维网格中。Given a lidar point p = (x, y, z, r, l), where (x, y, z), r and l∈{0,...,K-1} represent the position, reflectivity, respectively and the number of layers where the point is located. The point p is located in the grid (h,w) of the surface map ^Sh×w , where h=l,

Surface maps project 3D points onto a 2D grid based on the surface of the scene.

对于表面图的每个网格(h,w)，通过平均网格内的所有点，获得质心点

(h,w)网格内的深度的计算如下：For each grid (h,w) of the surface map, obtain the centroid point by averaging all points within the grid

The depth within the (h,w) grid is calculated as follows:

然后可以获得表面深度图D_map＝{d}∈R^H×W。表面深度图将深度信息存储在每个网格中，并将在后文视图转换模块中使用。The surface depth _{map Dmap} ={d}∈R ^H×W can then be obtained. The surface depth map stores depth information in each grid and will be used later in the view conversion module.

表面网络(SurfaceNet)SurfaceNet

SurfaceNet提出了一种利用表面图(Surface map)表示来预测物体的精确检测框方法。它由四个模块组成(如图1所示)：1)网格特征编码器，可以处理每个网格内任意数量的点；2)表面特征卷积模块，采用二维主干网络提取高层次特征；3)视图转换模块，将特征从前视图Front View(FV)转换为鸟瞰图Bird’s Eye View(BEV)；以及4)3D检测框参数的预测和无锚点3D中心热图预测。SurfaceNet proposes an accurate detection box method for predicting objects using surface map representations. It consists of four modules (shown in Figure 1): 1) a grid feature encoder, which can process any number of points within each grid; 2) a surface feature convolution module, which uses a 2D backbone network to extract high-level features; 3) a view conversion module, which converts features from Front View (FV) to Bird's Eye View (BEV); and 4) 3D detection box parameter prediction and anchor-free 3D center heatmap prediction.

网格特征编码器Grid Feature Encoder

由于点云的不规则性，网格内的点个数是任意的。网格特征编码器设计用于将任意数量的点编码为具有固定维度C的稠密特征，如图1(a)所示。Due to the irregularity of the point cloud, the number of points in the grid is arbitrary. The trellis feature encoder is designed to encode an arbitrary number of points into dense features with a fixed dimension C, as shown in Fig. 1(a).

本发明的编码器基于体素特征编码(Voxel Feature Encoder)层。VFE层处理Surface map的每个网格以生成该网格的特征，从而生成规则的2D表曲面特征图

如果网格没有任何点，则使用零填充。此外，本发明的网格特征编码器不执行VFE中的随机采样。这里有两个原因：1)每个网格点的数量很少；2)每个网格之间点的分布是近似均匀的，并且不需要减少点的不均衡性。The encoder of the present invention is based on a Voxel Feature Encoder layer. The VFE layer processes each mesh of the Surface map to generate features for that mesh, resulting in a regular 2D surface feature map

If the grid does not have any points, zero padding is used. Furthermore, the trellis feature encoder of the present invention does not perform random sampling in the VFE. There are two reasons for this: 1) the number of points per grid is small; 2) the distribution of points between each grid is approximately uniform and there is no need to reduce the unevenness of points.

为了便于多尺度三维目标检测，本发明采用了N种不同分辨率的表面图，(即S^H×W,

)。由网格特征编码器对其分别进行独立处理，生成三个表面特征图(即S^H×W,

)。然后，由特征串接得到一个多尺度表面特征F∈R^3C×H×W：In order to facilitate multi-scale three-dimensional target detection, the present invention adopts N surface maps with different resolutions, (ie S ^H×W ,

). They are processed independently by the grid feature encoder to generate three surface feature maps (i.e. S ^H×W ,

). Then, a multi-scale surface feature F∈R ^3C×H×W is obtained by feature concatenation:

此多尺度表面特征用作后续模块的初始输入。为了更加清晰表示，本发明省略了“多尺度(multi-scale)”，并使用表面特征(surface feature)来表示多尺度表面特征(multi-scale surface feature)。This multiscale surface feature is used as the initial input for subsequent modules. For clearer representation, the present invention omits "multi-scale" and uses surface features to represent multi-scale surface features.

表面特征卷积模块Surface Feature Convolutional Module(SFCM)Surface Feature Convolutional Module (SFCM)

由于表面特征的感受野非常有限(即，仅在其底层网格内)，因此本发明使用2D卷积神经网络(参见图1(b))来更有效地逐步扩大感受野。Since the receptive field of a surface feature is very limited (ie, only within its underlying grid), the present invention uses a 2D convolutional neural network (see Fig. 1(b)) to more efficiently progressively enlarge the receptive field.

一般而言，由2D卷积神经网络生成的特征图由于算力原因，具有比其输入图像更低的分辨率。为了避免低分辨率特征引起的性能下降(特别是在小目标检测中)，本发明设计了一个表面特征卷积模块SFCM，并通过增加一个低分辨率输出的网络反卷积层来获得全分辨率输出。因此，该模块生成的

中的前视图特征具有与其输入表面特征F相同的分辨率，但特征的维度不同。In general, feature maps generated by 2D convolutional neural networks have lower resolution than their input images due to computational power. In order to avoid performance degradation caused by low-resolution features (especially in small target detection), the present invention designs a surface feature convolution module SFCM, and obtains full resolution by adding a low-resolution output network deconvolution layer rate output. Therefore, the module generates

The front view features in F have the same resolution as their input surface features F, but the dimensions of the features are different.

视图转换模块view conversion module

前视图特征将单个网格的局部表面信息及其邻接关系嵌入到前视图中。然而，从前视图特征中直接预测绝对深度信息是很困难的。但是，根据前视图特征的位置，本发明可以判别出高度和宽度信息。因此，本发明提出了基于深度表面图的前视图特征从前视图到鸟瞰图的视图转换模块，如图1(c)所示。The Front View feature embeds the local surface information of a single mesh and its adjacencies into the front view. However, it is difficult to directly predict absolute depth information from front view features. However, based on the location of the front view feature, the present invention can discriminate height and width information. Therefore, the present invention proposes a view conversion module from the front view to the bird's-eye view based on the front view features of the depth surface map, as shown in Fig. 1(c).

使用视图转换模块的原因在于：1)不同对象的深度不同，但是从2d前视伪图像中获取的绝对深度是不相等的；2)而不同物体的高度是相似的，因为它们总是站在地面上。因此，本发明可以很容易地从俯视图(BEV)特征中得到物体的深度，并在视图变换后对高度进行回归。The reasons for using the view transformation module are: 1) the depths of different objects are different, but the absolute depths obtained from the 2d front-view pseudo-image are not equal; 2) the heights of different objects are similar because they always stand on the ground. Therefore, the present invention can easily obtain the depth of an object from top view (BEV) features and regress the height after view transformation.

具体来说，视图转换模块有两个步骤：扩展和压缩。在扩展步骤中，FV特征中每个(h,w)位置的特征f将会根据深度信息D映射到扩增特征图E的相应位置(d,h,w)，其中Specifically, the view transformation module has two steps: expansion and compression. In the expansion step, the feature f at each (h, w) position in the FV feature will be mapped to the corresponding position (d, h, w) of the augmented feature map E according to the depth information D, where

这里R是最大深度范围。如果D_map(h,w)>R，则设置d＝D。Here R is the maximum depth range. If _Dmap (h,w)>R, then set d=D.

在压缩步骤中，本发明通过随机选择在其H轴上挤压扩展的特征图，得到2D特征图，其大小为D×W，维度c'。In the compression step, the present invention obtains a 2D feature map by randomly selecting and extruding the expanded feature map on its H-axis, the size of which is D×W and the dimension c′.

最后，使用M个连续的2D卷积层来处理上述输出，从而得到最终的俯视图特征图。Finally, M consecutive 2D convolutional layers are used to process the above output, resulting in the final top-view feature map.

无锚点框的三维目标检测3D object detection without anchor boxes

图2所示，本发明将3D对象视为一些具有属性的点。从热图H^O派生的点表示俯视图中检测对象中心的位置(即，x,z)，而参数图P^O包含对象的参数，例如高度y、大小(h，w，l)和旋转角度θ。本发明的检测网络由一个公共特征提取器和两个分支(即热图分支和参数分支)组成。本发明使用与VoxelNet中的RPN相似的公共特征提取模块。不同的是，本发明使用二维卷积层直接处理视图转换模块输出的二维特征，而非三维卷积层。热图和参数图分支具有相同的拓扑结构，由M个连续的2D卷积层组成。As shown in Figure 2, the present invention treats 3D objects as some points with attributes. The points derived from the heatmap ^HO represent the location (i.e., x, z) of the center of the detected object in the top view, while the parameter map PO contains the parameters of the object such as height y, size (h, w, ^l ) and rotation angle θ . The detection network of the present invention consists of a common feature extractor and two branches (ie, the heatmap branch and the parameter branch). The present invention uses a common feature extraction module similar to RPN in VoxelNet. The difference is that the present invention uses a two-dimensional convolutional layer to directly process the two-dimensional features output by the view conversion module instead of a three-dimensional convolutional layer. The heatmap and parameter map branches have the same topology and consist of M consecutive 2D convolutional layers.

如图3所示，本发明在KITTI 3D目标检测数据集上评估SurfaceNet，该数据集包含7481个训练和7518个测试点云。评价采用三个难度等级：容易、中等和困难。由于访问KITTI测试服务器的次数有限，本发明通过将官方训练集分割为3712个点云进行训练和3769个点云进行验证来评估本发明的方法。3D边界框相交超过并集(IoU)阈值设置为0.25％，用于检测行人。此外，使用PointRCNN的离线评估代码来获得本发明方法的度量。进一步的，由图3可知，SurfaceNet为66.17％，优于最先进的方法(如AVOD-FP和PointPillars)超过7％。并且，本发明的方法仅使用激光雷达点云，而AVOD-FPN使用点云和RGB图像。As shown in Figure 3, the present invention evaluates SurfaceNet on the KITTI 3D object detection dataset, which contains 7481 training and 7518 testing point clouds. The assessment uses three difficulty levels: easy, medium and hard. Due to the limited number of visits to the KITTI test server, the present invention evaluates the method of the present invention by dividing the official training set into 3712 point clouds for training and 3769 point clouds for validation. The 3D bounding box intersection over union (IoU) threshold is set to 0.25% for pedestrian detection. Furthermore, the offline evaluation code of PointRCNN is used to obtain the metrics of the method of the present invention. Further, it can be seen from Figure 3 that SurfaceNet is 66.17%, outperforming the state-of-the-art methods (such as AVOD-FP and PointPillars) by more than 7%. Also, the method of the present invention uses only Lidar point cloud, while AVOD-FPN uses point cloud and RGB image.

损失函数loss function

本发明的SurfaceNet预测3D预测框的中心热图H^O∈R^D×W和参数图P^O∈R^5×D×W。H^O用于确定(x，z)平面中对象的中心，而P^O用于回归高度(y)、大小(w，h，l)和旋转角度θ。The SurfaceNet of the present invention predicts the center heatmap H ^O ∈ R ^D×W and the parameter map P ^O ∈ R ^5×D×W of the 3D prediction frame. HO is used to determine the center of the object in the (x, z) plane, while PO is used to ^regress height (y), size (w, h, ^l ) and rotation angle θ.

对于每个中心热图的回归，本发明使用均方误差损失：For the regression of each center heatmap, the present invention uses a mean squared error loss:

L_hm＝MSE(H^gt-H^o)L _hm =MSE(H ^gt -H ^o )

H^gt是通过物体真值中心的位置(x，z)经过高斯热图生成的。H ^gt is generated through a Gaussian heatmap through the position (x, z) of the object's ground-truth center.

对于每一个参数，本发明使用每一个参数的平滑L1损失的总和作为总的损失函数。For each parameter, the present invention uses the sum of the smoothed L1 losses for each parameter as the total loss function.

其中Δy,Δw,Δh,Δl和Δθ为相应属性的损失where Δy, Δw, Δh, Δl and Δθ are the losses of the corresponding properties

高度损失Δy使用真值和预测值之间的误差定义：The height loss Δy is defined using the error between the true and predicted values:

Δy＝y^gt-y^o Δy=y ^gt -y ^o

预测框大小的损失{Δw,Δh,Δl}使用对数损失：The loss of predicted box size {Δw,Δh,Δl} uses a logarithmic loss:

旋转损失定义：Spin loss definition:

Δθ＝sin(θ^gt-θ^o)Δθ=sin(θ ^gt -θ ^o )

在训练过程中，y^o,w^o,h^o,l^o和θ^o由参数预测图P^o根据真值3D边界框的中心位置得出得参数，而y^gt,w^gt,h^gt,l^gt和θ^gt则是对应物体真值的参数。During training, y ^o , w ^o , h ^o , l ^o and θ ^o are obtained from the parameter prediction map P ^o according to the center position of the ground-truth 3D bounding box, while y ^gt , w ^gt , h ^gt , l ^gt and θ ^gt are the parameters corresponding to the true value of the object.

最后，总的损失函数定义如下：Finally, the overall loss function is defined as follows:

L＝L_hm+βL_loc L=L _hm +βL _loc

其中β是用于调整两个损失项之间的平衡参数。当然，本发明所描述的总的损失函数只是其中的一种，作为以此为基础所作出的函数变形，应当属于本发明的保护范围之内。where β is the balance parameter used to adjust the balance between the two loss terms. Of course, the total loss function described in the present invention is only one of them, and the functional deformation based on this should fall within the protection scope of the present invention.

对于本领域的技术人员来说，可以根据以上的技术方案和构思，给出各种相应的改变和变形，而所有的这些改变和变形，都应该包括在本发明权利要求的保护范围之内。For those skilled in the art, various corresponding changes and deformations can be given according to the above technical solutions and concepts, and all these changes and deformations should be included within the protection scope of the claims of the present invention.

Claims

1. A three-dimensional target detection method based on laser radar point cloud data is characterized by comprising the following steps:

the point cloud is represented as a dense surface map with a number of lines K, where K is the number of channels of the lidar, a lidar point p is given (x, y, z, r,) l, where (x, y, z), r and l ∈ { 0.., K-1} represent the position, reflectivity, and number of layers at which the point p is located on the surface map S^h×wIn the grid (h, w) of (a), wherein h ═ l,

surface map three-dimensional points are projected into a two-dimensional grid according to the surface of the scene, and for each grid (h, w) of the surface map centroid points are obtained by averaging all points within the grid

The depth within the (h, w) grid is calculated as follows:

a surface depth map D may then be obtained_map＝{d}∈R^H×W(ii) a The surface depth map stores depth information in each mesh;

grid feature encoder based on voxel feature encoding layer, voxel feature encoding layer processes each grid of surface map to generate features of the grid, thereby generating regular 2D table surface feature map

If the grid has no points, zero padding is used; the grid characteristic encoder does not execute random sampling in the voxel characteristic encoding layer;

surface maps with N different resolutions, i.e. S^H×W,

The three surface feature maps are generated by independently processing the three surface feature maps by a grid feature encoder respectively, namely S^H×W,

Then, a multi-scale surface feature F ∈ R is obtained by feature concatenation^3C×H×W：

This multi-scale surface feature is used as an initial input for subsequent modules;

the system comprises a surface feature convolution module and a network deconvolution layer with low resolution output, wherein the network deconvolution layer is added to obtain full resolution output; generated by a surface feature convolution module

The front view features in (a) have the same resolution as their input surface features F, but the dimensions of the features are different;

a view transformation module having front view features based on a depth surface map from a front view to a bird's eye view, the depths of different objects being different but the absolute depths obtained from the 2d front view pseudo-image being unequal; the depth of the object is obtained from the top view features and the height is regressed after the view transformation.

From heatmap H^ODerived points represent the position of the center of the detected object in top view, i.e., x, z, while the parameter map P^OContaining the parameters of the objects, the detection network consists of one common feature extractor and two branches, namely a hot map branch and a parameter branch.

2. The lidar point cloud data-based three-dimensional target detection method of claim 1, wherein the view conversion module comprises two steps: expanding and compressing;

in the expansion step, the feature f of each (h, w) position in the FV feature is mapped to the corresponding position (D, h, w) of the augmented feature map E according to the depth information D, wherein

Where R is the maximum depth range, if D_map(h,w)>R, setting D as D;

in the compression step, a 2D characteristic diagram is obtained by randomly selecting a characteristic diagram extruded and expanded on an H axis, wherein the size of the characteristic diagram is DxW and the dimension c'; finally, the output is processed using M consecutive 2D convolutional layers, resulting in the final top view feature map.