CN110879994A

CN110879994A - 3D visual detection method, system and device based on shape attention mechanism

Info

Publication number: CN110879994A
Application number: CN201911213392.9A
Authority: CN
Inventors: 张兆翔; 张驰; 叶阳阳
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-03-13

Abstract

The invention belongs to the field of computer deep reinforcement learning and pattern recognition, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism, aiming at solving the problems that the precision of a single-stage detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system. The invention comprises the following steps: representing the point cloud data by three-dimensional grid voxels; extracting features and coding a space sparse feature map; extracting different scale features after projecting to a top view; adopting a deconvolution layer merging characteristic; extracting a shape attention feature map through attention weight and a convolution coding layer; and acquiring the target category, the target position, the target size and the target direction through a target classification network and a regression positioning network. The invention uses a sampling strategy based on distance constraint and an attention mechanism based on shape prior, relieves the instability caused by uneven data distribution, improves the problem that a single-stage detector lacks shape prior, and has high precision, short time consumption, strong real-time performance and good robustness.

Description

3D visual detection method, system and device based on shape attention mechanism

技术领域technical field

本发明属于深度强化学习、计算机视觉、模式识别以及机器学习领域，具体涉及了一种基于形状注意力机制的三维目测检测方法、系统、装置。The invention belongs to the fields of deep reinforcement learning, computer vision, pattern recognition and machine learning, and particularly relates to a three-dimensional visual detection method, system and device based on a shape attention mechanism.

背景技术Background technique

三维目标检测器需要输出可靠的空间和语义信息，即三维位置、方位、所占体积和类别。相对于二维物体检测，三维目标提供更多细节信息，但建模难度较大。三维物体检测一般采用距离传感器，如激光雷达、TOF相机、立体相机等，来预测更有意义、更准确的结果。三维物体检测成为自动驾驶汽车、UVA、机器人等领域的关键技术。在交通场景中大部分准确的三维物体检测算法都是基于雷达传感器，它已经成为户外场景感知的基本传感器。而交通场景中目标感知是无人驾驶车辆关于定位周围目标的关键技术。3D object detectors need to output reliable spatial and semantic information, namely 3D position, orientation, occupied volume and category. Compared with 2D object detection, 3D objects provide more detailed information, but the modeling is more difficult. 3D object detection generally uses distance sensors, such as lidar, TOF cameras, stereo cameras, etc., to predict more meaningful and accurate results. 3D object detection has become a key technology in areas such as self-driving cars, UVA, and robotics. Most accurate 3D object detection algorithms in traffic scenes are based on radar sensors, which have become the basic sensor for outdoor scene perception. Target perception in traffic scenes is a key technology for unmanned vehicles to locate surrounding targets.

基于激光雷达的三维目标检测涉及两个重要问题。第一个问题是如何针对从激光雷达传感器采样得到的稀疏非均匀点云生成描述性底层特征。激光雷达的采样点在离传感器近的地方采样点多，而在距离远的地方采样点少。点云的多样性分布会降低检测器的检测性能以及导致检测结果的不稳定。许多方法依赖于手工的特征提取方法。然而，由于手工特征没有很好考虑和处理不均衡的激光点云分布，导致检测算法并不稳定。物体检测与分割对视觉数据理解与感知都扮演极为重要的角色。另一个问题是如何有效地对三维形状信息进行编码，实现更好的判别嵌入。三维物体检测的框架主要有两种:单阶段检测器和两阶段检测器。单阶段检测器效率更高，而两阶段检测器检测精度更高。由于区域候选网络输出需要裁剪的感兴趣区域ROI，导致两级检测器效率不高。但是，这些裁剪后的ROI为每个检测到的对象提供形状先验，通过后续的优化网络可以获得更高的检测精度。由于缺乏形状先验和后续的优化网络，单阶段检测器的性能低于两阶段检测器。然而，对于实时系统来说，两级检测器非常耗时。此外，三维形状先验更适合于三维目标的检测。Lidar-based 3D object detection involves two important issues. The first problem is how to generate descriptive underlying features for sparse non-uniform point clouds sampled from lidar sensors. The sampling points of the lidar are more sampling points close to the sensor, and less sampling points are farther away. The diverse distribution of point clouds will reduce the detection performance of the detector and lead to unstable detection results. Many methods rely on manual feature extraction methods. However, the detection algorithm is not stable because the hand-crafted features are not well considered and dealt with the uneven distribution of laser point clouds. Object detection and segmentation play an extremely important role in both understanding and perception of visual data. Another problem is how to efficiently encode 3D shape information for better discriminative embeddings. There are two main frameworks for 3D object detection: single-stage detectors and two-stage detectors. Single-stage detectors are more efficient, while two-stage detectors have higher detection accuracy. Since the region candidate network outputs the cropped region of interest ROI, the two-stage detector is inefficient. However, these cropped ROIs provide shape priors for each detected object, and higher detection accuracy can be obtained through subsequent optimization of the network. One-stage detectors outperform two-stage detectors due to the lack of shape priors and subsequent optimization networks. However, for real-time systems, the two-stage detector is very time-consuming. In addition, the 3D shape prior is more suitable for the detection of 3D objects.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中的上述问题，即单阶段三维目标检测器精度低于两阶段检测器，而两阶段检测器耗时大、不适用于实时系统的问题，本发明提供了一种基于形状注意力机制的三维目测检测方法，该三维目标检测方法包括：In order to solve the above problem in the prior art, that is, the accuracy of the single-stage three-dimensional object detector is lower than that of the two-stage detector, and the two-stage detector is time-consuming and not suitable for real-time systems, the present invention provides a shape-based A three-dimensional visual detection method of attention mechanism, the three-dimensional target detection method includes:

步骤S10，获取包含目标物的激光点云数据作为待检测数据，并将所述待检测数据通过基于三维网络的体素来表征；Step S10, acquiring laser point cloud data containing the target as the data to be detected, and characterizing the data to be detected by voxels based on a three-dimensional network;

步骤S20，通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码，获得待处理数据对应的空间稀疏特征图；Step S20, obtaining the feature expression of the voxel through a feature extractor and performing sparse convolution encoding to obtain a spatial sparse feature map corresponding to the data to be processed;

步骤S30，将所述空间稀疏特征图投影到二维顶视平面，并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征，获得顶视特征图；Step S30, projecting the spatial sparse feature map to a two-dimensional top-view plane, and after obtaining features of different scales through a feature pyramid convolution network, combining the features of different scales through a deconvolution layer to obtain a top-view feature map;

步骤S40，通过注意力权重层获取所述顶视特征图的注意力权重特征图；通过卷积编码层获取所述顶视特征图的编码特征图；Step S40, obtaining the attention weight feature map of the top-view feature map through an attention weight layer; obtaining an encoding feature map of the top-view feature map through a convolutional coding layer;

步骤S50，将所述注意力权重特征图乘到所述编码特征图的对应区域，并进行特征拼接获得注意力特征图；Step S50, multiply the attention weight feature map to the corresponding area of the encoding feature map, and perform feature splicing to obtain the attention feature map;

步骤S60，基于所述注意力特征图，通过训练好的目标分类网络获取待检测数据中目标类别；基于所述注意力特征图，通过训练好的目标回归定位网络，获取待检测数据中目标位置、尺寸、方向。Step S60, based on the attention feature map, obtain the target category in the data to be detected through the trained target classification network; based on the attention feature map, obtain the target position in the data to be detected through the trained target regression positioning network , size, orientation.

在一些优选的实施例中，步骤S10中“将所述待检测数据通过基于三维网络的体素来表征”，其方法为：In some preferred embodiments, in step S10, "characterize the data to be detected by a three-dimensional network-based voxel", the method is as follows:

其中，D代表激光点云数据的体素表征，x_i、y_i、z_i代表激光点云数据中的第i个点相对于激光雷达的三维位置信息，R_i代表激光点云数据中的第i个点的反射率。Among them, D represents the voxel representation of the laser point cloud data, x _i , y _i , and z _i represent the three-dimensional position information of the ith point in the laser point cloud data relative to the lidar, and R _i represents the laser point cloud data. The reflectivity of the i-th point.

在一些优选的实施例中，步骤S20中“通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码，获得待处理数据对应的空间稀疏特征图”，其方法为：In some preferred embodiments, in step S20, "acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution encoding to obtain a spatial sparse feature map corresponding to the data to be processed", the method is as follows:

其中，F()代表通过特征提取器获取体素的特征表达，D代表激光点云数据的体素表征，(x,y,z)代表空间稀疏特征图的空间坐标。Among them, F() represents the feature expression of voxels obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatially sparse feature map.

在一些优选的实施例中，步骤S40中“通过注意力权重层获取所述顶视特征图的注意力权重特征图”，其方法为：In some preferred embodiments, in step S40, "obtaining the attention weight feature map of the top-view feature map through the attention weight layer", the method is as follows:

F_att(u,v)＝Conv_att(F_FPN(u,v))F _att (u,v)=Conv _att (F _FPN (u,v))

其中，F_att(u,v)代表顶视特征图对应的注意力权重特征图， F_FPN(u,v)代表顶视特征图，Conv_att()代表注意力权重层卷积操作。Among them, F _att (u, v) represents the attention weight feature map corresponding to the top-view feature map, F _FPN (u, v) represents the top-view feature map, and Conv _att () represents the attention weight layer convolution operation.

在一些优选的实施例中，步骤S40中“通过卷积编码层获取所述顶视特征图的编码特征图”，其方法为：In some preferred embodiments, in step S40, "obtaining the encoded feature map of the top-view feature map through a convolutional encoding layer", the method is as follows:

F_en(u,v)＝Conv_en(F_FPN(u,v))F _en (u,v)= _Conven (F _FPN (u,v))

其中，F_en(u,v)代表顶视特征图对应的编码特征图，F_FPN(u,v) 代表顶视特征图，Conv_en()代表卷积编码层卷积操作。Among them, F _en (u, v) represents the encoded feature map corresponding to the top-view feature map, F _FPN (u, v) represents the top-view feature map, and _Conven () represents the convolutional encoding layer convolution operation.

在一些优选的实施例中，步骤S50中“将所述注意力权重特征图乘到所述编码特征图的对应区域，并进行特征拼接获得注意力特征图”，其方法为：In some preferred embodiments, in step S50, "multiply the attention weight feature map to the corresponding region of the encoding feature map, and perform feature splicing to obtain the attention feature map", the method is as follows:

F_op(u,v)＝F_en(u,v)Repeat(Reshape(F_att(u,v)))F _op (u,v)=F _en (u,v)Repeat(Reshape(F _att (u,v)))

其中，Reshape()代表形变操作，Repeat()代表复制操作；Among them, Reshape() represents the deformation operation, and Repeat() represents the copy operation;

其中，[]代表特征拼接操作。Among them, [ ] represents the feature concatenation operation.

在一些优选的实施例中，所述目标分类网络通过交叉熵损失函数进行训练；所述交叉熵损失函数为：In some preferred embodiments, the target classification network is trained through a cross-entropy loss function; the cross-entropy loss function is:

其中，N代表计算损失的样本数；y_i代表正负样本，用0表示负样本，用1表示正样本；x_i代表样本的网络输出值。Among them, N represents the number of samples for calculating the loss; y _i represents the positive and negative samples, 0 represents the negative sample, and 1 represents the positive sample; _xi represents the network output value of the sample.

在一些优选的实施例中，所述目标回归定位网络通过Smooth L1损失函数进行训练；所述Smooth L1损失函数为：In some preferred embodiments, the target regression positioning network is trained through the Smooth L1 loss function; the Smooth L1 loss function is:

其中，x代表需要回归的残差。where x represents the residual that needs to be regressed.

本发明的另一方面，提出了一种基于形状注意力机制的三维目测检测系统，该三维目标检测系统包括输入模块、稀疏卷积编码模块、特征金字塔模块、注意力权重卷积模块、编码卷积模块、特征融合模块、目标分类模块、目标定位模块、输出模块；In another aspect of the present invention, a 3D visual detection system based on shape attention mechanism is proposed. The 3D object detection system includes an input module, a sparse convolution coding module, a feature pyramid module, an attention weight convolution module, and a coding volume. Product module, feature fusion module, target classification module, target positioning module, output module;

所述输入模块，配置为获取包含目标物的激光点云数据作为待检测数据，并将所述待检测数据通过基于三维网络的体素来表征；The input module is configured to obtain laser point cloud data containing the target as the data to be detected, and to characterize the data to be detected by voxels based on a three-dimensional network;

所述稀疏卷积编码模块，配置为通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码，获得待处理数据对应的空间稀疏特征图；The sparse convolution coding module is configured to obtain the feature expression of the voxel through a feature extractor and perform sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed;

所述特征金字塔模块，配置为将所述空间稀疏特征图投影到二维顶视平面，并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征，获得顶视特征图；The feature pyramid module is configured to project the spatial sparse feature map to a two-dimensional top-view plane, obtain features of different scales through a feature pyramid convolutional network, and combine the features of different scales through a deconvolution layer to obtain Top view feature map;

所述注意力权重卷积模块，配置为通过注意力权重层获取所述顶视特征图的注意力权重特征图；The attention weight convolution module is configured to obtain the attention weight feature map of the top-view feature map through the attention weight layer;

所述编码卷积模块，配置为通过卷积编码层获取所述顶视特征图的编码特征图；The coding convolution module is configured to obtain the coding feature map of the top-view feature map through a convolution coding layer;

所述特征融合模块，配置为将所述注意力权重特征图乘到所述编码特征图的对应区域，并进行特征拼接获得注意力特征图；The feature fusion module is configured to multiply the attention weight feature map to the corresponding region of the encoding feature map, and perform feature splicing to obtain the attention feature map;

所述目标分类模块，配置为基于所述注意力特征图，通过训练好的目标分类网络获取待检测数据中目标类别；The target classification module is configured to obtain target categories in the data to be detected through the trained target classification network based on the attention feature map;

所述目标定位模块，配置为基于所述注意力特征图，通过训练好的目标回归定位网络，获取待检测数据中目标位置、尺寸、方向；The target positioning module is configured to obtain the target position, size and direction in the data to be detected through the trained target regression positioning network based on the attention feature map;

所述输出模块，配置为输出获取的目标类别以及目标位置、尺寸、方向。The output module is configured to output the acquired target category and target position, size, and direction.

本发明的第三方面，提出了一种存储装置，其中存储有多条程序，所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。In a third aspect of the present invention, a storage device is provided, wherein a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to realize the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.

本发明的第四方面，提出了一种处理装置，包括处理器、存储装置；所述处理器，适于执行各条程序；所述存储装置，适于存储多条程序；所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。In a fourth aspect of the present invention, a processing device is provided, including a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; the program is suitable for Loaded and executed by the processor to realize the above-mentioned three-dimensional visual detection method based on the shape attention mechanism.

本发明的有益效果：Beneficial effects of the present invention:

本发明基于形状注意力机制的三维目测检测方法，使用基于距离约束的采样策略，能有效地缓解雷达采样点云数据分布不均匀带来的不稳定结果，通过基于形状先验的注意力机制解决了单阶段检测器缺乏形状先验的问题，该方法可以改善目前单阶段三维目标检测器的检测性能，特别是针对具有明显形状特点的目标，检测精度高、检测耗时短、适用于实时系统、模型鲁棒性好。The three-dimensional visual detection method based on the shape attention mechanism of the present invention uses the sampling strategy based on distance constraints, which can effectively alleviate the unstable results caused by the uneven distribution of radar sampling point cloud data, and solve the problem through the attention mechanism based on shape prior. This method can improve the detection performance of the current single-stage 3D object detector, especially for objects with obvious shape characteristics, with high detection accuracy, short detection time, and suitable for real-time systems. , The model has good robustness.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本申请的其它特征、目的和优点将会变得更明显：Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1是本发明基于形状注意力机制的三维目测检测方法的流程示意图；1 is a schematic flowchart of a three-dimensional visual detection method based on a shape attention mechanism of the present invention;

图2是本发明基于形状注意力机制的三维目测检测方法一种实施例的算法结构示意图；2 is a schematic diagram of the algorithm structure of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention;

图3是本发明基于形状注意力机制的三维目测检测方法一种实施例的数据集与检测结果示例图；3 is an example diagram of a data set and a detection result of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention;

图4是本发明基于形状注意力机制的三维目测检测方法一种实施例的本发明方法与其他方法检测结果对比图。FIG. 4 is a comparison diagram of the detection results of the method of the present invention and other methods according to an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

本发明的一种基于形状注意力机制的三维目测检测方法，该三维目标检测方法包括：A three-dimensional visual detection method based on a shape attention mechanism of the present invention, the three-dimensional target detection method includes:

为了更清晰地对本发明基于形状注意力机制的三维目测检测方法进行说明，下面结合图1对本发明方法实施例中各步骤展开详述。In order to more clearly describe the three-dimensional visual detection method based on the shape attention mechanism of the present invention, each step in the embodiment of the method of the present invention will be described in detail below with reference to FIG. 1 .

本发明一种实施例的基于形状注意力机制的三维目测检测方法，包括步骤S10-步骤S60，各步骤详细描述如下：A three-dimensional visual detection method based on a shape attention mechanism according to an embodiment of the present invention includes steps S10 to S60, and each step is described in detail as follows:

步骤S10，获取包含目标物的激光点云数据作为待检测数据，并将所述待检测数据通过基于三维网络的体素来表征，如式(1)所示：In step S10, the laser point cloud data containing the target object is obtained as the data to be detected, and the data to be detected is represented by voxels based on a three-dimensional network, as shown in formula (1):

其中，D代表激光点云数据的体素表征，x_i、y_i、z_i代表激光点云数据中的第i个点在激光雷达点云中的三维位置信息，R_i代表激光点云数据中的第i个点的反射率。Among them, D represents the voxel representation of the laser point cloud data, x _i , y _i , and z _i represent the three-dimensional position information of the ith point in the laser point cloud data in the lidar point cloud, and R _i represents the laser point cloud data The reflectivity of the ith point in .

假设激光雷达点云包含一个范围为H、W、D的三维空间，分别表示垂直方向上的高度、水平方向上的位置和距离，每个体素的尺寸大小为ΔH×ΔW×ΔD，ΔH＝0.4m,ΔW＝0.2m,ΔD＝0.2m。整个三维空间的体素网格的尺寸可以通过H/ΔH、W/ΔW、D/ΔD计算得到。然后通过特征编码层(VFE)对每个体素进行特征表达。本发明一个实施例中，特征提取器使用7维向量(分别是三维坐标、反射率和体素的相对三维坐标)描述每个体素中的样本点，并且为每个样本添加了当前柱中心的坐标(P_x,P_y)。此时，每个体素中的样本点的描述向量变成9维。本发明一个实施例中，特征编码层(VFE)包括线性层、批处理规范化层(BN)、校正的线性单元层(ReLU)来提取点的向量特征。Assume that the lidar point cloud contains a three-dimensional space with the range of H, W, and D, representing the height in the vertical direction, the position and distance in the horizontal direction, and the size of each voxel is ΔH×ΔW×ΔD, ΔH=0.4 m, ΔW=0.2m, ΔD=0.2m. The size of the voxel grid in the entire three-dimensional space can be calculated by H/ΔH, W/ΔW, D/ΔD. Each voxel is then characterized by a feature encoding layer (VFE). In one embodiment of the present invention, the feature extractor uses 7-dimensional vectors (3-dimensional coordinates, reflectivity, and relative 3-dimensional coordinates of voxels, respectively) to describe the sample points in each voxel, and adds the current column center for each sample. Coordinates (P _x , P _y ). At this point, the description vector of the sample points in each voxel becomes 9-dimensional. In one embodiment of the present invention, the feature encoding layer (VFE) includes a linear layer, a batch normalization layer (BN), and a corrected linear unit layer (ReLU) to extract vector features of points.

步骤S20，通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码，获得待处理数据对应的空间稀疏特征图，如式(2)所示：In step S20, the feature expression of the voxel is obtained through a feature extractor and sparse convolution coding is performed to obtain a spatial sparse feature map corresponding to the data to be processed, as shown in formula (2):

步骤S30，将所述空间稀疏特征图投影到二维顶视平面，并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征，获得顶视特征图。Step S30, project the spatial sparse feature map to a two-dimensional top-view plane, obtain features of different scales through a feature pyramid convolution network, and combine the features of different scales through a deconvolution layer to obtain a top-view feature map.

将空间稀疏特征图f_s(x,y,z)投影到顶视图(即鸟瞰图)，就是将空间稀疏特征图f_s(x,y,z)垂直方向的维度压缩，获得顶视图的特征图 f_2D(u,v)。具体来说，假设原始特征为(C,D,H,W)，将高度特征并入特征通道变为(C×D,H,W)，获得2D卷积特征为顶视图的特征图。通过特征金字塔卷积网络获取f_2D(u,v)不同尺度的特征，并通过反卷积层合并不同尺度的特征得到特征图f_FPN(u,v)。本发明一个实施例中，特征金字塔卷积层包括三个卷积组，分别具有(3、5、5)层卷积层，每层卷积层后跟批处理规范化层(BN)、校正的线性单元层(ReLU)。Projecting the spatially sparse feature map f _s (x, y, z) to the top view (that is, a bird’s-eye view) is to compress the vertical dimension of the spatial sparse feature map f _s (x, y, z) to obtain the top-view feature map f _2D (u,v). Specifically, assuming that the original feature is (C, D, H, W), the height feature is merged into the feature channel to become (C×D, H, W), and the 2D convolution feature is the top-view feature map. The features of different scales of f _2D (u, v) are obtained through the feature pyramid convolutional network, and the features of different scales are combined through the deconvolution layer to obtain the feature map f _FPN (u, v). In an embodiment of the present invention, the feature pyramid convolutional layer includes three convolutional groups, which respectively have (3, 5, 5) convolutional layers, and each convolutional layer is followed by a batch normalization layer (BN), a corrected linear Unit Layer (ReLU).

步骤S40，通过注意力权重层获取所述顶视特征图的注意力权重特征图；通过卷积编码层获取所述顶视特征图的编码特征图。In step S40, the attention weight feature map of the top-view feature map is obtained through an attention weight layer; the coding feature map of the top-view feature map is obtained through a convolutional coding layer.

通过注意力权重层获取顶视特征图的注意力权重特征图，如式(3)所示：The attention weight feature map of the top-view feature map is obtained through the attention weight layer, as shown in formula (3):

F_att(u,v)＝Conv_att(F_FPN(u,v)) 式(3)F _att (u,v)=Conv _att (F _FPN (u,v)) Equation (3)

通过卷积编码层获取顶视特征图的编码特征图，如式(4) 所示：The coding feature map of the top-view feature map is obtained through the convolutional coding layer, as shown in equation (4):

F_en(u,v)＝Conv_en(F_FPN(u,v)) 式(4)F _en (u,v)= _Conven (F _FPN (u,v)) Equation (4)

步骤S50，将所述注意力权重特征图乘到所述编码特征图的对应区域，并进行特征拼接获得注意力特征图，如式(5)、式(6)所示：Step S50, multiply the attention weight feature map to the corresponding area of the encoding feature map, and perform feature splicing to obtain the attention feature map, as shown in equations (5) and (6):

F_op(u,v)＝F_en(u,v)Repeat(Reshape(F_att(u,v))) 式(5)F _op (u,v)=F _en (u,v)Repeat(Reshape(F _att (u,v))) Equation (5)

如图2所示，为本发明基于形状注意力机制的三维目测检测方法一种实施例的算法结构示意图，分为三个部分：其中第一个部分为基于距离约束的体素生成器(Distance-based Voxel Generator)，将输入激光雷达点云变为体素；第二个部分为特征提取层(Feature extraction layers)，编码体素特征并编码三维空间特征；第三部分为注意力区域推荐网络(Attention RPN)，注入注意力机制输出检测结果。As shown in FIG. 2, it is a schematic diagram of the algorithm structure of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention, which is divided into three parts: the first part is a distance constraint-based voxel generator (Distance -based Voxel Generator), which converts the input lidar point cloud into voxels; the second part is the feature extraction layer, which encodes voxel features and encodes 3D spatial features; the third part is the attention region recommendation network (Attention RPN), inject the attention mechanism to output the detection result.

目标分类网络通过交叉熵损失函数进行训练，交叉熵损失函数如式(7)所示：The target classification network is trained through the cross-entropy loss function, and the cross-entropy loss function is shown in formula (7):

目标回归定位网络通过Smooth L1损失函数进行训练， Smooth L1损失函数如式(8)所示：The target regression positioning network is trained through the Smooth L1 loss function. The Smooth L1 loss function is shown in formula (8):

其中，x代表回归的残差。where x represents the residual of the regression.

将注意力特征图F_hybrid(u,v)分别连接目标分类网络和目标回归定位网络，目标分类网络用于判断检测对象是否为目标，目标回归定位网络用于获取检测对象的位置、尺寸、方向。The attention feature map F _hybrid (u, v) is connected to the target classification network and the target regression positioning network respectively. The target classification network is used to determine whether the detection object is a target, and the target regression positioning network is used to obtain the position, size and direction of the detection object. .

本发明一个实施例中，对于目标分类任务中类别为小汽车的，将锚点和目标的交并比(IOU)大于0.6的设为正样本，将交并比小于0.45 的设为负样本；对于类别为行人和骑车人的，将锚点和目标的交并比(IOU) 大于0.5的设为正样本，将交并比小于0.35的设为负样本。对于回归定位任务，设定对应目标车的预定义锚点的宽×长×高为(1.6×3.9×1.5)米；对于目标行人预定义锚点的宽×长×高为(0.6×0.8×1.73)米；对于目标骑车人的预定义锚点的宽×长×高为(0.6×1.76×1.73)米。定义一个三维的真实边界框为x_g,y_g,z_g,l_g,w_g,h_g,θ_g，其中，x,y,z是边界框的中心位置，l,w,h 表示三维目标的长宽高，θ是目标在Z轴方向的转角，用*_g表示真实值，用*_a表示正样本的锚点，用Δ*表示对应的残差，通过网络学习，预测真实三维目标的位置、尺寸和方向。边界框中心位置的残差(Δx,Δy,Δz)、三维目标长宽高的残差(Δl,Δw,Δh)、目标在Z轴方向转角的残差(Δθ) 分别如式(9)、式(10)、式(11)所示：In an embodiment of the present invention, for the class of the car in the target classification task, the intersection between the anchor point and the target (IOU) greater than 0.6 is set as a positive sample, and the intersection ratio is less than 0.45 as a negative sample; For the categories of pedestrians and cyclists, the intersection ratio (IOU) of anchor and target greater than 0.5 is set as a positive sample, and the intersection ratio less than 0.35 is set as a negative sample. For the regression positioning task, the width × length × height of the predefined anchor point corresponding to the target vehicle is set as (1.6 × 3.9 × 1.5) meters; for the target pedestrian, the width × length × height of the predefined anchor point is set as (0.6 × 0.8 × 1.73) meters; the predefined anchor point for the target cyclist has a width x length x height of (0.6 x 1.76 x 1.73) meters. Define a three-dimensional real bounding box as x _g , y _g , z _g , l _g , w _g , h _g , θ _g , where x, y, z are the center positions of the bounding box, and l, w, h represent the three-dimensional The length, width and height of the target, θ is the rotation angle of the target in the Z-axis direction, * _g represents the real value, * _a represents the anchor point of the positive sample, and Δ* represents the corresponding residual. Through network learning, predict the real three-dimensional target location, size and orientation. The residuals of the center position of the bounding box (Δx, Δy, Δz), the residuals of the length, width, and height of the three-dimensional target (Δl, Δw, Δh), and the residuals of the target’s turning angle in the Z-axis direction (Δθ) are shown in formula (9), Formulas (10) and (11) are shown:

Δθ＝sin(θ_g-θ_a) 式(11)Δθ=sin(θ _g -θ _a ) Equation (11)

为了详细说明本发明的有效性，将本发明提出的方法应用于公开无人驾驶数据集KITTI，该数据库含有3个验证类别。如图3所示，为本发明基于形状注意力机制的三维目测检测方法一种实施例的数据集与检测结果示例图，第一列Car表示车辆的检测结果，第二列Pedestrian表示行人的检测结果，第三列Cyclist表示骑车人的检测结果。每列有三组实验结果，每组包括一幅RGB图像和雷达的顶视图，检测的结果投影到图上。In order to illustrate the effectiveness of the present invention in detail, the method proposed by the present invention is applied to the public unmanned driving dataset KITTI, which contains 3 verification categories. As shown in FIG. 3, it is an example diagram of the data set and detection result of an embodiment of the 3D visual detection method based on the shape attention mechanism of the present invention. The first column Car represents the detection result of the vehicle, and the second column Pedestrian represents the detection of pedestrians. As a result, the third column Cyclist represents the detection results of cyclists. There are three sets of experimental results in each column, each set includes an RGB image and the top view of the radar, and the detection results are projected onto the graph.

在本发明一个实施例中，对于KITTI数据集，使用train数据集进行训练，使用test数据集进行测试。如图4所示，为本发明基于形状注意力机制的三维目测检测方法一种实施例的本发明方法与其他方法检测结果对比图，数据集对每类测试目标分为三个等级：容易、中等和困难。难度的划分是根据每个目标在相机图像中的高度，遮挡等级和截断程度来进行的。难度为容易的样本边界框的高度大于40等于个像素，最大截断为15％，遮挡等级为完全可见；难度为中的样本边界框的高度大于等于 25像素，最大截断为30％，遮挡等级为部分遮挡；难度为困难的样本边界框的高度大于等于25像素，最大截断为50％，遮挡等级为难以看见。BEV 表示顶视图检测结果，3D表示三维边界框的检测结果。使用PASCAL标准(AP，平均精度)评估3D目标检测性能。在对比方法中，用ARPNET 代表本发明方法，MV3D代表多视图3D目标检测法，ContFuse代表深度连续融合多传感器3D目标检测法，AOVD代表多视角聚合数据实现无人驾驶场景下3D物体实时检测法，F-PointNet代表视锥点云网络RGB-D数据 3D物体检测法，SECOND代表稀疏嵌入卷积目标检测法，Voxelnet代表基于端到端学习的点云数据3D目标检测法。In an embodiment of the present invention, for the KITTI data set, the train data set is used for training, and the test data set is used for testing. As shown in Figure 4, it is a comparison chart of the detection results of the method of the present invention and other methods of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention. The data set is divided into three levels for each type of test target: easy, Moderate and difficult. The difficulty is divided according to the height of each object in the camera image, the occlusion level and the degree of truncation. For samples with an easy difficulty, the height of the bounding box is greater than or equal to 40 pixels, the maximum truncation is 15%, and the occlusion level is fully visible; the height of the sample bounding box for the difficulty is greater than or equal to 25 pixels, the maximum truncation is 30%, and the occlusion level is Partial occlusion; samples with difficulty difficulty have a bounding box height of 25 pixels or more, a maximum truncation of 50%, and an occlusion level of hard to see. BEV represents the top view detection result, and 3D represents the detection result of the 3D bounding box. 3D object detection performance is evaluated using the PASCAL criterion (AP, average precision). In the comparison method, ARPNET represents the method of the present invention, MV3D represents the multi-view 3D object detection method, ContFuse represents the deep continuous fusion multi-sensor 3D object detection method, and AOVD represents the multi-view aggregated data to realize the real-time detection method of 3D objects in the unmanned scene. , F-PointNet stands for RGB-D data 3D object detection method of frustum point cloud network, SECOND stands for sparse embedding convolution object detection method, and Voxelnet stands for point cloud data 3D object detection method based on end-to-end learning.

本发明第二实施例的基于形状注意力机制的三维目测检测系统，该三维目标检测系统包括输入模块、稀疏卷积编码模块、特征金字塔模块、注意力权重卷积模块、编码卷积模块、特征融合模块、目标分类模块、目标定位模块、输出模块；The 3D visual detection system based on the shape attention mechanism according to the second embodiment of the present invention, the 3D target detection system includes an input module, a sparse convolution coding module, a feature pyramid module, an attention weight convolution module, a coding convolution module, a feature Fusion module, target classification module, target positioning module, output module;

所属技术领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统的具体工作过程及有关说明，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process and related description of the system described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

需要说明的是，上述实施例提供的基于形状注意力机制的三维目测检测系统，仅以上述各功能模块的划分进行举例说明，在实际应用中，可以根据需要而将上述功能分配由不同的功能模块来完成，即将本发明实施例中的模块或者步骤再分解或者组合，例如，上述实施例的模块可以合并为一个模块，也可以进一步拆分成多个子模块，以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称，仅仅是为了区分各个模块或者步骤，不视为对本发明的不当限定。It should be noted that, the three-dimensional visual inspection system based on the shape attention mechanism provided by the above embodiments is only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functions module, that is, the modules or steps in the embodiments of the present invention are decomposed or combined. For example, the modules in the above embodiments can be combined into one module, and can also be further split into multiple sub-modules to complete all or part of the above description. Function. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing each module or step, and should not be regarded as an improper limitation of the present invention.

本发明第三实施例的一种存储装置，其中存储有多条程序，所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are adapted to be loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on a shape attention mechanism.

本发明第四实施例的一种处理装置，包括处理器、存储装置；处理器，适于执行各条程序；存储装置，适于存储多条程序；所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。A processing device according to a fourth embodiment of the present invention includes a processor and a storage device; the processor is adapted to execute various programs; the storage device is adapted to store multiple programs; the programs are adapted to be loaded and executed by the processor In order to realize the above-mentioned 3D visual detection method based on shape attention mechanism.

所属技术领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的存储装置、处理装置的具体工作过程及有关说明，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process and relevant description of the storage device and processing device described above can refer to the corresponding process in the foregoing method embodiments, which is not repeated here. Repeat.

本领域技术人员应该能够意识到，结合本文中所公开的实施例描述的各示例的模块、方法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those skilled in the art should be aware that the modules and method steps of each example described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two, and the programs corresponding to the software modules and method steps Can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or as known in the art in any other form of storage medium. In order to clearly illustrate the interchangeability of electronic hardware and software, the components and steps of each example have been described generally in terms of functionality in the foregoing description. Whether these functions are performed in electronic hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of the present invention.

术语“第一”、“第二”等是用于区别类似的对象，而不是用于描述或表示特定的顺序或先后次序。The terms "first," "second," etc. are used to distinguish between similar objects, and are not used to describe or indicate a particular order or sequence.

术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素，而且还包括没有明确列出的其它要素，或者还包括这些过程、方法、物品或者设备/装置所固有的要素。The term "comprising" or any other similar term is intended to encompass a non-exclusive inclusion such that a process, method, article or device/means comprising a list of elements includes not only those elements but also other elements not expressly listed, or Also included are elements inherent to these processes, methods, articles or devices/devices.

至此，已经结合附图所示的优选实施方式描述了本发明的技术方案，但是，本领域技术人员容易理解的是，本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下，本领域技术人员可以对相关技术特征作出等同的更改或替换，这些更改或替换之后的技术方案都将落入本发明的保护范围之内。So far, the technical solutions of the present invention have been described with reference to the preferred embodiments shown in the accompanying drawings, however, those skilled in the art can easily understand that the protection scope of the present invention is obviously not limited to these specific embodiments. Without departing from the principle of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will fall within the protection scope of the present invention.

Claims

1. A three-dimensional visual inspection method based on a shape attention mechanism is characterized by comprising the following steps:

step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;

step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;

step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;

step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;

step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;

step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.

2. The three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S10, "the data to be inspected is characterized by voxels based on three-dimensional network", which is performed by:

wherein D represents the voxel representation of the laser point cloud data, x_i、y_i、z_iRepresenting the three-dimensional position information of the ith point in the laser point cloud data relative to the laser radar, R_iRepresenting the reflectivity of the ith point in the laser point cloud data.

3. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S20, "obtaining the feature expression of the voxel by a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed" includes:

wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.

4. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:

F_att(u,v)＝Conv_att(F_FPN(u,v))

wherein, F_att(u, v) represents the attention weight feature map corresponding to the top view feature map, F_FPN(u, v) represents a top view feature map, Conv_att() Representing the attention weight layer convolution operation.

5. A three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S40, "obtaining the encoding feature map of the top view feature map by convolution encoding layer" comprises:

F_en(u,v)＝Conv_en(F_FPN(u,v))

wherein, F_en(u, v) represents the coding feature map corresponding to the top view feature map, F_FPN(u, v) represents a top view feature map, Conv_en() Representing a convolutional encoding layer convolution operation.

6. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S50, the method comprises the steps of multiplying the attention weight feature map to the corresponding region of the coding feature map and performing feature matching to obtain the attention feature map, and comprises the steps of:

F_op(u,v)＝F_en(u,v)Repeat(Reshape(F_att(u,v)))

wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;

wherein [ ] represents a characteristic splicing operation.

7. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the object classification network is trained by cross entropy loss function; the cross entropy loss function is:

wherein N represents the number of samples for which loss is calculated; y is_iRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number of_iA network output value representing a sample.

8. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the target regression positioning network is trained by Smooth L1 loss function; the Smooth L1 loss function is:

where x represents the residual of the regression.

9. A three-dimensional visual inspection detection system based on a shape attention mechanism is characterized by comprising an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;

the input module is configured to acquire laser point cloud containing target object data as to-be-detected data, and the to-be-detected data is represented by voxels based on a three-dimensional network;

the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;

the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;

the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;

the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;

the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;

the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;

the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;

the output module is configured to output the acquired object type, and the object position, size and direction.

10. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for three-dimensional visual inspection based on the shape attention mechanism of any one of claims 1 to 8.

11. A treatment apparatus comprises

A processor adapted to execute various programs; and

a storage device adapted to store a plurality of programs;

wherein the program is adapted to be loaded and executed by a processor to perform:

the three-dimensional visual inspection method based on the shape attention mechanism as set forth in any one of claims 1 to 8.