CN110879994A - 3D visual detection method, system and device based on shape attention mechanism - Google Patents

3D visual detection method, system and device based on shape attention mechanism Download PDF

Info

Publication number
CN110879994A
CN110879994A CN201911213392.9A CN201911213392A CN110879994A CN 110879994 A CN110879994 A CN 110879994A CN 201911213392 A CN201911213392 A CN 201911213392A CN 110879994 A CN110879994 A CN 110879994A
Authority
CN
China
Prior art keywords
feature map
attention
target
module
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911213392.9A
Other languages
Chinese (zh)
Inventor
张兆翔
张驰
叶阳阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201911213392.9A priority Critical patent/CN110879994A/en
Publication of CN110879994A publication Critical patent/CN110879994A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/513Sparse representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Length Measuring Devices By Optical Means (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer deep reinforcement learning and pattern recognition, and particularly relates to a three-dimensional visual inspection detection method, system and device based on a shape attention mechanism, aiming at solving the problems that the precision of a single-stage detector is lower than that of a two-stage detector, and the two-stage detector consumes much time and is not suitable for a real-time system. The invention comprises the following steps: representing the point cloud data by three-dimensional grid voxels; extracting features and coding a space sparse feature map; extracting different scale features after projecting to a top view; adopting a deconvolution layer merging characteristic; extracting a shape attention feature map through attention weight and a convolution coding layer; and acquiring the target category, the target position, the target size and the target direction through a target classification network and a regression positioning network. The invention uses a sampling strategy based on distance constraint and an attention mechanism based on shape prior, relieves the instability caused by uneven data distribution, improves the problem that a single-stage detector lacks shape prior, and has high precision, short time consumption, strong real-time performance and good robustness.

Description

基于形状注意力机制的三维目测检测方法、系统、装置3D visual detection method, system and device based on shape attention mechanism

技术领域technical field

本发明属于深度强化学习、计算机视觉、模式识别以及机器学习领域,具体涉及了一种基于形状注意力机制的三维目测检测方法、系统、装置。The invention belongs to the fields of deep reinforcement learning, computer vision, pattern recognition and machine learning, and particularly relates to a three-dimensional visual detection method, system and device based on a shape attention mechanism.

背景技术Background technique

三维目标检测器需要输出可靠的空间和语义信息,即三维位置、方位、所占体积和类别。相对于二维物体检测,三维目标提供更多细节信息,但建模难度较大。三维物体检测一般采用距离传感器,如激光雷达、TOF相机、立体相机等,来预测更有意义、更准确的结果。三维物体检测成为自动驾驶汽车、UVA、机器人等领域的关键技术。在交通场景中大部分准确的三维物体检测算法都是基于雷达传感器,它已经成为户外场景感知的基本传感器。而交通场景中目标感知是无人驾驶车辆关于定位周围目标的关键技术。3D object detectors need to output reliable spatial and semantic information, namely 3D position, orientation, occupied volume and category. Compared with 2D object detection, 3D objects provide more detailed information, but the modeling is more difficult. 3D object detection generally uses distance sensors, such as lidar, TOF cameras, stereo cameras, etc., to predict more meaningful and accurate results. 3D object detection has become a key technology in areas such as self-driving cars, UVA, and robotics. Most accurate 3D object detection algorithms in traffic scenes are based on radar sensors, which have become the basic sensor for outdoor scene perception. Target perception in traffic scenes is a key technology for unmanned vehicles to locate surrounding targets.

基于激光雷达的三维目标检测涉及两个重要问题。第一个问题是如何针对从激光雷达传感器采样得到的稀疏非均匀点云生成描述性底层特征。激光雷达的采样点在离传感器近的地方采样点多,而在距离远的地方采样点少。点云的多样性分布会降低检测器的检测性能以及导致检测结果的不稳定。许多方法依赖于手工的特征提取方法。然而,由于手工特征没有很好考虑和处理不均衡的激光点云分布,导致检测算法并不稳定。物体检测与分割对视觉数据理解与感知都扮演极为重要的角色。另一个问题是如何有效地对三维形状信息进行编码,实现更好的判别嵌入。三维物体检测的框架主要有两种:单阶段检测器和两阶段检测器。单阶段检测器效率更高,而两阶段检测器检测精度更高。由于区域候选网络输出需要裁剪的感兴趣区域ROI,导致两级检测器效率不高。但是,这些裁剪后的ROI为每个检测到的对象提供形状先验,通过后续的优化网络可以获得更高的检测精度。由于缺乏形状先验和后续的优化网络,单阶段检测器的性能低于两阶段检测器。然而,对于实时系统来说,两级检测器非常耗时。此外,三维形状先验更适合于三维目标的检测。Lidar-based 3D object detection involves two important issues. The first problem is how to generate descriptive underlying features for sparse non-uniform point clouds sampled from lidar sensors. The sampling points of the lidar are more sampling points close to the sensor, and less sampling points are farther away. The diverse distribution of point clouds will reduce the detection performance of the detector and lead to unstable detection results. Many methods rely on manual feature extraction methods. However, the detection algorithm is not stable because the hand-crafted features are not well considered and dealt with the uneven distribution of laser point clouds. Object detection and segmentation play an extremely important role in both understanding and perception of visual data. Another problem is how to efficiently encode 3D shape information for better discriminative embeddings. There are two main frameworks for 3D object detection: single-stage detectors and two-stage detectors. Single-stage detectors are more efficient, while two-stage detectors have higher detection accuracy. Since the region candidate network outputs the cropped region of interest ROI, the two-stage detector is inefficient. However, these cropped ROIs provide shape priors for each detected object, and higher detection accuracy can be obtained through subsequent optimization of the network. One-stage detectors outperform two-stage detectors due to the lack of shape priors and subsequent optimization networks. However, for real-time systems, the two-stage detector is very time-consuming. In addition, the 3D shape prior is more suitable for the detection of 3D objects.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中的上述问题,即单阶段三维目标检测器精度低于两阶段检测器,而两阶段检测器耗时大、不适用于实时系统的问题,本发明提供了一种基于形状注意力机制的三维目测检测方法,该三维目标检测方法包括:In order to solve the above problem in the prior art, that is, the accuracy of the single-stage three-dimensional object detector is lower than that of the two-stage detector, and the two-stage detector is time-consuming and not suitable for real-time systems, the present invention provides a shape-based A three-dimensional visual detection method of attention mechanism, the three-dimensional target detection method includes:

步骤S10,获取包含目标物的激光点云数据作为待检测数据,并将所述待检测数据通过基于三维网络的体素来表征;Step S10, acquiring laser point cloud data containing the target as the data to be detected, and characterizing the data to be detected by voxels based on a three-dimensional network;

步骤S20,通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码,获得待处理数据对应的空间稀疏特征图;Step S20, obtaining the feature expression of the voxel through a feature extractor and performing sparse convolution encoding to obtain a spatial sparse feature map corresponding to the data to be processed;

步骤S30,将所述空间稀疏特征图投影到二维顶视平面,并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征,获得顶视特征图;Step S30, projecting the spatial sparse feature map to a two-dimensional top-view plane, and after obtaining features of different scales through a feature pyramid convolution network, combining the features of different scales through a deconvolution layer to obtain a top-view feature map;

步骤S40,通过注意力权重层获取所述顶视特征图的注意力权重特征图;通过卷积编码层获取所述顶视特征图的编码特征图;Step S40, obtaining the attention weight feature map of the top-view feature map through an attention weight layer; obtaining an encoding feature map of the top-view feature map through a convolutional coding layer;

步骤S50,将所述注意力权重特征图乘到所述编码特征图的对应区域,并进行特征拼接获得注意力特征图;Step S50, multiply the attention weight feature map to the corresponding area of the encoding feature map, and perform feature splicing to obtain the attention feature map;

步骤S60,基于所述注意力特征图,通过训练好的目标分类网络获取待检测数据中目标类别;基于所述注意力特征图,通过训练好的目标回归定位网络,获取待检测数据中目标位置、尺寸、方向。Step S60, based on the attention feature map, obtain the target category in the data to be detected through the trained target classification network; based on the attention feature map, obtain the target position in the data to be detected through the trained target regression positioning network , size, orientation.

在一些优选的实施例中,步骤S10中“将所述待检测数据通过基于三维网络的体素来表征”,其方法为:In some preferred embodiments, in step S10, "characterize the data to be detected by a three-dimensional network-based voxel", the method is as follows:

Figure RE-GDA0002353771660000031
Figure RE-GDA0002353771660000031

其中,D代表激光点云数据的体素表征,xi、yi、zi代表激光点云数据中的第i个点相对于激光雷达的三维位置信息,Ri代表激光点云数据中的第i个点的反射率。Among them, D represents the voxel representation of the laser point cloud data, x i , y i , and z i represent the three-dimensional position information of the ith point in the laser point cloud data relative to the lidar, and R i represents the laser point cloud data. The reflectivity of the i-th point.

在一些优选的实施例中,步骤S20中“通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码,获得待处理数据对应的空间稀疏特征图”,其方法为:In some preferred embodiments, in step S20, "acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution encoding to obtain a spatial sparse feature map corresponding to the data to be processed", the method is as follows:

Figure RE-GDA0002353771660000032
Figure RE-GDA0002353771660000032

其中,F()代表通过特征提取器获取体素的特征表达,D代表激光点云数据的体素表征,(x,y,z)代表空间稀疏特征图的空间坐标。Among them, F() represents the feature expression of voxels obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatially sparse feature map.

在一些优选的实施例中,步骤S40中“通过注意力权重层获取所述顶视特征图的注意力权重特征图”,其方法为:In some preferred embodiments, in step S40, "obtaining the attention weight feature map of the top-view feature map through the attention weight layer", the method is as follows:

Fatt(u,v)=Convatt(FFPN(u,v))F att (u,v)=Conv att (F FPN (u,v))

其中,Fatt(u,v)代表顶视特征图对应的注意力权重特征图, FFPN(u,v)代表顶视特征图,Convatt()代表注意力权重层卷积操作。Among them, F att (u, v) represents the attention weight feature map corresponding to the top-view feature map, F FPN (u, v) represents the top-view feature map, and Conv att () represents the attention weight layer convolution operation.

在一些优选的实施例中,步骤S40中“通过卷积编码层获取所述顶视特征图的编码特征图”,其方法为:In some preferred embodiments, in step S40, "obtaining the encoded feature map of the top-view feature map through a convolutional encoding layer", the method is as follows:

Fen(u,v)=Conven(FFPN(u,v))F en (u,v)= Conven (F FPN (u,v))

其中,Fen(u,v)代表顶视特征图对应的编码特征图,FFPN(u,v) 代表顶视特征图,Conven()代表卷积编码层卷积操作。Among them, F en (u, v) represents the encoded feature map corresponding to the top-view feature map, F FPN (u, v) represents the top-view feature map, and Conven () represents the convolutional encoding layer convolution operation.

在一些优选的实施例中,步骤S50中“将所述注意力权重特征图乘到所述编码特征图的对应区域,并进行特征拼接获得注意力特征图”,其方法为:In some preferred embodiments, in step S50, "multiply the attention weight feature map to the corresponding region of the encoding feature map, and perform feature splicing to obtain the attention feature map", the method is as follows:

Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u,v)))F op (u,v)=F en (u,v)Repeat(Reshape(F att (u,v)))

其中,Reshape()代表形变操作,Repeat()代表复制操作;Among them, Reshape() represents the deformation operation, and Repeat() represents the copy operation;

Figure RE-GDA0002353771660000041
Figure RE-GDA0002353771660000041

其中,[]代表特征拼接操作。Among them, [ ] represents the feature concatenation operation.

在一些优选的实施例中,所述目标分类网络通过交叉熵损失函数进行训练;所述交叉熵损失函数为:In some preferred embodiments, the target classification network is trained through a cross-entropy loss function; the cross-entropy loss function is:

Figure RE-GDA0002353771660000042
Figure RE-GDA0002353771660000042

其中,N代表计算损失的样本数;yi代表正负样本,用0表示负样本,用1表示正样本;xi代表样本的网络输出值。Among them, N represents the number of samples for calculating the loss; y i represents the positive and negative samples, 0 represents the negative sample, and 1 represents the positive sample; xi represents the network output value of the sample.

在一些优选的实施例中,所述目标回归定位网络通过Smooth L1损失函数进行训练;所述Smooth L1损失函数为:In some preferred embodiments, the target regression positioning network is trained through the Smooth L1 loss function; the Smooth L1 loss function is:

Figure RE-GDA0002353771660000043
Figure RE-GDA0002353771660000043

其中,x代表需要回归的残差。where x represents the residual that needs to be regressed.

本发明的另一方面,提出了一种基于形状注意力机制的三维目测检测系统,该三维目标检测系统包括输入模块、稀疏卷积编码模块、特征金字塔模块、注意力权重卷积模块、编码卷积模块、特征融合模块、目标分类模块、目标定位模块、输出模块;In another aspect of the present invention, a 3D visual detection system based on shape attention mechanism is proposed. The 3D object detection system includes an input module, a sparse convolution coding module, a feature pyramid module, an attention weight convolution module, and a coding volume. Product module, feature fusion module, target classification module, target positioning module, output module;

所述输入模块,配置为获取包含目标物的激光点云数据作为待检测数据,并将所述待检测数据通过基于三维网络的体素来表征;The input module is configured to obtain laser point cloud data containing the target as the data to be detected, and to characterize the data to be detected by voxels based on a three-dimensional network;

所述稀疏卷积编码模块,配置为通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码,获得待处理数据对应的空间稀疏特征图;The sparse convolution coding module is configured to obtain the feature expression of the voxel through a feature extractor and perform sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed;

所述特征金字塔模块,配置为将所述空间稀疏特征图投影到二维顶视平面,并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征,获得顶视特征图;The feature pyramid module is configured to project the spatial sparse feature map to a two-dimensional top-view plane, obtain features of different scales through a feature pyramid convolutional network, and combine the features of different scales through a deconvolution layer to obtain Top view feature map;

所述注意力权重卷积模块,配置为通过注意力权重层获取所述顶视特征图的注意力权重特征图;The attention weight convolution module is configured to obtain the attention weight feature map of the top-view feature map through the attention weight layer;

所述编码卷积模块,配置为通过卷积编码层获取所述顶视特征图的编码特征图;The coding convolution module is configured to obtain the coding feature map of the top-view feature map through a convolution coding layer;

所述特征融合模块,配置为将所述注意力权重特征图乘到所述编码特征图的对应区域,并进行特征拼接获得注意力特征图;The feature fusion module is configured to multiply the attention weight feature map to the corresponding region of the encoding feature map, and perform feature splicing to obtain the attention feature map;

所述目标分类模块,配置为基于所述注意力特征图,通过训练好的目标分类网络获取待检测数据中目标类别;The target classification module is configured to obtain target categories in the data to be detected through the trained target classification network based on the attention feature map;

所述目标定位模块,配置为基于所述注意力特征图,通过训练好的目标回归定位网络,获取待检测数据中目标位置、尺寸、方向;The target positioning module is configured to obtain the target position, size and direction in the data to be detected through the trained target regression positioning network based on the attention feature map;

所述输出模块,配置为输出获取的目标类别以及目标位置、尺寸、方向。The output module is configured to output the acquired target category and target position, size, and direction.

本发明的第三方面,提出了一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。In a third aspect of the present invention, a storage device is provided, wherein a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to realize the above-mentioned three-dimensional visual inspection method based on the shape attention mechanism.

本发明的第四方面,提出了一种处理装置,包括处理器、存储装置;所述处理器,适于执行各条程序;所述存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。In a fourth aspect of the present invention, a processing device is provided, including a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; the program is suitable for Loaded and executed by the processor to realize the above-mentioned three-dimensional visual detection method based on the shape attention mechanism.

本发明的有益效果:Beneficial effects of the present invention:

本发明基于形状注意力机制的三维目测检测方法,使用基于距离约束的采样策略,能有效地缓解雷达采样点云数据分布不均匀带来的不稳定结果,通过基于形状先验的注意力机制解决了单阶段检测器缺乏形状先验的问题,该方法可以改善目前单阶段三维目标检测器的检测性能,特别是针对具有明显形状特点的目标,检测精度高、检测耗时短、适用于实时系统、模型鲁棒性好。The three-dimensional visual detection method based on the shape attention mechanism of the present invention uses the sampling strategy based on distance constraints, which can effectively alleviate the unstable results caused by the uneven distribution of radar sampling point cloud data, and solve the problem through the attention mechanism based on shape prior. This method can improve the detection performance of the current single-stage 3D object detector, especially for objects with obvious shape characteristics, with high detection accuracy, short detection time, and suitable for real-time systems. , The model has good robustness.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1是本发明基于形状注意力机制的三维目测检测方法的流程示意图;1 is a schematic flowchart of a three-dimensional visual detection method based on a shape attention mechanism of the present invention;

图2是本发明基于形状注意力机制的三维目测检测方法一种实施例的算法结构示意图;2 is a schematic diagram of the algorithm structure of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention;

图3是本发明基于形状注意力机制的三维目测检测方法一种实施例的数据集与检测结果示例图;3 is an example diagram of a data set and a detection result of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention;

图4是本发明基于形状注意力机制的三维目测检测方法一种实施例的本发明方法与其他方法检测结果对比图。FIG. 4 is a comparison diagram of the detection results of the method of the present invention and other methods according to an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

本发明的一种基于形状注意力机制的三维目测检测方法,该三维目标检测方法包括:A three-dimensional visual detection method based on a shape attention mechanism of the present invention, the three-dimensional target detection method includes:

步骤S10,获取包含目标物的激光点云数据作为待检测数据,并将所述待检测数据通过基于三维网络的体素来表征;Step S10, acquiring laser point cloud data containing the target as the data to be detected, and characterizing the data to be detected by voxels based on a three-dimensional network;

步骤S20,通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码,获得待处理数据对应的空间稀疏特征图;Step S20, obtaining the feature expression of the voxel through a feature extractor and performing sparse convolution encoding to obtain a spatial sparse feature map corresponding to the data to be processed;

步骤S30,将所述空间稀疏特征图投影到二维顶视平面,并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征,获得顶视特征图;Step S30, projecting the spatial sparse feature map to a two-dimensional top-view plane, and after obtaining features of different scales through a feature pyramid convolution network, combining the features of different scales through a deconvolution layer to obtain a top-view feature map;

步骤S40,通过注意力权重层获取所述顶视特征图的注意力权重特征图;通过卷积编码层获取所述顶视特征图的编码特征图;Step S40, obtaining the attention weight feature map of the top-view feature map through an attention weight layer; obtaining an encoding feature map of the top-view feature map through a convolutional coding layer;

步骤S50,将所述注意力权重特征图乘到所述编码特征图的对应区域,并进行特征拼接获得注意力特征图;Step S50, multiply the attention weight feature map to the corresponding area of the encoding feature map, and perform feature splicing to obtain the attention feature map;

步骤S60,基于所述注意力特征图,通过训练好的目标分类网络获取待检测数据中目标类别;基于所述注意力特征图,通过训练好的目标回归定位网络,获取待检测数据中目标位置、尺寸、方向。Step S60, based on the attention feature map, obtain the target category in the data to be detected through the trained target classification network; based on the attention feature map, obtain the target position in the data to be detected through the trained target regression positioning network , size, orientation.

为了更清晰地对本发明基于形状注意力机制的三维目测检测方法进行说明,下面结合图1对本发明方法实施例中各步骤展开详述。In order to more clearly describe the three-dimensional visual detection method based on the shape attention mechanism of the present invention, each step in the embodiment of the method of the present invention will be described in detail below with reference to FIG. 1 .

本发明一种实施例的基于形状注意力机制的三维目测检测方法,包括步骤S10-步骤S60,各步骤详细描述如下:A three-dimensional visual detection method based on a shape attention mechanism according to an embodiment of the present invention includes steps S10 to S60, and each step is described in detail as follows:

步骤S10,获取包含目标物的激光点云数据作为待检测数据,并将所述待检测数据通过基于三维网络的体素来表征,如式(1)所示:In step S10, the laser point cloud data containing the target object is obtained as the data to be detected, and the data to be detected is represented by voxels based on a three-dimensional network, as shown in formula (1):

Figure RE-GDA0002353771660000081
Figure RE-GDA0002353771660000081

其中,D代表激光点云数据的体素表征,xi、yi、zi代表激光点云数据中的第i个点在激光雷达点云中的三维位置信息,Ri代表激光点云数据中的第i个点的反射率。Among them, D represents the voxel representation of the laser point cloud data, x i , y i , and z i represent the three-dimensional position information of the ith point in the laser point cloud data in the lidar point cloud, and R i represents the laser point cloud data The reflectivity of the ith point in .

假设激光雷达点云包含一个范围为H、W、D的三维空间,分别表示垂直方向上的高度、水平方向上的位置和距离,每个体素的尺寸大小为ΔH×ΔW×ΔD,ΔH=0.4m,ΔW=0.2m,ΔD=0.2m。整个三维空间的体素网格的尺寸可以通过H/ΔH、W/ΔW、D/ΔD计算得到。然后通过特征编码层(VFE)对每个体素进行特征表达。本发明一个实施例中,特征提取器使用7维向量(分别是三维坐标、反射率和体素的相对三维坐标)描述每个体素中的样本点,并且为每个样本添加了当前柱中心的坐标(Px,Py)。此时,每个体素中的样本点的描述向量变成9维。本发明一个实施例中,特征编码层(VFE)包括线性层、批处理规范化层(BN)、校正的线性单元层(ReLU)来提取点的向量特征。Assume that the lidar point cloud contains a three-dimensional space with the range of H, W, and D, representing the height in the vertical direction, the position and distance in the horizontal direction, and the size of each voxel is ΔH×ΔW×ΔD, ΔH=0.4 m, ΔW=0.2m, ΔD=0.2m. The size of the voxel grid in the entire three-dimensional space can be calculated by H/ΔH, W/ΔW, D/ΔD. Each voxel is then characterized by a feature encoding layer (VFE). In one embodiment of the present invention, the feature extractor uses 7-dimensional vectors (3-dimensional coordinates, reflectivity, and relative 3-dimensional coordinates of voxels, respectively) to describe the sample points in each voxel, and adds the current column center for each sample. Coordinates (P x , P y ). At this point, the description vector of the sample points in each voxel becomes 9-dimensional. In one embodiment of the present invention, the feature encoding layer (VFE) includes a linear layer, a batch normalization layer (BN), and a corrected linear unit layer (ReLU) to extract vector features of points.

步骤S20,通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码,获得待处理数据对应的空间稀疏特征图,如式(2)所示:In step S20, the feature expression of the voxel is obtained through a feature extractor and sparse convolution coding is performed to obtain a spatial sparse feature map corresponding to the data to be processed, as shown in formula (2):

Figure RE-GDA0002353771660000082
Figure RE-GDA0002353771660000082

其中,F()代表通过特征提取器获取体素的特征表达,D代表激光点云数据的体素表征,(x,y,z)代表空间稀疏特征图的空间坐标。Among them, F() represents the feature expression of voxels obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatially sparse feature map.

步骤S30,将所述空间稀疏特征图投影到二维顶视平面,并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征,获得顶视特征图。Step S30, project the spatial sparse feature map to a two-dimensional top-view plane, obtain features of different scales through a feature pyramid convolution network, and combine the features of different scales through a deconvolution layer to obtain a top-view feature map.

将空间稀疏特征图fs(x,y,z)投影到顶视图(即鸟瞰图),就是将空间稀疏特征图fs(x,y,z)垂直方向的维度压缩,获得顶视图的特征图 f2D(u,v)。具体来说,假设原始特征为(C,D,H,W),将高度特征并入特征通道变为(C×D,H,W),获得2D卷积特征为顶视图的特征图。通过特征金字塔卷积网络获取f2D(u,v)不同尺度的特征,并通过反卷积层合并不同尺度的特征得到特征图fFPN(u,v)。本发明一个实施例中,特征金字塔卷积层包括三个卷积组,分别具有(3、5、5)层卷积层,每层卷积层后跟批处理规范化层(BN)、校正的线性单元层(ReLU)。Projecting the spatially sparse feature map f s (x, y, z) to the top view (that is, a bird’s-eye view) is to compress the vertical dimension of the spatial sparse feature map f s (x, y, z) to obtain the top-view feature map f 2D (u,v). Specifically, assuming that the original feature is (C, D, H, W), the height feature is merged into the feature channel to become (C×D, H, W), and the 2D convolution feature is the top-view feature map. The features of different scales of f 2D (u, v) are obtained through the feature pyramid convolutional network, and the features of different scales are combined through the deconvolution layer to obtain the feature map f FPN (u, v). In an embodiment of the present invention, the feature pyramid convolutional layer includes three convolutional groups, which respectively have (3, 5, 5) convolutional layers, and each convolutional layer is followed by a batch normalization layer (BN), a corrected linear Unit Layer (ReLU).

步骤S40,通过注意力权重层获取所述顶视特征图的注意力权重特征图;通过卷积编码层获取所述顶视特征图的编码特征图。In step S40, the attention weight feature map of the top-view feature map is obtained through an attention weight layer; the coding feature map of the top-view feature map is obtained through a convolutional coding layer.

通过注意力权重层获取顶视特征图的注意力权重特征图,如式(3)所示:The attention weight feature map of the top-view feature map is obtained through the attention weight layer, as shown in formula (3):

Fatt(u,v)=Convatt(FFPN(u,v)) 式(3)F att (u,v)=Conv att (F FPN (u,v)) Equation (3)

其中,Fatt(u,v)代表顶视特征图对应的注意力权重特征图, FFPN(u,v)代表顶视特征图,Convatt()代表注意力权重层卷积操作。Among them, F att (u, v) represents the attention weight feature map corresponding to the top-view feature map, F FPN (u, v) represents the top-view feature map, and Conv att () represents the attention weight layer convolution operation.

通过卷积编码层获取顶视特征图的编码特征图,如式(4) 所示:The coding feature map of the top-view feature map is obtained through the convolutional coding layer, as shown in equation (4):

Fen(u,v)=Conven(FFPN(u,v)) 式(4)F en (u,v)= Conven (F FPN (u,v)) Equation (4)

其中,Fen(u,v)代表顶视特征图对应的编码特征图,FFPN(u,v) 代表顶视特征图,Conven()代表卷积编码层卷积操作。Among them, F en (u, v) represents the encoded feature map corresponding to the top-view feature map, F FPN (u, v) represents the top-view feature map, and Conven () represents the convolutional encoding layer convolution operation.

步骤S50,将所述注意力权重特征图乘到所述编码特征图的对应区域,并进行特征拼接获得注意力特征图,如式(5)、式(6)所示:Step S50, multiply the attention weight feature map to the corresponding area of the encoding feature map, and perform feature splicing to obtain the attention feature map, as shown in equations (5) and (6):

Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u,v))) 式(5)F op (u,v)=F en (u,v)Repeat(Reshape(F att (u,v))) Equation (5)

其中,Reshape()代表形变操作,Repeat()代表复制操作;Among them, Reshape() represents the deformation operation, and Repeat() represents the copy operation;

Figure RE-GDA0002353771660000101
Figure RE-GDA0002353771660000101

其中,[]代表特征拼接操作。Among them, [ ] represents the feature concatenation operation.

步骤S60,基于所述注意力特征图,通过训练好的目标分类网络获取待检测数据中目标类别;基于所述注意力特征图,通过训练好的目标回归定位网络,获取待检测数据中目标位置、尺寸、方向。Step S60, based on the attention feature map, obtain the target category in the data to be detected through the trained target classification network; based on the attention feature map, obtain the target position in the data to be detected through the trained target regression positioning network , size, orientation.

如图2所示,为本发明基于形状注意力机制的三维目测检测方法一种实施例的算法结构示意图,分为三个部分:其中第一个部分为基于距离约束的体素生成器(Distance-based Voxel Generator),将输入激光雷达点云变为体素;第二个部分为特征提取层(Feature extraction layers),编码体素特征并编码三维空间特征;第三部分为注意力区域推荐网络(Attention RPN),注入注意力机制输出检测结果。As shown in FIG. 2, it is a schematic diagram of the algorithm structure of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention, which is divided into three parts: the first part is a distance constraint-based voxel generator (Distance -based Voxel Generator), which converts the input lidar point cloud into voxels; the second part is the feature extraction layer, which encodes voxel features and encodes 3D spatial features; the third part is the attention region recommendation network (Attention RPN), inject the attention mechanism to output the detection result.

目标分类网络通过交叉熵损失函数进行训练,交叉熵损失函数如式(7)所示:The target classification network is trained through the cross-entropy loss function, and the cross-entropy loss function is shown in formula (7):

Figure RE-GDA0002353771660000102
Figure RE-GDA0002353771660000102

其中,N代表计算损失的样本数;yi代表正负样本,用0表示负样本,用1表示正样本;xi代表样本的网络输出值。Among them, N represents the number of samples for calculating the loss; y i represents the positive and negative samples, 0 represents the negative sample, and 1 represents the positive sample; xi represents the network output value of the sample.

目标回归定位网络通过Smooth L1损失函数进行训练, Smooth L1损失函数如式(8)所示:The target regression positioning network is trained through the Smooth L1 loss function. The Smooth L1 loss function is shown in formula (8):

Figure RE-GDA0002353771660000103
Figure RE-GDA0002353771660000103

其中,x代表回归的残差。where x represents the residual of the regression.

将注意力特征图Fhybrid(u,v)分别连接目标分类网络和目标回归定位网络,目标分类网络用于判断检测对象是否为目标,目标回归定位网络用于获取检测对象的位置、尺寸、方向。The attention feature map F hybrid (u, v) is connected to the target classification network and the target regression positioning network respectively. The target classification network is used to determine whether the detection object is a target, and the target regression positioning network is used to obtain the position, size and direction of the detection object. .

本发明一个实施例中,对于目标分类任务中类别为小汽车的,将锚点和目标的交并比(IOU)大于0.6的设为正样本,将交并比小于0.45 的设为负样本;对于类别为行人和骑车人的,将锚点和目标的交并比(IOU) 大于0.5的设为正样本,将交并比小于0.35的设为负样本。对于回归定位任务,设定对应目标车的预定义锚点的宽×长×高为(1.6×3.9×1.5)米;对于目标行人预定义锚点的宽×长×高为(0.6×0.8×1.73)米;对于目标骑车人的预定义锚点的宽×长×高为(0.6×1.76×1.73)米。定义一个三维的真实边界框为xg,yg,zg,lg,wg,hgg,其中,x,y,z是边界框的中心位置,l,w,h 表示三维目标的长宽高,θ是目标在Z轴方向的转角,用*g表示真实值,用*a表示正样本的锚点,用Δ*表示对应的残差,通过网络学习,预测真实三维目标的位置、尺寸和方向。边界框中心位置的残差(Δx,Δy,Δz)、三维目标长宽高的残差(Δl,Δw,Δh)、目标在Z轴方向转角的残差(Δθ) 分别如式(9)、式(10)、式(11)所示:In an embodiment of the present invention, for the class of the car in the target classification task, the intersection between the anchor point and the target (IOU) greater than 0.6 is set as a positive sample, and the intersection ratio is less than 0.45 as a negative sample; For the categories of pedestrians and cyclists, the intersection ratio (IOU) of anchor and target greater than 0.5 is set as a positive sample, and the intersection ratio less than 0.35 is set as a negative sample. For the regression positioning task, the width × length × height of the predefined anchor point corresponding to the target vehicle is set as (1.6 × 3.9 × 1.5) meters; for the target pedestrian, the width × length × height of the predefined anchor point is set as (0.6 × 0.8 × 1.73) meters; the predefined anchor point for the target cyclist has a width x length x height of (0.6 x 1.76 x 1.73) meters. Define a three-dimensional real bounding box as x g , y g , z g , l g , w g , h g , θ g , where x, y, z are the center positions of the bounding box, and l, w, h represent the three-dimensional The length, width and height of the target, θ is the rotation angle of the target in the Z-axis direction, * g represents the real value, * a represents the anchor point of the positive sample, and Δ* represents the corresponding residual. Through network learning, predict the real three-dimensional target location, size and orientation. The residuals of the center position of the bounding box (Δx, Δy, Δz), the residuals of the length, width, and height of the three-dimensional target (Δl, Δw, Δh), and the residuals of the target’s turning angle in the Z-axis direction (Δθ) are shown in formula (9), Formulas (10) and (11) are shown:

Figure RE-GDA0002353771660000111
Figure RE-GDA0002353771660000111

Figure RE-GDA0002353771660000112
Figure RE-GDA0002353771660000112

Δθ=sin(θga) 式(11)Δθ=sin(θ ga ) Equation (11)

为了详细说明本发明的有效性,将本发明提出的方法应用于公开无人驾驶数据集KITTI,该数据库含有3个验证类别。如图3所示,为本发明基于形状注意力机制的三维目测检测方法一种实施例的数据集与检测结果示例图,第一列Car表示车辆的检测结果,第二列Pedestrian表示行人的检测结果,第三列Cyclist表示骑车人的检测结果。每列有三组实验结果,每组包括一幅RGB图像和雷达的顶视图,检测的结果投影到图上。In order to illustrate the effectiveness of the present invention in detail, the method proposed by the present invention is applied to the public unmanned driving dataset KITTI, which contains 3 verification categories. As shown in FIG. 3, it is an example diagram of the data set and detection result of an embodiment of the 3D visual detection method based on the shape attention mechanism of the present invention. The first column Car represents the detection result of the vehicle, and the second column Pedestrian represents the detection of pedestrians. As a result, the third column Cyclist represents the detection results of cyclists. There are three sets of experimental results in each column, each set includes an RGB image and the top view of the radar, and the detection results are projected onto the graph.

在本发明一个实施例中,对于KITTI数据集,使用train数据集进行训练,使用test数据集进行测试。如图4所示,为本发明基于形状注意力机制的三维目测检测方法一种实施例的本发明方法与其他方法检测结果对比图,数据集对每类测试目标分为三个等级:容易、中等和困难。难度的划分是根据每个目标在相机图像中的高度,遮挡等级和截断程度来进行的。难度为容易的样本边界框的高度大于40等于个像素,最大截断为15%,遮挡等级为完全可见;难度为中的样本边界框的高度大于等于 25像素,最大截断为30%,遮挡等级为部分遮挡;难度为困难的样本边界框的高度大于等于25像素,最大截断为50%,遮挡等级为难以看见。BEV 表示顶视图检测结果,3D表示三维边界框的检测结果。使用PASCAL标准(AP,平均精度)评估3D目标检测性能。在对比方法中,用ARPNET 代表本发明方法,MV3D代表多视图3D目标检测法,ContFuse代表深度连续融合多传感器3D目标检测法,AOVD代表多视角聚合数据实现无人驾驶场景下3D物体实时检测法,F-PointNet代表视锥点云网络RGB-D数据 3D物体检测法,SECOND代表稀疏嵌入卷积目标检测法,Voxelnet代表基于端到端学习的点云数据3D目标检测法。In an embodiment of the present invention, for the KITTI data set, the train data set is used for training, and the test data set is used for testing. As shown in Figure 4, it is a comparison chart of the detection results of the method of the present invention and other methods of an embodiment of the three-dimensional visual detection method based on the shape attention mechanism of the present invention. The data set is divided into three levels for each type of test target: easy, Moderate and difficult. The difficulty is divided according to the height of each object in the camera image, the occlusion level and the degree of truncation. For samples with an easy difficulty, the height of the bounding box is greater than or equal to 40 pixels, the maximum truncation is 15%, and the occlusion level is fully visible; the height of the sample bounding box for the difficulty is greater than or equal to 25 pixels, the maximum truncation is 30%, and the occlusion level is Partial occlusion; samples with difficulty difficulty have a bounding box height of 25 pixels or more, a maximum truncation of 50%, and an occlusion level of hard to see. BEV represents the top view detection result, and 3D represents the detection result of the 3D bounding box. 3D object detection performance is evaluated using the PASCAL criterion (AP, average precision). In the comparison method, ARPNET represents the method of the present invention, MV3D represents the multi-view 3D object detection method, ContFuse represents the deep continuous fusion multi-sensor 3D object detection method, and AOVD represents the multi-view aggregated data to realize the real-time detection method of 3D objects in the unmanned scene. , F-PointNet stands for RGB-D data 3D object detection method of frustum point cloud network, SECOND stands for sparse embedding convolution object detection method, and Voxelnet stands for point cloud data 3D object detection method based on end-to-end learning.

本发明第二实施例的基于形状注意力机制的三维目测检测系统,该三维目标检测系统包括输入模块、稀疏卷积编码模块、特征金字塔模块、注意力权重卷积模块、编码卷积模块、特征融合模块、目标分类模块、目标定位模块、输出模块;The 3D visual detection system based on the shape attention mechanism according to the second embodiment of the present invention, the 3D target detection system includes an input module, a sparse convolution coding module, a feature pyramid module, an attention weight convolution module, a coding convolution module, a feature Fusion module, target classification module, target positioning module, output module;

所述输入模块,配置为获取包含目标物的激光点云数据作为待检测数据,并将所述待检测数据通过基于三维网络的体素来表征;The input module is configured to obtain laser point cloud data containing the target as the data to be detected, and to characterize the data to be detected by voxels based on a three-dimensional network;

所述稀疏卷积编码模块,配置为通过特征提取器获取所述体素的特征表达并进行稀疏卷积编码,获得待处理数据对应的空间稀疏特征图;The sparse convolution coding module is configured to obtain the feature expression of the voxel through a feature extractor and perform sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed;

所述特征金字塔模块,配置为将所述空间稀疏特征图投影到二维顶视平面,并通过特征金字塔卷积网络获取不同尺度的特征后通过反卷积层合并所述不同尺度的特征,获得顶视特征图;The feature pyramid module is configured to project the spatial sparse feature map to a two-dimensional top-view plane, obtain features of different scales through a feature pyramid convolutional network, and combine the features of different scales through a deconvolution layer to obtain Top view feature map;

所述注意力权重卷积模块,配置为通过注意力权重层获取所述顶视特征图的注意力权重特征图;The attention weight convolution module is configured to obtain the attention weight feature map of the top-view feature map through the attention weight layer;

所述编码卷积模块,配置为通过卷积编码层获取所述顶视特征图的编码特征图;The coding convolution module is configured to obtain the coding feature map of the top-view feature map through a convolution coding layer;

所述特征融合模块,配置为将所述注意力权重特征图乘到所述编码特征图的对应区域,并进行特征拼接获得注意力特征图;The feature fusion module is configured to multiply the attention weight feature map to the corresponding region of the encoding feature map, and perform feature splicing to obtain the attention feature map;

所述目标分类模块,配置为基于所述注意力特征图,通过训练好的目标分类网络获取待检测数据中目标类别;The target classification module is configured to obtain target categories in the data to be detected through the trained target classification network based on the attention feature map;

所述目标定位模块,配置为基于所述注意力特征图,通过训练好的目标回归定位网络,获取待检测数据中目标位置、尺寸、方向;The target positioning module is configured to obtain the target position, size and direction in the data to be detected through the trained target regression positioning network based on the attention feature map;

所述输出模块,配置为输出获取的目标类别以及目标位置、尺寸、方向。The output module is configured to output the acquired target category and target position, size, and direction.

所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统的具体工作过程及有关说明,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process and related description of the system described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

需要说明的是,上述实施例提供的基于形状注意力机制的三维目测检测系统,仅以上述各功能模块的划分进行举例说明,在实际应用中,可以根据需要而将上述功能分配由不同的功能模块来完成,即将本发明实施例中的模块或者步骤再分解或者组合,例如,上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块,以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称,仅仅是为了区分各个模块或者步骤,不视为对本发明的不当限定。It should be noted that, the three-dimensional visual inspection system based on the shape attention mechanism provided by the above embodiments is only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functions module, that is, the modules or steps in the embodiments of the present invention are decomposed or combined. For example, the modules in the above embodiments can be combined into one module, and can also be further split into multiple sub-modules to complete all or part of the above description. Function. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing each module or step, and should not be regarded as an improper limitation of the present invention.

本发明第三实施例的一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are adapted to be loaded and executed by a processor to implement the above-mentioned three-dimensional visual inspection method based on a shape attention mechanism.

本发明第四实施例的一种处理装置,包括处理器、存储装置;处理器,适于执行各条程序;存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于形状注意力机制的三维目测检测方法。A processing device according to a fourth embodiment of the present invention includes a processor and a storage device; the processor is adapted to execute various programs; the storage device is adapted to store multiple programs; the programs are adapted to be loaded and executed by the processor In order to realize the above-mentioned 3D visual detection method based on shape attention mechanism.

所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的存储装置、处理装置的具体工作过程及有关说明,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process and relevant description of the storage device and processing device described above can refer to the corresponding process in the foregoing method embodiments, which is not repeated here. Repeat.

本领域技术人员应该能够意识到,结合本文中所公开的实施例描述的各示例的模块、方法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those skilled in the art should be aware that the modules and method steps of each example described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two, and the programs corresponding to the software modules and method steps Can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or as known in the art in any other form of storage medium. In order to clearly illustrate the interchangeability of electronic hardware and software, the components and steps of each example have been described generally in terms of functionality in the foregoing description. Whether these functions are performed in electronic hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of the present invention.

术语“第一”、“第二”等是用于区别类似的对象,而不是用于描述或表示特定的顺序或先后次序。The terms "first," "second," etc. are used to distinguish between similar objects, and are not used to describe or indicate a particular order or sequence.

术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素,而且还包括没有明确列出的其它要素,或者还包括这些过程、方法、物品或者设备/装置所固有的要素。The term "comprising" or any other similar term is intended to encompass a non-exclusive inclusion such that a process, method, article or device/means comprising a list of elements includes not only those elements but also other elements not expressly listed, or Also included are elements inherent to these processes, methods, articles or devices/devices.

至此,已经结合附图所示的优选实施方式描述了本发明的技术方案,但是,本领域技术人员容易理解的是,本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下,本领域技术人员可以对相关技术特征作出等同的更改或替换,这些更改或替换之后的技术方案都将落入本发明的保护范围之内。So far, the technical solutions of the present invention have been described with reference to the preferred embodiments shown in the accompanying drawings, however, those skilled in the art can easily understand that the protection scope of the present invention is obviously not limited to these specific embodiments. Without departing from the principle of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will fall within the protection scope of the present invention.

Claims (11)

1. A three-dimensional visual inspection method based on a shape attention mechanism is characterized by comprising the following steps:
step S10, laser point cloud data containing a target object are obtained to serve as data to be detected, and the data to be detected are represented through voxels based on a three-dimensional network;
step S20, acquiring the feature expression of the voxel through a feature extractor and performing sparse convolution coding to obtain a space sparse feature map corresponding to the data to be processed;
step S30, projecting the space sparse feature map to a two-dimensional top view plane, acquiring features of different scales through a feature pyramid convolution network, and then combining the features of different scales through deconvolution lamination to obtain a top view feature map;
step S40, acquiring an attention weight feature map of the top view feature map through an attention weight layer; acquiring a coding feature map of the top view feature map through a convolution coding layer;
step S50, multiplying the attention weight feature map to the corresponding area of the coding feature map, and performing feature splicing to obtain an attention feature map;
step S60, acquiring target categories in the data to be detected through a trained target classification network based on the attention feature map; and acquiring the position, the size and the direction of the target in the data to be detected through the trained target regression positioning network based on the attention feature map.
2. The three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S10, "the data to be inspected is characterized by voxels based on three-dimensional network", which is performed by:
Figure FDA0002298787760000011
wherein D represents the voxel representation of the laser point cloud data, xi、yi、ziRepresenting the three-dimensional position information of the ith point in the laser point cloud data relative to the laser radar, RiRepresenting the reflectivity of the ith point in the laser point cloud data.
3. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S20, "obtaining the feature expression of the voxel by a feature extractor and performing sparse convolution coding to obtain a spatial sparse feature map corresponding to the data to be processed" includes:
Figure FDA0002298787760000021
wherein, F () represents the feature representation of the voxel obtained by the feature extractor, D represents the voxel representation of the laser point cloud data, and (x, y, z) represents the spatial coordinates of the spatial sparse feature map.
4. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S40, "obtaining the attention weight feature map of the top view feature map through the attention weight layer" includes:
Fatt(u,v)=Convatt(FFPN(u,v))
wherein, Fatt(u, v) represents the attention weight feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Convatt() Representing the attention weight layer convolution operation.
5. A three-dimensional visual inspection method based on shape attention mechanism according to claim 1, wherein in step S40, "obtaining the encoding feature map of the top view feature map by convolution encoding layer" comprises:
Fen(u,v)=Conven(FFPN(u,v))
wherein, Fen(u, v) represents the coding feature map corresponding to the top view feature map, FFPN(u, v) represents a top view feature map, Conven() Representing a convolutional encoding layer convolution operation.
6. A three-dimensional visual inspection method based on a shape attention mechanism according to claim 1, wherein in step S50, the method comprises the steps of multiplying the attention weight feature map to the corresponding region of the coding feature map and performing feature matching to obtain the attention feature map, and comprises the steps of:
Fop(u,v)=Fen(u,v)Repeat(Reshape(Fatt(u,v)))
wherein, resume () represents the deformation operation, and Repeat () represents the copy operation;
Figure FDA0002298787760000031
wherein [ ] represents a characteristic splicing operation.
7. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the object classification network is trained by cross entropy loss function; the cross entropy loss function is:
Figure FDA0002298787760000032
wherein N represents the number of samples for which loss is calculated; y isiRepresents positive and negative samples, with 0 representing a negative sample and 1 representing a positive sample; x is the number ofiA network output value representing a sample.
8. The three-dimensional visual inspection method based on shape attention mechanism according to any one of claims 1-6, characterized in that the target regression positioning network is trained by Smooth L1 loss function; the Smooth L1 loss function is:
Figure FDA0002298787760000033
where x represents the residual of the regression.
9. A three-dimensional visual inspection detection system based on a shape attention mechanism is characterized by comprising an input module, a sparse convolution coding module, a characteristic pyramid module, an attention weight convolution module, a coding convolution module, a characteristic fusion module, a target classification module, a target positioning module and an output module;
the input module is configured to acquire laser point cloud containing target object data as to-be-detected data, and the to-be-detected data is represented by voxels based on a three-dimensional network;
the sparse convolution coding module is configured to obtain the characteristic expression of the voxel through a characteristic extractor and carry out sparse convolution coding to obtain a spatial sparse characteristic diagram corresponding to the data to be processed;
the characteristic pyramid module is configured to project the space sparse characteristic diagram to a two-dimensional top view plane, obtain characteristics of different scales through a characteristic pyramid convolution network, and then combine the characteristics of different scales through deconvolution lamination to obtain a top view characteristic diagram;
the attention weight convolution module is configured to acquire an attention weight feature map of the top view feature map through an attention weight layer;
the coding convolution module is configured to acquire a coding feature map of the top view feature map through a convolution coding layer;
the feature fusion module is configured to multiply the attention weight feature map to a corresponding region of the coding feature map, and perform feature splicing to obtain an attention feature map;
the target classification module is configured to obtain a target class in the data to be detected through a trained target classification network based on the attention feature map;
the target positioning module is configured to obtain the position, the size and the direction of a target in the data to be detected through a trained target regression positioning network based on the attention feature map;
the output module is configured to output the acquired object type, and the object position, size and direction.
10. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for three-dimensional visual inspection based on the shape attention mechanism of any one of claims 1 to 8.
11. A treatment apparatus comprises
A processor adapted to execute various programs; and
a storage device adapted to store a plurality of programs;
wherein the program is adapted to be loaded and executed by a processor to perform:
the three-dimensional visual inspection method based on the shape attention mechanism as set forth in any one of claims 1 to 8.
CN201911213392.9A 2019-12-02 2019-12-02 3D visual detection method, system and device based on shape attention mechanism Pending CN110879994A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911213392.9A CN110879994A (en) 2019-12-02 2019-12-02 3D visual detection method, system and device based on shape attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911213392.9A CN110879994A (en) 2019-12-02 2019-12-02 3D visual detection method, system and device based on shape attention mechanism

Publications (1)

Publication Number Publication Date
CN110879994A true CN110879994A (en) 2020-03-13

Family

ID=69729811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911213392.9A Pending CN110879994A (en) 2019-12-02 2019-12-02 3D visual detection method, system and device based on shape attention mechanism

Country Status (1)

Country Link
CN (1) CN110879994A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723719A (en) * 2020-06-12 2020-09-29 中国科学院自动化研究所 Video target detection method, system and device based on category external memory
CN111862101A (en) * 2020-07-15 2020-10-30 西安交通大学 A Semantic Segmentation Method of 3D Point Clouds from the Coding Perspective of Bird's Eye View
CN111985378A (en) * 2020-08-13 2020-11-24 中国第一汽车股份有限公司 Road target detection method, device and equipment and vehicle
CN112257605A (en) * 2020-10-23 2021-01-22 中国科学院自动化研究所 3D target detection method, system and device based on self-labeled training samples
CN112347987A (en) * 2020-11-30 2021-02-09 江南大学 A 3D Object Detection Method Based on Multimodal Data Fusion
CN112418421A (en) * 2020-11-06 2021-02-26 常州大学 Roadside end pedestrian trajectory prediction algorithm based on graph attention self-coding model
CN112464905A (en) * 2020-12-17 2021-03-09 湖南大学 3D target detection method and device
CN112668469A (en) * 2020-12-28 2021-04-16 西安电子科技大学 Multi-target detection and identification method based on deep learning
CN112884723A (en) * 2021-02-02 2021-06-01 贵州电网有限责任公司 Insulator string detection method in three-dimensional laser point cloud data
CN113095172A (en) * 2021-03-29 2021-07-09 天津大学 Point cloud three-dimensional object detection method based on deep learning
CN113269147A (en) * 2021-06-24 2021-08-17 浙江海康智联科技有限公司 Three-dimensional detection method and system based on space and shape, and storage and processing device
CN113807184A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Obstacle detection method and device, electronic equipment and automatic driving vehicle
CN114663879A (en) * 2022-02-09 2022-06-24 中国科学院自动化研究所 Target detection method and device, electronic equipment and storage medium
CN115082902A (en) * 2022-07-22 2022-09-20 松立控股集团股份有限公司 Vehicle target detection method based on laser radar point cloud
CN115183782A (en) * 2022-09-13 2022-10-14 毫末智行科技有限公司 Method and device for multimodal sensor fusion based on joint space loss
CN116704464A (en) * 2023-06-14 2023-09-05 苏州科技大学 Three-dimensional object detection method, system and storage medium based on auxiliary task learning network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102896630A (en) * 2011-07-25 2013-01-30 索尼公司 Robot device, method of controlling the same, computer program, and robot system
US20160063754A1 (en) * 2014-08-26 2016-03-03 The Boeing Company System and Method for Detecting a Structural Opening in a Three Dimensional Point Cloud
CN106778856A (en) * 2016-12-08 2017-05-31 深圳大学 A kind of object identification method and device
CN108133191A (en) * 2017-12-25 2018-06-08 燕山大学 A kind of real-time object identification method suitable for indoor environment
US20180210896A1 (en) * 2015-07-22 2018-07-26 Hangzhou Hikvision Digital Technology Co., Ltd. Method and device for searching a target in an image
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN110070025A (en) * 2019-04-17 2019-07-30 上海交通大学 Objective detection system and method based on monocular image
CN110458112A (en) * 2019-08-14 2019-11-15 上海眼控科技股份有限公司 Vehicle checking method, device, computer equipment and readable storage medium storing program for executing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102896630A (en) * 2011-07-25 2013-01-30 索尼公司 Robot device, method of controlling the same, computer program, and robot system
US20160063754A1 (en) * 2014-08-26 2016-03-03 The Boeing Company System and Method for Detecting a Structural Opening in a Three Dimensional Point Cloud
US20180210896A1 (en) * 2015-07-22 2018-07-26 Hangzhou Hikvision Digital Technology Co., Ltd. Method and device for searching a target in an image
CN106778856A (en) * 2016-12-08 2017-05-31 深圳大学 A kind of object identification method and device
US20180165547A1 (en) * 2016-12-08 2018-06-14 Shenzhen University Object Recognition Method and Device
US20190147245A1 (en) * 2017-11-14 2019-05-16 Nuro, Inc. Three-dimensional object detection for autonomous robotic systems using image proposals
CN108133191A (en) * 2017-12-25 2018-06-08 燕山大学 A kind of real-time object identification method suitable for indoor environment
CN110070025A (en) * 2019-04-17 2019-07-30 上海交通大学 Objective detection system and method based on monocular image
CN110458112A (en) * 2019-08-14 2019-11-15 上海眼控科技股份有限公司 Vehicle checking method, device, computer equipment and readable storage medium storing program for executing

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
YANG YANG YE ET AL: "ARPNET:attention region proposal network for 3D object detection", 《SCIENCE CHINA INFORMATION SCIENCES》 *
YANG YANGYE ET AL: "SARPNET: Shape attention regional proposal network for liDAR-based 3D object detection", 《NEURO COMPUTING》 *
YIN ZHOU ET AL: "VoxelNet:End-to-End Learning for Point Cloud Based 3D Object Detection", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
赵华卿: "三维目标检测中的先验方向角估计", 《传感器与微系统》 *
陈敏: "《认知计算导论》", 30 April 2017 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723719A (en) * 2020-06-12 2020-09-29 中国科学院自动化研究所 Video target detection method, system and device based on category external memory
CN111862101A (en) * 2020-07-15 2020-10-30 西安交通大学 A Semantic Segmentation Method of 3D Point Clouds from the Coding Perspective of Bird's Eye View
CN111985378A (en) * 2020-08-13 2020-11-24 中国第一汽车股份有限公司 Road target detection method, device and equipment and vehicle
CN112257605A (en) * 2020-10-23 2021-01-22 中国科学院自动化研究所 3D target detection method, system and device based on self-labeled training samples
CN112418421B (en) * 2020-11-06 2024-01-23 常州大学 Road side end pedestrian track prediction algorithm based on graph attention self-coding model
CN112418421A (en) * 2020-11-06 2021-02-26 常州大学 Roadside end pedestrian trajectory prediction algorithm based on graph attention self-coding model
CN112347987A (en) * 2020-11-30 2021-02-09 江南大学 A 3D Object Detection Method Based on Multimodal Data Fusion
CN112347987B (en) * 2020-11-30 2025-01-14 江南大学 A 3D object detection method based on multi-modal data fusion
CN112464905A (en) * 2020-12-17 2021-03-09 湖南大学 3D target detection method and device
CN112464905B (en) * 2020-12-17 2022-07-26 湖南大学 3D target detection method and device
CN112668469A (en) * 2020-12-28 2021-04-16 西安电子科技大学 Multi-target detection and identification method based on deep learning
CN112884723A (en) * 2021-02-02 2021-06-01 贵州电网有限责任公司 Insulator string detection method in three-dimensional laser point cloud data
CN112884723B (en) * 2021-02-02 2022-08-12 贵州电网有限责任公司 Insulator string detection method in three-dimensional laser point cloud data
CN113095172A (en) * 2021-03-29 2021-07-09 天津大学 Point cloud three-dimensional object detection method based on deep learning
CN113269147A (en) * 2021-06-24 2021-08-17 浙江海康智联科技有限公司 Three-dimensional detection method and system based on space and shape, and storage and processing device
CN113807184A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Obstacle detection method and device, electronic equipment and automatic driving vehicle
CN114663879A (en) * 2022-02-09 2022-06-24 中国科学院自动化研究所 Target detection method and device, electronic equipment and storage medium
CN114663879B (en) * 2022-02-09 2023-02-21 中国科学院自动化研究所 Target detection method, device, electronic equipment and storage medium
CN115082902B (en) * 2022-07-22 2022-11-11 松立控股集团股份有限公司 Vehicle target detection method based on laser radar point cloud
CN115082902A (en) * 2022-07-22 2022-09-20 松立控股集团股份有限公司 Vehicle target detection method based on laser radar point cloud
CN115183782B (en) * 2022-09-13 2022-12-09 毫末智行科技有限公司 Multi-modal sensor fusion method and device based on joint space loss
CN115183782A (en) * 2022-09-13 2022-10-14 毫末智行科技有限公司 Method and device for multimodal sensor fusion based on joint space loss
CN116704464A (en) * 2023-06-14 2023-09-05 苏州科技大学 Three-dimensional object detection method, system and storage medium based on auxiliary task learning network
CN116704464B (en) * 2023-06-14 2025-05-06 苏州科技大学 Three-dimensional target detection method, system and storage medium based on auxiliary task learning network

Similar Documents

Publication Publication Date Title
CN110879994A (en) 3D visual detection method, system and device based on shape attention mechanism
CN113052109B (en) A 3D object detection system and a 3D object detection method thereof
CN113269147B (en) Three-dimensional detection method and system based on space and shape, and storage and processing device
CN111681212B (en) Three-dimensional target detection method based on laser radar point cloud data
Huang et al. A coarse-to-fine algorithm for registration in 3D street-view cross-source point clouds
CN113267761B (en) Laser radar target detection and identification method, system and computer readable storage medium
WO2023016082A1 (en) Three-dimensional reconstruction method and apparatus, and electronic device and storage medium
Lee et al. 3D reconstruction using a sparse laser scanner and a single camera for outdoor autonomous vehicle
Nguyen et al. Toward real-time vehicle detection using stereo vision and an evolutionary algorithm
CN114693862A (en) Three-dimensional point cloud data model reconstruction method, target re-identification method and device
Hu et al. R-CNN based 3D object detection for autonomous driving
CN118397616B (en) 3D target detection method based on density perception completion and sparse fusion
CN111198563A (en) Terrain recognition method and system for dynamic motion of foot type robot
CN112906519B (en) Vehicle type identification method and device
Li et al. 3D object detection based on point cloud in automatic driving scene
Saleem et al. Effects of ground manifold modeling on the accuracy of stixel calculations
US20240029392A1 (en) Prediction method for target object, computer device, and storage medium
Cai et al. Deep representation and stereo vision based vehicle detection
CN116740665A (en) A point cloud target detection method and device based on three-dimensional intersection and union ratio
CN115965549A (en) A laser point cloud completion method and related device
CN114693863A (en) A method and device for vehicle re-identification based on lidar camera
Corneliu et al. Real-time pedestrian classification exploiting 2D and 3D information
CN116052122B (en) Method and device for detecting drivable space, electronic equipment and storage medium
CN117475410B (en) Three-dimensional target detection method, system, equipment and medium based on foreground point screening
Chu et al. Convergent application for trace elimination of dynamic objects from accumulated lidar point clouds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200313

RJ01 Rejection of invention patent application after publication