CN110322453B

CN110322453B - 3D point cloud semantic segmentation method based on position attention and auxiliary network

Info

Publication number: CN110322453B
Application number: CN201910604264.0A
Authority: CN
Inventors: 焦李成; 冯志玺; 张格格; 杨淑媛; 程曦娜; 马清华; 张�杰; 郭雨薇; 丁静怡; 唐旭
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2023-04-18
Anticipated expiration: 2039-07-05
Also published as: CN110322453A

Abstract

The invention provides a 3D point cloud semantic segmentation method based on position attention and an auxiliary network, which mainly solves the problem of low segmentation precision in the prior art, and the implementation scheme is as follows: acquiring a training set T and a test set V; constructing a 3D point cloud semantic segmentation network, and setting a loss function of the network, wherein the network comprises a feature down-sampling network, a position attention module, a feature up-sampling network and an auxiliary network which are sequentially cascaded; and performing P rounds of supervised training on the segmentation network by using a training set T: adjusting network parameters according to a loss function in the training process of each round, and taking a network model with the highest segmentation precision as a trained network model after P rounds of training are completed; and inputting the test set V into the trained network model for semantic segmentation to obtain a segmentation result of each point. The method improves the semantic segmentation precision of the 3D point cloud, and can be used for automatic driving, robots, 3D scene reconstruction, quality detection, 3D mapping and smart city construction.

Description

3D point cloud semantic segmentation method based on position attention and auxiliary network

技术领域Technical Field

本发明属于数据处理技术领域，特别涉及一种3D点云语义分割方法，可用于自动驾驶、机器人、3D场景重建、质量检测，3D制图及智慧城市建设。The present invention belongs to the field of data processing technology, and in particular relates to a 3D point cloud semantic segmentation method, which can be used for autonomous driving, robots, 3D scene reconstruction, quality inspection, 3D mapping and smart city construction.

背景技术Background Art

近年来，随着激光雷达，RGBD相机等3D传感器在机器人、无人驾驶领域的广泛应用，深度学习在3D点云数据的应用已经成为研究热点之一。3D点云数据是指：在一个三维坐标系统中的一组向量的集合，这些向量通常以x，y，z三维坐标的形式表示，一般用来代表一个物体的外表面形状。另外，除了(x,y,z)代表的几何信息外，可能还含有RGB颜色、强度、灰度值，深度或者返回次数等信息。点云数据通常是由3D扫描设备获得，例如激光雷达，RGBD相机等。这些传感器用自动化的方式测量在物体表面的大量点的信息，然后用某种数据文件输出点云数据。点云数据具有无序性，非结构化的特点并且在3D空间中可能具有不同的稠密度。这使深度学习应用在3D点云数据的研究面临巨大的挑战。In recent years, with the widespread application of 3D sensors such as LiDAR and RGBD cameras in the fields of robotics and unmanned driving, the application of deep learning in 3D point cloud data has become one of the research hotspots. 3D point cloud data refers to: a set of vectors in a three-dimensional coordinate system. These vectors are usually expressed in the form of x, y, and z three-dimensional coordinates, and are generally used to represent the outer surface shape of an object. In addition, in addition to the geometric information represented by (x, y, z), it may also contain RGB color, intensity, grayscale value, depth or return times. Point cloud data is usually obtained by 3D scanning devices, such as LiDAR, RGBD cameras, etc. These sensors measure the information of a large number of points on the surface of an object in an automated way, and then output point cloud data in a certain data file. Point cloud data is disordered and unstructured and may have different densities in 3D space. This makes the application of deep learning in the study of 3D point cloud data face huge challenges.

3D点云语义分割是指对输入的点云数据中的每一个点分配一个类别。在早期的研究工作中，3D点云数据通常被转换为手工体素网格特征或者是多视角的图像特征，然后送入深度学习网络进行特征提取，这样转换特征的方法不仅数据量大、而且计算复杂，如果降低分辨率，则分割精度会下降。因此，使用深度学习的方法直接处理点云数据显得尤其重要。3D point cloud semantic segmentation refers to assigning a category to each point in the input point cloud data. In early research work, 3D point cloud data is usually converted into manual voxel grid features or multi-view image features, and then sent to the deep learning network for feature extraction. This feature conversion method not only has a large amount of data, but also is computationally complex. If the resolution is reduced, the segmentation accuracy will decrease. Therefore, it is particularly important to use deep learning methods to directly process point cloud data.

2017年，Charles R Qi等人在CVPR上发表的名称为“PointNet:Deep Learning onPoint Sets for 3D Classification and Segmentation”的论文，公开了一种直接处理3D点云数据的深度学习框架，该方法使用max-pooling的对称函数解决点云无序性的问题，从而提取每个点的全局特征。但是该方法只考虑了全局特征，忽略了每个点的局部特征。因此，在PointNet不久，Charles R Qi的团队在NIPS发表了名称为“PointNet++:DeepHierarchical Feature Learning on Point Sets in a Metric Space”的论文，PointNet++是PointNet的分层版本，每层都有三个阶段：采样、分组和特征提取。首先选取一些比较重要的点作为每一个局部区域的中心点，然后在这些中心点的周围根据欧氏距离选取k个近邻点。再将k个近邻点作为一个局部点云采用PointNet网络来提取特征，之后对深层特征进行回传，从而得到3D点云数据语义分割结果，该方法较PointNet精度有所提升。In 2017, Charles R Qi et al. published a paper titled "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" at CVPR, which disclosed a deep learning framework for directly processing 3D point cloud data. This method uses the symmetric function of max-pooling to solve the problem of point cloud disorder, thereby extracting the global features of each point. However, this method only considers global features and ignores the local features of each point. Therefore, shortly after PointNet, Charles R Qi's team published a paper titled "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space" at NIPS. PointNet++ is a hierarchical version of PointNet, and each layer has three stages: sampling, grouping, and feature extraction. First, some important points are selected as the center points of each local area, and then k neighboring points are selected around these center points according to the Euclidean distance. Then, the k neighboring points are used as a local point cloud to extract features using the PointNet network, and then the deep features are transmitted back to obtain the semantic segmentation results of 3D point cloud data. This method has improved accuracy compared to PointNet.

上述这两种方法与传统方法相比，由于直接处理3D点云数据，计算简单，有效解决了点云无序性的特点并且提升了分割精度，但是，PointNet++由于没有考虑到各个中心点特征之间的关系，也即上下文信息，所以特征表示相对较弱，另外，PointNet++遵从了编码-解码的通用框架，没有考虑底层更多的信息，因此，分割精度不高，仍具有改进的空间。Compared with traditional methods, the above two methods directly process 3D point cloud data, are simple to calculate, effectively solve the problem of point cloud disorder and improve segmentation accuracy. However, PointNet++ does not consider the relationship between the features of each center point, that is, the context information, so the feature representation is relatively weak. In addition, PointNet++ follows the general framework of encoding and decoding and does not consider more underlying information. Therefore, the segmentation accuracy is not high and there is still room for improvement.

发明内容Summary of the invention

本发明的目的在于针对上述现有技术的不足，提出一种基于位置注意力和辅助网络的3D点云数据语义分割方法，以关联上下文特征的位置注意力和重建底层信息的辅助网络，提高分割精度。The purpose of the present invention is to address the deficiencies of the above-mentioned prior art and propose a 3D point cloud data semantic segmentation method based on position attention and auxiliary network, so as to improve the segmentation accuracy by associating the position attention of contextual features and the auxiliary network of reconstructing the underlying information.

为实现上述目的，本发明的技术方案包括如下步骤：To achieve the above object, the technical solution of the present invention includes the following steps:

(1)从ScanNet官网下载3D点云数据的训练文件和测试文件，并对其进行类别统计和切块处理，获取训练集T和测试集V；(1) Download the training and test files of 3D point cloud data from the ScanNet official website, perform category statistics and block processing on them, and obtain the training set T and test set V;

(2)构建3D点云语义分割网络，其包括依次级联的特征下采样网络，位置注意力模块，特征上采样网络和辅助网络；(2) Construct a 3D point cloud semantic segmentation network, which includes a cascaded feature downsampling network, a position attention module, a feature upsampling network, and an auxiliary network;

(3)使用多分类的交叉熵损失函数，作为3D点云语义分割网络的损失函数；(3) Use the multi-classification cross entropy loss function as the loss function of the 3D point cloud semantic segmentation network;

(4)使用训练集T，对3D点云数据语义分割网络进行P轮有监督的训练，P≥500；(4) Using the training set T, perform P rounds of supervised training on the 3D point cloud data semantic segmentation network, where P ≥ 500;

(4a)在每轮训练过程中，根据语义分割网络的损失函数，调整网络参数，得到网络模型；(4a) In each round of training, the network parameters are adjusted according to the loss function of the semantic segmentation network to obtain the network model;

(4b)每隔P₁轮，使用测试集的样本对当前网络模型的分割精度进行评估，若当前网络模型的分割精度高于之前保存的网络模型，则进行保存，P₁≥2；(4b) Every P ₁ rounds, use the samples of the test set to evaluate the segmentation accuracy of the current network model. If the segmentation accuracy of the current network model is higher than that of the previously saved network model, it is saved, P ₁ ≥ 2;

(4c)P轮训练完成后，把分割精度最高的网络模型作为训练好的网络模型；(4c) After P rounds of training are completed, the network model with the highest segmentation accuracy is taken as the trained network model;

(5)将测试集V输入到训练好的网络模型中进行语义分割，得到每一个点的分割结果。(5) Input the test set V into the trained network model for semantic segmentation to obtain the segmentation result of each point.

本发明与现有技术相比，具有以下优点：Compared with the prior art, the present invention has the following advantages:

本发明由于构建了3D点云语义分割网络，并通过其中的位置注意力模块，计算其输入数据的各个质心所代表的特征之间的相关性，为3D点云语义分割网络的局部质心特征增加了上下文信息；同时由于通过其中的辅助网络，对3D点云语义分割网络的底层特征进行回传，重建了3D点云语义分割网络的底层信息，有效提高了3D点云语义分割的分割精度。The present invention constructs a 3D point cloud semantic segmentation network, and calculates the correlation between the features represented by each centroid of its input data through the position attention module therein, thereby adding contextual information to the local centroid features of the 3D point cloud semantic segmentation network; at the same time, since the underlying features of the 3D point cloud semantic segmentation network are fed back through the auxiliary network therein, the underlying information of the 3D point cloud semantic segmentation network is reconstructed, thereby effectively improving the segmentation accuracy of the 3D point cloud semantic segmentation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的实现流程图；Fig. 1 is an implementation flow chart of the present invention;

图2是本发明中构建的3D点云语义分割网络整体结构图；FIG2 is an overall structural diagram of a 3D point cloud semantic segmentation network constructed in the present invention;

图3是本发明中位置注意力模块结构图。FIG3 is a structural diagram of the position attention module in the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和具体实施例，对本发明作进一步详细描述。The present invention is further described in detail below in conjunction with the accompanying drawings and specific embodiments.

参照图1，本实例的实现步骤包括如下。1 , the implementation steps of this example include the following.

步骤1，获取训练集T和测试集V。Step 1: Get the training set T and the test set V.

1.1)从ScanNet官网下载3D点云数据的训练文件和测试文件，其中训练文件包含f₀个点云场景，测试文件中包含f₁个点云场景，本实施例中f₀＝1201，f₁＝312；1.1) Download the training file and test file of 3D point cloud data from ScanNet official website, where the training file contains f ₀ point cloud scenes and the test file contains f ₁ point cloud scenes. In this embodiment, f ₀ =1201 and f ₁ =312;

1.2)使用直方图统计训练文件中所有f₀个场景的点云数据各个类别的数目，并计算各个类别的权重w_k：1.2) Use the histogram to count the number of each category of point cloud data of all f ₀ scenes in the training file, and calculate the weight w _k of each category:

其中，G_k表示第k类点云数据的数目，M表示所有点云数据的数目，L表示分割类别数，L≥2，本实施例中L＝21；Wherein, G _k represents the number of point cloud data of the kth type, M represents the number of all point cloud data, L represents the number of segmentation categories, L ≥ 2, and in this embodiment, L = 21;

1.3)对训练文件中的每个场景，随机选取一个点作为中心点，坐标为(x,y,z)，在其周围取(x-0.75,x+0.75)，(y-0.75,y+0.75)，(z-0.75,z+0.75)范围中的点，组成一个数据块；1.3) For each scene in the training file, randomly select a point as the center point with coordinates (x, y, z), and take points in the range of (x-0.75, x+0.75), (y-0.75, y+0.75), and (z-0.75, z+0.75) around it to form a data block;

1.4)设置采样点数N₀，将(1.3)得到的数据块中的点数与采样点数N₀进行比较，判断其是否合理：1.4) Set the number of sampling points N ₀ , compare the number of points in the data block obtained in (1.3) with the number of sampling points N ₀ to determine whether it is reasonable:

若该数据块中的点数大于采样点数N₀，则判为该数据块合理，并在该数据块中随机采样N₀个点，组成一个样本数据，否则，抛弃该数据块，由此得到训练集T，本实施例中，N₀＝8192；If the number of points in the data block is greater than the number of sampling points N ₀ , the data block is judged to be reasonable, and N ₀ points are randomly sampled in the data block to form a sample data. Otherwise, the data block is discarded, thereby obtaining a training set T. In this embodiment, N ₀ =8192;

1.5)对于测试文件中所有f₁个场景中的每一个场景，使用大小为1.5×1.5×3的立方体窗口进行滑窗切块，对每个数据块，随机采样N₀个点，组成一个样本数据，得到测试集V。1.5) For each of all f ₁ scenes in the test file, use a cubic window of size 1.5×1.5×3 to perform sliding window cutting. For each data block, randomly sample N ₀ points to form a sample data to obtain the test set V.

步骤2，构建3D点云语义分割网络。Step 2: Build a 3D point cloud semantic segmentation network.

参照图2，本步骤的构建的3D点云语义分割网络包括依次级联的特征下采样网络，位置注意力模块，特征上采样网络和辅助网络。Referring to Figure 2, the 3D point cloud semantic segmentation network constructed in this step includes a feature downsampling network, a position attention module, a feature upsampling network and an auxiliary network that are cascaded in sequence.

2.1)设置特征下采样网络：2.1) Set up feature downsampling network:

该特征下采样网络包括n个级联的PointSA模块，每个PointSA模块包括依次级联的点云质心采样以及分组层、点云特征提取层，其中，n≥2，本实施例中该参数设置为n＝4；The feature downsampling network includes n cascaded PointSA modules, each PointSA module includes sequentially cascaded point cloud centroid sampling and grouping layers, and point cloud feature extraction layers, wherein n≥2, and in this embodiment, the parameter is set to n＝4;

对于第m个PointSA模块的质心采样以及分组层，m＝1,2,...,n，首先采用迭代最远点采样法从输入点集中采样

个点作为质心点，其次，以

个采样的质心点为中心，使用球形搜索算法，在其特定半径r^m的范围内搜索

个点，组成一个分组。本实施例中第1个PointSA模块，设置

r¹＝0.1；第2个PointSA模块，设置

r²＝0.2；第3个PointSA模块，设置

r³＝0.4；第4个PointSA模块，设置

r⁴＝0.8；For the centroid sampling and grouping layer of the m-th PointSA module, m = 1, 2, ..., n, the iterative farthest point sampling method is first used to sample from the input point set

points as the centroid, and secondly,

The centroid of the samples is used as the center, and the spherical search algorithm is used to search within the range of its specific radius r ^m.

Points form a group. In this embodiment, the first PointSA module is set

r ¹ = 0.1; the second PointSA module, set

r ² = 0.2; The third PointSA module is set

r ³ = 0.4; the fourth PointSA module, set

^r4 = 0.8;

对于第m个PointSA模块的点云特征提取层，包括3个依次级联的2D卷积层，用于提取质心采样及分组层输出数据的特征，并使用最大池化策略对提取到的区域特征进行池化。本实施例中第1个PointSA模块的点云特征提取层的3个2D卷积层的卷积核大小均为1×1，步长均为1，输出通道数分别是32、32、64；第2个PointSA模块的点云特征提取层的3个2D卷积层的卷积核大小均为1×1，步长均为1，输出通道数分别是64、64、128；第3个PointSA模块的点云特征提取层的3个2D卷积层的卷积核大小均为1×1，步长均为1，输出通道数分别是128、128、256；第4个PointSA模块的点云特征提取层的3个2D卷积层的卷积核大小均为1×1，步长均为1，输出通道数分别是256、256、512；For the point cloud feature extraction layer of the mth PointSA module, it includes three cascaded 2D convolutional layers to extract the features of the centroid sampling and grouping layer output data, and uses the maximum pooling strategy to pool the extracted regional features. In this embodiment, the convolution kernel size of the three 2D convolution layers of the point cloud feature extraction layer of the first PointSA module is 1×1, the step size is 1, and the number of output channels is 32, 32, and 64 respectively; the convolution kernel size of the three 2D convolution layers of the point cloud feature extraction layer of the second PointSA module is 1×1, the step size is 1, and the number of output channels is 64, 64, and 128 respectively; the convolution kernel size of the three 2D convolution layers of the point cloud feature extraction layer of the third PointSA module is 1×1, the step size is 1, and the number of output channels is 128, 128, and 256 respectively; the convolution kernel size of the three 2D convolution layers of the point cloud feature extraction layer of the fourth PointSA module is 1×1, the step size is 1, and the number of output channels is 256, 256, and 512 respectively;

2.2)设置位置注意力模块，用于计算其输入数据F的各个质心所代表的特征之间的相关性，得到位置注意力加强后的特征E：2.2) Set up a position attention module to calculate the correlation between the features represented by each centroid of its input data F, and obtain the feature E after position attention enhancement:

参照图3，该模块工作原理如下：Referring to Figure 3, the working principle of this module is as follows:

2.2.1)输入数据F分别经过第一1D卷积层Q得到第i个质心的特征Q_i，i＝1,2,...,N，N表示F的质心数量；再经过第二1D卷积层U，得到第j个质心的特征U_j，j＝1,2,...,N，再经过第三1D卷积层V得到第j个质心的特征V_j；其中，这三个1D卷积层Q、U、V的卷积核大小均为1，步长均为1，第一1D卷积层Q和第二1D卷积层U的输出特征通道数均为输入数据F特征通道数的

第三1D卷积层V的输出特征通道数与输入数据F的特征通道数相同；2.2.1) The input data F passes through the first 1D convolutional layer Q to obtain the feature _Qi of the i-th centroid, i = 1, 2, ..., N, N represents the number of centroids of F; then passes through the second 1D convolutional layer U to obtain the feature _Uj of the j-th centroid, j = 1, 2, ..., N, and then passes through the third 1D convolutional layer V to obtain the feature _Vj of the j-th centroid; wherein the convolution kernel size of these three 1D convolutional layers Q, U, and V is 1, the step size is 1, and the number of output feature channels of the first 1D convolutional layer Q and the second 1D convolutional layer U is twice the number of feature channels of the input data F

The number of output feature channels of the third 1D convolutional layer V is the same as the number of feature channels of the input data F;

2.2.2)计算各个质心所代表的特征之间的注意力影响值t_ij：

使用t_ij构成矩阵A：2.2.2) Calculate the attention influence value _tij between the features represented by each centroid:

Use t _ij to construct the matrix A:

2.2.3)计算位置注意力特征

2.2.3) Calculate position attention features

2.2.4)输出注意力加强后的特征E：2.2.4) Output the feature E after attention enhancement:

E＝[E₁；E₂；...；E_i；...；E_N]，E=[E ₁ ; E ₂ ;...; E _i ;...; E _N ],

其中，E_i＝αJ_i+F_i表示E中第i个质心的特征，α表示位置注意力特征J_i的权重，F_i表示输入的第i个质心的特征；Among them, E _i = αJ _i + _Fi represents the feature of the i-th centroid in E, α represents the weight of the position attention feature _Ji , and _Fi represents the feature of the i-th centroid of the input;

2.3)设置特征上采样网络：2.3) Set up the feature upsampling network:

该特征上采样网络包括依次级联的a个PointFP模块、1D卷积层、Dropout层和用于分类的1D卷积层，每个PointFP模块包括依次级联的特征插值层和特征提取层，其中，a≥2，本实施例中该参数设置为a＝4；The feature upsampling network includes a PointFP modules, a 1D convolution layer, a Dropout layer, and a 1D convolution layer for classification, which are cascaded in sequence. Each PointFP module includes a feature interpolation layer and a feature extraction layer, which are cascaded in sequence. Where a≥2, in this embodiment, the parameter is set to a＝4;

所述a个PointFP模块，其特征插值层和特征提取层的结构有所不同，其中：The structures of the feature interpolation layer and the feature extraction layer of the PointFP module are different, where:

对于第1个PointFP模块，其特征插值层对位置注意力模块的输出特征进行插值，并将插值后的特征与第3个PointSA模块的输出特征进行级联，得到特征插值层的输出特征；其特征提取层，包含2个依次级联的2D卷积层，用来进一步提取该输出特征，2个2D卷积层的卷积核均为卷积核大小均为1×1，步长均为1，输出通道数分别是256、256；For the first PointFP module, its feature interpolation layer interpolates the output features of the position attention module and cascades the interpolated features with the output features of the third PointSA module to obtain the output features of the feature interpolation layer; its feature extraction layer includes two cascaded 2D convolutional layers to further extract the output features. The convolution kernels of the two 2D convolutional layers are both 1×1 in size, 1 in stride, and the number of output channels is 256 and 256 respectively;

对于第2个PointFP模块，其特征插值层对第1个PointFP模块的输出特征进行插值，并将插值后的特征与第2个PointSA模块的输出特征进行级联，得到特征插值层的输出特征；其特征提取层，包含2个依次级联的2D卷积层，用来进一步提取该输出特征，2个2D卷积层的卷积核均为卷积核大小均为1×1，步长均为1，输出通道数分别是256、256；For the second PointFP module, its feature interpolation layer interpolates the output features of the first PointFP module and cascades the interpolated features with the output features of the second PointSA module to obtain the output features of the feature interpolation layer; its feature extraction layer includes two cascaded 2D convolutional layers to further extract the output features. The convolution kernels of the two 2D convolutional layers are both 1×1 in size, 1 in stride, and the number of output channels is 256 and 256 respectively;

对于第3个PointFP模块，其特征插值层对第2个PointFP模块的输出特征进行插值，并将插值后的特征与第1个PointSA模块的输出特征进行级联，得到特征插值层的输出特征；其特征提取层，包含2个依次级联的2D卷积层，用来进一步提取该输出特征，2个2D卷积层的卷积核均为卷积核大小均为1×1，步长均为1，输出通道数分别是256、128；For the third PointFP module, its feature interpolation layer interpolates the output features of the second PointFP module and cascades the interpolated features with the output features of the first PointSA module to obtain the output features of the feature interpolation layer; its feature extraction layer includes two cascaded 2D convolutional layers to further extract the output features. The convolution kernels of the two 2D convolutional layers are both 1×1 in size, 1 in stride, and the number of output channels is 256 and 128 respectively;

对于第4个PointFP模块，其特征插值层对第3个PointFP模块的输出特征进行插值，得到插值后的特征，该插值后的特征作为其特征插值层的输出特征；其特征提取层，包含3个依次级联的2D卷积层，用来进一步提取该输出特征，3个2D卷积层的卷积核均为卷积核大小均为1×1，步长均为1，输出通道数分别是128、128、128。For the fourth PointFP module, its feature interpolation layer interpolates the output features of the third PointFP module to obtain the interpolated features, which are used as the output features of its feature interpolation layer; its feature extraction layer includes three cascaded 2D convolutional layers to further extract the output features. The convolution kernels of the three 2D convolutional layers are all 1×1 in size, with a step size of 1, and the number of output channels are 128, 128, and 128 respectively.

所述1D卷积层，其卷积核大小为1，步长为1，输出特征通道数设置为128；The 1D convolution layer has a convolution kernel size of 1, a step size of 1, and an output feature channel number of 128;

所述Dropout层，其保留概率设置为0.5；The retention probability of the Dropout layer is set to 0.5;

所述用于分类的1D卷积层，其卷积核大小为1，步长为1，输出特征通道数设置为分割的类别数L。The 1D convolution layer used for classification has a convolution kernel size of 1, a step size of 1, and the number of output feature channels is set to the number of segmented categories L.

2.4)设置辅助网络：2.4) Set up auxiliary network:

该辅助网络包括依次级联的b个PointAux模块和用于分类的1D卷积层，每个PointAux模块包括1D卷积层和特征插值层，其中，b≥1，本实施例中b＝2；The auxiliary network includes b PointAux modules cascaded in sequence and a 1D convolution layer for classification, each PointAux module includes a 1D convolution layer and a feature interpolation layer, wherein b≥1, and in this embodiment, b＝2;

对于第1个PointAux模块，其1D卷积层用来提取第2个PointFP模块输出数据的特征，卷积核大小为1，步长为1，输出特征通道为分割的类别数L；其特征插值层用来对1D卷积层提取到的特征进行插值；For the first PointAux module, its 1D convolution layer is used to extract the features of the output data of the second PointFP module. The convolution kernel size is 1, the step size is 1, and the output feature channel is the number of segmented categories L; its feature interpolation layer is used to interpolate the features extracted by the 1D convolution layer;

对于第2个PointAux模块，其1D卷积层用来提取第1个PointAux模块输出数据的特征，卷积核大小为1，步长为1，输出特征通道为分割的类别数L；其特征插值层用来对1D卷积层提取到的特征进行插值；For the second PointAux module, its 1D convolution layer is used to extract the features of the output data of the first PointAux module, the convolution kernel size is 1, the step size is 1, and the output feature channel is the number of segmented categories L; its feature interpolation layer is used to interpolate the features extracted by the 1D convolution layer;

用于分类的1D卷积层，用于对第2个PointAux模块的输出特征进行分类，其卷积核大小为1，步长为1，输出特征通道数设置为分割的类别数L。The 1D convolutional layer for classification is used to classify the output features of the second PointAux module. Its convolution kernel size is 1, the stride is 1, and the number of output feature channels is set to the number of segmented categories L.

步骤3，设定3D点云语义分割网络的损失函数。Step 3: Set the loss function of the 3D point cloud semantic segmentation network.

本实例将多分类的交叉熵损失函数，作为3D点云语义分割网络的损失函数，其表示公式如下：In this example, the multi-classification cross entropy loss function is used as the loss function of the 3D point cloud semantic segmentation network. The expression formula is as follows:

其中，C代表训练的样本点数，L代表类别总数，w_k为第k类的权重，w_a为辅助网络的loss的权重，w_a∈[0,1]，本实施例中w_a＝0.5；Wherein, C represents the number of training sample points, L represents the total number of categories, _wk is the weight of the kth category, _wa is the weight of the auxiliary network loss, _wa∈ [0,1], and in this embodiment, _wa ＝0.5;

p_i,k代表第i个样本点属于第k类的真实概率，若第i个样本点属于第k类，则概率值为1，否则，概率值为0；p _i,k represents the true probability that the i-th sample point belongs to the k-th class. If the i-th sample point belongs to the k-th class, the probability value is 1, otherwise, the probability value is 0;

和

分别表示特征上采样网络和辅助网络预测的第i个样本点属于第k类的概率，计算公式如下：

and

They represent the probability that the i-th sample point predicted by the feature upsampling network and the auxiliary network belongs to the k-th class, respectively. The calculation formula is as follows:

其中，

分别表示特征上采样网络和辅助网络输出的第i个样本点的第k个通道特征值，计算公式如下：in,

They represent the k-th channel feature value of the i-th sample point output by the feature upsampling network and the auxiliary network respectively. The calculation formula is as follows:

其中，x_i表示第i个样本点的输入特征，f¹表示特征上采样网络，θ¹表示特征上采样网络的参数，f²表示辅助网络，θ²表示辅助网络的参数。Among them, _xi represents the input feature of the i-th sample point, ^f1 represents the feature upsampling network, ^θ1 represents the parameters of the feature upsampling network, ^f2 represents the auxiliary network, and ^θ2 represents the parameters of the auxiliary network.

步骤4，使用训练集T，对3D点云语义分割网络进行P轮有监督的训练，P≥500。Step 4: Use the training set T to perform P rounds of supervised training on the 3D point cloud semantic segmentation network, where P ≥ 500.

本实施例中取P＝1000，其训练步骤如下：In this embodiment, P=1000, and the training steps are as follows:

4.1)在第q轮训练过程中，设置l_q为第q轮训练过程的学习率，设置θ_q为第q轮训练过程的网络模型的参数，根据步骤3设定的损失函数，使用公式

调整θ_q，得到用于第q+1轮训练过程的网络模型参数θ_q+1，由此得到第q轮训练过程后的网络模型；4.1) In the qth round of training, set _lq as the learning rate of the qth round of training, set _θq as the parameter of the network model of the qth round of training, and use the formula according to the loss function set in step 3

Adjust θ _q to obtain the network model parameter θ _q+1 for the q+1th round of training process, thereby obtaining the network model after the qth round of training process;

4.2)每隔P₁轮，将测试集输入到当前网络模型中，得到测试集中所有点云数据的预测类别,P₁≥2，本实施例中，P₁＝5；4.2) Every P ₁ rounds, the test set is input into the current network model to obtain the predicted categories of all point cloud data in the test set, P ₁ ≥ 2. In this embodiment, P ₁ = 5;

4.3)统计测试集中所有点云数据的预测类别与其真实类别相同的数目，计算分割精度：

其中，R表示测试集中所有点云数据的预测类别与其真实类别相同的数目，H表示测试集中所有点云数据的数目；4.3) Count the number of predicted categories of all point cloud data in the test set that are the same as their true categories, and calculate the segmentation accuracy:

Among them, R represents the number of predicted categories of all point cloud data in the test set that are the same as their true categories, and H represents the number of all point cloud data in the test set;

4.4)将当前网络模型的分割精度acc与之前保存的网络模型的分割精度acc进行比较，若当前网络模型的分割精度acc高于之前保存的网络模型的分割精度acc，则表明当前网络模型更好，并对其进行保存，否则，不进行保存。4.4) Compare the segmentation accuracy acc of the current network model with the segmentation accuracy acc of the previously saved network model. If the segmentation accuracy acc of the current network model is higher than the segmentation accuracy acc of the previously saved network model, it indicates that the current network model is better and is saved. Otherwise, it is not saved.

4.5)P轮训练完成后，把分割精度最高的网络模型作为训练好的网络模型；4.5) After P rounds of training are completed, the network model with the highest segmentation accuracy is used as the trained network model;

步骤5，将测试集V输入到步骤4.5)得到的训练好的网络模型中进行语义分割，得到每一个点的分割结果。Step 5: Input the test set V into the trained network model obtained in step 4.5) for semantic segmentation to obtain the segmentation result of each point.

以下结合仿真实验，对本发明的技术效果作以说明：The following is a description of the technical effects of the present invention in conjunction with simulation experiments:

1.仿真条件1. Simulation conditions

本发明的仿真实验在以下环境中进行。The simulation experiment of the present invention is carried out in the following environment.

硬件平台：Intel(R)Xeon(R)CPU E5-2650v4@2.20GHz，64GB运行内存，Ubuntu16.04操作系统，GeForce GTX TITAN X；Hardware platform: Intel(R) Xeon(R) CPU E5-2650v4@2.20GHz, 64GB RAM, Ubuntu16.04 operating system, GeForce GTX TITAN X;

软件平台：Tensorflow深度学习框架，Python3.5，实验所采用的数据集是点云数据集ScanNet。Software platform: Tensorflow deep learning framework, Python3.5. The dataset used in the experiment is the point cloud dataset ScanNet.

ScanNet是一个通过RGB-D相机所扫描和重建的室内场景点云数据集。总共包含1513个场景，使用1201个场景作为训练集，312个场景作为测试集，包含的类别数有21类。ScanNet is a point cloud dataset of indoor scenes scanned and reconstructed by RGB-D cameras. It contains a total of 1513 scenes, using 1201 scenes as training sets and 312 scenes as test sets, and contains 21 categories.

2.仿真实验：2. Simulation experiment:

根据本发明获取训练集和测试集，构建3D点云语义分割网络，使用训练集对3D点云语义分割网络进行有监督训练，然后使用训练好的网络模型对测试集中的点进行预测，根据步骤4.3的方法计算3D点云分割网络对测试集V的分割精度。According to the present invention, a training set and a test set are obtained, a 3D point cloud semantic segmentation network is constructed, the training set is used to perform supervised training on the 3D point cloud semantic segmentation network, and then the trained network model is used to predict the points in the test set, and the segmentation accuracy of the 3D point cloud segmentation network for the test set V is calculated according to the method in step 4.3.

比较本发明与现有的PointNet++方法对点云数据做语义分割的精度，并使用分割精度作为评价本发明与现有技术好坏的评价指标，结果如表1所示：The accuracy of semantic segmentation of point cloud data by the present invention is compared with that of the existing PointNet++ method, and the segmentation accuracy is used as an evaluation index to evaluate the quality of the present invention and the existing technology. The results are shown in Table 1:

表1 ScanNet数据集分割精度对比表Table 1 ScanNet dataset segmentation accuracy comparison table

评价指标Evaluation indicators 现有技术Prior art 本发明The present invention 分割精度Segmentation accuracy 0.8360.836 0.8520.852

从表1中可以看出，本发明在ScanNet数据集上的分割精度超过了现有技术Pointnet++，提升了1.6％，表明本发明对3D点云的语义分割效果强于PointNet++。As can be seen from Table 1, the segmentation accuracy of the present invention on the ScanNet dataset exceeds that of the prior art Pointnet++ by 1.6%, indicating that the semantic segmentation effect of the present invention on 3D point clouds is stronger than that of PointNet++.

Claims

1. A 3D point cloud semantic segmentation method based on position attention and auxiliary network, characterized by comprising the following:

(1) Download the training and test files of 3D point cloud data from the ScanNet official website, perform category statistics and block processing on them, and obtain the training set T and test set V;

(2) Construct a 3D point cloud semantic segmentation network, which includes a cascaded feature downsampling network, a position attention module, a feature upsampling network, and an auxiliary network;

The position attention module includes three independent 1D convolutional layers Q, U, and V, which are used to extract the features of the input data F of the module and calculate the attention influence value _tij between the features represented by each centroid and the feature E after attention enhancement:

E=[E ₁ ; E ₂ ;...; E _i ;...; E _N ]

Among them, U _i represents the feature of the i-th centroid extracted by the 1D convolution layer U from the input data F of the position attention module, Q _j ^T represents the transposition of the feature of the j-th centroid extracted by the 1D convolution layer Q from the input data F of the position attention module, N represents the number of centroids of F, and E _i represents the feature of the i-th centroid in E. The calculation formula is:

Among them, _Vj represents the feature of the jth centroid extracted by F through the 1D convolution layer V,

represents the feature of the i-th centroid after position attention, α represents the weight of the position attention feature, and F _i represents the feature of the i-th centroid of the input;

The auxiliary network includes b PointAux modules cascaded in sequence and a 1D convolution layer for classification, each PointAux module includes a 1D convolution layer and a feature interpolation layer, wherein b≥1;

(3) Use the multi-classification cross entropy loss function as the loss function of the 3D point cloud semantic segmentation network;

(4) Using the training set T, perform P rounds of supervised training on the 3D point cloud data semantic segmentation network, where P ≥ 500:

(4a) In each round of training, the network parameters are adjusted according to the loss function of the semantic segmentation network to obtain the network model;

(4b) Every P ₁ rounds, use the samples of the test set to evaluate the segmentation accuracy of the current network model. If the segmentation accuracy of the current network model is higher than that of the previously saved network model, it is saved, P ₁ ≥ 2;

(4c) After P rounds of training are completed, the network model with the highest segmentation accuracy is taken as the trained network model;

(5) Input the test set V into the trained network model for semantic segmentation to obtain the segmentation result of each point.

2. The method according to claim 1 is characterized in that the point cloud data is subjected to category statistics and block processing in (1), which is implemented as follows:

(1a) Use the histogram to count the number of each category of point cloud data of all _f0 scenes in the training file, and calculate the weight _wk of each category:

Where G _k represents the number of point cloud data of the kth category, M represents the number of all point cloud data, L represents the number of segmentation categories, f ₀ ≥ 1000, L ≥ 2;

(1b) For each scene in the training file, randomly select a point as the center point with coordinates (x, y, z), and take points in the range of (x-0.75, x+0.75), (y-0.75, y+0.75), and (z-0.75, z+0.75) around it to form a data block. The number of points in the data block is compared with the number of sampling points _N0 to determine whether it is reasonable:

If the number of points in the data block is greater than the number of sampling points N ₀ , the data block is judged to be reasonable, and N ₀ points are randomly sampled in the data block to form a sample data. Otherwise, the data block is discarded, thereby obtaining the training set T, where N ₀ ≥ 4096;

(1c) For each of all f ₁ scenes in the test file, a cubic window of size 1.5×1.5×3 is used for sliding window dicing. For each data block, N ₀ points are randomly sampled to form a sample data to obtain a test set V, f ₁ ≥300.

3. The method according to claim 1 is characterized in that the feature downsampling network described in (2) includes n cascaded PointSA modules, each PointSA module includes a point cloud centroid sampling and grouping layer, and a point cloud feature extraction layer cascaded in sequence, where n≥2.

4. The method according to claim 1 is characterized in that the feature upsampling network described in (2) includes a PointFP modules, a 1D convolution layer, a Dropout layer and a 1D convolution layer for classification that are cascaded in sequence, and each PointFP module includes a feature interpolation layer and a feature extraction layer that are cascaded in sequence, where a≥2.

5. The method according to claim 1, characterized in that the loss function of the 3D point cloud semantic segmentation network in step (3) is calculated as follows:

Where C represents the number of training sample points, L represents the total number of categories, _wk is the weight of the kth category, _wa is the weight of the auxiliary network loss, _wa∈ [0,1]; pi _,k represents the true probability that the ith sample point belongs to the kth category. If the ith sample point belongs to the kth category, the probability value is 1, otherwise, the probability value is 0;

and

They represent the probability that the i-th sample point predicted by the feature upsampling network and the auxiliary network belongs to the k-th class, respectively.

and

The calculation formula is as follows:

in,

Among them, _xi represents the input feature of the i-th sample point, ^f1 represents the feature upsampling network, ^θ1 represents the parameters of the feature upsampling network, ^f2 represents the auxiliary network, and ^θ2 represents the parameters of the auxiliary network.

6. The method according to claim 5, characterized in that in (4a), adjusting the network parameters according to the loss function of the semantic segmentation network is performed by the following formula:

Among them, _lq represents the learning rate of the qth round of training process, _θq represents the parameters of the 3D point cloud semantic segmentation network in the qth round of training process, and _θq+1 represents the parameters used for the q+1th round of training process after adjusting _θq .

7. The method according to claim 1, characterized in that, in (4b), the segmentation accuracy of the current network model is evaluated every P ₁ rounds, which is implemented as follows:

(4b1) Every P ₁ rounds, the test set is input into the current network model to obtain the predicted categories of all point cloud data in the test set;

(4b2) Count the number of predicted categories of all point cloud data in the test set that are the same as their true categories, and calculate the segmentation accuracy:

(4b3) Compare the segmentation accuracy of the current network model with the segmentation accuracy of the previously saved network model. If the segmentation accuracy of the current network model is higher than that of the previously saved network model, it indicates that the current network model is better and is saved. Otherwise, it is not saved.