CN116310350B

CN116310350B - Urban scene semantic segmentation method based on graph convolution and semi-supervised learning network

Info

Publication number: CN116310350B
Application number: CN202310596881.7A
Authority: CN
Inventors: 王程; 陈钧; 陈一平
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-18
Anticipated expiration: 2043-05-25
Also published as: CN116310350A

Abstract

The invention discloses a method for semantic segmentation of urban scenes based on graph convolution and semi-supervised learning network, comprising the following steps: S1, pre-training the graph convolution network to obtain initialization parameters; S2, inputting the original point set once , the output feature vector ; S3, for the original point set Compute eigenvectors from each point's neighborhood ; S4, calculate the eigenvector and The distance is used as a loss function to adjust the parameters of the graph convolutional network; S5, using labeled data to assign pseudo-labels to unlabeled data; S6, assigning pseudo-labels Do semantic segmentation and predict the category of each point. The method of the invention can realize the semantic segmentation of the urban road scene with only a small amount of labeled data.

Description

Urban scene semantic segmentation method based on graph convolution and semi-supervised learning network

技术领域Technical Field

本发明涉及计算机图形学领域，具体涉及基于图卷积和半监督学习网络的城市场景语义分割方法。The present invention relates to the field of computer graphics, and in particular to a method for semantic segmentation of urban scenes based on graph convolution and semi-supervised learning networks.

背景技术Background Art

点云作为一种非结构化的三维数据，相比于体素、网格等数据格式，其表征物体更精确、灵活和多样，在智慧城市领域中具有广泛的应用。例如，在规划城市建设中，使用点云生成数字交通地图，辅助交通线路、城区建设等的规划，以提高规划效率和精度；在环境监测与分析中，可以利用点云数据对实际场景进行三维建模，用于分析地貌、水文地质以及建筑物损坏等情况，方便城市的管理与维护。As a kind of unstructured three-dimensional data, point cloud represents objects more accurately, flexibly and diversely than data formats such as voxels and grids, and has a wide range of applications in the field of smart cities. For example, in urban construction planning, point cloud is used to generate digital traffic maps to assist in the planning of traffic routes, urban construction, etc., so as to improve planning efficiency and accuracy; in environmental monitoring and analysis, point cloud data can be used to perform three-dimensional modeling of actual scenes, which can be used to analyze landforms, hydrogeology, and building damage, etc., to facilitate urban management and maintenance.

在实际的智慧城市应用中，点云从获取到应用通常可以分为以下五个步骤：（1）点云采集；（2）点云预处理；（3）点云特征提取；（4）点云语义分割；（5）下游模型部署与应用。In actual smart city applications, point cloud acquisition and application can usually be divided into the following five steps: (1) point cloud acquisition; (2) point cloud preprocessing; (3) point cloud feature extraction; (4) point cloud semantic segmentation; and (5) downstream model deployment and application.

上述步骤的一个难点在于特征提取和语义分割需要大量的标注数据进行模型训练。在特征提取和语义分割上，传统方法是采用手动设计特征描述符提取特征或采用深度学习方法利用神经网络自动提取特征，从而分别实现语义分割。但这些方法的训练过程通常是监督学习的，即需要大量的标注数据进行模型训练。然而城市场景的点云规模巨大，如果对所有的点进行人工标注，该过程将繁琐且昂贵。One difficulty in the above steps is that feature extraction and semantic segmentation require a large amount of labeled data for model training. In feature extraction and semantic segmentation, the traditional method is to use manually designed feature descriptors to extract features or use deep learning methods to automatically extract features using neural networks, thereby achieving semantic segmentation. However, the training process of these methods is usually supervised learning, that is, a large amount of labeled data is required for model training. However, the point cloud of urban scenes is huge in scale. If all points are manually labeled, the process will be cumbersome and expensive.

发明内容Summary of the invention

本发明目的在于克服城市场景语义分割算法需要大量标注数据的难点，提供了一种基于图卷积和半监督学习网络的城市场景语义分割方法。The purpose of the present invention is to overcome the difficulty that the urban scene semantic segmentation algorithm requires a large amount of labeled data, and provide an urban scene semantic segmentation method based on graph convolution and semi-supervised learning network.

基于图卷积和半监督学习网络的城市场景语义分割方法，包括如下步骤：The urban scene semantic segmentation method based on graph convolution and semi-supervised learning network includes the following steps:

S1、利用公开且标注的城市道路数据集预训练图卷积网络，以获得图卷积网络中各层的初始化参数；S1. Use the public and annotated urban road dataset to pre-train the graph convolutional network to obtain the initialization parameters of each layer in the graph convolutional network;

S2、向初始化后的图卷积网络中一次输入原始点集，中的点仅含有坐标xyz和颜色rgb信息，利用所述图卷积网络输出特征向量；S2: Input the original point set into the initialized graph convolutional network once , The points in the graph only contain coordinates xyz and color rgb information, and the graph convolutional network is used to output the feature vector ;

S3、对步骤S2中原始点集的每个点使用k-NN，寻找k个邻近点组成邻域，根据每个点的邻域计算特征向量；S3, the original point set in step S2 Use k-NN for each point, find k neighboring points to form a neighborhood, and calculate the feature vector based on the neighborhood of each point ;

S4、计算特征向量和的距离作为损失函数，所述损失函数用于调整步骤S2中所述图卷积网络的参数；S4. Calculate the feature vector and The distance between the two images is used as a loss function, and the loss function is used to adjust the parameters of the graph convolutional network in step S2;

S5、将原始点集作为目标语义分割数据集，其含有有标签数据和无标签数据，其中，所述有标签数据的数据量占所述原始点集的比例为1%~10%，然后在半监督学习网络中利用有标签数据为无标签数据分配伪标签；S5. The original point set As a target semantic segmentation dataset , which contains labeled data and unlabeled data, wherein the amount of labeled data accounts for the original point set The ratio is 1%~10%, and then the labeled data is used in the semi-supervised learning network to assign pseudo labels to the unlabeled data;

S6、将步骤S5中分配了伪标签的用于网络推理，语义分割并预测每个点的类别。S6: assign pseudo labels to the Used for network inference, semantic segmentation and predicting the category of each point.

进一步地，所述步骤S2具体为：Furthermore, the step S2 is specifically as follows:

S21、利用所述图卷积网络的编码器对所述原始点集进行编码得到编码特征；S21, using the encoder of the graph convolutional network to encode the original point set Encode to get the encoding features ;

S22、再利用所述图卷积网络的解码器对编码特征进行解码得到解码特征；S22, then use the decoder of the graph convolutional network to encode the features Decode to get the decoding features ;

S23、经过MLP将解码特征映射输出为特征向量；其中，特征向量中的每个点的维度表示为，和分别代表编码出的坐标和颜色特征，r的下标表示特征通道，r的上标中1表示均值，2表示方差。S23, decode the features through MLP Mapping output as feature vector ; Among them, the eigenvector The dimension of each point in is expressed as , and They represent the encoded coordinates and color features respectively. The subscript of r represents the feature channel. The superscript of r represents the mean and 2 represents the variance.

进一步地，所述步骤S3具体为：Furthermore, the step S3 is specifically as follows:

对步骤S2中原始点集的每个点使用k-NN寻找k个邻近点组成邻域，并根据每个点的邻域计算特征向量，其中，每个点的维度表示为：；For the original point set in step S2 For each point in the graph, k-NN is used to find k neighboring points to form a neighborhood, and the feature vector is calculated based on the neighborhood of each point. , where the dimension of each point is expressed as: ;

该计算过程为： The calculation process is:

其中，表示每个点的邻域坐标通道的均值，表示每个点的邻域颜色通道的均值，表示每个点的邻域颜色通道的方差，取1,2,3，该自学习过程设置与每个特征通道相对应，用于计算特征距离，n表示每个点的k个邻近点的索引。in, represents the mean of the neighborhood coordinate channel of each point, represents the mean of the neighborhood color channels of each point, represents the variance of the neighborhood color channel of each point, Take 1, 2, 3, the self-learning process is set and Each feature channel corresponds to the feature distance calculation, and n represents the index of the k neighboring points of each point.

进一步地，所述步骤S4具体为：Furthermore, the step S4 is specifically as follows:

设步骤S2中输入到所述图卷积网络的原始点集含有个点，则坐标距离计算为欧式距离，颜色距离计算为曼哈顿距离；Suppose the original point set input to the graph convolutional network in step S2 is contain points, the coordinate distance is calculated as Euclidean distance, and the color distance is calculated as Manhattan distance;

坐标距离的损失函数为：Coordinate distance loss function for:

颜色距离的损失函数为：Color distance loss function for:

最终，损失函数为：Finally, the loss function is:

其中，为原始点集中每个点的索引，和是两个超参数，在该图卷积网络中和分别设置为1/3和2/3；使用该损失函数对步骤S2中的图卷积网络训练，进一步调整其编码器与解码器的参数。in, The original point set The index of each point in and are two hyperparameters. In this graph convolutional network and They are set to 1/3 and 2/3 respectively; the loss function is used to train the graph convolutional network in step S2 to further adjust the parameters of its encoder and decoder.

进一步地，所述步骤S5具体为：Furthermore, the step S5 is specifically as follows:

S51、将原始点集作为目标语义分割数据集为，则为一组含有个点的点，设原始点集中有标签数据的点集为，点数为，无标签数据的点集为，点数为，则有和；S51, the original point set As the target semantic segmentation dataset ,but For a group containing points, let the original point set The point set with labeled data is , the points are , the point set of unlabeled data is , the points are , then and ;

S52、使用步骤S4中训练调整后的编码器与解码器，将步骤S4中输出维的MLP替换为输出维的MLP，并将输出的维向量记作；S52, using the encoder and decoder trained and adjusted in step S4, outputting the The MLP of dimension is replaced by the output dimensional MLP, and the output dimensional vector is denoted by ;

S53、将中含有标签的点对应的特征表示为，无标签的点对应的特征表示为；则；S53, will The features corresponding to the points with labels in are expressed as , the feature corresponding to the unlabeled point is expressed as ;but ;

其中，和均为维的向量，并均使用下标用以区分不同的点，并且中含有个类别，0＜≤，为中需要语义分割出的实际类别数；in, and Both dimensional vectors, and use subscripts to distinguish different points, and Contains categories, 0＜ ≤ , for The actual number of categories that need to be semantically segmented in ;

S54、从已知的有标签数据中选择属于类别为的点云，并计算这些点的特征平均值，得到该类别平均特征向量：S54, select the data belonging to the category from the known labeled data point cloud, and calculate the feature average of these points to obtain the average feature vector of this category :

其中，表示类别是的点的数量，表示对应点的类别为，然后，依次对输入的个点中b个类别计算平均特征向量，其中；对于中其余不存在的类别，记为零向量；in, Indicates that the category is The number of points, express The corresponding point categories are , and then, the input Calculate the average feature vector of b categories in points ,in ;for The remaining categories that do not exist in It is recorded as zero vector;

S55、计算无标签数据的点的特征向量与的相似度矩阵：S55. Calculate the points of unlabeled data The eigenvector of and The similarity matrix :

其中，表示类别的平均特征向量与未标记点对应的向量的欧式距离，的上标表示类别，的下标表示某一个点，且，表示自然对数的底数，括号内为其指数，的维度为；in, The Euclidean distance between the average feature vector of the class and the vector corresponding to the unlabeled point, The superscript indicates the category. The subscript of represents a point, and , It represents the base of natural logarithm, and the exponent in brackets is: The dimension is ;

S56、将步骤S53中的特征向量再使用MLP映射为向量，作为预测结果，的特征维度为；S56: The feature vector in step S53 is Then use MLP to map it into a vector , as the prediction result, The feature dimension is ;

对于个有标签的点，直接使用Softmax分类器和交叉熵损失函数实现类别预测，这些点计算得到的损失函数为；for There are labeled points. The Softmax classifier and cross entropy loss function are used directly to realize category prediction. The loss function calculated for these points is ;

对于个无标签的点，先生成伪标签，再用于和对比，具体的：先逐类别地选择相似度矩阵中置信度最高的个点，假设共选出个点（≤≤），再对被选择过的点逐点选择置信度最高的类别，然后更新这个点的伪标签的最大置信度及其对应的标签值；for unlabeled points, generate pseudo labels first, and then use them to Contrast, specifically: first select the similarity matrix category by category The most confident points, assuming that a total of points ( ≤ ≤ ), and then select the category with the highest confidence for each of the selected points, and then update this The maximum confidence of the pseudo-label of each point and its corresponding label value;

将个无标签的点的预测损失函数设计为：Will The prediction loss function of unlabeled points is designed as:

其中，下标表示个点中的任意一点，s表示个无标签的点的索引，为中含有的类别数，m表示类别数的索引，表示最终的预测标签的概率值，表示伪标签类别，当伪标签类别和预测类别相同时，m取1，否则取0，表示当该点在中时取1，否则取0；Among them, the subscript express Any point among the points, s represents The indices of the unlabeled points, for The number of categories contained in, m represents the index of the number of categories, Represents the probability value of the final predicted label, Represents the pseudo label category. When the pseudo label category and the predicted category are the same, m takes 1, otherwise it takes 0. Indicates that when the point is Take 1 if it is correct, otherwise take 0;

S57、整个图卷积网络的损失函数为：S57. Loss function of the entire graph convolutional network for:

其中，权重为：Among them, the weight for:

其中，epoch表示当前的训练轮次，max-epoch表示最大训练轮次，起始时使用较小的权重。Among them, epoch represents the current training round, max-epoch represents the maximum training round, and the starting time is Use smaller weights.

进一步地，所述步骤S6具体为：Furthermore, the step S6 is specifically as follows:

迭代所述步骤S51-S57中分配了伪标签的过程训练网络，直到在目标数据集上收敛，最终预测时，利用该训练好的网络，去除相似度矩阵的计算，对中所有点使用Softmax分类器，然后读取中的其余点，迭代此测试过程，即可以实现对目标数据集所有点的语义分割和类别预测。Iterate the process of assigning pseudo labels in steps S51-S57 to train the network until the target data set When the trained network converges, the similarity matrix is removed during the final prediction. The calculation of Use Softmax classifier on all points in , and then read The remaining points in the target data set can be tested by iterating this test process. Semantic segmentation and category prediction for all points.

采用上述技术方案后，本发明与背景技术相比，具有如下优点：After adopting the above technical solution, the present invention has the following advantages compared with the background technology:

1、本发明采用了迁移学习的思想，充分利用不同城市场景的相似特性，利用公开的标注数据集来获得图卷积网络的初始化参数，有助于提高神经网络在不同数据集上表现的稳定性；1. This paper adopts the idea of transfer learning, makes full use of the similar characteristics of different urban scenes, and uses public annotated data sets to obtain the initialization parameters of the graph convolutional network, which helps to improve the stability of the neural network performance on different data sets;

2、本发明采用自学习的预训练任务，充分利用城市场景中物体的局部和颜色特性，无需使用标注数据，也能学习数据的先验分布来微调网络参数；2. The present invention adopts a self-learning pre-training task, making full use of the local and color characteristics of objects in urban scenes. It can also learn the prior distribution of data to fine-tune network parameters without using labeled data;

3、本发明使用半监督学习减少对标注数据的依赖，从而仅需使用少量的有标签数据，就可以生成高质量的伪标签，提高半监督学习的效果，并实现对目标数据集的语义分割，极大减少了对人工标注数据的依赖。3. The present invention uses semi-supervised learning to reduce the dependence on labeled data, so that only a small amount of labeled data is needed to generate high-quality pseudo-labels, improve the effect of semi-supervised learning, and realize semantic segmentation of the target data set, greatly reducing the dependence on manually labeled data.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的预训练图卷积网络微调参数的流程图；FIG1 is a flow chart of fine-tuning parameters of a pre-trained graph convolutional network according to the present invention;

图2为本发明半监督学习网络生成伪标签的训练过程流程图。FIG2 is a flow chart of the training process of generating pseudo labels by the semi-supervised learning network of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not used to limit the present invention.

实施例Example

（一）预训练图卷积网络，获得初始化参数（通过以下步骤S1实现）(I) Pre-train the graph convolutional network and obtain the initialization parameters (achieved by following the steps S1 below)

S1、利用公开且标注的数据集预训练图卷积网络，以获得图卷积网络中各层的初始化参数；S1. Use a public and annotated dataset to pre-train the graph convolutional network to obtain the initialization parameters of each layer in the graph convolutional network;

本发明所述图卷积网络的编码器由一个多层感知机层和四个图卷积模块组成，将四个图卷积模块依次编号为1、2、3、4；所述图卷积模块的输入特征表示为，输出特征为，其中，即前一个图卷积模块的输出为下一个图卷积模块的输入，并且当前模块输入点数为、输入特征维度为，卷积后的特征维度为，再经过一次随机下采样将点数减少为。The encoder of the graph convolution network described in the present invention is composed of a multi-layer perceptron layer and four graph convolution modules, and the four graph convolution modules are numbered 1, 2, 3, and 4 in sequence; the input feature of the graph convolution module is represented as The output features are ,in , that is, the output of the previous graph convolution module is the input of the next graph convolution module, and the number of input points of the current module is , the input feature dimension is , the feature dimension after convolution is , and then a random downsampling is performed to reduce the number of points to .

本实施例采用公开且标注的城市道路数据作为预训练数据集用于预训练。公开数据集Toronto3D利用车载激光雷达获取了1km的高质量城市场景点云，该数据集手工标注了超过八千万个点，共涵盖8种常见的城市场景类别：公路、斑马线、楼房、电力线、电力杆塔、汽车和围栏，且全部包含坐标和颜色信息。选取Toronto3D中的一块点云作为预训练的图卷积网络的输入，在本实施例中的点数设为65536，则的维度为[65536, 6]。This example uses public and annotated urban road data as a pre-training dataset for pre-training. The public dataset Toronto3D uses a vehicle-mounted LiDAR to obtain a 1km high-quality urban scene point cloud. The dataset has more than 80 million points manually annotated, covering a total of 8 common urban scene categories: roads, zebra crossings, buildings, power lines, power towers, cars, and fences, all of which contain coordinate and color information. Select a point cloud in Toronto3D As the input of the pre-trained graph convolutional network, in this embodiment The number of points is set to 65536, then The dimension is [65536, 6].

首先，使用编码器得到编码特征，其中使用MLP（即多层感知机）将[65536, 6]映射成[65536,16]后输入到四个图卷积模块提取特征；再经过解码器得到解码特征，其维度为[65536,16]；然后，使用全连接网络和Softmax分类器实现对中每个点的类别预测，全连接网络对每个点的特征维度变化设置为(16, 64, 64, 8)；接着，使用交叉熵作为损失函数、随机梯度下降进行优化，对网络进行预训练，更新网络每层的参数；最后，重复上述过程，直到网路收敛。收敛条件设置为固定训练100轮，若连续20轮预测精度不提高，也可停止训练。本实施例的网络采用预训练的参数初始化，以代替随机初始化。First, use the encoder to get the encoded features , where MLP (multi-layer perceptron) is used to map [65536, 6] to [65536,16] and then input into four graph convolution modules to extract features; then the decoder is used to obtain the decoded features , whose dimension is [65536,16]; then, a fully connected network and Softmax classifier are used to implement The category prediction of each point in the fully connected network is set to (16, 64, 64, 8) for each point’s feature dimension change; then, cross entropy is used as the loss function and stochastic gradient descent is used for optimization, the network is pre-trained, and the parameters of each layer of the network are updated; finally, the above process is repeated until the network converges. The convergence condition is set to a fixed training of 100 rounds. If the prediction accuracy does not improve after 20 consecutive rounds, the training can also be stopped. The network of this embodiment uses pre-trained parameter initialization instead of random initialization.

（二）执行自学习的训练任务，实现参数微调（通过以下步骤S2-S4实现）(ii) Perform self-learning training tasks and achieve parameter fine-tuning (achieved through the following steps S2-S4)

其中，所述步骤S2具体为：Wherein, the step S2 is specifically as follows:

由于步骤S1已经初始化了图卷积网络的各层参数，步骤S2仅使用其中的编码器与解码器部分，先将步骤S1中的全连接网络与Softmax分类器修改为MLP层，然后执行下列步骤。Since step S1 has initialized the parameters of each layer of the graph convolutional network, step S2 only uses the encoder and decoder parts. First, modify the fully connected network and Softmax classifier in step S1 to MLP layers, and then perform the following steps.

特别的，对于一次输入到图卷积网络中的原始点集，其中的点仅含有坐标xyz和颜色rgb这6维特征，记为特征，将经过一个多层感知机映射到16维，输出为特征，并作为第一个图卷积模块的输入，且前一个图卷积模块的输出为下一个图卷积模块的输入，经过四个图卷积模块后输出特征，即为最终的编码特征。In particular, for the original point set input into the graph convolutional network at one time , where the points only contain the 6-dimensional features of coordinates xyz and color rgb, recorded as features ,Will After being mapped to 16 dimensions by a multi-layer perceptron, the output is the feature , and used as the input of the first graph convolution module, and the output of the previous graph convolution module is the input of the next graph convolution module. After four graph convolution modules, the output feature , which is the final encoding feature .

将S21中得到的编码特征使用MLP进行同维映射得到特征，并作为解码器输入，解码特征经过邻近点上采样、MLP降维和编码器的跳连接后解码得到输出特征。其中，解码特征使用下标和上标与编码器的特征做区别，依次取4,3,2,1。编码器与解码器进行跳连接，即将具有相同维度的编码特征与解码特征相加后作为后续层的输入特征。解码特征即为解码特征。The encoding features obtained in S21 Use MLP to perform same-dimensional mapping to obtain features , and used as decoder input to decode features After neighboring point upsampling, MLP dimension reduction and encoder skip connection, the output features are decoded. . The decoding features use the subscript and superscript Different from the characteristics of the encoder, Take 4, 3, 2, and 1 in turn. The encoder and decoder are skipped, that is, the encoded features with the same dimension With decoding features After addition, they are used as input features for subsequent layers. Decoded features Decoding feature .

所述步骤S3具体为：The step S3 is specifically as follows:

首先，对步骤S2中原始点集的每个点使用k-NN寻找k个邻近点组成邻域；First, the original point set in step S2 For each point, use k-NN to find k neighboring points to form a neighborhood;

具体的，为输入到网络的原始点集中的每个点使用k-NN寻找一组最邻近点集，然后嵌入坐标信息：Specifically, it is the original point set input to the network Each point in Use k-NN to find a set of nearest neighbors , and then embed the coordinate information:

= LBR( , , , ) = LBR ( , , , )

其中，坐标特征由点与其邻近点的空间位置关系得到，具体为连接了这两个点的绝对坐标和、偏移量，以及空间距离；符号LBR则表示连接后的特征向量依次经过Linear层、BatchNorm层和ReLU层，并且该图卷积模块中将的维度映射成和其输入的点集特征相同的维度。Among them, the coordinate feature By point Its neighboring points The spatial position relationship is obtained, specifically the absolute coordinates connecting the two points and , offset , and spatial distance ; The symbol LBR means that the connected feature vector passes through the Linear layer, BatchNorm layer and ReLU layer in sequence, and the graph convolution module The dimension of is mapped to the same dimension as the input point set feature.

然后，点与邻近点的关系表示为边关系：Then, click With neighboring points The relationship is represented as an edge relationship :

= R(g()) = R ( g ( ))

其中，，将输入的第个图卷积的点集特征与其坐标特征连接后使用可学习权重g进行加权，g可以使用MLP和1D-CNN等实现；表示ReLU层。最后，使用Max-Pooling逐通道聚合每个点的邻域特征，并使用随机采样减少点数，得到输出特征。in, , the input Point set features of graph convolution With its coordinate characteristics After the connection, the learnable weight g is used for weighting. g can be implemented using MLP and 1D-CNN, etc. ReLU layer. Finally, Max-Pooling is used to aggregate each point channel by channel. Neighborhood features, and use random sampling to reduce the number of points to get the output features .

然后根据每个点的邻域计算特征向量，其中，每个点的维度表示为：；Then the eigenvector is calculated based on the neighborhood of each point , where the dimension of each point is expressed as: ;

该计算过程为： The calculation process is:

其中，表示每个点的邻域坐标通道的均值，表示每个点的邻域颜色通道的均值，表示每个点的邻域颜色通道的方差，取1,2,3，该自学习过程设置与每个特征通道相对应，用于计算特征距离，n表示每个点的k个邻近点的索引。由于城市街道场景点云分布稀疏不均，导致邻域坐标方差较大，所以网络仅使用坐标均值。而植被（绿色）、路面（灰色）等物体存在显著的颜色特点，且局部颜色变化通常较为平滑，所以颜色选用均值与方差这两个特征。in, represents the mean of the neighborhood coordinate channel of each point, represents the mean of the neighborhood color channels of each point, represents the variance of the neighborhood color channel of each point, Take 1, 2, 3, the self-learning process is set and Each feature channel corresponds to the feature distance calculation, and n represents the index of the k neighboring points of each point. Since the point cloud of urban street scenes is sparsely distributed and uneven, the variance of the neighborhood coordinates is large, so the network only uses the mean of the coordinates. Objects such as vegetation (green) and road surface (gray) have significant color characteristics, and local color changes are usually smooth, so the color uses the mean and variance as the two features.

所述步骤S4具体为：The step S4 is specifically as follows:

设步骤S2中输入到所述图卷积网络的原始点集含有个点，则坐标距离计算为欧式距离，颜色距离计算为曼哈顿距离；Assume that the original point set input to the graph convolutional network in step S2 is contain points, the coordinate distance is calculated as Euclidean distance, and the color distance is calculated as Manhattan distance;

坐标距离的损失函数为：Coordinate distance loss function for:

颜色距离的损失函数为：Color distance loss function for:

最终，损失函数为：Finally, the loss function is:

上述步骤S2-S4实现对预训练的图卷积网络的参数进行微调。具体的，在本实施例中，The above steps S2-S4 implement fine-tuning of the parameters of the pre-trained graph convolutional network. Specifically, in this embodiment,

（1）先固定预训练好的编码器和解码器，将解码器后续的全连接层改为多层感知机（即MLP），该MLP对每个点的特征维度变化设置为（16,32,9）。从目标语义分割数据集为构建一块点云（与前一步预训练中构建方式一样，仅更换数据集），将点云经过图1的网络输出特征，其维度为[65536, 9]。(1) First, fix the pre-trained encoder and decoder, and change the subsequent fully connected layer of the decoder to a multi-layer perceptron (MLP). The feature dimension of each point in the MLP is set to (16, 32, 9). The target semantic segmentation dataset is Build a point cloud (The construction method is the same as in the previous step of pre-training, only the data set is changed), the point cloud is passed through the network output features of Figure 1 , whose dimension is [65536, 9].

（2）同时对中对每个点使用k-NN分别构建邻域，邻域点数k设置为16。一个点的特征计算方式为:(2) At the same time In the example, k-NN is used to construct the neighborhood for each point, and the number of neighborhood points k is set to 16. The feature calculation method of a point is:

其中，表示该点的邻域坐标通道的均值，表示该点的邻域颜色通道的均值，表示该点的邻域颜色通道的方差。i取1,2,3, 该自学习过程设置与每个特征通道相对应，用于计算特征距离，n表示每个点的k个邻近点的索引。得到该点的特征表示为：in, Represents the mean of the neighborhood coordinate channel of the point, represents the mean of the neighborhood color channels of the point, represents the variance of the neighborhood color channel of the point. i takes 1, 2, or 3, and the self-learning process is set and Each feature channel corresponds to the feature distance calculation, and n represents the index of the k neighboring points of each point. The feature representation of the point is obtained as:

则构建出的的所有点特征为，其维度为[65536, 9]。but Constructed All point features are , whose dimension is [65536, 9].

（3）计算和的距离作为图1网络的损失函数，再进行训练。与坐标相关的特征计算为欧式距离，与颜色相关的特征距离计算为曼哈顿距离。坐标距离损失函数为：(3) Calculation and The distance is used as the loss function of the network in Figure 1 and then trained. The features related to the coordinates are calculated as the Euclidean distance, and the feature distance related to the color is calculated as the Manhattan distance. Coordinate distance loss function for:

颜色距离损失函数为：Color distance loss function for:

最终，损失函数为：Finally, the loss function is:

其中，为原始点集中每个点的索引，和是两个超参数，设置为1/3和2/3。in, The original point set The index of each point in and are two hyperparameters, set to 1/3 and 2/3.

训练使用随机梯度下降进行优化，固定训练30轮。该预训练用于微调编码器和解码器的参数，使其适用于的数据集编码。The training is optimized using stochastic gradient descent and fixed for 30 rounds. This pre-training is used to fine-tune the parameters of the encoder and decoder to make it suitable for The dataset encoding.

（三）采用半监督学习网络生成伪标签和语义分割（通过以下步骤S5、S6实现）(III) Generate pseudo labels and semantic segmentation using a semi-supervised learning network (implemented by following steps S5 and S6)

S5、将原始点集作为目标语义分割数据集，其含有少量有标签数据和大量无标签数据，其中，所述有标签数据的数据量占原始点集的比例为1%~10%，然后在半监督学习网络中利用有标签数据为无标签数据分配伪标签；S5. The original point set As a target semantic segmentation dataset , which contains a small amount of labeled data and a large amount of unlabeled data, where the amount of labeled data accounts for The ratio is 1%~10%, and then the labeled data is used in the semi-supervised learning network to assign pseudo labels to the unlabeled data;

所述步骤S5具体为：The step S5 is specifically as follows:

对于个无标签的点，先生成伪标签，再用于和对比，如果使用伪标签置信度较低的点，将会使分割结果产生较大误差。因此，可以先逐类别地选择相似度矩阵中置信度最高的个点，再对被选择过的点逐点选择置信度最高的类别。for unlabeled points, generate pseudo labels first, and then use them to In contrast, if the pseudo-label confidence level is low, the segmentation result will have a large error. Therefore, we can first select the similarity matrix by category. The most confident points, and then select the category with the highest confidence for each of the selected points.

具体的：先逐类别地选择相似度矩阵中置信度最高的个点，假设共选出个点（≤≤），再对被选择过的点逐点选择置信度最高的类别，然后更新这个点的伪标签的最大置信度及其对应的标签值；Specifically: First select the similarity matrix category by category The most confident points, assuming that a total of points ( ≤ ≤ ), and then select the category with the highest confidence for each of the selected points, and then update this The maximum confidence of the pseudo-label of each point and its corresponding label value;

其中，权重为：Among them, the weight for:

所述步骤S6具体为：The step S6 is specifically as follows:

具体的，本实施例将图1中的解码器后续的层修改为两个MLP，第一个MLP对每个点的特征维度变化设置为（16,32,32），输出特征，其维度为[65536, 32]；第二个MLP对每个点的特征维度变化设置为（32,32,8）,输出特征，其维度为[65536, 8]，修改后的网络架构如图2所示。Specifically, in this embodiment, the subsequent layers of the decoder in FIG1 are modified into two MLPs. The first MLP sets the feature dimension change of each point to (16, 32, 32), and outputs the feature , whose dimension is [65536, 32]; the second MLP sets the feature dimension change of each point to (32,32,8), and outputs the feature , whose dimension is [65536, 8]. The modified network architecture is shown in Figure 2.

目标语义分割数据集中需要含有少量有标签数据，以及大量无标签数据，该步骤S5中利用有标签数据为无标签数据分配伪标签。从目标语义分割数据集为构建一块点云（构建方式与前一步自学习预训练任务中构建方式相同），半监督训练时每次构建出的中需要有1%~10%的点带有标注信息。如在一轮训练中，65536个点中含有4096个标注点，共5个类别，标注信息占比6.25%。使用，计算这些标注点对应每个类别的特征平均值，得到该类别平均特征向量：Target semantic segmentation dataset It needs to contain a small amount of labeled data and a large amount of unlabeled data. In step S5, the labeled data is used to assign pseudo labels to the unlabeled data. Build a point cloud (The construction method is the same as that in the previous self-learning pre-training task). 1%~10% of the points need to have labeled information. For example, in one round of training, there are 4096 labeled points in 65536 points, with a total of 5 categories, and the labeled information accounts for 6.25%. , calculate the average feature value of each category corresponding to these labeled points, and obtain the average feature vector of the category :

其中，表示某个标注点在中对应的特征，其类别为。然后，依次对输入的4096个点中个类别计算平均特征向量，其中。对于中其余不存在的3个类别，记为零向量。in, express A marked point is The corresponding features in are of the following categories: Then, we input 4096 points one by one. Calculate the average feature vector for each category ,in .for The remaining 3 categories that do not exist in Recorded as zero vector.

接着，计算未标记点的特征向量与的相似度矩阵：Next, calculate the unmarked points The eigenvector of and The similarity matrix :

其中，，的维度为。in, , The dimension is .

接着，将特征向量再使用MLP映射为向量，作为预测结果，的特征维度为。其中，4096个有标签的点直接使用Softmax分类器和交叉熵损失函数实现类别预测。而对于个无标签的点，还需要生成伪标签，用于和对比。先逐类别地选择相似度矩阵中置信度最高的个点，再对被选择过的点逐点选择置信度最高的类别。假设共选出个点（≤≤），然后更新这个点的伪标签的最大置信度及其对应的标签值。其中，设置为每个类别数量的50%。训练轮次设置为100，优化方法选择Adam。Next, the feature vector Then use MLP to map it into a vector , as the prediction result, The feature dimension is Among them, 4096 labeled points directly use Softmax classifier and cross entropy loss function to achieve category prediction. There are unlabeled points, and pseudo labels need to be generated for First, select the similarity matrix by category. The most confident points, and then select the category with the highest confidence for each of the selected points. points ( ≤ ≤ ), then update this The maximum confidence of the pseudo-label of the points and its corresponding label value. Among them, Set it to 50% of the number of each category. Set the number of training rounds to 100, and select Adam as the optimization method.

最终，利用该训练好的网络，去除相似度矩阵的计算，即去除图2中到simi和simi到的计算。然后，测试时对一次输入到网络的中所有点使用Softmax分类器，迭代构建直到读取中所有的点，即可为所有点预测标签值，实现语义分割。Finally, using the trained network, remove the similarity matrix Calculation, that is, removing the to simi and simi to Then, during the test, the input to the network is All points in the classifier are Softmax, and the Until read All the points in can be used to predict label values for all points and achieve semantic segmentation.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by a person skilled in the art within the technical scope disclosed by the present invention should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A method for semantic segmentation of urban scenes based on graph convolution and semi-supervised learning network, characterized in that it comprises the following steps:

S1. Use the public and annotated urban road dataset to pre-train the graph convolutional network to obtain the initialization parameters of each layer in the graph convolutional network;

S2: Input the original point set into the initialized graph convolutional network once , The points in the graph only contain coordinates xyz and color rgb information, and the graph convolutional network is used to output the feature vector ;

The step S2 is specifically as follows:

S21, using the encoder of the graph convolutional network to encode the original point set Encode to get the encoding features ; For the original point set input into the graph convolutional network at one time , where the points only contain the 6-dimensional features of coordinates xyz and color rgb, recorded as features ,Will After an MLP mapping to 16 dimensions, the output is the feature , and used as the input of the first graph convolution module, and the output of the previous graph convolution module is the input of the next graph convolution module. After four graph convolution modules, the output feature , which is the final encoding feature ;

S22, then use the decoder of the graph convolutional network to encode the features Decode to get the decoding features ;

S23, decode the features through MLP Mapping output as feature vector ; Among them, the eigenvector The dimension of each point in is expressed as , and Represent the encoded coordinates and color features respectively. The subscript of r represents the feature channel. The superscript of r is 1 for the mean and 2 for the variance.

S3, the original point set in step S2 Use k-NN for each point, find k neighboring points to form a neighborhood, and calculate the feature vector based on the neighborhood of each point ;

S4. Calculate the feature vector and The distance between the two images is used as a loss function, and the loss function is used to adjust the parameters of the graph convolutional network in step S2;

The step S4 is specifically as follows:

Assume that the original point set input to the graph convolutional network in step S2 is contain points, the coordinate distance is calculated as Euclidean distance, and the color distance is calculated as Manhattan distance;

Coordinate distance loss function for: Color distance loss function for: Finally, the loss function is:

Among them, ɑ is the original point set The index of each point in and are two hyperparameters. In this graph convolutional network and are set to 1/3 and 2/3 respectively; use this loss function to train the graph convolutional network in step S2 and further adjust the parameters of its encoder and decoder;

S5. The original point set As a target semantic segmentation dataset , which contains labeled data and unlabeled data, wherein the amount of labeled data accounts for the original point set The ratio is 1%~10%, and then the labeled data is used in the semi-supervised learning network to assign pseudo labels to the unlabeled data;

S6: assign pseudo labels to the Used for network inference, semantic segmentation and predicting the category of each point.

2. The urban scene semantic segmentation method based on graph convolution and semi-supervised learning network according to claim 1, characterized in that: the step S3 is specifically:

For the original point set in step S2 For each point in the graph, k-NN is used to find k neighboring points to form a neighborhood, and the feature vector is calculated based on the neighborhood of each point. , where the dimension of each point is expressed as: ;

The calculation process is: in, represents the mean of the neighborhood coordinate channel of each point, represents the mean of the neighborhood color channels of each point, represents the variance of the neighborhood color channel of each point, Take 1, 2, 3, self-learning process settings and Corresponding to each feature channel, n represents the index of k neighboring points of each point.

3. The urban scene semantic segmentation method based on graph convolution and semi-supervised learning network according to claim 2, characterized in that: the step S5 is specifically:

S51, the original point set As the target semantic segmentation dataset ,but For a group containing points, let the original point set The point set with labeled data is , the points are , the point set of unlabeled data is , the points are , then and ;

S52, using the encoder and decoder trained and adjusted in step S4, outputting the The MLP of dimension is replaced by the output dimensional MLP, and the output dimensional vector is denoted by ;

S53, will The features corresponding to the points with labels in are expressed as , the feature corresponding to the unlabeled point is expressed as ;but ;

in, and Both dimensional vectors, and use subscripts to distinguish different points, and Contains categories, 0＜ ≤ , for The actual number of categories that need to be semantically segmented in ;

S54, select the data belonging to the category from the known labeled data point cloud, and calculate the feature average of these points to obtain the average feature vector of this category : in, Indicates that the category is The number of points, express The corresponding point categories are , and then, the input Point Calculate the average feature vector for each category ,in ;for The remaining categories that do not exist in It is recorded as zero vector;

S55. Calculate the points of unlabeled data The eigenvector of and The similarity matrix :

in, The Euclidean distance between the average feature vector of the class and the vector corresponding to the unlabeled point, The superscript indicates the category. Subscript express Any point among the points, and , It represents the base of natural logarithm, and the exponent in brackets is: The dimension is ;

S56: The feature vector in step S53 is Then use MLP to map it into a vector , as the prediction result, The feature dimension is ;

for There are labeled points. The Softmax classifier and cross entropy loss function are used directly to realize category prediction. The loss function calculated for these points is ;

for unlabeled points, generate pseudo labels first, and then use them to Contrast, specifically: first select the similarity matrix category by category The most confident points, assuming that a total of Points, ≤ ≤ , and then select the category with the highest confidence for each of the selected points, and then update this The maximum confidence of the pseudo-label of each point and its corresponding label value;

Will The prediction loss function of unlabeled points is designed as:

Among them, the subscript express Any point among the points, s represents The indices of the unlabeled points, for The number of categories contained in, m represents the index of the number of categories, Represents the probability value of the final predicted label, Represents the pseudo label category. When the pseudo label category and the predicted category are the same, m takes 1, otherwise it takes 0. Indicates that when the point is Take 1 if it is correct, otherwise take 0;

S57. Loss function of the entire graph convolutional network for:

Among them, the weight for:

Among them, epoch represents the current training round, max-epoch represents the maximum training round, and the starting time is Use smaller weights.

4. The urban scene semantic segmentation method based on graph convolution and semi-supervised learning network according to claim 3, characterized in that: the step S6 is specifically:

Iterate the process of assigning pseudo labels in steps S51-S57 to train the network until the target data set When the trained network converges, the similarity matrix is removed during the final prediction. The calculation of Use Softmax classifier on all points in , and then read The remaining points in the target dataset are tested by iterating this test process. Semantic segmentation and category prediction for all points.