CN113657387A

CN113657387A - Semi-supervised three-dimensional point cloud semantic segmentation method based on neural network

Info

Publication number: CN113657387A
Application number: CN202110764019.3A
Authority: CN
Inventors: 张扬刚; 陈涛; 廖永斌; 叶创冠
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-11-16
Anticipated expiration: 2041-07-07
Also published as: CN113657387B

Abstract

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to a semi-supervised three-dimensional point cloud semantic segmentation method based on a neural network. The invention adopts a semi-supervised learning mode and combines a three-dimensional point cloud semantic segmentation network model to form a whole semi-supervised three-dimensional point cloud semantic segmentation method framework; the segmentation network model is divided into a student network and a teacher network, and the two networks sample the same SSCNs network; the input of the student network is original point cloud which is not transformed, and the input of the teacher network is transformed point cloud; the output of the part with labels of the student network is supervised by the corresponding labels, and the consistency supervision is carried out on the whole output of the student network and the teacher network, so as to update the weight of the student network, and the weight of the teacher network is obtained by carrying out exponential sliding average on the weight of the student network. Experiments show that the performance of the network is obviously improved on each labeling rate by using semi-supervised learning with labeled data and unlabeled data.

Description

Semi-supervised three-dimensional point cloud semantic segmentation method based on neural network

Technical Field

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to a three-dimensional point cloud semantic segmentation method.

Background

In recent years, deep learning has achieved excellent performance on a variety of computer vision tasks, particularly in the image domain. However, for some applications with practical significance, such as automatic driving, virtual reality, and augmented reality, it is necessary to acquire richer information than a mere picture to achieve better scene understanding. Three-dimensional data acquired by a lidar or an RGB-D depth camera, which is usually represented in the form of a point cloud, is a good complement to two-dimensional picture data. The three-dimensional point cloud is composed of a large number of points with three-dimensional coordinates and colors, is an intuitive three-dimensional data format, contains abundant environmental space information compared with a two-dimensional image, is more beneficial to scene understanding, and has become a main representation form of a plurality of three-dimensional visual analysis tasks.

In all three-dimensional visual analysis tasks, point cloud semantic segmentation is an essential key task in three-dimensional scene understanding. In recent years, point cloud semantic segmentation has made great progress, but the existing methods are trained in a full-supervised learning mode, and rely heavily on a large amount of finely labeled data, which is expensive and time-consuming. In addition, compared with classification and detection tasks, semantic segmentation requires intensive point-level labeling, and is longer in time consumption and higher in cost. For example, a point in an indoor scene can often be on the order of millions, with annotation taking several hours. Semi-supervised learning is a method to reduce the cost of data labeling, which can improve the performance of the existing model by using a small amount of labeled data and a large amount of unlabeled data. In many fields, tags can only be given by experts in the relevant field, while untagged data is readily available. Unlike full supervised learning, the semi supervised learning method can improve performance by adding additional unlabeled data for training, and is a new method for overcoming data starvation.

Some related algorithms for semi-supervised learning and point cloud semantic segmentation are briefly introduced below.

1. Semi-supervised learning

Algorithms for semi-supervised learning can be roughly classified into three categories: a method based on generation of a countermeasure network (GAN), a method of entropy minimization and a method of consistency regularization. For the GAN generation method, [1] additional annotation data is generated for network training, [2] a discriminator is trained to constrain the difference between the prediction and the label; for the method of entropy minimization, [3] realize the utilization to the unmarked data by minimizing the entropy loss of the unmarked data, [4] realize the entropy minimization of the implicit expression according to constructing a pseudo label to the high confidence prediction of the unmarked data; for the consistency constraint method, [5] taking different image intercepting blocks as input, then forcing the predictions of the different image intercepting blocks to be consistent, a Mean Teacher model [6] consists of two Teacher branches and student branches with the same structure, parameters of the student branches are updated by an optimizer, parameters of the Teacher branches are from exponential moving averages of student network parameters, Mean Teacher is always the most common structure of the consistency regularization method due to a simple and effective framework, and in the invention, the Mean Teacher framework is also selected as a supervision paradigm of the point cloud semantic segmentation task.

2. Point cloud semantic segmentation

Existing point cloud semantic segmentation methods can be divided into two categories: point-based methods and projection-based methods. Point-based methods take an original point cloud as input, but it is difficult to process unstructured and unordered point clouds. PointNet [7] utilizes perception machine and transformation matrix module that the multilayer shares to carry on the characteristic study of the point level, then use the symmetric function to carry on the characteristic study of the overall situation, PointNet + + [8] has introduced the hierarchical structure that the characteristic studies further, therefore it can be more accurate local texture characteristic and richer local structure information for every point study; the method comprises the steps of converting disordered point cloud into intermediate regular expression based on a projection ground method, inputting the regular expression into a backbone network for feature extraction, [9] firstly projecting the point cloud onto a synthesized two-dimensional image, then learning image features through a 2D-CNN method, obtaining a final semantic segmentation result through fusing the image features and projecting the result back onto the point cloud, [10] using a range image as the intermediate expression, and providing a new post-processing algorithm to overcome problems caused by discretization. SSCNs [11] first voxelizes the input point cloud and proposes a new sparse convolution method to alleviate the problem of heavy burden of point cloud computation.

Disclosure of Invention

The invention aims to provide a semi-supervised three-dimensional point cloud semantic segmentation method based on a neural network, which has low data annotation requirement, high accuracy and good robustness.

The invention provides a semi-supervised three-dimensional point cloud semantic segmentation method based on a neural network, which has the following overall structural description: the whole design is based on a deep learning method, a semi-supervised learning Mean Teacher paradigm is adopted, and a three-dimensional point cloud semantic segmentation network model is combined to form a whole semi-supervised three-dimensional point cloud semantic segmentation method framework. The structure of the segmentation network model is as follows: the system is divided into an upper branch and a lower branch, wherein the upper branch is called a student network, the lower branch is called a teacher network, and the student network and the teacher network sample the same structure, namely the student network and the teacher network adopt the same three-dimensional semantic segmentation backbone network; the input of the student network is original point cloud which is not transformed, and the input of the teacher network is transformed point cloud; the output of the part with labels of the student network is supervised by the corresponding labels, and the consistency supervision is carried out on the whole output of the student network and the teacher network, so as to update the weight of the student network, and the weight of the teacher network is obtained by carrying out exponential sliding average on the weight of the student network.

The method of the invention comprises the following specific steps.

Step 1: and dividing a training data set.

The training sample for supervised learning consists of labeled data and unlabeled data. For the existing labeled data set, a certain proportion (for example, between 10% and 90%) of labeled training samples are divided, and the labels of the rest parts are removed to be used as unlabeled training samples. Or collecting marked training samples and unmarked training samples by self. It should be noted that the object classes contained in the labeled sample need to contain all the object classes to be segmented.

Step 2: and (5) pre-training the network.

Pre-training a backbone network used by a teacher network and a student network by using the labeled data obtained by dividing or collecting in the step 1, wherein the pre-training process adopts a full supervision mode; the loss function adopted in the training process is a standard cross entropy loss function.

And step 3: and (5) network training.

The marked point cloud sample and the unmarked point cloud sample which are input into the network are respectively marked as

Wherein x_i∈R^p×6Representing the p points each training sample contains, along with its coordinate and color information. A batch of training samples is recorded as x^l∪x^uThe scaled and rotated version is recorded as

x^l∪x^uAnd

as inputs to the branches of the student network and teacher network, respectively, and their corresponding outputs are recorded as

And

before the network begins to train, respectively initializing the student network and the teacher network by using the weights obtained in the pre-training process in the step 2; then, each training, the output of the student network

In (1)

Carrying out supervision calculation on loss according to corresponding marking information y

And

consistency loss function designed by us

Supervision, described in detail below:

wherein f is_TAnd f_SRespectively refer to a teacher network and a student network, tau refers to the scaling and rotation transformation mentioned above, and KL refers to KL divergence (Kullback-Leibler divergence) calculation. Integral loss function

Is recorded as:

wherein, ω is_cIs a consistency weight parameter;

student network pass optimization loss function

Updating the network parameters; the teacher network is obtained by performing Exponential Moving Average (Exponential Moving Average) on the parameters of the student network, and the specific formula is as follows:

θ′_t＝αθ′_t-1+(1-α)θ_t

θ′_t、θ_tweights denoted as the tth iteration teacher network and student network, respectivelyWeight, α, is a weight hyperparameter. Therefore, the weight parameter of the teacher network is obtained by multiplying the parameter of the teacher network after the previous iteration by the weight super parameter, adding the parameter of the student network at the moment after updating by multiplying the parameter by 1 and subtracting the weight super parameter.

And 4, step 4: and (4) network reasoning.

When in network reasoning, an ideal semantic segmentation result of the three-dimensional point cloud can be obtained by using a trained teacher network or a trained student network, and the segmentation performance of the teacher network and the segmentation performance of the student network are similar.

In the invention, semi-supervised learning is combined with labeled data by adding unlabelled data so as to effectively improve the performance of the model. According to the invention, SSCNs are selected as a backbone network for point cloud semantic segmentation, and two loss functions are designed to force a teacher model and a student model to have the same prediction. The invention has simple and effective network structure, and a large number of experiments show that the performance of the network can be obviously improved on each marking rate by using semi-supervised learning with marked data and unmarked data.

Drawings

FIG. 1 is a semi-supervised three-dimensional point cloud semantic segmentation frame diagram provided by the present invention.

Detailed Description

In the following, embodiments of the invention are described in a three-dimensional scene point cloud dataset.

Description of the data set: the invention relates to a three-dimensional scene point cloud data set [12], which comprises 1513 scanning samples reconstructed from 707 indoor scenes and divided into 1201 training samples and 312 verification samples by the official.

Training experiment setup:

the section introduces training settings for semantic segmentation of point clouds in three-dimensional scenes, codes are written by PyTorch, and 1201 training samples in a data set introduced by the contents are selected as training samples. Furthermore, all experiments in this section were performed according to the following experimental setup:

data set partitioning:

according to the proportion of labeled samples, dividing 1201 training samples into seven groups of experiments of 10%, 20%, 30%, 40%, 50%, 70% and 100%, and removing labels from the rest samples of each group of experiments to serve as unlabeled samples.

A pre-training stage:

learning rate: 0.001.

training period: approximately 250 traversals through the training set, also called epochs number.

The number of batch sizes captured each time: 32.

and (3) an optimization algorithm: adam.

Hyper-parameters of SSCNs: the network width m is 16, the convolution block repetition factor is 1, the voxel size is 1/20, the number of test surfaces is 1, and no residual block is used.

A training stage:

learning rate: 0.001, and every 50 training cycles is reduced to 1/10 of the previous stage.

The number of batchsize captured each time: the annotated sample is 6 and the unlabeled sample is 24.

And (3) an optimization algorithm: adam.

Consistency weight ω_c: the first 40000 steps are gradually increased from 0 to 1.

Weight override parameter α: the first 40000 steps were 0.99, the latter 0.999.

Test experiment setup:

and (4) verification set: 312 validation samples in the dataset.

Evaluation indexes are as follows: mean interaction-Over-Union (mIoU).

Reference line: the SSCNs network is trained by the same number of labeled samples, and then the result of reasoning the same verification set is obtained.

Annotating sample proportions	10％	20％	30％	40％	50％	70％	100％
								Baseline SSCNs	40.49	50.04	53.39	53.62	55.86	57.71	60.04
The invention	42.74	51.86	55.84	55.87	57.77	59.19	61.76

。

And (3) direct-push learning result verification:

the direct-push learning refers to reasoning on an unlabeled sample in a training process, and is a common evaluation mode in semi-supervised learning. This section shows the results of the present invention in this evaluation mode.

And (4) verification set: the unlabeled samples are used in different proportions.

Evaluation indexes are as follows: mean interaction-Over-Union (mIoU).

Annotating sample proportions	10％	20％	30％	40％	50％	70％
							Baseline SSCNs	44.42	57.23	61.26	63.48	65.92	68.49
The invention	46.90	59.29	63.50	65.29	67.47	70.36

。

And (4) analyzing results:

the semi-supervised three-dimensional point cloud semantic segmentation method provided by the invention can improve the precision of the existing three-dimensional point cloud method no matter the result on the test set or the evaluation mode of direct-push learning. Therefore, the semantic segmentation precision of the three-dimensional point cloud can be improved by using a small number of labeled samples and a large number of unlabeled samples, and the dependence on data labeling is greatly reduced compared with the conventional semantic segmentation method of the three-dimensional point cloud.

This specification presents a specific embodiment for the purpose of illustrating the context and method of practicing the invention. The details introduced in the examples are not intended to limit the scope of the claims but to aid in the understanding of the process described herein. Those skilled in the art will understand that: various modifications, changes or substitutions to the preferred embodiment steps are possible without departing from the spirit and scope of the invention and its appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.

Reference to the literature

[1]N.Souly，C.Spampinato，M.Shah，Semi supervised semantic segmentation using generative adversarial network，in：Proceedings of the IEEEInternational Conference on Computer Vision，2017，pp.5688-5696.

[2]S.Mittal，M.Tatarchenko，T.Brox，Semi-supervised semantic segmentationwith high-and low-level consistency，IEEE Transactions on PatternAnalysis and Machine Intelligence.

[3]Y.Grandvalet，Y.Bengio，Semi-supervised learning by entropy minimization，Advances in neural information processing systems 17(2004)529-536.

[4]K.Sohn,D.Berthelot,C.-L.Li,Z.Zhang,N.Carlini,E.D.Cubuk,A.Kurakin,H.Zhang,C.Ra el,Fixmatch:Simplifying semi-supervised learningwith consistency and confidence,arXiv preprint arXiv:2001.07685.

[5]S.Laine,T.Aila,Temporal ensembling for semi-supervised learning,arXiv preprint arXiv:1610.02242.

[6]A.Tarvainen,H.Valpola,Mean teachers are better role models:Weight averagedconsistency targets improve semi-supervised deep learning results,in:Advances in neural information processing systems,2017,pp.1195–1204.

[7]C.R.Qi,H.Su,K.Mo,L.J.Guibas,Pointnet:Deep learning on pointsets for 3d classification and segmentation,in:Proceedings of the IEEEconference on computer vision and pattern recognition,2017,pp.652–660.

[8]C.R.Qi,L.Yi,H.Su,L.J.Guibas,Pointnet++:Deep hierarchical featurelearning on point sets in a metric space,Advances in neural informationprocessing systems 30(2017)5099–5108.

[9]F.J.Lawin,M.Danelljan,P.Tosteberg,G.Bhat,F.S.Khan,M.Felsberg,Deep projective 3d semantic segmentation,in:International Conferenceon Computer Analysis of Images and Patterns,Springer,2017,pp.95–107.

[10]A.Milioto,I.Vizzo,J.Behley,C.Stachniss,Rangenet++:Fast and accuratelidar semantic segmentation,in:2019 IEEE/RSJ International Conferenceon Intelligent Robots and Systems(IROS),IEEE,2019,pp.4213–4220.

[11]B.Graham,M.Engelcke,L.Van Der Maaten,3d semantic segmentationwith submanifold sparse convolutional networks,in:Proceedings of theIEEE conference on computer vision and pattern recognition,2018,pp.9224–9232.

[12]A.Dai,A.X.Chang,M.Savva,M.Halber,T.Funkhouser,M.Nieβner,Scannet:Richly-annotated 3d reconstructions of indoor scenes,in:Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition,2017,pp.5828–5839.。

Claims

1. A semi-supervised three-dimensional point cloud semantic segmentation method based on a neural network is characterized in that a semi-supervised learning Mean Teacher paradigm is adopted, and a three-dimensional point cloud semantic segmentation backbone network is combined to form a whole semi-supervised three-dimensional point cloud semantic segmentation method framework; the structure of the segmentation network model is as follows: the system is divided into an upper branch and a lower branch, wherein the upper branch is called a student network, the lower branch is called a teacher network, and the student network and the teacher network sample the same structure, namely a three-dimensional semantic segmentation backbone network is adopted; the input of the student network is original point cloud which is not transformed, and the input of the teacher network is transformed point cloud; the output of the part with labels of the student network is supervised by the corresponding labels, and the integral output of the student network and the teacher network is supervised in consistency, so as to update the weight of the student network, wherein the weight of the teacher network is obtained by performing exponential sliding average on the weight of the student network;

the three-dimensional point cloud semantic segmentation comprises the following specific steps:

step 1: partitioning a training data set

The training sample for supervised learning consists of labeled data and unlabeled data; marking out a certain proportion of marked training samples for the existing marked data sets, and removing labels from the rest parts to be used as unmarked training samples; or automatically collecting labeled training samples and unlabeled training samples; here, the object classes contained in the labeled sample include all the object classes to be segmented;

step 2: network pre-training

Pre-training a backbone network used by a teacher network and a student network by using the labeled data obtained by dividing or collecting in the step 1, wherein the pre-training process adopts a full supervision mode; a loss function adopted in the training process is a standard cross entropy loss function;

and step 3: network training

Wherein x_i∈R^p×6Representing p points contained in each training sample and coordinate and color information of the p points; a batch of training samples is recorded as x^l∪x^uThe scaled and rotated version is recorded as

x^l∪x^uAnd

And

In (1)

And

consistency loss function by design

Supervision:

wherein f is_TAnd f_sRespectively referring to a teacher network and a student network, wherein tau refers to the scaling and rotation transformation mentioned above, and KL refers to KL divergence calculation; integral loss function

Is recorded as:

wherein, ω is_cIs a consistency weight parameter;

student network pass optimization loss function

Updating the network parameters; the teacher network is obtained by performing exponential sliding average on parameters of the student network, and the specific formula is as follows:

θ′_t＝αθ′_t-1+(1-α)θ_t

θ′_t、θ_trespectively recording the weights of the tth iteration teacher network and the weights of the t-th iteration student network, wherein alpha is a weight over-parameter;

and 4, step 4: network reasoning

And when network reasoning is carried out, an ideal three-dimensional point cloud semantic segmentation result can be obtained by using a trained teacher network or a trained student network.