CN113657387B

CN113657387B - Semi-supervised three-dimensional point cloud semantic segmentation method based on neural network

Info

Publication number: CN113657387B
Application number: CN202110764019.3A
Authority: CN
Inventors: 张扬刚; 陈涛; 廖永斌; 叶创冠
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2023-10-13
Anticipated expiration: 2041-07-07
Also published as: CN113657387A

Abstract

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to a semi-supervised three-dimensional point cloud semantic segmentation method based on a neural network. The invention adopts a semi-supervised learning model and combines a three-dimensional point cloud semantic segmentation network model to form a whole semi-supervised three-dimensional point cloud semantic segmentation method frame; the segmentation network model is divided into a student network and a teacher network, and the two networks sample the same SSCNs network; the input of the student network is original point cloud which is not transformed, and the input of the teacher network is transformed point cloud; the output of the part with the label of the student network is supervised by the corresponding label, and meanwhile, the integral output of the student network and the teacher network is supervised in consistency, so that the weight of the student network is updated, and the weight of the teacher network is obtained by carrying out exponential sliding average on the weight of the student network. Experiments show that the performance of the network is obviously improved on each labeling rate by using semi-supervised learning with labeled data and unlabeled data.

Description

Semi-supervised three-dimensional point cloud semantic segmentation method based on neural network

Technical Field

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to a three-dimensional point cloud semantic segmentation method.

Background

In recent years, deep learning has achieved excellent performance in a variety of computer vision tasks, particularly in the image field. However, for some applications with practical significance, such as autopilot, virtual reality, augmented reality, it is necessary to obtain more information than just pictures to achieve better scene understanding. Three-dimensional data acquired by lidar or RGB-D depth cameras is a good complement to two-dimensional picture data, which is typically represented in the form of a point cloud. The three-dimensional point cloud is composed of a large number of points with three-dimensional coordinates and colors, is an intuitive three-dimensional data format, contains abundant environmental space information compared with a two-dimensional image, is more beneficial to scene understanding, and has become a main representation form of many three-dimensional visual analysis tasks.

Among all three-dimensional visual analysis tasks, point cloud semantic segmentation is an essential key task in three-dimensional scene understanding. In recent years, the point cloud semantic segmentation has been greatly progressed, but the existing methods are all trained by a fully supervised learning mode, and are seriously dependent on a large amount of fine-marked data, so that the method is expensive and time-consuming. In addition, semantic segmentation requires dense point-level labeling, which is more time-consuming and costly than classification and detection tasks. For example, the points of an indoor scene can often be on the order of millions, with labeling taking several hours. Semi-supervised learning is a method of reducing the cost of labeling data, which can improve the performance of existing models by adding a small amount of labeled data to a large amount of unlabeled data. In many fields, labels can only be given by experts in the relevant field, whereas unlabeled data can be easily obtained. Unlike full supervised learning, the semi-supervised learning method can improve performance by adding additional unlabeled data for training, and is a new method for overcoming data starvation.

Some related algorithms for semi-supervised learning and point cloud semantic segmentation are briefly described below.

1. Semi-supervised learning

Algorithms for semi-supervised learning can be broadly divided into three categories: methods based on generating a countermeasure network (GAN), methods of entropy minimization, and methods of consistency regularization. For the method of GAN generation, [1] additional labeling data is generated for training the network, [2] a discriminator is trained to constrain the difference between predictions and labels; for the method of entropy minimization, [3] realizing utilization of unlabeled data by minimizing entropy loss of unlabeled data, [4] constructing a pseudo tag according to high confidence prediction of unlabeled data to realize implicit entropy minimization; for the method of consistency constraint, [5] takes different image interception blocks as input, then forces the predictions of the different image interception blocks to be consistent, a Mean Teacher model [6] consists of two Teacher branches and student branches with the same structure, the parameters of the student branches are updated by an optimizer, the parameters of the Teacher branches come from the index moving average of the parameters of the student network, and the Mean Teacher is always the most common structure of the consistency regularization method due to a simple and effective structure.

2. Point cloud semantic segmentation

The existing point cloud semantic segmentation methods can be divided into two types: point-based methods and projection-based methods. Point-based methods take the original point cloud as input, but it is difficult to handle unstructured and unordered point clouds. The PointNet [7] utilizes the multi-layer shared perceptron and transformation matrix module to perform point-level feature learning, then uses a symmetrical function to perform global feature learning, and the PointNet++ [8] further introduces a hierarchical structure of feature learning, so that more accurate local texture features and richer local structure information can be learned for each point; the method is based on projection, the unordered point cloud is generally converted into an intermediate regular representation, the regular representation is then input into a backbone network for feature extraction, [9] the point cloud is firstly projected onto a synthesized two-dimensional image, then image features can be learned through a 2D-CNN method, a final semantic segmentation result is obtained through fusion of the image features and projected back onto the point cloud, [10] a range image is used as the intermediate representation, and a novel post-processing algorithm is provided to overcome the problems caused by discretization. SSCNs [11] voxel the input point cloud first, and propose a new sparse convolution method to alleviate the problem of large calculation burden of the point cloud.

Disclosure of Invention

The invention aims to provide a semi-supervised three-dimensional point cloud semantic segmentation method based on a neural network, which has low data annotation requirements, high accuracy and good robustness.

The semi-supervised three-dimensional point cloud semantic segmentation method based on the neural network provided by the invention has the following overall structure description: the whole design is based on a deep learning method, adopts a semi-supervised learning Mean Teacher model, and combines a three-dimensional point cloud semantic segmentation network model to form a whole semi-supervised three-dimensional point cloud semantic segmentation method frame. The structure of the segmentation network model is as follows: the system is divided into an upper branch and a lower branch, wherein the upper branch is called a student network, the lower branch is called a teacher network, and the student network and the teacher network sample the same structure, namely the same three-dimensional semantic segmentation backbone network is adopted; the input of the student network is original point cloud which is not transformed, and the input of the teacher network is transformed point cloud; the output of the part with the label of the student network is supervised by the corresponding label, and meanwhile, the integral output of the student network and the teacher network is supervised in consistency, so that the weight of the student network is updated, and the weight of the teacher network is obtained by carrying out exponential sliding average on the weight of the student network.

The specific steps of the method of the invention are as follows.

Step 1: the training data set is partitioned.

The training sample for supervised learning consists of two parts, namely marked data and unmarked data. For the existing marked data set, marked training samples with a certain proportion (such as 10% -90%) are divided, and the rest part is used for removing labels to serve as unmarked training samples. Or collecting the marked training samples and the unmarked training samples by oneself. It should be noted that the object class contained in the annotated sample needs to contain all the object classes to be segmented.

Step 2: the network is pre-trained.

The trunk network used by the teacher network and the student network is pre-trained by using the marked data obtained by dividing or collecting in the step 1, and a full supervision mode is adopted in the pre-training process; the loss function used in the training process is a standard cross entropy loss function.

Step 3: and (5) training a network.

The marked point cloud sample and the unmarked point cloud sample input into the network are respectively marked as Wherein x is _i ∈R ^p×6 Representing the p points each training sample contains and its coordinates and color information. A batch of training samples is denoted as x ^l ∪x ^u The scaled and rotated version thereof is recorded asx ^l ∪x ^u And->Respectively used as the input of the branches of the student network and the teacher network, and the corresponding outputs are respectively marked as +.>And->

Before the network starts training, initializing a student network and a teacher network respectively by using weights obtained in the pre-training process in the step 2; then each time training is performed, the output of the student networkIs->The loss is calculated by supervision from the corresponding labeling information y>And->Consistency loss function designed by us>The supervision is specifically described as follows:

wherein f _T And f _S Respectively a teacher network and a student network, τ represents the scaling, rotation transformation mentioned above, KL refers to the KL divergence (Kullback-Leibler divergence) calculation. Integral loss functionThe method is characterized by comprising the following steps:

wherein omega _c Is a consistency weight parameter;

student network loss function optimizationUpdating network parameters; the teacher network is obtained by performing exponential sliding average (Exponential Moving Average) on parameters of the student network, and the specific formula is as follows:

θ′ _t ＝αθ′ _t-1 +(1-α)θ _t

θ′ _t 、θ _t the weights are respectively recorded as the weights of the t-th iteration teacher network and the student network, and alpha is a weight super parameter. It can be seen that the weight parameter of the teacher network is multiplied by the weight super-parameter from the previous iteration of the teacher network plus the updated parameter of the student network at that time multiplied by 1 minus the weight super-parameter.

Step 4: and (5) network reasoning.

When in network reasoning, an ideal three-dimensional point cloud semantic segmentation result can be obtained by using a trained teacher network or student network, and the segmentation performance of the two networks is similar.

In the invention, semi-supervised learning performs joint training by adding unlabeled data and labeled data, so as to effectively improve the performance of the model. The invention selects SSCNs as a backbone network for point cloud semantic segmentation, and designs two loss functions to force a teacher model and a student model to have the same prediction. The network structure of the invention is simple and effective, and a large number of experiments show that the performance of the network can be obviously improved on each labeling rate by using semi-supervised learning with labeled data and unlabeled data.

Drawings

Fig. 1 is a semi-supervised three-dimensional point cloud semantic segmentation framework diagram proposed by the present invention.

Detailed Description

In the following, embodiments of the invention are described in a three-dimensional scene point cloud dataset.

Data set description: the three-dimensional scene point cloud data set is from [12], wherein 1513 scanning samples obtained by reconstructing 707 indoor scenes are included, and the three-dimensional scene point cloud data set is divided into 1201 training samples and 312 verification samples by authorities.

Training experiment setting:

the section introduces training settings of semantic segmentation of the three-dimensional scene point cloud, codes are written by using PyTorch, and 1201 training samples in the data set introduced by the content are selected as training samples. And, all experiments in this section were performed according to the following experimental setup:

data set partitioning:

according to the proportion of marked samples, 1201 training samples are divided into seven groups of experiments of 10%, 20%, 30%, 40%, 50%, 70% and 100%, and the rest of the samples in each group are subjected to label removal as unmarked samples.

Pre-training stage:

learning rate: 0.001.

training period: about 250 passes through the training set, also called epochs number.

Number of batch size per grab: 32.

optimization algorithm: adam.

Super parameters of SSCNs: the network width m=16, the convolution block repetition factor is 1, the voxel size is 1/20, the number of test surfaces is 1, and the residual block is not applicable.

Training phase:

learning rate: 0.001, and is reduced to 1/10 of the previous stage every 50 training periods.

Number of batch size per grab: the marked sample is 6 and the unmarked sample is 24.

Optimization algorithm: adam.

Consistency weight omega _c : step by step from 0 to 1 between the first 40000 steps.

Weight super parameter alpha: the first 40000 steps were 0.99 followed by 0.999.

Test experiment settings:

verification set: 312 validation samples in the dataset.

Evaluation index: mean Intersection-Over-Union (mIoU).

And (5) datum line: the method refers to the result that the SSCNs network is trained by the same number of labeling samples and then the same verification set is inferred.

Marking sample proportions	10％	20％	30％	40％	50％	70％	100％
								Datum SSCNs	40.49	50.04	53.39	53.62	55.86	57.71	60.04
The invention is that	42.74	51.86	55.84	55.87	57.77	59.19	61.76

。

And (3) direct-push learning result verification:

direct push learning refers to reasoning on unlabeled samples in the training process, and is a common evaluation mode in semi-supervised learning. This section shows the results of the invention in this evaluation mode.

Verification set: different proportions of unlabeled samples are used.

Evaluation index: mean Intersection-Over-Union (mIoU).

Marking sample proportions	10％	20％	30％	40％	50％	70％
							Datum SSCNs	44.42	57.23	61.26	63.48	65.92	68.49
The invention is that	46.90	59.29	63.50	65.29	67.47	70.36

。

Analysis of results:

the semi-supervised three-dimensional point cloud semantic segmentation method provided by the invention can improve the precision of the existing three-dimensional point cloud method no matter in the result of a test set or in the evaluation mode of direct push learning. Therefore, semantic segmentation precision of the three-dimensional point cloud can be improved by using a small number of marked samples and a large number of unmarked samples, and compared with a conventional three-dimensional point cloud semantic segmentation method, dependence on data marking is greatly reduced.

For the purpose of illustrating the invention and the manner in which it is carried out, a specific embodiment is provided herein. The details are not included in the examples to limit the scope of the claims but to aid in understanding the method of the invention. Those skilled in the art will appreciate that: various modifications, changes, or substitutions of the preferred embodiment steps are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the present invention should not be limited to the preferred embodiments and the disclosure of the drawings.

Reference to the literature

[1]N.Souly，C.Spampinato，M.Shah，Semi supervised semantic segmentation using generative adversarial network，in：Proceedings of the IEEEInternational Conference on Computer Vision，2017，pp.5688-5696.

[2]S.Mittal，M.Tatarchenko，T.Brox，Semi-supervised semantic segmentationwith high-and low-level consistency，IEEE Transactions on PatternAnalysis and Machine Intelligence.

[3]Y.Grandvalet，Y.Bengio，Semi-supervised learning by entropy minimization，Advances in neural information processing systems 17(2004)529-536.

[4]K.Sohn,D.Berthelot,C.-L.Li,Z.Zhang,N.Carlini,E.D.Cubuk,A.Kurakin,H.Zhang,C.Ra el,Fixmatch:Simplifying semi-supervised learningwith consistency and confidence,arXiv preprint arXiv:2001.07685.

[5]S.Laine,T.Aila,Temporal ensembling for semi-supervised learning,arXiv preprint arXiv:1610.02242.

[6]A.Tarvainen,H.Valpola,Mean teachers are better role models:Weight averagedconsistency targets improve semi-supervised deep learning results,in:Advances in neural information processing systems,2017,pp.1195–1204.

[7]C.R.Qi,H.Su,K.Mo,L.J.Guibas,Pointnet:Deep learning on pointsets for 3d classification and segmentation,in:Proceedings of the IEEEconference on computer vision and pattern recognition,2017,pp.652–660.

[8]C.R.Qi,L.Yi,H.Su,L.J.Guibas,Pointnet++:Deep hierarchical featurelearning on point sets in a metric space,Advances in neural informationprocessing systems 30(2017)5099–5108.

[9]F.J.Lawin,M.Danelljan,P.Tosteberg,G.Bhat,F.S.Khan,M.Felsberg,Deep projective 3d semantic segmentation,in:International Conferenceon Computer Analysis of Images and Patterns,Springer,2017,pp.95–107.

[10]A.Milioto,I.Vizzo,J.Behley,C.Stachniss,Rangenet++:Fast and accuratelidar semantic segmentation,in:2019 IEEE/RSJ International Conferenceon Intelligent Robots and Systems(IROS),IEEE,2019,pp.4213–4220.

[11]B.Graham,M.Engelcke,L.Van Der Maaten,3d semantic segmentationwith submanifold sparse convolutional networks,in:Proceedings of theIEEE conference on computer vision and pattern recognition,2018,pp.9224–9232.

[12]A.Dai,A.X.Chang,M.Savva,M.Halber,T.Funkhouser,M.Nieβner,Scannet:Richly-annotated 3d reconstructions of indoor scenes,in:Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition,2017,pp.5828–5839.。

Claims

1. The semi-supervised three-dimensional point cloud semantic segmentation method based on the neural network is characterized in that a semi-supervised learning Mean Teacher model is adopted, and a three-dimensional point cloud semantic segmentation backbone network is combined to form a whole semi-supervised three-dimensional point cloud semantic segmentation method frame; the structure of the segmentation network model is as follows: the system is divided into an upper branch and a lower branch, wherein the upper branch is called a student network, the lower branch is called a teacher network, and the student network and the teacher network sample the same structure, namely, three-dimensional semantic segmentation backbone networks are adopted; the input of the student network is original point cloud which is not transformed, and the input of the teacher network is transformed point cloud; the output of the part with the label of the student network is supervised by the corresponding label, and meanwhile, the integral output of the student network and the teacher network is supervised in consistency, so that the weight of the student network is updated, and the weight of the teacher network is obtained by carrying out exponential sliding average on the weight of the student network;

the three-dimensional point cloud semantic segmentation comprises the following specific steps:

step 1: partitioning training data sets

The training sample for supervised learning consists of two parts, namely marked data and unmarked data; dividing a certain proportion of marked training samples from the existing marked data set, and removing labels from the rest part to serve as unmarked training samples; or collecting marked training samples and unmarked training samples by oneself; here, the object class contained in the labeling sample contains all the object classes to be segmented;

step 2: network pre-training

The trunk network used by the teacher network and the student network is pre-trained by using the marked data obtained by dividing or collecting in the step 1, and a full supervision mode is adopted in the pre-training process; the loss function adopted in the training process is a standard cross entropy loss function;

step 3: network training

The marked point cloud sample and the unmarked point cloud sample input into the network are respectively marked as Wherein x is _i ∈R ^p×6 Representing p points contained in each training sample and its coordinates and color information; a batch of training samples is denoted as x ^l ∪x ^u The scaled and rotated version thereof is recorded asx ^l ∪x ^u And->Respectively used as the input of the branches of the student network and the teacher network, and the corresponding outputs are respectively marked as +.>And->

Before the network starts training, initializing a student network and a teacher network respectively by using weights obtained in the pre-training process in the step 2; then each time training is performed, the output of the student networkIs->The loss is calculated by supervision from the corresponding labeling information y> And->From the consistency loss function of the design->And (3) supervision:

wherein f _T And f _s Respectively referring to a teacher network and a student network, τ represents the scaling and rotation transformation mentioned above, and KL refers to KL divergence calculation; integral loss functionThe method is characterized by comprising the following steps:

wherein omega _c Is a consistency weight parameter;

student network loss function optimizationUpdating network parameters; the teacher network is obtained by carrying out index sliding average on parameters of the student network, and the specific calculation formula is as follows:

θ′ _t ＝αθ′ _t-1 +(1-α)θ _t

θ′ _t 、θ _t respectively marking the weights of a t-th iteration teacher network and a student network, wherein alpha is a weight super parameter;

step 4: network reasoning

When in network reasoning, an ideal three-dimensional point cloud semantic segmentation result can be obtained by using a trained teacher network or student network.