CN113095328A

CN113095328A - Self-training-based semantic segmentation method guided by Gini index

Info

Publication number: CN113095328A
Application number: CN202110318561.6A
Authority: CN
Inventors: 王立春; 胡玉杰; 王少帆; 孔德慧; 李敬华; 尹宝才
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-07-09
Anticipated expiration: 2041-03-25
Also published as: CN113095328B

Abstract

The invention discloses a self-training-based semantic segmentation method guided by a kini index, which is provided by the invention.

Description

Self-training-based semantic segmentation method guided by Gini index

Technical Field

The invention relates to a self-training-based field self-adaptive semantic labeling method, which is different from the traditional method in the selection of a pseudo label, determines the pseudo label by taking a Gini index as a basis, belongs to the field of pattern recognition and computer vision, and can be applied to the automatic driving and robot visual navigation technology.

Background

The self-training-based domain adaptive semantic segmentation method uses two types of data: the method comprises the steps that labeled source domain data and unlabeled target domain data are obtained, labels are used as supervision information in a source domain, pseudo labels are used as supervision information in a target domain, a network is trained on the basis of the supervision information, and then a model with a good semantic labeling effect on a target domain image is learned. Accurate unsupervised domain adaptive semantic segmentation is important for applications with obvious data difference between a model learning stage and a model using stage, such as automatic driving, robot navigation and the like.

The main idea of unsupervised domain adaptation based on self-training is to create a pseudo label and use the pseudo label as a real label of a target domain image in a training phase. The biggest problem to be solved by the unsupervised field adaptive method based on self-training is how to acquire correct pseudo labels, and wrong pseudo labels may finally cause 'confirmation deviation', namely, the wrong pseudo labels become noise when being used as supervision information, so that the expression performance of a trained model is worse.

In order to efficiently obtain as correct a pseudo tag as possible, existing strategies include: selecting a pseudo label based on the prediction of the network output; the pseudo-label is selected based on a measure of uncertainty of the network output prediction. The method comprises the steps that a threshold value is set in advance, and pixels with Softmax scores larger than the threshold value are endowed with class labels corresponding to the maximum prediction score value as pseudo labels of the pixels. The method can generate error labels in the initial stage of iteration, but as the number of iterations increases, the performance of the classifier on the test data set is improved, so that the accuracy of the labels is improved. The problem with this approach is that choosing a pseudo label in cases where the model is highly uncertain about the prediction of the pixel (e.g., boundary pixels) is error prone, i.e., a softmax score above a threshold does not represent that the corresponding predicted label is correct. In response to this problem, researchers have proposed to measure the uncertainty of the network output prediction and select pseudo labels based on the uncertainty, which is typically an approach of this type, which calculates the Entropy of the network output prediction to measure the uncertainty of the prediction and selects pseudo labels based on the Entropy to improve the reliability of the pseudo labels. However, in the entropy-based gradient back propagation optimization process, the categories which are easy to classify are optimized in a biased manner, namely the optimization weight of the categories which are difficult to classify is smaller than that of the categories which are easy to classify, so that the problem that the pixel accuracy which is difficult to classify is not high is caused.

Disclosure of Invention

In order to effectively improve the accuracy of unsupervised domain self-adaptive semantic segmentation based on a self-training frame, the invention provides that the uncertainty of output prediction is measured by using a kini index, and the selection of a pseudo label is guided by using the kini index, namely, a class label corresponding to the maximum softmax score is given to a pixel with the kini index smaller than a set threshold value as the pseudo label. The abscissa shown in fig. 1 is the output prediction probability and the ordinate is the gradient based on the uncertainty measure (entropy information or kini index) when propagating backwards. The counter-propagating gradients in the two intervals of prediction probability [0.75,0.9] and prediction probability [0.9,1] in FIG. 1 are compared: when uncertainty of output prediction is calculated based on entropy information, the gradient for back propagation is far greater than the prediction probability [0.75,0.9] in the prediction probability [0.9,1] interval; when the uncertainty of output prediction is calculated by adopting the Gini index, the difference between the gradient used for back propagation in the prediction probability [0.9,1] interval and the gradient used for back propagation in the prediction probability [0.75,0.9] interval is not large. That is, the gradient calculated based on the kini index does not pay much attention to the points in the [0.9,1] interval during the back propagation process, and the model gives relatively greater update weight to the classes over the prediction probability [0.75,0.9] interval. Research results have indicated that the IOU value of a class is positively correlated with the class prediction probability. Because the relevant entropy index of the Gini index is more focused on outputting the corresponding point of the prediction probability [0.75,0.9] interval during model training, the IOU value of the prediction probability in the [0.75,0.9] interval can be improved. Because the pseudo label selected on the interval [0.9,1] is relatively accurate, the accuracy of class prediction on the interval [0.75,0.9] is ensured, the correctness of the pseudo label is ensured, more correct supervision information is favorably introduced, and the introduction of noise is avoided.

In the invention, the source domain refers to a synthetic data set, and the target domain refers to a real data set. The network comprises two sub-networks with the same structure and different parameters, which are respectively called G_teAnd G_st. During training, network G_teThe source domain and target domain images are input for training. After training is complete, use G_tePerforming semantic segmentation on the target domain image based on G_teAnd predicting and calculating a kini index of the output target domain image and guiding the acquisition of the target domain image pseudo label by using the kini index. The obtained pseudo label is used for training G_st。

The method comprises the following specific steps:

step (1), a RGB image is taken from a source domain data set and a target domain data set at random respectively as a batch input semantic segmentation network G_te；

Step (2), calculating cross entropy loss of the source domain image based on the output prediction graph and the group channel of the last two layers of the network, and carrying out weighted summation on the loss of the last two layers;

step (3), respectively calculating the Gini index and the uncertainty loss of the output prediction images of the last two layers of the target domain image, and weighting and summing the losses of the last two layers;

step (4), summing the weighting loss calculated in the step (2) and the weighting loss calculated in the step (3), and iterating until the loss of the model is smaller than a certain threshold value by using an error back propagation optimization model, and finishing training of the batch data;

step (5), returning to step 1 to continue to select new batch data, repeating steps 1 to 5 until 2000 batches are trained and storing the trained models;

step (6), repeating the steps (1) to (5) until 120000 batch data are trained, namely, 60 models are saved;

step (7), testing 60 stored models in a target domain verification set, calculating output prediction of images of a target domain training set by using the model with the best accuracy on the verification set, calculating a kini index corresponding to the output prediction, and endowing the images of the target domain training set with a pseudo label based on the kini index;

step (8), respectively taking an RGB image as a batch as input randomly from the source domain data set and the target domain data set, and training a semantic segmentation network G_st；

Step (9), calculating cross entropy loss of the source domain image based on the output prediction graph and the group prediction of the last two layers, and carrying out weighted summation on the loss of the last two layers;

step (10), calculating cross entropy losses of the output prediction images of the last two layers and the pseudo labels of the target domain images respectively, and performing weighted summation on the losses of the last two layers;

step (11), respectively calculating the Gini index and the uncertainty loss of the output prediction images of the last two layers of the target domain images, and carrying out weighted summation on the losses of the last two layers of the target domain images;

step (12), summing the weighting losses of the step (9), the step (10) and the step (11), and iterating until the loss of the model is smaller than a certain threshold value by using an error back propagation optimization model, and finishing the training of the batch data;

step (14), the step 8 is returned to continue to select new batch data, and the steps 8 to 12 are repeated until 2000 batches are trained and the trained models are stored, wherein 120000 batches are trained in total, namely 60 models are stored;

and (15) testing the 60 stored models in the target domain test set during testing, and taking the optimal segmentation result.

Compared with the prior art, the method has the advantages that the pseudo label of the target domain image selected by taking the uncertainty of the Kernel index measurement as the basis is more reliable, and the semantic annotation accuracy of the target domain is improved.

Drawings

FIG. 1: the gradient values VS predict the probability.

FIG. 2: the self-adaptive semantic segmentation network structure chart of the self-supervision field.

FIG. 3: semantic segmentation network (G)_teAnd G_st) Structure diagram.

FIG. 4: ASPP module structure chart.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

The invention utilizes the synthetic data set as a source domain and the real data set as a target domain. The network comprises two sub-networks with the same structure and different parameters, which are respectively called G_teAnd G_st. During training, the network G is semantically segmented_teThe source domain and target domain images are input for training. After training is completed, the network G is semantically segmented_teAnd calculating the prediction of the target domain training image, calculating the predicted kini index of the target domain training image and guiding the acquisition of the target domain image pseudo label by using the kini index. Target domain image pseudo label for training semantic segmentation network G_st。

The method comprises the following specific steps:

Step (2), calculating cross entropy loss of the source domain image based on the output prediction graph and the group treth of the last two layers of the network, and carrying out weighted summation on the loss of the last two layers of the source domain;

step (3), respectively calculating the Gini index and the uncertainty loss of the output prediction images of the last two layers of the target domain images, and carrying out weighted summation on the losses of the last two layers of the target domain images;

step (7), testing 60 stored models in a target domain verification set, calculating output prediction of images of a target domain training set by using the model with the best accuracy, calculating a kini index corresponding to the output prediction, and endowing the images of the target domain training set with a pseudo label based on the kini index;

Step (9), calculating cross entropy loss of the source domain image based on the output prediction graph and the group prediction of the last two layers, and performing weighted summation on the loss of the last two layers of the source domain;

step (10), calculating cross entropy losses of the output prediction images of the last two layers of the target domain image and the pseudo label respectively, and performing weighted summation on the losses of the last two layers of the target domain image;

step (14), the step 8 is returned to continue to select new batch data, and the steps 8 to 12 are repeated until 2000 batches are trained and the trained models are stored, wherein 120000 batches are trained in total, namely 60 models are stored; and (15) testing the 60 stored models in the target domain test set during testing, and taking the optimal segmentation result.

The model constructed by the method provided by the invention is an unsupervised field self-adaptive network, the overall structure of the network is shown as figure 2, and the network comprises two sub-networks G_teAnd G_st. First training network G_te(ii) a Then according to network G_teSelecting a pseudo label of the target domain image for the uncertainty measurement result of the target domain prediction, and training a semantic segmentation network G after giving the pseudo label to the corresponding target domain image_stAnd the prediction accuracy of the target domain image is improved by adding effective supervision information.

1. Semantic segmentation netLuo G_teAnd G_stThe network structure of (2):

semantic segmentation network G_teAnd G_stThe network structure of the system is the same, the Deeplab-V2 is used as a basic network architecture and is composed of an encoder and a decoder, and the specific network structure diagram is shown in FIG. 3.

The encoder uses Resnet101 as the basic network, and the network structure parameters are shown in Table 1. The encoder is composed of four blocks of convolutional layers Conv _1, Conv _2, Conv _3, Conv _4 and Conv _5, each block comprises 3 residual modules, 4 residual modules, 23 residual modules and 3 residual modules, and the active functions are ReLU functions.

The convolution layer Conv _1 includes 64 7 × 7 filters having stride 2 and padding 3. In four blocks, Conv _2 contains one 3 × 3 max pooling layer and 3 residual modules; stride 2, no padding, of 1 x 1 filter of the Conv _3 first residual block; the 3 × 3 filter of the first residual block of Conv _4 is a hole convolution with stride 1, variance 2, padding 2; the 3 × 3 filter of the first residual block of Conv _5 is a hole convolution with stride 1, variance 4, padding 4; in the remaining residual blocks not specifically described above, all 3 × 3 filters are convolutions in which stride is 1 and padding is 1, and all 1 × 1 filters are convolutions in which stride is 1 and no padding is present.

The decoder inputs the feature maps obtained by Conv _4 and Conv _5 into the ASPP module, the final feature map output by the ASPP is 1/8 of the original image, the original image is restored to the original image size by bilinear interpolation, and finally the boundary is smoothed by using CRF to obtain the final semantic segmentation result. The structure of the ASPP module is shown in fig. 4, and the detailed parameters are shown in table 2.

2. Selection of pseudo tags based on kini index

For the target domain training set image, the target domain image x_t∈R^H×W×3Is finally predicted to be a segmentation map

Semantic-dependent partitioning of networks G_teCalculation, as shown in equation (1): the specific calculation method for predicting the selected pseudo label is as follows:

wherein,

is a semantically segmented network G_teConv _5 of (2) output target area image x_tThe predicted partition map of (a) is,

is a semantically segmented network G_teConv _4 output target area image x_tPredicted partition map of beta₃Is a hyper-parameter.

Target field image x_tThe pixel at the (h, w) position belongs to class c, if and only if the predicted value of the pixel belonging to class c is maximum and the corresponding Gini index value of the pixel is

(

Is calculated in a manner that the values of the equations (7), (8) and (9)) are less than v^(c)，v^(c)Is a hyper-parameter. The specific calculation method is as shown in formula (2), wherein

Is the target field image x_tPrediction of the pixel at the (h, w) position.

3. Semantic segmentation network G_teLoss function of

The loss of the unsupervised domain adaptive network comprises the source domain segmentation loss and the uncertainty loss of the target domain prediction.

i. Source domain partition loss

For source domain data, the invention uses the traditional cross entropy as a loss function to calculate the segmentation loss, and the corresponding segmentation loss is calculated based on the prediction output by Conv _5 and Conv _4 respectively

And

the sum of these two segmentation losses is the source domain segmentation loss L_seg(x_s,y_s)：

Wherein x is_s∈R^H×W×3Is a source domain RGB image with the resolution of H multiplied by W; y is_s∈R^H×W×CIs a source domain image x_sC is the number of classes;

is a semantically segmented network G_teConv _5 output Source Domain image x_sThe predicted partition map of (1);

is a semantically segmented network G_teConv _4 output Source Domain image x of (1)_sThe predicted partition map of (1); beta is a₁Is a hyper-parameter.

Uncertainty loss for target domain prediction

The method measures the uncertainty of target domain prediction by using the Gini index, and obtains a high-confidence prediction result for a target domain image by minimizing the Gini index and constraining an inter-domain adaptive network.

Calculating pixel-level kini indexes for target domain image predictions output by Conv _5 and Conv _4, respectively

And

the pixel-level kini index is calculated as follows:

x_t∈R^H×W×3is a target domain RGB image with resolution of H multiplied by W;

is based on a semantic segmentation network G_teConv _5 of (2) output target area image x_tA kini index map calculated from the predicted segmentation map of (1),

is the corresponding pixel-level kini index;

is based on a semantic segmentation network G_teConv _4 output target area image x_tA kini index map calculated from the predicted segmentation map of (1),

is the corresponding pixel-level kini index;

is a semantically segmented network G_teTarget of Conv _5 output ofDomain image x_tThe predicted partition map of (1);

is a semantically segmented network G_teConv _4 output target area image x_tIs predicted for the segmentation map.

Target field image x_tThe pixel-level kini index of (c) is calculated as follows:

wherein C is the number of classes, β₂Is a hyper-parameter.

Uncertainty loss of target domain prediction:

semantic segmentation network G_teTotal loss of L (x)_s,x_t) Comprises the following steps:

L(x_s，x_t)＝L_seg(x_s，y_s)+μ₁L_Gini(x_t) (10)

wherein mu₁Is a hyper-parameter.

4. Semantic segmentation network G_stLoss function of

Semantic segmentation network G_stThe loss of (2) includes a source domain partitioning loss, and the uncertainty of the target domain prediction loses a pseudo-label partitioning loss of the target domain.

The pseudo label calculation of the target domain adopts the traditional cross entropy loss function calculation, and the corresponding segmentation loss is calculated based on the predictions output by Conv _5 and Conv _4 respectively

And

the sum of these two segmentation losses is the target domain segmentation loss

Wherein x is_t∈R^H×W×3Is a target domain RGB image with resolution of H multiplied by W;

is the target field image x_tC is the number of classes;

is a semantically segmented network G_stConv _5 of (2) output target area image x_tThe predicted partition map of (1);

is a semantically segmented network G_stConv _4 output target area image x_sThe predicted partition map of (1); beta is a₂Is a hyper-parameter.

Semantic segmentation network G_stTotal loss of L (x)_s,x_t) Comprises the following steps:

wherein mu₂And mu₃Is a hyper-parameter.

Examples

1. Experimental data set

The method provided by the invention performs experiments on a common unsupervised self-adaptive data set GTA5-Cityscapes, wherein a synthetic data set GTA5 is used as a source domain, and a real data set Cityscapes is used as a target domain. Models were evaluated on the cityscaps validation set.

GTA 5: the composite data set GTA5 contains 24966 composite images with a resolution of 1914 × 1052 and a corresponding ground-truth. These composite images are collected from a city wind-light video game based on los angeles city. The automatically generated ground-truth contains 33 classes. The method of performing experiments on GTA5-Cityscapes generally only considers 19 classes compatible with the Cityscapes dataset, and the present invention is no exception.

Cityscaps: as a dataset collected from the real world, cityscaps provides 3975 images with fine segmentation annotations. The training set contained 2975 images and the validation set contained 500 images.

2. Evaluation index of experiment

The present invention uses an Intersection-over-Union (IoU) to evaluate the performance of semantic segmentation. IoU values are between [0, 1], the larger the value is, the better the segmentation effect is, IoU is defined as follows:

IoU＝TP/(TP+FP+FN)

where TP, FP and FN are the number of true positive (true positive), false positive (false positive) and false negative (false negative) pixels, respectively. The mlou in tables 3 and 4 is the mean IoU of class 19.

3. Network training

The unsupervised domain adaptive network batch size is 2, the resolution of the source domain input image is 1280 × 720, the resolution of the target domain input image is 1024 × 512, and resize to 321 × 321 is required respectively. During training, the label is reduced by 8 times, and the output prediction graph of the network is subjected to loss calculation; the output prediction graph of the network is enlarged by 8 times and compared with label during testing. Beta is a₁、β₃Set to 0.1; beta is a₂Set to 0.2, mu₁、μ₂、μ₃Set to 0.01. Semantic segmentation network G_teAnd semantic segmentation network G_stThe encoders of ResNet-101 are pre-trained based on ImageNet. We use SGD optimizationTraining apparatus G_stAnd G_stThe initial learning rate is 2.5 × 10^-4。

4. Results of the experiment

The invention performs experiments on a common unsupervised adaptive data set GTA 5-Cityscapes. Table 3 and table 4 show that the invention uses the self-supervision method guided by the kini index on the MinEnt method (network in which Vu et al uses entropy as a measure to directly minimize) and the AdvEnt method (network in which Vu et al uses entropy as a measure to fight) respectively, and it can be seen that mlou is significantly improved when the invention adds the target domain pseudo tag on the basis of the common method, and the semantic segmentation effect is improved compared with the common SSL and ESL pseudo tag self-training methods.

Table 1: encoder structure parameters

Table 2: decoder structure parameter

Table 3 improved experimental result comparison

Table 4 improved experimental results comparison

Claims

1. A self-training-based semantic segmentation method guided by a Gini index is characterized by comprising the following steps of: using the synthetic data set as a source domain and the real data set as a target domain; during training, inputting a source domain and a target domain image into an inter-domain adaptive network for training, and after the training is finished, dividing the target domain image and inputting the divided target domain image into the intra-domain adaptive network for training to obtain an optimal segmentation result;

the method comprises the following specific steps:

and (15) testing the 60 stored models in the target domain test set during testing to obtain a final segmentation result.

2. The method of claim 1, wherein the self-training semantic segmentation method based on the guiding of the kini index is characterized in that: the built model is an unsupervised field self-adaptive network, and the whole network structure comprises two sub-networks G_teAnd G_st(ii) a First training network G_te(ii) a Then according to network G_teSelecting a pseudo label of the target domain image for the uncertainty measurement result of the target domain prediction, and training a semantic segmentation network G after giving the pseudo label to the corresponding target domain image_stAnd the prediction accuracy of the target domain image is improved by adding effective supervision information.