CN114943834A

CN114943834A - Full scene semantic segmentation method based on prototype queue learning under few-label samples

Info

Publication number: CN114943834A
Application number: CN202210390663.3A
Authority: CN
Inventors: 袁媛; 王子超; 姜志宇
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-08-26
Anticipated expiration: 2042-04-14
Also published as: CN114943834B

Abstract

The invention discloses a full scene semantic segmentation method based on prototype queue learning under few labeled samples, which comprises the steps of firstly, performing prototype queue segmentation, performing mask average pooling on a feature map by using a label image to generate a foreground prototype and a background prototype, storing the foreground prototype and the background prototype into a prototype queue, and calculating the cosine distance of the feature map to obtain a new prediction probability map; and calculating the prediction probability graph by adopting an argmax function to obtain a mask label of the segmentation result, performing mask average pooling on the feature graph by utilizing the mask label, generating a foreground prototype and a background prototype at the second stage, storing the foreground prototype and the background prototype into a prototype queue, and calculating the cosine distance between the foreground prototype and the feature graph to obtain a final segmentation result. The method reduces the dependence on model parameters, improves the generalization and realizes better segmentation effect by using less labeled samples.

Description

Full scene semantic segmentation method based on prototype queue learning under few-label samples

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a full scene semantic segmentation method.

Background

Image Semantic Segmentation (Semantic Segmentation) is the pixel-level classification of images according to the Semantic class to which the pixels in a scene belong. The semantic segmentation method based on deep learning usually needs a large number of dense pixel-level labels, but the labeling of samples in an actual task is time-consuming and labor-consuming, and the labeling of samples in a specific task is difficult to obtain. Based on the above, the full scene semantic segmentation under the condition of few labeled samples aims at realizing the division of all pixels in the image according to the semantic categories under the condition of only a few labeled samples. The technology plays a key role in the application of practical high-complexity and strong dynamic scenes such as city planning, precision agriculture, forest inspection, national defense and military and the like.

With the development of deep learning, the semantic segmentation field makes many progresses, and a small sample semantic segmentation technology under the condition of few labeled samples is developed to a certain extent by combining the migration effect of meta-learning and the less sample adaptability of metric learning. However, the current small sample semantic segmentation mainly focuses on segmenting foreground objects and background, and often neglects the requirement of multi-class semantic segmentation. How to guide a test sample by fully utilizing a small number of marked samples in a metric learning mode is an important problem in a small sample semantic segmentation technology. Wang et al in the literature "Kaixin Wang, Jun Hao View, Yingtian Zou, Daquan Zhou, and Jianshi Feng. Panet: Few-shot image segmentation with protocol alignment in IEEE International Conference on Computer Vision,2019, pp.9197-9206" reverse alignment regularization of the process of prototype guided segmentation, thereby enhancing the propagation of key semantics. Wang et al in the literature "Haochen Wang, Xudong Zhang, Yutao Hu, Yandan Yang, Xianbin Cao, and Xiianong Zhen. Few-shot segmentation with a removal characterization networks. in European Conference on Computer Vision,2020, pp.730-746" established pixel-to-pixel correlations, replacing prototypes generated by mask pooling to deepen guided segmentation of test samples by sample labels.

Furthermore, the use of potentially new classes of information in the background helps to alleviate the problem of feature confusion, i.e. further enhances the efficient representation of different semantic classes. The Yang et al document "life Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, and Yang gao. minor classes for raw-shot segmentation. in IEEE International Conference on Computer Vision,2021, pp.8721-8730" introduces an additional branched network to utilize the potential new class information, and realizes more stable prototype guidance by correcting the foreground and background on this basis. In addition, the conventional small sample segmentation method has a coarse process of extracting prototypes, thereby causing loss of detail information when the masks are averaged and pooled. The loss of detail information can be reduced by performing iterative optimization on the prototype extraction process, and important and comprehensive semantic information can be retained, for example, c.zhang et al in documents "Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua shell.canet: Class-advertising segmentation networks with iterative refinement and objective raw-shot learning in IEEE Conference Computer Vision and Pattern Recognition,2019, pp.5217-5226" designs an iterative optimization module to optimize the segmentation process, but the above method does not directly update the prototype, so that the detail information lost by extracting the prototype is difficult to be supplemented. The loss of detail information can be further reduced by means of iterative optimization, but the optimization of the prototype extraction process is still insufficient.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a full scene semantic segmentation method based on prototype queue learning under few labeled samples, which comprises the steps of firstly carrying out prototype queue segmentation, carrying out mask average pooling on a feature map by utilizing a label image to generate a foreground prototype and a background prototype, storing the foreground prototype and the background prototype into a prototype queue, and then calculating the cosine distance of the feature map to obtain a new prediction probability map; and calculating the prediction probability graph by adopting an argmax function to obtain a mask label of the segmentation result, performing mask average pooling on the feature graph by utilizing the mask label, generating a foreground prototype and a background prototype at the second stage, storing the foreground prototype and the background prototype into a prototype queue, and calculating the cosine distance between the foreground prototype and the feature graph to obtain a final segmentation result. The method reduces the dependence on model parameters, improves the generalization and realizes better segmentation effect by using less labeled samples.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: dividing a prototype queue;

step 1-1: uniformly cutting the training image and the corresponding label image pair into a fixed size; establishing an empty prototype queue;

step 1-2: taking a training image as input data, and generating a feature map F through a feature extractor;

step 1-3: carrying out mask average pooling on the feature map F by using the label image M to generate a foreground prototype p _c And background prototype p _bg ：

Wherein, (x, y) represents the coordinate of the pixel point, 1[ ] represents the indicating function, namely the function value is 1 when the formula in the bracket is correct, otherwise, it is 0; c is a foreground category set, C is a foreground category in the image, and h and w are the length and the width of the input image respectively;

step 1-4: the foreground is prototyped p _c And background prototype p _bg Storing the foreground categories into an original queue, wherein the number of the foreground categories in the original queue is multiple, and the number of the background categories in the original queue is only one;

step 1-5: repeating the steps 1-2 to 1-4, and traversing all the training images and the corresponding label images; when storing in the prototype queue, if the foreground prototype or the background prototype generated later has the foreground prototype or the background prototype of the same category in the prototype queue, covering the foreground prototype or the background prototype of the same category in the prototype queue;

step 1-6: respectively calculating the cosine distance between the foreground prototype and the background prototype of different categories in the prototype queue and each pixel position in the feature map F to obtain a preliminary prediction probability map

Connecting P and F, and performing convolution calculation to obtain a new prediction probability map P _final The calculation is as follows:

P _final ＝Conv(Concat(F，P)) (3)

prediction probability map P _final The result is the preliminary prediction segmentation result;

step 2: second stage segmentation constraints;

step 2-1: using argmax function to predict probability map P _final Calculating to obtain a segmentation result mask label, and then carrying out binarization to uniformly label the non-foreground category as a background category to obtain a mask label only containing the foreground category and the background category;

step 2-2: performing mask average pooling on the feature graph F by using a mask label to generate a foreground prototype and a background prototype at the second stage;

step 2-3: storing the foreground prototype and the background prototype in the second stage into a prototype queue, and covering the foreground prototype or the background prototype in the prototype queue if the foreground prototype or the background prototype in the same category exists in the prototype queue;

step 2-4: respectively calculating the cosine distance between the foreground prototype and the background prototype of different categories in the prototype queue obtained in the step 2-3 and each pixel position in the feature map F to obtain a second-stage prediction probability map

Second stage predictive probability map

The final segmentation result is obtained;

and step 3: training according to the overall loss function to obtain a final segmentation model;

step 3-1: evaluating the loss;

using predictive probability maps P _final And the label image M calculates the evaluation loss of the preliminary segmentation result on the foreground category as follows:

wherein the content of the first and second substances,

for each position in the input image a probability of being predicted as foreground, c _fg A foreground category label; n represents the product of h and w;

using second stage predictive probability maps

And calculating the evaluation loss of the second-stage segmentation result of the foreground class by the label image M as follows:

wherein the content of the first and second substances,

representing the probability that each position of the input image in the second stage prediction result is predicted to be foreground;

the evaluation loss was calculated as follows:

L _eval ＝L _seg +L _t-s (6)

step 3-2: a multi-class loss;

multi-class loss L _mult The calculation is as follows:

wherein the pseudo label

Calculating a preliminary prediction probability map P by adopting an argmax function; multi-class prediction probability map

The characteristic diagram F is obtained through convolution operation and up-sampling calculation;

representing each position of input image in multiple classes of prediction resultsThe probability of prediction as class cl;

step 3-3: a background hiding class loss function;

calculating constraint loss for background region of input image, and using label image M and prediction probability map P by cross entropy formula _final Calculating false positive rate of background region, namely background Entropy loss Encopy _bg Loss of background Entropy Encopy _bg Describing the probability that a background region is not mispredicted as foreground, is calculated as follows:

in order to prevent the background area from being predicted as the foreground, increase the background Entropy value, reduce the probability that the hidden class of the background area is wrongly predicted, and lose the background Entropy _bg The addition loss constraints are as follows:

wherein λ is a background optimization weight parameter;

step 3-4: overall loss function:

Loss＝L _eval +L _blr +α×L _mult (10)

wherein alpha is a multi-class constraint weight parameter, and the value range is between 0 and 1.

Preferably, the training image and the corresponding label image pair are uniformly cropped to be a fixed size of 512 × 512 in step 1-1.

Preferably, said λ ranges between 1 and 2.

The invention has the following beneficial effects:

1. and expanding the foreground and background segmentation of the small sample to full scene multi-class semantic segmentation. The prototype queue provided by the invention can be used for updating and storing different types of prototypes and guiding multi-type segmentation. Different from the traditional method which is suitable for simple scenes, the method can realize the analysis of multi-class scenes.

2. The multi-class segmentation effect is better, and the multi-class segmentation can be realized by inputting single-class labels. The multi-class guiding branch designed by the invention adopts the preliminary multi-class segmentation result as a pseudo label to replace a single-class label guiding model to learn multi-class characteristics, thereby realizing better multi-class segmentation effect.

3. And the segmentation robustness is stronger under the condition of lacking sample labeling. The method is based on small sample learning and metric learning, and extracts image features and maps the image features to a feature metric space. The pixel-level multi-class segmentation is completed in a measuring mode, dependence on model parameters is reduced, generalization is improved, a better segmentation effect is achieved by using fewer labeled samples, and robustness is stronger in an environment where sample labeling is lack.

4. The accuracy rate and the average intersection ratio of the segmentation results are higher. The background hiding optimization module and the two-stage segmentation module further optimize the segmentation result, and can help the model architecture better analyze the scene.

5. The technology has more practical and industrial values. The method expands the small sample segmentation to more practical multi-class semantic segmentation, can meet the industrial requirements of urban planning, precision agriculture, automatic driving and the like, only needs fewer labeled samples, reduces the labeling cost, and is more suitable for practical application scenes.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a semantic segmentation result comparison graph generated by the method and the comparison method of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The invention discloses a full scene semantic segmentation framework based on prototype queue learning under less sample labeling, which mainly solves the problems of multi-class semantic segmentation and background potential class in small sample semantic segmentation. In particular, the present invention aims to solve the following aspects:

1. the existing small sample semantic segmentation technology only segments the foreground and the background but does not analyze the background in a complex scene, and more practical multi-class small sample semantic segmentation is realized.

2. The prior art lacks of fully utilizing potential new class information contained in the background class in the training sample.

3. In the prior art, local detail information is easy to lose by extracting the average pooling of the masks adopted by the semantic category feature prototype.

A full scene semantic segmentation method based on prototype queue learning under few labeled samples comprises the following steps:

step 1: dividing a prototype queue;

Wherein, (x, y) represents the coordinates of the pixel points, C is a foreground category set, C is a foreground category in the image, and h and w are the length and width of the input image respectively;

P _final ＝Conv(Concat(F，P)) (3)

step 2: two-stage segmentation constraints;

step 2-2: performing mask average pooling on the feature map F by using a mask label to generate a second-stage foreground prototype and a background prototype;

step 2-3: storing the foreground prototype and the background prototype in the second stage into a prototype queue, and covering the foreground prototype or the background prototype in the prototype queue if the same type of foreground prototype or background prototype exists in the prototype queue;

step 2-4: respectively calculating cosine distances between foreground prototypes and background prototypes of different categories in the prototype queue obtained in the step 2-3 and each pixel position in the feature map F to obtain a second-stage prediction probability map

Second stage predictive probability map

The second stage segmentation result is obtained;

step 3-1: evaluating the loss;

using predictive probability maps P _final And the evaluation loss of the label image M on the foreground classification calculation preliminary segmentation result is as follows:

wherein the content of the first and second substances,

for each position in the input image, the probability of being predicted as foreground, c _fg A foreground category label;

using second stage predictive probability maps

And calculating the evaluation loss of the segmentation result of the second stage on the foreground category by the label image M as follows:

the evaluation loss was calculated as follows:

L _eval ＝L _seg +L _t-s (6)

step 3-2: a multi-class loss;

multiple class loss L _mult The calculation is as follows:

wherein the pseudo label

Directly performing convolution and up-sampling calculation on the feature map F to obtain a feature map F;

step 3-3: a background hiding class loss function;

in order to prevent the background area from being predicted as the foreground, increase the background Entropy value, reduce the probability that the hidden class of the background area is mispredicted, and lose the background Entropy _bg The addition loss constraints are as follows:

wherein λ is a background optimization weight parameter;

step 3-4: overall loss function:

Loss＝L _eval +L _blr +α×L _mult (10)

The specific embodiment is as follows:

1. simulation conditions

The invention is a simulation by using Pythrch on an operating system with a central processing unit of Intel (R) Xeon (R) Silver 4110CPU @2.10GHz and a memory 40G, Linux. The data used in the simulation is an open data set.

2. Emulated content

The data used in the simulation were from the UDD and Vaihingen datasets. The UDD dataset contains 141 RGB pictures taken with a drone, containing six categories, cut into 2439 image blocks of 720 × 720 pixels. The Vaihingen dataset is an aerial photograph dataset published by the ispss, having a total of 33 RGB pictures, containing six categories, cut into 426 512 × 512 image blocks. Five pictures and corresponding class labels of the five pictures are selected as small samples for model training in each class, and the rest pictures are used for testing. In order to ensure that the fairness training samples of the experiment are randomly selected for five times, the test indexes are the average values of five groups of experiment indexes.

In order to prove the effectiveness of the algorithm, the invention selects PANET, HRNet and HRNet + to compare on two data sets. Among these, PANET is a reference to "Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jianshi Feng. Panet: few-shot image segmentation with prototypic alignment in IEEE International Conference on Computer Vision,2019, pp.9197-9206 ", which is a classic thumbnail semantic segmentation algorithm; HRNet is a classic semantic segmentation algorithm proposed in the documents "Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang.deep high-resolution representation for human position estimation. in IEEE Conference on Computer Vision and Pattern registration, 2019, pp.5693-5703", which we use to verify the effect of a fine tuning method on multi-type small sample segmentation tasks; HRNet + is a model obtained by taking HRNet as a feature extractor and adopting a small sample segmentation experiment method based on measurement to improve, and is a basic network of the invention. PQLNet is the method proposed in the present invention, OA and mlou are evaluation indexes for semantic segmentation quality of small samples, and the comparison results are shown in table 1:

TABLE 1 comparative results

As can be seen from table 1, the present invention is superior to other algorithms in OA and mlou indices on the UDD dataset and the Vaihingen dataset.

FIG. 2 is a graph of semantic segmentation results generated by the method of the present invention and the comparison algorithm. Compared with a comparison algorithm, the multi-class feature segmentation method has more accurate multi-class segmentation edges, and can prove that the multi-class feature segmentation method effectively utilizes multi-class joint information and increases the feature discrimination of different classes. In addition, the invention also achieves the effects of eliminating particles and thinning edges, thereby proving the effects of background hidden type distribution optimization and two-stage segmentation modules.

Claims

1. A full scene semantic segmentation method based on prototype queue learning under few labeled samples is characterized by comprising the following steps:

step 1: dividing a prototype queue;

Wherein, (x, y) represents the pixel point coordinate, 1[ ] represents the indication function, namely the function value is 1 when the formula in the bracket is correct, otherwise it is 0; c is a foreground category set, C is a foreground category in the image, and h and w are the length and the width of the input image respectively;

P _final ＝Conv(Concat(F,P)) (3)

step 2: second stage segmentation constraints;

step 2-1: using argmax function to predict probability map P _final Calculating to obtain a mask label of a segmentation result, and then carrying out binarization to uniformly mark the non-foreground category as a background category to obtain a mask label only containing the foreground category and the background category;

Second stage predictive probability map

The final segmentation result is obtained;

step 3-1: evaluating the loss;

wherein the content of the first and second substances,

using second stage predictive probability maps

wherein the content of the first and second substances,

the evaluation loss was calculated as follows:

L _eval ＝L _seg +L _t-s (6)

step 3-2: a multi-class loss;

multi-class loss L _mult The calculation is as follows:

wherein the pseudo label

representing the probability that each position of the input image in the multi-class prediction result is predicted to be a class cl;

step 3-3: a background hiding class loss function;

wherein λ is a background optimization weight parameter;

step 3-4: the overall loss function:

Loss＝L _eval +L _blr +α×L _mult (10)

2. The method for full scene semantic segmentation based on prototype-queue learning under few-label samples according to claim 1, wherein the training image and the corresponding label image pair are uniformly cropped to have a fixed size of 512 x 512 in step 1-1.

3. The method for full scene semantic segmentation based on prototype queue learning under few labeled samples according to claim 1, wherein the λ is in a range from 1 to 2.