CN112801104B

CN112801104B - Image pixel level pseudo label determination method and system based on semantic segmentation

Info

Publication number: CN112801104B
Application number: CN202110074943.9A
Authority: CN
Inventors: 于哲舟; 张哲�; 王碧琳; 李志远; 王兰亭; 赵凤志
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2022-01-07
Anticipated expiration: 2041-01-20
Also published as: CN112801104A

Abstract

The invention relates to a semantic segmentation-based image pixel level pseudo label determination method and a semantic segmentation-based image pixel level pseudo label determination system, wherein the method comprises the following steps: acquiring a first image and extracting features of the first image to obtain a first feature map; obtaining a second feature map and a plurality of third feature maps according to the first feature map; further obtaining a plurality of first pixel relation measurement matrixes and a plurality of second pixel relation measurement matrixes; obtaining a fourth characteristic diagram according to the second characteristic diagram and the second pixel relation measurement matrix; further obtaining a tensor matrix and a functional relation between the tensor matrix and the image output probability; training a classification network according to the loss function corresponding to the function relationship; obtaining a target position graph and a background target graph according to the trained classification network and the fourth feature graph; training a semantic segmentation network model according to the first image, the target position graph and the background target graph; and inputting the image to be detected into the trained semantic segmentation network model to obtain the image pixel level pseudo label. The invention can obtain the pixel level pseudo label of the segmentation network.

Description

Image pixel level pseudo label determination method and system based on semantic segmentation

Technical Field

The invention relates to the field of image semantic segmentation, in particular to a semantic segmentation-based image pixel level pseudo tag determination method and system.

Background

Since semantic segmentation labels require labeling of every pixel of the image, this results in a lot of time and effort. The generation of training data sets has been the bottleneck of semantic segmentation studies. Therefore, how to obtain pixel-level labeling of a given image in an inexpensive and efficient manner is a promising direction for future semantic segmentation.

Disclosure of Invention

The invention aims to provide a method and a system for determining image pixel-level pseudo labels based on semantic segmentation, which can obtain the pixel-level pseudo labels of a segmentation network through images and labels of the classification network.

In order to achieve the purpose, the invention provides the following scheme:

a semantic segmentation based image pixel level pseudo label determination method comprises the following steps:

acquiring an initial image and preprocessing the initial image to obtain a first image;

performing feature extraction on the first image by using a feature extractor to obtain a first feature map;

inputting the first feature map into a cavity convolution pixel relation model to obtain a second feature map and a plurality of third feature maps;

performing matrix product operation on the second characteristic diagram and each third characteristic diagram respectively to correspondingly obtain a plurality of first pixel relation measurement matrixes;

carrying out average fusion on the plurality of first pixel relation measurement matrixes to obtain a second pixel relation measurement matrix;

performing matrix product operation on the second characteristic diagram and the second pixel relation measurement matrix to obtain a fourth characteristic diagram;

inputting the fourth feature map into a global average pooling layer to obtain a tensor matrix;

inputting the tensor matrix into a softmax classification layer for classification to obtain a functional relation between the tensor matrix and the image output probability;

training a classification network according to the loss function corresponding to the function relationship to obtain a trained classification network;

obtaining a target position graph and a background target graph according to the trained classification network and the fourth feature graph;

obtaining a semantic segmentation network model;

training the semantic segmentation network model according to the first image, the target position graph and the background target graph to obtain a trained semantic segmentation network model;

and inputting the image to be detected into the trained semantic segmentation network model to obtain the image pixel level pseudo label.

Optionally, the initial image is randomly scaled by [321,481], and then the picture is cropped to size 321 × 321, resulting in the first image.

Optionally, the feature extractor is an improved VGG-16 network model that removes the last two pool layers in the VGG-16 model structure.

Optionally, the function relationship between the tensor matrix and the image output probability is

Wherein the content of the first and second substances,

indicating that for classes n, F_CWeight parameter of P_nRepresenting the probability of image output of class n, F_CA tensor matrix is represented.

Optionally, the semantic segmentation network model is a deep lab-ASPP network model.

Optionally, the initial image employs a PASCAL VOC 2012 data set.

Optionally, after the step of inputting the first feature map into the void convolution pixel relation model to obtain a second feature map and a plurality of third feature maps, performing matrix product operation on the second feature map and each of the third feature maps, and before the step of correspondingly obtaining a plurality of first pixel relation measurement matrices, the method further includes:

and performing size reshaping on the second feature map and the plurality of third feature maps.

Optionally, the first characteristic diagram is obtained through a conv5_3 layer of the improved VGG-16 network model.

Optionally, the first feature size is C × H × W, where C is the number of channels, and W and H are feature sizes, respectively.

An image pixel level pseudo tag determination system based on semantic segmentation, comprising:

the system comprises a preprocessing module, a first image acquisition module, a second image acquisition module and a second image acquisition module, wherein the preprocessing module is used for acquiring an initial image and preprocessing the initial image to obtain a first image;

the characteristic extraction module is used for extracting the characteristics of the first image by using a characteristic extractor to obtain a first characteristic diagram;

the first input module is used for inputting the first characteristic diagram into the cavity convolution pixel relation model to obtain a second characteristic diagram and a plurality of third characteristic diagrams;

the first matrix product operation module is used for respectively carrying out matrix product operation on the second characteristic diagram and each third characteristic diagram to correspondingly obtain a plurality of first pixel relation measurement matrixes;

the matrix fusion module is used for carrying out average fusion on the plurality of first pixel relation measurement matrixes to obtain a second pixel relation measurement matrix;

the second matrix product operation module is used for carrying out matrix product operation on the second characteristic diagram and the second pixel relation measurement matrix to obtain a fourth characteristic diagram;

the second input module is used for inputting the fourth feature map into a global average pooling layer to obtain a tensor matrix;

the classification module is used for inputting the tensor matrix into a softmax classification layer for classification to obtain a functional relation between the tensor matrix and the image output probability;

the first network training module is used for training a classification network according to the loss function corresponding to the function relation to obtain the trained classification network;

the target position graph and background target graph determining module is used for obtaining a target position graph and a background target graph according to the trained classification network and the fourth feature graph;

the model acquisition module is used for acquiring a semantic segmentation network model;

the second network training module is used for training the semantic segmentation network model according to the first image, the target position graph and the background target graph to obtain a trained semantic segmentation network model;

and the pseudo label determining module is used for inputting the image to be detected into the trained semantic segmentation network model to obtain the image pixel level pseudo label.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the invention, a deep neural classification network is designed to provide a high-quality pseudo label training segmentation network for a segmentation network, so that image semantic segmentation is carried out. A void convolution pixel relation network is provided, a pixel relation model between a void convolution characteristic diagram and a general convolution characteristic diagram is generated by combining the void convolution and an attention mechanism in a classification network, and a class excitation diagram generated by the classification network can highlight a more complete target area, so that a segmentation network pseudo label with higher quality is generated, and the segmentation capability of the segmentation network is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a semantic segmentation image pixel level pseudo tag determination method of the present invention;

FIG. 2 is a diagram of a void convolution pixel relationship network architecture in accordance with the present invention;

FIG. 3 is a diagram of a void convolution pixel relationship model according to the present invention;

FIG. 4 is a block diagram of a semantic segmentation image pixel level pseudo tag determination system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the invention discloses a method for determining semantic segmented image pixel level pseudo labels, comprising:

step 101: the method comprises the steps of obtaining an initial image and preprocessing the initial image to obtain a first image.

Step 102: and performing feature extraction on the first image by using a feature extractor to obtain a first feature map.

Step 103: and inputting the first characteristic diagram into a cavity convolution pixel relation model to obtain a second characteristic diagram and a plurality of third characteristic diagrams.

Step 104: and performing matrix product operation on the second characteristic diagram and each third characteristic diagram respectively to correspondingly obtain a plurality of first pixel relation measurement matrixes.

Step 105: and carrying out average fusion on the plurality of first pixel relation measurement matrixes to obtain a second pixel relation measurement matrix.

Step 106: and performing matrix product operation on the second characteristic diagram and the second pixel relation measurement matrix to obtain a fourth characteristic diagram.

Step 107: and inputting the fourth feature map into a global average pooling layer to obtain a tensor matrix.

Step 108: and inputting the tensor matrix into a softmax classification layer for classification to obtain a functional relation between the tensor matrix and the image output probability.

Step 109: and training a classification network according to the loss function corresponding to the function relationship to obtain the trained classification network.

Step 110: and obtaining a target position image and a background target image according to the trained classification network and the fourth feature image.

Step 111: and acquiring a semantic segmentation network model.

Step 112: and training the semantic segmentation network model according to the first image, the target position graph and the background target graph to obtain the trained semantic segmentation network model.

Step 113: and inputting the image to be detected into the trained semantic segmentation network model to obtain the image pixel level pseudo label.

Step 101 specifically includes:

the PASCAL VOC 2012 data set (20 foreground classes and one background class) is used, which includes 1464 pictures as training set, 1449 pictures as verification set, and 1456 pictures as test set.

The pictures are randomly scaled by the range of [321,481], and then cropped to size 321 × 321 as the input image set a of the network.

Step 102 specifically includes:

2.1 VGG-16 models pre-trained on the Imagnet database were used as feature extractors.

2.2 remove the last two pooling layers in the VGG-16 model structure to improve the resolution of the feature map.

2.3 passing the input image set A into the changed VGG-16 model for feature extraction.

2.4 obtaining the characteristic diagram Z with the size C H W at the VGG-16 model conv5_3 layer, wherein C is the channel number, and W and H are the characteristic diagram size respectively.

Step 103-106 specifically comprises:

3.1 the obtained feature map of the modified VGG-16 is imported into a hole convolution pixel relation model (DCPAM).

3.2 the DCPAM substitutes the characteristic diagram Z (the first characteristic diagram) into the hole convolution unit and the standard convolution unit respectively to obtain the characteristic diagram D epsilon R^C×H×W(third feature diagram) and S ∈ R^C×H×W(second feature map).

3.3 reshaping S and D to size R^C×NWhere N — H × W is the total number of positions in the S and D feature maps.

3.4 obtaining a first pixel relation measurement matrix A ∈ R between the reshaped feature maps S and D by matrix-multiplying them^N×N. The characteristic value of A is represented as

Where i and j are indices of the reshaped feature map S and D locations, S_iAnd D_jAnd characteristic values of the reconstructed characteristic maps S and D at the i position and the j position.

3.5 normalization of ai, j by softmax layer

3.6 average merging the multiple first pixel relation measurement matrixes between the characteristic diagram (second characteristic diagram) output by the standard convolution unit and the characteristic diagram (third characteristic diagram) output by the convolution kernels with different cavity rates in the cavity convolution unit to obtain a second pixel relation measurement matrix

Where d represents the hole convolution rate, A^dAnd a first pixel relation measurement matrix which represents the first pixel relation between the characteristic diagram output by the standard convolution unit and the characteristic diagram output by the convolution kernel with the hole convolution rate d.

3.7 Reinforcement of the remodeled profile S generated by the standard convolution unit with a second pixel relationship measurement matrix A, performing a matrix multiplication between S and A, then we reform the result to R^C×H×WAnd element summation with S to obtain an enhanced feature map (fourth feature map) E ∈ R^C×H×W，

Where λ is initialized to 0 and gradually learned through training.

Step 107 specifically includes:

the obtained enhanced feature map (fourth feature map) is transferred into an average pooling layer, and the result of performing a global average pooling layer for channel C is

Finally, a tensor matrix R is obtained^C×H×W∈R^C×1×1As an image representation.

Step 108 specifically includes:

substituting the image expression tensor matrix into a softmax classification layer for classification, and outputting softmax as class n

Wherein

Representing a weight parameter, P, for class n, FC_nRepresenting the probability of image output of class n, F_CA tensor matrix is represented. .

Step 109 specifically includes:

computing Cross-control loss function through image class labels, and training a classification network, wherein the loss function is

y_nA label representing the data set, n denotes a category. This loss function optimizes the classification network by stochastic gradient descent.

Step 110 specifically includes:

7.1 weighting parameters between the GAP layer and the classification layer in the trained classification network

And E, transmitting the enhanced feature graph E (a fourth feature graph) in the classification network to operate to obtain a target position graph:

M_n(i, j) represents a target location map belonging to the category n.

7.2 highlight the regions irrelevant to the target according to the position map of the class with the lowest probability in the classification network, and the first x classes with the lowest probability are taken as background targets,

wherein b (x) is the equilibrium fusion function:

step 111-112 is to train the segmentation network by using the target location map as the pseudo label of the segmentation network, and the following is to train the hyper-parameter setting of the segmentation network.

The method specifically comprises the following steps:

8.1 DeepLab-ASPP is adopted as a semantic segmentation network model.

8.2 takes the top 20% of the highest pixel values in the target location map in substep 7.1 as the foreground target.

8.3 takes the top 30% of the highest pixel values in the background target location map in substep 7.2 as the background target, and sets p to 3 and q to the number of dataset categories-p.

8.4 ignore all unallocated and conflicting pixels during the training process.

8.5 use the PASCAL VOC 2012 dataset as training data for the segmented network, defined as G, for any training image G ∈ G.

8.6 define the tagset as N ═ N^fg∪n^bgWherein n is^fgAs a foreground label, n^bgAs a background label.

8.7 define the segmented network model as f (g; θ), where θ is an optimizable parameter. f. of_u,c(g; θ) represents the modeling by the segmentation model of the conditional probability of any label c at any position u of the confidence map of a particular class.

8.8 define the balanced seed loss function:

hc denotes a pixel-level division pseudo label generated with the target position map Mn (i, j), and | · | denotes the number of pixels.

8.9 defines the helper seed loss function:

a target location tag representing an on-line prediction of the image by the segmentation model.

8.10 boundary constraint loss function is defined by Conditional Random Fields (CRF):

wherein Ru, c (i, f (i; theta)) is an output probability chart of the fully-connected CRF.

8.11 the loss function of the final model is defined as: l ═ L_seed+L_seg+L_boundary。

8.12 set mini-batch to 10 images, momentum 0.9, weight decay 0.0005. The initial learning rate was 5e-3, which was reduced by a factor of 10 per 2000 iterations, and training was terminated after 10000 iterations.

After the step 112, the trained semantic segmentation network model is obtained, and then the image to be detected can be directly input into the trained semantic segmentation network model to obtain the image pixel level pseudo label.

1. In addition, the present invention also discloses a semantic segmentation based image pixel level pseudo tag determination system, as shown in fig. 4, an image pixel level pseudo tag determination system based on semantic segmentation, comprising:

the preprocessing module 201 is configured to acquire an initial image and preprocess the initial image to obtain a first image.

A feature extraction module 202, configured to perform feature extraction on the first image by using a feature extractor to obtain a first feature map.

A first input module 203, configured to input the first feature map into the void convolution pixel relation model to obtain a second feature map and multiple third feature maps.

And a first matrix product operation module 204, configured to perform matrix product operation on the second feature map and each third feature map respectively, so as to obtain a plurality of first pixel relationship measurement matrices correspondingly.

A matrix fusion module 205, configured to perform average fusion on the plurality of first pixel relationship measurement matrices to obtain a second pixel relationship measurement matrix.

And a second matrix product operation module 206, configured to perform matrix product operation on the second feature map and the second pixel relation measurement matrix to obtain a fourth feature map.

A second input module 207, configured to input the fourth feature map into a global average pooling layer, so as to obtain a tensor matrix.

And the classification module 208 is configured to input the tensor matrix into a softmax classification layer for classification, so as to obtain a functional relationship between the tensor matrix and the image output probability.

And the first network training module 209 is configured to train a classification network according to the loss function corresponding to the functional relationship, so as to obtain a trained classification network.

And a target location map and background target map determining module 210, configured to obtain a target location map and a background target map according to the trained classification network and the fourth feature map.

And the model obtaining module 211 is configured to obtain a semantic segmentation network model.

And the second network training module 212 is configured to train the semantic segmentation network model according to the first image, the target location map and the background target map, so as to obtain a trained semantic segmentation network model.

And a pseudo label determining module 213, configured to input the image to be detected into the trained semantic segmentation network model to obtain an image pixel-level pseudo label.

The invention also discloses the following technical effects:

1. by combining the advantages of the void convolution and the attention mechanism, the method can effectively enlarge the highlighted target area, meanwhile can enhance the generation of the class-related target area and inhibit the class-unrelated area, obtains higher-quality semantic segmentation into labels, and further improves the segmentation capability of the segmentation network.

2. According to the invention, a deep neural classification network is designed to provide a high-quality pseudo label training segmentation network for a segmentation network, so that image semantic segmentation is carried out. A void convolution pixel relation network is provided, a pixel relation model between a void convolution characteristic diagram and a general convolution characteristic diagram is generated by combining the void convolution and an attention mechanism in a classification network, and a class excitation diagram generated by the classification network can highlight a more complete target area, so that a segmentation network pseudo label with higher quality is generated, and the segmentation capability of the segmentation network can be improved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A semantic segmentation-based image pixel level pseudo label determination method is characterized by comprising the following steps:

inputting the first feature map into a cavity convolution pixel relation model to obtain a second feature map and a plurality of third feature maps, specifically comprising: respectively substituting the first characteristic diagram into a hole convolution unit and a standard convolution unit to respectively obtain a third characteristic diagram and a second characteristic diagram;

the function relation between the tensor matrix and the image output probability is

Wherein the content of the first and second substances,

indicating that for classes n, F_CWeight parameter of P_nRepresenting the probability of image output of class n, F_CA tensor matrix is represented, C represents the first eigen mapThe number of channels of (a);

the obtaining of the target position map and the background target map according to the trained classification network and the fourth feature map specifically includes:

weighting parameters between the GAP layer and the classification layer in the trained classification network

Transmitting the fourth feature map into a classification network, and operating to obtain a target position map; the target position map is

Wherein M is_n(i, j) represents a target location map belonging to category n; i and j represent feature map location indices;

according to the region which is highlighted by the position map of the class with the lowest probability in the classification network and is irrelevant to the target, adopting the front X classes with the lowest probability as background target maps:

wherein b (x) is the equilibrium fusion function:

obtaining a semantic segmentation network model; the semantic segmentation network model is a deep Lab-ASPP network model;

the training of the semantic segmentation network model according to the first image, the target position map and the background target map to obtain the trained semantic segmentation network model specifically comprises the following steps:

selecting the top 20% of the pixel values in the target position image as a foreground target;

selecting the top 30% of the highest pixel value in the background target image as a background target, and setting p to be 3 and q to be the data set category number-p;

ignoring all unallocated and conflicting pixels;

adopting a PASCAL VOC 2012 data set as training data of the semantic segmentation network model, defining the training data as G, and regarding any training image G belonging to G;

defining a tag set as N ═ N^fg∪n^bgWherein n is^fgAs a foreground label, n^bgA background label;

defining the semantic segmentation network model as f (g; theta), wherein theta is an optimizable parameter, f_u,c(g; θ) represents modeling by the segmentation model of the conditional probability of any label c at any position u of the confidence map of the particular class;

defining a balanced seed loss function:

H_crepresentation of the target location map M_n(i, j) the generated pixel-level segmentation pseudo-label, | · | representing the number of pixels;

defining a helper seed loss function:

a target location tag representing an on-line prediction of the image by the segmentation model;

the boundary constraint loss function is defined by the conditional random field:

wherein R is_u,c(g, f (g; theta)) is an output probability map of the fully-connected CRF;

the loss function of the final model is defined as: l ═ L_seed+L_seg+L_boundary；

Setting mini-batch to 10 images, momentum to 0.9, weight decay to 0.0005; the initial learning rate was 5e-3, which was reduced by 10 times per 2000 iterations, and training was terminated after 10000 iterations;

2. The semantic segmentation based image pixel level pseudo label determination method according to claim 1, characterized in that the initial image is randomly scaled by [321,481] and then the picture is cropped to size 321 x 321 to obtain the first image.

3. The image pixel-level pseudo tag determination method based on semantic segmentation according to claim 1, wherein the feature extractor is an improved VGG-16 network model which removes the last two pooling layers in the VGG-16 model structure.

4. The semantic segmentation based image pixel level pseudo label determination method according to claim 1, characterized in that the initial image employs a pascanoc 2012 data set.

5. The image pixel-level pseudo tag determination method based on semantic segmentation according to claim 1, wherein after the step of inputting the first feature map into a hole convolution pixel relation model to obtain a second feature map and a plurality of third feature maps, the step of performing matrix product operation on the second feature map and each of the third feature maps respectively, and before the step of correspondingly obtaining a plurality of first pixel relation measurement matrices, further comprises:

6. The image pixel-level pseudo tag determination method based on semantic segmentation according to claim 3, wherein the first feature map is obtained through a conv5_3 layer of a modified VGG-16 network model.

7. The image pixel-level pseudo label determination method based on semantic segmentation according to claim 1 or 6, wherein the first feature map size is C H W, where C is the number of channels of the first feature map, and W and H are feature map sizes, respectively.

8. An image pixel level pseudo label determination system based on semantic segmentation, comprising:

the first input module specifically includes: respectively substituting the first characteristic diagram into a hole convolution unit and a standard convolution unit to respectively obtain a third characteristic diagram and a second characteristic diagram;

a classification module for classifying the tensor matrix input softmax intoClassifying the class layer to obtain a functional relation between the tensor matrix and the image output probability; the function relation between the tensor matrix and the image output probability is

Wherein the content of the first and second substances,

indicating that for classes n, F_CWeight parameter of P_nRepresenting the probability of image output of class n, F_CRepresenting a tensor matrix, C representing the number of channels of the first eigen map;

the target position map and background target map determining module specifically comprises:

wherein b (x) is the equilibrium fusion function:

the model acquisition module is used for acquiring a semantic segmentation network model; the semantic segmentation network model is a deep Lab-ASPP network model;

the second network training module specifically includes:

ignoring all unallocated and conflicting pixels;

defining a balanced seed loss function:

defining a helper seed loss function: