CN114463335A

CN114463335A - Weak supervision semantic segmentation method and device, electronic equipment and storage medium

Info

Publication number: CN114463335A
Application number: CN202111602397.8A
Authority: CN
Inventors: 张兆翔; 李靖; 樊峻菘
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-05-10

Abstract

The embodiment of the application discloses a weak supervision semantic segmentation method, a weak supervision semantic segmentation device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a picture to be recognized, and inputting the picture to be recognized into a semantic segmentation model to obtain a semantic segmentation result; the semantic segmentation model is obtained by training a basic semantic segmentation model based on the training pseudo labels; the training pseudo label is obtained by identifying the picture by a double-branch model; the double-branch model is obtained after iterative training is carried out on the basis of a first training label and a second training label; the first training label is an initial label generated by the CAM; the second training label is an online label output by the two-branch model. According to the method, a double-branch model is trained in an iterative optimization mode, so that the method can predict the object boundary and the segmentation result with higher quality, and finally, a high-quality pseudo label for training the basic semantic segmentation model is generated according to the object boundary and the segmentation result, so that the high-precision semantic segmentation model is trained.

Description

Weak supervision semantic segmentation method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of computer vision, in particular to a weak supervision semantic segmentation method and device, electronic equipment and a storage medium.

Background

Semantic segmentation is an important and classical computer vision task and has wide application in the aspects of image editing, scene analysis and the like. Although significant progress is made in current semantic segmentation based on deep neural networks, these methods rely heavily on time-consuming and labor-consuming pixel-level picture segmentation labels.

In order to reduce the cost of image labeling, a weak supervision semantic segmentation method based on image category labels is widely researched. Most of the existing methods train a classification network by using class labels, further acquire position and shape information of a foreground object by using an activation map (CAM) of the last convolutional layer of the classification network, generate initial labels (seed labels), train a standard semantic segmentation model by using the initial labels, and finally predict a semantic segmentation result of an image to be recognized by using the trained semantic segmentation model. The foreground region of the activation map is usually only locally highlighted, so the initial label generally only marks a partial region of the foreground object, and the foreground class recall is low, which affects the performance of the segmentation model.

At present, some works extract a foreground object boundary (contour) by using an initial label (seed label) training boundary detection model, and perform foreground category score propagation under contour constraint, so that a highlighted foreground region in an activation map is more complete. However, there are many false positive cases (object internal edges) in the object boundary map (constraint map) predicted by these boundary detection models, and they block the foreground category score propagation, so that there are cases where the highlighted foreground region is still incomplete in the modified activation map, and the initial label call is still low.

Disclosure of Invention

Because the existing method has the above problems, embodiments of the present application provide a method and an apparatus for weakly supervised semantic segmentation, an electronic device, and a storage medium, and focus on solving the problem that an initial tag call is low.

Specifically, the embodiment of the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides a weak supervised semantic segmentation method, including:

acquiring a picture to be recognized, and inputting the picture to be recognized into a semantic segmentation model to obtain a semantic segmentation result of the picture to be recognized;

the semantic segmentation model is obtained by training a basic semantic segmentation model based on a training pseudo label; the training pseudo label is obtained by identifying the picture by a double-branch model; the double-branch model is obtained after iterative training is carried out on the basis of a first training label and a second training label; wherein the first training label is an initial label generated by a classification network activation map (CAM); the initial label comprises foreground object position and shape information of the picture; the second training label is an online label output by the double-branch model; the online label is generated based on a semantic segmentation branch prediction result and an object boundary detection branch prediction result; the double-branch model is composed of the semantic segmentation branch and the object boundary detection branch, and the semantic segmentation branch and the object boundary detection branch share one trunk branch for extracting picture features.

Optionally, the CAM is obtained by performing feature recognition on the picture by using a classification network model; the classification network model is obtained after training based on the image category labels.

Optionally, the training pseudo tag is obtained by identifying a picture by a dual-branch model, and includes:

and obtaining the training pseudo label according to a semantic segmentation prediction result obtained by identifying the picture by the semantic segmentation branch and an object boundary result obtained by identifying the picture by the object boundary detection branch.

Optionally, the dual-branch model is obtained after iterative training based on a first training label and a second training label, and includes:

processing the CAM and generating a first training label offline; under the constraint of an object boundary graph generated by the object boundary detection branch, a foreground category score in an initial segmentation probability graph generated by the semantic segmentation branch is propagated in a foreground category score propagation mode to obtain a modified segmentation probability graph, and a second training label is generated based on the modified segmentation probability graph;

according to the first training label and the second training label, supervising and training the object boundary detection branch and the semantic segmentation branch in the double-branch model;

processing the initial segmentation probability map based on the dense conditional random field dense CRF to obtain a background reference label, and correcting the second training label according to the background reference label to obtain a corrected second training label; and monitoring and training the object boundary submodel in the double-branch model according to the first training label and the modified second training label.

Optionally, the obtaining the training pseudo label according to the semantic segmentation prediction result obtained by recognizing the picture according to the semantic segmentation branch and the object boundary result obtained by recognizing the picture according to the object boundary submodel includes:

after the picture is subjected to multi-scale scaling and horizontal turning, inputting the trained semantic segmentation branch to obtain a semantic segmentation prediction result, and inputting the trained object boundary detection branch to obtain an object boundary result;

and generating the training pseudo label according to the semantic segmentation prediction result and the object boundary result.

In a second aspect, an embodiment of the present application provides a weak supervised semantic segmentation apparatus, including:

the processing module is used for acquiring a picture to be recognized and inputting the picture to be recognized into a semantic segmentation model to obtain a semantic segmentation result of the picture to be recognized;

Optionally, the processing module is specifically configured to:

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the weak supervised semantic segmentation method according to the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the weak supervised semantic segmentation method according to the first aspect is implemented.

According to the technical scheme, the image to be recognized is input into a semantic segmentation model, and a semantic segmentation result of the image to be recognized is obtained; the semantic segmentation model is obtained by training a basic semantic segmentation model based on a training pseudo label; the training pseudo label is obtained by identifying the picture by a double-branch model; the double-branch model is obtained after iterative training is carried out on the basis of a first training label and a second training label; wherein the first training label is an initial label generated by a classification network activation map (CAM); the initial label comprises foreground object position and shape information of the picture; the second training label is an online label output by the double-branch model; the online label is generated based on a semantic segmentation branch prediction result and an object boundary detection branch prediction result; the double-branch model is composed of the semantic segmentation branch and the object boundary detection branch, and the semantic segmentation branch and the object boundary detection branch share one trunk branch for extracting picture features. Therefore, the two branches of the double-branch model are subjected to iterative optimization through the online labels, the foreground category scores in the segmentation results are propagated to the periphery under the constraint of the object boundary in forward propagation, the second training label is generated, the label predicts a more complete and accurate foreground region, and the two branches of the double-branch model are well optimized in backward propagation. Compared with the existing scheme of monitoring the object boundary branch only by using the initial label (the first training label), the method and the device can effectively inhibit false positive examples (the object internal boundary) in the object boundary, and are favorable for the foreground category score to be spread from the significant area to the non-significant area. According to the method and the device, the segmentation result of the segmentation sub-model is optimized by using fractional propagation to generate the training pseudo labels, and compared with the traditional method based on the CAM, the generated training pseudo labels are more accurate, so that a basic semantic segmentation model with higher performance can be trained, and the accuracy of the semantic segmentation result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a weakly supervised semantic segmentation method according to an embodiment of the present application;

FIG. 2 is a second flowchart illustrating steps of a weakly supervised semantic segmentation method according to an embodiment of the present application;

FIG. 3 is a block diagram of an iteratively trained two-branch model provided by an embodiment of the present application;

fig. 4 is a schematic diagram of a network structure of a dual-branch model according to an embodiment of the present disclosure;

fig. 5 is a second schematic diagram of a network structure of a dual-branch model according to an embodiment of the present application;

fig. 6 is a third schematic diagram of a network structure of a dual-branch model according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a weakly supervised semantic segmentation apparatus provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 shows one of step flow diagrams of a weak supervised semantic segmentation method provided by an embodiment of the present application, fig. 2 is a second of the step flow diagrams of the weak supervised semantic segmentation method provided by the embodiment of the present application, fig. 3 is a framework diagram of an iterative training dual-branch model provided by the embodiment of the present application, fig. 4 is one of a network structure schematic diagram of the dual-branch model provided by the embodiment of the present application, fig. 5 is a second of the network structure schematic diagram of the dual-branch model provided by the embodiment of the present application, and fig. 6 is a third of the network structure schematic diagram of the dual-branch model provided by the embodiment of the present application. The weak supervised semantic segmentation method provided by the embodiment of the present application is explained and explained in detail below with reference to fig. 1 to 6, and as shown in fig. 1, the weak supervised semantic segmentation method provided by the embodiment of the present application includes:

step 101: acquiring a picture to be recognized, and inputting the picture to be recognized into a semantic segmentation model to obtain a semantic segmentation result of the picture to be recognized;

In this step, it should be noted that, firstly, a classification network model needs to be trained by using the picture category label, as shown in fig. 3, the classification network model backbone is initialized by using the ImageNet pre-training model backbone, the full connection layer for classification has no bias, and the weights are initialized randomly. And after randomly enhancing the input pictures during training, inputting network training and optimizing by using SGD.

In this step, after the classification network model is trained, inputting a training picture into the classification network model, outputting a feature map F of the last convolutional layer without pooling, converting the weight of the full-link layer into a 1 × 1 convolutional core, performing convolution on the F, and inputting the result into a Relu Activation function, thereby obtaining an Activation map CAM (class Activation map).

In the step, after obtaining the activation map CAM, the CAM is up-sampled to the size of the input image, 0 is set for the channel corresponding to the class of the input image, a background channel is added to the CAM, and the background channel is set to be tau₁Obtaining a CAM₁The background channel is set to be tau₂(τ₂＜τ₁) Obtaining a CAM₂. General CAM₁And CAM₂Inputting argmax function respectively to obtain two initial tags, and correcting the two tags by using pydensecrf package to obtain Y_fg，Y_bg. For Y_fgIf they are in Y_bgIf it is foreground, then the pixel is relabeled as an uncertain pixel (take 255), as shown below (Y)_init[i]Represents Y_initThe ith pixel, Y_fg[i],Y_bg[i]Similarly), to Y_fgObtaining a first training label Y after modification_init。

In this step, as shown in fig. 4, the dual-branch model selects rescet 50 or rescet 101 as the backbone, changes stride of rescet network stage4 and stage5 from 2 to 1, and finally outputs a feature map F with stride of 8 from stage5_s8Adjusting the disparation of convolutional layers of network stages 4 and 5 to F_s8The receptive field at each location is as large as the receptive field at the corresponding location of the original resnet network. The two-branch model is composed of a semantic division branch (division submodel) and an object boundary detection branch (object boundary submodel), and as shown in fig. 5, the division submodel is constructed in a manner that: adding seg head after stage5, wherein the seg head adopts an Aspp model and consists of 4 different 3 x 3 convolution kernels, F_s8And inputting the results obtained by the 4 convolutions, adding the results, performing 2 times of upsampling on the spatial dimension, and finally inputting softmax to obtain a semantic segmentation probability map M. As shown in fig. 6, the object boundary submodule is constructed in the following manner: respectively reducing the output characteristics of the stages 1 to 5 to 32 through 5 edge _ layer, inputting the obtained 5 feature maps concat together into edge _ layer6(1 × 1 convolution), outputting an object boundary map with the channel of 1, and mapping the object boundary map to [0,1 through a sigmoid function]And is denoted as B.

In this step, except for the initial label Y_init(first training label) and also using the segmentation results M and M in the two-branch model training phaseObject boundary B generating online label Y^online(second training label). The training picture is subjected to random scaling and clipping enhancement before the double-branch model is input, only a certain rectangular region R in the input picture I contains the content of the original picture, and other regions are 0-complementing regions during picture enhancement. The invention relates to Y^onlineThe 0-complementing area in (1) is 255 (uncertain label), and Y is^onlineThe label of the middle rectangular region R is obtained by fractional propagation using the effective region R 'in M, B (R' corresponds to R). . Y is^onlineThe information of M and B is fused, and the result is more accurate. On the other hand, a small number of highlighted background regions in M will expand rapidly through fractional propagation, resulting in Y being generated^onlineIn these areas, a large number of false positive labels are predicted to be in the foreground, so that Y is required to be matched^onlineCorrecting the error labels into background labels to obtain a corrected second label Y^refine。

In this step, the first training label Y is obtained_initAnd a second training label Y^onlineAnd a corrected second training label Y^refineThen, use Y_initAnd Y^onlineThe segmentation prediction result M of the double-branch model is supervised and trained by Cross Engine loss, and Y can be used_initAnd Y^refineAnd obtaining a semantic correlation matrix among different pixels in the B, and indirectly monitoring an object boundary prediction result B of the double-branch model by utilizing the semantic correlation matrix generated by the matrix monitoring based on the B.

In the step, after the training of the double-branch model is completed, after the picture in the training set is subjected to multi-scale scaling and horizontal turning, the trained double-branch model is input, a semantic segmentation prediction result is obtained by a segmentation sub-model, an object boundary result is obtained by an object boundary sub-model, and then a training pseudo label is generated through fractional propagation according to the semantic segmentation prediction result and the object boundary result.

In this step, a basic semantic segmentation model (for example, depeplab) is trained by using the generated pseudo tag, and after the training is completed, the recognized picture is input into the semantic segmentation model to obtain a semantic segmentation result of the picture.

According to the technical scheme, the image to be recognized is input into a semantic segmentation model, and a semantic segmentation result of the image to be recognized is obtained; the semantic segmentation model is obtained by training a basic semantic segmentation model based on a training pseudo label; the training pseudo label is obtained by identifying the picture by a double-branch model; the double-branch model is obtained after iterative training is carried out on the basis of a first training label and a second training label; wherein the first training label comprises an initial label Y generated by a classification network activation map CAM_init(ii) a Said Y is_initForeground object position and shape information including pictures; the second training label Y^onlimeIs an online tag output by the two-branch model; said Y is^onlineGenerating a branch prediction result based on semantic segmentation and object boundary detection; the dual-branch model consists of a segmentation branch and an object boundary detection branch, which share a trunk branch for extracting features from the input picture. Therefore, the embodiment of the application passes through the online tag Y^onlineIterative optimization is carried out on two branch submodels of the double-branch model, foreground category fractions in the segmentation result are propagated to the periphery under the constraint of object boundaries during forward propagation, and the generated Y^onlineAnd a more complete and accurate foreground region is marked, and two sub-model branches of the dual-branch model can be optimized during back propagation. The embodiment of the application only uses the initial label Y_initCompared with the (first training label) supervised object boundary branching scheme, the method can effectively suppress false positive examples (object internal boundaries) in the object boundary and is beneficial to the propagation of the foreground class score from the significant area to the non-significant area. The segmentation prediction result based on the segmentation sub-model is subjected to score propagation optimization and training pseudo labels are generated, and compared with the traditional method based on the CAM, the generated training pseudo labels are more accurate, so that a basic semantic segmentation model with higher performance can be trained, and the accuracy of the semantic segmentation result is improved.

Based on the content of the above embodiment, in this embodiment, the CAM is obtained by performing feature identification on the picture by using a classification network model; the classification network model is obtained after training based on the image category label; the picture category labels are provided by a training data set.

Based on the content of the foregoing embodiment, in this embodiment, the training pseudo tag is obtained by identifying a picture by a dual-branch model, and includes:

and obtaining a semantic segmentation prediction result after identifying the picture according to the semantic segmentation branch, obtaining an object boundary result after identifying the picture according to the object boundary detection branch, and obtaining the training pseudo label by using the two results.

In this embodiment, it should be noted that, each picture in the training set is subjected to multi-scale scaling and horizontal inversion, a plurality of pictures are generated and input into the trained dual-branch model, a plurality of semantic segmentation prediction results and object boundary results are obtained, an average value of the plurality of semantic segmentation prediction results and the object boundary results is taken, and a training pseudo label is generated based on the average value in a manner similar to the generation of the second training label.

Based on the content of the foregoing embodiment, in this embodiment, the dual-branch model is obtained after performing iterative training based on the first training label and the second training label, and includes:

supervising training of the object boundary detection branches in the two-branch model according to the first training label and the second training label;

when the second training label is used for monitoring the object boundary detection branch, the second training label can be corrected to a certain extent, and the corrected second training label is used as a monitoring signal. Firstly, processing the initial segmentation probability map based on the dense conditional random field dense CRF to obtain a background reference label, and then correcting the second training label according to the background reference label to obtain a corrected second training label;

and finally, supervising and training the object boundary sub-model in the double-branch model according to the first training label and the corrected second training label.

In this embodiment, it should be noted that after the first training label generated by the CAM of the activation map is used, a second training label is obtained by using a network foreground class score propagation manner, and a segmentation sub-model in the dual-branch model is supervised and trained according to the first training label and the second training label. Furthermore, the second training label may be modified in order to better supervise the object boundary branch. The reason is that a small number of highlight background areas in the segmentation probability map are rapidly expanded through fractional propagation, so that the generated second training labels are predicted to be foreground in the areas, and a large number of false positive labels exist, so that the second training labels need to be corrected, the false labels are corrected to be background labels, so that dense CR processing is performed on segmentation results to obtain reference labels, the second training labels are corrected according to the reference labels to obtain corrected second training labels, and then object boundary submodels in the double-branch model are trained and supervised according to the first training labels and the corrected second training labels. Therefore, in the embodiment of the application, when the dual-branch model is reversely propagated, the initial label (the first training label) and the online label (the second training label) are used for monitoring the training semantic segmentation sub-model, and the initial label and the modified second training label are used for monitoring the object boundary sub-model. The initial label plays a role in initializing and stabilizing a training process, the online label and the corrected second training label are integrated with information of the two branch submodels, iterative optimization is carried out on the two branch submodels during training, and the corrected second training label avoids adverse effects of a low-quality object segmentation result diagram to a certain extent. After the training of the double-branch network is completed, the segmentation prediction result of the segmentation sub-model is optimized by using the object boundary information, and a training pseudo label is generated.

The present application will be specifically described below with reference to specific examples.

The first embodiment:

in this embodiment, taking a certain semantic segmentation database as an example, the database includes 21 semantic categories in total, and has 10582 training images and corresponding semantic segmentation labels, and in this embodiment, only image category labels are used, and can be obtained by converting the semantic segmentation labels.

Fig. 2 is a flowchart of the present invention, and as shown in the drawing, the weak supervised semantic segmentation method provided in the embodiment of the present application specifically includes the following steps:

step S0, training a classification network by using the picture category label, as shown in fig. 3, a classical model such as resnet50 may be adopted, the network backbone weight is initialized by the backbone of the ImageNet pre-training model, the full connection layer for classification has no bias, and the weight is initialized randomly. During training, the input picture is randomly scaled (the long edge is within the range of 320-640), horizontal inversion is randomly carried out, pixel value normalization is carried out (the pixel value is changed into 0,1 by dividing 255 first, then picture RGB channels are respectively normalized based on the average value of 0.485,0.456,0.406 and the variance of 0.229,0.224 and 0.225), then random clipping is carried out to 512 x 512, and the shortage part is filled with 0 during clipping. Inputting the cut pictures into a network for training, optimizing by using an SGD (generalized minimum) with a backbone learning rate of 0.1, finally training by 5 epochs with a full-connection-layer learning rate of 1.0 for classification and a batch _ size of 16.

Step S1, after training of the classification network is completed, inputting the training picture into the network, outputting a feature map F by the last convolution layer, taking the weight of the full connection layer as the weight of the 1 multiplied by 1 convolution kernel, convolving F and inputting the result into the Relu activation function. An activation map (CAM) with a channel number of 20 (20 classes for foreground objects) of the same size as F is obtained, as shown in fig. 3.

Step S2, the size of the input image is 512 × 512, the activation map (CAM) is upsampled 512 × 512, the channel corresponding to the class of the input image does not appear is set to be 0, and the values of the rest channels are normalized to be 0,1](each channel divided by the maximum of all its positions), a background of 0.3 is added before the first channelchannel to get a new activation map CAM₁。CAM₁Inputting argmax function, taking maximum value in channel dimension to obtain a segmentation label, and performing dense conditional random field processing on the segmentation label by using pydensecrf package to obtain a label Y_fg(addpayweisegaussian parameter sxy ═ 3, composition ═ 3; addpayweisebilateral parameter sxy ═ 50, srgb ═ 5, composition ═ 10, unity _ from _ labels parameter gt _ prob ═ 0.7, zero _ unity ═ False, reference 10 times). Similarly, a background channel with a value of 0.05 is added before the first channel, and the same subsequent operation is performed, so that the label Y can be obtained_bg. For Y_fgIf it is in Y_bgIf the center is foreground, the image is marked as an uncertain pixel again to obtain Y_init(512X 512) as shown below (Y)_init[i]Represents Y_initThe ith pixel, Y_fg[i],Y_bg[i]Similarly, 0 represents a background category and 255 represents an uncertain pixel).

Step S3, a dual branch network backhaul is constructed, as shown in fig. 4. The picture size of the double branch network input is 512 multiplied by 512, renet 50 or renet 101 is selected as backbone, stride of the renet network stage4 and stage5 is changed from 2 to 1, meanwhile, the disparity of the 3 × 3 convolutions from the 2 nd layer to the last layer of stage4 is set to 2, the first 3 × 3 convolution disparity is set to 1, the disparity of the 3 × 3 convolutions from the 2 nd layer to the last layer of stage5 is set to 4, and the first 3 × 3 convolution disparity is set to 2, so that the stage5 finally outputs a feature map F with stride of 8_s8(size 64X 64), F_s8The receptive field at each location is as large as the receptive field at the corresponding location of the original resnet network.

Step S4, constructing a dual-branch network division branch submodel, adding seg head after stage5, adopting an Aspp model, as shown in FIG. 5, composed of 4 convolutions 3 × 3 with bias, output channel 21, and dispations 6, 12, 18, and 24 respectively, F_s8The results of the 4 convolutions are input and added up, then 2 times up-sampling is carried out, and softm is carried out in the channel dimensionax is computed to obtain a semantic segmentation result M, which has 21 channels (foreground + background) and is 128 × 128 in size.

And step S5, constructing a dual-branch network object boundary branch sub-model. As shown in FIG. 6, the output characteristics of stages 1 through 5 respectively reduces the channel to 32 by 5 edge _ layers, which are called edge _ layer1, edge _ layer2, … …, and edge _ layer 5. Each edge _ layer is composed of a 1 × 1 convolution, a group norm layer (the group number is 4), and a Relu layer in turn, and the edge _ layer3, the edge _ layer4, and the edge _ layer5 perform 2 times of upsampling before the Relu layer. And (3) inputting the obtained 5 feature maps concat together into edge _ layer6(1 × 1 convolution), outputting an object boundary map with a channel of 1, mapping values of the object boundary map to [0,1] through a sigmoid function, and recording as that B is 128 × 128.

Step S6, using the segmentation result M and the object boundary B to generate an online label Y^online(second training label, 512 × 512). Before the training picture is input into the model, random scaling and clipping enhancement are carried out, only a certain rectangular region R (h multiplied by w) in the input picture I contains the content of the original picture, and other regions are 0 complementing regions. The R region corresponds to an effective region R '(h/4 xw/4) in M, B, and the R' region in M, B is selected when the double-branch model propagates forwards

And

generation of Y through fractional propagation^onlineTag of middle R region

Y^onlineThe 0-complementing area in (1) is set to 255 (uncertain tag).

The score propagation process is described below, and to reduce the number of computations and facilitate batch processing

And

adjusted to 64X 64

And

first based on

A pixel correlation sparse matrix a of 4096 × 4096 size is calculated. Consider that

Taking the maximum value beta of the boundary confidence of two pixels i, j with the upper distance not more than 3 and pixels (a plurality of pixels which are vertically closest to the i, j connecting line) near the two pixels i, j and the connecting line, and taking (1-beta)¹⁰As the degree of correlation of i, j, A_i，j＝A_j，i＝(1-β)¹⁰If the distance of the pixel m, n exceeds 3, A_m，n＝A_n，m0. Propagation of pixel correlation by matrix multiplication, calculation

To pair

Each column of (a) is subjected to normaize so that the sum thereof is 1.

The method is dense and describes semantic correlation between long-distance pixels, when two pixels are far away, the correlation is not accurately calculated according to the boundary confidence on the connecting line of the two pixels, and the long-distance pixel correlation is obtained by matrix continuous multiplication.

To obtain

Then, will

The middle input picture does not contain the channel corresponding to the category to set 0 and carry on the backSetting the scene channel to be 0.25, and adjusting each category i contained in the input picture

The ith channel in (i) is a 1 × 4096 vector, and the values are normalized to [0,1]]And is and

matrix multiplication is carried out to obtain a new vector V_i，V_iV adjusted to 64 x 64 size_{i 64×64}I.e. the corrected ith channel. Finally obtaining the corrected segmentation result

Inputting argmax function, and calculating the maximum value in the channel dimension to obtain

Corresponding online label

Adjusting it to R size (h × w) to obtain

The 0 region is filled up by 255 to obtain complete Y^online。V_iAnd

the calculation is as follows: (Vec () represents the vectorization of,

represent

The ith channel, label_ICategory label representing input picture):

using Y obtained as described above^onlineSupervision of the segmentation branch, for better supervision of the object boundary branch, for Y above^onlineSome corrections are made. Will be provided with

Setting all channels corresponding to the categories of the middle input picture to be 0 and setting all the channels corresponding to the categories of the middle input picture to be 0.05, then performing dense CRF processing, and adjusting the size to be R (h multiplied by w) to obtain the image

(DenseCRF parameters: iter _ max ═ 10, pos _ xy _ std ═ 1, pos _ w ═ 3, bi _ xy _ std ═ 67, bi _ rgb _ std ═ 3, bi _ w ═ 4), yield

Is much smaller than the generated background threshold (0.05)

Background threshold (0.25), the former with higher confidence in the background region, the latter with the background region label correction, results in

As shown in (

To represent

Ith pixel):

the 0-complementing area is complemented by 255 to obtain complete Y^refineBy Y^refineAnd Y_initAnd monitoring object boundary branches.

Step S7, train the dual branch network, as shown in FIG. 3, using Y_initAnd Y^onlineM is supervised by Cross Encopy loss, while Y can be used_initAnd Y^refineAnd obtaining a semantic correlation matrix among different pixels in the B, and indirectly supervising the B by utilizing the semantic correlation matrix.

Will Y_initOr Y^refineDown-sampling to the size of B (128 x 128), only considering semantic correlation among pixels with definite class labels and supervising the boundary confidence of the relevant positions in B, considering the class labels of all other pixels with the distance of not more than 10 to the pixel p, if the class labels are the same as p, composing positive pair with p, if the class labels are different, composing negative pair with p, and if the uncertain labels are possessed, not considering. And monitoring the maximum values of the boundary confidence degrees of the two pixels on the B and the pixels near the connecting line of the two pixels, setting the maximum value label to be 0 for positive pair, setting the maximum value label to be 1 for negative pair, and monitoring the maximum values through Binary Cross control loss. The total loss function is (L)_CEIs Cross Entrophy loss, L_AAs loss of inter-pixel correlation):

L＝L_A(B,Y^refine)+L_A(B,Y_init)+L_CE(M,Y^online)+L_CE(M,Y_init)

when the double-branch network is specifically trained, random scaling of [0.5,1.5] scale is carried out on an input picture, random horizontal inversion is carried out, pixel values are normalized to be [ -1,1], then random clipping is carried out to be 512 x 512, and the insufficient part is filled with 0 during clipping. Inputting the cut pictures into a network for training, wherein the backbone learning rate is 0.0025, all edge _ layer and seghead learning rates are 0.025, the batch _ size is 10, and 19 epochs are trained.

Step S8, after the model training is finished, the 10582 training pictures are horizontally turned over and amplified by 1.5 and 2 times to obtain 6 pictures, the 6 pictures are input into a double branch network, and the segmentation result and the object boundary adopt the average value M of the 6 results_ave、B_ave. At this time M_ave、B_aveAll regions are valid, like generating Y in step S6^onlineMethod based on B_aveGenerating a correlation matrix

(generating sparse correlation matrix A takes into account pairs of pixels whose distance does not exceed 5), for M_aveAnd carrying out score propagation to obtain a training pseudo label.

Step S9, training a basic semantic segmentation model (such as depeplab) by using the generated pseudo labels, and inputting the recognized picture into the semantic segmentation model after training to obtain the semantic segmentation result of the picture.

According to the technical scheme, the image classification label is used for training a classification model, the activation map (CAM) is used for obtaining the first training label (initial label) of the training picture, and the first training label is used as a supervision signal for training a double-branch model to predict the object boundary and the semantic segmentation result. In the process of training the double-branch model, a second training label (on-line label) is generated by using the object boundary and the semantic segmentation prediction result, the object boundary and the semantic segmentation branch are supervised, and iterative optimization is carried out. After the model training is finished, generating a high-quality training pseudo label by using the object boundary and the semantic segmentation prediction result, training a standard semantic segmentation model, and performing semantic segmentation on the picture by using the model. On one hand, the segmentation result of network prediction is more accurate than that of an activation map (CAM), on the other hand, false positive examples in object boundaries are reduced through iterative optimization, and foreground category score propagation is facilitated, so that the finally generated training pseudo labels mark more complete foreground regions, and the segmentation result of the basic semantic segmentation model trained based on the training pseudo labels is more accurate.

Based on the same inventive concept, another embodiment of the present invention provides a weakly supervised semantic segmentation apparatus, as shown in fig. 7, including:

the processing module 1 is used for acquiring a picture to be recognized and inputting the picture to be recognized into a semantic segmentation model to obtain a semantic segmentation result of the picture to be recognized;

The weak supervised semantic segmentation apparatus described in this embodiment may be used to implement the above method embodiments, and the principle and technical effect are similar, which are not described herein again.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device, which refers to the schematic structural diagram of the electronic device shown in fig. 8, and specifically includes the following contents: a processor 801, a memory 802, a communication interface 803, and a communication bus 804;

the processor 801, the memory 802 and the communication interface 803 complete mutual communication through the communication bus 804; the communication interface 803 is used for realizing information transmission between devices;

the processor 801 is configured to call a computer program in the memory 802, and when the processor executes the computer program, the processor implements all the steps of one of the weak supervised semantic segmentation methods described above, for example: acquiring a picture to be recognized, and inputting the picture to be recognized into a semantic segmentation model to obtain a semantic segmentation result of the picture to be recognized; the semantic segmentation model is obtained by training a basic semantic segmentation model based on a training pseudo label; the training pseudo label is obtained by identifying the picture by a double-branch model; the double-branch model is obtained after iterative training is carried out on the basis of a first training label and a second training label; wherein the first training label is an initial label generated by a classification network activation map (CAM); the initial label comprises foreground object position and shape information of the picture; the second training label is an online label output by the double-branch model; the online label is generated based on a semantic segmentation branch prediction result and an object boundary detection branch prediction result; the double-branch model is composed of the semantic segmentation branch and the object boundary detection branch, and the semantic segmentation branch and the object boundary detection branch share one trunk branch for extracting picture features.

Based on the same inventive concept, yet another embodiment of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements all the steps of one of the above-mentioned weakly supervised semantic segmentation methods, such as: acquiring a picture to be recognized, and inputting the picture to be recognized into a semantic segmentation model to obtain a semantic segmentation result of the picture to be recognized; the semantic segmentation model is obtained by training a basic semantic segmentation model based on a training pseudo label; the training pseudo label is obtained by identifying the picture by a double-branch model; the double-branch model is obtained after iterative training is carried out on the basis of a first training label and a second training label; wherein the first training label is an initial label generated by a classification network activation map (CAM); the initial label comprises foreground object position and shape information of the picture; the second training label is an online label output by the double-branch model; the online label is generated based on a semantic segmentation branch prediction result and an object boundary detection branch prediction result; the double-branch model is composed of the semantic segmentation branch and the object boundary detection branch, and the semantic segmentation branch and the object boundary detection branch share one trunk branch for extracting picture features. In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the weak semantic segmentation method described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A weakly supervised semantic segmentation method is characterized by comprising the following steps:

2. The weak supervised semantic segmentation method according to claim 1, wherein the CAM is obtained by performing feature recognition on the picture by a classification network model; the classification network model is obtained after training based on the image category labels.

3. The weakly supervised semantic segmentation method according to claim 1, wherein the training pseudo labels are obtained by identifying pictures by a dual-branch model, and include:

4. The weak supervised semantic segmentation method according to claim 1, wherein the dual-branch model is obtained after iterative training based on a first training label and a second training label, and includes:

5. The weak supervision semantic segmentation method according to claim 3 or 4, wherein obtaining the training pseudo label according to a semantic segmentation prediction result obtained by recognizing a picture according to the semantic segmentation branch and an object boundary result obtained by recognizing the picture according to the object boundary submodel comprises:

6. A weakly supervised semantic segmentation apparatus, comprising:

7. The weakly supervised semantic segmentation apparatus according to claim 6, wherein the CAM is obtained by performing feature recognition on the picture by a classification network model; the classification network model is obtained after training based on the image category labels.

8. The weakly supervised semantic segmentation apparatus according to claim 6, wherein the processing module is specifically configured to:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the weakly supervised semantic segmentation method of any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the weakly supervised semantic segmentation method according to any one of claims 1 to 5.