CN113159048A

CN113159048A - Weak supervision semantic segmentation method based on deep learning

Info

Publication number: CN113159048A
Application number: CN202110441665.6A
Authority: CN
Inventors: 颜成钢; 张二四; 高宇涵; 朱晨瑞; 孙垚棋; 张继勇; 李宗鹏; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-23

Abstract

The invention discloses a weak supervision semantic segmentation method based on deep learning; firstly, the existing data set is utilized to carry out fine adjustment on a pretrained Resnet50, then a corresponding class activation map is obtained by utilizing a trained Resnet50, a segmented pseudo label is obtained by utilizing a set threshold value, and a fully-connected conditional random field Dense conditional random fields, Dense CRF) is adopted to optimize the label. And finally, training the segmentation network by using the optimized pseudo label. The method can finish the tasks of target classification and semantic segmentation only by using the image-level labels, thereby greatly reducing a large amount of manpower and material resources consumed by manual labeling. Compared with the existing weak supervision method, the method has higher efficiency and more accurate positioning result.

Description

Weak supervision semantic segmentation method based on deep learning

Technical Field

The invention belongs to the field of image processing, relates to semantic segmentation of images, and particularly relates to a deep learning method for performing semantic segmentation on images by using image-level labels.

Background

Image segmentation is one of the basic and key techniques for image understanding. Conventional image segmentation methods mainly include a threshold-based image segmentation method, a region-based image segmentation method, an edge detection-based image segmentation method, a wavelet analysis and wavelet transform-based image segmentation method, a markov random field model-based image segmentation method, a genetic algorithm-based image segmentation method, a cluster-based segmentation method, and the like. Due to inherent limitations, the methods have insignificant effects on the segmentation of complex images, such as natural image segmentation. With the development of deep learning, the convolutional neural network is increasingly applied to image segmentation. The segmentation precision is further improved from FCN, UNet, DilatedNet to deep Lab, PSPNet. Because semantic segmentation training needs to label each pixel in an image, the labeling data is very time-consuming, the complexity degree of the labeling data far exceeds that of image classification and target detection, and a segmentation model is required to be trained, and a large amount of manpower is usually consumed for labeling of a Mask. To solve this problem, it is studied to train a semantic segmentation model, i.e., weakly supervised semantic segmentation, with relatively easy labels such as image _ level labels, Bounding boxes, or scribbels and points. Among them, image-level labels are most widely used because they are most easily available and least costly.

Since the image category label does not contain any position information, an additional method must be adopted to locate the target object in the image during the segmentation. Class Activation Mapping (CAM) is one of the most common positioning methods. By inputting the extracted features of the last convolutional layer of the network such as VGGNet and GoogLeNet into the classification of the full connection layer after global average pooling, the CAM can project the class scores output by the full connection layer back to the last feature map of the convolutional neural network, thereby completing the rough positioning of the object to be segmented. The mainstream image-level labeling weak supervision semantic segmentation method uses CAM to position the segmentation target, and SEC is a representative one. The SEC proposes three principles of seed, expand and constraint, and calculates the contribution of local image regions to the scores of each class in the final picture classification by using CAMs, and roughly estimates the region of each class of objects appearing in the picture. However, this positioning method often only locates the most significant region in the target region, and even sometimes a positioning error occurs, so that the obtained class activation map needs to be corrected.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the positioning area obtained by using the conventional activation map is often not accurate enough and needs to be corrected, so that the accuracy of the result obtained by training the segmentation label obtained by using the activation map is not high enough.

Aiming at the actual situation, the invention provides a weak supervision semantic segmentation method based on deep learning; a unified framework integrating classification and network segmentation is established, a multi-scale reasoning activation graph method is provided, and accuracy of weak supervision positioning can be improved. Meanwhile, a new UNet-based segmentation network, namely Mixed-UNet, is provided, and the semantic segmentation accuracy can be improved.

The method comprises the steps of firstly utilizing an existing data set to conduct fine adjustment on a pre-trained Resnet50, then utilizing a trained Resnet50 to obtain a corresponding class activation map, utilizing a set threshold value to obtain a segmented pseudo label, and utilizing a fully-connected conditional random field Density random fields, Dense CRF) to optimize the label. And finally, training the segmentation network by using the optimized pseudo label. The method specifically comprises the following steps:

step (1), training a classification network;

by taking Resnet50 pre-trained on the ImageNet data set as a classification frame, the convolution of the fourth and fifth volume blocks is replaced by the hole convolution with the hole rate of 2, so that a larger receptive field can be obtained while the spatial resolution of the image is kept unchanged, a denser characteristic response is obtained, and the computation amount can be kept unchanged. Raw data is first collected and labeled by a professional. And dividing the marked data set into a training set, a verification set and a test set. According to the actual situation, the data set has K categories, and c represents any one of the categories. Then, the classification frame is finely adjusted through the training set with the classification labels, and the training of the classification network is finished.

After the classification network training is finished, sampling the images of each training set in multiple scales by a bilinear interpolation method, wherein the sampling rates are 0.5, 1, 1.5 and 2.0 respectively. And respectively inputting the sampled images of four scales into a classification network to obtain class activation maps of four scales, then sampling the class activation maps of four scales to be consistent with the original input image in size, fusing the class activation maps of four scales and averaging the sampled class activation maps to obtain a fused class activation map, namely a modified class activation map.

And (3) normalizing the corrected class activation graphs, and obtaining a segmentation graph of each class by adopting a threshold segmentation method according to the actual classification number.

And (4) taking the training set as input, taking the segmentation graph obtained in the step (3) as a label for semantic segmentation, and simultaneously inputting the original input image and the segmentation graph obtained in the step (3) into a fully-connected conditional random field (Dense conditional random fields, Dense CRF) to optimize the label. Training a UNet-based semantic segmentation network (Mixed-UNet) by using the optimized label and the label obtained in the step (3), wherein the semantic segmentation network is formed by fusing two UNets on the whole frame, the two UNets share one feature extractor in the feature extraction stage, and the feature extractor is divided into two branches in the up-sampling stage. The first branch is added to a Pyramid Pooling Module (PPM) and the second branch is added to an attention mechanism.

And (5) in the training stage, the training set is used as input, the output loss function is calculated, network parameters are adjusted through a back propagation algorithm, and the model is verified by using the verification set. And taking the model with the best effect on the verification set as a final model. And after the training is finished, testing the trained model on the test set.

The invention has the beneficial effects that:

the method can finish the tasks of target classification and semantic segmentation only by using the image-level labels, thereby greatly reducing a large amount of manpower and material resources consumed by manual labeling. Compared with the existing weak supervision method, the method has higher efficiency and more accurate positioning result.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of Mixed-UNet according to an embodiment of the present invention;

FIG. 3 is a diagram of a PPM module according to an embodiment of the present invention;

FIG. 4 is a diagram of an SE module according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and examples.

The invention provides a unified image classification and semantic segmentation framework, which can finish the tasks of image classification and weak supervision semantic segmentation only by using image-level labels. The implementation flow is shown in figure 1. A weak supervision semantic segmentation method based on deep learning comprises the following steps:

and (1) training a classification network.

Raw data is first collected and labeled by a professional. And dividing the marked data set into a training set, a verification set and a test set. According to the actual situation, the data set has K categories, and c represents any one of the categories. With Resnet50 pre-trained on the ImageNet dataset as the classification framework, the convolution of the fourth and fifth volume blocks of Resnet50 was replaced with a hole convolution with a hole rate of 2. Under the same condition of the characteristic diagram, the cavity convolution can obtain a larger receptive field, so that more dense characteristic response is obtained. The fully-connected layer of Resnet50 is then removed and Global Average Pooling (GAP) is applied to the resulting feature map and the features obtained after global average Pooling are used as the features of the fully-connected layer that produces the desired output (classification or otherwise). With such a connection structure, the importance of the image region can be identified by projecting the weight of the output layer back to the convolution feature map. The detailed calculation process is as follows:

for any given input image, f_k(x, y) represents the active unit k of the last convolutional layer at spatial position (x, y). Then, for the active cell k, the result F after global average pooling is performed^kIs sigma_x，yf_k(x, y). Thus, for a given class c, the output S passes through softmax_cIs composed of

Wherein

Is the response weight of the kth cell to class c.

Reflect F_kImportance for class c. By bringing in F^kTo S_cCan obtain:

finally, the training set is used for training the classification network, so as to fine-tune the parameters and make the parameters have good performance on a specific data set.

And (3) after the training of the classification network in the step (2) is finished, sampling each image in the training set for four times, wherein the sampling rates are respectively 0.5, 1.0, 1.5 and 2.0, and respectively inputting the obtained images with four scales into a classification network Resnet50 to obtain a class activation map corresponding to each input image. The four different scale class activation maps are then sampled to a size consistent with the original input image size and are fusion averaged. The specific process is as follows:

with respect to the i-th input image,

representing a class activation map with a scale j corresponding to the ith image category c, where j is 0.5, 1.0, 1.5, and 0.2, the class activation map after fusing the four scales is:

the fusion reasoning of multiple scales can strengthen the positioning accuracy and avoid target positioning errors on one hand, and on the other hand, the fusion of the reasoning results of different scales can solve the detail loss of single scale reasoning to a certain extent.

And (3) carrying out normalization processing on the fused class activation graph. Let x_cFor any position in the c-th class activation map, the value of the class activation map is changed to [0,1 ] by using the following normalization formula]A number in between.

And after normalization is finished, carrying out threshold segmentation on the class activation graph, setting the class activation graph as a foreground when the class activation graph is larger than or equal to a threshold, and setting the class activation graph as a background when the class activation graph is smaller than the threshold. Assuming f (x, y) as the finger of the class activation map at the spatial position (x, y), the segmentation process is as follows:

wherein T is a set threshold value.

And (4) taking the segmentation graph obtained in the step three as a label of the training semantic segmentation network. The semantic segmentation network Mixed-UNet is shown in fig. 2. The semantic segmentation network overall framework is formed by fusing two unets, the two unets share one feature extractor in the feature extraction stage, and the two unets are divided into two branches in the up-sampling stage. The first branch is added to a Pyramid Pooling Module (PPM Module) and the second branch is added to an attention mechanism (SE Module). FIG. 3 is a diagram of a PPM module according to an embodiment of the present invention; the pyramid pooling module can reduce the loss of context information among different subregions, integrates 4 features with different pyramid scales, generates single bin output by the first row which is the coarsest feature through global pooling, and generates pooling features with different scales in the last three rows. To ensure the weight of the global features, if the pyramid has N levels, the number of channels is reduced to 1/N of the original number after each level using 1 × 1 convolution. And then obtaining a feature map with the same size as that before pooling through bilinear interpolation, and finally concat together. As shown in fig. 4, the second branch adds a Squeeze-and-excitation (se) block, and it is desirable to improve the representation capability of the network by modeling the dependency of each channel and to adjust the features channel by channel, so that the network can learn global information to selectively enhance the features containing useful information and suppress useless features. The basic structure of the SE block is shown in fig. 1. The first step is a squeeze (squeeze) operation, which takes the global spatial features of the channels in the feature map as a representation of each channel in the feature map, forming a channel descriptor. And the second step is activation (excitation) operation, learning the dependence degree of the segmentation network on each channel, adjusting the weight of each channel in the feature map according to the difference of the dependence degree, and outputting the adjusted feature map, namely SE block. And obtaining the final segmentation result by convolution according to the multi-scale information extracted by the first branch and the attention information concat extracted by the second branch.

The tags were optimized by using a Dense CRF. Specifically, the training set and the labels are input into a Dense CRF at the same time, and the Dense CRF model optimizes the labels according to the pixel correlation of the training set to obtain more precise output. The Loss function of the semantic segmentation network comprises two parts, namely a Loss pass 1 of the segmentation result and the label obtained by threshold segmentation in the step (3), and a Loss pass 1 of the segmentation result and the label optimized by the Dense CRF. The total Loss function Loss is Loss1+ Loss 2. The loss function takes the form of a weighted cross-entropy loss function, which is formulated as follows:

wherein, p represents the real label,

representing the predicted label. Beta is a hyper-parameter which can be set artificially, all positive samples are weighted according to beta (0-1), and the condition of class imbalance can be effectively solved.

And (5) in the training process, verifying the model by using a verification set every 100 iterations, and storing the verified model. And after the training is finished, selecting the model which best performs on the verification set as a final model. In the testing stage, the image data of the testing set is input, and the classification and segmentation results of the image can be obtained.

Claims

1. A weak supervision semantic segmentation method based on deep learning is characterized by comprising the following steps:

step (1), training a classification network;

replacing the convolution of the fourth and fifth volume blocks with a hole convolution with a hole rate of 2, with Resnet50 pre-trained on the ImageNet dataset as a classification frame; firstly, collecting original data, and marking by a professional; dividing the marked data set into a training set, a verification set and a test set; according to the actual situation, the data set has K categories, and c represents any one of the categories; then, fine-tuning the classification frame through a training set with classification labels to finish the training of the classification network;

after the classification network training is finished, sampling the images of each training set in multiple scales by a bilinear interpolation method, wherein the sampling rates are 0.5, 1, 1.5 and 2.0 respectively; respectively inputting the sampled images of four scales into a classification network to obtain class activation maps of four scales, then sampling the class activation maps of four scales to be consistent with the original input image in size, fusing the class activation maps of four scales and averaging the sampled class activation maps to obtain a fused class activation map, namely a modified class activation map;

normalizing the corrected class activation graphs, and obtaining a segmentation graph of each class by adopting a threshold segmentation method according to the actual classification number;

step (4), taking the training set as input, taking the segmentation graph obtained in the step (3) as a label for semantic segmentation, and simultaneously inputting the original input image and the segmentation graph obtained in the step (3) into a fully-connected conditional random field (Dense conditional random fields, Dense CRF) to optimize the label; training a UNet-based semantic segmentation network (Mixed-UNet) by using the optimized label and the label obtained in the step (3), wherein the semantic segmentation network is formed by fusing two UNets on the whole frame, the two UNets share one feature extractor in the feature extraction stage, and the feature extractor is divided into two branches in the up-sampling stage; the first branch is added into a pyramid pooling module, and the second branch is added with an attention mechanism;

step (5), in the training stage, a training set is used as input, an output loss function is calculated, network parameters are adjusted through a back propagation algorithm, and a model is verified through a verification set; taking the model with the best effect on the verification set as a final model; and after the training is finished, testing the trained model on the test set.

2. The weak supervised semantic segmentation method based on deep learning as claimed in claim 1, wherein the specific method in step (1) is as follows:

firstly, collecting original data, and marking by a professional; dividing the marked data set into a training set, a verification set and a test set; according to the actual situation, the data set has K categories, and c represents any one of the categories; replacing the convolution of the fourth and fifth volume blocks of Resnet50 with a hole convolution with a hole rate of 2, with Resnet50 pre-trained on the ImageNet dataset as a classification frame; then removing the full connection layer of Resnet50, applying global average pooling on the finally obtained feature map, and taking the features obtained after global average pooling as the features of the full connection layer for generating required output; the detailed calculation process is as follows:

for any given input image, f_k(x, y) represents the active unit k of the last convolutional layer at spatial position (x, y); then, for the active cell k, the result F after global average pooling is performed^kIs sigma_x,yf_k(x, y); thus, for a given class c, the output S passes through softmax_cIs composed of

Wherein

Is the response weight of the kth unit to class c;

reflect F_kImportance to category c; by bringing in F^kTo S_cCan obtain:

3. The weak supervised semantic segmentation method based on deep learning as claimed in claim 2, wherein the specific method in step (2) is as follows:

after the training of the classification network is finished, sampling each image in the training set for four times, wherein the sampling rates are respectively 0.5, 1.0, 1.5 and 2.0, and respectively inputting the obtained images with four scales into a classification network Resnet50 to obtain a class activation map corresponding to each input image; then sampling the four class activation images with different scales to the size consistent with the size of the original input image, and fusing and averaging the four class activation images; the specific process is as follows:

with respect to the i-th input image,

4. the weak supervised semantic segmentation method based on deep learning of claim 3, wherein the specific method in the step (3) is as follows;

normalizing the fused class activation graph; let x_cFor any position in the c-th class activation map, the value of the class activation map is changed to [0,1 ] by using the following normalization formula]A number in between;

after normalization is completed, carrying out threshold segmentation on the class activation graph, setting the class activation graph as a foreground when the class activation graph is larger than or equal to a threshold, and setting the class activation graph as a background when the class activation graph is smaller than the threshold; assuming f (x, y) as the finger of the class activation map at the spatial position (x, y), the segmentation process is as follows:

wherein T is a set threshold value.

5. The weak supervised semantic segmentation method based on deep learning as claimed in claim 4, wherein the specific method in step (4) is as follows:

taking the segmentation graph obtained in the step three as a label of a training semantic segmentation network; the semantic segmentation network overall framework is formed by fusing two unets, the two unets share one feature extractor in the feature extraction stage, and the two unets are divided into two branches in the up-sampling stage; the first branch is added into a pyramid pooling module, and the second branch is added with an attention mechanism; the pyramid pooling module can reduce the loss of context information among different subregions, integrates 4 features with different pyramid scales, generates single bin output by global pooling of the first row which is the coarsest feature, and generates pooling features with different scales in the last three rows; in order to ensure the weight of the global features, if the pyramid has N levels, reducing the number of channels to 1/N of the original number by using 1 × 1 convolution after each level; then obtaining a characteristic diagram with the same size as that before pooling through bilinear interpolation, and finally concat together; adding a Squeeze-and-Excitation block into the second branch, wherein the first step is extrusion operation, and the global spatial characteristics of each channel in the characteristic diagram are taken as the representation of each channel in the characteristic diagram to form a channel descriptor; the second step is activation operation, the dependence degree of the segmentation network on each channel is learned, the weight of each channel in the feature graph is adjusted according to the difference of the dependence degree, and the adjusted feature graph is output of the SE block; obtaining the final segmentation result by convolving the multi-scale information extracted by the first branch and the attention information concat extracted by the second branch;

the tags were optimized by using a Dense CRF; specifically, a training set and labels are input into a Dense CRF at the same time, and the Dense CRF model optimizes the labels according to the pixel correlation of the training set to obtain more precise output; the Loss function of the semantic segmentation network comprises two parts, wherein one part is the Loss of the segmentation result and the label obtained by threshold segmentation in the step (3) is 1, and the other part is the Loss of the segmentation result and the label optimized by the Dense CRF is 1; total Loss function Loss1+ Loss 2; the loss function takes the form of a weighted cross-entropy loss function, which is formulated as follows:

wherein, p represents the real label,

a label representing a prediction; beta is a hyper-parameter which can be set artificially, all positive samples are weighted according to beta (0-1), and the condition of class imbalance can be effectively solved.

6. The weak supervised semantic segmentation method based on deep learning of claim 5, wherein in the training process of the step (5), the model is verified by using a verification set every 100 iterations, and the verified model is saved; after training is finished, selecting the model which best appears on the verification set as a final model; in the testing stage, the image data of the testing set is input, and the classification and segmentation results of the image can be obtained.