CN109902809B

CN109902809B - Auxiliary semantic segmentation model by using generated confrontation network

Info

Publication number: CN109902809B
Application number: CN201910154150.0A
Authority: CN
Inventors: 郭子豪; 王永松; 郑云彬; 高峰; 刘丹
Original assignee: Chengdu Kangqiao Electronic Co ltd; University of Electronic Science and Technology of China
Current assignee: Chengdu Kangqiao Electronic Co ltd; University of Electronic Science and Technology of China
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2022-08-12
Anticipated expiration: 2039-03-01
Also published as: CN109902809A

Abstract

The invention provides an auxiliary semantic segmentation model for a countermeasure network, which comprises a semantic segmentation generation model designed based on VGG/ResNet 50; inputting an original image, a real segmentation image and a confrontation model for generating the segmentation image; adding a loss function to the opposing loss term; and adding a countermeasure loss term on the basis of the original cross entropy classification loss function, wherein the countermeasure loss term is defined by a binary cross entropy function. The invention mainly improves the segmentation precision by generating an antagonistic network structure auxiliary semantic segmentation model, distinguishes the characteristics of a region component segmentation graph and a real segmentation graph by utilizing the strong characteristic learning capability of the antagonistic model, and draws the mathematical distribution of the two parts, so that the generated model gradually learns the relationship between pixels in training, the spatial continuity of the pixels in the segmentation image is enhanced, and the segmentation precision is improved. Meanwhile, the time cost brought by the improvement of the segmentation precision of the general post-processing technology is also avoided.

Description

Auxiliary semantic segmentation model by using generated confrontation network

Technical Field

The invention belongs to the technical field of deep learning semantic segmentation, and particularly relates to a generation confrontation network assisted semantic segmentation model.

Background

Semantic segmentation is one of the classic challenges in the field of computer vision, with the goal of pixel-level labeling of a given image. It is one of the image understanding's basement stone technique, at autopilot system, unmanned aerial vehicle uses, and wearable equipment uses, has important effect in the aspect of VR technique etc.. The most advanced semantic segmentation technology at the present stage is realized by using a convolutional neural network, and semantic information and spatial information in the picture are extracted through strong feature extraction capability and learning capability of convolution. However, no matter the segmentation model of the full convolution network structure or the segmentation model of the U-Shape structure created by the U-Net, the prediction result of each pixel is independent of other pixels when the pixel label is predicted in the model training. Various post-processing techniques are commonly used to learn the relationship between pixels to enhance the spatial continuity of the segmented picture. Such as full ligation condition random DenseCRF, crfashrnn, etc. However, these post-processing techniques are difficult to implement and slow to operate, and thus are difficult to apply in video segmentation or real-time segmentation.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides an auxiliary semantic segmentation model by using a generation confrontation network.

In order to solve the technical problems, the technical scheme created by the invention is realized as follows:

a semantic segmentation model assisted with generation confrontation network, comprising:

generating a model based on the semantic segmentation designed by VGG/ResNet 50;

by using classic classification networks such as VGG (virtual grid graph) and ResNet50 as feature extractors and converting the last layers of full connection layers into convolution layers, the original classification network is converted into a full convolution network; amplifying in modes of deconvolution or bilinear interpolation and the like through a feature map output by a final layer of the model to obtain a segmented picture corresponding to the current input picture;

inputting an original image, a real segmentation image and a confrontation model for generating the segmentation image;

the confrontation model consists of a picture feature extractor, a segmentation feature extractor and a feature fusion device; integrating the features extracted from the picture with the segmentation features extracted by the segmentation feature extractor by using the shallow part of a classic classification network such as VGG/ResNet50 and inputting the integrated features into a feature fusion device for judgment;

adding a loss function to the opposing loss term;

adding a countermeasure loss term on the basis of the original cross entropy classification loss function, wherein the countermeasure loss term is defined by a binary cross entropy function, and when a generated segmentation graph output by a generated model is closer to a real segmentation graph, the calculation loss of the generated model in the countermeasure loss term is reduced, and the calculation loss of the countermeasure model is increased. And vice versa. Thereby improving the generating effect of the generating model.

Further, the weight of the full connection layer is transformed into a convolution kernel parameter corresponding to the full convolution layer through deformation.

Further, the feature graph received by the first layer of deconvolution is zoomed by 32 times compared with the original graph, and is merged and fused with the feature graph output by the 3 rd block to form a new feature graph after being subjected to deconvolution amplification, and by analogy, the new feature graph obtained after being fused with the feature graph output by the 2 nd block is subjected to 8 times of interpolation amplification operation to obtain a finer segmentation effect.

Furthermore, the ResNet50 replaces the standard convolutional layer in the 3 rd and 4 th blocks with the hole convolutional layer on the premise of not increasing the number of network parameters, so as to increase the receptive field range of the characteristic diagram.

Further, the hole convolution kernel enlarges the size of the convolution kernel by inserting 0 in the kernel.

Further, a global average pooling layer is introduced into an original ASPP module, all pixel points on one feature map are averaged to enable the original feature map to become feature vectors substantially, feature recombination is carried out on the feature vectors through a convolution layer, the feature vectors are re-amplified and merged with other feature maps, a large amount of detail information is filtered through averaging the whole feature map, only the overall information of the feature map is extracted, and the receptive field of the next layer of convolution is expanded to the whole map due to the fact that a sliding window covers the whole feature map.

Furthermore, the confrontation model can judge the source of the current feature map by judging whether the input segmentation feature map contains 0 and 1.

The invention has the advantages and positive effects that:

the invention mainly improves the segmentation precision by generating an auxiliary semantic segmentation model of an antagonistic network structure, inputs the generated segmentation map generated by the generation model and the real segmentation map of a source data set into the antagonistic model, distinguishes the characteristics of the generated segmentation map and the real segmentation map by utilizing the strong characteristic learning capability of the antagonistic model, and draws the mathematical distribution of the two parts, so that the generation model learns the inter-pixel relationship gradually in training, the spatial continuity of the pixels in the segmented image is enhanced, and the segmentation precision is improved. Meanwhile, the structure, parameters and calculated amount of the original segmentation model do not need to be changed, and time cost caused by improvement of segmentation precision by a general post-processing technology is avoided.

Drawings

FIG. 1 is a general block diagram of a model of the present invention;

FIG. 2 is a diagram of a VGG 16-based generative model architecture;

FIG. 3 is a diagram of a ResNet50 based generative model architecture;

FIG. 4 is a graph comparing a standard convolution with a hole convolution structure;

FIG. 5 is a diagram of a countermeasure network model architecture;

FIG. 6 is a graph comparing semantic segmentation effects.

Detailed Description

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In order to help the model to learn the relationship between pixels, the model segmentation precision is improved, meanwhile, the model structure and the parameter quantity are not increased, and the model operation time is not changed. The invention provides an application structure for learning the relationship between pixels by using a generation confrontation network assisted semantic segmentation model, which can enable segmented pictures generated by an original model to enhance the spatial continuity. The following detailed description of the embodiments of the invention is provided.

A semantic segmentation model assisted with generation confrontation network, as shown in fig. 1 to 6, comprising:

1. semantic segmentation generation model designed based on VGG/ResNet50

The traditional VGG model is an image classification model, and three fully connected layers are arranged behind the module 4. Before entering the full-connection layer, the feature map is subjected to one-dimensional transformation, the feature vectors of the y axis are connected together, and the feature map is changed into a space vector.

The semantic feature extraction is facilitated, the spatial structure feature of the object is greatly damaged, and the semantic segmentation needs to acquire the semantic feature and the spatial feature of the object at the same time. Therefore, in order to maintain the spatial features of the feature map, the full-connected layers of the network are converted into full convolution layers, and the weights of the full-connected layers can be converted into convolution kernel parameters corresponding to the full convolution layers through deformation. For the fully convolutional layer of block 5, for example, the weight shape corresponding to the previous fully connected layer is (25088,4096).

Since the shape of the feature vector input by the full-connected layer is (25088,) and the feature map with the shape of (7, 512) is deformed, the convolution kernel size of the full-convolution layer can be set to be (7,7), the input channel is 512, the output channel is 4096, and then the weight of the full-connected layer is grafted after shape conversion. Moreover, the feature map received by the first layer of deconvolution is scaled 32 times compared with the original map, and if the feature map is directly used for deconvolution to enlarge the original map, the segmentation effect is poor. Therefore, the feature images are combined with the feature images output by the 3 rd block and fused into a new feature image after being amplified through deconvolution.

By analogy, 8 times of interpolation amplification operation is carried out on the new feature graph obtained after the feature graph output by the 2 nd block is fused, and a finer segmentation effect can be obtained.

The structure of the VGG-based generation model is shown in fig. 2. For ResNet50, on the premise of not increasing the number of network parameters, the standard convolutional layers in the 3 rd and 4 th blocks are replaced by the hole convolutional layers, so as to increase the receptive field range of the characteristic diagram. The structural pair of the standard convolution kernel and the hole convolution kernel is shown in fig. 3. Compared with the standard convolution kernel, the kernel parameter quantity of the hole convolution kernel is not changed, and the size of the convolution kernel is enlarged by inserting 0 into the kernel.

In fig. 4, the receptive field of the current convolution kernel changes from 3X3 to 5X5, and as the number of layers increases, the size of the receptive field of the deep convolution kernel will increase exponentially. The expanded receptive field enables each pixel in the deep feature map to acquire more information in the shallow feature map, so that the segmentation fineness is improved. Meanwhile, a global average pooling layer is introduced into the original ASPP module to form enhanced ASPP. The global average pooling layer is equal to an average pooling layer with the size of a sliding window as a whole characteristic diagram, all pixel points on one characteristic diagram are averaged, so that the original characteristic diagram can be actually a characteristic vector through the global average pooling layer, and the original characteristic diagram is re-amplified after being subjected to characteristic recombination through the convolutional layer and is combined with other characteristic diagrams.

The global average pooling operation is proved to be an effective operation in experiments, a large amount of detail information is filtered by averaging the whole feature map in calculation, only the overall information of the feature map is extracted, and the receptive field of the next layer of convolution is expanded to the whole map because the sliding window covers the whole feature map. A detailed network structure pair of G-VGG16 and G-ResNet50 is shown in Table 1.

TABLE 1 Generation of model Structure Table

2. Input original image, real segmentation image and countermeasure model for generating segmentation image

In the countermeasure model, unlike the general countermeasure model in which only the generation result and the real result are input, the original image is also input to the model at the same time, and the original image is subjected to feature extraction by the feature extractor and then merged with the feature of the segmentation map. Because the countermeasure model needs to judge the source of the segmentation map, the segmentation map has a simple spatial structure and highly generalized semantic information, if the true and false segmentation maps are directly input for the countermeasure training, the countermeasure model is easily and rapidly converged through detail differentiation, and the high-order relationship among pixels can not be learned as expected. For example, for a real segmentation graph of an input model, one-hot transformation is firstly carried out, and the real segmentation graph is changed into an H multiplied by W multiplied by classes feature graph, wherein for each pixel, the dimension value of the feature vector describing the feature vector is only 1 in the class to which the pixel belongs, and the rest are 0, the generated segmentation graph transmitted by the generated model is also an H multiplied by W multiplied by classes feature graph, and all numerical values of the feature graph in the classes dimension are floating point numbers between 0 and 1. Therefore, the countermeasure model can determine the source of the current feature map by simply determining whether the input segmentation feature map contains 0 and 1.

In order to solve the problem, on one hand, the extraction of the original picture features is added, and a feature extractor (such as the first few layers in VGG/ResNet) is used for extracting the features of the picture to obtain a class of low-level features of the picture. Meanwhile, the segmentation graph is an abstracted and relatively abstract feature graph of information, and the information loss is more serious by using a deep network, so that the feature of the segmentation graph is transformed by using a shallow small network to obtain a second class of low-level features. And then combining the two low-level features, and performing feature fusion by using a convolutional neural network.

On the other hand, the feature vector of 01 is converted into a feature vector of a floating point number by scaling the one-hot feature map of the real segmentation map. For a pixel i in a real segmentation graph, the one-hot feature vector is assumed to be v _i The feature vector after scaling is

A fixed value epsilon is set, which means that the vector dimension with a median value of 1 (assuming position l) in the feature vector can not be smaller than epsilon at the lowest value after scaling change. Simultaneously obtaining the characteristic vector u of the pixel at the same position in the generated segmentation map _i Then, then

Vector value of median l

While

The vector value of the other position c in the sequence is

By the scaling transformation, when the pixel class prediction in the generated segmentation map is correct and the probability exceeds epsilon, the feature vector transformed by the pixel at the same position in the real segmentation map is exactly the same as that in the generated segmentation map. If below ε, the probability values are scaled for the other dimensions.

Through the two operations, the remarkable difference between the real segmentation graph and the generated segmentation graph is further reduced, the difficulty of distinguishing true and false samples by the confrontation network is increased, and the generated model is helped to find the spatial continuity between pixels.

In addition, in order to further enhance the distinguishing capability of the countermeasure model, the average value of 0 or 1 is output, which is different from the situation that the general countermeasure network directly performs sigmoid normalization operation after completely combining the feature maps of the last layer. But the size of the feature map is converted into a 4 x 4 judging map at the last layer, and each cell represents the probability that the current region belongs to a real segmentation map or a composite segmentation map. The normalization operation is independently carried out on each cell, so that the situation that gradient adjustment is carried out on the feature map overall when severe deviation occurs to a certain part of the segmentation map can be prevented, and the robustness of the model is improved.

TABLE 2 Structure Table of confrontation model

3. Increasing the loss function of the opposing loss term

Because the generation countermeasure model structure is introduced in the semantic segmentation, the generation network and the countermeasure network are trained respectively in the training process, so as to achieve the purpose of mutually adjusting the countermeasures. Therefore, two loss functions need to be designed for training the generative model and the antagonistic model respectively. In addition, in training, when one model is trained, the weight of the other model needs to be fixed and kept constant.

For the generated model, a multi-class cross entropy loss function is used for training the model to carry out class judgment on each pixel independently. Assuming that the input image is an RGB image x of H × W × 3, a divided picture obtained by generating a model change is

Therefore, the loss function of the training generative model alone is defined as follows.

Where y represents a one-hot representation of the class to which the current pixel value belongs. After the generative model is trained to be converged, the generative confrontation model is introduced for training. In the generation of the countermeasure model, when the generation model is trained, an image x is firstly input into the generation model, a generated segmentation result g (x) is input into the countermeasure model together with the image x, and a binary cross entropy loss function is used for calculating the loss obtained by calculating a countermeasure result d (g (x), x). The binary cross entropy loss function is defined as follows.

Since the task of generating the model is to fool the current confrontation model into being unable to distinguish the source of the segmented picture input into the confrontation model, the label for g (x) is 1 (indicating that the source is a data set).

So the loss function term for the challenge model is + loss _bce (1, d (g (x))), and use of-loss _bce (0, d (g (x)) can enable the gradient decline of the generated model to be more stable when the confrontation model judges whether the source of the segmentation picture is real or synthesized. The loss function for training the generative model under the generative challenge structure is thus defined as follows.

loss _g ＝loss _mce (y,g(x))-loss _bce (0,d(g(x),x))

When training the confrontation model, since the confrontation model needs to distinguish whether the segmented picture of the input model belongs to reality or synthesis, the label of the segmented picture synthesized by the generation model is 0, and the label of the segmented picture actually taken from the data set is 1. It is desirable that the countermeasure network can approximate the predicted value to 1 in the case where the input divided picture is real, and approximate the predicted value to 0 in the case where the input divided picture is composite. The loss function of the training countermeasure model is thus defined as follows.

loss _d ＝loss _bce (1,d(y,x))+loss _bce (0,d(g(x),x))

Thus, for a training set containing N pictures, the x-th _i Divided picture y corresponding to picture _i Thus, the loss function of the entire model is defined as follows:

wherein lambda is a hyper-parameter used for adjusting the loss provided by the countermeasure network during initial training.

Since the initial generative model has a good generative effect, and the countermeasure model is still in an initialized state, the initially provided loss is large, so that the countermeasure model needs to be adjusted by the hyper-parameter to reduce the gradient of the countermeasure loss for the generative model adjustment.

It is obvious to a person skilled in the art that the invention is not restricted to details of the above-described exemplary embodiments, but that it can be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A construction method for assisting a semantic segmentation model by using a generated confrontation network is characterized in that the constructed assisted semantic segmentation model by using the generated confrontation network comprises a semantic segmentation generation model and a confrontation model, wherein the semantic segmentation generation model and the confrontation model are respectively configured with a loss function;

Wherein,

and (3) generating a model by semantic segmentation:

the semantic segmentation generation model is designed based on VGG/ResNet50 and is used for: the VGG or ResNet50 is used as a feature extractor, and the last layers of all-connection layers are converted into convolution layers, so that the original classification network is converted into a full convolution network; generating a feature map output by a final layer of the model through semantic segmentation, and amplifying in a deconvolution or bilinear interpolation mode to obtain a segmented picture corresponding to the current input picture;

the confrontation model comprises the following steps:

inputting an original image, a real segmentation image and a generated segmentation image into the confrontation model, wherein the confrontation model consists of an image feature extractor, a segmentation feature extractor and a feature fusion device; the image feature extractor is used for extracting image features of an original image by using a shallow layer part of VGG/ResNet50, the segmentation feature extractor is used for extracting segmentation features of a real segmentation image, and the feature fusion device is used for carrying out feature fusion on the image features and the segmentation features; the segmentation features extracted by the segmentation feature extractor include those subjected to scaling change

And

wherein epsilon represents a set fixed value, l represents a position corresponding to a vector dimension in which the median of the feature vectors in the real segmentation map is 1, c represents a position other than the position l in the real segmentation map, i represents the number of a pixel at the position l in the real segmentation map,

Representing the scaled eigenvector at position l in the true segmentation map of pixel i,

representing the scaled eigenvectors, u, at positions other than position l in the true segmentation map of pixel i _i ^l The representative pixel i generates a feature vector at position i in the segmentation map,

representing the feature vectors of the pixel i at other positions except the position l in the segmentation graph;

loss function:

the two loss functions are respectively recorded as a generating loss function and a countering loss function, the generating loss function is used for training a semantic segmentation generating model, and the countering loss function is used for training a countering model; synthesizing the generated loss function and the countermeasure loss function into a loss function of the whole countermeasure network auxiliary semantic segmentation model, and recording the synthesized loss function as a total model loss function; the total model loss function is added with a countermeasure loss term on the basis of the original cross entropy classification loss function, the countermeasure loss term is defined by a binary cross entropy function, and when a generated segmentation graph output by the semantic segmentation generation model is closer to a real segmentation graph, the calculation loss of the semantic segmentation generation model in the countermeasure loss term is reduced, and the calculation loss of the countermeasure model is increased.

2. The method for constructing the countermeasure network-assisted semantic segmentation model according to claim 1, wherein the method comprises the following steps: the weights of the fully connected layers are transformed into convolution kernel parameters corresponding to the fully convolutional layers through deformation.

3. The method for constructing the countermeasure network-assisted semantic segmentation model according to claim 1, wherein the method comprises the following steps: and the feature graph received by the first layer of deconvolution is zoomed by 32 times compared with the original graph, is merged and fused with the feature graph output by the 3 rd block after being amplified by deconvolution to form a new feature graph, and by analogy, the new feature graph obtained after being fused with the feature graph output by the 2 nd block is interpolated and amplified by 8 times to obtain a finer segmentation effect.

4. The method for constructing the semantic segmentation model assisted by the generation of the countermeasure network according to claim 3, wherein the semantic segmentation model comprises the following steps: and replacing the standard convolution layers in the 3 rd and 4 th blocks by the hollow convolution layers on the premise of not increasing the network parameter quantity by the ResNet50, thereby increasing the receptive field range of the characteristic diagram.

5. The method for constructing the semantic segmentation model assisted by the generation of the countermeasure network according to claim 4, wherein the semantic segmentation model comprises the following steps: the hole convolution kernel enlarges the size of the convolution kernel by inserting 0's into the kernel.

6. The method for constructing the semantic segmentation model assisted by the generation of the countermeasure network according to any one of claims 1 to 5, wherein: introducing a global average pooling layer into an original ASPP module, averaging all pixel points on one feature map to enable the original feature map to be actually a feature vector, carrying out feature recombination through a convolution layer, then re-amplifying the feature vector, merging the feature vector with other feature maps, filtering a large amount of detail information through averaging the whole feature map, only extracting the total information of the feature map, and expanding the perception field of the next layer of convolution to the whole map because a sliding window covers the whole feature map.