CN112085738B

CN112085738B - Image segmentation method based on generation countermeasure network

Info

Publication number: CN112085738B
Application number: CN202010816515.4A
Authority: CN
Inventors: 刘天亮; 魏彪
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2022-08-26
Anticipated expiration: 2040-08-14
Also published as: CN112085738A

Abstract

The invention discloses an image segmentation method based on a generation countermeasure network, which comprises the steps of firstly, constructing a depth residual error network taking VGG as a prototype, taking the depth residual error network as a basic image semantic segmentation network, executing downsampling by using a convolution layer with the step length of 2, ending the network by average pooling and 1000 paths of full connection layers with Softmax, wherein the number of the weighted layers is 101, and outputting prediction categories by a prediction layer; then, constructing a generator, and generating a reconstructed image corresponding to the original image by adopting a 4-layer convolution and 4-layer deconvolution structure and taking the predicted layer output and the original image as input; and finally, constructing a discriminator, adopting 4 convolution layers, using the ReLU as an activation function except for the last layer, and taking the original image and the reconstructed image output by the generator as input. The invention can obtain the characteristics of higher grade and avoid overlarge calculation amount; the segmentation loss generated by the basic segmentation network is included in the generation of the total loss function of the countermeasure network, so that the learning of the model parameters is more accurate, and the segmentation result is more precise.

Description

Image segmentation method based on generation countermeasure network

Technical Field

The invention relates to the technical field of computer vision image segmentation, in particular to an image segmentation method based on a generated confrontation network (GAN).

Background

Image segmentation is a technique and process for dividing an image into a plurality of specific regions with unique properties and extracting an interested target, and is a key step from image processing to image analysis. The existing image segmentation methods mainly include the following categories: a threshold-based segmentation method, a region-based segmentation method, an edge-based segmentation method, a particular theory-based segmentation method, and the like. From a mathematical point of view, image segmentation is the process of dividing a digital image into mutually disjoint regions. The image segmentation process is also a labeling process, i.e. image indexes belonging to the same region are assigned with the same number. In recent years, deep learning techniques have been widely used in various types of data processing tasks, such as images, voice, and text. While creating a countermeasure network (GAN) and Reinforcement Learning (RL) have become "open-dated stars" in the framework of deep learning. The traditional generation model needs to define a parameter expression of probability distribution, and then trains the model by maximizing a likelihood function, and the gradient expression expansion of the models usually contains an expectation term, so that an accurate solution is difficult to obtain, and approximation is generally needed. To overcome the difficulties of solution accuracy and computational complexity, generation of countermeasure networks (generation networks and discrimination networks) is creatively proposed. The GAN model does not need to directly represent the likelihood functions of the data, but can generate samples with the same distribution as the original data.

The idea of GAN for image segmentation is based on segmenting an image with a generation network (i.e., segmentation network), performing a predictive segmentation with a discriminant network (discriminator) to determine whether the segmentation result is a true segmentation or a generation network generation, and sequentially training the discriminant network and the generation network with a GAN-like training method.

In recent years, many image segmentation research works are dedicated to propose an effective method for image segmentation, and it is shown that when image sample data is very large, a deep neural network can learn high-level features with remarkable and diversity compared with a manually designed feature extraction method. The number of layers of the neural network is not as great as the better, and when the number of layers of the neural network reaches a certain number, the data that we can reach is seemingly more "advanced" and more advanced, but in fact, such a number of layers may cause a degradation problem of the model, i.e., the accuracy reaches saturation (it is not surprising) and then drops rapidly as the depth of the network increases. Surprisingly, this degradation is not caused by over-fitting, but rather a higher error rate is caused by adding more layers in a reasonable depth model, so called "model degradation". In order to solve the above technical problem, the present invention proposes an image segmentation method based on a generation countermeasure network.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides an image segmentation method based on a generation countermeasure network, which can obtain higher-level features and avoid overlarge calculation amount and more elaborate segmentation results.

The invention content is as follows: the invention provides an image segmentation method based on a generation countermeasure network, which specifically comprises the following steps:

(1) constructing an image semantic segmentation network, constructing more layers by using residual mapping based on a VGG-16 full convolution neural network, further deepening the network, simplifying the training of the network, inputting an original RGB image, extracting image characteristics, and outputting a confidence map through a prediction layer;

(2) constructing a generated countermeasure network, wherein a generator is composed of 4 convolutional layers and a deconvolution layer, a discriminator is composed of 4 convolutional layers, a ReLU is used as an activation function, and calibration parameters of an original RGB image are used as input to pre-train the generated countermeasure network;

(3) in the step (1), a confidence map output by a prediction layer and an original RGB image are jointly used as input of a generator in a pairwise reactance network, and a reconstructed image corresponding to the original image is obtained; secondly, the reconstructed image and the original image are used as input of a discriminator together, and a final semantic segmentation result is obtained through further feature extraction and classification;

(4) defining a loss function, obtaining a prediction mask after a confidence map output by a prediction layer of an image semantic segmentation network passes argmax, wherein each value represents the category of each pixel in an input image, the difference between the mask and a calibration parameter of an original image is defined as segmentation loss, the difference between a reconstructed image output by a generator and an output image obtained by a discriminator is defined as discrimination loss, the difference between an output image of the discriminator and the original image is defined as content loss, and the segmentation loss, the discrimination loss and the content loss form a total loss function of the network together;

(5) training the whole model formed by the image semantic segmentation network and the generation countermeasure network in an end-to-end mode, obtaining parameters, inputting test data, and obtaining an image semantic segmentation result through a discriminator.

Further, the step (1) includes the steps of:

(11) based on the VGG-16 full convolution neural network, the number of network layers is deepened by adopting a residual mapping mode for extracting the characteristics of the change level, the traditional mode of directly stacking by utilizing the bottom mapping is replaced by the residual mapping, and the mapping is converted into H (x) -x which is converted into the current mapping function F (x) + x:

y＝F(x,{W _i })+x

where x represents the input, y represents the output, and F (x, { W) _i }) is a target residual mapping function to be learned;

(12) inputting an original RGB image, extracting image characteristics, outputting a confidence map of pixel classes through a prediction layer, wherein the difference between the confidence map and the original image is considered as segmentation loss, and a cross entropy loss function is used for replacing a classification loss function and a mean square loss function:

wherein N is the total number of pixels, C is the number of channels, P _xi Is the probability that pixel x is determined to be of type i, Y _xi Is the probability of the calibration parameter.

Further, the step (2) comprises the steps of:

(21) defining a generator and a discriminator, wherein the generator is composed of 4 layers of convolution and 4 layers of deconvolution, the random descent rate is set to be 0.5, ReLU is used as an activation function, the discriminator is composed of 4 layers of convolution, and ReLU is used as the activation function except for the last layer;

the generator and the discriminator use step convolution to replace a space pooling function, and define a three-dimensional array s (f) as p-norm pooling:

wherein f represents a feature diagram of the convolutional layer output, which is a three-dimensional array, and has the size of W × H × N, the number of rows is H, the number of columns is W, N represents the number of channels in the layer, g (H, W, i, j, u) ═ r · i + H, r · j + W, u) represents a mapping from s to f, k represents the pooling size, u represents a filter, r represents the stride size, and if r > k, pooling does not overlap;

(22) the method comprises the steps of pre-training a countermeasure network, and pre-training the countermeasure network by taking calibration parameters of an original RGB image as input.

Further, the step (4) comprises the steps of:

(41) the segmentation loss represents the difference between the calibration parameters of the prediction mask and the original RGB image obtained by argmax from the output of the basic image semantic segmentation network prediction layer, and is defined as:

wherein, P _xi Is the probability that pixel x is determined to be of type i, Y _xi Is the probability of the calibration parameter;

(42) the discriminant loss represents the difference between the reconstructed image output by the discriminator and the raw RGB image net discriminator output, and is defined as:

l _A ＝log(D(I))+log(1-D(I ^R ))

wherein I denotes an original image, I ^R Representing a reconstructed image;

(43) the content loss represents the difference between the generator output and the original image, defined as:

wherein, I _ij Individual pixels in the calibration parameters representing the original image,

representing pixels in the reconstructed image that correspond to pixels in the original image;

(44) the overall loss function of the network is:

l＝l _S +λ ₁ l _C +λ ₂ l _A

wherein λ is ₁ And λ ₂ Is a temporary parameter.

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: 1. the generation network and the segmentation network are designed separately, namely the generation network is provided with a segmentation network in front, and the whole model consists of three parts, namely the segmentation network, the generation network and the judgment network, so that the total loss function also comprises segmentation loss, content loss and judgment loss, higher-level features can be obtained, and the overlarge calculation amount is avoided; segmentation loss generated by the basic segmentation network is included in the generated countermeasure network total loss function, so that the learning of model parameters is more accurate, and the segmentation result is more precise; 2. to segment the network in order to obtain high-level features without significantly increasing the computational load, a deplab + ResNet101 architecture is used.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of building blocks used in the present invention;

fig. 3 is a schematic diagram of a model for generating a network according to the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

the invention provides an image segmentation method based on a generated countermeasure network, which introduces the generated countermeasure network (GAN), constructs a basic image segmentation network through a depth residual error network, acquires image characteristics, outputs a confidence map through a prediction layer, is used as the input of a generator together with an original RGB image, outputs a reconstructed image, is input into a discriminator together with the original image, distinguishes the difference between the two images by the discriminator and defines a loss function. Loss is gradually reduced by training the model, so that the output of the discriminator is similar to the calibration parameters of the original image as much as possible. As shown in fig. 1, the method specifically comprises the following steps:

step 1: and constructing a basic image semantic segmentation network, constructing more layers by using residual mapping on the basis of a VGG-16 full convolution neural network, further deepening the network, simplifying the training of the network, inputting an original RGB image, extracting image characteristics, and outputting a confidence map through a prediction layer.

The VGG-16 full convolution neural network is used as a prototype, the number of network layers is deepened by adopting a residual mapping mode, so that the characteristics of higher levels are extracted, and the problem of model degradation is avoided;

when the deep network tends to converge, the problem of "model degradation" tends to occur: as the depth of the network increases, the accuracy saturates and then drops rapidly. Surprisingly, this degradation is not caused by overfitting, but rather the addition of more layers in a reasonable depth model leads to a higher error rate, for which reason the deeper model has a solution by construction: identity mapping (identity mapping) to build added layers, while other layers are copied directly from the shallow model.

The invention adopts a depth residual error learning framework to solve the problem of model degradation caused by the increase of network depth, and the layers are fitted with residual error mapping (residual error mapping) instead of each layer directly fitting the required bottom layer mapping (desired residual error mapping). Assuming the desired underlying mapping is H (x) -x, the stacked non-linear layers are fitted to another mapping: f (x) h (x) -x. The original mapping thus translates into: f (x) + x, called residual mapping, can infer that residual mapping is easier to optimize than the original unreferenced mapping (unreferenced mapping). This can be achieved by "shortcut connection" of the feedforward neural network, as shown in fig. 2, which refers to this structure as a building block. It is defined as follows:

y＝F(x,{W _i })+x (1)

where x represents the input, y represents the output, and F (x, { W) _i }) is the target residual mapping function to be learned.

If the number of added layers can construct an identity mapping, the network training error after the number of layers is added does not increase, the model degradation problem shows that a plurality of nonlinear layers are difficult to be used for approximate identity mapping, after the network is rewritten by using residual mapping, if the identity mapping is optimal, the network optimization problem becomes very simple, the nonlinear layer parameters are directly driven to 0, in fact, the identity mapping is unlikely to be optimal, but the rewriting can provide effective preprocessing for the problem, and if the optimal function is close to the identity mapping, the optimization becomes easier.

Inputting an original RGB image, extracting image characteristics, outputting a confidence map of the image through a prediction layer, obtaining a prediction mask through argmax, wherein the difference between the calibration parameters of the prediction mask and the original RGB image is regarded as segmentation loss, and the segmentation is essentially a dense classification task, so that a cross entropy loss function is used to replace a classification loss function and a mean square loss function.

In the present invention, the multi-label cross entropy loss function is used to measure the performance of the segmented network, and from the essence, the segmentation is actually a dense classification task, so the cross entropy loss function is used to replace the classification loss function and the mean square loss function, and is defined as follows:

Step 2: the method comprises the steps of constructing and generating an antagonistic network GAN, wherein a generator is composed of a 4 convolution layer and a deconvolution layer, a discriminator is composed of a 4 convolution layer and is assisted by a ReLU as an activation function, a calibration parameter (Ground-route) of an original RGB image is used as an input to pre-train the antagonistic network, and the GAN can obtain a distribution relation between the Ground-route and the original RGB image through pre-training.

A generator consisting of 4 layers of convolution and 4 layers of deconvolution with a random descent rate set to 0.5 to prevent overfitting and a discriminator consisting of 4 layers of convolution using ReLU as the activation function except for the last layer are defined.

The nature of the generator and the arbiter is a Convolutional Neural Network (CNN), but differs from the conventional CNN in the following three aspects:

the use of stride convolution instead of spatial pooling functions (such as max pooling) allows the network to learn its own spatial downsampling, and the application of this method in generating networks and discriminating networks allows it to learn its own spatial upsampling.

Let f denote the feature diagram of the convolutional layer output, which is a three-dimensional array with the size of W × H × N, the number of rows is H, the number of columns is W, N represents the number of channels in the layer, and define the three-dimensional array s (f) as p-norm pooling:

where f denotes a feature map of convolutional layer output, which is a three-dimensional array with a size of W × H × N, where row number is H, column number is W, N denotes the number of channels in the layer, g (H, W, i, j, u) — (r · i + H, r · j + W, u) denotes a mapping from s to f, k denotes the pooling size, u denotes a filter, and r denotes the stride size, pooling does not overlap if r > k, but most CNN networks usually overlap, for example, k equals 3 and r equals 2.

Eliminating the fully-connected layer after the top-most convolution, the global average pooling improves the stability of the model, but reduces the convergence speed. In the middle zone of the highest convolution characteristic, a direct connection between the generator input and the discriminator output can also have a good effect. The first layer of GAN, which takes as input the uniform noise distribution Z, is called the fully connected layer only because it is a matrix multiplication, but the result is recombined as a 4-dimensional tensor and serves as the start of the convolution stack. For the arbiter, the last convolutional layer is smooth and connects a single sigmoid classification function, see the schematic diagram of the generator in FIG. 3.

And pre-training the GAN, pre-training the generated countermeasure network by taking calibration parameters of the original RGB image as input, and obtaining the distribution relation between the calibration parameters and the original RGB image by the GAN through pre-training, so that the output of a prediction layer in the basic image semantic segmentation network has the same distribution as the original RGB image, and simultaneously reducing the difference between a reconstructed image output by a subsequent generator and the original image through pre-training.

There are two points to note here:

1. batch Normalization (BN). The batch normalization operation is performed by normalizing the inputs to each cell to 0 mean and unit variance, which helps to deal with training problems caused by poor initialization and helps the gradient flow to deeper networks. This demonstrates that it is critical to have a deep layer generator to start learning, preventing the generator from folding all samples to one point, which is a common failure model in GAN, however, employing BN directly for all layers can lead to sample oscillation and model instability. This is avoided by not employing BN for the output layer of the generator and the input layer of the discriminator.

During training, after the parameters of a certain layer are updated and changed, the distribution of the output (input of the next layer) of the layer network is changed, so that the training is complex, a small learning rate and a good initial weight value must be used, the training is slow, and the training is difficult when an activation function (such as sigmoid, both positive and negative sides are saturated) is used. This phenomenon, we call "intra-covariate migration", can be solved by normalizing the inputs to the network layer.

Once the network is trained, the parameters are updated, and the input data distribution of each layer of the following network is always changed except for the data of the input layer (because the data of the input layer is artificially normalized for each sample), because the updating of the training parameters of the front layer causes the change of the input data distribution of the rear layer during the training. Taking the second layer of the network as an example: the second layer input of the network is calculated by the parameters and the input of the first layer, and the parameters of the first layer are changed all the time in the whole training process, so that the distribution of the input data of each later layer is necessarily changed. During the training process of the network middle layer, the change of the data distribution is the term of the aforementioned "internal covariate offset". The BN algorithm is to solve the problem that the data distribution of the middle layer is changed in the training process.

The generator uses the ReLU as an activation function and the output layer uses the Tanh activation function. Using a bounded activation function may allow the model to learn to reach saturation faster and cover the color space of the training distribution. On the arbiter, the ReLU is used as the activation function except for the last layer.

And step 3: and (2) the confidence map output by the prediction layer in the step (1) and the original RGB image are jointly used as the input of a generator in the GAN to obtain a reconstructed image corresponding to the original image, the difference between the reconstructed image and the original image is greatly reduced based on the pre-training of the GAN in the step (2), and then the reconstructed image and the original image are jointly used as the input of a discriminator to obtain the final output image through further characteristic extraction and classification.

And 4, step 4: defining a loss function, obtaining a prediction mask after a confidence map output by a prediction layer of the basic image segmentation network passes argmax, wherein each value represents the category of each pixel in an input image, the difference between the mask and a calibration parameter of an original image is defined as segmentation loss, the difference between a reconstructed image output by the generator and an output image obtained by a discriminator is defined as discrimination loss, the difference between an output image of the discriminator and the original image is defined as content loss, and the segmentation loss, the discrimination loss and the content loss form a total loss function of the network together.

In the min-max optimization process, the discriminator can be considered as a supervision of the generator, which outputs a reconstructed image I corresponding to the original image ^R A discriminator for discriminating between the original image and the reconstructed image I ^R The specific loss function of the generator and the arbiter for generating the countermeasure network is:

the input of the generator is the prediction layer output p in the basic image segmentation network _seg And the original image I, theta is an activation function, and the output is a reconstructed image I ^R The generator's course of action can be described as G (p) _seg ) To discriminateThe input of the device is a reconstructed image I ^R And an original image I for distinguishing the two.

The basic image segmentation network is used for obtaining the confidence map p _seg ∈R ^c·w·h C is the class number of the data set, w and h are the width and height of the confidence map respectively, then, the confidence map of the original image after basic segmentation can be obtained by taking argmax on the output of the prediction layer, and the difference between the confidence map and the original image is considered as the segmentation loss l _S See equation (2).

The original image and the reconstructed image output by the generator are used as the input of the discriminator together to obtain the output image, the difference between the original image and the output image is called discrimination loss, and the difference between the reconstructed image and the output image is called content loss.

In the present invention, the total loss of the network l includes three parts, namely, the segmentation loss l _S Content loss l _C And discriminating the loss l _A ：

l＝l _S +λ ₁ l _C +λ ₂ l _A (5)

Wherein λ is ₁ And λ ₂ Is a weight parameter.

Loss of content l _C Reconstructed image I for measuring output of generator G ^R Of the reconstructed image I ^R The difference from the original image I, where the Mean Square Error (MSE) function is used to form l _C ：

Wherein, I _ij A single pixel in the calibration parameters representing the original image,

representing pixels in the reconstructed image that correspond to pixels in the original image.

Determination of loss l _A For measuring the difference between the output image of the discriminator D and the original image:

l _A ＝log(D(I))+log(1-D(I ^R )) (7)

wherein I represents an original image, I ^R Representing the reconstructed image.

And 5: training the model in an end-to-end mode to obtain parameters, inputting test data, and obtaining an image semantic segmentation result through a discriminator.

Claims

1. An image segmentation method based on a generation countermeasure network is characterized by comprising the following steps:

(3) the confidence map output by the prediction layer in the step (1) and the original RGB image are jointly used as the input of a generator in the impedance network, and a reconstructed image corresponding to the original image is obtained; secondly, the reconstructed image and the original image are used as input of a discriminator together, and a final semantic segmentation result is obtained through further feature extraction and classification;

(5) training the whole model formed by the image semantic segmentation network and the generation countermeasure network in an end-to-end mode, obtaining parameters, inputting test data, and obtaining an image semantic segmentation result through a discriminator;

the step (4) comprises the following steps:

(42) the discriminant loss represents the difference between the reconstructed image output by the discriminator and the original RGB image net discriminator output, and is defined as:

wherein I denotes an original image, I ^R Representing a reconstructed image;

(44) the overall loss function of the network is:

wherein λ is ₁ And λ ₂ Is a temporary parameter.

2. The image segmentation method based on generation of countermeasure network according to claim 1, wherein the step (1) comprises the steps of:

y＝F(x,{W _i })+x

where x represents input, y represents output, and F (x, { W) _i }) is a target residual mapping function to be learned;

3. The image segmentation method based on generation of countermeasure network as claimed in claim 1, wherein the step (2) comprises the steps of:

(21) defining a generator and a discriminator, wherein the generator is composed of 4 layers of convolution and 4 layers of deconvolution, the random descent rate of the generator is set to be 0.5, ReLU is used as an activation function, the discriminator is composed of 4 layers of convolution, and ReLU is used as the activation function except for the last layer;

the generator and the discriminator use stride convolution to replace a space pooling function, and define a three-dimensional array s (f) as p-norm pooling:

wherein f represents a feature diagram of the convolutional layer output, which is a three-dimensional array, and has a size of W × H × N, rows are H, columns are W, N represents the number of channels in the layer, g (H, W, i, j, u) ═ r · i + H, r · j + W, u) represents a mapping from s to f, k represents a pooling size, u represents a filter, r represents a stride size, and if r > k, pooling does not overlap;

(22) and pre-training the antithetical network, and pre-training the antithetical network by taking calibration parameters of the original RGB image as input to generate the antithetical network.