CN111680695A

CN111680695A - Semantic segmentation method based on reverse attention model

Info

Publication number: CN111680695A
Application number: CN202010513903.5A
Authority: CN
Inventors: 李磊; 董卓莉; 费选; 母亚双; 李卫东; 王贵财; 石帅锋; 李铮
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2020-09-18

Abstract

The invention relates to a semantic segmentation method based on a reverse attention model; firstly, acquiring an image data set, and constructing a training set and a test set; constructing a deep semantic segmentation network model, wherein the deep semantic segmentation network model comprises a basic network model and a reverse attention model; inputting the features output by the basic network into a reverse attention model to calculate an attention view, respectively reacting the attention view on the low-level output features of the basic semantic segmentation network step by step, and fusing the attention view with the output features of the basic network and the up-sampling features of the output features of the basic semantic segmentation network to obtain a final segmentation result; the model only uses the output features of the basic semantic segmentation network to calculate the attention view and guides the low-level features to be merged into the output features of the basic semantic segmentation network, so that the noise in the low-level features of the model is suppressed, and the robustness and the segmentation precision of the semantic segmentation model are improved; meanwhile, a Gumbel softmax-based loss function is added to the high-level output characteristics of the basic semantic segmentation model so as to accelerate the model training speed.

Description

Semantic segmentation method based on reverse attention model

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a semantic segmentation method based on a reverse attention model.

Background

In recent years, deep learning has been rapidly developed, and a deep learning model represented by a Convolutional Neural Network (CNN) has reignited a neural network at a time of silence, which raises a wave of deep learning in academia and industry.

In order to solve the problem that the DNN-based segmentation model is limited in that the size of an input image must be fixed, Long and Shelhamer of the university of berkeley propose a Full Convolution Network (FCN) for semantic segmentation of an image, and end-to-end semantic segmentation is realized by using convolution instead of a full connected layer and by mapping a dense prediction (dense prediction) image output by the network onto an original image by using techniques such as deconvolution and upsampling, and the DNN model can process images of any size. The increase of the receptive field is an important factor for acquiring the semantic information of the image, but multiple downsampling easily causes the problems of image detail loss, boundary offset and the like. On the basis, a Deeplab V2 model, a Deeplab V3 model, a Deeplab V3+ model, a PSPNet model, a U-net model and other improved models thereof are successively proposed, improvement is carried out on a model architecture, an upsampling strategy, a receptive field size and the like, and particularly, by introducing a cavity convolution technology into a Deeplab series, the segmentation precision is effectively improved.

At present, semantic segmentation methods based on deep learning are developed based on the idea of the full convolution network, and the segmentation accuracy is greatly improved, but most of algorithms are proposed based on a public data set, and a plurality of small targets or complex scenes are often contained in an actual scene image, so that the existing model is challenged. In recent years, researchers have applied Attention models (Attention models) to convolutional neural network models in an attempt to extract accurate pixel-level Attention features from the high-level features of CNNs to improve the segmentation effect from another perspective. The attention model in deep learning actually simulates the attention model of the human brain, and a larger weight is assigned to the concerned object. Li et al propose a pyramid attention network that uses global context information of an image to solve the semantic segmentation problem, and combines an attention mechanism with a spatial pyramid (spatial pyramid) to extract accurate and dense features and obtain pixel labels. Fu et al propose a scene segmentation network integrating a two-way attention mechanism. The latest semantic segmentation framework is a Gated Shape CNNs (GSCNN) network proposed by Takikawa et al, which converts information in a normal semantic network stream into Shape information of an object by using a Gated relational Layer network, so as to add a Shape stream in a typical network architecture, that is, the Gated relational Layer combines a Shape stream and a regular stream, and finally obtains a final segmentation result.

However, this method increases the number of model learning parameters and is complicated. In addition, cross attention models and convolution block attention models are widely used.

Disclosure of Invention

The invention aims to provide a semantic segmentation method based on a reverse attention model, which is used for improving the performance of image semantic segmentation.

In order to solve the technical problems, the technical scheme of the invention is as follows: a semantic segmentation method based on a reverse attention model comprises the following steps:

(1) acquiring an image data set, and constructing a training set and a test set;

(2) constructing a deep semantic segmentation network model, wherein the deep semantic segmentation network model comprises a basic network model and a reverse attention model;

the basic network model comprises a plurality of convolution modules and an ASPP output module which are sequentially connected, wherein the ASPP output module is used for outputting the output characteristics of the basic network model;

the processing procedure of the reverse attention model is as follows:

1) obtaining output characteristics after dimensionality reduction by passing the output characteristics of the basic network model through a convolutional layer, inputting the output characteristics into an attention calculation model to obtain a first attention view, performing point multiplication on the output characteristics after dimensionality reduction and the first attention view, and then overlapping the output characteristics after dimensionality reduction to obtain first output characteristics;

2) the method comprises the steps of up-sampling a first output feature, obtaining at least two features with different scales, calculating an attention view for the feature with each scale, performing point multiplication on the obtained attention views with different scales and a feature map of a basic network model respectively, overlapping the result after point multiplication with the feature with the corresponding scale respectively to obtain output feature maps with different scales, and fusing the output feature maps with different scales and the first output feature to obtain an output result;

(3) inputting a training set into the deep semantic segmentation network model for training to obtain a trained deep semantic segmentation network model;

(4) and inputting the test set into the trained deep semantic segmentation network model to obtain an image segmentation result.

The invention has the beneficial effects that:

according to the invention, the attention view iteration of the high-level features of the model is reacted to the low-level features of the model to improve the precision of the semantic segmentation result. Compared with the traditional attention model in the semantic segmentation method, the self-attention model in the traditional method is a mode of calculating the attention view of the output feature of the current layer in the model and then applying the attention view to the output feature of the current layer, the method utilizes the semantic information in the reverse attention model to inhibit the noise in the low-layer feature of the model, namely the attention view is calculated by the last output feature of the basic model, the low-layer feature of the model is merged into the last output feature of the basic model by taking the last output feature as a guide, and the last output feature merges the semantic information of the high layer and the boundary information of the low layer. The reverse attention model in the invention can also accelerate the convergence process of the backward propagation process parameters of the deep convolutional neural network model. In addition, the difference between the model prediction boundary and the marked image boundary is calculated by using a Gumbel softmax-based loss function, so that the deep convolutional neural network model is guided to pay more attention to the boundary information of the image, and the model training speed is accelerated.

Further, the loss functions adopted by the deep semantic segmentation network model comprise a cross entropy loss function and a Gumbel softmax-based loss function.

Further, the attention calculation model adopts a combined way of channel attention and spatial attention, namely: m (f) ═ σ (M)_c(F)+M_s(F))，

Wherein F ∈ R^H×W×CFor input features, H is the length of the image, W is the width of the image, C is the number of channels, M_cCalculating a function for the channel attention, c is the attention of the channel, M_sCalculating a function for spatial attention, s is spatial attention, and sigma is sigmoid function; m_cAnd M_sAre respectively defined as follows:

M_c(F)＝BN(MLP(AvgPool(F)))

＝BN(w₁(w₀AvgPool(F)+b₀)+b₁)

wherein MLP denotes multi-layer perceptron, i.e. fully connected; AvgPool is the average pooling layer, BN is the batch normalization, w₀、w₁As network weight, b₀And b₁As an offset parameter, w₀∈R^C/r×C、b₀∈R^C/r、w₁∈R^C×C/rAnd b₁∈R^CR is a channelScaling ratio, C is the number of channels; f. of₀、f₁、f₂、f₃For convolution operations, 1 × 1 and 3 × 3 are convolution kernel sizes.

Further, the calculation process of the loss function based on Gumbel softmax is as follows:

1) the first output feature is a feature Y with semantic segmentation class number through the output dimension of the convolutional layer^N×cEach sample i (y) in the matrix_i＝[y_i1,...,y_ic]) C independent samples ∈ each subject to a uniform distribution of U (0, 1) are generated₁,...,∈_c；

2) Calculated noise is G_i＝-log(-log(∈_i))；

3) Adding the randomly generated samples and the network model output characteristics Y to obtain Gumbel distribution:

v_i＝[y_i1+G₁,...,y_ic+G_c]；

4) calculating the output characteristic probability size through a softmax function so as to obtain a class approximate to a one-hot form:

wherein tau is a temperature parameter, the output degree of the Gumbel softmax approximate to one-hot is controlled, the smaller the temperature coefficient value is, the more approximate to one-hot form the output result is, otherwise, the more approximate to uniform distribution is; v. of_iAnd v_jThe Gumbel distribution obtained by adding noise to the sample y;

5) to sigma_τ(v_i) Performing Gaussian smoothing, calculating its gradient to obtain boundary information

Then, the marked image is converted into a one-hot form, Gaussian smoothing is carried out, gradient information B is calculated, and calculation is carried out

And L between B₁Paradigm, loss of the last layer of the underlying network modelA loss function.

Further, when the basic network model is the deep lambv 2 based on VGG16, the basic network model includes five feature extraction blocks and an ASPP module, and the five feature extraction blocks sequentially include a first convolution module, a first pooling layer, a second convolution module, a second pooling layer, a third convolution module, a third pooling layer, a fourth convolution module, a fourth pooling layer, a fifth convolution module, and a fifth pooling layer; each convolution module comprises 2-3 convolution layers, and the convolution layers of the fourth convolution module and the fifth convolution module are empty convolution; the ASPP module is of a pyramid structure with cavity convolution.

Further, the first block convolution module in the VGG16 includes 2 convolution layers of 3 × 3, and the output dimension is 64; the second block of convolution modules includes 2 3 x 3 convolutional layers, the output dimension 128; the third convolution module includes 3 x 3 convolution layers, output dimension 256; the fourth convolution module includes 3 × 3 convolution layers, outputting dimension 512; the fifth convolution module includes 3 × 3 convolution layers, output dimension 512; and the output of the fifth convolution module is connected with the ASPP module.

Further, the depth semantic segmentation model is a deplab v3 model based on VGG 16.

Drawings

FIG. 1 is a schematic diagram of the semantic segmentation method of the reverse attention model based on Deeplab V2 in the present invention.

Detailed Description

For purposes of illustrating the objects, aspects and advantages of the present invention in detail, the present invention is further described in detail below with reference to specific implementation steps and the accompanying drawings.

The invention provides a semantic segmentation method based on a reverse attention model, which introduces the reverse attention model in a common full convolution network (CNN) semantic segmentation model, reacts an attention view of a high-level output feature of the model on a low-level feature of the model, performs multi-feature fusion, maintains boundary information in a segmentation result, filters partial noise information and improves the precision of the semantic segmentation result.

According to the invention, because a Gumbel softmax-based loss function is added to the characteristics of the last layer of output characteristics of the basic semantic segmentation model after attention self-enhancement, because Gumbel softmax is more similar to a one-hot type classification result, the boundary error can be calculated through the loss function, and the speed of model training parameter convergence can be accelerated.

Specifically, the semantic segmentation method of the present invention is described below by taking the deplabv 2 network architecture as an example.

It should be noted that the deep semantic segmentation network model in the application is a network model based on a network architecture of a traditional classical semantic segmentation model; the underlying network model architecture can be VGG16 or ResNet, etc.

As shown in fig. 1, taking image data in a horizontal warehouse as an example, and constructing a deep semantic segmentation network model based on a deep semantic segmentation V2 network model, where the deep semantic segmentation network model includes a VGG16 feature extraction module, an ASPP module, a reverse attention model, a cross entropy loss function, a loss function based on gumbelsoft max, and the like; the VGG16 network architecture comprises five feature extraction blocks, each convolution module comprises 2-3 convolution layers, each convolution layer in each convolution module is followed by a nonlinear corresponding ReLU layer, and each convolution module is followed by a pooling layer; the convolution layers of the fourth convolution module and the fifth convolution module are void convolution; the aspp (advanced Spatial Pyramid) module has a Pyramid structure with a hollow convolution.

Specifically, the semantic segmentation method of the present embodiment includes the following steps:

step 1, acquiring an image data set, and constructing a training set and a test set;

step 2, constructing a deep semantic segmentation network model, including a basic network model and a reverse attention model;

wherein the basic network model is a Deeplab V2 model based on VGG16 or a Deeplab V3 model based on VGG 16;

in this embodiment, taking the deep V2 model of VGG16 as an example, the basic network model includes five feature extraction blocks and an ASPP output module, where the five feature extraction blocks sequentially include a first convolution module, a first pooling layer, a second convolution module, a second pooling layer, a third convolution module, a third pooling layer, a fourth convolution module, a fourth pooling layer, a fifth convolution module, and a fifth pooling layer; each convolution module comprises 2-3 convolution layers, and the convolution layers of the fourth convolution module and the fifth convolution module are empty convolution; wherein the first block convolution module comprises 2 convolution layers of 3 × 3, and the output dimension is 64; the second block of convolution modules includes 2 3 x 3 convolutional layers, the output dimension 128; the third convolution module includes 3 x 3 convolution layers, output dimension 256; the fourth convolution module includes 3 × 3 convolution layers, outputting dimension 512; the fifth convolution module includes 3 × 3 convolution layers, output dimension 512; the output of the fifth convolution module is connected with an ASPP output module; the ASPP output module is of a pyramid structure with a cavity convolution and is used for outputting the output characteristics of the basic network module.

The processing procedure of the reverse attention model is as follows:

1) reducing the output characteristics of the last layer (i.e. the ASPP layer in FIG. 1) in the basic network model by the convolution layer of 1 × 1 to obtain characteristics F^hInputting the attention calculation model to obtain a first attention view M (F)₀ ^h) Reaction on feature F^hTo obtain

2) By outputting a characteristic F₀For high-level features, two attention views A with different scales are respectively calculated_i(i ═ 1,2) and reacts to the two lower-level output features F of the underlying network model_l ⁱ(output characteristics of i-th layer), and re-summing the processed characteristics₀And feature fusion through scale change, the fused features pass through two convolutional layers of 3 × 3 and one convolutional layer of 1 × 1 to obtain an output result,

if F is to be mentioned_l ⁱAnd F₀Are different, F may be interpolated using an upsampling operation₀Upsampling, using a 3 × 3 convolutional layer to convert F_l ⁱThe number of channels is decreased to a sum F₀The same number of channels, the calculation being based on F₀Attention view of (A)_iReacting it with F_l ⁱAnd F₀The fusion process is as follows:

where ⊙ denotes the dot product of the elements,

represents an addition of elements;

the attention calculation model in this embodiment adopts a combination of channel attention and spatial attention, that is: m (f) ═ σ (M)_c(F)+M_s(F) Wherein F ∈ R^H×W×CFor input features, H is the length of the image, W is the width of the image, C is the number of channels, M_cCalculating a function for the channel attention, c is the attention of the channel, M_sCalculating a function for spatial attention, s is spatial attention, and sigma is sigmoid function; m_cAnd M_sAre respectively defined as follows:

M_c(F)＝BN(MLP(AvgPool(F)))

＝BN(w₁(w₀AvgPool(F)+b₀)+b₁)

where MLP denotes multilayer perceptron (i.e. fully connected), AvgPool is average pooling layer, BN is batch normalization, w₀、w₁As network weight, b₀And b₁As an offset parameter, w₀∈R^C/r×C、b₀∈R^C/r、w₁∈R^C×C/rAnd b₁∈R^CR is the channel scaling ratio, C is the number of channels; f. of₀、f₁、f₂、f₃For convolution operations, 1 × 1 and 3 × 3 are rollsSize of the nuclei.

The bottom-layer output characteristics in this embodiment are the outputs of the first block convolution module and the second block convolution module (as shown in fig. 1), that is, the output characteristics can be obtained according to the above method

3) Will be provided with

F₀Connecting and outputting final characteristics through two 3 × 3 × 256, one 1 × 1 × C convolutional layers and an upsampling layer;

it should be noted that, when two layers of output features in the lower layer of the basic network model are merged into the higher layer features, the convolution operation is performed on each branch output feature, so that the number of feature channels is reduced, the operation complexity is reduced, and the influence of noise in the lower layer features is reduced.

Step 3, training the deep semantic segmentation network model constructed by inputting the training set to obtain a trained deep semantic segmentation network model;

it should be noted that, in the model training process, the adopted loss functions include a cross entropy loss function and a loss function based on Gumbel softmax;

specifically, the calculation process of the loss function based on Gumbel softmax is as follows:

1) aiming at the first output feature of the last layer of the basic network model, the feature of which the output dimension is the semantic segmentation class number through the convolution layer is Y^N×cWhere N is the product of the length and width of the feature matrix, in which each sample i (y) is_i＝[y_i1,...,y_ic]) C independent samples ∈ each subject to a uniform distribution of U (0, 1) are generated₁,...,∈_c...；

2) Calculated noise is G_i＝-log(-log(∈_i))；

3) Adding the randomly generated samples and the network model output characteristics Y to obtain Gumbel distribution: v. of_i＝[y_i1+G₁,...,y_ic+G_c]；

4) Calculating the output characteristic probability size through a softmax function so as to obtain a final class approximate to a one-hot form:

wherein tau is a temperature parameter, the output degree of the Gumbel softmax approximate to one-hot is controlled, the smaller the temperature coefficient value is, the more approximate to one-hot form the output result is, otherwise, the more approximate to uniform distribution is; v. of_iAnd v_jIs the Gumbel distribution obtained after adding noise to the sample y.

And B as a loss function of the last layer of the underlying network model.

Wherein the cross entropy loss function of a sample is expressed as:

wherein the content of the first and second substances,

is a true tag, y, of a sample x in the one-hot form_iAnd (3) a probability value output by the model and subjected to softmax, wherein i is the ith item in the vector, and the final cross entropy loss is the average value of sample loss values in all batch processing.

In the embodiment, the reverse attention model is adopted, so that the result obtained by the semantic segmentation method is fused with the semantic information and the boundary information, and the boundary loss function based on Gumbel softmax is used, so that the model training can be guided, and the speed of model parameter convergence is accelerated.

It is worth noting that the reverse attention model proposed by the invention can be applied to one or more layers of the lower layers of the model, and the strategy can be applied to any basic semantic segmentation model.

And 4, inputting the test set into the trained deep semantic segmentation network model to obtain an image segmentation result.

In order to verify the superiority of the method, a series of comparative test results based on a reverse attention model (BA for short) are designed on a Deeplab V2+ VGG 16; and adopting semantic segmentation field evaluation criteria: IoU and F1 evaluated the results of the segmentation.

The data of the method of the invention respectively adopts two database sets: the first image database is an image set in the one-storey barn; the second image database is an image data set using the VOC2012 database to verify the method of the present invention.

Firstly, testing an internal diagram of a single-storey barn:

(1) acquiring 120 images in the bin as a training image set, and manually marking the acquired images to acquire a group channel; using 20 pictures as a verification image set;

the size of each collected image is 1080 × 1920, and the images are shot under different illumination and preset angles.

(2) And enhancing the marked training image set to obtain a training image set required by the subsequent deep convolutional neural network model training.

The method for enhancing the training image set comprises the steps of intercepting an interested area with a specified size by adopting a certain step length, adjusting image Gamma correction parameters, zooming an image, turning the image, rotating the image by no more than +/-10 degrees relative to an original image, increasing Gaussian noise and the like.

The original training image and the marked image are changed simultaneously in the training image enhancement process, and if interpolation operation exists in the marked image, nearest neighbor interpolation is selected.

Specifically, for any training image and its labeled image in the horizontal warehouse, an area of interest with a specified size is intercepted from the upper left corner of the image in a certain step length, operations such as inversion, Gamma parameter adjustment, rotation within +/-10 degrees, addition of Gaussian noise and the like are respectively carried out on the area, each operation has a corresponding parameter set, and finally training image sets close to the target number are generated, wherein the final training image sets are 5600 and the verification image sets are 120.

In the method, the original training image cannot be rotated in a large angle because the grain surface and the grain containing line have a certain semantic context relationship.

Table 1 shows the comparative test results of the present invention on the image set in the single-storey barn based on the reverse attention model (BA for short) on deepab V2+ VGG 16. As can be seen from table 1, the segmentation performance of the original model can be effectively improved by adding the reverse attention model to the original model.

TABLE 1 evaluation of image set training results in one-storey barn

Test based on VOC2012 database:

the labeling image set comprises 12081 images and marking images, and is divided into a training image set and a verification image set, wherein the training image set comprises 10582 images, and the verification image set comprises 1499 images.

Table 2 shows the test result of the invention on the VOC2012 data set, and the deepab V2+ VGG16 is used, from which it can be seen that the invention can effectively improve the segmentation performance of the original model by adding the gumblesoftmax attention model to the original model.

TABLE 2 VOC2012 image set training results evaluation

Therefore, the semantic segmentation method based on the reverse attention model further improves the segmentation performance of the image.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A semantic segmentation method based on a reverse attention model is characterized by comprising the following steps:

the processing procedure of the reverse attention model is as follows:

2. The inverse attention model-based semantic segmentation method according to claim 1, wherein the loss functions adopted by the deep semantic segmentation network model comprise a cross entropy loss function and a Gumbel softmax-based loss function.

3. The inverse attention model-based semantic segmentation method according to claim 1, wherein the attention calculation model adopts a combination of channel attention and spatial attention, that is:

M(F)＝σ(M_c(F)+M_s(F))，

M_c(F)＝BN(MLP(AvgPool(F)))

＝BN(w₁(w₀AvgPool(F)+b₀)+b₁)

wherein MLP denotes multi-layer perceptron, i.e. fully connected; AvgPool is the average pooling layer, BN is the batch normalization, w₀、w₁As network weight, b₀And b₁As an offset parameter, w₀∈R^C/r×C、b₀∈R^C/r、w₁∈R^C×C/rAnd b₁∈R^CR is the channel scaling ratio, C is the number of channels; f. of₀、f₁、f₂、f₃For convolution operations, 1 × 1 and 3 × 3 are convolution kernel sizes.

4. The inverse attention model-based semantic segmentation method as set forth in claim 2, wherein the Gumbel softmax-based loss function calculation procedure is as follows:

(1) the first output feature is a feature Y with semantic segmentation class number through the output dimension of the convolutional layer^N×cEach sample i (y) in the matrix_i＝[y_i1,…,y_ic]) C independent samples ∈ each subject to a uniform distribution of U (0, 1) are generated₁,...,∈_c；

(2) Calculated noise is G_i＝-log(-log(∈_i))；

(3) Adding the randomly generated samples and the network model output characteristics Y to obtain Gumbel distribution: v. of_i＝[y_i1+G₁,...,y_ic+G_c]；

(4) Calculating the output characteristic probability size through a softmax function so as to obtain a class approximate to a one-hot form:

(5) to sigma_τ(v_i) Performing Gaussian smoothing, calculating its gradient to obtain boundary information

And L between B₁And the paradigm is used as a loss function of the last layer of the basic network model.

5. The reverse attention model-based semantic segmentation method according to claim 1, wherein when the basic network model is the deep ladder v2 based on VGG16, the basic network model comprises five feature extraction blocks and an ASPP module, and the five feature extraction blocks sequentially comprise a first convolution module, a first pooling layer, a second convolution module, a second pooling layer, a third convolution module, a third pooling layer, a fourth convolution module, a fourth pooling layer, a fifth convolution module and a fifth pooling layer; each convolution module comprises 2-3 convolution layers, and the convolution layers of the fourth convolution module and the fifth convolution module are empty convolution; the ASPP module is of a pyramid structure with cavity convolution.

6. The attention model-based semantic segmentation method of claim 5 wherein the first block convolution module in VGG16 includes 2 3 x 3 convolution layers, the output dimension is 64; the second block of convolution modules includes 2 3 x 3 convolutional layers, the output dimension 128; the third convolution module includes 3 x 3 convolution layers, output dimension 256; the fourth convolution module includes 3 × 3 convolution layers, outputting dimension 512; the fifth convolution module includes 3 × 3 convolution layers, output dimension 512; and the output of the fifth convolution module is connected with the ASPP module.

7. The inverse attention model-based semantic segmentation method according to claim 1, wherein the deep semantic segmentation model is a deplab v3 model based on VGG 16.