CN112329778A

CN112329778A - Semantic segmentation method for introducing feature cross attention mechanism

Info

Publication number: CN112329778A
Application number: CN202011144252.3A
Authority: CN
Inventors: 彭思齐; 曾海波
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-02-05

Abstract

The method aims at the problems that the Deeplabv3+ model is inaccurate in image target edge segmentation, image feature fitting is slow, and attention information cannot be effectively utilized. It is proposed to add a feature cross attention module to the model, the cross attention network consisting of two branches and a feature cross attention module. The shallow branch is used for extracting low-level spatial information, and the deep branch is used for extracting high-level contextual features, so that important features are extracted more finely. The method designs and realizes the connection of a feature cross attention mechanism and an encoding module of the Deeplabv3+, and inputs the output features of the Deeplabv3+ encoding module into the feature cross attention module to carry out convolution operation so as to realize the recalibration of the original features. The decoding module of deplapv 3+ acquires the spatial feature and the channel feature from the two branches, respectively, and then fuses the acquired features to acquire the more important features. The improved model is verified through a Pascal Voc2012 data set, and the result shows that the model added with the feature cross attention mechanism can effectively improve the defects of the original model, can more finely divide the target, and better solves the problems of rough dividing boundary and the like.

Description

Semantic segmentation method for introducing feature cross attention mechanism

Technical Field

The invention belongs to the field of semantic segmentation, relates to a semantic segmentation model introducing an attention mechanism, and particularly relates to a model introducing a double-attention mechanism method into Deeplabv3 +.

Background

At present, the convolutional neural network greatly promotes the execution of visual tasks by virtue of rich representation capability, and image semantic segmentation is one of key tasks for promoting computer vision. Image semantic segmentation has been an important research direction as a classic computer vision problem (image classification, object recognition detection, semantic segmentation), and its essence is to classify pixel points in pictures. The image semantic segmentation is widely applied to the related fields of automatic driving, ground feature classification, cloud detection, medical detection and the like. In the image semantic segmentation method based on the Pascal Voc2012 data set, the current popular models include FCN, U-Net and Deeplab series, and the FCN, U-Net and Deeplab series have the problems that the edge modification of the segmentation result is less, the boundary of partial image segmentation is rough, the relation between long-distance pixel classes cannot be fully utilized and the like.

Disclosure of Invention

In view of the above problems, the present invention proposes a semantic segmentation method based on a cross attention mechanism to solve the above problems or at least partially solve the above drawbacks of the semantic segmentation method.

The semantic segmentation method of the cross attention mechanism provided by the invention comprises the following steps of;

the invention provides a model for introducing a Deeplabv3+ into a cross attention mechanism, wherein the cross attention model consists of a space attention module and a channel attention module;

the channel attention module extracts pixel information extracted from a high-layer convolution layer in the Deeplabv3+ model and is used for extracting deep spatial information;

learning the feature weight by using a space attention module through a network according to the loss;

the spatial attention module is used for endowing important feature map with large weight, and the model is trained in a mode of invalid or unimportant feature map with small weight to achieve better result;

the method for extracting the attention of the feature channel is basically similar to SEnet, a maxpool feature extraction method is added on the basis of SEnet, and the final output result is obtained by adding the average pooling result and the maximum pooling result;

the method for extracting the characteristics with the Avgpool is the same as the method for extracting the Avgpool in SENEt;

when the two pools are used, a shared MLP is used for attention inference to save parameters, and the two aggregated channel features are located in the same semantic embedding space;

each channel of the channel attention module features represents a special detector, and the channel attention is concerned about what features are meaningful;

in order to summarize spatial characteristics, the invention adopts two modes of global average pooling and maximum pooling to respectively utilize different information, and the operation process of the channel attention module is as follows.

Wherein: MLP is a multi-sensing layer; σ is the sigmoid activation function. The input is a feature F of H × W × C;

firstly, respectively carrying out global average pooling and maximum pooling in a space to obtain two 1 × 1 × C channel descriptions;

respectively sending the neurons into a neural network with two layers, wherein the number of neurons in the first layer is C/r, and the activation function is Relu;

the number of neurons in the second layer is C, and the neural networks in the two layers are shared;

adding the two obtained characteristics, and obtaining a weight coefficient Mc through a Sigmoid activation function;

finally, multiplying the weight coefficient by the original characteristic F to obtain a new characteristic after zooming;

the spatial attention module extracts pixel content information extracted from a low-layer convolution layer in a Deeplabv3+ model and is used for extracting shallow spatial information;

spatial attention mechanisms are where meaningful features are of concern;

a spatial attention module, giving a H × W × C feature F;

firstly, respectively carrying out average pooling and maximum pooling of one channel dimension to obtain two HxWx1 channel descriptions, and splicing the two descriptions together according to the channel;

using the extracted features to get the right through a convolution layer with convolution kernel size of 7 multiplied by 7, the activation function is Sigmoid, and obtaining the weight coefficient Ms;

finally multiplying the weighting coefficient by the characteristic F' to obtain a new characteristic;

the operation formula of the space attention module is as follows;

wherein: σ is a sigmoid activation function; f is a convolutional layer; [; is the join profile in the channel dimension.

The feature cross attention module of the invention extracts shallower spatial information with a spatial attention module and then captures contextual information using a channel attention mechanism;

the two branch output characteristics of the invention are different, the characteristics of the high layer mainly comprise category information, so that the channel attention module can be used for extracting the characteristics of the high layer information, and the low layer corresponds to more space information;

the feature cross attention module can not directly perform sampling fusion on the low-level features, and a space attention module can be used for performing feature extraction on the low-level features;

our added FCA's high-level features of its channel attention module are used to provide context information, while the low-level features extracted by the spatial attention module are used to refine pixel localization;

firstly, cascading the output characteristics of the two branches, and performing convolution, batch processing normalization and ReLU unit processing on the cascaded characteristics;

then, the output of the features and the spatial branches fused by the SA module is used as input to help refine positioning;

after the characteristics of the SA module are subjected to normalization and S-type nonlinear convolution, multiplying the characteristics by the characteristics after fusion;

applying the space output by the context peak value to a channel attention block, compressing the context characteristics along the space dimension through global pooling and maximum pooling to obtain two vectors;

sharing the two vectors to a full connection layer and a Sigmoid operator to generate an attention diagram, and finally performing convolution, batch normalization and Relu unit fusion;

the semantic segmentation model of the feature cross attention module has better effects of extracting the target edge information and the content information:

1. the introduction of a characteristic cross attention mechanism into the DeeplabV3+ model is proposed, and the Deeplabv3+ model based on the cross attention mechanism is proposed.

2. The significance degree of pixel features is distinguished by emphasizing meaningful feature information on channel and space dimensions and carrying out convolution operation to redistribute weights, the more important the pixel features are, the more important the obtained weights are, and then the segmentation of the image is obtained through the joint learning of a main branch and a cross attention module.

3. The attention mechanism is a simple and effective lightweight module, and adding this module adds little additional computation.

4. After Deeplabv3+ is introduced, due to the fact that important information is selectively concerned by an attention mechanism, network areas of the improved network are divided more accurately, ideal target areas can be divided, the edges of objects can be accurately divided, and the problem that semantic division and marking are unreasonable is effectively solved.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a schematic model diagram of Deeplabv3+

FIG. 2 is a block diagram of a channel attention mechanism

FIG. 3 is a block diagram of a space-based mechanism

FIG. 4 is a block diagram of a feature cross attention mechanism

Fig. 5 is a schematic diagram of a deplaybv 3+ model with a cross attention mechanism

Fig. 6, 7 and 8 are graphs of the test results of the deplab v3+ model providing a mechanism for drawing attention in the practice of the present invention.

Detailed Description

In order to make the purpose and technical solution of the present invention more clearly understood, the following detailed description is made with reference to the accompanying drawings and examples, and the application principle of the present invention is described in detail.

The embodiment of the invention provides a Deeplabv3+ model diagram introducing a feature cross attention mechanism. Fig. 5 shows a schematic diagram of a modified depllabv 3+ model, and a specific operation flow is shown in fig. 5.

The Deeplabv3+ model is rewritten based on the Xconcept network, and the final full-connection layer is removed firstly to realize the end-to-end output.

The last two pooling layers of the Xception network are removed because the convolution itself has a translational invariance and the pooling layers can further enhance this property of the network because the pooling layers themselves are a process of blurring the location. Semantic segmentation is an end-to-end problem, each pixel needs to be accurately classified, the position of the pixel is sensitive, too many posing are used, the size of a feature layer is too small, included features are too sparse, the semantic segmentation is not facilitated, and only a part of posing needs to be removed.

The method has the advantages of increasing the density of the features, enlarging the receptive field and improving the classification precision by using the conditional random field CRF.

An ASPP (advanced Spatial Pyramid clustering) structure is adopted, and the structure uses hole convolution operations with different sampling rates to perform parallel sampling on an input feature map, namely, the feature map is subjected to multi-scale capture of image context information.

The ASPP is improved by using a 1 × 1 convolution, that is, when the rate is increased, a degenerate version of a 3 × 3 convolution is used instead of the 3 × 3 convolution to reduce the number of parameters, and another point is to increase image output, which may be called global pooling, to supplement global features.

All convolutional and pooled layers were replaced with depth separable convolutions, using BN and ReLU after each 3 x 3 depth separable convolution.

The method for extracting the attention of the feature channel is basically similar to SEnet, the feature extraction method of maxpool is added on the basis of SEnet, and the final output result is that the average pooling result and the maximum pooling result are added and output. The extraction of features with Avgpool is the same as in the extraction of Avgpool in SENet. Furthermore, when using the two pools, a shared MLP is used for attention inference to save parameters, and the two aggregated channel features are both located in the same semantic embedding space. Each channel of the channel attention module features represents a special detector, and it makes sense to what features the channel attention is focused on. In order to summarize spatial features, two modes of global average pooling and maximum pooling are adopted to respectively utilize different information, and the operation process of the channel attention module is shown in the following formula.

The spatial attention mechanism of the invention is concerned about where the meaningful features are, the way of extracting the attention of the feature channel is to give a H multiplied by W multiplied by C feature F, firstly, average pooling and maximum pooling of one channel dimension are respectively carried out to obtain two H multiplied by W multiplied by 1 channel descriptions, and the two descriptions are spliced together according to the channels. Then, the activation function is Sigmoid through a convolution layer with convolution kernel size of 7 × 7, and a weight coefficient Ms is obtained. Finally, multiplying the weighting coefficient by the characteristic F' is the new characteristic. The operation process is shown in the following formula.

The Deeplabv3+ model added with the feature cross attention mechanism is mainly realized by an encoder and a decoder, wherein the encoder is divided into a deep separation convolution layer and an ASPP layer, the decoder fuses low-layer features and recovers a feature map, and the separable convolution is discussed, so that the proposed model is faster and stronger, and the computational complexity of the proposed model is obviously reduced.

Feature cross attention module extracts shallow spatial information using a spatial attention module fig. 3, and a model diagram of information using a channel attention mechanism to capture context is shown in fig. 2.

The output characteristics of the two branches are different, because the characteristics of the high layer mainly comprise the category information, the channel attention module can be used for extracting the characteristics of the high layer information, the low layer corresponds to more space information and cannot directly sample and fuse the space information, and the space attention module can be used for extracting the characteristics of the low layer.

The added F-cross attention mechanism has the high-level features of its channel attention module used to provide context information, while the low-level features extracted by the spatial attention module are used to refine the pixel localization. Firstly, the output characteristics of two branches are cascaded, the cascaded characteristics are subjected to convolution, batch processing normalization and ReLU unit processing,

and then the fused features of the SA module and the output of the spatial branch are used as input to help refine positioning. The features of the SA module are multiplied by the fused features after normalization and S-type nonlinear convolution. And applying the space output by the context peak value to a channel attention block, and compressing the context features along the space dimension through global pooling and maximum pooling to obtain two vectors. And then sharing the two vectors to a full connection layer and a Sigmoid operator to generate an attention diagram, and finally performing convolution, batch normalization and Relu unit fusion.

In order to better mine the spatial features and channel features in the decoder, after performing hole convolution, a shallow feature is extracted by using 1 × 1 convolution, and then an SA attention module is added to obtain better shallow spatial information.

A channel attention mechanism is added behind the feature obtained by four times of up-sampling of feature information obtained after ASPP operation to obtain context channel information of a higher layer, an adding module has little influence on the structure of an original network model, additional training parameters and overhead are hardly added, and the model can obtain more important spatial features and channel features.

In the FCA module, the output signatures of the two branches are first concatenated and then a 3 × 3 convolution, batch normalization and replay unit is applied to the concatenated signature.

The spatial branch characteristics are subjected to 3 x 3 convolution, batch normalization and Sigmoid nonlinearity, and then multiplied by fusion characteristics, and the output of the spatial attention block and the context characteristics of the context branch are applied to the channel attention block.

And compressing the context characteristics along the space dimension through the global pool and the maximum pool to obtain two vectors. These two vectors are then applied to the shared fully connected layer and Sigmoid operator to generate the attention map. The attention map is next multiplied by the output features from the spatial attention block and added to the fused features.

The application effect of the invention is described in detail by combining a Matlab/simulink simulation diagram as follows:

the visualization results are shown in fig. 6, 7 and 8, and it can be seen that when the overall, marginal and detailed aspects of our model are better than those of the original model, the network with the cross attention mechanism added in the marginal detail aspect can well learn and utilize the information in the target region and aggregate features from the target region, and the feature refinement process of our model finally guides the network to reasonably utilize the given features. The proposed Deeplabv3+ model introducing the cross attention module can refine the characteristics of the attention mechanism to two different modules, realize better performance improvement under the condition of keeping smaller calculation amount, design a double-branch network to improve the context characteristics, and simultaneously encode low-level spatial information.

Claims

1. A deplab 3+ model incorporating a feature cross attention module (FCA), comprising;

a feature cross attention module (FCA) extracts shallower spatial information with a spatial attention module;

capturing context information using a channel attention mechanism;

the output characteristics of the two branches are different, and the characteristics of the high layer mainly comprise category information;

the channel attention module is used for extracting features of the high-level information, the low-level information corresponds to more spatial information and cannot be directly sampled and fused, and the spatial attention module is used for extracting features of the low-level information.

2. The FCA module added according to the method of claim 1, comprising;

its high-level features of the channel attention module are used to provide context information, while the low-level features extracted by the spatial attention module are used to refine pixel localization;

the output characteristics of the two branches are cascaded, and the cascaded characteristics are subjected to convolution, batch normalization and ReLU unit processing;

and the fused features of the SA module and the output of the spatial branch are used as input to help refine positioning.

3. The method of claims 1, 2 further comprising;

and sharing the two vectors to a full-connection layer and a Sigmoid operator to generate an attention diagram, and finally performing convolution, batch normalization and Relu unit fusion.

4. The method of claim 3 further comprising;

after the space characteristics and the channel characteristics in a decoder are better mined, after the original image is subjected to cavity convolution, the shallow layer characteristics are extracted by using 1 multiplied by 1 convolution, and then an SA attention module is added to obtain better shallow layer space information;

and adding a channel attention mechanism behind the feature obtained by four times of up-sampling of the feature information obtained after the ASPP operation is performed on the original image to obtain context channel information of a higher layer.

5. The method of claim 4 further comprising;

the added module has no influence on the structure of the original network model, and extra training parameters and expenses are not added, but the model can obtain more important spatial features and channel features;

in the FCA module, the output features of the two branches are first concatenated, and then the 3 × 3 convolution, batch normalization and replay units are applied to the concatenated features;

after the spatial branch feature is subjected to 3 × 3 convolution, batch normalization and Sigmoid nonlinearity, the spatial branch feature is multiplied by the fusion feature.

6. The method of claim 1 comprising;

applying the output of the spatial attention block and the context characteristics of the context branch to the channel attention block;

compressing the context characteristics along the space dimension of the global pool and the maximum pool to obtain two vectors;

these two vectors are applied to the shared fully connected layer and Sigmoid operator to generate the attention map. The attention map is next multiplied by the output features from the spatial attention block and added to the fused features.