CN115170801A

CN115170801A - FDA-deep Lab semantic segmentation algorithm based on double-attention mechanism fusion

Info

Publication number: CN115170801A
Application number: CN202210852168.XA
Authority: CN
Inventors: 张小国; 滕浩; 丁立早; 杜文俊; 王�琦
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-10-11

Abstract

The invention designs an FDA-deep Lab semantic segmentation algorithm based on double attention mechanism fusion, which mainly comprises the following steps: building a feature extraction network of a ResNet-50 structure according to a DeepLabv3+ model framework, and building a spatial pyramid pooling ASPP module after the feature extraction network; designing a double-attention mechanism feature fusion module; designing a feature fusion module based on a double-attention machine mechanism feature fusion module, inputting a high-level feature map and a low-level feature map into the feature fusion module to obtain an output image, and obtaining a semantic segmentation result through depth separable convolution and upsampling, wherein the FDA-deep Lab model is completely built; initializing the FDA-DeepLab backbone model by adopting a pre-trained model, training the model, improving a loss function to optimize the training, and performing image segmentation and performance comparison on the test set by using the trained FDA-DeepLab model and the DeepLabv3+ model.

Description

FDA-deep Lab semantic segmentation algorithm based on double-attention mechanism fusion

Technical Field

The invention relates to an FDA-deep Lab semantic segmentation algorithm based on double-attention mechanism fusion, and belongs to the field of image processing.

Background

In the conventional semantic segmentation problem, there are several challenges: the successive downsampling operations in the conventional classification CNN result in a continuous degradation of the resolution of the feature map. The multi-scale detection problem, which is generally rescaling and aggregating feature maps, is computationally expensive. To solve these problems, the deep lab model was developed. The DeepLabv3+ model is obtained by continuous development of the DeepLab model. The DeepLabv3+ model takes DeepLabv3 as an Encoder part to extract multi-scale features. On the basis, a Decoder part is added, a new method for fusing ASPP, encoder and Decoder is formed, and the object boundary of the segmentation structure can be effectively improved. However, in practice, there are several problems:

1. similar objects are prone to misjudgment.

2. Small targets are easily missed.

3. The boundary segmentation error is large.

4. The output is predicted to have holes.

Disclosure of Invention

The invention aims to: aiming at the problems in the prior art, the invention provides an FDA-deep Lab semantic segmentation algorithm based on double-attention mechanism fusion, which mainly aims at solving the problems of fracture and cavities of an object segmented by an original deep Labv3+ model; the problem that the original DeepLabv3+ model has large image boundary segmentation error is solved; the problem that similar objects are easily misjudged by an original DeepLabv3+ model is solved; the problems of unbalanced data set sample categories and unbalanced sample classification difficulty in the actual training process are solved.

The technical scheme is as follows:

an FDA-deep Lab semantic segmentation algorithm based on double-attention mechanism fusion is characterized by comprising the following steps of:

step 1: constructing a feature extraction network and a spatial pyramid pooling ASPP module according to a DeepLabv3+ model framework;

step 2: designing a double-attention machine system feature fusion module;

and step 3: designing a feature fusion module based on a feature fusion module of a double-attention machine;

and 4, step 4: performing depth separable convolution and upsampling on an output image obtained by the feature fusion module, and completing model building;

and 5: and training the model, improving the loss function to optimize training, and comparing the performances of different models.

The step 1 comprises the following steps:

step 1.1: and (3) constructing a feature extraction network by adopting a ResNet-50 convolutional neural network model to obtain low-level feature maps with the down-sampling rates of 4, 8 and 16.

Step 1.2: and after the characteristic extraction network, a spatial pyramid pooling ASPP module is built to obtain an advanced characteristic diagram.

The step 2 comprises the following steps:

step 2.1: let the low resolution feature map input be U for the same dual-attention mechanism fusion module _LI The resolution of the feature map is H 'xW', and the input of the high-resolution feature map is U _HI The resolution of the characteristic diagram is H multiplied by W;

step 2.2: to U _LI Performing an upsampling operation to obtain U _L′I′ Make U _L′I′ Resolution and U _HI The resolution becomes equal, i.e., H × W. The formula is as follows:

U _L'I' ＝f _up (U _LI ),U _L'I' ∈H×W×C

in the formula (f) _up Representing the upsampling operation, and generally adopting a bilinear interpolation method;

step 2.3: to U _L′I′ Performing channel attention operation to obtain U _LI′ To U, to U _HI Performing spatial attention maneuversTo obtain a weight F _S . Weight F _S And U _LI′ Multiply to obtain U _LO′ . The formula is as follows:

U _LI′ ＝f(W _R *z)*U _L′I′

F _S ＝[f(s _1,1 ),f(s _1,2 ),…,f(s _i,j ),…,f(s _H,W )]

wherein f () represents Sigmoid function, s is mapping characteristic, W _R Z is a compression characteristic for the parameter corresponding to the convolution operation;

step 2.4: handle U _LO′ And U _HI Add and add a 1 x 1 convolution kernel to reduce the dimension. The formula is as follows:

U _O ＝c(U _LO' +U _HI )

in the formula, c represents a 1 × 1 convolution operation.

The step 3 comprises the following steps:

step 3.1: obtaining an output characteristic diagram 1 by a characteristic fusion module of a double-attention machine designed in the step 2 by using the low-level characteristic diagram with the down sampling rate of 16 obtained in the step 1.1 and the high-level characteristic diagram obtained in the step 1.2;

step 3.2: obtaining an output characteristic diagram 2 by a double-attention-machine mechanism characteristic fusion module designed in the step 2 by using the low-level characteristic diagram with the down sampling rate of 8 obtained in the step 1.1 and the output characteristic diagram 1 obtained in the step 3.1;

step 3.3: obtaining a low-level feature map with the down-sampling rate of 4 obtained in the step 1.1 and an output feature map 2 obtained in the step 3.2 through a feature fusion module of a double-attention machine designed in the step 2 to obtain an output feature map 3;

the step 4 comprises the following steps:

step 4.1: the output signature 3 obtained in step 3.3 is subjected to a depth separable convolution with a convolution kernel of 3 x 3 and a 4-fold upsampling.

The step 5 comprises the following steps:

step 5.1: and training the model. The FDA-DeepLab backbone model was initialized with the ResNet-50 pre-trained model pre-trained on the ImageNet dataset. Setting the batch processing size to be 10, the iteration step number to be 40000, the total downsampling multiple of the basic feature extraction network to be 16, the initial learning rate to be 0.007, the training data size to be 513 multiplied by 513, and adopting a poly learning rate strategy.

Step 5.2: the training is optimized by improving the loss function. The focus loss function is adopted to replace the conventional cross entropy loss function, and the formula is as follows:

L _FL (p _t )＝-α _t (1-p _t ) ^γ log p _t

wherein alpha is a weight parameter between classes (class 0-1), and (1-p) _t ) ^y For simple/difficult sample adjustment factor, gamma is the focusing parameter, p _t The probability of the corresponding label of the prediction result is obtained. In this experiment, γ =2 and α =0.25 were set.

Step 5.3: the model was tested. MIoU is used as a performance evaluation index, and the MIoU index has the characteristics of simplicity and strong representativeness and is the most common evaluation standard in the semantic segmentation field.

Has the advantages that:

1. the problems of breakage and cavities of the object divided by the original DeepLabv3+ model are solved;

2. the problem that the original DeepLabv3+ model has large image boundary segmentation error is solved;

3. the problem that similar objects are easily misjudged by an original DeepLabv3+ model is solved;

4. the problems of unbalanced data set sample categories and unbalanced sample classification difficulty in the actual training process are solved.

Drawings

FIG. 1 is a network model diagram of the original DeepLabv3+ model;

FIG. 2 is a block diagram of a dual-attention machine mechanism fusion module designed in accordance with the present invention;

FIG. 3 is an overall architecture diagram of the present invention;

FIG. 4 is a graph of results of a comparative attention mechanism experiment;

FIG. 5 is a graph of the results of a multi-feature fusion comparison experiment under a dual-force-of-interest mechanism;

FIG. 6 is a graph of loss function versus experimental results;

FIG. 7 is a graph showing the results of comparative experiments before and after the DeepLabv3+ modification;

FIG. 8 is a graph of the results of comparative experiments with different algorithms.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. Thus, the following detailed description of the embodiments of the invention, provided in the accompanying drawings, is not intended to limit the scope of the invention, as claimed.

As shown in the figure, the FDA-deep Lab semantic segmentation algorithm based on the double-attention mechanism fusion comprises the following steps:

step 2: designing a double-attention machine system feature fusion module;

and 4, step 4: performing depth separable convolution and up-sampling on the output image obtained by the feature fusion module, and finishing model building;

and 5: and training the model, improving a loss function to optimize training, and comparing the performances of different models.

The step 1 comprises the following steps:

The step 2 comprises the following steps:

step 2.1: let the low resolution feature map input be U for the same dual-attention mechanism fusion module _LI The resolution of the feature map is H 'multiplied by W', and the input of the high resolution feature map is U _HI The resolution of the characteristic diagram is H multiplied by W;

step 2.2: to U _LI Performing an upsampling operation to obtain U _L′I′ Make U _L′I′ Resolution and U _HI The resolution becomes uniform, i.e., H × W. The formula is as follows:

U _L'I' ＝f _up (U _LI ),U _L'I' ∈H×W×C

in the formula, f _up Representing the upsampling operation, and generally adopting a bilinear interpolation method;

step 2.3: to U _L′I′ Performing channel attention operation to obtain U _LI′ To U is aligned with _HI Performing spatial attention operation to obtain weight F _S . Weight F _S And U _LI′ Multiply to obtain U _LO′ . The formula is as follows:

U _LI′ ＝f(W _R *z)*U _L′I′

F _S ＝[f(s _1,1 ),f(s _1,2 ),…,f(s _i,j ),…,f(s _H,W )]

wherein f () represents Sigmoid function, s is mapping characteristic, W _R As parameters for the corresponding convolution operation, z is the compression characteristic;

U _O ＝c(U _LO' +U _HI )

in the formula, c represents a 1 × 1 convolution operation.

The step 3 comprises the following steps:

step 3.1: obtaining an output characteristic diagram 1 by a double-attention-machine mechanism characteristic fusion module designed in the step 2 through the low-level characteristic diagram with the down-sampling rate of 16 obtained in the step 1.1 and the high-level characteristic diagram obtained in the step 1.2;

step 3.3: obtaining an output characteristic diagram 3 by a double-attention-machine mechanism characteristic fusion module designed in the step 2 through the low-level characteristic diagram with the down-sampling rate of 4 obtained in the step 1.1 and the output characteristic diagram 2 obtained in the step 3.2;

the step 4 comprises the following steps:

The step 5 comprises the following steps:

step 5.1: and training the model. The constructed FDA-DeepLab model was trained using the public data set PASCAL VOC 2012 data set. The FDA-DeepLab backbone model was initialized with the ResNet-50 pre-trained model pre-trained on the ImageNet dataset. Setting the batch processing size to be 10, the iteration step number to be 40000, the total downsampling multiple of the basic feature extraction network to be 16, the initial learning rate to be 0.007, the training data size to be 513 multiplied by 513, and adopting a poly learning rate strategy.

L _FL (p _t )＝-α _t (1-p _t ) ^γ log p _t

According to the method, the defect of the original DeepLabv3+ model on image semantic segmentation is overcome by adding the double-attention mechanism feature fusion module, the feature fusion module and the improved loss function on the DeepLabv3+ model, and the segmentation effect of the image is improved.

Wherein, fig. 1 is a network model diagram of the original deep bv3+ model. The original DeepLabv3+ model is divided into two modules, encoder and Decode, as a whole. Specifically, the Encoder module comprises a backbone network and an ASPP module which are responsible for basic feature extraction, and the effective extraction of image features is the key of high-precision semantic segmentation. The Decoder module is responsible for gradually up-sampling the feature map obtained by the Encoder module, and fusing high and low features by adopting the idea of FPN feature fusion, so that the problem of detail loss in the feature extraction process is solved, and finally, a semantic segmentation result is obtained.

Fig. 2 is a structural diagram of a fusion module of a dual-gravity mechanism designed by the invention. The invention integrates the advantages of two attention mechanisms, effectively fuses low-level spatial details and high-level semantic clues and obtains an attention mechanism model with better effect. In the current stage, a common fusion mode is to perform two attention mechanism operations on the same feature map and fuse the results, and the difference is more different from different feature fusion modes. The low-level characteristic diagram with high resolution is suitable for adopting space attention operation to extract the space position information of the input image and locate important parts from the space position information; the high-level feature map with low resolution is suitable for taking channel attention operation to focus on more relevant feature channels and neglect other interference information. Therefore, the invention adopts different attention mechanisms on the characteristic diagrams with different resolutions and then carries out fusion, thereby improving the fusion effect.

Fig. 3 is the general architecture of the present invention.

Fig. 4, 5, 6, 7, and 8 are the results of the attention mechanism comparison experiment performed on the PASCAL VOC 2012 validation set, the results of the multi-feature fusion comparison experiment performed under the dual attention mechanism, the results of the loss function comparison experiment, the results of the comparison experiment before and after the deeplab v3+ improvement, and the results of the comparison experiment performed by different algorithms, respectively. The experimental result shows that the overall effect of the double-attention machine system module designed by the invention is superior to that of a single channel attention machine system or a space attention machine system and is better than that of an original model; the feature fusion method based on the double-attention machine mechanism is superior to other fusion methods; the focus loss function designed by the invention also promotes the public data set with more balanced data distribution to a certain extent. In general, the invention obtains an FDA-DeepLab semantic segmentation algorithm based on the double-attention mechanism fusion by designing a double-attention mechanism feature fusion module, a feature fusion module and an improved loss function on the DeepLabv3+ model, improves the defects of the original DeepLabv3+ model and improves the segmentation effect.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. An FDA-deep Lab semantic segmentation algorithm based on double-attention mechanism fusion is characterized by comprising the following steps of:

step 2: designing a double-attention mechanism feature fusion module;

and 3, step 3: designing a feature fusion module based on a feature fusion module of a double-attention machine;

2. The FDA-DeepLab semantic segmentation algorithm based on the double-attention mechanism fusion as claimed in claim 1, wherein the step 1 comprises the following steps:

step 1.1: a ResNet-50 convolutional neural network model is adopted to build a feature extraction network, and low-level feature maps with down-sampling rates of 4, 8 and 16 are obtained;

3. The FDA-DeepLab semantic segmentation algorithm based on the double-attention mechanism fusion as claimed in claim 2, wherein the step 2 comprises the following steps:

step 2.2: to U _LI Performing an upsampling operation to obtain U _L′I′ Make U _L′I′ Resolution and U _HI The resolution becomes equal, i.e., H × W. (ii) a The formula is as follows:

U _L'I' ＝f _up (U _LI ),U _L'I' ∈H×W×C

in the formula (f) _up Representing the upsampling operation, generally adopting a bilinear interpolation method;

step 2.3: to U _L′I′ Performing channel attention operation to obtain U _LI′ To U, to U _HI Performing spatial attention operation to obtain weight F _S . (ii) a Weight F _S And U _LI′ Multiply to obtain U _LO′ . (ii) a The formula is as follows:

U _LI′ ＝f(W _R *z)*U _L′I′

F _S ＝[f(s _1,1 ),f(s _1,2 ),…,f(s _i,j ),…,f(s _H,W )]

wherein f () represents Sigmoid function, s is mapping characteristic, W _R Z is a compression characteristic for the parameter corresponding to the convolution operation; step 2.4: handle U _LO′ And U _HI Adding, and adding a 1 × 1 convolution kernel to reduce the dimension; the formula is as follows:

U _O ＝c(U _LO' +U _HI )

in the formula, c represents a 1 × 1 convolution operation.

4. The FDA-DeepLab semantic segmentation algorithm based on the double-attention mechanism fusion as claimed in claim 2, wherein the step 3 comprises the following steps:

step 3.2: obtaining an output characteristic diagram 2 by a double-attention-machine mechanism characteristic fusion module designed in the step 2 from the low-level characteristic diagram with the down-sampling rate of 8 obtained in the step 1.1 and the output characteristic diagram 1 obtained in the step 3.1;

step 3.3: and (3) obtaining the output characteristic diagram 3 by a double-attention-machine mechanism characteristic fusion module designed in the step (2) through the low-level characteristic diagram with the down-sampling rate of 4 obtained in the step (1.1) and the output characteristic diagram 2 obtained in the step (3.2).

5. The FDA-DeepLab semantic segmentation algorithm based on the double-attention mechanism fusion as claimed in claim 4, wherein the step 4 comprises the following steps:

step 4.1: the output feature map 3 obtained in step 3.3 is subjected to depth separable convolution with a convolution kernel of 3 x 3 and up-sampling by a factor of 4.

6. The FDA-DeepLab semantic segmentation algorithm based on the double-attention mechanism fusion as claimed in claim 1, wherein the step 5 comprises the following steps:

step 5.1: training the model; initializing an FDA-deep Lab backbone model by adopting a ResNet-50 pre-training model which is pre-trained on an ImageNet data set; setting the batch processing size to be 10, the iteration step number to be 40000, the total downsampling multiple of a basic feature extraction network to be 16, the initial learning rate to be 0.007, the training data size to be 513 multiplied by 513, and adopting a poly learning rate strategy;

step 5.2: improving a loss function to optimize training; the focus loss function is adopted to replace the conventional cross entropy loss function, and the formula is as follows:

L _FL (p _t )＝-α _t (1-p _t ) ^γ logp _t

wherein alpha is a weight parameter between classes, (1-p) _t ) ^y For simple/difficult sample adjustment factor, gamma is the focusing parameter, p _t The probability of the corresponding label of the prediction result is obtained; setting γ =2, α =0.25;

step 5.3: testing the model; MIoU is adopted as a performance evaluation index.