CN116012581A

CN116012581A - Image segmentation method based on dual attention fusion

Info

Publication number: CN116012581A
Application number: CN202211633594.0A
Authority: CN
Inventors: 袁非牛; 汤照达
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-04-25

Abstract

The invention discloses an image segmentation method based on double attention fusion, which comprises the steps of constructing a data set according to a target task; the method comprises the steps of reconstructing a segmentation network model, inputting image samples in a data set into the constructed segmentation network model for training, wherein the segmentation network model adopts a U-shaped structure and comprises a coding module, a double-attention-gating fusion module and a decoding module, wherein the coding module is used for coding an input image to obtain an initial feature map; the double-attention gating fusion module comprises a multi-scale weighted channel attention branch and a global space self-attention branch, and is formed by fusing the features extracted by the two branches through a gating mechanism module and is used for acquiring a final feature map; the decoding module is used for decoding the final feature map to obtain a segmented image; and finally, inputting the image data to be segmented into a trained segmentation network model to obtain a high-precision segmentation result of the target image.

Description

Image segmentation method based on dual attention fusion

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an image segmentation method based on dual attention fusion.

Background

Medical images play a very critical role in diagnosing diseases. Accurate segmentation of organs from medical images using manual methods is very difficult and time consuming, and manual methods are also highly dependent on personal experience. The rapid development of modern image processing and Artificial Intelligence (AI) provides doctors with more critical information about lesions, artificial intelligence techniques are improving the accuracy of disease diagnosis and reducing diagnosis time, and accurately segmenting specific organs from medical images is a challenging task, which is important for clinical diagnosis.

In recent years, research on medical image segmentation based on deep learning has made a breakthrough, but segmentation of multiple organs from medical images has great difficulty, which comes from the following aspects: first, the human body has multiple organs pressed in a small space, resulting in a large deformation of the same organ of different people, e.g., the colon and pancreas of the abdomen have very different shape appearances on different people; secondly, the existing scanner has limited scanning quality, which leads to blurred boundaries, low contrast and high noise, and in the abdomen CT section, a generally blurred boundary exists between the pancreas head and the duodenum, which leads to limited segmentation precision; third, human organs have different sizes and shapes in medical images, so we need more abstract features and high-level features extracted from different dimensions to extract image semantic information.

Most of the existing methods are image segmentation algorithms based on convolutional neural networks, and the characteristic extraction capability of the networks is improved by adding new modules into a basic segmentation framework so as to obtain segmentation results with higher precision, but the convolutional neural network-based methods are good at capturing detailed information in local characteristics, have insufficient remote relational modeling capability on the image global, are proposed in the field of natural language processing, are widely applied to the field of computer vision by people, have better effects, and are quite a worth exploring problem how to utilize the respective advantages of good convolution and transform.

Disclosure of Invention

The invention provides an image segmentation method based on double attention fusion, and provides a dual attention gating fusion network segmentation model.

The invention can be realized by the following technical scheme:

an image segmentation method based on dual attention fusion comprises the following steps:

s1, constructing a data set according to a target task, wherein the data set comprises a plurality of image samples for carrying out pixel-level labeling on a specific target;

s2, constructing a segmentation network model, inputting the image samples in the data set into the constructed segmentation network model for training,

the segmentation network model adopts a U-shaped structure and comprises a coding module, a double-attention-gating fusion module and a decoding module, wherein the coding module is used for coding an input image to obtain an initial feature map; the double-attention gating fusion module comprises a CNN-based multi-scale weighted channel attention branch and a transform-based global space self-attention branch, and is formed by fusing the features extracted by the two branches through a gating mechanism module, and is used for acquiring a final feature map; the decoding module is used for decoding the final feature map to obtain a segmented image;

the multi-scale weighted channel attention branches are used for extracting response characteristics among classes so as to improve classification accuracy and acquire multi-scale characteristic diagrams, and the global space self-attention branches are used for extracting long-distance dependency characteristics so as to improve positioning accuracy and acquire global characteristic diagrams;

s3, inputting the image data to be segmented into the trained segmentation network model to obtain a high-precision segmentation result of the target image.

Further, the multi-scale weighted channel attention branch comprises a multi-scale convolution operation and a weighted channel attention operation in series,

for an input initial feature map, on one hand, firstly, carrying out 1×1 convolution on the initial feature map to change the number of channels into 1/8 of the original number, and generating a feature map t1; on the other hand, after the initial feature map is halved in size and channel number, 3×3 convolution, 5×5 convolution and 7×7 convolution are respectively performed in parallel, and the channel number is compressed to 1/8 of the initial feature map, so that feature maps t2, t3 and t4 with three 1/8 channel numbers are obtained again, and total four feature maps t1, t2, t3 and t4 are obtained;

and then carrying out SE channel attention operation on the four feature graphs t1, t2, t3 and t4 respectively, thereby obtaining four groups of channel attention coefficients, further endowing trainable four weight values to the 4 groups of channel attention coefficients respectively, obtaining four groups of weighted channel attention coefficients, splicing and normalizing by a Softmax function, then correspondingly multiplying the channel attention coefficients to a combined feature graph formed by splicing the four feature graphs t1, t2, t3 and t4, and carrying out convolution again to finally obtain a multi-scale feature graph with 1/2 original channel number, thereby realizing multi-scale information extraction of the initial feature graph.

Further, the global spatial self-attention branch includes a plurality of transform attention blocks in series, each of the transform attention blocks including a layer normalization module, a multi-headed self-attention MSA, a multi-layer perceptron MLP, and a residual module.

Further, the transducer attention block is provided with 12, the 5×5 convolution is provided with four groups, and the 7×7 convolution is provided with eight groups.

Furthermore, the gating mechanism module adopts a GRU gating recursion unit structure, comprising a reset gate and an update gate, uses a sigmoid function as a control activation function, enhances the beneficial characteristics of each of the input multi-scale characteristic diagram and the global characteristic diagram and suppresses adverse factors, so as to fully fuse the characteristic information, remove repeated redundant information and obtain a final characteristic diagram.

Furthermore, the coding module takes a residual convolution block of the ResNet-50 network as a main structure, and a plurality of convolution and pooling operations are inserted into the main structure as connection;

the decoding module comprises three identical upsampling blocks and a segmentation map output block, wherein the upsampling blocks comprise upsampling with one-time resolution being doubled, one-time co-scale coding and decoding characteristic splicing and two convolution operations, and the number of channels of the characteristic map is halved by the latter convolution; the segmentation map output block comprises a convolution operation and a double up-sampling operation, wherein the number of convolution output channels is the number of categories to be segmented.

The beneficial technical effects of the invention are as follows:

(1) The proposed dual-attention-gating fusion segmentation network model adopts a U-shaped encoding-decoding structure, and the dual-attention-gating fusion module is added between the encoding module and the decoding module at the bottom of the model, so that the feature extraction capability of the model is improved to a great extent, and the high-precision segmentation result of an image is directly output at the end of the network, so that the accuracy of automatic segmentation is greatly improved.

(2) Different branches in the dual-attention gating fusion module play different feature extraction functions, and the multi-scale weighted channel attention branches adjust the importance of multi-scale features by using the learnable weight parameters, so that response features among classes in the feature map are effectively extracted, and the classification precision is improved; the transducer global space self-attention branches formed by the continuous stacking of the plurality of transducer self-attention blocks effectively extract the long-distance dependency relationship in the feature map so as to improve the positioning accuracy.

(3) For the feature graphs obtained by different branches in the dual-attention gating fusion module, the gating circulation unit GRU can effectively fuse high-level features from different branches, wherein the update gate of the GRU determines the quantity of low-level features and high-level features entering the next stage so as to emphasize important information, and the reset gate is used for forgetting information which is unfavorable for segmentation, so that redundancy of the feature information is effectively avoided.

(4) The powerful coding module and the dual attention gating fusion module obtain advanced features with global information and multi-scale information at the same time, decode is carried out based on the advanced features, splice with the same-scale features from the coding module in the decoding stage, enrich the decoded features again, finally show the effect of the segmentation result, and effectively improve the segmentation precision of the network.

Drawings

FIG. 1 is a flow chart illustrating an image segmentation method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of the overall structure of a split network model for dual attention-gated fusion in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a dual attention gated fusion module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a transform global self-attention module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a gate control unit GRU according to an embodiment of the invention;

FIG. 6 is a diagram showing the comparison of the segmentation result with other algorithms according to an embodiment of the present invention.

Detailed Description

The following detailed description of the invention refers to the accompanying drawings and preferred embodiments.

The present invention provides a method for image segmentation with dual attention-gated fusion, in a preferred embodiment, the overall schematic diagram of the method is shown in fig. 1, and the method comprises the following steps:

s1, constructing a data set according to a target task, wherein the data set comprises a plurality of image samples which are marked with specific targets at pixel level and can be three-dimensional images. Preprocessing the sample data according to specific requirements, splitting the sample data according to a certain quantity proportion, and constructing a training data set and a verification data set;

in this embodiment, an abdominal multi-organ dataset is selected and acquired, where the dataset includes a plurality of three-dimensional organ scan images that have been previously labeled, a manually segmented segmentation map, and a multi-organ scan original for automatic segmentation, where the manually segmented segmentation map is segmented into 9 types of regions according to the aorta, gall bladder, left kidney, right kidney, liver, pancreas, spleen, stomach, and background.

When the three-dimensional scanning image is used as a training set and a verification set, the original three-dimensional scanning image is sliced at equal intervals along the horizontal axis section, and then a plurality of two-dimensional slices can be obtained from each three-dimensional image; and cutting the obtained two-dimensional slice in a fixed scale, randomly rotating and randomly turning the two-dimensional slice with a certain probability, and finally normalizing the two-dimensional slice to realize data enhancement of the data. In the embodiment, the total data set is randomly split into a training set and a verification set according to the sample number proportion of 8:2, wherein the training data set is only used in training iteration, and the verification data set is only used in model screening stage test precision.

S2, constructing a segmentation network model based on double-attention-gating fusion, wherein the segmentation network model comprises an encoding module, a double-attention-gating fusion module and a decoding module, the double-attention-gating fusion module comprises a CNN-based multi-scale weighted channel attention branch and a transform-based global space self-attention branch, and the characteristics of the two branches are fused through a gating mechanism module;

as shown in FIG. 2, the split network model adopts a U-shaped encoding-decoding structure, and a dual-attention-gating fusion module is added at the bottom of the model. For the coding module of the model, a residual convolution block of a ResNet-50 network is mainly used as a main structure, and a plurality of convolution and pooling operations are inserted into the module to be used as connection; the double-attention gating fusion module for the model comprises a CNN-based multi-scale weighted channel attention branch and a transform-based global space self-attention branch, and the characteristics of the two branches are fused through a gating mechanism module; and the decoding module of the model consists of three identical upsampling blocks and a segmentation map output block, wherein the multi-scale weighted channel attention branch is used for extracting response characteristics among classes so as to improve classification accuracy, a multi-scale characteristic map is obtained, and the global space self-attention branch is used for extracting long-distance dependency characteristics so as to improve positioning accuracy and obtain a global characteristic map.

In this embodiment, after the input 224×224×3 two-dimensional image slice enters the encoding module, the following four stages of operations are performed, specifically as follows:

the first stage is two downsampling operations, the first downsampling is realized by a downsampling convolution block, the convolution block is formed by serially connecting a convolution operation with a step length of 2, a Group normalization operation and a ReLU activation function, and the output is 112 multiplied by 64; the second downsampling is a max pooling operation resulting in a 56 x 64 output.

The second stage is two residual block operations under the same scale, and the residual block consists of two branches: one branch is connected with the other branch in an identical way, the other branch is formed by connecting three convolution blocks in series, wherein the first convolution block and the second convolution block are formed by connecting a convolution operation with a step length of 1, a Group normalization operation and a ReLU activation function in series, the third convolution block is formed by connecting only a convolution operation with a step length of 1 and a Group normalization operation in series, the first convolution block compresses the number of channels of the feature map, the first convolution block restores the number of channels to the original number of channels in the third convolution block, and finally the two branches are added and are activated and output through the ReLU function.

The first residual block operation expands the number of channels by a factor of 4 without changing the feature map size to obtain a 56×56×256 feature map, and the second residual block does not change the shape of the feature map.

The third stage and the fourth stage are respectively one-time downsampling residual block operation and one-time co-scale residual block operation, wherein the downsampling residual block consists of two branches: one branch is a downsampled convolution block, which is formed by serially connecting a convolution operation with a step length of 2 and a Group normalization operation, the other branch is formed by serially connecting three convolution blocks, wherein the first convolution block is formed by serially connecting a convolution operation with a step length of 1, a Group normalization operation and a ReLU activation function, the second convolution block is formed by serially connecting a convolution operation with a step length of 2, a Group normalization operation and a ReLU activation function, the third convolution block is formed by serially connecting only a convolution operation with a step length of 1 and a Group normalization operation, the first convolution block compresses the number of channels of the feature map and restores the number of original channels in the third convolution block, and finally the two branches are added and output through the ReLU activation function.

Thus, the shape of the feature map after the third stage operation is 28×28×512, and then the feature map after the fourth stage operation is 14×14×1024, so that the initial feature map which is 4 times of the original input downsampling and has 1024 channels is obtained.

For multi-scale weighted channel attention branches in a dual attention gating fusion module, as shown in fig. 3. In this embodiment, after the initial feature map of 14×14×1024 sizes enters the attention branch of the multi-scale weighted channel, on the one hand, a feature map t1 of 14×14×128 is obtained by 1×1 convolution; on the other hand, firstly, carrying out downsampling convolution to obtain a 7 multiplied by 512 characteristic diagram, then respectively carrying out 3 multiplied by 3 convolution, 5 multiplied by 5 group number four group convolution and 7 multiplied by 7 group number eight group convolution on the characteristic diagram to obtain characteristic diagrams t2, t3 and t4; then, carrying out SE channel attention operation on the feature diagrams t1, t2, t3 and t4 respectively to obtain four groups of channel attention coefficients c1, c2, c3 and c4, respectively endowing the four groups of coefficients with learnable weight coefficients w1, w2, w3 and w4 to obtain new channel attention coefficients w1c1, w2c2, w3c3 and w4c4, splicing the four groups of coefficients, and normalizing the four groups of coefficients through a Softmax function to obtain a channel attention coefficient c of 1 multiplied by 512;

the four feature maps t1, t2, t3 and t4 on the other side are spliced to obtain a 14 multiplied by 512 combined feature map, the channel attention coefficient c and the combined feature map are subjected to point multiplication operation of the corresponding channel, and finally the 14 multiplied by 512 multi-scale feature map of the multi-scale weighted channel attention branch output is obtained.

For the global self-attention branch in the double-attention gating fusion module, after the initial feature map with the size of 14 multiplied by 1024 enters the global self-attention branch, the shape of the feature map is firstly reconstructed into 196 feature vectors with the dimension of 728 through convolution operation and reshape operation, then the position coding is carried out, and the 196 feature vectors are input into a sequence in which a plurality of Transformer attention blocks are stacked; after the feature vector enters each transducer attention block, the feature vector is subjected to feature processing as shown in fig. 4, and global long-distance dependency modeling is performed. In this embodiment, the feature processing of the 12-pass transform attention block is performed, and the obtained feature vector is finally reconstructed into a global feature map of 14×14×512.

For the gating mechanism module in the dual attention gating fusion module, a GRU gating recursion unit structure is adopted, and the gating mechanism module has two main gates: "reset gate" for controlling state information of one of the inputs and "update gate" for controlling information of the other input use a sigmoid function as a control activation function. Through the two gates, the GRU module can flexibly control the flow of the feature information, and the enhancement of the respective beneficial features and the suppression of adverse factors in the input multi-scale feature map and the global feature map are realized, so that the feature information is fully fused, repeated redundant information is removed, and a final feature map is obtained. Namely, 14×14×512 feature maps generated by the attention branches of the multi-scale weighted channel and the global self-attention branches are input into a gating cycle unit GRU shown in fig. 5, the information flow entering the next stage is determined through an update gate in the GRU structure so as to emphasize important feature information, and a reset gate is used for discarding information which is unfavorable for segmentation, so that redundancy of the feature information is effectively avoided, and a final feature map of 14×14×512 with the same input shape is output.

For a decoding module of the network model, the decoding is carried out based on the 14 multiplied by 512 advanced semantic feature images output by the dual-attention-gating fusion module, the decoding module is composed of three identical upsampling blocks and a segmentation image output block, the upsampling blocks comprise upsampling with one time resolution being doubled, one time encoding and decoding feature splicing with the same scale and two convolution operations, and the number of channels of the feature images is halved by the latter convolution. In this embodiment, the feature map of size 14×14×512 is up-sampled for 4 times to finally restore to 224×224×64, and the number of channels is changed to 9 by one convolution, i.e. the number of classes to be segmented, to finally obtain 224×224×9 segmentation map.

And S3, after the loss function, the optimizer and the training parameters of the model training stage are configured, inputting the training data set into a segmentation network for model training, and performing model screening by verifying the performance of the data set to obtain an optimal image segmentation model.

In the present embodiment, the size of the input image is a slice of 224×224×3 input in the present embodiment; the initial learning rate of the model is set to be 0.01, the learning rate is gradually reduced to be 0.001 along with the increase of iteration times according to the ploy learning rate attenuation strategy, the batch data size of the input model of each iteration is set to be 12, and the training period number is 150. The optimizer selects a random gradient descent (SGD) optimizer with a momentum of 0.9 and a weight decay coefficient of 1e-4, the loss function uses a weighted summation of a Dice loss function and a cross entropy loss function, the weights are set to 0.5, and the setting of the rest of network parameters can be understood conventionally and will not be repeated here. In this embodiment, after the model is trained to the 100 th period, performing verification and evaluation by using the current breakpoint model once every training period, recording the evaluation accuracy, and taking the breakpoint model with the highest verification index as the optimal model weight after the model is trained for all 150 periods; the final performance of the model can be judged by using commonly used evaluation parameters such as prediction precision and the like, or the segmentation effect can be judged manually.

S4, inputting the three-dimensional image data to be segmented into an optimal double-attention-gating fusion segmentation network, and obtaining a high-precision segmentation result of the target image.

The multi-organ scanning original image for automatic segmentation is input into an optimal image segmentation network obtained through training after the data preprocessing operation to obtain a segmentation result, and a comparison schematic diagram of the image segmentation result obtained by the segmentation method and other algorithms is shown in fig. 6.

In the method, CNN and Transformer combined application is adopted to improve segmentation performance, a U-shaped frame is used as a network backbone, and multi-scale channel attention is obtained in a weighted mode, so that the characteristics contain more spatial context information; by stacking a plurality of transducer blocks, the transducer global self-attention branches are designed to extract long-distance dependence; the gating dual attention module is provided to effectively combine the advantages of the attention branch characteristic and the transducer global self-attention branch characteristic of all the times, thereby remarkably improving the favorable characteristics of medical image segmentation and inhibiting unfavorable information; finally decoding is carried out based on the advanced features, and meanwhile, the decoding stage is spliced with the co-scale features from the coding module, so that the decoding features are enriched again, and the segmentation precision of the network is effectively improved.

While particular embodiments of the present invention have been described above, it will be appreciated by those skilled in the art that these are merely illustrative, and that many changes and modifications may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims.

Claims

1. An image segmentation method based on dual attention fusion is characterized by comprising the following steps:

2. The dual attention fusion based image segmentation method as set forth in claim 1, wherein: the multi-scale weighted channel attention branch comprises a multi-scale convolution operation and a weighted channel attention operation in series,

3. The dual attention fusion based image segmentation method as set forth in claim 2, wherein: the global spatial self-attention branch comprises a plurality of transducer attention blocks connected in series, and each transducer attention block comprises a layer normalization module, a multi-head self-attention MSA, a multi-layer perceptron MLP and a residual module.

4. A dual attention fusion based image segmentation method as in claim 3, wherein: the transducer attention block is provided with 12, the 5 x 5 convolution is provided with four groups, and the 7 x 7 convolution is provided with eight groups.

5. The dual attention fusion based image segmentation method as set forth in claim 1, wherein: the gating mechanism module adopts a GRU gating recursion unit structure, comprising a reset gate and an update gate, uses a sigmoid function as a control activation function, and enhances the beneficial characteristics and suppresses adverse factors in the input multi-scale characteristic map and global characteristic map, so that the characteristic information is fully fused, the repeated redundant information is removed, and the final characteristic map is obtained.

6. The dual attention fusion based image segmentation method as set forth in claim 1, wherein: the coding module takes a residual convolution block of a ResNet-50 network as a main structure, and a plurality of convolution and pooling operations are inserted into the main structure as connection;