CN116416434A

CN116416434A - Medical image segmentation method based on Swin transducer fused with multi-scale features and multi-attention mechanism

Info

Publication number: CN116416434A
Application number: CN202310429066.1A
Authority: CN
Inventors: 孙晓楠; 陆奎; 孙姜珊
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-07-11

Abstract

The utility model name is as follows: medical image segmentation method abstract based on SwinTransformer fusion of multi-scale features and multi-attention mechanism: the utility model provides a medical image segmentation method based on SwinTransformer fusion of multi-scale features and a multi-attention mechanism. Comprising S1: establishing a coding module with a plurality of downsampling layers based on a SwinTransformer network, and continuously downsampling the sample medical image to obtain sample image features with four different scales; s2: multiplying and fusing based on the four sample image features with different scales to obtain four sample image features after multi-scale fusion; s3: establishing a decoder having a plurality of upsampling layers based on an attention module combining the space and the channel; and carrying out continuous up-sampling coding based on the four fused sample image features to obtain a corresponding segmentation result of the sample image. The utility model uses the self-attention mechanism in the SwinTransformer module to quickly and fully utilize the context information, fuses the multi-scale characteristics to obtain the perception of the information comprehensive characteristics, and combines the multi-attention suppression irrelevant characteristics and the enhancement relevant characteristics, thereby improving the segmentation precision of medical images.

Description

Medical image segmentation method based on Swin transducer fused with multi-scale features and multi-attention mechanism

Technical Field

The utility model belongs to the field of computer vision, and particularly relates to a medical image segmentation method based on a Swin transform fusion multi-scale feature and a multi-attention mechanism.

Background

With the advancement and innovation of medical technology, medical image analysis has become an indispensable tool and technological means in medical research and clinical disease diagnosis. An artificial neural network segmentation algorithm, a basic knowledge segmentation algorithm and other digital image processing segmentation technologies, and increasingly used in the field of medical analysis. After the medical image segmentation technology is adopted, the human tissue organ and the lesion part can be effectively defined, and the size and the range of the human tissue organ and the lesion part can be judged at the same time, so that the medical image segmentation technology is helpful for subsequent medical work. By means of medical image segmentation techniques, the relevant images can be usefully decomposed and understood, facilitating the progress of the fusion calibration work of the images with respect to each other.

The existing medical image segmentation method mainly takes a full convolution neural network architecture U-Net as a main framework. The convolutional neural network architecture based on U-Net dominates the medical image segmentation field. The U-Net is a typical coding-decoding network model, and the structure of the U-Net is bilaterally symmetrical and consists of a contraction (coding) path on the left side of the network, an expansion (decoding) path on the right side of the network and a jump connection 3 part. The systolic path is formed by a classical convolutional neural network for extracting input image features. The expansion path carries out up-sampling on the high-dimensional feature map through transposed convolution, restores the resolution of the feature map, and halves the number of channels of the feature map. The jump link can be used for fusing the characteristics of different scales obtained in the encoder process and supplementing the information loss caused in the downsampling process. In order to alleviate the problem of the gutter of U-Net between the layers of the jump link stage, a U-Net++ network architecture is developed for the field of medical segmentation, and has higher precision and faster convergence. For U-net and U-net++ network models, there is a lack of sufficient information to explore from the full scale, and the location and boundaries of organs cannot be clearly understood. The U-Net3+ network model is developed and applied. Meanwhile, network architectures such as Res-UNet, attention-Unet, dense-Unet and U2-Net are developed on the basis of a U-Net network and used for segmentation of medical images.

Convolutional neural networks have achieved good results in the field of medical image segmentation. However, the inherent inductive bias of convolution prevents the further improvement of a segmentation network, and the advantage of long-distance dependence is captured in a feature map by utilizing a transducer, so that the defect of insufficient overall information grasping of the comprehensive convolutional neural network due to local bias and weight sharing is overcome. The Transformer network was originally designed for natural language processing and language translation tasks, and is now also used in the field of image processing. The transducer has "attention mechanism" to grasp the image information from the global. Swin transducer uses sliding windows to calculate self-attention in windows that do not overlap locally and allows cross-window connections. Swin transducer is a significant reduction in the computation of traditional transducer models for computing the attention mechanism throughout the global image. The method comprises the steps of utilizing a Swin transform network model to carry out the coding part of the medical image segmentation model, adding multi-scale feature fusion, and then utilizing a space and channel attention module to carry out decoding operation, so as to form the utility model: a medical image segmentation method based on a Swin transducer fused with multi-scale features and a multi-attention mechanism.

Disclosure of Invention

In order to solve the problem that a convolutional neural network in traditional medical image segmentation is insufficient in overall information grasping due to local bias and weight sharing, the utility model provides a medical image segmentation method based on a Swin transform fusion multi-scale feature and multi-attention mechanism, which adopts the following technical scheme:

the utility model provides a medical image segmentation method based on a Swin transform fusion multi-scale feature and a multi-attention mechanism, which is characterized by comprising the following steps: step S1, a coding module with a plurality of downsampling layers is established based on a Swin transform network, and continuous downsampling is carried out on a sample medical image, so that sample image features with four different scales are obtained; step S2, carrying out multiplication fusion based on the four sample image features with different scales to obtain four sample image features after multi-scale fusion; step S3, a decoder with a plurality of upsampling layers is built based on an attention module combining the space and the channel; and carrying out continuous up-sampling coding based on the four fused sample image features to obtain a corresponding segmentation result of the sample image.

The medical image segmentation method based on the Swin transform fusion multi-scale feature and the multi-attention mechanism provided by the utility model can also have the technical characteristics that in the encoder, a first encoding layer consists of a Patch Partition, a Linear enhancement and two layers of Swin transform; the second coding layer consists of a Patch recording layer and two Swin transformers; the third coding layer consists of a Patch ranging layer and six Swin transformers; the fourth coding layer consists of a Patch recording layer and two Swin transformers; the Patch Partition is that an input image is divided into a plurality of patches, the Linear interpolation is that the input image is mapped into any dimension, the Patch merge is that resolution is reduced, receptive field is enlarged, and the method is similar to a pooling layer in a convolutional neural network.

The medical image segmentation method based on the Swin transform fused multi-scale features and the multi-attention mechanism provided by the utility model can also have the technical characteristics that the calculation formula in the Swin transform network in the Swin transform based coding module is as follows:

in the above, Z ^l-1 Representing input characteristics of the layer I Swin transducer network module;

representing the output of the first layer W-MSA; z is Z ^l The output characteristics of the Swin transducer network module of the first layer are represented and are input characteristics of the first layer +1; />

Representing the output of the layer 1 SW-MSA; z is Z ^l+1 Representing the output characteristics of the layer 1 Swin transducer network module; wherein W-MSA is the MSA layer dividing windows, SW-MSA is the MSA layer with moving windows, wherein MSA utilizes the global to calculate the attention mechanism, the attention calculation formula is:

wherein Q, K, V are obtained by multiplying input features by three matrices respectively, Q represents information to be queried, K represents information to be queried, V represents values obtained by querying, d represents the number of feature division areas and channel dimensions, and the values in B come from bias matrices.

The utility model provides a medical image segmentation method based on a Swin transform fusion multi-scale feature and a multi-attention mechanism, which can also have the technical features that in the step 1, the sample medical image is encoded by a four-time encoder to output four sample image features with different scales; the channel dimensions of the four groups of sample image features are 128, 256, 512 and 1024 respectively, and the feature dimensions are 96, 48, 24 and 12 respectively; sample image feature images with four different scales passing through different times of coding layers are marked as d _i ，i＝1、2、3、4。

The medical image segmentation method based on the Swin transform fusion multi-scale features and the multi-attention mechanism provided by the utility model can also have the technical features that in the step 2, the method is based on four sample image features with different scales after encoding by an encoderMultiplying and fusing to obtain sample image features after four multi-scale fusion; the four image features after fusion are marked as e _i I=1, 2, 3, 4; the sample image feature expression after fusion is as follows:

e ₁ ＝g ₁ (d ₁ )×g ₂ (d ₂ )×g ₃ (d ₃ )×g ₄ (d ₄ )

e ₂ ＝g ₂ (d ₂ )×g ₃ (d ₃ )×g ₄ (d ₄ )

e ₃ ＝g ₃ (d ₃ )×g ₄ (d ₄ )

e ₄ ＝g ₄ (d ₄ )

in the formula, the function representing the feature transformation is expressed as g _i (d _i )，i＝1、2、3、4。g _i (d _i ) A convolution of 1×1 is adopted to reduce the number of channels, and up-sampling is used to restore resolution; and adopting a synthesis method of multiplication when fusing the scales.

The utility model provides a medical image segmentation method based on a Swin transducer fusion multiscale feature and a multiscale attention mechanism, which can also have the technical characteristics that in step 3, an attention module (SCSE) for combining a space and a channel in a decoder is mainly formed by combining a channel attention module (CSE) and a space attention module (SSE);

the channel attention module (CSE) firstly carries out global pooling on the input feature map to obtain a pooling vector V; then inputting V into the full connection layer FC1, and performing ReLU activation to obtain a vector V _mid The method comprises the steps of carrying out a first treatment on the surface of the Will V _mid Putting into the full connection layer FC2, and then performing Sigmoid activation to obtain V _c The method comprises the steps of carrying out a first treatment on the surface of the Finally, input image and V _c Re-weighting the feature dimensions according to the space domain to form an output feature map U of the CSE _CSE The method comprises the steps of carrying out a first treatment on the surface of the Feature map V in channel attention Module (CES) _c The expression is:

V _c ＝Sigmoid(W ₂ (ReLU(W ₁ Pooling(Input))))

in which W is ₁ Being fully-connected to layer FC1Weight, W ₂ The Pooling is the Pooling layer, which is the weight of the fully connected layer FC 2.

The method comprises the steps that a space attention module (SSE) firstly carries out convolution operation by using 1 convolution check input images with the size of 1 multiplied by 1, and then carries out Sigmoid activation operation to obtain a characteristic image Q; finally, re-weighting the input image and the Q according to the space domain to form an output characteristic diagram of the SSE; the expression of the feature map Q in the spatial attention module (SSE) is:

Q＝Sigmoid(Conv(Input))

where Conv is a 1×1 convolution.

The utility model provides a medical image segmentation method based on a Swin transform fusion multiscale feature and a multi-attention mechanism, which can also have the technical characteristics that in step 3, a decoder with a plurality of up-sampling layers is built based on a space and channel attention module; the decoder comprises four decoding layers, and consists of a decoding block, an up-sampling block and a splicing operation; the feature map after decoding by the four-layer decoder is denoted as w _i I=1, 2, 3, 4. The decoding block comprises two parts of a decoding block 1 and a decoding block 2; the decoding block 1 is a space and channel attention module (SCSE) first, then performs a convolution operation of 3×3 twice, and finally performs a space and channel attention module (SCSE) once; the decoding block2 is first subjected to a spatial and channel attention module (SCSE), then to a convolution of 3×3, an upsampling and a convolution of 1×1, and finally to a spatial and channel attention module (SCSE); the expressions of decoding block 1 and decoding block2 are:

Decoder1(Input)＝SCSE(Conv1(SCSE(Input)))

Decoder2(Input)＝SCSE(Conv3(Upsample(Cov2(SCSE(Input)))))

where Conv1 represents 2 3×3 convolutions, conv2 represents 1 3×3 convolutions, conv3 represents 1 1×1 convolutions, and upsamples represent upsampling.

Among the decoders, the first layer decoder is the feature map e after fusion ₄ Input to the decoder 1 to obtain a decoded feature map w ₁ The method comprises the steps of carrying out a first treatment on the surface of the The second layer decoder is to map the characteristic map w ₁ Inputting, up-sampling once, and then summing the obtained characteristic diagramFeature map e ₃ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 1 ₂ The method comprises the steps of carrying out a first treatment on the surface of the The third layer decoder is to map the characteristic map w ₂ Inputting, up-sampling once, and obtaining a characteristic diagram and a characteristic diagram e ₂ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 1 ₃ The method comprises the steps of carrying out a first treatment on the surface of the The fourth layer decoder is to map the characteristic map w ₃ Inputting, up-sampling once and then obtaining a characteristic diagram and a characteristic diagram e ₁ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 2 ₄ The method comprises the steps of carrying out a first treatment on the surface of the Decoded w _i The expression of i=1, 2, 3, 4 is:

w ₁ ＝Decoder1(e ₄ )

w ₂ ＝Decoder1(Concat(e ₃ ,Upsample(w ₁ )))

w ₃ ＝Decoder1(Concat(e ₂ ,Upsample(w ₂ )))

w ₄ ＝Decoder2(Concat(e ₁ ,Upsample(w ₃ )))

where Concat represents a splice operation and Upsample represents upsampling.

The technical scheme provided by the utility model has the following actions and effects:

the utility model provides a medical image segmentation method based on a Swin transform fusion multi-scale feature and a multi-attention mechanism, which is based on a medical image segmentation model of a coder of the Swin transform, a decoder of the multi-scale feature fusion and combination space and channel attention module.

In the Swin transducer-based encoder, the self-attention mechanism in the Swin transducer module can make full use of the context information quickly and establish remote dependencies. The encoder not only solves the problems of small receptive field and local bias of the traditional neural network, but also adds the concept of moving window compared with the traditional transducer model, thereby greatly reducing the calculated amount.

In multi-scale fusion, the fusion model adopted by the utility model fuses four feature maps with different scales obtained after the four layers of encoders are encoded with semantic features of a higher level, so that the network model pays attention to shallow features and deep features at the same time, thereby obtaining the perception of comprehensive features. Meanwhile, a multiplication mode is adopted during feature fusion, the difference between a noise area and a normal area can be amplified through multiplication, and the interference of shallow noise in the feature fusion process is reduced.

In the decoder combining the space and channel attention module (SCSE), the input of each layer of decoder is the concatenation of the feature map after the transverse connection fusion of the encoder of the same layer and the output feature map of the decoder of the previous layer, so that richer details of the high-resolution feature map and stronger semantic information of the low-resolution feature map can be combined. In the decoder, the spatial and channel attention module (SCSE) is mainly formed by combining the channel attention module (CSE) and the spatial attention module (SSE). The CSE module mainly starts from the aspect of channels, obtains the importance of each characteristic channel through learning, and distributes corresponding weights for each characteristic channel at the same time, so that the importance degree of each characteristic channel is displayed, and the purposes of suppressing irrelevant characteristics and enhancing relevant characteristics are realized from the aspect of channels. The SSE module mainly converts information in an image from one space to another space from the aspect of space, and allocates corresponding weights and weight-dividing outputs for each position in the space according to the importance of the position, so that the region of interest is enhanced in the segmentation process, and the background and irrelevant regions are suppressed.

Drawings

Fig. 1 is a flowchart of steps of a method for segmenting medical images based on Swin transformation fusion of multi-scale features and multi-attention mechanisms according to an embodiment of the present utility model.

Fig. 2 is a diagram of a model architecture of a medical image segmentation network according to an embodiment of the present utility model.

Fig. 3 is a schematic structural diagram of a Swin transducer module according to an embodiment of the utility model.

Fig. 4 is a schematic structural diagram of a multi-scale feature fusion module according to an embodiment of the present utility model.

Figure 5 is a schematic diagram of the structure of a space and channel attention module (SCSE) module in accordance with an embodiment of the utility model.

Fig. 6 is a schematic structural diagram of a Decoder Block module according to an embodiment of the present utility model.

Fig. 7 is a schematic structural diagram of a Decoder Block2 module according to an embodiment of the present utility model.

Detailed description of the preferred embodiments

In order to make the objects, technical schemes, creation features and advantages of the present utility model more clear, a medical image segmentation method based on the Swin transform fusion multi-scale features and multi-attention mechanism of the present utility model is further described in detail below with reference to experimental examples and drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the utility model.

< example >

Fig. 1 is a flowchart of steps of a medical image segmentation method based on a Swin transformation fusion multi-scale feature and a multi-attention mechanism in an embodiment of the utility model, and fig. 2 is a model architecture diagram of a medical segmentation network in an embodiment of the utility model.

As shown in fig. 1 and 2, the medical image segmentation method based on the Swin transducer fused multi-scale feature and multi-attention mechanism comprises the following steps:

step S1, a coding module with a plurality of downsampling layers is established based on a Swin converter network, and sample medical images are subjected to continuous downsampling, so that sample image features with four different scales are obtained.

In an embodiment, the encoder is used to extract image features and the decoder is used to restore image resolution, enabling pixel level segmentation. The encoder-decoder structure includes a 4-layer encoder and a 4-layer decoder and an intermediate multi-scale fusion module. The encoder consists of Patch Partition, linear coding, swin transform, patch recording. The decoder consists of a spatial and channel attention module (SCSE), a convolutional layer, an upsampling layer, and a feature splice.

In the encoder, a first encoding layer consists of a Patch Partition, a Linear coding and two Swin transformers; the second coding layer consists of a Patch recording layer and two Swin transformers; the third coding layer consists of a Patch ranging layer and six Swin transformers; the fourth coding layer consists of a Patch recording layer and two Swin transformers; the Patch Partition is that an input image is divided into a plurality of patches, the Linear interpolation is that the input image is mapped into any dimension, the Patch merge is that resolution is reduced, receptive field is enlarged, and the method is similar to a pooling layer in a convolutional neural network.

In the encoder, the structure of the coding module of the Swin Transformer is schematically shown in fig. 3, and the Swin Transformer is composed of two consecutive Swin Transformer Block. The former Swin Transformer Block includes a window based MSA layer and an MLP layer connected in series; an LN layer is provided before both the window based MSA layer and the MLP layer, and a residual connection is used after both the window based MSA layer and the MLP layer. The latter Swin Transformer Block includes shifted window based MSA layers and MLP layers connected in sequence; the shifted window based MSA layer and the MLP layer are both preceded by an LN layer and a residual connection is used after the shifted window based MSA layer and the MLP layer. The calculation formula in the Swin transducer network is as follows:

And four times of encoding are carried out on the sample medical image through the encoder to output sample image features with four different scales. The input 384×384×3 sample medical image is characterized in that the channel dimensions of the four output sets of sample image features are 128, 256, 512 and 1024, and the feature dimensions are 96, 48, 24 and 12. Sample image feature images with four different scales passing through different times of coding layers are marked as d _i I=1, 2, 3, 4. The image characteristic expression after the encoder codes is as follows:

d ₁ ＝Swin Transformer(Linear Embeding(Patch Partition(Input)))

d ₂ ＝Swin Transformer(Patch Merging(d ₁ ))

d ₃ ＝Swin Transformer(Patch Merging(d ₂ ))

d ₄ ＝Swin Transformer(Patch Merging(d ₃ ))

and step S2, carrying out multiplication fusion based on the four sample image features with different scales to obtain four sample image features after multi-scale fusion. After obtaining four sample image features d with different scales processed by an encoder ₁ 、d ₂ 、d ₃ 、d ₄ And inputting the sample image characteristics into a multi-scale characteristic fusion module, as shown in fig. 4, to obtain four multi-scale fused sample image characteristics. The four image features after fusion are marked as e _i I=1, 2, 3, 4. The sample image feature expression after fusion is as follows:

e ₁ ＝g ₁ (d ₁ )×g ₂ (d ₂ )×g ₃ (d ₃ )×g ₄ (d ₄ )

e ₂ ＝g ₂ (d ₂ )×g ₃ (d ₃ )×g ₄ (d ₄ )

e ₃ ＝g ₃ (d ₃ )×g ₄ (d ₄ )

e ₄ ＝g ₄ (d ₄ )

in the formula, the function representing the feature transformation is expressed as g _i (d _i )，i＝1、2、3、4。g _i (d _i ) A convolution of 1×1 is adopted to reduce the number of channels, and up-sampling is used to restore resolution; and adopting a synthesis method of multiplication when fusing the scales. The four input different scale feature graphs are fused with higher-level semantic features, so that the network model focuses on shallow features and deep features at the same time, and the perception of comprehensive features is obtained. Meanwhile, a multiplication mode is adopted during feature fusion, the difference between a noise area and a normal area can be amplified through multiplication, and the interference of shallow noise in the feature fusion process is reduced.

Step S3, a decoder with a plurality of upsampling layers is built based on an attention module combining the space and the channel; and carrying out continuous up-sampling coding based on the four fused sample image features to obtain a corresponding segmentation result of the sample image.

The decoder consists of a spatial and channel attention module (SCSE), a convolutional layer, an upsampling layer, and a feature splice.

As shown in fig. 5, the spatial and channel attention module (SCSE) is mainly formed by combining and integrating the channel attention module (CSE) and the spatial attention module (SSE).

The channel attention module (CSE) firstly carries out global pooling on the input characteristic diagram to obtain poolsThe vector V is normalized; then inputting V into the full connection layer FC1, and performing ReLU activation to obtain a vector V _mid The method comprises the steps of carrying out a first treatment on the surface of the Will V _mid Putting into the full connection layer FC2, and then performing Sigmoid activation to obtain V _c The method comprises the steps of carrying out a first treatment on the surface of the Finally, input image and V _c Re-weighting the feature dimensions according to the space domain to form an output feature map U of the CSE _CSE The method comprises the steps of carrying out a first treatment on the surface of the Feature map V in channel attention Module (CES) _c 、U _CSE The expression is:

V _c (Input)＝Sigmoid(W ₂ (ReLU(W ₁ Pooling(Input))))

U _CSE (Input)＝V _c (Input)⊕Input

in which W is ₁ Weight of full connection layer FC1, W ₂ The Pooling is the Pooling layer, which is the weight of the fully connected layer FC 2.

The method comprises the steps that a space attention module (SSE) firstly carries out convolution operation by using 1 convolution check input images with the size of 1 multiplied by 1, and then carries out Sigmoid activation operation to obtain a characteristic image Q; finally, re-weighting the input image and the Q according to the space domain to form an output characteristic diagram of the SSE; feature map Q, U in spatial attention Module (SSE) _SSE 、U _SCSE The expression of (2) is:

Q(Input)＝Sigmoid(Conv(Input))

U _SSE (Input)＝Q(Input)⊕Input

U _SCSE (Input)＝SCSE(Input)＝U _CSE (Input)⊕U _SSE (Input)

where Conv is a 1×1 convolution.

The CSE module and the SSE module respectively distribute corresponding weights according to different importance from the aspects of channels and spaces, and achieve the purposes of suppressing irrelevant features and enhancing relevant features. The spatial and channel attention module (SCSE) enhances the region of interest on the one hand and suppresses the background and extraneous regions on the other hand during segmentation.

Establishing a decoder with a plurality of up-sampling layers based on a space and channel attention module, wherein the decoder comprises four decoding layers, and the decoder consists of a decoding block, an up-sampling block and a splicing operation; four layer decoder solutionThe feature map after the code is denoted as w _i ，i＝1、2、3、4。

As shown in fig. 6 and 7, the decoding block includes two parts of decoding block 1 and decoding block 2. The decoding block 1 is a space and channel attention module (SCSE) first, then performs a convolution operation of 3×3 twice, and finally performs a space and channel attention module (SCSE) once; the decoding block2 is first subjected to a spatial and channel attention module (SCSE), then to a convolution of 3×3, an upsampling and a convolution of 1×1, and finally to a spatial and channel attention module (SCSE); the expressions of decoding block 1 and decoding block2 are:

Decoder1(Input)＝SCSE(Conv1(SCSE(Input)))

Decoder2(Input)＝SCSE(Conv3(Upsample(Cov2(SCSE(Input)))))

Among the decoders, the first layer decoder is the feature map e after fusion ₄ Input to the decoder 1 to obtain a decoded feature map w ₁ The method comprises the steps of carrying out a first treatment on the surface of the The second layer decoder is to map the characteristic map w ₁ Inputting, up-sampling once, and obtaining a characteristic diagram and a characteristic diagram e ₃ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 1 ₂ The method comprises the steps of carrying out a first treatment on the surface of the The third layer decoder is to map the characteristic map w ₂ Inputting, up-sampling once, and obtaining a characteristic diagram and a characteristic diagram e ₂ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 1 ₃ The method comprises the steps of carrying out a first treatment on the surface of the The fourth layer decoder is to map the characteristic map w ₃ Inputting, up-sampling once and then obtaining a characteristic diagram and a characteristic diagram e ₁ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 2 ₄ The method comprises the steps of carrying out a first treatment on the surface of the Decoded w _i The expression of i=1, 2, 3, 4 is:

w ₁ ＝Decoder1(e ₄ )

w ₂ ＝Decoder1(Concat(e ₃ ,Upsample(w ₁ )))

w ₃ ＝Decoder1(Concat(e ₂ ,Upsample(w ₂ )))

w ₄ ＝Decoder2(Concat(e ₁ ,Upsample(w ₃ )))

where Concat represents a splice operation and Upsample represents upsampling.

The medical image segmentation sample diagram is input into a model of the medical image segmentation method based on the Swin transform fusion multi-scale feature and multi-attention mechanism, and a large-scale feature fusion module and a decoder output segmented images as w through an encoder ₄ 。

It will be evident to those skilled in the art that the utility model is not limited to the details of the foregoing illustrative embodiments, and that the present utility model may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the utility model being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A medical image segmentation method based on a swinTransformer fusion multi-scale feature and a multi-attention mechanism, which is characterized by comprising the following steps:

step S1, a coding module with a plurality of downsampling layers is established based on a SwinTransformer network, and sample medical images are subjected to continuous downsampling, so that sample image features with four different scales are obtained;

step S2, carrying out multiplication fusion based on the four sample image features with different scales to obtain four sample image features after multi-scale fusion;

2. The medical image segmentation method based on the SwinTransformer fusion of multi-scale features and multi-attention mechanisms according to claim 1, wherein the method is characterized in that:

in step S1, a coding module with a plurality of downsampling layers is established based on a SwinTransformer network, and an encoder comprises 4 coding layers, wherein the coding module consists of a PatchPartition, linearEmbedding, swinTransformer network and PatchMerging;

the first coding layer consists of PatchPartition, linearEmbedding and two Swin transformers; the second coding layer consists of a PatchMerging layer and two SwinTransformamers; the third coding layer consists of PatchMerging and six SwinTransformamers; the fourth coding layer consists of a PatchMerging layer and two SwinTransformamers;

the PatchPartification is that an input image is divided into a plurality of Patches;

the linearEmbedding is to map the input image to any dimension;

the PatchMerging is used for reducing resolution and expanding receptive fields, and is similar to a pooling layer in a convolutional neural network;

in the coding module of the SwinTransformer, the calculation formula in the SwinTransformer network is as follows:

in the above, Z ^l-1 Representing input features of a layer I SwinTransformer network module;

representing the output of the first layer W-MSA; z is Z ^l Representing the output characteristics of the SwinTransformer network module of the first layer as well as the input characteristics of the first layer +1; />

Representing the output of the layer 1 SW-MSA; z is Z ^l+1 Representing the output characteristics of the layer 1 SwinTransformer network module;

the W-MSA is an MSA layer for dividing windows; SW-MSA is the MSA layer with moving windows;

the MSA calculates an attention mechanism by using the global, and an attention calculation formula is as follows:

in the above formula, Q, K and V are obtained by multiplying input features by three matrixes respectively, Q represents information to be queried, K represents queried information, and V represents a value obtained by query; d represents the number of feature division areas and the channel dimension; the values in B come from the bias matrix;

four times of encoding are carried out on the sample medical image through an encoder to output sample image features with four different scales; the channel dimensions of the four groups of sample image features are 128, 256, 512 and 1024 respectively, and the method is characterized in thatThe dimensions are 96, 48, 24 and 12 respectively; sample image feature images with four different scales passing through different times of coding layers are marked as d _i ，i＝1、2、3、4。

3. The medical image segmentation method based on the SwinTransformer fusion of multi-scale features and multi-attention mechanisms according to claim 1, wherein the method is characterized in that:

in the step 2, multiplying and fusing are carried out based on the four sample image features with different scales, so that four sample image features after multi-scale fusion are obtained; the four image features after fusion are marked as e _i I=1, 2, 3, 4; the sample image feature expression after fusion is as follows:

e ₁ ＝g ₁ (d ₁ )×g ₂ (d ₂ )×g ₃ (d ₃ )×g ₄ (d ₄ )

e ₂ ＝g ₂ (d ₂ )×g ₃ (d ₃ )×g ₄ (d ₄ )

e ₃ ＝g ₃ (d ₃ )×g ₄ (d ₄ )

e ₄ ＝g ₄ (d ₄ )

in the above formula, the function representing the feature transformation is denoted as g _i (d _i )，i＝1、2、3、4；g _i (d _i ) A convolution of 1×1 is adopted to reduce the number of channels, and up-sampling is used to restore resolution; and adopting a synthesis method of multiplication when fusing the scales.

4. The medical image segmentation method based on the SwinTransformer fusion of multi-scale features and multi-attention mechanisms according to claim 1, wherein the method is characterized in that:

in step 3, the spatial and channel-combined attention module is denoted as SCSE, which is mainly formed by combining and combining a channel attention module (CSE) and a spatial attention module (SSE);

the channel attention module (CSE) firstly carries out global pooling on the input feature map to obtain a pooling vector V; thenV is input into the full connection layer FC1, and ReLU activation is carried out to obtain a vector V _mid The method comprises the steps of carrying out a first treatment on the surface of the Will V _mid Putting into the full connection layer FC2, and then performing Sigmoid activation to obtain V _c The method comprises the steps of carrying out a first treatment on the surface of the Finally, input image and V _c Re-weighting the feature dimensions according to the space domain to form an output feature map U of the CSE _CSE The method comprises the steps of carrying out a first treatment on the surface of the Feature map V in channel attention Module (CES) _c The expression is:

V _c ＝Sigmoid(W ₂ (ReLU(W ₁ Pooling(Input))))

in the above, W ₁ Weight of full connection layer FC1, W ₂ The weight value of the full connection layer FC2 is calculated, and Pooling is a Pooling layer;

the spatial attention module (SSE) firstly carries out convolution operation by using 1 convolution check input images with the size of 1 multiplied by 1, and then carries out Sigmoid activation operation to obtain a characteristic image Q; finally, re-weighting the input image and the Q according to the space domain to form an output characteristic diagram of the SSE; the expression of the feature map Q in the spatial attention module (SSE) is:

Q＝Sigmoid(Conv(Input))

in the above equation, conv is a 1×1 convolution.

5. The medical image segmentation method based on the SwinTransformer fusion of multi-scale features and multi-attention mechanisms according to claim 1, wherein the method is characterized in that:

in step 3, a decoder with a plurality of upsampling layers is built based on the spatial and channel attention modules; the decoder comprises four decoding layers, and consists of a decoding block, an up-sampling block and a splicing operation; the feature map after decoding by the four-layer decoder is denoted as w _i ，i＝1、2、3、4；

The decoding block comprises two parts of a decoding block 1 and a decoding block 2; the decoding block 1 is a space and channel attention module (SCSE) first, then performs a convolution operation of 3×3 twice, and finally performs a space and channel attention module (SCSE) once; the decoding block2 is first subjected to a spatial and channel attention module (SCSE), then to a convolution of 3×3, an upsampling and a convolution of 1×1, and finally to a spatial and channel attention module (SCSE); the expressions of decoding block 1 and decoding block2 are:

Decoder1(Input)＝SCSE(Conv1(SCSE(Input)))

Decoder2(Input)＝SCSE(Conv3(Upsample(Cov2(SCSE(Input)))))

in the above formula, conv1 represents 2 3×3 convolutions, conv2 represents 1 3×3 convolutions, conv3 represents 1 1×1 convolutions, and upsamples represent upsampling;

the first layer decoder is to fuse the characteristic diagram e ₄ Input to the decoder 1 to obtain a decoded feature map w ₁ The method comprises the steps of carrying out a first treatment on the surface of the The second layer decoder is to map the characteristic map w ₁ Inputting, up-sampling once, and obtaining a characteristic diagram and a characteristic diagram e ₃ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 1 ₂ The method comprises the steps of carrying out a first treatment on the surface of the The third layer decoder is to map the characteristic map w ₂ Inputting, up-sampling once, and obtaining a characteristic diagram and a characteristic diagram e ₂ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 1 ₃ The method comprises the steps of carrying out a first treatment on the surface of the The fourth layer decoder is to map the characteristic map w ₃ Inputting, up-sampling once and then obtaining a characteristic diagram and a characteristic diagram e ₁ Performing splicing operation, and finally obtaining a decoded characteristic diagram w through a decoder 2 ₄ The method comprises the steps of carrying out a first treatment on the surface of the Decoded w _i The expression of i=1, 2, 3, 4 is:

w ₁ ＝Decoder1(e ₄ )

w ₂ ＝Decoder1(Concat(e ₃ ,Upsample(w ₁ )))

w ₃ ＝Decoder1(Concat(e ₂ ,Upsample(w ₂ )))

w ₄ ＝Decoder2(Concat(e ₁ ,Upsample(w ₃ )))

in the above equation, concat represents a splicing operation, and upsampled represents upsampling.