CN115661462A

CN115661462A - Medical image segmentation method based on convolution and deformable self-attention mechanism

Info

Publication number: CN115661462A
Application number: CN202211422579.1A
Authority: CN
Inventors: 高宇飞; 马自行; 石磊
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-01-31

Abstract

The invention is applicable to the field of medical image segmentation, and provides a medical image segmentation method based on convolution and a deformable self-attention mechanism, which comprises the following steps of S1: preprocessing and data enhancing the CT image; s2: constructing a U-Net architecture model Med-CaDA based on convolution and a deformable self-attention mechanism; s3: adopting a Dice loss function for the U-Net architecture model constructed in the step S2; s4: training the U-Net architecture model constructed in the step S2 by adopting an Adam optimization algorithm; s5: the segmentation accuracy was measured using two indices, dice score and 95% Hausdorff distance. The invention constructs a segmentation model of a U-Net framework, and provides a CaDA block based on convolution and a deformable self-attention mechanism, so that the advantage of convolution in local information extraction is retained, the capability of the deformable self-attention mechanism for capturing global dependence is utilized, and good medical image segmentation precision is realized under the condition of low calculation amount.

Description

Medical image segmentation method based on convolution and deformable self-attention mechanism

Technical Field

The invention belongs to the technical field of medical image segmentation, and particularly relates to a medical image segmentation method based on convolution and a deformable self-attention mechanism.

Background

With the continuous progress of medical image technology and computer vision technology, medical image analysis has become an indispensable tool and technical means in medical research and clinical disease diagnosis. As an important component of medical image analysis, medical image segmentation plays a very important role in clinical diagnosis and treatment.

In recent years, the full convolution neural network architecture U-Net has a dominant position in medical image segmentation, and the U-Net has achieved great success in various medical image segmentation tasks by constructing an asymmetric encoder-decoder architecture with jump connection. The encoder consists of a series of convolutional layers and downsampling layers and is used for extracting deep layer characteristics; the decoder upsamples the deep features to the same size as the input and fuses the features of different scales obtained by the encoder process through skip-join to compensate for the spatial information loss brought by the downsampling process. Inspired by U-Net design, res-UNet, att-UNet, U-Net + + and UNet3+ and other methods are developed for segmentation of organ lesions, and the excellent performance proves that the convolutional neural network has strong characteristic learning capability.

However, the lack of ability to capture long range dependencies makes it difficult for convolutional neural network-based methods to meet the segmentation accuracy requirements for medical applications. Transformers have achieved breakthrough in some areas of computer vision, resulting in a series of excellent performance methods, such as ViT, PVT, swin Transformer, and others. With this inspiring, transformers are increasingly being applied in the field of medical image segmentation. Recently, the Deformable Attention Transformer overcomes the defect that critical related information may be lost by methods such as PVT and Swin Transformer by designing a Deformable self-Attention mechanism, and obtains optimal precision on ImageNet data sets.

Disclosure of Invention

The invention provides a medical image segmentation method based on convolution and a deformable self-attention mechanism, and aims to solve the problem that a convolution neural network lacks long-distance dependence capture capability.

The invention is realized in such a way that a medical image segmentation method based on convolution and a deformable self-attention mechanism comprises the following steps:

step S1: preprocessing and data enhancing the CT image;

step S2: constructing a U-Net architecture model Med-CaDA based on convolution and a deformable self-attention mechanism;

and step S3: adopting a Dice loss function for the U-Net architecture model constructed in the step S2;

and step S4: training the U-Net architecture model constructed in the step S2 by adopting an Adam optimization algorithm;

step S5: and measuring the segmentation accuracy by using two indexes of Dice Score and 95% Hausdorff distance.

Further, in the step S1, the preprocessing is standardized by using a Z-score method, and the data enhancement includes padding, random cropping, random flipping, and random intensity shifting.

Further, in step S2, the U-Net architecture model is composed of an encoder, a decoder and a jump connection, the encoder includes an Embedding layer, a CaDA block composed of convolution and a deformable self-attention mechanism, and a down-sampling layer, and the decoder includes an up-sampling layer, a CaDA block, and an extension layer.

Further, the Embedding layer is used for mapping the input image to a multi-dimensional space, and the output size is

Further, the CaDA block is used for extracting local global information and comprises a reverse residual module and a deformable self-attention mechanism, wherein the reverse residual module is composed of convolutions, and the bottleneck residual module is composed of two 1 × 1 × 1 convolutions and a depth separable convolution and is used for extracting local information and ensuring that the sizes of input and output feature graphs are consistent. The bottleneck residual module can be expressed as:

Bottleneck(X)＝Conv(F(Conv(X))) (1)

F(X)＝DWConv(X)+X (2)

where Conv () stands for convolution, DWConv () stands for deep separable convolution, and additionally BatchNorm normalization and the GELU activation function are omitted.

The deformable self-attention mechanism is a variant of the self-attention mechanism, does not consume a large amount of computation while realizing the global dependency capture, and is assumed to have H multiplied by W multiplied by D multiplied by C input and H multiplied by W multiplied by D multiplied by C output, so that the input and output dimensions are consistent.

Further, the down-sampling layer and the up-sampling layer are respectively 2 × 2 × 2 convolution and 2 × 2 × 2 deconvolution.

Furthermore, the expansion layer adopts 4 × 4 × 4 deconvolution with step size of 4 to obtain a feature map

And restoring to the original image H multiplied by W multiplied by D multiplied by K, wherein K is the number of the final segmentation classes.

Further, in step S3, the Dice loss function is implemented by the following formula:

| A ≦ B | represents the intersection between A and B, and | A | and | B | represent the number of elements of A and B, respectively.

Further, in the step S4, the learning rate of the Adam optimization algorithm is set to be 1e-4.

Further, in step S5, the Dice score index is:

the 95% Hausdorff distance index is:

where distance (a, B) is the distance between two points of ab, and A and B are two irregular areas.

Compared with the prior art, the invention has the advantages that: the invention provides a medical image segmentation method based on convolution and a deformable self-attention mechanism, which fully extracts local and global context information by utilizing the convolution and a Transformer attention mechanism. The method greatly improves the segmentation precision of the medical image on the premise of keeping lower calculation amount.

Drawings

FIG. 1 is a schematic diagram of the process steps of the present invention;

FIG. 2 is a model diagram of a U-Net architecture model Med-CaDA in the present invention;

FIG. 3 is a block model diagram of the CaDA of the present invention;

FIG. 4 is a flow chart of the deformable self-attentive mechanism of the present invention;

Detailed Description

In order to clearly understand the above objects, technical solutions and effects of the present invention, the present invention will be clearly and completely described in the following with reference to the accompanying drawings and the detailed description. It should be noted that in the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below.

Referring to fig. 1, the present invention provides a technical solution: a medical image segmentation method based on convolution and a deformable self-attention mechanism comprises the following steps:

step S1: preprocessing and data enhancing the CT image;

in this embodiment, the pre-processing is normalized using the method of Z-score to enhance the foreground and background discrimination. The implementation formula is as follows:

where μ is the mean of all sample data and σ is the standard deviation of all sample data.

The data enhancement comprises filling, random cutting, random overturning and random intensity deviation, specifically, the image with the input size of 240 × 240 × 155 is filled to 240 × 240 × 160, then the image is randomly cut to the size of 128 × 128 × 128, and finally the random overturning and the random intensity deviation are sequentially adopted, so that the problem that the data size is too small is solved, and the generalization capability of the model is improved.

in this embodiment, the U-Net architecture model Med-CaDA is composed of an encoder, a decoder and a jump connection as shown in fig. 2, the encoder includes an Embedding layer, a CaDA block composed of convolution and a deformable self-attention mechanism, and a down-sampling layer, and the decoder includes an up-sampling layer, a CaDA block, and an extension layer.

The Embedding layer is used for mapping an original image with an input size of H multiplied by W multiplied by D multiplied by 4 to a multi-dimensional space, and the output size is

The convolution sequence includes 3 × 3 × 3 convolution with 2 padding 1, 3 × 3 × 3 convolution with 1 padding 1, 3 × 3 × 3 convolution with 2 padding 1, and 3 × 3 × 3 convolution with 1 padding 1.

The CaDA block is used for extracting local and global information, and comprises a bottleneck residual module and a deformable self-attention mechanism, wherein the bottleneck residual module is composed of a 1 × 1 × 1 convolution with a step size of 1, a 3 × 3 × 3 deep separable convolution and a 1 × 1 × 1 convolution with a step size of 1 in sequence, and is used for extracting local information and ensuring that the sizes of input and output feature maps are consistent. With the first 1 x 1 convolution responsible for halving the dimension and the second 1 x 1 convolution responsible for doubling the dimension.

The bottleneck residual module can be expressed as:

Bottleneck(X)＝Conv(F(Conv(X))) (2)

F(X)＝DWConv(X)+X (3)

where Conv () stands for convolution, DWConv () stands for depth separable convolution, the first Conv () is used to reduce dimensionality to the original

(R =4 in practical use), the second Conv () reduces the reduced dimensionality, thereby reducing the parameters. Additionally, batchNorm normalization and the GELU activation function are omitted.

The deformable self-attention mechanism is a variant of the self-attention mechanism, does not consume a large amount of calculation while capturing global dependence, and is assumed to have input of H multiplied by W multiplied by D multiplied by C and output of H multiplied by W multiplied by D multiplied by C, so that the input and output dimensions are consistent.

As shown in fig. 4, specifically: for input X, first a set of grid points P e H is generated for it _g ×W _g ×D _g X3, learning the Offset delta P of the grid point P through the Offset network by using the Query key Q (in order to prevent the delta P from being too large, the size of the delta P is controlled by using controllable parameters s and tanh ()), and then obtaining a sampling point X on the original image X by using bilinear interpolation _sampled Finally, generating Key and Value keys according to the sampling points, then calculating the self-attention together with the Query keys, and outputting X _O 。

Can be expressed as:

Q＝XW _q (4)

ΔP＝stanh(Offset(Q)) (5)

X _sampled ＝BI(X,P+ΔP) (6)

K＝X _sampled W _k ；V＝X _sampled W _v (7)

X _o ＝ZW _o (9)

wherein Offset () is an Offset learning network consisting of a deep separable convolution and a 1 x 1 convolution for generatingHas a size of H _g ×W _g ×D _g Offset Δ P of × 3, BI represents bilinear interpolation.

The down-sampling layer and the up-sampling layer are respectively 2 × 2 × 2 convolution with step size of 2 and 2 × 2 × 2 deconvolution with step size of 2, and the down-sampling layer combines the feature map

Cascaded down-sampling to

Upsampling a feature map from

Cascaded upsampling back to

And (4) size.

The extension layer deconvolves the feature map by 4 × 4 × 4 with step size 4

in this embodiment, the Dice loss function is implemented by the following formula:

| A ≧ B | represents the intersection between A and B, and | A | and | B | represent the numbers of elements of A and B, respectively.

In the training process, the Dice loss function can judge the difference degree between the estimated value and the reality so as to modify the network weight value to ensure that the estimated value is real.

in this embodiment, the learning rate of the Adam optimization algorithm is set to 1e-4, the training round is 800, and the size of the batch size is 4. The overfitting was mitigated with L2 regularization, with the weight decay set to 1e-5. In addition, a cosine learning rate attenuation strategy is adopted, and the optimal solution can be achieved in the training process.

Step S5: and measuring the segmentation accuracy by using two indexes, namely Dice Score and 95% Hausdorff distance.

In this embodiment, the Dice score index is:

the 95% Hausdorff distance index is:

wherein distance (a, B) is the distance between two points of ab, and A and B are two irregular areas.

The method applies the deformable self-attention mechanism Transformer to the medical image segmentation task for the first time, and combines the residual convolution module to provide the CaDA block, thereby not only ensuring the local information extraction capability of the model, but also realizing the overall dependence capture of the model. Finally, experiments on a BraTS2020 data set prove that compared with a classical U-Net architecture model, the model has very large improvement on two indexes in an ET (entry-time), WT (time-to-live) and TC (time-to-time) area, and compared with the current most advanced method, the model can achieve a segmentation effect equivalent to the calculation amount under the condition of lower calculation amount. The segmentation results are given in table 1 below:

TABLE 1 comparison of segmentation results

The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A medical image segmentation method based on convolution and a deformable self-attention mechanism is characterized in that: the method comprises the following steps:

step S1: preprocessing and data enhancing the CT image;

step S5: the segmentation accuracy was measured using two indices, dice Score and 95% Hausdorff distance.

2. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 1 wherein: in the step S1, the preprocessing adopts a Z-score method for standardization, and the data enhancement comprises filling, random cutting, random overturning and random intensity deviation.

3. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 1 wherein: and S2, the U-Net architecture model is formed by connecting an encoder, a decoder and a jump, wherein the encoder comprises an Embedding layer, a CaDA block formed by convolution and a deformable self-attention mechanism, and a down-sampling layer, and the decoder comprises an up-sampling layer, a CaDA block and an expansion layer.

4. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 3 wherein: the Embedding layer is used for mapping the input image to a multi-dimensional space, and the output size is

5. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 3 wherein: the CaDA block is used for extracting local global information and comprises a bottleneck residual error module and a deformable self-attention mechanism, wherein the bottleneck residual error module is composed of two convolutions of 1 multiplied by 1 and a depth separable convolution, and the bottleneck residual error module is used for extracting local information and ensuring that the sizes of input and output feature graphs are consistent. The bottleneck residual module can be expressed as:

Bottleneck(X)＝Conv(F(Conv(X))) (1)

F(X)＝DWConv(X)+X (2)

where Conv () represents a convolution and DWConv () represents a depth separable convolution.

6. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 3 wherein: the down-sampling layer and the up-sampling layer are respectively 2 × 2 × 2 convolution and 2 × 2 × 2 deconvolution.

7. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 4 wherein: the expansion layer adopts 4 × 4 × 4 deconvolution to make feature map from

8. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 1 wherein: the step S3 is realized by the following formula:

9. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 1 wherein: and S4, setting the learning rate of the Adam optimization algorithm to be 1e-4.

10. A method for medical image segmentation based on convolution and deformable self-attention mechanism as claimed in claim 1 wherein: the Dice score index of step S5 is:

95% Hausdorff distance index: