CN114049314A

CN114049314A - Medical image segmentation method based on feature rearrangement and gated axial attention

Info

Publication number: CN114049314A
Application number: CN202111262731.XA
Authority: CN
Inventors: 俞俊; 于云杰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-02-15

Abstract

The invention discloses a medical image segmentation method based on feature rearrangement and gated axial attention. Firstly, unifying all original images and real segmentation images into the same size, and then randomly turning the adjusted training images horizontally/vertically, thereby improving the diversity of samples. And secondly, down sampling is performed by adopting characteristic rearrangement, so that the characteristics of the original image are better retained, up sampling is performed by adopting reverse characteristic rearrangement, and the decoding capability of the network is improved. Then, the global branch and the local branch are trained cooperatively, and the global information interaction and the local information interaction of the image are respectively extracted. And finally, combining the two parts of information to segment the image. Meanwhile, by optimizing parameter setting and utilizing proper training skills, a more accurate medical image segmentation effect is realized.

Description

Medical image segmentation method based on feature rearrangement and gated axial attention

Technical Field

The invention relates to the field of medical image segmentation, in particular to a medical image segmentation method based on end-to-end.

Background

With the popularization of deep convolutional neural networks in computer vision, deep convolutional neural networks are used for medical image segmentation tasks. Networks like U-Net, Res-UNet and U-Net + + have been specifically proposed for performing image segmentation for various medical imaging modalities. These methods also achieved good performance on many difficult datasets, demonstrating the effectiveness of CNN in learning to discern features to segment organs or lesions from medical scans. Although the CNN-based method achieves excellent performance in the field of medical image segmentation, it is difficult for the CNN-based method to learn exact global and remote semantic information interaction due to the inherent locality of convolution operation. The segmentation accuracy does not fully satisfy medical applications. Medical image segmentation remains a challenging task in medical image analysis.

Recently, the transform-based approach has been highly varied in the field of computer vision. The main reason for the success of transformers is that they are able to learn long-term dependencies between input tokens, which can learn global and remote semantic information interactions. Transformer-based deformable axial attention block, which decomposes 2D self attention into two 1D self attentions, introduces position-sensitive axial attention that has been used for panorama segmentation. Transformer-based morphosome gated axial attention, a gating mechanism can be used to control the circulation of attention information in the network.

Most of the existing medical image segmentation models adopt a convolution layer or a pooling layer to down-sample an original image. For upsampling, the upsampling is performed using convolution plus double linear interpolation. In the process of downsampling by using a convolutional layer or pooling method, a part of original image information is lost. The invention uses the characteristic rearrangement mode to store the original information on the C channel. Because the double linear interpolation depends on an interpolation formula, the flexibility of different data sets is poor, and the method adopts a reverse characteristic rearrangement mode to perform upsampling. And meanwhile, the rolling block is replaced by the axial attention block with global and remote semantic information interactive gating so as to realize better segmentation precision on the medical image.

Disclosure of Invention

The invention provides a medical image segmentation method based on feature rearrangement and gated axial attention. The method adopts a feature rearrangement and gating axial attention mechanism, and the global branch and the local branch are cooperatively trained in an end-to-end mode, so that the global information interaction and local information interaction features of an input original image can be effectively extracted, and the circulation of information in a network is controlled through the gating mechanism, so that a model can learn good position deviation information on a small sample data set. Experimental results show that the method can more accurately segment the medical image.

A medical image segmentation method based on feature rearrangement and gated axial attention comprises the following steps:

step 1, acquiring a data set; selecting 3 data sets from existing public medical image segmentation data sets;

the 3 data sets in the data acquisition are gland segmentation data sets Glas respectively, and comprise 85 training images and 80 testing images. The nuclear segmentation dataset moninseg, comprised 30 training images and 14 test images. A nuclear segmentation data set TNBC comprising 35 training images and 15 test images.

Step 2, data processing; on the medical image segmentation data set acquired in the step 1, adjusting images in the data set to be the same in size; then, randomly turning the adjusted training sample image horizontally/vertically, thereby increasing the diversity of the training sample;

step 3, defining a medical image segmentation model based on feature rearrangement and gated axial attention, wherein the model comprises a global branch and a local branch; taking the training image processed in the step 2 and a real segmentation graph of the training image as input;

step 4, a loss function; the function of the loss function is to measure the error between the predicted value and the real sample mark; here a cross entropy loss function is employed;

step 5, defining an Adam optimizer, setting a reasonable learning rate for the model, setting the initial learning rate to be 0.001, slowing down the learning rate along with the increase of the number of batches in the model training process, and adjusting the learning rate to be 0.8 in each 50 batches, so that the oscillation is effectively inhibited, and a better network parameter is found; meanwhile, L2 regularization is adopted to effectively reduce overfitting;

the learning rate attenuation formula is as follows (3):

l_p＝l₀×0.8^p//50 (3)

in the above formula, p is the number of training batches (epoch). The hyper-parameter defining the L2 regularization term is 0.0005.

And 6, training and testing the network, cooperatively training the global branch and the local branch in the step 3, and evaluating on the test set provided by each data set while training, wherein the evaluation adopts average IoU and average F1 score.

Further, the data processing in step 2 is specifically implemented as follows:

firstly, an original image in a data set and a real segmentation map are formed into a size of 128 x 128 through resize;

and finally, randomly turning the training image subjected to resize and the corresponding segmentation image horizontally/vertically with the probability of 50%.

Further, the global branch and the local branch of the model described in step 3 are specifically implemented as follows:

global branch code (encode) part:

3-1, using a 7 × 7 convolution kernel for an input training sample image, setting the step length to be 1 and Padding to be 3, reserving the input height H and width W, and mapping by a BatchNorm layer and a ReLU activation function to obtain a feature block;

3-2, performing characteristic rearrangement on the characteristic blocks, dividing H and W patches into 2 multiplied by 2 patch blocks, and downsampling (B, C, H, W) into (B,4C, H/2, W/2); b is the number of pictures input at one time, C is the number of channels of the feature block, and H and W are the height and width of the feature block respectively; the characteristic rearrangement can keep the characteristics of the adjacent elements on the C channel, and compared with the adoption of a pooling layer, the information can be better kept;

3-3, carrying out convolution with unchanged height and width and a BatchNorm layer and a ReLU activation function mapping on the feature block after feature rearrangement to obtain a feature block x, and enhancing information circulation of a local panel block;

3-4, inputting the characteristic block x into a gated axial attention block; a gated axial attention block is first mapped through a 1 × 1 convolution plus the BatchNorm layer and the ReLU activation function; second, applying a gated axial attention along the width axis of the tensor, as expressed by the following equation (1):

wherein, N represents the width after last down-sampling, q_ij＝W^q′x,k_ij＝W^k′x,v_ij＝W^v′x is query, key, and value, respectively; w is the linear transformation, x is the input, y is the output;

q_ij＝W^q′x, denotes the jth element of the ith row, W^q′Representing parameters of a linear transformation on x

k_ij＝W^k′x, denotes the jth element of the ith row, W^k′Representing parameters of a linear transformation on x

v_ij＝W^v′x, denotes the jth element of the ith row, W^v′Representing parameters of a linear transformation on x

Wherein qe, ke and e are learnable position deviation terms, which are generally called relative position codes;

wherein G is a learnable gating parameter used to control the effect of learned relative position coding on coding non-local contexts; then, a gated axial attention applied along the height axis of the tensor; the method is consistent with a gated axial attention applied at the width axis; finally, mapping by a convolution of 1 × 1 and a BatchNorm layer and a ReLU activation function, and establishing residual connection before and after the characteristic block passes through each gated axial attention block;

3-5, performing feature rearrangement on the feature blocks obtained after the 1 gate-controlled axial attention, performing down-sampling, and finally passing through two gate-controlled axial attention blocks; obtaining compact feature value block x with long-range dependency and global information₁；

Global branch decoding (decoder) section:

firstly, feature block x₁Expanding the capacity of a C channel by using convolution of 1 multiplied by 1 after passing through two gated axial attention blocks so as to carry out up-sampling by using inverse characteristic rearrangement; then carrying out inverse characteristic rearrangement to carry out up-sampling; compared with the method adopting bilinear interpolation up-sampling, the method has strong feature rearrangement flexibility, the parameters can be obtained by network learning, and the bilinear interpolation is obtained by a formula; secondly, the C channel is expanded by a gated axial attention block and a convolution of 1 multiplied by 1, and then inverse feature rearrangement is carried out to carry out up-sampling to obtain a feature block x which has long-range dependence and global information and has the same size with the input original image₂All the blocks corresponding to the decoding part and the coding part adopt jump connection in a way of addition;

partial branching portion:

301. the input training sample image is divided into 32 x 32 patch blocks on H and W patches, and then for each patch block the following operations are performed:

7 multiplied by 7 convolution kernel, the step length is 1, Padding is set to be 3, the input height and width are reserved, and a characteristic block is obtained through BatchNorm layer and ReLU activation function mapping;

302. performing feature rearrangement downsampling on the feature block obtained in the step 301, and performing convolution without changing height and width, a BatchNorm layer and a ReLU activation function mapping for 2 times to obtain a feature block x';

303. 3 times of gated axial attention blocks and characteristic rearrangement downsampling are carried out on x', and the gated axial attention blocks adopted each time are respectively 3, 4 and 1; obtaining a characteristic block x'₁(ii) a Then x 'is processed'₁Performing up-sampling by 3 times of gating axial attention blocks and reverse characteristic rearrangement, wherein the gating axial attention blocks adopted each time are respectively 1, 4 and 3;

304. pressing each dough sheet block as it isSequentially splicing to obtain a feature block x 'with local information and the same size as the input original image'₂。

Further, the loss function in step 4 is as follows (2):

where H and W are dimensions of the image, p (x, y) corresponds to a pixel in the image, and p' (x, y) represents the output prediction for a particular location (x, y).

Further, the cooperative training and evaluation indexes in step 6 are as follows:

the global branch and the local branch are cooperatively trained by using the feature block x obtained in the step 3₂And x'₂Adding, and performing 1 × 1 convolution to reduce the number of C channels to the number of classes to be segmented;

the average IoU and the average F1 score mean that IoU and F1 scores of different categories of each predicted segmentation map and each real segmentation map are calculated firstly, and then the average values are taken; adding the mean IoU and the mean F1 score of all test sets respectively, and dividing by the number of test images to obtain a mean IoU and a mean F1 score; the two indexes can effectively evaluate the accuracy of model segmentation.

The invention has the following beneficial effects:

the method adopts a down-sampling mode of characteristic rearrangement, can keep more original image information compared with convolution and pooling, adopts an up-sampling mode of reverse characteristic rearrangement, has more flexibility compared with bilinear interpolation, and can obtain corresponding parameters through self learning of a network without being limited to the determination of a formula. And a gating axial attention model of the global branch and the local branch is adopted, so that not only local information interaction and global information interaction are considered, but also the circulation of information in the network is controlled by using a gating mechanism. And a plurality of training skills are properly adopted, and ideal network parameters, an optimization algorithm and the setting of the learning rate are selected, so that the accuracy of the medical image segmentation is improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic diagram of the network framework of the present invention.

Fig. 3 is a graph comparing the segmentation effect of the present invention with other methods.

Detailed Description

The invention is further illustrated by the following figures and examples.

As shown in fig. 1, a method for segmenting a medical image based on feature rearrangement and gated axial attention specifically includes the following steps:

step 1, data set acquisition, wherein 3 data sets are selected from the existing public medical image segmentation data sets. The data sets used in the present invention are the gland segmentation data sets Glas, respectively, containing 85 training images and 80 test images. The nuclear segmentation dataset moninseg, comprised 30 training images and 14 test images. A nuclear segmentation data set TNBC comprising 35 training images and 15 test images.

And 2, processing data, namely unifying the original image and the real segmentation image in the data set into the size of 128 multiplied by 128. And finally, random horizontal/vertical turnover with the probability of 50% is carried out on the training images and the corresponding segmentation images with uniform sizes, so that overfitting can be effectively avoided on one hand, and the model performance can be improved to a certain extent on the other hand.

And 3, as shown in FIG. 2, the medical image segmentation model network framework diagram based on feature rearrangement and gated axial attention is composed of 2 parts, namely a global branch and a local branch. And (4) taking the training image processed in the step (2) and a real segmentation graph of the training image as input.

The global branch and the local branch are specifically realized as follows:

global branch code (encode) part:

firstly, an input original picture uses a 7 × 7 convolution kernel, the step size is 1, Padding is set to be 3, the height and the width of the input are reserved, and the original picture is mapped through a BatchNorm layer and a ReLU activation function.

Next, a feature rearrangement is performed to divide the H and W patches into 2 × 2 patches, i.e., (B, C, H, W) is downsampled to (B,4C, H/2, W/2). Wherein B is the number of pictures input at a time, C is the number of channels of the feature block, and H and W are the height and width of the feature block respectively. This preserves the characteristics of neighboring elements on the C channel, which preserves information better than using pooling layers.

Then, the feature block obtained above is subjected to 2 times of convolution addition without changing the height and width

And mapping the BatchNorm layer and the ReLU activation function to obtain a feature block x, and enhancing the information circulation of the local panel block.

The feature block x is then input into the gated axial attention block. A gated axial attention block is first mapped through a 1 x 1 convolution plus the BatchNorm layer and the ReLU activation function. Second, the gated axis attention applied along the width axis of the tensor is given by:

Where qe, ke, e are learnable positional deviation terms, commonly referred to as relative position codes. Where G is a learnable gating parameter used to control the effect of learned relative position coding on coding non-local contexts. Then, the gated axis applied along the height axis of the tensor is focused. The method is consistent with the application of gated axial attention on the width axis. Finally, a 1 × 1 convolution plus the BatchNorm layer and the ReLU activation function map are passed, and the feature blocks establish residual connections before and after passing through each gated axial attention block.

The feature block x passes through a gated axial attention block, then performs feature rearrangement for downsampling, and finally passes through two gated axial attention blocks. Obtaining compact feature value block x with long-range dependency and global information₁。

Global branch decoding (decoder) section:

firstly, feature block x₁The C channel is expanded using a 1 x 1 convolution through two gated axial attention blocks for upsampling using inverse feature rebinning. Inverse eigen-rearrangement is then performed for upsampling. Compared with the method adopting bilinear interpolation up-sampling, the method has strong feature rearrangement flexibility, the parameters can be obtained by network learning, and the bilinear interpolation is obtained by a formula. Secondly, the C channel is expanded by a gated axial attention block and a convolution of 1 multiplied by 1, and then inverse feature rearrangement is carried out to carry out up-sampling to obtain a feature block x which has long-range dependence and global information and has the same size with the input original image₂All the blocks corresponding to the decoding part and the coding part adopt jump connection in a way of addition.

Partial branching portion:

firstly, dividing an input original picture into 32 × 32 tile blocks on H and W tiles, and secondly, performing the following operations for each tile block:

using 7 × 7 convolution kernel, step size is 1, Padding is set to 3, the height and width of the input are preserved, and then mapped through BatchNorm layer and ReLU activation function.

Secondly, performing feature rearrangement downsampling on the feature block obtained in the previous step, and performing convolution without changing height and width, a BatchNorm layer and a ReLU activation function mapping for 2 times to obtain a feature block x'.

Then, x' is subjected to 3 gated axial attention blocks plus feature rearrangement downsampling, and the gated axial attention blocks adopted each time are respectively 3, 4 and 1. Obtaining a characteristic block x'₁. Then, the product is processedX'₁The up-sampling is rearranged by 3 gating axial attention blocks and reverse characteristics, and the gating axial attention blocks adopted each time are 1, 4 and 3 respectively.

Finally, splicing each face patch in the original order to obtain a feature block x 'with local information and the same size as the input original image'₂。

And 4, defining a loss function, wherein the function of the loss function is used for measuring the error between the predicted value and the real sample mark. This uses the cross entropy loss function commonly used in segmentation tasks as the following equation (2):

where w, h is the dimension of the image, p (x, y) corresponds to a pixel in the image, and p' (x, y) represents the output prediction for a particular location (x, y).

And 5, defining an Adam optimizer, setting a reasonable learning rate for the model, setting the initial learning rate to be 0.001, slowing down the learning rate along with the increase of the number of batches in the model training process, and adjusting the learning rate to be 0.8 in each 50 batches, so that the oscillation is effectively inhibited, and a better network parameter is found. Meanwhile, the overfitting is effectively reduced by adopting L2 regularization, and the hyper-parameter of the regularization term is 0.0005. The learning rate attenuation formula is defined as the following formula (3):

l_p＝l₀×0.8^p//50 (3)

in the above formula, p is the number of training batches (epoch).

Step 6, network training and testing, training global branch and local branch cooperatively, and processing the characteristic block x obtained in step 3₂And x'₂The addition is performed, followed by a 1 × 1 convolution, so that the C channel is reduced to the number of classes to be segmented. And then, carrying out sigmoid function mapping on the obtained corresponding prediction block in a channel C, calculating the cross entropy loss between the prediction block and a real segmentation graph, and finally carrying out gradient updating by using an Adam optimizer defined in the step 5. A total of 400 training batches were run.

And (4) testing once every 50 batches of training, firstly carrying out sigmoid function mapping on the C channel, and then classifying the corresponding pixels into the classes with the maximum probability. The indices used for evaluation were average IoU and average F1 score. Averaging IoU and F1 score means that IoU and F1 score are calculated for each different class of predicted segmentation map and real segmentation map and then averaged. The mean IoU and the mean F1 score of all test sets were then added separately and divided by the number of test images to yield the mean IoU and the mean F1 score. The two indexes can effectively evaluate the accuracy of model segmentation.

The comparative model used for the experiment was the most recently best MedT model performed on the Glas and moninseg datasets. Experimental index pair ratio table 1 below, the comparative graph of the segmentation effect is shown in fig. 3.

Table 1 the invention is compared to the MedT model index.

Claims

1. A medical image segmentation method based on feature rearrangement and gated axial attention is characterized by comprising the following steps:

2. The method for segmenting medical images based on feature rearrangement and gated axial attention according to claim 1, wherein the data processing in step 2 is implemented as follows:

3. The method for segmenting medical images based on feature rearrangement and gated axial attention according to claim 2, wherein the model global branch and the model local branch in step 3 are implemented as follows:

global branch code (encode) part:

Wherein qe, ke and e are learnable position deviation terms, which are generally called relative position codes; wherein G is a learnable gating parameter used to control the effect of learned relative position coding on coding non-local contexts; then, a gated axial attention applied along the height axis of the tensor; the method is consistent with a gated axial attention applied at the width axis; finally, mapping by a convolution of 1 × 1 and a BatchNorm layer and a ReLU activation function, and establishing residual connection before and after the characteristic block passes through each gated axial attention block;

3-5, performing feature rearrangement on the feature blocks obtained after 1 gate-controlled axial attention to sample downFinally, two gated axial attention blocks are passed; obtaining compact feature value block x with long-range dependency and global information₁；

Global branch decoding (decoder) section:

partial branching portion:

303. 3 times of gated axial attention blocks and characteristic rearrangement downsampling are carried out on x', and the gated axial attention blocks adopted each time are respectively 3, 4 and 1; get the feature block x₁'; then x is processed₁The up-sampling is rearranged by 3 times of gating axial attention blocks and reverse characteristics, and the gating axial attention blocks adopted each time are respectively 1, 4 and 3;

304. splicing each face patch in the original order to obtain a feature block x 'with local information and the same size as the input original image'₂。

4. The method for medical image segmentation based on feature rearrangement and gated axial attention according to claim 3, wherein the loss function in step 4 is as follows (2):

5. The method of feature rebinning and gated axial attention based medical image segmentation according to claim 4, wherein the co-training and evaluation index of step 6 is as follows: