CN111612790A

CN111612790A - Medical image segmentation method based on T-shaped attention structure

Info

Publication number: CN111612790A
Application number: CN202010355011.7A
Authority: CN
Inventors: 颜成钢; 杨祥宇; 杨建�; 张二四; 孙垚棋; 张继勇; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-09-01
Anticipated expiration: 2040-04-29
Also published as: CN111612790B

Abstract

The invention provides a medical image segmentation method based on a T-shaped attention structure. The invention provides an improved U-Net model for medical image segmentation, wherein a gate control attention mechanism with a T attention module is introduced into a decoding network structure of a U-Net network, and a ResNet residual module is added into the encoding network structure and the decoding network structure; through the gated attention mechanism with the T attention module, the global feature information of the shallow input image can be extracted, and the problem of insufficient image features caused by direct jumping is solved. And the T attention modules share model parameters, so that the time and space complexity is reduced. And a residual error module is added for each convolution layer, so that the problems of gradient loss and explosion in the training process are solved, and the optimization is easier during network training.

Description

Medical image segmentation method based on T-shaped attention structure

Technical Field

The invention belongs to the technical field of deep learning, particularly relates to a novel U-Net medical image segmentation method based on a T-shaped attention structure, and aims at medical image segmentation.

Background

At present, due to the rapid development of deep learning, great influence is generated on many fields. As a difficult visual task, medical image segmentation encounters the diversity and individual characteristics of imaging data sets that are important reasons for the persistent interest: the data set is extremely different when considering the queue size, image dimensions, image size, voxel intensity range and intensity interpretation. Class labels in images may be highly unbalanced, labels may be ambiguous, and the quality of expert annotations varies greatly between different data sets. Furthermore, some data sets are very inhomogeneous in terms of image geometry, or slice misplacement may occur.

In the current medical image segmentation field, the application is the most extensive, the best effect is U-Net, and the adopted encoder (down sampling) -decoder (up sampling) structure and jump connection are a very classical design method. However, the skip structure of the U-Net network has a disadvantage that the result of up-sampling at each step is connected to the decoding layer without processing, so that the shallow layer characteristics may not be sufficiently transformed.

In recent years, the semantic segmentation framework based on the full convolution network has made remarkable progress, but some problems still remain to be solved, and the medical image may have high correlation among pixels of the whole image, so that the required receptive field is larger than that of a general image, and the global receptive field can generate the most abundant information. However, the network generated by FCN limits the local receptive field and small range of context information, and the hole convolution in the proposed aspp can only collect information of some surrounding pixels, and cannot generate tight context information, as does pyramid pooling.

Some attention-based methods proposed by this expert scholarly require the generation of large attention maps to measure the relationship of each pair of pixels. The effect is improved, but the complexity of time and space is high, the calculated amount and the occupied memory are large, and the training difficulty is increased.

Disclosure of Invention

The invention provides a medical image segmentation method based on a T-shaped attention structure, aiming at the problems that the possible transformation of shallow features is insufficient and the calculated amount of an attention mechanism and the occupied memory are reduced, and the gradient disappears and explodes in training caused by the direct fusion of the traditional U-Net intermediate jump structure.

The invention comprises the following steps:

firstly, preprocessing medical image data, and dividing the data to obtain a training set and a test set;

step two, improving the U-Net network to obtain a prediction model:

a gate control attention mechanism with a T attention module is introduced into a decoding network structure of the U-Net network, and a ResNet residual module is added into the encoding network structure and the decoding network structure;

inputting the training set data into a prediction model for training;

and inputting the training set into a prediction model, and adopting a random initialization and random gradient descent optimization method. Setting an initial learning rate, a segmentation layer learning rate, momentum and a weight attenuation coefficient. Training according to a set training strategy to obtain a trained network model;

inputting the test set data set into the trained prediction model in the third step to obtain segmentation data;

further, preprocessing the acquired medical image data, and dividing the data to obtain a training set, a verification set and a test set, wherein the specific operations are as follows;

unifying the size of the medical image into n x n square shape, taking eighty percent of samples in the medical image data set as a training set, and taking the rest twenty percent of samples as a testing set. In the training and testing stage, a patch-based method based on image blocking is adopted, and each picture of the training set is divided into a plurality of small blocks, and each small block is called a patch. The training set was divided into patch 90% as training data and 10% as validation data.

Preferably, the prediction model in step two includes: the encoding network structure comprises 4 convolutional layers and 3 maximum pooling layers, and the decoding network structure comprises 3 upsampling layers and 3 convolutional layers. Wherein each convolutional layer contains three convolutional block operations, replacing the last two convolutional blocks of each convolutional layer in the coding network structure and the decoding network structure with residual convolutional blocks. In order to combine the prediction of the last layer and the previous shallower layer of the coding network structure, jump structures are added to connect the last layer and the previous shallower layer of the coding network structure, and a gating attention structure is added to each layer of jump structures to perform shallow input feature extraction.

The gated attention structure firstly fuses an input coded image and a convolved decoded image, performs dimensionality reduction operation by using a 1 x 1 convolution layer, inputs the dimensionality reduced image into two series T-shaped attention modules, and fuses an output obtained by the T-shaped attention modules and the input coded image to obtain a feature image containing global information.

Furthermore, the T-shaped attention modules collect relevant information from the horizontal direction and the vertical direction, the representativeness of a pixel level is enhanced, for a given local feature H, 1 × 1 convolution operation is firstly carried out to obtain Q, K and V, each pixel point in the Q and all pixels in the same row and column corresponding to the point in the K are subjected to fusion operation to obtain A, the result obtained by the fusion operation of the A and the V is further fused with the H to obtain the relevance of each pixel point and the pixel points in the same row and column, the two T-shaped attention modules are connected in series to obtain global information, and the two T-shaped attention modules share parameter information.

And further, inputting training set data into the prediction model for training, and adopting a random initialization and random gradient descent optimization method, wherein the random gradient descent optimization method comprises the steps of randomly extracting a group of samples from the input training set for training, updating once according to the gradient after training, then re-extracting a group of samples for training, and updating once again. And setting the initial learning rate, the segmentation layer learning rate, the momentum, the weight attenuation coefficient and the iteration times in sequence. And inputting the verification set after the training is finished to obtain a verification result.

The invention has the following beneficial effects:

the invention provides an improved U-Net model for medical image segmentation, which can extract global feature information of a shallow input image through a gating attention mechanism with a T attention module and solves the problem of insufficient image features caused by directly performing a jump structure. And the T attention modules share model parameters, so that the time and space complexity is reduced. And a residual error module is added for each convolution layer, so that the problems of gradient loss and explosion in the training process are solved, and the optimization is easier during network training.

Drawings

FIG. 1 is a block diagram of an embodiment of the method of the present invention;

FIG. 2 is a diagram of a gated attention structure in an embodiment of the method of the present invention;

FIG. 3 is a T-shaped attention module diagram according to an embodiment of the method of the present invention.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings. The invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. In the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Example one

As shown in fig. 1, a method for segmenting a medical image based on a T-shaped attention structure includes the following steps:

unifying the size of the medical image into n x n square shape, taking eighty percent of samples in the medical image data set as a training set, and taking the rest twenty percent of samples as a testing set. Due to the data scarcity of the data set, an image-based patch-based method is adopted in the training and testing stage, and each picture of the training set is divided into a plurality of small blocks, and each small block is called a patch. The training set was divided into patch 90% as training data and 10% as validation data.

In this example, the medical data set used a DRIVE data set consisting of 40 color retinal images, 32 of which were used for training and the remaining 8 were used for testing. Each original image is of size 565 x 584 pixels, and in order to be a square shape with the medical image size being uniform to n x n, the image is cut to contain only data from column 9 to column 574, and each image is of size 565 x 565 pixels. Each picture of the training set is divided into a plurality of patches, each patch being called a patch. Randomly partitioning the training set image into 320000 patches, wherein 288000 patches are used for training and the remaining 32000 are used for verification;

step two, improving the U-Net network to obtain a prediction model:

the U-Net network model comprises an encoding network structure and a decoding network structure, wherein the encoding structure is used for extracting the characteristics of input characteristics, and the decoding structure predicts a segmentation model of an input image on the basis of extracting the characteristics, and the segmentation model has the same size as the input image. A gated attention mechanism with a T attention module is introduced into a decoding network structure of the U-Net network and is used for transferring image characteristics of each layer of the encoding network to the decoding network. In order to solve the problems of gradient loss and gradient explosion in training, a ResNet residual module is added into an encoding network structure and a decoding network structure;

the U-Net network model comprises: the encoding network structure comprises 4 convolutional layers and 3 maximum pooling layers, and the decoding network structure comprises 3 upsampling layers and 3 convolutional layers. Wherein each convolutional layer contains three convolutional block operations, replacing the last two convolutional blocks of each convolutional layer in the coding network structure and the decoding network structure with residual convolutional blocks. In order to combine the prediction of the last layer and the previous shallower layer of the coding network structure, jump structures are added to connect the last layer and the previous shallower layer of the coding network structure, and a gating attention structure is added to each layer of jump structures to perform shallow input feature extraction.

In order to reduce the computational complexity and obtain sufficient global information, the T-shaped attention modules collect relevant information from the horizontal direction and the vertical direction and enhance the representativeness of pixel levels, for a given local feature H, 1 × 1 convolution operation is firstly carried out to obtain Q, K and V, each pixel point in Q and all pixels in the same row and column corresponding to the point in K are subjected to fusion operation to obtain A, the result obtained by the fusion operation of A and V is fused with H to obtain the relevance of each pixel point and the same pixel point in the same row and column, two T-shaped attention modules are connected in series to obtain the global information, and the two T-shaped attention modules share parameter information.

Inputting the training set data into a prediction model for training;

inputting a training set into a prediction model, and adopting a random initialization and random gradient descent optimization method, wherein a group of samples are randomly extracted from the input training set for training, the training is carried out once according to the gradient after the training, then a group of samples are extracted again for training, the updating is carried out once again, under the condition of large sample quantity, a model with a loss value within an acceptable range can be obtained without training all the samples, and the samples are randomly disturbed in each iteration process. Setting an initial learning rate, a segmentation layer learning rate, momentum and a weight attenuation coefficient. Training according to a set training strategy to obtain a trained network model;

the batch size for stochastic gradient descent optimization was 32, epoch was 150, initial learning rate was 0.001, final segmentation layer learning rate was 0.01, momentum was 0.9, and weight decay was 0.0005.

And inputting the verification set after the training is finished to obtain a verification result.

establishing evaluation indexes including Accuracy (AC), Sensitivity (SE), Specificity (SP), F1 score, Dice Coefficient (DC) and Jacard index (JA), wherein the evaluation indexes are calculated in the following mode:

wherein

Wherein TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

Claims

1. A medical image segmentation method based on a T-shaped attention structure is characterized by comprising the following steps:

step two, improving the U-Net network to obtain a prediction model:

inputting the training set data into a prediction model for training;

inputting the training set into a prediction model, and adopting a random initialization and random gradient descent optimization method; setting an initial learning rate, a segmentation layer learning rate, momentum and a weight attenuation coefficient; training according to a set training strategy to obtain a trained network model;

and step four, inputting the test set data set into the well-trained prediction model in the step three to obtain the segmentation data.

2. The medical image segmentation method based on the T-shaped attention structure according to claim 1, characterized in that, further, the steps of preprocessing the acquired medical image data and dividing the data to obtain a training set, a verification set and a test set are as follows;

unifying the sizes of the medical images into a square shape of n x n, taking eighty percent of samples in the medical image data set as a training set, and taking the rest twenty percent of samples as a testing set; in the training and testing stage, a patch-based method based on image blocking is adopted, each picture of a training set is divided into a plurality of small blocks, and each small block is called a patch; the training set was divided into patch 90% as training data and 10% as validation data.

3. A T-shaped attention structure-based medical image segmentation method as claimed in claim 2, wherein preferably, the prediction model in step two comprises: the network structure comprises an encoding network structure and a decoding network structure, wherein the encoding network structure comprises 4 convolutional layers and 3 maximum pooling layers, and the decoding network structure comprises 3 upsampling layers and 3 convolutional layers; wherein each convolutional layer comprises three convolutional block operations, the last two convolutional blocks of each convolutional layer in the coding network structure and the decoding network structure are replaced by residual convolutional blocks; in order to combine the prediction of the last layer and the previous shallower layer of the coding network structure, a jump structure is added to connect the last layer and the previous shallower layer, and a gating attention structure is added to each layer of jump structure to extract the input features of the shallower layer;

the gated attention structure firstly fuses an input coded image and a convolved decoded image, performs dimensionality reduction operation by using a 1 x 1 convolution layer, inputs the dimensionality reduced image into two series T-shaped attention modules, and fuses an output obtained by the T-shaped attention modules and the input coded image to obtain a characteristic image containing global information;

furthermore, in order to reduce the computational complexity and obtain sufficient global information, the T-shaped attention module collects relevant information from the horizontal and vertical directions, enhances the representativeness of the pixel level, firstly performs 1 × 1 convolution operation on a given local feature H to obtain Q, K and V, performs fusion operation on each pixel point in Q and all pixels in the same row and column corresponding to the point in K to obtain a, fuses the result obtained by performing the fusion operation on a and V and H to obtain the relevance between each pixel point and the pixel point in the same row and column, and connects two T-shaped attention modules in series to obtain the global information, and the two T-shaped attention modules share parameter information.

4. The method for segmenting the medical image based on the T-shaped attention structure according to claim 3, characterized in that, further, in the third step, the training set data is input into the prediction model for training, and the random initialization and random gradient descent optimization method is adopted, wherein, firstly, a group of samples are randomly extracted from the input training set for training, after the training, the group of samples are updated according to the gradient, then, a group of samples are extracted again for training, and then, the samples are updated again; setting an initial learning rate, a segmentation layer learning rate, momentum, a weight attenuation coefficient and iteration times in sequence; and inputting the verification set after the training is finished to obtain a verification result.

5. The method of claim 4, wherein the batch size of the stochastic gradient descent optimization is 32, the epoch is 150, the initial learning rate is 0.001, the last segmentation layer learning rate is 0.01, the momentum is 0.9, and the weight attenuation is 0.0005.