CN111681252B

CN111681252B - Medical image automatic segmentation method based on multipath attention fusion

Info

Publication number: CN111681252B
Application number: CN202010479507.5A
Authority: CN
Inventors: 舒禹程; 张晶; 肖斌; 李伟生
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-05-30
Filing date: 2020-05-30
Publication date: 2022-05-03
Anticipated expiration: 2040-05-30
Also published as: CN111681252A

Abstract

The invention belongs to the technical field of medical image processing and computer vision, and particularly relates to a medical image automatic segmentation method based on multipath attention fusion, which comprises the steps of acquiring a medical image data set, dividing the data set into a training set and a verification set, amplifying images in the training set, and normalizing the images in the training set after the verification and the amplification; inputting the pictures in the training set into a multi-path attention fusion network model, and outputting under the guidance of a cross entropy loss function to obtain a segmentation result graph; selecting a model with the highest accuracy of a verification set, inputting a test set into a multi-path attention fusion network loaded with the model, and outputting a segmentation result graph of the obtained image; the invention solves the problems that the existing network can not effectively improve the feature quality under different scales at an encoder and is difficult to control the interlayer dependence between the low-level structural features and the high-level semantic features of the network so as to cause poor segmentation results and the like in the medical image segmentation process.

Description

Medical image automatic segmentation method based on multipath attention fusion

Technical Field

The invention belongs to the technical field of medical image processing and computer vision, and particularly relates to a medical image automatic segmentation method based on multipath attention fusion.

Background

Medical images play a key role in medical treatment and diagnosis. The goal of computer-aided diagnosis (CAD) systems is to provide physicians with an accurate interpretation of medical images to allow a large number of patients to be treated better. Moreover, automated processing of medical images results in reduced time, cost and error for human-based processing. One of the main areas of research in this area is medical image segmentation, which is a key step in many medical imaging studies.

Deep learning networks, like other search domains in computer vision, achieve excellent results and perform better in medical imaging than non-deep techniques. Deep neural networks are mainly used for classification tasks, where the output of the network is the probability values of a single label or a label related to a given input image. These network operations benefit from certain structural features, such as: activation functions, different efficient optimization algorithms and as a network regularizer. Given the large number of network parameters, these networks require a large amount of data to train and provide good generalization behavior.

The full convolutional neural network (FCN) is one of the earliest deep networks applied to image segmentation. The U-Net continues to expand under the framework of a full convolution neural network (FCN), a standard encoder-decoder structure is possessed, the framework becomes one of popular frameworks of medical images, and the deep neural network achieves good segmentation results by utilizing the requirement of a large amount of training data. The U-Net network consists of encoding and decoding paths in which a number of dimensionally reduced feature maps are extracted from the input data. The decoding path is used to generate a segmentation map (of the same size as the input) by performing an upward convolution. The most important modification is mainly related to skip connection, and the feature map extracted from the skip connection is fed to the encoder for cascade processing. However, in the hierarchical conversion in the U-Net network, the learning processes of different pooling levels often share the same data path, so the generated multi-scale feature map may not be fully partitioned as expected, the encoder may lose part of the spatial information due to the existence of the pooling layer, a single U-Net network uses two simple continuous 3 × 3 convolutional layers, the convolutional layers of the standard are difficult to store rich spatial information, another disadvantage is that the U-Net network uses a simple jump connection, only the features of the encoder and the features of the decoder are spliced, the features of the output from the decoder have redundancy, and the final segmentation result is affected.

Disclosure of Invention

In order to improve the separation effect of medical images, the invention provides a medical image automatic segmentation method based on multipath attention fusion, which comprises the following steps:

s1, acquiring a medical picture data set, dividing the data set into a training set and a verification set, amplifying pictures in the training set, and normalizing the pictures in the training set after verification and amplification;

s2, inputting the pictures in the training set into the multi-path attention fusion network model, and outputting under the guidance of the cross entropy loss function to obtain a segmentation result graph;

s3, verifying the accuracy of the multi-path attention fusion network model after each iterative training by using verification set data, and taking the network parameter with the highest accuracy as the network parameter of the multi-path attention fusion network model;

and S4, inputting the image data which is subjected to the normalization processing and needs to be segmented into the multipath attention fusion network model to obtain a segmentation result graph.

Further, the process of augmenting the training set picture includes:

rotating the pictures in the training set by 10 degrees, 20 degrees, -10 degrees and-20 degrees, and storing the rotated pictures;

turning the pictures in the training set up and down and left and right, and storing the turned pictures;

performing elastic transformation on the pictures in the training set, and storing the pictures after the elastic transformation;

carrying out (20%, 80%) range scaling on the pictures in the training set, and storing the pictures after scaling;

and (4) taking the pictures in the training set and the pictures in the training set after the processing as the training set together to finish the augmentation.

Further, the multi-path attention fusion network model comprises a multi-path encoder, an attention fusion module and a decoder with reconstructed upsampling, wherein:

the attention fusion module inputs two features at a time, and the two features are connected in series by convolution operation to obtain a combined feature A; sequentially using convolution operation, ReLu activation function and convolution operation on the combined feature A, and then using sigmoid processing to obtain a feature map with the dimension of 1 × 1 × C, wherein C is the number of channels of the feature, multiplying the feature map by the feature A to selectively filter the feature to obtain the feature, and summing the selectively filtered feature and the feature A to obtain the final output feature;

the decoder with reconstruction upsampling comprises three layers of reconstruction upsampling, wherein the first layer of reconstruction upsampling is to perform upsampling on the characteristics of the bottommost layer of a first path of a multi-path encoder, then to splice with the characteristics output by a third layer of attention fusion module, then to input the spliced characteristics into a decoding module, the second layer of reconstruction upsampling is to perform upsampling on the characteristics output after the first layer of reconstruction upsampling is input into the decoder, then to splice with the characteristics output by the second layer of attention module, and to input into the decoding module, the third layer of reconstruction upsampling is to perform upsampling on the characteristics output after the second layer of reconstruction upsampling is input into the decoder, then to splice with the characteristics output by the first layer of attention module, and to input into the decoding module, a 1 x 1 convolution operation and a sigmoid activation function are used for operating the characteristics output after the third layer of reconstruction upsampling is input into the decoder, and then a final segmentation result graph is obtained.

Further, the operations performed in the residual module include:

201: inputting a feature map with the size of h multiplied by w multiplied by c into a 3 multiplied by 3 convolutional layer;

202: inputting the convolution result in 201 to the standard normalization and the ReLU activation function;

203: continuing to input the result of 202 into a 3 × 3 convolutional layer, standard normalization and a ReLU activation function;

204: inputting the result of 203 into a 3 × 3 convolutional layer;

205: summing the convolution result of 204 and the convolution result of 201;

206: inputting the result of 205 into a standard normalization and a ReLU activation function to obtain a characteristic of h multiplied by w multiplied by c;

where h denotes the height of the feature map, w denotes the width of the feature map, and c denotes the number of channels of the feature map.

Further, the operations at the attention fusion module include:

211: for a feature M with the resolution of h multiplied by w multiplied by c from a multipath encoder and a feature F with the resolution of h/2 multiplied by w/2 multiplied by 2c from the result output by the previous attention fusion module, firstly performing up-sampling operation on the feature F to enable the resolution of the feature F to be h multiplied by w multiplied by c, then performing splicing operation on the feature F subjected to up-sampling operation and the feature M, and inputting the spliced features into a 3 multiplied by 3 convolutional layer;

212: inputting the convolution result of 211 to a standard normalization and a ReLU activation function;

213: inputting the convolution result of 212 into a global average pooling function, wherein one dimension is a feature map of 1 × 1 × c, and c is the channel number of the feature;

214: inputting the convolution result of 213 into a 1 × 1 convolution layer, a ReLu activation function and a 1 × 1 convolution layer in sequence, and finally inputting the convolution result into a sigmoid function to obtain a feature map with a dimension of 1 × 1 × c;

215: multiplying the feature map of 214 with the convolution result of 211 to obtain a selected filtering feature;

216: and summing the convolution result of 215 and the convolution result of 211 to obtain the final output characteristic.

Further, the operations at the decoder module include:

221: for the feature F with the resolution h multiplied by w multiplied by c from the attention fusion module and the feature D with the resolution h/2 multiplied by w/2 multiplied by c from the result output by the previous decoder module, the feature D is firstly subjected to the upsampling operation so that the resolution of the feature D is h multiplied by w multiplied by c, and then the feature F subjected to the upsampling operation is subjected to the splicing operation with the feature D. Finally inputting the spliced characteristics into a 3 x 3 convolutional layer;

222: inputting the resulting features of 221 into a 3 × 3 convolutional layer;

223, inputting the convolution result in 222 to the standard normalization and ReLU activation functions;

224: the 223 results are continuously input into a 3 x 3 convolutional layer, standard normalization and a ReLU activation function;

225: inputting the result of 224 into a 3 × 3 convolutional layer;

226: summing the convolution result of 225 and the convolution result of 201;

227: the results of 226 are input to the standard normalization and the ReLU activation functions to get the h × w × c feature.

The invention has the beneficial effects that:

1) according to the invention, as for medical image segmentation, enough training samples are not easy to obtain, so that a training data set is increased by adopting rotation, turnover and elastic transformation, and the number of pictures in the effective training set can be greatly increased by the method;

2) the invention makes improvement aiming at the encoder with a single structure, uses four multi-path encoding structures, can effectively improve the feature quality under different scales and control the interlayer dependency between the low-level structure features and the high-level semantic features;

3) the invention improves the convolution block of the standard U-Net network, and uses the residual error module to replace the standard convolution block, thereby accelerating the network training and improving the network segmentation accuracy;

4) the invention improves aiming at simple jump connection, uses an attention fusion mechanism to effectively and gradually accumulate image representations from different semantic levels, then transmits the recombined function to the back end of the network on the basis, can establish visual association among different paths by introducing a novel attention-based feature fusion algorithm, and explores the jump connection in a data-driven mode.

5) The invention has good robustness and accuracy, and has good accuracy in skin cancer lesion segmentation and breast CT images.

Drawings

FIG. 1 is a flow chart of a medical image automatic segmentation method based on multi-path attention fusion according to the present invention;

FIG. 2 is a schematic diagram of a path attention convergence network structure according to the present invention;

FIG. 3 is a schematic view of an attention fusion structure according to the present invention;

FIG. 4 is a schematic diagram of a test data set of the present invention and the segmentation results obtained by the method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a medical image automatic segmentation method based on multipath attention fusion, as shown in figure 1, comprising the following steps:

Example 1

The images in the medical image data set are divided into a training set and a verification set, the training set is used for training the model, the verification set is used for optimizing various indexes of the model, and for medical image segmentation, enough training samples are not easy to obtain, so that the images in the training set are augmented, and the augmentation operation comprises the following steps:

The method for normalizing the pictures in the training set and the verification set after augmentation comprises the following steps:

I＝(I-M)/Std

where I denotes the contrast of the image, M denotes the mean of the image data, and Std denotes the standard deviation of the image data.

Inputting training data into a multipath attention fusion network model, outputting under the guidance of a cross entropy loss function to obtain a segmentation result graph, and when the multipath attention fusion network model is constructed, mainly comprising a multipath encoder, an attention fusion module and a decoder with reconstructed upsampling, wherein:

the multi-path encoder comprises 4 convolution modules with different lengths, wherein the first path comprises 4 residual network modules and 3 maximum pooling operations, the second path comprises 3 residual network modules and 2 maximum pooling operations, the third path comprises 2 residual network modules and 1 maximum pooling operation, and the fourth path only comprises 1 residual network module; as shown in fig. 2, the multi-path encoder includes four paths, i.e., a first path, a second path, a third path, and a fourth path from the left side to the right side of the diagram, wherein a max-pooling operation is performed between two residual blocks in each path;

inputting two features into the attention fusion module each time, as shown in fig. 3, the present embodiment concatenates the convolution operations of 3 × 3 for the two features to obtain a combined feature a; performing standard normalization on the characteristic A by using convolution operation, activating a ReLu activation function, performing global average pooling by using the convolution operation, sequentially passing through a 1 × 1 convolutional layer, the ReLu activation function and a 1 × 1 convolutional layer, and processing by using a sigmoid to obtain a characteristic diagram with a dimension of 1 × 1 × C, wherein C is the number of channels of the characteristic, multiplying the characteristic diagram with the characteristic A to selectively filter the characteristic to obtain the characteristic, and summing the selectively filtered characteristic and the characteristic A to obtain the final output characteristic; as shown in fig. 2, the present invention performs three times of fusion, where the input of the first time of fusion is the output of the first path and the output of the second path, the input of the second time of fusion is the output of the third path and the output of the first layer attention module, and the input of the third time of fusion is the output of the fourth path and the output of the second layer attention module;

The above operation at the attention fusion module can be expressed as:

uⁱ＝C([eⁱ,S(x^i-1)])；

sⁱ＝P_avg(R(B(uⁱ)))；

aⁱ＝σ(C_1x1(R(C_1x1(sⁱ))))；

wherein u isⁱRepresents; c (-) 3X 3 convolutional layer, S (-) represents an upsampling operation]Representing a splicing operation. e.g. of the typeⁱAnd x^i-1Respectively representing the characteristics of an ith layer encoder and the characteristics output by the i-1 attention fusion module; sⁱNote that B (-) and R (-) represent the standard normalization and ReLU activation functions, P, respectively_avgRepresents a global tie pooling; a isⁱIs represented by C_1x1Represents a convolution layer of 1 multiplied by 1, and sigma represents a sigmoid activation function; x is the number ofⁱIt is shown that,

representing multiplication, and + representing pixel summation.

During training, calculating the error between a prediction result and a label through a cross entropy loss function, and propagating backwards through a gradient, wherein an Adam optimization algorithm is adopted to update parameters of a model, and the learning rate is set to be 0.0001; wherein the cross entropy loss function is expressed as:

wherein, y_iA label representing a picture of the picture,

representing the prediction result of the picture.

During testing, the model with the highest accuracy of the verification set is selected, the model is loaded, then the test set is subjected to normalization processing, and the test set is input into a network to obtain a final segmentation result.

Example 2

The separation method in the embodiment 1 is adopted, in the implementation, a Keras and Tensorflow open source deep learning library is adopted, NIVIDIA Geform RTX-2080Ti GPU is used for training, an Adam optimization algorithm model is adopted, and the learning rate is set to be 0.0001; a 2018ISIC skin cancer lesion segmentation, LUNA lung CT dataset was used.

One data set of this example contains 2954 skin cancer lesion pictures in total, each picture being 700 x 900 in size and having a segmentation label map corresponding thereto, provided by the skin cancer lesion segmentation challenge in 2018; using 1815 pictures as training set, 59 pictures as verification set, and the remaining 520 pictures as test set, to facilitate network training, all the pictures are resized to 256 × 256, and the data in the test set is shown in fig. 4, where the first row in fig. 4 is original data, the second row is labels of the original, the first and second columns are LUNA data set, and the third and fourth columns are ISIC data set.

In this example, four evaluation indexes, F1-score, Accuracy, Sensitivity, and Specificity, were used. The larger these four indices are, the more accurate the segmentation effect is. Wherein F1-score was used to assess the lesion area, Accuracy was indicated for Accuracy, Sensitivity was indicated for Sensitivity, and Specificity was indicated for Specificity. As can be seen from Table 1, in the 2018ISIC skin cancer lesion segmentation data set, compared with U-Ne, R2-Unet, BCD-Net and U-Net + +, the main indexes F1-score, Accuracy and Sensitivity are improved in the case of smaller parameters in the method and the compared network model, and the F1-score is obviously improved as the most main index and is improved by 2.98% compared with the U-Net method.

The fourth row of fig. 4 shows the result of the inventive network segmentation.

TABLE 1 comparison of experimental results of this and other methods on the ISIC2018 dataset

Method	F1-score	Accuracy	Sensitivity	Specificity	Parameter (M)
						U-Net	0.8607	0.9417	0.8092	0.9796	9
R2-Unet	0.8740	0.9479	0.8104	0.9873	17.6
						BCD-Unet	0.8637	0.9444	0.7822	0.9878	20.6
U-Net++	0.8756	0.9472	0.8343	0.9795	9
						Method for producing a composite material	0.8905	0.9527	0.8649	0.9779	4.3

Example 3

Using the partitioning method in example 1, unlike example 2, this example used the LUNA data set provided by the 2017 Kaggle lung node competition. The total number of the split label maps comprises 730 pictures and 730 corresponding split label maps, the pixel size of each picture is 512 x 512, 70% of the pictures are used as a training set, 10% of the pictures are used as a verification set, and the rest 20% of data are used as a test set.

Because the data is less, the training data set is amplified by using the technologies of rotation, inversion, elastic transformation and the like, so that the network has good robustness and segmentation accuracy.

Four evaluation indices, F1-score, Accuracy, Sensitivity and Specificity, were used. The larger these four indices are, the more accurate the segmentation effect is. As can be seen from Table 2, the results of the LUNA data set experiments show that the method of the present invention and the compared network model have improved main indexes F1-score, Accuracy, Sensitivity and Specificity with smaller parameters than those of U-Net, R2-Unet, BCD-Net and U-Net + +.

TABLE 2 comparison of the results of the LUNA data set with the present method and other methods

Method	F1-score	Accuracy	Sensitivity	Specificity	Parameter (M)
						U-Net	0.9658	0.9872	0.9696	0.9872	9
R2-Unet	0.9823	0.9918	0.9832	0.9944	17.6
						BCD-Net	0.9904	0.9972	0.9910	0.9982	20.6
U-Net++	0.9899	0.9971	0.9942	0.9975	9
						Method for producing a composite material	0.9924	0.9978	0.9944	0.9984	4.3

From the above table it can be seen that the present invention is effective and has significant advantages over the prior art methods of the same type.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A medical image automatic segmentation method based on multipath attention fusion is characterized by comprising the following steps:

s2, inputting the pictures in the training set into a multipath attention fusion network model, and outputting the pictures under the guidance of a cross entropy loss function to obtain a segmentation result graph, wherein the multipath attention fusion network model comprises a multipath encoder, an attention fusion module and a decoder with reconstruction upsampling, and the method comprises the following steps:

the multi-path encoder comprises 4 paths with different lengths, wherein the first path comprises 4 residual network modules and 3 maximum pooling operations, the second path comprises 3 residual network modules and 2 maximum pooling operations, the third path comprises 2 residual network modules and 1 maximum pooling operation, and the fourth path comprises 1 residual network module; wherein the operations performed in the residual module include:

204: inputting the result of 203 into a 3 × 3 convolutional layer;

205: summing the convolution result of 204 and the convolution result of 201;

wherein h represents the height of the characteristic diagram, w represents the width of the characteristic diagram, and c represents the channel number of the characteristic diagram;

the decoder with reconstruction upsampling comprises three layers of reconstruction upsampling, wherein the first layer of reconstruction upsampling is to perform upsampling on the characteristics of the bottommost layer of a first path of a multi-path encoder, then to splice with the characteristics output by a third layer of attention fusion module, then to input the spliced characteristics into a decoding module, the second layer of reconstruction upsampling is to perform upsampling on the characteristics output after the first layer of reconstruction upsampling is input into the decoder, then to splice with the characteristics output by the second layer of attention module, and to input into the decoding module, the third layer of reconstruction upsampling is to perform upsampling on the characteristics output after the second layer of reconstruction upsampling is input into the decoder, then to splice with the characteristics output by the first layer of attention module, and to input into the decoding module, a 1 x 1 convolution operation and a sigmoid activation function are used to operate on the characteristics output after the third layer of reconstruction upsampling is input into the decoder, so as to obtain a final segmentation result graph;

2. The method for automatic segmentation of medical images based on multi-path attention fusion as claimed in claim 1, wherein the augmenting process of the training set picture comprises:

3. The method for automatic segmentation of medical images based on multi-path attention fusion as claimed in claim 1, wherein the normalization process is expressed as:

I＝(I-M)/Std；

4. The method for automatic segmentation of medical images based on multi-path attention fusion as claimed in claim 1, wherein the operation in the attention fusion module comprises:

5. The method of claim 1, wherein the operations at the decoder module comprise:

221: for the feature F with the resolution of h multiplied by w multiplied by c from the attention fusion module and the feature D with the result resolution of h/2 multiplied by w/2 multiplied by 2c output from the previous decoder module, firstly performing up-sampling operation on the feature D to enable the resolution of the feature D to be h multiplied by w multiplied by c, then performing splicing operation on the feature F with the feature D after up-sampling, and finally inputting the spliced features into a convolution layer of 3 multiplied by 3;

222: inputting the resulting features of 221 into a 3 × 3 convolutional layer;

225: inputting the result of 224 into a 3 × 3 convolutional layer;

226: summing the convolution result of 225 and the convolution result of 201;

6. The method for automatic segmentation of medical images based on multi-path attention fusion as claimed in claim 1, wherein the cross entropy loss function is expressed as:

wherein y is_iA label representing a picture of the picture,

representing the prediction result of the picture.