CN115457051A

CN115457051A - Liver CT image segmentation method based on global self-attention and multi-scale feature fusion

Info

Publication number: CN115457051A
Application number: CN202211064580.1A
Authority: CN
Inventors: 刘利军; 戴舒婷; 乔伟晨; 黄青松
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-09

Abstract

The invention relates to a liver CT image segmentation method based on global self-attention and multi-scale feature fusion, and belongs to the technical field of medical image processing. The invention comprises the following steps: acquiring an abdominal CT data set and preprocessing the abdominal CT data set; (2) Extracting multi-scale features by adopting a ResNeXt convolutional neural network, and introducing multi-scale spatial information; (3) Obtaining a global self-attention fusion feature through a global self-attention module by using the multi-scale feature; (4) And extracting the fusion features through an improved convolution module, and finally performing up-sampling to obtain a segmentation result. The method is verified based on a LiTS public data set, and the average Dice value of an overlapped region of a segmentation result and real segmentation reaches 96.4%, which is 4.3% higher than that of a classical model UNet.

Description

Liver CT image segmentation method based on global self-attention and multi-scale feature fusion

Technical Field

The invention relates to a liver CT image segmentation method based on global self-attention and multi-scale feature fusion, and belongs to the technical field of medical image processing.

Background

Liver cancer is one of the fastest growing cancers worldwide in morbidity and mortality. Computed Tomography (CT) is a commonly used clinical tumor diagnosis method, which mainly benefits from that CT imaging technology can generally avoid the problem of organ image overlapping in other imaging technologies, and is more beneficial to tumor identification. Liver segmentation is a key step of interventional liver cancer clinical diagnosis and analysis, and accurate liver segmentation results can greatly improve the film reading efficiency of a doctor on CT images, so that a diagnosis and treatment scheme can be made as early as possible.

With the increasing number of CT images, the CT scan data of a case is usually accompanied by hundreds of CT slices, and there are problems of subjective interference, inconsistent standards, complex flow, time and labor consuming, and no repeatability in manual one-by-one analysis. Therefore, the liver organ can be accurately and automatically segmented in the abdominal CT image, and the segmentation method has higher value compared with manual segmentation. The difficulty of liver segmentation at present is mainly reflected in that the internal contrast of a liver organ is low, the intensity difference between the liver and other adjacent organs is small, the boundary of adjacent organs is fuzzy, and the shape change is large, so that the difficulty of liver segmentation is high. Therefore, liver organ segmentation based on CT images is a challenging task.

The automatic liver segmentation is mainly solved by the following three methods: 1) The traditional image segmentation method comprises the following steps: the segmentation task is done using shallow features such as grayscale, texture, etc. However, this also results in the conventional method being relatively sensitive to noise pixels and difficult to make good use of the conventional image segmentation method for deeper image features. 2) The machine learning method comprises the following steps: data patterns are analyzed from large-scale data. However, most of the machine learning algorithms need to carefully design artificial image features, and the expression of the features and the final segmentation result are also limited by the feature selection mode. 3) The deep learning method comprises the following steps: more and more abstract features can be extracted without additional intermediate processes, and the selection mode of the features is continuously adjusted according to results, so that the accuracy is greatly improved. The segmentation result of the existing deep learning method is generally better than that of the traditional image processing method, but the existing deep learning method is still insufficient for segmenting the liver and the liver tumor, and is also insufficient for considering relevant characteristics such as fuzzy boundary, variable positions and the like expressed in a CT image of the liver and the liver tumor. Many extracted features play little or no role in the segmentation result in the down-sampling process of the deep learning method, and the features are not weakened and are expressed in the same way as the key segmentation features, so that the segmentation result is not facilitated. In addition, the traditional U-Net jump link mode can cause semantic gap to cause the problem of feature mismatching, and the association among features is not fully considered in part of multi-scale model methods, so that the performance of a segmentation model is influenced.

Disclosure of Invention

In order to solve the above mentioned problems, the invention provides a liver CT image segmentation method based on global attention and multi-scale feature fusion, and the invention selects ResNeXt using packet convolution as an image feature extraction network, and obtains more image features without increasing computation time. Aiming at the problem of fuzzy boundary of liver organs, the method is solved by extracting and fusing different scale characteristics through a multi-scale architecture. And since there must be some relationship between the liver organ and other organs in the CT image, a self-attention mechanism is introduced to capture the relationship between the extracted features. And finally, the features are fused through a residual volume block of an improved attention method, so that the features are better expressed, and a better liver segmentation result is obtained.

The technical scheme of the invention is as follows: a liver CT image segmentation method based on global self-attention and multi-scale feature fusion comprises the following specific steps:

step1, image preprocessing: and processing the CT image in the LiTS data set according to the HU value range to increase the contrast, and expanding the data set by adopting a random overturning mode and the like.

Step2, acquiring the same dimension characteristic and the multi-scale characteristic: after the Step1 preprocessing operation, extracting image features by using a ResNeXt convolution neural network, and obtaining convolution features with uniform dimensionality and multi-scale features based on the convolution features through linear transformation.

Step3, obtaining a global self-attention fusion characteristic: and obtaining a self-attention fusion feature containing global information by a global self-attention module (Non-Local) by using the multi-scale feature obtained in Step2 so as to capture the relation between the target feature and the surrounding features.

And Step4, extracting the features of the self-attention fusion features obtained in Step3 through an improved convolution module, highlighting the effect of important semantic features in channel dimensions, and finally performing up-sampling to obtain a segmentation result.

Further, the specific steps of Step1 are as follows:

step1.1, processing the CT image in the LiTS data set according to the HU value range corresponding to the liver organ to increase the contrast; and processing according to CT values ranging from-130 HU to 230HU, namely the window width 360HU and the window level 50HU, and then performing normalization operation on the processed CT image.

The Step1.2 data expansion adopts the modes of random horizontal turning, vertical turning, zooming and cutting to carry out data enhancement; after random expansion, the data are divided, wherein 82% is used as a training set, the rest 18% is used as a test set, and the training set is further divided into training data and verification data according to the proportion of 8.

Further, the specific steps of Step2 are as follows:

after image preprocessing, step2.1 uses the first five layers of the ResNeXt-101 network as a feature extraction layer, the convolution in each ResNeXt block is divided into 32 paths, the middle channel dimension processed by each path is 4, different paths are equivalent to different feature subspaces and used for extracting different semantic features, meanwhile, the convolution kernels of different paths have sparser relations, and the risk of overfitting is reduced.

Step2.2 unifies the channel dimensions of Layer 1-4 output results in the ResNeXt network structure into 64 through linear transformation, and upsamples the feature map size to be consistent with Layer 1. And splicing the four features, performing 1 × 1 convolution compression on the spliced features to 64 to obtain multi-scale features, wherein the number of the feature channels and the size of the feature diagram are consistent with the dimension of the features processed by Layer 1-4.

Further, the Step3 comprises the following specific steps:

different organs in the Step3.1 abdomen CT image have certain relation, and the obtained relation can improve the liver organ segmentation effect. Inspired by the idea of calculating the correlation between the current position and other positions in the image in the non-local mean algorithm, starting from the multi-scale feature obtained by Step2, performing linear mapping for three times respectively to obtain Key, query and Value embedded space features, wherein the linear mapping is realized by adopting 1 × 1 convolution.

Step3.2 calculates the similarity between the characteristic Key and Query, and the function for calculating the correlation is obtained according to a Gaussian function selected by a non-local mean value, and the calculation formula is as follows:

wherein x _i Is the ith position of the input feature map and j represents all the positions that may be associated with i. And weighting the calculated similarity to Value to obtain the self-attention feature.

Step3.3 obtains the output of self-attention weight from the attention characteristic through a Softmax layer, thereby integrating the learned long-distance dependency relationship into the output characteristic, and the overall calculation formula is as follows:

where C (x) is a Softmax normalization function, function g linearly maps the representation of the input j position, typically by 1 × 1 convolution, and function f calculates the correlation of the input ith position with the jth position.

Further, the Step4 comprises the following specific steps:

step4.1 extracts the fusion features containing multi-scale information and self-attention relations from Step3, and further feature extraction is carried out through an improved convolution module. The multi-scale self-attention fusion feature is subjected to 1 x 1 convolution, a feature channel is mapped to a specified dimension, and then the feature sum of the feature channel and the specified dimension is obtained through 1 x 1 convolution and 3 x 3 convolution.

Step4.2 uses an Attention module (Channel Attention, CA) acting on the Channel dimension to recalibrate the Channel of the feature Channel, and uses a residual path to fuse the original feature and the Attention feature of the Channel to obtain the output feature of the residual module, and the specific calculation is as shown in the formula:

Y _MRA (X)＝Y _CA (W _L X+W _E X)+X

wherein Y is _MRA (X) denotes a multi-level residual attention convolution operation, and X denotes an input characteristic. W _L Is a 1 × 1 convolution matrix used to linearly map the original input, which is equivalent to a residual path. W _E Is a 3 x 3 convolution matrix for feature extraction of the input features, Y _CA Indicating channel attention operation.

The characteristics extracted by Step4.3 adopt a multi-path parallel idea of ensemble learning to obtain four groups of segmentation outputs, and the four groups of outputs are calculated and averaged to be used as a final output result.

The invention is further explained, in Step1 and Step 4:

1) The data preprocessing method comprises the following steps:

the original CT image contains a CT value in a large range, the contrast is poor as a whole, and the gray level difference among organs in the image is small and difficult to distinguish. On medical CT images, the HU value range corresponding to the target organ is usually used for processing to increase the contrast. The liver part is usually processed by adopting a window width of 150 and a window level of 30, but the liver organ and the liver tumor have gray level difference, and if the method is processed according to the HU value of the liver, the gray level loss of part of the liver tumor area is inevitably caused, so that important information is lost, and the training effect is not good. Aiming at the problem, the CT value range is-130 HU to 230HU, namely the window width 360HU and the window level 50HU, which are obtained by analyzing the distribution of the HU values of the histogram. And obtaining a processed CT image after normalization operation, wherein the processed image enhances the contrast between organs while retaining the information of the target region to the maximum extent, and is more beneficial to training a model, the image before processing is as shown in fig. 3 (a), and the image after processing is as shown in fig. 3 (b).

2) Designing a loss function:

for the condition that positive and negative samples in the data set are unbalanced, a Binary Cross Entropy (BCE) and a Dice Loss (Dice Loss, DL) are combined in a weighted mode to serve as training Loss, and a calculation formula of a Loss function L is as follows:

where y represents the true segmentation map value,

for the segmentation map values predicted by the model, ω is set to 0.5 for the weight of the two penalties, and e is set to 1.0 for the smoothing term set to avoid the denominator being 0.

The invention has the beneficial effects that:

1. according to the liver CT image segmentation method based on the fusion of the global self-attention and the multi-scale features, aiming at the characteristics of a liver segmentation task, a multi-scale strategy is selected to extract various features, the multi-scale features extracted by different network layers are used for introducing multi-scale spatial information, and the problems of fuzzy boundaries and the like in liver segmentation are solved. And a global attention mechanism is introduced to construct the relation between the image features corresponding to different semantic categories, so that the model can better capture the association between the semantic features corresponding to the liver and other organs, and the problem of large change of the liver shape is solved.

2. Each channel dimension corresponds to a type of semantic information, which is mapped to a type of image features in the original image. But we hope that the image features corresponding to the liver organ should have a higher degree of importance, and obviously the same channel weight is not good for the expression of key features. Aiming at the problem, the invention designs a multi-stage residual error attention convolution MRA module for highlighting the important semantic features in the channel dimension.

To sum up, the liver CT image segmentation method based on the fusion of the global self-attention and the multi-scale features firstly utilizes the ResNeXt convolutional neural network to obtain the multi-scale features in the abdominal CT image, then uses the global self-attention module to capture the spatial position relationship, and combines the effect of the multi-level residual error attention module to highlight the important semantic features in the channel dimension; finally, the accuracy of liver image segmentation is improved.

Drawings

FIG. 1 is a diagram of a segmentation method of a liver CT image based on global attention and multi-scale feature fusion;

FIG. 2 is a schematic diagram of a global self-attention-based module according to the present invention;

FIG. 3 is a comparison of the pretreatment of the present invention before and after; wherein, (a) the pre-processed image; (b) the processed image;

FIG. 4 is a visual comparison of the segmentation results of the present invention; wherein, (a) a CT picture; (b) a reference segmentation criterion; (c) original model segmentation results; (d) adding the self-attention module model segmentation result; (e) adding the improved convolution model segmentation result.

Detailed Description

Example 1: as shown in fig. 1-4, a liver CT image segmentation method based on global attention and multi-scale feature fusion specifically includes the following steps:

Further, the specific steps of Step1 are as follows:

step1.1 processes the CT images in the LiTS dataset according to the HU value range corresponding to the liver organ to increase the contrast. And processing according to CT values ranging from-130 HU to 230HU, namely the window width 360HU and the window level 50HU, and then performing normalization operation on the processed CT image.

The Step1.2 data expansion adopts the modes of random horizontal turning, vertical turning, zooming and cutting to carry out data enhancement. After random expansion, the data are divided, wherein 82% is used as a training set, the rest 18% is used as a test set, and the training set is further divided into training data and verification data according to the proportion of 8.

Further, the specific steps of Step2 are as follows:

after image preprocessing, step2.1 uses the first five layers of the ResNeXt-101 network as feature extraction layers, the convolution in each ResNeXt block is divided into 32 paths, the middle channel dimension processed by each path is 4, and different paths are equivalent to different feature subspaces and used for extracting different semantic features.

Step2.2 unifies the channel dimensions of Layer 1-4 output results in the ResNeXt network structure into 64 through linear transformation, and upsamples the feature map size to keep consistent with Layer 1. And splicing the four features, compressing the spliced features to 64 through a 1 × 1 convolution to obtain the multi-scale features, wherein the number of the feature channels and the size of the feature map are consistent with the dimension of the features processed by Layer 1-4.

Step3, obtaining a global self-attention fusion characteristic: and obtaining a self-attention fusion feature containing global information through a global self-attention module (Non-Local) by using the multi-scale feature obtained in Step2 so as to capture the relation between the target feature and the surrounding features.

Further, the specific steps of Step3 are as follows:

certain relation exists among different organs in the Step3.1 abdominal CT image, and the liver organ segmentation effect can be improved by acquiring the relation. Inspired by a Non-Local mean algorithm, the invention selects Non-Local global self attention, starts from multi-scale characteristics obtained by Step2, and respectively carries out cubic linear mapping to obtain Key, query and Value embedded space characteristics, wherein the linear mapping is realized by adopting 1 × 1 convolution, and the multi-scale fusion characteristics and different attention methods are shown in a table 1.

TABLE 1 Multi-Scale fusion features Using different attention methods for comparison

Step3.3 obtains the output of the self-attention weight from the attention feature through a Softmax layer, so that the learned long-distance dependency relationship is merged into the output feature; the overall calculation formula is as follows:

Further, the specific steps of Step4 are as follows:

step4.1 extracting the fusion feature containing multi-scale information and self-attention relation from Step3, and further extracting the information in the fusion feature through an improved convolution module. The multi-scale self-attention fusion feature is subjected to 1 x 1 convolution, the feature channel is mapped to the specified dimension, and then the feature summation of the two is obtained through 1 x 1 convolution and 3 x 3 convolution.

Step4.2 uses an Attention module (Channel Attention, CA) acting on the Channel dimension to perform Channel recalibration on the characteristic Channel, and uses a residual error path to fuse the original characteristic and the Channel Attention characteristic to obtain the output characteristic of the residual error module; the specific calculation is shown as the formula:

Y _MRA (X)＝Y _CA (W _L X+W _E X)+X

wherein, Y _MRA (X) denotes a multi-level residual attention convolution operation, and X denotes an input feature. W is a group of _L Is a 1 × 1 convolution matrix used to linearly map the original input, which is equivalent to a residual path. W _E Is a 3 x 3 convolution matrix for feature extraction of the input features, Y _CA Indicating channel attention operation.

The hardware environment adopted by the experiment is configured as Intel (R) Xeon (R) CPU E5-2620 v4@2.10GHZ and GPU NVDIA TITAN XP hardware platform, the operating system is Ubuntu 18.04.1, and the software platform comprises a GPU parallel computing architecture CUDA and a Pytron programming language-based Pytrch deep learning framework. Here, using Adam optimizer, the learning rate variation strategy employs a cosine annealing (cosine annealing) strategy with an initial learning rate of 0.001 and a minimum of 0.00001, reset every 30 rounds. The total number of training rounds is 80 rounds and the batch size is 4.

Table 2 shows the experimental comparison results between the method of the present invention and the medical image segmentation field on the dataset LiTS, wherein the method includes classical segmentation algorithms including UNet, FCN, etc., DAF, msAUNet, etc. which are similar to the method of the present invention using multi-scale feature fusion, and H-DenseUNet, multiple UNet, etc. which are well-known algorithms in the field of liver segmentation. The liver segmentation result of the method is far higher than that of a classical segmentation algorithm including FCN, and the average Dice value of the overlapped region of the segmentation result and the real segmentation reaches 96.4 percent and is 4.3 percent higher than that of a classical model UNet. Since the invention obtains the result similar to the 3D method under the condition of adopting the 2D model, the method has strong competitiveness in the liver segmentation task.

TABLE 2 comparison with existing Process

Fig. 4 is a result of experimental image segmentation on a dataset LiTS according to the present invention, wherein (a) CT pictures; (b) a reference segmentation criterion; (c) original model segmentation results; (d) adding the self-attention module model segmentation result; (e) adding the improved convolution module model segmentation result; after the self-attention mechanism is added, the error prediction of the region outside the liver in the prediction can be relieved to a certain extent, partial false positive prediction results are eliminated, and the effect of the self-attention mechanism in the liver segmentation task is further proved. And an MRA module with a perfect channel attention mechanism is further added, most false positive predictions are successfully eliminated through enhancement or inhibition of semantic features on channel dimensions, and meanwhile, the segmentation edge is closer to a real segmentation result.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. The liver CT image segmentation method based on the fusion of global self-attention and multi-scale features is characterized by comprising the following specific operation steps of:

step1, image preprocessing: processing the CT image in the LiTS data set according to the HU value range to increase the contrast, and then expanding the data set;

step2, acquiring the same dimension characteristic and the multi-scale characteristic: after the Step1 preprocessing operation, extracting image features by using a ResNeXt convolutional neural network, and obtaining convolution features of uniform dimensions and multi-scale features based on the convolution features through linear transformation;

step3, obtaining a global self-attention fusion characteristic: obtaining a self-attention fusion feature containing global information through a global self-attention module Non-Local by using the multi-scale feature obtained in Step2 so as to capture the relation between the target feature and the surrounding features;

2. The liver CT image segmentation method based on the fusion of the global self-attention and the multi-scale features as claimed in claim 1, wherein Step1 comprises the following specific steps:

step1.1 processing the CT image in the LiTS data set according to the HU value range corresponding to the liver organ to increase the contrast; processing according to the CT value ranging from-130 HU to 230HU, namely the window width 360HU and the window level 50HU, and then performing normalization operation on the processed CT image;

3. The liver CT image segmentation method based on the fusion of the global self-attention and the multi-scale features as claimed in claim 1, wherein Step2 comprises the following specific steps:

after image preprocessing, step2.1 uses the first five layers of a ResNeXt-101 network as feature extraction layers, the convolution in each ResNeXt block is divided into 32 paths, the middle channel dimension processed by each path is 4, different paths are equivalent to different feature subspaces and used for extracting different semantic features, meanwhile, the convolution kernels of different paths have sparser relations, and the risk of overfitting is reduced;

4. The liver CT image segmentation method based on the fusion of the global self-attention and the multi-scale features as claimed in claim 1, wherein Step3 comprises the following specific steps:

certain relation exists among different organs in a Step3.1 abdominal CT image, and the liver organ segmentation effect can be improved by acquiring the relation; inspired by the idea of calculating the correlation between the current position and other positions in the image in a non-local mean algorithm, starting from the multi-scale feature obtained by Step2, performing linear mapping for three times respectively to obtain Key, query and Value embedded space features, wherein the linear mapping is realized by adopting 1 × 1 convolution;

wherein x _i I, inputting the ith position of the feature map, wherein j represents all positions possibly related to i, and weighting the calculated similarity on Value to obtain the self-attention feature;

5. The liver CT image segmentation method based on the fusion of the global self-attention and the multi-scale features as claimed in claim 1, wherein Step4 comprises the following specific steps:

step4.1 extracting fusion characteristics containing multi-scale information and self-attention relation from Step3, and further extracting information in the fusion characteristics through an improved convolution module; the multi-scale self-attention fusion feature is subjected to 1 × 1 convolution, a feature channel is mapped to a specified dimension, and then the feature sum of the multi-scale self-attention fusion feature and the specified dimension is obtained through 1 × 1 convolution and 3 × 3 convolution;

step4.2 uses an attention module acting on channel dimension to perform channel recalibration on the characteristic channel, uses a residual path to fuse the original characteristic and the channel attention characteristic to obtain the output characteristic of the residual module, and the specific calculation is as shown in a formula:

Y _MRA (X)＝Y _CA (W _L X+W _E X)+X

wherein, Y _MRA (X) denotes a multi-level residual attention convolution operation, X denotes an input feature, W _L The convolution matrix is 1 multiplied by 1 and is used for linear mapping of original input and is equivalent to a residual path; w is a group of _E Is a 3 x 3 convolution matrix for feature extraction of the input features, Y _CA Indicating channel attention operation;