CN111325751A

CN111325751A - CT image segmentation system based on attention convolution neural network

Info

Publication number: CN111325751A
Application number: CN202010190946.4A
Authority: CN
Inventors: 龙建武; 宋鑫磊; 安勇; 鄢泽然
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2020-06-23
Anticipated expiration: 2040-03-18
Also published as: CN111325751B

Abstract

The invention provides a CT image segmentation system based on an attention convolution neural network, which comprises a feature coding module, a feature extraction module and a feature extraction module, wherein the feature coding module is used for gradually reducing the size of a feature map of an input image by using a parallel convolution neural network, and realizing the simultaneous extraction of image semantic information and spatial information through network layer multiplexing and the interception and fusion of features of each layer; the attention feature is generated by using pooling, and the semantic information extraction attention module is used for further refining and refining the semantic information features extracted by the feature coding module; the feature fusion pooling attention module is used for combining the refined semantic information features with the semantic information and spatial information features spliced by the feature coding module in parallel to form an attention feature map; and a convolution module and an up-sampling module are used for finely restoring the attention feature map step by step into a feature map code module of the size of the input image. The invention realizes efficient and accurate image segmentation by fusing the attention module.

Description

CT image segmentation system based on attention convolution neural network

Technical Field

The invention relates to the technical field of image understanding, in particular to a CT image segmentation system based on an attention convolution neural network.

Background

Image segmentation is an important fundamental research problem in the field of computer vision, while medical image segmentation is an application of image segmentation, which can accurately and rapidly position a large number of patient lesions in a short time. Therefore, how to effectively apply the image segmentation technique to medical images becomes a major task of researchers.

The medical image segmentation classifies semantic expressions in an image pixel by extracting medical image features, and the medical image segmentation needs to accurately position an object and a class to which the object belongs and the position of the object, and clearly divides an object boundary to distinguish different classes of objects.

At present, there are many medical image segmentation methods widely used at home and abroad, wherein the traditional method mainly comprises the following steps: based on threshold segmentation, the threshold segmentation has the advantages of relatively simple implementation, but is not suitable for multi-channel images and images with little difference of characteristic values, and is difficult to obtain accurate results for the image segmentation problem that obvious gray difference does not exist in the images or gray value ranges of various objects are greatly overlapped; based on the edge segmentation method, the edge detection has the advantages of high search detection speed and good edge detection effect, but also has the defects of incapability of obtaining better region structure and contradiction between noise resistance and detection precision during edge detection; the method based on the active contour model is also called as a Snake model, the basic idea of the original Snake model is that an initial curve with an energy function is gradually deformed and moved towards the contour direction of a target to be detected through energy minimization, and finally the initial curve is converged to a target boundary to obtain a smooth and continuous contour, and the original Snake model has the defects of difficulty in capturing a target concave boundary, sensitivity to an initial contour line and the like, so that a plurality of subsequent improved methods are provided.

In addition, the segmentation method based on the neural network populates the end-to-end convolutional network into semantic segmentation since Long et al proposed an FCN algorithm (FullyConvolutional Networks) in 2014. The pretrained ImageNet network is used for the segmentation problem again, the deconvolution layer is used for up-sampling, the jump connection is provided to improve the roughness of the up-sampling, but the result obtained by the FCN has a certain difference from the practical application. Although the accuracy is improved by using the skip structure, the model cannot be well separated from the edge information of the image. In the process of classifying pixels one by one, the FCN does not fully consider the connection between pixels, and lacks spatial consistency. Vijay et al proposed a SegNet (semantic segmentation) algorithm in 2015 that shifted large pooling indices into the decoder, improving segmentation resolution. In an FCN network, a coarse segmentation map is generated by convolutional layers and some hopping connections, and more hopping connections are introduced to improve the effect. However, FCN only replicates the encoder features, while SegNet replicates the maximum pooling index, which makes SegNet more efficient than FCN in memory usage.

The U-Net proposed by Ronneberger et al combines shallow semantic information with deep semantic information, and segments medical images using Encoder and Decoder architectures, but the feature extraction is not good. Yu et al proposed in 2016 a hole convolution layer (dilatedconvolentions) that increased the corresponding receptive field index without reducing the spatial dimensions. In deep lab, which will be mentioned next, the hole convolution is called porous convolution (AtrousConvolution). The last two pooling layers are removed from the pre-trained classification network (here VGG, Visual Geometry group network) and the subsequent convolutional layers are replaced with hole convolutions. The DeepLabV2 and V3 use hole convolution, and implement pyramid-shaped hole pooling ASPP (atomic Spatial pyramid) in Spatial dimension, and use full-connected conditional random field, and the hole convolution increases the receptive field without increasing the number of parameters.

Zhao et al proposed pspnet (pyramid Scene Parsing network) in 2017. the algorithm proposed a pyramid pooling module to aggregate the background information and used additive Loss (auxiary Loss). In addition global scene classification is important because it provides clues to segment the distribution of classes, and pyramid pooling modules use large kernel pooling layers to capture this information. As with the hole convolution system mentioned above, PSPNet also improves the ResNet structure with hole convolution and adds a pyramidal pooling module that connects the feature map of ResNet to the upsampled output of the parallel pooling layer, with the kernel covering the entire area, half-area and small areas of the image, respectively.

Chen et al, in 2018, again proposed the deep labv3+ model, using a spatial pyramid pool module and a codec structure to be used for the deep neural network for the semantic segmentation task. The former network can encode multi-scale context information by detecting input features with filter or sink operations at multiple rates and multiple effective fields of view, while the latter network can capture sharper object boundaries by gradually restoring spatial information. The algorithm combines the advantages of both methods, extending deepLabv3+ by adding a simple and efficient decoder module to refine the segmentation results, especially along object boundaries. By further exploring the Xception model and applying the deep separable convolution to aspp (advanced spatial gradient) and decoder modules, a faster and stronger encoder-decoder network is constructed, but there are disadvantages of large consumption of computing resources, etc. The pyramid structure is used as a module for semantic segmentation, has good integration, can be easily added into any neural network structure, and obtains excellent effect in the process of extracting context information. However, the pyramid structure has some defects, such as what is really needed to be valued by the network for the extracted information, and the pyramid structure is not well explained.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a CT image segmentation system based on an attention convolution neural network, which designs an accurate and efficient segmentation model by using a deep learning method and fusing an attention module, so that the execution efficiency of the existing CT image segmentation method is improved, and a more accurate segmentation result is obtained.

In order to solve the technical problems, the invention adopts the following technical scheme:

a CT image segmentation system based on an attention convolution neural network comprises a feature coding module, a semantic information extraction attention module, a feature fusion pooling attention module and a feature graph code module; the feature coding module gradually reduces the size of a feature map of an input image by using a parallel convolution neural network, and realizes the simultaneous extraction of semantic information features and spatial information features of the image through network layer multiplexing and interception and fusion of features of each layer; the semantic information extraction attention module generates attention features by using pooling, and further refines and refines the semantic information features extracted by the feature coding module; the feature fusion pooling attention module is connected in parallel with the average pooling by using maximum pooling and average pooling, and combines semantic information features refined by the semantic information extraction attention module with semantic information and spatial information features spliced by the feature coding module to form an attention feature map; and the feature map decoding module gradually and finely restores the attention feature map fused by the feature fusion pooling attention module into the size of the input image by using a convolution module and an up-sampling module.

Compared with the prior art, the CT image segmentation system based on the attention convolution neural network provided by the invention firstly gradually reduces the size of the feature map of an input image by using the convolution neural network, further extracts abundant semantic information features for optimizing a classification task, and simultaneously reduces the loss of space information feature compression by network design during the extraction of the semantic information features; then, optimizing the extraction of the semantic information by using a semantic information extraction attention module; then, a characteristic fusion pooling attention module is used for combining semantic information characteristics refined by a semantic information extraction attention module with semantic information and spatial information characteristics spliced by a characteristic coding module, and fusion processing is carried out through pooling attention to obtain an attention characteristic diagram; and finally, performing upsampling and convolution operations by using a feature map decoding module, and finely restoring the attention feature map to the size of the input image step by step. In addition, compared with the current typical segmentation network, the segmentation system model provided by the invention has higher adaptability to the CT image data set segmentation.

Further, the feature coding module comprises a first convolution module, a second convolution module, a first bottleneck channel, a second bottleneck channel, a third bottleneck channel, a fourth bottleneck channel and a first splicing operation module which are arranged in sequence, the first convolution module comprises a convolution layer and a batch regularization which are sequentially arranged, the second convolution module comprises a convolution layer, a batch regularization and a ReLu activation function which are sequentially arranged, the first bottleneck channel, the second bottleneck channel, the third bottleneck channel and the fourth bottleneck channel are arranged in parallel, the bottleneck layer in each bottleneck channel is continuously reduced from the first bottleneck channel to the end of the fourth bottleneck channel, while the second to fourth bottleneck passageways are continuously reduced in size compared to the output characteristic of the first bottleneck passageway, and the number of the feature map channels finally output by each bottleneck layer is increased along with the increase of the number of the layers, and the semantic information features and the spatial information features extracted by the four bottleneck channels are spliced by the first splicing operation module.

Further, the convolution kernel size of the convolutional layer is 3 × 3, and the step size is 2.

Further, the number of the bottleneck layers in the first to fourth bottleneck channels is 4, 3, 2, 1 respectively, the sizes of the characteristic diagrams output by the second to fourth bottleneck channels compared with the first bottleneck channel are 1/2, 1/4, 1/8 respectively, and the number of the channels of the output characteristic diagrams in the first to fourth bottleneck channels is 128, 256, 512 and 1024 respectively.

Further, each bottleneck layer comprises three convolution units, an addition unit and a ReLu activation function unit which are sequentially arranged, each convolution unit comprises a convolution kernel, a batch regularization function and a ReLu activation function which are sequentially arranged, and the addition unit is also in jump connection with the feature map input into the convolution kernel of the first convolution unit.

Further, the semantic information extraction attention module comprises a first channel attention module, a second channel attention module, a global pooling module, a multiplication operation module and a second splicing operation module, wherein the first channel attention module and the second channel attention module are arranged in parallel, each channel attention module comprises a global average pooling module, a convolution module, a batch regularization and Sigmoid activation function and a multiplication operation, the global average pooling module is sequentially arranged and used for capturing the semantic feature information of the lower context in the input feature map, the convolution module is used for calculating the weight of the semantic information, the batch regularization and Sigmoid activation function are used for refining the semantic information extraction, the multiplication operation is used for multiplying the refined semantic information and the input feature map, the multiplication operation module is used for multiplying the feature map output by the second channel attention module and the output feature map processed by the global pooling module, the second splicing operation module is used for splicing the feature map output by the first channel attention module and the output feature map of the multiplication operation module, and the input feature maps of the two channel attention modules are obtained by connecting semantic information features extracted by the feature coding module.

Further, the feature fusion pooling attention module comprises a third convolution module, an average pooling passage, a maximum pooling passage and a two-way pooling multiplication operation module, wherein the third convolution module is used for extracting mixed information features of the fused semantic information features and the spatial information features and simultaneously converting channels of the information, the average pooling passage and the maximum pooling passage are arranged in parallel and are respectively used for processing the features extracted by the third convolution module, and the two-way pooling multiplication operation module is used for multiplying the two processed features of the average pooling passage and the maximum pooling passage to form an attention feature map.

Further, the average pooling passage uses two serially connected average pooling modules to process the features as a first passage for feature extraction, and the maximum pooling passage uses two serially connected maximum pooling modules to process the features as a second passage for feature extraction.

Further, the feature map coding module comprises a first up-sampling module, a fourth convolution module, a second up-sampling module, a fifth convolution module and a sixth convolution module which are sequentially arranged, the feature maps output by the first up-sampling module and the fourth convolution module are the same in size, and the feature maps output by the second up-sampling module, the fifth convolution module and the sixth convolution module are all the same in size as the input image.

Further, the sampling coefficients of the first and second upsampling modules are 2.

Drawings

FIG. 1 is a schematic block diagram of a CT image segmentation system based on an attention convolution neural network according to the present invention.

Fig. 2 is a schematic structural diagram of the feature encoding module of fig. 1.

Fig. 3 is a schematic diagram of the structure of each bottleneck layer in the feature encoding module of fig. 2.

FIG. 4 is a block diagram of a channel attention module of the semantic information extraction attention module of FIG. 1.

FIG. 5 is a schematic diagram of the structure of the feature fusion pooling attention module of FIG. 1.

Fig. 6 is a schematic structural diagram of a feature diagram decoding module of fig. 1.

FIG. 7 is a graph illustrating the FCN and FEM training process.

FIG. 8 is a schematic diagram of an image comparison of pancreas segmentation test results provided by the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained below by combining the specific drawings.

Referring to fig. 1, the present invention provides a CT image segmentation system based on an attention convolutional neural network, which includes a feature coding module, a semantic information extraction attention module, a feature fusion pooling attention module, and a feature graph coding module; the feature coding module gradually reduces the size of a feature map of an input image by using a parallel convolution neural network, and realizes the simultaneous extraction of semantic information features and spatial information features of the image through network layer multiplexing and interception and fusion of features of each layer; the semantic information extraction attention module generates attention features by using pooling, and further refines and refines the semantic information features extracted by the feature coding module; the feature fusion pooling attention module is connected in parallel with the average pooling by using maximum pooling and average pooling, and combines semantic information features refined by the semantic information extraction attention module with semantic information and spatial information features spliced by the feature coding module to form an attention feature map; and the feature map decoding module gradually and finely restores the attention feature map fused by the feature fusion pooling attention module into the size of the input image by using a convolution module and an up-sampling module.

Specifically, the design background on the feature encoding module is as follows: as is known, for semantic segmentation tasks, spatial information and semantic information are as important, and the traditional deep learning method uses a series convolution mode, and reduces the size of a feature map step by step through convolution and pooling so as to achieve the purpose of extracting the semantic information and the spatial information, for example, FCN, SegNet, U-Net, deep Lab and other methods. However, spatial information is inevitably lost in the process of reducing the feature map, so many models make a lot of improvements to this point, for example: DeepLabV3 and PSPNet extract spatial information by using pyramid pooling and cavity convolution, BiseNet extracts spatial features by adding a very short network again, DenseASPP reduces the loss of the feature space to the minimum by using a Dense connection structure, PAN is arranged at the tail and the middle of a backbone network, and an attention module is added to increase the spatial feature extraction power of the network. However, if the spatial information is too much emphasized, very accurate semantic information cannot be obtained, which causes a dilemma. According to the invention, through designing a network, two complex tasks of semantic information extraction and spatial information extraction are simultaneously carried out, and under the condition of only increasing a small amount of network parameters, spatial information and semantic information features are simultaneously extracted through network layer multiplexing and interception and fusion of features of each layer, and no additional loss is brought.

As a specific embodiment, please refer to fig. 2, the feature encoding module includes a first convolution module, a second convolution module, first to fourth bottleneck paths, and a first splicing operation module arranged in sequence, the first convolution module comprises a sequentially arranged convolution layer (Conv) and batch regularization (BN), the second convolution module comprises a convolutional layer (Conv), a batch regularization (BN) and a ReLu activation function which are sequentially arranged, the first Bottleneck passage, the second Bottleneck passage and the fourth Bottleneck passage are arranged in parallel, from the first Bottleneck passage to the end of the fourth Bottleneck passage, the Bottleneck layer (Bottleneck) in each Bottleneck passage is continuously reduced, while the second to fourth bottleneck passageways are continuously reduced in size compared to the output characteristic of the first bottleneck passageway, and the number of the feature map channels finally output by each bottleneck layer is increased along with the increase of the number of the layers, and the semantic information features and the spatial information features extracted by the four bottleneck channels are spliced by the first splicing (concat) operation module. In the design of the feature coding module provided by this embodiment, a traditional convolution series mode is changed, a parallel mode is used to simultaneously extract semantic information features and spatial information features, a Bottleneck layer (Bottleneck) is set as 4 parallel paths when a network is designed, the spatial information features are retained because the size of a feature map on each path is not changed, and the combination of multi-scale feature maps is realized because the size of each channel feature map is different; the size of each path feature map is gradually reduced, so that the extraction of semantic information features is realized at the top layer of each path.

As a preferred embodiment, please refer to fig. 2, the convolution kernel size of the convolution layer is 3 × 3, and the step size is 2, so that the first convolution module and the second convolution module can be used to reduce the feature map of the input image, and reduce the calculation amount.

As a preferred embodiment, please refer to fig. 2, the number of the bottleneck layers in the first to fourth bottleneck paths is 4, 3, 2, 1, respectively, the sizes of the feature maps output by the second to fourth bottleneck paths are 1/2, 1/4, 1/8, respectively, compared to the first bottleneck path, and the number of channels of the output feature maps in the first to fourth bottleneck paths is 128, 256, 512, 1024, respectively, thereby better extracting the semantic information features and the spatial information features at the same time.

As a specific embodiment, please refer to fig. 3, each bottleneck layer includes three convolution units, an adding unit (Add) and a ReLu activation function unit, which are sequentially arranged, each convolution unit includes a convolution kernel (ConV2D), a Batch regularization (BN), and a ReLu activation function, which are sequentially arranged, and the adding unit is also jump-connected to a feature map in the convolution kernel input to the first convolution unit, so that the jump-connection and the ReLu activation function are added to the convolution layer, so that a path of a convolutional neural network can be autonomously selected through network learning, thereby further improving accuracy.

Specifically, aiming at the Semantic Information characteristics, the invention designs a Semantic Information Extraction Attention Module (SIEAM) for the task again. As a specific embodiment, please refer to fig. 1 and 4, the semantic information extracting attention module includes a first channel attention module, a second channel attention module, a global pooling module, a multiplication operation module, and a second concatenation operation module, the first channel attention module and the second channel attention module are arranged in parallel, each of the channel attention modules includes a global average pooling for capturing context semantic feature information in an input feature map, a convolution (ConV2D) for calculating semantic information weight, a canonical Batching (BN) and Sigmoid activation functions for refining semantic information extraction after convolution, and a multiplication (Mul) operation for multiplying the refined semantic information with the input feature map, the multiplication (Mul) operation module is used for multiplying the feature map output by the second channel attention module with an output feature map processed by the global pooling module, the second splicing (concat) operation module is used for splicing the feature graph output by the first channel attention module and the output feature graph of the multiplying operation module, and the feature graph is multiplied to be used as a weight influence input feature graph, so that the task of thinning semantic information is achieved; wherein, the input feature maps of the two channel attention modules are obtained by butting semantic information features extracted by a feature coding module, specifically, as shown in fig. 2, a leftmost Bottleneck layer (Bottleneck) and a second leftmost upper Bottleneck layer (Bottleneck) in fig. 2 are rich in a large amount of semantic information features, so for the two bottlenecks, two channel attention modules in a Semantic Information Extraction Attention Module (SIEAM) are butted with the two Bottleneck layers in a one-to-one correspondence manner, specifically, the leftmost Bottleneck layer is connected with the second channel attention module, and the second leftmost upper Bottleneck layer is connected with the first channel attention module, so that the semantic information features extracted by the two Bottleneck layers are respectively used as input feature maps of the two channel attention modules one by one, and then, after being refined by the semantic information extraction attention module, are sent to a feature fusion pooling attention module for integration, and accordingly, the SIEAM realizes integration of a large amount of global context semantic information features, only a little more computational cost is added.

Specifically, the design background on the feature fusion pooling attention module is as follows: although the feature coding module can fully extract the spatial information of the image features, and the semantic information extraction attention module can also extract more detailed semantic information, the spatial information is not matched with the semantic information, and a module is required to integrate the two information instead of removing the rough fusion. Therefore, the invention provides a Feature Fusion PolingAttention Module (FFPAM), semantic information features and spatial information features are fused through the FFPAM and are applied to a Feature map as attention information, so that the context semantic information and the spatial information can be fully fused, and the segmentation precision is improved.

As a specific embodiment, please refer to fig. 5, the feature fusion pooling attention module includes a third convolution module (including a convolution ConV2D-BN-ReLU activation function) for extracting mixed information features of fused semantic information features and spatial information features and simultaneously converting channels of information, an average pooling path, a maximum pooling path, and a two-way pooling multiplication operation module, where the average pooling path and the maximum pooling path are arranged in parallel and are respectively used for processing the features extracted by the third convolution module, and the two-way pooling multiplication operation module is used for multiplying the two processed features of the average pooling path and the maximum pooling path to form an attention feature map. The invention fuses the spatial information characteristic and the semantic information characteristic by two paths of the average pooling path and the maximum pooling path which are connected in parallel, thereby increasing the receptive field of the model and enhancing the characteristic extraction capability of the model, and an attention characteristic diagram formed by multiplying the two paths of characteristics has the characteristics of the average pooling path and the maximum pooling path at the same time, the attention characteristic is multiplied with the input characteristic diagram and is superposed on the input characteristic as a weight to influence the input characteristic diagram, and finally, a jump connection structure in ResNet is used, so that the negative influence of an attention module on the input characteristic diagram can be reduced and the final characteristic diagram can be output. The feature fusion pooling attention module in the embodiment successfully combines context semantic information and image space information together through multiplication of two routes, so that higher precision is improved, in order to verify the effectiveness of average pooling and maximum pooling, the invention tests 5 conditions of single-path maximum pooling, single-path average pooling, two-path pooling addition, two-path pooling combination and two-path pooling multiplication, and experiments prove that the two-path pooling multiplication really brings optimal precision, and the similarity (dice) precision of 2.71% is improved by the module through the design of the module.

As a preferred embodiment, please refer to fig. 5, the average pooling path uses two serially connected average pooling modules (including average pooled AvgPool-convolution ConV2D-ReLU activation function) to process the features as a first path for feature extraction, and after the output of the ReLU activation function in the second average pooling module is multiplied by the input feature map of the path, the multiplied feature map is added to the input feature map of the path to obtain the final output result of the path; the maximum pooling path uses two maximum pooling modules (including maximum pooling Maxpool-convolution ConV2D-ReLU activation function) connected in series to process the characteristics as a second path for characteristic extraction, after the output of the ReLU activation function in the second maximum pooling module is multiplied by the input characteristic diagram of the path, the characteristic diagram formed by multiplication is added to the input characteristic diagram of the path to serve as the final output result of the path; and finally, multiplying the characteristics finally output by the two channels with the characteristics extracted by the third convolution module (namely the output of the ReLU activation function in the third convolution module), adding (Add) the multiplied result with the characteristics extracted by the third convolution module, and forming an attention characteristic diagram through the ReLU activation function.

As a specific example, please refer to fig. 6, the feature map coding module includes a first upsampling module (Upsample), a fourth convolution module (including convolution Conv-BN-ReLU activation function), a second upsampling module (Upsample), a fifth convolution module (including convolution Conv-BN-ReLU activation function), and a sixth convolution module (including convolution Conv-BN-ReLU activation function), which are sequentially arranged, the feature map sizes output by the first upsampling module and the fourth convolution module are the same (e.g., 96, 128), and the feature map sizes output by the second upsampling module, the fifth convolution module, and the sixth convolution module (e.g., 192, 256) are all the same as the input image. In the embodiment, the up-sampling information is refined by using the three convolution modules, so that the segmentation result is refined by one step, and the precision is improved finally.

As a specific embodiment, the sampling coefficient of the first upsampling module and the sampling coefficient of the second upsampling module are 2, and specifically, the existing bilinear interpolation method can be used for sampling, that is, 2 times of upsampling by the bilinear interpolation method is used for sampling, and a convolution module is used for refining certain spatial information loss caused by the bilinear interpolation method upsampling, so that spatial information loss caused by sampling is reduced.

In designing a CT image (e.g. pancreas image) segmentation system model provided by the present invention, it is first necessary to prepare a data set and preprocess the data set, and process the data set as an input required by the model, so as to improve the robustness of the model. Specifically, the data preprocessing includes: processing each slice, and classifying all pixels with the pixel larger than 240 as 240 and all pixels with the pixel smaller than-100 as-100, wherein the calculation formula is as follows:

image Pixel[Pixel＜low_range]＝low_range

image Pixel[Pixel＞high_range]＝high_range

wherein image Pixel is an image Pixel, low _ range is-100, and high _ range is 240. Each slice is then normalized so that its pixel intensity iso-map is between (-1, 1).

The data set preparation includes: the data set is divided into three parts, namely a training set, a verification set and a test set by adopting an NIH criteria segmentation dataset and using 4-fold cross-validation. The training set and the validation set total 62 samples, and the test set total 20 samples. During training, using Adam optimizer, initial learning rate was set to 10^-5Then every 10epochs (which can be understood as a batch, equal to training with all samples in the training set)Once) the learning rate decayed by 0.2, for a total of 100 batches of training in the experiment. The results show that training medical images from scratch can achieve better performance and shorter training times than model pre-trained with fine-tuned natural images.

Compared with the prior art, the CT image segmentation system based on the attention convolution neural network has the following advantages:

firstly, in a feature coding module, the FCN is used for carrying out an experiment relative to a backbone network, due to the fact that strategies such as learning rate attenuation, initialization parameters, regularization input and overfitting prevention are used, under the condition that 100epochs are repeated in a training process, in a scheme image of the system provided by the invention, object segmentation has a high dice value; because the semantic information and the spatial information are considered at the same time, the convergence rate is very high, and the loss value is lower than the baseline FCN, which is also reflected in that the dice value is higher than the FCN.

Secondly, the cross parallel network used in the invention learns more features than the FCN in the process of extracting the image information. As shown in table 1 below, while the parameter amount is much smaller than the FCN with VGG16 as the basic architecture, the network of the present invention scores much higher than the FCN in terms of precision rate, recall rate or dice score, which proves the effectiveness of the feature coding module used in the present invention.

TABLE 1

Model (model)

Average dice%

Maximum dice%

Minimum dice%

Rate of accuracy

Recall rate

Amount of ginseng

FCN

69.02±6.3

76.14

49.48

0.7092

0.6754

134.3M

FEM

78.93±5.6

86.54

65.15

0.8339

0.7543

16.15M

Thirdly, in the feature fusion pooling attention module, as shown in table 2 below, the path of the feature fusion pooling attention module is set to be one, and an experiment is performed by multiplying the average pooling path by the maximum pooling path, so that indexes in all aspects are greatly improved, all indexes are higher than those in the previous item, and the context semantic information and the image space information are successfully combined together by multiplying the results of the two paths, so that high precision is brought.

TABLE 2

Fourth, as shown in table 3 below, the frame used in the present invention has a large increase in dice value when the parameters are much smaller than those of the current typical networks FCN and U-Net.

TABLE 3

Model (model)	Basic network	Dice％	Amount of ginseng
				FCN	VGG16	80.3	134.3M
U-Net	VGG16	79.7	23.3M
				Bisenet	XceptionV1	82.8	44.8M
Framework for use in the system	FEM	86.6	18.9M

Fifth, as shown in table 4 below, the present invention will be compared to the current typical network to observe the adaptability of each model to the pancreatic CT dataset. Of the current 82 samples, most models used 62/20 training/test set ratios, with # Folds being the fold number for cross validation, it can be seen that the system model of the present invention is higher than these typical models at present.

TABLE 4

Sixthly, the fusion experiment of each module uses 20 samples as a test set as the same as the previous experiment, and then the precision rate, the recall rate and the dice value of the 20 samples are respectively tested. As shown in table 5 below, it can be seen that Base + Decoder + ARM + GAM is much higher than others in the aspect of recall rate and dice value, except that the precision rate is slightly lower than that of Base + Decoder + ARM, which also verifies the validity of all modules stacked.

TABLE 5

Model (model)	Average dice%	Maximum dice%	Minimum dice%	Rate of accuracy	Recall rate	Amount of ginseng
							FCN(baseline)	69.02±6.3	76.14	49.48	0.7092	0.6754	134.3M
FEM+FDM	82.81±4.2	88.54	74.07	0.8477	0.8115	16.15M
							FEM+FDM+SIEAM	83.91±4.4	89.70	73.89	0.8726	0.8106	18,96M
FEM+FDM+SIEAM+FFPAM	86.62±3.6	91.31	78.91	0.8607	0.8737	19.8M

Referring to fig. 8, the Image before the 1 st behavior segmentation, the 2 nd behavior label GT, the test result segmented by the 3 rd behavior FCN, the test result segmented by the 4 th behavior U-Net, the test result segmented by the 5 th behavior FEM + FDM, and the test result segmented by the 6 th behavior FEM + FDM + SIEAM + FFPAM are the final algorithm segmentation test results. As can be seen from this figure, since the FCN directly upsamples the feature map with small segmentation using the transposed convolution, the result lacks edge smoothness, presenting a mosaic-like segmentation result. The algorithm well smoothes the hard edge feature of the FCN because U-Net has a gentle upsampling, but U-Net generates many extra small fragments on the detail segmentation, and the small fragments are not generated on the 2 nd, 3 rd and 4 th segmentation prediction maps of the 4 th row. On the 5 th line, on the FEM + FDM used by the invention, as the spatial information and semantic information of the image are effectively reserved, the fragments generated in the segmentation process of U-Net are effectively reduced, and the whole picture becomes clean; however, in detail segmentation, there are some disadvantages. For example, in row 5, no effective segmentation of the folds of the pancreas occurred, and in row 5, no segmentation of the pancreas was achieved, in row 3, too much segmentation of the pancreas area. On the basis of the method, two attention modules are added to the method, so that the method is focused on solving the detail defects. In line 6, the final model used in the present invention effectively solves the fragmentation of the region around the segmented target, and is more complete for the detailed region than FEM + FDM, and the whole is closer to GT.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. The CT image segmentation system based on the attention convolutional neural network is characterized by comprising a feature coding module, a semantic information extraction attention module, a feature fusion pooling attention module and a feature graph code module; the feature coding module gradually reduces the size of a feature map of an input image by using a parallel convolution neural network, and realizes the simultaneous extraction of semantic information features and spatial information features of the image through network layer multiplexing and interception and fusion of features of each layer; the semantic information extraction attention module generates attention features by using pooling, and further refines and refines the semantic information features extracted by the feature coding module; the feature fusion pooling attention module is connected in parallel with the average pooling by using maximum pooling and average pooling, and combines semantic information features refined by the semantic information extraction attention module with semantic information and spatial information features spliced by the feature coding module to form an attention feature map; and the feature map decoding module gradually and finely restores the attention feature map fused by the feature fusion pooling attention module into the size of the input image by using a convolution module and an up-sampling module.

2. The CT image segmentation system based on the attention convolutional neural network as claimed in claim 1, wherein the feature coding module comprises a first convolution module, a second convolution module, a first bottleneck channel, a second bottleneck channel, a fourth bottleneck channel and a first splicing operation module, the first convolution module comprises convolution layers and batch regularization, the convolution layers and the batch regularization are sequentially arranged, the second convolution module comprises convolution layers, batch regularization and ReLu activation functions, the first bottleneck channel, the second bottleneck channel and the fourth bottleneck channel are arranged in parallel, the bottleneck layers in each bottleneck channel are continuously reduced from the first bottleneck channel to the fourth bottleneck channel, the size of output feature maps of the second bottleneck channel, the fourth bottleneck channel and the first bottleneck channel is continuously reduced, the number of feature map channels finally output by each bottleneck layer is increased along with the increase of the number of the layers, and the first splicing operation module splices semantic information feature and spatial information feature extracted by the four bottleneck channels and the spatial information feature And (6) connecting.

3. The attention convolution neural network-based CT image segmentation system of claim 2, wherein the convolution kernel size of the convolution layer is 3 × 3 with a step size of 2.

4. The attention convolution neural network-based CT image segmentation system according to claim 2, wherein the number of bottleneck layers in the first to fourth bottleneck paths is 4, 3, 2, 1, respectively, and the sizes of feature maps output by the second to fourth bottleneck paths compared with the first bottleneck path are 1/2, 1/4, 1/8, respectively, and the number of channels of output feature maps in the first to fourth bottleneck paths is 128, 256, 512 and 1024, respectively.

5. The attention convolution neural network-based CT image segmentation system according to claim 2, wherein each bottleneck layer comprises three convolution units, an addition unit and a ReLu activation function unit which are sequentially arranged, each convolution unit comprises a convolution kernel, a batch regularization and a ReLu activation function which are sequentially arranged, and the addition unit is also in jump connection with a feature map in the convolution kernel input to the first convolution unit.

6. The CT image segmentation system based on attention convolution neural network of claim 1, wherein the semantic information extraction attention module comprises a first channel attention module, a second channel attention module, a global pooling module, a multiplication module and a second stitching module, the first channel attention module and the second channel attention module are arranged in parallel, each of the channel attention modules comprises a global average pooling for capturing context semantic feature information in an input feature map, a convolution for calculating semantic information weight, a batch regularization and Sigmoid activation function for refining semantic information extraction, and a multiplication operation for multiplying the refined semantic information with the input feature map, the multiplication module is used for multiplying the feature map output by the second channel attention module with an output feature map processed by the global pooling module, the second splicing operation module is used for splicing the feature map output by the first channel attention module and the output feature map of the multiplication operation module, and the input feature maps of the two channel attention modules are obtained by butting semantic information features extracted by the feature coding module.

7. The CT image segmentation system based on the attention convolutional neural network of claim 1, wherein the feature fusion pooling attention module comprises a third convolution module, an average pooling path, a maximum pooling path and a two-way pooling multiplication operation module, the third convolution module is used for extracting mixed information features of the fused semantic information features and spatial information features and simultaneously converting channels of information, the average pooling path and the maximum pooling path are arranged in parallel and are respectively used for processing the features extracted by the third convolution module, and the two-way pooling multiplication operation module is used for multiplying the two paths of features processed by the average pooling path and the maximum pooling path to form an attention feature map.

8. The attention convolution neural network-based CT image segmentation system of claim 7, wherein the average pooling pass uses two serially connected average pooling modules to process features as a first pass of feature extraction and the max pooling pass uses two serially connected max pooling modules to process features as a second pass of feature extraction.

9. The CT image segmentation system based on the attention convolution neural network is characterized in that the feature map coding module comprises a first up-sampling module, a fourth convolution module, a second up-sampling module, a fifth convolution module and a sixth convolution module which are sequentially arranged, the feature maps output by the first up-sampling module and the fourth convolution module are the same in size, and the feature maps output by the second up-sampling module, the fifth convolution module and the sixth convolution module are the same in size as the input image.

10. The attention convolution neural network-based CT image segmentation system of claim 9, wherein a sampling coefficient of the first and second upsampling modules is 2.