CN117392082A

CN117392082A - Liver CT image segmentation method and system based on full-scale jump connection

Info

Publication number: CN117392082A
Application number: CN202311326888.3A
Authority: CN
Inventors: 陈从平; 石井; 张春生; 徐志伟; 陈奔; 陆鹏; 李明春
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2024-01-12

Abstract

The invention relates to the technical field of image processing, in particular to a liver CT image segmentation method and a system based on full-scale jump connection, comprising the steps of collecting a liver CT image, slicing and preprocessing; taking a UNet3+ network as a backbone, and introducing a attention module between an encoder and a decoder; leading out a branch from the last layer of decoder and inputting the branch into a classification guiding module, and judging whether the feature map belongs to organ features or not; multiplying the output of each node of the decoder and sending the multiplied output to a depth supervision module; and comparing the output result of the encoder layer with the label image with the corresponding size respectively, and calculating the respective loss values of different layers. The invention solves the problem that important regional characteristics are not lost when the characteristics of the network among different scales are fused.

Description

Liver CT image segmentation method and system based on full-scale jump connection

Technical Field

The invention relates to the technical field of image processing, in particular to a liver CT image segmentation method and system based on full-scale jump connection.

Background

Liver segmentation is an important task in the field of medical image processing, and has important significance in aspects of diagnosis, operation planning, treatment monitoring and the like of liver diseases; with the continued advancement of medical imaging technology, CT scanning and MRI imaging of the liver have become common clinical examination methods, producing large amounts of liver medical image data. However, these medical images typically contain a large number of anatomical structures and tissues, including the liver, blood vessels, gall bladder, etc., and thus require accurate image segmentation methods to accurately separate the liver from surrounding tissues and structures.

With the development of convolutional neural networks, semantic segmentation has made rapid progress and plays a great role in the field of medical image segmentation, three general basic frameworks, namely a Unet, a PSPNet and a Full Convolutional Network (FCN), exist in the convolutional neural networks, the Unet is based on a classical encoder-decoder structure, a series of improved versions are widely applied to the field of medical image segmentation by combining a low-level detailed feature map from a current scale encoder and a high-level semantic feature map from a corresponding next scale decoder through jump connection, and the improved versions are rapidly become baselines of semantic segmentation tasks of most medical images, in the past, many researches are carried out on the baselines, and the Unet (UNet 3+) of the full scale jump connection is one of the most representative UNet-based structures, and provides a jump connection mode capable of obtaining enough information from the full scale feature map.

Disclosure of Invention

Aiming at the defects of the existing method, the invention solves the problem that important regional characteristics are not lost when the characteristics of the network between different scales are fused.

The technical scheme adopted by the invention is as follows: the liver CT image segmentation method and system based on full-scale jump connection comprise the following steps:

step one, acquiring a liver CT image, and slicing and preprocessing;

further, the pretreatment includes: random rotation, random horizontal flip, random vertical flip, random selection.

Further, the pretreatment also comprises setting the Hounsfield unit window width to 300Hu and the window level to 50Hu.

Step two, taking a UNet3+ network as a main trunk, and introducing a attention module between an encoder and a decoder; leading out a branch from the last layer of decoder and inputting the branch into a classification guiding module, and judging whether the feature map belongs to organ features or not; and multiplying the output of each node of the decoder and sending the multiplied output to the depth supervision module.

Further, the second step specifically includes:

step 21, constructing five layers of encoders, wherein each layer of encoder takes VGG16 as a backbone, and is subjected to convolution twice 3*3, the step length is 1, and the filling is 1; performing ReLU activation and maximum pooling operation after each convolution;

step 22, constructing five layers of encoders, wherein the decoder characteristic diagram of each layer is formed by the encoder above the layer through maximum pooling, convolution of 3*3 and jump connection between full scales; the decoder under the layer is connected with full-scale internal jump through bilinear upsampling and convolution of 3*3, and the feature graphs of the total 5 different scales are spliced firstly and then pass through a 3*3 convolution, a BN normalization layer and a ReLU activation function layer, and finally the feature graphs of the decoder nodes are output;

and step 23, adding an attention gate module between a certain layer of encoder and each layer of decoder of the backbone structure, wherein before each encoder is connected to the decoder in a jumping way, the attention gate module needs to perform feature fusion of different scales through an attention gate, and the attention gate module is provided with two input signals, namely the encoding of the current layer and the up-sampling signals of other layers of different scales, and performs convolution, activation function and convolution re-addition operation to obtain an attention coefficient alpha.

Further, the output characteristics of the attention gate model are formulated as follows:

X _att ＝X _encoder ×α

wherein X is _encoder For encoder output, α is the attention coefficient.

And step three, comparing the output result of the encoder layer with the label image with the corresponding size respectively, and calculating the respective loss values of different layers.

Further, the calculation formula of the loss value is:

wherein Y is _i ∈Y，Y _i Is the label image of the i-th layer, +.>Is the output of the layer i decoder node, N represents the training batch, +.>Is focus loss, +.>Is Laplace smooth dice loss, +.>Is the Lorentz hinge loss.

Further, a liver CT image segmentation system based on full-scale jump connection includes: a memory for storing instructions executable by the processor; and the processor is used for executing instructions to realize a liver CT image segmentation method based on full-scale jump connection.

Further, a computer readable medium storing computer program code, characterized in that the computer program code, when executed by a processor, implements a liver CT image segmentation method based on full-scale jump connection.

The invention has the beneficial effects that:

1. the invention focuses on the interested area better by using the full-size characteristic diagram and the attention-directing mechanism to the maximum, thereby improving the segmentation precision of the small target or the challenging area;

2. attention to different areas is enhanced through an attention mechanism, the expression capacity of the different areas in the data is improved, and a better segmentation result is obtained;

3. the constructed mixing loss function brings a smooth gradient, reduces over-fitting, processes sample imbalance, and can capture large-scale and fine-structure clear boundaries.

Drawings

FIG. 1 is a block diagram of a liver CT image segmentation method and system based on full-scale jump connection of the present invention;

FIG. 2 is X _De ³ Schematic diagram of aggregation process of feature map at decoder layer;

FIG. 3 is a schematic diagram of an attention module mechanism;

FIG. 4 is a schematic diagram of a depth supervision;

FIG. 5 is a schematic diagram of a classification guideline;

fig. 6 shows the attention coefficients in the first level attention gate in different training cycles (0,10,30,60,90), with the model gradually learning to focus on the location of the liver.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic illustrations showing only the basic structure of the invention and thus showing only those constructions that are relevant to the invention.

As shown in fig. 1, the liver CT image segmentation method based on full-scale jump connection includes the following steps:

step one, collecting a published liver CT scanning data set, slicing and preprocessing;

an LiTS liver CT image dataset is obtained, which comprises 131 training CT scans and 70 test CT scans, the training CT scans are marked by a professional doctor, and the following steps are carried out according to 8:1:1 is randomly divided into a training set, a testing set and a verification set; data preprocessing is carried out on medicine, and data augmentation is very important for medical image segmentation, so that overfitting generated by network training can be reduced, and therefore training samples are expanded by adopting random rotation, random horizontal overturn, random vertical overturn and random selection (brightness, contrast and saturation change); the liver is used as a positive class and the liver is used as a negative class, the Hounsfield Unit (HU) window width is set to be 300Hu window level to be 50Hu in the image preprocessing process, so that most irrelevant organs and tissues are removed, and the data set is preprocessed by maximum and minimum normalization after the window is set.

Step two, building a model structure, taking a UNet3+ network as a main trunk, and adding an attention mechanism module suitable for fusion of different scale features into the UNet3+ network; respectively between the encoder and decoder of the first layer to the fifth layer; the attention module can enable the network to pay more attention to the target segmentation area and inhibit the region which is not concerned so as to improve the segmentation accuracy of the network; the last layer of decoder outputs another path to the classifying guide module to judge whether the feature map belongs to organ feature, multiplies with each node output, and sends to the depth supervision module.

Step 21, constructing an encoder, wherein the encoder has five layers of nodes, the encoder is obtained through downsampling in sequence, the encoding process of the UNet3+ network is actually a characteristic extraction process, the structure of the encoder is similar to that of the VGG16, the VGG16 has the characteristics of a deep network, and strong characteristic representation capability is achieved, which means that abundant characteristic information in an input image can be extracted, and the encoder contributes to liver segmentation tasks, so that each layer of encoder module constructed takes the VGG16 as a backbone, and the convolution is carried out twice 3*3, wherein the step length is 1, the filling is 1, each convolution follows a ReLU activation layer, and the maximum pooling is carried out, namely the convolution with the step length of 2.

Step 22, constructing a decoder structure, wherein the decoder has five layers of nodes;

the decoder profile for each layer is defined by: encoders above this layer are connected by max pooling, convolution of 3*3, and skip between full scale; the decoder under the layer is connected with full-scale internal jump through bilinear upsampling and convolution of 3*3, and the feature graphs of the total 5 different scales are spliced firstly and then pass through a 3*3 convolution, a BN normalization layer and a ReLU activation function layer, and finally the feature graphs of the decoder nodes are output; the decoder of each layer can connect shallow layers and deep layers in a full-scale jump connection mode so as to reserve the bottom layer characteristics, avoid the performance degradation of a model caused by adding a plurality of layers, have the capability of exploring enough information from the full scale, clearly know the position and the boundary of organs, and each decoder layer is used for aggregating small-scale and same-scale feature images in the encoder and large-scale feature images in the decoder, so that fine-granularity semantic information and coarse-granularity semantic information under the full scale are obtained.

As in fig. 2, decoder layer X _De ³ In the detailed process of feature map aggregation, the adopted convolution size is 3x3 due to the fact that the size and the channel number of the feature maps on each scale are different, the feature maps on each different scale are output into 64 channels, and the feature map sizes on different scales are consistent with the feature map size of a target decoder layer by pooling and up-adopted operation; as can be seen, decoder layer X _De ³ The sources of the characteristic diagram are three parts: the first part being a small-scale encoder X _En ¹ And X _En ² Feature map, X of (2) _En ¹ And X _En ² Transmitting low-level semantic information of the bottom layer through non-overlapping maximum pooling downsampling, so that the size of the feature map is reduced by 4 times and 2 times respectively; the second part is a co-scale encoder X _En ³ Is to directly receive X in a target decoder _En ³ Is a feature map of (1); the third part is a large-scale encoder X _En ⁵ Or decoder X _De ⁵ 、X _De ⁴ Is a feature map of encoder X _En ⁵ Or decoder X _De ⁵ And X _De ⁴ High-level semantic information is transferred by bilinear interpolation, so thatThe resolution of the feature map is respectively amplified by 4 times and 2 times; the number of channels is changed by convolution operation to reduce redundant information, so that 5 characteristic graphs with the number of channels being 64 are obtained, the characteristic graphs with the 5 scales are stacked, convolved by 320 filters with the convolution size of 3X3, and finally normalized BN layer and ReLU activation function layer operation is carried out to obtain X _De ³ Is a feature map of (1).

As in fig. 1, step 23, an attention gate module is introduced between the encoder and the decoder;

an attention gate module is added between a certain layer of encoder and each layer of decoder of the backbone structure, the attention gate module is provided with two input signals which are respectively the encoding and up-sampling signals of the current layer and perform addition operation, so that the activation value of a repeated region is larger, the learning related to a target region is guided and enhanced, and the irrelevant region is restrained; as shown in FIG. 1, X _De ¹ From X _En ¹ 、X _De ² To X _De ⁵ The feature graphs output by the five different-scale codecs are stacked; wherein X is _De ¹ And X is _En ¹ The attention gate module between is composed of X _De ² As input.

X _De ² From X _En ¹ 、X _En ² 、X _De ³ To X _De ⁵ The feature graphs output by five different-scale codecs are stacked, wherein X is _En ¹ And X is _De ² The attention gate module between is composed of X _En ² As input; x is X _En ² And X is _De ² The attention gate module between is composed of X _De ³ As input.

X _De ³ From X _En ¹ -X _En ³ 、X _De ⁴ To X _De ⁵ Feature map stacking for five different scale codec outputsIn which X is _En ¹ And X is _De ³ The attention gate module between is composed of X _En ² As input; x is X _En ² And X is _De ³ The attention gate module between is composed of X _De ³ As input; x is X _En ³ And X is _De ³ The attention gate module between is composed of X _De ⁴ As input.

X _De ⁴ From X _En ¹ -X _En ⁴ 、X _De ⁵ The feature graphs output by five different-scale codecs are stacked, wherein X is _En ¹ And X is _De ⁴ The attention gate module between is composed of X _En ² As input; x is X _En ² And X is _De ⁴ The attention gate module between is composed of X _De ³ As input; x is X _En ³ And X is _De ⁴ The attention gate module between is composed of X _De ⁴ As input; x is X _En ⁴ And X is _De ⁴ The attention gate module between is composed of X _De ⁵ As input; wherein X is _De ⁵ And X is _En ⁵ Is a common codec structure.

Attention coefficient α of the attention gate module:

α＝σ ₂ (φ(σ ₁ (W _x X _encoder +W _g X _upsample +b _g ))+b _φ )∈[0,1]

wherein X is _encoder For encoder output, X _upsample For upsampling the signal σ ₁ Sum sigma ₂ Activating functions, W, for ReLU and Sigmoid, respectively _x And W is _g Weight, b _g And b _φ For bias, convolution with phi 1*1, the attention coefficient alpha is larger in the target organ region and smaller in the background region, so as to achieve the purpose of screening to improve the accuracy of segmentation, and finally the current method is toThe encoded signal of the layer and the attention coefficient alpha are multiplied by pixels and then sent to up-sampling, and the calculation formula of the output characteristics of the attention gate module is as follows:

X _att ＝X _encoder ×α

FIG. 6 shows that the attention coefficients obtained from the training images change gradually during training, the attention gate initially evenly distributes weights at all locations, gradually updating and locating to the target organ relevant to the segmentation task as the training batch increases, so that the weights are enhanced, and the irrelevant tissues are suppressed; therefore, the attention introducing mechanism trains a plurality of attention gates on each image scale, so that the model can focus on a specific target organ in the learning process, and the reliability of jump links is enhanced.

As in fig. 4, the depth supervision structure: a convolution and ReLU activation function of 3x3 is connected behind each decoder output node, and the sizes of the decoder layer outputs under each scale are inconsistent, so that the size of the tag image GT is reduced in a downsampling mode; wherein the label image is from a LiTS liver CT image dataset, the downsampling is performed by MaxPool2d, dx2, dx4, dx8, dx16 represent scaling factors 2, 4, 8, 16.

Step three, comparing the output result of the encoder layer with the label image with corresponding size to calculate the respective loss value of different layers, summing the loss values of the 5 layers to obtain the loss value for training the network, and providing a mixed loss function for fully utilizing the feature graphs with different semantic levels at the nodes, wherein the mixed loss function comprises the following steps: laplace smooth dice loss, focus loss, and Lowatt hinge loss, the blend loss function can bring a smooth gradient, reduce overfitting, handle sample imbalance, and capture large scale and fine structure sharp boundaries, the blend loss function is described as follows:

Classification guidance module structure: in most medical image segmentation, the occurrence of false positives in organ-free images is a necessary occurrence, which may be due to over-segmentation phenomena caused by noise information of the image background existing in the shallow layer; therefore, in order to solve the false positive problem of the non-organ image in the medical image segmentation, the invention applies the classification guidance module to a fifth layer decoder node of the network, the node takes another path of output, the output comprises dropout, convolution of 1x1, self-adaptive max pooling, activates the function Sigmoid to obtain a two-dimensional tensor, the value in the two-dimensional tensor represents the probability of whether a target organ exists, applies the Argmax function to obtain a single output as 0 or 1, and finally multiplies the value by branches output by 5 decoder nodes in the network structure.

Training of a model: the training set and the verification set are firstly downsampled into 320-320 sizes, then input into a built model for training, loss values are obtained by a mixed loss function in a depth supervision module, the loss values are respectively added into five layers of loss values with different scales, and model parameters are stored after training.

And sending the preprocessed pictures into the built model for training, calculating the optimal weight in the back propagation process by adopting an Adam optimizer, wherein all the super parameters are default values, and adjusting the learning rate through cosine annealing. After the training image is input into a network, each layer of decoder outputs a segmented image, the output of the lowest layer of decoder passes through a Classification Guide Module (CGM) to judge whether the image is false positive, if the image is an organ image, branches output by each node are multiplied by 1 and kept unchanged, five layers of decoders output segmented images with different scales and label images with different scales to calculate loss values, the loss values are transmitted to an Adam optimizer, and the Adam optimizer adjusts various parameters in a network structure according to the loss values.

The adopted hardware system is Windows10 professional version, the processor is Intel (R) Xeon (R) Gold6248R CPU@3.00GHz 2.99GHz (2 processors), the machine band RAM is 128G, the software operating environment is Pycharm, the programming language python3.9, and a Pytorch framework is adopted to build a network.

Model evaluation: the following evaluation indexes are used for measuring the performance of the network, wherein the evaluation indexes comprise: the parameter number, the Dice, the IoU, the Dice and the IoU index of the network model are all in the numerical ranges of [0,1], and the closer to 1, the better the model performance is represented, and the smaller the parameter number of the network model is, the better the model performance is represented. The formulae for the Dice and IoU indices are defined as follows:

where True (TP) represents correctly classified target pixels, true Negative (TN) represents correctly classified background pixels, false Positive (FP) represents incorrectly classified background pixels and False Negative (FN) represents incorrectly classified target pixels.

Table 1 is an evaluation index for liver segmentation according to the present invention:

the unet3+ network of the present invention uses full-size skip connection to transfer the extracted features from the encoder to the target decoder, facilitating integration of the layered representation; an attention gate mechanism is adopted between the skipped connection of the encoder and the decoder, so that the features extracted on different scales are selectively combined in the extended extension path, and the accuracy of the model is improved; deep and shallow features are processed simultaneously by a attention mechanism, thereby enhancing the importance of the target region while suppressing irrelevant background regions.

With the above-described preferred embodiments according to the present invention as an illustration, the above-described descriptions can be used by persons skilled in the relevant art to make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description, but must be determined according to the scope of claims.

Claims

1. The liver CT image segmentation method based on full-scale jump connection is characterized by comprising the following steps of:

step one, acquiring a liver CT image, and slicing and preprocessing;

step two, taking a UNet3+ network as a main trunk, and introducing a attention module between an encoder and a decoder; leading out a branch from the last layer of decoder and inputting the branch into a classification guiding module, and judging whether the feature map belongs to organ features or not; multiplying the output of each node of the decoder and sending the multiplied output to a depth supervision module;

2. The liver CT image segmentation method based on full-scale jump connection as recited in claim 1, wherein the preprocessing comprises: random rotation, random horizontal flip, random vertical flip, random selection.

3. The liver CT image segmentation method based on full-scale jump connection of claim 2, wherein the preprocessing further comprises setting a Hounsfield unit window width to 300Hu and a window level to 50Hu.

4. The liver CT image segmentation method based on full-scale jump connection of claim 1, wherein the step two specifically comprises:

5. The liver CT image segmentation method based on full-scale jump connection as recited in claim 4, wherein the output characteristics of the attention gate model are formulated as:

X _att ＝X _encoder ×α

wherein X is _encoder For encoder output, α is the attention coefficient.

6. The liver CT image segmentation method based on full-scale jump connection of claim 1, wherein the calculation formula of the loss value is:

7. Liver CT image segmentation system based on full-scale jump connection, characterized by comprising: a memory for storing instructions executable by the processor; a processor for executing instructions to implement a full-scale jump connection based liver CT image segmentation method as claimed in any of claims 1-6.

8. Computer readable medium storing computer program code, characterized in that the computer program code, when executed by a processor, implements a liver CT image segmentation method based on full-scale jump connection as claimed in any of the claims 1-6.