CN117745745B

CN117745745B - CT image segmentation method based on context fusion perception

Info

Publication number: CN117745745B
Application number: CN202410180218.3A
Authority: CN
Inventors: 刘敏; 汪嘉正; 申文婷; 张哲�; 王耀南
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2024-02-18
Filing date: 2024-02-18
Publication date: 2024-05-10
Anticipated expiration: 2044-02-18
Also published as: CN117745745A

Abstract

The invention discloses a CT image segmentation method based on context fusion perception, which comprises the steps of constructing a CT image segmentation model comprising a backbone structure with an encoder and a decoder, a parallel cavity convolution module PDCM, a pyramid fusion module PFM and a position attention module PAM, and optimizing the model by using cross entropy and dice loss as mixed loss; encoding the input image by using an encoder, and outputting encoding results at different stages; the PFM module is utilized to respectively cascade the encoding results of different stages and perform context feature fusion through separable cavity convolution of different rates, and the output is jumped with a decoder of the same stage; the PDCM module is utilized to enhance and fuse the final output characteristic diagram of the encoder through six different branches, and the improved high-order characteristic map is sent to a decoder; the PAM module is utilized to locate and segment the object through multiple layers of position attention for each stage of characteristic diagram output by the decoder. The accuracy of target segmentation is improved.

Description

CT image segmentation method based on context fusion perception

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a CT image segmentation method based on context fusion perception.

Background

Early discovery and accurate diagnosis are key to improving cure rate and survival rate of a variety of diseases. Even in high income countries, the survival rate of liver cancer is still not ideal and little improvement has been achieved in recent decades. Meanwhile, since the liver is mostly blocked by the right rib, conventional physical examination cannot find an insignificant liver tumor, so early detection of liver cancer is often challenging.

The advent of Computed Tomography (CT) imaging technology has revolutionized the diagnosis of liver tumors. This scanning technique uses X-rays to create detailed body images that make the location inside the liver or adjacent to the tumor visually. However, due to the statistical uncertainty of the CT physical measurement data, various noises (such as quantum noises and electronic noises) are introduced into the CT image in the imaging process, so that the contrast of the CT image is low, and the boundaries of a lesion area are difficult to distinguish. Meanwhile, since most early tumors are small in volume and lesion features are not obvious, the human eye may not be able to accurately distinguish in the CT image. These limitations prevent the accuracy and efficiency of diagnosis and present difficulties for the physician to analyze the disease and to formulate a treatment regimen. Therefore, it is necessary to develop an accurate CT image segmentation method to solve the difficult problems of fuzzy boundary and tiny target segmentation, and help doctors to complete early disease diagnosis and clinical scheme formulation.

In recent years, many researchers have made multi-angle attempts to alleviate the effects of these problems. On one hand, aiming at the problem of fuzzy boundary caused by low contrast and noise of a target, CPFNet proposed by a scholars adds two pyramid modules in the encoder-decoder structure, so that the receptive field of a network can be effectively enlarged, the global information integration capability of the network is improved, and the influence of CT image background noise is relieved to a certain extent; the MCI-Net proposed by a scholars is added with a multi-scale context extraction module, and partial detail features are deeply captured by combining four cascade mixed expansion convolution branches, so that effective identification of partial low-contrast features in CT images is realized. However, the method lacks means for effectively fusing global and local information, and is difficult to realize effective segmentation of fuzzy boundaries only by unilateral information, so that the application of the method in clinical decision is limited. On the other hand, the introduction of the attention mechanism can focus the network on important areas in the image, which provides a thought for solving the problem of tiny object localization. Three attention modules are introduced into the CNN by a learner and are respectively used for the spatial position, the channel number and the scale of the feature map so as to realize accurate medical image segmentation, however, the research of the method is mainly oriented to a general medical image segmentation task, and no reasonable solution is provided for a small target. Therefore, in order to effectively solve the above-mentioned problems, a CT image segmentation method based on context fusion awareness is proposed.

Disclosure of Invention

Aiming at the technical problems, the invention provides a CT image segmentation method based on context fusion perception.

The technical scheme adopted for solving the technical problems is as follows:

A CT image segmentation method based on context fusion awareness, the method comprising the steps of:

S100: constructing a CT image segmentation model comprising a backbone structure with an encoder and a decoder, a parallel cavity convolution module PDCM, a pyramid fusion module PFM and a position attention module PAM, and utilizing cross entropy and dice loss as mixing loss to jointly optimize the model;

s200: acquiring an input image, encoding the input image by using an improved ResNet encoder, and outputting encoding results of different stages, wherein the encoding results output by the encoder in each stage have different scales;

S300: the PFM module is utilized to respectively cascade the encoding results of different stages of the encoder, context feature fusion is carried out through separable cavity convolution of different rates, and the output is connected with the decoder of the same stage in a jumping manner;

s400: the PDCM module is utilized to enhance and fuse the final output characteristic diagram of the encoder through six different branches, and the improved high-order characteristic map is sent to a decoder;

s500: the PAM module is utilized to locate and segment the target through multiple layers of position attention for each stage of characteristic diagram output by the decoder.

Preferably, the cross entropy and dice loss are used as mixing losses in S100 to jointly optimize the model, in particular:

（1）

（2）

（3）

Wherein, Representing cross entropy loss,/>Representing dice loss,/>Representing the mixing loss,/>Representing real tags,/>Representing the result of the prediction,/>Represents the/>True value of individual pixels,/>Represents the/>Predicted value of individual pixels,/>Representing the number of pixels of the sample.

Preferably, the improved ResNet encoder in S200 is specifically:

the residual error network ResNet pre-trained by the ImageNet is used as a backbone structure of the encoder, the last pooling layer and the full-connection layer of ResNet are removed, and primary characteristics extracted by a residual error module at each stage are output before the downsampling operation;

to achieve effective extraction of each stage of features, one is added after the primary features output by the residual modules of each stage And a ReLu nonlinear activation layer to obtain the output characteristics/>, of each stage，/>Wherein/>Representing different encoding stages.

Preferably, S300 includes:

The PFM module integrates output features of different scales at each stage in the backbone structure encoder by using a continuous convolution layer, and then the scales of the feature graphs at each stage are unified and then spliced by bilinear interpolation up-sampling operation, wherein the output features at each stage are subjected to context fusion with deeper features only; extracting deep features from different levels through separable cavity convolution with different rates, integrating output results through a series of convolution and downsampling, and jumping-connecting the final results with a decoder at the same stage; wherein the continuous convolution layer is formed by convolution kernel 、/>、/>The convolution and a batch normalization layer and a ReLu nonlinear activation layer are alternately formed.

Preferably, to fuse multiple layers of context information, the model uses a total of 4 PFM modules, each expressed mathematically as:

（4）

（5）

（6）

Wherein, Representing the different encoding phases of the encoder,/>Representing the output characteristics of each stage of the encoder,/>Representing the integrated output feature map,/>Represents the/>Feature map of each PFM module after preliminary integration and cascading,/>Represents the firstOutput after processing by PFM module,/>，/>Representing a cascading operation,/>Representing the up-sampling operation and,Representing upsampling multiple,/>Representative Rate is/>/>Separable hole convolution operation,/>Representing the convolution kernel as/>、/>And/>Convolution and batch normalization and ReLu non-linear activation.

Preferably, S400 includes:

after the final high-order output characteristics of the encoder are obtained, the characteristics are enhanced and fused through six different branches, wherein five branches comprise different numbers and rates of hole convolutions, and the last branch is a residual branch for preventing gradient from disappearing; at the end of each hole convolution branch, use The convolution, batch normalization and ReLu non-linearly activated sequential operations are corrected; after the five cavity convolution branches reform the high-order features, the high-order features are spliced in a cascading mode, channels are integrated through a series of convolution operations, and finally the high-order features and the residual branches are added element by element to obtain output features which are used as input of a decoder.

Preferably, PDCM branches and outputs can be expressed mathematically as:

（7）

（8）

（9）

（10）

（11）

（12）

（13）

Wherein, For the output feature map of the fifth stage of the encoder,/>Represents the/>Output of branches,/>，/>Representative Rate is/>/>Hole convolution operation,/>Representing a cascading operation,/>Representative use/>Continuous operation of convolution, batch normalization and ReLu nonlinear activation,/>Representing a point-by-point addition operation of the matrix.

Preferably, S500 includes:

S510: extracting an output feature map from each stage of a decoder, acquiring a first type feature and a second type feature in the feature map of each stage by using average pooling and maximum pooling operation, and fusing context information of different stages into a multi-layer mixed feature map in a point-by-point adding and cascading mode;

S520: the multi-layer mixed feature map is used for adaptively adjusting the importance weight of each position point through a position attention module, so that the key target is effectively perceived and positioned; and finally, integrating the restored two-dimensional feature map into the size of the output image to obtain a final segmentation result.

Preferably, S510 is specifically:

Extracting output feature maps of stages from a decoder ，/>Firstly, expanding each stage of output feature map to the same scale as the final stage of output of a decoder through the same continuous convolution and up-sampling operation as in a PFM module, obtaining a first type of feature and a second type of feature in each stage of feature map through average pooling and maximum pooling operation, fusing the two types of features obtained in each stage in a point-by-point addition mode, and combining the features extracted in different stages together through a cascading mode to obtain a multi-layer hybrid feature map;

S520 specifically comprises: the multi-layer mixed feature map respectively acquires global first type features and global second type features through global average pooling and maximum pooling operations of channel dimensions, and then reforms the two types of features into one-dimensional vectors And/>The method comprises the steps of adaptively adjusting importance weights of each position point in two types of features by using a multi-layer perceptron MLP (multi-level perceptron), recovering the importance weights into a two-dimensional feature map, fusing the position attention weights in the two types of feature maps in a point-by-point addition mode, fusing the obtained attention weights with a multi-layer mixed feature map in a point-by-point multiplication mode, and weighting the position information of a region of interest through multi-layer fusion to obtain a position weighted feature map/>Realizing effective perception and positioning of key targets, finally, up-sampling the position weighted feature map to the size of an output image, and integrating channels through a series of convolution operations to obtain a final output result/>。

Preferably, the PAM module overall flow is expressed mathematically as:

（14）

（15）

（16）

（17）

（18）

Wherein, Representing a multi-layer mixed feature map obtained after feature cascading extracted at each stage,/>Representing an average pooling operation along the channel dimension,/>Representing maximum pooling operations along the channel dimension,/>Representing feature patterns extracted from stages of the decoder,/>Representing an operation of reforming a two-dimensional feature map into a one-dimensional feature vector,/>And/>Representing the reformed one-dimensional global first-class feature and global second-class feature vector respectively,/>Representing an operation of restoring a one-dimensional feature vector to a two-dimensional feature map,/>Representing a multi-layer perceptron,/>Representing a point-wise multiplication operation of a matrix,/>Representing a location weighted feature map obtained after passing through multiple layers of location attention,/>Representing the final output result.

According to the CT image segmentation method based on context fusion perception, the edge information of the target is accurately perceived through reconstruction jump connection, the context information is deeply fused by using cavity convolution with different rates and a multi-dimensional attention mechanism, accurate positioning and segmentation of the micro target are achieved, and accuracy of segmentation of the fuzzy boundary target and the micro target in the CT image is effectively improved.

Drawings

FIG. 1 is a flow chart of a CT image segmentation method based on context fusion awareness according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an overall network structure of a CT image segmentation model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a fourth stage PFM module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a PDCM module structure according to an embodiment of the present invention;

FIG. 5 is a schematic diagram showing a PAM module according to an embodiment of the present invention;

Fig. 6 is a schematic diagram illustrating an effect of CT image segmentation according to an embodiment of the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings.

In one embodiment, as shown in fig. 1 and 2, a method for segmenting a CT image based on context fusion awareness, the method comprising the steps of:

S100: a CT image segmentation model is constructed comprising a backbone structure with encoder and decoder, a parallel hole convolution module PDCM, a pyramid fusion module PFM and a position attention module PAM, the model is jointly optimized using cross entropy and dice loss as mixing loss.

In one embodiment, the model is jointly optimized in S100 using cross entropy and dice loss as mixing loss, specifically:

（1）

（2）

（3）

In particular, cross entropy loss is exploitedAnd dice loss/>Together constitute a mixing loss/>And combining the learning effect of the optimization model. When the segmentation content of the sample is unbalanced, dice loss tends to learn a clear large target, and cross entropy loss increases the learning weight of a fuzzy small target, so that the combination of the two can effectively improve the learning ability of a network to different targets and effectively improve the segmentation precision.

S200: an input image is obtained, the input image is encoded by using a modified ResNet encoder, the encoding results of different stages are output, and the encoding results output by the encoder in each stage are different in scale.

In one embodiment, the improved ResNet encoder in S200 is specifically:

Specifically, the last pooling layer and the full-connection layer of ResNet are removed to better fuse the backbone structure with the subsequent module, and in addition, the nonlinear features are deeply activated on the premise of keeping the scale of the feature map unchanged, so that the effective extraction and fusion of the features in the subsequent module are facilitated.

S300: and cascading the coding results of different stages of the encoder by using the PFM module, performing context feature fusion by separable cavity convolution of different rates, and jumping-connecting the output with the decoder of the same stage.

In one embodiment, S300 includes:

Specifically, the continuous convolution layer is used for stabilizing the distribution of output features and integrating the output features of different stages into the same channel number; and integrating the feature maps with different scales into the same scale through bilinear interpolation up-sampling operation, and cascading the feature maps into a new feature map.

Further, different rates are designed according to the output of each stageSeparable hole convolution/>(Rate of separable hole convolution/>To ensure continuity of feature regions within the receptive field), extract the generated feature map from different levels to obtain context information, and then skip the generated results to the decoder. The number of separable hole convolutions is the same as the number of input levels, and the use of separable hole convolutions in the network can increase the receptive field and decrease the parameters compared to the use of normal convolutions. Therefore, the PFM can well improve the problem of insufficient acquisition of global information of the input feature map by the network.

In one embodiment, as shown in FIG. 3, to fuse multiple layers of context information, the model uses a total of 4 PFM modules, each expressed mathematically as:

（4）

（5）

（6）

S400: and the PDCM module is utilized to enhance and fuse the characteristic diagram finally output by the encoder through six different branches, and the characteristic diagram of the higher order is transformed and then sent to the decoder.

In one embodiment, S400 includes:

Specifically, as shown in FIG. 4, the PDCM branches of the holes include different numbers and different ratesHole convolution/>(Wherein the rate of hole convolution/>The receptive field is maximized on the premise of not exceeding the high-order characteristic size), and the receptive fields with different sizes can be provided by using cavity convolution for comprehensively extracting the high-order characteristic information under different scales. At the end of each branch, use/>The convolution, batch normalization and ReLu non-linearly activated sequential operations are corrected. After the five cavity convolution branches finely reform the high-order features, the features are spliced in a cascading mode, and then feature channels are integrated by using the same continuous convolution operation as that in the PFM module.

Further, in PDCM branches, the last branch is used as a residual branch, and the high-order input features and the integration features extracted from other branches are fused in an element-by-element addition mode so as to prevent gradient disappearance.

In one embodiment, PDCM branches and outputs can be expressed mathematically as:

（7）

（8）

（9）

（10）

（11）

（12）

（13）

Further, the decoder has the same number of decoding modules as the encoder, each stage of decoding modules is composed ofThe continuous operation of convolution, batch normalization and bilinear interpolation up-sampling is formed to reform the channel number and restore the characteristic diagram stage by stage. Before the output characteristic diagram of the current stage is sent to the next decoding stage, the output characteristic diagram of the current stage is fused with the output of the corresponding PFM module in a point-by-point addition mode, and then the fused output characteristic diagram is sent to a decoder of the next stage for the same operation, and finally the output characteristic diagram/>, required by the PAM module, of each stage of the decoder is obtained，/>。

In one embodiment, S500 includes:

Specifically, the average pooling operation is based on a feature map (e.g., channel number is) All the values of each point in the channel are calculated and averaged to obtain a pooling graph with the channel being 1, the features appearing in the feature graph obtained after the operation are the first type of features, and the maximum pooling operation is to calculate the maximum value of the channel to obtain the pooling graph with the channel being 1, namely the features with the strongest performance are the second type of features.

In one embodiment, as shown in fig. 5, S510 is specifically:

Specifically, the multi-layer hybrid feature map adaptively adjusts importance weights of each position point through a position attention module, so that effective perception and positioning of key targets are realized.

In one embodiment, the PAM module overall flow is expressed mathematically as:

（14）

（15）

（16）

（17）

（18）

In an embodiment of the present invention, the CT image segmentation effect on liver tumor regions is shown in fig. 6, wherein white lines outline the CT image segmentation effect of the present invention, and black lines represent real tumor regions marked by several doctors with years of clinical experience. The contrast between the two areas in the image can clearly show that the real label and the segmentation result of the invention have high overall overlapping ratio, and still have high segmentation accuracy for the tumor area with the fuzzy boundary and the tiny feature, thus showing the effectiveness of the invention in the segmentation task of the fuzzy boundary target and the tiny target of the CT image.

According to the CT image segmentation method based on context fusion perception, the edge information of the target is accurately perceived through reconstruction jump connection, the context information is deeply fused by using cavity convolution with different rates and a multi-dimensional attention mechanism, accurate positioning and segmentation of the micro target are achieved, the accuracy of segmentation of the fuzzy boundary target and the micro target in the CT image is effectively improved, and assistance is provided for diagnosis and segmentation of early diseases, formulation of a follow-up treatment scheme and other clinical applications of doctors.

The CT image segmentation method based on context fusion awareness provided by the invention is described in detail above. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the core concepts of the invention. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A CT image segmentation method based on context fusion awareness, the method comprising the steps of:

S300: the PFM module is utilized to respectively cascade the encoding results of different stages of the encoder, context feature fusion is carried out through separable cavity convolution of different rates, and the output is connected with the decoder of the same stage in a jumping manner; s300 includes:

The PFM module integrates output features of different scales at each stage in the backbone structure encoder by using a continuous convolution layer, and then the scales of the feature graphs at each stage are unified and then spliced by bilinear interpolation up-sampling operation, wherein the output features at each stage are subjected to context fusion with deeper features only; extracting deep features from different levels through separable cavity convolution with different rates, integrating output results through a series of convolution and downsampling, and jumping-connecting the final results with a decoder at the same stage; wherein the continuous convolution layer is formed by convolution kernel 、/>、/>The convolution and a batch normalization layer and a ReLu nonlinear activation layer are alternately formed;

s400: the PDCM module is utilized to enhance and fuse the final output characteristic diagram of the encoder through six different branches, and the improved high-order characteristic map is sent to a decoder; s400 includes:

after the final high-order output characteristics of the encoder are obtained, the characteristics are enhanced and fused through six different branches, wherein five branches comprise different numbers and rates of hole convolutions, and the last branch is a residual branch for preventing gradient from disappearing; at the end of each hole convolution branch, use The convolution, batch normalization and ReLu non-linearly activated sequential operations are corrected; after the five cavity convolution branches reform the high-order features, the high-order features are spliced in a cascading mode, channels are integrated through a series of convolution operations, and finally the high-order features and residual branches are added element by element to obtain output features which are used as input of a decoder;

s500: positioning and dividing a target through multi-layer position attention by utilizing a PAM module to each stage of feature images output by a decoder; s500 includes:

2. The method according to claim 1, characterized in that the optimization model is jointly optimized in S100 using cross entropy and dice loss as mixing loss, in particular:

（1）

（2）

（3）

3. The method of claim 1, wherein the improved ResNet encoder in S200 is specifically:

4. A method according to claim 3, characterized in that to fuse multiple layers of context information, the model uses a total of 4 PFM modules, each mathematically represented as:

（4）

（5）

（6）

Wherein, Representing the different encoding phases of the encoder,/>Representing the output characteristics of each stage of the encoder,/>Representing the integrated output feature map,/>Represents the/>Feature map of each PFM module after preliminary integration and cascading,/>Represents the/>Output after processing by PFM module,/>，/>Representing a cascading operation,/>Representing an upsampling operation,/>Representing upsampling multiple,/>Representative Rate is/>/>Separable hole convolution operation,/>Representing the convolution kernel as、/>And/>Convolution and batch normalization and ReLu non-linear activation.

5. The method of claim 4, wherein PDCM branches and outputs are represented mathematically as:

（7）

（8）

（9）

（10）

（11）

（12）

（13）

6. The method according to claim 5, wherein S510 is specifically:

7. The method of claim 6, wherein the PAM module overall flow is expressed mathematically as:

（14）

（15）

（16）

（17）

（18）