CN113889234A

CN113889234A - Medical image segmentation method based on channel mixing coding and decoding network

Info

Publication number: CN113889234A
Application number: CN202111154112.9A
Authority: CN
Inventors: 田辉; 刘其开; 郭玉刚; 张志翔
Original assignee: Hefei High Dimensional Data Technology Co ltd
Current assignee: Hefei High Dimensional Data Technology Co ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04

Abstract

The invention particularly relates to a medical image segmentation method based on a channel mixing coding and decoding network, which comprises the following steps: acquiring a medical image to be segmented; importing the medical image to be segmented into a trained network model for recognition to obtain a segmented medical image; the trained network model infrastructure is a symmetrical U-Net network structure, and the characteristic diagrams of the decoder part in the U-Net network structure are obtained by respectively processing and fusing the characteristic diagrams with the same size in an encoder through a self-attention module and a maximum up-sampling index matrix. According to the medical image segmentation method provided by the invention, most segmentation algorithms use skip connection of U-Net for reference to fuse information under different scales, so that the aim of information transmission is guided; different from the traditional algorithm, the method provides a channel-mixed self-attention module to replace a skip-connect structure and maximum index upsampling to replace a transposed convolution, so as to realize the feature upsampling of the decoding network.

Description

Medical image segmentation method based on channel mixing coding and decoding network

Technical Field

The invention relates to the technical field of computer image recognition, in particular to a medical image segmentation method based on a channel mixing coding and decoding network.

Background

Currently, computer vision technology is applied to a plurality of scenes, including the fields of image classification, target detection, three-dimensional reconstruction, semantic segmentation, and the like. With the rapid development of internet communication, the competitiveness of intelligent products requires a technical breakthrough of higher semantic scene understanding. Therefore, semantic segmentation is used as a core problem of computer vision, and can help more and more products to automatically and efficiently understand related knowledge or semantics in images or videos, so that an intelligent target is achieved, manual interactive operation is reduced, and comfort of customers is improved. These products are currently used in a wide variety of applications in the fields of automotive driving, human-computer interaction, computational photography, image search engines, augmented reality, and the like.

The semantic segmentation problem in computer vision is essentially a process that progresses from coarse to refined reasoning. Going back to the classification problem, i.e. roughly predicting the object class in the input sample, is followed by the location and detection of the target object, which not only predicts the class of the object, but also gives additional information about the spatial location of each class, such as the center point or the border of the object area. On the basis, semantic segmentation can be understood as fine-grained prediction in the detection field, a test image is input into a segmentation network, the size of a predicted heat map is consistent with that of an input image, the number of channels is equal to the number of classes, the probabilities that all spatial positions belong to all the classes are represented respectively, and classification can be carried out on a pixel-by-pixel basis.

The full convolution network FCN becomes a base for applying a deep learning technology to a semantic segmentation problem, can accept an input image with any size, and performs up-sampling decoding on a feature map (feature map) of the last convolution of a coding network through a plurality of deconvolution layers to restore the feature map to the same size of the input image, so that a prediction can be generated for each pixel, and spatial information in the original input image is kept. Then, on the basis of the FCN, a plurality of semantic segmentation models are derived, such as a symmetric network U-net with jump connection between encoding and decoding, a DeepLab series network introducing expansion volume and post-processing optimization by using a conditional random field CRF, and a ParseNet combining context information for feature fusion.

The medical image segmentation is a complex and key step in the field of medical image processing and analysis, and aims to segment parts with certain special meanings in a medical image, extract relevant features, provide reliable basis for clinical diagnosis and pathological research and assist doctors in making more accurate diagnosis. However, it is still a difficult task to automatically segment the target from the medical image by using some common algorithms, and the segmentation accuracy is not high due to the fact that the medical image has high complexity and lacks of simple linear features, and partial volume effect, gray level non-uniformity, artifacts, different gray values between soft tissues, and the like.

Disclosure of Invention

The invention aims to provide a medical image segmentation method based on a channel mixing coding and decoding network, which can better realize the segmentation of complex images.

In order to realize the purpose, the invention adopts the technical scheme that: a medical image segmentation method based on a channel mixed coding and decoding network comprises the following steps: acquiring a medical image to be segmented; importing the medical image to be segmented into a trained network model for recognition to obtain a segmented medical image; the trained network model infrastructure is a symmetrical U-Net network structure, and the characteristic diagrams of the decoder part in the U-Net network structure are obtained by respectively processing and fusing the characteristic diagrams with the same size in an encoder through a self-attention module and a maximum up-sampling index matrix.

Compared with the prior art, the invention has the following technical effects: according to the medical image segmentation method provided by the invention, most segmentation algorithms use skip connection of U-Net for reference to fuse information under different scales, so that the aim of information transmission is guided; different from the traditional algorithm, the method provides a channel-mixed self-attention module to replace a skip-connect structure and the maximum index upsampling to replace the transposition convolution, so as to realize the characteristic upsampling of the decoding network; in addition, a combined loss strategy of the segmented training is also provided to achieve the balance of the training speed and the training precision.

Drawings

FIG. 1 is a block diagram of the architecture of a network model in the present invention;

FIG. 2 is a detailed block diagram of a network model in the present invention;

FIG. 3 is a detailed block diagram of the self-attention module of the present invention;

fig. 4 is a schematic diagram of a maximum upsampling index matrix concatenation.

Detailed Description

The present invention will be described in further detail with reference to fig. 1 to 4.

Referring to fig. 1, a method for segmenting a medical image based on a channel-mixed codec network includes the following steps: acquiring a medical image to be segmented; importing the medical image to be segmented into a trained network model for recognition to obtain a segmented medical image; the trained network model infrastructure is a symmetrical U-Net network structure, and the characteristic diagrams of the decoder part in the U-Net network structure are obtained by respectively processing and fusing the characteristic diagrams with the same size in an encoder through a self-attention module and a maximum up-sampling index matrix. According to the medical image segmentation method provided by the invention, most segmentation algorithms use skip connection of U-Net for reference to fuse information under different scales, so that the aim of information transmission is guided; different from the traditional algorithm, the method provides a channel-mixed self-attention module to replace a skip-connect structure and maximum index upsampling to replace a transposed convolution, so as to realize the feature upsampling of the decoding network. The self-attention module can encode the global feature information of a large-range area and then add the global feature information into the local feature information, so that the local feature of the feature map has information dependent on the global spatial feature, and the representation capability of the module is further enhanced. The maximum upsampling index matrix is used for storing position information of the pooled points in the symmetric encoding stage, and in the decoding stage, for the upsampling operation, which region of the previous 2x2 each pooled 1x1 feature point comes from is determined according to the index matrix of the symmetric encoding record. Through the processing of the self-attention module and the maximum up-sampling index matrix, more information can be fused when the feature map is subjected to image semantic reduction, and the final segmentation precision is improved.

There are many setting methods for the U-Net network structure based on the self-attention module and the maximum upsampling index matrix, and the scheme shown in fig. 2 is preferably adopted in the present invention, because the network structure in the present invention is complicated and is not convenient to be directly expressed from the structure, the network structure is presented here by describing the specific steps of image processing. Specifically, the U-Net network structure processes an input image into a segmented medical image according to the following steps: s100, performing convolution operation on the input image to obtain a feature map a₁The number of channels of the input image, height and width, is marked as C, H and W; s200, matching feature graphs a_iPerforming convolution and pooling operation to obtain a feature map a_i+1Where i ∈ {1,2, …, n-1}, i.e.: for characteristic diagram a₁Performing convolution and pooling operation to obtain a feature map a₂For feature map a₂Performing convolution and pooling operation to obtain a feature map a₃And so on, finally obtaining the characteristic diagram a_n(ii) a S300, utilizing the self-attention module to pair the feature map a_iProcessing is carried out, and the maximum upsampling index matrix is utilized to carry out processing on the characteristic diagram b_i+1Processing, and fusing the processing results of the two to obtain a characteristic diagram b_i(ii) a S400, pair feature graph b₁Performing convolution operation to obtain a characteristic diagram c; s500, performing softmax operation on the feature map c to obtain a segmented medical image; in the above steps, the input image and the feature map c have the same width and height, the number of channels of the feature map c is K, K is the number of categories, and the feature value in each category represents the probability that each pixel position is in the category. In the step S500, the number K of channels in the feature map c is the number of categories after the medical image is segmented; the segmented medical image is a heat map, and each K category in the heat map is displayed in different colors. Characteristic diagram a_iAnd b_iThe sizes are completely the same, namely the number, height and width of the two channels are all equal. Through these steps, the input image can be conveniently processed into a classified heat map.

The description is made here specifically as follows: the general U-Net network model comprises an encoder and a decoder, wherein the encoder performs high-level semantic feature extraction on a feature map, the decoder performs image semantic restoration on the feature map, the encoder and the decoder are connected through a bottommost feature map, and the bottommost feature map can be divided into the encoder and the decoder, and can also exist independently as the bottommost feature map. In the present invention, for convenience of description, the characteristic diagram a_nAnd characteristic diagram b_nAnd the bottommost characteristic diagram refers to the same thing, namely the characteristic diagram of C4H 3W 3 in FIG. 2.

In the conventional algorithm model, the up-sampling is generally realized by transpose convolution and the feature map in the encoder is directly copied into the decoder, and the improvement is aimed at. There are many specific schemes capable of implementing the self-attention module and the maximum upsampling index matrix processing, and in the present invention, it is preferable that in step S200, the feature map a_iGenerating a maximum upsampling index matrix d when performing pooling operations_iThe matrix is shown in fig. 4, which expresses pooled point location information for the encoding phase. Step S300 includes the steps of: s310, characteristic diagram a_iObtaining a characteristic diagram b after the self-attention module processing_i,1(ii) a S320, matching feature graphs b_i+1Performing convolution operation to obtain characteristic diagram

(ii) a S330, characteristic diagram

Indexing the matrix d according to the maximum upsampling_jMaximum pooling upsampling to obtain a feature map b_i,2(ii) a S340, converting the characteristic diagram b_i,1And a characteristic diagram b_i,2Performing channel fusion to obtain a feature map b_i(ii) a In the above steps, the characteristic diagram a_i、b_i,1、b_i,2、b_iThe sizes are completely the same; the channel fusion is as follows: from the characteristic map b_i,1Selecting partial channels, from feature map b_i,2Selecting another part of channels to jointly form a characteristic diagram b_iIn the present invention, preferably, in step S340, the feature maps b are respectively selected from_i,1And a characteristic diagram b_i,2The number of channels of 1/2 constitutes a feature map b_i. Through the steps, the image voice is restored, and the finally obtained classification heat map is more accurate.

Referring to fig. 3, further, the step S310 includes the following steps: s311, comparing the feature map a_iCarrying out convolution operation on partial channels to respectively obtain a characteristic diagram f and a characteristic diagram g; s312, matching feature graphs a_iPerforming convolution operation to obtain a characteristic graph h; s313, processing the feature maps f and g into two-dimensional matrixes with the channel number x (height and width) according to the channel number, height and width of the feature maps; s314, after the two-dimensional matrix corresponding to the characteristic diagram f is rotated, multiplying the two-dimensional matrix corresponding to the characteristic diagram g to obtain an attention layer matrix; s315, multiplying the characteristic diagram h with the attention layer matrix to obtain a characteristic diagram b_i,1. Preferably, in step S311, the "part" of the partial channel is 1/4 or 1/8, that is: channel number of characteristic diagram f and characteristic diagram g and characteristic diagram a_iThe ratio of the number of channels of (a) is 1/4 or 1/8. 1/8 used in this example is described in detail below in conjunction with FIG. 3.

Suppose a feature map a to be processed_iC × H × W, the "part" of the partial channel is equal to 1/8, and after the convolution operation, the sizes of the feature maps f and g are both (C/8) × 0H × 1W, and at this time, the two feature maps are converted into a two-dimensional matrix, that is, a matrix of (C/8) × (H × W) size, that is, a matrix having C/8 columns and H × W elements in each row. Then transposing a matrix corresponding to the characteristic diagram f to obtain a matrix (H multiplied by W) x (C/8), and multiplying the matrix and the matrix corresponding to the characteristic diagram g to obtain a two-dimensional matrix (H multiplied by W) x (H multiplied by W) in size, wherein the matrix is an attention layer matrix; finally, multiplying the characteristic diagram h by the attention layer matrix to obtain the characteristic diagram a_iFeature map b of the same size_i,1. Through the steps, the global feature information of a large-range area can be coded and then added into the local feature information, so that the local feature of the feature map has information depending on the global spatial feature.

For medical image segmentation, cross entropy loss or dice loss is used as a common loss function. In the present invention, the weighting of the two is adopted, and the weighting and the time coefficient

Correlation, in particular, loss function in the training of the U-Net network

The following were used:

in the formula (I), the compound is shown in the specification,

representing the prediction probability that the ith sample belongs to the jth class,

indicates the probability that the ith sample belongs to the jth class label, N indicates the number of samples, K indicates the number of classes,

represents a smoothing factor;

is a time coefficient used to change the weighting of the two losses. The dice loss has a good optimization effect on the problem of unbalanced sample categories, but when the values of p and q are very small, the calculated gradient value may be very large, which can cause instability of training, so that a time coefficient is introduced in the scheme

In the early stage of training,

close to 1, loss function loss of

Mainly, as the training progresses,

gradual decrease, 1-

Is gradually increased so that

The weight is reduced and the weight is lowered,

the weights are increased, and in the late training phase,

close to 0, loss function loss of

Mainly comprises the following steps.

Time coefficient

The calculation modes are various, and only the numerical value of the training early stage is required to be smaller than and close to 1, and the numerical value of the training later stage is required to be larger than and close to 0. Preferably, in the present invention, the calculation formula of the time coefficient is as follows:

t represents the ratio of the current iteration number to the lost switching number T,

and T is a defined hyperparameter, said

The value ranges from 5 to 10. Time coefficient thus set

Two determined loss weights are more reasonable.

After the network model is constructed according to the previous steps, it needs to be trained. During training, the method can be carried out according to the following steps:

1. preparing a public data set, and determining the input size of the data set;

2. building training network, determining each network parameter and hyper-parameter during training, the hyper-parameter mentioned above

And T, and also comprises parameters of the conventional U-Net network during training, such as initial learning rate, total epoch and the like, wherein the learning rate is reduced after iteration for a certain number of times

Also, as the iteration progresses, its value gradually decreases;

3. inputting a sample set, and performing batch sampling training;

4. and finally, verifying and testing the model by using a verification set, and identifying the picture to be detected after the tested network model is stored.

Claims

1. A medical image segmentation method based on a channel mixed coding and decoding network is characterized in that: the method comprises the following steps:

acquiring a medical image to be segmented;

importing the medical image to be segmented into a trained network model for recognition to obtain a segmented medical image;

the trained network model infrastructure is a symmetrical U-Net network structure, and the characteristic diagrams of the decoder part in the U-Net network structure are obtained by respectively processing and fusing the characteristic diagrams with the same size in an encoder through a self-attention module and a maximum up-sampling index matrix.

2. The medical image segmentation method based on the channel mixing coding and decoding network as claimed in claim 1, characterized in that: the U-Net network structure processes an input image into a segmented medical image according to the following steps:

s100, performing convolution operation on the input image to obtain a feature map a₁；

S200, matching feature graphs a_iPerforming convolution and pooling operation to obtain a feature map a_i+1Where i ∈ {1,2, …, n-1}, a_nNamely the bottommost characteristic diagram b_n；

S300, utilizing the self-attention module to pair the feature map a_iProcessing is carried out, and the maximum upsampling index matrix is utilized to carry out processing on the characteristic diagram b_i+1Processing, and fusing the processing results of the two to obtain a characteristic diagram b_i；

S400, pair feature graph b₁Performing convolution operation to obtain a characteristic diagram c;

s500, performing softmax operation on the feature map c to obtain a segmented medical image;

in the above steps, the input image has the same width and height as the feature map c, and the feature map a_iAnd b_iThe dimensions are identical.

3. The medical image segmentation method based on the channel mixing coding and decoding network as claimed in claim 1, characterized in that: in the step S200, the feature map a_iGenerating a maximum upsampling index matrix d when performing pooling operations_iStep S300 includes the following steps:

s310, characteristic diagram a_iObtaining a characteristic diagram b after the self-attention module processing_i,1；

S320, matching feature graphs b_i+1Performing convolution operation to obtain characteristic diagram

；

S330, characteristic diagram

Indexing the matrix d according to the maximum upsampling_jMaximum pooling upsampling to obtain a feature map b_i,2；

S340, converting the characteristic diagram b_i,1And a characteristic diagram b_i,2Performing channel fusion to obtain a feature map b_i；

In the above steps, the characteristic diagram a_i、b_i,1、b_i,2、b_iThe sizes are completely the same; the channel fusion is as follows: from the characteristic map b_i,1Selecting partial channels, from feature map b_i,2Selecting another part of channels to jointly form a characteristic diagram b_i。

4. The method for segmenting medical images based on the channel mixing coding and decoding network as claimed in claim 3, characterized in that: the step S310 includes the following steps:

s311, comparing the feature map a_iCarrying out convolution operation on partial channels to respectively obtain a characteristic diagram f and a characteristic diagram g;

s312, matching feature graphs a_iPerforming convolution operation to obtain a characteristic graph h;

s313, processing the feature maps f and g into two-dimensional matrixes with the channel number x (height and width) according to the channel number, height and width of the feature maps;

s314, after the two-dimensional matrix corresponding to the characteristic diagram f is rotated, multiplying the two-dimensional matrix corresponding to the characteristic diagram g to obtain an attention layer matrix;

s315, multiplying the characteristic diagram h with the attention layer matrix to obtain a characteristic diagram b_i,1。

5. The method for segmenting medical images based on the channel mixing coding and decoding network as claimed in claim 3, characterized in that: in step S340, the feature maps b are respectively selected_i,1And a characteristic diagram b_i,2The number of channels of 1/2 constitutes a feature map b_i。

6. The medical image segmentation method based on the channel mixing coding and decoding network as claimed in claim 2, characterized in that: in the step S500, the number K of channels in the feature map c is the number of categories after the medical image is segmented; the segmented medical image is a heat map, and each K category in the heat map is displayed in different colors.

7. The method for segmenting medical images based on the channel mixing coding and decoding network as claimed in claim 4, characterized in that: in the step S311, the "part" of the partial channel is 1/4 or 1/8.

8. The method for segmenting medical images based on channel-mixed codec networks according to any of claims 1 to 7, characterized in that: loss function in the training of the U-Net network

The following were used:

in the formula (I), the compound is shown in the specification,

represents a smoothing factor;

is a time coefficient used to change the weighting of the two losses.

9. The method of claim 8 for medical image segmentation based on channel-mixing codec networkThe method is characterized in that: the calculation formula of the time coefficient is as follows:

and T is the determined hyperparameter.

10. The medical image segmentation method based on the channel mixing coding and decoding network as claimed in claim 9, characterized in that: said

The value ranges from 5 to 10.