CN113344933B

CN113344933B - Glandular cell segmentation method based on multi-level feature fusion network

Info

Publication number: CN113344933B
Application number: CN202110608183.5A
Authority: CN
Inventors: 饶云波; 王艺霖
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2022-05-03
Anticipated expiration: 2041-06-01
Also published as: CN113344933A

Abstract

The invention discloses a gland cell segmentation method based on a multi-level feature fusion network, and belongs to the technical field of medical image processing. The gland cell segmentation network model provided by the invention comprises an encoder and a decoder, wherein in the encoder stage, the characteristic diagrams at the input end are down-sampled to generate characteristic diagrams with different scales, and then the characteristic diagrams are spliced with the characteristic diagrams with the corresponding proportion generated in the encoder in the largest pooling mode to realize multi-level characteristic input and enhance the propagation of image characteristics; in the decoder stage, according to the feature map information of different layers reserved in the down-sampling, the feature maps with the corresponding sizes are spliced during the up-sampling operation in the decoder stage, the shallow image features are combined again, the loss of pixel positions is compensated, the error in predicting and positioning the pixel positions of the feature maps in the decoding process is reduced, and the efficient glandular cell image segmentation task is realized.

Description

Glandular cell segmentation method based on multi-level feature fusion network

Technical Field

The invention belongs to the technical field of medical image processing, and particularly relates to a segmentation method for a gland cell image.

Background

A normal gland is composed of a luminal area of tubular structure and epithelial nuclei surrounding the cytoplasm, and a malignant tumor produced by glandular epithelial cells, called adenocarcinoma. Conventional treatment plans often depend on the grade and stage of the adenocarcinoma. Annotation and segmentation of the glandular morphology in histopathological images is an important step and means for medical experts to judge the grade of cancers such as colon, breast and prostate. This work is very important for the treatment of patient conditions, as accurate gland segmentation facilitates a targeted, personalized treatment design, thereby increasing the cure rate of the patient. However, manual annotation by a medical professional, segmentation of glandular cells is highly focused and laborious and time consuming. Moreover, since the judgment of the gland cell morphology is subjectively influenced by medical experts, the error rate of the experts is greatly increased if long-time fatigue operation is carried out, so that the probability of life danger brought to patients is increased. Therefore, the automatic segmentation method for the glandular cell image in clinical treatment puts high demands on the method, and reduces the workload of medical experts on the premise of high efficiency, high precision and high reliability.

The rise of computational pathology led to the development of automated glandular segmentation methods aimed at overcoming the challenges of manual segmentation. However, this task is very challenging because of the large variability in the appearance of glandular cells and the difficulty in distinguishing certain glandular and non-glandular tissue structures. Furthermore, a measure of uncertainty is essential for diagnostic decision-making. Automated gland segmentation is therefore of crucial importance in improving the efficiency and reliability of clinical work and reducing the workload on pathologists. In the past research work, there are two main reasons why it is difficult to perform high-precision and high-efficiency segmentation of glandular cells:

(1) the gland morphology difference between benign and malignant is large, the benign gland generally has a circular structure, the malignant gland generally has an irregular structure, and the morphology difference of the malignant gland is larger along with the difference of the cancer grade, so that the difficulty of accurately delineating single gland cells from tissues is increased;

(2) the gaps between adjacent gland cells are narrow, and the division of the boundaries of each gland cell needs higher resolution, so that the difficulty of accurately dividing each gland cell is increased;

the automated and accurate segmentation of glandular cells allows medical professionals to extract important morphological features from large-scale glandular tissues conveniently and quickly. The traditional method for segmenting the glandular tissues needs a large number of human-computer interaction processes in the execution process, and has the problems of low accuracy, weak generalization capability and anti-interference capability and the like.

Compared with the traditional medical image segmentation algorithm, the medical image segmentation method based on deep learning obtains good performance in accuracy and efficiency, such as neural networks like FCN, U-Net, SegNet, ResU-Net and the like. The neural network-based algorithm is characterized in that pixel information in a medical image is subjected to multiple nonlinear changes through convolution operation of a filter, so that abstract features of different levels and different strengths in the image are extracted, and an interested target object in an original image is analyzed and judged by combining the feature information, so that a desired result is finally obtained. In the deep learning-based medical image segmentation algorithm, the most representative is a U-Net network which has a symmetrical encoding-decoding topological structure, and meanwhile, due to the unique jump connection structure, high-level image features can be combined with low-level image features, so that the network obtains more image detail features in the decoding process, and further achieves a better segmentation effect. However, when the feature is extracted by using the convolution kernel in the encoder stage, the U-Net filters or loses valuable position information in the image to some extent, which causes inaccuracy in the position of the pixel recovered in the decoder stage, and makes the segmentation boundary undesirable.

For the extraction and classification of image features, more and more neural network algorithms appear in the field of view of the large family, among which are the ResNeXt network and the Squeeze-and-Excitation Networks (SE-Net) which are distinguished. The ResNeXt network is a combination of a ResNet network and an inclusion network. ResNet considers that deepening the network can improve the network quality, and increment considers that widening the network can bring about improvement of the deep network performance. To take advantage of both, resenext uses packet convolution and proposes a new concept called cardability. In the packet convolution, different packets belong to different subspaces, and each subspace can learn different feature representations, so that ResNeXt can acquire more abundant feature information. Cardinal then represents the number of packet convolutions, similar to the inclusion blocks, but with each small block inside being the same structure, i.e., the ResNeXt network is constructed by stacking such ResNeXt blocks with the same convolution parameters. Meanwhile, the whole packet convolution combines the core idea proposed by the residual error network: and the identity mapping further improves the capability of extracting and classifying the image features by the network.

The filter is used as the core of the convolutional neural network, and processes and analyzes the characteristic information on the image space aiming at the local receptive field covered by each layer of network so as to construct the information characteristic of the image. In order to enhance the capability of extracting information from the network, researchers often adopt a method of enhancing the coding quality of an image space, and the like. However, the relation between the characteristic channel and the performance of the neural network is not considered, and the SE-Net breaks the traditional thought, and is different from the traditional thought that the characteristic is extracted from the image space and the dimension, and then the research is carried out on the characteristic channel to explore the relation between the image characteristic and the characteristic channel. As the name implies, SE-Net involves two very critical operations, feature compression (Squeeze) and feature Excitation (Excitation), respectively. The method is a new learning strategy, establishes the interdependence relation between the characteristic channels through the Squeeze and the Excitation, and adaptively learns and calibrates the weight of each characteristic channel in a learning mode. Stacking these modules together forms a SE-Net neural network.

In the implementation process of the technical scheme of the invention, the inventor finds that: in order to improve the accuracy of the neural network for segmenting glandular cells, the following two aspects should be focused and optimized: firstly, at the encoder stage, how to extract more effective image features in the down-sampling process so that the pixel classification is more accurate; secondly, in the decoder stage, when the high-level feature map needs to be expanded to the same size as the original image, how to effectively position the pixel point to the corresponding position of the original image in the up-sampling process makes the pixel prediction more accurate. In either aspect, it is a great challenge to achieve the desired effect.

Disclosure of Invention

The invention provides a gland cell segmentation method based on a multi-level feature fusion network, which is used for realizing accurate segmentation of example objects in a gland cell image.

The invention provides a gland cell segmentation method based on a multi-level feature fusion network, which comprises the following steps:

setting a gland cell segmentation network model, wherein the gland cell segmentation network model adopts a U-shaped structure encoder and decoder structure;

the encoder comprises M +1 encoder sub-blocks, wherein the first encoder sub-block comprises a convolution block and at least one EBA1 which are sequentially connected, the second to Mth encoder sub-blocks have the same structure and comprise a splicing layer, at least one EBA1 and a maximum pooling layer which are sequentially connected, and the M +1 th encoder sub-block comprises a splicing layer and at least one EBA1 which are sequentially connected; the output of the last encoder sub-block is connected with a splicing layer of the next encoder sub-block, the splicing layer is also connected with a down-sampling layer respectively, and the down-sampling layer is used for generating down-sampling feature maps with different sizes from the input image of the encoder so as to realize channel splicing with the feature map with the corresponding proportion generated by the maximum pooling layer of the corresponding sub-block;

the convolution block comprises a convolution layer, a batch normalization layer and an activation function layer which are connected in sequence;

the decoder comprises M (M is more than 1) decoder subblocks, the structures of the first to M-1 decoder subblocks are the same, the decoder subblocks comprise a splicing layer and at least one EBA2 which are sequentially connected, and the Mth decoder subblock comprises a splicing layer, at least one EBA2, a convolution layer with a filter of 1x1 and an output layer adopting a Sigmoid activation function which are sequentially connected; the output of the previous decoder subblock is connected to the splicing layer of the next decoder subblock through an up-sampling layer, the splicing layer is also respectively connected with the outputs of the encoder subblocks at the symmetrical positions of the decoder subblock where the previous decoder subblock is located, the input of the splicing layer of the first decoder subblock is connected with the output of the (M + 1) th encoder subblock through the up-sampling layer and is directly connected with the output of the Mth encoder subblock, so that the channel splicing is carried out on each feature map generated by the decoder in the up-sampling process, and the feature map extracted and reserved by each EBA1 in the down-sampling process;

the EBA1 comprises a plurality of convolution groups, each convolution group comprises the same number of filters, the number of the filters in each convolution group is defined to be N, a dimension reduction convolution block with the number of N/2 filters and a dimension increase convolution block with the number of N filters are sequentially connected after the first convolution group, an SE module based on feature channel learning is connected after the last convolution group, and a feature map output by the SE module is fused with an input feature map of the EBA1 to obtain an output feature map of the EBA 1; the dimension reduction rolling block and the dimension increasing rolling block have the same structure and sequentially comprise a rolling layer, a batch normalization layer and an activation function layer;

the EBA2 comprises a plurality of convolution groups, a dimension reduction convolution block with the number of filters being N/4 and a dimension increase convolution block with the number of filters being N/2 are sequentially connected after the first convolution group, an SE module based on feature channel learning is connected after the last convolution group, and an input feature map of the EBA2 is fused with a feature map output by the SE module after passing through the dimension reduction convolution block with the number of filters being N/2 to obtain an output feature map of the EBA 2;

carrying out gray level conversion processing on the original glandular cell image to obtain a training image, carrying out pixel enhancement processing on a label image of the original glandular cell image to obtain a training label image, and obtaining a training data set; deep learning training is carried out on the gland cell segmentation network model based on the training data set until a preset training end condition is met, and the trained gland cell segmentation network model is obtained;

and carrying out gray level conversion treatment on the original glandular cell image to be segmented to obtain a to-be-segmented image, inputting the to-be-segmented image into the trained glandular cell segmentation network model, and acquiring a segmentation result of the to-be-segmented image based on the output of the to-be-segmented image.

Further, each encoder sub-block includes two EBAs 1 connected in series, and each decoder sub-block includes one EBA 2.

Further, the SE module includes a global average pooling layer, a first full connection layer, a second full connection layer, and a channel-by-channel weighting layer, which are connected in sequence, and the input of the channel-by-channel weighting layer further includes an input feature map of the SE module, the first full connection layer is used to reduce the number of channels of the feature map, and the second full connection layer is used to increase the number of channels of the feature map.

Further, the activation function adopted by the first fully-connected layer of the SE module is ReLU, and the activation function adopted by the second fully-connected layer is Sigmoid.

Further, the specific network structures of EBA1 and EBA2 are:

the EBA1 includes two sets of convolution groups, where the number of each set of convolution groups is 32, the convolution kernel is 3 × 3, the activation function of the first set of convolution groups is ReLU, the activation function of the second set of convolution groups is Linear, the convolution kernel of the dimensionality reduction volume block of EBA1 is 1 × 1, the activation function is Linear, the convolution kernel of the dimensionality enhancement volume block of EBA1 is 3 × 3, and the activation function is Linear;

the EBA2 includes two sets of convolution groups, each set having 32 convolution groups, a convolution kernel of 3 × 3, and the activation function of the first set of convolution groups is ReLU, the activation function of the second set of convolution groups is Linear, the convolution kernel of the reduced-dimension volume block having the filter number of EBA2 of N/4 is 1 × 1, the activation function is Linear, the convolution kernel of the reduced-dimension volume block having the filter number of N/2 of EBA2 is 1 × 1, the activation function is ReLU, the convolution kernel of the up-dimension volume block having the filter number of N/2 of EBA2 is 1 × 1, and the activation function is ReLU.

Further, binary cross entropy is adopted as a loss function when deep learning training is carried out on the glandular cell segmentation network model based on the training data set.

Further, post-processing optimization is performed on the segmentation result of the segmentation map by using a fully connected Conditional Random Fields (DCRF).

In another aspect, the present invention further provides a glandular cell segmentation device with a multi-level feature fusion network, which comprises a computer for receiving an original glandular cell image, wherein the computer is programmed to execute any one of the above-mentioned glandular cell segmentation methods.

The technical scheme provided by the invention at least has the following beneficial effects:

the network model for segmentation adopted by the invention consists of an encoder and a decoder, the features of the target object are extracted as much as possible through the encoder and are accurately classified, then the pixels are restored and positioned on the basis of the low-resolution feature map through the decoder, and finally the accurate segmentation of the example object in the glandular cell image is realized. Two Efficient Bottleneck Architectures (EBAs) with different functions are arranged to run through the whole encoding-decoding process, and meanwhile, a mode of adding feature pixels point by point is introduced into the EBAs to realize multi-level feature fusion, so that the capability of spreading image features to a deeper network is enhanced, the whole segmentation network extracts feature information more finely, and the learning capability of the feature information is stronger. In the stage of an encoder, feature maps of different scales are generated by downsampling the feature maps of the input end, and then the feature maps are spliced with the feature map of the corresponding proportion generated in the encoder in the largest pooling mode to realize multi-level feature input and enhance the propagation of image features; in the decoder stage, according to the feature map information of different levels reserved by downsampling, when the upsampling operation in the decoder stage is carried out, the feature maps with corresponding sizes are spliced, the shallow image features are combined again, the loss of pixel positions is compensated, the error when the pixel positions of the feature images are predicted and positioned in the decoding process is reduced, and the efficient gland cell image segmentation task is realized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic processing process diagram of a gland cell segmentation method based on a multi-level feature fusion network according to an embodiment of the present invention;

FIG. 2 is a schematic representation of an image of a native gland cell employed in an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a pre-processing procedure for an image dataset of glandular cells according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of two efficient bottleneck structures according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an SE module employed in the embodiments of the present invention;

FIG. 6 is a schematic diagram of an encoder structure used in an embodiment of the present invention;

FIG. 7 is a block diagram of a decoder used in an embodiment of the present invention;

FIG. 8 is a diagram of a neural network architecture based on multi-level feature fusion as utilized in an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating post-processing optimization of glandular cell images by DCRF as used in the examples of the present invention;

FIG. 10 shows the result of segmenting an image of a benign glandular cell according to an embodiment of the present invention;

FIG. 11 is a segmentation result of an image of malignant glandular cells according to an embodiment of the present invention;

in fig. 6 to 8, numerals above the rectangular frame indicate the number of channels of the feature map.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Aiming at the challenging task existing in the glandular cell image segmentation, the embodiment of the invention provides a network model based on multi-level feature fusion. In the network model, the characteristic graph information of different levels of the image is combined, the characteristics on a plurality of receptive fields are gathered, the characteristic information of multiple levels of the image is embedded, and the capability of extracting the characteristics is improved by grasping the global and local spatial context information, which is the idea of multi-level characteristic fusion. The network model consists of an encoder and a decoder, the features of the target object are extracted as much as possible through the encoder and are accurately classified, then the pixels are restored and positioned on the basis of a low-resolution feature map through the decoder, and finally accurate segmentation of the example object in the glandular cell image is achieved.

The specific method of the invention is that two high-efficiency bottleneck frameworks (EBAs) with different functions are designed firstly, the EBAs are penetrated through the whole encoding-decoding process, and meanwhile, a characteristic pixel point-by-point addition mode is introduced into the EBAs to realize multi-level characteristic fusion, thereby increasing the capability of spreading image characteristics to a deeper network, leading the whole network to extract characteristic information more carefully and having stronger learning capability on the characteristic information. In the stage of an encoder, the feature maps of the input end are down sampled to generate feature maps of different scales, and then the feature maps are spliced with the feature map of the corresponding proportion generated by the maximal pooling in the encoder to realize multi-level feature input and strengthen the spread of image features; in the decoder stage, according to the feature map information of different layers reserved in the down-sampling, the feature maps with the corresponding sizes are spliced during the up-sampling operation in the decoder stage, the shallow image features are combined again, the loss of pixel positions is compensated, the error in predicting and positioning the pixel positions of the feature maps in the decoding process is reduced, and the efficient glandular cell image segmentation task is realized.

Referring to fig. 1, taking glandular cells as an example, the specific implementation steps of the segmentation method based on multi-level feature fusion provided by the embodiment of the present invention include:

step 1, preprocessing a glandular cell image.

Due to the particularity of medical image data, image data output by a medical imaging device cannot be directly used, and the data must be preprocessed to be used for training and experiment of a neural network. In an embodiment of the present invention, the original glandular cell Image uses an official dataset of 2015 MICCAI (medical Image Computing and Computer Assisted interpretation society) glandular segmentation challenge. This data set has a total of 165 pairs of artwork and corresponding label maps. The data format of the label graph is BMP file, which belongs to standard image file format, the size is 775 multiplied by 522, the bit depth of the original graph is 24, the bit depth of the label graph marked by experts is 8, but the pixel contrast of the original label graph is not obvious, and the original label graph is not easy to be identified by human eyes. The entire data set contains various grades from benign gland to malignant gland and contains cases with larger differences, as shown in fig. 2, from left to right (column a, column b, column c to column d in the figure) representing benign gonadal morphology to generally malignant glandular morphology to more severe glandular morphology. By performing experiments on data in various forms, the performance (robustness and robustness) of the segmentation method provided by the embodiment of the invention can be better evaluated.

In order to adapt to the neural network model training based on multi-level feature fusion provided by the embodiment of the invention, and simultaneously, aiming at the problem that the pixel contrast in a label image labeled by an expert is low, so that the human eye identification is difficult, the original image and the label image are respectively preprocessed and adjusted. First, the original image is grayed, for example, the bitmap depth of the original image is adjusted from 24 to 8, i.e., the original color image is converted into a grayscale image. Then, the original tag image is subjected to a pixel enhancement operation, for example, each pixel is multiplied by 255, so that the contrast between the pixels is increased, and the human eye can clearly distinguish the glandular cells in the tag image. The gland cell image preprocessing process and the result are shown in fig. 3.

The data that can be used in the neural network of the present invention is obtained by preprocessing the glandular cell raw data set.

And 2, designing an efficient bottleneck structure.

For the glandular cell image, in order to perform accurate feature extraction and classification on the target object at the encoder stage and perform accurate positioning and boundary refinement on the contour and edge of the target object at the decoder stage, thereby completing accurate segmentation of the glandular cell image, in this embodiment, two kinds of EBAs are designed and implemented and are respectively applied to the encoder and the decoder, and the structural design thereof is shown in fig. 4.

Fig. 4 is a diagram of an efficient bottleneck structure provided by the embodiment of the present invention, where the left side is EBA1 and the right side is EBA 2. The idea of grouping convolution is mainly adopted by the EBAs, namely a plurality of convolution groups are calculated in parallel, the convolution groups are independent, and the characteristic information of different spaces can be acquired through the convolution groups, so that more image characteristics can be acquired. Unlike ResNeXt, EBAs are greatly improved, specifically:

first, EBAs use a packet convolution with 32 convolution groups and a filter size of 3x3, which is different from the conventional bottleneck structure using a 1x1 convolution, in order to encode more expressive spatial information. For each convolution in the packet convolution, the topology of the convolution is identical, and a convolution with a filter size of 3 × 3 is firstly adopted, the step size is 1, and simultaneously the policy of same padding (filling 0 around the input so as to make the input and output sizes the same after the convolution operation) is adopted to ensure that the size of the feature map is kept unchanged during the forward propagation of the network. The convolution is normally adjusted by adopting batch normalization operation after 3x3 convolution, so that the subsequent activation function can be enabled to carry out nonlinear transformation to the maximum extent, the training speed and the convergence speed of the network are improved, and the disappearance of the gradient in the learning process is prevented. And then performing nonlinear learning using the ReLU as an activation function.

Compared with the ordinary convolution, assuming that the width, height and dimension of the input feature map are W, H and N respectively, the number of filters is F, the size of the filter is K × K × N, then the size of the output feature map is W × H × F, and then the total parameter number of the ordinary convolution is:

Parameter_General＝K×K×N×F

in the grouping convolution, similarly, the width, height and dimension of the input feature map are W, H and N respectively, the number of filters is F, the size of the filter is KxKxN, the number of convolution groups is G, and then the size of each group of filters is

The characteristic map size of each set of convolution inputs is

Corresponding filter size of

Each set of output feature map sizes is

And then the outputs of the G groups of feature maps are spliced to obtain the final output feature map with the size of W multiplied by H multiplied by G, and the total parameters of the grouping convolution are as follows:

when G ═ N, the total number of parameters for the packet convolution is:

as can be seen from the above formula, the parameter amount of the packet convolution is 1/G of the ordinary convolution in the same case. It follows that the packet convolution can reduce the number of parameters. In particular, when the number of convolution groups G is equal to the input feature map dimension N, the number of parameters is further reduced, which greatly optimizes the network model and reduces the computation load of the network model.

Then, after the first packet convolution, EBA uses a convolution with a filter size of 1x1 while halving the number of filters, and uses the batch normalization and activation function Linear after 1x1 convolution. Next, a convolution of 1x1 doubling the number of filters is used to restore the resolution of the feature map, while using the batch normalization and activation function ReLU.

Next, using a 32 convolution group number of packet convolutions, unlike the initial packet convolution, Linear was used here as the activation function after using 3x3 convolution and batch normalization.

After that, the SE (Squeeze-and-exact site Networks) module based on feature channel learning is used in EBAs to enhance the propagation of useful feature information by learning the feature channels in the feature map, so as to further improve the feature extraction capability of the network on example objects in the glandular cell image, and the structure of the SE module is shown in fig. 5. Firstly, compressing each characteristic channel by using global average pooling along the spatial dimension of the characteristic graph to generate multi-dimensional characteristic vectors corresponding to each channel of the characteristic graph.

Suppose the input feature map is M ═ M₁,m₂,...,m_nIn which m is_nRepresenting the feature map in the nth dimension with a size of W × H, the output O ═ O is output after global average pooling₁,o₂,...,o_nWhere the feature vector o in the k-th dimension_kComprises the following steps:

wherein m is_k(i, j) represents a feature map m_kThe characteristic value of (2).

And then, learning the multi-dimensional feature vectors by using a bottleneck structure consisting of two full connections to obtain a group of weight values between 0 and 1 corresponding to the multi-dimensional feature vectors, wherein the weight values are learned by certain parameters and are used for displaying the relationship among the feature vectors and showing the correlation among the channels of the feature map, namely the importance degree of each feature channel.

Let the weight of two full connections be omega₁,ω₂Then the first fully connected output F₁＝{f₁₁,f₁₂,...f_1nThe method is as follows:

F₁＝ReLU[f(o_k,ω₁)],k∈[1,n]

wherein f is_1nIndicating the first full connection in the nth dimensionThe result is output, f () representing a full join operation.

Second fully connected output F₂＝{f₁₁,f₁₂,...f_1nThe method is as follows:

F₂＝Sigmoid[f(F₁,ω₂)]

＝Sigmoid{f[ReLU(f(o_k,ω₁)),ω₂]},k∈[1,n]

of which f is_2nRepresenting the first output result fully connected in the nth dimension.

In a bottleneck structure composed of two full-connected layers, the first full-connected layer is used for reducing the number of channels of a feature map, namely, the function of dimension reduction is achieved, a parameter R of the first full-connected layer is a hyper-parameter, the dimension of the feature map after dimension reduction is N/R, wherein N is the dimension of an input feature map, and a ReLU activation function is used for carrying out nonlinear propagation and learning on the feature map (output by the first full-connected layer) after the first full-connected layer. The second full-connection layer is used for increasing the number of channels of the feature diagram, namely the function of rising dimension, directly restoring the feature diagram to the original dimension N, and activating a function by using Sigmoid. The reason for using two continuous fully-connected layers is that the global average pooling in front of the SE module is to operate on the pixel information of a certain characteristic channel, and each channel is isolated and independent from each other, so the two fully-connected layers are used for fusing the characteristic map information of each channel, and the generalization capability of the network model is improved. And finally, weighting the learned weight values corresponding to the channels to the feature map input by the SE module one by one through multiplication, so as to finish the recalibration of the feature map.

Let the weighting function of the feature graph by channel multiplication be scale (), the Output after the "feature recalibration" strategy is:

Output＝scale(m_k,f_2k),k∈[1,n]

the channels in the feature map are recalibrated through the SE module, the weight values corresponding to the channels in the feature map are learned, the importance degree of each channel is mastered according to the weight values, and therefore the useful and expressive feature channels are enhanced, and the useless and poor feature channels are restrained.

Finally, in EBA1, the initial input is fused with the output of the SE module, so that the core idea and identity mapping of the residual error network are realized. The method is used for solving the problems of neural network training degradation and gradient disappearance or explosion, simultaneously realizing multi-level feature fusion, promoting feature reuse and further enhancing feature propagation and extraction. In EBA2, since the number of input and output characteristic channels is different, no identity mapping is used, but a convolution with step size 1 and filter size 1 × 1 is used, followed by a batch normalization and activation function ReLU, which is called Shortcut (Shortcut).

Two kinds of EBAs penetrate through the whole neural network, and through a series of unusual convolution operations and operations in an EBAs module, the encoder can effectively extract detailed characteristic information of an interested object in a gland cell image to generate a refined characteristic diagram, so that the classification precision between pixels is improved, a decoder can accurately predict the position information of a target object to accurately position the outline of the target object, and the segmentation precision of the gland cell image is improved to the maximum extent.

And 3, designing the structure of the encoder.

In order to better extract the characteristics of the glandular cell image, the encoder is designed, as shown in fig. 6. In the present encoder, for glandular cells, in order to perform efficient feature extraction and classification on them, a first efficient bottleneck architecture is applied. In addition, in the stage of an encoder, feature maps with different sizes are generated through different down-sampling operations according to input images of the whole network and are recorded and reserved, and then the feature maps with corresponding proportions generated by the largest pooling layer are fused in a channel splicing mode in the process of the encoder, so that multi-level feature fusion is realized to enhance the extraction and classification of image features.

The entire encoder consists of one convolutional layer with step size 1 and filter size 3x3 and ten EBA1, four max pooling layers and four convolutional layers responsible for downsampling. Wherein, after the first convolution and each maximum pooling operation, a plurality of (preferably two) identical EBAs 1 are used in series, which is to deepen the network depth and improve the extraction capability of the network for the features. Because this task is biased towards segmenting the texture, contour, etc. features of the object of interest, using max pooling at the encoder stage can help the network reduce complexity and computational effort while filtering out some of the garbage. However, the max-pooling operation may also filter characteristic information useful to the network. Aiming at the defect of maximum pooling, the invention skillfully utilizes the concept of multi-level feature fusion, inputs the initial feature map of the whole network to carry out convolution downsampling operation for a plurality of times to generate feature maps with different proportions and sizes, respectively corresponds to the feature maps generated by the first, second, third and fourth maximum pooling operations, and then carries out channel splicing with the feature maps to be used as the input of the subsequent convolution operation. By the method, the problem of loss of useful feature information caused by maximum pooling is solved, multi-level feature fusion is realized, and extraction of image features is improved.

And 4, designing a decoder structure.

In the decoder stage, in order to achieve accurate segmentation of glandular cells, the decoder was designed as shown in fig. 7. For glandular cells, in order to better predict and locate the contours and boundaries of the target object, a second efficient bottleneck architecture is applied in the decoder. In addition, for each feature map generated by the decoder during the up-sampling process, channel splicing is performed corresponding to the feature map extracted and retained by each EBA1 during the down-sampling process by the encoder, so as to realize multi-level feature fusion. Therefore, the low-level features generated by the shallow coding layer are combined with the high-level features generated by the deep decoding layer to obtain more representative feature information, and the multi-level image features are analyzed to improve the result of pixel prediction and further realize better segmentation effect.

The entire decoder consists of one convolutional layer with step size 1 and filter size 1x1 and four upsampled convolutional layers and four EBAs 2. In the step of expanding and restoring the resolution of the feature map by the decoder, the present invention uses the combination of upsampling and convolution rather than directly using deconvolution. According to research, the deconvolution easily causes uneven overlapping of pixels, and the characteristic information of some areas is more abstract than that of other areas, and particularly when the size of a filter cannot be divided by a step length, the phenomenon is more obvious, so that the problem of artifact of the whole characteristic image is caused, and the final segmentation result is seriously influenced. Because the neural network records and retains the position information of the pixels at the stage of the encoder, at the stage of the decoder, the upsampling can restore the pixel information to the original position by means of the position information, and the feature map with low resolution is accurately recovered, so that the problem that the details and the pixel information are lost due to the convolution and pooling operations in the encoder is solved, and the edge details are retained at higher segmentation accuracy.

At the end of the decoder, which is also the end of the whole neural network, a convolution layer with the step size of 1 and the filter size of 1x1 is used, and finally, a segmentation probability map is obtained by using a Sigmoid activation function.

And 5, building a neural network model based on multi-level feature fusion.

In the embodiment of the present invention, a neural network model based on multi-level feature fusion is composed of an encoder and a decoder, as shown in fig. 8. The whole network starts with an encoder, the image resolution is reduced by 16 times after downsampling with four maximum pooling layers, then the feature map is expanded by 16 times through four upsampling to restore the image, and finally the final classification prediction is realized by using a convolution of 1x1 and a Sigmiod activation function.

In the neural network adopted in the embodiment of the invention, the concept of multi-level feature fusion is embodied for many times. First, in an efficient bottleneck structure, EBAs skillfully use the identity mapping and short ("direct" or "Shortcut") to achieve multi-level feature fusion by means of pixel superposition. Secondly, in the stage of an encoder, the characteristic diagram input at the beginning of the neural network is downsampled for multiple times to obtain a plurality of characteristic diagrams with different proportions, the characteristic diagrams respectively correspond to the characteristic diagrams with the same proportion generated by the largest pooling layer in the process of the encoder, and multi-level characteristic fusion is realized by using a channel splicing mode. And finally, in a decoder stage, channel splicing is carried out on the feature graph extracted and retained in the down-sampling stage of the encoder and the feature graph with the same proportion generated by corresponding up-sampling so as to realize multi-level feature fusion.

In the neural network training process provided by the embodiment of the invention, the propagation and reuse of image features can be promoted, global and local context feature information can be combined and mastered, the problem of loss of useful feature information caused by convolution and pooling operation can be solved, the extraction and classification capability of the image features can be further improved, the prediction and positioning capability of a low-resolution feature map can be enhanced, and the accurate segmentation of the glandular cell image can be finally realized.

And 6, training a network model and optimizing parameters of the network model.

In order to improve the robustness of the neural network provided by the embodiment of the invention and reduce overfitting, the training image and the corresponding label image are respectively placed in two folders by the same name, and a data enhancement strategy such as translation, clipping, rotation and elastic deformation is used for expanding the training data set.

In this embodiment, a pre-training model is not loaded in the training process, the filter initialization parameter is 0-1 random normal distribution, Adam is used by the network model optimizer, the initial learning rate is set to 0.0001, the number of batch input images in the training process is 4, and meanwhile, binary cross entropy is used as a loss function, and the loss function calculation method is as follows:

wherein the content of the first and second substances,

indicating the pixel class prediction value, y_iAnd (4) representing a pixel class label, and n represents the number of pixel points.

The neural network provided by the embodiment of the invention continuously iterates and optimizes network parameters and models through back propagation, and the trained network parameters and models are stored as HDF5 format files. Therefore, the trained neural network model based on the multi-level feature fusion is obtained.

Inputting the preprocessed image to be segmented into a trained neural network model based on multi-level feature fusion, obtaining a category predicted value of each pixel point based on the output of the neural network model, and obtaining a segmentation result based on a segmentation object corresponding to the category.

Further, step 7, post-processing optimization is performed on the segmentation result by using a fully connected Conditional Random Fields (DCRF).

The image context information represents the spatial relationship between the pixel class labels and has an important role in the structured prediction task. Combining image context information and high-level semantic information is a key to image segmentation. Since the noisy tissue in the glandular cell image is very similar to the target object and has a lot of noises, in this case, the DCRF model is very suitable for post-processing optimization, and the structural prediction capability of the DCRF model can play an important role. Aiming at the characteristics of the glandular cell image, in the embodiment of the invention, the output result (segmentation result) based on the multi-level feature fusion network is mapped into a probability map, then a DCRF is used for constructing a random field model, and then modeling is carried out between pixels in the probability map. A unitary potential and binary potential energy function is established by utilizing the gray intensity and Euclidean distance of the glandular cell image, local and global space information is mastered, the context information of the image is fully combined, the coupling relation among pixels is detected, the similar pixels are gathered together as much as possible, and the whole probability map is optimized.

The results of the optimization using the DCRF model modeling are shown in fig. 9, with each row representing a set of glandular cell images, showing two sets in total. The first column is an original input image, the second column is a probability map output by the multilevel feature fusion network, and the third column is a segmentation result after optimization by using DCRF modeling.

It can be found from fig. 9 that there are more noise regions around and inside the gland cell in the second-row probability map because there is too much noise information in the gland cell image, and the noise regions indicate that the region is likely to become the gland cell, and after establishing the unipotential and binary potential energy function, the bilateral relationship is established between the pixels of the gland cell region and the noise regions, the classification of the pixels is determined, the pixels belonging to the gland cell classification object in the noise regions are divided into the gland cell, and the redundant pixels and the isolated pixels are removed, thereby improving the segmentation effect and the accuracy.

The embodiment of the invention is experimented by using the preprocessed glandular cell image data and an evaluation result aiming at glandular cell image segmentation is made, and the used test data comprises two categories of benign glandular cell images and malignant glandular cell images. The experimental results using the benign glandular cell test data are shown in fig. 10, which shows four groups of glandular cell image segmentation results. The left two columns are respectively an original graph and an original label graph, and the right column is a segmentation result. For benign glandular cell images, in the face of the phenomenon of low contrast between glandular cells and surrounding tissues and the condition of glandular cell adhesion caused by very similar and various noisy tissues around cells, the embodiment of the invention performs rich extraction and prediction on the characteristics of glandular cell objects by utilizing the advantages of an image multi-level characteristic fusion algorithm and learning and combining the characteristics of different spaces and different characteristic dimensions. According to the observation experiment result, the gland cell object can be accurately segmented by the embodiment of the invention.

In addition to using benign glandular cell test data, experimental tests were also performed using malignant glandular cell images to verify the generalization and robustness of embodiments of the present invention to glandular cell image segmentation. Benign glandular cells are generally well-defined, smooth and regular in morphology, mostly appearing as ovals and circles. The malignant glandular cells are not usually wrapped by a film, so that the cells diffuse to the periphery, and the glandular cells are not only large in size, but also irregular in shape, so that the similarity between the glandular cells and the peripheral tissues is higher, the cell boundaries are more difficult to identify, and the difficulty in accurately segmenting the malignant glandular cell images is extremely high. The results of the experiments performed on the malignant glandular cells by the present invention are shown in FIG. 11. According to the experimental result, even if malignant glandular cells have more noise tissues and are different in morphological structure and large in difference, the malignant glandular cells can be clearly segmented by virtue of the excellent characteristic space information extraction capability of the malignant glandular cells.

The present invention was subjected to qualitative assessment experiments using benign glandular cell images and malignant glandular cell images, respectively. Experiments prove that for benign and malignant glandular cells, even if more noise tissues exist at the periphery of the glandular cells and the similarity with the peripheral noise tissues is higher, the embodiment of the invention can still ensure high accuracy of single glandular cell segmentation.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A glandular cell segmentation method based on a multi-level feature fusion network is characterized by comprising the following steps:

the encoder comprises M +1 encoder sub-blocks, wherein M is a positive integer greater than 1, the first encoder sub-block comprises a volume block and at least one EBA1 which are connected in sequence, the second to Mth encoder sub-blocks have the same structure and comprise a splicing layer, at least one EBA1 and a maximum pooling layer which are connected in sequence, and the M +1 encoder sub-block comprises a splicing layer and at least one EBA1 which are connected in sequence; the output of the last encoder sub-block is connected with a splicing layer of the next encoder sub-block, the splicing layer is also connected with a down-sampling layer respectively, and the down-sampling layer is used for generating down-sampling feature maps with different sizes from the input image of the encoder so as to realize channel splicing with the feature map with the corresponding proportion generated by the maximum pooling layer of the corresponding sub-block;

the decoder comprises M decoder subblocks, wherein the first decoder subblock to the M-1 th decoder subblock have the same structure and comprise a splicing layer and at least one EBA2 which are sequentially connected, and the M decoder subblock comprises a splicing layer, at least one EBA2, a convolution layer with a filter of 1x1 and an output layer adopting a Sigmoid activation function which are sequentially connected; the output of the previous decoder subblock is connected to the splicing layer of the next decoder subblock through an up-sampling layer, the splicing layer is also respectively connected with the outputs of the encoder subblocks at the symmetrical positions of the decoder subblock where the previous decoder subblock is located, the input of the splicing layer of the first decoder subblock is connected with the output of the (M + 1) th encoder subblock through the up-sampling layer and is directly connected with the output of the Mth encoder subblock, so that the channel splicing is carried out on each feature map generated by the decoder in the up-sampling process, and the feature map extracted and reserved by each EBA1 in the down-sampling process;

2. The method of claim 1, wherein each encoder sub-block comprises two serially connected EBAs 1, and each decoder sub-block comprises one EBA 2.

3. The method of claim 1 or 2, wherein the SE module comprises a global averaging pooling layer, a first fully connected layer and a second fully connected layer connected in sequence, and a channel-by-channel weighting layer, and wherein the input to the channel-by-channel weighting layer further comprises an input feature map of the SE module, the first fully connected layer being used to decrease the number of channels of the feature map, and the second fully connected layer being used to increase the number of channels of the feature map.

4. The method of claim 3, wherein the activation function employed by the first fully-connected layer of the SE module is ReLU and the activation function employed by the second fully-connected layer is Sigmoid.

5. The method of claim 1, wherein the specific network structures of EBA1 and EBA2 are:

the EBA1 includes two sets of convolution groups, each set of convolution groups is 32 in number, the convolution kernel is 3 × 3, and the activation function of the first set of convolution groups is ReLU, the activation function of the second set of convolution groups is Linear, the convolution kernel of the dimensionality reduction volume block of EBA1 is 1 × 1, the activation function is Linear, the convolution kernel of the dimensionality enhancement volume block of EBA1 is 3 × 3, and the activation function is Linear;

6. The method of claim 1, wherein the deep learning training of the glandular cell segmentation network model based on the training dataset uses binary cross entropy as a loss function.

7. The method of claim 1, wherein post-processing optimization is performed on segmentation results of the segmentation map using fully connected Conditional Random Fields (DCRF).