CN112614131A

CN112614131A - Pathological image analysis method based on deformation representation learning

Info

Publication number: CN112614131A
Application number: CN202110027548.5A
Authority: CN
Inventors: 张玥杰; 徐际岚
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-01-10
Filing date: 2021-01-10
Publication date: 2021-04-06

Abstract

The invention belongs to the technical field of medical image processing, and particularly relates to a pathological image analysis method based on deformation representation learning. The method comprises the steps of constructing an automatic supervision deformation representation learning model for pathological image analysis and then for classification and segmentation of pathological images; the learning model includes: the system comprises a deformation module, a local heterogeneous characteristic sensing module and a global homogeneous characteristic sensing module; the deformation module is used for carrying out elastic deformation operation on the image; the local heterogeneous characteristic sensing module is used for learning structural difference information of local areas in the image due to deformation; the module comprises a feature extractor network, a multi-scale feature network and a discriminator network; and the global homogeneous feature perception module is used for realizing the learning process of the network on the global features of the pathological images. The method can learn the ability of extracting local structural features without marking data, and can learn the global semantic information of the pathological image; compared with the best self-supervision learning method at present, the performance is greatly improved.

Description

Pathological image analysis method based on deformation representation learning

Technical Field

The invention belongs to the technical field of medical image processing, and particularly relates to a pathological image analysis method.

Background

In recent years, deep learning has achieved a pleasing outcome in a number of different computer vision applications. As one of the very considerable problems, there is a need and market for automated pathological image analysis, which can provide reliable statistical data for subsequent analysis and clinical diagnosis, especially in the diagnosis and intervention treatment of some important diseases (such as cancer, survival prediction, etc.). Many successful methods use fully supervised learning to extract effective features in images, but the training process of these models requires a large number of labels as labeling information to construct supervision signals, which is very difficult in practical application. The method is mainly characterized in that a large amount of labels need to consume huge manpower and material resources, and meanwhile, labeling pathological images needs a professional doctor to spend time and energy, so that the working efficiency of the doctor is reduced.

Recently, unsupervised learning has raised a wave of heat, and there are many influential unsupervised tasks to complete learning by exploring the characteristics of data itself rather than labels. This is in line with the trend of medical images, because there is a lot of label-free data, and only a small part of it carries label data. As a new unsupervised learning paradigm, the purpose of the self-supervised learning is to enable the network to learn a series of characteristics rich in semantic information by constructing a pre-training task without real labels and only artificially constructing some supervision signals. In the previous work, there are some different design ideas with the training task, including the turning angle of the predicted image, the coloring of the image, the filling of the hole in the image, etc., which are proved to be very effective. Generally, a way to prove whether a pre-training task for supervised learning is valuable is measured by different downstream task performances. In the process, the network weight obtained in the pre-training process can be used for initializing parameters of downstream tasks, and the learned high-level representation can be directly extracted and applied to different downstream tasks. These downstream tasks are of a wide variety, and image classification, image segmentation, object detection, and the like are common. Although there have been a great many self-supervised learning works successfully applied to natural images, how to learn from pathological images to an effective high-level representation in a self-supervised manner still remains a very challenging problem, which is mainly due to the cross-domain gap existing between natural images and pathological images.

In biomedical image processing, deformation is often used to perform data enhancement expansion or image registration. It has been observed that deformation is a very important feature in pathological images. For example, for benign images, the glandular structures are very regular, usually appearing as circles, ellipses, etc.; however, in the case of hypo-differentiation or severe differentiation, the glands become very irregular, severely degenerated or deformed structures. Therefore, the invention uses deformation as a supervision signal, and provides an automatic supervision deformation representation learning framework to assist pathological image analysis. The objective is to hopefully play a role in different pathological image analysis downstream tasks, such as pathological image segmentation and classification, etc., by learning useful features. When constructing the task of self-supervised deformation representation learning, two important properties are based on local heterogeneity and global homogeneity. As shown in fig. 1, given an original pathological image and a deformed image, there are local structural differences caused by deformation, and these differences can be reflected in the feature space. In particular, the mutual information is used to measure the original pathological image and the deformed pathological image. The main purpose is to expect that the network can extract some key high-level features (such as the presence or absence of glands) to make a decision rather than noise information in the image. Meanwhile, structural differences are likely to occur at multiple positions and multiple scales of the image, and the network has more robust multi-scale sensing capability due to the fact that a feature enhancement module is provided and inserted into the existing network for end-to-end training. In contrast, since the original image and the deformed image are always generated by transforming one image, the original image and the deformed image have more similar context information than other images in the data set. As shown in the right half of fig. 1, global similarity can be considered as the two having a closer distance in the feature space, and such a feature space is desired to be learned. To this end, two different methods are proposed to pull similar samples closer together in the feature space and dissimilar samples farther apart in the feature space, referred to as hard target global homogeneity and soft target global homogeneity, respectively. And a local heterogeneity and global homogeneity representation learning framework is utilized, so that the network learns local and global pathological image features in a self-supervision manner. Based on evaluation performed on different downstream tasks, namely migration learning to pathological image segmentation and semi-supervised classification tasks, the strong generalization capability of the proposed method is embodied on the two tasks.

Disclosure of Invention

The invention aims to provide a pathological image analysis method capable of utilizing self-supervision learning to effectively extract image characteristics and act on various different downstream tasks on the premise of a large amount of unmarked pathological image data and a small amount of marked pathological image data.

The invention provides a pathological image analysis method (DRL) by utilizing Self-supervised Learning, which is based on a Deformation Representation Learning technology with high-efficiency data utilization rate, namely, firstly constructing an Self-supervised Deformation Representation Learning model for pathological image analysis and then classifying and segmenting pathological images; the self-supervision deformation representation learning model specifically comprises three components, namely a deformation module, a local heterogeneous characteristic sensing module and a global homogeneous characteristic sensing module; wherein:

and the deformation module is used for performing elastic deformation operation on the input image. The deformation operation adopts a transformation mode of elastic deformation commonly used in medical image modeling, essentially, a random variable in the x direction and the y direction within the range of [ -1,1] is generated at each position (i, j) of an image, the random variables respectively represent random offset in two directions, and then the random offset is subjected to convolution operation through a Gaussian core with zero mean and variance sigma as shown below.

And obtaining a smooth offset map through convolution operation, and guiding the original map to perform position transformation by using the offset of each point. Here, the converted map is obtained simply by using a bilinear interpolation method. At this point, the entire image has completed random elastic deformation.

The local heterogeneous feature perception module is used for learning structure difference information of local regions in the image due to deformation. The module contains three learnable networks: a feature extractor network, a multi-scale feature network, and a discriminator network. The whole local feature perception learning process is completed by three steps:

firstly, an image feature encoder is adopted to obtain high-level semantic features of an input image (namely a pathological image) and generate image feature representation. From the original input pathology image x_iAnd corresponding deformed pathological image

Respectively obtaining corresponding feature extractors through a feature extractor network: f (x)_i(ii) a Theta) and

where θ is a parameter of the feature extractor F. To estimate the amount of information that the high-level features can represent the original input x, a common method is to use mutual information to represent:

I(X,Y)＝KL(p(x,y)||p(x)p(y)) (2)

wherein p (X, Y) is the joint probability distribution of X and Y; p (x) and p (y) are marginal probability distributions, maximizing mutual information of both is equivalent to optimizing the KL divergence between the two distributions. As shown in fig. 1, for image x_iAnd its deformed image

It is generally desirable to have a deep representation of each of the two

Can contain structural differences after deformation, thus maximizing

Both carrying information of the same image. At the same time, to minimize

Since both actually contain the structural differences of deformation on the same image. However, since mutual information is very inconvenient to calculate in high dimensions, we choose to estimate the value of mutual information using JS distance for this purpose, namely:

wherein sp (x) log (1+ e)^x) Is the softplus function. D (X; phi) is a discriminator with a parameter phi and is used for distinguishing positive samples from negative samples.

The second step is multi-scale feature extraction, and firstly positive sample pairs are constructed

And negative sample pair

The principle of construction is that the original pathological image x_iAnd a feature representation

Are different representations from the same image, but the negative example is from a deformed pathological image

And the original pathological imageIs characterized by

The combination of the components is characterized in that the structural difference caused by deformation is contained. Due to the fact that structural differences caused by deformation have multi-scale characteristics, single-scale features are formed

And

obtaining F (x) by using feature enhancement module FEM_i(ii) a Theta) and

for representing multi-scale features, x in positive sample pairs_iReplacement by F (x)_i(ii) a Theta), of negative sample pairs

Is replaced by

To this end, the negative examples are paired

The method reflects the difference of the original image in a single-scale characteristic and a deformation structure on multiple scales of the deformation image. The feature enhancement module FEM is used because considering that in some pathological images, some glands with severe differentiation have large cavity sizes in the middle and even span the whole image, in which case a single-scale neural network cannot effectively capture multi-scale information spread over multiple positions. In the classical segmentation model, the pyramid pooling model of the void space uses 4 parallel convolution kernels with void rates of 1, 6, 12 and 18 in the original module, i.e. the sizes of the convolution kernels are fixed. Compared with the convolution kernel with fixed size, the feature enhancement module FEM used in the invention adopts a dynamic mode of determining the size of the convolution kernel according to the size of the feature map. Specifically, k is defined as log₂M convolution kernels with the size of s and the void ratio of r₀＝1,r₁＝2,r₂＝6,…,r_k-1＝6×2^k-3Then the actual size of these convolution kernels is:

wherein i is more than or equal to 0 and less than or equal to k-1. By definition of the hole convolution, assuming that the original convolution kernel size s is 3, the maximum convolution kernel and its actual convolution kernel size can be deduced

Thus, the size of the hole convolution kernel can be ensured to be larger than the size of the feature map, namely, a sufficient receptive field can be ensured to cover the whole feature map. In addition, the feature enhancement module FEM may also ensure that long range dependencies in the image can be maintained. When a larger feature size such as a 5 × 5 convolution kernel is used, the actual convolution kernel size is larger and still meets the requirements. Through the convolution kernels of the k different void rates, the features of the image can be extracted in parallel, and the features are connected to form the final multi-scale feature. To ensure fairness, the feature enhancement module is only used in the framework of deformation representation learning to fully extract features of the image and not used for downstream tasks.

Thirdly, respectively sending the positive and negative sample pairs into a discriminator D (X)^EncF (X; θ); phi), where the network parameter of the discriminator is phi, it is desirable that the discriminator outputs 1 for positive sample pairs and 0 for negative sample pairs. Specifically, it consists of three fully convolutional layers with dimensions 512, 256, and 256, respectively. Next, a local heterogeneity loss function is calculated from the output of the arbiter, namely:

where N represents the number of all samples and sp () represents the softplus function. In the whole calculation process，

F(x_i(ii) a Theta) are all the same size, and width and height are both M. In the construction process of the local heterogeneity loss function, the supervised label information is formed by structural differences brought by deformation operation. In the whole process, the images do not need to be manually marked, so that the network end-to-end self-supervision learning is realized.

And the global homogeneous feature perception module is used for realizing the learning process of the network on the global features of the pathological images. First, every sample x in a batch (batch)_jAll go through a feature extractor to obtain features

Then all the obtained F (x) is obtained through a characteristic enhancement module_j(ii) a θ), finally normalized by the norm of L2, after which features are aggregated into a Global feature F using Global Average Pooling (GAP)_G(x_j(ii) a θ). To this end, all the multi-scale features in one batch are acquired. Next, a pair of positive samples F is constructed in the feature space_G(x_i；θ),

And negative sample pair F_G(x_i；θ),

Where j ≠ i.

Next, a global homogeneity loss function is constructed in two different ways, and a supervised network learns global characteristics:

first, the global homogeneity of the hard target is used, i.e. the distance of the positive sample pair in the feature space is directly pulled close and the distance of the negative sample pair in the feature space is pulled far by the contrast loss function. To this end, the distance metric is defined as the cosine distance, i.e.:

calculating a global homogeneity loss function based on a hard target according to a given cosine distance metric, namely:

the hard object can effectively reduce the distance between the original image and the deformation image in the feature space, but different slices are probably from the same original full-resolution image due to the characteristics of pathological slice images. Therefore, considering the similarity of the original and other slice images as well, then defining soft objects on the basis of hard objects is:

wherein Z is a normalization factor. Next, the soft target is applied to the contrast loss function, the distance between the original image and more similar targets is simultaneously reduced, and the calculation process is regarded as the reverse calculation for realizing the hard target, so that the soft target global homogeneous loss function is as follows:

finally, the overall loss function is calculated as:

L＝L_LSH+λL_GCH (10)

through a pathological image analysis method constructed by a local heterogeneous characteristic sensing module and a global homogeneous characteristic sensing module, the network learns the capacity of extracting effective structural characteristics of pathological images in a self-supervision learning mode and has wide generalization capacity.

The invention also provides a generalization ability evaluation based on the pathological image analysis method based on deformation representation learning, wherein the generalization ability evaluation comprises pathological image segmentation and pathological image classification.

Pathological image segmentation: and initializing a segmentation network by using the self-supervision deformation expression learning pre-trained feature extractor, and performing end-to-end training on pathological image data with segmentation labels and labels. Compared with other pre-training methods, the method also utilizes a similar mode to initialize parameters, a training target adopts a cross entropy loss function, and the training parameters are updated by utilizing random gradient descent until the model converges and then is evaluated.

Classifying pathological images: the task includes two different evaluation modes:

(1) in the evaluation method, a pre-trained feature extractor is used for initializing a classification network, and then the whole network is finely adjusted on labeled data with different proportions, namely end-to-end training is carried out. Here, several discrete proportions are selected, the interval being between 0.01% and 100%;

(2) and in the semi-supervised classification mode, linear classifier evaluation is adopted, and after a classification network is initialized by using a pre-trained feature extractor, parameters of a feature extractor part are fixed, only the classifier part of the classification network is trained, and classifiers are connected on different intermediate convolutional layers to evaluate the feature performances of different layers.

The advantages of the invention include:

firstly, exploring deformation is a very important characteristic in a pathological image, constructing a pathological image analysis method based on deformation expression learning by taking the deformation as a supervision signal in self-supervision learning for the first time, and designing the proposed framework specially for pathological image analysis, so that the framework can be used as a brand new paradigm for preprocessing in pathological image analysis.

Secondly, a local heterogeneity feature-based sensing module is firstly provided, a mode that a discriminator is used for distinguishing local structural differences between an original image and a deformed image is used, the network is supervised to learn high-level rich semantic features, and meanwhile, a feature enhancement module is combined, so that the network can learn the capability of extracting local structural features without marking data in a self-supervision learning framework.

Moreover, on the basis of the self-supervision contrast learning loss widely used for natural images, a global homogeneous feature perception module is provided by combining the self characteristics of pathological images, wherein the global homogeneous loss of hard targets and soft targets is utilized, a feature space can be effectively constructed to enable similar samples to be closer in distance, and therefore the global semantic information of the pathological images is learned.

Finally, the invention obtains good results on the public data sets GLaS and PCam, has greatly improved performance compared with the best self-supervision learning method at present, is also comparable with the network performance pre-trained by ImageNet, and can shorten the distance between the public data sets GLaS and PCam and the fully-supervised learning model.

Drawings

The left image of fig. 1 is a schematic diagram of local heterogeneity in an image, and the right image is a schematic diagram of global homogeneity.

FIG. 2 is a complete framework diagram of the morphable representation learning of the present invention.

Detailed Description

As is known in the background art, most of the previous studies have faced the following two problems:

1. pathological image analysis requires a large amount of labeled data related to tasks, consumes a lot of time and energy of doctors, and has high requirement on professional knowledge, so that the generalization capability of most fully supervised models is limited by a small amount of labeled data.

2. The self-supervision representation learning task emerging on the natural image is inconvenient to migrate to the medical image analysis, and because the two images have large domain difference, a proper representation learning method needs to be designed based on the self characteristics of pathological image data.

The invention carries out deep research aiming at the two problems, firstly, the self characteristics of the pathological images are explored, a monitoring signal is constructed by focusing on the deformation characteristics in the pathological images, a self-monitoring deformation expression learning framework is provided, and the network learns the semantic characteristics of the pathological images without artificial labels by utilizing the local heterogeneity and the global homogeneity of the original images and the deformation images, so that the generalization performance of the model is enhanced, and the method can be adapted to different downstream tasks, namely various pathological image analysis tasks.

The pathological image deformation characteristics which are first explored by the invention are shown in figure 1, and the gland shape in the benign pathological image is observed to be in a round or elliptical regular shape, while the gland in the malignant case is in a very irregular structure such as distortion and deformation, so that the structural difference is related to the property of the pathological image. The other information in fig. 1 is a simple illustration of local heterogeneity and global homogeneity, which has already been mentioned above and will not be described herein again.

The implementation process of the invention mainly comprises the steps of extracting the semantic features of the pathological images through a feature extractor, and respectively capturing the local structural difference and the global context similarity of the pathological images through a local heterogeneity loss function and a global homogeneity loss function so as to supervise the network learning of the high-level semantic features. The whole pre-training process does not need manual labels, and the feature learning is completely carried out in a self-supervision learning mode. After the pre-training stage is finished, the feature extractor is partially taken out to initialize network parameters of downstream tasks including pathological image classification, pathological image segmentation and the like. Given a small amount of labeled data related to different tasks for evaluation, the self-supervision deformation representation learning of the invention can embody strong generalization performance on a plurality of different tasks as pre-training.

As shown in fig. 2, the deformation expression learning method provided by the present invention is composed of three parts in total.

The first part is to perform appropriate deformation operation on the original image, first generate a random offset represented by a random variable with x direction and y direction within the range of [ -1,1] at each position (i, j) of the image, and then pass through a zero-mean, σ -variance gaussian kernel, that is:

and then, performing convolution on the original offset map by using a Gaussian kernel to obtain a smoothly transited offset map, guiding the original map to perform position transformation by using each point on the smoothly transited offset map, and obtaining the transformed map by simply using a bilinear interpolation method. Besides the transformation operation, there are transformation methods such as affine transformation, projective transformation, etc., but elastic transformation, which is a non-rigid transformation, can have a more flexible transformation method. For example, transforming from straight lines to curved lines or vice versa, with some randomness at each location, prevents the network from finding trivial, low-complexity trivial solutions in subsequent representation learning processes.

In this embodiment, the second step constructs a loss function based on local heterogeneity. First, from an input image x_iAnd

respectively obtaining corresponding x by a feature extractor^Enc,

Wherein the feature extractor is F and the parameter is theta, and then constructing a positive sample pair (x)_i,x^Enc) And negative sample pair

Here, the mutual information is used to weigh the original image and features. The significance is that the larger the mutual information is, the more valuable semantic information rather than random noise in the image can be represented by the proven features better, and different from the traditional supervised learning framework in which a loss function with labeled information is used to supervise network learning, the network can learn the general high-level feature representation by using the mutual information, and the generalization capability is stronger. First, the mutual information can be expressed as:

I(X,Y)＝KL(p(x,y)||p(x)p(y)) (2)

where p (x, y) is the joint probability distribution, p (x), and p (y) is the edge probability distribution. Here, x will be input_iAnd the features replace X and Y in the formula, respectively, then the goal can be expressed as finding the optimal network parameters to maximize the mutual information, i.e.:

next, how to construct the loss function by deformation is described. As shown in FIG. 1, given an image x_iAnd its deformed image

It is expected that the deep representation of both should also be able to contain structural differences after deformation, as they carry the information of the same image. At the same time, to minimize

Since they actually contain the deformed structural differences on the same image. However, since mutual information is very inconvenient to compute in high dimensions, it is chosen to estimate the value of the mutual information using JSD for this purpose, namely:

wherein sp (x) log (1+ e)^x) Is the softplus function. In practice, D (X; φ) is a discriminator with parameters φ, which distinguishes between positive and negative examples. This is achieved using three full convolutional layers with dimensions 512, 256 and 256, respectively. In this embodiment, it is observed that in some pathological images, the size of the cavity in the middle of some glands with severe differentiation is large, even spanning the whole image, and in this case, a single-scale neural network cannot effectively capture multi-scale information spread over multiple positions. Thereby characterizing a single scale

And

obtaining F (x) by a feature enhancement module FEM_i(ii) a Theta) and

for representing multi-scale featuresX in the positive sample pair_iReplacement by F (x)_i(ii) a Theta), of negative sample pairs

Is replaced by

By constructing the feature enhancement module, local difference information on different scales is maintained through modeling, so that the network is more robust to the structural difference of different pathological images.

In the present embodiment, the pathological images have features of global similarity in addition to the features of local heterogeneity. Considering that the original image and the deformed image do not change the pixel information of the original image after being deformed, the process can be simply understood as a pixel small-amplitude rearrangement process, so that the global context information does not change greatly, i.e. the high-level semantic information of the global context information still remains similar, especially compared with other pathological images in the data set. To maintain this global homogeneity, each sample x in a batch is then first sampled_jAll go through a feature extractor to obtain features

Then all the obtained F (x) is obtained through a characteristic enhancement module_j(ii) a θ), finally normalized by the L2 norm, followed by clustering of features into a global feature F using global mean pooling_G(x_j(ii) a θ). So far, all the multi-scale features in one batch are obtained. Next, a pair of positive samples F is constructed in the feature space_G(x_i；θ),

And a negative sample pair. First, the cosine distance is chosen as a metric, namely:

secondly, the distance between the positive sample pairs is drawn in, the drawingThe distance between far negative sample pairs can be realized by a contrast loss function, and the deformation sample in the positive sample pair is considered as the original image sample characteristic F_G(x_i(ii) a Theta) hard object desired to be approximated in feature space

Thus, the hard target global similarity loss can be expressed as:

the hard target is a single target which considers the deformation image as the original image approximation, but in the actual data set, because a plurality of different pathological section images are possibly from the same pathological full-resolution image, even a plurality of different images of the same patient should have certain similarity. Therefore, considering a single similar approximation target, such a similarity relationship between the original image and other images is ignored. To allow sufficient consideration of the similarity between the original and more objects without additional supervision information (e.g. whether the different images are from the same patient or the same full resolution map), soft objects are defined

The following were used:

wherein Z is a normalization factor. Such soft target can be regarded as a weighted combination of a plurality of targets, and the weight information of the soft target is obtained by transforming the high-level features, and belongs to a part which can be adaptively adjusted in end-to-end training. By constructing a soft target and then implementing the calculation of a similar reverse hard target, the soft target global homogeneous loss function is defined as:

finally, the overall loss function is calculated as:

L＝L_LSH+λL_GCH (9)

by constructing a local heterogeneous loss function and a global homogeneous loss function, the network completes the deformation expression learning process in a self-supervision normal form without manual labels. The embodiment also comprises the step of applying the pre-trained feature extraction part after deformation representation learning to different tasks of pathological image classification and segmentation.

In this embodiment, two common data sets are selected for the source of the pathological image data to perform network training, i.e., GLaS and pca. GLaS is a standard data set provided on the MICCAI 2015 gland segmentation competition, containing 85 training images (37 benign and 48 malignant) and 80 test set images (37 benign and 43 malignant), all images obtained at 20-fold magnification from 16H & E stained tissue sections. Most images are 755 pixels wide and 522 pixels high, and contain segmentation labels marking the position of the gland. The PCam data set consisted of 327,680 image slices, each 96X 96 image slice, from 400 image slices, each image slice containing a classification label to indicate whether metastatic tissue was present. The whole data set is divided into 70%, 15% and 15% as a label training set, a verification set and a test set respectively.

Experiments in the implementation of the invention were trained on 2 nvidiageforcgtxtitanxp GPUs, Adam optimizers were selected from the training deformation representation learning models, and the initial learning rate was defined as 0.0001 and the blocksize was set to 8. On the evaluation of the segmentation and classification tasks, the hyperparameter λ in the loss function is set to 0.01 and 0.5 by grid search in the pre-training phase, respectively. In order to maximize the use of unlabeled data, the training set and validation set of the GLaS dataset were cropped using a sliding window of 256 and a step size of 64 to obtain a larger-scale slice dataset. Meanwhile, VGG is adopted as a default feature extractor, which takes the VGG-like model as a comparison method into consideration. In addition, the value of the standard deviation sigma of the magnification factor a and the Gaussian kernel in the elastic deformation is researched by utilizing grid search, and finally the deformation effect obtained in the range of sigma belonging to [0.03S,0.05S ] is moderate, wherein S represents the smaller of the width and the height of the image. In the present embodiment, the F1 value, the Dice coefficient at the Object level, and the hough distance at the Object level are selected as evaluation indexes of segmentation. And because the positive and negative samples are balanced, the performance of the classification model is measured by adopting the classification accuracy. In the downstream segmentation task, through the experiment of a verification set, a random gradient descent SGD optimizer with an initial learning rate of 0.005 is adopted, the batch size is set to be 2, and the whole segmentation network adopts an encoder-decoder structure. The encoder is initialized by using pre-trained parameters, and the decoder is built into a structure symmetrical to the encoder and uses random initialization. On the classification task, the present embodiment employs an Adam optimizer, the learning rate is set to 0.005 and the batch size is determined to be 128. In order to fully play the role of the feature extractor, namely fully show the feature level learned by the deformation representation learning, the feature enhancement module is not combined and adopted, and on the other hand, the network parameters can be reduced.

The model which is provided by the invention and is learned and pre-trained through deformation representation has good effect on two data sets of GLaS and PCam. In the GLaS, the pre-trained segmentation models obtained 0.900, 0.896, and 50.55 at the F1 score, the Object level Dice coefficient, and the Object level hausdorff distance, respectively, exceeded about 4% of the randomly initialized baseline segmentation network and also exceeded about 1% of the pre-trained models on ImageNet. Compared with the performance of each convolution module of the pre-training model, the performance of the pre-training model shows certain migration capacity on both deep layers and shallow layers, and the generalization effect of the deep layers is better relatively. On PCam, when semi-supervised classification evaluation is adopted, when the pre-training model is subjected to fine tuning by using a small amount of labeled data of 0.01%, 1% and 10%, the average effect exceeds 10% of that of the baseline model, and the learned rich characteristics of the pre-training model are completely shown. In the process of using linear classification capability evaluation, a feature extractor part is fixed, only partial parameters of a classifier are trained, and the result also proves that the performance of a pre-trained model on a plurality of convolution layers can exceed that of a method which is obviously performed on a natural image. Finally, ablation experiments are also used to prove that the local heterogeneity loss function and the global homogeneity loss function can assist in the pre-trained feature learning.

In summary, the invention provides a novel self-supervision deformation representation learning framework aiming at the deformation characteristics of the pathological image on the premise of dealing with less labeled data, high labeling cost and more labeled samples. The framework mainly aims to construct two loss functions of local heterogeneity and global homogeneity through deformation, so that a network can learn local structural features and global semantic information of pathological images under a self-supervision paradigm, and can be generalized to different downstream tasks including pathological image classification and pathological image segmentation through transfer learning. The invention obtains good results on the public data sets GLaS and PCam, the performance is greatly improved compared with other self-supervision methods widely used at present, and the model can be used as a brand-new pre-training method for pathological image analysis.

Pathological section analysis of tissue is a fundamental step in the disease assessment process. The automatic pathological image analysis needs a large amount of task-related labeled data to train the model, insufficient labeled data amount can limit the generalization capability of the supervised learning model, and the work is time-consuming and labor-consuming in real life. The invention provides a learning framework based on self-supervision deformation expression, which learns semantic information from non-labeled data. As a new expression learning paradigm, the framework utilizes deformation as a supervisory signal for expression learning based on local heterogeneity and global homogeneity of pathological images. Given an original image and a morphed image, both produce a distortion difference in local structure, but at a global level they still retain global similarity to other pathological images in the dataset. For this purpose, a feature extractor is specially trained to measure local structural differences by calculating mutual information and global similarity by comparing losses. Based on evaluations performed on different downstream tasks, including pathological image segmentation and classification, the proposed framework goes beyond much of the previous work of self-supervised learning and is comparable to the level of pre-trained networks based on ImageNet. Therefore, the method can be used as a brand-new pre-training paradigm for pathological image analysis.

Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto, and those skilled in the art can make possible variations and modifications of the present invention using the above-described methods and techniques without departing from the spirit and scope of the present invention. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention, unless the content of the technical solution of the present invention is departed from.

Claims

1. A pathological image analysis method using self-supervision learning is characterized in that a deformation representation learning technology with high efficiency data utilization rate is adopted, and the method comprises the steps of constructing a self-supervision deformation representation learning model for pathological image analysis and then for pathological image classification and segmentation; the self-supervision deformation representation learning model specifically comprises three parts, namely a deformation module, a local heterogeneous characteristic sensing module and a global homogeneous characteristic sensing module; wherein:

the deformation module is used for performing elastic deformation operation on the input image;

the local heterogeneous feature perception module is used for learning structure difference information of local regions in the image due to deformation; the module contains three learnable networks: a feature extractor network, a multi-scale feature network, and a discriminator network; the whole learning process of the local heterogeneous feature perception module is divided into three steps:

firstly, acquiring high-level semantic features of an input image by adopting an image feature encoder to generate image feature representation; in particular, from the original input pathology image x_iAnd corresponding deformed pathological image

wherein θ is a parameter of the feature extractor F; to estimate the amount of information that the high-level features can represent the original input x, it is represented by mutual information:

I(X，Y)＝KL(p(x，y)||p(x)p(y)) (2)

wherein p (X, Y) is the joint probability distribution of X and Y; p (x) and p (y) are marginal probability distributions, maximizing mutual information of the two is equivalent to optimizing the KL divergence between the two distributions; for image x_iAnd its deformed image

It is desirable to have a deep representation of each of the two

Can contain structural differences after deformation, thus maximizing

The two images carry the information of the same image; at the same time, to minimize

Because both actually contain the deformation structure difference on the same image; for this purpose JS distance is used to estimate the value of the mutual information, i.e.:

wherein sp (x) log (1+ e)^x) Is the softplus function; d (X; phi) is a discriminator with phi as a parameter and is used for distinguishing positive samples from negative samples;

secondly, extracting multi-scale features; first, positive sample pairs are constructed

And negative sample pair

Is a different representation method from the same image, and the negative sample is a deformed pathological image

And features of the original pathology image

The combination is formed, and the structural difference caused by deformation is contained in the combination; due to the fact that structural differences caused by deformation have multi-scale characteristics, single-scale features are formed

And

obtaining F (x) by using feature enhancement module FEM_i(ii) a Theta) and

Is replaced by

Thus, the negative sample pair

The difference between the single-scale characteristics of the original image and the deformation structure of the deformation image on multiple scales is reflected;

thirdly, respectively sending the positive and negative sample pairs into a discriminator D (X)^EncF (X; θ); phi) which is a network parameter of the discriminator, and requires the discriminator to output 1 to the positive sample pair and 0 to the negative sample pair; specifically, the discriminator consists of three full convolution layers with dimensions 512, 256, and 256, respectively; calculating a local heterogeneity loss function from the output of the discriminator, namely:

wherein N represents the number of all samples, and sp () represents the softplus function; in the whole process of the calculation,

F(x_i(ii) a Theta) are all the same size, and the width and the height are both M; in the construction process of the local heterogeneity loss function, the supervised label information is formed by the structural difference brought by deformation operation, and the whole process does not need to manually label the image, so that the end-to-end network self-supervision learning is realized;

the global homogeneous feature perception module is used for realizing the learning process of the network on the global features of the pathological images; first, each sample x in a batch_jAll go through a feature extractor to obtain features

Then all the obtained F (x) is obtained through a characteristic enhancement module_j(ii) a θ), finally normalized by the L2 norm, followed by clustering of features into a global feature F using global mean pooling_G(x_j(ii) a θ); all multi-scale features in one batch are obtained; next, a pair of positive samples F is constructed in the feature space_G(x_i；θ)，

And negative sample pair F_G(x_i；θ)，

Wherein j ≠ i;

then, constructing a global homogeneity loss function, and supervising network learning global feature learning:

according to the global homogeneity of the hard target, namely, the distance of the positive sample pair in the feature space is directly shortened by using a contrast loss function, and the distance of the negative sample pair in the feature space is lengthened; to this end, the distance metric is defined as the cosine distance, i.e.:

the hard target can effectively shorten the distance between the original image and the deformation image in the feature space;

defining a soft target on the basis of a hard target as:

wherein Z is a normalization factor; and (3) applying the soft target to a contrast loss function, simultaneously shortening the distance between the original image and more similar targets, and regarding the calculation process as the reverse calculation for realizing the hard target, so that the soft target global homogeneous loss function is as follows:

finally, the overall loss function is calculated as:

L＝L_LSH+λL_GCH (10)。

2. the pathological image analysis method using self-supervised learning as defined in claim 1, wherein the elastic deformation operation is to generate a random variable in the x-direction and the y-direction within the range of [ -1,1] at each position (i, j) of the image, which represents the random offset in two directions, and then perform convolution operation on the random offset by using a zero-mean variance and a variance of σ as follows:

and obtaining a smooth offset map through convolution operation, and guiding the original map to perform position transformation by using the offset of each point.

3. The pathological image analysis method using self-supervised learning as defined in claim 1, wherein the feature enhancement module FEM dynamically determines the size of the convolution kernel according to the feature map size; specifically, k is defined as log₂M convolution kernels with the size of s and the void ratio of r₀＝1，r₁＝2，r₂＝6，...，r_k-1＝6×2^k-3Then the actual size of these convolution kernels is:

wherein i is more than or equal to 0 and less than or equal to k-1.

4. The pathological image analysis method using self-supervised learning according to claim 3, wherein the pathological images are segmented and classified using the constructed self-supervised deformation representation learning model;

pathological image segmentation: initializing a segmentation network by using the self-supervision deformation expression learning pre-trained feature extractor, and realizing end-to-end training of pathological image data with segmentation labels and labels;

classifying pathological images: two different ways are involved:

(1) a semi-supervised classification mode, namely, adopting a few data classification, firstly initializing a classification network by using a pre-trained feature extractor, and then finely adjusting the whole network on labeled data with different proportions, namely, performing end-to-end training;

(2) semi-supervised classification mode-linear classifier is adopted, after the classification network is initialized by using the pre-trained feature extractor, the parameters of the feature extractor part are fixed, only the classifier part of the classification network is trained, and classifiers are connected on different intermediate convolution layers to evaluate the feature performances of different layers.