CN112767325B

CN112767325B - Automatic detection method and system for cancer pathology image

Info

Publication number: CN112767325B
Application number: CN202110015477.7A
Authority: CN
Inventors: 姚海龙; 孟昕悦; 游惠捷
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2024-06-28
Anticipated expiration: 2041-01-05
Also published as: CN112767325A

Abstract

The embodiment of the invention provides a method and a system for automatically detecting cancer pathology images, wherein the method comprises the following steps: extracting a foreground image set of the full-slice image set by adopting a preset extraction algorithm; the foreground image set is subjected to blocking treatment to obtain a blocking image set, and category labels of the label data set are extracted; inputting the segmented image set and the category label into an EM model semi-supervised learning framework for model training to obtain the distribution probability of the labeled data set and the distribution probability of the unlabeled data set; calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to a probability map standardization algorithm to obtain a cancer probability distribution map; and processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result. Aiming at the automatic cancer region detection of the full-slice image pathology image, the invention provides a semi-supervised algorithm framework based on expected maximization, and can carry out more accurate pathology typing and grading on the tissue of the primary cancer region.

Description

Automatic detection method and system for cancer pathology image

Technical Field

The invention relates to the technical field of pathological image diagnosis, in particular to an automatic detection method and an automatic detection system for cancer pathological images.

Background

At present, diagnosis of cancer still depends on manual diagnosis means, particularly breast cancer diagnosis, and in order to obtain the gold standard of breast cancer diagnosis, breast histopathological diagnosis is usually required. Pathologists need to analyze tissue sections, carefully observe lymph node sections of the breast, and perform pathological staging of the breast, while also observing tissue morphology to determine pathological staging.

Manual diagnosis is prone to missed detection and false detection, and many related researches find that diagnosis results given by different pathologists are often different for the same patient. Studies have indicated that pathologists have only 75.3% consistency in diagnosing breast cancer. For some unusual breast cancer categories, the consistency of diagnosis is only 48%. When observing pathological sections under a microscope, the pathological sections usually contain hundreds of billions of pixels under the microscope, and when a pathologist diagnoses a patient, more than one section needs to be observed, the observation process can take a lot of diagnosis time, and medical staff is insufficient, so that the period between the examination of the patient and the diagnosis of the doctor is prolonged, the timeliness of the examination is lost, and the patient misses the optimal period of treatment. In addition, the scan sheet not only needs medical equipment, but also needs medical professionals with deep expertise and accumulated practice, and in some areas with insufficient medical resources, serious professional deficiency and equipment shortage exist, so that misdiagnosis is caused. In addition to medical resource supply, due to excessive patient load of pathologists, some tiny cancer areas are easy to miss during the rapid scan, resulting in reduced accuracy of diagnosis.

With the rapid development of artificial intelligence, many researchers have attempted to utilize artificial intelligence algorithms to achieve automated breast cancer pathological grading, hopefully reducing labor costs by automatically analyzing breast cancer images. However, artificial intelligence algorithms generally rely on a large amount of labeling data as training data, but effective labeling data is difficult to obtain on medical images, so that the accuracy of algorithm diagnosis is limited, which is important for medical diagnosis. The existing algorithm uses a conditional generation countermeasure network (CGAN) to generate an image that approximates the real data, but produces spurious features on the original image that can lead to subsequent misjudgment. Therefore, if the limitation caused by fewer training data sets can be successfully overcome, the algorithm can be better applied to the practical level.

Recent studies have shown that WSI has been widely used for breast pathology diagnosis, since diagnosis is performed by observing a Whole-section image (white SLIDE IMAGE, WSI), consistent with the effect of the conventional diagnostic method of observing sections using a microscope. Slice pathology images cannot be directly input as an artificial intelligent network generally, and the existing algorithm directly cuts the slice pathology images into standard input of the network, so that global information of the pathology images is lost, and global features of the pathology images cannot be extracted. Or downsampling the image to obtain standard input of the network, which causes the resolution of the pathological image to be reduced, and the characteristic of local detail cannot be extracted. In the case of breast cancer pathological images, the former cannot accurately distinguish between carcinoma in situ and invasive carcinoma; the latter has an impact on the extraction of mitotic key features. Therefore, a pathological image feature extraction method is needed to effectively retain global features and local detail features at the same time.

Disclosure of Invention

The embodiment of the invention provides an automatic detection method and an automatic detection system for a cancer pathology image, which are used for solving the defects in the prior art.

In a first aspect, an embodiment of the present invention provides a method for automatically detecting a cancer pathology image, including:

Reading a whole slice image set of the cancer to be detected;

extracting a foreground image set of the full-slice image set by adopting a preset extraction algorithm;

Performing blocking processing on the foreground image set to obtain a blocked image set, and extracting category labels with label data sets in the blocked image set;

Inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training to obtain a labeled data set distribution probability and an unlabeled data set distribution probability;

Calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to a probability map standardization algorithm to obtain a cancer probability distribution map;

and processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result.

Further, the extracting the foreground image set of the full-slice image set by adopting a preset extraction algorithm specifically includes:

Constructing an undirected graph, enabling each node in the undirected graph to correspond to one pixel point in the full-slice image set, and extracting the connection between adjacent pixels corresponding to each node to obtain an edge set;

Acquiring edge weights in the edge set, and calculating the edge weights to generate a minimum spanning tree by using a Kruskal algorithm;

Deleting edges with weights larger than a preset weight threshold in the minimum spanning tree to obtain a plurality of subtrees, and calculating RGB average values of the subtrees to obtain the maximum value of the RGB average values;

And dividing subtrees with the RGB average value larger than a preset RGB threshold value into background images, and taking the rest subtrees as the foreground image set.

Further, the step of performing a blocking process on the foreground image set to obtain a blocked image set, and extracting class labels of a label data set in the blocked image set specifically includes:

dividing the full-slice image set into a plurality of small blocks with preset sizes according to a preset overlapping rate;

And judging small blocks with the foreground area exceeding a first preset proportion as background areas and discarding the background areas, carrying out contour labeling on the rest small blocks, and taking the cancer classification label occupying the largest small block area as the class label.

Further, the determining the small blocks with the foreground area exceeding the preset proportion as background areas and discarding the small blocks, performing contour labeling on the remaining small blocks, and taking the cancer classification label occupying the largest small block area as the class label, further includes:

If judging that the area of the normal region is larger than the area occupying the small block by more than a second preset proportion, marking the small block as the normal region;

If the small block is judged to only comprise a normal area and an area of one of the cancer types, and the proportion of the area of the small block occupied by the area of one of the cancer types exceeds a third preset proportion, marking the small block as the area of one of the cancer types;

If the small block is judged to contain at least two areas of the cancer types, and the proportion of the area of each cancer type occupied by the small block exceeds the third preset proportion, dividing the small block into noise areas and discarding the noise areas.

Further, the step of inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training to obtain a labeled data set distribution probability and an unlabeled data set distribution probability, specifically includes:

extracting the marked data set in the segmented image, acquiring the category mark corresponding to the marked data set, and training a CNN model based on the marked data set and the category mark to obtain model initialization parameters;

Extracting the unlabeled data set in the segmented image, estimating a probability model diagram of the unlabeled data set based on an initialized CNN model, mapping the probability model diagram to a preset interval to obtain a corresponding mapping value, assigning the mapping value to a corresponding pixel point to obtain a thermodynamic diagram, mapping the thermodynamic diagram to a classification diagram based on a preset threshold vector, estimating labels of the unlabeled data set based on the classification diagram and a collaborative filtering algorithm, and constructing an unlabeled training data set by the labels of the unlabeled data set;

And retraining the CNN model based on the unlabeled training data set and the labeled data set, optimizing a formula maximum likelihood function, and updating model parameters.

Further, the training the CNN model based on the noted dataset and the noted category label to obtain model initialization parameters further includes:

performing put-back sampling on the marked data set by using a difficult-case mining algorithm, and performing model training by using iterative sampling data;

Calculating effective coefficients of the blocks in the marked data set by the initialized CNN model, sorting according to the values of the effective coefficients, extracting the blocks corresponding to the pre-set ranking proportion before ranking in the effective coefficients, and performing model training;

Correspondingly, retraining the CNN model based on the unlabeled training data set and the labeled data set, optimizing a formula maximum likelihood function, and updating model parameters further includes:

extracting the blocking features of the unlabeled training data set and the labeled data set by adopting the CNN model;

Calculating the similarity of each unlabeled block and all labeled blocks based on the feature vector, and extracting labeled blocks with the similarity larger than a preset similarity threshold value for each unlabeled block to form a similarity set;

And based on the similar set, marking the unlabeled data by adopting a majority voting method, and if the marking is consistent with the marking predicted by the CNN model, adding the corresponding unlabeled data into the unlabeled training data set to perform the next training.

Further, the calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to the probability map normalization algorithm to obtain a cancer probability distribution map specifically includes:

Setting the preset threshold vector to be between 0 and 1 and a threshold interval comprising a plurality of endpoints, and mapping the probability vector to the threshold interval to obtain a cancer intensity characterization value;

Assigning the cancer intensity representation value to probability image pixels corresponding to the blocks, obtaining the intensity value of each pixel, classifying the intensity value according to the class diagram, obtaining a plurality of cancer classification intervals, and forming the cancer probability distribution map by the plurality of cancer classification intervals.

In a second aspect, an embodiment of the present invention further provides an automatic cancer pathology image detection system, including:

the acquisition module is used for reading a full-slice image set of the cancer to be detected;

The first extraction module is used for extracting a foreground image set of the full-slice image set by adopting a preset extraction algorithm;

The second extraction module is used for conducting blocking processing on the foreground image set to obtain a blocked image set, and extracting class labels with label data sets in the blocked image set;

The training module is used for inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training to obtain a labeled data set distribution probability and an unlabeled data set distribution probability;

the classification module is used for calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to a probability map standardization algorithm to obtain a cancer probability distribution map;

And the processing module is used for processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the steps of the method for automatically detecting a cancer pathology image according to any one of the above.

In a fourth aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the cancer pathology image automatic detection method according to any one of the above.

According to the method and the system for automatically detecting the cancer pathology image, provided by the embodiment of the invention, through the automatic cancer region detection aiming at the full-slice image pathology image, a semi-supervised algorithm framework based on expected maximization is provided, and aiming at limited data volume, more accurate pathology typing and grading of the cancer primary region tissue are realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an automatic detection method for cancer pathology image according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a result of a foreground extraction algorithm according to an embodiment of the present invention;

FIG. 3 is a flowchart of a foreground extraction algorithm based on a Kruskal algorithm provided by an embodiment of the invention;

FIG. 4 is a semi-supervised learning framework diagram of an EM model provided by an embodiment of the present invention;

FIG. 5 is a flowchart of an unlabeled dataset selection algorithm provided by an embodiment of the present invention;

FIG. 6 is a graph of results of an iterative example of an EM algorithm provided by an embodiment of the present invention;

FIG. 7 is a graph of one of the accurate automatic breast cancer detection results provided by an embodiment of the present invention;

FIG. 8 is a diagram showing two accurate results of automatic breast cancer detection according to an embodiment of the present invention

FIG. 9 is a schematic diagram of an automatic detection system for cancer pathology image according to an embodiment of the present invention;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flow chart of an automatic detection method for cancer pathology image according to an embodiment of the present invention, as shown in fig. 1, including:

s1, reading a full-slice image set of a cancer to be detected;

S2, extracting a foreground image set of the full-slice image set by adopting a preset extraction algorithm;

s3, performing blocking processing on the foreground image set to obtain a blocked image set, and extracting class labels with label data sets in the blocked image set;

S4, inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training to obtain a labeled data set distribution probability and an unlabeled data set distribution probability;

s5, calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to a probability map standardization algorithm to obtain a cancer probability distribution map;

S6, processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result.

The method for automatically detecting the cancer pathology image mainly comprises the steps of firstly extracting the foreground of a pathology full-slice image to obtain a foreground image, then carrying out block extraction on the foreground image to obtain a plurality of block images and corresponding labeling information, dividing the plurality of blocks into labeled blocks and unlabeled blocks, carrying out model training by adopting an EM model semi-supervised learning frame to obtain distribution probability of labeled data sets and distribution probability of unlabeled data sets, further obtaining a cancer probability distribution map, and finally obtaining a standardized automatic detection result.

According to the invention, by aiming at the automatic cancer region detection of the full-slice image pathological image, a semi-supervised algorithm framework based on expected maximization is provided, and aiming at limited data volume, more accurate pathological typing and grading of the cancer primary region tissue are realized.

Based on the above embodiment, step S2 in the method specifically includes:

In particular, for pathological full slice images (WSI), which typically contain billions of pixels, special processing is required to be input as an algorithm. Since WSI contains large areas of background, which do not contain any tissue features, clipping WSI directly results in the introduction of many invalid samples. Therefore, it is necessary to remove the background portion of the WSI, extract a region of interest (ROI), and perform a process such as clipping. Experiments show that the common prospect extraction method, such as Ostu, can not effectively extract sparse tissue regions from WSI. In order to effectively extract the sparse tissue region, the invention provides a prospect extraction algorithm based on a Kruskal method. For a given WSI, firstly, an undirected graph G (V, E) is constructed, in the undirected graph G, each node V _i,j epsilon V corresponds to a pixel point, an edge set E= { (V _i,j,v_i+1,j),(v_i,j,v_i,j+1),(v_i,j,v_i-1,j),(v_i,j,v_i,j-1) } corresponds to the connection between adjacent pixels, the weight of an edge is set to W (V _i,j,v_i+1,j)＝||v_i,j-v_i+1,j |and then the minimum spanning tree T is calculated by using a Kruskal algorithm, then the edge with the weight larger than a specific threshold value (the threshold value is set to 100 in an experiment) is deleted from the T, a series of subtrees (such as T ₁,T₂,…,T_n) can be obtained by deleting the edge, the RGB average value (such as RGB (T ₁),RGB(T₂),…,RGB(T_n)) is calculated based on the obtained subtrees, so that the maximum value u of the RGB average value is obtained, then the RGB average value of the subtrees is compared, and if the RGB average value is larger than u-45, the edge is defined as the background.

The foreground extraction algorithm adopted by the invention can better extract the sparser region in the full-slice image, and obtain a more accurate foreground image extraction result.

Based on any of the above embodiments, step S3 in the method specifically includes:

The method includes the steps of judging small blocks with the foreground area exceeding a preset proportion as background areas and discarding the background areas, carrying out contour labeling on the rest small blocks, taking the cancer classification label occupying the largest small block area as the category label, and further comprising:

Specifically, after obtaining the foreground region and obtaining the pixels of the foreground, the present invention clips the WSI into 1536×1536 tiles at a 50% overlap rate. If the foreground area ratio of a small block is less than 40%, the small block is judged as a background and discarded. Of the 30 WSI datasets, only 10 WSIs were outlined. In the WSI dataset with labels, pathologists label benign regions, carcinoma in situ regions, and carcinoma in infiltration regions, respectively. Therefore, it is necessary to automatically generate the labeling of the small blocks from the contour labeling. Typically, the cancer occupying the largest area of a patch is labeled as the cancer type of that patch.

In addition, there are three special cases to consider:

(1) If the area of the normal region occupies two-thirds of the area of the patch, the patch is marked as a normal region;

(2) If a patch contains only two types of tissue, one of which is normal and the other of which is one of three cancer types (benign, carcinoma in situ, invasive), and the area of the cancer area occupies more than one third of the area of the patch, the patch is marked as the corresponding cancer type;

(3) If two to three cancer types (benign, carcinoma in situ, carcinoma in infiltration) are included in a patch, and each type of cancer area occupies more than one third of the area of the patch, the patch is marked as a noisy area and discarded. The reason is that such small blocks mislead the training of the model, degrading the accuracy of the model.

Further, the blocks and the corresponding class labels can be extracted accurately, and if most of the slices have no corresponding labels, only the blocks are extracted, and a data set is constructed by all the small blocks (1536×1536) which are segmented and used as the input of an automatic detection framework.

Based on any of the above embodiments, step S4 in the method specifically includes:

Wherein, the training the CNN model based on the noted dataset and the noted category label to obtain model initialization parameters further includes:

Specifically, as shown in fig. 4, a schematic diagram of an EM semi-supervised learning framework is given, and in the data set, only part of the data has labels, and the part of the labeled data is defined as a set D; and the part of the data set that is not marked is defined as a non-marked set U. y _i is defined as the label for partition x _i e D. The hidden variable z _j is defined as the label of the partition x _j e U. Training the CNN model on the marked data set D to obtain an initialization parameter theta ⁰ of the model, and estimating a probability map P (z _j|x_j) of the unlabeled block x _j epsilon U based on the initialization CNN model. In the step E, the label of the unlabeled dataset is estimated, and the unlabeled dataset after the label is obtained is defined as E. In the M step, using the data set E and the labeling data set D obtained in the E step, maximizing the probability P (X|θ, Z), and updating the parameters of the CNN model. In the framework, the probability model map P (z _j|x_j) is mapped to a value between 0 and 1, and the value is assigned to the corresponding pixel point, so as to obtain a thermodynamic diagram, as shown in (C) in fig. 4. Based on the set threshold vector, # ^*＝(β¹,β²,β³), the thermodynamic diagram is mapped to a classification diagram, as shown in fig. 4 (D). And giving the label of the label-free data block according to the classification chart and the collaborative filtering algorithm, retraining the obtained block and the label data set block serving as the input of the model, and updating the parameters of the CNN model.

In the initialization, assuming that the blocks are all independent and distributed, training a model based on the labeled data set D to obtain an initial model parameter theta ⁰, wherein the initial model parameter theta ⁰ can be obtained by the following steps of:

In step E, based on model parameters θ ^t in the t iteration stage, a probability map P (z _j|x_j,θ^t) of unlabeled data is estimated and then mapped to P _norm(z_j|x_j,θ^t) ∈0,1, resulting in a continuous thermodynamic diagram based on class threshold vector β ^* and thermodynamic diagram P _norm(z_j|x_j,θ^t.

Based on the classification map and the collaborative filtering algorithm, the label c _j of the unlabeled data block is estimated to construct an unlabeled training dataset E _t.

In the step M, based on the unlabeled training data set Et obtained in the step E and the labeled data set D, retraining the CNN model, optimizing the maximum likelihood function of the formula, and updating the model parameters, wherein the flow is shown in figure 5.

Assuming that x _j |θ obeys a uniform distribution, the objective function Q (θ, θ ^t) is reduced to the following form:

and in the initialization stage, sampling with the put back of the marked data by using a difficult-case mining algorithm. The model is trained by iterating the sampled data. In the model training process, samples with wrong model separation are called difficult sample, the difficult sample is resampled in the training process, the number of times of model learning the difficult sample is increased, the learning and the memory of the model to the difficult sample are enhanced, and the accuracy of the model is improved. The higher the effective coefficient α is defined in the following formula, the more difficult and valuable this sample is to pair for the model.

Let the label of block x _i be c _i and let the probability map of block x _i beIn the initialization phase, we initialize the CNN model based on the pathology image data. And extracting the slice data blocks based on a foreground extraction algorithm and a block annotation extraction method. And calculating the effective coefficients of the obtained blocks by using the initialization model, sorting according to the values of the effective coefficients, and extracting a block training model with 20% of the effective coefficients.

Here, in step E, a block label is obtained based on an algorithm, and a label-free block training dataset E is constructed for retraining in step M. Firstly, extracting all marked and unmarked blocking features by using a CNN model. Based on the feature vectors, the similarity sim (x _i,y_j) between each unlabeled partition x _i and all labeled partitions y _j is calculated, and labeled partitions with similarity greater than a threshold value are extracted for each unlabeled partition x _i to form a set. Extracting marked blocks with similarity larger than a threshold value based on the similarity set for each unmarked block x _i to form a setBased on similarity setsAnd (3) marking the unlabeled data by using a majority voting method, and if the marking is consistent with the marking predicted by the M-step CNN model, adding the unlabeled data x _i into the data set E ^t, and performing the next training.

Based on any of the above embodiments, step S5 in the method specifically includes:

Specifically, the probability map mapping rule is as follows, i referring to the index of the maximum probability in the probability vector p (x). In an experiment, a class threshold vector β ^* was set to [0,0.25,0.5,0.75,1.0], and then the probability map p (x) was mapped to a value s (x) between 0 and 1. s (x) represents the intensity of the cancer, the value of normal partitions s (x) is close to 0, and the value of infiltrating cancer is close to 1. Assigning s (x) to the probability image pixels corresponding to the partitions, if a pixel belongs to several foreground partitions, setting the intensity of the pixel as the average value of the partitions to which the pixel belongs, mapping the thermodynamic diagram to a class diagram through a class threshold vector beta ^* = [0.1,0.5,0.75], and classifying each pixel according to the intensity value of the pixel: (1) [0,0.1] was classified as Normal (Normal), (2) [0.25,0.5] was classified as benign (Benign), (3) [0.5,0.75] was classified as In-Situ (In-Situ), and (4) [0.75,1.0] was classified as migration (investive). After standardization, the lower probability value is a block with high uncertainty, and the whole s (x) value is pulled to the left side of the interval; while the higher probability value is a block with low uncertainty, the s (x) value is pulled up to the right as a whole.

The class vector β ^* is set as: (1) [0,0.1] was classified as Normal (Normal), (2) [0.25,0.5] was classified as benign (Benign), (3) [0.5,0.75] was classified as In-Situ (In-Situ), and (4) [0.75,1.0] was classified as migration (investive). In the general method, noise with lower certainty cannot be smoothed by setting the class threshold vector, however, β ^* can well remove noise with lower certainty, so that the obtained thermodynamic diagram is more continuous and smooth, and an adaptive class threshold vector β ^** is adopted, and according to the distribution of the slice cancer, the vector is adaptively adjusted, so that a continuous and smooth class diagram is generated. And researching related priori knowledge, designing an objective function based on the priori knowledge, and adaptively adjusting a threshold vector by optimizing the objective function.

Finally, based on the probability map standardization, the visual automatic flow detection result is obtained, and in fig. 6, (a), (b), (c) and (d) show the detection result of the first iteration of the EM-30% model, the detection result of the second iteration, the detection result of the third iteration and the labeling result of the class map on the a03 slice respectively. The result can be obtained, and the EM model can obtain at least a local optimal solution after multiple iterations and optimization. Similar to the iterative training process of the neural network model, the optimal solution is approximated through iterative optimization. As shown in fig. 6, after the EM model iterates for a plurality of times, noise points of the class diagram predicted by the model are fewer, and the detection result is closer to the labeling result.

Generally, in order for a doctor to better observe the detection result, a slice profile is generated based on the class diagram, as shown in fig. 7 and 8. First, by scanning the class diagram using gaussian filtering, the filtering will smooth the inconsistent area, removing noise points. Then, the pixel areas belonging to the same category in the area range are communicated by applying an expansion algorithm, and a communication area is generated. Finally, the hole noise in the communication region is removed by an operation such as opening and closing. According to the average area of the noise points, the size of the kernel in the opening and closing operation is estimated, the opening and closing kernel is set, the noise blocks in the image are removed, a continuous contour is generated, and the breast cancer in-situ cancer detection result is visualized.

As shown in FIG. 8, there are many focal areas of cancer belonging to the Inverve class in the A08 section, and the Kwok method tends to misdetect small areas of Invasive cancer as carcinoma in situ. Through analysis, the main reason is that the pixel average labeling method used by Kwok introduces error labeling. For example, in the case of block labeling, the area of the infiltrated cancer accounts for one third of the area of the block, the rest of the area is normal tissue, and if the area is labeled according to the pixel averaging method, the block is labeled as in-situ cancer or benign type with a high probability, so that most of the area of the infiltrated cancer is labeled as in-situ cancer in A08, and false detection occurs.

The FSL method is largely undetected, and small areas of invasive cancer cannot be detected, and the main reason is found to be due to the data imbalance problem through analysis. The most common problem of data imbalance is data imbalance among categories, after slicing data, the number of normal blocks is 5562, the number of benign blocks is 1530, the number of in-situ cancer blocks is 786, and the number of invasive cancer blocks is 2450. Where the number of benign and carcinoma in situ segments is too small, data imbalance between classes can cause the model to over-fit, such that the model tends to classify most segments into the most numerous classes to achieve the lowest loss. Therefore, the data is enhanced and data balance processing is performed. In the dataset, in addition to the problem of unbalance between categories, there is also an unbalance between slice data, e.g. in a benign segmented dataset, 60% of the segments originate from slice a02, but only 5% originate from slice a07, leaving a total of 35% from other slices. Data imbalance between the slices can lead to model discarding learning features of this part of a07 during training, resulting in model failure to detect benign regions in a 07. EM methods alleviate data imbalance between categories and slices by effectively utilizing unlabeled datasets. In real-world diagnosis, small cancer areas are often critical to diagnosis, but are easily missed by doctors and pathologists, and as shown in fig. 8, the method of the present invention can effectively detect small cancer areas, thus proving the effectiveness and usability of the algorithm of the present invention.

The cancer pathology image automatic detection system provided by the embodiment of the invention is described below, and the cancer pathology image automatic detection system described below and the cancer pathology image automatic detection method described above can be referred to correspondingly.

Fig. 9 is a schematic structural diagram of an automatic cancer pathology image detection system according to an embodiment of the present invention, as shown in fig. 9, including: an acquisition module 91, a first extraction module 92, a second extraction module 93, a training module 94, a classification module 95 and a processing module 96; wherein:

The acquisition module 91 is used for reading a full-slice image set of the cancer to be detected; the first extraction module 92 is configured to extract a foreground image set of the full-slice image set by using a preset extraction algorithm; the second extraction module 93 is configured to perform a blocking process on the foreground image set to obtain a blocked image set, and extract class labels of a label data set in the blocked image set; the training module 94 is configured to input the segmented image set and the class label into an EM model semi-supervised learning framework for model training, so as to obtain a distribution probability of a labeled data set and a distribution probability of an unlabeled data set; the classification module 95 is configured to calculate the distribution probability of the labeled dataset and the distribution probability of the unlabeled dataset according to a probability map normalization algorithm, so as to obtain a cancer probability distribution map; the processing module 96 is configured to process the whole-slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result.

Fig. 10 illustrates a physical structure diagram of an electronic device, as shown in fig. 10, which may include: a processor (processor) 1010, a communication interface (communication interface) 1020, a memory (memory) 1030, and a communication bus (bus) 1040, wherein the processor 1010, the communication interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a method for automated detection of cancer pathology images, the method comprising: reading a whole slice image set of the cancer to be detected; extracting a foreground image set of the full-slice image set by adopting a preset extraction algorithm; performing blocking processing on the foreground image set to obtain a blocked image set, and extracting category labels with label data sets in the blocked image set; inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training to obtain a labeled data set distribution probability and an unlabeled data set distribution probability; calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to a probability map standardization algorithm to obtain a cancer probability distribution map; and processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result.

Further, the logic instructions in the memory 1030 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a random-access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention also provide a computer program product including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, enable the computer to perform the method for automatically detecting a cancer pathology image provided by the above method embodiments, the method including: reading a whole slice image set of the cancer to be detected; extracting a foreground image set of the full-slice image set by adopting a preset extraction algorithm; performing blocking processing on the foreground image set to obtain a blocked image set, and extracting category labels with label data sets in the blocked image set; inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training to obtain a labeled data set distribution probability and an unlabeled data set distribution probability; calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to a probability map standardization algorithm to obtain a cancer probability distribution map; and processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result.

In still another aspect, embodiments of the present invention further provide a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the cancer pathology image automatic detection method provided in the above embodiments, the method comprising: reading a whole slice image set of the cancer to be detected; extracting a foreground image set of the full-slice image set by adopting a preset extraction algorithm; performing blocking processing on the foreground image set to obtain a blocked image set, and extracting category labels with label data sets in the blocked image set; inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training to obtain a labeled data set distribution probability and an unlabeled data set distribution probability; calculating the distribution probability of the marked data set and the distribution probability of the unmarked data set according to a probability map standardization algorithm to obtain a cancer probability distribution map; and processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software and a necessary general hardware platform, and of course may also be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An automatic detection method for cancer pathology image, characterized by comprising the following steps:

Reading a whole slice image set of the cancer to be detected;

processing the whole slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result;

the method for extracting the foreground image set of the full-slice image set by adopting a preset extraction algorithm specifically comprises the following steps:

2. The automatic detection method of cancer pathology image according to claim 1, wherein the step of performing a blocking process on the foreground image set to obtain a blocked image set, and extracting class labels with label data sets in the blocked image set specifically comprises:

3. The automatic detection method of cancer pathology image according to claim 2, wherein the determining small blocks with foreground area areas exceeding a preset proportion as background areas and discarding the small blocks, performing contour labeling on the remaining small blocks, and taking the cancer classification label occupying the largest area of the small blocks as the class label, further comprising:

4. The method for automatically detecting cancer pathology image according to claim 1, wherein the inputting the segmented image set and the class label into an EM model semi-supervised learning framework for model training, to obtain a labeled dataset distribution probability and an unlabeled dataset distribution probability, specifically comprises:

5. The automated cancer pathology image detection method according to claim 4, wherein training the CNN model based on the noted dataset and the class annotation to obtain model initialization parameters further comprises:

6. The method according to claim 4, wherein the calculating the distribution probability of the labeled dataset and the distribution probability of the unlabeled dataset according to the probability map normalization algorithm, to obtain the cancer probability distribution map, specifically comprises:

Assigning the cancer intensity representation value to probability image pixels corresponding to the blocks, obtaining the intensity value of each pixel, classifying the intensity value according to a class diagram to obtain a plurality of cancer classification intervals, and forming the cancer probability distribution map by the plurality of cancer classification intervals.

7. An automatic detection device for cancer pathology image, characterized by comprising:

the processing module is used for processing the full-slice image set based on the cancer probability distribution map to obtain a standardized automatic detection result;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method for automatically detecting cancer pathology image according to any one of claims 1 to 6 when the computer program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the cancer pathology image automatic detection method according to any one of claims 1 to 6.