CN110993106A - Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information - Google Patents
Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information Download PDFInfo
- Publication number
- CN110993106A CN110993106A CN201911265751.5A CN201911265751A CN110993106A CN 110993106 A CN110993106 A CN 110993106A CN 201911265751 A CN201911265751 A CN 201911265751A CN 110993106 A CN110993106 A CN 110993106A
- Authority
- CN
- China
- Prior art keywords
- clinical information
- pathological image
- recurrence risk
- risk
- recurrence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000001575 pathological effect Effects 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 53
- 201000007270 liver cancer Diseases 0.000 title claims abstract description 26
- 208000014018 liver neoplasm Diseases 0.000 title claims abstract description 26
- 230000002980 postoperative effect Effects 0.000 title claims abstract description 22
- 230000004083 survival effect Effects 0.000 claims abstract description 39
- 238000007637 random forest analysis Methods 0.000 claims abstract description 37
- 238000012545 processing Methods 0.000 claims abstract description 20
- 230000003993 interaction Effects 0.000 claims abstract description 16
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 13
- 201000011510 cancer Diseases 0.000 claims abstract description 13
- 238000002790 cross-validation Methods 0.000 claims abstract description 6
- 238000011156 evaluation Methods 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000012795 verification Methods 0.000 claims description 6
- 238000001574 biopsy Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000002271 resection Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 208000026350 Inborn Genetic disease Diseases 0.000 claims description 2
- 208000016361 genetic disease Diseases 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000010276 construction Methods 0.000 abstract description 2
- 238000013058 risk prediction model Methods 0.000 abstract description 2
- 210000003855 cell nucleus Anatomy 0.000 description 36
- 239000011159 matrix material Substances 0.000 description 12
- 238000012360 testing method Methods 0.000 description 8
- 238000003066 decision tree Methods 0.000 description 6
- 230000001419 dependent effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000010186 staining Methods 0.000 description 4
- 238000000692 Student's t-test Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000001325 log-rank test Methods 0.000 description 3
- 238000012353 t test Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 238000004393 prognosis Methods 0.000 description 2
- 210000001519 tissue Anatomy 0.000 description 2
- 206010027476 Metastases Diseases 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 230000005802 health problem Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 210000001165 lymph node Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a method for predicting postoperative recurrence risk of liver cancer by combining pathological images and clinical information, belonging to the technical field of construction of postoperative recurrence risk prediction models of cancer. The method takes the clinical information of a patient and the pathological image characteristics of the tumor area of the patient extracted by applying an image processing technology as basic variables, further calculates the interaction among the basic variables as input data, fits a survival random forest model and accurately predicts the survival time of the patient. The results of the embodiment of the invention show that the cross-validation efficacy evaluation index C-index of the model provided by the invention is superior to the result of prediction only by using pathological image characteristics or clinical information, and the accuracy of prediction of postoperative recurrence risk of liver cancer is obviously improved; in addition, the invention also provides a classification index of postoperative recurrence risk, and the patient can be divided into two subgroups of higher recurrence risk and lower recurrence risk, so that doctors can be helped to make a targeted treatment scheme for the patient.
Description
Technical Field
The invention belongs to the technical field of construction of a cancer postoperative recurrence risk prediction model, and particularly relates to a liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information. Specifically, the method is based on a survival random forest model, combines pathological image characteristics and clinical parameter characteristics, and accurately predicts the postoperative recurrence risk of the liver cancer patient.
Background
Cancer is a common health problem for all people, according to statistics of the national cancer center in 2019, the proportion of liver cancer patients in 2015 to all cancer patients is 9.42%, the liver cancer patients are in the fourth position of the morbidity of malignant tumors, while the liver cancer patients are 13.94% of all cancer patients and in the second position of the mortality of malignant tumors. As can be seen, liver cancer is high in incidence and mortality rate, and seriously threatens public health. The high postoperative recurrence risk of the liver cancer is one of the important reasons that the mortality rate of the liver cancer patients is high, and if the postoperative recurrence risk can be well predicted, doctors can be helped to make a targeted treatment scheme for the patients, so that the method has great significance for postoperative treatment and prognosis of the patients.
With the progress of science and technology, artificial intelligence is developed rapidly in recent years, and the recurrence risk prediction by using an artificial intelligence algorithm is gradually raised. Random forest is a common machine learning method, and can perform feature screening while having high prediction accuracy. On the other hand, the survival analysis is an analysis method designed for survival time data containing deletion or truncation, the survival random forest model built by combining the survival analysis and the random forest model can further analyze the survival data, and the excellent characteristics of the random forest can be fully utilized.
For clinical treatment of liver cancer, pathological images are the diagnostic gold standard. The morphology, color, texture, and morphology of specific tissue structures of cells in the pathological image are usually related to the occurrence or progression of diseases, so the abundant information contained in the pathological image has the potential to improve the recurrence risk prediction. However, subjective judgment of recurrence risk by merely manually observing pathological images is not reliable due to subjective factors and visual limitations of naked eyes. What is needed is a stable method for deep mining pathological image information to predict recurrence risk.
Under the background, the invention provides a method for predicting postoperative recurrence risk of liver cancer by combining pathological images and clinical information. The invention automatically extracts image characteristic information from the H & E staining image of liver cancer histopathology by an image processing technology, and combines the conventional clinical detection information of patients and a survival random forest method to more accurately grade postoperative recurrence risks of the patients.
Disclosure of Invention
The invention aims to provide a liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information, which is characterized in that input data of the method comprises pathological image characteristics extracted by applying an image processing technology and clinical information of cancer TNM staging and the like of a patient; the prediction model framework of the method is based on a survival random forest model, and compared with the traditional random forest model which can only process regression and classification problems, the random forest model can specially process survival time data; the efficacy evaluation index of the method is C-index, the probability that the model prediction result is consistent with the actually observed result is estimated, and the method is commonly used for evaluating the distinguishing capability and consistency of a prognosis model in statistical analysis; the method specifically comprises the following steps:
step 1: extracting image characteristics of the pathological image, and sorting variables with medical significance in clinical information;
step 2: data processing, including default value processing, dummy variable setting, obvious and unreasonable variable removal and normalization processing;
and step 3: calculating interaction between the pathological image characteristics and clinical information as input data by combining the pathological image characteristics and the clinical information;
and 4, step 4: selecting a better survival random forest model by using the C-index efficacy evaluation index through cross validation;
and 5: and generating a recurrence risk classification index by using a survival random forest model to classify recurrence risk subgroups of the recurrence patients.
The image features of the pathological image extracted in the step 1 specifically include the following features:
-WSI _ snu _ osmerici _ ngtdm _ Strength _ range: deconvolving the H & E pathological staining image in the RGB color space to the HEO color space, intercepting the minimum external rectangular frame of each cell nucleus in the O channel image, and calculating an adjacent Gray Difference Matrix (NGTDM), wherein the calculation method of the NGTDM Matrix comprises the following steps:
let the intercepted area contain N1, N2…NnFor n gray levels, NGTDM is nx4 matrix, with j row: [ N ]j,Fj,Pj,Sj]In which N isjIs the value of the j-th gray level, FjIs NjFrequency of occurrence, PJIs NjFrequency of occurrence, SjIs the sum of the absolute values of the differences between each Nj and the mean gray value of the neighbourhood,
the Strength signature is then computed from the NGTDM:
finally, calculating the range difference of all the cell nucleus Strength characteristics as WSI _ snu _ osmeri _ ngtdm _ Strength _ range characteristics;
-MND _ smu _ hsmerci _ gldm _ largedendenderlowgradevelemphasis _ mean: taking the cell nucleus with the minimum cell nucleus density (number of local cell nuclei/local area) in the pathological image as the center, intercepting a local area with a fixed size of KxKpixels, intercepting the minimum external rectangular frame of each cell nucleus in an H channel image of the local area, and calculating a Gray Level Dependency Matrix (GLDM), wherein the calculation method of the GLDM Matrix comprises the following steps:
let the gray level of the pixel m be gmThe gray level of the pixel n is gnWhen the distance between m and n is less than d and | gm-gnIf | is less than or equal to α, the pixel n is called a gray level dependent pixel of m, and the gray level g is countedmLet P (i, j) be the pixel value of the ith row and jth column (starting from 0) of GLDM, which represents that the number of pixels with j gray-level dependency is P (i, j) in all pixels with i gray-level,
ldlgle (large dependencelowgray levelemophasis) characteristics were then calculated from GLDM:
wherein N isgIs the number of gray levels in the image (GLDM line number), NdIs the number of different gray-scale dependent image numbers in the image (number of GLDM columns), NzIs the amount of gray dependency in the image (sum of GLDM),
finally, calculating the average value of LDLGLE characteristics of all cell nuclei in the local area, namely MND _ smu _ hsmeric _ gldm _ LargeDependenceLowGrayLevelEmphasis _ mean characteristic;
MND _ smu _ osnsci _ single _ fractional _ dim _ mean: the method comprises the following steps of taking a cell nucleus with the minimum cell nucleus density (local cell nucleus number/local area) in a pathological image as a center, cutting out a local area with a fixed size of KxK pixels, taking the center of each cell nucleus in an O channel image of the local area as the center, cutting out KxK the area with the fixed size, and calculating the fractal dimension of the area, wherein the calculation method of the fractal dimension comprises the following steps:
setting image fractal size S1,S2…SnWhen the fractal size is SjWhile dividing the image into a number of Sj×SjSmall blocks, counting the range of pixel values in all small blocks, and calculating the average Nr of all rangejConstructing a linear regression model of Nr relative to S, wherein the coefficient of the linear regression model is the fractal dimension of the image,
finally, calculating the average value of fractal dimension characteristics of all cell nuclei in the local area, namely the MND _ smu _ osnsci _ single _ fractional _ dim _ mean characteristic;
-WSI _ snu _ esperci _ lbp3_5_ recorder: deconvolving the H & E pathological staining image in the RGB color space to the HEO color space, intercepting the minimum external rectangular frame of each cell nucleus in the E channel image, and calculating the statistical characteristics of Local Binary Pattern (LBP), wherein the statistical characteristics of the Local Binary Pattern are calculated as follows:
if the pixel value g of the pixel iiPixel value g greater than or equal to neighborhood pixel jjIf the local binary value corresponding to the position of the pixel j is 0, the pixel value g of the pixel i is equal to the local binary value corresponding to the position of the pixel jiPixel value g less than neighborhood pixel jjIf the local binary value corresponding to the position of the pixel j is 1, the local binary values in the neighborhood of the pixel i can be connected together to form a binary number, the binary number is the local binary pattern of the pixel i, uniform and cyclic invariant processing is carried out on all the binary numbers, the frequency of the local binary patterns of all the pixels in the image is counted to be used as the local binary pattern characteristic of the image,
and finally, calculating the confusion value of the local binary pattern feature with the local binary pattern of 5 of all the cell nuclei as the WSI _ snu _ approximate _ lbp3_5_ descriptor feature, wherein the calculation formula of the confusion value is as follows:
std is standard deviation, mean is mean;
MED _ smu _ esperci _ ngtdm _ busy _ range: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image in a median as a center, intercepting a local area with a fixed size of KxK pixels, intercepting the minimum external rectangular frame of each cell nucleus in an H channel image of the local area, calculating an NGTDM matrix, and further calculating Busync characteristics:
finally, calculating the range of all the cell nucleus Busyness characteristics as MED _ smu _ esperci _ ngtdm _ Busyness _ range characteristics;
-WSI _ th _ ori _ firstorder _ Range: reducing the pathological staining image by several times, and then calculating the pixel value Range under the H channel as the WSI _ th _ ori _ firstorder _ Range characteristic;
MED _ e _ ori _ ngtdm _ Complexity: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image in the median as the center, cutting out a local area with a fixed size of KxK pixels, calculating an NGTDM matrix of the local area under an E channel, and then calculating a Complexity characteristic as an MED _ E _ ori _ NGTDM _ Complexity characteristic, wherein the calculation formula of the Complexity characteristic is as follows:
-MXD _ smu _ hsmerci _ gldm _ largedependencelowgradevelemphasis _ mean: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image at the maximum value as a center, intercepting a local area with a fixed size of KxKpixels, intercepting the minimum outer rectangular frame of each cell nucleus in an H channel image of the local area, calculating a GLDM matrix, then calculating LDLGLE characteristics, and finally calculating the average value of the LDLGLE characteristics of all the cell nuclei in the local area as MXD-smu _ hsmeri _ GLDM _ LargedependenceLowGrayLevelphases _ mean characteristics;
MED _ smu _ osnsci _ lbp4_12_ range: taking the cell nucleus with the density (the number of local cell nuclei/the local area) of the cell nucleus in the pathological image at the maximum value as the center, cutting out a local area with a fixed size of KxK pixels, then taking the center of each cell nucleus in an O channel image of the local area as the center, cutting out KxK the area with the fixed size, and calculating the polar difference of the local binary pattern characteristic with the local binary pattern of 12 of all the cell nuclei in the cut-out area as an MED _ smu _ osnsci _ lbp4_12_ range characteristic;
-MND _ e _ ori _ glszm _ largeareahghgraylevelemphasis: taking the cell nucleus with the minimum cell nucleus density (local cell nucleus number/local area) in the pathological image as the center, cutting out the local area with the fixed Size of KxK pixels, and calculating the Gray Level frequency matrix (GLSZM, Gray Level Size Zonematrix) of the local area under the E channel, wherein the second cell nucleus in the GLSZM matrixiGo to the firstjThe value P (i, j) of a column represents a gray level ofiHas a connected domain size ofjThe frequency of (a) is P (i, j), and finally, calculating LAHGLE (LargeAreaHighGrayLevelEmphasis) characteristics according to the GLSZM matrix as MND _ e _ ori _ GLSZM _ LargeAreaHighGrayLevelEmphasis characteristics, wherein the calculation formula of the LAHGLE characteristics is as follows:
wherein N isgIs the number of gray levels in the image (GLDM line number), NdIs the number of different gray-level dependent images in the image (number of GLDM columns) and Nz is the number of gray-level dependent images in the image (sum of GLDM).
The step 1 of sorting out medically significant variables in the clinical information specifically includes:
the cancer TNM staging index divides the cancer of the patient into five stages, namely stage 0-stage IV according to the size of the primary tumor of the patient, the degree of spread to the local lymph node and whether distant metastasis occurs, wherein each stage can be divided more finely;
biopsy collection methods include lobectomies (lobectoys) and lung resections (segmentectomas);
-the patient ethnicity information includes ethnicities of asian, caucasian, african-black and indian.
The step 2 specifically comprises:
step 2.1: the default value processing is respectively carried out on the clinical treatment information and the pathological image characteristics, and comprises the steps of deleting variables with more default values (such as the number of missing samples is more than 5), deleting samples with more default values (such as the number of missing samples exceeds 20%), then filling the default values of each continuous variable by using an average value, and sampling and filling the default values of each discrete variable in non-default values;
step 2.2: setting the discrete multi-value class variable after the default value is processed into a dummy variable;
step 2.3: removing obvious unreasonable variables in the data obtained in the step 2.2, wherein the obvious unreasonable variables comprise a variable with a variance of 0 and a discrete variable with unbalanced data volume;
step 2.4: and (4) normalizing continuous variables of the data obtained in the step 2.3.
After the data processing according to the step 2, the clinical information variables specifically include:
staged i: whether the TNM staging SatgeI is 1 or not and whether the SatgeI is 0 or not;
-specific _ collection _ method _ name.lobectomy: whether a biopsy sample is taken by leaf resection is 1 or not and is 0 or not;
asian: whether the race is Asian race or not is 1 or not is 0;
relative _ family _ cancer _ history. yes: whether a family genetic disease history exists is 1 or 0;
whether the patholog _ stage.stagei is TNM staging SatgeII is 1 or 0;
there are 5 clinical information variables.
And 3, the calculation method of the interaction in the step 3 is to take the image characteristics and the clinical information as basic variables, calculate the product of any two variables as the interaction between the two variables, and finally obtain an interaction matrix as the input data of the survival random forest model.
The step 4 specifically includes:
step 4.1: setting a survival random forest parameter variation range;
step 4.2: for each group of parameters, randomly dividing input data into k samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest k-1 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set; the basic principle of the importance of the random forest model and the Permutation characteristic is as follows:
-randomly selecting N from N samples with put backiA sample using the NiConstructing a decision tree for each sample;
-randomly selecting M from the M attributes when constructing the decision treejAn attribute from this MjSelecting a certain attribute from the attributes by adopting a certain classification method (such as a Kernian coefficient) as a branch attribute of the node to branch the data, and building a decision tree until the branching is stopped until a certain stopping condition is reached (if the branching cannot be performed any more or the branching times reach 10);
repeating the two steps, and constructing a random forest by using a large number of established decision trees;
-in the process of constructing the decision tree, comparing the new prediction accuracy obtained by rearranging the observed values of a certain feature value with the original accuracy, the difference between them obtaining the permatation importance of the feature, and then calculating the average of all the importance obtained when constructing the decision tree of the feature as the importance of the feature; the importance of all the characteristics can be obtained by repeating the process on all the characteristics;
step 4.3: and selecting a group of survival random forest model parameters with better prediction effect according to the C-index value, and fitting all input data to construct a final random forest prediction model.
The step 5 specifically includes:
step 5.1: calculating a recurrence risk classification index, specifically, two indexes are included:
sorting the input features of the random forest prediction model constructed in the step 4 according to Permutation importance, and selecting the features with the highest importance as recurrence risk classification indexes;
predicting the independent variables of the input data to be classified by using the random forest prediction model constructed in the step 4, and taking the predicted values as classification indexes of recurrence risks;
step 5.2: if the recurrence risk classification index is a discrete variable, classifying the patients according to each discrete value; if the recurrence risk classification index is a continuous variable, all patients are classified into two groups according to whether the index is greater than the median (or mean).
The invention has the advantages that the accurate liver cancer recurrence risk prediction method is provided, key factors related to recurrence risk in pathological images can be deeply excavated, and compared with the method of predicting by singly using clinical treatment information, the prediction effect is obviously improved; compared with the method that the recurrence risk prediction is carried out by doctors according to medical knowledge and medical experience, the method is more stable; in addition, the specific implementation mode shows that the method can obtain better prediction effect only by using 10 pathological image characteristics and 5 items of clinical information, so that the calculation difficulty can be greatly reduced, and the method can be widely applied; in addition, the classification index of the recurrence risk calculated by the invention can be used for distinguishing patients with higher and lower recurrence risk, and has great medical research and application values.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a C-index comparison between the model used in the present invention and other models. From left to right according to the abscissa, each box plot is in turn: the method comprises the steps of using 64128 pathological image features and a prediction model (imcln) of 13 items of clinical information, using a prediction model (im) of 64128 pathological image features, using a prediction model (cln) of 13 items of clinical information, using a prediction model (cln.s) of 5 items of screened clinical information, using a prediction model (im.s) of 10 screened pathological image features, using a prediction model (imcln.s) of 5 items of screened clinical information and 10 pathological image features, using a prediction model (imcln.smi) of 5 items of screened clinical information, 10 pathological image features and interaction among all variables, constructing 15 variables by using the 5 items of screened clinical information and 10 pathological image features, and further calculating a model (imcln.si) for predicting the interaction among the variables, wherein the model is used and does not include the 15 variables per se.
FIG. 3 is a Kaplan-Meier curve for high and low risk of recurrence patients classified using the risk of recurrence classification index of the present invention. The abscissa axis of the graph represents the survival time of the patient, the ordinate axis represents the survival rate of the patient, the horizontal and vertical dashed lines are used to show median survival difference between the two patient populations, and p <0.0001 represents the log rank test p-value of the survival distribution between the patient subpopulations of less than 0.0001.
Detailed Description
The invention provides a method for predicting recurrence risk of liver cancer by combining pathological image characteristics and clinical information, and technical characteristics and advantages of the invention are described in the following by combining figures and embodiments.
The embodiment data of the invention is derived from a public database TCGA-LIHC, the code implementation languages are Python 3.7 and R3.6, the specific implementation mode is shown in figure 1, and the liver cancer recurrence risk prediction method combining pathological image characteristics and clinical information, provided by the invention, comprises the following steps:
step 1: extracting image characteristics of the pathological image, and sorting variables with medical significance in clinical information;
step 2: data processing, including default value processing, dummy variable setting, obvious and unreasonable variable removal and normalization processing;
and step 3: calculating interaction between the pathological image characteristics and clinical information as input data by combining the pathological image characteristics and the clinical information;
and 4, step 4: selecting a better survival random forest model by using the C-index efficacy evaluation index through cross validation;
and 5: and (3) classifying the recurrence risk subgroups of the recurrence patients by using the survival random forest model extraction indexes, calculating the survival function of the classified subgroups, drawing a Kaplan-Meier curve and fitting a cox proportional risk model to evaluate the classification indexes.
The pathological image features extracted in the step 1 specifically include:
WSI_snu_osmerci_ngtdm_Strength_range,
MND_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MND_smu_osnsci_single_fractal_dim_mean,
WSI_snu_esmerci_lbp3_5_disorder,
MED_smu_esmerci_ngtdm_Busyness_range,
WSI_th_ori_firstorder_Range,
MED_e_ori_ngtdm_Complexity,
MXD_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MED_smu_osnsci_lbp4_12_range,
MND_e_ori_glszm_LargeAreaHighGrayLevelEmphasis,
there are 10 pathological image features.
The clinical treatment information collated in the step 1 comprises a cancer TNM staging index, a biopsy tissue collection method and patient ethnicity information.
In the step 2, the data processing is performed on the pathological image features and the clinical information respectively, and the data processing specifically comprises the following steps:
step 2.1: processing default values, including deleting the feature values with the feature default value larger than 5, deleting the samples with the sample default value larger than 50% and the rest samples without the same loss, filling the default values of the continuous feature values by using an average value, and filling the default values of the discrete feature values by using samples of non-default values (the number proportion of the discrete values of the non-default values is required to be kept);
step 2.2: setting a dummy variable for the discrete characteristic value, and if each discrete variable in the characteristic value is independent, independently setting each discrete variable as the dummy variable; if the discrete variables in the characteristic value have a mutual relation, setting a dummy variable according to the mutual relation; if data imbalance exists in each variable in the characteristic value, combining dummy variables is considered;
step 2.3: deleting the characteristic of data imbalance (such as A: B =100: 1) in the discrete characteristic, and deleting the characteristic of variance 0 in the continuous characteristic;
step 2.4: normalizing each continuous type characteristic;
after data processing, 10 pathological image characteristics and 5 clinical information variables are obtained.
The interaction in the step 3 refers to data obtained by multiplying 15 variables obtained in the step 2 by each other, and the total number of the 15 variables is 105.
The step 4 specifically includes:
step 4.1: setting a survival random forest parameter variation range;
step 4.2: for each group of parameters, randomly dividing input data into 3 samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest 2 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set;
step 4.3: and 4.2, selecting a group of parameters of the survival random forest model with better prediction effect according to the average value of the 3-fold cross validation C-index obtained in the step 4.2, and fitting all input data to construct a final survival random forest prediction model.
The step 5 specifically includes:
step 5.1: predicting all input data by using the survival random forest model constructed in the step 4, and taking a predicted value as a classification index of relapse risks of all patients;
step 5.2: dividing the patient into two subgroups according to whether the classification index of the relapse risk of the patient is larger than the median;
step 5.3: respectively calculating the survival functions of the two subgroups, drawing respective Kaplan-Meier curves of the two subgroups according to the survival functions, and testing the classification result by using a logarithmic rank test p value;
step 5.4: and fitting a cox proportional risk model of the survival time of the patient about the recurrence risk classification index, and determining the effectiveness of the recurrence risk classification index according to the cox proportional risk model and the hypothesis test result of the coefficient.
In the examples of the present invention, the average value of C-index (imcln.si) at the time of prediction using the interaction was 0.765, the average value of C-index (cln.s) at the time of prediction using only the clinical information was 0.690, and the p-value of t-test hypothesis test with the C-index value at the time of prediction using the interaction was 1.519 e-13; the mean value of the C-index (im.s) when only pathological image features are used for prediction is 0.707, and the p value of t-test hypothesis test performed on the C-index when the C-index is used for prediction by interaction is 4.712 e-8; the p-value for the C-index (im.s) predicted using only pathological image features and the C-index (cln.s) predicted using only clinical information for the t-test hypothesis test was 0.026. The results of this example not only show that the pathological image features have significant improvement relative to the accuracy of prediction of clinical information, but also that the interaction between clinical information and pathological image features plays an important role in improving the accuracy of recurrence risk prediction.
In embodiments of the invention, the p-value of the log-valued test using the relapse-free survival distribution between the two subpopulations classified by the relapse risk classification index is less than 0.0001, indicating the validity and reliability of the relapse risk classification index on classification problems for high and low relapse risk patients; on the other hand, the recurrence risk classification index is used as a unique continuous independent variable fitting cox proportional risk model, wherein the p value of the log rank test of the model is 1e-13 (the original hypothesis of the log rank test of the model is that all coefficients are 0 and are used for testing whether the independent variable has a significant influence on the result predicted by the model), and the p value of the coefficient Waldestest is 2.26e-9, so that the recurrence risk classification index has a significant correlation with the survival time of the patient, and the effectiveness of the recurrence risk classification index is also proved.
The above examples are only intended to illustrate the technical solution of the invention and to confirm the effectiveness and superiority of the proposed method, without limiting it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (8)
1. A liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information is characterized in that: the method comprises the following steps:
step 1: extracting image characteristics of the pathological image, and sorting variables with medical significance in clinical information;
step 2: data processing, including default value processing, dummy variable setting, obvious and unreasonable variable removal and normalization processing;
and step 3: calculating interaction between the pathological image characteristics and clinical information as input data by combining the pathological image characteristics and the clinical information;
and 4, step 4: selecting a better survival random forest model by using the C-index efficacy evaluation index through cross validation;
and 5: and generating a recurrence risk classification index by using a survival random forest model to classify recurrence risk subgroups of the recurrence patients.
2. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the image characteristics of the pathological image in the step 1 comprise:
WSI_snu_osmerci_ngtdm_Strength_range,
MND_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MND_smu_osnsci_single_fractal_dim_mean,
WSI_snu_esmerci_lbp3_5_disorder,
MED_smu_esmerci_ngtdm_Busyness_range,
WSI_th_ori_firstorder_Range,
MED_e_ori_ngtdm_Complexity,
MXD_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MED_smu_osnsci_lbp4_12_range,
MND_e_ori_glszm_LargeAreaHighGrayLevelEmphasis,
the total number of the 10 image features is 10, and the calculation method of the 10 image features is described in detail in the specification.
3. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the clinical information variables collated in the step 1 comprise cancer TNM staging indexes, biopsy tissue collection methods and patient ethnicity information.
4. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the step 2 is performed on the pathological image features and the clinical information variables respectively, wherein the clinical information variables after data processing comprise:
staged i: whether the TNM staging Satge I is 1 or not and whether the Satge I is 0 or not;
specific _ collection _ method _ name. lobectomy: whether a biopsy sample is taken by leaf resection is 1 or not and is 0 or not;
asian: whether the race is Asian race or not is 1 or not is 0;
relative _ family _ list _ history. yes: whether a family genetic disease history exists is 1 or 0;
staged II, whether TNM staging Satge II, is 1, whether 0;
there are 5 clinical information variables.
5. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the interaction calculation method in the step 3 is a new variable obtained by multiplying 15 variables of the clinical information variable and the pathological image feature by each other, and the number of the interaction variables is 105.
6. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the specific steps of selecting a better model through cross validation in the step 4 are as follows:
step 4.1: setting a survival random forest parameter variation range;
step 4.2: for each group of parameters, randomly dividing input data into k samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest k-1 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set;
step 4.3: and selecting a group of survival random forest model parameters with better prediction effect according to the C-index value, fitting all input data to construct a final random forest prediction model, and predicting the recurrence risk of the new patient according to the random forest model.
7. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the classification index of recurrence risk in step 5 includes two indexes: sorting the input features of the random forest prediction model constructed in the step 4 according to Permutation importance, and selecting the features with the highest importance as recurrence risk classification indexes; and (4) predicting the independent variable of the input data to be classified by using the random forest prediction model constructed in the step (4), and taking the predicted value as a recurrence risk classification index.
8. The method of classifying patients according to claim 7, wherein the risk of relapse classification index is: if the recurrence risk classification index is a discrete variable, classifying the patients according to each discrete value of the recurrence risk classification index, and if the recurrence risk classification index is a continuous variable, classifying the high-low recurrence risk subgroup of the patients according to whether the recurrence risk classification index is greater than the median or the average value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911265751.5A CN110993106A (en) | 2019-12-11 | 2019-12-11 | Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911265751.5A CN110993106A (en) | 2019-12-11 | 2019-12-11 | Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110993106A true CN110993106A (en) | 2020-04-10 |
Family
ID=70092314
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911265751.5A Pending CN110993106A (en) | 2019-12-11 | 2019-12-11 | Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110993106A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111554402A (en) * | 2020-04-24 | 2020-08-18 | 山东省立医院 | Machine learning-based method and system for predicting postoperative recurrence risk of primary liver cancer |
CN111784637A (en) * | 2020-06-04 | 2020-10-16 | 复旦大学附属中山医院 | Prognostic characteristic visualization method, system, equipment and storage medium |
CN111985584A (en) * | 2020-09-30 | 2020-11-24 | 平安科技(深圳)有限公司 | Disease auxiliary detection equipment, method, device and medium based on multi-mode data |
CN112309571A (en) * | 2020-10-30 | 2021-02-02 | 电子科技大学 | Screening method of prognosis quantitative characteristics of digital pathological image |
CN112562855A (en) * | 2020-12-18 | 2021-03-26 | 深圳大学 | Hepatocellular carcinoma postoperative early recurrence risk prediction method |
CN112768060A (en) * | 2020-07-14 | 2021-05-07 | 福州宜星大数据产业投资有限公司 | Liver cancer postoperative recurrence prediction method based on random survival forest and storage medium |
CN112908470A (en) * | 2021-02-08 | 2021-06-04 | 深圳市人民医院 | Hepatocellular carcinoma prognosis scoring system based on RNA binding protein gene and application thereof |
CN112991320A (en) * | 2021-04-07 | 2021-06-18 | 德州市人民医院 | System and method for predicting hematoma expansion risk of cerebral hemorrhage patient |
CN113180633A (en) * | 2021-04-28 | 2021-07-30 | 济南大学 | MR image liver cancer postoperative recurrence risk prediction method and system based on deep learning |
CN113724876A (en) * | 2021-09-10 | 2021-11-30 | 南昌大学第二附属医院 | Intra-stroke hospital complication prediction model based on multi-mode fusion and DFS-LLE algorithm |
CN113808747A (en) * | 2021-10-11 | 2021-12-17 | 南昌大学第二附属医院 | Ischemic stroke recurrence prediction method |
CN113850753A (en) * | 2021-08-17 | 2021-12-28 | 苏州鸿熙融合智能医疗科技有限公司 | Medical image information calculation method and device, edge calculation equipment and storage medium |
CN114037774A (en) * | 2022-01-10 | 2022-02-11 | 雅安市人民医院 | Method and device for sequencing and transmitting images of cross sections of cranium and brain and storage medium |
CN114549896A (en) * | 2022-01-24 | 2022-05-27 | 清华大学 | Heterogeneous high-order representation method and device for full-view image for survival prediction |
CN118645251A (en) * | 2024-08-16 | 2024-09-13 | 上海孪心医疗科技有限公司 | Risk stratification method and system for prognosis of heart failure and atrial fibrillation and catheter ablation |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106355208A (en) * | 2016-08-31 | 2017-01-25 | 广州精点计算机科技有限公司 | Data prediction analysis method based on COX model and random survival forest |
CN106815481A (en) * | 2017-01-19 | 2017-06-09 | 中国科学院深圳先进技术研究院 | A kind of life cycle Forecasting Methodology and device based on image group |
CN109642258A (en) * | 2018-10-17 | 2019-04-16 | 上海允英医疗科技有限公司 | A kind of method and system of tumor prognosis prediction |
CN110111892A (en) * | 2019-04-29 | 2019-08-09 | 杭州电子科技大学 | A kind of postoperative short-term relapse and metastasis risk evaluating system of NSCLC patient |
WO2019224044A1 (en) * | 2018-05-22 | 2019-11-28 | Koninklijke Philips N.V. | Performing a prognostic evaluation |
-
2019
- 2019-12-11 CN CN201911265751.5A patent/CN110993106A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106355208A (en) * | 2016-08-31 | 2017-01-25 | 广州精点计算机科技有限公司 | Data prediction analysis method based on COX model and random survival forest |
CN106815481A (en) * | 2017-01-19 | 2017-06-09 | 中国科学院深圳先进技术研究院 | A kind of life cycle Forecasting Methodology and device based on image group |
WO2019224044A1 (en) * | 2018-05-22 | 2019-11-28 | Koninklijke Philips N.V. | Performing a prognostic evaluation |
CN109642258A (en) * | 2018-10-17 | 2019-04-16 | 上海允英医疗科技有限公司 | A kind of method and system of tumor prognosis prediction |
CN110111892A (en) * | 2019-04-29 | 2019-08-09 | 杭州电子科技大学 | A kind of postoperative short-term relapse and metastasis risk evaluating system of NSCLC patient |
Non-Patent Citations (1)
Title |
---|
崔海波 等: "基于随机生存森林与网络拓扑信息的食管癌风险预测" * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111554402A (en) * | 2020-04-24 | 2020-08-18 | 山东省立医院 | Machine learning-based method and system for predicting postoperative recurrence risk of primary liver cancer |
CN111784637A (en) * | 2020-06-04 | 2020-10-16 | 复旦大学附属中山医院 | Prognostic characteristic visualization method, system, equipment and storage medium |
CN112768060A (en) * | 2020-07-14 | 2021-05-07 | 福州宜星大数据产业投资有限公司 | Liver cancer postoperative recurrence prediction method based on random survival forest and storage medium |
CN111985584A (en) * | 2020-09-30 | 2020-11-24 | 平安科技(深圳)有限公司 | Disease auxiliary detection equipment, method, device and medium based on multi-mode data |
CN112309571B (en) * | 2020-10-30 | 2022-04-15 | 电子科技大学 | Screening method of prognosis quantitative characteristics of digital pathological image |
CN112309571A (en) * | 2020-10-30 | 2021-02-02 | 电子科技大学 | Screening method of prognosis quantitative characteristics of digital pathological image |
CN112562855B (en) * | 2020-12-18 | 2021-11-02 | 深圳大学 | Hepatocellular carcinoma postoperative early recurrence risk prediction method, medium and terminal equipment |
CN112562855A (en) * | 2020-12-18 | 2021-03-26 | 深圳大学 | Hepatocellular carcinoma postoperative early recurrence risk prediction method |
CN112908470A (en) * | 2021-02-08 | 2021-06-04 | 深圳市人民医院 | Hepatocellular carcinoma prognosis scoring system based on RNA binding protein gene and application thereof |
CN112908470B (en) * | 2021-02-08 | 2023-10-03 | 深圳市人民医院 | Hepatocellular carcinoma prognosis scoring system based on RNA binding protein gene and application thereof |
CN112991320A (en) * | 2021-04-07 | 2021-06-18 | 德州市人民医院 | System and method for predicting hematoma expansion risk of cerebral hemorrhage patient |
CN113180633A (en) * | 2021-04-28 | 2021-07-30 | 济南大学 | MR image liver cancer postoperative recurrence risk prediction method and system based on deep learning |
CN113850753B (en) * | 2021-08-17 | 2023-09-01 | 苏州鸿熙融合智能医疗科技有限公司 | Medical image information computing method, device, edge computing equipment and storage medium |
CN113850753A (en) * | 2021-08-17 | 2021-12-28 | 苏州鸿熙融合智能医疗科技有限公司 | Medical image information calculation method and device, edge calculation equipment and storage medium |
CN113724876A (en) * | 2021-09-10 | 2021-11-30 | 南昌大学第二附属医院 | Intra-stroke hospital complication prediction model based on multi-mode fusion and DFS-LLE algorithm |
CN113808747A (en) * | 2021-10-11 | 2021-12-17 | 南昌大学第二附属医院 | Ischemic stroke recurrence prediction method |
CN113808747B (en) * | 2021-10-11 | 2023-12-26 | 南昌大学第二附属医院 | Ischemic cerebral apoplexy recurrence prediction method |
CN114037774A (en) * | 2022-01-10 | 2022-02-11 | 雅安市人民医院 | Method and device for sequencing and transmitting images of cross sections of cranium and brain and storage medium |
CN114037774B (en) * | 2022-01-10 | 2022-03-08 | 雅安市人民医院 | Method and device for sequencing and transmitting images of cross sections of cranium and brain and storage medium |
CN114549896A (en) * | 2022-01-24 | 2022-05-27 | 清华大学 | Heterogeneous high-order representation method and device for full-view image for survival prediction |
CN114549896B (en) * | 2022-01-24 | 2024-08-16 | 清华大学 | Heterogeneous high-order representation method and device for full-field image for survival prediction |
CN118645251A (en) * | 2024-08-16 | 2024-09-13 | 上海孪心医疗科技有限公司 | Risk stratification method and system for prognosis of heart failure and atrial fibrillation and catheter ablation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110993106A (en) | Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information | |
CN107103187B (en) | Lung nodule detection grading and management method and system based on deep learning | |
CN109872772B (en) | Method for excavating colorectal cancer radiotherapy specific genes by using weight gene co-expression network | |
Dimitoglou et al. | Comparison of the C4. 5 and a Naïve Bayes classifier for the prediction of lung cancer survivability | |
CN112382392A (en) | System for be used for pulmonary nodule risk assessment | |
US20070019854A1 (en) | Method and system for automated digital image analysis of prostrate neoplasms using morphologic patterns | |
CN113140258A (en) | Method for screening potential prognosis biomarkers of lung adenocarcinoma based on tumor infiltrating immune cells | |
CN107909102A (en) | A kind of sorting technique of histopathology image | |
CN115588507A (en) | Prognosis model of lung adenocarcinoma EMT related gene, construction method and application | |
CN107169497A (en) | A kind of tumor imaging label extracting method based on gene iconography | |
Paul et al. | Gland segmentation from histology images using informative morphological scale space | |
CN111062425A (en) | Unbalanced data set processing method based on C-K-SMOTE algorithm | |
CN113903471A (en) | Gastric cancer patient survival risk prediction method based on histopathology image and gene expression data | |
Lopez et al. | A new set of wavelet-and fractals-based features for Gleason grading of prostate cancer histopathology images | |
KR20240012738A (en) | Cluster analysis system and method of artificial intelligence classification for cell nuclei of prostate cancer tissue | |
CN115537467A (en) | Establishment method and application of ovarian cancer survival prognosis prediction molecular model based on deep neural network | |
WO2006122251A2 (en) | Method and system for automated digital image analysis of prostrate neoplasms using morphologic patterns | |
CN113571189A (en) | Establishment method of prediction model for survival benefit of gallbladder cancer patient after radiotherapy and chemotherapy | |
Radhakrishnan et al. | Detection of non-small cell lung cancer using histopathological images by the approach of deep learning | |
CN116504314B (en) | Gene regulation network construction method based on cell dynamic differentiation | |
CN112435133A (en) | Medical insurance combined fraud detection method, device and equipment based on graph analysis | |
CN117912694A (en) | Multi-mode cancer survival risk prediction method based on deep learning | |
CN111793692A (en) | Characteristic miRNA expression profile combination and lung squamous carcinoma early prediction method | |
Kabir et al. | Classification models and survival analysis for prostate cancer using RNA sequencing and clinical data | |
CN116313111A (en) | Breast cancer risk prediction method, system, medium and equipment based on combined model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Yang Anli Inventor after: Deng Feiwen Inventor after: Hua Rui Inventor after: Zhang Youlong Inventor after: Li Jialu Inventor before: Hua Rui Inventor before: Zhang Youlong Inventor before: Li Jialu |
|
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200410 |