CN110993106A - Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information - Google Patents

Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information Download PDF

Info

Publication number
CN110993106A
CN110993106A CN201911265751.5A CN201911265751A CN110993106A CN 110993106 A CN110993106 A CN 110993106A CN 201911265751 A CN201911265751 A CN 201911265751A CN 110993106 A CN110993106 A CN 110993106A
Authority
CN
China
Prior art keywords
clinical information
pathological image
recurrence risk
risk
recurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911265751.5A
Other languages
Chinese (zh)
Inventor
华芮
张游龙
李嘉路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huajia Biological Intelligence Technology Co ltd
Original Assignee
Shenzhen Huajia Biological Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huajia Biological Intelligence Technology Co ltd filed Critical Shenzhen Huajia Biological Intelligence Technology Co ltd
Priority to CN201911265751.5A priority Critical patent/CN110993106A/en
Publication of CN110993106A publication Critical patent/CN110993106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for predicting postoperative recurrence risk of liver cancer by combining pathological images and clinical information, belonging to the technical field of construction of postoperative recurrence risk prediction models of cancer. The method takes the clinical information of a patient and the pathological image characteristics of the tumor area of the patient extracted by applying an image processing technology as basic variables, further calculates the interaction among the basic variables as input data, fits a survival random forest model and accurately predicts the survival time of the patient. The results of the embodiment of the invention show that the cross-validation efficacy evaluation index C-index of the model provided by the invention is superior to the result of prediction only by using pathological image characteristics or clinical information, and the accuracy of prediction of postoperative recurrence risk of liver cancer is obviously improved; in addition, the invention also provides a classification index of postoperative recurrence risk, and the patient can be divided into two subgroups of higher recurrence risk and lower recurrence risk, so that doctors can be helped to make a targeted treatment scheme for the patient.

Description

Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information
Technical Field
The invention belongs to the technical field of construction of a cancer postoperative recurrence risk prediction model, and particularly relates to a liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information. Specifically, the method is based on a survival random forest model, combines pathological image characteristics and clinical parameter characteristics, and accurately predicts the postoperative recurrence risk of the liver cancer patient.
Background
Cancer is a common health problem for all people, according to statistics of the national cancer center in 2019, the proportion of liver cancer patients in 2015 to all cancer patients is 9.42%, the liver cancer patients are in the fourth position of the morbidity of malignant tumors, while the liver cancer patients are 13.94% of all cancer patients and in the second position of the mortality of malignant tumors. As can be seen, liver cancer is high in incidence and mortality rate, and seriously threatens public health. The high postoperative recurrence risk of the liver cancer is one of the important reasons that the mortality rate of the liver cancer patients is high, and if the postoperative recurrence risk can be well predicted, doctors can be helped to make a targeted treatment scheme for the patients, so that the method has great significance for postoperative treatment and prognosis of the patients.
With the progress of science and technology, artificial intelligence is developed rapidly in recent years, and the recurrence risk prediction by using an artificial intelligence algorithm is gradually raised. Random forest is a common machine learning method, and can perform feature screening while having high prediction accuracy. On the other hand, the survival analysis is an analysis method designed for survival time data containing deletion or truncation, the survival random forest model built by combining the survival analysis and the random forest model can further analyze the survival data, and the excellent characteristics of the random forest can be fully utilized.
For clinical treatment of liver cancer, pathological images are the diagnostic gold standard. The morphology, color, texture, and morphology of specific tissue structures of cells in the pathological image are usually related to the occurrence or progression of diseases, so the abundant information contained in the pathological image has the potential to improve the recurrence risk prediction. However, subjective judgment of recurrence risk by merely manually observing pathological images is not reliable due to subjective factors and visual limitations of naked eyes. What is needed is a stable method for deep mining pathological image information to predict recurrence risk.
Under the background, the invention provides a method for predicting postoperative recurrence risk of liver cancer by combining pathological images and clinical information. The invention automatically extracts image characteristic information from the H & E staining image of liver cancer histopathology by an image processing technology, and combines the conventional clinical detection information of patients and a survival random forest method to more accurately grade postoperative recurrence risks of the patients.
Disclosure of Invention
The invention aims to provide a liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information, which is characterized in that input data of the method comprises pathological image characteristics extracted by applying an image processing technology and clinical information of cancer TNM staging and the like of a patient; the prediction model framework of the method is based on a survival random forest model, and compared with the traditional random forest model which can only process regression and classification problems, the random forest model can specially process survival time data; the efficacy evaluation index of the method is C-index, the probability that the model prediction result is consistent with the actually observed result is estimated, and the method is commonly used for evaluating the distinguishing capability and consistency of a prognosis model in statistical analysis; the method specifically comprises the following steps:
step 1: extracting image characteristics of the pathological image, and sorting variables with medical significance in clinical information;
step 2: data processing, including default value processing, dummy variable setting, obvious and unreasonable variable removal and normalization processing;
and step 3: calculating interaction between the pathological image characteristics and clinical information as input data by combining the pathological image characteristics and the clinical information;
and 4, step 4: selecting a better survival random forest model by using the C-index efficacy evaluation index through cross validation;
and 5: and generating a recurrence risk classification index by using a survival random forest model to classify recurrence risk subgroups of the recurrence patients.
The image features of the pathological image extracted in the step 1 specifically include the following features:
-WSI _ snu _ osmerici _ ngtdm _ Strength _ range: deconvolving the H & E pathological staining image in the RGB color space to the HEO color space, intercepting the minimum external rectangular frame of each cell nucleus in the O channel image, and calculating an adjacent Gray Difference Matrix (NGTDM), wherein the calculation method of the NGTDM Matrix comprises the following steps:
let the intercepted area contain N1, N2…NnFor n gray levels, NGTDM is nx4 matrix, with j row: [ N ]j,Fj,Pj,Sj]In which N isjIs the value of the j-th gray level, FjIs NjFrequency of occurrence, PJIs NjFrequency of occurrence, SjIs the sum of the absolute values of the differences between each Nj and the mean gray value of the neighbourhood,
the Strength signature is then computed from the NGTDM:
Figure 797787DEST_PATH_IMAGE001
finally, calculating the range difference of all the cell nucleus Strength characteristics as WSI _ snu _ osmeri _ ngtdm _ Strength _ range characteristics;
-MND _ smu _ hsmerci _ gldm _ largedendenderlowgradevelemphasis _ mean: taking the cell nucleus with the minimum cell nucleus density (number of local cell nuclei/local area) in the pathological image as the center, intercepting a local area with a fixed size of KxKpixels, intercepting the minimum external rectangular frame of each cell nucleus in an H channel image of the local area, and calculating a Gray Level Dependency Matrix (GLDM), wherein the calculation method of the GLDM Matrix comprises the following steps:
let the gray level of the pixel m be gmThe gray level of the pixel n is gnWhen the distance between m and n is less than d and | gm-gnIf | is less than or equal to α, the pixel n is called a gray level dependent pixel of m, and the gray level g is countedmLet P (i, j) be the pixel value of the ith row and jth column (starting from 0) of GLDM, which represents that the number of pixels with j gray-level dependency is P (i, j) in all pixels with i gray-level,
ldlgle (large dependencelowgray levelemophasis) characteristics were then calculated from GLDM:
Figure 714927DEST_PATH_IMAGE002
wherein N isgIs the number of gray levels in the image (GLDM line number), NdIs the number of different gray-scale dependent image numbers in the image (number of GLDM columns), NzIs the amount of gray dependency in the image (sum of GLDM),
finally, calculating the average value of LDLGLE characteristics of all cell nuclei in the local area, namely MND _ smu _ hsmeric _ gldm _ LargeDependenceLowGrayLevelEmphasis _ mean characteristic;
MND _ smu _ osnsci _ single _ fractional _ dim _ mean: the method comprises the following steps of taking a cell nucleus with the minimum cell nucleus density (local cell nucleus number/local area) in a pathological image as a center, cutting out a local area with a fixed size of KxK pixels, taking the center of each cell nucleus in an O channel image of the local area as the center, cutting out KxK the area with the fixed size, and calculating the fractal dimension of the area, wherein the calculation method of the fractal dimension comprises the following steps:
setting image fractal size S1,S2…SnWhen the fractal size is SjWhile dividing the image into a number of Sj×SjSmall blocks, counting the range of pixel values in all small blocks, and calculating the average Nr of all rangejConstructing a linear regression model of Nr relative to S, wherein the coefficient of the linear regression model is the fractal dimension of the image,
finally, calculating the average value of fractal dimension characteristics of all cell nuclei in the local area, namely the MND _ smu _ osnsci _ single _ fractional _ dim _ mean characteristic;
-WSI _ snu _ esperci _ lbp3_5_ recorder: deconvolving the H & E pathological staining image in the RGB color space to the HEO color space, intercepting the minimum external rectangular frame of each cell nucleus in the E channel image, and calculating the statistical characteristics of Local Binary Pattern (LBP), wherein the statistical characteristics of the Local Binary Pattern are calculated as follows:
if the pixel value g of the pixel iiPixel value g greater than or equal to neighborhood pixel jjIf the local binary value corresponding to the position of the pixel j is 0, the pixel value g of the pixel i is equal to the local binary value corresponding to the position of the pixel jiPixel value g less than neighborhood pixel jjIf the local binary value corresponding to the position of the pixel j is 1, the local binary values in the neighborhood of the pixel i can be connected together to form a binary number, the binary number is the local binary pattern of the pixel i, uniform and cyclic invariant processing is carried out on all the binary numbers, the frequency of the local binary patterns of all the pixels in the image is counted to be used as the local binary pattern characteristic of the image,
and finally, calculating the confusion value of the local binary pattern feature with the local binary pattern of 5 of all the cell nuclei as the WSI _ snu _ approximate _ lbp3_5_ descriptor feature, wherein the calculation formula of the confusion value is as follows:
Figure 494664DEST_PATH_IMAGE003
std is standard deviation, mean is mean;
MED _ smu _ esperci _ ngtdm _ busy _ range: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image in a median as a center, intercepting a local area with a fixed size of KxK pixels, intercepting the minimum external rectangular frame of each cell nucleus in an H channel image of the local area, calculating an NGTDM matrix, and further calculating Busync characteristics:
Figure 889874DEST_PATH_IMAGE004
finally, calculating the range of all the cell nucleus Busyness characteristics as MED _ smu _ esperci _ ngtdm _ Busyness _ range characteristics;
-WSI _ th _ ori _ firstorder _ Range: reducing the pathological staining image by several times, and then calculating the pixel value Range under the H channel as the WSI _ th _ ori _ firstorder _ Range characteristic;
MED _ e _ ori _ ngtdm _ Complexity: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image in the median as the center, cutting out a local area with a fixed size of KxK pixels, calculating an NGTDM matrix of the local area under an E channel, and then calculating a Complexity characteristic as an MED _ E _ ori _ NGTDM _ Complexity characteristic, wherein the calculation formula of the Complexity characteristic is as follows:
Figure 766563DEST_PATH_IMAGE006
-MXD _ smu _ hsmerci _ gldm _ largedependencelowgradevelemphasis _ mean: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image at the maximum value as a center, intercepting a local area with a fixed size of KxKpixels, intercepting the minimum outer rectangular frame of each cell nucleus in an H channel image of the local area, calculating a GLDM matrix, then calculating LDLGLE characteristics, and finally calculating the average value of the LDLGLE characteristics of all the cell nuclei in the local area as MXD-smu _ hsmeri _ GLDM _ LargedependenceLowGrayLevelphases _ mean characteristics;
MED _ smu _ osnsci _ lbp4_12_ range: taking the cell nucleus with the density (the number of local cell nuclei/the local area) of the cell nucleus in the pathological image at the maximum value as the center, cutting out a local area with a fixed size of KxK pixels, then taking the center of each cell nucleus in an O channel image of the local area as the center, cutting out KxK the area with the fixed size, and calculating the polar difference of the local binary pattern characteristic with the local binary pattern of 12 of all the cell nuclei in the cut-out area as an MED _ smu _ osnsci _ lbp4_12_ range characteristic;
-MND _ e _ ori _ glszm _ largeareahghgraylevelemphasis: taking the cell nucleus with the minimum cell nucleus density (local cell nucleus number/local area) in the pathological image as the center, cutting out the local area with the fixed Size of KxK pixels, and calculating the Gray Level frequency matrix (GLSZM, Gray Level Size Zonematrix) of the local area under the E channel, wherein the second cell nucleus in the GLSZM matrixiGo to the firstjThe value P (i, j) of a column represents a gray level ofiHas a connected domain size ofjThe frequency of (a) is P (i, j), and finally, calculating LAHGLE (LargeAreaHighGrayLevelEmphasis) characteristics according to the GLSZM matrix as MND _ e _ ori _ GLSZM _ LargeAreaHighGrayLevelEmphasis characteristics, wherein the calculation formula of the LAHGLE characteristics is as follows:
Figure DEST_PATH_IMAGE007
wherein N isgIs the number of gray levels in the image (GLDM line number), NdIs the number of different gray-level dependent images in the image (number of GLDM columns) and Nz is the number of gray-level dependent images in the image (sum of GLDM).
The step 1 of sorting out medically significant variables in the clinical information specifically includes:
the cancer TNM staging index divides the cancer of the patient into five stages, namely stage 0-stage IV according to the size of the primary tumor of the patient, the degree of spread to the local lymph node and whether distant metastasis occurs, wherein each stage can be divided more finely;
biopsy collection methods include lobectomies (lobectoys) and lung resections (segmentectomas);
-the patient ethnicity information includes ethnicities of asian, caucasian, african-black and indian.
The step 2 specifically comprises:
step 2.1: the default value processing is respectively carried out on the clinical treatment information and the pathological image characteristics, and comprises the steps of deleting variables with more default values (such as the number of missing samples is more than 5), deleting samples with more default values (such as the number of missing samples exceeds 20%), then filling the default values of each continuous variable by using an average value, and sampling and filling the default values of each discrete variable in non-default values;
step 2.2: setting the discrete multi-value class variable after the default value is processed into a dummy variable;
step 2.3: removing obvious unreasonable variables in the data obtained in the step 2.2, wherein the obvious unreasonable variables comprise a variable with a variance of 0 and a discrete variable with unbalanced data volume;
step 2.4: and (4) normalizing continuous variables of the data obtained in the step 2.3.
After the data processing according to the step 2, the clinical information variables specifically include:
staged i: whether the TNM staging SatgeI is 1 or not and whether the SatgeI is 0 or not;
-specific _ collection _ method _ name.lobectomy: whether a biopsy sample is taken by leaf resection is 1 or not and is 0 or not;
asian: whether the race is Asian race or not is 1 or not is 0;
relative _ family _ cancer _ history. yes: whether a family genetic disease history exists is 1 or 0;
whether the patholog _ stage.stagei is TNM staging SatgeII is 1 or 0;
there are 5 clinical information variables.
And 3, the calculation method of the interaction in the step 3 is to take the image characteristics and the clinical information as basic variables, calculate the product of any two variables as the interaction between the two variables, and finally obtain an interaction matrix as the input data of the survival random forest model.
The step 4 specifically includes:
step 4.1: setting a survival random forest parameter variation range;
step 4.2: for each group of parameters, randomly dividing input data into k samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest k-1 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set; the basic principle of the importance of the random forest model and the Permutation characteristic is as follows:
-randomly selecting N from N samples with put backiA sample using the NiConstructing a decision tree for each sample;
-randomly selecting M from the M attributes when constructing the decision treejAn attribute from this MjSelecting a certain attribute from the attributes by adopting a certain classification method (such as a Kernian coefficient) as a branch attribute of the node to branch the data, and building a decision tree until the branching is stopped until a certain stopping condition is reached (if the branching cannot be performed any more or the branching times reach 10);
repeating the two steps, and constructing a random forest by using a large number of established decision trees;
-in the process of constructing the decision tree, comparing the new prediction accuracy obtained by rearranging the observed values of a certain feature value with the original accuracy, the difference between them obtaining the permatation importance of the feature, and then calculating the average of all the importance obtained when constructing the decision tree of the feature as the importance of the feature; the importance of all the characteristics can be obtained by repeating the process on all the characteristics;
step 4.3: and selecting a group of survival random forest model parameters with better prediction effect according to the C-index value, and fitting all input data to construct a final random forest prediction model.
The step 5 specifically includes:
step 5.1: calculating a recurrence risk classification index, specifically, two indexes are included:
sorting the input features of the random forest prediction model constructed in the step 4 according to Permutation importance, and selecting the features with the highest importance as recurrence risk classification indexes;
predicting the independent variables of the input data to be classified by using the random forest prediction model constructed in the step 4, and taking the predicted values as classification indexes of recurrence risks;
step 5.2: if the recurrence risk classification index is a discrete variable, classifying the patients according to each discrete value; if the recurrence risk classification index is a continuous variable, all patients are classified into two groups according to whether the index is greater than the median (or mean).
The invention has the advantages that the accurate liver cancer recurrence risk prediction method is provided, key factors related to recurrence risk in pathological images can be deeply excavated, and compared with the method of predicting by singly using clinical treatment information, the prediction effect is obviously improved; compared with the method that the recurrence risk prediction is carried out by doctors according to medical knowledge and medical experience, the method is more stable; in addition, the specific implementation mode shows that the method can obtain better prediction effect only by using 10 pathological image characteristics and 5 items of clinical information, so that the calculation difficulty can be greatly reduced, and the method can be widely applied; in addition, the classification index of the recurrence risk calculated by the invention can be used for distinguishing patients with higher and lower recurrence risk, and has great medical research and application values.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a C-index comparison between the model used in the present invention and other models. From left to right according to the abscissa, each box plot is in turn: the method comprises the steps of using 64128 pathological image features and a prediction model (imcln) of 13 items of clinical information, using a prediction model (im) of 64128 pathological image features, using a prediction model (cln) of 13 items of clinical information, using a prediction model (cln.s) of 5 items of screened clinical information, using a prediction model (im.s) of 10 screened pathological image features, using a prediction model (imcln.s) of 5 items of screened clinical information and 10 pathological image features, using a prediction model (imcln.smi) of 5 items of screened clinical information, 10 pathological image features and interaction among all variables, constructing 15 variables by using the 5 items of screened clinical information and 10 pathological image features, and further calculating a model (imcln.si) for predicting the interaction among the variables, wherein the model is used and does not include the 15 variables per se.
FIG. 3 is a Kaplan-Meier curve for high and low risk of recurrence patients classified using the risk of recurrence classification index of the present invention. The abscissa axis of the graph represents the survival time of the patient, the ordinate axis represents the survival rate of the patient, the horizontal and vertical dashed lines are used to show median survival difference between the two patient populations, and p <0.0001 represents the log rank test p-value of the survival distribution between the patient subpopulations of less than 0.0001.
Detailed Description
The invention provides a method for predicting recurrence risk of liver cancer by combining pathological image characteristics and clinical information, and technical characteristics and advantages of the invention are described in the following by combining figures and embodiments.
The embodiment data of the invention is derived from a public database TCGA-LIHC, the code implementation languages are Python 3.7 and R3.6, the specific implementation mode is shown in figure 1, and the liver cancer recurrence risk prediction method combining pathological image characteristics and clinical information, provided by the invention, comprises the following steps:
step 1: extracting image characteristics of the pathological image, and sorting variables with medical significance in clinical information;
step 2: data processing, including default value processing, dummy variable setting, obvious and unreasonable variable removal and normalization processing;
and step 3: calculating interaction between the pathological image characteristics and clinical information as input data by combining the pathological image characteristics and the clinical information;
and 4, step 4: selecting a better survival random forest model by using the C-index efficacy evaluation index through cross validation;
and 5: and (3) classifying the recurrence risk subgroups of the recurrence patients by using the survival random forest model extraction indexes, calculating the survival function of the classified subgroups, drawing a Kaplan-Meier curve and fitting a cox proportional risk model to evaluate the classification indexes.
The pathological image features extracted in the step 1 specifically include:
WSI_snu_osmerci_ngtdm_Strength_range,
MND_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MND_smu_osnsci_single_fractal_dim_mean,
WSI_snu_esmerci_lbp3_5_disorder,
MED_smu_esmerci_ngtdm_Busyness_range,
WSI_th_ori_firstorder_Range,
MED_e_ori_ngtdm_Complexity,
MXD_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MED_smu_osnsci_lbp4_12_range,
MND_e_ori_glszm_LargeAreaHighGrayLevelEmphasis,
there are 10 pathological image features.
The clinical treatment information collated in the step 1 comprises a cancer TNM staging index, a biopsy tissue collection method and patient ethnicity information.
In the step 2, the data processing is performed on the pathological image features and the clinical information respectively, and the data processing specifically comprises the following steps:
step 2.1: processing default values, including deleting the feature values with the feature default value larger than 5, deleting the samples with the sample default value larger than 50% and the rest samples without the same loss, filling the default values of the continuous feature values by using an average value, and filling the default values of the discrete feature values by using samples of non-default values (the number proportion of the discrete values of the non-default values is required to be kept);
step 2.2: setting a dummy variable for the discrete characteristic value, and if each discrete variable in the characteristic value is independent, independently setting each discrete variable as the dummy variable; if the discrete variables in the characteristic value have a mutual relation, setting a dummy variable according to the mutual relation; if data imbalance exists in each variable in the characteristic value, combining dummy variables is considered;
step 2.3: deleting the characteristic of data imbalance (such as A: B =100: 1) in the discrete characteristic, and deleting the characteristic of variance 0 in the continuous characteristic;
step 2.4: normalizing each continuous type characteristic;
after data processing, 10 pathological image characteristics and 5 clinical information variables are obtained.
The interaction in the step 3 refers to data obtained by multiplying 15 variables obtained in the step 2 by each other, and the total number of the 15 variables is 105.
The step 4 specifically includes:
step 4.1: setting a survival random forest parameter variation range;
step 4.2: for each group of parameters, randomly dividing input data into 3 samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest 2 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set;
step 4.3: and 4.2, selecting a group of parameters of the survival random forest model with better prediction effect according to the average value of the 3-fold cross validation C-index obtained in the step 4.2, and fitting all input data to construct a final survival random forest prediction model.
The step 5 specifically includes:
step 5.1: predicting all input data by using the survival random forest model constructed in the step 4, and taking a predicted value as a classification index of relapse risks of all patients;
step 5.2: dividing the patient into two subgroups according to whether the classification index of the relapse risk of the patient is larger than the median;
step 5.3: respectively calculating the survival functions of the two subgroups, drawing respective Kaplan-Meier curves of the two subgroups according to the survival functions, and testing the classification result by using a logarithmic rank test p value;
step 5.4: and fitting a cox proportional risk model of the survival time of the patient about the recurrence risk classification index, and determining the effectiveness of the recurrence risk classification index according to the cox proportional risk model and the hypothesis test result of the coefficient.
In the examples of the present invention, the average value of C-index (imcln.si) at the time of prediction using the interaction was 0.765, the average value of C-index (cln.s) at the time of prediction using only the clinical information was 0.690, and the p-value of t-test hypothesis test with the C-index value at the time of prediction using the interaction was 1.519 e-13; the mean value of the C-index (im.s) when only pathological image features are used for prediction is 0.707, and the p value of t-test hypothesis test performed on the C-index when the C-index is used for prediction by interaction is 4.712 e-8; the p-value for the C-index (im.s) predicted using only pathological image features and the C-index (cln.s) predicted using only clinical information for the t-test hypothesis test was 0.026. The results of this example not only show that the pathological image features have significant improvement relative to the accuracy of prediction of clinical information, but also that the interaction between clinical information and pathological image features plays an important role in improving the accuracy of recurrence risk prediction.
In embodiments of the invention, the p-value of the log-valued test using the relapse-free survival distribution between the two subpopulations classified by the relapse risk classification index is less than 0.0001, indicating the validity and reliability of the relapse risk classification index on classification problems for high and low relapse risk patients; on the other hand, the recurrence risk classification index is used as a unique continuous independent variable fitting cox proportional risk model, wherein the p value of the log rank test of the model is 1e-13 (the original hypothesis of the log rank test of the model is that all coefficients are 0 and are used for testing whether the independent variable has a significant influence on the result predicted by the model), and the p value of the coefficient Waldestest is 2.26e-9, so that the recurrence risk classification index has a significant correlation with the survival time of the patient, and the effectiveness of the recurrence risk classification index is also proved.
The above examples are only intended to illustrate the technical solution of the invention and to confirm the effectiveness and superiority of the proposed method, without limiting it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information is characterized in that: the method comprises the following steps:
step 1: extracting image characteristics of the pathological image, and sorting variables with medical significance in clinical information;
step 2: data processing, including default value processing, dummy variable setting, obvious and unreasonable variable removal and normalization processing;
and step 3: calculating interaction between the pathological image characteristics and clinical information as input data by combining the pathological image characteristics and the clinical information;
and 4, step 4: selecting a better survival random forest model by using the C-index efficacy evaluation index through cross validation;
and 5: and generating a recurrence risk classification index by using a survival random forest model to classify recurrence risk subgroups of the recurrence patients.
2. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the image characteristics of the pathological image in the step 1 comprise:
WSI_snu_osmerci_ngtdm_Strength_range,
MND_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MND_smu_osnsci_single_fractal_dim_mean,
WSI_snu_esmerci_lbp3_5_disorder,
MED_smu_esmerci_ngtdm_Busyness_range,
WSI_th_ori_firstorder_Range,
MED_e_ori_ngtdm_Complexity,
MXD_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean,
MED_smu_osnsci_lbp4_12_range,
MND_e_ori_glszm_LargeAreaHighGrayLevelEmphasis,
the total number of the 10 image features is 10, and the calculation method of the 10 image features is described in detail in the specification.
3. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the clinical information variables collated in the step 1 comprise cancer TNM staging indexes, biopsy tissue collection methods and patient ethnicity information.
4. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the step 2 is performed on the pathological image features and the clinical information variables respectively, wherein the clinical information variables after data processing comprise:
staged i: whether the TNM staging Satge I is 1 or not and whether the Satge I is 0 or not;
specific _ collection _ method _ name. lobectomy: whether a biopsy sample is taken by leaf resection is 1 or not and is 0 or not;
asian: whether the race is Asian race or not is 1 or not is 0;
relative _ family _ list _ history. yes: whether a family genetic disease history exists is 1 or 0;
staged II, whether TNM staging Satge II, is 1, whether 0;
there are 5 clinical information variables.
5. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the interaction calculation method in the step 3 is a new variable obtained by multiplying 15 variables of the clinical information variable and the pathological image feature by each other, and the number of the interaction variables is 105.
6. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the specific steps of selecting a better model through cross validation in the step 4 are as follows:
step 4.1: setting a survival random forest parameter variation range;
step 4.2: for each group of parameters, randomly dividing input data into k samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest k-1 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set;
step 4.3: and selecting a group of survival random forest model parameters with better prediction effect according to the C-index value, fitting all input data to construct a final random forest prediction model, and predicting the recurrence risk of the new patient according to the random forest model.
7. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the classification index of recurrence risk in step 5 includes two indexes: sorting the input features of the random forest prediction model constructed in the step 4 according to Permutation importance, and selecting the features with the highest importance as recurrence risk classification indexes; and (4) predicting the independent variable of the input data to be classified by using the random forest prediction model constructed in the step (4), and taking the predicted value as a recurrence risk classification index.
8. The method of classifying patients according to claim 7, wherein the risk of relapse classification index is: if the recurrence risk classification index is a discrete variable, classifying the patients according to each discrete value of the recurrence risk classification index, and if the recurrence risk classification index is a continuous variable, classifying the high-low recurrence risk subgroup of the patients according to whether the recurrence risk classification index is greater than the median or the average value.
CN201911265751.5A 2019-12-11 2019-12-11 Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information Pending CN110993106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911265751.5A CN110993106A (en) 2019-12-11 2019-12-11 Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911265751.5A CN110993106A (en) 2019-12-11 2019-12-11 Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information

Publications (1)

Publication Number Publication Date
CN110993106A true CN110993106A (en) 2020-04-10

Family

ID=70092314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911265751.5A Pending CN110993106A (en) 2019-12-11 2019-12-11 Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information

Country Status (1)

Country Link
CN (1) CN110993106A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554402A (en) * 2020-04-24 2020-08-18 山东省立医院 Machine learning-based method and system for predicting postoperative recurrence risk of primary liver cancer
CN111784637A (en) * 2020-06-04 2020-10-16 复旦大学附属中山医院 Prognostic characteristic visualization method, system, equipment and storage medium
CN111985584A (en) * 2020-09-30 2020-11-24 平安科技(深圳)有限公司 Disease auxiliary detection equipment, method, device and medium based on multi-mode data
CN112309571A (en) * 2020-10-30 2021-02-02 电子科技大学 Screening method of prognosis quantitative characteristics of digital pathological image
CN112562855A (en) * 2020-12-18 2021-03-26 深圳大学 Hepatocellular carcinoma postoperative early recurrence risk prediction method
CN112768060A (en) * 2020-07-14 2021-05-07 福州宜星大数据产业投资有限公司 Liver cancer postoperative recurrence prediction method based on random survival forest and storage medium
CN112908470A (en) * 2021-02-08 2021-06-04 深圳市人民医院 Hepatocellular carcinoma prognosis scoring system based on RNA binding protein gene and application thereof
CN112991320A (en) * 2021-04-07 2021-06-18 德州市人民医院 System and method for predicting hematoma expansion risk of cerebral hemorrhage patient
CN113180633A (en) * 2021-04-28 2021-07-30 济南大学 MR image liver cancer postoperative recurrence risk prediction method and system based on deep learning
CN113724876A (en) * 2021-09-10 2021-11-30 南昌大学第二附属医院 Intra-stroke hospital complication prediction model based on multi-mode fusion and DFS-LLE algorithm
CN113808747A (en) * 2021-10-11 2021-12-17 南昌大学第二附属医院 Ischemic stroke recurrence prediction method
CN113850753A (en) * 2021-08-17 2021-12-28 苏州鸿熙融合智能医疗科技有限公司 Medical image information calculation method and device, edge calculation equipment and storage medium
CN114037774A (en) * 2022-01-10 2022-02-11 雅安市人民医院 Method and device for sequencing and transmitting images of cross sections of cranium and brain and storage medium
CN114549896A (en) * 2022-01-24 2022-05-27 清华大学 Heterogeneous high-order representation method and device for full-view image for survival prediction
CN118645251A (en) * 2024-08-16 2024-09-13 上海孪心医疗科技有限公司 Risk stratification method and system for prognosis of heart failure and atrial fibrillation and catheter ablation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355208A (en) * 2016-08-31 2017-01-25 广州精点计算机科技有限公司 Data prediction analysis method based on COX model and random survival forest
CN106815481A (en) * 2017-01-19 2017-06-09 中国科学院深圳先进技术研究院 A kind of life cycle Forecasting Methodology and device based on image group
CN109642258A (en) * 2018-10-17 2019-04-16 上海允英医疗科技有限公司 A kind of method and system of tumor prognosis prediction
CN110111892A (en) * 2019-04-29 2019-08-09 杭州电子科技大学 A kind of postoperative short-term relapse and metastasis risk evaluating system of NSCLC patient
WO2019224044A1 (en) * 2018-05-22 2019-11-28 Koninklijke Philips N.V. Performing a prognostic evaluation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355208A (en) * 2016-08-31 2017-01-25 广州精点计算机科技有限公司 Data prediction analysis method based on COX model and random survival forest
CN106815481A (en) * 2017-01-19 2017-06-09 中国科学院深圳先进技术研究院 A kind of life cycle Forecasting Methodology and device based on image group
WO2019224044A1 (en) * 2018-05-22 2019-11-28 Koninklijke Philips N.V. Performing a prognostic evaluation
CN109642258A (en) * 2018-10-17 2019-04-16 上海允英医疗科技有限公司 A kind of method and system of tumor prognosis prediction
CN110111892A (en) * 2019-04-29 2019-08-09 杭州电子科技大学 A kind of postoperative short-term relapse and metastasis risk evaluating system of NSCLC patient

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔海波 等: "基于随机生存森林与网络拓扑信息的食管癌风险预测" *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554402A (en) * 2020-04-24 2020-08-18 山东省立医院 Machine learning-based method and system for predicting postoperative recurrence risk of primary liver cancer
CN111784637A (en) * 2020-06-04 2020-10-16 复旦大学附属中山医院 Prognostic characteristic visualization method, system, equipment and storage medium
CN112768060A (en) * 2020-07-14 2021-05-07 福州宜星大数据产业投资有限公司 Liver cancer postoperative recurrence prediction method based on random survival forest and storage medium
CN111985584A (en) * 2020-09-30 2020-11-24 平安科技(深圳)有限公司 Disease auxiliary detection equipment, method, device and medium based on multi-mode data
CN112309571B (en) * 2020-10-30 2022-04-15 电子科技大学 Screening method of prognosis quantitative characteristics of digital pathological image
CN112309571A (en) * 2020-10-30 2021-02-02 电子科技大学 Screening method of prognosis quantitative characteristics of digital pathological image
CN112562855B (en) * 2020-12-18 2021-11-02 深圳大学 Hepatocellular carcinoma postoperative early recurrence risk prediction method, medium and terminal equipment
CN112562855A (en) * 2020-12-18 2021-03-26 深圳大学 Hepatocellular carcinoma postoperative early recurrence risk prediction method
CN112908470A (en) * 2021-02-08 2021-06-04 深圳市人民医院 Hepatocellular carcinoma prognosis scoring system based on RNA binding protein gene and application thereof
CN112908470B (en) * 2021-02-08 2023-10-03 深圳市人民医院 Hepatocellular carcinoma prognosis scoring system based on RNA binding protein gene and application thereof
CN112991320A (en) * 2021-04-07 2021-06-18 德州市人民医院 System and method for predicting hematoma expansion risk of cerebral hemorrhage patient
CN113180633A (en) * 2021-04-28 2021-07-30 济南大学 MR image liver cancer postoperative recurrence risk prediction method and system based on deep learning
CN113850753B (en) * 2021-08-17 2023-09-01 苏州鸿熙融合智能医疗科技有限公司 Medical image information computing method, device, edge computing equipment and storage medium
CN113850753A (en) * 2021-08-17 2021-12-28 苏州鸿熙融合智能医疗科技有限公司 Medical image information calculation method and device, edge calculation equipment and storage medium
CN113724876A (en) * 2021-09-10 2021-11-30 南昌大学第二附属医院 Intra-stroke hospital complication prediction model based on multi-mode fusion and DFS-LLE algorithm
CN113808747A (en) * 2021-10-11 2021-12-17 南昌大学第二附属医院 Ischemic stroke recurrence prediction method
CN113808747B (en) * 2021-10-11 2023-12-26 南昌大学第二附属医院 Ischemic cerebral apoplexy recurrence prediction method
CN114037774A (en) * 2022-01-10 2022-02-11 雅安市人民医院 Method and device for sequencing and transmitting images of cross sections of cranium and brain and storage medium
CN114037774B (en) * 2022-01-10 2022-03-08 雅安市人民医院 Method and device for sequencing and transmitting images of cross sections of cranium and brain and storage medium
CN114549896A (en) * 2022-01-24 2022-05-27 清华大学 Heterogeneous high-order representation method and device for full-view image for survival prediction
CN114549896B (en) * 2022-01-24 2024-08-16 清华大学 Heterogeneous high-order representation method and device for full-field image for survival prediction
CN118645251A (en) * 2024-08-16 2024-09-13 上海孪心医疗科技有限公司 Risk stratification method and system for prognosis of heart failure and atrial fibrillation and catheter ablation

Similar Documents

Publication Publication Date Title
CN110993106A (en) Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information
CN107103187B (en) Lung nodule detection grading and management method and system based on deep learning
CN109872772B (en) Method for excavating colorectal cancer radiotherapy specific genes by using weight gene co-expression network
Dimitoglou et al. Comparison of the C4. 5 and a Naïve Bayes classifier for the prediction of lung cancer survivability
CN112382392A (en) System for be used for pulmonary nodule risk assessment
US20070019854A1 (en) Method and system for automated digital image analysis of prostrate neoplasms using morphologic patterns
CN113140258A (en) Method for screening potential prognosis biomarkers of lung adenocarcinoma based on tumor infiltrating immune cells
CN107909102A (en) A kind of sorting technique of histopathology image
CN115588507A (en) Prognosis model of lung adenocarcinoma EMT related gene, construction method and application
CN107169497A (en) A kind of tumor imaging label extracting method based on gene iconography
Paul et al. Gland segmentation from histology images using informative morphological scale space
CN111062425A (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN113903471A (en) Gastric cancer patient survival risk prediction method based on histopathology image and gene expression data
Lopez et al. A new set of wavelet-and fractals-based features for Gleason grading of prostate cancer histopathology images
KR20240012738A (en) Cluster analysis system and method of artificial intelligence classification for cell nuclei of prostate cancer tissue
CN115537467A (en) Establishment method and application of ovarian cancer survival prognosis prediction molecular model based on deep neural network
WO2006122251A2 (en) Method and system for automated digital image analysis of prostrate neoplasms using morphologic patterns
CN113571189A (en) Establishment method of prediction model for survival benefit of gallbladder cancer patient after radiotherapy and chemotherapy
Radhakrishnan et al. Detection of non-small cell lung cancer using histopathological images by the approach of deep learning
CN116504314B (en) Gene regulation network construction method based on cell dynamic differentiation
CN112435133A (en) Medical insurance combined fraud detection method, device and equipment based on graph analysis
CN117912694A (en) Multi-mode cancer survival risk prediction method based on deep learning
CN111793692A (en) Characteristic miRNA expression profile combination and lung squamous carcinoma early prediction method
Kabir et al. Classification models and survival analysis for prostate cancer using RNA sequencing and clinical data
CN116313111A (en) Breast cancer risk prediction method, system, medium and equipment based on combined model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Yang Anli

Inventor after: Deng Feiwen

Inventor after: Hua Rui

Inventor after: Zhang Youlong

Inventor after: Li Jialu

Inventor before: Hua Rui

Inventor before: Zhang Youlong

Inventor before: Li Jialu

WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200410