CN110993106A

CN110993106A - Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information

Info

Publication number: CN110993106A
Application number: CN201911265751.5A
Authority: CN
Inventors: 华芮; 张游龙; 李嘉路
Original assignee: Shenzhen Huajia Biological Intelligence Technology Co ltd
Current assignee: Shenzhen Huajia Biological Intelligence Technology Co ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2020-04-10

Abstract

The invention discloses a method for predicting postoperative recurrence risk of liver cancer by combining pathological images and clinical information, belonging to the technical field of construction of postoperative recurrence risk prediction models of cancer. The method takes the clinical information of a patient and the pathological image characteristics of the tumor area of the patient extracted by applying an image processing technology as basic variables, further calculates the interaction among the basic variables as input data, fits a survival random forest model and accurately predicts the survival time of the patient. The results of the embodiment of the invention show that the cross-validation efficacy evaluation index C-index of the model provided by the invention is superior to the result of prediction only by using pathological image characteristics or clinical information, and the accuracy of prediction of postoperative recurrence risk of liver cancer is obviously improved; in addition, the invention also provides a classification index of postoperative recurrence risk, and the patient can be divided into two subgroups of higher recurrence risk and lower recurrence risk, so that doctors can be helped to make a targeted treatment scheme for the patient.

Description

Liver cancer postoperative recurrence risk prediction method combining pathological image and clinical information

Technical Field

The invention belongs to the technical field of construction of a cancer postoperative recurrence risk prediction model, and particularly relates to a liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information. Specifically, the method is based on a survival random forest model, combines pathological image characteristics and clinical parameter characteristics, and accurately predicts the postoperative recurrence risk of the liver cancer patient.

Background

Cancer is a common health problem for all people, according to statistics of the national cancer center in 2019, the proportion of liver cancer patients in 2015 to all cancer patients is 9.42%, the liver cancer patients are in the fourth position of the morbidity of malignant tumors, while the liver cancer patients are 13.94% of all cancer patients and in the second position of the mortality of malignant tumors. As can be seen, liver cancer is high in incidence and mortality rate, and seriously threatens public health. The high postoperative recurrence risk of the liver cancer is one of the important reasons that the mortality rate of the liver cancer patients is high, and if the postoperative recurrence risk can be well predicted, doctors can be helped to make a targeted treatment scheme for the patients, so that the method has great significance for postoperative treatment and prognosis of the patients.

With the progress of science and technology, artificial intelligence is developed rapidly in recent years, and the recurrence risk prediction by using an artificial intelligence algorithm is gradually raised. Random forest is a common machine learning method, and can perform feature screening while having high prediction accuracy. On the other hand, the survival analysis is an analysis method designed for survival time data containing deletion or truncation, the survival random forest model built by combining the survival analysis and the random forest model can further analyze the survival data, and the excellent characteristics of the random forest can be fully utilized.

For clinical treatment of liver cancer, pathological images are the diagnostic gold standard. The morphology, color, texture, and morphology of specific tissue structures of cells in the pathological image are usually related to the occurrence or progression of diseases, so the abundant information contained in the pathological image has the potential to improve the recurrence risk prediction. However, subjective judgment of recurrence risk by merely manually observing pathological images is not reliable due to subjective factors and visual limitations of naked eyes. What is needed is a stable method for deep mining pathological image information to predict recurrence risk.

Under the background, the invention provides a method for predicting postoperative recurrence risk of liver cancer by combining pathological images and clinical information. The invention automatically extracts image characteristic information from the H & E staining image of liver cancer histopathology by an image processing technology, and combines the conventional clinical detection information of patients and a survival random forest method to more accurately grade postoperative recurrence risks of the patients.

Disclosure of Invention

The invention aims to provide a liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information, which is characterized in that input data of the method comprises pathological image characteristics extracted by applying an image processing technology and clinical information of cancer TNM staging and the like of a patient; the prediction model framework of the method is based on a survival random forest model, and compared with the traditional random forest model which can only process regression and classification problems, the random forest model can specially process survival time data; the efficacy evaluation index of the method is C-index, the probability that the model prediction result is consistent with the actually observed result is estimated, and the method is commonly used for evaluating the distinguishing capability and consistency of a prognosis model in statistical analysis; the method specifically comprises the following steps:

step 1: extracting image characteristics of the pathological image, and sorting variables with medical significance in clinical information;

step 2: data processing, including default value processing, dummy variable setting, obvious and unreasonable variable removal and normalization processing;

and step 3: calculating interaction between the pathological image characteristics and clinical information as input data by combining the pathological image characteristics and the clinical information;

and 4, step 4: selecting a better survival random forest model by using the C-index efficacy evaluation index through cross validation;

and 5: and generating a recurrence risk classification index by using a survival random forest model to classify recurrence risk subgroups of the recurrence patients.

The image features of the pathological image extracted in the step 1 specifically include the following features:

-WSI _ snu _ osmerici _ ngtdm _ Strength _ range: deconvolving the H & E pathological staining image in the RGB color space to the HEO color space, intercepting the minimum external rectangular frame of each cell nucleus in the O channel image, and calculating an adjacent Gray Difference Matrix (NGTDM), wherein the calculation method of the NGTDM Matrix comprises the following steps:

let the intercepted area contain N₁， N₂…N_nFor n gray levels, NGTDM is nx4 matrix, with j row: [ N ]_j,F_j,P_j,S_j]In which N is_jIs the value of the j-th gray level, F_jIs N_jFrequency of occurrence, P_JIs N_jFrequency of occurrence, S_jIs the sum of the absolute values of the differences between each Nj and the mean gray value of the neighbourhood,

the Strength signature is then computed from the NGTDM:

finally, calculating the range difference of all the cell nucleus Strength characteristics as WSI _ snu _ osmeri _ ngtdm _ Strength _ range characteristics;

-MND _ smu _ hsmerci _ gldm _ largedendenderlowgradevelemphasis _ mean: taking the cell nucleus with the minimum cell nucleus density (number of local cell nuclei/local area) in the pathological image as the center, intercepting a local area with a fixed size of KxKpixels, intercepting the minimum external rectangular frame of each cell nucleus in an H channel image of the local area, and calculating a Gray Level Dependency Matrix (GLDM), wherein the calculation method of the GLDM Matrix comprises the following steps:

let the gray level of the pixel m be g_mThe gray level of the pixel n is g_nWhen the distance between m and n is less than d and | g_m－g_nIf | is less than or equal to α, the pixel n is called a gray level dependent pixel of m, and the gray level g is counted_mLet P (i, j) be the pixel value of the ith row and jth column (starting from 0) of GLDM, which represents that the number of pixels with j gray-level dependency is P (i, j) in all pixels with i gray-level,

ldlgle (large dependencelowgray levelemophasis) characteristics were then calculated from GLDM:

wherein N is_gIs the number of gray levels in the image (GLDM line number), N_dIs the number of different gray-scale dependent image numbers in the image (number of GLDM columns), N_zIs the amount of gray dependency in the image (sum of GLDM),

finally, calculating the average value of LDLGLE characteristics of all cell nuclei in the local area, namely MND _ smu _ hsmeric _ gldm _ LargeDependenceLowGrayLevelEmphasis _ mean characteristic;

MND _ smu _ osnsci _ single _ fractional _ dim _ mean: the method comprises the following steps of taking a cell nucleus with the minimum cell nucleus density (local cell nucleus number/local area) in a pathological image as a center, cutting out a local area with a fixed size of KxK pixels, taking the center of each cell nucleus in an O channel image of the local area as the center, cutting out KxK the area with the fixed size, and calculating the fractal dimension of the area, wherein the calculation method of the fractal dimension comprises the following steps:

setting image fractal size S_1，S₂…S_nWhen the fractal size is S_jWhile dividing the image into a number of S_j×S_jSmall blocks, counting the range of pixel values in all small blocks, and calculating the average Nr of all range_jConstructing a linear regression model of Nr relative to S, wherein the coefficient of the linear regression model is the fractal dimension of the image,

finally, calculating the average value of fractal dimension characteristics of all cell nuclei in the local area, namely the MND _ smu _ osnsci _ single _ fractional _ dim _ mean characteristic;

-WSI _ snu _ esperci _ lbp3_5_ recorder: deconvolving the H & E pathological staining image in the RGB color space to the HEO color space, intercepting the minimum external rectangular frame of each cell nucleus in the E channel image, and calculating the statistical characteristics of Local Binary Pattern (LBP), wherein the statistical characteristics of the Local Binary Pattern are calculated as follows:

if the pixel value g of the pixel i_iPixel value g greater than or equal to neighborhood pixel j_jIf the local binary value corresponding to the position of the pixel j is 0, the pixel value g of the pixel i is equal to the local binary value corresponding to the position of the pixel j_iPixel value g less than neighborhood pixel j_jIf the local binary value corresponding to the position of the pixel j is 1, the local binary values in the neighborhood of the pixel i can be connected together to form a binary number, the binary number is the local binary pattern of the pixel i, uniform and cyclic invariant processing is carried out on all the binary numbers, the frequency of the local binary patterns of all the pixels in the image is counted to be used as the local binary pattern characteristic of the image,

and finally, calculating the confusion value of the local binary pattern feature with the local binary pattern of 5 of all the cell nuclei as the WSI _ snu _ approximate _ lbp3_5_ descriptor feature, wherein the calculation formula of the confusion value is as follows:

std is standard deviation, mean is mean;

MED _ smu _ esperci _ ngtdm _ busy _ range: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image in a median as a center, intercepting a local area with a fixed size of KxK pixels, intercepting the minimum external rectangular frame of each cell nucleus in an H channel image of the local area, calculating an NGTDM matrix, and further calculating Busync characteristics:

finally, calculating the range of all the cell nucleus Busyness characteristics as MED _ smu _ esperci _ ngtdm _ Busyness _ range characteristics;

-WSI _ th _ ori _ firstorder _ Range: reducing the pathological staining image by several times, and then calculating the pixel value Range under the H channel as the WSI _ th _ ori _ firstorder _ Range characteristic;

MED _ e _ ori _ ngtdm _ Complexity: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image in the median as the center, cutting out a local area with a fixed size of KxK pixels, calculating an NGTDM matrix of the local area under an E channel, and then calculating a Complexity characteristic as an MED _ E _ ori _ NGTDM _ Complexity characteristic, wherein the calculation formula of the Complexity characteristic is as follows:

-MXD _ smu _ hsmerci _ gldm _ largedependencelowgradevelemphasis _ mean: taking the cell nucleus with the density (number of local cell nuclei/local area) of the cell nucleus in the pathological image at the maximum value as a center, intercepting a local area with a fixed size of KxKpixels, intercepting the minimum outer rectangular frame of each cell nucleus in an H channel image of the local area, calculating a GLDM matrix, then calculating LDLGLE characteristics, and finally calculating the average value of the LDLGLE characteristics of all the cell nuclei in the local area as MXD-smu _ hsmeri _ GLDM _ LargedependenceLowGrayLevelphases _ mean characteristics;

MED _ smu _ osnsci _ lbp4_12_ range: taking the cell nucleus with the density (the number of local cell nuclei/the local area) of the cell nucleus in the pathological image at the maximum value as the center, cutting out a local area with a fixed size of KxK pixels, then taking the center of each cell nucleus in an O channel image of the local area as the center, cutting out KxK the area with the fixed size, and calculating the polar difference of the local binary pattern characteristic with the local binary pattern of 12 of all the cell nuclei in the cut-out area as an MED _ smu _ osnsci _ lbp4_12_ range characteristic;

-MND _ e _ ori _ glszm _ largeareahghgraylevelemphasis: taking the cell nucleus with the minimum cell nucleus density (local cell nucleus number/local area) in the pathological image as the center, cutting out the local area with the fixed Size of KxK pixels, and calculating the Gray Level frequency matrix (GLSZM, Gray Level Size Zonematrix) of the local area under the E channel, wherein the second cell nucleus in the GLSZM matrixiGo to the firstjThe value P (i, j) of a column represents a gray level ofiHas a connected domain size ofjThe frequency of (a) is P (i, j), and finally, calculating LAHGLE (LargeAreaHighGrayLevelEmphasis) characteristics according to the GLSZM matrix as MND _ e _ ori _ GLSZM _ LargeAreaHighGrayLevelEmphasis characteristics, wherein the calculation formula of the LAHGLE characteristics is as follows:

wherein N is_gIs the number of gray levels in the image (GLDM line number), N_dIs the number of different gray-level dependent images in the image (number of GLDM columns) and Nz is the number of gray-level dependent images in the image (sum of GLDM).

The step 1 of sorting out medically significant variables in the clinical information specifically includes:

the cancer TNM staging index divides the cancer of the patient into five stages, namely stage 0-stage IV according to the size of the primary tumor of the patient, the degree of spread to the local lymph node and whether distant metastasis occurs, wherein each stage can be divided more finely;

biopsy collection methods include lobectomies (lobectoys) and lung resections (segmentectomas);

-the patient ethnicity information includes ethnicities of asian, caucasian, african-black and indian.

The step 2 specifically comprises:

step 2.1: the default value processing is respectively carried out on the clinical treatment information and the pathological image characteristics, and comprises the steps of deleting variables with more default values (such as the number of missing samples is more than 5), deleting samples with more default values (such as the number of missing samples exceeds 20%), then filling the default values of each continuous variable by using an average value, and sampling and filling the default values of each discrete variable in non-default values;

step 2.2: setting the discrete multi-value class variable after the default value is processed into a dummy variable;

step 2.3: removing obvious unreasonable variables in the data obtained in the step 2.2, wherein the obvious unreasonable variables comprise a variable with a variance of 0 and a discrete variable with unbalanced data volume;

step 2.4: and (4) normalizing continuous variables of the data obtained in the step 2.3.

After the data processing according to the step 2, the clinical information variables specifically include:

staged i: whether the TNM staging SatgeI is 1 or not and whether the SatgeI is 0 or not;

-specific _ collection _ method _ name.lobectomy: whether a biopsy sample is taken by leaf resection is 1 or not and is 0 or not;

asian: whether the race is Asian race or not is 1 or not is 0;

relative _ family _ cancer _ history. yes: whether a family genetic disease history exists is 1 or 0;

whether the patholog _ stage.stagei is TNM staging SatgeII is 1 or 0;

there are 5 clinical information variables.

And 3, the calculation method of the interaction in the step 3 is to take the image characteristics and the clinical information as basic variables, calculate the product of any two variables as the interaction between the two variables, and finally obtain an interaction matrix as the input data of the survival random forest model.

The step 4 specifically includes:

step 4.1: setting a survival random forest parameter variation range;

step 4.2: for each group of parameters, randomly dividing input data into k samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest k-1 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set; the basic principle of the importance of the random forest model and the Permutation characteristic is as follows:

-randomly selecting N from N samples with put back_iA sample using the N_iConstructing a decision tree for each sample;

-randomly selecting M from the M attributes when constructing the decision tree_jAn attribute from this M_jSelecting a certain attribute from the attributes by adopting a certain classification method (such as a Kernian coefficient) as a branch attribute of the node to branch the data, and building a decision tree until the branching is stopped until a certain stopping condition is reached (if the branching cannot be performed any more or the branching times reach 10);

repeating the two steps, and constructing a random forest by using a large number of established decision trees;

-in the process of constructing the decision tree, comparing the new prediction accuracy obtained by rearranging the observed values of a certain feature value with the original accuracy, the difference between them obtaining the permatation importance of the feature, and then calculating the average of all the importance obtained when constructing the decision tree of the feature as the importance of the feature; the importance of all the characteristics can be obtained by repeating the process on all the characteristics;

step 4.3: and selecting a group of survival random forest model parameters with better prediction effect according to the C-index value, and fitting all input data to construct a final random forest prediction model.

The step 5 specifically includes:

step 5.1: calculating a recurrence risk classification index, specifically, two indexes are included:

sorting the input features of the random forest prediction model constructed in the step 4 according to Permutation importance, and selecting the features with the highest importance as recurrence risk classification indexes;

predicting the independent variables of the input data to be classified by using the random forest prediction model constructed in the step 4, and taking the predicted values as classification indexes of recurrence risks;

step 5.2: if the recurrence risk classification index is a discrete variable, classifying the patients according to each discrete value; if the recurrence risk classification index is a continuous variable, all patients are classified into two groups according to whether the index is greater than the median (or mean).

The invention has the advantages that the accurate liver cancer recurrence risk prediction method is provided, key factors related to recurrence risk in pathological images can be deeply excavated, and compared with the method of predicting by singly using clinical treatment information, the prediction effect is obviously improved; compared with the method that the recurrence risk prediction is carried out by doctors according to medical knowledge and medical experience, the method is more stable; in addition, the specific implementation mode shows that the method can obtain better prediction effect only by using 10 pathological image characteristics and 5 items of clinical information, so that the calculation difficulty can be greatly reduced, and the method can be widely applied; in addition, the classification index of the recurrence risk calculated by the invention can be used for distinguishing patients with higher and lower recurrence risk, and has great medical research and application values.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a C-index comparison between the model used in the present invention and other models. From left to right according to the abscissa, each box plot is in turn: the method comprises the steps of using 64128 pathological image features and a prediction model (imcln) of 13 items of clinical information, using a prediction model (im) of 64128 pathological image features, using a prediction model (cln) of 13 items of clinical information, using a prediction model (cln.s) of 5 items of screened clinical information, using a prediction model (im.s) of 10 screened pathological image features, using a prediction model (imcln.s) of 5 items of screened clinical information and 10 pathological image features, using a prediction model (imcln.smi) of 5 items of screened clinical information, 10 pathological image features and interaction among all variables, constructing 15 variables by using the 5 items of screened clinical information and 10 pathological image features, and further calculating a model (imcln.si) for predicting the interaction among the variables, wherein the model is used and does not include the 15 variables per se.

FIG. 3 is a Kaplan-Meier curve for high and low risk of recurrence patients classified using the risk of recurrence classification index of the present invention. The abscissa axis of the graph represents the survival time of the patient, the ordinate axis represents the survival rate of the patient, the horizontal and vertical dashed lines are used to show median survival difference between the two patient populations, and p <0.0001 represents the log rank test p-value of the survival distribution between the patient subpopulations of less than 0.0001.

Detailed Description

The invention provides a method for predicting recurrence risk of liver cancer by combining pathological image characteristics and clinical information, and technical characteristics and advantages of the invention are described in the following by combining figures and embodiments.

The embodiment data of the invention is derived from a public database TCGA-LIHC, the code implementation languages are Python 3.7 and R3.6, the specific implementation mode is shown in figure 1, and the liver cancer recurrence risk prediction method combining pathological image characteristics and clinical information, provided by the invention, comprises the following steps:

and 5: and (3) classifying the recurrence risk subgroups of the recurrence patients by using the survival random forest model extraction indexes, calculating the survival function of the classified subgroups, drawing a Kaplan-Meier curve and fitting a cox proportional risk model to evaluate the classification indexes.

The pathological image features extracted in the step 1 specifically include:

WSI_snu_osmerci_ngtdm_Strength_range，

MND_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean，

MND_smu_osnsci_single_fractal_dim_mean，

WSI_snu_esmerci_lbp3_5_disorder，

MED_smu_esmerci_ngtdm_Busyness_range，

WSI_th_ori_firstorder_Range，

MED_e_ori_ngtdm_Complexity，

MXD_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean，

MED_smu_osnsci_lbp4_12_range，

MND_e_ori_glszm_LargeAreaHighGrayLevelEmphasis，

there are 10 pathological image features.

The clinical treatment information collated in the step 1 comprises a cancer TNM staging index, a biopsy tissue collection method and patient ethnicity information.

In the step 2, the data processing is performed on the pathological image features and the clinical information respectively, and the data processing specifically comprises the following steps:

step 2.1: processing default values, including deleting the feature values with the feature default value larger than 5, deleting the samples with the sample default value larger than 50% and the rest samples without the same loss, filling the default values of the continuous feature values by using an average value, and filling the default values of the discrete feature values by using samples of non-default values (the number proportion of the discrete values of the non-default values is required to be kept);

step 2.2: setting a dummy variable for the discrete characteristic value, and if each discrete variable in the characteristic value is independent, independently setting each discrete variable as the dummy variable; if the discrete variables in the characteristic value have a mutual relation, setting a dummy variable according to the mutual relation; if data imbalance exists in each variable in the characteristic value, combining dummy variables is considered;

step 2.3: deleting the characteristic of data imbalance (such as A: B =100: 1) in the discrete characteristic, and deleting the characteristic of variance 0 in the continuous characteristic;

step 2.4: normalizing each continuous type characteristic;

after data processing, 10 pathological image characteristics and 5 clinical information variables are obtained.

The interaction in the step 3 refers to data obtained by multiplying 15 variables obtained in the step 2 by each other, and the total number of the 15 variables is 105.

The step 4 specifically includes:

step 4.1: setting a survival random forest parameter variation range;

step 4.2: for each group of parameters, randomly dividing input data into 3 samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest 2 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set;

step 4.3: and 4.2, selecting a group of parameters of the survival random forest model with better prediction effect according to the average value of the 3-fold cross validation C-index obtained in the step 4.2, and fitting all input data to construct a final survival random forest prediction model.

The step 5 specifically includes:

step 5.1: predicting all input data by using the survival random forest model constructed in the step 4, and taking a predicted value as a classification index of relapse risks of all patients;

step 5.2: dividing the patient into two subgroups according to whether the classification index of the relapse risk of the patient is larger than the median;

step 5.3: respectively calculating the survival functions of the two subgroups, drawing respective Kaplan-Meier curves of the two subgroups according to the survival functions, and testing the classification result by using a logarithmic rank test p value;

step 5.4: and fitting a cox proportional risk model of the survival time of the patient about the recurrence risk classification index, and determining the effectiveness of the recurrence risk classification index according to the cox proportional risk model and the hypothesis test result of the coefficient.

In the examples of the present invention, the average value of C-index (imcln.si) at the time of prediction using the interaction was 0.765, the average value of C-index (cln.s) at the time of prediction using only the clinical information was 0.690, and the p-value of t-test hypothesis test with the C-index value at the time of prediction using the interaction was 1.519 e-13; the mean value of the C-index (im.s) when only pathological image features are used for prediction is 0.707, and the p value of t-test hypothesis test performed on the C-index when the C-index is used for prediction by interaction is 4.712 e-8; the p-value for the C-index (im.s) predicted using only pathological image features and the C-index (cln.s) predicted using only clinical information for the t-test hypothesis test was 0.026. The results of this example not only show that the pathological image features have significant improvement relative to the accuracy of prediction of clinical information, but also that the interaction between clinical information and pathological image features plays an important role in improving the accuracy of recurrence risk prediction.

In embodiments of the invention, the p-value of the log-valued test using the relapse-free survival distribution between the two subpopulations classified by the relapse risk classification index is less than 0.0001, indicating the validity and reliability of the relapse risk classification index on classification problems for high and low relapse risk patients; on the other hand, the recurrence risk classification index is used as a unique continuous independent variable fitting cox proportional risk model, wherein the p value of the log rank test of the model is 1e-13 (the original hypothesis of the log rank test of the model is that all coefficients are 0 and are used for testing whether the independent variable has a significant influence on the result predicted by the model), and the p value of the coefficient Waldestest is 2.26e-9, so that the recurrence risk classification index has a significant correlation with the survival time of the patient, and the effectiveness of the recurrence risk classification index is also proved.

The above examples are only intended to illustrate the technical solution of the invention and to confirm the effectiveness and superiority of the proposed method, without limiting it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A liver cancer postoperative recurrence risk prediction method combining pathological images and clinical information is characterized in that: the method comprises the following steps:

2. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the image characteristics of the pathological image in the step 1 comprise:

WSI_snu_osmerci_ngtdm_Strength_range，

MND_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean，

MND_smu_osnsci_single_fractal_dim_mean，

WSI_snu_esmerci_lbp3_5_disorder，

MED_smu_esmerci_ngtdm_Busyness_range，

WSI_th_ori_firstorder_Range，

MED_e_ori_ngtdm_Complexity，

MXD_smu_hsmerci_gldm_LargeDependenceLowGrayLevelEmphasis_mean，

MED_smu_osnsci_lbp4_12_range，

MND_e_ori_glszm_LargeAreaHighGrayLevelEmphasis，

the total number of the 10 image features is 10, and the calculation method of the 10 image features is described in detail in the specification.

3. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the clinical information variables collated in the step 1 comprise cancer TNM staging indexes, biopsy tissue collection methods and patient ethnicity information.

4. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the step 2 is performed on the pathological image features and the clinical information variables respectively, wherein the clinical information variables after data processing comprise:

staged i: whether the TNM staging Satge I is 1 or not and whether the Satge I is 0 or not;

specific _ collection _ method _ name. lobectomy: whether a biopsy sample is taken by leaf resection is 1 or not and is 0 or not;

asian: whether the race is Asian race or not is 1 or not is 0;

relative _ family _ list _ history. yes: whether a family genetic disease history exists is 1 or 0;

staged II, whether TNM staging Satge II, is 1, whether 0;

there are 5 clinical information variables.

5. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the interaction calculation method in the step 3 is a new variable obtained by multiplying 15 variables of the clinical information variable and the pathological image feature by each other, and the number of the interaction variables is 105.

6. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the specific steps of selecting a better model through cross validation in the step 4 are as follows:

step 4.1: setting a survival random forest parameter variation range;

step 4.2: for each group of parameters, randomly dividing input data into k samples, wherein the ratio of deleted events of each sample is approximately the same as the total ratio of deleted events, taking 1 part as a verification set in sequence, inputting the rest k-1 parts as a training set into a survival random forest model, and calculating C-index according to the predicted value and the true value of the verification set;

step 4.3: and selecting a group of survival random forest model parameters with better prediction effect according to the C-index value, fitting all input data to construct a final random forest prediction model, and predicting the recurrence risk of the new patient according to the random forest model.

7. The method for predicting risk of postoperative recurrence of liver cancer according to claim 1, wherein the pathological image and clinical information are combined, and the method comprises: the classification index of recurrence risk in step 5 includes two indexes: sorting the input features of the random forest prediction model constructed in the step 4 according to Permutation importance, and selecting the features with the highest importance as recurrence risk classification indexes; and (4) predicting the independent variable of the input data to be classified by using the random forest prediction model constructed in the step (4), and taking the predicted value as a recurrence risk classification index.

8. The method of classifying patients according to claim 7, wherein the risk of relapse classification index is: if the recurrence risk classification index is a discrete variable, classifying the patients according to each discrete value of the recurrence risk classification index, and if the recurrence risk classification index is a continuous variable, classifying the high-low recurrence risk subgroup of the patients according to whether the recurrence risk classification index is greater than the median or the average value.