CN112185549A - Esophageal squamous carcinoma risk prediction method based on clinical phenotype and logistic regression analysis - Google Patents

Esophageal squamous carcinoma risk prediction method based on clinical phenotype and logistic regression analysis Download PDF

Info

Publication number
CN112185549A
CN112185549A CN202011052229.1A CN202011052229A CN112185549A CN 112185549 A CN112185549 A CN 112185549A CN 202011052229 A CN202011052229 A CN 202011052229A CN 112185549 A CN112185549 A CN 112185549A
Authority
CN
China
Prior art keywords
esophageal squamous
index
cell carcinoma
squamous cell
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011052229.1A
Other languages
Chinese (zh)
Other versions
CN112185549B (en
Inventor
王延峰
凌丹
张桢桢
孙军伟
王妍
王英聪
黄春
张勋才
王立东
宋昕
赵学科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University of Light Industry
Original Assignee
Zhengzhou University of Light Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University of Light Industry filed Critical Zhengzhou University of Light Industry
Priority to CN202011052229.1A priority Critical patent/CN112185549B/en
Publication of CN112185549A publication Critical patent/CN112185549A/en
Application granted granted Critical
Publication of CN112185549B publication Critical patent/CN112185549B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Abstract

The invention provides an esophageal squamous cell carcinoma risk prediction method based on clinical phenotype and logistic regression analysis, which is used for realizing the prognosis survival risk evaluation of esophageal squamous cell carcinoma patients. The method comprises the following steps: firstly, screening out characteristic indexes according to clinical detection data of esophageal squamous carcinoma patients, and constructing a decision tree classifier according to the characteristic indexes; secondly, classifying the patients with esophageal squamous cell carcinoma into early-stage and middle-and late-stage patients with esophageal squamous cell carcinoma by using a decision tree classifier; then, obtaining blood index information of the esophageal squamous cell carcinoma patient in a week before the operation, screening out blood indexes with high survival risk correlation with the esophageal squamous cell carcinoma patient and constructing a logistic regression model; inputting the classified blood indexes of the esophageal squamous cell carcinoma patients into a logistic regression model to obtain the probability value of the prognosis survival risk of the esophageal squamous cell carcinoma patients; and then the survival risk of prognosis is judged. The method can accurately judge the postoperative survival state of the esophageal squamous cell carcinoma patient, improve the performance of risk prediction and reduce the cost of risk prediction.

Description

Esophageal squamous carcinoma risk prediction method based on clinical phenotype and logistic regression analysis
Technical Field
The invention relates to the technical field of machine learning, in particular to an esophageal squamous cell carcinoma risk prediction method based on clinical phenotype and logistic regression analysis.
Background
With the increasing incidence of cancer, model-based prediction of cancer prognosis has been widely applied to different diseases, and accurate prognosis of cancer patients remains a current primary problem. Clinically detected data is typical characteristics of multiple colinearity, high dimensionality and multiple noises, so that the data has the problems of information redundancy, nonlinearity and the like, particularly the characteristics of the high-dimensionality data are a great problem influencing data mining, and on one hand, the high dimensionality data cause the processing of the data to require high operation cost, and on the other hand, the data cannot directly reflect essential attributes. In recent years, researchers at home and abroad have made thinking and research about the problem of dimensional disasters, and have made an effort to research on methods for extracting features of biological information.
The feature selection and the model construction are a research hotspot and key point in the academic world and the medical field, and the good feature selection can improve the performance of the model, help to understand the characteristics and the underlying structure of data and help to improve the model. In the prior art, the following methods are used for feature selection and model construction of training data: (1) the single-factor analysis of variance can test each feature, measure the relation between the feature and the dependent variable, and discard the information of the undesired features; (2) measuring linear correlation between variables by using a Pearson correlation coefficient, and establishing correlation between the variables; (3) linear regression is a common modeling method. The above methods are all to adopt the conventional method to screen the characteristic variables and then establish the prediction model, so that the recognition rate of the existing model is low, however, a method capable of accurately judging the prognosis risk is required in the current medical field.
Disclosure of Invention
Aiming at the defects in the background technology, the invention provides an esophageal squamous carcinoma risk prediction method based on clinical phenotype and logistic regression analysis, and solves the technical problem of low recognition rate caused by incomplete feature screening of the existing prediction model.
The technical scheme of the invention is realized as follows:
a method for predicting esophageal squamous carcinoma risk based on clinical phenotype and logistic regression analysis comprises the following steps:
the method comprises the following steps: acquiring clinical detection data of patients with esophageal squamous cell carcinoma, and screening out characteristic indexes with high classification correlation with patients with esophageal squamous cell carcinoma according to the clinical detection data;
step two: constructing a decision tree classifier according to the characteristic indexes with high classification correlation with esophageal squamous carcinoma patients;
step three: inputting the characteristic indexes of the esophageal squamous cell carcinoma patients to be classified into a decision tree classifier to obtain classification results of the esophageal squamous cell carcinoma patients;
step four: obtaining blood index information of a week before an esophageal squamous cell carcinoma patient operation, and screening out a blood index with high association with survival risk of the esophageal squamous cell carcinoma patient by constructing an ROC curve of the blood index information;
step five: constructing a logistic regression model according to blood indexes with high correlation with survival risks of esophageal squamous cell carcinoma patients;
step six: inputting the blood indexes of the esophageal squamous cell carcinoma patients classified in the third step into a logistic regression model to obtain the probability value of the prognosis survival risk of the esophageal squamous cell carcinoma patients;
step seven: and judging whether the probability value of the prognosis survival risk is greater than a threshold gamma, if so, the prognosis survival risk is high risk, otherwise, the prognosis survival risk is low risk, wherein the threshold gamma represents a critical value of high risk and low risk constructed by an ROC curve.
The indexes in the clinical detection data of the esophageal squamous carcinoma patient comprise sex, pathological diagnosis, tumor part, tumor length, tumor width, tumor thickness, tumor type, pathological differentiation degree, tumor infiltration degree, negativity, lymph node positive metastasis, T stage, N stage, M stage and eighth version TNM stage.
The method for screening the characteristic indexes with high classification correlation with the esophageal squamous cell carcinoma patients according to the clinical detection data comprises the following steps:
s11, calculating chi-square values of all indexes in clinical detection data, corresponding the chi-square values to a chi-square table one by one to obtain P values of all indexes, and screening out indexes with P <0.05 as preliminary characteristic indexes; wherein the primary characteristic indexes specifically refer to sex, pathological differentiation degree, tumor infiltration degree and lymph node positive metastasis;
s12, respectively calculating the information entropy of each preliminary characteristic index before attribute division and the information entropy after attribute division, and calculating the information gain of the preliminary characteristic index according to the information entropy before attribute division and the information entropy after attribute division;
s13, screening the preliminary characteristic indexes according to the information gain to obtain characteristic indexes with high classification correlation with esophageal squamous cell carcinoma patients; the characteristic indexes with high relevance to the classification of esophageal squamous cell carcinoma patients comprise tumor infiltration degree and lymph node positive metastasis.
The calculation method of the chi-square values of all indexes in the clinical detection data comprises the following steps:
Figure BDA0002709923180000021
wherein k represents the index category, and the value range of the index k belongs to k e {1,2k},nkThe total number of the indexes is represented,
Figure BDA0002709923180000024
denotes the chi-squared value of the index k, i denotes the attribute class of the index, i ∈ {1,2k},mkThe total number of attribute categories of the index k is shown, j represents the classification category of the esophageal squamous carcinoma patient, j belongs to {1,2}, AkijThe actual number of patients with j-th esophageal squamous cell carcinoma, T, with the index type of k and the attribute value of ikijThe index type is a theoretical number of patients with k attribute value of i and belongs to the jth esophageal squamous carcinoma.
The index type is a theoretical number T of j esophageal squamous carcinoma patients with the k attribute value of ikijThe calculation formula of (2) is as follows:
Figure BDA0002709923180000022
the method for calculating the information entropy before attribute division comprises the following steps:
Figure BDA0002709923180000023
wherein InfoBeform (H (x)) represents the entropy of the patient's diagnosis of the esophageal squamous cell carcinoma event x without considering the index type, H (x) represents the entropy of the patient's diagnosis of the esophageal squamous cell carcinoma event x, and P (x)j) Representing the probability of occurrence of the patient belonging to the jth esophageal squamous carcinoma event, wherein j represents the classification category of the esophageal squamous carcinoma patient, and j belongs to {1,2 };
the method for calculating the information entropy after attribute division comprises the following steps:
Figure BDA0002709923180000031
wherein the InfoAfter(H(xk) Indicates that the patient was diagnosed with esophageal squamous carcinoma event x when the index categories were consideredkK represents the index category, and the value range of the index k belongs to k e {1,2k},nkDenotes the total number of indices, H (x)k) Indicating that the patient has confirmed diagnosis of esophageal squamous carcinoma event xkEntropy of the information, x, occurringkiIndicates that the patient with the attribute value of i in the index k has a definite diagnosis of esophageal squamous carcinoma event, P (x)ki) Representing an event xkiProbability of occurrence, xkijThe index k has an attribute value of i, i represents the attribute class of the current index, and i belongs to the jth patienti},miThe attribute category total number of the current index is represented;
the method for calculating the information gain of the preliminary characteristic index comprises the following steps:
△H(xk)=InfoBefore(H(xk))-InfoAfter(H(xk)),
wherein, Δ H (x)k) Information gain, InfoBeform (H (x)) representing a preliminary characteristic index kk) Infoform (h (x)) indicates event x regardless of index categorykThe entropy of the information that occurs.
The construction method of the decision tree classifier comprises the following steps: and (3) taking the positive lymph node metastasis as a root node of the decision tree, and taking the tumor infiltration degree as a leaf node of the decision tree to construct a decision tree classifier.
The blood index information includes a white blood cell count, a lymphocyte count, a monocyte count, a neutrophil count, an eosinophil count, a basophil count, a red blood cell count, a hemoglobin concentration, a platelet count, total protein, albumin, globulin, prothrombin time, activated partial thromboplastin time, thrombin time, and fibrinogen.
The method for screening the blood index with high survival risk correlation with the esophageal squamous cell carcinoma patient by constructing the ROC curve of the blood index information comprises the following steps:
respectively drawing ROC curves of all blood indexes in the blood index information, and obtaining the AUC and P' values of each blood index according to the ROC curves;
according to a statistical theory, an area value under an ROC curve is between 1.0 and 0.5, and a blood index with AUC >0.5 and P' <0.05 is selected as a blood index with high correlation with the survival risk of the esophageal squamous cell carcinoma patient; among the blood indices that are highly correlated with the risk of survival of esophageal squamous cell carcinoma patients include the variables leukocyte count, monocyte count, neutrophil count, eosinophil count and total protein.
The logistic regression model is as follows:
logit(p)=β01X12X2+…+βmXm
wherein p represents the probability of an esophageal squamous carcinoma patient being classified as low risk, logit (p) represents the log occurrence ratio of the probability of an esophageal squamous carcinoma patient being classified as low risk, X1Denotes the value of the 1 st variable, X2Denotes the value of the 2 nd variable, XmRepresents the value of the mth variable, m represents the number of variable factors in the logistic regression model, beta0Constant term, β, representing a logistic regression model1Representing variable X in logistic regression model1Corresponding coefficient, beta2Representing variable X in logistic regression model2Corresponding coefficient, betamRepresenting variable X in logistic regression modelmThe corresponding coefficients.
The beneficial effect that this technical scheme can produce:
(1) according to the invention, the chi-square value, the information entropy and the information gain are utilized to screen the characteristic variables of clinical text data, so that the characteristic variables of early-stage and middle-stage and late-stage esophageal squamous cell carcinoma can be effectively identified.
(2) The invention analyzes survival probability curve of early and middle and late esophageal squamous carcinoma patients and analyzes the prognosis survival difference of two groups of patients; constructing a multivariable prediction model according to blood detection data of the esophageal squamous carcinoma patient in a week before operation; the multivariate prediction model is used for carrying out prognosis risk judgment on the esophageal squamous cell carcinoma patient, so that the postoperative survival state of the esophageal squamous cell carcinoma patient can be accurately judged, the risk prediction performance is improved, and the risk prediction cost is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a decision diagram provided by an embodiment of the present invention;
FIG. 3 is a graph of a survival curve analysis provided by an embodiment of the present invention;
FIG. 4 is a graph of a leucocyte count ROC curve analysis provided by an embodiment of the present invention;
FIG. 5 is a graph of a monocyte count ROC curve analysis provided by an embodiment of the present invention;
FIG. 6 is a graph of a neutrophil count ROC curve analysis provided by an embodiment of the present invention;
FIG. 7 is a graph of an eosinophil count ROC curve provided by an embodiment of the present invention;
FIG. 8 is a graphical representation of ROC curve analysis of total protein provided by an example of the present invention;
FIG. 9 is a diagram of ROC curve analysis for a multivariate probability prediction model provided by an embodiment of the present invention;
FIG. 10 is a diagram of ROC curve analysis of the PNI model provided by an embodiment of the present invention;
FIG. 11 is a risk assessment diagram for different models provided by embodiments of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a method for predicting esophageal squamous cell carcinoma risk based on clinical phenotype and logistic regression analysis, which comprises the following specific steps:
the method comprises the following steps: acquiring clinical detection data of patients with esophageal squamous cell carcinoma, and screening out characteristic indexes with high classification correlation with patients with esophageal squamous cell carcinoma according to the clinical detection data; the indexes in the clinical detection data of the esophageal squamous carcinoma patient comprise sex, pathological diagnosis, tumor part, tumor length, tumor width, tumor thickness, tumor type, pathological differentiation degree, tumor infiltration degree, negativity, lymph node positive metastasis, T stage, N stage, M stage and eighth version TNM stage. The data of 418 patients with esophageal squamous carcinoma are included in the embodiment of the invention, wherein 260 patients are male (62.2%) and 158 patients are female (37.8%); the tumor sites occurred in 79 cases of the upper thoracic region (18.9%), 279 cases of the middle thoracic region (66.7%), and 60 cases of the lower thoracic region (14.4%), indicating that the tumors mostly occurred in the middle thoracic region; 26 cases (6.2%) of high differentiation in the pathological differentiation degree of the tumor, 224 cases (53.6%) of medium differentiation in the pathological differentiation degree of the tumor, and 168 cases (40.2%) of low differentiation in the pathological differentiation degree of the tumor; the degree of tumor infiltration of most people (62.0%) was in the fibrous membrane, 111 cases (26.6%) in the muscular layer, and a few in the mucosal and submucosal layers; lymph node positive metastasis approaches 50%.
The method for screening the characteristic indexes with high classification correlation with the esophageal squamous cell carcinoma patients according to the clinical detection data comprises the following steps:
s11, calculating chi-square values of all indexes in clinical detection data, corresponding the chi-square values to a chi-square table one by one to obtain P values of all indexes, and screening out indexes with P <0.05 as preliminary characteristic indexes;
the calculation method of the chi-square values of all indexes in the clinical detection data comprises the following steps:
Figure BDA0002709923180000051
wherein k represents the index category, and the value range of the index k belongs to k e {1,2k},nkThe total number of the indexes is represented,
Figure BDA0002709923180000053
denotes the chi-squared value of the index k, i denotes the attribute class of the index, i ∈ {1,2k},mkThe total number of attribute categories of the index k is shown, j represents the classification category of the esophageal squamous carcinoma patient, j belongs to {1,2}, AkijThe actual number of patients with j-th esophageal squamous carcinoma who have the index class of k with the attribute value of i,
Figure BDA0002709923180000052
the index type is a theoretical number of patients with k attribute value of i and belongs to the jth esophageal squamous carcinoma.
Substituting the chi-squared values of all the indexes into a table 1, obtaining the probability P value of the current index through table lookup, and if the P value is less than 0.05, screening the index into a preliminary characteristic index.
TABLE 1 chi fang distribution table
Figure BDA0002709923180000061
Wherein v represents a degree of freedom, and a calculation formula of the degree of freedom is as follows: the degree of freedom v is (number of rows-1) × (number of columns-1).
Taking the clinical detection index gender as an example, the influence of the gender on the classification of esophageal squamous cell carcinoma is analyzed. The actual statistical distribution of esophageal squamous carcinoma in men/women is shown in table 2:
TABLE 2 actual statistics of esophageal squamous carcinoma distribution chart in men/women
Early esophageal squamous carcinoma Middle and late stage esophageal squamous carcinoma Total up to Probability of developing early esophageal squamous carcinoma in patients
Male sex A111=57 A112=203 260 21.9%
Female with a view to preventing the formation of wrinkles A121=58 A122=100 158 36.7%
Total up to 115 303 418 27.5%
Wherein A is111The number of men who showed actual statistics and were judged to be early esophageal squamous carcinoma was 57, A112The number of men who showed actual statistics and were judged as middle and late esophageal squamous carcinoma was 203, A121The number of women who showed actual statistics and were judged to be early esophageal squamous carcinoma was 58, A122The number of women who are actually counted and are judged to be esophageal squamous carcinoma at the middle and late stages is 100. The early esophageal squamous carcinoma patients belong to male patients and female patientsThe cancer ratio is 21.9% and 36.7%, respectively, and the difference between the two may be caused by sampling error or the influence of sex on the early esophageal squamous cell carcinoma of a patient to be diagnosed.
First, an assumption was made that gender had no effect on the classification of esophageal squamous carcinoma, i.e., gender was independent of whether or not esophageal squamous carcinoma was in early stages. Therefore, the proportion of patients with early esophageal squamous carcinoma among patients diagnosed is actually (57+58)/(57+58+203+100) ═ 27.5%. Then, theoretical values as shown in table 3 can be obtained.
TABLE 3 theoretical distribution Table of esophageal squamous carcinoma in Male/female
Early esophageal squamous carcinoma Middle and late stage esophageal squamous carcinoma Total up to
Male sex T111=72 T112=188 260
Female with a view to preventing the formation of wrinkles T121=43 T122=115 158
Total up to 115 303 418
Wherein, T111The number of men judged to be early esophageal squamous carcinoma, which indicates theoretical calculation, was 72, T112188, T represents the number of men judged to be esophageal squamous carcinoma at the middle and late stages and is obtained by theoretical calculation121The number of women who showed theoretical calculation and were judged to be early esophageal squamous carcinoma was 43, T122The number of women who were judged to be esophageal squamous carcinoma at the middle and late stages, which was obtained by theoretical calculation, was 115.
If gender had no effect on the classification of esophageal squamous cell carcinoma, the calculated chi-square value would be very small. The genine chi-square calculation formula is as follows:
Figure BDA0002709923180000071
and next, obtaining a probability P value according to the chi-square distribution table, and then judging whether the probability P value is different. The freedom degree needs to be known by inquiring chi-square distribution, and according to the table 1, the assumption that the gender is rejected to have no influence on the confirmation of the early esophageal squamous cell carcinoma when the freedom degree is 1 and the corresponding P value is less than 0.05 when the chi-square value is 11.5, namely the gender has influence on the confirmation of the early esophageal squamous cell carcinoma in the patient.
Chi fang test analysis results are shown in table 2, and gender (P ═ 0.001), pathological differentiation degree (P ═ 0.000), tumor infiltration degree (P ═ 0.000) and lymph node positive metastasis (P ═ 0.000) have significant correlation with early and middle and late stage esophageal squamous cell carcinoma patients; the tumor site (P ═ 0.227) was not significantly associated with patients with esophageal squamous cell carcinoma. Therefore, the primary characteristic indexes specifically refer to sex, pathological differentiation degree, tumor infiltration degree and lymph node positive metastasis;
s12, respectively calculating the information entropy of each preliminary characteristic index before attribute division and the information entropy after attribute division, and calculating the information gain of the preliminary characteristic index according to the information entropy before attribute division and the information entropy after attribute division;
the information entropy is used as a quantitative index of the information content of a system. Entropy is used primarily to measure uncertainty. In the machine learning classification problem, a larger entropy indicates a larger uncertainty for this class and vice versa.
The method for calculating the information entropy before attribute division comprises the following steps:
Figure BDA0002709923180000072
wherein InfoBeform (H (x)) represents the entropy of the patient's diagnosis of the esophageal squamous cell carcinoma event x without considering the index type, H (x) represents the entropy of the patient's diagnosis of the esophageal squamous cell carcinoma event x, and P (x)j) The probability that the patient belongs to the j-th esophageal squamous carcinoma event is shown, j represents the classification category of the esophageal squamous carcinoma patient, and j belongs to {1,2 }.
The method for calculating the information entropy after attribute division comprises the following steps:
Figure BDA0002709923180000073
of these, InfoAfter (H (x)k) Indicates that the patient was diagnosed with esophageal squamous carcinoma event x when the index categories were consideredkK represents the index category, and the value range of the index k belongs to k e {1,2k},nkDenotes the total number of indices, H (x)k) Indicating that the patient has confirmed diagnosis of esophageal squamous carcinoma event xkEntropy of the information, x, occurringkiIndicates that the patient with the attribute value of i in the index k has a definite diagnosis of esophageal squamous carcinoma event, P (x)ki) Representing an event xkiProbability of occurrence, xkijThe index k has an attribute value of i, i represents the attribute class of the current index, and i belongs to the jth patienti},miThe total number of attribute categories representing the current metric.
According to a statistical probability calculation formula, the proportion of early esophageal squamous cell carcinoma patients in all samples is P1-115/418-0.275, and the proportion of middle and late esophageal squamous cell carcinoma patients is P2-303/418-0.725, so that the entropy of the occurrence of esophageal squamous cell carcinoma events can be calculated:
H=-P1log2(P1)-P2log2(P2)
=-0.275log2(0.275)-0.725log2(0.725)
≈0.8487
the base of the logarithm is 2 because only the amount of information is needed to satisfy that a low probability event corresponds to a high amount of information, and the choice of the logarithm is arbitrary. The invention only follows the general tradition of information theory, and takes 2 as the base of logarithm. Thus the information amount (original state) of the current data is calculated with entropy, and the result is 0.8487.
S13, the larger the information gain value of the characteristic index is, the larger the classification correlation between the index and the patients with esophageal squamous cell carcinoma is, and the preliminary characteristic index is screened according to the information gain value to obtain the characteristic index with high classification correlation with the patients with esophageal squamous cell carcinoma; wherein the characteristic indexes comprise tumor infiltration degree and lymph node positive metastasis.
The information gain is an index used for selecting a feature in the decision tree algorithm, and the larger the information gain is, the better the selectivity of the feature is. The method for calculating the information gain of the preliminary characteristic index comprises the following steps:
△H(xk)=InfoBefore(H(xk))-InfoAfter(H(xk)),
wherein, Δ H (x)k) Information gain, InfoBeform (H (x)) representing a preliminary characteristic index kk) Infoform (h (x)) indicates event x regardless of index categorykThe entropy of the information that occurs.
Based on the basis of the information entropy calculation, the data of the esophageal squamous carcinoma patients in early stage and middle and late stage are divided by using the attribute of gender, the data can be seen to be divided into two parts (male and female) after division, and the information entropy of each branch is calculated as follows:
Figure BDA0002709923180000081
Figure BDA0002709923180000082
therefore, the entropy of the information divided by gender is:
Figure BDA0002709923180000083
finally, the information gain divided based on the attribute of gender is as follows:
Δ H (gender) ═ infobeform (H (gender)) -InfoAfter (H (gender)) -0.8487-0.8304 ═ 0.0183
When the early and middle and late stage esophageal squamous carcinoma data are divided by the attribute of selecting tumor sites, the data can be seen to be divided into three parts after division: and calculating the information entropy of each branch as follows:
Figure BDA0002709923180000084
Figure BDA0002709923180000091
Figure BDA0002709923180000092
therefore, the entropy of information after division according to tumor sites is:
Figure BDA0002709923180000093
finally, the information gain divided based on the tumor part is as follows:
Δ H (tumor site) ═ InfoBeform (H (tumor site)) -InfoAfter (H (tumor site))
=0.8487-0.8438=0.0049
When the early stage and middle and late stage esophageal squamous carcinoma data are divided by the attribute of tumor differentiation degree, the data can be seen to be divided into three parts after division: high, medium and low differentiation, the entropy of each branch is calculated as follows:
Figure BDA0002709923180000094
Figure BDA0002709923180000095
Figure BDA0002709923180000096
therefore, the information entropy after division according to the degree of tumor differentiation is:
Figure BDA0002709923180000097
finally, the information gain divided based on the attribute of tumor differentiation degree is as follows:
Δ H (degree of tumor differentiation) ═ InfoBeform (H (degree of tumor differentiation)) -InfoAfter (H (degree of tumor differentiation))
=0.8487-0.7967=0.0520
When the early stage and middle and late stage esophageal squamous carcinoma data are divided by selecting the attribute of tumor infiltration degree, the data can be divided into four parts after division: the information entropy of each branch is calculated as follows:
Figure BDA0002709923180000098
Figure BDA0002709923180000099
Figure BDA00027099231800000910
Figure BDA00027099231800000911
therefore, the information entropy divided according to the tumor infiltration degree is:
Figure BDA0002709923180000101
finally, the information gain divided based on the attribute of tumor infiltration degree is as follows:
Δ H (degree of tumor infiltration) ═ InfoBeform (H (degree of tumor infiltration)) -InfoAfter (H (degree of tumor infiltration))
=0.8487-0.6036=0.2451
Early and middle-late esophageal squamous carcinoma patient data are divided by the attribute of lymph node positive metastasis, after division, the data can be seen to be divided into two parts (non-metastasis and metastasis), and then the information entropy of each branch is calculated as follows:
Figure BDA0002709923180000102
Figure BDA0002709923180000103
therefore, the information entropy after division depending on whether lymph nodes are metastasized is:
Figure BDA0002709923180000104
finally, the information gain of the classification based on the attribute of whether lymph nodes are metastasized is:
Δ H (lymph node positive metastasis) ═ InfoBeform (H (lymph node positive metastasis)) -InfoAfter (H (lymph node positive metastasis))
=0.8487-0.5099=0.3388
The information entropy and information gain values for the early and middle-late classification of esophageal squamous carcinoma for each attribute are shown in table 4.
TABLE 4 Single-factor analysis chart for early and middle-late esophageal squamous carcinoma patients
Figure BDA0002709923180000105
Figure BDA0002709923180000111
Wherein the P values of Table 4 were obtained by one-way analysis of variance. According to chi-square analysis results of each index, gender (P ═ 0.001), pathological differentiation degree (P ═ 0.000), tumor infiltration degree (P ═ 0.000), and lymph node positive metastasis (P ═ 0.000) were associated with risk factors of early and middle-and late-stage esophageal squamous cell carcinoma patients; and combining the information entropy and the analysis result of the information gain value, the information gain values of two risk factors, namely the infiltration degree (delta H-0.2451) and the lymph node positive metastasis (delta H-0.3388) are relatively large, and the two risk factors are used for the root node information of the decision tree.
Step two: constructing a decision tree classifier according to the characteristic indexes with high classification correlation with esophageal squamous carcinoma patients;
the construction method of the decision tree classifier comprises the following steps: and (3) taking the positive lymph node metastasis as a root node of the decision tree, and taking the tumor infiltration degree as a leaf node of the decision tree to construct a decision tree classifier. The attribute with the largest information gain, namely lymph node positive metastasis, is used as a first node, the tumor infiltration degree is used as a second node to construct a decision tree classifier, and a decision tree model is shown in fig. 2. And substituting the collected data into a decision tree classification model for verification, and counting the probability distribution condition under each root node to obtain the model with the accuracy of 95.2% when the model is used for classifying early and middle and late esophageal squamous cell carcinoma patients.
Step three: and inputting the characteristic indexes of the esophageal squamous cell carcinoma patients to be classified into a decision tree classifier to obtain classification results of the esophageal squamous cell carcinoma patients.
Survival rate calculation for esophageal squamous carcinoma patients:
S(t)=S(t-1)S(t|t-1)
wherein, the survival rate is also called survival probability or survival function, which represents the probability that the survival time of a patient is longer than the time t, and is represented by S (t), namely S (t) represents the survival rate of t years, and S (t | t-1) represents the conditional probability of survival for t-1 years and t years. The curve plotted with time t as abscissa and s (t) as ordinate is called the survival rate curve, which is a descending curve, the steeper the descent, the lower the survival rate or the shorter the survival time, the slope of which indicates the death rate.
As shown in figure 3, Kaplan-Meier analysis among different groups shows that the esophageal squamous cell carcinoma patients in the early and middle-late groups have significant difference, and the survival time of the esophageal squamous cell carcinoma patients in the middle and late stages is significantly shorter than that of the esophageal squamous cell carcinoma patients in the early stage. The survival time of the middle and late stage esophageal squamous cell carcinoma patients is obviously lower than that of the early stage esophageal squamous cell carcinoma patients (logarithmic rank test, chi)219.580, P0.000). According to follow-up data analysis, the 3-year survival rate of the early group is more than 70 percent, and the 3-year survival rate of the middle and late group is 54.03 percent; the survival rate of the early group at 6 years is 49.95 percent, and the survival rate of the middle and late groups at 6 years is 27.07 percent; the 11-year survival rate in the early group was 37.41%, while the 11-year survival rate in the middle and late group was only 15.45%.
Step four: obtaining blood index information of a week before an esophageal squamous cell carcinoma patient operation, and screening out a blood index with high association with survival risk of the esophageal squamous cell carcinoma patient by constructing an ROC curve of the blood index information;
the blood indicator information includes a white blood cell count (10)9/L), lymphocyte count (10)9/L), monocyte count (10)9/L), neutrophil count (10)9/L), eosinophil count (10)9/L), basophil count (10)9/L), red blood cell count (10)9/L), hemoglobin concentration (g/L), platelet count (10)9L), total protein (g/L), albumin (g/L), globulin (g/L), prothrombin time(s), activated partial coagulationThe enzyme activation time(s), thrombin time(s) and fibrinogen (mg/dL).
The method for screening the blood index with high survival risk correlation with the esophageal squamous cell carcinoma patient by constructing the ROC curve of the blood index information comprises the following steps:
respectively drawing ROC curves of all blood indexes in the blood index information, and obtaining the AUC and P' values of each blood index according to the ROC curves; according to statistical theory, the area under the ROC curve is between 1.0 and 0.5, indicating better classification in the case of AUC >0.5 and P' < 0.05. Screening out a blood index with AUC >0.5 and P' <0.05 as a blood index with high correlation with the survival risk of the esophageal squamous cell carcinoma patient; among the blood indices that are highly correlated with the risk of survival of esophageal squamous cell carcinoma patients include the variables leukocyte count, monocyte count, neutrophil count, eosinophil count and total protein.
A Receiver operating characteristic curve (ROC curve for short) is an analysis tool for a coordinate schema, and is used to select an optimal univariate classification model, an optimal threshold is set in the same model, and auc (area under the curve) represents the area under the curve. The ROC curve analysis was performed on the above 16 blood indices, and the analysis results were as follows: white blood cell count (10)9L) (AUC 0.663 and P0.007), lymphocyte count (10)9L) (AUC 0.508 and P0.893), monocyte count (10)9L) (AUC ═ 0.669, P ═ 0.005), neutrophil count (10)9L) (AUC 0.650 and P0.010), eosinophil count (10)9L) (AUC ═ 0.647, P ═ 0.015), basophil count (10)9L) (AUC 0.555, P0.362), red blood cell count (10)9L) (AUC 0.455 and P0.454), hemoglobin concentration (g/L) (AUC 0.427 and P0.227), platelet count (10)9L) (AUC 0.584, P0.162), total protein (g/L) (AUC 0.622, P0.043), albumin (g/L) (AUC 0.605, P0.082), globulin (g/L) (AUC 0.537, P0.539), prothrombin time(s) (AUC 0.443, P0.346), activated partial thromboplastin time(s) (AUC 0.609, P0.072), thrombin time(s) (AUC 0.407, P0.125), fiber (c) and fiber (c/L), and the likeSpecific results of proprotein (mg/dL) (AUC 0.597 and P0.107) are shown in table 5.
TABLE 5 univariate ROC Curve analysis
Figure BDA0002709923180000131
Wherein the P values in Table 5 were obtained by ROC curve analysis. And (4) screening out the risk factors which have large influence on the survival risk in the future according to the analysis of the working characteristic curve of the subject with 16 blood indexes. The ROC curve is characterized in that a plurality of different critical values are set for continuous variables, so that a series of sensitivities and specificities are calculated, the sensitivities are taken as vertical coordinates, the (1-specificity) is taken as horizontal coordinates, and a curve is drawn, wherein the larger the area below the curve is, the higher the classification accuracy is; on the working characteristic curve of the subject, the point closest to the upper left of the coordinate graph is a critical value with high sensitivity and specificity, namely an optimal classification threshold value of a single index, and the selection of the threshold value is judged by a john index, wherein the computational expression of the john index is as follows: jotan index-sensitivity + specificity-1. Through ROC curve analysis, five important characteristic variables were found: leukocyte count, monocyte count, neutrophil count, eosinophil count and total protein. The specific analysis results are as follows: FIG. 4 shows the white blood cell count (10)9/L)(AUC=0.663,95%CI:[0.549,0.778]P ═ 0.007), sensitivity was 71.9%, specificity was 63.9%, jotan index ═ sensitivity + specificity-1 ═ 0.605+0.719-1 ═ 0.358, the optimal threshold for white blood cell counts was 6.05, white blood cell counts were divided into two groups (high white blood cell count group:>6.05×109l, low white blood cell count set: less than or equal to 6.05 multiplied by 109L); FIG. 5 shows monocyte counts (10)9/L)(AUC=0.669,95%CI:[0.561,0.776]P ═ 0.005), sensitivity was 78.1%, specificity was 43.4%, john's index was sensitivity + specificity-1 ═ 0.781+0.434-1 ═ 0.215, the optimal threshold for monocyte count was 0.35, monocyte counts were divided into two groups (high monocyte count group:>0.35×109l, low monocyte counts: not more than 0.35X 109L); FIG. 6 shows neutrophil counts (10)9/L)(AUC=0.650,95%CI:[0.537,0.764]P-0.010), sensitivity of 78.1%, specificity of 55.4%, yoden index sensitivity + specificity-1-0.781 + 0.554-1-0.335, optimal threshold for neutrophil count of 3.35, dividing neutrophil count into two groups (high neutrophil count group:>3.35×109l, low neutrophil count set: less than or equal to 3.35 multiplied by 109L); FIG. 7 shows eosinophil count (10)9/L)(AUC=0.647,95%CI:[0.538,0.756]P ═ 0.015), sensitivity was 84.4%, specificity was 42.4%, jotan index ═ sensitivity + specificity-1 ═ 0.844+0.424-1 ═ 0.226, the optimal threshold for eosinophil count was 0.05, eosinophil counts were divided into two groups (high eosinophil count group:>0.05×109l, low eosinophil granulometry array: less than or equal to 0.05 multiplied by 109L); FIG. 8 shows total protein (g/L) (AUC 0.622, 95% CI: [0.515,0.729 ]]P-0.043), sensitivity of 90.6%, specificity of 32.5%, jordan index sensitivity + specificity-1-0.906 + 0.325-1-0.231, optimal threshold for total protein of 67.5, total protein split into two groups (high total protein group:>67.5g/L, low total protein group: less than or equal to 67.5 g/L). The high index group and the low index group are qualitatively divided, the high index group is marked as '1', and the low index group is marked as '0'.
Step five: constructing a logistic regression model according to blood indexes which are screened out by the ROC curve and have high association with the survival risk of the esophageal squamous cell carcinoma patient;
the logistic regression model is as follows:
logit(p)=β01X12X2+…+βmXm
wherein p represents the probability of an esophageal squamous carcinoma patient being classified as low risk, logit (p) represents the log occurrence ratio of the probability of an esophageal squamous carcinoma patient being classified as low risk, X1Denotes the value of the 1 st variable, X2Denotes the value of the 2 nd variable, XmRepresents the value of the mth variable, m represents the number of variable factors in the logistic regression model, beta0To representConstant term, beta, of logistic regression model1Representing variable X in logistic regression model1Corresponding coefficient, beta2Representing variable X in logistic regression model2Corresponding coefficient, betamRepresenting variable X in logistic regression modelmThe corresponding coefficients.
Step six: inputting the blood indexes of the esophageal squamous cell carcinoma patients classified in the third step into a logistic regression model to obtain the probability value of the prognosis survival risk of the esophageal squamous cell carcinoma patients;
step seven: and judging whether the probability value of the prognosis survival risk is greater than a threshold value gamma, if so, judging the prognosis survival risk to be high risk, otherwise, judging the prognosis survival risk to be low risk. Where the threshold γ represents a high risk and low risk threshold constructed from the ROC curve.
According to the single-factor analysis of blood factors, five indexes of white blood cell count, mononuclear cell count, neutrophil count, eosinophil count and total protein are selected as characteristic variables for predictive modeling. The data is divided into a test set and a verification set, the test set is analyzed and modeled, and then the validity and reliability of the model are verified by using the data of the verification set.
The five predictor variables, leukocyte, monocyte, neutrophil, eosinophil and total protein, were included in the test set for multivariate regression analysis with results shown in table 6.
TABLE 6 multivariate logistic regression analysis
Figure BDA0002709923180000151
Further, the logistic regression model is established as follows:
logit (p) ═ 0.241-2.554 Xwhite cell count-0.453 Xmonocyte count +1.012 Xneutrophil count-2.484 Xeosinophil count-0.527 Xtotal protein
Further, the probability estimation formula can be obtained as follows:
Figure BDA0002709923180000152
wherein p represents the probability of an esophageal squamous carcinoma patient being classified as low risk, logit (p) represents the log occurrence ratio of the probability of an esophageal squamous carcinoma patient being classified as low risk, X1Value representing the white blood cell count, X2Value representing the monocyte count, X3Represents the value of neutrophil, X4Value representing the eosinophil count, X5Represents the value of total protein, beta00.241 denotes a constant term, β, of the logistic regression model1-2.554 represents the variable X1Corresponding coefficient, beta2-0.453 represents the variable X2Corresponding coefficient, beta31.012 denotes a variable X3Corresponding coefficient, beta42.484 represents the variable X4Corresponding coefficient, beta5-0.527 represents the variable X5The corresponding coefficients. The ROC curve of the obtained multivariate probability prediction regression model for risk prediction is shown in fig. 9, the accuracy of risk prediction is 86.1%, and the prediction effects based on different cutoff values of the multivariate probability prediction model are shown in table 7. Wherein the sensitivity is 79.7%, the specificity is 82.6%, the jotan index is sensitivity + specificity-1 is 0.797+0.826-1 is 0.623, and the corresponding optimal threshold is 0.046755, in other words, when the calculated probability value is less than the threshold, the subject is at low risk of prognosis, and the anti-regularity is at high risk of prognosis.
TABLE 7 prediction Effect based on different cutoff values of the multivariate probability prediction model
Cutoff value Sensitivity of the device 1-specificity Joden index
0 1 1 0
0.003969 1 0.913 0.087
0.00665 1 0.87 0.13
0.01084 0.855 0.261 0.594
0.01373 0.826 0.261 0.565
0.02627 0.797 0.174 0.623
0.046755 0.71 0.087 0.623
0.057084 0.696 0.087 0.609
0.061088 0.652 0.043 0.609
0.077959 0.565 0.043 0.522
0.094332 0.507 0.043 0.464
0.097361 0.478 0.043 0.435
0.118622 0.449 0.043 0.406
0.142759 0.42 0.043 0.377
0.151804 0.406 0.043 0.363
0.185282 0.377 0.043 0.334
0.220002 0.362 0.043 0.319
0.274615 0.348 0.043 0.305
0.376103 0.319 0.043 0.276
0.438091 0.159 0.043 0.116
0.503579 0.13 0.043 0.087
0.616943 0.043 0.043 0
0.725873 0.014 0.043 -0.029
1 0 0 0
The common medical prognostic nutritional indicator models are as follows:
PNI ═ Albumin +5 × lymphocyte count
The risk prediction was performed in the test set according to this model, and as a result, as shown in fig. 10, the accuracy of predicting the risk after prediction was 57.9%, where the sensitivity was 61.9%, the specificity was 63.1%, the john index was 0.619+0.631-1 was 0.250, and the corresponding optimum threshold was 51.75.
And verifying the multivariate regression model and the PNI model in a verification set, and evaluating the prediction accuracy of the model.
The final classification validation of the multivariate probabilistic predictive regression model in the validation set is shown in table 8:
TABLE 8 Classification matrix
Figure BDA0002709923180000171
According to the classification matrix, the actual risk is low, and the number of the model prediction which is also low risk is 10; actually low risk, the number of model predictions that are high risk is 4; the number of actual high risk, model predictions low risk is 3; in fact, the number of high risks, which the model predicts, is 6.
Evaluation indexes of the prediction model are as follows:
Figure BDA0002709923180000172
Figure BDA0002709923180000173
Figure BDA0002709923180000174
Figure BDA0002709923180000175
Figure BDA0002709923180000176
acc represents the proportion of all correctly judged results of the classification model in the total observed value; PPV represents the proportion of the predictive model that is low risk among all results that are actually low risk; NPV indicates that the predictive model is a high risk weight among all results that are actually high risk; r represents the recall rate and is the ratio of the number of samples predicted to be at low risk and the number of samples predicted to be at low risk; f1-measure represents a weighted harmonic mean of PPV and R.
According to the above calculation formula, the accuracy of the multivariate probability prediction model is Acc equal to 16/23 equal to 69.57%, and among all results actually with low risk, the specific gravity of the prediction model with low risk is PPV equal to 10/14 equal to 71.43%, and among all results actually with high risk, the specific gravity of the prediction model with high risk is NPV equal to 6/9 equal to 66.67%, and the ratio of the number of samples both predicted and actually with low risk to the number of samples predicted as low risk, that is, the recall rate R equal to 10/13 equal to 76.92%, and the weighted harmonic mean of PPV and R, that is, F1-measure equal to 2 × 0.7143 × 0.7692/(0.7143+0.7692) equal to 74.07%.
Similarly, the final classification result verification of the PNI model in the verification set is shown in table 9:
TABLE 9 Classification matrix
Figure BDA0002709923180000181
According to the classification matrix, the actual risk is low, and the number of model predictions which are also low risk is 5; actually low risk, the number of model predictions that are high risk is 9; the number of actual high risk, model predictions low risk is 2; in fact, is a high risk, and the number of model predictions that are also high is 7.
Then, according to the calculation formula of the evaluation index of the prediction model, the accuracy of the PNI prediction model is Acc of 12/23 ≈ 52.17%, the specific gravity of the prediction model with low risk is PPV of 5/14 ≈ 35.71% among all results with low risk actually, the specific gravity of the prediction model with high risk is NPV of 7/9 ≈ 77.78% among all results with high risk actually, the ratio of the number of samples with low risk both predicted and actually to the number of samples with low risk predicted, that is, the recall rate of R of 5/7 ≈ 71.43%, and the weighted harmonic mean of PPV and R, that is, F1-measure of 2 × 0.3571.7143/(0.3571 +0.7143) ≈ 47.62%.
Finally, the multivariate probability prediction model is compared with the PNI model to obtain the evaluation indexes, as shown in FIG. 11. By combining various risk evaluation indexes, the multivariate probability prediction model established by the invention has better prediction capability.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for predicting esophageal squamous carcinoma risk based on clinical phenotype and logistic regression analysis is characterized by comprising the following steps:
the method comprises the following steps: acquiring clinical detection data of patients with esophageal squamous cell carcinoma, and screening out characteristic indexes with high classification correlation with patients with esophageal squamous cell carcinoma according to the clinical detection data;
step two: constructing a decision tree classifier according to the characteristic indexes with high classification correlation with esophageal squamous carcinoma patients;
step three: inputting the characteristic indexes of the esophageal squamous cell carcinoma patients to be classified into a decision tree classifier to obtain classification results of the esophageal squamous cell carcinoma patients;
step four: obtaining blood index information of a week before an esophageal squamous cell carcinoma patient operation, and screening out a blood index with high association with survival risk of the esophageal squamous cell carcinoma patient by constructing an ROC curve of the blood index information;
step five: constructing a logistic regression model according to blood indexes with high correlation with survival risks of esophageal squamous cell carcinoma patients;
step six: inputting the blood indexes of the esophageal squamous cell carcinoma patients classified in the third step into a logistic regression model to obtain the probability value of the prognosis survival risk of the esophageal squamous cell carcinoma patients;
step seven: and judging whether the probability value of the prognosis survival risk is greater than a threshold gamma, if so, the prognosis survival risk is high risk, otherwise, the prognosis survival risk is low risk, wherein the threshold gamma represents a critical value of high risk and low risk constructed by an ROC curve.
2. The esophageal squamous cancer risk prediction method based on clinical phenotype and logistic regression analysis as claimed in claim 1, wherein the indices in the clinical test data of esophageal squamous cancer patients comprise sex, pathological diagnosis, tumor location, tumor length, tumor width, tumor thickness, tumor type, pathological differentiation degree, tumor infiltration degree, negative, lymph node positive metastasis, T stage, N stage, M stage, TNM version.
3. The esophageal squamous cancer risk prediction method based on clinical phenotype and logistic regression analysis according to claim 2, wherein the method for screening the characteristic index with high classification correlation with esophageal squamous cancer patients according to clinical detection data comprises the following steps:
s11, calculating chi-square values of all indexes in clinical detection data, corresponding the chi-square values to a chi-square table one by one to obtain P values of all indexes, and screening out indexes with P <0.05 as preliminary characteristic indexes; wherein the primary characteristic indexes specifically refer to sex, pathological differentiation degree, tumor infiltration degree and lymph node positive metastasis;
s12, respectively calculating the information entropy of each preliminary characteristic index before attribute division and the information entropy after attribute division, and calculating the information gain of the preliminary characteristic index according to the information entropy before attribute division and the information entropy after attribute division;
s13, screening the preliminary characteristic indexes according to the information gain to obtain characteristic indexes with high classification correlation with esophageal squamous cell carcinoma patients; the characteristic indexes with high relevance to the classification of esophageal squamous cell carcinoma patients comprise tumor infiltration degree and lymph node positive metastasis.
4. The esophageal squamous cancer risk prediction method based on clinical phenotype and logistic regression analysis as claimed in claim 3, wherein the chi-squared value of all indexes in the clinical test data is calculated by:
Figure FDA0002709923170000021
wherein k represents the index category, and the value range of the index k belongs to k e {1,2k},nkDenotes the total index, χk 2Denotes the chi-squared value of the index k, i denotes the attribute class of the index, i ∈ {1,2k},mkThe total number of attribute categories of the index k is shown, j represents the classification category of the esophageal squamous carcinoma patient, j belongs to {1,2}, AkijThe actual number of patients with j-th esophageal squamous cell carcinoma, T, with the index type of k and the attribute value of ikijThe index type is a theoretical number of patients with k attribute value of i and belongs to the jth esophageal squamous carcinoma.
5. The method of claim 4, wherein the index is classified as having k property value of i and belonging to the jth esophageal squamous carcinoma patientTheoretical number of people TkijThe calculation formula of (2) is as follows:
Figure FDA0002709923170000022
6. the esophageal squamous cancer risk prediction method based on clinical phenotype and logistic regression analysis as claimed in claim 3, wherein the calculation method of the information entropy before attribute classification is as follows:
Figure FDA0002709923170000023
wherein InfoBeform (H (x)) represents the entropy of the patient's diagnosis of the esophageal squamous cell carcinoma event x without considering the index type, H (x) represents the entropy of the patient's diagnosis of the esophageal squamous cell carcinoma event x, and P (x)j) Representing the probability of occurrence of the patient belonging to the jth esophageal squamous carcinoma event, wherein j represents the classification category of the esophageal squamous carcinoma patient, and j belongs to {1,2 };
the method for calculating the information entropy after attribute division comprises the following steps:
Figure FDA0002709923170000024
of these, InfoAfter (H (x)k) Indicates that the patient was diagnosed with esophageal squamous carcinoma event x when the index categories were consideredkK represents the index category, and the value range of the index k belongs to k e {1,2k},nkDenotes the total number of indices, H (x)k) Indicating that the patient has confirmed diagnosis of esophageal squamous carcinoma event xkEntropy of the information, x, occurringkiIndicates that the patient with the attribute value of i in the index k has a definite diagnosis of esophageal squamous carcinoma event, P (x)ki) Representing an event xkiProbability of occurrence, xkijThe index k has an attribute value of i, i represents the attribute class of the current index, and i belongs to the jth patienti},miAttributes representing current metricsThe total number of categories;
the method for calculating the information gain of the preliminary characteristic index comprises the following steps:
△H(xk)=InfoBefore(H(xk))-InfoAfter(H(xk)),
wherein, Δ H (x)k) Information gain, InfoBeform (H (x)) representing a preliminary characteristic index kk) Infoform (h (x)) indicates event x regardless of index categorykThe entropy of the information that occurs.
7. The esophageal squamous cancer risk prediction method based on clinical phenotype and logistic regression analysis as claimed in claim 3, wherein the decision tree classifier is constructed by the following steps: and (3) taking the positive lymph node metastasis as a root node of the decision tree, and taking the tumor infiltration degree as a leaf node of the decision tree to construct a decision tree classifier.
8. The method of claim 1, wherein the blood indicative information includes white blood cell count, lymphocyte count, monocyte count, neutrophil count, eosinophil count, basophil count, erythrocyte count, hemoglobin concentration, platelet count, total protein, albumin, globulin, prothrombin time, activated fraction thrombin time, and fibrinogen.
9. The esophageal squamous cancer risk prediction method based on clinical phenotype and logistic regression analysis according to claim 8, wherein the method for screening the blood index with high association with the survival risk of esophageal squamous cancer patients by constructing the ROC curve of the blood index information comprises the following steps:
respectively drawing ROC curves of all blood indexes in the blood index information, and obtaining the AUC and P' values of each blood index according to the ROC curves;
according to a statistical theory, an area value under an ROC curve is between 1.0 and 0.5, and a blood index with AUC >0.5 and P' <0.05 is selected as a blood index with high correlation with the survival risk of the esophageal squamous cell carcinoma patient; among the blood indices that are highly correlated with the risk of survival of esophageal squamous cell carcinoma patients include the variables leukocyte count, monocyte count, neutrophil count, eosinophil count and total protein.
10. The method of claim 9, wherein the logistic regression model is:
logit(p)=β01X12X2+…+βmXm
wherein p represents the probability of an esophageal squamous carcinoma patient being classified as low risk, logit (p) represents the log occurrence ratio of the probability of an esophageal squamous carcinoma patient being classified as low risk, X1Denotes the value of the 1 st variable, X2Denotes the value of the 2 nd variable, XmRepresents the value of the mth variable, m represents the number of variable factors in the logistic regression model, beta0Constant term, β, representing a logistic regression model1Representing variable X in logistic regression model1Corresponding coefficient, beta2Representing variable X in logistic regression model2Corresponding coefficient, betamRepresenting variable X in logistic regression modelmThe corresponding coefficients.
CN202011052229.1A 2020-09-29 2020-09-29 Esophageal squamous carcinoma risk prediction system based on clinical phenotype and logistic regression analysis Active CN112185549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011052229.1A CN112185549B (en) 2020-09-29 2020-09-29 Esophageal squamous carcinoma risk prediction system based on clinical phenotype and logistic regression analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011052229.1A CN112185549B (en) 2020-09-29 2020-09-29 Esophageal squamous carcinoma risk prediction system based on clinical phenotype and logistic regression analysis

Publications (2)

Publication Number Publication Date
CN112185549A true CN112185549A (en) 2021-01-05
CN112185549B CN112185549B (en) 2022-08-02

Family

ID=73946978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011052229.1A Active CN112185549B (en) 2020-09-29 2020-09-29 Esophageal squamous carcinoma risk prediction system based on clinical phenotype and logistic regression analysis

Country Status (1)

Country Link
CN (1) CN112185549B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992368A (en) * 2021-04-09 2021-06-18 中山大学附属第三医院(中山大学肝脏病医院) Prediction model system and recording medium for prognosis of severe spinal cord injury
CN112992346A (en) * 2021-04-09 2021-06-18 中山大学附属第三医院(中山大学肝脏病医院) Method for establishing prediction model for prognosis of severe spinal cord injury
CN113066577A (en) * 2021-03-02 2021-07-02 四川省肿瘤医院 Esophagus squamous carcinoma survival rate prediction system based on blood coagulation index
CN113192632A (en) * 2021-05-24 2021-07-30 哈尔滨理工大学 Breast cancer classification method based on weighted association rule algorithm
CN113270188A (en) * 2021-05-10 2021-08-17 北京市肿瘤防治研究所 Method and device for constructing prognosis prediction model of patient after esophageal squamous carcinoma radical treatment
CN113520319A (en) * 2021-07-12 2021-10-22 吾征智能技术(北京)有限公司 Epileptic event risk management method and system based on logistic regression
CN113851216A (en) * 2021-09-23 2021-12-28 首都医科大学附属北京天坛医院 Acute ischemic stroke clinical phenotype construction method, key biomarker screening method and application thereof
CN114496306A (en) * 2022-01-28 2022-05-13 北京大学口腔医学院 Machine learning-based prognosis survival stage prediction method and system
CN114724717A (en) * 2022-04-20 2022-07-08 山东大学齐鲁医院 Stomach early cancer high-risk screening system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041684A1 (en) * 2010-04-23 2013-02-14 Peter Kotanko System And Method Of Identifying When A Patient Undergoing Hemodialysis Is At Increased Risk Of Death By A Logistic Regression Model
TW201804348A (en) * 2016-07-29 2018-02-01 長庚醫療財團法人林口長庚紀念醫院 Method for analyzing cancer detection result by establishing cancer prediction model and combining tumor marker kits analyzing the cancer detection result by using the established cancer prediction model and combining the detection results of the tumor marker kits
WO2019197624A2 (en) * 2018-04-12 2019-10-17 Uea Enterprises Limited Improved classification and prognosis of prostate cancer
CN111128385A (en) * 2020-01-17 2020-05-08 河南科技大学第一附属医院 Prognosis early warning system for esophageal squamous carcinoma and application thereof
CN111710423A (en) * 2020-06-17 2020-09-25 上海市精神卫生中心(上海市心理咨询培训中心) Method for determining mood disorder morbidity risk probability based on regression model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041684A1 (en) * 2010-04-23 2013-02-14 Peter Kotanko System And Method Of Identifying When A Patient Undergoing Hemodialysis Is At Increased Risk Of Death By A Logistic Regression Model
TW201804348A (en) * 2016-07-29 2018-02-01 長庚醫療財團法人林口長庚紀念醫院 Method for analyzing cancer detection result by establishing cancer prediction model and combining tumor marker kits analyzing the cancer detection result by using the established cancer prediction model and combining the detection results of the tumor marker kits
WO2019197624A2 (en) * 2018-04-12 2019-10-17 Uea Enterprises Limited Improved classification and prognosis of prostate cancer
CN111128385A (en) * 2020-01-17 2020-05-08 河南科技大学第一附属医院 Prognosis early warning system for esophageal squamous carcinoma and application thereof
CN111710423A (en) * 2020-06-17 2020-09-25 上海市精神卫生中心(上海市心理咨询培训中心) Method for determining mood disorder morbidity risk probability based on regression model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘迷迷 等: "《C5.0决策树对早期胃癌风险筛查研究》", 《中华肿瘤防治杂志》 *
孙晓光 等: "《妇科恶性肿瘤患者的生存期预测》", 《中华医学杂志》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066577A (en) * 2021-03-02 2021-07-02 四川省肿瘤医院 Esophagus squamous carcinoma survival rate prediction system based on blood coagulation index
CN113066577B (en) * 2021-03-02 2024-01-26 四川省肿瘤医院 Esophageal squamous carcinoma survival rate prediction system based on coagulation index
CN112992368A (en) * 2021-04-09 2021-06-18 中山大学附属第三医院(中山大学肝脏病医院) Prediction model system and recording medium for prognosis of severe spinal cord injury
CN112992346A (en) * 2021-04-09 2021-06-18 中山大学附属第三医院(中山大学肝脏病医院) Method for establishing prediction model for prognosis of severe spinal cord injury
CN113270188A (en) * 2021-05-10 2021-08-17 北京市肿瘤防治研究所 Method and device for constructing prognosis prediction model of patient after esophageal squamous carcinoma radical treatment
CN113192632A (en) * 2021-05-24 2021-07-30 哈尔滨理工大学 Breast cancer classification method based on weighted association rule algorithm
CN113520319A (en) * 2021-07-12 2021-10-22 吾征智能技术(北京)有限公司 Epileptic event risk management method and system based on logistic regression
CN113851216A (en) * 2021-09-23 2021-12-28 首都医科大学附属北京天坛医院 Acute ischemic stroke clinical phenotype construction method, key biomarker screening method and application thereof
CN114496306A (en) * 2022-01-28 2022-05-13 北京大学口腔医学院 Machine learning-based prognosis survival stage prediction method and system
CN114724717A (en) * 2022-04-20 2022-07-08 山东大学齐鲁医院 Stomach early cancer high-risk screening system
CN114724717B (en) * 2022-04-20 2024-04-12 山东大学齐鲁医院 Stomach early cancer high risk screening system

Also Published As

Publication number Publication date
CN112185549B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN112185549B (en) Esophageal squamous carcinoma risk prediction system based on clinical phenotype and logistic regression analysis
CN112635057B (en) Esophageal squamous carcinoma prognosis index model construction method based on clinical phenotype and LASSO
CN110634563A (en) Differential diagnosis device for diabetic nephropathy and non-diabetic nephropathy
CN111612261B (en) Financial big data analysis system based on block chain
CN111081381B (en) Intelligent screening method for critical indexes of prediction of hospital fatal alimentary canal re-bleeding
CN112635056A (en) Lasso-based esophageal squamous carcinoma patient risk prediction nomogram model establishing method
Rahman et al. QCovSML: A reliable COVID-19 detection system using CBC biomarkers by a stacking machine learning model
CN112201330A (en) Medical quality monitoring and evaluating method combining DRGs tool and Bayesian model
CN113327679A (en) Pulmonary embolism clinical risk and prognosis scoring method and system
CN114220540A (en) Construction method and application of diabetic nephropathy risk prediction model
CN116052770A (en) VTE risk assessment model based on polygenic mutation, construction method and application
US20220336047A1 (en) Method and device for determining chromosomal aneuploidy and constructing classification model.
Hilda et al. D-dimer as a sensitive biomarker of survival rate in patients with COVID-19
Liao et al. LightGBM: an efficient and accurate method for predicting pregnancy diseases
CN114639482A (en) IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method
CN116047074A (en) Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof
Zhang et al. DIC score combined with CLIF-C OF score is more effective in predicting prognosis in patients with hepatitis b virus acute-on-chronic liver failure
Iversen Jr et al. Validating Bayesian prediction models: a case study in genetic susceptibility to breast cancer
Davagdorj et al. Synthetic oversampling based decision support framework to solve class imbalance problem in smoking cessation program
KR20110068083A (en) Method of generating decision rule for clinical diagnosis
Wang et al. Identify risk factors and predict the postoperative risk of ESCC using ensemble learning
CN113488123B (en) Method for establishing diagnosis time-effect-based COVID-19 triage system, system and triage method
CN112525804B (en) Application of whole blood cell count in prediction of SARS-CoV-2 infection
Zhao et al. Random survival forests for predicting the interactions of multiple physiological risk factors on all-cause mortality
Tawsifur et al. QCovSML: A reliable COVID-19 detection system using CBC biomarkers by a stacking machine learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant