CN114639482A - IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method - Google Patents

IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method Download PDF

Info

Publication number
CN114639482A
CN114639482A CN202210276812.3A CN202210276812A CN114639482A CN 114639482 A CN114639482 A CN 114639482A CN 202210276812 A CN202210276812 A CN 202210276812A CN 114639482 A CN114639482 A CN 114639482A
Authority
CN
China
Prior art keywords
esophageal squamous
lasso
idpc
patients
survival
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210276812.3A
Other languages
Chinese (zh)
Inventor
凌丹
刘安浩
王延峰
王妍
孙军伟
栗朝松
王英聪
王立东
宋昕
赵学科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University of Light Industry
Original Assignee
Zhengzhou University of Light Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University of Light Industry filed Critical Zhengzhou University of Light Industry
Priority to CN202210276812.3A priority Critical patent/CN114639482A/en
Publication of CN114639482A publication Critical patent/CN114639482A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides an IDPC and LASSO-based esophageal squamous cell carcinoma prognosis survival risk assessment method, which comprises the following steps: firstly, acquiring pathological data of an esophageal squamous carcinoma patient, constructing a decision tree by using a chi-square test method and important pathological factors determined by information gain, and dividing the patient into an early stage group and a middle and late stage group; secondly, acquiring preoperative blood conventional biochemical indexes of esophageal squamous cell carcinoma patients in an early group and a middle and late group respectively, and selecting indexes which are obviously related to postoperative survival risks by using LASSO; then, IDPC is utilized to gather the esophageal squamous carcinoma patients in the early and middle-late groups into different clusters respectively, and for each cluster, an LR-based nomogram is constructed to predict the survival risk of the esophageal squamous carcinoma patients; finally, the performance of the nomograms was evaluated using the confusion matrix and the subject's AUC. The invention can accurately judge the prognosis survival risk of the esophageal squamous cell carcinoma patient, and can help doctors to make diagnosis decision so as to provide effective treatment for the patient.

Description

IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method
Technical Field
The invention relates to the technical field of esophageal squamous cell carcinoma risk assessment, in particular to an IDPC and LASSO based esophageal squamous cell carcinoma prognosis survival risk assessment method.
Background
The TNM staging system proposed by the United states cancer Joint Committee has been widely applied to prognosis prediction of patients with esophageal squamous cell carcinoma. However, the pathogenesis of esophageal squamous carcinoma is complex, and the survival risk of patients diagnosed with esophageal squamous carcinoma only by using the TNM staging system has some limitations. Endoscopic determinations can also determine the risk of survival for patients with esophageal squamous carcinoma, but this is expensive for the patient. Classifying survival risk based on clinical pathology and blood routine examination information is a challenge facing computer-aided systems. In recent years, a number of machine learning methods have been used to predict the prognostic survival of patients with esophageal squamous carcinoma, such as neural networks, support vector machines, and random forests. However, it is difficult for the user to find the internal structure of the nonlinear model created by the machine learning method, and the importance of the index cannot be found. Meanwhile, the method for extracting and clustering the characteristics of the biological information is a difficult problem for researchers at home and abroad. The present medical field needs a method which can conveniently and visually find index factors influencing the survival risk of esophageal squamous cell carcinoma after prognosis and accurately judge the prognosis risk.
Disclosure of Invention
Aiming at the defects in the background technology, the invention provides an IDPC and LASSO-based esophageal squamous cell carcinoma prognosis survival risk assessment method, which solves the technical problems of unclear internal structure, incomplete index variable screening and low prediction capability of the existing prediction model.
The technical scheme of the invention is realized as follows:
an IDPC and LASSO based esophageal squamous carcinoma prognosis survival risk assessment method comprises the following steps:
the method comprises the following steps: acquiring pathological data of an esophageal squamous carcinoma patient;
step two: using pathological data of esophageal squamous carcinoma patients, constructing a decision tree by using a chi-square test method and important pathological factors determined by information gain, and dividing the patients into an early group and a middle-late group;
step three: respectively obtaining preoperative blood conventional biochemical indexes of esophageal squamous carcinoma patients in an early group and a middle and late group, and selecting indexes which are obviously related to postoperative survival risks by using minimum absolute contraction and a selection operator;
step four: respectively clustering early-stage group and middle-stage and late-stage group esophageal squamous carcinoma patients into different clusters by using an improved density peak clustering algorithm based on cosine distance and K nearest neighbor;
step five: for each cluster, constructing a nomogram based on a logistic regression model to predict the survival risk of the esophageal squamous cell carcinoma patient;
step six: and evaluating the performance of the nomogram in the fifth step by using the confusion matrix and the area under the operating characteristic curve of the subject.
Preferably, the pathological data of the esophageal squamous carcinoma patient comprise sex, age, tumor size, differentiation degree, infiltration degree and lymph node metastasis.
Preferably, the chi-square verification method is as follows:
Figure BDA0003556075450000021
wherein m isiAnd mjRespectively representing the number of variables and the number of samples, i represents the value of the variables, j represents the value of the sample of the esophageal squamous cell carcinoma patient, AijRepresenting a variableThe value is i and belongs to the actual value of the sample of the jth esophageal squamous carcinoma patient, TijRepresenting the expected value of a sample with the variable value i and belonging to the jth esophageal squamous carcinoma patient, wherein TijThe definition is as follows:
Figure BDA0003556075450000022
preferably, the information gain is calculated by:
Figure BDA0003556075450000023
wherein, grThe information gain rate is represented, delta H is the information gain of the attribute, and InfoBeform (H) is the information entropy before attribute classification; the method for calculating the information gain delta H comprises the following steps:
△H=InfoBefore(H)-InfoAfter(H);
wherein, InfoAfter (H) is the information entropy after attribute classification;
the calculation methods of the information entropy InfoBeform (H) before attribute classification and the information entropy InfoAfter (H) after attribute classification are respectively as follows:
Figure BDA0003556075450000024
Figure BDA0003556075450000025
wherein, P (x)1) Event x1Probability of occurrence, P (x)2) Is an event x2The probability of the occurrence of the event is,
Figure BDA0003556075450000026
is an event
Figure BDA0003556075450000027
The probability of occurrence.
Preferably, the common biochemical indices of blood of the esophageal squamous carcinoma patient include White Blood Cell Count (WBCC), lymphocyte count (LYC), monocyte count (MOC), neutrophil count (NEC), eosinophil count (EOS), basophil count (BAC), erythrocyte count (ERY), Hemoglobin (HGB), platelet count (THC), Total Protein (TP), Albumin (ALB), Globulin (GLO), Prothrombin Time (PT), Activated Partial Thromboplastin Time (APTT), Thrombin Time (TT), Fibrinogen (FIB).
Preferably, the method for selecting the index significantly related to the postoperative survival risk by using the minimum absolute contraction and the selection operator comprises:
Figure BDA0003556075450000031
where Y is an n × 1 vector, Y represents an actual value corresponding to a sample X, X is an n × p matrix, X represents an input sample of LASSO regression, and β ═ is (β ═ is)12,…,βp)TIs a vector of regression coefficients of p x 1,
Figure BDA0003556075450000032
is a penalty term, λ>0 is an adjustment parameter to balance penalty term and empirical risk.
Preferably, in step four, the minimum distance δ between the data point of the improved density peak clustering algorithm based on cosine distance and K nearest neighbors and the clustering centeriThe calculation method comprises the following steps:
Figure BDA0003556075450000033
where ρ isi'Is the local density, di'j”Is xi'And xj”Cosine distance between, xi'Denotes the ith' patient sample, xj”Represents the jth "patient sample, N is the number of patient samples;
xi'and xj”Cosine distance d betweeni'j”The calculation method comprises the following steps:
Figure BDA0003556075450000034
wherein x isi'aRepresents a sample xi'Corresponding value of middle feature a, xj”aRepresenting a sample xj”The corresponding value of the characteristic a is shown, and L is the characteristic quantity;
local density ρi'The calculation method comprises the following steps:
Figure BDA0003556075450000035
wherein, kNN (x)i') Is xi'K neighbor set of (1);
xi'k neighbor set kNN (x)i') The calculation method comprises the following steps:
kNN(xi')={xj”∈X|d(xi',xj”)≤d(xi',NNk(xi'))};
wherein, d (x)i',xj”) Is xi'And xj”Cosine distance of, NNk(xi') Is xi'Is adjacent to the k-th neighbor.
Preferably, the logistic regression model is:
Figure BDA0003556075450000036
Figure BDA0003556075450000041
wherein,
Figure BDA0003556075450000042
p (y '═ 1| x') is the probability that the input variable x 'belongs to the positive class, p (y' ═ 0| x ') is the probability that the input variable x' belongs to the negative class, and α0Is a constant number, α1Is the regression coefficient of the input variable x'.
Preferably, the parameter α in the logistic regression model0And the regression coefficient alpha1The calculation method comprises the following steps:
Figure BDA0003556075450000043
wherein, L (alpha)01) Representing the estimated alpha0And alpha1N, n represents the number of samples, y represents the likelihood function of (1, 2, …)t' represent input variable xt' is predicted.
Preferably, the construction method of the nomogram based on the logistic regression model comprises the following steps: assigning scores to each value level of the influence factors according to the size of the regression coefficient in the logistic regression model, and then adding the scores to obtain a total score; the 3-year probability of survival and the 5-year probability of survival are obtained by the total score located on the total score scale.
Compared with the prior art, the invention has the following beneficial effects:
1) according to the invention, the pathological indexes which are obviously related to the survival risk of the esophageal squamous cell carcinoma are screened out by chi-square test and information entropy, and a decision tree is constructed, so that patients with the esophageal squamous cell carcinoma are effectively divided into an early group and a middle-late group.
2) The early group and the middle and late groups are divided into different clusters by using a method of combining LASSO regression analysis and IDPC, so that guarantee is provided for further constructing a high-accuracy risk prediction model of the esophageal squamous cell carcinoma patient.
3) And for different patient clusters, a nomogram model based on logistic regression is constructed, and an accurate, intuitive and easy-to-use prognosis survival risk assessment system is provided for users.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a decision tree that considers the degree of lymph node metastasis and infiltration;
FIG. 3 is a selection of the regulatory parameter λ in LASSO by the lowest criteria, (a) is the LASSO regression analysis for patients with early stage esophageal squamous carcinoma, (b) is the LASSO regression analysis for patients with intermediate and late stage esophageal squamous carcinoma;
FIG. 4 is a decision graph of DPC algorithm for patients with early esophageal squamous carcinoma, (a) is a decision graph of DPC-LASSO, and (b) is an IDPC-LASSO decision graph with cosine distance and KNN;
FIG. 5 is a decision map of the DPC algorithm for patients with middle and advanced esophageal cancer, (a) is a decision map of DPC-LASSO, (b) is an IDPC-LASSO decision map with cosine distance and KNN;
FIG. 6 is a nomogram model of patients with esophageal squamous carcinoma, wherein (a) is a nomogram for predicting 5-year survival probability of patients with early esophageal squamous carcinoma cluster 1, (b) is a nomogram for predicting 5-year survival probability of patients with early esophageal squamous carcinoma cluster 2, (c) is a nomogram for predicting 3-year survival probability of patients with intermediate and late esophageal squamous carcinoma cluster 1, and (d) is a nomogram for predicting 3-year survival probability of patients with intermediate and late esophageal squamous carcinoma cluster 2;
FIG. 7 shows the results of model comparisons of different clustering algorithms, where (a) is the ROC curve based on the survival risk model of early esophageal squamous carcinoma patients, and (b) is the ROC curve of the survival risk model of middle and late esophageal squamous carcinoma patients;
FIG. 8 is a graph of the results of different model tests based on patients with early esophageal squamous carcinoma;
FIG. 9 is a graph of the results of different model tests based on patients with middle and advanced esophageal squamous carcinoma.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides an IDPC and LASSO-based esophageal squamous cell carcinoma prognosis survival risk assessment method, which includes the following specific steps:
the method comprises the following steps: removing patients who do not meet the standard according to the inclusion standard, and acquiring pathological data of the esophageal squamous carcinoma patients; the embodiment of the invention incorporates the data set of 418 samples in total. According to the international cancer control union standard, the following samples were excluded: (a) patients with other malignancies as well; (b) patients who have not successfully performed surgery; (c) patients who die from heart disease, lung cancer, liver cancer or acute infection; (d) patients with incomplete follow-up data and unknown prognosis. Pathological data of esophageal squamous carcinoma patients are acquired, including sex, age, tumor size, differentiation degree, infiltration degree and lymph node metastasis.
Step two: using pathological data of esophageal squamous carcinoma patients, constructing a decision tree by using a chi-square test method and important pathological factors determined by information gain, and dividing the patients into an early group and a middle-late group;
the chi-square test method comprises the following steps:
Figure BDA0003556075450000051
wherein m isiAnd mjRespectively representing the variable number and the sample number, i represents the value of the variable, j represents the value of the sample of the esophageal squamous cell carcinoma patient, AijRepresents the actual value, T, of a sample which has a variable value of i and belongs to the jth esophageal squamous carcinoma patientijRepresenting the expected value of a sample with the variable value i and belonging to the jth esophageal squamous carcinoma patient, wherein TijThe definition is as follows:
Figure BDA0003556075450000061
the information gain calculation method comprises the following steps:
Figure BDA0003556075450000062
wherein, grThe information gain rate is represented, delta H is the information gain of the attribute, and InfoBeform (H) is the information entropy before attribute classification;
the method for calculating the information gain delta H comprises the following steps:
△H=InfoBefore(H)-InfoAfter(H);
wherein, InfoAfter (H) is the information entropy after attribute classification;
the calculation methods of the information entropy InfoBeform (H) before attribute classification and the information entropy InfoAfter (H) after attribute classification are respectively as follows:
Figure BDA0003556075450000063
Figure BDA0003556075450000064
wherein, P (x)1) Event x1Probability of occurrence, P (x)2) Is an event x2The probability of the occurrence of the event is,
Figure BDA0003556075450000065
is an event
Figure BDA0003556075450000066
The probability of occurrence.
Clinical records were collected for 418 patients with esophageal squamous carcinoma, of which 115 (27.5%) were in the early stage and 303 (72.5%) were in the mid-to late stage. These disease diagnosis information records are saved in a text format. 260 (62.2%) patients were male, 158 (37.8%) were female. 79 (18.9%) occurred in the upper thoracic region, 279 (66.7%) in the mid-thoracic region, and 60 (14.4%) in the lower thoracic region. 26 cases (6.2%) of highly differentiated tumors, 224 cases (53.6%) of moderately differentiated tumors, and 168 cases (40.2%) of poorly differentiated tumors. 14 (3.3%) patients infiltrated the mucosal layer, 34 (8.1%) patients infiltrated the submucosa, 111 (26.6%) patients infiltrated the muscular layer, and 259 (62.0%) patients infiltrated the fibrous layer. 204 cases (48.8%) were lymph node negative metastases, and 214 cases (51.2%) were lymph node positive metastases. Table 1 shows the chi-square test results of pathological features of patients with esophageal squamous carcinoma. As can be seen from table 1, early and middle-late stage patients were significantly associated with sex (P <0.001), tumor size (P <0.001), degree of differentiation (P <0.001), degree of infiltration (P <0.001) and lymph node metastasis (P < 0.001). It can also be seen from table 1 that the risk of esophageal cancer is independent of tumor site (P ═ 0.227) and age (P ═ 0.642).
TABLE 1 Chifang test results of pathological characteristics of patients with esophageal squamous carcinoma
Figure BDA0003556075450000071
Table 2 shows the entropy analysis of the information of the significant pathological factors of early and middle-late esophageal squamous carcinoma patients. As can be seen from table 2, lymph node metastasis (H-0.6036, Δ H-0.2451, g)r28.88%) and degree of wetting (H0.5099, Δ H0.3388, g)r39.92%) is an important factor for patients with early and middle-advanced esophageal squamous carcinoma and is used for constructing a decision tree.
TABLE 2 entropy analysis of information of significant pathological factors of early and middle-late esophageal squamous carcinoma patients
Figure BDA0003556075450000072
The decision tree for distinguishing between early and middle stage esophageal squamous carcinoma patients is shown in fig. 2. The model was evaluated by 10-fold cross validation. The entire cohort was randomly divided into 10 sub-cohorts, with the predictive model first fitted in 80% of the population (training set) and the remaining 20% of the population (test set) used to evaluate the performance of the decision tree. The accuracy of the decision model reaches 95.2%, which is helpful for distinguishing early stage ESCC patients from middle and late stage ESCC patients. And further calculating the survival risk of different populations.
Step three: respectively obtaining preoperative blood conventional biochemical indexes of esophageal squamous carcinoma patients in an early group and a middle and late group, and selecting an index which is obviously related to postoperative survival risk by using a least absolute contraction and selection operator (LASSO); the conventional biochemical indices of blood of esophageal squamous carcinoma patients include White Blood Cell Count (WBCC), lymphocyte count (LYC), monocyte count (MOC), neutrophil count (NEC), eosinophil count (EOS), basophil count (BAC), erythrocyte count (ERY), Hemoglobin (HGB), platelet count (THC), Total Protein (TP), Albumin (ALB), Globulin (GLO), Prothrombin Time (PT), Activated Partial Thromboplastin Time (APTT), Thrombin Time (TT), Fibrinogen (FIB).
Based on the constructed decision tree, 418 patients were divided into two groups, an early group and a middle and late group. Each group was divided into two independent queues at an 8:2 ratio, with 80% used as the training set and 20% used as the test set. All statistical analyses were considered significant at a two-tailed P < 0.1. The study of early esophageal squamous carcinoma included 115 patients. According to the follow-up survival time and the 5-year survival probability, patients with early esophageal squamous carcinoma are divided into two types, namely high-risk patients and low-risk patients. The study of middle and advanced esophageal squamous carcinoma included 303 patients. Patients in the middle and late stages are also classified into two categories, high risk and low risk, according to follow-up survival time and 3-year survival probability.
The method for selecting the indexes which are obviously related to the postoperative survival risk by using the minimum absolute shrinkage and the selection operator comprises the following steps:
Figure BDA0003556075450000081
where Y is an n × 1 vector, Y represents an actual value corresponding to a sample X, X is an n × p matrix, X represents an input sample of LASSO regression, and β ═ is (β ═ is)12,…,βp)TIs a vector of regression coefficients of p x 1,
Figure BDA0003556075450000082
is a penalty term, λ>0 is an adjustment parameter to balance penalty term and empirical risk.
FIG. 3(a) is a LASSO regression analysis of patients with early esophageal squamous carcinoma. It can be seen that the most important 5 indices in the final model are MOC, ALB, PT, NEC and ERY, according to the non-zero coefficients retained in the LASSO analysis, at the most appropriate tuning parameter λ 0.1256.
FIG. 3(b) is a LASSO regression analysis of patients with intermediate and advanced esophageal squamous carcinoma. It can be seen that the most important 3 indices in the final model are PT, WBCC and ALB, according to the non-zero coefficients retained in the LASSO analysis, when the most suitable tuning parameter λ is 0.1249.
Step four: aggregating early and middle-late groups of esophageal squamous carcinoma patients into different clusters respectively by utilizing improved density peak clustering algorithm (DPC) algorithm based on cosine distance and K nearest neighbor (IDPC);
in the fourth step, the minimum distance delta between the data point of the improved density peak value clustering algorithm based on the cosine distance and the K nearest neighbor and the clustering centeriThe calculation method comprises the following steps:
Figure BDA0003556075450000091
where ρ isi'Is the local density, di'j”Is xi'And xj”Cosine distance between, xi'Denotes the ith' patient sample, xj”Represents the jth "patient sample, N is the number of patient samples;
xi'and xj”Cosine distance d betweeni'j”The calculation method comprises the following steps:
Figure BDA0003556075450000092
wherein x isi'aRepresents a sample xi'Corresponding value of middle feature a, xj”aRepresents a sample xj”The corresponding value of the characteristic a is shown, and L is the characteristic quantity;
local density ρi'The calculating method comprises the following steps:
Figure BDA0003556075450000093
wherein, kNN (x)i') Is xi'K neighbor set of (1);
xi'k neighbor set kNN (x)i') The calculation method comprises the following steps:
kNN(xi')={xj”∈X|d(xi',xj”)≤d(xi',NNk(xi'))};
wherein d (x)i',xj”) Is xi'And xj”Cosine distance of, NNk(xi') Is xi'Is adjacent to the k-th neighbor.
Early stage patients: early esophageal squamous carcinoma patients were clustered using the DPC and IDPC algorithms, respectively, based on 5 important indicators determined by the LASSO algorithm. The clustering results of DPC and IDPC are shown in FIG. 4. Fig. 4(a) shows a decision diagram of the DPC algorithm based on euclidean distance. Only one clustering center is arranged at the upper right corner, and effective clustering can not be carried out on patient samples. The decision diagram of the IDPC algorithm with cosine distance and KNN is shown in fig. 4 (b). There were two samples with significantly larger ρ and δ, indicating that patients with early esophageal squamous carcinoma were divided into two categories. In the upper right corner, there are two cluster centers, where cluster center cluster 1 is MOC 0.4, NEC 2.2, ERY 4.13, ALB 40, PT 7.1, cluster center 2 is MOC 0.5, NEC 5.9, ERY 4.35, ALB 49, and PT 12.9.
Patients in middle and advanced stages: and (3) clustering middle and late esophageal squamous carcinoma patients by using DPC (DPC-based data processing) and IDPC (idle data processing) algorithms respectively according to 3 important indexes determined by the LASSO algorithm. Also, the DPC algorithm cannot efficiently cluster patient samples (fig. 5 (a)). The two samples in fig. 5(b) have significantly larger ρ and δ, indicating that the IDPC algorithm can classify patients into two classes. Cluster 1 of the cluster center is PT 10.9, WBCC 5, ALB 37, cluster 2 of the cluster center is PT 11.1, WBCC 8, and ALB 50.
By comparing fig. 4 and 5, it can be observed that the distance δ and the cluster density ρ obtained by the cosine distance and KNN are sufficiently large, which facilitates the IDPC algorithm to construct two centers for early and middle-late esophageal squamous carcinoma patients, respectively. The result shows that the proposed IDPC algorithm based on cosine distance and KNN can improve the clustering capability of the DPC algorithm.
Step five: for each cluster, constructing a nomogram based on a Logistic Regression (LR) model to predict survival risk of esophageal squamous cell carcinoma patients;
the logistic regression model is as follows:
Figure BDA0003556075450000101
Figure BDA0003556075450000102
wherein,
Figure BDA0003556075450000103
p (y '═ 1| x') is the probability that the input variable x 'belongs to the positive class, p (y' ═ 0| x ') is the probability that the input variable x' belongs to the negative class, and α0Is a constant number, α1Is the regression coefficient of the input variable x'.
Parameter alpha in the logistic regression model0And the regression coefficient alpha1The calculation method comprises the following steps:
Figure BDA0003556075450000104
wherein, L (alpha)01) Representing the estimated alpha0And alpha1N, n represents the number of samples, y represents the likelihood function of (1, 2, …)t' representing an input variable xt' is predicted.
Based on the blood routine biochemical examination indexes which are obviously related to the survival risk of early and middle and late esophageal squamous cell carcinoma patients, LR models are respectively established for different clusters of early and middle and late esophageal squamous cell carcinoma patients. The LR model for esophageal squamous carcinoma patients is shown in table 3.
TABLE 3 LR model of esophageal squamous carcinoma patients
Figure BDA0003556075450000105
Figure BDA0003556075450000111
The collinear chart method comprises the following steps: and assigning scores to the value level of each influence factor according to the size of the regression coefficient in the multi-factor LR regression model, and then adding the scores to obtain a total score. The 3-year and 5-year survival probabilities are obtained by the total score located on the total score scale. Since the survival time of most early esophageal squamous carcinoma patients is more than 5 years, and the survival time of middle and late esophageal squamous carcinoma patients is less than 5 years, the 5-year survival risk of early esophageal squamous carcinoma patients and the 3-year survival risk of middle and late esophageal squamous carcinoma patients are predicted. FIG. 6 is a histogram model of patients with esophageal squamous carcinoma, wherein (a) is a histogram predicting 5-year survival probability of patients with early esophageal squamous carcinoma cluster 1, (b) is a histogram predicting 5-year survival probability of patients with early esophageal squamous carcinoma cluster 2, (c) is a histogram predicting 3-year survival probability of patients with middle and late esophageal squamous carcinoma cluster 1, and (d) is a histogram predicting 3-year survival probability of patients with middle and late esophageal squamous carcinoma cluster 2.
Step six: the performance of the nomograms in step five was evaluated using the confusion matrix and the area under the subject operating characteristic curve (ROC) curve.
The confusion matrix is shown in table 4. Wherein True Positive (TP) indicates that the predicted result is positive and the actual result is positive; false Positive (FP) indicates that the predicted result is positive, while the actual result is negative; true Negative (TN) means that the predicted result is negative, while the actual result is negative; false Negatives (FN) indicate negative results in the prediction and positive results in the actual results.
TABLE 4 confusion matrix
Figure BDA0003556075450000112
Four classification indices are defined from the parameters of the confusion matrix to evaluate the performance of the classification model, namely accuracy (Acc), Positive Predictive Value (PPV), recall (R) and F1-score, defined as follows:
Figure BDA0003556075450000113
Figure BDA0003556075450000114
Figure BDA0003556075450000115
Figure BDA0003556075450000116
the results of the test set model ROC curve comparison are shown in fig. 7. As can be seen from FIG. 7(a), the AUC in the LASSO-IDPC-LR model of Cluster 1 in patients with early esophageal squamous carcinoma was 0.881 (95% CI:0.779-0.983), which is 0.014 higher than the AUC in the LASSO-Kmeans-LR model; the AUC in the LASSO-IDPC-LR model for Cluster 2 was 0.873 (95% CI:0.776-0.970), which was 0.067 higher than the AUC in the LASSO-Kmeans-LR model. In FIG. 7(b), the AUC in the LASSO-IDPC-LR model for Cluster 1 in patients with middle and advanced esophageal squamous carcinoma was 0.802 (95% CI:0.722-0.883), which is 0.104 higher than the AUC in the LASSO-Kmeans-LR model; the AUC of LASSO-IDPC-LR model for Cluster 2 was 0.774 (95% CI:0.680-0.869), which was 0.095 higher than the AUC of LASSO-Kmeans-LR model. Obviously, the IDPC algorithm has better performance than the Kmeans algorithm in the aspect of evaluating the survival risk of the esophageal squamous carcinoma patient after prognosis.
Based on the confusion matrix evaluation index, the model test set results are shown in fig. 8 and 9. Early esophageal squamous carcinoma patients: the model performance of the different models is shown in fig. 8. The performance of the TNM staging system was the worst, with Acc, R, PPV, and F1-scores of 75.3%, 50.8%, 60.8%, and 0.554, respectively. The Acc, R, PPV and F1-scores of the LR-IDPC-LASSO model are 81.4%, 83.7%, 80.4% and 0.82 respectively, and the performance is best compared with the LASSO-Kmeans-LR model, the LASSO-LR model and the TNM staging system, wherein the Acc, R, PPV and F1-scores of the LR-IDPC-LASSO model are respectively improved by 8.9%, 14.9%, 9% and 0.118 compared with the LASSO-Kmeans-LR model. Patients with middle and late stage esophageal squamous carcinoma: the performance of the different models of the patient is shown in figure 9. As can be seen from FIG. 9, the predicted performance of the proposed LASSO-IDPC-LR model is also superior to other models, i.e., the LASSO-Kmeans-LR model, the LASSO-LR model, and the TNM phase system. The Acc, R, PPV and F1-scores of the LASSO-IDPC-LR model were 75.1%, 67.6%, 75.3% and 0.712, respectively.
As can be seen from fig. 8 and 9, the conventional TNM method has difficulty in accurately predicting the survival risk of patients with esophageal squamous cell carcinoma. The results show that the combination of the LASSO algorithm and the LR model improves the prediction performance of the LR model. By introducing the IDPC algorithm, the prediction capability of the LR model is further improved. In addition, the clustering capability of the IDPC algorithm is superior to that of the Kmeans algorithm. The method comprises the steps of firstly selecting important indexes by using an LASSO algorithm, then clustering patients by using an IDPC algorithm, and finally establishing a plurality of linear prognosis evaluation models by using LR.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. An IDPC and LASSO based esophageal squamous carcinoma prognosis survival risk assessment method is characterized by comprising the following steps:
the method comprises the following steps: acquiring pathological data of an esophageal squamous carcinoma patient;
step two: using pathological data of esophageal squamous carcinoma patients, constructing a decision tree by using a chi-square test method and important pathological factors determined by information gain, and dividing the patients into an early group and a middle-late group;
step three: respectively obtaining preoperative blood conventional biochemical indexes of esophageal squamous carcinoma patients in an early group and a middle and late group, and selecting indexes which are obviously related to postoperative survival risks by using minimum absolute contraction and a selection operator;
step four: respectively clustering early-stage group and middle-stage and late-stage group esophageal squamous carcinoma patients into different clusters by using an improved density peak clustering algorithm based on cosine distance and K nearest neighbor;
step five: for each cluster, constructing a nomogram based on a logistic regression model to predict the survival risk of the esophageal squamous cell carcinoma patient;
step six: and evaluating the performance of the nomogram in the fifth step by using the confusion matrix and the area under the operating characteristic curve of the subject.
2. The IDPC and LASSO based prognostic survival risk assessment method of claim 1 wherein the pathological data of the patients with esophageal squamous carcinoma include sex, age, tumor size, degree of differentiation, degree of infiltration and lymph node metastasis.
3. The IDPC and LASSO based esophageal squamous cancer prognostic survival risk assessment method according to claim 1, wherein said chi-square test method is:
Figure FDA0003556075440000011
wherein m isiAnd mjRespectively representing the number of variables and the number of samples, i represents the value of the variables, j represents the value of the sample of the esophageal squamous cell carcinoma patient, AijRepresents the actual value, T, of a sample which has a variable value of i and belongs to the jth esophageal squamous carcinoma patientijRepresenting the expected value of a sample with the variable value i and belonging to the jth esophageal squamous carcinoma patient, wherein TijThe definition is as follows:
Figure FDA0003556075440000012
4. the IDPC and LASSO based esophageal squamous cancer prognostic survival risk assessment method according to claim 1, wherein said information gain is calculated by:
Figure FDA0003556075440000013
wherein, grThe information gain rate is represented, delta H is the information gain of the attribute, and InfoBeform (H) is the information entropy before attribute classification;
the method for calculating the information gain delta H comprises the following steps:
△H=InfoBefore(H)-InfoAfter(H);
wherein, the InfoAfter (H) is the information entropy after attribute classification;
the calculation methods of the information entropy InfoBeform (H) before attribute classification and the information entropy InfoAfter (H) after attribute classification are respectively as follows:
Figure FDA0003556075440000021
Figure FDA0003556075440000022
wherein, P (x)1) Event x1Probability of occurrence, P (x)2) Is an event x2The probability of occurrence of the event is determined,
Figure FDA0003556075440000023
is an event
Figure FDA0003556075440000024
The probability of occurrence.
5. The IDPC-and LASSO-based esophageal squamous cancer prognostic survival risk assessment method according to claim 1, wherein said esophageal squamous cancer patient's blood-related biochemical indicators include White Blood Cell Count (WBCC), lymphocyte count (LYC), monocyte count (MOC), neutrophil count (NEC), eosinophil count (EOS), basophil count (BAC), erythrocyte count (ERY), Hemoglobin (HGB), platelet count (THC), Total Protein (TP), Albumin (ALB), Globulin (GLO), Prothrombin Time (PT), Activated Partial Thrombin Time (APTT), Thrombin Time (TT), Fibrinogen (FIB).
6. The IDPC and LASSO based method of assessing risk of esophageal squamous carcinoma prognosis survival as claimed in claim 1, wherein the method of using minimum absolute contraction and selection operator to select out the index significantly correlated with risk of postoperative survival is:
Figure FDA0003556075440000025
where Y is an n × 1 vector, Y represents an actual value corresponding to a sample X, X is an n × p matrix, X represents an input sample of LASSO regression, and β ═ is (β ═ is)12,…,βp)TIs a vector of regression coefficients of p x 1,
Figure FDA0003556075440000026
is a penalty term, λ>0 is an adjustment parameter to balance penalty term and empirical risk.
7. The IDPC-and LASSO-based method for assessing risk of esophageal squamous cancer prognosis survival as claimed in claim 1, wherein in step four the minimum distance δ between the data point of the improved density peak clustering algorithm based on cosine distance and K nearest neighbors and the cluster centeriThe calculation method comprises the following steps:
Figure FDA0003556075440000027
where ρ isi'Is the local density, di'j”Is xi'And xj”Cosine distance between, xi'Denotes the ith' patient sample, xj”Represents the jth "patient sample, N is the number of patient samples;
xi'and xj”Cosine distance d betweeni'j”The calculation method comprises the following steps:
Figure FDA0003556075440000031
wherein x isi'aRepresents a sample xi'Corresponding value of middle feature a, xj”aRepresents a sample xj”The corresponding value of the characteristic a is shown, and L is the characteristic quantity;
local density ρi'The calculation method comprises the following steps:
Figure FDA0003556075440000032
wherein, kNN (x)i') Is xi'K neighbor set of (1);
xi'k neighbor set kNN (x)i') The calculation method comprises the following steps:
kNN(xi')={xj”∈X|d(xi',xj”)≤d(xi',NNk(xi'))};
wherein d (x)i',xj”) Is xi'And xj”Cosine distance of, NNk(xi') Is xi'The k-th neighbor of (2).
8. The IDPC and LASSO based esophageal squamous cancer prognostic survival risk assessment method according to claim 1, wherein said logistic regression model is:
Figure FDA0003556075440000033
Figure FDA0003556075440000034
wherein,
Figure FDA0003556075440000035
p (y ═ 1| x ') is the probability that the input variable x' belongs to the positive class, p (y ═ 0| x ') is the probability that the input variable x' belongs to the negative class, and α0Is a constant, α1Is the regression coefficient of the input variable x'.
9. The IDPC and LASSO based esophageal squamous cancer prognostic survival risk assessment method according to claim 8, wherein the parameter α in the logistic regression model0And the regression coefficient alpha1The calculation method comprises the following steps:
Figure FDA0003556075440000036
wherein, L (alpha)01) Representing the estimated alpha0And alpha1N, n represents the number of samples, y represents the likelihood function of (1, 2, …)t'represents an input variable x'tThe predicted value of (2).
10. The IDPC and LASSO based esophageal squamous cancer prognostic survival risk assessment method according to claim 8 or 9, wherein said logistic regression model based nomogram is constructed by the method comprising: assigning scores to each value level of the influence factors according to the size of the regression coefficient in the logistic regression model, and then adding the scores to obtain a total score; the 3-year probability of survival and the 5-year probability of survival are obtained by the total score located on the total score scale.
CN202210276812.3A 2022-03-21 2022-03-21 IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method Pending CN114639482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210276812.3A CN114639482A (en) 2022-03-21 2022-03-21 IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210276812.3A CN114639482A (en) 2022-03-21 2022-03-21 IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method

Publications (1)

Publication Number Publication Date
CN114639482A true CN114639482A (en) 2022-06-17

Family

ID=81949717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210276812.3A Pending CN114639482A (en) 2022-03-21 2022-03-21 IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method

Country Status (1)

Country Link
CN (1) CN114639482A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117524486A (en) * 2024-01-04 2024-02-06 北京市肿瘤防治研究所 TTE model establishment method for predicting non-progressive survival probability of postoperative patient

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117524486A (en) * 2024-01-04 2024-02-06 北京市肿瘤防治研究所 TTE model establishment method for predicting non-progressive survival probability of postoperative patient
CN117524486B (en) * 2024-01-04 2024-04-05 北京市肿瘤防治研究所 TTE model establishment method for predicting non-progressive survival probability of postoperative patient

Similar Documents

Publication Publication Date Title
CN112185549B (en) Esophageal squamous carcinoma risk prediction system based on clinical phenotype and logistic regression analysis
US20230222311A1 (en) Generating machine learning models using genetic data
CN112259221A (en) Lung cancer diagnosis system based on multiple machine learning algorithms
US20020042681A1 (en) Characterization of phenotypes by gene expression patterns and classification of samples based thereon
CN113517073B (en) Method for constructing survival rate prediction model after lung cancer surgery and prediction model system
CN113539498A (en) Decision tree model-based system for predicting malignant risk of isolated pulmonary nodules
CN111128372A (en) Disease prediction method based on RF-LR improved algorithm
CN114639482A (en) IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method
CN112735606A (en) Colorectal cancer risk prediction method, device and storage medium
CN115862838A (en) Bile duct cancer diagnosis model based on machine learning algorithm and construction method and application thereof
CN114220487A (en) Construction method of novel 9-gene RISK acute myelogenous leukemia prognosis model
CN117423479A (en) Prediction method and system based on pathological image data
CN116047074B (en) Marker for diagnosing and/or predicting lung cancer, diagnostic model and construction method thereof
CN110010246A (en) A kind of disease Intelligent Diagnosis Technology based on neural network and confidence interval
CN116130105A (en) Health risk prediction method based on neural network
Casey et al. A machine learning approach to prostate cancer risk classification through use of RNA sequencing data
KR102397822B1 (en) Apparatus and method for analyzing cells using chromosome structure and state information
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
JP2002132749A (en) Sampling bias evaluating/decreasing device
KR102225231B1 (en) IDENTIFYING METHOD FOR TUMOR PATIENT BASED ON miRNA IN EXOSOME AND APPARATUS FOR THE SAME
Suhiman et al. Classification of Breast Cancer Subtypes using Microarray RNA Expression Data
Mishra et al. Analyzing the Impact of Feature Correlation on Classification Acuracy of Machine Learning Model
Kumar et al. Cervical Cancer Prediction Using Machine Learning Algorithms
Agaal et al. Biological and Tumor Markers in Early Prediction Phase of Breast Cancer Using Classification and Regression Tree: Sebha Oncology Center as a Case study
CN117174323B (en) SFTs integration risk assessment system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination