CN113782197B

CN113782197B - New coronary pneumonia patient outcome prediction method based on interpretable machine learning algorithm

Info

Publication number: CN113782197B
Application number: CN202110887756.2A
Authority: CN
Inventors: 贾立静; 李静; 陈威; 张恒; 魏子健; 王佳明; 郏瑞琪; 俞哲媛; 王照鸿; 李秀成
Original assignee: Beijing Jiaotong University; First Medical Center of PLA General Hospital
Current assignee: Beijing Jiaotong University; First Medical Center of PLA General Hospital
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2022-11-04
Anticipated expiration: 2041-08-03
Also published as: CN113782197A

Abstract

The invention provides a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm, which comprises the following steps: extracting patient data of COVID-19 from a database, and dividing the patient data into an experimental group and a control group according to the disease conversion condition of a patient; interpolating missing values of the indexes through random forest regression; screening indexes input into the model, and taking the screened indexes as key risk factors for identifying the disease deterioration of the patient; inputting key risk factors of a patient into an XGboost model and a logistic regression model; selecting an XGboost model with better prediction expressiveness, generating an index combination, predicting by using the XGboost model, and recording a prediction result; defining the early warning range of the key index; when the key risk index of the patient enters the early warning range, an alarm prompt is sent to the medical care personnel; combining the calculation results of the algorithm with the clinical experience of a doctor, two index combinations consisting of 15 first group indexes and 5 second group indexes are provided for predicting the illness state of the patient with the new coronary pneumonia.

Description

New coronary pneumonia patient outcome prediction method based on interpretable machine learning algorithm

Technical Field

The invention relates to the technical field of machine learning, in particular to a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm.

Background

The proliferation of new coronavirus disease (COVID-19) infection cases presents a huge challenge to the management of medical resources. Although approximately 81% of patients with COVID-19 exhibit mild or moderate symptoms, some patients experience a sudden worsening of the disease and the disease progresses rapidly to a severe or critical condition. Therefore, early intervention in the exacerbation of a patient with COVID-19 would greatly aid in patient management and allocation of medical resources. In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

1) In published studies on poor prognosis of COVID-19, most studies have still used statistical methods to statistically analyze and describe the characteristics and outcome of patients with COVID-19. Early risk factors are identified by comparison of severe versus non-severe patients. Statistical methods, however, fail to predict a poor prognosis for a patient.

2) In studies using machine learning algorithms to predict poor prognosis in covi-19 patients, researchers are interested in predicting outcomes that are limited to ICU hospitalization or death. No researchers are currently concerned with the state transition of disease progression.

3) Although the machine learning in the existing research obtains good prediction results, the number of indexes required by the model is large, the sampling is complex, various laboratory indexes are included, and a long time is needed for obtaining all indexes required by the model. The problems of index acquirement and timeliness of the machine learning model used in the real background are ignored.

4) Meanwhile, no matter research is carried out based on a traditional statistical method or a machine learning method, only risk factors causing illness deterioration, hospitalization or final death of a patient are identified, but an early warning range corresponding to an index is not provided.

Disclosure of Invention

The object of the present invention is to solve at least one of the technical drawbacks mentioned.

Therefore, the invention aims to provide a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm.

In order to achieve the above object, an embodiment of the present invention provides a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm, including:

step S1, extracting patient data of COVID-19 from a database, and dividing the patient data into an experimental group and a control group according to the state of illness of a patient;

s2, acquiring the latest recorded values of all indexes in the past preset days of each time point of the patients in the experimental group from light to heavy as a positive sample; for all time points of the patients in the control group, acquiring the latest recorded values of all indexes within the past preset days as negative samples, and randomly sampling the negative samples until the number of the negative samples is the same as that of the positive samples; mixing the positive sample and the negative sample, and interpolating the missing value of each index;

s3, calculating the model performance under different index numbers and parameters by adopting a stepwise backward regression method through ten-fold cross validation, screening the indexes of the model performance, and taking the screened indexes as key risk factors for identifying the disease deterioration of the COVID-19 patient;

s4, inputting key risk factors of the patient into the XGboost model and the logistic regression model, and comparing AUC values of the two models;

s5, selecting an XGboost model with a higher AUC value as the XGboost model with better prediction expressive force, calculating | SHAP | mean values of all indexes in the model, sequencing, generating an index combination according to sequencing results of the indexes, predicting by using the XGboost model, and recording a prediction result;

s6, defining the early warning range of the key indexes in the SHAP partial dependency graph according to the prediction result of the SHAP partial dependency graph and by combining clinical experience; identifying the risk probability of the mild patient deteriorating into the severe patient through a machine learning model, and sending an alarm prompt to medical staff when the key risk index of the patient enters an early warning range;

s7, screening out 15 first group indexes and 5 second group indexes by combining the calculation result of the algorithm and clinical experience of a doctor, wherein the 15 first group indexes comprise: prothrombin Time (PT), prothrombin activity (PTA), lactate Dehydrogenase (LDH), international Normalized Ratio (INR), heart rate, body Mass Index (BMI), D-dimer, creatine Kinase (CK), hematocrit, urine specific gravity, magnesium, globulin, activated Partial Thromboplastin Time (APTT), lymphocyte count (L%), and platelet count;

the 5 second set of metrics include: PT, heart rate, BMI, HCT, and complications,

two indices of the 15 first set of indices and the 5 second set of indices were combined to predict the condition of a new coronary pneumonia patient.

Further, in the step S1, the patient data includes: demographic, underlying disease, vital signs, blood coagulation, blood convention, blood biochemistry and urine convention.

Further, in the step S2, after the positive sample and the negative sample are mixed, a random forest regression algorithm is used to interpolate missing values of each index of the patient data.

Further, in the step S3, the model performance includes: accuracy, recall, F1, AUC and 95% CI's AUC values.

Further, in step S3, a stepwise backward regression method is used to calculate model performances under different index numbers and parameters, including: putting all the characteristics into an XGboost model, and calculating the SHAP value of each characteristic; and for each iteration, deleting the features with the minimum SHAP absolute value from the model, repeating the steps until no features meet the iteration standard, and recording the AUC value of each iteration process.

Further, in the step S4, the index combination includes: pulse index and BMI index.

Further, the preset number of days is 10 to 20 days.

Further, in step S3, the number of indicators input into the model is selected according to 1 standard error rule, and the selected indicators are used as key risk factors for identifying disease deterioration of covi-19 patients.

Further, in step S6, the indicator with the positive SHAP value in the SHAP map indicating that the indicator is in the range is positively contributed to the prediction result, i.e., the disease deterioration, whereas the indicator with the negative SHAP value indicating that the indicator is in the range, i.e., the disease deterioration, is negatively contributed, and the range with the positive SHAP value in the SHAP partial dependency map is defined as the early warning range.

Further, continuous variable measurements were represented using mean and median, the mean was compared using Wilcoxon rank sum test, and the categorical variables were described as frequency and percentage modeled using the chi-square test.

According to the new coronary pneumonia patient outcome prediction method based on the interpretable machine learning algorithm of the embodiment of the invention,

1) An accurate and effective model is constructed by using an interpretable machine learning algorithm, and whether the mild/moderate patient will be degraded into a severe/critical case or not is predicted.

2) And forming two index combinations according to different indexes required by the model. According to different requirements on timeliness and accuracy under different scenes, two sets of index combinations with different quantities are formed. From the viewpoint of accuracy, the combination of 15 indexes can be selected, and the model prediction accuracy is high; from the viewpoint of applicability and practicality, a combination of 5 indexes can be selected. The number of indexes needed by the model prediction is reduced, so that the waiting time is shortened. The invention can obtain the prediction result by only using 5 indexes, wherein 2 indexes are laboratory indexes, and the patient index data can be quickly obtained by bedside detection.

3) Early risk factors for patient exacerbation are identified and SHAP is used to indicate the approximate extent of these risk factors in the early warning of severe recurrent pneumonia.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flow diagram of a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a model processing flow according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a data extraction flow according to an embodiment of the invention;

fig. 4 is a diagram illustrating changes in AUC values of the XGBoost model during feature quantity iteration according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a key risk factor | SHAP | mean ordering according to an embodiment of the invention;

FIG. 6A is a partial dependency graph (BMI) of SHAP according to an embodiment of the invention;

FIG. 6B is a gender interaction chart (pulse) according to an embodiment of the present invention;

FIGS. 7A-7E are SHAP partial dependence graphs of coagulation indices according to embodiments of the present invention;

FIGS. 8A-8C are SHAP partial dependency graphs of blood routine according to embodiments of the present invention;

FIGS. 9A to 9D are SHAP partial dependence graphs of blood biochemistry according to embodiments of the invention;

FIG. 10 is a graph of the partial dependence of SHAP on urine criteria according to an embodiment of the present invention;

fig. 11A to 11C are model ROC graphs under different index combinations according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In order to solve the problems, the invention provides a method for predicting the disease condition deterioration of a COVID-19 patient based on an interpretable machine learning method, determines an early warning index of the disease deterioration of the COVID-19 patient, and provides an approximate warning range of the early warning index.

As shown in fig. 1 and fig. 2, the new coronary pneumonia patient outcome prediction method based on the interpretable machine learning algorithm of the embodiment of the present invention includes the following steps:

step S1, extracting patient data of COVID-19 from a database, and dividing the patient data into an experimental group and a control group according to the disease conversion condition of the patient.

In an embodiment of the invention, the patient data comprises: indices of demographics (e.g., age and gender), underlying disease, vital signs, coagulation, blood routine, blood biochemistry and urine routine are shown in table 1.

Specifically, as shown in fig. 3, according to the clinical typing standard of "new coronavirus pneumonia diagnosis and treatment plan" published by the office of the national health commission, the patient data of COVID-19, including the indices of the patient's demographics, basic diseases, vital signs, blood coagulation, blood routine, blood biochemistry and urine routine, is extracted from the database. And the data were divided into a critical group (experimental group) and a non-critical group (control group) according to whether the patients had a disease transition from mild/normal to severe/critical.

TABLE 1

And S2, performing data cleaning and data compensation on the patient data to process abnormal data and missing data.

In the step, for each time point of the patients in the experimental group, which changes from light to heavy, the latest recorded values of all indexes in the past preset days are obtained and used as positive samples; for all time points of the patients in the control group, obtaining the latest recorded values of all indexes within the past preset days as negative samples, and randomly sampling to the same number as the positive samples; the positive and negative samples are mixed, and the missing value of each index is interpolated.

In the embodiment of the invention, after the positive sample and the negative sample are mixed, the missing values of all indexes of the patient data are interpolated by adopting a random forest regression algorithm.

In yet another embodiment of the present invention, the predetermined number of days is 10 days to 20 days. Preferably, the predetermined number of days is 15 days.

And S3, calculating the model performance under different index numbers and parameters by adopting a stepwise backward regression method through ten-fold cross validation.

In this step, ten-fold cross validation is employed to divide the data set into 10 portions, wherein 9 portions are used as training data and 1 portion is used as test data. And calculating the model performance under different index numbers and parameters. Wherein the model performance comprises: accuracy, recall, F1, AUC and 95% of the AUC values of CI.

Then, the indexes of the model performance are screened, and the screened indexes are used as key risk factors for identifying the disease deterioration of the patient with COVID-19. And selecting the number of indexes input into the model according to 1 standard error principle (1 SE), and using the screened indexes as key risk factors for identifying the disease deterioration of the patient with COVID-19.

In this step, a stepwise backward regression method is used to calculate the model performance under different index numbers and parameters, including: putting all the characteristics into an XGboost model, and calculating the SHAP value of each characteristic; for each iteration, deleting the features with the minimum SHAP (shaped additive extensions lots) absolute values from the model, repeating the steps until no features meet the iteration standard, and recording the AUC value of each iteration process. Fig. 4 is a schematic diagram of change in AUC values of the XGBoost model in the feature quantity iteration process according to the embodiment of the present invention.

The following describes Shapley additive extensions:

shapley additive extensions (SHAPs) are used to improve the interpretability of results. This model-uncertain approach is the most recently developed approach to interpreting the output of machine learning models. The goal of SHAP is to interpret the prediction of instance x by computing the contribution of each feature to the prediction. In other words, this technique determines the Shapley value of the joint game theory. The specific calculation can be expressed as:

g is an interpretation model, z' is e {0,1} ^M Is a set of reduction functions, M is the maximum feature size, φ ₀ is an interpretation model of a constant term, and

is due to the j ^th And (5) performing characteristic. In addition, a partial SHAP dependency graph is used to illustrate the effect of a single feature change on the severity of COVID-19. The SHAP dependency graph represents the marginal impact of each feature on the predicted outcome of the machine learning model, and may reveal the exact form of the relationship (e.g., linear, monotonic, or more complex). After considering the individual features, an additional combined feature effect (interaction effect) was also considered in this study.

And S4, inputting the key risk factors of the patient into the XGboost model and the logistic regression model, and comparing the AUC values of the two models, wherein the evaluation indexes of the models are the same as those in the step S3.

The Extreme gradient boosting model (Extreme gradient boosting) applied in the present invention is explained below:

XGboost is derived from a gradient enhanced decision tree, and produces better results for various machine learning problems. In this model, the importance of candidate predictors is ranked by selection frequency. The sum of these importance values is then scaled to 100, which means that each term can be interpreted as part of the overall model importance. Furthermore, a single prediction in XGBoost may be represented by breaking the decision path into one component per feature. In this way, a decision can be tracked through the tree and a prediction can be interpreted by the contribution added in each decision node.

XGBoost iteratively modifies the residuals of the previous model using the decision tree as a weak classifier. Furthermore, the algorithm employs regularization to control the complexity of the tree, thereby avoiding overfitting and simplifying the model. This principle can be explained as follows. Let

The representation is compared to the n samples and the m property database. Output of tree addition model

A tree can be defined as:

in that

And T indicates the number of leaves. This tree can be divided by f _k The structural component qlobe weight ω. If complex relationships in the data are to be understood, the hyper-parameters have to be adjusted. Therefore, a grid search is performed to determine the optimal values of the possible parameter combinations.

The Logistic regression model Logistic regression is explained as follows:

logistic Regression (LR) is a traditional machine learning algorithm that has been widely used in medical classification tasks. Rather than fitting a straight line or hyperplane, the output of a linear equation is constrained between 0 and 1 using a logistic function. The logistic function is defined as:

when a feature x _j When adjusted by 1 unit, the interpretability of the algorithm comes from the predicted change. This is an extension of the linear regression model.

And S5, selecting an XGboost model with a higher AUC value as the XGboost model with better prediction expressiveness, calculating | SHAP | mean values of all indexes in the model, sequencing, generating an index combination according to the sequencing result of the indexes, predicting by using the XGboost model, and recording the prediction result. Fig. 5 is a diagram illustrating the | SHAP | mean ordering of key risk factors according to an embodiment of the invention. PT is Prothrombin time, LDH is lactate dehydrogenase, INR is activated partial thromboplastin time, DD is D-dimer, CK is secretion kinase, APTT is activated partial thromboplastin time, L is lympocytic count, SHAP is shared additive extensions.

Specifically, after sorting, 5 more important and clinically easily-obtained indexes are selected by integrating the sorting results of the indexes and doctor suggestions, are used as a simplified version of index combination, then the XGboost model is used for prediction, and the prediction results are recorded.

In an embodiment of the invention, the combination of indicators comprises: pulse index and BMI index. FIG. 6A is a partial dependency graph (BMI) of SHAP according to an embodiment of the invention; FIG. 6B is a gender interaction diagram (pulse) according to an embodiment of the present invention.

Fig. 7A to 7E are SHAP partial dependence graphs of coagulation indices according to an embodiment of the present invention. PT. (B) PTA. (C) INR. (D) D-dimer. (E) APTT.PT: prothrombin time; PTA, prothrombin activity; INR: international normalized ratio; APTT, activated partial thromboplastin time.

FIGS. 8A-8C are SHAP partial dependency graphs of blood routine according to embodiments of the present invention; (A) Hematocrit, (B) L%, (C) Platelet count.L: lymphocyte count.

FIGS. 9A to 9D are SHAP partial dependence graphs of blood biochemistry according to embodiments of the present invention; LDH, (B) CK, (C) magnesium, (D) Global. LDH: lactate dehydrogenase. CK: creatine kinase.

s6, defining a key index early warning range in the SHAP partial dependency graph according to a prediction result of the SHAP partial dependency graph and by combining clinical experience; setting the SHAP value in the SHAP map to be positive indicates that the index in the range positively contributes to the prediction result, namely the disease deterioration, and conversely, setting the SHAP value to be negative indicates that the index in the range negatively contributes to the disease deterioration, and defining the range of the SHAP value in the SHAP partial dependency map to be positive as the early warning range. The risk probability that a mild patient is degraded into a severe patient is identified through the machine learning model, and when the key risk index of the patient enters an early warning range, an alarm prompt is sent to medical staff, so that reference can be provided for clinical intervention measures, and the nursing quality is improved.

And S7, screening out 15 first group indexes and 5 second group indexes by integrating the calculation result of the algorithm and the clinical experience of a doctor.

Wherein the 15 first set of metrics include: prothrombin Time (PT), prothrombin activity (PTA), lactate Dehydrogenase (LDH), international Normalized Ratio (INR), heart rate, body Mass Index (BMI), D-dimer, creatine Kinase (CK), hematocrit, urine specific gravity, magnesium, globulin, activated Partial Thromboplastin Time (APTT), lymphocyte count (L%), and platelet count. The 5 second set of metrics include: PT, heart rate, BMI, HCT and complications.

The following explains a model evaluation index calculation method in the present invention:

in the field of machine learning, a confusion matrix (confusion matrix) is an image display tool for evaluating the quality of a classification model. Wherein each column of the matrix represents a sample case of model prediction; each row of the matrix represents the true case of a sample. Table 2 represents a confusion matrix for a binary model.

TABLE 2

Wherein, true Positive (TP) represents True class, i.e. the True class of the sample is Positive class, and the result of model prediction is also Positive class.

False Negative (FN) represents a False Negative class, i.e., the true class of the sample is a positive class, but the model predicts it as a Negative class.

False Positive (FP) represents a False Positive class, i.e., the true class of the sample is a negative class, but the model predicts it as a Positive class.

True Negative (TN) represents a True class, the True class of the sample is a Negative class, and the model predicts it as a Negative class.

The indexes for evaluating the model accuracy derived from the confusion matrix are as follows:

indicating the accuracy of the model. In general, the higher the accuracy of the model, the better the model performance.

Indicating the precision. Generally, the higher the precision, the better the model effect.

Indicating the recall rate. Generally, the higher the recall rate, the more normal samples are predicted by the modelIndeed, the better the model.

In general, the higher the Precision value and the higher the Recall value, the better the model will work. But in fact the two are in some cases contradictory. For example, in an extreme case, the model only searches for one result, and is accurate, then Precision is 100%, but Recall is very low; whereas if all results are returned, then Recall is 100%, but Precision is very low. The most common method is therefore to introduce a composite rating index, F-Measure (also known as F-Score, i.e., precision and Recall weighted sum-mean):

where let α =1.

In the present example, the efficiency of the model was evaluated by ROC curve and AUC.

A receiver operating characteristic curve (ROC curve) is also called a sensitivity curve (sensitivity curve), and each point on the ROC curve reflects the sensitivity to the same signal stimulus.

Horizontal axis: negative positive rate (FPR) specificity, dividing the proportion of all negative cases in the example to all negative cases; (1-Specificity)

Longitudinal axis: true Positive Rate (TPR) Sensitivity, sensitivity (positive coverage)

For a binary problem, instances are classified into positive (positive) or negative (negative) classes. However, in practice, four cases occur when classifying.

(1) If an instance is a Positive class and is predicted to be a Positive class, it is a True class (True Positive TP)

(2) If an example is a positive class, but is predicted to be a Negative class, i.e., a False Negative class (False Negative FN)

(3) If an instance is a negative class, but is predicted to be a Positive class, i.e., a False Positive class (False Positive FP)

(4) If an example is a Negative class, but is predicted to be a Negative class, it is a True Negative class (True Negative TN)

TP Correct Positive number

FN, missing reports, number of failed matches

FP false positive, no incorrect match

TN number of mismatches for correct rejection

Tabulated table 3 below, 1 for positive class and 0 for negative class:

TABLE 3

From the above table, the calculation formula for the horizontal and vertical axes is:

(1) The True class Rate (True Positive Rate) TPR: TP/(TP + FN) represents the proportion of actual Positive instances in the Positive class predicted by the classifier to all Positive instances. Sensitivity

(2) The negative Positive class Rate (False Positive Rate) FPR (FP/(FP + TN)) represents the proportion of actual negative examples in all negative examples in the Positive class predicted by the classifier. 1-Specificity

(3) True Negative class Rate (True Negative Rate) TNR: TN/(FP + TN), representing the proportion of actual Negative instances in the Negative class predicted by the classifier to all Negative instances, TNR =1-FPR. Specificity

AUC (Area under currve): area under Roc curve, between 0.1 and 1. Auc can be used as a numerical value to visually evaluate the quality of the classifier, and the larger the value is, the better the value is.

Firstly, the AUC value is a probability value, when a positive sample and a negative sample are randomly selected, the probability that the positive sample is arranged in front of the negative sample by the current classification algorithm according to the computed Score value is the AUC value, and the larger the AUC value is, the more likely the current classification algorithm is to arrange the positive sample in front of the negative sample, so that better classification can be realized.

Fig. 11A to 11C are model ROC graphs under different index combinations according to an embodiment of the present invention. LR (15-index combination) and XGboost (15-index combination) (C) XGboost (5-index combination) were used to predict the working characteristic curve of subjects with an exacerbation of COVID-19. AUC is area under the curve, LR is logical regression, XGboost is extreme gradient boosting.

The invention incorporates the data of inpatients in the warrior fire mountain hospital in china from 2 months 2 to 4 months 1 day in 2020. All patients were positively diagnosed as COVID-19 pneumonia by nucleic acid detection, and classified into light, medium, heavy and critical 4 clinical classifications according to the diagnosis and treatment protocol for novel coronavirus pneumonia (eighth edition) specified by the national health and health committee of the people's republic of China. In the present invention, both mild and moderate patients are treated as mild cases. All other patients were considered to be severely ill. In the present study, the present invention predicts whether a patient's condition will transition from mild to severe.

The present invention collects Electronic Medical Records (EMR) of all patients during hospital stay in the fire mountain hospital including epidemiology, demographics, clinical features, laboratory indices, past medical history, exposure history, symptoms, chest computer tomography and any treatment (i.e. antiviral treatment, corticosteroid treatment, respiratory support and renal replacement therapy). All data were reviewed by a team of physicians. In order to more accurately identify high-risk factors that cause mild patients to worsen to severe/critical conditions, mild patients were classified into severe (experimental group) and non-severe (control group) according to whether or not they worsen to severe during hospitalization (see fig. 2A). Since the disease progression is dynamic, more than one exacerbation (on average 2.9 per patient) occurs in 35.7% of patients in the severe group during hospitalization. Each transition from light to heavy in the patient was considered a positive sample in the study, while the control group of patients continued to be in a mild condition. Due to the abundant sources of the negative samples, the number of the negative samples is obviously higher than that of the positive samples, so that the class imbalance between the positive samples and the negative samples is caused. Thus, a random undersampling technique is employed to keep the number of positive and negative samples balanced.

The model inputs include three general categories (i.e., characteristics) common in EMRs, (1) demographic variables (such as age and gender); (2) complications; (3) clinical and laboratory results. Any property missing more than 50% is excluded. All missing variables in the resulting data were filled in using a random forest (see FIG. 2B). This procedure produced 82 properties for inclusion in the model (see table 4).

In addition, the performance of the model was evaluated using 10-fold cross-validation. In this process, the data set is randomly partitioned into 10 equally sized subsamples, 9 of which are used to train the model, and then the model is validated with the remaining 1.

The machine learning algorithm employed by the present invention is described below: various machine learning and deep learning algorithms have been applied to this type of problem. However, most techniques prioritize task prediction or classification, and ignore interpretation or information risk factor selection and analysis. The deep learning model comprises a plurality of hidden layers, a large number of training samples are needed, and interpretation for a clinician is difficult. Thus, the present invention considers interpretability to be a core requirement for machine learning model selection. XGboost and Logistic Regression (LR) was used to predict whether COVID-19 mild patients will develop severe patients.

The invention uses the average value (standard deviation-SD) and median (quartile-IQR) to represent continuous variable measurement values. Since the data of the present invention is not normally distributed, the present invention compares the mean values using Wilcoxon rank sum test. The classification variables are described as frequency and percentage modeled using the chi-square test. All statistical tests were two-tailed. Table 4 is a population-based baseline profile of patients.

TABLE 4

Note DM, diabetes mellitis, CAD, coronary array heart disease, CCB, calcium channel blocks, ARB, angiotensins II type I receiver blocks.

TABLE 5 sample-based patient Baseline characteristics

<xnotran> : BMI: body Mass Index, CRP: c-reactive protein, N: neutrophil count, M: monocyte count, MCHC: mean corpuscular hemoglobin concentration, MCH: mean corpuscular hemoglobin, L: lymphocyte count, RDW: red blood cell volume distribution width, MCV: mean corpuscular volume, PT: prothrombin time, TT: thrombin time, INR: international normalized ratio, APTT: activated partial thromboplastin time, α -HBDH: α -hydroxybutyrate dehydrogenase, r-GT: r-glutamyl transpeptidase, LDH: lactate dehydrogenase, BUN: blood urea nitrogen, tbi: total bilirubin, cl: chlorine, alb: albumin, dTbi: direct bilirubin, ALP: alkaline phosphatase, cr: creatinine, CK: creatine kinase, CK-MB: creatine kinase isoenzyme, cysC: cystatin C, ALT: alanine aminotransferase, AST: aspartate amino transferase. </xnotran>

Table 6 LR and XGBoost prediction results.

Remarking: the numbers in parentheses represent the number of indices in the model.

With the worldwide explosion of COVID-19, SARS-CoV-2 infection has become a serious threat to public health. Therefore, early prediction and aggressive treatment of mild patients at high risk for malignant progression is critical to reduce mortality, optimize treatment strategies, and maintain efficient operation of medical systems. The invention shows that the XGboost-based efficient prediction model (AUC 0.8271% CI. This model outperforms traditional logistic regression (AUC 0.653295% CI. In addition, the invention utilizes the SHAP diagram to visually explain the feature importance and determines the risk factors causing the development of serious COVID-19.

Each sample in the dataset contained 82 features including indicators of underlying disease, vital signs, coagulation, blood routine, hematopoiesis, and urine routine. From a research point of view, the invention needs to include a set of indexes that can sufficiently reflect the state of the patient, but from a practical point of view, the acquisition time of the physiological indexes and the laboratory indexes is too long. During the waiting period, the patient's condition may deteriorate. The length of time that a patient waits for an index result affects the timeliness of diagnosis and treatment. Therefore, the invention reduces the number of indexes required by the model. The invention uses a stepwise backward regression method to perform experiments on models with different index numbers and parameters, and records the model performance under each experiment. The implementation steps are to put all features into the XGBoost model and to calculate the SHAP value for each feature. For each iteration, the feature with the smallest SHAP absolute value is removed from the model. This process is repeated until no features meet the iteration criterion, and the AUC value is recorded for each iteration.

According to a standard deviation principle, 15 indexes (PT, PTA, INR, DD, APTT, LDH, CK, magnesium, globulin, HCT, L%, platlet count, pulse, BMI and Urine specific gradient) are selected, the AUC value of the model can achieve a good result, the number of the indexes is reduced, and meanwhile the model can achieve good prediction performance. And the integral effect of the XGboost is explained in a characteristic contribution form by using the SHAP diagram, so that the interpretability of the model is improved.

In addition, the invention firstly proposes that the simplified index combination can obtain better prediction effect, and the prediction result of the rapid primary screening can be obtained by only using two laboratory indexes (PT and HCT) and three basic index combinations (pulse, BMI and whether basic diseases exist, AUC 0.801795% CI:0.7905, 0.8130). Different requirements of medical personnel on prediction accuracy and timeliness in clinical treatment can be met by combining the stepped indexes.

The new coronary pneumonia patient outcome prediction method based on the interpretable machine learning algorithm is a new interpretable machine learning algorithm, can predict the probability of disease deterioration of a COVID-19 patient, simultaneously identifies early risk factors of the disease deterioration, and provides an approximate range of the risk factors by using a SHAP graph.

The invention adopts a machine learning method to predict the disease deterioration of the patient with COVID-19, identifies the risk factors and further determines the approximate early warning range of the risk factors. The invention has the following main beneficial effects:

1) An accurate and effective model is constructed by using an interpretable machine learning algorithm, and whether the mild/moderate patient will be degraded into severe/critical cases is predicted.

2) And forming two index combinations according to different indexes required by the model. According to different requirements on timeliness and accuracy under different scenes, two sets of index combinations with different quantities are formed. From the viewpoint of accuracy, a combination of 15 indexes can be selected, and the model prediction accuracy is high; from the viewpoint of applicability and practicality, a combination of 5 indexes can be selected. The number of indexes required is predicted by the simplified model, so that the waiting time is shortened. The invention can obtain the prediction result by only using 5 indexes, wherein 2 indexes are laboratory indexes, and the patient index data can be quickly obtained by bedside detection.

3) Early risk factors for patient exacerbation are identified and SHAP is used to indicate the approximate extent of these risk factors to the early warning of severe recurrent pulmonary inflammation.

In summary, the present invention establishes a high-performance predictive model using EMR data from 3028 patients based on the interpretable machine learning algorithm XGBoost (AUC 0.8271 95%ci 0.8144, 0.8397). The 15 high-risk factors of the COVID-19 malignant development and the approximate early warning range thereof are determined. In addition, the invention firstly proposes that the simplified index combination can obtain better prediction effect, and the prediction result of the rapid primary screening can be obtained only by using two laboratory indexes (PT and HCT) and two basic index combinations (pulse and BMI) (AUC 0.801795% CI. Different requirements of medical personnel on prediction accuracy and timeliness in clinic can be met by combining the stepped indexes. In a word, for the purpose of resisting the new coronary pneumonia epidemic situation, the invention can help to reduce the death rate, improve prognosis and optimize clinical treatment.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that those skilled in the art may make variations, modifications, substitutions and alterations within the scope of the present invention without departing from the spirit and scope of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm is characterized by comprising the following steps:

step S1, extracting patient data of COVID-19 from a database, and dividing the patient data into an experimental group and a control group according to the state of illness conversion condition of a patient;

s2, acquiring the latest recorded values of all indexes in the past preset days of each time point of the patients in the experimental group from light to heavy as a positive sample; for all time points of the patients in the control group, obtaining the latest recorded values of all indexes within the past preset days as negative samples, and randomly sampling to the same number of positive samples; mixing the positive sample and the negative sample, and interpolating missing values of all indexes;

s3, through ten-fold cross validation, calculating the model performance under different index numbers and parameters by adopting a stepwise backward regression method, screening the indexes of the model performance, and taking the screened indexes as key risk factors for identifying the disease deterioration of the patient with COVID-19;

s6, defining a key index early warning range in the SHAP partial dependency graph according to a prediction result of the SHAP partial dependency graph and by combining clinical experience; identifying the risk probability of the mild patient deteriorating into the severe patient through a machine learning model, and sending an alarm prompt to medical staff when the key risk index of the patient enters an early warning range;

s7, screening out 15 first group indexes and 5 second group indexes by integrating the calculation result of the algorithm and the clinical experience of a doctor, wherein the 15 first group indexes comprise: prothrombin Time (PT), prothrombin activity (PTA), lactate Dehydrogenase (LDH), international Normalized Ratio (INR), heart rate, body Mass Index (BMI), D-dimer, creatine Kinase (CK), hematocrit, urine specific gravity, magnesium, globulin, activated Partial Thromboplastin Time (APTT), lymphocyte count (L%), and platelet count;

two indices of the 15 first set of indices and the 5 second set of indices were combined to predict the disease status of patients with new coronary pneumonia.

2. The method for patient outcome prediction for new coronary pneumonia based on interpretable machine learning algorithm according to claim 1, wherein in step S1, the patient data comprises: demographic, underlying disease, vital signs, blood coagulation, blood convention, blood biochemistry and urine convention.

3. The new coronary pneumonia patient regression prediction method based on interpretable machine learning algorithm as claimed in claim 1, wherein in step S2, after mixing the positive sample and the negative sample, the missing values of each index of the patient data are interpolated by using random forest regression algorithm.

4. The new coronary pneumonia patient outcome prediction method based on interpretable machine learning algorithm of claim 1 wherein in step S3 the model performance includes: accuracy, recall, F1, AUC and 95% of the AUC values of CI.

5. The method for predicting the outcome of a new coronary pneumonia patient based on the interpretable machine learning algorithm of claim 1, wherein in the step S3, a stepwise backward regression method is adopted to calculate the model performance under different index numbers and parameters, which comprises: putting all the features into an XGboost model, and calculating SHAP values of each feature; and for each iteration, deleting the features with the minimum SHAP absolute value from the model, repeating the steps until no features meet the iteration standard, and recording the AUC value of each iteration process.

6. The method for predicting the outcome of a new coronary pneumonia patient based on interpretable machine learning algorithm according to claim 1, wherein in the step S4, the index combination comprises: pulse index and BMI index.

7. The method for predicting the outcome of a new coronary pneumonia patient based on the interpretable machine learning algorithm of claim 1, wherein the preset number of days is 10 days to 20 days.

8. The method for predicting the outcome of a new coronary pneumonia patient based on the interpretable machine learning algorithm as claimed in claim 1, wherein in step S3, the number of the indexes inputted into the model is selected according to 1 standard error principle, and the selected indexes are used as the key risk factors for identifying the disease deterioration of the patient with codv-19.

9. The method of claim 1, wherein in step S6, the indicator with positive SHAP value in the SHAP map indicating that the range is positive contributes to the prediction result, i.e. the disease deterioration, whereas the indicator with negative SHAP value indicating that the range is negative contributes to the disease deterioration, and the range with positive SHAP value in the SHAP partial dependency map is defined as the early warning range.

10. The method of claim 1, wherein the mean and median values are used to represent continuous variable measurements, the mean values are compared using Wilcoxon rank sum test, and the categorical variables are described as frequency and as a percentage modeled using the chi-square test.