CN113782197A

CN113782197A - Novel coronary pneumonia patient outcome prediction method based on interpretable machine learning algorithm

Info

Publication number: CN113782197A
Application number: CN202110887756.2A
Authority: CN
Inventors: 贾立静; 李静; 陈威; 张恒; 魏子健; 王佳明; 郏瑞琪; 俞哲媛; 王照鸿; 李秀成
Original assignee: Beijing Jiaotong University; First Medical Center of PLA General Hospital
Current assignee: Beijing Jiaotong University; First Medical Center of PLA General Hospital
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2021-12-10
Anticipated expiration: 2041-08-03
Also published as: CN113782197B

Abstract

The invention provides a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm, which comprises the following steps: extracting patient data of COVID-19 from the database, and dividing the patient data into an experimental group and a control group according to the disease conversion condition of the patient; interpolating missing values of the indexes through random forest regression; screening indexes input into the model, and taking the screened indexes as key risk factors for identifying the disease deterioration of the patient; inputting key risk factors of a patient into an XGboost model and a logistic regression model; selecting an XGboost model with better prediction expressiveness, generating an index combination, predicting by using the XGboost model, and recording a prediction result; defining the early warning range of the key index; when the key risk index of the patient enters the early warning range, an alarm prompt is sent to the medical care personnel; combining the calculation results of the algorithm with the clinical experience of a doctor, two index combinations consisting of 15 first group indexes and 5 second group indexes are provided for predicting the illness state of the patient with the new coronary pneumonia.

Description

Novel coronary pneumonia patient outcome prediction method based on interpretable machine learning algorithm

Technical Field

The invention relates to the technical field of machine learning, in particular to a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm.

Background

The proliferation of new coronavirus disease (COVID-19) infection cases presents a huge challenge to the management of medical resources. Although approximately 81% of patients with COVID-19 exhibit mild or moderate symptoms, some patients experience a sudden worsening of the disease and the disease progresses rapidly to a severe or critical condition. Therefore, early intervention in the exacerbation of a patient with COVID-19 would greatly aid in patient management and allocation of medical resources. In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

1) in published studies on poor prognosis of COVID-19, most studies have still used statistical methods to statistically analyze and describe the characteristics and outcome of patients with COVID-19. Early risk factors are identified by comparison of severe versus non-severe patients. However, statistical methods do not predict a poor prognosis for the patient.

2) In studies using machine learning algorithms to predict poor prognosis in COVID-19 patients, researchers were interested in predicting outcomes that were limited to ICU hospitalization or death. No researchers are currently concerned with the state transition of disease progression.

3) Although the existing research shows that the machine learning obtains good prediction results, the number of indexes required by the model is large, the sampling is complex, various laboratory indexes are included, and a long time is needed for obtaining all the indexes required by the model. The problems of index availability and timeliness of the machine learning model used in the real background are ignored.

4) Meanwhile, no matter research based on the traditional statistical method or the machine learning method is carried out, only the risk factors causing the illness deterioration, hospitalization or final death of the patient are identified, but the early warning range corresponding to the index is not provided.

Disclosure of Invention

The object of the present invention is to solve at least one of the technical drawbacks mentioned.

Therefore, the invention aims to provide a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm.

In order to achieve the above object, an embodiment of the present invention provides a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm, including:

step S1, extracting patient data of COVID-19 from the database, and dividing the patient data into experimental group and control group according to the patient' S conversion condition;

step S2, for each time point of the patients in the experimental group, the latest recorded values of all indexes in the past preset days are obtained and used as positive samples; for all time points of the patients in the control group, obtaining the latest recorded values of all indexes within the past preset days as negative samples, and randomly sampling to the same number of positive samples; mixing the positive sample and the negative sample, and interpolating missing values of all indexes;

step S3, calculating the model performance under different index numbers and parameters by adopting a stepwise backward regression method through ten-fold cross validation, screening the indexes of the model performance, and taking the screened indexes as key risk factors for identifying the disease deterioration of the COVID-19 patient;

step S4, inputting the key risk factors of the patient into the XGboost model and the logistic regression model, and comparing the AUC values of the two models;

step S5, selecting an XGboost model with a higher AUC value as the XGboost model with better prediction expressiveness, calculating | SHAP | mean values of all indexes in the model, sequencing, generating an index combination according to sequencing results of the indexes, predicting by using the XGboost model, and recording a prediction result;

step S6, according to the prediction result of the SHAP partial dependency graph and by combining clinical experience, defining the early warning range of the key indexes in the SHAP partial dependency graph; identifying the risk probability of the mild patient deteriorating into the severe patient through a machine learning model, and sending an alarm prompt to medical staff when the key risk index of the patient enters an early warning range;

step S7, the calculation result of the algorithm and the clinical experience of the doctor are integrated, and 15 first group indexes and 5 second group indexes are screened out, wherein the 15 first group indexes comprise: prothrombin Time (PT), prothrombin activity (PTA), Lactate Dehydrogenase (LDH), International Normalized Ratio (INR), heart rate, Body Mass Index (BMI), D-dimer, myokinase (CK), hematocrit, urine specific gravity, magnesium, globulin, Activated Partial Thromboplastin Time (APTT), lymphocyte count (L%), and platelet count;

the 5 second set of metrics include: PT, heart rate, BMI, HCT, and complications,

two indices of the 15 first set of indices and the 5 second set of indices were combined to predict the condition of a new coronary pneumonia patient.

Further, in the step S1, the patient data includes: demographic, underlying disease, vital signs, blood coagulation, blood convention, blood biochemistry and urine convention.

Further, in step S2, after the positive and negative samples are mixed, a random forest regression algorithm is used to interpolate missing values of each index of the patient data.

Further, in the step S3, the model performance includes: accuracy, recall, F1, AUC and AUC values for 95% CI.

Further, in step S3, calculating model performances under different index numbers and parameters by using a stepwise backward regression method, including: putting all the characteristics into an XGboost model, and calculating the SHAP value of each characteristic; for each iteration, the feature with the smallest SHAP absolute value is deleted from the model and repeated continuously until no feature meets the iteration standard, and the AUC value of each iteration process is recorded.

Further, in the step S4, the index combination includes: pulse index and BMI index.

Further, the preset number of days is 10 to 20 days.

Further, in step S3, the number of indicators input into the model is selected according to 1 standard error rule, and the selected indicators are used as key risk factors for identifying the disease deterioration of the patient with COVID-19.

Further, in step S6, the indicator with the positive SHAP value in the SHAP map indicating that the indicator within the range positively contributes to the disease deterioration as the prediction result, whereas the indicator with the negative SHAP value indicating that the indicator within the range negatively contributes to the disease deterioration as the case requires, and the range with the positive SHAP value in the SHAP partial dependency map is defined as the warning range.

Further, continuous variable measurements were represented using mean and median, the mean was compared using Wilcoxon rank sum test, and the categorical variables were described as frequency and percentage modeled using the chi-square test.

According to the new coronary pneumonia patient outcome prediction method based on the interpretable machine learning algorithm of the embodiment of the invention,

1) an accurate and effective model is constructed by using an interpretable machine learning algorithm, and whether the mild/moderate patient will be degraded into a severe/critical case or not is predicted.

2) And forming two index combinations according to different indexes required by the model. According to different requirements on timeliness and accuracy under different scenes, two sets of index combinations with different quantities are formed. From the viewpoint of accuracy, the combination of 15 indexes can be selected, and the model prediction accuracy is high; from the viewpoint of applicability and practicality, a combination of 5 indexes can be selected. The number of indexes needed by the model prediction is reduced, so that the waiting time is shortened. The invention can obtain the prediction result by only using 5 indexes, wherein 2 indexes are laboratory indexes, and the patient index data can be quickly obtained by bedside detection.

3) Early risk factors for patient exacerbation are identified and SHAP is used to indicate the approximate extent of these risk factors in the early warning of severe recurrent pneumonia.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flow chart of a new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model processing flow according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a data extraction flow according to an embodiment of the invention;

FIG. 4 is a diagram illustrating changes in AUC values of the XGboost model during feature quantity iteration, according to an embodiment of the invention;

FIG. 5 is a diagram illustrating a key risk factor | SHAP | mean ordering according to an embodiment of the invention;

FIG. 6A is a partial dependency graph (BMI) of SHAP according to an embodiment of the invention;

FIG. 6B is a gender interaction graph (pulse) according to an embodiment of the present invention;

FIGS. 7A-7E are SHAP partial dependence graphs of coagulation indices according to embodiments of the present invention;

FIGS. 8A-8C are SHAP partial dependency graphs of blood routine according to embodiments of the present invention;

FIGS. 9A to 9D are SHAP partial dependence graphs of blood biochemistry according to embodiments of the present invention;

FIG. 10 is a SHAP partial dependence graph of urine routine according to an embodiment of the present invention;

fig. 11A to 11C are model ROC graphs under different index combinations according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In order to solve the problems, the invention provides a method for predicting the disease condition deterioration of a COVID-19 patient based on an interpretable machine learning method, determines an early warning index of the disease deterioration of the COVID-19 patient, and provides an approximate warning range of the early warning index.

As shown in fig. 1 and fig. 2, the new coronary pneumonia patient outcome prediction method based on the interpretable machine learning algorithm of the embodiment of the present invention includes the following steps:

at step 1, patient data for COVID-19 is extracted from the database and divided into experimental and control groups based on patient condition conversion.

In an embodiment of the invention, the patient data comprises: indices of demographics (e.g., age and gender), underlying disease, vital signs, coagulation, blood routine, blood biochemistry and urine routine are shown in table 1.

Specifically, as shown in fig. 3, according to the clinical typing standard of "new coronavirus pneumonia diagnosis and treatment plan" published by the office of the national health commission, the patient data of COVID-19, including the indices of the patient's demographics, basic diseases, vital signs, blood coagulation, blood routine, blood biochemistry and urine routine, is extracted from the database. And the data were divided into a critical group (experimental group) and a non-critical group (control group) according to whether the patients had a disease transition from mild/normal to severe/critical.

TABLE 1

Step S2, data cleansing and data differencing are performed on the patient data to handle abnormal data and missing data.

In the step, for each time point of the patients in the experimental group, which changes from light to heavy, the latest recorded values of all indexes in the past preset days are obtained and used as positive samples; for all time points of the patients in the control group, obtaining the latest recorded values of all indexes within the past preset days as negative samples, and randomly sampling to the same number as the positive samples; the positive and negative samples are mixed, and the missing value of each index is interpolated.

In the embodiment of the invention, after the positive sample and the negative sample are mixed, the missing values of all indexes of the patient data are interpolated by adopting a random forest regression algorithm.

In yet another embodiment of the present invention, the predetermined number of days is 10 days to 20 days. Preferably, the predetermined number of days is 15 days.

And step S3, calculating the model performance under different index numbers and parameters by adopting a stepwise backward regression method through ten-fold cross validation.

In this step, ten-fold cross validation is employed to divide the data set into 10 portions, wherein 9 portions are used as training data and 1 portion is used as test data. And calculating the model performance under different index numbers and parameters. Wherein the model performance comprises: accuracy, recall, F1, AUC and AUC values for 95% CI.

Then, the indexes of the model performance are screened, and the screened indexes are used as key risk factors for identifying the disease deterioration of the patient with COVID-19. And selecting the number of indexes input into the model according to 1 standard error principle (1SE), and using the screened indexes as key risk factors for identifying the disease deterioration of the patient with COVID-19.

In this step, a stepwise backward regression method is used to calculate the model performance under different index numbers and parameters, including: putting all the characteristics into an XGboost model, and calculating the SHAP value of each characteristic; for each iteration, deleting the features with the minimum SHAP (shape additive extensions) absolute values from the model, repeating the steps until no features meet the iteration standard, and recording the AUC value of each iteration process. Fig. 4 is a schematic diagram of changes in AUC values of the XGBoost model in the feature quantity iteration process according to the embodiment of the present invention.

The following describes the Shapley additive extensions:

shapley additive extensions (SHAPs) are used to improve the interpretability of results. This model-uncertain approach is the most recently developed approach to interpreting the output of machine learning models. The goal of SHAP is to account for the prediction of instance x by computing the contribution of each feature to the prediction. In other words, this technique determines the sharley value of the joint game theory. The specific calculation can be expressed as:

g is an explanationModel, z' is an element {0, 1}^MIs a set of simplified functions, M is the maximum feature size, phi₀is an interpretation model of a constant term, and

is due to the j^thAnd (5) characterizing. In addition, a partial SHAP dependency graph is used to illustrate the effect of a single feature change on the severity of COVID-19. The SHAP dependency graph represents the marginal impact of each feature on the predicted outcome of the machine learning model, and may reveal the exact form of the relationship (e.g., linear, monotonic, or more complex). After considering the individual features, the present study also considered an additional combined feature effect (interaction effect).

And step S4, inputting the key risk factors of the patient into the XGboost model and the logistic regression model, and comparing the AUC values of the two models, wherein the evaluation indexes of the models are the same as those in the step S3.

The Extreme gradient boosting model (Extreme gradient boosting) applied in the present invention is explained below:

XGboost is derived from a gradient enhanced decision tree, and produces better results for various machine learning problems. In this model, the importance of candidate predictors is ranked by selection frequency. The sum of these importance values is then scaled to 100, which means that each term can be interpreted as part of the overall model importance. Furthermore, a single prediction in XGBoost may be represented by breaking the decision path into one component per feature. In this way, a decision can be tracked through the tree and a prediction can be interpreted by the contribution added in each decision node.

XGBoost iteratively modifies the residuals of the previous model using the decision tree as a weak classifier. Furthermore, the algorithm employs regularization to control the complexity of the tree, thereby avoiding overfitting and simplifying the model. This principle can be explained as follows. Let

The representation is compared to the n samples and the m property database. Output of tree addition model

A tree can be defined as:

in that

And T indicates the number of leaves. This tree can be divided by f_kThe structural component qlobe weight ω. If complex relationships in the data are to be understood, the hyper-parameters have to be adjusted. Therefore, a grid search is performed to determine the optimal values of the possible parameter combinations.

The Logistic regression model Logistic regression is described below:

logistic Regression (LR) is a traditional machine learning algorithm that has been widely used in medical classification tasks. Rather than fitting a straight line or hyperplane, the output of a linear equation is limited to between 0 and 1 using a logistic function. The logistic function is defined as:

when a feature x_jWhen adjusted by 1 unit, the interpretability of the algorithm comes from the predicted change. This is an extension of the linear regression model.

And step S5, selecting an XGboost model with a higher AUC value as the XGboost model with better prediction expressiveness, calculating | SHAP | mean values of all indexes in the model, sequencing, generating an index combination according to the sequencing result of the indexes, predicting by using the XGboost model, and recording the prediction result. Fig. 5 is a diagram illustrating the | SHAP | mean ordering of key risk factors according to an embodiment of the invention. PT is Prothrombin time, LDH is lactate dehydrogenase, INR is activated partial thromboplastin time, DD is D-dimer, CK is creatine kinase, APTT is activated partial thromboplastin time, L is lympocytic count, SHAP is Shapley additive extensions.

Specifically, after sorting, 5 more important indexes which are easy to obtain clinically are selected by integrating the sorting result of the indexes and doctor suggestions, the indexes are used as a simplified index combination, an XGboost model is used for prediction, and the prediction result is recorded.

In an embodiment of the present invention, the index combination includes: pulse index and BMI index. FIG. 6A is a partial dependency graph (BMI) of SHAP according to an embodiment of the invention; FIG. 6B is a gender interaction diagram (pulse) according to an embodiment of the present invention.

Fig. 7A to 7E are SHAP partial dependence graphs of coagulation indices according to an embodiment of the present invention. (A) PT, (B) PTA, (C) INR, (D) D-dimer, (E) APTT PT, prothrombin time; PTA, prothrombin activity; INR is International normalized ratio; APTT, Activated partial thromboplastin time.

FIGS. 8A-8C are SHAP partial dependency graphs of blood routine according to embodiments of the present invention; (A) hematocrit, (B) L%, (C) Platelet count.L: Lymphocyte count.

FIGS. 9A to 9D are SHAP partial dependence graphs of blood biochemistry according to embodiments of the present invention; (A) LDH, (B) CK, (C) Magnesium, (D) globulin LDH: lactate dehydrogenase. CK: Creatine kinase.

step S6, according to the prediction result of the SHAP partial dependency graph and by combining clinical experience, defining the early warning range of the key indexes in the SHAP partial dependency graph; setting the SHAP value in the SHAP map to be positive indicates that the index in the range positively contributes to the prediction result, namely the disease deterioration, and conversely, setting the SHAP value to be negative indicates that the index in the range negatively contributes to the disease deterioration, and defining the range of the SHAP value in the SHAP partial dependency map to be positive as the early warning range. The risk probability that a mild patient is degraded into a severe patient is identified through the machine learning model, and when the key risk index of the patient enters an early warning range, an alarm prompt is sent to medical staff, so that reference can be provided for clinical intervention measures, and the nursing quality is improved.

And step S7, screening out 15 first group indexes and 5 second group indexes by combining the calculation result of the algorithm and the clinical experience of the doctor.

Wherein the 15 first set of metrics include: prothrombin Time (PT), prothrombin activity (PTA), Lactate Dehydrogenase (LDH), International Normalized Ratio (INR), heart rate, Body Mass Index (BMI), D-dimer, Creatine Kinase (CK), hematocrit, urine specific gravity, magnesium, globulin, Activated Partial Thromboplastin Time (APTT), lymphocyte count (L%), and platelet count. The 5 second set of metrics include: PT, heart rate, BMI, HCT and complications.

The following explains a model evaluation index calculation method in the present invention:

in the field of machine learning, a confusion matrix (confusion matrix) is an image display tool for evaluating the quality of a classification model. Wherein each column of the matrix represents a sample case of model prediction; each row of the matrix represents the true case of a sample. Table 2 represents a confusion matrix for a binary model.

TABLE 2

Wherein, True Positive (TP) represents True class, i.e. the True class of the sample is Positive class, and the result of model prediction is also Positive class.

False Negative (FN) represents a False Negative class, i.e., the true class of the sample is a positive class, but the model predicts it as a Negative class.

False Positive (FP) represents a False Positive class, i.e., the true class of the sample is a negative class, but the model predicts it as a Positive class.

True Negative (TN) represents a True class, the True class of the sample is a Negative class, and the model predicts it as a Negative class.

The indexes for evaluating the model accuracy derived from the confusion matrix are:

indicating the accuracy of the model. In general, the higher the accuracy of the model, the better the model will be.

Indicating the precision. Generally, the higher the precision, the better the model effect.

Indicating the recall rate. In general, the higher the recall rate is, the more positive samples are predicted correctly by the model, and the better the model is.

In general, the higher the Precision value and the higher the Recall value, the better the model will work. But in fact the two are in some cases contradictory. For example, in an extreme case, the model only searches for a result, and is accurate, then Precision is 100%, but Recall is very low; whereas if all results are returned, then Recall is 100%, but Precision is very low. The most common approach is therefore to introduce a comprehensive rating index, F-Measure (also known as F-Score, i.e. Precision and Recall weighted harmonic mean):

wherein α is 1.

In the present example, the efficiency of the model was evaluated by ROC curve and AUC.

A receiver operating characteristic curve (ROC curve) is also called a sensitivity curve (sensitivity curve), and each point on the ROC curve reflects the sensitivity to the same signal stimulus.

Horizontal axis: negative positive rate (FPR) specificity, dividing the proportion of all negative cases in the example to all negative cases; (1-Specificity)

Longitudinal axis: true Positive Rate (TPR) Sensitivity, Sensitivity (positive coverage)

For a binary problem, instances are classified into positive (positive) or negative (negative) classes. However, in practice, four cases occur when classifying.

(1) If an instance is a Positive class and is predicted to be a Positive class, it is a True class (True Positive TP)

(2) If an instance is a positive class, but is predicted to be a Negative class, i.e., a False Negative class (False Negative FN)

(3) If an instance is a negative class, but is predicted to be a Positive class, i.e., a False Positive class (False Positive FP)

(4) If an example is a Negative class, but is predicted to be a Negative class, it is a True Negative class (True Negative TN)

TP correct positive number

FN, missing reports, number of failed matches

FP false positive, no incorrect match

TN number of mismatches for correct rejection

Tabulated table 3 below, 1 for positive class and 0 for negative class:

TABLE 3

From the above table, the calculation formula for the horizontal and vertical axes is:

(1) the True class Rate (True Positive Rate) TPR: TP/(TP + FN) represents the proportion of actual Positive instances in the Positive class predicted by the classifier to all Positive instances. Sensing

(2) The negative Positive class Rate (False Positive Rate) FPR: FP/(FP + TN) represents the proportion of actual negative instances in the Positive class predicted by the classifier to all negative instances. 1-Specificity

(3) True Negative class Rate (True Negative Rate) TNR: TN/(FP + TN), representing the proportion of actual Negative instances in the Negative class predicted by the classifier to all Negative instances, TNR ═ 1-FPR. Specificity

Auc (area under cut): area under Roc curve, between 0.1 and 1. Auc can be used as a numerical value to visually evaluate the quality of the classifier, and the larger the value is, the better the value is.

Firstly, the AUC value is a probability value, when a positive sample and a negative sample are randomly selected, the probability that the positive sample is arranged in front of the negative sample by the current classification algorithm according to the computed Score value is the AUC value, and the larger the AUC value is, the more likely the current classification algorithm is to arrange the positive sample in front of the negative sample, so that better classification can be realized.

Fig. 11A to 11C are model ROC graphs under different index combinations according to an embodiment of the present invention. (A) LR (15 index combination) and (B) XGboost (15 index combination) (C) XGboost (5 index combination) predicted the subject's working profile of COVID-19 exacerbation. AUC is area under the curve, LR is logical regression, XGboost is extreme gradient boosting.

The invention incorporates the data of inpatients in the warrior fire mountain hospital in china from 2 months 2 to 4 months 1 day in 2020. All patients were positively diagnosed as COVID-19 pneumonia by nucleic acid detection, and classified into light, medium, heavy and critical 4 clinical classifications according to the diagnosis and treatment protocol for novel coronavirus pneumonia (eighth edition) specified by the national health and health committee of the people's republic of China. In the present invention, both mild and moderate patients are treated as mild cases. All other patients were considered to be severely ill. In the present study, the present invention predicts whether a patient's condition will transition from mild to severe.

The present invention collects Electronic Medical Records (EMR) of all patients during hospital stay in the fire mountain hospital including epidemiology, demographics, clinical features, laboratory indices, past medical history, exposure history, symptoms, chest computer tomography and any treatment (i.e. antiviral treatment, corticosteroid treatment, respiratory support and renal replacement therapy). All data were reviewed by a team of physicians. In order to more accurately identify high-risk factors that cause mild patients to worsen to severe/critical conditions, mild patients were classified into severe (experimental group) and non-severe (control group) according to whether or not they worsen to severe during hospitalization (see fig. 2A). Since the disease progression is dynamic, more than one exacerbation (on average 2.9 per patient) occurs in 35.7% of patients in the severe group during hospitalization. Each transition from light to heavy in the patient was considered a positive sample in the study, while the control group of patients continued to be in a light state. Due to the abundant sources of the negative samples, the number of the negative samples is obviously higher than that of the positive samples, so that the class imbalance between the positive samples and the negative samples is caused. Therefore, a random undersampling technique is employed to balance the number of positive and negative samples.

The model inputs include three general categories (i.e., characteristics) common in EMRs, (1) demographic variables (such as age and gender); (2) complications; (3) clinical and laboratory results. Any property missing more than 50% is excluded. All missing variables in the resulting data were filled in using a random forest (see FIG. 2B). This procedure produced 82 properties for inclusion in the model (see table 4).

In addition, the performance of the model was evaluated using 10-fold cross-validation. In this process, the data set is randomly partitioned into 10 equally sized subsamples, 9 of which are used to train the model, and then the model is validated with the remaining 1.

The machine learning algorithm employed by the present invention is described below: various machine learning and deep learning algorithms have been applied to this type of problem. However, most techniques prioritize task prediction or classification, while ignoring interpretation or information risk factor selection and analysis. The deep learning model comprises a plurality of hidden layers, a large number of training samples are needed, and interpretation for a clinician is difficult. Thus, the present invention considers interpretability to be a core requirement for machine learning model selection. XGboost and Logistic Regression (LR) were used to predict whether COVID-19 mild patients will develop severe patients.

The invention uses the average value (standard deviation-SD) and median (quartile-IQR) to represent continuous variable measurement. Since the data of the present invention is not normally distributed, the present invention compares the mean values using Wilcoxon rank sum test. The classification variables are described as frequency and percentage modeled using the chi-square test. All statistical tests were two-tailed. Table 4 is a population-based baseline profile of patients.

TABLE 4

Note DM is diabetes mellitis, CAD is coronary artery disease, CCB is calcium channel blocks, and ARB is angiotensin II type I receiver blocks.

TABLE 5 sample-based patient Baseline characteristics

Note that BMI: Body Mass Index, CRP: C-reactive protein, N: neutral count, M: monocyte count, MCHC: mean synergistic carbohydrate restriction, MCV: mean synergistic volume, PT: protein time, TT: protein time, INR: endogenous metabolic ratio, APTT: activated partial protein time, alpha-HBDH: alpha-hydroxykinase, hydroxyl-promoter, gamma-promoter, cholesterol-promoter.

Table 6 LR and XGBoost prediction results.

Remarking: the numbers in parentheses represent the number of indices in the model.

With the worldwide outbreak of COVID-19, SARS-CoV-2 infection has become a serious threat to public health. Therefore, early prediction and active treatment of mild patients at high risk for malignant progression are critical to reducing mortality, optimizing treatment strategies, and maintaining efficient operation of medical systems. The invention shows that the XGboost-based efficient prediction model (AUC 0.827195% CI:0.8144, 0.8397) can identify the risk probability of a mild patient deteriorating into a severe patient by utilizing the existing EMR data. The model outperformed traditional logistic regression (AUC 0.653295% CI:0.6421, 0.6642). In addition, the invention utilizes the SHAP diagram to visually explain the feature importance and determines the risk factors causing the development of serious COVID-19.

Each sample in the dataset contained 82 features including indicators of underlying disease, vital signs, coagulation, blood routine, hematopoiesis, and urine routine. From a research point of view, the invention needs to include a set of indexes that can sufficiently reflect the state of the patient, but from a practical point of view, the acquisition time of the physiological indexes and the laboratory indexes is too long. During the waiting period, the patient's condition may deteriorate. The length of time that a patient waits for an index result affects the timeliness of diagnosis and treatment. Therefore, the invention reduces the number of indexes required by the model. The invention uses a stepwise backward regression method to perform experiments on models with different index numbers and parameters, and records the model performance under each experiment. The implementation steps are to put all features into the XGBoost model and to calculate the SHAP value for each feature. For each iteration, the feature with the smallest SHAP absolute value is removed from the model. This process is repeated until no features meet the iteration criterion, and the AUC value is recorded for each iteration.

According to a standard deviation principle, 15 indexes (PT, PTA, INR, DD, APTT, LDH, CK, Magnesium, Globulin, HCT, L%, Platlet count, Pulse, BMI and Urine specific gradient) are selected, the AUC value of the model can achieve a good result, the number of the indexes is reduced, and meanwhile the model can achieve good prediction performance. And the integral effect of the XGboost is explained in a characteristic contribution form by using the SHAP diagram, so that the interpretability of the model is improved.

In addition, the invention firstly provides that the simplified index combination can obtain better prediction effect, and the prediction result of the rapid primary screening can be obtained only by using two laboratory indexes (PT and HCT) and three basic index combinations (pulse, BMI and whether basic diseases exist, AUC 0.801795% CI:0.7905, 0.8130). Different requirements of medical personnel on prediction accuracy and timeliness in clinical practice can be met by combining the stepped indexes.

The new coronary pneumonia patient outcome prediction method based on the interpretable machine learning algorithm is a new interpretable machine learning algorithm, can predict the probability of disease deterioration of a COVID-19 patient, simultaneously identifies early risk factors of the disease deterioration, and provides an approximate range of the risk factors by using a SHAP graph.

The invention adopts a machine learning method to predict the disease deterioration of the patient with COVID-19, identifies the risk factors and further determines the approximate early warning range of the risk factors. The invention has the following main beneficial effects:

2) And forming two index combinations according to different indexes required by the model. According to different requirements on timeliness and accuracy under different scenes, two sets of index combinations with different quantities are formed. From the viewpoint of accuracy, a combination of 15 indexes can be selected, and the model prediction accuracy is high; from the viewpoint of applicability and practicality, a combination of 5 indexes can be selected. The number of indexes required is predicted by the simplified model, so that the waiting time is shortened. The invention can obtain the prediction result by only using 5 indexes, wherein 2 indexes are laboratory indexes, and the patient index data can be quickly obtained by bedside detection.

3) Early risk factors for patient exacerbation are identified and SHAP is used to indicate the approximate extent of these risk factors to the early warning of severe recurrent pulmonary inflammation.

In summary, the invention builds a high-performance prediction model using EMR data from 3028 patients based on the interpretable machine learning algorithm XGboost (AUC 0.827195% CI:0.8144, 0.8397). The 15 high-risk factors of the COVID-19 malignant development and the approximate early warning range thereof are determined. In addition, the invention firstly provides that the simplified index combination can obtain better prediction effect, and the rapid primary screening prediction result (AUC 0.801795% CI:0.7905, 0.8130) can be obtained only by using two laboratory indexes (PT and HCT) and two basic index combinations (pulse and BMI). Different requirements of medical personnel on prediction accuracy and timeliness in clinic can be met by combining the stepped indexes. In a word, for the purpose of resisting the new coronary pneumonia epidemic situation, the invention can help to reduce the death rate, improve prognosis and optimize clinical treatment.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A new coronary pneumonia patient outcome prediction method based on an interpretable machine learning algorithm is characterized by comprising the following steps:

step S7, the calculation result of the algorithm and the clinical experience of the doctor are integrated, and 15 first group indexes and 5 second group indexes are screened out, wherein the 15 first group indexes comprise: prothrombin Time (PT), prothrombin activity (PTA), Lactate Dehydrogenase (LDH), International Normalized Ratio (INR), heart rate, Body Mass Index (BMI), D-dimer, Creatine Kinase (CK), hematocrit, urine specific gravity, magnesium, globulin, Activated Partial Thromboplastin Time (APTT), lymphocyte count (L%), and platelet count;

two indices of the 15 first set of indices and the 5 second set of indices were combined to predict the disease status of patients with new coronary pneumonia.

2. The new coronary pneumonia patient outcome prediction method based on interpretable machine learning algorithm of claim 1, wherein in the step S1, the patient data comprises: demographic, underlying disease, vital signs, blood coagulation, blood convention, blood biochemistry and urine convention.

3. The method for predicting the outcome of a new coronary pneumonia patient based on the interpretable machine learning algorithm as claimed in claim 1, wherein in step S2, after mixing the positive and negative samples, the missing values of each index of the patient data are interpolated by using a random forest regression algorithm.

4. The method for new coronary pneumonia patient outcome prediction based on interpretable machine learning algorithm of claim 1, wherein in the step S3, the model performance includes: accuracy, recall, F1, AUC and AUC values for 95% CI.

5. The method for predicting the outcome of a new coronary pneumonia patient based on the interpretable machine learning algorithm of claim 1, wherein in the step S3, a stepwise backward regression method is used to calculate the model performance under different index numbers and parameters, which comprises: putting all the characteristics into an XGboost model, and calculating the SHAP value of each characteristic; and for each iteration, deleting the feature with the minimum SHAP absolute value from the model, repeating continuously until no feature meets the iteration standard, and recording the AUC value of each iteration process.

6. The method for predicting new coronary pneumonia patient outcome based on interpretable machine learning algorithm of claim 1, wherein in the step S4, the index combination includes: pulse index and BMI index.

7. The method for predicting the outcome of a new coronary pneumonia patient based on the interpretable machine learning algorithm of claim 1, wherein the preset number of days is 10 days to 20 days.

8. The method for predicting the outcome of a new coronary pneumonia patient based on the interpretable machine learning algorithm of claim 1, wherein in step S3, the number of the indexes inputted into the model is selected according to 1 standard error rule, and the selected indexes are used as the key risk factors for identifying the disease deterioration of the patient with codv-19.

9. The method for predicting the outcome of a patient with new coronary pneumonia according to claim 1, wherein in step S6, the indicator whose SHAP value is positive in the SHAP map indicates that the range is positive in the prediction result, i.e. the disease deterioration, whereas the indicator whose SHAP value is negative in the range indicates that the disease deterioration is negative in the prediction result, i.e. the range in which the SHAP value is positive in the SHAP partial dependency map is defined as the warning range.

10. The method of claim 1, wherein the mean and median values are used to represent continuous variable measurements, the mean values are compared using Wilcoxon rank sum test, and the categorical variables are described as frequency and as a percentage modeled using the chi-square test.