CN111261282A

CN111261282A - Sepsis early prediction method based on machine learning

Info

Publication number: CN111261282A
Application number: CN202010068293.2A
Authority: CN
Inventors: 付梦莎; 袁家斌
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-09

Abstract

The invention discloses a sepsis early-stage prediction method based on machine learning. Firstly, extracting clinical data of a patient in 24 hours after the patient enters an ICU (intensive care unit) by using an electronic medical record, wherein the clinical data comprises a plurality of variables such as demographics (such as age and sex), vital sign variables (such as heart rate and systolic pressure) and laboratory measurement indexes (such as creatinine and platelet count), preprocessing the data, inputting the preprocessed data into an improved deep forest algorithm model for training, and outputting the illness probability of the patient after training and tuning. Meanwhile, the algorithm model can also sequence the characteristic variables of the sepsis and output early warning factors which have important influence on early sepsis prediction. Finally, the patient's corresponding variables to be predicted are entered into the trained model, and an early prediction of sepsis can be made for this patient. The method for early prediction of sepsis based on machine learning can assist doctors in clinical decision making and improve prediction accuracy.

Description

Sepsis early prediction method based on machine learning

Technical Field

The invention belongs to the field of medical data mining, and particularly relates to a sepsis early-stage prediction method based on machine learning.

Background

Sepsis is a disease that poses a serious threat to life safety, is a systemic inflammatory response syndrome caused by infection, and is one of the main causes of common high-risk complications and death of ICU patients. An estimated 3000 million people worldwide each year suffer from sepsis, and the sepsis treatment cost is very high and the risk is very high due to the number of sepsis fatalities exceeding 600 million people. Sepsis has become a public medical problem of high global concern due to morbidity, mortality, and expensive treatment costs. The clinical diagnostic definition of sepsis has progressed from 1.0 to 3.0 and is also constantly changing. Currently the latest definition of sepsis-3 in the clinic was proposed by the european association of severe illness in 2016. Clinical research on the pathogenesis of sepsis has been advanced to a certain extent, but the pathogenesis of sepsis is complex, more variable factors are involved, and the diagnosis accuracy rate is still to be improved. Studies have shown that early detection of sepsis and timely antibiotic treatment are critical to improve outcome in patients with sepsis, increasing mortality by 4% -8% every hour of treatment delay. The patient who is possibly developed into the sepsis is discovered as early as possible and is given timely treatment, and the research on key influencing factors closely related to the occurrence of the sepsis has important research value and significance for improving the prognosis of the patient. Most current studies are from a medical point of view, mostly based on statistical analysis and simple logistic regression models, and few use machine learning algorithms for early prediction of sepsis in patients. Current studies indicate that sepsis is a major cause of late death. The 24 hours after the patient entered the ICU is a very critical moment during which most of the disease transitions occur. The clinical data within 24 hours have higher research value and significance for early diagnosis of sepsis of patients. In addition, due to the privacy of medical information data, many research literature data are based on a specific hospital and are difficult to share, so that the research methods and results are not repeatable and comparable. With the advent of intelligent medical treatment based on data driving, more and more medical staff expect to utilize a machine learning method to mine medical data, and further help the medical staff to improve deep cognition and diagnosis efficiency of diseases.

Disclosure of Invention

In order to solve the problems of difficult clinical diagnosis and low accuracy of sepsis of ICU patients in the prior art, the invention provides an early sepsis prediction method based on machine learning. The method comprises the steps of extracting a plurality of examination variables recorded by an electronic medical record of a patient, preprocessing data, constructing a prediction model by using an improved deep forest algorithm, outputting the probability of sepsis of the patient, and early predicting and identifying the patient suffering from sepsis, has the beneficial effect of high prediction accuracy, and simultaneously explores key influence factors closely related to the occurrence of sepsis.

In order to achieve the purpose, the invention adopts the technical scheme that:

a sepsis early prediction method based on machine learning comprises the following steps:

step 1, firstly, defining a diagnosis task, and extracting a plurality of predicted characteristic variables recorded within 24 hours after the electronic medical record or medical data of a patient is centralized and input into an ICU;

step 2, after data is extracted, preprocessing the data is needed under the conditions that the data has deletion and abnormality in different degrees, wherein the preprocessing comprises operations of variable screening, deletion value filling, abnormal value processing and feature extraction;

step 3, after data preprocessing, inputting the data into the constructed prediction model, and constructing the prediction model by using an improved deep forest algorithm;

step 4, training the prediction model, finding the optimal parameters through training, and continuously adjusting the optimal parameters to ensure that the model effect is stable and optimal;

and 5, after the model is trained, carrying out early sepsis prediction on a new patient, and simultaneously outputting the importance metric value of each characteristic variable.

Further, the predicted characteristic variables extracted in step 1 include vital sign variables; measuring indexes in a laboratory; demographic information.

Further, the data preprocessing in the step 2 comprises:

variable screening, namely setting a deletion rate threshold value, removing variables with too high deletion rate, and simultaneously performing low variance filtering variable selection, namely calculating the variance corresponding to each variable, setting the threshold value, filtering if the variance of the variables is lower than the threshold value, and selecting and removing zero variance characteristics, wherein the variance is 0 and indicates that the value of the variables has no change, and the variables have no distinction to the model;

filling missing values, namely filling the missing values by using a MissForest model prediction method, wherein the MissForest is a non-parameter missing value filling method, and performing prediction filling on the missing values by using an algorithm model;

outlier processing, the processing method is the 6 σ principle used, where: σ denotes the standard deviation of the data, extending the upper and lower bounds of the general 3 σ, if a certain item of data of a patient deviates more than 6 times the standard deviation from the mean in the data set to which it belongs, i.e. the data value is outside [ U-6 σ -U +6 σ ], where: the average value of the data set represented by U is replaced by the minimum and maximum limit values respectively;

and feature extraction, namely expanding the feature variables, and performing feature expansion from the maximum value, the minimum value and the average value by the medical scoring system based on the maximum value calculation and the minimum value calculation to judge the severity of the patient.

Furthermore, in the model construction in the step 3, an improved deep forest algorithm model is used to construct a prediction model;

the improved deep forest algorithm model comprises the following specific steps:

step (1): two tree-based model algorithms of a random forest and XGboost are selected as a base learner of each layer, and each layer selects 2 identical random forest algorithm models and 2 identical XGboost as four base learner prediction models. After each model was trained on data using k-fold cross validation, wherein: k is 10, and the probability vector X of each forest is output_i；

Step (2): after k-fold training, the accuracy of the prediction result of the training data is calculated at the same time, and the accuracy is used as a weight parameter of the model after regular normalization and is recorded as w'_iThe calculation formula is as follows:

wherein: w_iShowing the accuracy of the predicted result of the training data of the ith forest algorithm model after 10-fold training; w'_iA weight representing a prediction probability of the ith model;

and (3): the prediction probabilities of the four base learning device prediction models used by each layer are subjected to weighted fusion, and the finally output probability vector is recorded as X_probThen the final probability vector X of each layer output_probInput into the next layer, X, in connection with the original features_probThe calculation formula of (2) is as follows:

wherein, X_iAnd the predicted probability vector output by the ith forest is represented by training data by adopting k-fold cross validation at each layer, wherein k is 10.

And (4): and (4) repeating the steps (1) to (3) on the next layer, continuing training until the training precision is not improved any more, automatically stopping, and outputting the weighted probability of the last layer as a prediction result.

Further, in the training and tuning of the model parameters in the step 4, each layer of the algorithm model for training and predicting selects 2 identical random forests and 2 identical XGboost, and each forest comprises 100 trees; training data in each layer by adopting a 10-fold cross validation mode, and dividing the data of a training set and a test set according to 0.8/0.2; the evaluation index uses the prediction accuracy, and if the prediction accuracy of the next layer in the training is not improved any more, the training is terminated.

Further, the importance metric value of each feature variable output in the step 5 is generated simultaneously in the model training process, and the feature importance of the last output of each layer is F'_nThe calculation method comprises the following steps:

F′_n′＝w′_i*F_i

wherein n represents the nth layer, i represents the ith forest algorithm model of each layer, and F_iRepresenting all the feature importance of the ith forest model after 10 th training and averaging;

and finally, selecting and outputting the feature importance of the last layer, and further obtaining the importance metric of each feature variable.

Compared with the prior art, the invention has the following beneficial effects:

the invention aims at medical data extracted within 24 hours after a patient enters an ICU, and carries out preprocessing including variable screening, missing value filling, abnormal value processing, characteristic extraction and the like, and an improved deep forest algorithm model is constructed based on a task of early prediction of sepsis, wherein the improved deep forest algorithm model is different from an original deep forest model framework, the probability superposition of prediction of each forest in the previous layer is directly used, the forest is directly weighted and averaged after the distribution probability of each forest is output, and the weight parameter is determined by the accuracy of a training model. Therefore, dimensionality can be reduced, only the key extracted information is reserved as a new feature for prediction, and the training speed is greatly increased. In addition, the importance of the features is emphasized to be scored, the contribution of the features in respective training models is output in each layer, and finally the overall importance scores of all the features are calculated to obtain early warning factors which have important influence on early prediction of sepsis. The method for machine learning is used for predicting and classifying sepsis, has higher precision, can assist doctors to judge disease conditions, pay close attention to patients with possible sepsis, and timely rescue and improve the prognosis of ICU patients.

Drawings

FIG. 1 is an overall flow diagram of the present invention;

FIG. 2 is a SOFA score table;

FIG. 3 is a diagram of a cascaded forest structure;

FIG. 4 is a diagram of the improved deep forest structure of the present invention;

FIG. 5 is a schematic representation of the ranking of feature importance metric values.

Detailed Description

In order to better explain the technical scheme, the following is made a more detailed description with reference to the embodiment.

step 1: early prediction of sepsis is a predictive classification task that requires data acquisition first. The data is the cornerstone of research, and for different disease problems, the corresponding data is needed as a support. The ICU patient has many examinations and has complicated data records, so that a plurality of required prediction characteristic variables need to be extracted from the electronic medical record records or medical data sets of the patient, and clinical data within 24 hours after entering the ICU are mainly extracted;

1.1 predictive feature variables commonly used in medical scoring systems

Clinically, the diagnosis of sepsis is mainly judged on the condition of patients using medical scoring systems such as SOFA score, and the variables commonly used in these scoring systems mainly include three types: vital sign variables (e.g. heart rate, pulse, systolic blood pressure, arterial pressure), laboratory measures (e.g. creatinine, direct bilirubin, serum glucose, lactate), demographic information (e.g. age, weight, length of stay, type of stay). As shown in fig. 2, are variables needed in the SOFA score table. In order to compare the performance of the machine learning method with that of the traditional clinical diagnosis, variables appearing in a medical scoring system are often extracted and used in research of others, so that comparison is facilitated.

Step 2: after data is extracted, before the data is input into a prediction model, the data needs to be preprocessed, including variable screening, missing value filling, abnormal value processing, feature extraction and other operations of the data, so that the data quality is improved.

Generally, the quality of an original data set is not high, so that the original data set has many problems, cannot be directly used and needs to be subjected to certain pretreatment. Data in a real scene often has many defects, and data distortion is caused by various reasons such as human errors, equipment errors or loopholes in the collection process of the data derived from a hospital system. The ICU data is sparse, and a large number of missing values and some abnormal values exist. Because the measurement frequencies of the extracted variables are not consistent, for example, some laboratory test indexes such as blood culture/white blood cell count may require 1-5 days to obtain results, some variables are sampled in real time such as heart rate/respiratory rate, and some variables may be sampled for hours or once a day. We need to preprocess the data before making the prediction. The data processing results are that various dirty data are processed in a corresponding mode, and available data with consistent standards are obtained. The data preprocessing mainly comprises the following steps:

and (4) variable screening, namely setting a deletion rate threshold value and removing the variables with the over-high deletion rate. Meanwhile, low variance filtering variable selection is carried out, namely, the variance corresponding to each variable is calculated, a threshold value is set, and filtering is carried out when the variance of the variable is lower than the threshold value. The zero variance feature is generally selected to be eliminated, the variance is 0, which means that the value of the variable has no change, and the variable is not distinguishable for the model.

Outlier processing, because the comparison of medical data anomalies is difficult to define, the processing method is the 6 σ principle used, where: σ denotes the standard deviation of the data, extending the upper and lower bounds of the general 3 σ, if a certain item of data of a patient deviates more than 6 times the standard deviation from the mean in the data set to which it belongs, i.e. the data value is outside [ U-6 σ -U +6 σ ], where: the average value of the data set represented by U is replaced by the minimum and maximum limit values respectively;

and filling missing values, wherein the missing values are filled by using a MissForest model prediction method. MissForest is a nonparametric missing value filling method, and an algorithm model is used for predicting and filling missing values. The algorithm principle takes known data columns as features and takes missing variable columns as labels. And then, taking the label with data as a training set and the label missing as a test set, training by using a random forest algorithm, predicting and updating missing values, and finally, filling the predicted data and continuously iterating to predict other missing data.

Feature extraction, namely, the expansion of variables, and the judgment of the severity of a patient by a common medical scoring system is generally based on the most-valued calculation, so that feature expansion is carried out from three aspects of a maximum value, a minimum value and a mean value.

And step 3: after data preprocessing, the data are input into the constructed prediction model, and the prediction model is constructed by using an improved deep forest algorithm. In the research of medical data mining by using a machine learning algorithm, different algorithms need to be selected or constructed according to different disease diagnosis learning tasks so as to achieve the best prediction effect. In machine learning, various algorithms can be selected, such as logistic regression, support vector machine, K neighbor and the like, or various improved algorithms based on basic algorithms are carried out. With the continuous development of neural networks, deep learning methods such as transfer learning/reinforcement learning are also widely used, and the effects are prominent. The deep forest algorithm has better competitiveness than a deep neural network, and has less hyper-parameters, high training efficiency and excellent performance on a small-scale data set. In the intelligent medical auxiliary diagnosis, a deep learning method is used for obtaining higher accuracy in a plurality of researches, but because of the inexplicability of a neural network, the reliability of doctors on the neural network is not high due to the similar structure of a black box. The tree model has a good exposable type, so the reason can be explained according to the splitting path in the decision tree.

3.1 deep forest model

A multi-granular Cascade Forest algorithm (gcForest), also called deep Forest algorithm, was proposed in 2017 by professor zhou shihua et al. This is a multi-layer classifier, each layer being integrated from multiple base classifiers. The gcForest structure is similar to the deep neural network, and the ability is characterized through hierarchical learning. The deep neural network is widely applied due to the characteristics of strong characterization learning capability, automatic feature conversion, complex models and the like, but the problems of high computational complexity, excessive parameters, lack of interpretability and the like exist at the same time. The proposal of gcForest provides a new idea for exploring a deep algorithm except a neural network. The gcForest algorithm consists of two parts: the first part is multi-granularity scanning and is used for extracting data features; the second part is a cascade forest which is used for iterative training to improve the classification result; the two parts can be used in combination or separately. The basic composition of the neural network is small neurons, the basic composition unit of gcForest is a forest which is an integrated model formed by a plurality of decision trees, and therefore the deep forest can also be regarded as an integrated model. FIG. 3 is a diagram of a cascaded forest structure of the gcForest model.

3.2 improved deep forest model

The gcForest is actually a frame of a deep forest algorithm and needs to be modified according to a specific adaptive scene. On the basis of a gcForest algorithm framework, an improved deep forest algorithm is provided for sepsis prediction. The improved deep forest algorithm structure is shown in fig. 4. Different from an original gcForest model architecture, probability superposition of prediction of each forest of the previous layer is directly used, weighted average is directly carried out on each forest after the forest outputs class distribution probability, and weight parameters are determined by accuracy of a training model. Weight parameter w of each forest model'_iThe calculation method is as follows:

wherein, W_iShowing the accuracy of the predicted result of the training data of the ith forest algorithm model after 10-fold training;

then, the prediction probabilities of the four models are subjected to weighted fusion, and finally the output probability vector is recorded as X_probThe calculation formula is as follows:

wherein, X_iThe predicted probability vector output by the ith forest is shown by training data with k-fold (k is 10) cross validation for each layer.

Will last probability vector X_probAnd connecting the initial features and inputting the initial features into the next layer until the model training is finished, and outputting the weighted probability of the last layer as a prediction result.

After weighted average is carried out on the probabilities predicted by all forest models, dimensionality can be reduced, only key extracted information is reserved as new features for prediction, and training speed is greatly increased. In addition, the feature importance of each layer can be output simultaneously in the training process, and the average value is obtained finallyHowever, the model weight parameter w 'may be used as well'_iTo assist in the computation.

And 4, training the model, wherein the training time of the model is different according to the scale of the training data. The essence of model training is a process of parameter optimization. The optimal hyper-parameters and the parameters of each base model are found through training, and then the optimization is continuously carried out according to the evaluation indexes, so that the model effect is stable and optimal. And 2 identical random forests and 2 identical XGboost are selected for each layer of the algorithm model for training prediction, and each forest comprises 100 trees. Each layer was trained using k-fold (k 10) cross validation, with the training set and test set data being divided by 0.8/0.2. The evaluation index uses the prediction accuracy, and if the prediction accuracy of the next layer in the training is not improved any more, the training is terminated.

And 5, after the model is trained, carrying out sepsis prediction output on a new patient, wherein the model predicts the probability of the disease output, the probability value is between 0 and 1, and patients with the prediction probability of more than or equal to 0.5 are classified as 1, namely the patients are considered to possibly produce sepsis. Whereas a prediction probability of less than 0.5 would be classified as 0, indicating that the patient does not develop sepsis. And after the model training is finished, the importance metric value of each characteristic variable can be output. The characteristic importance of the last output of each layer is F'_nThe calculation method comprises the following steps:

F′_n＝w′_i*F_i

wherein n represents the nth layer, i represents the ith forest algorithm model of each layer, and F_iAll feature importance averaged after training of the ith forest model 10 is shown.

And finally outputting the feature importance of the last layer. As shown in fig. 5, the abscissa represents the importance measure of a feature and the ordinate represents the corresponding feature variable. The longer the bar length, the larger the value. The sum of all feature importance sums to 1.

The invention firstly extracts various predictive variables of a patient from an electronic medical record or a medical data set, and mainly extracts three major types of multidimensional variable information of vital sign variables, laboratory measurement indexes and demographic information of the patient, which are recorded within 24 hours after the patient enters an ICU. And then preprocessing the extracted data, including operations of variable screening, abnormal value processing, missing value filling, feature extraction and the like of the data, and finishing the cleaning and processing of the data. And inputting the processed data into the constructed prediction model, wherein the prediction model uses an improved deep forest algorithm model. After training and parameter tuning, the prediction model can predict a new patient, and data of the predicted patient and corresponding variables in the training model are input, so that the early prediction of sepsis of the patient can be performed. In addition, the trained algorithm model can also sequence the characteristic variables of the sepsis and output early warning factors which have important influence on early sepsis prediction. Higher scores for the early warning factors indicate an increasingly close relationship with the onset of sepsis. By adopting the technical scheme of the invention, doctors can be assisted in making clinical decisions, and the accuracy of predicting the patient condition is improved.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A sepsis early prediction method based on machine learning is characterized by comprising the following steps:

and 5, after the model is trained, carrying out sepsis prediction on a new patient, and simultaneously outputting the importance metric value of each characteristic variable.

2. The machine learning-based early prediction of sepsis method of claim 1, characterized by: the predicted characteristic variables extracted in the step 1 comprise vital sign variables; measuring indexes in a laboratory; demographic information.

3. The machine learning-based early prediction of sepsis method of claim 1, characterized by: the data preprocessing in the step 2 comprises the following steps:

outlier processing, the processing method is the 6 σ principle used, where: σ represents the standard deviation of the data, if a certain item of data of a patient deviates by more than 6 times the standard deviation from the mean in the data set to which it belongs, i.e. the data value is outside [ U-6 σ -U +6 σ ], wherein: the average value of the data set represented by U is replaced by the minimum and maximum limit values respectively;

4. The machine learning-based early prediction of sepsis method of claim 1, characterized by: in the model building in the step 3, an improved deep forest algorithm model is used for building a prediction model;

step (1): selecting two tree-based model algorithms of a random forest and XGboost as a base learner of each layer, wherein each layer selects 2 same random forest algorithm models and 2 same XGboost as four base learner prediction models; after each model was trained on data using k-fold cross validation, wherein: k is 10, and the probability vector X of each forest is output_i；

X_prob＝∑w′_i*X_i

5. The method for the early prediction of sepsis based on machine learning according to claim 1, characterized in that: in the training and tuning of the model parameters in the step 4, each layer of the algorithm model for training and predicting selects 2 identical random forests and 2 identical XGboost, and each forest comprises 100 trees; training data in each layer by adopting a 10-fold cross validation mode, and dividing the data of a training set and a test set according to 0.8/0.2; the evaluation index uses the prediction accuracy, and if the prediction accuracy of the next layer in the training is not improved any more, the training is terminated.

6. The machine learning-based early prediction of sepsis method of claim 1, characterized by: the importance metric value of each characteristic variable output in the step 5 is generated simultaneously in the model training process, and the characteristic importance of the last output of each layer is F'_nThe calculation method comprises the following steps:

F′_n＝w′_i*F_i