CN111261282A - Sepsis early prediction method based on machine learning - Google Patents

Sepsis early prediction method based on machine learning Download PDF

Info

Publication number
CN111261282A
CN111261282A CN202010068293.2A CN202010068293A CN111261282A CN 111261282 A CN111261282 A CN 111261282A CN 202010068293 A CN202010068293 A CN 202010068293A CN 111261282 A CN111261282 A CN 111261282A
Authority
CN
China
Prior art keywords
prediction
model
data
training
sepsis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010068293.2A
Other languages
Chinese (zh)
Inventor
付梦莎
袁家斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202010068293.2A priority Critical patent/CN111261282A/en
Publication of CN111261282A publication Critical patent/CN111261282A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a sepsis early-stage prediction method based on machine learning. Firstly, extracting clinical data of a patient in 24 hours after the patient enters an ICU (intensive care unit) by using an electronic medical record, wherein the clinical data comprises a plurality of variables such as demographics (such as age and sex), vital sign variables (such as heart rate and systolic pressure) and laboratory measurement indexes (such as creatinine and platelet count), preprocessing the data, inputting the preprocessed data into an improved deep forest algorithm model for training, and outputting the illness probability of the patient after training and tuning. Meanwhile, the algorithm model can also sequence the characteristic variables of the sepsis and output early warning factors which have important influence on early sepsis prediction. Finally, the patient's corresponding variables to be predicted are entered into the trained model, and an early prediction of sepsis can be made for this patient. The method for early prediction of sepsis based on machine learning can assist doctors in clinical decision making and improve prediction accuracy.

Description

Sepsis early prediction method based on machine learning
Technical Field
The invention belongs to the field of medical data mining, and particularly relates to a sepsis early-stage prediction method based on machine learning.
Background
Sepsis is a disease that poses a serious threat to life safety, is a systemic inflammatory response syndrome caused by infection, and is one of the main causes of common high-risk complications and death of ICU patients. An estimated 3000 million people worldwide each year suffer from sepsis, and the sepsis treatment cost is very high and the risk is very high due to the number of sepsis fatalities exceeding 600 million people. Sepsis has become a public medical problem of high global concern due to morbidity, mortality, and expensive treatment costs. The clinical diagnostic definition of sepsis has progressed from 1.0 to 3.0 and is also constantly changing. Currently the latest definition of sepsis-3 in the clinic was proposed by the european association of severe illness in 2016. Clinical research on the pathogenesis of sepsis has been advanced to a certain extent, but the pathogenesis of sepsis is complex, more variable factors are involved, and the diagnosis accuracy rate is still to be improved. Studies have shown that early detection of sepsis and timely antibiotic treatment are critical to improve outcome in patients with sepsis, increasing mortality by 4% -8% every hour of treatment delay. The patient who is possibly developed into the sepsis is discovered as early as possible and is given timely treatment, and the research on key influencing factors closely related to the occurrence of the sepsis has important research value and significance for improving the prognosis of the patient. Most current studies are from a medical point of view, mostly based on statistical analysis and simple logistic regression models, and few use machine learning algorithms for early prediction of sepsis in patients. Current studies indicate that sepsis is a major cause of late death. The 24 hours after the patient entered the ICU is a very critical moment during which most of the disease transitions occur. The clinical data within 24 hours have higher research value and significance for early diagnosis of sepsis of patients. In addition, due to the privacy of medical information data, many research literature data are based on a specific hospital and are difficult to share, so that the research methods and results are not repeatable and comparable. With the advent of intelligent medical treatment based on data driving, more and more medical staff expect to utilize a machine learning method to mine medical data, and further help the medical staff to improve deep cognition and diagnosis efficiency of diseases.
Disclosure of Invention
In order to solve the problems of difficult clinical diagnosis and low accuracy of sepsis of ICU patients in the prior art, the invention provides an early sepsis prediction method based on machine learning. The method comprises the steps of extracting a plurality of examination variables recorded by an electronic medical record of a patient, preprocessing data, constructing a prediction model by using an improved deep forest algorithm, outputting the probability of sepsis of the patient, and early predicting and identifying the patient suffering from sepsis, has the beneficial effect of high prediction accuracy, and simultaneously explores key influence factors closely related to the occurrence of sepsis.
In order to achieve the purpose, the invention adopts the technical scheme that:
a sepsis early prediction method based on machine learning comprises the following steps:
step 1, firstly, defining a diagnosis task, and extracting a plurality of predicted characteristic variables recorded within 24 hours after the electronic medical record or medical data of a patient is centralized and input into an ICU;
step 2, after data is extracted, preprocessing the data is needed under the conditions that the data has deletion and abnormality in different degrees, wherein the preprocessing comprises operations of variable screening, deletion value filling, abnormal value processing and feature extraction;
step 3, after data preprocessing, inputting the data into the constructed prediction model, and constructing the prediction model by using an improved deep forest algorithm;
step 4, training the prediction model, finding the optimal parameters through training, and continuously adjusting the optimal parameters to ensure that the model effect is stable and optimal;
and 5, after the model is trained, carrying out early sepsis prediction on a new patient, and simultaneously outputting the importance metric value of each characteristic variable.
Further, the predicted characteristic variables extracted in step 1 include vital sign variables; measuring indexes in a laboratory; demographic information.
Further, the data preprocessing in the step 2 comprises:
variable screening, namely setting a deletion rate threshold value, removing variables with too high deletion rate, and simultaneously performing low variance filtering variable selection, namely calculating the variance corresponding to each variable, setting the threshold value, filtering if the variance of the variables is lower than the threshold value, and selecting and removing zero variance characteristics, wherein the variance is 0 and indicates that the value of the variables has no change, and the variables have no distinction to the model;
filling missing values, namely filling the missing values by using a MissForest model prediction method, wherein the MissForest is a non-parameter missing value filling method, and performing prediction filling on the missing values by using an algorithm model;
outlier processing, the processing method is the 6 σ principle used, where: σ denotes the standard deviation of the data, extending the upper and lower bounds of the general 3 σ, if a certain item of data of a patient deviates more than 6 times the standard deviation from the mean in the data set to which it belongs, i.e. the data value is outside [ U-6 σ -U +6 σ ], where: the average value of the data set represented by U is replaced by the minimum and maximum limit values respectively;
and feature extraction, namely expanding the feature variables, and performing feature expansion from the maximum value, the minimum value and the average value by the medical scoring system based on the maximum value calculation and the minimum value calculation to judge the severity of the patient.
Furthermore, in the model construction in the step 3, an improved deep forest algorithm model is used to construct a prediction model;
the improved deep forest algorithm model comprises the following specific steps:
step (1): two tree-based model algorithms of a random forest and XGboost are selected as a base learner of each layer, and each layer selects 2 identical random forest algorithm models and 2 identical XGboost as four base learner prediction models. After each model was trained on data using k-fold cross validation, wherein: k is 10, and the probability vector X of each forest is outputi
Step (2): after k-fold training, the accuracy of the prediction result of the training data is calculated at the same time, and the accuracy is used as a weight parameter of the model after regular normalization and is recorded as w'iThe calculation formula is as follows:
Figure BDA0002376595190000031
wherein: wiShowing the accuracy of the predicted result of the training data of the ith forest algorithm model after 10-fold training; w'iA weight representing a prediction probability of the ith model;
and (3): the prediction probabilities of the four base learning device prediction models used by each layer are subjected to weighted fusion, and the finally output probability vector is recorded as XprobThen the final probability vector X of each layer outputprobInput into the next layer, X, in connection with the original featuresprobThe calculation formula of (2) is as follows:
Figure BDA0002376595190000032
wherein, XiAnd the predicted probability vector output by the ith forest is represented by training data by adopting k-fold cross validation at each layer, wherein k is 10.
And (4): and (4) repeating the steps (1) to (3) on the next layer, continuing training until the training precision is not improved any more, automatically stopping, and outputting the weighted probability of the last layer as a prediction result.
Further, in the training and tuning of the model parameters in the step 4, each layer of the algorithm model for training and predicting selects 2 identical random forests and 2 identical XGboost, and each forest comprises 100 trees; training data in each layer by adopting a 10-fold cross validation mode, and dividing the data of a training set and a test set according to 0.8/0.2; the evaluation index uses the prediction accuracy, and if the prediction accuracy of the next layer in the training is not improved any more, the training is terminated.
Further, the importance metric value of each feature variable output in the step 5 is generated simultaneously in the model training process, and the feature importance of the last output of each layer is F'nThe calculation method comprises the following steps:
F′n′=w′i*Fi
wherein n represents the nth layer, i represents the ith forest algorithm model of each layer, and FiRepresenting all the feature importance of the ith forest model after 10 th training and averaging;
and finally, selecting and outputting the feature importance of the last layer, and further obtaining the importance metric of each feature variable.
Compared with the prior art, the invention has the following beneficial effects:
the invention aims at medical data extracted within 24 hours after a patient enters an ICU, and carries out preprocessing including variable screening, missing value filling, abnormal value processing, characteristic extraction and the like, and an improved deep forest algorithm model is constructed based on a task of early prediction of sepsis, wherein the improved deep forest algorithm model is different from an original deep forest model framework, the probability superposition of prediction of each forest in the previous layer is directly used, the forest is directly weighted and averaged after the distribution probability of each forest is output, and the weight parameter is determined by the accuracy of a training model. Therefore, dimensionality can be reduced, only the key extracted information is reserved as a new feature for prediction, and the training speed is greatly increased. In addition, the importance of the features is emphasized to be scored, the contribution of the features in respective training models is output in each layer, and finally the overall importance scores of all the features are calculated to obtain early warning factors which have important influence on early prediction of sepsis. The method for machine learning is used for predicting and classifying sepsis, has higher precision, can assist doctors to judge disease conditions, pay close attention to patients with possible sepsis, and timely rescue and improve the prognosis of ICU patients.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a SOFA score table;
FIG. 3 is a diagram of a cascaded forest structure;
FIG. 4 is a diagram of the improved deep forest structure of the present invention;
FIG. 5 is a schematic representation of the ranking of feature importance metric values.
Detailed Description
In order to better explain the technical scheme, the following is made a more detailed description with reference to the embodiment.
A sepsis early prediction method based on machine learning comprises the following steps:
step 1: early prediction of sepsis is a predictive classification task that requires data acquisition first. The data is the cornerstone of research, and for different disease problems, the corresponding data is needed as a support. The ICU patient has many examinations and has complicated data records, so that a plurality of required prediction characteristic variables need to be extracted from the electronic medical record records or medical data sets of the patient, and clinical data within 24 hours after entering the ICU are mainly extracted;
1.1 predictive feature variables commonly used in medical scoring systems
Clinically, the diagnosis of sepsis is mainly judged on the condition of patients using medical scoring systems such as SOFA score, and the variables commonly used in these scoring systems mainly include three types: vital sign variables (e.g. heart rate, pulse, systolic blood pressure, arterial pressure), laboratory measures (e.g. creatinine, direct bilirubin, serum glucose, lactate), demographic information (e.g. age, weight, length of stay, type of stay). As shown in fig. 2, are variables needed in the SOFA score table. In order to compare the performance of the machine learning method with that of the traditional clinical diagnosis, variables appearing in a medical scoring system are often extracted and used in research of others, so that comparison is facilitated.
Step 2: after data is extracted, before the data is input into a prediction model, the data needs to be preprocessed, including variable screening, missing value filling, abnormal value processing, feature extraction and other operations of the data, so that the data quality is improved.
Generally, the quality of an original data set is not high, so that the original data set has many problems, cannot be directly used and needs to be subjected to certain pretreatment. Data in a real scene often has many defects, and data distortion is caused by various reasons such as human errors, equipment errors or loopholes in the collection process of the data derived from a hospital system. The ICU data is sparse, and a large number of missing values and some abnormal values exist. Because the measurement frequencies of the extracted variables are not consistent, for example, some laboratory test indexes such as blood culture/white blood cell count may require 1-5 days to obtain results, some variables are sampled in real time such as heart rate/respiratory rate, and some variables may be sampled for hours or once a day. We need to preprocess the data before making the prediction. The data processing results are that various dirty data are processed in a corresponding mode, and available data with consistent standards are obtained. The data preprocessing mainly comprises the following steps:
and (4) variable screening, namely setting a deletion rate threshold value and removing the variables with the over-high deletion rate. Meanwhile, low variance filtering variable selection is carried out, namely, the variance corresponding to each variable is calculated, a threshold value is set, and filtering is carried out when the variance of the variable is lower than the threshold value. The zero variance feature is generally selected to be eliminated, the variance is 0, which means that the value of the variable has no change, and the variable is not distinguishable for the model.
Outlier processing, because the comparison of medical data anomalies is difficult to define, the processing method is the 6 σ principle used, where: σ denotes the standard deviation of the data, extending the upper and lower bounds of the general 3 σ, if a certain item of data of a patient deviates more than 6 times the standard deviation from the mean in the data set to which it belongs, i.e. the data value is outside [ U-6 σ -U +6 σ ], where: the average value of the data set represented by U is replaced by the minimum and maximum limit values respectively;
and filling missing values, wherein the missing values are filled by using a MissForest model prediction method. MissForest is a nonparametric missing value filling method, and an algorithm model is used for predicting and filling missing values. The algorithm principle takes known data columns as features and takes missing variable columns as labels. And then, taking the label with data as a training set and the label missing as a test set, training by using a random forest algorithm, predicting and updating missing values, and finally, filling the predicted data and continuously iterating to predict other missing data.
Feature extraction, namely, the expansion of variables, and the judgment of the severity of a patient by a common medical scoring system is generally based on the most-valued calculation, so that feature expansion is carried out from three aspects of a maximum value, a minimum value and a mean value.
And step 3: after data preprocessing, the data are input into the constructed prediction model, and the prediction model is constructed by using an improved deep forest algorithm. In the research of medical data mining by using a machine learning algorithm, different algorithms need to be selected or constructed according to different disease diagnosis learning tasks so as to achieve the best prediction effect. In machine learning, various algorithms can be selected, such as logistic regression, support vector machine, K neighbor and the like, or various improved algorithms based on basic algorithms are carried out. With the continuous development of neural networks, deep learning methods such as transfer learning/reinforcement learning are also widely used, and the effects are prominent. The deep forest algorithm has better competitiveness than a deep neural network, and has less hyper-parameters, high training efficiency and excellent performance on a small-scale data set. In the intelligent medical auxiliary diagnosis, a deep learning method is used for obtaining higher accuracy in a plurality of researches, but because of the inexplicability of a neural network, the reliability of doctors on the neural network is not high due to the similar structure of a black box. The tree model has a good exposable type, so the reason can be explained according to the splitting path in the decision tree.
3.1 deep forest model
A multi-granular Cascade Forest algorithm (gcForest), also called deep Forest algorithm, was proposed in 2017 by professor zhou shihua et al. This is a multi-layer classifier, each layer being integrated from multiple base classifiers. The gcForest structure is similar to the deep neural network, and the ability is characterized through hierarchical learning. The deep neural network is widely applied due to the characteristics of strong characterization learning capability, automatic feature conversion, complex models and the like, but the problems of high computational complexity, excessive parameters, lack of interpretability and the like exist at the same time. The proposal of gcForest provides a new idea for exploring a deep algorithm except a neural network. The gcForest algorithm consists of two parts: the first part is multi-granularity scanning and is used for extracting data features; the second part is a cascade forest which is used for iterative training to improve the classification result; the two parts can be used in combination or separately. The basic composition of the neural network is small neurons, the basic composition unit of gcForest is a forest which is an integrated model formed by a plurality of decision trees, and therefore the deep forest can also be regarded as an integrated model. FIG. 3 is a diagram of a cascaded forest structure of the gcForest model.
3.2 improved deep forest model
The gcForest is actually a frame of a deep forest algorithm and needs to be modified according to a specific adaptive scene. On the basis of a gcForest algorithm framework, an improved deep forest algorithm is provided for sepsis prediction. The improved deep forest algorithm structure is shown in fig. 4. Different from an original gcForest model architecture, probability superposition of prediction of each forest of the previous layer is directly used, weighted average is directly carried out on each forest after the forest outputs class distribution probability, and weight parameters are determined by accuracy of a training model. Weight parameter w of each forest model'iThe calculation method is as follows:
Figure BDA0002376595190000061
wherein, WiShowing the accuracy of the predicted result of the training data of the ith forest algorithm model after 10-fold training;
then, the prediction probabilities of the four models are subjected to weighted fusion, and finally the output probability vector is recorded as XprobThe calculation formula is as follows:
Figure BDA0002376595190000071
wherein, XiThe predicted probability vector output by the ith forest is shown by training data with k-fold (k is 10) cross validation for each layer.
Will last probability vector XprobAnd connecting the initial features and inputting the initial features into the next layer until the model training is finished, and outputting the weighted probability of the last layer as a prediction result.
After weighted average is carried out on the probabilities predicted by all forest models, dimensionality can be reduced, only key extracted information is reserved as new features for prediction, and training speed is greatly increased. In addition, the feature importance of each layer can be output simultaneously in the training process, and the average value is obtained finallyHowever, the model weight parameter w 'may be used as well'iTo assist in the computation.
And 4, training the model, wherein the training time of the model is different according to the scale of the training data. The essence of model training is a process of parameter optimization. The optimal hyper-parameters and the parameters of each base model are found through training, and then the optimization is continuously carried out according to the evaluation indexes, so that the model effect is stable and optimal. And 2 identical random forests and 2 identical XGboost are selected for each layer of the algorithm model for training prediction, and each forest comprises 100 trees. Each layer was trained using k-fold (k 10) cross validation, with the training set and test set data being divided by 0.8/0.2. The evaluation index uses the prediction accuracy, and if the prediction accuracy of the next layer in the training is not improved any more, the training is terminated.
And 5, after the model is trained, carrying out sepsis prediction output on a new patient, wherein the model predicts the probability of the disease output, the probability value is between 0 and 1, and patients with the prediction probability of more than or equal to 0.5 are classified as 1, namely the patients are considered to possibly produce sepsis. Whereas a prediction probability of less than 0.5 would be classified as 0, indicating that the patient does not develop sepsis. And after the model training is finished, the importance metric value of each characteristic variable can be output. The characteristic importance of the last output of each layer is F'nThe calculation method comprises the following steps:
F′n=w′i*Fi
wherein n represents the nth layer, i represents the ith forest algorithm model of each layer, and FiAll feature importance averaged after training of the ith forest model 10 is shown.
And finally outputting the feature importance of the last layer. As shown in fig. 5, the abscissa represents the importance measure of a feature and the ordinate represents the corresponding feature variable. The longer the bar length, the larger the value. The sum of all feature importance sums to 1.
The invention firstly extracts various predictive variables of a patient from an electronic medical record or a medical data set, and mainly extracts three major types of multidimensional variable information of vital sign variables, laboratory measurement indexes and demographic information of the patient, which are recorded within 24 hours after the patient enters an ICU. And then preprocessing the extracted data, including operations of variable screening, abnormal value processing, missing value filling, feature extraction and the like of the data, and finishing the cleaning and processing of the data. And inputting the processed data into the constructed prediction model, wherein the prediction model uses an improved deep forest algorithm model. After training and parameter tuning, the prediction model can predict a new patient, and data of the predicted patient and corresponding variables in the training model are input, so that the early prediction of sepsis of the patient can be performed. In addition, the trained algorithm model can also sequence the characteristic variables of the sepsis and output early warning factors which have important influence on early sepsis prediction. Higher scores for the early warning factors indicate an increasingly close relationship with the onset of sepsis. By adopting the technical scheme of the invention, doctors can be assisted in making clinical decisions, and the accuracy of predicting the patient condition is improved.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (6)

1. A sepsis early prediction method based on machine learning is characterized by comprising the following steps:
step 1, firstly, defining a diagnosis task, and extracting a plurality of predicted characteristic variables recorded within 24 hours after the electronic medical record or medical data of a patient is centralized and input into an ICU;
step 2, after data is extracted, preprocessing the data is needed under the conditions that the data has deletion and abnormality in different degrees, wherein the preprocessing comprises operations of variable screening, deletion value filling, abnormal value processing and feature extraction;
step 3, after data preprocessing, inputting the data into the constructed prediction model, and constructing the prediction model by using an improved deep forest algorithm;
step 4, training the prediction model, finding the optimal parameters through training, and continuously adjusting the optimal parameters to ensure that the model effect is stable and optimal;
and 5, after the model is trained, carrying out sepsis prediction on a new patient, and simultaneously outputting the importance metric value of each characteristic variable.
2. The machine learning-based early prediction of sepsis method of claim 1, characterized by: the predicted characteristic variables extracted in the step 1 comprise vital sign variables; measuring indexes in a laboratory; demographic information.
3. The machine learning-based early prediction of sepsis method of claim 1, characterized by: the data preprocessing in the step 2 comprises the following steps:
variable screening, namely setting a deletion rate threshold value, removing variables with too high deletion rate, and simultaneously performing low variance filtering variable selection, namely calculating the variance corresponding to each variable, setting the threshold value, filtering if the variance of the variables is lower than the threshold value, and selecting and removing zero variance characteristics, wherein the variance is 0 and indicates that the value of the variables has no change, and the variables have no distinction to the model;
filling missing values, namely filling the missing values by using a MissForest model prediction method, wherein the MissForest is a non-parameter missing value filling method, and performing prediction filling on the missing values by using an algorithm model;
outlier processing, the processing method is the 6 σ principle used, where: σ represents the standard deviation of the data, if a certain item of data of a patient deviates by more than 6 times the standard deviation from the mean in the data set to which it belongs, i.e. the data value is outside [ U-6 σ -U +6 σ ], wherein: the average value of the data set represented by U is replaced by the minimum and maximum limit values respectively;
and feature extraction, namely expanding the feature variables, and performing feature expansion from the maximum value, the minimum value and the average value by the medical scoring system based on the maximum value calculation and the minimum value calculation to judge the severity of the patient.
4. The machine learning-based early prediction of sepsis method of claim 1, characterized by: in the model building in the step 3, an improved deep forest algorithm model is used for building a prediction model;
the improved deep forest algorithm model comprises the following specific steps:
step (1): selecting two tree-based model algorithms of a random forest and XGboost as a base learner of each layer, wherein each layer selects 2 same random forest algorithm models and 2 same XGboost as four base learner prediction models; after each model was trained on data using k-fold cross validation, wherein: k is 10, and the probability vector X of each forest is outputi
Step (2): after k-fold training, the accuracy of the prediction result of the training data is calculated at the same time, and the accuracy is used as a weight parameter of the model after regular normalization and is recorded as w'iThe calculation formula is as follows:
Figure FDA0002376595180000021
wherein: wiShowing the accuracy of the predicted result of the training data of the ith forest algorithm model after 10-fold training; w'iA weight representing a prediction probability of the ith model;
and (3): the prediction probabilities of the four base learning device prediction models used by each layer are subjected to weighted fusion, and the finally output probability vector is recorded as XprobThen the final probability vector X of each layer outputprobInput into the next layer, X, in connection with the original featuresprobThe calculation formula of (2) is as follows:
Xprob=∑w′i*Xi
wherein, XiAnd the predicted probability vector output by the ith forest is represented by training data by adopting k-fold cross validation at each layer, wherein k is 10.
And (4): and (4) repeating the steps (1) to (3) on the next layer, continuing training until the training precision is not improved any more, automatically stopping, and outputting the weighted probability of the last layer as a prediction result.
5. The method for the early prediction of sepsis based on machine learning according to claim 1, characterized in that: in the training and tuning of the model parameters in the step 4, each layer of the algorithm model for training and predicting selects 2 identical random forests and 2 identical XGboost, and each forest comprises 100 trees; training data in each layer by adopting a 10-fold cross validation mode, and dividing the data of a training set and a test set according to 0.8/0.2; the evaluation index uses the prediction accuracy, and if the prediction accuracy of the next layer in the training is not improved any more, the training is terminated.
6. The machine learning-based early prediction of sepsis method of claim 1, characterized by: the importance metric value of each characteristic variable output in the step 5 is generated simultaneously in the model training process, and the characteristic importance of the last output of each layer is F'nThe calculation method comprises the following steps:
F′n=w′i*Fi
wherein n represents the nth layer, i represents the ith forest algorithm model of each layer, and FiRepresenting all the feature importance of the ith forest model after 10 th training and averaging;
and finally, selecting and outputting the feature importance of the last layer, and further obtaining the importance metric of each feature variable.
CN202010068293.2A 2020-01-21 2020-01-21 Sepsis early prediction method based on machine learning Pending CN111261282A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010068293.2A CN111261282A (en) 2020-01-21 2020-01-21 Sepsis early prediction method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010068293.2A CN111261282A (en) 2020-01-21 2020-01-21 Sepsis early prediction method based on machine learning

Publications (1)

Publication Number Publication Date
CN111261282A true CN111261282A (en) 2020-06-09

Family

ID=70952505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010068293.2A Pending CN111261282A (en) 2020-01-21 2020-01-21 Sepsis early prediction method based on machine learning

Country Status (1)

Country Link
CN (1) CN111261282A (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111696667A (en) * 2020-06-11 2020-09-22 吾征智能技术(北京)有限公司 Common gynecological disease prediction model construction method and prediction system
CN111897857A (en) * 2020-08-06 2020-11-06 暨南大学附属第一医院(广州华侨医院) ICU (intensive care unit) duration prediction method after aortic dissection cardiac surgery
CN111951975A (en) * 2020-08-19 2020-11-17 哈尔滨工业大学 Sepsis early warning method based on deep learning model GPT-2
CN111951964A (en) * 2020-07-30 2020-11-17 山东大学 Method and system for rapidly detecting novel coronavirus pneumonia
CN111986814A (en) * 2020-08-21 2020-11-24 南通大学 Modeling method of lupus nephritis prediction model of lupus erythematosus patient
CN112069820A (en) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 Model training method, model training device and entity extraction method
CN112331350A (en) * 2020-10-14 2021-02-05 华南师范大学 Method, system and storage medium for predicting early shift into intensive care unit
CN112365943A (en) * 2020-10-22 2021-02-12 杭州未名信科科技有限公司 Method and device for predicting length of stay of patient, electronic equipment and storage medium
CN112633601A (en) * 2020-12-31 2021-04-09 天津开心生活科技有限公司 Method, device, equipment and computer medium for predicting disease event occurrence probability
CN112908480A (en) * 2021-03-17 2021-06-04 上海电气集团股份有限公司 Organ failure early warning method and system, electronic equipment and storage medium
CN112992368A (en) * 2021-04-09 2021-06-18 中山大学附属第三医院(中山大学肝脏病医院) Prediction model system and recording medium for prognosis of severe spinal cord injury
CN113017831A (en) * 2021-02-26 2021-06-25 上海鹰瞳医疗科技有限公司 Method and equipment for predicting arch height after artificial lens implantation
CN113017572A (en) * 2021-03-17 2021-06-25 上海交通大学医学院附属瑞金医院 Severe warning method and device, electronic equipment and storage medium
CN113057589A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Method and system for predicting organ failure infection diseases and training prediction model
CN113057588A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
CN113096814A (en) * 2021-05-28 2021-07-09 哈尔滨理工大学 Alzheimer disease classification prediction method based on multi-classifier fusion
CN113284615A (en) * 2021-06-16 2021-08-20 北京大学人民医院 XGboost algorithm-based gastrointestinal stromal tumor prediction method and system
CN113517066A (en) * 2020-08-03 2021-10-19 东南大学 Depression assessment method and system based on candidate gene methylation sequencing and deep learning
CN113539394A (en) * 2020-12-31 2021-10-22 内蒙古卫数数据科技有限公司 Multi-disease prediction method based on medical inspection data
CN113593708A (en) * 2021-07-12 2021-11-02 杭州电子科技大学 Sepsis prognosis prediction method based on integrated learning algorithm
CN113744869A (en) * 2021-09-07 2021-12-03 中国医科大学附属盛京医院 Method for establishing early screening of light chain amyloidosis based on machine learning and application thereof
CN113871009A (en) * 2021-09-27 2021-12-31 山东师范大学 Sepsis prediction system, storage medium and apparatus in intensive care unit
CN114420300A (en) * 2022-01-20 2022-04-29 北京大学第六医院 Chinese old cognitive impairment prediction model
CN114724701A (en) * 2022-03-11 2022-07-08 梁娜 Noninvasive ventilation curative effect prediction system based on superposition integration algorithm and automatic encoder
WO2022216220A1 (en) * 2021-04-07 2022-10-13 Biosigns Pte. Ltd. Method and system for personalized prediction of infection and sepsis
CN115579147A (en) * 2022-09-26 2023-01-06 一选(浙江)医疗科技有限公司 Sepsis recognition model training method, sepsis early warning method and device
CN116580847A (en) * 2023-07-14 2023-08-11 天津医科大学总医院 Modeling method and system for prognosis prediction of septic shock
CN117238510A (en) * 2023-11-16 2023-12-15 天津中医药大学第二附属医院 Sepsis prediction method and system based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874663A (en) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 Cardiovascular and cerebrovascular disease Risk Forecast Method and system
CN109119167A (en) * 2018-07-11 2019-01-01 山东师范大学 Pyemia anticipated mortality system based on integrated model
CN109872819A (en) * 2019-01-30 2019-06-11 杭州脉兴医疗科技有限公司 A kind of acute kidney injury incidence rate forecasting system based on Intensive Care Therapy detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874663A (en) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 Cardiovascular and cerebrovascular disease Risk Forecast Method and system
CN109119167A (en) * 2018-07-11 2019-01-01 山东师范大学 Pyemia anticipated mortality system based on integrated model
CN109872819A (en) * 2019-01-30 2019-06-11 杭州脉兴医疗科技有限公司 A kind of acute kidney injury incidence rate forecasting system based on Intensive Care Therapy detection

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111696667A (en) * 2020-06-11 2020-09-22 吾征智能技术(北京)有限公司 Common gynecological disease prediction model construction method and prediction system
CN111951964A (en) * 2020-07-30 2020-11-17 山东大学 Method and system for rapidly detecting novel coronavirus pneumonia
CN113517066A (en) * 2020-08-03 2021-10-19 东南大学 Depression assessment method and system based on candidate gene methylation sequencing and deep learning
CN111897857A (en) * 2020-08-06 2020-11-06 暨南大学附属第一医院(广州华侨医院) ICU (intensive care unit) duration prediction method after aortic dissection cardiac surgery
CN111951975A (en) * 2020-08-19 2020-11-17 哈尔滨工业大学 Sepsis early warning method based on deep learning model GPT-2
CN111951975B (en) * 2020-08-19 2022-03-25 哈尔滨工业大学 Sepsis early warning method based on deep learning model GPT-2
CN111986814B (en) * 2020-08-21 2024-01-16 南通大学 Modeling method of lupus nephritis prediction model of lupus erythematosus patient
CN111986814A (en) * 2020-08-21 2020-11-24 南通大学 Modeling method of lupus nephritis prediction model of lupus erythematosus patient
CN112069820A (en) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 Model training method, model training device and entity extraction method
CN112069820B (en) * 2020-09-10 2024-05-24 杭州中奥科技有限公司 Model training method, model training device and entity extraction method
CN112331350A (en) * 2020-10-14 2021-02-05 华南师范大学 Method, system and storage medium for predicting early shift into intensive care unit
CN112365943A (en) * 2020-10-22 2021-02-12 杭州未名信科科技有限公司 Method and device for predicting length of stay of patient, electronic equipment and storage medium
CN112633601A (en) * 2020-12-31 2021-04-09 天津开心生活科技有限公司 Method, device, equipment and computer medium for predicting disease event occurrence probability
CN113539394A (en) * 2020-12-31 2021-10-22 内蒙古卫数数据科技有限公司 Multi-disease prediction method based on medical inspection data
CN113017831A (en) * 2021-02-26 2021-06-25 上海鹰瞳医疗科技有限公司 Method and equipment for predicting arch height after artificial lens implantation
CN112908480A (en) * 2021-03-17 2021-06-04 上海电气集团股份有限公司 Organ failure early warning method and system, electronic equipment and storage medium
CN113057589A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Method and system for predicting organ failure infection diseases and training prediction model
CN113017572A (en) * 2021-03-17 2021-06-25 上海交通大学医学院附属瑞金医院 Severe warning method and device, electronic equipment and storage medium
CN113057588A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
WO2022216220A1 (en) * 2021-04-07 2022-10-13 Biosigns Pte. Ltd. Method and system for personalized prediction of infection and sepsis
CN112992368B (en) * 2021-04-09 2023-06-20 中山大学附属第三医院(中山大学肝脏病医院) Prediction model system and storage medium for severe spinal cord injury prognosis
CN112992368A (en) * 2021-04-09 2021-06-18 中山大学附属第三医院(中山大学肝脏病医院) Prediction model system and recording medium for prognosis of severe spinal cord injury
CN113096814A (en) * 2021-05-28 2021-07-09 哈尔滨理工大学 Alzheimer disease classification prediction method based on multi-classifier fusion
CN113284615A (en) * 2021-06-16 2021-08-20 北京大学人民医院 XGboost algorithm-based gastrointestinal stromal tumor prediction method and system
CN113593708A (en) * 2021-07-12 2021-11-02 杭州电子科技大学 Sepsis prognosis prediction method based on integrated learning algorithm
CN113744869A (en) * 2021-09-07 2021-12-03 中国医科大学附属盛京医院 Method for establishing early screening of light chain amyloidosis based on machine learning and application thereof
CN113744869B (en) * 2021-09-07 2024-03-26 中国医科大学附属盛京医院 Method for establishing early screening light chain type amyloidosis based on machine learning and application thereof
CN113871009A (en) * 2021-09-27 2021-12-31 山东师范大学 Sepsis prediction system, storage medium and apparatus in intensive care unit
CN114420300A (en) * 2022-01-20 2022-04-29 北京大学第六医院 Chinese old cognitive impairment prediction model
CN114420300B (en) * 2022-01-20 2023-08-04 北京大学第六医院 Chinese senile cognitive impairment prediction model
CN114724701A (en) * 2022-03-11 2022-07-08 梁娜 Noninvasive ventilation curative effect prediction system based on superposition integration algorithm and automatic encoder
CN115579147A (en) * 2022-09-26 2023-01-06 一选(浙江)医疗科技有限公司 Sepsis recognition model training method, sepsis early warning method and device
CN115579147B (en) * 2022-09-26 2024-02-09 一选(浙江)医疗科技有限公司 Sepsis recognition model training method, sepsis early warning method and sepsis early warning device
CN116580847B (en) * 2023-07-14 2023-11-28 天津医科大学总医院 Method and system for predicting prognosis of septic shock
CN116580847A (en) * 2023-07-14 2023-08-11 天津医科大学总医院 Modeling method and system for prognosis prediction of septic shock
CN117238510A (en) * 2023-11-16 2023-12-15 天津中医药大学第二附属医院 Sepsis prediction method and system based on deep learning
CN117238510B (en) * 2023-11-16 2024-02-06 天津中医药大学第二附属医院 Sepsis prediction method and system based on deep learning

Similar Documents

Publication Publication Date Title
CN111261282A (en) Sepsis early prediction method based on machine learning
CN110827993A (en) Early death risk assessment model establishing method and device based on ensemble learning
CN111951975B (en) Sepsis early warning method based on deep learning model GPT-2
CN112201330B (en) Medical quality monitoring and evaluating method combining DRGs tool and Bayesian model
Zhou et al. Modeling methodology for early warning of chronic heart failure based on real medical big data
CN111081379B (en) Disease probability decision method and system thereof
CN114023441A (en) Severe AKI early risk assessment model and device based on interpretable machine learning model and development method thereof
CN113838577B (en) Convenient layered old people MODS early death risk assessment model, device and establishment method
CN113470816A (en) Machine learning-based diabetic nephropathy prediction method, system and prediction device
Wang et al. Predictive classification of ICU readmission using weight decay random forest
CN113593708A (en) Sepsis prognosis prediction method based on integrated learning algorithm
CN115482932A (en) Multivariate blood glucose prediction algorithm based on transfer learning and glycosylated hemoglobin
Van Steenkiste et al. Sensor fusion using backward shortcut connections for sleep apnea detection in multi-modal data
CN111986814A (en) Modeling method of lupus nephritis prediction model of lupus erythematosus patient
KR102169637B1 (en) Method for predicting of mortality risk and device for predicting of mortality risk using the same
Alghatani et al. Precision clinical medicine through machine learning: using high and low quantile ranges of vital signs for risk stratification of ICU patients
KR102421172B1 (en) Smart Healthcare Monitoring System and Method for Heart Disease Prediction Based On Ensemble Deep Learning and Feature Fusion
US6941288B2 (en) Online learning method in a decision system
CN117116477A (en) Construction method and system of prostate cancer disease risk prediction model based on random forest and XGBoost
CN114464319B (en) AMS susceptibility assessment system based on slow feature analysis and deep neural network
Shahul et al. Machine Learning Based Analysis of Sepsis
Rajmohan et al. G-Sep: A Deep Learning Algorithm for Detection of Long-Term Sepsis Using Bidirectional Gated Recurrent Unit
CN115083616A (en) Chronic nephropathy subtype mining system based on self-supervision graph clustering
Umut et al. Prediction of sepsis disease by Artificial Neural Networks
CN112365992A (en) Medical examination data identification and analysis method based on NRS-LDA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination