CN112992368A - Prediction model system and recording medium for prognosis of severe spinal cord injury - Google Patents

Prediction model system and recording medium for prognosis of severe spinal cord injury Download PDF

Info

Publication number
CN112992368A
CN112992368A CN202110383930.XA CN202110383930A CN112992368A CN 112992368 A CN112992368 A CN 112992368A CN 202110383930 A CN202110383930 A CN 202110383930A CN 112992368 A CN112992368 A CN 112992368A
Authority
CN
China
Prior art keywords
algorithm
probability
patient
discharge
blood cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110383930.XA
Other languages
Chinese (zh)
Other versions
CN112992368B (en
Inventor
戎利民
范国鑫
刘华清
庞卯
刘斌
张良明
黄桂芳
韩蓝青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Research Institute Of Tsinghua Pearl River Delta
Third Affiliated Hospital Sun Yat Sen University
Original Assignee
Research Institute Of Tsinghua Pearl River Delta
Third Affiliated Hospital Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Research Institute Of Tsinghua Pearl River Delta, Third Affiliated Hospital Sun Yat Sen University filed Critical Research Institute Of Tsinghua Pearl River Delta
Priority to CN202110383930.XA priority Critical patent/CN112992368B/en
Publication of CN112992368A publication Critical patent/CN112992368A/en
Application granted granted Critical
Publication of CN112992368B publication Critical patent/CN112992368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Biology (AREA)
  • Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a prediction model system for prognosis of severe spinal cord injury, which comprises: establishing a clinical characteristic database of patient cases of spinal cord injury; constructing a prediction model of severe spinal cord injury prognosis; predicting probability values of death of the patient at the discharge end point, continuation of professional rehabilitation nursing treatment and home return according to the final prediction model, inputting the probability values into a formula 1, and giving a final prediction probability value, wherein the formula 1:
Figure DDA0003014109990000011
the invention also discloses a computer readable recording medium. The invention is a summary for calculating discharge endpoints based on clinical medical historyRate, and to ascertain important clinical features that have an impact on the clinical outcome of patients with severe spinal cord injury.

Description

Prediction model system and recording medium for prognosis of severe spinal cord injury
Technical Field
The present invention relates to a prediction model system and a recording medium for prognosis of severe spinal cord injury.
Background
Spinal cord injury patients often enter the Intensive Care Unit (ICU) because of major trauma or serious complications, and therefore their prognosis is a significant concern for clinicians and patient families. However, how to accurately predict the prognosis of severe spinal cord injury is a clinical problem. Clinically, physicians often empirically determine the prognosis of a patient to develop a treatment plan. However, it often fails to provide an objectively quantifiable probability of prognosis when dealing with the patient's condition. Therefore, it is important to have a system for accurately and objectively predicting the prognosis of a patient with severe spinal cord injury to assist the clinician.
Disclosure of Invention
In order to overcome the drawbacks of the prior art, the present invention provides a predictive model system for prognosis of severe spinal cord injury and a computer-readable recording medium thereof, which can calculate the probability of discharge endpoint based on clinical medical history and find out the important clinical features affecting the discharge endpoint of patients with severe spinal cord injury.
The invention is realized by the following technical route:
a prediction model system for prognosis of severe spinal cord injury, comprising: establishing a clinical characteristic database of patient cases of spinal cord injury;
constructing a prediction model of severe spinal cord injury prognosis: extracting clinical features from a clinical feature database, processing missing data by different filling methods according to the types of the extracted clinical features, filling continuous variable features by a prediction mean matching method, filling binary variable features by a logistic regression method, filling multi-classification variable features by a polynomial regression method, and finally obtaining different features which are randomly divided into a training data set and a testing data set according to a reasonable proportion; building an algorithm combination model by a machine learning classification algorithm, wherein the feature selection method is used for screening clinical features with obvious prediction value and using the selected clinical features for training the machine learning classification algorithm; selecting an algorithm combination model with the optimal area AUC under the micro average curve for predicting the discharge endpoint (three categories: home rest, continuous professional rehabilitation and care treatment and death) of the patient from the algorithm combination model, and constructing a final prediction model by using an integrated algorithm stacking method;
inputting the probability value into a formula 1 according to the probability value of death, continuous professional rehabilitation nursing treatment and home return of the patient discharge endpoint predicted by the algorithm combination model, and giving a final predicted probability value,
said formula 1
Figure BDA0003014109970000021
In the above-mentioned formula,
Pθ(X) represents a discharge endpoint category probability, where p (y 1| X; θ) represents a death probability, p (y 2| X; θ) represents a probability of continuing professional rehabilitation therapy, and p (y 3| X; θ) represents a probability of home rest;
θj=[θj,1 θj,2 θj,3 … θj,3n-2 θj,3n-1 θj,3n](where j ∈ 1,2,3) represents a coefficient, where n represents the number of base classifiers;
Figure BDA0003014109970000022
Figure BDA0003014109970000023
Figure BDA0003014109970000024
wherein theta isi,j(i ∈ 1,2, 3; j ∈ 1,2, 3.., 3n) are coefficients of the pre-trained integrated predictive model;
X=[x1 x2 x3 … x3n-2 x3n-1 x3n]representing the probability of discharge endpoint predicted by n base classifiers, where x3k-2Indicates the probability of death at the discharge end predicted by the kth (k ═ 1, 2.., n) base classifier, x3k-1Represents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is the continuation of the professional rehabilitation care treatment; x is the number of3kRepresents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is in home rest;Λthe symbols are omitted.
y is 1| X; theta represents the probability X of the patient discharge endpoint predicted by inputting n base classifiers into the algorithm, and the algorithm predicts the patient discharge endpoint to be the class 1; y 1, the algorithm predicts that the patient's discharge endpoint category 1 is death; wherein y is 2, the algorithm predicts that the patient's discharge endpoint category 2 is to continue professional rehabilitation care treatment; y is 3, the algorithm predicts the patient's discharge endpoint category 3 as home rest;
t represents the transposition of the vector;
Figure BDA0003014109970000025
represents a column vector θ1After the transposition, multiplying the vector X by the transposition;
Figure BDA0003014109970000026
represents a column vector θjTransposing into a row vector; theta is a coefficient of the algorithm, and a specific value is obtained through training.
The training data set is selected by a feature selection algorithm and cross-validated to obtain an AUC matrix of a validation set, the ordinate of the AUC matrix is a feature selection method, the abscissa of the AUC matrix is a machine learning classification algorithm, and then algorithm combination models (the feature selection method and the machine learning classification algorithm) are formed; and selecting three algorithm combinations with the maximum area AUC under the micro average curve according to the prediction performance of the algorithm combination models in the verification set (a feature selection method and a machine learning classification algorithm), and integrating the three algorithm combinations by using the stacking method to obtain the final prediction model.
The feature selection method is 7 and is used for screening clinical features with remarkable prediction value, the 7 feature selection methods comprise maximum mutual information coefficient MIC, random forest RF embedding, recursive feature elimination REF, linear support vector classifier embedding LSVC embedding, logic regressor embedding LR embedding, tree embedding and minimum redundancy-maximum correlation mRMR embedding, the machine learning classification algorithm is 15, and the 15 machine learning classification algorithms comprise logic regression, linear discriminant analysis LDA, support vector machine SVM, K nearest neighbor KNN, Gaussian naive Bayes NB, decision tree, additional decision tree, randomness, forest Bagging algorithm Bagging, adaptive enhancement AdaBoost, gradient enhancement decision tree GBDT, extreme gradient enhancement XGBoosting, light gradient elevator LightGBM, multilayer perceptron MLP and deep neural network DNN.
The clinical features are: the demographic information comprises race, gender, age, body mass index, admission type, ICU type, admission source, ICU duration, length of stay after ICU discharge, and the like; vital signs include respiratory rate, heart rate, systolic and diastolic blood pressure, mean arterial pressure; laboratory data include white blood cell count, red blood cell count RBC, platelet count, basophils, eosinophils, neutrophils, lymphocytes, monocytes, red blood cell distribution width RDW, hemoglobin, hematocrit, mean red blood cell hemoglobin amount MCH, red blood cell mean hemoglobin concentration MCHC, red blood cell mean volume MCV, prothrombin time PT, activated partial thromboplastin time APTT, international normalized ratio INR, oxygen concentration fraction FiO2, oxygen partial pressure PaO2, carbon dioxide partial PaCO2, hydrogen ion concentration index PH, bicarbonate, lactate, residual base BE, anion space, potassium, sodium, calcium, magnesium, chloride, phosphate, blood urea nitrogen BUN, creatinine, albumin, blood glucose, and the like; the use of drugs and therapeutic conditions include mechanical ventilation, morphine sulfate, cefazolin, potassium chloride KCl, glucocorticoids, dopamine, dobutamine, epinephrine, and norepinephrine;
clinical features of greater than or equal to 50% of the total case weight of the deletion case, directly deleting the clinical features, including red blood cell distribution width RDW, partial oxygen pressure PaO2, ethnic group, mean volume of red blood cells MCV, lactate, morphine sulfate, age, body mass index, white blood cell count, red blood cell count RBC, platelet count, basophils, eosinophils, neutrophils, lymphocytes, monocytes, red blood cell distribution width RDW, hemoglobin, hematocrit, mean hemoglobin amount MCH, mean hemoglobin concentration MCHC of red blood cells, mean volume of red blood cells MCV, prothrombin time PT, activated partial thromboplastin time APTT, international normalized ratio INR, partial oxygen pressure PaO2, carbon dioxide sub-PaCO 2, hydrogen ion concentration index PH, oxygen partial pressure PaO2, blood cell count, RBC, blood cell count, blood platelet count, blood cell count, eosinophils, neutrophil count, lymphocyte count, blood cell distribution width RDV, hemoglobin, oxygen partial pressure PaO2, carbon dioxide sub, Bicarbonate, lactate, residual base BE, anion space, potassium, sodium, calcium, magnesium, chlorine, phosphate, blood urea nitrogen BUN, creatinine, albumin, blood glucose, respiratory rate, heart rate, systolic and diastolic blood pressure, mean arterial pressure, ICU duration, stay after ICU, oxygen concentration fraction FiO2 are continuous variable features, mechanical ventilation, morphine sulfate, ceftizoline, potassium chloride KCl, glucocorticoid, dopamine, dobutamine, epinephrine and norepinephrine are binary variable features, multi-classification variable features such as race, gender, ICU type, hospitalization source and the like, and binary variable features such as mechanical ventilation, morphine sulfate and the like are converted into virtual variable forms, and finally, 55 different features are obtained.
The optimal three algorithm combinations are an embedded tree star gradient boosting decision tree GBDT, an embedded tree star extreme gradient boosting XGBoosting and an embedded LSVC star extreme gradient boosting XGBoosting.
The program causes a computer to execute a program of the steps of: predicting probability values for patient discharge endpoint death, follow-up professional rehabilitation therapy, return home, according to the final predictive model of claim 1, inputting said probability values to equation 1, giving final predicted probability values,
said formula 1
Figure BDA0003014109970000041
In the above-mentioned formula,
Pθ(X) represents a discharge endpoint category probability, where p (y 1| X; θ) represents a death probability, p (y 2| X; θ) represents a probability of continuing professional rehabilitation therapy, and p (y 3| X; θ) represents a probability of home rest;
θj=[θj,1 θj,2 θj,3 … θj,3n-2 θj,3n-1 θj,3n](where j ∈ 1,2,3) represents a coefficient, where n represents the number of base classifiers; j ∈ 1,2,3 has the same meaning as j ═ 1,2, 3.
Figure BDA0003014109970000042
Figure BDA0003014109970000043
Figure BDA0003014109970000044
Wherein theta isi,j(i 1,2, 3; j 1,2, 3.., 3n) are coefficients of a pre-trained integrated predictive model;
X=[x1 x2 x3 … x3n-2 x3n-1 x3n]representing the probability of discharge endpoint predicted by n base classifiers, where x3k-2Indicates the probability of death at the discharge end predicted by the kth (k ═ 1, 2.., n) base classifier, x3k-1Represents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is the continuation of the professional rehabilitation care treatment; x is the number of3kRepresents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is in home rest;
y is 1| X; theta represents the probability X of the patient discharge endpoint predicted by inputting n base classifiers into the algorithm, and the algorithm predicts the patient discharge endpoint to be the class 1; wherein y is 2, the algorithm predicts that the patient's discharge endpoint category 2 is to continue professional rehabilitation care treatment; y is 3, the algorithm predicts the patient's discharge endpoint category 3 as home rest;
t represents the transposition of the vector;
Figure BDA0003014109970000051
represents a column vector θ1After the transposition, multiplying the vector X by the transposition;
Figure BDA0003014109970000052
represents a column vector θjTransposing into a row vector; theta is a coefficient of the algorithm, and a specific value is obtained through training.
The optimal three algorithm combinations are an embedded tree star gradient boosting decision tree GBDT, an embedded tree star extreme gradient boosting XGBoosting and an embedded LSVC star extreme gradient boosting XGBoosting.
The invention has the following advantages:
the prediction model system for the prognosis of severe spinal cord injury can calculate the probability of a discharge endpoint based on clinical medical history and find out important clinical characteristics influencing the clinical result of a patient with severe spinal cord injury. The invention is used for screening clinical characteristics with obvious prediction value by a characteristic selection method, and the selected clinical characteristics are used for training a machine learning classification algorithm, so that an accurate machine learning model for predicting the prognosis of patients with severe spinal cord injury can be constructed. The invention establishes the AUC matrix by establishing the prediction performance of the machine learning classification algorithm on the verification data set, and can display the prediction accuracy of 105 models at one time.
Drawings
FIG. 1 is a flow chart of a system for predicting the prognosis of severe spinal cord injury according to the present invention;
FIG. 2 is a graph of AUC matrix, which is the predicted performance of the machine learning classification algorithm on the training data set obtained by the present invention.
Detailed Description
The invention relates to a prediction model system for prognosis of severe spinal cord injury, which is constructed based on clinical data of a severe spinal cord injury patient, and comprises: establishing a clinical characteristic database of patient cases of spinal cord injury; constructing a prediction model of severe spinal cord injury prognosis: extracting clinical features from a clinical feature database, processing missing data by different filling methods according to the types of the extracted clinical features, filling continuous variable features by a prediction mean matching method, filling binary variable features by a logistic regression method, filling multi-classification variable features by a polynomial regression method, and finally obtaining different features which are randomly divided into a training data set and a testing data set according to a reasonable proportion; building an algorithm combination model by a machine learning classification algorithm, wherein the feature selection method is used for screening clinical features with obvious prediction value and using the selected clinical features for training the machine learning classification algorithm; and selecting an algorithm combination model with the optimal area AUC under the micro average curve for predicting the patient discharge endpoint (three categories: home rest, continuous professional rehabilitation and treatment and death) from the algorithm combination models, and constructing a final prediction model by using an integrated algorithm stacking method.
Inputting the probability value into a formula 1 according to the probability value of death, continuous professional rehabilitation nursing treatment and home return of the patient discharge endpoint predicted by the algorithm combination model, and giving a final predicted probability value,
said formula 1
Figure BDA0003014109970000061
In the above-mentioned formula,
Pθ(X) represents a discharge endpoint category probability, where p (y 1| X; θ) represents a death probability, p (y 2| X; θ) represents a probability of continuing professional rehabilitation therapy, and p (y 3| X; θ) represents a probability of home rest;
θj=[θj,1 θj,2 θj,3 … θj,3n-2 θj,3n-1 θj,3n](where j ∈ 1,2,3) represents a coefficient, where n represents the number of base classifiers;
Figure BDA0003014109970000062
Figure BDA0003014109970000063
Figure BDA0003014109970000064
wherein theta isi,j(i 1,2, 3; j 1,2, 3.., 3n) are coefficients of a pre-trained integrated predictive model;
X=[x1 x2 x3 … x3n-2 x3n-1 x3n]representing the probability of discharge endpoint predicted by n base classifiers, where x3k-2Indicates the probability of death at the discharge end predicted by the kth (k ═ 1, 2.., n) base classifier, x3k-1Represents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is the continuation of the professional rehabilitation care treatment; x is the number of3kRepresents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is in home rest;
y is 1| X; theta represents the entered patient characteristics; y 1, the algorithm predicts that the patient's discharge endpoint category 1 is death; wherein y is 2, the algorithm predicts that the patient's discharge endpoint category 2 is to continue professional rehabilitation care treatment; y is 3, the algorithm predicts the patient's discharge endpoint category 3 as home rest;
t represents the transposition of the vector;
Figure BDA0003014109970000071
represents a column vector θ1After the transposition, multiplying the vector X by the transposition;
Figure BDA0003014109970000072
represents a column vector θjTransposing into a row vector; theta is a coefficient of the algorithm, and a specific value is obtained through training.
For example, such as x1,x2,x3Respectively representing the probability of the type of the discharge endpoint, namely death, continuous professional rehabilitation nursing treatment and home return, which is predicted by the 1 st optimal algorithm combination; x is the number of4,x5,x6Respectively representing the probability of the type of the discharge endpoint predicted by the combination of the 2 nd optimal algorithm, namely death, continuous professional rehabilitation nursing treatment and home return; x is the number of7,x8,x9Respectively represents the probability of death, continuous professional rehabilitation care treatment and home return which are the types of discharge endpoints predicted by the 3 rd best algorithm combination.
Referring to fig. 1, the clinical data of the severe spinal cord injury patient according to the present invention includes clinical data based on the first examination for prediction in advance and clinical data based on the latest examination for prediction in the near future, and three categories, i.e., Death, continuation of professional rehabilitation therapy FMC, and Home-returning nursing Home, are predicted as the discharge endpoint. The final curve in fig. 1 represents the graph of the results of the first integrated predictive model predicting the discharge endpoint, the overall prediction showed an AUC of 0.878, and the three classifications thereof had an AUC of 0.968 for death, 0.828 for continuing professional rehabilitation care, and 0.831 for returning home for rest.
The construction of the final prediction model comprises the following steps:
(1) inclusion in patients has a clinical profile of potential predictive value: the demographic information comprises race, gender, age, body mass index, admission type, ICU type, admission source, ICU duration, length of stay after ICU discharge, and the like; vital signs include respiratory rate, heart rate, systolic and diastolic blood pressure, mean arterial pressure; laboratory data include white blood cell count, red blood cell count RBC, platelet count, basophils, eosinophils, neutrophils, lymphocytes, monocytes, red blood cell distribution width RDW, hemoglobin, hematocrit, mean red blood cell hemoglobin amount MCH, red blood cell mean hemoglobin concentration MCHC, red blood cell mean volume MCV, prothrombin time PT, activated partial thromboplastin time APTT, international normalized ratio INR, oxygen concentration fraction FiO2, oxygen partial pressure PaO2, carbon dioxide partial PaCO2, hydrogen ion concentration index PH, bicarbonate, lactate, residual base BE, anion space, potassium, sodium, calcium, magnesium, chloride, phosphate, blood urea nitrogen BUN, creatinine, albumin, blood glucose, and the like; the use of drugs and therapeutic conditions include mechanical ventilation, morphine sulfate, cefazolin, potassium chloride KCl, glucocorticoids, dopamine, dobutamine, epinephrine, and norepinephrine;
(2) pretreatment clinical characteristics: the missing data is processed by different padding methods depending on the type of clinical feature. Specifically, for any clinical profile, when the proportion of deletion cases is greater than 50% of the total cases, the clinical profile is deleted directly. For clinical characteristics that the proportion of the missing cases in the total cases is less than 50%, filling is performed by adopting an R language expansion package 'mic' according to the type of the clinical characteristics, wherein for continuous variable characteristics, filling is performed by applying a prediction mean matching method (which is the prior art); for binary variable characteristics, a logistic regression method (which is the prior art) is used for filling; for the categorical variable features, polynomial regression methods (prior art) are used for filling.
(3) Converting all multi-classification variable features in the padded feature data into a form of a virtual variable, wherein the virtual variable, for example, a virtual variable reflecting gender, may be: male ═ 0, 0; female ═ 1, 0; not detailed is (0, 1).
(4) The pre-processed clinical data set was randomly divided into a training data set (60% by weight), a validation set (20% by weight) and a test data set (20% by weight).
(5) N × M combinations of algorithms were included: n general feature selection methods were performed to screen clinical features with significant predictive value. The feature selection method includes maximum Mutual Information Coefficient (MIC), embedding Random Forest (RF), recursive feature elimination (REF), embedding Linear Support Vector Classifier (LSVC), embedding Logistic Regressor (LR), embedding tree and minimum redundancy-maximum correlation (mRMR). Then, the selected features are used for training M machine learning classification algorithms, namely, logistic regression, Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), K Nearest Neighbor (KNN), Gaussian Naive Bayes (NB), decision trees, additional decision trees, random forests, Bagging algorithms (Bagging), adaptive boosting (AdaBoost), gradient boosting decision tree GBDT, extreme gradient boosting XGboosting, light gradient boosting GBM, multilayer perceptron (MLP), Deep Neural Network (DNN) and the like, all feature selection methods are the prior art, N is 7, M is 15, and N is 105 algorithm combinations.
And (3) constructing a final model by primary ensemble learning: according to the predicted performance of the N-M algorithm combinations in the verification data set, three algorithm combinations with the largest area under the curve (AUC) are selected, and the three algorithm combinations are constructed and combined by using a stacking method to obtain a preliminary integrated model (hereinafter referred to as a first integrated model). (Stack https:// www.jianshu.com/p/7fc9aa03ec 11).
When the number of features included in the integrated model obtained in the last step is more than 10, constructing a simplified version of the integrated model (hereinafter referred to as a second integrated model) with higher practicability by adopting the following method: evaluating the importance of each feature incorporated into the integration model of step (5) using a displaced feature importance method based on the test data set. And ordering the importance of the features from large to small, only keeping the 10 features with the maximum importance in the first integrated model, abandoning the rest features, and retraining by adopting a training data set so as to obtain a second integrated model. The model only includes the first ten most important features, so the model has the advantage of high practicability.
The screening in step (3) has clinical features with significant predictive value: for any feature selection method without limiting the number of feature selections, a certain number of feature selections are set, and the feature selection method is used, namely the feature selection method comprises the following steps of: maximum Mutual Information Coefficient (MIC), Random Forest (RF) embedding, recursive feature elimination (REF), linear support vector classifier (LSVC embedding), logistic regression embedding (LR embedding), tree embedding and minimum redundancy-maximum correlation (mRMR) screening out features on a training data set, training a basic classification algorithm in a cross validation mode (cross validation is a common method https:// zhuananlan, zhuhu, com/p/24825503refer to rdatamining), and obtaining the prediction performance of the basic classification algorithm on the set of features, wherein the prediction performance has the technical content that the AUC is high and low, the higher the prediction performance is, the lower the prediction performance is, the worse the prediction performance is; and (3) setting different feature selection quantities by traversing (the specific technical process of traversing means that each link is processed), repeating the steps, obtaining the predicted performance of the basic classification algorithm under the condition of different feature selection quantities, and selecting the optimal feature selection quantity, so that the basic algorithm has the optimal performance under the feature selection quantity. This optimal feature selection number is set as the feature selection number of the feature selection algorithm. Preferably, a logistic regression algorithm is used as a basic classification algorithm.
The training of the machine learning classification algorithm in the step (3) is as follows. For any combination of the feature selection algorithm and any machine learning classification algorithm, the training of the machine learning classification algorithm is divided into the following three steps: firstly, the characteristics screened out on a training data set by adopting a characteristic selection algorithm are used for finding out the optimal hyper-parameter combination of a machine learning algorithm by a grid search method or a random search method; then, determining the structure of a machine learning classification algorithm according to the found optimal hyper-parameter combination; and finally, training the machine learning classification algorithm by adopting a cross validation mode to obtain the prediction performance of the machine learning classification algorithm on a training data set.
The following is a further description of three steps thereto:
a: the combination of the parameters to be searched and the value ranges thereof of each classification algorithm is shown in the following table. And searching for the optimal parameter value combination within the combination limited range by adopting grid search or random search.
Defining the hyper-parameters: the hyper-parameter is a parameter that is set before the learning process is started, and is not parameter data obtained by training.
And (3) optimization of the hyper-parameters: 1. defining higher level concepts about the model, such as complexity or learning capabilities; 2. the method can not be directly learned from data in the standard model training process, and needs to be defined in advance; 3. this can be decided by setting different values, training different models and selecting better test values.
Figure BDA0003014109970000101
Figure BDA0003014109970000111
LR, LDA, linear discriminant analysis, SVM, KNN, k nearest neighbor algorithm, Gaussian NB, DT, ET, extra decision tree, RF, random forest, AdaBoost, adaptive boosting algorithm, bagging, gradient boosting decision tree GBDT, extreme gradient boosting XGBoosting, light gradient booster lightGBM, MLP, multi-level sensing, DNN, deep neural network, clf, classifier, invscaling, inverse proportional cascading, relu, rectifying linear units.
b: determining the structure of a machine learning classification algorithm: and c, assigning the optimal parameter value combination found in the step a to the corresponding machine learning classification algorithm, thereby determining the structure of the corresponding machine learning classification algorithm.
c: and obtaining the predicted performance of the machine learning classification algorithm on the training data set, namely forming an AUC matrix. As shown in fig. 2, the ordinate is 7 feature selection methods, the abscissa is 15 machine learning algorithms, and then 105 models are formed, the AUC values of the 105 models are corresponding values of each space in fig. 2, and the AUC matrix has an effect of displaying the prediction accuracy of the 105 models at one time, that is, the AUC value is reflected.
The clinical data extracted or established by the invention and diagnosed as spinal cord injury patient cases are extracted from the public MIMIC-III-v1.4 database, MIMIC-IV-v0.4 database and EICU-v2.0 database, and the number of patient cases is 1566. The predicted objective is the patient discharge endpoint, including three categories: rest at home, continue professional rehabilitation and nursing, and die.
As described in more detail below:
aiming at the clinical characteristics in the step 1), processing missing data through different filling methods according to the types of the clinical characteristics. Wherein the deletion case accounts for more than or equal to 50 percent of the total cases, and comprises the following characteristics: width of red blood cell distribution (RDW), partial pressure of oxygen (PaO2), directly abrogated this clinical feature. The deletion case number accounts for more than 0 and less than 50 percent of the total case number and is characterized by ethnicity, mean volume of erythrocytes (MCV), lactate and morphine sulfate. Wherein the mean volume of red blood cells (MCV) and lactate are continuous variable characteristics, and the continuous variable characteristics are filled by a prediction mean matching method; morphine sulfate and dopamine are binary variable characteristics, and are filled by a logistic regression method; the ethnicity is a multi-classification variable characteristic, and is filled by applying a polynomial regression method.
The multi-classification features included: race, gender, ICU type, source of admission. These variable characteristics are converted into the form of virtual variables. A total of 70 different characteristics were finally obtained: the demographic information comprises race, gender, age, body mass index, admission type, ICU type, admission source, ICU duration, length of stay after ICU discharge, and the like; vital signs include respiratory rate, heart rate, systolic and diastolic blood pressure, mean arterial pressure; laboratory data include white blood cell count, red blood cell count RBC, platelet count, basophils, eosinophils, neutrophils, lymphocytes, monocytes, red blood cell distribution width RDW, hemoglobin, hematocrit, mean red blood cell hemoglobin amount MCH, red blood cell mean hemoglobin concentration MCHC, red blood cell mean volume MCV, prothrombin time PT, activated partial thromboplastin time APTT, international normalized ratio INR, oxygen concentration fraction FiO2, oxygen partial pressure PaO2, carbon dioxide partial PaCO2, hydrogen ion concentration index PH, bicarbonate, lactate, residual base BE, anion space, potassium, sodium, calcium, magnesium, chloride, phosphate, blood urea nitrogen BUN, creatinine, albumin, blood glucose, and the like; the use of drugs and therapeutic conditions include mechanical ventilation, morphine sulfate, cefazolin, potassium chloride KCl, glucocorticoids, dopamine, dobutamine, epinephrine, and norepinephrine;
the whole clinical data set after pretreatment was processed in 60%: 20%: the proportion of 20% is divided randomly into a training data set, a verification data set and a test data set.
As previously described, the feature selection method includes maximum Mutual Information Coefficient (MIC), embedded Random Forest (RF), Recursive Feature Elimination (RFE), embedded linear support vector classifier (embedded LSVC), embedded logistic regressor (embedded LR), embedded tree, and minimum redundancy-maximum correlation (mRMR). For the feature selection algorithms such as Random Forest (RF), embedded linear support vector classifier (embedded LSVC), embedded logistic regression (embedded LR) and embedded tree in the feature selection method, the algorithm can select the optimal feature combination without setting the number of features. Therefore, the most characteristic combinations can be selected by directly using the characteristic selection algorithms. Finally, 14, 23, 17, 18 and 26 different features are selected from Random Forest (RF), embedded linear support vector classifier (embedded LSVC), embedded logistic regressor (embedded LR) and embedded tree respectively. The feature selection number is not limited for the three feature selection algorithms of maximum Mutual Information Coefficient (MIC), Recursive Feature Elimination (RFE), and minimum redundancy-maximum correlation (mRMR) in the feature selection method. Therefore, for the three feature selection methods, the set feature selection number ranges from 5 to 70, starting from k equal to 5, k features screened out on a training data set by using a feature selection algorithm are used, a logistic regression is used as a basic classification algorithm, the basic classification algorithm is trained in a cross validation mode, and the predicted performance of the basic classification algorithm on the set of k features is obtained; and (4) setting different feature selection quantities in a traversing manner, namely respectively setting k to be 5,6, 70, and repeating the steps to obtain the predicted performance of the basic classification algorithm under the condition of different feature selection quantities. The optimal feature selection number kbest is chosen such that the underlying algorithm performs best at this feature selection number. This optimal feature selection number is set as the feature selection number of the feature selection algorithm. Finally, the optimal feature numbers for maximum Mutual Information Coefficient (MIC), Recursive Feature Elimination (RFE), minimum redundancy-maximum correlation (mRMR) are 28, 26, 19, respectively.
And using the selected characteristics in the last step for training 13 machine learning classification algorithms, namely, logistic regression, Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), K Nearest Neighbor (KNN), Gaussian Naive Bayes (NB), decision trees, additional decision trees, random forests, Bagging algorithm (Bagging), adaptive enhancement (AdaBoost), gradient boosting decision tree GBDT, extreme gradient boosting XGboosting, light gradient boosting lightGBM, multilayer perceptron (MLP), Deep Neural Network (DNN) and the like. For any one of the feature selection algorithms and any one of the machine learning classification algorithm combinations, the training of the machine learning classification algorithm is divided into the following three steps: firstly, the characteristics screened out on a training data set by adopting a characteristic selection algorithm are used for finding out the optimal hyper-parameter combination of a machine learning algorithm by a grid search method or a random search method; then, determining the structure of a machine learning classification algorithm according to the found optimal hyper-parameter combination; and finally, training the machine learning classification algorithm by adopting a cross validation mode to obtain the prediction performance of the machine learning classification algorithm on a training data set.
And (3) building a final model by integrated learning: according to the prediction performance of the combination of 7 feature selection algorithms and 15 machine learning classification algorithms in the verification data set, the optimal combination of the area AUC under the micro average curve for predicting the patient discharge endpoint (three classifications: home rest, continuous professional rehabilitation and treatment and death) is selected as follows:
optimal algorithm combination 1: embedded tree gradient boosting decision tree GBDT;
optimal algorithm combination 2: embedding trees and extreme gradient to promote XG boosting;
optimal algorithm combination 3: embedding LSVC and improving XG boosting by extreme gradient;
wherein:
the characteristics of the embedded tree screening include: gross gras score, length of stay, mechanical ventilation, systolic blood pressure, diastolic blood pressure, length of ICU, length of stay after ICU, albumin, heart rate, cefazolin, lactic acid, bicarbonate, red blood cell distribution width RDW, arterial mean pressure, hemoglobin, age, HR heart rate, potassium chloride, blood urea nitrogen, total diagnostic number, morphine, blood chloride, blood glucose, WBC white blood cells, sodium ions, oxygen concentration score FiO 2;
features of the embedded LSVC screening include: epinephrine, norepinephrine, oxygen fraction FiO2, systolic blood pressure, cefazolin, glucocorticoids, bicarbonate, glass total, length of stay in hospital, mechanical ventilation, hemoglobin, age, HR heart rate, albumin, potassium chloride, blood urea nitrogen, total diagnostic count, blood chloride, lactate, thromboplastin time, arterial mean pressure, WBC white blood cells, red blood cells, platelets, blood glucose.
And integrating the three algorithm combinations in a stacking method mode to construct a first basic model. The basic model I is used for predicting clinical results of death at the discharge end point of severe spinal cord injury, continuous professional rehabilitation and nursing treatment and home return to obtain a corresponding probability numerical value. The base model is an integrated prediction model, also called an integrated model.
It can be seen that the above-described optimal combination of the three algorithms uses a total of 38 features. Namely, the number of the features included in the integration model I is more than 10, and the simplified version integration model (second integration model) with higher practicability is constructed by adopting the following method: and (3) evaluating the importance of each feature included in the integration model I in the step (5) by adopting a replacement feature importance method based on the test data set. And ordering the importance of the features from large to small, only keeping the 10 features with the maximum importance in the first integration model, and discarding the rest features.
The 10 features included are: length of stay, gross gras score, age, oxygen concentration fraction FiO2, blood glucose, heart rate, red blood cell distribution width RDW, albumin, blood urea nitrogen, total number of diagnoses.
Features discarded: blood chloride ions, lactic acid, blood glucose, PTT thromboplastin time, arterial mean blood pressure, white blood cells, platelets, sodium ions, heart rate, arterial systolic pressure, hemoglobin, mechanical ventilation, morphine, ceftizolin, potassium chloride, ICU duration, and the like.
Retraining with 10 features included in the training dataset to obtain a second integrated model. The coefficients of the model are: θ 1 ═ 5.6,1.9,0.4,4.2,1.5,1.1,4.6,1.4,1.2], θ 2 ═ 0.7,3.9,1.8,0.8,3.7,1.9,0.6,3.6,1.6], θ 3 ═ 0.7,2.3,4.1,0.9,2.2,3.4,1.1,1.9,3.9 ]. The model only includes the first ten most important features, so the model has the advantage of high practicability.
And testing the predicted performance of the final prediction model on the test data set, namely the AUC.
A computer-readable recording medium of the present invention, in which a program for running a prognosis prediction for severe spinal cord injury is recorded, the program causing a computer to run the program comprising the steps of: determining values from a medical history collected from a patient and entering said values into said program; executing the program and then calculating the probability of each discharge endpoint by the following equation 1 based on the values determined in the above steps; the numerical value is the probability numerical value of predicting that the discharge end point of the patient is dead, continues the professional rehabilitation nursing treatment and goes home according to the final prediction model;
equation 1
Figure BDA0003014109970000161
In the above-mentioned formula,
Pθ(X) represents a discharge endpoint category probability, where p (y 1| X; θ) represents a death probability, p (y 2| X; θ) represents a probability of continuing the professional rehabilitation therapy, and p (y 3| X; θ) represents a probability of returning home;
θj=[θj,1 θj,2 θj,3 … θj,3n-2 θj,3n-1 θj,3n](where j ∈ 1,2,3) represents a coefficient, where n represents the number of base classifiers;
Figure BDA0003014109970000162
Figure BDA0003014109970000163
Figure BDA0003014109970000164
wherein theta isi,j(i 1,2, 3; j 1,2, 3.., 3n) are coefficients of a pre-trained integrated prediction model, i.e., the final prediction model;
X=[x1 x2 x3 … x3n-2 x3n-1 x3n]representing the probability of discharge endpoint predicted by n base classifiers, where x3k-2Indicates the probability of death at the discharge end predicted by the kth (k ═ 1, 2.., n) base classifier, x3k-1Represents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is the continuation of the professional rehabilitation care treatment; x is the number of3kIndicates the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is home, specifically, where x1,x2,x3Respectively representing the probability of the type of the discharge endpoint, namely death, continuous professional rehabilitation nursing treatment and home return, which is predicted by the 1 st optimal algorithm combination; x is the number of4,x5,x6Respectively representing the probability of the type of the discharge endpoint predicted by the combination of the 2 nd optimal algorithm, namely death, continuous professional rehabilitation nursing treatment and home return; x is the number of7,x8,x9Respectively representing the probability of the 3 rd best algorithm combination for predicting the discharge endpoint types, namely death, continuous professional rehabilitation nursing treatment and home return;
y is 1| X; theta represents the probability X of the patient discharge endpoint predicted by inputting n base classifiers into the algorithm, and the algorithm predicts the patient discharge endpoint to be the class 1; wherein y is 2, the algorithm predicts that the patient's discharge endpoint category 2 is to continue professional rehabilitation care treatment; y is 3, the algorithm predicts the patient's discharge endpoint category 3 as home;
t represents the transposition of the vector;
Figure BDA0003014109970000165
after the column vector theta 1 is rotated, multiplying the column vector theta 1 by the vector X, namely, the inner product of the vector theta 1 and the vector X;
Figure BDA0003014109970000166
representing the transposition of the column vector thetaj into a row vector; theta is a coefficient of the algorithm, and a specific value is obtained through training.
In formula 1, for example, 3 algorithm combinations, the formula for continuing the professional rehabilitation care treatment is as follows:
Figure BDA0003014109970000171
the recording medium of the invention predicts the clinical results of death of the discharge outcome of severe spinal cord injury, continuous professional rehabilitation and nursing treatment and home return.
And (3) constructing the optimal 3 algorithm combinations in the final prediction model, predicting 3 probability values (death, continuous professional rehabilitation care treatment and returning home) by each algorithm combination, inputting the nine probability values into a formula 1, and integrating the predictions of the three algorithm combinations by the formula 1 to give final predicted probability values (death, continuous professional rehabilitation care treatment and returning home).
The invention can be kept at home, and the medical treatment can be further carried out by continuing the professional rehabilitation nursing treatment.
The following is illustrated by taking the example of continuing professional rehabilitation care treatment:
the optimal 3 algorithm combinations are: the embedded tree gradient promotes the decision tree GBDT, the embedded tree extreme gradient promotes the XGBoosting, and the embedded LSVC extreme gradient promotes the XGBoosting.
The values of the characteristics of a certain patient a are as follows.
Length of stay: 12.6 days;
gross Glass score: 15;
age: age 70;
oxygen concentration fraction FiO 2: 0.5;
albumin: 4.1 g/dL;
red blood cell distribution width RDW: 0.1;
blood urea nitrogen: 27.2 mg/dL;
blood sugar: 10 mg/dL;
heart rate: 58;
total number of diagnoses: 2;
the algorithm combination is one, and the probabilities that the patient is discharged from hospital and the terminal is dead, continues professional rehabilitation nursing treatment and returns home are respectively as follows: 0.11,0.75, 0.14; the probability that the patient is dead at the discharge end point, continues professional rehabilitation nursing treatment and returns home through the algorithm combination II is respectively as follows: 0.09, 0.79, 0.12; the three combined algorithm probabilities of predicting that the patient is dead at the discharge end point, continues professional rehabilitation nursing treatment and returns home are respectively as follows: 0.18, 0.61, 0.21. That is, patient X is [0.11,0.75, 0.14, 0.09, 0.79, 0.12, 0.18, 0.61, 0.21 ]. The coefficient θ 1 of the second integrated prediction model is [5.6,1.9,0.4,4.2,1.5,1.1,4.6,1.4,1.2], θ 2 is [0.7,3.9,1.8,0.8,3.7,1.9,0.6,3.6,1.6], θ 3 is [0.7,2.3,4.1,0.9,2.2,3.4,1.1,1.9,3.9 ].
Substituting the above numerical values into formula 1, obtaining a calculation formula of the probability that the patient discharge endpoint is dead:
Figure BDA0003014109970000181
the probability that the patient discharge endpoint is the continuous professional rehabilitation nursing treatment is as follows:
Figure BDA0003014109970000182
the patient discharge endpoint is a calculation formula of the probability of home recuperation:
Figure BDA0003014109970000183
Figure BDA0003014109970000191

Claims (7)

1. a prediction model system for prognosis of severe spinal cord injury, comprising:
establishing a clinical characteristic database of patient cases of spinal cord injury;
constructing a prediction model of severe spinal cord injury prognosis: extracting clinical features from a clinical feature database, processing missing data by different filling methods according to the types of the extracted clinical features, filling continuous variable features by a prediction mean matching method, filling binary variable features by a logistic regression method, filling multi-classification variable features by a polynomial regression method, and finally obtaining different features which are randomly divided into a training data set, a verification data set and a test data set according to a reasonable proportion; building an algorithm combination model by a machine learning classification algorithm, wherein the feature selection method is used for screening clinical features with obvious prediction value and using the selected clinical features for training the machine learning classification algorithm; selecting an algorithm combination model with the best area AUC under the micro average curve for predicting the discharge endpoint of the patient from the algorithm combination models, and constructing a final prediction model by using an integrated algorithm stacking method, wherein the discharge endpoint of the patient is in three categories of home rest, continuous professional rehabilitation and treatment and death;
according to the probability values of death, continuous professional rehabilitation nursing treatment and family rest of the patient at the discharge end point predicted by the algorithm combination model, the probability values are input into a formula 1 to give a final predicted probability value,
said formula 1
Figure FDA0003014109960000011
In the above-mentioned formula,
Pθ(X) represents the probability of the discharge endpoint category, wherein p (y 1| X; theta) represents the probability of death, and p (y 2| X; theta) represents the continuation of the professional rehabilitation therapyThe probability of treatment, p (y is 3| X; theta), represents the probability of home rest;
Figure FDA0003014109960000012
(where j ═ 1,2,3) denotes coefficients where n denotes the number of basis classifiers;
Figure FDA0003014109960000013
Figure FDA0003014109960000014
Figure FDA0003014109960000015
wherein theta isi,j(i 1,2, 3; j 1,2, 3.., 3n) are coefficients of a pre-trained integrated predictive model;
X=[x1 x2 x3 … x3n-2 x3n-1 x3n]representing the probability of discharge endpoint predicted by n base classifiers, where x3k-2Indicates the probability of death at the discharge end predicted by the kth (k ═ 1, 2.., n) base classifier, x3k-1Represents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is the continuation of the professional rehabilitation care treatment; x is the number of3kRepresents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is in home rest;
y is 1| X; theta represents the entered patient characteristics; y 1, the algorithm predicts that the patient's discharge endpoint category 1 is death; wherein y is 2, the algorithm predicts that the patient's discharge endpoint category 2 is to continue professional rehabilitation care treatment; y is 3, the algorithm predicts the patient's discharge endpoint category 3 as home rest;
t represents the transposition of the vector;
Figure FDA0003014109960000021
represents a column vector θ1After the transposition, multiplying the vector X by the transposition;
Figure FDA0003014109960000022
represents a column vector θjTransposing into a row vector; theta is a coefficient of the algorithm, and a specific value is obtained through training; Λ represents an omission symbol.
2. The severe spinal cord injury prognosis prediction model system as claimed in claim 1, wherein an AUC matrix of the validation dataset is established, the ordinate of the AUC matrix is a feature selection method, the abscissa is a machine learning classification algorithm, and then a number of algorithm combination models of the feature selection method and the machine learning classification algorithm are constructed; according to the feature selection method and the prediction performance of the number of algorithm combination models of the machine learning classification algorithm in the verification data set, three algorithm combinations with the maximum area AUC under the micro average curve are selected, and the three algorithm combinations are integrated by utilizing the stacking method to obtain the final prediction model.
3. The severe spinal cord injury prognosis prediction model system of claim 1, wherein the 7 feature selection methods are for screening clinical features with significant predictive value, the 7 feature selection methods are maximum mutual information coefficient MIC, embedded random forest RF, recursive feature elimination REF, embedded linear support vector classifier embedded LSVC, embedded logistic regressor embedded LR, embedded tree, and minimum redundancy-maximum correlation mRMR, the machine learning classification algorithm is 15, the 13 machine learning classification algorithms are logistic regression, linear discriminant analysis LDA, support vector machine SVM, K nearest neighbor KNN, celsian bayes NB, decision tree, extra decision tree, random forest, Bagging algorithm Bagging, adaptive enhancement AdaBoost, extreme gradient lift decision tree GBDT, extreme gradient lift XGBoosting, light weight lift gradient lift liggbm, htgbm, A multi-layer perceptron MLP and a deep neural network DNN.
4. The system of claim 1, wherein the clinical features are: the demographic information comprises race, gender, age, body mass index, admission type, ICU type, admission source, ICU duration, and length of stay after discharge; vital signs include respiratory rate, heart rate, systolic and diastolic blood pressure, mean arterial pressure; laboratory data include white blood cell count, red blood cell count RBC, platelet count, basophils, eosinophils, neutrophils, lymphocytes, monocytes, red blood cell distribution width RDW, hemoglobin, hematocrit, mean red blood cell hemoglobin amount MCH, red blood cell mean hemoglobin concentration MCHC, red blood cell mean volume MCV, prothrombin time PT, activated partial thromboplastin time APTT, international normalized ratio INR, oxygen concentration fraction FiO2, oxygen partial pressure PaO2, carbon dioxide partial PaCO2, hydrogen ion concentration index PH, bicarbonate, lactate, residual base BE, anion space, potassium, sodium, calcium, magnesium, chloride, phosphate, blood urea nitrogen BUN, creatinine, albumin, blood glucose, and the like; the use of drugs and therapeutic conditions include mechanical ventilation, morphine sulfate, cefazolin, potassium chloride KCl, glucocorticoids, dopamine, dobutamine, epinephrine, and norepinephrine;
clinical features of greater than or equal to 50% of the total case weight of the deletion case, directly deleting the clinical features, including red blood cell distribution width RDW, partial oxygen pressure PaO2, ethnic group, mean volume of red blood cells MCV, lactate, morphine sulfate, age, body mass index, white blood cell count, red blood cell count RBC, platelet count, basophils, eosinophils, neutrophils, lymphocytes, monocytes, red blood cell distribution width RDW, hemoglobin, hematocrit, mean hemoglobin amount MCH, mean hemoglobin concentration MCHC of red blood cells, mean volume of red blood cells MCV, prothrombin time PT, activated partial thromboplastin time APTT, international normalized ratio INR, partial oxygen pressure PaO2, carbon dioxide sub-PaCO 2, hydrogen ion concentration index PH, oxygen partial pressure PaO2, blood cell count, RBC, blood cell count, blood platelet count, blood cell count, eosinophils, neutrophil count, lymphocyte count, blood cell distribution width RDV, hemoglobin, oxygen partial pressure PaO2, carbon dioxide sub, Bicarbonate, lactate, residual base BE, anionic interstitial, potassium, sodium, calcium, magnesium, chloride, phosphate, blood urea nitrogen BUN, creatinine, albumin, blood glucose, respiratory rate, heart rate, systolic blood pressure, diastolic blood pressure, mean arterial pressure, ICU duration, length of stay after leaving the ICU, oxygen concentration fraction FiO2 are continuous variable features, mechanical ventilation, morphine sulfate, ceftizoline, potassium chloride KCl, glucocorticoid, dopamine, dobutamine, epinephrine and norepinephrine are binary variable features, the race, gender, ICU type, source of admission are converted into a form of virtual variables, and finally a total of 55 different features are obtained.
5. The system of claim 1, wherein the optimal three algorithm combinations are GBDT, XGboosting, LSVC and XGboosting.
6. A computer-readable recording medium in which a program for running prognosis prediction of severe spinal cord injury is recorded, the program causing a computer to run a program for: the final predictive model of claim 1 predicts probability values for patient discharge endpoint death, follow-up professional rehabilitation therapy, home rest, inputs the probability values to equation 1, gives a final predicted probability value,
said formula 1
Figure FDA0003014109960000041
In the above-mentioned formula,
Pθ(X) represents a discharge endpoint category probability, where p (y 1| X; θ) represents a death probability, p (y 2| X; θ) represents a probability of continuing professional rehabilitation therapy, and p (y 3| X; θ) represents a probability of home rest;
Figure FDA0003014109960000042
(where j ═ 1,2,3) denotes coefficients where n denotes the number of basis classifiers;
Figure FDA0003014109960000043
Figure FDA0003014109960000044
Figure FDA0003014109960000045
wherein theta isi,j(i 1,2, 3; j 1,2, 3.., 3n) are coefficients of a pre-trained integrated predictive model;
X=[x1 x2 x3 … x3n-2 x3n-1 x3n]representing the probability of discharge endpoint predicted by n base classifiers, where x3k-2Indicates the probability of death at the discharge end predicted by the kth (k ═ 1, 2.., n) base classifier, x3k-1Represents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is the continuation of the professional rehabilitation care treatment; x is the number of3kRepresents the probability that the discharge endpoint predicted by the kth (k ═ 1, 2.., n) base classifier is in home rest;
y is 1| X; theta represents the entered patient characteristics; y 1, the algorithm predicts that the patient's discharge endpoint category 1 is death; wherein y is 2, the algorithm predicts that the patient's discharge endpoint category 2 is to continue professional rehabilitation care treatment; y is 3, the algorithm predicts the patient's discharge endpoint category 3 as home rest;
t represents the transposition of the vector;
Figure FDA0003014109960000046
represents a column vector θ1After the transposition, multiplying the vector X by the transposition;
Figure FDA0003014109960000047
represents a column vector θjTransposing into a row vector; theta is a coefficient of the algorithm, and a specific value is obtained through training; Λ represents an omission symbol.
7. The computer-readable recording medium according to claim 1, wherein the optimal three algorithm combinations are embedded tree gradient boosting decision tree GBDT, embedded tree extreme gradient boosting XGBoosting, embedded LSVC extreme gradient boosting XGBoosting.
CN202110383930.XA 2021-04-09 2021-04-09 Prediction model system and storage medium for severe spinal cord injury prognosis Active CN112992368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110383930.XA CN112992368B (en) 2021-04-09 2021-04-09 Prediction model system and storage medium for severe spinal cord injury prognosis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110383930.XA CN112992368B (en) 2021-04-09 2021-04-09 Prediction model system and storage medium for severe spinal cord injury prognosis

Publications (2)

Publication Number Publication Date
CN112992368A true CN112992368A (en) 2021-06-18
CN112992368B CN112992368B (en) 2023-06-20

Family

ID=76339663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110383930.XA Active CN112992368B (en) 2021-04-09 2021-04-09 Prediction model system and storage medium for severe spinal cord injury prognosis

Country Status (1)

Country Link
CN (1) CN112992368B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420300A (en) * 2022-01-20 2022-04-29 北京大学第六医院 Chinese old cognitive impairment prediction model
CN115662613A (en) * 2022-09-28 2023-01-31 中日友好医院(中日友好临床医学研究所) Barotrauma prediction method and device
CN116151470A (en) * 2023-03-06 2023-05-23 联宝(合肥)电子科技有限公司 Product quality prediction method, device, equipment and storage medium
CN117727448A (en) * 2024-02-06 2024-03-19 四川省医学科学院·四川省人民医院 Medical conjuncted-based intelligent decision control system for pressure injury

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119167A (en) * 2018-07-11 2019-01-01 山东师范大学 Pyemia anticipated mortality system based on integrated model
CN110051324A (en) * 2019-03-14 2019-07-26 深圳大学 A kind of acute respiratory distress syndrome anticipated mortality method and system
CN111261282A (en) * 2020-01-21 2020-06-09 南京航空航天大学 Sepsis early prediction method based on machine learning
CN112185549A (en) * 2020-09-29 2021-01-05 郑州轻工业大学 Esophageal squamous carcinoma risk prediction method based on clinical phenotype and logistic regression analysis
US20210020312A1 (en) * 2019-07-17 2021-01-21 Regents Of The University Of Minnesota Efficient and lightweight patient-mortality-prediction system with modeling and reporting at time of admission

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119167A (en) * 2018-07-11 2019-01-01 山东师范大学 Pyemia anticipated mortality system based on integrated model
CN110051324A (en) * 2019-03-14 2019-07-26 深圳大学 A kind of acute respiratory distress syndrome anticipated mortality method and system
US20210020312A1 (en) * 2019-07-17 2021-01-21 Regents Of The University Of Minnesota Efficient and lightweight patient-mortality-prediction system with modeling and reporting at time of admission
CN111261282A (en) * 2020-01-21 2020-06-09 南京航空航天大学 Sepsis early prediction method based on machine learning
CN112185549A (en) * 2020-09-29 2021-01-05 郑州轻工业大学 Esophageal squamous carcinoma risk prediction method based on clinical phenotype and logistic regression analysis

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420300A (en) * 2022-01-20 2022-04-29 北京大学第六医院 Chinese old cognitive impairment prediction model
CN114420300B (en) * 2022-01-20 2023-08-04 北京大学第六医院 Chinese senile cognitive impairment prediction model
CN115662613A (en) * 2022-09-28 2023-01-31 中日友好医院(中日友好临床医学研究所) Barotrauma prediction method and device
CN116151470A (en) * 2023-03-06 2023-05-23 联宝(合肥)电子科技有限公司 Product quality prediction method, device, equipment and storage medium
CN117727448A (en) * 2024-02-06 2024-03-19 四川省医学科学院·四川省人民医院 Medical conjuncted-based intelligent decision control system for pressure injury
CN117727448B (en) * 2024-02-06 2024-04-19 四川省医学科学院·四川省人民医院 Medical conjuncted-based intelligent decision control system for pressure injury

Also Published As

Publication number Publication date
CN112992368B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN112992346B (en) Method for establishing prediction model of severe spinal cord injury prognosis
CN112992368B (en) Prediction model system and storage medium for severe spinal cord injury prognosis
Purushotham et al. Benchmarking deep learning models on large healthcare datasets
Mall et al. Heart diagnosis using deep neural network
Juraev et al. Multilayer dynamic ensemble model for intensive care unit mortality prediction of neonate patients
WO2021205828A1 (en) Prognosis prediction device and program
Li et al. Real-time sepsis severity prediction on knowledge graph deep learning networks for the intensive care unit
Popkes et al. Interpretable outcome prediction with sparse Bayesian neural networks in intensive care
Mansouri et al. Predicting hospital length of stay of neonates admitted to the NICU using data mining techniques
Shin et al. Early prediction of mortality in critical care setting in sepsis patients using structured features and unstructured clinical notes
Zhang et al. Machine learning prediction models for postoperative stroke in elderly patients: analyses of the MIMIC database
Srimedha et al. A comprehensive machine learning based pipeline for an accurate early prediction of sepsis in ICU
Vijayakumar et al. Diabetes prediction by machine learning over big data from healthcare communities
US20130253892A1 (en) Creating synthetic events using genetic surprisal data representing a genetic sequence of an organism with an addition of context
CN117476242A (en) Construction method and application of interpretable machine learning model for ICU death risk early warning of patient with sepsis acute kidney injury
Majhi et al. Wavelet based ensemble models for early mortality prediction using imbalance ICU big data
Golovco et al. Acute kidney injury prediction with gradient boosting decision trees enriched with temporal features
Qadri et al. Heart failure survival prediction using novel transfer learning based probabilistic features
Coşkun et al. Evaluation of performance of classification algorithms in prediction of heart failure disease
Yuan et al. Interpretable Machine Learning-Based Risk Scoring with Individual and Ensemble Model Selection for Clinical Decision Making
Umut et al. Prediction of sepsis disease by Artificial Neural Networks
CN113012808A (en) Health prediction method
Johnson Mortality prediction and acuity assessment in critical care
Subbulakshmi et al. Systematic cardiovascular disorder identification using machine learning algorithms
Moudani et al. Heart disease diagnosis using fuzzy supervised learning based on dynamic reduced features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant