CN113470837A - Infection screening method based on combination of decision tree model and logistic regression model - Google Patents

Infection screening method based on combination of decision tree model and logistic regression model Download PDF

Info

Publication number
CN113470837A
CN113470837A CN202111019378.2A CN202111019378A CN113470837A CN 113470837 A CN113470837 A CN 113470837A CN 202111019378 A CN202111019378 A CN 202111019378A CN 113470837 A CN113470837 A CN 113470837A
Authority
CN
China
Prior art keywords
model
logistic regression
regression model
combination
decision tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111019378.2A
Other languages
Chinese (zh)
Inventor
商春恒
王云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Microelectronics of CAS
Guangdong Greater Bay Area Institute of Integrated Circuit and System
Original Assignee
Institute of Microelectronics of CAS
Guangdong Greater Bay Area Institute of Integrated Circuit and System
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Microelectronics of CAS, Guangdong Greater Bay Area Institute of Integrated Circuit and System filed Critical Institute of Microelectronics of CAS
Priority to CN202111019378.2A priority Critical patent/CN113470837A/en
Publication of CN113470837A publication Critical patent/CN113470837A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Optimization (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Epidemiology (AREA)
  • Computational Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Computing Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses an infection screening method based on a combination of a decision tree model and a logistic regression model, which is convenient to detect and operate and can improve the infection screening accuracy, and is realized based on a vital sign monitor, wherein the vital sign monitor is in communication connection with a remote data service platform, and the remote data service platform carries out infection screening according to clinical data, and the method comprises the following steps: the method comprises the following steps of detecting and acquiring clinical data of a user through a vital sign monitor, randomly dividing the clinical data into a training set and a testing set, and equally dividing the training set into two parts: the method comprises the steps of constructing a decision tree model based on a training set A and a training set B, simultaneously, carrying out feature selection on the training set A, taking key feature vectors as input of the constructed decision tree model to obtain new constructed feature vectors, constructing a logistic regression model based on the combined feature vectors, carrying out prediction classification on a test set based on the combination of the decision tree model and the logistic regression model, and obtaining a classification result.

Description

Infection screening method based on combination of decision tree model and logistic regression model
Technical Field
The invention relates to the technical field of data analysis, in particular to an infection screening method based on a combination of a decision tree model and a logistic regression model.
Background
At present, hospitals mainly screen infectious diseases by using ct (computed tomogry), clinical characteristics and body temperature detection as diagnostic methods, but ct (computed tomogry), clinical characteristics, body temperature detection and the like are limited by medical technology and regions, and still have the problems of lagged acquisition of detection results, poor detection accuracy, high infection risk and the like.
For example, in the current environment of a large-scale epidemic situation of a new coronavirus, an effective way for controlling the spread of the disease is large-scale screening, patient isolation treatment and symptom monitoring, most of the existing detection is realized based on RT-PCR (reverse transcription polymerase chain reaction), but in the outbreak peak period of COVID-19, an RT-PCR kit is seriously in short supply, hospitals always use Computed Tomogry (CT), clinical characteristics and body temperature detection as alternative diagnosis methods, the CT, the clinical characteristics, the body temperature detection and the like need professional medical personnel to operate, the operation steps are complicated, the clinical characteristics are influenced by the detection experience and subjective motility of the medical personnel, and the problem of poor detection accuracy exists.
Disclosure of Invention
Aiming at the problems of poor detection result acquisition lag and poor detection accuracy caused by CT, clinical characteristics and body temperature detection as diagnosis methods in the prior art, the invention provides an infection screening method based on the combination of a decision tree model and a logistic regression model, which is convenient in detection operation and can improve the accuracy of infection screening.
In order to achieve the purpose, the invention adopts the following technical scheme:
an infection screening method based on a combination of a decision tree model and a logistic regression model is realized based on a vital sign monitor, the vital sign monitor is used for detecting clinical data of a user, the vital sign monitor is in communication connection with a remote data service platform through a communication module, and the remote data service platform is used for carrying out infection screening according to the clinical data, and the infection screening method is characterized by comprising the following steps: s1, detecting and acquiring clinical data of the user through the vital sign monitor;
s2, randomly dividing the clinical data into a training set and a testing set, and equally dividing the training set into two parts: training set A and B;
s3, training the XGboost model based on the training set A, constructing the XGboost model, and meanwhile, selecting the characteristics of the training set A and selecting key characteristics;
s4, selecting key feature vectors of corresponding key features in the training set B, taking the key feature vectors as the input of the constructed XGboost model, and performing OneHot coding on the output of leaf nodes of the XGboost model to obtain newly constructed feature vectors;
s5, merging the newly constructed feature vector and the key feature vector to obtain a combined feature vector;
s6, training a Logistic regression model based on the combined feature vector, and constructing the Logistic regression model;
and S7, based on the combination of the XGboost model and the Logistic regression model, performing prediction classification on the test set to obtain a classification result.
It is further characterized in that the method further comprises the steps of,
in step S1, the user includes a healthy person and a patient, the clinical data is a feature vector related to a disease condition, and the clinical data includes: respiratory rate mean, median respiratory rate, maximum respiratory rate, minimum respiratory rate, mean heart rate, median heart rate, maximum heart rate, minimum heart rate, percentage waking, percentage REM sleep, percentage light sleep, percentage deep sleep, sleep latency, length of sleep, sleep efficiency, sleep score, body movement density, body movement minute ratio, number of waking, number of turning, number of apneas during sleep, hypopnea index for apnea, number of REM apneas, number of apneas during light sleep, number of apneas during deep sleep;
in step S2, randomly extracting 75% of the clinical data as the training set, and the remaining 25% as the test set;
in steps S3 and S4, the calculation method of the decision tree model includes:
suppose that
Figure 516783DEST_PATH_IMAGE001
For the output result of the t-th tree,
Figure 289567DEST_PATH_IMAGE002
is the current output result of the model and,
Figure 830270DEST_PATH_IMAGE003
is a practical result, then
Figure 739320DEST_PATH_IMAGE004
T represents a total T decision trees, wherein T represents the T-th iteration, namely, an optimal model is searched each time and added into the existing model to enable the predicted value to approach the true value;
in step S3, the feature selection is implemented based on the SHAP value of the XGBoost model, the influence of the clinical data in the training set a on the result is described by the SHAP value, and the key feature is obtained according to the influence;
the key features include: REM phase apnea number, heart rate mean, sleep duration, sleep latency, heart rate median, shallow sleep percentage, apnea hypopnea index;
in step S6, the Logistic regression model is calculated in a manner including:
the Logistic regression model regression is a binary classification model, an input function value is mapped to an interval of 0-1 through a sigmoid function on the basis of linear regression, the probability of discrimination of various types is taken as y, y is a dependent variable of two types and represents whether a patient is present, y =0 represents health, y =1 represents the patient, x = { x1, x2, x3, … and xp } represent corresponding p-dimensional explanation variables, and the probability is a probability
Figure 340066DEST_PATH_IMAGE005
Representing the probability that y belongs to class 1 given the combined feature vector x,
order to
Figure 600146DEST_PATH_IMAGE006
Figure 944539DEST_PATH_IMAGE007
Can be expressed as:
Figure 708096DEST_PATH_IMAGE008
the above formula is a logistic regression model, wherein
Figure 10901DEST_PATH_IMAGE009
Representing the coefficients corresponding to each feature. Parameters can be determined using a gradient descent parameter estimation method
Figure 227119DEST_PATH_IMAGE010
Where c is the base e of the logarithm of the natural number and T represents
Figure 375204DEST_PATH_IMAGE011
Transposing of the matrix.
Step S7 further includes, S71, model verification, the verification mode is: verifying the combination of the XGboost model and the Logistic regression model which are constructed by adopting a ten-fold cross verification method, and determining the hyper-parameters of the corresponding model; s72, selecting the existing five models: the method comprises the following steps of verifying five models by adopting a cross-folding cross verification method to determine a hyper-parameter of a corresponding model, wherein the XGboost model, the Logistic regression model, the KNN model (namely a K adjacent node algorithm), the SVM model (namely a support vector machine) and the RF model (namely a random forest model);
in step S71, the ten-fold cross-validation method divides the training set into ten parts, cyclically extracts one part as an optimized validation set, and verifies the model with the remaining nine parts as optimized training sets, where each verification results in a corresponding correct rate or error rate, and an average of the correct rates (or error rates) of the ten results is used as an estimation of the algorithm precision, and a hyper-parameter is determined according to the estimation result;
in step S71, the hyper-parameter of the combination of the XGBoost model and the Logistic regression model includes: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value;
in step S72, the hyper-parameters of the XGBoost model include: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value;
the hyperparameters of the Logistic regression model comprise: c (i.e., the inverse of the regularization coefficient λ), regularization term (i.e., penalty);
the hyperparameters of the KNN model comprise: the number of adjacent points (i.e., k _ neighbors), the initial value (P);
the hyper-parameters of the SVM model include: kernel functions k (kernel), C (inverse of regularization coefficient λ), gamma values;
the hyper-parameters of the RF model include: maximum number of iterations (i.e., n _ estimators), maximum depth (i.e., max _ depth);
the method further comprises the step S8 of classifying the test set based on the existing five models;
step S9, the XGboost model and the Logistic regression model are combined, and classification results of the five existing models are evaluated, wherein evaluation indexes include: prediction accuracy, recall, Area Under the ROC Curve AUC (i.e., Area Under cutter).
By adopting the structure of the invention, the following beneficial effects can be achieved: according to the method, firstly, the XGboost model is used for splitting characteristics in the clinical data to obtain characteristics and threshold values with the best effect on nodes, meanwhile, the clinical data are subjected to characteristic selection, characteristics with high risk in the clinical data are selected, and the characteristic selection is beneficial to improving the construction speed of the model, enhancing the generalization capability of the model and reducing the over-fitting problem; a Logistic regression model is constructed based on the best effect characteristics (newly constructed characteristic vectors) and the key characteristic vectors with larger risks, and the test set is classified based on the Logistic regression model, so that the classification accuracy is greatly improved, and the accuracy of infection screening is improved.
Drawings
FIG. 1 is a flow chart of the infection screening method of the present invention;
FIG. 2 is a schematic diagram of the structural features of the XGboost model according to the present invention;
FIG. 3a is a graphical representation of the invention using SHAP values to describe the characteristic significance of clinical data;
FIG. 3b is a bar graph depicting feature importance using the average of the absolute values of SHAP values in accordance with the present invention;
FIG. 4a is a schematic diagram of a confusion matrix structure of a Logistic Regression model, an SVM model and an XGboost model according to the present invention;
FIG. 4b is a schematic diagram of the confusion matrix structure of the KNN model, the RF model and the XGboost + LR model.
Detailed Description
An infection screening method based on a combination of a decision tree model and a logistic regression model is realized based on a vital sign monitor (the vital sign monitor adopts the existing non-contact type vital sign monitor with the model of YOLI-RD 200-G-WH), the vital sign monitor is in communication connection with a remote data service platform through a communication module to form a non-contact type vital sign monitoring system, the vital sign monitor comprises a biological radar, the non-contact type vital sign monitor is adopted to monitor the vital signs of patients (the patients are CODV-19 patients in the Wuhan hospital in the embodiment), 140 parts of radar monitoring data (namely clinical data) of 23 patients are collected and analyzed with 144 parts of sleep monitoring data of healthy people, the collection environment of the patients is a centralized monitoring ward, the collection environment of the healthy people is the respective home, all people's basic information such as age, sex, basic disease, and medication conditions were counted, and the information was distributed and dispersed, and no aggregation occurred in a certain feature, so that other factors were considered as small interference items, and the control test was considered to be effective.
The doctor places the vital sign monitor at the patient head of a bed and can gathers patient's data, and this kind of data collection method does not influence patient's various daily activities, also easily medical personnel operate, and medical personnel can acquire clinical data information through remote data service platform, has reduced operations such as CT, consequently detects convenient operation, has reduced the contact, has reduced the infection risk.
Non-contact vital sign monitor system comprises non-contact vital sign monitor and remote data service platform, the monitor transmission radar signal, then carry out the filtering in the radar echo signal, isolate heartbeat signal, respiratory signal, body movement signal, and extract respiratory rate, rhythm of the heart, body movement, leave data such as bed, monitoring data uploads to data platform in real time through wifi, the patient is after sleeping night, the system (remote data service platform) carries out sleep analysis, apnea analysis to the data of evening again, reachs the sleep and guards the report. The sleep monitoring report output data is clinical data, the total number of the data is 25, and the 25 data comprehensively reflect the conditions of night respiration, heartbeat, sleep structure, sleep quality, body movement, apnea and the like of a patient: this 25 data was applied to the infection screening method described below.
Referring to fig. 1, an infection screening method based on a combination of a decision tree model and a logistic regression model includes: s1, detecting and acquiring clinical data of the user through the vital sign monitor; the users include healthy people and patients, and the clinical data comprises 25 characteristics: mean respiratory rate (meanRR), median respiratory rate (' medRR), maximum respiratory rate (maxRR), minimum respiratory rate (minRR), mean heart rate (meanHR), median heart rate (medHR), maximum heart rate (maxHR), minimum heart rate (minHR), percentage wakefulness (awakPrct), percentage REM period sleep (remspprct), percentage shallow sleep (lightSPrct), percentage deep sleep (delesprct), sleep latency (latnMin), sleep duration (slepMin), sleep efficiency (sleffective), sleep score (slepScore), body motion density (meand), body motion minutes ratio (movMinPrct), number of wakefulness (awittims), number of turns over (turnover), number of apneas during sleep (epottims), apnea low ventilation index (rei), number of REM periods apnea (REM), number of shallow apneas (sataliments), number of deep sleep (ahemims).
S2, disordering data of the patient and the healthy person, randomly extracting 75% of the data according to the label proportion to be used as a training set, using the rest 25% of the data to be used as a test set, using the training set for training and constructing a model combination, and using the test set for testing the screening performance of the model; the training set is divided into two parts: training set A and training set B.
S3, training an XGboost model (decision tree model) based on the training set A, constructing the XGboost model (logistic regression model), and selecting the characteristics of the training set A and selecting key characteristics;
the feature selection is beneficial to improving the construction speed of the model, enhancing the generalization capability of the model and reducing the over-fitting problem. Good global feature importance metrics need to meet consistency and accuracy, and the present application uses the SHAP value to describe the importance of and evaluate features. The SHAP values allow for an overall visualization of the features, and FIG. 3a plots the effect of twenty features in the clinical data on each sample. Each row represents a feature, the abscissa represents the SHAP value (influence on the output of the model), the ordinate represents the feature, the middle vertical line represents the risk of zero, the farther the feature of each row is from the center, the greater the risk of the disease, i.e., the more important the feature affects the disease, for example, the graph shows that the value of the feature "rematims" increases the risk of the patient, and the feature value of "meanHR" also increases the risk, but the significance is lower than the feature "rematims", and the table shows that the values of "rematims", "meanHR", "slepMin", "latnMin", "medHR", "rightspt", "AHI", "maxRR", "maxHR", "meanNMD", "movmjnct", "minRR", "slotmartim", "satprct" and "waktims" sequentially decrease the risk of the disease.
Fig. 3b takes the mean value mean of the absolute values of the zap values of each feature (| Tree zap |) as the importance of the feature, the abscissa in fig. 3b represents the mean value mean (| Tree zap |) (the mean influence on the output quantity of the model), the ordinate is the feature, a standard bar graph is obtained, the features are sorted according to mean (| Tree zap |), and therefore, the feature that the number of times of apnea during REM is seen is the strongest factor for distinguishing patients.
According to the importance degree of the feature value, the first 7 features are selected as key features to train the XGboost model and the Logistic regression model, and the key features comprise: REM phase apnea number, heart rate mean, sleep duration, sleep latency, heart rate median, percentage of shallow sleep, apnea hypopnea index.
S4, selecting key feature vectors of corresponding key features in the training set B, taking the key feature vectors as the input of the constructed XGboost model, and performing OneHot coding on the output of leaf nodes of the XGboost model to obtain newly constructed feature vectors;
in steps S3 and S4, the XGBoost model is an ensemble learning algorithm based on gradient descent iteration, the model uses a decision tree as a base learner for integration, the algorithm continuously adds trees through feature splitting growth, continuously performs feature splitting to grow one tree, each time adds one tree, actually learns a new function, fits a residual error predicted last time through the new function, each tree corresponds to one leaf node, each leaf node corresponds to one score, the scores corresponding to each tree are added to obtain a predicted value of the sample, and the specific calculation process of the XGBoost model is as follows:
suppose that
Figure 258846DEST_PATH_IMAGE012
For the output result of the t-th tree,
Figure 466973DEST_PATH_IMAGE013
is the current output of the model, yiIs a practical result, then
Figure 430207DEST_PATH_IMAGE014
And T represents a total T decision trees, wherein T represents the T-th iteration, namely, an optimal model (a new function) is searched each time and is added into the existing model, so that the predicted value is closer to the true value. By minimizing a loss function
Figure 116403DEST_PATH_IMAGE016
(Loss Function) to construct the optimal model, and when the training data set is small, it is easy to overfit, so that a regularization term is generally required to be added to reduce the complexity of the model. The loss function L is calculated as:
Figure 854552DEST_PATH_IMAGE017
where F is the space of the hypothesis that,
Figure 233581DEST_PATH_IMAGE018
to control the model complexity, λ represents the regularization coefficient, and minf ∈ F represents finding the parameters that minimize the expression result in all hypothesis spaces.
Thus, the objective function:
Figure 689970DEST_PATH_IMAGE019
wherein
Figure 179857DEST_PATH_IMAGE020
Is the error in the training that is,
Figure 506933DEST_PATH_IMAGE021
is the complexity of the penalty model (the sum of the complexities of all trees), and the term includes two parts, the number of leaf nodes and the value of the leaf nodes. Is expressed as
Figure 56863DEST_PATH_IMAGE022
Wherein T isThe number of leaf nodes, | w | | | is the modulus of the leaf node vector.
Figure 549DEST_PATH_IMAGE023
The difficulty of node segmentation is represented, and lambda represents a regularization coefficient.
Performing second-order Taylor expansion on the target function:
Figure 294127DEST_PATH_IMAGE024
let ILAnd IRRespectively representing the left sub-tree node and the right sub-tree node after the splitting
Figure 6868DEST_PATH_IMAGE025
For each leaf node, the nodes are split according to gain. Definition of gain:
Figure 462120DEST_PATH_IMAGE026
the first two terms on the right side of the equal sign are the sum of the branches of the left and right subtrees after the splitting, the third term is the score value of the father node before the splitting is not performed, and the last term gamma is the complexity (namely the difficulty of splitting the node) caused by introducing additional leaf nodes.
Constructing a new feature vector in step S4 means that each leaf node of all decision trees in the XGBoost model is used as a new feature, so that the number of the constructed features is the same as the data of the leaf nodes of the XGBoost model, each feature is 0 or 1, and for each decision tree, if an input sample falls into a leaf node, the value of the leaf node is 1, otherwise, the value is 0. In fig. 2, the XGBoost model is obtained by training the training set a, and the XGBoost model includes two decision trees, each leaf node is a new feature, and a sample falls into a first leaf node through a tree1 (a first tree) and into a second leaf node through a tree2 (a second tree), so that the newly constructed feature vector is [1,0,0,0,1 ].
And S5, merging the newly constructed feature vector and the key feature vector to obtain a combined feature vector.
S6, training the Logistic regression model based on the combined feature vector to construct the Logistic regression model;
the Logistic regression model is a binary model, the model maps input function values to 0-1 intervals to represent through a sigmoid function on the basis of linear regression, as the probability of various discrimination, y is a dependent variable of two classes and represents whether a patient is present, y =0 represents health, y =1 represents the patient, and x = { x1, x2, x3, …, xp } represents corresponding p-dimensional explanatory variables. Probability of
Figure 627522DEST_PATH_IMAGE027
Representing the probability that y belongs to class 1 given a feature vector x, let
Figure 724791DEST_PATH_IMAGE028
Figure 292039DEST_PATH_IMAGE029
Can be expressed as:
Figure 918192DEST_PATH_IMAGE030
the above formula is referred to as a logistic regression model, wherein
Figure 836469DEST_PATH_IMAGE031
Representing the coefficients corresponding to each feature. Parameters can be found using a gradient descent isoparametric estimation method
Figure 471850DEST_PATH_IMAGE032
C denotes the base e of the logarithm of the natural number, T denotes
Figure 893604DEST_PATH_IMAGE033
Transposing of the matrix.
And S7, based on the combination of the XGboost model and the Logistic regression model, performing prediction classification on the test set to obtain a classification result. The XGboost model and the Logistic regression model combination are composed of two parts, wherein the XGboost model is used for extracting features in a training set to serve as new training input data, and the Logistic regression model serves as a classifier of the new training input data.
Before testing the combination of the XGboost model and the Logistic regression model by using a test set, verifying the combination of the XGboost model and the Logistic regression model, and verifying the model by S71 in the following way: verifying the combination of the constructed XGboost model and the Logistic regression model by adopting a ten-fold cross-validation method, determining the hyper-parameters of the corresponding model, wherein the ten-fold cross-validation method means that a training set is divided into ten parts, one part is circularly extracted as an optimized validation set, the other nine parts are used as optimized training sets, the model is verified, corresponding accuracy or error rate can be obtained by each verification, the average value of the accuracy (or error rate) of ten results is used as the estimation of the algorithm precision, and the hyper-parameters are determined according to the estimation result; s52, selecting the existing five models: the XGboost model, the Logistic regression model, the KNN model (namely a K adjacent node algorithm), the SVM model (namely a support vector machine) and the RF model (namely a random forest model) are verified by adopting a ten-fold cross verification method to determine the hyper-parameters of the corresponding models.
The hyper-parameters of each model are specified as follows: the hyper-parameters of the combination of the XGboost model and the Logistic regression model comprise: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value; the hyper-parameters of the XGboost model include: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value; the hyperparameters of the Logistic regression model include: c (inverse of the regularization coefficient λ), regularization term (penalty); the hyperparameters of the KNN model include: the number of adjacent points (k _ neighbors), an initial value (P); the hyperparameters of the SVM model include: kernel functions k (kernel), C (inverse of regularization coefficient λ), gamma values; the hyper-parameters of the RF model include: maximum number of iterations (n _ estimators), maximum depth (max _ depth). The final parameter settings of the six models (algorithms) are shown in tables 1a and 1 b.
TABLE 1a Final parameter settings for six algorithms
Figure 221817DEST_PATH_IMAGE034
TABLE 1b Final parameter settings for six algorithms
Figure 96232DEST_PATH_IMAGE035
The above hyper-parameters are respectively the final parameter settings of the corresponding models, and the model containing the corresponding hyper-parameters of table 1 is the optimized model.
And step S73, classifying the test set based on the six models containing the corresponding hyper-parameters.
Step S8, the combination of the XGboost model and the Logistic regression model (XGboost + LR) and the classification results of the five models are evaluated, and the evaluation indexes comprise: the prediction accuracy (Precision), the Recall rate (Recall) and the area AUC (area Under Curve) Under the ROC curve, and the larger the area AUC value in the ROC curve is, the better the classification effect of the corresponding model is.
The constructed model combination is evaluated through the acquired 25 items of data, in the evaluation process, firstly, confusion matrixes of six algorithms (Logistic Regression, KNN, SVM, RF, XGboost + LR) are acquired, as shown in FIG. 4, the horizontal axis represents a Predicted value (Predicted label) and the vertical axis represents a True value (True label), and values in the confusion matrixes represent the number of samples of the corresponding True value and the Predicted value, so that the Recall of the combined model of the XGboost + LR reaches 0.971 and has better accuracy. Secondly, six algorithms are adopted for comparison, in order to reduce randomness, 1000 times of data extraction and modeling are carried out, and the average value of each algorithm result is given in table 2.
TABLE 2 comparison of classification results of six models
Figure 800883DEST_PATH_IMAGE036
The data in Table 2 show that the XGboost + LR combined model has higher accuracy compared with other single models, the Recall is 96.8%, the Precision is 92.5%, and the AUC is 98.0%. The model with the performance is enough for clinical use, and can effectively help doctors to accurately judge whether patients are infected.
The infection screening method has the following advantages: firstly, a plurality of items of clinical data are adopted for judgment, night sleep data of a patient are related with infectious disease prediction, the reliability of prediction is improved, secondly, a classification algorithm based on XGboost and Logistic regression combination is used, feature selection is strengthened, the difference of different features of the patient and a healthy person can be measured, and meanwhile, the accuracy is higher than that of a traditional machine learning classification algorithm.
The above is only a preferred embodiment of the present application, and the present invention is not limited to the above embodiments. It is to be understood that other modifications and variations directly derived or suggested to those skilled in the art without departing from the spirit and scope of the invention are to be considered as included within the scope of the invention.

Claims (10)

1. An infection screening method based on a combination of a decision tree model and a logistic regression model is realized based on a vital sign monitor, the vital sign monitor is used for detecting clinical data of a user, the vital sign monitor is in communication connection with a remote data service platform through a communication module, and the remote data service platform is used for carrying out infection screening according to the clinical data, and the infection screening method is characterized by comprising the following steps: s1, detecting and acquiring clinical data of the user;
s2, randomly dividing the clinical data into a training set and a testing set, and equally dividing the training set into two parts: training set A and B;
s3, training the XGboost model based on the training set A, constructing the XGboost model, and meanwhile, selecting the characteristics of the training set A and selecting key characteristics;
s4, selecting key feature vectors of corresponding key features in the training set B, taking the key feature vectors as the input of the constructed XGboost model, and performing OneHot coding on the output of leaf nodes of the XGboost model to obtain newly constructed feature vectors;
s5, merging the newly constructed feature vector and the key feature vector to obtain a combined feature vector;
s6, training a Logistic regression model based on the combined feature vector, and constructing the Logistic regression model;
and S7, based on the combination of the XGboost model and the Logistic regression model, performing prediction classification on the test set to obtain a classification result.
2. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 1, wherein in step S1, the user comprises a healthy person and a patient, the clinical data is a feature vector related to a disease condition, and the clinical data comprises: mean respiratory rate, median respiratory rate, maximum respiratory rate, minimum respiratory rate, mean heart rate, median heart rate, maximum heart rate, minimum heart rate, percentage waking, percentage REM sleep, percentage light sleep, percentage deep sleep, sleep latency, length of sleep, sleep efficiency, sleep score, body movement density, body movement minute ratio, number of waking, number of turning, number of apneas during sleep, hypopnea index of apnea, number of REM apneas, number of apneas during light sleep, number of apneas during deep sleep.
3. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 1 or 2, wherein 75% of the clinical data is randomly drawn as the training set and the remaining 25% is drawn as the testing set in step S2.
4. The infection screening method based on the combination of decision tree model and logistic regression model as claimed in claim 3, wherein the calculation manner of the decision tree model in steps S3, S4 comprises:
suppose that
Figure 421056DEST_PATH_IMAGE001
For the output result of the t-th tree,
Figure 347424DEST_PATH_IMAGE002
is the current output result of the model and,
Figure 427375DEST_PATH_IMAGE003
is a practical result, then
Figure 515417DEST_PATH_IMAGE004
And T represents a total T decision trees, wherein T represents the T-th iteration, namely, an optimal model is searched each time and is added into the existing model, so that the predicted value approaches to the true value.
5. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 4, wherein in step S3, the feature selection is implemented based on SHAP value, the influence of the clinical data in the training set A on the result is described by the SHAP value, and the key feature is obtained according to the influence.
6. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 5, wherein the key features comprise: REM phase apnea number, heart rate mean, sleep duration, sleep latency, heart rate median, percentage of shallow sleep, apnea hypopnea index.
7. The infection screening method based on the combination of decision tree model and Logistic regression model as claimed in claim 1 or 5, wherein in step S6, the Logistic regression model is calculated by the following steps:
logistic regression model regression is a two-class model, on-lineMapping an input function value to a 0-1 interval through a sigmoid function on the basis of the sexual regression, setting y as a dependent variable of two classes as the probability of judging each class, indicating whether the patient is the patient, y =0 indicating health, y =1 indicating the patient, x = { x1, x2, x3, …, xp } as corresponding p-dimensional explanation variables, and setting the probability as the probability of judging each class, wherein x =0 indicates health, y =1 indicates the patient, and x = { x1, x2, x3, …, xp } is the corresponding p-dimensional explanation variable
Figure 313609DEST_PATH_IMAGE005
Representing the probability that y belongs to class 1 given the combined feature vector x, let
Figure 778088DEST_PATH_IMAGE006
Figure 712546DEST_PATH_IMAGE007
Can be expressed as:
Figure 237068DEST_PATH_IMAGE008
the above formula is a logistic regression model, wherein
Figure 256977DEST_PATH_IMAGE009
The coefficients corresponding to the features are expressed, and the parameters can be obtained by using a gradient descent parameter estimation method
Figure 259568DEST_PATH_IMAGE010
C denotes the base e of the logarithm of the natural number, T denotes
Figure 314112DEST_PATH_IMAGE011
Transposing of the matrix.
8. The infection screening method based on the combination of the decision tree model and the Logistic regression model as claimed in claim 7, wherein the step S7 further comprises the steps of S71 performing model verification by using a ten-fold cross-validation method, verifying the combination of the XGBoost model and the Logistic regression model that has been constructed, and determining the hyper-parameters of the corresponding model; s72, selecting the existing five models: the XGboost model, the Logistic regression model, the KNN model, the SVM model and the RF model are verified by adopting a ten-fold cross verification method to determine the hyper-parameters of the corresponding models.
9. The infection screening method based on the combination of decision tree model and Logistic regression model according to claim 8, wherein in step S71, the hyper-parameters of the XGBoost model and Logistic regression model combination include: learning rate, maximum iteration times, maximum depth and gamma value;
in step S72, the hyper-parameters of the XGBoost model include: learning rate, maximum iteration times, maximum depth and gamma value;
the hyperparameters of the Logistic regression model comprise: C. a regularization term;
the hyperparameters of the KNN model comprise: the number and initial value of adjacent points;
the hyper-parameters of the SVM model include: kernel K, C, gamma values;
the hyper-parameters of the RF model include: maximum number of iterations, maximum depth.
10. The infection screening method based on a combination of decision tree model and logistic regression model according to claim 8, further comprising steps S8, S9, S8, using the five models to classify the test set; s9, evaluating the classification result of the combination of the XGboost model and the Logistic regression model and the classification results of the five models, wherein the evaluation indexes comprise: prediction accuracy and recall rate.
CN202111019378.2A 2021-09-01 2021-09-01 Infection screening method based on combination of decision tree model and logistic regression model Pending CN113470837A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111019378.2A CN113470837A (en) 2021-09-01 2021-09-01 Infection screening method based on combination of decision tree model and logistic regression model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111019378.2A CN113470837A (en) 2021-09-01 2021-09-01 Infection screening method based on combination of decision tree model and logistic regression model

Publications (1)

Publication Number Publication Date
CN113470837A true CN113470837A (en) 2021-10-01

Family

ID=77867108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111019378.2A Pending CN113470837A (en) 2021-09-01 2021-09-01 Infection screening method based on combination of decision tree model and logistic regression model

Country Status (1)

Country Link
CN (1) CN113470837A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111834010A (en) * 2020-05-25 2020-10-27 重庆工贸职业技术学院 COVID-19 detection false negative identification method based on attribute reduction and XGboost
CN112259221A (en) * 2020-10-21 2021-01-22 北京大学第一医院 Lung cancer diagnosis system based on multiple machine learning algorithms

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111834010A (en) * 2020-05-25 2020-10-27 重庆工贸职业技术学院 COVID-19 detection false negative identification method based on attribute reduction and XGboost
CN112259221A (en) * 2020-10-21 2021-01-22 北京大学第一医院 Lung cancer diagnosis system based on multiple machine learning algorithms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHANG C等: ""A COVID-19 Non-contact Screening System Based on XGBoost and Logistic Regression"", 《JMIR PREPRINTS. 15/01/2021:27151》 *

Similar Documents

Publication Publication Date Title
Hassantabar et al. CovidDeep: SARS-CoV-2/COVID-19 test based on wearable medical sensors and efficient neural networks
Loh et al. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022)
Lin et al. An explainable deep fusion network for affect recognition using physiological signals
Zhao et al. Advances in patient classification for traditional Chinese medicine: a machine learning perspective
Silva et al. Rating organ failure via adverse events using data mining in the intensive care unit
CN109036553A (en) A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge
Li et al. Multi-label classification of arrhythmia for long-term electrocardiogram signals with feature learning
CN108492877B (en) Cardiovascular disease auxiliary prediction method based on DS evidence theory
CN110459328A (en) A kind of Clinical Decision Support Systems for assessing sudden cardiac arrest
CN111009321A (en) Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN111067503A (en) Sleep staging method based on heart rate variability
Zhang et al. Medical diagnosis data mining based on improved Apriori algorithm
Theerthagiri Predictive analysis of cardiovascular disease using gradient boosting based learning and recursive feature elimination technique
Dervishi Fuzzy risk stratification and risk assessment model for clinical monitoring in the ICU
Zhang et al. Auto-annotating sleep stages based on polysomnographic data
Li et al. Research on massive ECG data in XGBoost
Haba et al. A remote and personalised novel approach for monitoring asthma severity levels from EEG signals utilizing classification algorithms
CN110575178A (en) Diagnosis and monitoring integrated medical system for judging motion state and judging method thereof
CN114191665A (en) Method and device for classifying man-machine asynchronous phenomena in mechanical ventilation process
KR102169637B1 (en) Method for predicting of mortality risk and device for predicting of mortality risk using the same
Yang et al. Development of a machine learning model for the prediction of the short-term mortality in patients in the intensive care unit
Davies et al. A transition probability based classification model for enhanced N1 sleep stage identification during automatic sleep stage scoring
Panindre et al. Artificial intelligence-based remote diagnosis of sleep apnea using instantaneous heart rates
CN117116477A (en) Construction method and system of prostate cancer disease risk prediction model based on random forest and XGBoost
Srimedha et al. A comprehensive machine learning based pipeline for an accurate early prediction of sepsis in ICU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211001

RJ01 Rejection of invention patent application after publication