CN113470837A

CN113470837A - Infection screening method based on combination of decision tree model and logistic regression model

Info

Publication number: CN113470837A
Application number: CN202111019378.2A
Authority: CN
Inventors: 商春恒; 王云峰
Original assignee: Institute of Microelectronics of CAS; Guangdong Greater Bay Area Institute of Integrated Circuit and System
Current assignee: Institute of Microelectronics of CAS; Guangdong Greater Bay Area Institute of Integrated Circuit and System
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2021-10-01

Abstract

The invention discloses an infection screening method based on a combination of a decision tree model and a logistic regression model, which is convenient to detect and operate and can improve the infection screening accuracy, and is realized based on a vital sign monitor, wherein the vital sign monitor is in communication connection with a remote data service platform, and the remote data service platform carries out infection screening according to clinical data, and the method comprises the following steps: the method comprises the following steps of detecting and acquiring clinical data of a user through a vital sign monitor, randomly dividing the clinical data into a training set and a testing set, and equally dividing the training set into two parts: the method comprises the steps of constructing a decision tree model based on a training set A and a training set B, simultaneously, carrying out feature selection on the training set A, taking key feature vectors as input of the constructed decision tree model to obtain new constructed feature vectors, constructing a logistic regression model based on the combined feature vectors, carrying out prediction classification on a test set based on the combination of the decision tree model and the logistic regression model, and obtaining a classification result.

Description

Infection screening method based on combination of decision tree model and logistic regression model

Technical Field

The invention relates to the technical field of data analysis, in particular to an infection screening method based on a combination of a decision tree model and a logistic regression model.

Background

At present, hospitals mainly screen infectious diseases by using ct (computed tomogry), clinical characteristics and body temperature detection as diagnostic methods, but ct (computed tomogry), clinical characteristics, body temperature detection and the like are limited by medical technology and regions, and still have the problems of lagged acquisition of detection results, poor detection accuracy, high infection risk and the like.

For example, in the current environment of a large-scale epidemic situation of a new coronavirus, an effective way for controlling the spread of the disease is large-scale screening, patient isolation treatment and symptom monitoring, most of the existing detection is realized based on RT-PCR (reverse transcription polymerase chain reaction), but in the outbreak peak period of COVID-19, an RT-PCR kit is seriously in short supply, hospitals always use Computed Tomogry (CT), clinical characteristics and body temperature detection as alternative diagnosis methods, the CT, the clinical characteristics, the body temperature detection and the like need professional medical personnel to operate, the operation steps are complicated, the clinical characteristics are influenced by the detection experience and subjective motility of the medical personnel, and the problem of poor detection accuracy exists.

Disclosure of Invention

Aiming at the problems of poor detection result acquisition lag and poor detection accuracy caused by CT, clinical characteristics and body temperature detection as diagnosis methods in the prior art, the invention provides an infection screening method based on the combination of a decision tree model and a logistic regression model, which is convenient in detection operation and can improve the accuracy of infection screening.

In order to achieve the purpose, the invention adopts the following technical scheme:

an infection screening method based on a combination of a decision tree model and a logistic regression model is realized based on a vital sign monitor, the vital sign monitor is used for detecting clinical data of a user, the vital sign monitor is in communication connection with a remote data service platform through a communication module, and the remote data service platform is used for carrying out infection screening according to the clinical data, and the infection screening method is characterized by comprising the following steps: s1, detecting and acquiring clinical data of the user through the vital sign monitor;

s2, randomly dividing the clinical data into a training set and a testing set, and equally dividing the training set into two parts: training set A and B;

s3, training the XGboost model based on the training set A, constructing the XGboost model, and meanwhile, selecting the characteristics of the training set A and selecting key characteristics;

s4, selecting key feature vectors of corresponding key features in the training set B, taking the key feature vectors as the input of the constructed XGboost model, and performing OneHot coding on the output of leaf nodes of the XGboost model to obtain newly constructed feature vectors;

s5, merging the newly constructed feature vector and the key feature vector to obtain a combined feature vector;

s6, training a Logistic regression model based on the combined feature vector, and constructing the Logistic regression model;

and S7, based on the combination of the XGboost model and the Logistic regression model, performing prediction classification on the test set to obtain a classification result.

It is further characterized in that the method further comprises the steps of,

in step S1, the user includes a healthy person and a patient, the clinical data is a feature vector related to a disease condition, and the clinical data includes: respiratory rate mean, median respiratory rate, maximum respiratory rate, minimum respiratory rate, mean heart rate, median heart rate, maximum heart rate, minimum heart rate, percentage waking, percentage REM sleep, percentage light sleep, percentage deep sleep, sleep latency, length of sleep, sleep efficiency, sleep score, body movement density, body movement minute ratio, number of waking, number of turning, number of apneas during sleep, hypopnea index for apnea, number of REM apneas, number of apneas during light sleep, number of apneas during deep sleep;

in step S2, randomly extracting 75% of the clinical data as the training set, and the remaining 25% as the test set;

in steps S3 and S4, the calculation method of the decision tree model includes:

suppose that

For the output result of the t-th tree,

is the current output result of the model and,

is a practical result, then

T represents a total T decision trees, wherein T represents the T-th iteration, namely, an optimal model is searched each time and added into the existing model to enable the predicted value to approach the true value;

in step S3, the feature selection is implemented based on the SHAP value of the XGBoost model, the influence of the clinical data in the training set a on the result is described by the SHAP value, and the key feature is obtained according to the influence;

the key features include: REM phase apnea number, heart rate mean, sleep duration, sleep latency, heart rate median, shallow sleep percentage, apnea hypopnea index;

in step S6, the Logistic regression model is calculated in a manner including:

the Logistic regression model regression is a binary classification model, an input function value is mapped to an interval of 0-1 through a sigmoid function on the basis of linear regression, the probability of discrimination of various types is taken as y, y is a dependent variable of two types and represents whether a patient is present, y =0 represents health, y =1 represents the patient, x = { x1, x2, x3, … and xp } represent corresponding p-dimensional explanation variables, and the probability is a probability

Representing the probability that y belongs to class 1 given the combined feature vector x,

order to

，

Can be expressed as:

，

the above formula is a logistic regression model, wherein

Representing the coefficients corresponding to each feature. Parameters can be determined using a gradient descent parameter estimation method

Where c is the base e of the logarithm of the natural number and T represents

Transposing of the matrix.

Step S7 further includes, S71, model verification, the verification mode is: verifying the combination of the XGboost model and the Logistic regression model which are constructed by adopting a ten-fold cross verification method, and determining the hyper-parameters of the corresponding model; s72, selecting the existing five models: the method comprises the following steps of verifying five models by adopting a cross-folding cross verification method to determine a hyper-parameter of a corresponding model, wherein the XGboost model, the Logistic regression model, the KNN model (namely a K adjacent node algorithm), the SVM model (namely a support vector machine) and the RF model (namely a random forest model);

in step S71, the ten-fold cross-validation method divides the training set into ten parts, cyclically extracts one part as an optimized validation set, and verifies the model with the remaining nine parts as optimized training sets, where each verification results in a corresponding correct rate or error rate, and an average of the correct rates (or error rates) of the ten results is used as an estimation of the algorithm precision, and a hyper-parameter is determined according to the estimation result;

in step S71, the hyper-parameter of the combination of the XGBoost model and the Logistic regression model includes: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value;

in step S72, the hyper-parameters of the XGBoost model include: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value;

the hyperparameters of the Logistic regression model comprise: c (i.e., the inverse of the regularization coefficient λ), regularization term (i.e., penalty);

the hyperparameters of the KNN model comprise: the number of adjacent points (i.e., k _ neighbors), the initial value (P);

the hyper-parameters of the SVM model include: kernel functions k (kernel), C (inverse of regularization coefficient λ), gamma values;

the hyper-parameters of the RF model include: maximum number of iterations (i.e., n _ estimators), maximum depth (i.e., max _ depth);

the method further comprises the step S8 of classifying the test set based on the existing five models;

step S9, the XGboost model and the Logistic regression model are combined, and classification results of the five existing models are evaluated, wherein evaluation indexes include: prediction accuracy, recall, Area Under the ROC Curve AUC (i.e., Area Under cutter).

By adopting the structure of the invention, the following beneficial effects can be achieved: according to the method, firstly, the XGboost model is used for splitting characteristics in the clinical data to obtain characteristics and threshold values with the best effect on nodes, meanwhile, the clinical data are subjected to characteristic selection, characteristics with high risk in the clinical data are selected, and the characteristic selection is beneficial to improving the construction speed of the model, enhancing the generalization capability of the model and reducing the over-fitting problem; a Logistic regression model is constructed based on the best effect characteristics (newly constructed characteristic vectors) and the key characteristic vectors with larger risks, and the test set is classified based on the Logistic regression model, so that the classification accuracy is greatly improved, and the accuracy of infection screening is improved.

Drawings

FIG. 1 is a flow chart of the infection screening method of the present invention;

FIG. 2 is a schematic diagram of the structural features of the XGboost model according to the present invention;

FIG. 3a is a graphical representation of the invention using SHAP values to describe the characteristic significance of clinical data;

FIG. 3b is a bar graph depicting feature importance using the average of the absolute values of SHAP values in accordance with the present invention;

FIG. 4a is a schematic diagram of a confusion matrix structure of a Logistic Regression model, an SVM model and an XGboost model according to the present invention;

FIG. 4b is a schematic diagram of the confusion matrix structure of the KNN model, the RF model and the XGboost + LR model.

Detailed Description

An infection screening method based on a combination of a decision tree model and a logistic regression model is realized based on a vital sign monitor (the vital sign monitor adopts the existing non-contact type vital sign monitor with the model of YOLI-RD 200-G-WH), the vital sign monitor is in communication connection with a remote data service platform through a communication module to form a non-contact type vital sign monitoring system, the vital sign monitor comprises a biological radar, the non-contact type vital sign monitor is adopted to monitor the vital signs of patients (the patients are CODV-19 patients in the Wuhan hospital in the embodiment), 140 parts of radar monitoring data (namely clinical data) of 23 patients are collected and analyzed with 144 parts of sleep monitoring data of healthy people, the collection environment of the patients is a centralized monitoring ward, the collection environment of the healthy people is the respective home, all people's basic information such as age, sex, basic disease, and medication conditions were counted, and the information was distributed and dispersed, and no aggregation occurred in a certain feature, so that other factors were considered as small interference items, and the control test was considered to be effective.

The doctor places the vital sign monitor at the patient head of a bed and can gathers patient's data, and this kind of data collection method does not influence patient's various daily activities, also easily medical personnel operate, and medical personnel can acquire clinical data information through remote data service platform, has reduced operations such as CT, consequently detects convenient operation, has reduced the contact, has reduced the infection risk.

Non-contact vital sign monitor system comprises non-contact vital sign monitor and remote data service platform, the monitor transmission radar signal, then carry out the filtering in the radar echo signal, isolate heartbeat signal, respiratory signal, body movement signal, and extract respiratory rate, rhythm of the heart, body movement, leave data such as bed, monitoring data uploads to data platform in real time through wifi, the patient is after sleeping night, the system (remote data service platform) carries out sleep analysis, apnea analysis to the data of evening again, reachs the sleep and guards the report. The sleep monitoring report output data is clinical data, the total number of the data is 25, and the 25 data comprehensively reflect the conditions of night respiration, heartbeat, sleep structure, sleep quality, body movement, apnea and the like of a patient: this 25 data was applied to the infection screening method described below.

Referring to fig. 1, an infection screening method based on a combination of a decision tree model and a logistic regression model includes: s1, detecting and acquiring clinical data of the user through the vital sign monitor; the users include healthy people and patients, and the clinical data comprises 25 characteristics: mean respiratory rate (meanRR), median respiratory rate (' medRR), maximum respiratory rate (maxRR), minimum respiratory rate (minRR), mean heart rate (meanHR), median heart rate (medHR), maximum heart rate (maxHR), minimum heart rate (minHR), percentage wakefulness (awakPrct), percentage REM period sleep (remspprct), percentage shallow sleep (lightSPrct), percentage deep sleep (delesprct), sleep latency (latnMin), sleep duration (slepMin), sleep efficiency (sleffective), sleep score (slepScore), body motion density (meand), body motion minutes ratio (movMinPrct), number of wakefulness (awittims), number of turns over (turnover), number of apneas during sleep (epottims), apnea low ventilation index (rei), number of REM periods apnea (REM), number of shallow apneas (sataliments), number of deep sleep (ahemims).

S2, disordering data of the patient and the healthy person, randomly extracting 75% of the data according to the label proportion to be used as a training set, using the rest 25% of the data to be used as a test set, using the training set for training and constructing a model combination, and using the test set for testing the screening performance of the model; the training set is divided into two parts: training set A and training set B.

S3, training an XGboost model (decision tree model) based on the training set A, constructing the XGboost model (logistic regression model), and selecting the characteristics of the training set A and selecting key characteristics;

the feature selection is beneficial to improving the construction speed of the model, enhancing the generalization capability of the model and reducing the over-fitting problem. Good global feature importance metrics need to meet consistency and accuracy, and the present application uses the SHAP value to describe the importance of and evaluate features. The SHAP values allow for an overall visualization of the features, and FIG. 3a plots the effect of twenty features in the clinical data on each sample. Each row represents a feature, the abscissa represents the SHAP value (influence on the output of the model), the ordinate represents the feature, the middle vertical line represents the risk of zero, the farther the feature of each row is from the center, the greater the risk of the disease, i.e., the more important the feature affects the disease, for example, the graph shows that the value of the feature "rematims" increases the risk of the patient, and the feature value of "meanHR" also increases the risk, but the significance is lower than the feature "rematims", and the table shows that the values of "rematims", "meanHR", "slepMin", "latnMin", "medHR", "rightspt", "AHI", "maxRR", "maxHR", "meanNMD", "movmjnct", "minRR", "slotmartim", "satprct" and "waktims" sequentially decrease the risk of the disease.

Fig. 3b takes the mean value mean of the absolute values of the zap values of each feature (| Tree zap |) as the importance of the feature, the abscissa in fig. 3b represents the mean value mean (| Tree zap |) (the mean influence on the output quantity of the model), the ordinate is the feature, a standard bar graph is obtained, the features are sorted according to mean (| Tree zap |), and therefore, the feature that the number of times of apnea during REM is seen is the strongest factor for distinguishing patients.

According to the importance degree of the feature value, the first 7 features are selected as key features to train the XGboost model and the Logistic regression model, and the key features comprise: REM phase apnea number, heart rate mean, sleep duration, sleep latency, heart rate median, percentage of shallow sleep, apnea hypopnea index.

in steps S3 and S4, the XGBoost model is an ensemble learning algorithm based on gradient descent iteration, the model uses a decision tree as a base learner for integration, the algorithm continuously adds trees through feature splitting growth, continuously performs feature splitting to grow one tree, each time adds one tree, actually learns a new function, fits a residual error predicted last time through the new function, each tree corresponds to one leaf node, each leaf node corresponds to one score, the scores corresponding to each tree are added to obtain a predicted value of the sample, and the specific calculation process of the XGBoost model is as follows:

suppose that

For the output result of the t-th tree,

is the current output of the model, y_iIs a practical result, then

And T represents a total T decision trees, wherein T represents the T-th iteration, namely, an optimal model (a new function) is searched each time and is added into the existing model, so that the predicted value is closer to the true value. By minimizing a loss function

(Loss Function) to construct the optimal model, and when the training data set is small, it is easy to overfit, so that a regularization term is generally required to be added to reduce the complexity of the model. The loss function L is calculated as:

；

where F is the space of the hypothesis that,

to control the model complexity, λ represents the regularization coefficient, and minf ∈ F represents finding the parameters that minimize the expression result in all hypothesis spaces.

Thus, the objective function:

wherein

Is the error in the training that is,

is the complexity of the penalty model (the sum of the complexities of all trees), and the term includes two parts, the number of leaf nodes and the value of the leaf nodes. Is expressed as

；

Wherein T isThe number of leaf nodes, | w | | | is the modulus of the leaf node vector.

The difficulty of node segmentation is represented, and lambda represents a regularization coefficient.

Performing second-order Taylor expansion on the target function:

let I_LAnd I_RRespectively representing the left sub-tree node and the right sub-tree node after the splitting

For each leaf node, the nodes are split according to gain. Definition of gain:

the first two terms on the right side of the equal sign are the sum of the branches of the left and right subtrees after the splitting, the third term is the score value of the father node before the splitting is not performed, and the last term gamma is the complexity (namely the difficulty of splitting the node) caused by introducing additional leaf nodes.

Constructing a new feature vector in step S4 means that each leaf node of all decision trees in the XGBoost model is used as a new feature, so that the number of the constructed features is the same as the data of the leaf nodes of the XGBoost model, each feature is 0 or 1, and for each decision tree, if an input sample falls into a leaf node, the value of the leaf node is 1, otherwise, the value is 0. In fig. 2, the XGBoost model is obtained by training the training set a, and the XGBoost model includes two decision trees, each leaf node is a new feature, and a sample falls into a first leaf node through a tree1 (a first tree) and into a second leaf node through a tree2 (a second tree), so that the newly constructed feature vector is [1,0,0,0,1 ].

And S5, merging the newly constructed feature vector and the key feature vector to obtain a combined feature vector.

S6, training the Logistic regression model based on the combined feature vector to construct the Logistic regression model;

the Logistic regression model is a binary model, the model maps input function values to 0-1 intervals to represent through a sigmoid function on the basis of linear regression, as the probability of various discrimination, y is a dependent variable of two classes and represents whether a patient is present, y =0 represents health, y =1 represents the patient, and x = { x1, x2, x3, …, xp } represents corresponding p-dimensional explanatory variables. Probability of

Representing the probability that y belongs to class 1 given a feature vector x, let

，

Can be expressed as:

the above formula is referred to as a logistic regression model, wherein

Representing the coefficients corresponding to each feature. Parameters can be found using a gradient descent isoparametric estimation method

C denotes the base e of the logarithm of the natural number, T denotes

Transposing of the matrix.

And S7, based on the combination of the XGboost model and the Logistic regression model, performing prediction classification on the test set to obtain a classification result. The XGboost model and the Logistic regression model combination are composed of two parts, wherein the XGboost model is used for extracting features in a training set to serve as new training input data, and the Logistic regression model serves as a classifier of the new training input data.

Before testing the combination of the XGboost model and the Logistic regression model by using a test set, verifying the combination of the XGboost model and the Logistic regression model, and verifying the model by S71 in the following way: verifying the combination of the constructed XGboost model and the Logistic regression model by adopting a ten-fold cross-validation method, determining the hyper-parameters of the corresponding model, wherein the ten-fold cross-validation method means that a training set is divided into ten parts, one part is circularly extracted as an optimized validation set, the other nine parts are used as optimized training sets, the model is verified, corresponding accuracy or error rate can be obtained by each verification, the average value of the accuracy (or error rate) of ten results is used as the estimation of the algorithm precision, and the hyper-parameters are determined according to the estimation result; s52, selecting the existing five models: the XGboost model, the Logistic regression model, the KNN model (namely a K adjacent node algorithm), the SVM model (namely a support vector machine) and the RF model (namely a random forest model) are verified by adopting a ten-fold cross verification method to determine the hyper-parameters of the corresponding models.

The hyper-parameters of each model are specified as follows: the hyper-parameters of the combination of the XGboost model and the Logistic regression model comprise: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value; the hyper-parameters of the XGboost model include: learning rate (learning _ rate), maximum number of iterations (n _ estimators), maximum depth (max _ depth), gamma value; the hyperparameters of the Logistic regression model include: c (inverse of the regularization coefficient λ), regularization term (penalty); the hyperparameters of the KNN model include: the number of adjacent points (k _ neighbors), an initial value (P); the hyperparameters of the SVM model include: kernel functions k (kernel), C (inverse of regularization coefficient λ), gamma values; the hyper-parameters of the RF model include: maximum number of iterations (n _ estimators), maximum depth (max _ depth). The final parameter settings of the six models (algorithms) are shown in tables 1a and 1 b.

TABLE 1a Final parameter settings for six algorithms

TABLE 1b Final parameter settings for six algorithms

The above hyper-parameters are respectively the final parameter settings of the corresponding models, and the model containing the corresponding hyper-parameters of table 1 is the optimized model.

And step S73, classifying the test set based on the six models containing the corresponding hyper-parameters.

Step S8, the combination of the XGboost model and the Logistic regression model (XGboost + LR) and the classification results of the five models are evaluated, and the evaluation indexes comprise: the prediction accuracy (Precision), the Recall rate (Recall) and the area AUC (area Under Curve) Under the ROC curve, and the larger the area AUC value in the ROC curve is, the better the classification effect of the corresponding model is.

The constructed model combination is evaluated through the acquired 25 items of data, in the evaluation process, firstly, confusion matrixes of six algorithms (Logistic Regression, KNN, SVM, RF, XGboost + LR) are acquired, as shown in FIG. 4, the horizontal axis represents a Predicted value (Predicted label) and the vertical axis represents a True value (True label), and values in the confusion matrixes represent the number of samples of the corresponding True value and the Predicted value, so that the Recall of the combined model of the XGboost + LR reaches 0.971 and has better accuracy. Secondly, six algorithms are adopted for comparison, in order to reduce randomness, 1000 times of data extraction and modeling are carried out, and the average value of each algorithm result is given in table 2.

TABLE 2 comparison of classification results of six models

The data in Table 2 show that the XGboost + LR combined model has higher accuracy compared with other single models, the Recall is 96.8%, the Precision is 92.5%, and the AUC is 98.0%. The model with the performance is enough for clinical use, and can effectively help doctors to accurately judge whether patients are infected.

The infection screening method has the following advantages: firstly, a plurality of items of clinical data are adopted for judgment, night sleep data of a patient are related with infectious disease prediction, the reliability of prediction is improved, secondly, a classification algorithm based on XGboost and Logistic regression combination is used, feature selection is strengthened, the difference of different features of the patient and a healthy person can be measured, and meanwhile, the accuracy is higher than that of a traditional machine learning classification algorithm.

The above is only a preferred embodiment of the present application, and the present invention is not limited to the above embodiments. It is to be understood that other modifications and variations directly derived or suggested to those skilled in the art without departing from the spirit and scope of the invention are to be considered as included within the scope of the invention.

Claims

1. An infection screening method based on a combination of a decision tree model and a logistic regression model is realized based on a vital sign monitor, the vital sign monitor is used for detecting clinical data of a user, the vital sign monitor is in communication connection with a remote data service platform through a communication module, and the remote data service platform is used for carrying out infection screening according to the clinical data, and the infection screening method is characterized by comprising the following steps: s1, detecting and acquiring clinical data of the user;

2. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 1, wherein in step S1, the user comprises a healthy person and a patient, the clinical data is a feature vector related to a disease condition, and the clinical data comprises: mean respiratory rate, median respiratory rate, maximum respiratory rate, minimum respiratory rate, mean heart rate, median heart rate, maximum heart rate, minimum heart rate, percentage waking, percentage REM sleep, percentage light sleep, percentage deep sleep, sleep latency, length of sleep, sleep efficiency, sleep score, body movement density, body movement minute ratio, number of waking, number of turning, number of apneas during sleep, hypopnea index of apnea, number of REM apneas, number of apneas during light sleep, number of apneas during deep sleep.

3. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 1 or 2, wherein 75% of the clinical data is randomly drawn as the training set and the remaining 25% is drawn as the testing set in step S2.

4. The infection screening method based on the combination of decision tree model and logistic regression model as claimed in claim 3, wherein the calculation manner of the decision tree model in steps S3, S4 comprises:

suppose that

For the output result of the t-th tree,

is the current output result of the model and,

is a practical result, then

And T represents a total T decision trees, wherein T represents the T-th iteration, namely, an optimal model is searched each time and is added into the existing model, so that the predicted value approaches to the true value.

5. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 4, wherein in step S3, the feature selection is implemented based on SHAP value, the influence of the clinical data in the training set A on the result is described by the SHAP value, and the key feature is obtained according to the influence.

6. The infection screening method based on a combination of decision tree model and logistic regression model as claimed in claim 5, wherein the key features comprise: REM phase apnea number, heart rate mean, sleep duration, sleep latency, heart rate median, percentage of shallow sleep, apnea hypopnea index.

7. The infection screening method based on the combination of decision tree model and Logistic regression model as claimed in claim 1 or 5, wherein in step S6, the Logistic regression model is calculated by the following steps:

logistic regression model regression is a two-class model, on-lineMapping an input function value to a 0-1 interval through a sigmoid function on the basis of the sexual regression, setting y as a dependent variable of two classes as the probability of judging each class, indicating whether the patient is the patient, y =0 indicating health, y =1 indicating the patient, x = { x1, x2, x3, …, xp } as corresponding p-dimensional explanation variables, and setting the probability as the probability of judging each class, wherein x =0 indicates health, y =1 indicates the patient, and x = { x1, x2, x3, …, xp } is the corresponding p-dimensional explanation variable

Representing the probability that y belongs to class 1 given the combined feature vector x, let

，

Can be expressed as:

，

the above formula is a logistic regression model, wherein

The coefficients corresponding to the features are expressed, and the parameters can be obtained by using a gradient descent parameter estimation method

C denotes the base e of the logarithm of the natural number, T denotes

Transposing of the matrix.

8. The infection screening method based on the combination of the decision tree model and the Logistic regression model as claimed in claim 7, wherein the step S7 further comprises the steps of S71 performing model verification by using a ten-fold cross-validation method, verifying the combination of the XGBoost model and the Logistic regression model that has been constructed, and determining the hyper-parameters of the corresponding model; s72, selecting the existing five models: the XGboost model, the Logistic regression model, the KNN model, the SVM model and the RF model are verified by adopting a ten-fold cross verification method to determine the hyper-parameters of the corresponding models.

9. The infection screening method based on the combination of decision tree model and Logistic regression model according to claim 8, wherein in step S71, the hyper-parameters of the XGBoost model and Logistic regression model combination include: learning rate, maximum iteration times, maximum depth and gamma value;

in step S72, the hyper-parameters of the XGBoost model include: learning rate, maximum iteration times, maximum depth and gamma value;

the hyperparameters of the Logistic regression model comprise: C. a regularization term;

the hyperparameters of the KNN model comprise: the number and initial value of adjacent points;

the hyper-parameters of the SVM model include: kernel K, C, gamma values;

the hyper-parameters of the RF model include: maximum number of iterations, maximum depth.

10. The infection screening method based on a combination of decision tree model and logistic regression model according to claim 8, further comprising steps S8, S9, S8, using the five models to classify the test set; s9, evaluating the classification result of the combination of the XGboost model and the Logistic regression model and the classification results of the five models, wherein the evaluation indexes comprise: prediction accuracy and recall rate.