CN111243751B

CN111243751B - Heart disease prediction method based on dual feature selection and XGboost algorithm

Info

Publication number: CN111243751B
Application number: CN202010052452.XA
Authority: CN
Inventors: 孙昊; 崔子超
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2022-04-22
Anticipated expiration: 2040-01-17
Also published as: CN111243751A

Abstract

The invention discloses a heart disease prediction method based on dual feature selection and an XGboost algorithm.

Description

Heart disease prediction method based on dual feature selection and XGboost algorithm

Technical Field

The invention belongs to the technical field of medical data analysis, and particularly relates to a heart disease prediction method based on dual feature selection and an XGboost algorithm.

Background

Heart disease is a common and serious cardiovascular disease in life. Cardiovascular diseases are one of the biggest threats to the health of China and even the people all over the world, and the diseases bring serious burden to the medical system of China. The famous magazine "lancets" published "global disease burden report 2013" evaluated patient mortality in 190 countries between 1990 and 2013. Coronary heart disease, chronic lung disease and sudden brain death are the three biggest diseases of Chinese, the death rate in the current year is up to 46%, and the number is continuously increasing. The heart disease prediction model can be trained according to the existing medical data, and health guidance is provided for the patient. The current methods for predicting heart diseases include:

1. a Support Vector Machine (SVM) is used to predict whether a patient is suffering from a heart disease. A Heart Disease prediction model based on a support vector Machine is proposed by researchers A.Gavhane et al (A.Gavhane, G.Kokkula, I.Pandya, P.K.Dedevakar.Prediction of Heart Disease Using Machine Learning [ C ].2018Second International Conference on Electronics, Communication and Aerospace Technology (ICECA),2018: 1275-. However, the SVM algorithm is difficult to determine the kernel function, and consumes a large amount of space and time when training the model.

2. And (4) training the model by using a decision tree algorithm to predict whether the patient is ill or not. Prediction using a decision tree training model is proposed by researchers A.J.Aljaaf et al (A.J.Aljaaf; D.Al-Jumeily; A.J.Hussain; T.Dawson; P Fergus; M.Al-Jumaily, Predicting the candidate of heart failure with a multi level assessment using a decision tree [ C ], electric and Computer Engineering (TAEECE),2015: 101-106), but decision trees are easily affected by outliers and are easily fit.

3. Multiple simple classifiers are used for integration to perform majority voting. Researchers M.Shouman (M.Shouman, T.Turner, R.Stocker, Using data mining techniques in heart disease diagnosis and treatment [ C ]. Japan-Egypt Conference on Electronics, Communications and computers.2012,173-177.) combined decision trees, Bayesian classifiers and support vector machine algorithms trained a new classifier based on the votes to which they belong, but did not gain better generalization.

4. The Wuhaxing discloses a heart disease risk prediction system (CN 109377470A), which is used for carrying out identification classification, myocardial shape feature vector extraction, electrocardiogram feature extraction and the like on heart ultrasonic videos of patients by using medical images of hearts of the patients, and then training a deep neural network by using the features. However, the cardiac ultrasound video processed by the system needs a large number of cases, the feature identification and extraction process is difficult, the requirement of a training model on a machine is high, and the overall implementation is difficult.

5. The table name proposes a heart disease prediction method (CN 110265146 a) based on Bagging-Fuzzy-GBDT algorithm, which fuzzifies some data by using Fuzzy logic, then combines the fuzzified data with GBDT algorithm, and performs m times of re-sampling by using Bagging algorithm, so as to increase data diversity. However, the GBDT algorithm only samples classification and regression tree (CART) as a base classifier, and the base classifier is single in selection relative to the XGboost algorithm, cannot process missing values, and is easy to overfit.

The xgboost (extreme Gradient boosting) algorithm is proposed by chen curio, university of washington, and has gained wide attention due to superior efficiency and higher accuracy. The algorithm can select different base classifiers, add a regular term in a loss function, perform pruning processing and reduce the risk of overfitting.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a heart disease prediction method based on dual feature selection and an XGboost algorithm, and improves the accuracy of prediction and the generalization of a model while reducing the algorithm training time.

The technical problem to be solved by the invention is realized by adopting the following technical scheme: a heart disease prediction method based on dual feature selection and an XGboost algorithm is designed, and is characterized by comprising the following specific steps:

firstly, preprocessing an open-source heart disease data set to obtain a sample data set D with the size of N;

secondly, judging the importance of the characteristics of the sample data set D through a random forest algorithm, wherein the random forest algorithm gives the importance of all the characteristics and sorts the importance of the characteristics;

thirdly, performing characteristic correlation analysis on the sample data set D by adopting a spearman level correlation coefficient method;

fourthly, according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, double feature selection is carried out, and the most appropriate feature is selected;

and fifthly, building a scimit-lean machine learning framework on windows, building a heart disease prediction model based on an XGboost algorithm, training the heart disease prediction model by taking 70-90% of the data set of the characteristics selected in the fourth step as a training set, and performing parameter adjustment and test on the trained heart disease prediction model by taking the remaining 10-30% as a test set to obtain a heart disease prediction result.

The invention has the beneficial effects that: compared with the prior art, the invention has the following outstanding characteristics:

(1) a Support Vector Machine (SVM) is used to predict whether a patient is suffering from a heart disease. A Heart Disease prediction model based on a support vector Machine is proposed by researchers A.Gavhane et al (A.Gavhane, G.Kokkula, I.Pandya, P.K.Dedevakar.Prediction of Heart Disease Using Machine Learning [ C ].2018Second International Conference on Electronics, Communication and Aerospace Technology (ICECA),2018: 1275-. However, the SVM algorithm is difficult to determine the kernel function, and consumes a large amount of space and time when training the model. The technical scheme of the invention is that the heart disease prediction is carried out based on the XGboost algorithm. The two have substantial differences in method.

(2) Researchers A.J.Aljaaf et al (A.J.Aljaaf; D.Al-Jumeily; A.J.Hussain; T.Dawson; P Fergus; M.Al-Jumlly, Predicting the candidate of heart failure with a multi level assessment using a decision tree [ C ],2015Third International Conference on technical Advances in electric, Electronics and Computer Engineering (TAEECE),2015:101-106.) proposed using a decision tree training model for prediction, but decision trees are susceptible to outliers and are easily overfitted. Researchers M.Shouman (M.Shouman, T.Turner, R.Stocker, Using data mining techniques in heart disease diagnosis and treatment [ C ]. Japan-Egypt Conference on Electronics, Communications and computers.2012,173-177.) combined decision trees, Bayesian classifiers and support vector machine algorithms trained a new classifier based on the votes to which they belong, but did not gain better generalization. The heart disease prediction method adopts the technical scheme that the heart disease prediction is carried out based on the XGboost algorithm, the XGboost is one of Boosting algorithms and is a tree lifting model, a plurality of tree models are integrated together to form a strong classifier, and the performance is greatly improved.

(3) The inventor of the invention puts forward a heart disease risk prediction system by the earlier patent technology CN109377470A, which needs to perform identification classification, myocardial shape feature vector extraction, electrocardiogram feature extraction and the like on heart ultrasonic videos, needs a large number of cases, is difficult in the process of identifying and extracting features, has high requirements on machines for training models, and is difficult to realize integrally. In addition, patent technology CN110265146A proposes a heart disease prediction method based on Bagging-Fuzzy-GBDT algorithm. In order to overcome the defects of CN109377470A and enable the heart disease prediction technology to have a qualitative promotion, the invention develops a brand-new heart disease prediction method based on dual feature selection and XGboost algorithm, and utilizes a random forest algorithm and feature correlation analysis to select features, thereby overcoming the complex steps of identification classification, feature extraction and the like when CN109377470A processes videos. Compared with the GBDT algorithm used by CN110265146A, the XGboost algorithm is improved on the basis of the GBDT algorithm, and a regular term is added in a cost function for controlling the complexity of the model and reducing the variance of the model, so that the learned model is simpler and stronger, and overfitting can be prevented.

Compared with the prior art, the invention has the following remarkable improvements:

(1) the XGboost algorithm is one of Boosting algorithms, a plurality of tree models are integrated together to form a strong classifier, and the XGboost is improved on the basis of the GBDT to be stronger. The XGboost algorithm supports a user to simulate a target function and an evaluation function, so that the flexibility is improved; regularization is added into the cost function to prevent overfitting; all the sub-trees which can be built are built from top to bottom, and then reverse pruning is carried out from bottom to top, so that the local optimal solution is not easy to fall into.

(2) According to the heart disease prediction method based on the dual feature selection and the XGboost algorithm, original data are processed, the processed data are analyzed through a random forest algorithm and feature correlation, feature indexes are calculated according to importance ranking of features and correlation between the features and sample labels, the features are selected for model training, and the defects that the existing heart disease prediction needs more features and is lack of accuracy are overcome.

Drawings

FIG. 1 is a flow chart of the steps of one embodiment of the method of the present invention;

FIG. 2 is a graph of feature importance ranking based on a random forest algorithm according to an embodiment of the method of the present invention;

FIG. 3 is a feature correlation analysis thermodynamic diagram of one embodiment of the method of the present invention;

FIG. 4 is a graph of dual feature selection indices for one embodiment of the method of the present invention;

FIG. 5 is a graph showing comparison of evaluation indexes in different methods;

FIG. 6 is a ROC plot of the method of the present invention;

FIG. 7 is a graph comparing ROC curves for different methods;

in the figure: age-age; sex (1 ═ male; 0 ═ male) -gender; cp-chest pain type; trestbps-resting blood pressure (measured in mm hg at admission); chol-serum cholestasis (mg/dl); fbs- (fasting plasma glucose >120 mg/dl); restecg-resting electrocardiogram results; thalach-maximum heart rate reached; exang-exercise angina; oldpeak-movement induced depression of the ST segment relative to rest; slope-slope of the moving peak ST segment; ca-number of major vessels stained with fluorescence (0-3); thal-mediterranean ischemia, 3 is normal, 6 is fixed defect, and 7 is reversible defect;

a Decision Tree; SVM-support vector machine; GBDT-gradient boosting decision tree; majorityvote-majority voting (decision tree, support vector product, K-nearest); XGboost-the method of the invention.

Detailed Description

The embodiments of the present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, the invention provides a heart disease prediction method based on dual feature selection and an XGBoost algorithm, which is characterized by comprising the following specific steps:

The process of preprocessing the open-source cardiac disease data set in the first step is as follows: missing data filling, abnormal data deleting, multi-class data ordered mapping or one-hot coding and data standardization are carried out on the missing data, the abnormal data and a plurality of classes of a certain characteristic existing in the open-source heart disease data set.

The normalization is to set the mean value of the feature sequence to 0 and the variance to 1 so that the values of the features are normally distributed. The normalized formula is:

in the above formula (2), μ_xAnd σ_xRespectively, the mean and standard deviation of a certain feature of the sample.

The calculation process of the importance of the features in the second step is as follows: the random forest comprises a plurality of decision trees, and each node in the decision trees is a condition of a certain characteristic. These conditions are used to divide the data into two parts, so that each part is classified into the same set, and the process is repeated to classify the data. And judging how much each feature contributes to each tree in the random forest, and then taking an average value, wherein the value is the importance of the feature. The contribution calculation mode adopts a Gini index, a variable importance score (variable importance measures) is represented by VIM, a Gini coefficient is represented by GI, and the total c characteristics x are assumed₁，x₂，x₃，...x_cTo calculate each feature x_jGini coefficient score of

I.e. the average amount of change in the splitting purity of all decision tree nodes of the random forest for the jth feature.

The formula for calculating the kini index is as follows:

in the above formula (3), K represents that there are K classes, p_mkDisplay sectionThe fraction of class k in point m.

Characteristic x_jThe importance of the node m is the variation of the Gini coefficient before and after the branch of the node m:

in the above formula (4), N represents the number of samples in the node, and N_lAnd N_rRespectively representing the number of samples, GI, in two new nodes after branching_lAnd GI_rRespectively representing the Gini coefficients of two new nodes after branching.

Then x_jThe importance in the ith tree is:

the set M in the above formula (5) is the feature x in the decision tree i_jThe set of all nodes that have appeared.

Assuming a total of n trees in a random forest, feature x_jThe importance of is:

and finally, normalization treatment:

according to the method, the importance of all the characteristics can be obtained, and the importance of the characteristics is ranked.

The third step comprises the following specific processes: the spearman scale correlation coefficient method, denoted by the greek letter p, is used to estimate the correlation between two variables X, Y, where the correlation between the variables can be described using a monotonic function. The value of p between the two variables is between-1 and + 1.

The spearman correlation coefficient is:

in the above formula (8), N' is the number of elements in the variable X, Y; x, Y are sorted, and the ith values of the two random variables are respectively X'_iAnd Y_i'is represented by, wherein x'_iAnd y'_iAre respectively X'_iAnd Y_i' alignment in X and Y, d_i＝x’_i-y’_i. Thereby completing a correlation analysis of the features;

the fourth step comprises the following specific processes: and (3) according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, extracting the value of the feature importance, extracting the correlation coefficient of each feature and the result label, and normalizing according to the formula (2). And calculating the characteristic index of each characteristic according to the extracted characteristic importance value and the correlation between each characteristic and the result label, calculating the characteristic index, then normalizing the characteristic indexes and sequencing the characteristic indexes, setting a threshold value, selecting the characteristic, and further finishing the characteristic selection.

The specific method for calculating the characteristic index in the fourth step is as follows:

first extracted feature importance value VIM_jThe correlation value of each feature with the result label is then normalized by index_jExpressing the normalized correlation coefficient, and then calculating an index of features (indices of variables), expressed as IOV, the mathematical formula for the index of features is shown in the following formula (12):

the threshold set when the feature index is selected in the fourth step is 0.1, the features with the feature selection index larger than 0.1 are selected as the features of the final training model, and 9 features can be selected from 13 features of the open-source heart disease data set.

As shown in fig. 2, the implementation process of feature importance ranking of the method of the present invention is: firstly, taking an open source heart disease data set as original data, and carrying out missing value processing, abnormal value processing, ordered mapping or one-hot coding processing of multi-class data and standardization processing on the original data. And training the constructed random forest model by using the processed data to obtain the feature importance, and sequencing the feature importance in a descending order.

As shown in fig. 3, the implementation process of the feature correlation analysis of the method of the present invention is: and (3) directly carrying out feature correlation analysis on the data processed in the first step, and finally generating a thermodynamic diagram by adopting a spearman method (spearman), so that the correlation among features and the correlation between the features and results can be visually seen.

Fig. 4 shows a dual feature selection index of the method of the present invention, which is calculated based on the feature importance and the correlation of the features with the result tags shown in fig. 2 and 3, and can provide a great help for selecting features.

Fig. 5 shows the evaluation index of the method of the present invention compared with several other algorithms, and shows that the method of the present invention has certain advantages in both accuracy and F1 (calculated from accuracy and recall) scores.

Fig. 6 shows a receiver operating characteristic curve (ROC) of the method of the present invention, which is a useful tool for classification model selection based on performance indexes such as false positive rate and true positive rate of the model. The diagonal of the ROC can be understood as a random guess. Based on the ROC curve, we can calculate the so-called ROC under-line region (AUC) to characterize the performance of the classification model.

As shown in fig. 7, AUC comparison results of the method of the present invention and some current algorithms include decision tree, Support Vector Machine (SVM), GBDT, majority vote (decision tree, support vector machine, K-nearest neighbor algorithm). The AUC value is a very important evaluation index in machine learning. From the comparison result, the heart disease prediction method based on the dual feature selection and the XGboost algorithm has the highest AUC value, the SVM, the GBDT and the majority vote have the difference of 0.01 in the result, but the method is greatly improved compared with other algorithms respectively, and the method has better performance and reliability.

According to the heart disease prediction method based on the dual feature selection and the XGboost algorithm, the feature importance is determined through random forests, the feature correlation analysis is carried out, a new feature evaluation index, namely a feature index, is calculated, the features can be selected according to the feature index, and the number of used features is effectively reduced. And the selected features are used for training a prediction model, so that a prediction effect is achieved, the defects of the prior art that the accuracy is poor and the number of features is large are overcome, and the method has a very high application value.

Examples

The heart disease prediction method based on the dual feature selection and the XGboost algorithm comprises the following specific steps:

the detailed process of preprocessing data is that the original heart disease data set has problems of data missing, data abnormal, a plurality of categories of certain characteristics and the like, and missing data filling, abnormal data deletion, ordered mapping or unique hot coding of multi-category data and data standardization are required to be carried out on the original heart disease data set.

in the above formula (2), μ_xAnd σ_xRespectively, the mean and standard deviation of a certain feature of the sample. Thereby completing the processing of the data;

the random forest comprises a plurality of decision trees, and each node in the decision trees is a condition of a certain characteristic. These conditions are used to divide the data into two parts, so that each part is classified into the same set, and the process is repeated to classify the data.

The evaluation idea is to judge how much each feature contributes to each tree in the random forest and then take the average value, and this value is the importance of the feature.

The above-mentioned contribution calculation method uses the Gini index, and the variable importance scores (variable importance measures) are expressed by VIM, Gini coefficients are expressed by GI, and it is assumed that there are c total features x₁，x₂，x₃，...x_cTo calculate each feature x_jGini coefficient score of

The formula for calculating the kini index is as follows:

in the above formula (3), K represents that there are K classes, p_mkRepresenting the proportion of class k in node m.

in the above formula (4), N represents the number of samples in the node, GI_lAnd GI_rRespectively representing the Gini coefficients of two new nodes after branching.

Then x_jImportance in the ith treeThe character is as follows:

and finally, normalization treatment:

according to the method, the importance of all the characteristics can be obtained, and the characteristics with high importance can be selected according to requirements. Thereby completing feature importance ranking;

this step uses the spearman rank correlation method, denoted by the greek letter p, to estimate the correlation between two variables X, Y, where the correlation between the variables can be described using a monotonic function. The value of p between the two variables is between-1 and + 1.

The spearman correlation coefficient is:

in the above formula (8), N' is the number of elements in the variable X, Y; x, Y are sorted (ascending or descending), and the ith values of the two random variables are respectively X'_iAnd Y_i'is represented by, wherein x'_iAnd y'_iAre respectively X'_iAnd Y_i' alignment in X and Y, d_i＝x’_i-y’_i. Thereby completing a correlation analysis of the features;

fourthly, according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, double feature selection is carried out, and the most appropriate feature is selected; the specific operation is as follows:

r is the correlation coefficient, then the relationship between the correlation degree and the correlation coefficient is:

there is significant correlation between | r | >0.95

0.8 highly correlated | r | > |

0.5< r > 0.8 moderate correlation

0.3< r | <0.5 low degree correlation

The relationship of r <0.3 is very weak and is considered irrelevant

The second step obtains the importance ranking of the features, and the third step obtains the correlation analysis of the features; first, the correlation coefficient of each feature and the result label is extracted and normalized according to the above formula (2), and the normalized correlation coefficient is expressed by Index. The characteristic index (index of variables) is obtained, expressed as IOV:

and after the characteristic indexes are solved, normalizing the characteristic indexes and sequencing the characteristic indexes, setting a threshold value of the characteristic selection index to be 0.1, and selecting the characteristic with the characteristic selection index larger than 0.1 as the characteristic of the final training model, wherein the selected characteristic has certain importance and certain correlation with the result, and thus, the characteristic selection is completed.

The normalization refers to scaling the values of the features to the interval [0,1], which is a special case of min-max scaling. The normalized formula is:

in the above formula (1), IOV_jRepresenting a particular sample, IOV_j-maxAnd IOV_j-minRespectively a maximum and a minimum of a certain characteristic.

And fifthly, building a scimit-lean machine learning framework on windows, building a heart disease prediction model based on an XGboost algorithm, training the heart disease prediction model by taking 80% of the data set of the characteristics selected in the fourth step as a training set, and performing parameter adjustment and test on the trained heart disease prediction model by taking the remaining 20% as a test set to obtain a heart disease prediction result.

XGBoost is a tree integration model that uses the sum of the predicted values of each of K (total number of trees K) trees as the prediction of the sample in the XGBoost system:

for a given data set, there are N samples, each feature:

D＝{(X_i,y_i)}(|D|＝N,X_i∈R^m,y_i∈R) (9)，

in the above formula (9), X_iDenotes the ith sample, y_iThe label representing the ith sample.

In the above-mentioned formula (10),

representing the predicted value of the model, K representing the number of trees, f_kRepresenting the kth tree model, q representing the fraction of each sample mapped by the structure of each tree resulting in the corresponding leaf node, w_q(x)Represents a set of scores for all leaf nodes of the tree q, and T represents the number of leaf nodes per tree.

As shown in equation (10), the predicted value of XGBoost is the sum of the predicted values of each tree, i.e. the sum of scores of corresponding leaf nodes of each tree. Our goal is to learn such K tree models f (x). For learning the model f (x), the following objective function is defined:

in the above equation (11), the first term on the right side of the equation is a loss function term, i.e. a training error, which is a differentiable convex function, and the second term is a regularization term, i.e. the sum of the complexity of each tree, so as to control the complexity of the model and prevent overfitting. Our goal is to derive the corresponding model f (x) when we get the minimization of ψ (φ). Gamma and lambda represent coefficients that require tuning in practical applications.

In the above examples, the random forest, the feature correlation analysis, scimit-lean are methods well known in the art.

Nothing in this specification is said to apply to the prior art.

Claims

1. A heart disease prediction method based on dual feature selection and an XGboost algorithm is characterized by comprising the following specific steps:

fifthly, building a scimit-lean machine learning framework on windows, building a heart disease prediction model based on an XGboost algorithm, training the heart disease prediction model by taking 70-90% of the data set of the characteristics selected in the fourth step as a training set, and performing parameter adjustment and test on the trained heart disease prediction model by taking the remaining 10-30% as a test set to obtain a heart disease prediction result;

XGBoost is a tree integration model that uses the sum of the predicted values for each of the K trees as the prediction of the sample in the XGBoost system:

for a given data set, there are N samples, each feature:

D＝{(X_i,y_i)}(|D|＝N,X_i∈R^m,y_i∈R) (9)

in the above formula (9), X_iDenotes the ith sample, y_iA label representing the ith sample;

in the above-mentioned formula (10),

representing the predicted value of the model, K representing the number of trees, f_kRepresenting the kth tree model, q representing the fraction of each sample mapped by the structure of each tree resulting in the corresponding leaf node, w_q(x)Represents a set of scores for all leaf nodes of the tree q, T represents the number of leaf nodes per tree;

as can be seen from the formula (10), the predicted value of XGBoost is the sum of the predicted values of each tree, i.e. the sum of scores of corresponding leaf nodes of each tree; our goal is to learn such K tree models f (x); for learning the model f (x), the following objective function is defined:

in the above formula (11), the first term on the right side of the formula is a loss function term, i.e. a training error, which is a differentiable convex function, and the second term is a regularization term, i.e. the sum of the complexity of each tree, so as to control the complexity of the model and prevent overfitting; our goal is to derive the corresponding model f (x) when we get the minimization of ψ (φ); γ and λ represent coefficients;

the process of preprocessing the open-source cardiac disease data set in the first step is as follows: missing data filling, abnormal data deleting, multi-class data ordered mapping or one-hot coding and data standardization are carried out on data missing, data abnormality and a plurality of classes of a certain characteristic existing in an open source heart disease data set;

the normalization is to set the mean value of the feature sequence to 0 and the variance to 1 so that the values of the features are in a standard normal distribution; the normalized formula is:

in the above formula (2), μ_xAnd σ_xRespectively representing the mean value and the standard deviation of a certain characteristic of the sample;

the calculation process of the importance of the features in the second step is as follows: the random forest comprises a plurality of decision trees, and each node in the decision trees is a condition of a certain characteristic; the conditions are used for dividing the data into two parts, so that each part is classified into the same set, and the process is continuously repeated to classify the data; judging how much each feature contributes to each tree in the random forest, and then taking an average value, wherein the value is the importance of the feature; the contribution calculation mode adopts a Gini index, the variable importance score is represented by VIM, the Gini coefficient is represented by GI, and the total c characteristics x are assumed₁，x₂，x₃，...x_cTo calculate each feature x_jGini coefficient score of

Namely the average change amount of the splitting purities of all decision tree nodes of the jth characteristic in the random forest;

the formula for calculating the kini index is as follows:

in the above formula (3), K represents that there are K classes, p_mkRepresenting the proportion of the class k in the node m;

in the above formula (4), N represents the number of samples in the node, and N_lAnd N_rRespectively representing the number of samples, GI, in two new nodes after branching_lAnd GI_rRespectively representing Gini coefficients of two new nodes after branching;

then x_jThe importance in the ith tree is:

the set M in the above formula (5) is the feature x in the decision tree i_jA set of all nodes that have appeared;

and finally, normalization treatment:

obtaining the importance of all the characteristics according to the method, and sequencing the importance of the characteristics;

the third step comprises the following specific processes: a spearman scale correlation coefficient method, denoted by the greek letter p, is used to estimate the correlation between two variables X, Y, where the correlation between the variables is described using a monotonic function; the rho value between the two variables is between-1 and + 1;

the spearman correlation coefficient is:

in the above formula (8), N' is the number of elements in the variable X, Y; x, Y are sorted, and the ith values of the two random variables are respectively X'_iAnd Y'_iIs represented by x'_iAnd y'_iAre respectively X'_iAnd Y'_iAlignment in X and Y, d_i＝x’_i-y’_i(ii) a Thereby completing a correlation analysis of the features;

the fourth step comprises the following specific processes: according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, extracting the value of the feature importance, extracting the correlation coefficient of each feature and the result label, and standardizing according to the formula (2); calculating a characteristic index of each characteristic according to the extracted characteristic importance value and the correlation between each characteristic and the result label, calculating the characteristic index, then carrying out normalization processing on the characteristic indexes and sequencing the characteristic indexes after solving the characteristic indexes, setting a threshold value, selecting the characteristic, and further finishing the characteristic selection;

the specific method for calculating the characteristic index is as follows:

first extracted feature importance value VIM_jThe correlation value of each feature with the result label is then normalized by index_jExpressing the normalized correlation coefficient, and then calculating a characteristic index, wherein the characteristic index is expressed by IOV, and the mathematical calculation formula of the characteristic index is shown as the following formula (12):

the normalization refers to scaling the values of the features to the interval [0,1], and the formula of normalization is:

2. The heart disease prediction method based on the dual feature selection and XGboost algorithm as claimed in claim 1, wherein the specific method for calculating the feature index is as follows: the threshold set when the feature index is selected is 0.1, the features with the feature selection index larger than 0.1 are selected as the features of the final training model, and 9 features are selected from 13 features of the open-source heart disease data set.