CN111243751B - Heart disease prediction method based on dual feature selection and XGboost algorithm - Google Patents

Heart disease prediction method based on dual feature selection and XGboost algorithm Download PDF

Info

Publication number
CN111243751B
CN111243751B CN202010052452.XA CN202010052452A CN111243751B CN 111243751 B CN111243751 B CN 111243751B CN 202010052452 A CN202010052452 A CN 202010052452A CN 111243751 B CN111243751 B CN 111243751B
Authority
CN
China
Prior art keywords
feature
characteristic
importance
heart disease
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010052452.XA
Other languages
Chinese (zh)
Other versions
CN111243751A (en
Inventor
孙昊
崔子超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202010052452.XA priority Critical patent/CN111243751B/en
Publication of CN111243751A publication Critical patent/CN111243751A/en
Application granted granted Critical
Publication of CN111243751B publication Critical patent/CN111243751B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Abstract

The invention discloses a heart disease prediction method based on dual feature selection and an XGboost algorithm.

Description

Heart disease prediction method based on dual feature selection and XGboost algorithm
Technical Field
The invention belongs to the technical field of medical data analysis, and particularly relates to a heart disease prediction method based on dual feature selection and an XGboost algorithm.
Background
Heart disease is a common and serious cardiovascular disease in life. Cardiovascular diseases are one of the biggest threats to the health of China and even the people all over the world, and the diseases bring serious burden to the medical system of China. The famous magazine "lancets" published "global disease burden report 2013" evaluated patient mortality in 190 countries between 1990 and 2013. Coronary heart disease, chronic lung disease and sudden brain death are the three biggest diseases of Chinese, the death rate in the current year is up to 46%, and the number is continuously increasing. The heart disease prediction model can be trained according to the existing medical data, and health guidance is provided for the patient. The current methods for predicting heart diseases include:
1. a Support Vector Machine (SVM) is used to predict whether a patient is suffering from a heart disease. A Heart Disease prediction model based on a support vector Machine is proposed by researchers A.Gavhane et al (A.Gavhane, G.Kokkula, I.Pandya, P.K.Dedevakar.Prediction of Heart Disease Using Machine Learning [ C ].2018Second International Conference on Electronics, Communication and Aerospace Technology (ICECA),2018: 1275-. However, the SVM algorithm is difficult to determine the kernel function, and consumes a large amount of space and time when training the model.
2. And (4) training the model by using a decision tree algorithm to predict whether the patient is ill or not. Prediction using a decision tree training model is proposed by researchers A.J.Aljaaf et al (A.J.Aljaaf; D.Al-Jumeily; A.J.Hussain; T.Dawson; P Fergus; M.Al-Jumaily, Predicting the candidate of heart failure with a multi level assessment using a decision tree [ C ], electric and Computer Engineering (TAEECE),2015: 101-106), but decision trees are easily affected by outliers and are easily fit.
3. Multiple simple classifiers are used for integration to perform majority voting. Researchers M.Shouman (M.Shouman, T.Turner, R.Stocker, Using data mining techniques in heart disease diagnosis and treatment [ C ]. Japan-Egypt Conference on Electronics, Communications and computers.2012,173-177.) combined decision trees, Bayesian classifiers and support vector machine algorithms trained a new classifier based on the votes to which they belong, but did not gain better generalization.
4. The Wuhaxing discloses a heart disease risk prediction system (CN 109377470A), which is used for carrying out identification classification, myocardial shape feature vector extraction, electrocardiogram feature extraction and the like on heart ultrasonic videos of patients by using medical images of hearts of the patients, and then training a deep neural network by using the features. However, the cardiac ultrasound video processed by the system needs a large number of cases, the feature identification and extraction process is difficult, the requirement of a training model on a machine is high, and the overall implementation is difficult.
5. The table name proposes a heart disease prediction method (CN 110265146 a) based on Bagging-Fuzzy-GBDT algorithm, which fuzzifies some data by using Fuzzy logic, then combines the fuzzified data with GBDT algorithm, and performs m times of re-sampling by using Bagging algorithm, so as to increase data diversity. However, the GBDT algorithm only samples classification and regression tree (CART) as a base classifier, and the base classifier is single in selection relative to the XGboost algorithm, cannot process missing values, and is easy to overfit.
The xgboost (extreme Gradient boosting) algorithm is proposed by chen curio, university of washington, and has gained wide attention due to superior efficiency and higher accuracy. The algorithm can select different base classifiers, add a regular term in a loss function, perform pruning processing and reduce the risk of overfitting.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a heart disease prediction method based on dual feature selection and an XGboost algorithm, and improves the accuracy of prediction and the generalization of a model while reducing the algorithm training time.
The technical problem to be solved by the invention is realized by adopting the following technical scheme: a heart disease prediction method based on dual feature selection and an XGboost algorithm is designed, and is characterized by comprising the following specific steps:
firstly, preprocessing an open-source heart disease data set to obtain a sample data set D with the size of N;
secondly, judging the importance of the characteristics of the sample data set D through a random forest algorithm, wherein the random forest algorithm gives the importance of all the characteristics and sorts the importance of the characteristics;
thirdly, performing characteristic correlation analysis on the sample data set D by adopting a spearman level correlation coefficient method;
fourthly, according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, double feature selection is carried out, and the most appropriate feature is selected;
and fifthly, building a scimit-lean machine learning framework on windows, building a heart disease prediction model based on an XGboost algorithm, training the heart disease prediction model by taking 70-90% of the data set of the characteristics selected in the fourth step as a training set, and performing parameter adjustment and test on the trained heart disease prediction model by taking the remaining 10-30% as a test set to obtain a heart disease prediction result.
The invention has the beneficial effects that: compared with the prior art, the invention has the following outstanding characteristics:
(1) a Support Vector Machine (SVM) is used to predict whether a patient is suffering from a heart disease. A Heart Disease prediction model based on a support vector Machine is proposed by researchers A.Gavhane et al (A.Gavhane, G.Kokkula, I.Pandya, P.K.Dedevakar.Prediction of Heart Disease Using Machine Learning [ C ].2018Second International Conference on Electronics, Communication and Aerospace Technology (ICECA),2018: 1275-. However, the SVM algorithm is difficult to determine the kernel function, and consumes a large amount of space and time when training the model. The technical scheme of the invention is that the heart disease prediction is carried out based on the XGboost algorithm. The two have substantial differences in method.
(2) Researchers A.J.Aljaaf et al (A.J.Aljaaf; D.Al-Jumeily; A.J.Hussain; T.Dawson; P Fergus; M.Al-Jumlly, Predicting the candidate of heart failure with a multi level assessment using a decision tree [ C ],2015Third International Conference on technical Advances in electric, Electronics and Computer Engineering (TAEECE),2015:101-106.) proposed using a decision tree training model for prediction, but decision trees are susceptible to outliers and are easily overfitted. Researchers M.Shouman (M.Shouman, T.Turner, R.Stocker, Using data mining techniques in heart disease diagnosis and treatment [ C ]. Japan-Egypt Conference on Electronics, Communications and computers.2012,173-177.) combined decision trees, Bayesian classifiers and support vector machine algorithms trained a new classifier based on the votes to which they belong, but did not gain better generalization. The heart disease prediction method adopts the technical scheme that the heart disease prediction is carried out based on the XGboost algorithm, the XGboost is one of Boosting algorithms and is a tree lifting model, a plurality of tree models are integrated together to form a strong classifier, and the performance is greatly improved.
(3) The inventor of the invention puts forward a heart disease risk prediction system by the earlier patent technology CN109377470A, which needs to perform identification classification, myocardial shape feature vector extraction, electrocardiogram feature extraction and the like on heart ultrasonic videos, needs a large number of cases, is difficult in the process of identifying and extracting features, has high requirements on machines for training models, and is difficult to realize integrally. In addition, patent technology CN110265146A proposes a heart disease prediction method based on Bagging-Fuzzy-GBDT algorithm. In order to overcome the defects of CN109377470A and enable the heart disease prediction technology to have a qualitative promotion, the invention develops a brand-new heart disease prediction method based on dual feature selection and XGboost algorithm, and utilizes a random forest algorithm and feature correlation analysis to select features, thereby overcoming the complex steps of identification classification, feature extraction and the like when CN109377470A processes videos. Compared with the GBDT algorithm used by CN110265146A, the XGboost algorithm is improved on the basis of the GBDT algorithm, and a regular term is added in a cost function for controlling the complexity of the model and reducing the variance of the model, so that the learned model is simpler and stronger, and overfitting can be prevented.
Compared with the prior art, the invention has the following remarkable improvements:
(1) the XGboost algorithm is one of Boosting algorithms, a plurality of tree models are integrated together to form a strong classifier, and the XGboost is improved on the basis of the GBDT to be stronger. The XGboost algorithm supports a user to simulate a target function and an evaluation function, so that the flexibility is improved; regularization is added into the cost function to prevent overfitting; all the sub-trees which can be built are built from top to bottom, and then reverse pruning is carried out from bottom to top, so that the local optimal solution is not easy to fall into.
(2) According to the heart disease prediction method based on the dual feature selection and the XGboost algorithm, original data are processed, the processed data are analyzed through a random forest algorithm and feature correlation, feature indexes are calculated according to importance ranking of features and correlation between the features and sample labels, the features are selected for model training, and the defects that the existing heart disease prediction needs more features and is lack of accuracy are overcome.
Drawings
FIG. 1 is a flow chart of the steps of one embodiment of the method of the present invention;
FIG. 2 is a graph of feature importance ranking based on a random forest algorithm according to an embodiment of the method of the present invention;
FIG. 3 is a feature correlation analysis thermodynamic diagram of one embodiment of the method of the present invention;
FIG. 4 is a graph of dual feature selection indices for one embodiment of the method of the present invention;
FIG. 5 is a graph showing comparison of evaluation indexes in different methods;
FIG. 6 is a ROC plot of the method of the present invention;
FIG. 7 is a graph comparing ROC curves for different methods;
in the figure: age-age; sex (1 ═ male; 0 ═ male) -gender; cp-chest pain type; trestbps-resting blood pressure (measured in mm hg at admission); chol-serum cholestasis (mg/dl); fbs- (fasting plasma glucose >120 mg/dl); restecg-resting electrocardiogram results; thalach-maximum heart rate reached; exang-exercise angina; oldpeak-movement induced depression of the ST segment relative to rest; slope-slope of the moving peak ST segment; ca-number of major vessels stained with fluorescence (0-3); thal-mediterranean ischemia, 3 is normal, 6 is fixed defect, and 7 is reversible defect;
a Decision Tree; SVM-support vector machine; GBDT-gradient boosting decision tree; majorityvote-majority voting (decision tree, support vector product, K-nearest); XGboost-the method of the invention.
Detailed Description
The embodiments of the present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, the invention provides a heart disease prediction method based on dual feature selection and an XGBoost algorithm, which is characterized by comprising the following specific steps:
firstly, preprocessing an open-source heart disease data set to obtain a sample data set D with the size of N;
secondly, judging the importance of the characteristics of the sample data set D through a random forest algorithm, wherein the random forest algorithm gives the importance of all the characteristics and sorts the importance of the characteristics;
thirdly, performing characteristic correlation analysis on the sample data set D by adopting a spearman level correlation coefficient method;
fourthly, according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, double feature selection is carried out, and the most appropriate feature is selected;
and fifthly, building a scimit-lean machine learning framework on windows, building a heart disease prediction model based on an XGboost algorithm, training the heart disease prediction model by taking 70-90% of the data set of the characteristics selected in the fourth step as a training set, and performing parameter adjustment and test on the trained heart disease prediction model by taking the remaining 10-30% as a test set to obtain a heart disease prediction result.
The process of preprocessing the open-source cardiac disease data set in the first step is as follows: missing data filling, abnormal data deleting, multi-class data ordered mapping or one-hot coding and data standardization are carried out on the missing data, the abnormal data and a plurality of classes of a certain characteristic existing in the open-source heart disease data set.
The normalization is to set the mean value of the feature sequence to 0 and the variance to 1 so that the values of the features are normally distributed. The normalized formula is:
Figure GDA0003516717780000071
in the above formula (2), μxAnd σxRespectively, the mean and standard deviation of a certain feature of the sample.
The calculation process of the importance of the features in the second step is as follows: the random forest comprises a plurality of decision trees, and each node in the decision trees is a condition of a certain characteristic. These conditions are used to divide the data into two parts, so that each part is classified into the same set, and the process is repeated to classify the data. And judging how much each feature contributes to each tree in the random forest, and then taking an average value, wherein the value is the importance of the feature. The contribution calculation mode adopts a Gini index, a variable importance score (variable importance measures) is represented by VIM, a Gini coefficient is represented by GI, and the total c characteristics x are assumed1,x2,x3,...xcTo calculate each feature xjGini coefficient score of
Figure GDA0003516717780000081
I.e. the average amount of change in the splitting purity of all decision tree nodes of the random forest for the jth feature.
The formula for calculating the kini index is as follows:
Figure GDA0003516717780000082
in the above formula (3), K represents that there are K classes, pmkDisplay sectionThe fraction of class k in point m.
Characteristic xjThe importance of the node m is the variation of the Gini coefficient before and after the branch of the node m:
Figure GDA0003516717780000083
in the above formula (4), N represents the number of samples in the node, and NlAnd NrRespectively representing the number of samples, GI, in two new nodes after branchinglAnd GIrRespectively representing the Gini coefficients of two new nodes after branching.
Then xjThe importance in the ith tree is:
Figure GDA0003516717780000084
the set M in the above formula (5) is the feature x in the decision tree ijThe set of all nodes that have appeared.
Assuming a total of n trees in a random forest, feature xjThe importance of is:
Figure GDA0003516717780000085
and finally, normalization treatment:
Figure GDA0003516717780000086
according to the method, the importance of all the characteristics can be obtained, and the importance of the characteristics is ranked.
The third step comprises the following specific processes: the spearman scale correlation coefficient method, denoted by the greek letter p, is used to estimate the correlation between two variables X, Y, where the correlation between the variables can be described using a monotonic function. The value of p between the two variables is between-1 and + 1.
The spearman correlation coefficient is:
Figure GDA0003516717780000091
in the above formula (8), N' is the number of elements in the variable X, Y; x, Y are sorted, and the ith values of the two random variables are respectively X'iAnd Yi'is represented by, wherein x'iAnd y'iAre respectively X'iAnd Yi' alignment in X and Y, di=x’i-y’i. Thereby completing a correlation analysis of the features;
the fourth step comprises the following specific processes: and (3) according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, extracting the value of the feature importance, extracting the correlation coefficient of each feature and the result label, and normalizing according to the formula (2). And calculating the characteristic index of each characteristic according to the extracted characteristic importance value and the correlation between each characteristic and the result label, calculating the characteristic index, then normalizing the characteristic indexes and sequencing the characteristic indexes, setting a threshold value, selecting the characteristic, and further finishing the characteristic selection.
The specific method for calculating the characteristic index in the fourth step is as follows:
first extracted feature importance value VIMjThe correlation value of each feature with the result label is then normalized by indexjExpressing the normalized correlation coefficient, and then calculating an index of features (indices of variables), expressed as IOV, the mathematical formula for the index of features is shown in the following formula (12):
Figure GDA0003516717780000092
the threshold set when the feature index is selected in the fourth step is 0.1, the features with the feature selection index larger than 0.1 are selected as the features of the final training model, and 9 features can be selected from 13 features of the open-source heart disease data set.
As shown in fig. 2, the implementation process of feature importance ranking of the method of the present invention is: firstly, taking an open source heart disease data set as original data, and carrying out missing value processing, abnormal value processing, ordered mapping or one-hot coding processing of multi-class data and standardization processing on the original data. And training the constructed random forest model by using the processed data to obtain the feature importance, and sequencing the feature importance in a descending order.
As shown in fig. 3, the implementation process of the feature correlation analysis of the method of the present invention is: and (3) directly carrying out feature correlation analysis on the data processed in the first step, and finally generating a thermodynamic diagram by adopting a spearman method (spearman), so that the correlation among features and the correlation between the features and results can be visually seen.
Fig. 4 shows a dual feature selection index of the method of the present invention, which is calculated based on the feature importance and the correlation of the features with the result tags shown in fig. 2 and 3, and can provide a great help for selecting features.
Fig. 5 shows the evaluation index of the method of the present invention compared with several other algorithms, and shows that the method of the present invention has certain advantages in both accuracy and F1 (calculated from accuracy and recall) scores.
Fig. 6 shows a receiver operating characteristic curve (ROC) of the method of the present invention, which is a useful tool for classification model selection based on performance indexes such as false positive rate and true positive rate of the model. The diagonal of the ROC can be understood as a random guess. Based on the ROC curve, we can calculate the so-called ROC under-line region (AUC) to characterize the performance of the classification model.
As shown in fig. 7, AUC comparison results of the method of the present invention and some current algorithms include decision tree, Support Vector Machine (SVM), GBDT, majority vote (decision tree, support vector machine, K-nearest neighbor algorithm). The AUC value is a very important evaluation index in machine learning. From the comparison result, the heart disease prediction method based on the dual feature selection and the XGboost algorithm has the highest AUC value, the SVM, the GBDT and the majority vote have the difference of 0.01 in the result, but the method is greatly improved compared with other algorithms respectively, and the method has better performance and reliability.
According to the heart disease prediction method based on the dual feature selection and the XGboost algorithm, the feature importance is determined through random forests, the feature correlation analysis is carried out, a new feature evaluation index, namely a feature index, is calculated, the features can be selected according to the feature index, and the number of used features is effectively reduced. And the selected features are used for training a prediction model, so that a prediction effect is achieved, the defects of the prior art that the accuracy is poor and the number of features is large are overcome, and the method has a very high application value.
Examples
The heart disease prediction method based on the dual feature selection and the XGboost algorithm comprises the following specific steps:
firstly, preprocessing an open-source heart disease data set to obtain a sample data set D with the size of N;
the detailed process of preprocessing data is that the original heart disease data set has problems of data missing, data abnormal, a plurality of categories of certain characteristics and the like, and missing data filling, abnormal data deletion, ordered mapping or unique hot coding of multi-category data and data standardization are required to be carried out on the original heart disease data set.
The normalization is to set the mean value of the feature sequence to 0 and the variance to 1 so that the values of the features are normally distributed. The normalized formula is:
Figure GDA0003516717780000111
in the above formula (2), μxAnd σxRespectively, the mean and standard deviation of a certain feature of the sample. Thereby completing the processing of the data;
secondly, judging the importance of the characteristics of the sample data set D through a random forest algorithm, wherein the random forest algorithm gives the importance of all the characteristics and sorts the importance of the characteristics;
the random forest comprises a plurality of decision trees, and each node in the decision trees is a condition of a certain characteristic. These conditions are used to divide the data into two parts, so that each part is classified into the same set, and the process is repeated to classify the data.
The evaluation idea is to judge how much each feature contributes to each tree in the random forest and then take the average value, and this value is the importance of the feature.
The above-mentioned contribution calculation method uses the Gini index, and the variable importance scores (variable importance measures) are expressed by VIM, Gini coefficients are expressed by GI, and it is assumed that there are c total features x1,x2,x3,...xcTo calculate each feature xjGini coefficient score of
Figure GDA0003516717780000121
I.e. the average amount of change in the splitting purity of all decision tree nodes of the random forest for the jth feature.
The formula for calculating the kini index is as follows:
Figure GDA0003516717780000122
in the above formula (3), K represents that there are K classes, pmkRepresenting the proportion of class k in node m.
Characteristic xjThe importance of the node m is the variation of the Gini coefficient before and after the branch of the node m:
Figure GDA0003516717780000123
in the above formula (4), N represents the number of samples in the node, GIlAnd GIrRespectively representing the Gini coefficients of two new nodes after branching.
Then xjImportance in the ith treeThe character is as follows:
Figure GDA0003516717780000124
the set M in the above formula (5) is the feature x in the decision tree ijThe set of all nodes that have appeared.
Assuming a total of n trees in a random forest, feature xjThe importance of is:
Figure GDA0003516717780000125
and finally, normalization treatment:
Figure GDA0003516717780000126
according to the method, the importance of all the characteristics can be obtained, and the characteristics with high importance can be selected according to requirements. Thereby completing feature importance ranking;
thirdly, performing characteristic correlation analysis on the sample data set D by adopting a spearman level correlation coefficient method;
this step uses the spearman rank correlation method, denoted by the greek letter p, to estimate the correlation between two variables X, Y, where the correlation between the variables can be described using a monotonic function. The value of p between the two variables is between-1 and + 1.
The spearman correlation coefficient is:
Figure GDA0003516717780000131
in the above formula (8), N' is the number of elements in the variable X, Y; x, Y are sorted (ascending or descending), and the ith values of the two random variables are respectively X'iAnd Yi'is represented by, wherein x'iAnd y'iAre respectively X'iAnd Yi' alignment in X and Y, di=x’i-y’i. Thereby completing a correlation analysis of the features;
fourthly, according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, double feature selection is carried out, and the most appropriate feature is selected; the specific operation is as follows:
r is the correlation coefficient, then the relationship between the correlation degree and the correlation coefficient is:
there is significant correlation between | r | >0.95
0.8 highly correlated | r | > |
0.5< r > 0.8 moderate correlation
0.3< r | <0.5 low degree correlation
The relationship of r <0.3 is very weak and is considered irrelevant
The second step obtains the importance ranking of the features, and the third step obtains the correlation analysis of the features; first, the correlation coefficient of each feature and the result label is extracted and normalized according to the above formula (2), and the normalized correlation coefficient is expressed by Index. The characteristic index (index of variables) is obtained, expressed as IOV:
Figure GDA0003516717780000141
and after the characteristic indexes are solved, normalizing the characteristic indexes and sequencing the characteristic indexes, setting a threshold value of the characteristic selection index to be 0.1, and selecting the characteristic with the characteristic selection index larger than 0.1 as the characteristic of the final training model, wherein the selected characteristic has certain importance and certain correlation with the result, and thus, the characteristic selection is completed.
The normalization refers to scaling the values of the features to the interval [0,1], which is a special case of min-max scaling. The normalized formula is:
Figure GDA0003516717780000142
in the above formula (1), IOVjRepresenting a particular sample, IOVj-maxAnd IOVj-minRespectively a maximum and a minimum of a certain characteristic.
And fifthly, building a scimit-lean machine learning framework on windows, building a heart disease prediction model based on an XGboost algorithm, training the heart disease prediction model by taking 80% of the data set of the characteristics selected in the fourth step as a training set, and performing parameter adjustment and test on the trained heart disease prediction model by taking the remaining 20% as a test set to obtain a heart disease prediction result.
XGBoost is a tree integration model that uses the sum of the predicted values of each of K (total number of trees K) trees as the prediction of the sample in the XGBoost system:
for a given data set, there are N samples, each feature:
D={(Xi,yi)}(|D|=N,Xi∈Rm,yi∈R) (9),
in the above formula (9), XiDenotes the ith sample, yiThe label representing the ith sample.
Figure GDA0003516717780000151
In the above-mentioned formula (10),
Figure GDA0003516717780000152
representing the predicted value of the model, K representing the number of trees, fkRepresenting the kth tree model, q representing the fraction of each sample mapped by the structure of each tree resulting in the corresponding leaf node, wq(x)Represents a set of scores for all leaf nodes of the tree q, and T represents the number of leaf nodes per tree.
As shown in equation (10), the predicted value of XGBoost is the sum of the predicted values of each tree, i.e. the sum of scores of corresponding leaf nodes of each tree. Our goal is to learn such K tree models f (x). For learning the model f (x), the following objective function is defined:
Figure GDA0003516717780000153
Figure GDA0003516717780000154
in the above equation (11), the first term on the right side of the equation is a loss function term, i.e. a training error, which is a differentiable convex function, and the second term is a regularization term, i.e. the sum of the complexity of each tree, so as to control the complexity of the model and prevent overfitting. Our goal is to derive the corresponding model f (x) when we get the minimization of ψ (φ). Gamma and lambda represent coefficients that require tuning in practical applications.
In the above examples, the random forest, the feature correlation analysis, scimit-lean are methods well known in the art.
Nothing in this specification is said to apply to the prior art.

Claims (2)

1. A heart disease prediction method based on dual feature selection and an XGboost algorithm is characterized by comprising the following specific steps:
firstly, preprocessing an open-source heart disease data set to obtain a sample data set D with the size of N;
secondly, judging the importance of the characteristics of the sample data set D through a random forest algorithm, wherein the random forest algorithm gives the importance of all the characteristics and sorts the importance of the characteristics;
thirdly, performing characteristic correlation analysis on the sample data set D by adopting a spearman level correlation coefficient method;
fourthly, according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, double feature selection is carried out, and the most appropriate feature is selected;
fifthly, building a scimit-lean machine learning framework on windows, building a heart disease prediction model based on an XGboost algorithm, training the heart disease prediction model by taking 70-90% of the data set of the characteristics selected in the fourth step as a training set, and performing parameter adjustment and test on the trained heart disease prediction model by taking the remaining 10-30% as a test set to obtain a heart disease prediction result;
XGBoost is a tree integration model that uses the sum of the predicted values for each of the K trees as the prediction of the sample in the XGBoost system:
for a given data set, there are N samples, each feature:
D={(Xi,yi)}(|D|=N,Xi∈Rm,yi∈R) (9)
in the above formula (9), XiDenotes the ith sample, yiA label representing the ith sample;
Figure FDA0003516717770000011
in the above-mentioned formula (10),
Figure FDA0003516717770000021
representing the predicted value of the model, K representing the number of trees, fkRepresenting the kth tree model, q representing the fraction of each sample mapped by the structure of each tree resulting in the corresponding leaf node, wq(x)Represents a set of scores for all leaf nodes of the tree q, T represents the number of leaf nodes per tree;
as can be seen from the formula (10), the predicted value of XGBoost is the sum of the predicted values of each tree, i.e. the sum of scores of corresponding leaf nodes of each tree; our goal is to learn such K tree models f (x); for learning the model f (x), the following objective function is defined:
Figure FDA0003516717770000022
Figure FDA0003516717770000023
in the above formula (11), the first term on the right side of the formula is a loss function term, i.e. a training error, which is a differentiable convex function, and the second term is a regularization term, i.e. the sum of the complexity of each tree, so as to control the complexity of the model and prevent overfitting; our goal is to derive the corresponding model f (x) when we get the minimization of ψ (φ); γ and λ represent coefficients;
the process of preprocessing the open-source cardiac disease data set in the first step is as follows: missing data filling, abnormal data deleting, multi-class data ordered mapping or one-hot coding and data standardization are carried out on data missing, data abnormality and a plurality of classes of a certain characteristic existing in an open source heart disease data set;
the normalization is to set the mean value of the feature sequence to 0 and the variance to 1 so that the values of the features are in a standard normal distribution; the normalized formula is:
Figure FDA0003516717770000024
in the above formula (2), μxAnd σxRespectively representing the mean value and the standard deviation of a certain characteristic of the sample;
the calculation process of the importance of the features in the second step is as follows: the random forest comprises a plurality of decision trees, and each node in the decision trees is a condition of a certain characteristic; the conditions are used for dividing the data into two parts, so that each part is classified into the same set, and the process is continuously repeated to classify the data; judging how much each feature contributes to each tree in the random forest, and then taking an average value, wherein the value is the importance of the feature; the contribution calculation mode adopts a Gini index, the variable importance score is represented by VIM, the Gini coefficient is represented by GI, and the total c characteristics x are assumed1,x2,x3,...xcTo calculate each feature xjGini coefficient score of
Figure FDA0003516717770000031
Namely the average change amount of the splitting purities of all decision tree nodes of the jth characteristic in the random forest;
the formula for calculating the kini index is as follows:
Figure FDA0003516717770000032
in the above formula (3), K represents that there are K classes, pmkRepresenting the proportion of the class k in the node m;
characteristic xjThe importance of the node m is the variation of the Gini coefficient before and after the branch of the node m:
Figure FDA0003516717770000033
in the above formula (4), N represents the number of samples in the node, and NlAnd NrRespectively representing the number of samples, GI, in two new nodes after branchinglAnd GIrRespectively representing Gini coefficients of two new nodes after branching;
then xjThe importance in the ith tree is:
Figure FDA0003516717770000034
the set M in the above formula (5) is the feature x in the decision tree ijA set of all nodes that have appeared;
assuming a total of n trees in a random forest, feature xjThe importance of is:
Figure FDA0003516717770000035
and finally, normalization treatment:
Figure FDA0003516717770000036
obtaining the importance of all the characteristics according to the method, and sequencing the importance of the characteristics;
the third step comprises the following specific processes: a spearman scale correlation coefficient method, denoted by the greek letter p, is used to estimate the correlation between two variables X, Y, where the correlation between the variables is described using a monotonic function; the rho value between the two variables is between-1 and + 1;
the spearman correlation coefficient is:
Figure FDA0003516717770000041
in the above formula (8), N' is the number of elements in the variable X, Y; x, Y are sorted, and the ith values of the two random variables are respectively X'iAnd Y'iIs represented by x'iAnd y'iAre respectively X'iAnd Y'iAlignment in X and Y, di=x’i-y’i(ii) a Thereby completing a correlation analysis of the features;
the fourth step comprises the following specific processes: according to the importance ranking of the features of the sample data set D obtained in the second step and the feature correlation analysis of the sample data set D obtained in the third step, extracting the value of the feature importance, extracting the correlation coefficient of each feature and the result label, and standardizing according to the formula (2); calculating a characteristic index of each characteristic according to the extracted characteristic importance value and the correlation between each characteristic and the result label, calculating the characteristic index, then carrying out normalization processing on the characteristic indexes and sequencing the characteristic indexes after solving the characteristic indexes, setting a threshold value, selecting the characteristic, and further finishing the characteristic selection;
the specific method for calculating the characteristic index is as follows:
first extracted feature importance value VIMjThe correlation value of each feature with the result label is then normalized by indexjExpressing the normalized correlation coefficient, and then calculating a characteristic index, wherein the characteristic index is expressed by IOV, and the mathematical calculation formula of the characteristic index is shown as the following formula (12):
Figure FDA0003516717770000042
the normalization refers to scaling the values of the features to the interval [0,1], and the formula of normalization is:
Figure FDA0003516717770000043
in the above formula (1), IOVjRepresenting a particular sample, IOVj-maxAnd IOVj-minRespectively a maximum and a minimum of a certain characteristic.
2. The heart disease prediction method based on the dual feature selection and XGboost algorithm as claimed in claim 1, wherein the specific method for calculating the feature index is as follows: the threshold set when the feature index is selected is 0.1, the features with the feature selection index larger than 0.1 are selected as the features of the final training model, and 9 features are selected from 13 features of the open-source heart disease data set.
CN202010052452.XA 2020-01-17 2020-01-17 Heart disease prediction method based on dual feature selection and XGboost algorithm Active CN111243751B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010052452.XA CN111243751B (en) 2020-01-17 2020-01-17 Heart disease prediction method based on dual feature selection and XGboost algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010052452.XA CN111243751B (en) 2020-01-17 2020-01-17 Heart disease prediction method based on dual feature selection and XGboost algorithm

Publications (2)

Publication Number Publication Date
CN111243751A CN111243751A (en) 2020-06-05
CN111243751B true CN111243751B (en) 2022-04-22

Family

ID=70874603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010052452.XA Active CN111243751B (en) 2020-01-17 2020-01-17 Heart disease prediction method based on dual feature selection and XGboost algorithm

Country Status (1)

Country Link
CN (1) CN111243751B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985560B (en) * 2020-08-19 2023-05-12 中南大学 Knowledge tracking model optimization method, system and computer storage medium
CN112132334A (en) * 2020-09-18 2020-12-25 上海大学 Method for predicting yield of urban domestic garbage
CN112336310B (en) * 2020-11-04 2024-03-08 吾征智能技术(北京)有限公司 FCBF and SVM fusion-based heart disease diagnosis system
CN112529067A (en) * 2020-12-04 2021-03-19 国网电力科学研究院武汉南瑞有限责任公司 Power transmission line ice wind disaster fault type evaluation method based on naive Bayes
CN112614590A (en) * 2020-12-10 2021-04-06 浙江大学 Machine learning-based elderly disability risk prediction method and system
CN112617856A (en) * 2020-12-14 2021-04-09 上海交通大学 Coronary heart disease electrocardiogram screening system and method based on residual error neural network
CN112232892B (en) * 2020-12-14 2021-03-30 南京华苏科技有限公司 Method for mining accessible users based on satisfaction of mobile operators
CN112530595A (en) * 2020-12-21 2021-03-19 无锡市第二人民医院 Cardiovascular disease classification method and device based on multi-branch chain type neural network
CN112992346B (en) * 2021-04-09 2023-05-09 中山大学附属第三医院(中山大学肝脏病医院) Method for establishing prediction model of severe spinal cord injury prognosis
CN113113152A (en) * 2021-04-13 2021-07-13 上海市疾病预防控制中心 Disease data set sample acquisition processing method, system, device, processor and storage medium thereof for novel coronavirus pneumonia
CN113468960A (en) * 2021-05-26 2021-10-01 太原科技大学 Parameter adjusting method for improving generalization capability of rolling mill vibration prediction model based on xgboost
CN113362888A (en) * 2021-06-02 2021-09-07 齐鲁工业大学 System, method, equipment and medium for improving gastric cancer prognosis prediction precision based on depth feature selection algorithm of random forest
CN113705873B (en) * 2021-08-18 2024-01-19 中国科学院自动化研究所 Construction method of film and television work score prediction model and score prediction method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503672A (en) * 2016-11-03 2017-03-15 河北工业大学 A kind of recognition methods of the elderly's abnormal behaviour
CN107169284A (en) * 2017-05-12 2017-09-15 北京理工大学 A kind of biomedical determinant attribute system of selection
CN108389626A (en) * 2018-02-09 2018-08-10 上海长江科技发展有限公司 Cerebral apoplexy screening method based on artificial intelligence and system
CN108836322A (en) * 2018-05-04 2018-11-20 成都泰和万钟科技有限公司 A kind of naked eye 3D display vision induction motion sickness detection method
CN109117864A (en) * 2018-07-13 2019-01-01 华南理工大学 Coronary heart disease risk prediction technique, model and system based on heterogeneous characteristic fusion
CN109919901A (en) * 2018-07-25 2019-06-21 江南大学 A kind of image quality evaluating method based on integrated study and random forest
CN110298085A (en) * 2019-06-11 2019-10-01 东南大学 Analog-circuit fault diagnosis method based on XGBoost and random forests algorithm
CN110458616A (en) * 2019-08-02 2019-11-15 深圳索信达数据技术有限公司 A kind of finance product recommended method based on GAMxNN model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503672A (en) * 2016-11-03 2017-03-15 河北工业大学 A kind of recognition methods of the elderly's abnormal behaviour
CN107169284A (en) * 2017-05-12 2017-09-15 北京理工大学 A kind of biomedical determinant attribute system of selection
CN108389626A (en) * 2018-02-09 2018-08-10 上海长江科技发展有限公司 Cerebral apoplexy screening method based on artificial intelligence and system
CN108836322A (en) * 2018-05-04 2018-11-20 成都泰和万钟科技有限公司 A kind of naked eye 3D display vision induction motion sickness detection method
CN109117864A (en) * 2018-07-13 2019-01-01 华南理工大学 Coronary heart disease risk prediction technique, model and system based on heterogeneous characteristic fusion
CN109919901A (en) * 2018-07-25 2019-06-21 江南大学 A kind of image quality evaluating method based on integrated study and random forest
CN110298085A (en) * 2019-06-11 2019-10-01 东南大学 Analog-circuit fault diagnosis method based on XGBoost and random forests algorithm
CN110458616A (en) * 2019-08-02 2019-11-15 深圳索信达数据技术有限公司 A kind of finance product recommended method based on GAMxNN model

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
Fingerprinting Acoustic Localization Indoor Based on Cluster Analysis and Iterative Interpolation;Shuopeng Wang;《Applied science》;20181010;全文 *
Magnetocardiography-Based Ischemic Heart Disease Detection and Localization Using Machine Learning Methods;Rong Tao;《 IEEE Transactions on Biomedical Engineering ( Volume: 66, Issue: 6, June 2019)》;20181023;全文 *
Observation of clinical efficacy of Shenfu Qiangxin Pills combined with benazepril hydrochloride in treatment of senile hypertension with heart failure;Sun hao;《Drugs and Clinic 29 (8) 》;20141121;全文 *
The computer-aided parallel external fixator for complex lower limb deformity correction;Mengting Wei;《International Journal of Computer Assisted Radiology and Surgery》;20170807;全文 *
一种双重特征选择的不平衡复杂网络链接分类模型;伍杰华;《计算机应用研究》;20170119;全文 *
基于局部聚类优化的听觉定位方法;杨鹏;《传感技术学报》;20191017;全文 *
基于聚类和XGboost算法的心脏病预测;刘宇;《计算机系统应用》;20190115;全文 *
基于随机森林和XGBoost的大型风力机故障诊断方法研究;钱力扬;《中国优秀博硕士学位论文全文数据库(硕士) 工程科技辑II辑》;20190115;全文 *
基于随机森林的心脏病预测平台的设计与实现;罗斌杰;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20181015;全文 *
药物洗脱支架置入术后极晚期血栓患者再发支架内血栓的临床分析;徐立;《中国介入心脏病学杂志》;20150327;全文 *

Also Published As

Publication number Publication date
CN111243751A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN111243751B (en) Heart disease prediction method based on dual feature selection and XGboost algorithm
Liu et al. Classification of heart diseases based on ECG signals using long short-term memory
CN111000553B (en) Intelligent classification method for electrocardiogram data based on voting ensemble learning
Alalawi et al. Detection of cardiovascular disease using machine learning classification models
Inan et al. A hybrid probabilistic ensemble based extreme gradient boosting approach for breast cancer diagnosis
Luo et al. The prediction of hypertension based on convolution neural network
Rattan et al. Artificial intelligence and machine learning: what you always wanted to know but were afraid to ask
Badriyah et al. Improving stroke diagnosis accuracy using hyperparameter optimized deep learning
Abdollahi et al. Feature selection for medical diagnosis: Evaluation for using a hybrid Stacked-Genetic approach in the diagnosis of heart disease
CN113707317B (en) Disease risk factor importance analysis method based on mixed model
CN112450944B (en) Label correlation guide feature fusion electrocardiogram multi-classification prediction system and method
CN117195027A (en) Cluster weighted clustering integration method based on member selection
CN112465054B (en) FCN-based multivariate time series data classification method
Mohamed et al. prediction of cardiovascular disease using machine learning techniques
CN113223701A (en) Sudden heart disease prediction method based on Transformer-MHP model
Hamidi et al. Analysis and evaluation of techniques for myocardial infarction based on genetic algorithm and weight by SVM
Geltser et al. Machine learning models for atrial fibrillation prediction after coronary artery bypass graft surgery
Pillai et al. Integrated hierarchical and flat classifiers for food image classification using epistemic uncertainty
CN114565972B (en) Skeleton action recognition method, system, equipment and storage medium
CN114330589A (en) Electrocardiogram classification method and system based on time sequence shape
Patil et al. Improving Cardiovascular Disease Prognosis Using Outlier Detection and Hyperparameter Optimization of Machine Learning Models.
CN117235487B (en) Feature extraction method and system for predicting hospitalization event of asthma patient
Mohapatra et al. Prediction of Heart Disease Using Machine Learning
Cenitta et al. Cataloguing of coronary heart malady using machine learning algorithms
Jasmine Jinitha et al. Coronary Artery Disease Prediction and Analysis using Machine Learning Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant