CN110895969B - Atrial fibrillation prediction decision tree and pruning method thereof - Google Patents

Atrial fibrillation prediction decision tree and pruning method thereof Download PDF

Info

Publication number
CN110895969B
CN110895969B CN201811068303.1A CN201811068303A CN110895969B CN 110895969 B CN110895969 B CN 110895969B CN 201811068303 A CN201811068303 A CN 201811068303A CN 110895969 B CN110895969 B CN 110895969B
Authority
CN
China
Prior art keywords
atrial fibrillation
patient
decision tree
judging
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811068303.1A
Other languages
Chinese (zh)
Other versions
CN110895969A (en
Inventor
张敏
张树龙
汪祖民
杨慧英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN201811068303.1A priority Critical patent/CN110895969B/en
Publication of CN110895969A publication Critical patent/CN110895969A/en
Application granted granted Critical
Publication of CN110895969B publication Critical patent/CN110895969B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Measurement And Recording Of Electrical Phenomena And Electrical Characteristics Of The Living Body (AREA)

Abstract

A atrial fibrillation prediction decision tree and a pruning method thereof belong to the field of data processing, and in order to solve the problem of constructing a decision tree to mine out indexes influencing atrial fibrillation prediction, a root node A peak in the decision tree is the peak of information gain, the attribute is the maximum information gain, the normal range of the peak is 41 to 87, the first branch of the decision tree is the value of the A peak when a < = 0, and the atrial fibrillation of a patient is judged as the atrial fibrillation of the patient because the data has no non-0 number, namely when a = 0; when a >0, the ef attribute needs to be considered continuously, when the ef value is smaller than 58, the patient is judged to be normal, and the decision tree has important guiding significance for determining some important reference indexes for influencing atrial fibrillation.

Description

Atrial fibrillation prediction decision tree and pruning method thereof
Technical Field
The invention belongs to the field of data processing, and relates to a method for constructing an atrial fibrillation prediction decision tree and the decision tree.
Background
Atrial fibrillation is an supraventricular tachyarrhythmia characterized by rapid, disordered atrial electrical activity. Atrial fibrillation is mainly represented by the disappearance of P waves and the replacement of irregular atrial fibrillation waves on an electrocardiogram; RR intervals are absolutely irregular (when atrioventricular conduction is present). This is also the main basis for judging atrial fibrillation in the current medical fields. Atrial fibrillation is medically classified into paroxysmal atrial fibrillation (paroxysmal af), persistent atrial fibrillation (persistent af), long-range persistent atrial fibrillation (long-standing persistentAF), and permanent atrial fibrillation (permanentAF) mainly according to the duration of an atrial fibrillation episode. The specific classification is shown in Table 1.
TABLE 1.1 detailed classification of medical atrial fibrillation
Atrial fibrillation is a very common arrhythmia clinically, the incidence rate of the atrial fibrillation is 0.5% -1% in China, and the incidence rate is higher along with the increase of age. Whereas the risk of atrial fibrillation in patients with hypertension is 1.7 times higher than that in normal blood pressure patients, 33% of patients with atrial fibrillation are currently attributed to hypertension. For the high incidence of atrial fibrillation in hypertensive patients, it is even thought that atrial fibrillation is another manifestation of damage to target organs for hypertension. However, no better index for predicting the occurrence of AF of a patient with hypertension exists clinically at present. In addition, some patients with atrial fibrillation have no obvious clinical symptoms, resulting in those patients being unintentionally exposed to the risk of various critical conditions, and when clinical symptoms occur or the disease is sudden, cardiovascular organic lesions have often been caused, thereby greatly affecting the physical health of the patient and even endangering life. Therefore, it is important to study the probability of atrial fibrillation in the population of patients suffering from hypertension.
There are many methods for predicting atrial fibrillation at present, and in the medical field, the method starts from the aspect of treatment of atrial fibrillation. Internationally though there is CHA 2 DS 2 The VASc score (hypertension, age, diabetes, stroke, vascular lesions, gender, congestive heart failure) and the HATCH score (hypertension, age, ischemic attacks, chronic obstructive pulmonary, heart failure) are used to predict atrial fibrillation, but both of these scores suffer from various limitations, making the prediction method non-normative and the prediction result inaccurate. In the field of computers, it is generally known whether a patient suffers from atrial fibrillation according to the electrocardiogram of the patient, and according to factors such as judging P-waves, analyzing the change rule of RR interval distribution with time, etc., the algorithm used has a statistical aspect and a machine learning aspect. Some characteristic indexes of a human body are further detected through the intelligent watch so as to predict, the face is scanned through the intelligent mobile phone, the face is predicted through the facial color of the human body, and even for a patient without symptoms, sometimes, the Holter heart rate of the patient is directly tested by a medical instrument so as to predict. These still lack standardization, no particular criteria.
Disclosure of Invention
In order to solve the problem of constructing a decision tree to mine out indexes affecting atrial fibrillation prediction, the invention provides the following technical scheme:
a atrial fibrillation prediction decision tree, wherein the attribute is the peak A of a root node in the decision tree, the information gain rate is the maximum, the normal range of the attribute is 41 to 87, the first branch of the decision tree, when a < = 0, a refers to the value of the peak A, and a patient generates atrial fibrillation, and the atrial fibrillation is judged to occur in the patient because the data has no non-0 number, namely when a = 0; when a >0, the ef attribute needs to be considered continuously, and when the ef value is smaller than 58, the patient is judged to be normal.
Another atrial fibrillation prediction decision tree, wherein the root node in the decision tree is XGN, when the XGN grade is greater than 1, the patient is judged to be atrial fibrillation, when the XGN grade is less than or equal to 1, the peak A is continuously considered, when the peak A is 0, the FS is continuously considered, when the FS is greater than 0, the patient is judged to have atrial fibrillation, otherwise, the FJB is continuously considered, when the FJB is less than or equal to 0, the LVPWD is considered, when the LVPWD is less than or equal to 9, the EF value is continuously considered, when the EF is less than or equal to 57, the patient is judged to be normal, otherwise, the patient is atrial fibrillation; continuing to trace back to the right branch of the LVPWD, when the LVPWD is larger than 9, considering the value of FDMB1, and judging that the patient is atrial fibrillation when the value is smaller than or equal to 101, otherwise, considering the LAD, and judging that the patient is atrial fibrillation when the LAD is smaller than or equal to 50, otherwise, judging that the patient is normal; continuing to trace back to the right branch of the FJB, when the FJB is larger than 0, considering GXB, and when the GXB is smaller than or equal to 2, judging that the patient is normal, otherwise, judging that the patient is atrial fibrillation; continuing to trace back to the right branch of the FS, and judging that the patient is atrial fibrillation when the FS is larger than 0; continuing to trace back to the right branch of the peak A, when A is larger than 0, considering TNB, when TNB is smaller than or equal to 0, judging that the patient is normal, otherwise, considering FDMB, when FDMB is larger than 0, judging that the patient is normal, otherwise, considering E value, when E is larger than 72, judging that the patient is atrial fibrillation, otherwise, considering MCHC value, when MCHC is smaller than or equal to 338, judging that the patient is atrial fibrillation, otherwise, judging that the patient is normal, and traversing the whole decision tree.
A method of pruning an atrial fibrillation prediction decision tree, comprising:
1) Three kinds of prediction error division sample numbers are calculated respectively: calculating the sum of the number of the prediction error samples of all leaf nodes of the subtree Tv, and marking the sum as E1; calculating the number of mispredicted samples when the subtree Tv is pruned to replace with leaf nodes, and marking as E2; calculating the maximum branch prediction error sample number of the subtree Tv, and marking as E3;
2) Comparison is performed: e1 is the smallest, and pruning is not carried out; e2, pruning is carried out when the E2 is minimum, and a leaf node is used for replacing the subtree Tv; e3 is minimum, replace sub-tree Tv with the largest branch.
The beneficial effects are that: the invention provides a predictive decision tree, and provides important guiding significance for determining some important reference indexes affecting atrial fibrillation, and the pruning method is suitable for the decision tree, so that the construction efficiency and accuracy are better.
Drawings
FIG. 1 is a schematic diagram of a decision tree structure;
FIG. 2 is a schematic illustration of a medical data manuscript;
FIG. 3 is a schematic diagram of an derived Excel table;
FIG. 4 is a schematic view of the ultrasound properties of the heart;
FIG. 5 is a schematic diagram of a 4weka operator interface;
FIG. 6 is a schematic diagram of a decision tree using default values;
FIG. 7 is a schematic diagram of decision tree accuracy;
FIG. 8 is a schematic diagram of a decision tree of 154 factors;
FIG. 9 is a schematic diagram of decision tree accuracy.
Detailed Description
Example 1:
in order to solve the construction problem of a decision tree for atrial fibrillation prediction, the invention provides the following technical scheme: a method of constructing an atrial fibrillation prediction decision tree, comprising:
step 1: if the data set S belongs to the same category, a leaf node is created, corresponding category labels are marked, and the tree is stopped being built; otherwise, carrying out the step 2;
step 2: calculating the information Gain rate Gain-rate (A) of all the attributes in the data set S;
step 3: selecting an attribute A of the maximum information gain rate;
step 4: the attribute A is established as a root node of a decision tree T, and T is the decision tree to be established;
step 5: dividing the data set into a plurality of subsets according to different values of the attribute A, circularly executing the steps 1-4 on the subset Sv, and constructing a subtree Tv, wherein Sv is a sample subset with the value v of the attribute A;
step 6: adding the subtree Tv into the corresponding branch of the decision tree T;
step 7: and (5) ending the cycle to obtain a decision tree T.
Further, the data processing method comprises the following steps: for class tag deletion, directly deleting the piece of information; for attribute value missing, the values are incorporated into some of the most common classes or replaced with the most common values; the continuous value processing is to sort the data, divide the data set by taking each data as a threshold value, calculate the information gain of each division, select the threshold value according to the maximum gain, and divide the data set by using the threshold value.
Further, pruning operation is carried out on the decision tree:
1) Three kinds of prediction error division sample numbers are calculated respectively: calculating the sum of the number of the prediction error samples of all leaf nodes of the subtree Tv, and marking the sum as E1; calculating the number of mispredicted samples when the subtree Tv is pruned to replace with leaf nodes, and marking as E2; calculating the maximum branch prediction error sample number of the subtree Tv, and marking as E3;
2) Comparison is performed: e1 is the smallest, and pruning is not carried out; e2, pruning is carried out when the E2 is minimum, and a leaf node is used for replacing the subtree Tv; e3 is minimum, replace sub-tree Tv with the largest branch.
Further, a splitting attribute is selected according to the information gain rate:
the formula of the information entropy is:
Info_Gain(A)=H(S)-H(A)
wherein S represents a dataset, c i Representing the ith class, p (c) i ) Represents c i The probability that this category is selected;
when the decision tree is divided, the information entropy of a certain characteristic attribute is calculated, and the characteristic attribute A divides the data set S into n small data sets by S on the premise that the characteristic attribute A has n different values i Representing the probability of each small dataset being selected as p (s i ) As can be seen from equation (1), each small dataset s i The information entropy is H(s) i ) The information entropy calculation formula of the characteristic attribute A is as follows:
the information gain calculation formula is:
Info_Gain(A)=H(S)-H(A) (3)
the information gain ratio calculation formula is:
further, by changing parameters of the decision tree algorithm, the constructed decision tree is continuously adjusted, so that the accuracy and the branch attribute value of the constructed decision tree are optimal: the J48 algorithm can modify 11 parameters, wherein binarySplits, debug, saveInstance, subtreeRaising, unpruned, useLaplace adopts default values, and the ConfidenceFactor, minNumObj, numFolds, seed, reduceErrorPruning five parameters are modified and verified to continuously approximate to the accurate values of the medical data; putting the data file subjected to data processing into weka software, selecting an algorithm, modifying parameters corresponding to the algorithm, running a result, carrying out experiments on possible values of various parameters, and finally selecting an optimal experimental result;
the experiment is divided into two branches:
the first branch experiment is to perform experiments on a plurality of attributes of the heart ultrasonic index, wherein the last column is a class label, f is atrial fibrillation, z is normal, and each parameter of the algorithm uses a default value; according to the decision tree, among the attributes of the heart ultrasound, three attributes with great influence on atrial fibrillation, namely an A peak, an ef and a lasd, are specifically referred to as a root node A peak in the decision tree, the attribute is the maximum information gain rate, the normal range of the attribute is 41 to 87, the first branch of the decision tree, when a < = 0, a refers to the value of the A peak, and the patient generates atrial fibrillation, and the patient is judged to generate atrial fibrillation because no non-0 number exists in the data, namely when a = 0; when a >0, continuing to consider the ef attribute, and when the ef value is smaller than 58, judging that the patient is normal;
collecting characteristic indexes of a patient by a second branch experiment, wherein the characteristic indexes comprise blood convention, alpha function, coagulum, liver function, blood fat and heart ultrasonic index detection items as attribute columns, the last column is a class label, f is atrial fibrillation, z is normal, each parameter of the algorithm uses default values, according to a decision tree, the patient is judged to be atrial fibrillation when the XGN grade is larger than 1, A peak (heart ultrasonic index), FS (rheumatic heart valve disease), FJB (interstitial lung disease), LVPWD (heart ultrasonic index), EF (heart ultrasonic index), FDMB1 (pulmonary valve blood flow velocity), FDMB (pulmonary valve), LAD (heart ultrasonic index), GXB (coronary heart disease), TNB (diabetes), MCHC (hemoglobin concentration) and E peak (heart ultrasonic index), and if the XGN grade is smaller than 1, the A peak is continuously considered when the FS is 0, if the FS is larger than 0, or if the FS is smaller than 0, the F JB is continuously considered, if the F is smaller than 57, otherwise, the F peak is continuously considered to be equal to or equal to 0, if the F peak is smaller than 57, if the F is smaller than the decision tree is equal to 1, if the F peak is not normally considered; continuing to trace back to the right branch of the LVPWD, when the LVPWD is larger than 9, considering the value of FDMB1, and judging that the patient is atrial fibrillation when the value is smaller than or equal to 101, otherwise, considering the LAD, and judging that the patient is atrial fibrillation when the LAD is smaller than or equal to 50, otherwise, judging that the patient is normal; continuing to trace back to the right branch of the FJB, when the FJB is larger than 0, considering GXB, and when the GXB is smaller than or equal to 2, judging that the patient is normal, otherwise, judging that the patient is atrial fibrillation; continuing to trace back to the right branch of the FS, and judging that the patient is atrial fibrillation when the FS is larger than 0; continuing to trace back to the right branch of the peak A, when A is larger than 0, considering TNB, when TNB is smaller than or equal to 0, judging that the patient is normal, otherwise, considering FDMB, when FDMB is larger than 0, judging that the patient is normal, otherwise, considering E value, when E is larger than 72, judging that the patient is atrial fibrillation, otherwise, considering MCHC value, when MCHC is smaller than or equal to 338, judging that the patient is atrial fibrillation, otherwise, judging that the patient is normal, and traversing the whole decision tree.
Example 2:
the present disclosure employs data mining to build a canonical decision tree model for medical reference.
The standard terminology used herein is explained:
data Mining (DM), which is directed to massive, multi-aspect, long-time accumulated Data of Data sources, extracts valuable patterns, links, knowledge, etc. that human beings have recognized. The method is to mine data and discover knowledge on the premise of no assumption in advance. Data mining is a technology for searching for a rule from a large amount of data by analyzing each data, and mainly comprises 3 steps of data preparation, rule searching and rule representing. Tasks of data mining are association analysis, cluster analysis, classification analysis, anomaly analysis, specific group analysis, evolution analysis and the like. The invention performs classification analysis to analyze whether patients suffering from hypertension suffer from atrial fibrillation.
The decision tree algorithm is a typical algorithm for classification prediction in the field of data mining, and has low computational complexity and visual output result. The present invention introduces decision tree algorithms into predicting the probability of having atrial fibrillation in hypertensive patients.
The invention discloses a decision tree, which is a basic classification and regression method. The decision tree model is in a tree structure, and in the classification problem, a process of classifying the instance based on the characteristics is represented. Compared with naive Bayesian classification, the decision tree has the advantage that no domain knowledge or parameter setting is needed in the construction process, so that in practical application, the decision tree is more applicable to the detection type knowledge discovery. Decision tree algorithms include the ID3 algorithm, the C4.5 algorithm, and the CART algorithm. The invention adopts C4.5 algorithm to carry out experiments. The C4.5 is mainly improved on the basis of the ID3, and the attribute with more values is preferentially selected when the information gain selects the attribute in the ID3 algorithm. To solve this problem, the information gain is replaced with the information gain rate in the C4.5 algorithm. The decision tree is a tree structure and is composed of a root node, a series of internal nodes and leaf nodes, each node has only a father node and two or more child nodes, and the nodes are connected through branches. Each internal node of the decision tree corresponds to a non-category attribute or combination of attributes, each edge corresponds to each possible value of the attribute, and each leaf node corresponds to a category attribute value. An example of a decision tree structure is shown in fig. 1.
Aiming at the known meaning of the decision tree, the method is suitable for classifying indexes of atrial fibrillation prediction, and the method comprises the following steps:
c4.5 Algorithm flow
Step 1: if the data set S belongs to the same category, a leaf node is created, corresponding category labels are marked, and the tree is stopped being built; otherwise, carrying out the step 2;
step 2: calculating the information Gain rate Gain-rate (A) of all the attributes in the data set S;
step 3: selecting an attribute A of the maximum information gain rate;
step 4: the attribute A is established as a root node of a decision tree T, and T is the decision tree to be established;
step 5: dividing the data set into a plurality of subsets according to different values of the attribute A, circularly executing the steps 1-4 on the subset Sv, and constructing a subtree Tv, wherein Sv is a sample subset with the value v of the attribute A;
step 6: adding the subtree Tv into the corresponding branch of the decision tree T;
step 7: and (5) ending the cycle to obtain a decision tree T.
Numerical value processing: training data with missing attribute values can be processed. For class tag deletion, directly deleting the piece of information; for missing attribute values, these values are incorporated into some of the most common classes or replaced with the most common values. Consecutive value attributes may be processed. The continuous value processing is to sort the data, divide the data set by taking each data as a threshold value, calculate the information gain of each division, select the threshold value according to the maximum gain, and divide the data set by using the threshold value.
Pruning: by the above decision tree generation process, we can construct a decision tree based on the training dataset, but the accuracy of the decision tree, as well as other performance, is some of the work we need to evaluate this tree should do. Because our resulting decision tree is purely based on a training dataset, there are some overfitting issues. To solve this problem, we need to prune the decision tree. The basic idea of decision tree pruning is to eliminate a part of the trees (subtrees) that do not contribute to the classification accuracy of the unknown test sample, and there are two improved recursive branching methods for generating simple, more easily understood trees: pre-pruning and post-pruning.
Pre-pruning: making decisions before branching prevents the dataset from generating too many branches. Pruning is performed while constructing the decision tree.
Pruning: mainly aims to solve the noise influence and prune redundant branches.
Because the J48 algorithm employed by the present invention is post pruning, the post pruning method is described in detail herein. The post pruning method comprises the following steps: REP (Reduced Error Pruning), PEP (Pessimistic Error Pruning), MEP (Minimum Error Pruning), CCP (Cost-Complexity Pruning), etc. The default pruning method of the C4.5 algorithm is REP pruning method. The basic idea is as follows:
1) Three kinds of prediction error division sample numbers are calculated respectively: calculating the sum of the number of the prediction error samples of all leaf nodes of the subtree Tv, and marking the sum as E1; calculating the number of mispredicted samples when the subtree Tv is pruned to replace with leaf nodes, and marking as E2; the maximum branch prediction error sample number of the subtree Tv is calculated and denoted as E3.
2) A comparison is made. E1 is the smallest, and pruning is not carried out; e2, pruning is carried out when the E2 is minimum, and a leaf node is used for replacing the subtree Tv; and E3, if the maximum branch is the smallest, adopting a grafting strategy, namely replacing the subtree Tv with the maximum branch.
Split attribute selection: the criteria for split attribute selection are the fundamental differences between decision tree algorithms. It has been mentioned above that ID3 is the splitting attribute selected by the information gain and C4.5 is the splitting attribute selected by the information gain rate. Information entropy is an expected value of information, and for a data set, information entropy expresses the degree of disorder of the data set. The more categories a data set contains, the greater the entropy of the corresponding information. The formula is:
wherein S represents a dataset, c i Representing the ith class, p (c) i ) Represents c i The probability that this category is selectedA rate;
when the decision tree is divided, the information entropy of a certain characteristic attribute is calculated, and the characteristic attribute A divides the data set S into n small data sets by S on the premise that the characteristic attribute A has n different values i Representing the probability of each small dataset being selected as p (s i ) As can be seen from equation (1), each small dataset s i The information entropy is H(s) i ) The information entropy calculation formula of the characteristic attribute A is as follows:
the information gain calculation formula is:
Info_Gain(A)=H(S)-H(A) (3)
the information gain ratio calculation formula is:
algorithm application
Description of data: the data adopted by the invention is provided by a hospital in Dalian city, and is produced by actual measurement of patients with hypertension, and 360 parts of the data are taken in total. The experimental report mainly comprises white blood cell count (WBC), granulocyte absolute value (Neu#), NT-proBNP, EF (ejection fraction), LVEF (left chamber ejection fraction), hypertension grade, whether atrial fibrillation is caused or not, etc. As shown in fig. 2, is part of the original data item.
Data preprocessing: the type of data file running on the Weka platform is a.csv file, and our data file is Excel table data, so the first step now needs to convert the data file into a.csv file. Other indicators not considered by the invention in the data given by the hospital are filtered out, and only the study object is left. And deleting the abnormal data, and automatically processing the null value attribute J48 algorithm. Because the 154-dimensional data have large orders of magnitude, the invention can purposefully extract 11-dimensional data, namely the heart ultrasonic index, from the related medical standards to perform more specific experiments. Such as ef (ejection fraction), a peak, e peak, etc. Simplified as shown in fig. 3.
Operating environment: wycark intelligent analysis environment (Waikato Environment for Knowledge Analysis, WEKA) a free, non-commercialized, JAVA-based open-source machine learning and data mining software, mainly developed from new zealand. The official website is: the http is/(WEKA. Wikispecies. Com/. WEKA is taken as a public data mining working platform, a large number of machine learning algorithms capable of bearing data mining tasks are integrated, the data are preprocessed, associated analysis, classification, regression and clustering are carried out, and visualization on a new interactive interface is carried out, so that WEKA is embedded into Myeclipse, and the WEKA is conveniently developed for the second time; the latest data mining algorithms are modified or added, and mining results can be displayed in various forms, so that a user can conveniently and clearly find required knowledge. Before mining, JDBC needs to be configured, and the database driver is loaded. The Weka control platform and the operation interface are shown in FIG. 4. If we software is used, the control platform is firstly required to be opened, a first option Explorer is selected to start an experiment, an opened interface is shown in an operation interface diagram, the experiment to be performed is firstly required to be selected through an openfile option, and then different experiments are performed on data according to the requirements of the experiment, such as data preprocessing, a classification algorithm, a clustering algorithm, association rules and other option parameters. According to the experimental requirements, the J48 algorithm in the classification algorithm is selected for experiments. The software operation interface is shown in fig. 5.
And (3) constructing a decision tree: the construction of decision trees is not unique, unfortunately the construction of optimal decision trees belongs to the NP problem. How to construct a good decision tree is therefore the focus of research. The invention continuously adjusts the constructed decision tree by changing the parameters of the decision tree algorithm, so that the accuracy of the constructed decision tree and the branch attribute value are both optimal. The J48 algorithm can modify the parameters by 11 items, wherein binarySplits, debug, saveInstance, subtreeRaising, unpruned, useLaplace adopts default values to modify ConfidenceFactor, minNumObj, numFolds, seed, reduceErrorPruning five parameters. The experiment of the invention is mainly used for modifying and verifying the remaining six parameters to continuously approximate the accurate value of the medical data, so that the accuracy of the decision tree is higher in feasibility. The weka software is similar to a black box, and the result can be run by only putting the processed data file into the weka to select the algorithm which is wanted and modifying the parameters corresponding to the algorithm. And (3) carrying out experiments on possible values of various parameters, and finally selecting the optimal experimental results as follows. The experiment is divided into two branches, and one part is to perform the experiment on 11 attributes of heart ultrasound, wherein the last column is a class label, f is atrial fibrillation, and z is normal. The experimental data contained 360 total, 186 in men and 174 in women. There are 178 and 182 persons with atrial fibrillation (here normal refers to a patient with pure hypertension). The algorithm parameters were all default values, and the experimental results are shown in fig. 6.
From the above decision tree, it can be known that among the attributes of cardiac ultrasound, three attributes of a peak, ef and lasd have a large influence on atrial fibrillation. Specifically to the root node a peak in the decision tree (when vanishing, meaning that atrial fibrillation has occurred.) this attribute is that the information gain rate is maximum, which normally ranges from 41 to 87. The first branch we can see that when a < = 0, the patient is in atrial fibrillation, and since there is no non-0 number in the data, i.e. when a = 0, the patient can be judged to be in atrial fibrillation. When a >0, the ef attribute needs to be considered continuously, and when the ef value is smaller than 58, the patient is judged to be normal. And so on to analyze the decision tree. The decision tree accuracy screenshot includes accuracy, error rate, kappa value, etc., which can be used to evaluate the algorithm for quality. The invention mainly takes the accuracy as the judgment basis. The accuracy was 83.0556% as can be seen from fig. 7.
The second portion of the experimental data contained 308 data in total. The characteristic indexes of 154 patients comprise index detection items such as blood routine, alpha function, coagulopathy, liver function, blood fat, heart ultrasound and the like as attribute columns, wherein the last column is a class label, f is atrial fibrillation, and z is normal. There were 162 men and 146 women in the data. There are 128 patients with atrial fibrillation and 180 patients with atrial fibrillation. Similar to the above, the algorithm uses default values for each parameter, and the experimental results are shown in fig. 8.
From the decision tree we can see that among the 154 attributes, XGN (cardiac function grade), peak a (cardiac ultrasound index), FS (rheumatic heart valve disease), FJB (interstitial lung disease), LVPWD (cardiac ultrasound index), EF (cardiac ultrasound index), FDMB1 (pulmonary valve blood flow velocity), FDMB (pulmonary valve), LAD (cardiac ultrasound index), GXB (coronary heart disease), TNB (diabetes mellitus), MCHC (hemoglobin concentration), peak E (cardiac ultrasound index) are contributing to atrial fibrillation. Some of these 13 indices have not attracted enough attention in medicine. Such as the effect of hemoglobin concentration on atrial fibrillation.
Specifically, in the decision tree, the root node is XGN, which indicates that the index has great effect on the occurrence of atrial fibrillation, when the XGN grade is less than or equal to 1, the peak A is continuously considered, when the peak A is 0, the FS is continuously considered, when the FS is more than 0, the patient is judged to have atrial fibrillation, otherwise, the FJB is continuously considered, when the FJB is less than or equal to 0, the LVPWD is considered, when the LVPWD is less than or equal to 9, the EF value (namely EF1 in the decision tree) is continuously considered, when the EF is less than or equal to 57, the patient is judged to be normal, otherwise, the patient is atrial fibrillation; continuing to trace back to the right branch of the LVPWD, when the LVPWD is larger than 9, considering the value of FDMB1, and judging that the patient is atrial fibrillation when the value is smaller than or equal to 101, otherwise, considering the LAD, and judging that the patient is atrial fibrillation when the LAD is smaller than or equal to 50, otherwise, judging that the patient is normal; continuing to trace back to the right branch of the FJB, when the FJB is larger than 0, considering GXB, and when the GXB is smaller than or equal to 2, judging that the patient is normal, otherwise, judging that the patient is atrial fibrillation; continuing to trace back to the right branch of the FS, and judging that the patient is atrial fibrillation when the FS is larger than 0; continuing to trace back to the right branch of the peak A, when A is larger than 0, considering TNB, when TNB is smaller than or equal to 0, judging that the patient is normal, otherwise considering FDMB, when FDMB is larger than 0, judging that the patient is normal, otherwise considering E value, when E is larger than 72, judging that the patient is atrial fibrillation, otherwise considering MCHC value, when MCHC is smaller than or equal to 338, judging that the patient is atrial fibrillation, otherwise judging that the patient is normal, and so on, traversing the whole decision tree. The accuracy of this model was 85.0649%.
Through the above different experiments, the invention selects fig. 8 as a final model by comprehensively considering decision trees and accuracy. The model has more and more comprehensive consideration factors. The medical staff is more concise and elegant. The model is also medically approved.
Aiming at the problem that the medical community does not have a unified and standardized model for predicting atrial fibrillation and the probability that a hypertensive patient suffers from atrial fibrillation is higher than that of an ordinary person, the invention refers to the review of the atrial fibrillation prediction in medicine, and provides an atrial fibrillation prediction method based on a decision tree to solve the problem. A visual and concise decision tree is established by the method for medical research reference. The model combines a large amount of real medical data, so that the accuracy of the model is ensured as comprehensively as possible, and the accuracy of the model is 85.0649%. In the process of establishing the model, not only can the potential relation among various medical indexes of a hypertension patient be mined, but also the index which is more likely to cause atrial fibrillation can be mined, and some indexes are not deeply focused in medicine at first. In the next operation, the first point will increase the data volume, making the model more generalizable, preventing overfitting. And secondly, performing large and better classification by using a machine learning algorithm, and establishing a practical and standard decision tree.
Example 3:
the invention aims to solve the problem of accurately reflecting the index of atrial fibrillation, and constructs a method for selecting the index of atrial fibrillation artificial intelligence experiment, which comprises the following steps:
s1, constructing a decision tree;
s2, adjusting parameters to optimize a decision tree;
s3, carrying out experiments on possible values of various parameters, and finally selecting an optimal experimental result which is used as a main index of decision tree prediction.
Further, the main indexes are three attributes of A peak, ef and lasd in the attribute of cardiac ultrasound.
Further, the main indexes are XGN (cardiac function grade), a peak (cardiac ultrasound index), FS (rheumatic heart valve disease), FJB (interstitial lung disease), LVPWD (cardiac ultrasound index), EF (cardiac ultrasound index), FDMB1 (pulmonary valve blood flow velocity), FDMB (pulmonary valve), LAD (cardiac ultrasound index), GXB (coronary heart disease), TNB (diabetes mellitus), MCHC (hemoglobin concentration), E peak (cardiac ultrasound index).
The method for constructing the decision tree is as described in examples 1 and 2.
The invention also relates to application of the prediction decision tree in atrial fibrillation prediction.
The invention provides a method for constructing a decision tree which can be used for predicting the atrial fibrillation, and fully expounds the construction process of the decision tree, so that a standard model for constructing the decision tree by a mathematical method can be established in the atrial fibrillation prediction field, and the decision tree has important guiding significance for determining some important reference indexes for influencing the atrial fibrillation.
Example 4:
atrial fibrillation (atrial fibrillation, AF, abbreviated as atrial fibrillation) is one of the most common clinical arrhythmias, with prevalence of about 0.4% -1.0% in the general population and increasing with age, studies have shown that the prevalence of <55 years old population is only 0.1% and that of >80 years old population is as high as 9%. The common clinical complications of atrial fibrillation are systemic thromboembolism, cerebral apoplexy is a main embolic event caused by atrial fibrillation, and is also the complication with the highest disability rate of atrial fibrillation patients, the incidence rate of cerebral apoplexy is increased by 5 times, the death rate is increased by 2 times, ischemic cerebral apoplexy is a main cause of the increase of the death rate, and atrial fibrillation is an independent risk factor for the occurrence of ischemic cerebral apoplexy, and the incidence rate of the cerebral infarction is increased with the increase of age. Other hazards of atrial fibrillation include: the heart failure, electrical disturbance, sudden death, irregularity, and rapid ventricular rate caused by the loss of the function of the auxiliary pump in the atrium.
Accurate prediction of atrial fibrillation occurrence and application of effective preventive means are an important part of the atrial fibrillation treatment process. At present, diagnosis of atrial fibrillation is mainly based on an electrocardiogram and an electrocardiogram extension such as a dynamic electrocardiogram, a monitored electrocardiogram and an implanted long-range electrocardiogram. In recent years, great achievement is achieved by fusing an electrocardiogram technology with artificial intelligence, but the accuracy of diagnosing atrial fibrillation based on the electrocardiogram technology of more than 100 years is high, but the diagnosis omission rate is also high, and the method is particularly suitable for paroxysmal atrial fibrillation and asymptomatic atrial fibrillation which are not more than symptomatic atrial fibrillation in harm. The technology develops a new atrial fibrillation diagnosis system based on clinical big data and Artificial Intelligence (AI) so as to replace the traditional electrocardiographic diagnosis technology, at least as a screening diagnosis system for electrocardiographic examination of anterior atrial fibrillation patients and as an important supplement for classical electrocardiographic examination.
The method and the technology are as follows: the research utilizes an information integration platform of a hospital affiliated to the applicant-a mountain hospital affiliated to university of Dalian university to analyze all data of clinic, image, inspection and the like of a hypertensive patient, and adopts a big data processing means, such as a decision tree means as described in the embodiment 3, to manufacture an automatic intelligent diagnosis model, such as a decision tree model, on the basis, the model is utilized to carry out diagnosis analysis on the combined atrial fibrillation of the hypertensive patient, further utilizes an advanced learning means of an AI system to carry out further correction and continuous improvement on the model, and finally develops a perfect atrial fibrillation artificial intelligent diagnosis system. The invention tightly combines the clinical big data with the AI, can certainly open up a new break for predicting the AF occurrence through big data processing and AI self-learning, and provides an important diagnosis means for atrial fibrillation prevention and control strategies.
AI model making: the method comprises the steps of performing big data processing on clinical data (medical history, physical examination, physicochemical examination and the like) of hypertensive patients registered in China in the hospital from 1 st 2010 to 2017 12 th by using an information integration platform of an affiliated hospital of the applicant, namely an affiliated mountain hospital of university of Dalian university, and establishing a primary diagnosis model.
AI model verification: the related parameter data of the hypertension patient in hospital is input into a computer by utilizing a primary AI model, and the diagnosis capability (including prediction sensitivity, specificity, coincidence rate and prediction efficiency) of the AI model is checked
AI model perfection: the model is continuously corrected and perfected by the self deep learning capability of AI, and the model is gradually developed and perfected.
While the invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
While the invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (2)

1. A judgment method based on an atrial fibrillation prediction decision tree is characterized by comprising the following steps of:
the construction method of the atrial fibrillation prediction decision tree comprises the following steps:
step 1: if the data set S belongs to the same category, a leaf node is created, corresponding category labels are marked, and the tree is stopped being built; otherwise, carrying out the step 2;
step 2: calculating the information Gain rate Gain-rate (A) of all the attributes in the data set S;
step 3: selecting an attribute A of the maximum information gain rate;
step 4: the attribute A is established as a root node of a decision tree T, and T is the decision tree to be established;
step 5: dividing the data set into a plurality of subsets according to different values of the attribute A, circularly executing the steps 1-4 on the subset Sv, and constructing a subtree Tv, wherein Sv is a sample subset with the value v of the attribute A;
step 6: adding the subtree Tv into the corresponding branch of the decision tree T;
step 7: the cycle is ended, and a decision tree T is obtained;
the atrial fibrillation prediction decision tree is a computer processing decision model, and the processing decision process specifically comprises the following steps: if the root node a peak in the decision tree has the maximum information gain rate, the normal range is 41 to 87, the first branch of the decision tree, a refers to the value of the a peak when a < = 0, and the patient generates atrial fibrillation, and the data has no non-0 number, namely when a = 0, the patient is judged to generate atrial fibrillation; when a >0, continuing to consider the ef attribute, and when the ef value is smaller than 58, judging that the patient is normal;
if the root node in the decision tree is XGN, when the XGN grade is greater than 1, the patient is judged to be atrial fibrillation, when the XGN grade is less than or equal to 1, the A peak is continuously considered, when the A peak is 0, the FS is continuously considered, when the FS is greater than 0, the patient is judged to have atrial fibrillation, otherwise, the FJB is continuously considered, when the FJB is less than or equal to 0, the LVPWD is considered, when the LVPWD is less than or equal to 9, the EF is continuously considered, when the EF is less than or equal to 57, the patient is judged to be normal, otherwise, the patient is judged to be atrial fibrillation; continuing to trace back to the right branch of the LVPWD, when the LVPWD is larger than 9, considering the value of FDMB1, and judging that the patient is atrial fibrillation when the value is smaller than or equal to 101, otherwise, considering the LAD, and judging that the patient is atrial fibrillation when the LAD is smaller than or equal to 50, otherwise, judging that the patient is normal; continuing to trace back to the right branch of the FJB, when the FJB is larger than 0, considering GXB, and when the GXB is smaller than or equal to 2, judging that the patient is normal, otherwise, judging that the patient is atrial fibrillation; continuing to trace back to the right branch of the FS, and judging that the patient is atrial fibrillation when the FS is larger than 0; continuing to trace back to the right branch of the peak A, when A is larger than 0, considering TNB, when TNB is smaller than or equal to 0, judging that the patient is normal, otherwise, considering FDMB, when FDMB is larger than 0, judging that the patient is normal, otherwise, considering E value, when E is larger than 72, judging that the patient is atrial fibrillation, otherwise, considering MCHC value, when MCHC is smaller than or equal to 338, judging that the patient is atrial fibrillation, otherwise, judging that the patient is normal, and traversing the whole decision tree;
wherein the splitting attribute is selected according to the information gain rate:
the formula of the information entropy is:
Info_Gain(A)=H(S)-H(A)
wherein S represents a dataset, c i Representing the ith class, p (c) i ) Represents c i The probability that this category is selected;
when the decision tree is divided, the information entropy of a certain characteristic attribute is calculated, and the characteristic attribute A divides the data set S into n small data sets by S on the premise that the characteristic attribute A has n different values i Representing the probability of each small dataset being selected as p (s i ) As can be seen from equation (1), each small dataset s i The information entropy is H(s) i ) The information entropy calculation formula of the characteristic attribute A is as follows:
the information gain calculation formula is:
Info_Gain(A)=H(S)-H(A) (3)
the information gain ratio calculation formula is:
2. the method for determining a decision tree based on atrial fibrillation prediction as claimed in claim 1, wherein:
the method for pruning the atrial fibrillation prediction decision tree comprises the following steps:
1) Three kinds of prediction error division sample numbers are calculated respectively: calculating the sum of the number of the prediction error samples of all leaf nodes of the subtree Tv, and marking the sum as E1; calculating the number of mispredicted samples when the subtree Tv is pruned to replace with leaf nodes, and marking as E2; calculating the maximum branch prediction error sample number of the subtree Tv, and marking as E3;
2) Comparison is performed: e1 is the smallest, and pruning is not carried out; e2, pruning is carried out when the E2 is minimum, and a leaf node is used for replacing the subtree Tv; e3 is minimum, replace sub-tree Tv with the largest branch.
CN201811068303.1A 2018-09-13 2018-09-13 Atrial fibrillation prediction decision tree and pruning method thereof Active CN110895969B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811068303.1A CN110895969B (en) 2018-09-13 2018-09-13 Atrial fibrillation prediction decision tree and pruning method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811068303.1A CN110895969B (en) 2018-09-13 2018-09-13 Atrial fibrillation prediction decision tree and pruning method thereof

Publications (2)

Publication Number Publication Date
CN110895969A CN110895969A (en) 2020-03-20
CN110895969B true CN110895969B (en) 2023-12-15

Family

ID=69785498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811068303.1A Active CN110895969B (en) 2018-09-13 2018-09-13 Atrial fibrillation prediction decision tree and pruning method thereof

Country Status (1)

Country Link
CN (1) CN110895969B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115137369B (en) * 2021-03-30 2023-10-20 华为技术有限公司 Electronic equipment and system for carrying out atrial fibrillation early warning based on different atrial fibrillation stages
CN113598741B (en) * 2021-06-30 2024-03-22 合肥工业大学 Atrial fibrillation evaluation model training method, atrial fibrillation evaluation method and atrial fibrillation evaluation device
CN115423224B (en) * 2022-11-04 2023-04-18 佛山市电子政务科技有限公司 Secondary water supply amount prediction method and device based on big data and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
WO2015089484A1 (en) * 2013-12-12 2015-06-18 Alivecor, Inc. Methods and systems for arrhythmia tracking and scoring
CN107296604A (en) * 2017-08-29 2017-10-27 心云(北京)医疗器械有限公司 A kind of atrial fibrillation determination methods
CN107610771A (en) * 2017-08-23 2018-01-19 上海电力学院 A kind of medical science Testing index screening technique based on decision tree

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
WO2015089484A1 (en) * 2013-12-12 2015-06-18 Alivecor, Inc. Methods and systems for arrhythmia tracking and scoring
CN107610771A (en) * 2017-08-23 2018-01-19 上海电力学院 A kind of medical science Testing index screening technique based on decision tree
CN107296604A (en) * 2017-08-29 2017-10-27 心云(北京)医疗器械有限公司 A kind of atrial fibrillation determination methods

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
决策树剪枝方法的比较;魏红宁;西南交通大学学报(第01期);第44-48页 *
对数据挖掘决策树分类法的研究;鞠慧;;福建电脑(第12期);第96-97页 *
超声心动图对房颤患者的观察分析;李诺 等;中国超声诊断杂志(第05期);第333-334页 *

Also Published As

Publication number Publication date
CN110895969A (en) 2020-03-20

Similar Documents

Publication Publication Date Title
Lutimath et al. Prediction of heart disease using machine learning
CN110895969B (en) Atrial fibrillation prediction decision tree and pruning method thereof
Melillo et al. Discrimination power of long-term heart rate variability measures for chronic heart failure detection
Cüvitoğlu et al. Classification of CAD dataset by using principal component analysis and machine learning approaches
Laxmikant et al. An efficient approach to detect diabetes using XGBoost classifier
Li et al. Research on massive ECG data in XGBoost
Mahmood et al. Early detection of clinical parameters in heart disease by improved decision tree algorithm
Xie et al. Research on heartbeat classification algorithm based on CART decision tree
Janghorbani et al. Prediction of acute hypotension episodes using logistic regression model and support vector machine: A comparative study
Bridge et al. Artificial intelligence to detect abnormal heart rhythm from scanned electrocardiogram tracings
Shylaja et al. Comparative analysis of various classification and clustering algorithms for heart disease prediction system
CN110895669A (en) Method for constructing atrial fibrillation prediction decision tree
Zhang et al. A deep Bayesian neural network for cardiac arrhythmia classification with rejection from ECG recordings
WO2024098553A1 (en) Method and system for analyzing and identifying electrocardiogram, and storage medium
CN117079810A (en) Cardiovascular disease unscheduled re-hospitalization risk prediction method
de Andrades et al. Hyperparameter tuning and its effects on cardiac arrhythmia prediction
CN110895972A (en) Method for selecting indexes through atrial fibrillation artificial intelligence experiment and application of prediction decision tree in atrial fibrillation prediction
Popli et al. Generalized Association Rule Mining on Fuzzy Multiple Datasets For Brain Injury Patients
Kakudi et al. Predicting metabolic syndrome using risk quantification and ensemble methods
WO2021012203A1 (en) Multi-model complementary enhanced machine leaning platform based on danger early warning in perioperative period
Nandanwar et al. Ecg signals-early detection of arrhythmia using machine learning approaches
Gilani Machine learning classifiers for critical cardiac conditions
Jelinek et al. A survey of data mining methods for automated diagnosis of cardiac autonomic neuropathy progression
Kavitha et al. A survey on machine learning techniques to predict heart disease
Darmawahyuni et al. Analysis of Classifier Performance on ECG Interpretation for Precision Medicine: Which performance metrics should we use?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant