AU2021103976A4 - Asthma diagnosis system based on decision tree and improved SMOTE algorithm - Google Patents
Asthma diagnosis system based on decision tree and improved SMOTE algorithm Download PDFInfo
- Publication number
- AU2021103976A4 AU2021103976A4 AU2021103976A AU2021103976A AU2021103976A4 AU 2021103976 A4 AU2021103976 A4 AU 2021103976A4 AU 2021103976 A AU2021103976 A AU 2021103976A AU 2021103976 A AU2021103976 A AU 2021103976A AU 2021103976 A4 AU2021103976 A4 AU 2021103976A4
- Authority
- AU
- Australia
- Prior art keywords
- data
- asthma
- decision tree
- model
- diagnosis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 208000006673 asthma Diseases 0.000 title claims abstract description 65
- 238000003745 diagnosis Methods 0.000 title claims abstract description 53
- 238000003066 decision tree Methods 0.000 title claims abstract description 31
- 238000005070 sampling Methods 0.000 claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 20
- 210000004369 blood Anatomy 0.000 claims abstract description 13
- 239000008280 blood Substances 0.000 claims abstract description 13
- 238000010200 validation analysis Methods 0.000 claims abstract description 13
- 238000001514 detection method Methods 0.000 claims abstract description 11
- 239000002245 particle Substances 0.000 claims abstract description 11
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 5
- 238000005457 optimization Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 6
- 238000013138 pruning Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 210000003979 eosinophil Anatomy 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 210000000440 neutrophil Anatomy 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 210000004027 cell Anatomy 0.000 claims description 2
- 238000003064 k means clustering Methods 0.000 claims description 2
- 210000003743 erythrocyte Anatomy 0.000 claims 5
- 102000001554 Hemoglobins Human genes 0.000 claims 3
- 108010054147 Hemoglobins Proteins 0.000 claims 3
- 238000002790 cross-validation Methods 0.000 claims 3
- 210000003651 basophil Anatomy 0.000 claims 2
- 238000004820 blood count Methods 0.000 claims 2
- 210000004698 lymphocyte Anatomy 0.000 claims 2
- 210000001616 monocyte Anatomy 0.000 claims 2
- 201000004569 Blindness Diseases 0.000 claims 1
- 210000000265 leukocyte Anatomy 0.000 claims 1
- 238000007418 data mining Methods 0.000 abstract description 3
- 201000010099 disease Diseases 0.000 abstract description 3
- 208000024891 symptom Diseases 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 abstract 2
- 238000004140 cleaning Methods 0.000 description 8
- 230000011218 segmentation Effects 0.000 description 4
- 206010011224 Cough Diseases 0.000 description 2
- 208000000059 Dyspnea Diseases 0.000 description 2
- 206010013975 Dyspnoeas Diseases 0.000 description 2
- 238000003759 clinical diagnosis Methods 0.000 description 2
- 238000012000 impulse oscillometry Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 208000000884 Airway Obstruction Diseases 0.000 description 1
- 206010008479 Chest Pain Diseases 0.000 description 1
- 206010008469 Chest discomfort Diseases 0.000 description 1
- 208000014085 Chronic respiratory disease Diseases 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 208000002200 Respiratory Hypersensitivity Diseases 0.000 description 1
- 208000037656 Respiratory Sounds Diseases 0.000 description 1
- 210000001744 T-lymphocyte Anatomy 0.000 description 1
- 206010047924 Wheezing Diseases 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 210000001552 airway epithelial cell Anatomy 0.000 description 1
- 230000010085 airway hyperresponsiveness Effects 0.000 description 1
- 230000009285 allergic inflammation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 206010006451 bronchitis Diseases 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 208000037976 chronic inflammation Diseases 0.000 description 1
- 208000037893 chronic inflammatory disorder Diseases 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 210000003630 histaminocyte Anatomy 0.000 description 1
- 208000030603 inherited susceptibility to asthma Diseases 0.000 description 1
- 230000004199 lung function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000009325 pulmonary function Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 208000013220 shortness of breath Diseases 0.000 description 1
- 208000024794 sputum Diseases 0.000 description 1
- 210000003802 sputum Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- Physics & Mathematics (AREA)
- Pathology (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
OF THE DISCLOSURE
The present disclosure relates to the technical field of data mining, and in particular
to an asthma diagnosis system based on a decision tree and the improved SMOTE
algorithm. The data is composed of blood routine physical examination data of normal
people and blood routine physical examination data of asthma patients. Particle Swarm
Optimization (PSO) is used to optimize the sampling rate of the SMOTE over-sampling
technology so as to obtain an improved SMOTE over-sampling technology is obtained.
This algorithm is used to over sample data sets, and then the data is modeled and
diagnosed by the decision tree algorithm. Compared with the traditional diagnosis
depending on symptoms of patients, the asthma diagnosis system can automatically
diagnose whether a patient is suffered from asthma according to his/her blood routine
physical examination data, which reduces the influence due to fatigue, misjudgment or
inexperience of physicians and improves the efficiency of asthma diagnosis. The present
disclosure can be applied to intelligent detection of asthma.
-1/2
DRAWINGS
Data acquisition
Data processing [
Oversampling
processing
Blood routine database Data acquisition
(for training) No
Construct and train Data processing
decision tree model
Yes
Blood routine database Does the model works? Decision tree model of
(for validation) asthma diagnosis
Results of intelligent
diagnosis by machine
Auxiliary diagnosis by
physician
Model training and validation Application
of disease
diagnosis
FIG.1
Description
-1/2
Data acquisition
Data processing [
Oversampling processing
Blood routine database Data acquisition (for training) No
Construct and train Data processing decision tree model
Yes Blood routine database Does the model works? Decision tree model of (for validation) asthma diagnosis
Results ofintelligent diagnosis by machine
Auxiliary diagnosis by physician
Model training and validation Application of disease diagnosis
FIG.1
[01] The present disclosure relates to the technical field of data mining, and in particular to an asthma diagnosis system based on a decision tree and the improved SMOTE algorithm.
[02] Bronchial asthma (asthma for short) is a chronic inflammatory disease of the airway, which involves various cells (such as eosinophils, mastocytes, T lymphocytes, neutrophils, and airway epithelial cells) and cellular components. Asthma is an allergic inflammation reaction of the airway. Its clinical manifestation in an acute attack includes: repeated wheezing, dyspnea, chest tightness and cough, and decreased exercise tolerance accompanied by airway hyper-responsiveness and obstruction. Asthma is a chronic respiratory disease that seriously threatens human health, which is high in incidence and cannot be cured, seriously affecting normal working and life of patients. A lot of patients who didn't receive treatment in time or made mistakes in treatment methods have their lung functions further damaged. A bad attack of asthma, if not intervened and treated in time, will even endanger the life security of patients.
[03] Statistically, about 300 million people in the world are suffering from asthma, and the number of affected patients is increasing exponentially. By 2025, another 100 million people may be affected by asthma. Commonly used methods for evaluating asthma, such as sputum smear observation of eosinophils, pulmonary function (SPIR) and impulse oscillometry system (IOS), are difficult to perform detection, time-consuming, strenuous, and expensive. The above detection methods require a large amount of professionals equipped with expertise and diagnosis experience, but the number of professionals is relatively small relative to the large disease base, which will create great fatigue to medical staff, and even prone to misdiagnosis. Moreover, because of the lack of unified clinical indexes, different physicians will give different diagnosis results, which is greatly restrictive and dangerous. Some patients often have paroxysmal cough as their unique symptom, which is often misdiagnosed as bronchitis in clinic, while some teenagers have chest distress and shortness of breath during exercise as their unique clinical manifestation. If physicians don't know enough about asthma or have incorrect ideas about clinical diagnosis, they will easily make misdiagnosis or missed diagnosis.
[04] In the present disclosure, we focus on asthma, use the blood routine data of asthma patients obtained from relevant departments of hospitals, and combine the data with related data mining algorithms of machine learning to establish an asthma diagnosis model system, so as to help physicians working on clinical diagnosis, thus achieving early diagnosis and treatment and helping patients reduce the incidence of asthma.
[05] In view of the above problems, the present disclosure provides an asthma diagnosis system based on a decision tree and the improved SMOTE algorithm, which includes a primary module of data acquisition, an oversampling processing module, a primary module of decision tree, a primary training module and a primary detection module;
[06] The primary module of data acquisition is used for acquiring asthma data, preprocessing the acquired asthma data to obtain preprocessed data, and inputting the preprocessed data into the primary module of oversampling processing;
[07] The primary module of over-sampling processing is used for processing input data and randomly dividing the processed data into two groups, namely a training sample set and a validation sample set;
[08] The primary module of decision tree is used for constructing an asthma diagnosis model;
[09] The primary training module trains the constructed asthma disease diagnosis model by using the training sample set, and obtains the trained asthma diagnosis model;
[10] The primary detection module is used for loading the trained asthma diagnosis model, and validating the trained asthma diagnosis model by using the validation sample set;
[11] If the trained asthma diagnosis model has an asthma diagnosis accuracy of greater than or equal to 85% on the validation sample set, the trained asthma diagnosis model is used as the final model, and the final model is used for asthma diagnosis;
[12] Otherwise, the parameters of the constructed asthma model are adjusted, and the constructed asthma model is retrained by using the training sample set until the asthma diagnosis accuracy of the trained model on the validation sample set is greater than or equal to 85%, then the final model is obtained, and the final model is used for asthma diagnosis.
[13] The present disclosure has the following beneficial effects:
[14] For asthma diagnosis in the prior art, physicians made determinations according to their own experience in combination with patients' characteristics. According to the present disclosure, physicians may carry out diagnosis simply by the physical examination data of patients' blood routine, which brings a great auxiliary to physicians, reduces medical burden, and makes the diagnosis faster.
[15] Fig. 1 is a structure schematic view of the system according to the present disclosure.
[16] Fig. 2 is a flow chart of the improved SMOTE algorithm according to the present disclosure.
[17] In order to make the technical schemes provided by the present disclosure clearer, the present disclosure will be further described in detail with reference to accompany drawings and embodiments below. It should be understood that the specific embodiments described herein are only used to explain the present disclosure without limiting the same.
[18] As shown in Fig. 1, the present disclosure discloses an asthma diagnosis system based on a decision tree and an improved SMOTE algorithm, which includes a primary module of acquisition, an oversampling processing module, a primary module of decision tree, a primary training module and a primary detection module;
[19] The specific steps are as follows:
[20] In Step 1, the primary module of data acquisition obtains 1800 entries of blood routine physical examination data of patients and normal people from the outpatient department of Wuxi People's Hospital, including 400 patients, and the outpatient data mainly relates to the basic information of patients and various asthma-related detection indexes.
[21] In Step 2, the data is cleaned, including missing value cleaning, format content cleaning, logic error cleaning, non-required data cleaning, correlation verification and other steps as follows:
[22] 2.1) The missing value cleaning step includes: determining a missing value range, calculating a missing value ratio for each field, and formulating strategies according to the missing value ratio and field significance; removing unnecessary fields and deleting some meaningless fields, such as a patient's physical examination serial number.
[23] 2.2) The missing values are filled in; for missing values with different features, using different filling methods, such as filling the missing values according to a physician's experience, or filling the missing values with special values, median, and hot deck.
[24] 2.3) Reacquiring data, as some features are very important but the missing ratio is too high, it is necessary to contact the outpatient department for reacquiring the data.
[25] 2.4) Cleaning the format content includes solutions of the following problems: time and date values have inconsistent display formats, the content contains characters that should have not existed in the content, and the content of the field is inconsistent with the content that the field should have.
[26] 2.5) The operation of cleaning logical errors is to remove some data that can be found problematic by using simple logical reasoning, so as to prevent the analysis results from deviating. The step mainly includes eliminating duplication, removing unreasonable values, correcting contradictory contents and so on.
[27] 2.6) Non-required data cleaning is to delete unnecessary fields.
[28] 2.7) The purpose of correlation verification is to ensure the correctness of the correlation among data when the data comes from multiple tables or databases, so as to prevent errors from occurring in the correlation or contradictions from occurring among data.
[29] In Step 3, the discrete data is pre-processed, which includes the following steps:
[30] For the preprocessing of discrete data, we should not carry out encoding schemes in normal conditions, but should digitize features of the discrete data. The One-Hot encoding scheme is adopted in the present disclosure. One-Hot encoding, also known as one-bit valid encoding, is a scheme that mainly adopts an N-bit status register to encode N states, wherein each state has its own register bit, and only one bit is valid at any time. The One-Hot encoding is the representation of classification variables acting as binary vectors. It requires mapping the classification values into integer values, and then representing each integer value as a binary vector, wherein all integers are zero except the index of the integers, and the index is marked as 1. By using the One-Hot encoding, the value selection of discrete features are extended to the Euclidean space, and a certain value of discrete features corresponds to a certain point in the Euclidean space. Since the calculation of distance or similarity among features is very important, and the calculation of distance or similarity commonly used is the similarity calculation in the Euclidean space, using the One-Hot encoding for discrete features will make the calculation of distance among features more reasonable.
[31] In Step 4: the primary module of over-sampling processing includes the following steps:
[32] Firstly, the K-means clustering algorithm is used to cluster samples of minority classes to form fixed K clusters and record each cluster center. E=Z |x,-z||2
[33] Wherein:
[34] In the above formula, / represents the i data sample in the data set; Ni
represents the i cluster; zJ represents a cluster center of the i cluster.
[35] m sampling points are selected from n samples nearest to the minority class sample. The sampling rate is optimized by particle swarm optimization (PSO) algorithm.
[36] In the formula:
[37] vi = wx v[+ cl x rl(pbestf- x+ c2 x r2(gbest," - z)
[38] d d
[39] w represents inertia factor, whose value is non-negative, i represents the
particle and d represents the d dimension of the particle. r1 , r2 represents two
random numbers located at [0,1] (for different dimensions of a particle, r1 and r2
have different values), pbest[i] refer to the position where the particle obtains the
highest (lowest) fitness and gbest[i] refer to the position where the whole system obtains the highest (lowest) fitness. Therefore, the optimal sampling rate may be found out.
[40] After selecting the original point and the sampling rate, new minority class samples are generated.
[41] In the formula: X_1- = X+rand(0,1)*(M, -X),i=1,2,,,,N
[421 X-' is a newly inserted sample; X is the selected original sample data;
rand(0,1) represents a certain random number between 0 and 1; Mi is the best
sampling point optimized by PSO in the nearest samples of the original sample data X.
[43] In Step 5, the primary decision tree module includes the following steps:
[44] 1) In the attribute space of the training sample, a region is segmented into two sub-regions, the output values of the sub-regions are determined. By recursively executing this step, a decision tree is constructed, the optimal segmentation pointj and segmentation point s are selected for solving mi1 [mn (y - c,2+mn (y, - c)2 X;E R, (j,s)
[45] XE R1(j,s)
[46] R1 and R 2 represent the segmented space. By traversing the variable j, the segmentation point s is scanned for the fixed segmentation variable j, such that above formula achieves the variable (j, s) with the minimum error.
[47] 2) The region is segmented with (j, s) and the output values in response are determined.
i S
[48] R jlS)=(2 1 s,R2(jlS)=(x
c, = N y,,xe R,,m=1,2
[49] , XgE R (,Sx)
[50] 3) (1) and (2) are called repeatedly from the two sub-regions until the conditions are met.
[51] 4) The signature space is segmented into M regions R1, R2R3.......RM and a decision tree is constructed;
f(x)= Yc,I(xe R,)
[52] -1
[53] In Step 6: the MEP post-pruning algorithm is adopted, with the step including the following steps:
[54] 1) If there are K classes of samples in total, the probability of belonging to class i in the training sample at the decision tree node t is as follows: n, (t) + P,, (t) * m n(t = +
[55] n(t)+m
[56] Wherein: i is the priori probability of the i class samples, namely the
proportion accounted by the i class samples in the whole data set; m is the influence
factor of i on the posterior probability , so that m is not a fixed value. Then the prediction error rate E,(t) of the node tis defined as the following formula:
E(t)= min{l-P,(t))= min n(t) - n,(t)+ (1+ Pi,(t)* m)
[57] n(t)+ m j
[58] If the priori probabilities of all classes are the same, namely PTh=1/k,(i= 1,2,,,,k), m=k, then E,(t) at this moment can be expressed as
+(k - 1) E, (t)= n(t) -n, (t)
[59] n(t)+ k
[60] In the above formula: n(t) is the total number of samples at the node t; n, (t) is the sample number of the primary class at the node t.
[61] Finally, the errors E,(Tt) of non-leaf nodes are calculated respectively, and the sub-tree is retained, otherwise the sub-tree is cut off.
[62] In Step 7: the system is constructed and the visual design is executed, including the following steps:
[63] The trained model is used to construct the system, and a visual operation interface is designed. After the user can input his/her blood routine data into the system, the system will diagnose whether he/she is suffered from asthma according to each entry of the user's data. After a large amount of data testing, the validation accuracy of the system reaches more than 96.5%, which is valuable in practical application.
[64] The above description only aims at providing preferred embodiments of the present disclosure, but not limiting the present disclosure in other forms. Anyone skilled in this art may use the technical content disclosed above to change or modify the embodiments herein into equivalent embodiments with equivalent variations to be applied in other fields. However, any simple modification, equivalent variation and modification made to the above embodiment according to the technical essence of the present disclosure without departing from the technical scheme content thereof still falls within the claimed scope of the technical scheme of the present disclosure.
Claims (4)
1. An asthma diagnosis system based on a decision tree and the improved SMOTE algorithm, comprising a primary module of data acquisition, a primary module of oversampling processing, a primary module of decision tree, a primary training module and a primary detection module; The primary module of data acquisition is used for acquiring physical examination data of blood routine, preprocessing the acquired data to obtain preprocessed data, and inputting the preprocessed data into the primary module of oversampling processing; The primary module of over-sampling processing is used for processing input data and dividing the processed and balanced data into two groups, namely a training sample set and a validation sample set; The primary module of over-sampling processing consists of a PSO optimization module, a newly generated sample detection module and a correlation sorting module; The PSO optimization module is an SMOTE over-sampling method based on a PSO algorithm; in order to improve the accuracy of the model diagnosis, it is necessary to over-sample asthma samples of minority classes; aiming at the blindness of neighboring selection due to fixed sampling rates of traditional SMOTE, PSO is used herein to optimize the over-sampling rate of SMOTE and select an optimal sampling rate. The newly generated sample detection module focuses on the fuzzy boundary issue of the newly generated sample points by SMOTE, and frames a space with the newly generated points being the center. If the samples of minority classes/majority classes are less than 1/2, the newly generated samples are considered as "garbage points", and are discarded, otherwise, they are retained. The correlation sorting module selects features of a whole data set of the generated data, sorts the features according to the correlation among the data, and selects the features before a median as the data set for training the decision tree model. The primary module of decision tree is used for constructing an asthma diagnosis model; As the asthma diagnosis is a binary classification issue, and the eigenvalues are continuous values, for which the CART regression tree algorithm is suitable, the CART regression tree algorithm is adopted. Moreover, since most of the data sets are less in data due to the unbalanced distribution of samples, ID3 and C4.5 algorithms respectively use information gain and information gain rate for note calculations. This will lead to the selection of nodes tending to a multi-class feature, thereby affecting the accuracy. Therefore, a CART regression tree algorithm can better deal with continuous eigenvalues, and it is more advantageous when a mean square deviation is used as a standard for selecting nodes. As the pre-pruning algorithm is simple, but it may lose more important information, the MEP post-pruning algorithm is adopted. For the MEP post-pruning algorithm, no additional pruning set is required, so that it can be applied in a wider range. Firstly, the K-fold cross-validation method is introduced to select the optimal influence factor m, and then m is substituted into the MEP algorithm to prune the original decision tree. In this way, a more accurate and precise decision tree can be obtained, and the influence characteristics of the decision tree can be retained at the same time. The primary training module trains the constructed asthma disease diagnosis model by using the training sample set, and obtains the trained asthma diagnosis model; specific process of this step is as follows: Cross-validation and grid search are used to construct the decision tree model, wherein is selected as the fold number of the cross-validation of training set and testing set, and a ratio of training set to testing set is 4:1. The training set is used for model training and the testing set is used for model checking. Each parameter value is divided into cells, results of different parameters are compared to find out the global optimal or nearly global optimal target value and parameter solution. The primary detection module is used for loading the trained asthma diagnosis model, and validating the trained asthma diagnosis model by using the validation sample set; If the trained asthma diagnosis model has an asthma diagnosis accuracy of greater than or equal to 85% on the validation sample set, the trained asthma diagnosis model is used as the final model, and the final model is used for asthma diagnosis; Otherwise, the parameters of the constructed asthma model are adjusted, and the constructed asthma model is retrained by using the training sample set until the asthma diagnosis accuracy of the trained model on the validation sample set is greater than or equal to 85%, then the final model is obtained, and the final model is used for asthma diagnosis.
2. The asthma diagnosis system based on a decision tree and the improved SMOTE algorithm according to claim 1, wherein the primary module of data acquisition acquires blood routine physical examination data from hospitals, wherein the physical examination data of asthma patients are taken as positive samples, and a large number of physical examination data of people not suffered from asthma are taken as negative samples. Each examinee is takes as a sample, and each sample has 23 features as follows: gender, age, basophil ratio, basophil count, eosinophil ratio, eosinophil count, HCT, hemoglobin, lymphocyte ratio, lymphocyte count, average erythrocyte hemoglobin content, average erythrocyte hemoglobin concentration, average erythrocyte volume, monocyte ratio, monocyte count, average platelet volume, neutrophil ratio, neutrophil count, PCT, platelet distribution width, platelet count, red blood cell count, red blood cell distribution width, white blood cell count, diagnosis result, etc.
3. The asthma diagnosis system based on a decision tree and the improved SMOTE algorithm according to claim 1, wherein the primary module of over-sampling processing comprises the following processing steps: Firstly, the K-means clustering algorithm is used to cluster samples of minority classes to form fixed K clusters and record each cluster center. wherein: E = ||x -z, ||
In the above formula, ' represents the i data sample in the data set; " represents
the i cluster; zJ represents a cluster center of the i cluster.
m sampling points are selected from n samples nearest to the minority class sample. The sampling rate is optimized by particle swarm optimization (PSO) algorithm. In the formula:
v = wx v +clx rl(pbest'-x)+c2xr2(gbestd-x
) d =d +d xi xi vi
w represents inertia factor, whose value is non-negative, i represents the i particle
and d represents the d dimension of the particle. r1 , r2 represents two random
numbers located at [0,1] (for different dimensions of a particle, r1 and r2 have
different values), pbest[i] refer to the position where the particle obtains the highest
(lowest) fitness and gbest[i] refer to the position where the whole system obtains the highest (lowest) fitness. Therefore, the optimal sampling rate may be found out. After selecting the original point and the sampling rate, new minority class samples are generated. In the formula:
Xww = X + rand(0,1) * (M, - X), i= 1,2,,,,. N
Xe, is a newly inserted sample; X is the selected original sample data; rand(0,1)
represents a certain random number between 0 and 1; Mi is the best sampling point
optimized by PSO in the nearest samples of the original sample data X.
4. The asthma diagnosis system based on a decision tree and the improved SMOTE algorithm according to claim 1, wherein the primary decision tree module comprises the following processing steps: After the positive and negative samples are balanced, a CART regression tree is generated; The MEP post-pruning algorithm is adopted for the generated decision tree.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110302072.1 | 2021-03-22 | ||
CN202110302072.1A CN112951413B (en) | 2021-03-22 | 2021-03-22 | Asthma diagnosis system based on decision tree and improved SMOTE algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2021103976A4 true AU2021103976A4 (en) | 2021-09-09 |
Family
ID=76227537
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2021103976A Ceased AU2021103976A4 (en) | 2021-03-22 | 2021-07-08 | Asthma diagnosis system based on decision tree and improved SMOTE algorithm |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN112951413B (en) |
AU (1) | AU2021103976A4 (en) |
WO (1) | WO2022198761A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114611616A (en) * | 2022-03-16 | 2022-06-10 | 吕少岚 | Unmanned aerial vehicle intelligent fault detection method and system based on integrated isolated forest |
CN115169556A (en) * | 2022-07-25 | 2022-10-11 | 美的集团(上海)有限公司 | Model pruning method and device |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114091026A (en) * | 2021-11-25 | 2022-02-25 | 云南电网有限责任公司信息中心 | Integrated learning-based network abnormal intrusion detection method and system |
CN116434950B (en) * | 2023-06-05 | 2023-08-29 | 山东建筑大学 | Diagnosis system for autism spectrum disorder based on data clustering and ensemble learning |
CN117198517B (en) * | 2023-06-27 | 2024-04-30 | 安徽省立医院(中国科学技术大学附属第一医院) | Modeling method of motion reactivity assessment and prediction model based on machine learning |
CN117316295A (en) * | 2023-09-13 | 2023-12-29 | 哈尔滨工业大学 | Endocrine disease cell identification method based on cell heterogeneity gene and pathway function |
CN117637154B (en) * | 2024-01-27 | 2024-03-29 | 南通大学附属医院 | Nerve internal department severe index prediction method and system based on optimization algorithm |
CN117743957B (en) * | 2024-02-06 | 2024-05-07 | 北京大学第三医院(北京大学第三临床医学院) | Data sorting method and related equipment of Th2A cells based on machine learning |
CN117766155B (en) * | 2024-02-22 | 2024-05-10 | 中国人民解放军海军青岛特勤疗养中心 | Dynamic blood pressure medical data processing system based on artificial intelligence |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930856A (en) * | 2016-03-23 | 2016-09-07 | 深圳市颐通科技有限公司 | Classification method based on improved DBSCAN-SMOTE algorithm |
JP2020004178A (en) * | 2018-06-29 | 2020-01-09 | ルネサスエレクトロニクス株式会社 | Learning model evaluation method, learning method, device, and program |
CN109147949A (en) * | 2018-08-16 | 2019-01-04 | 辽宁大学 | A method of based on post-class processing come for detecting teacher's sub-health state |
CN111145902A (en) * | 2019-12-06 | 2020-05-12 | 江苏大学 | Asthma diagnosis method based on improved artificial neural network |
CN112102945B (en) * | 2020-11-09 | 2021-02-05 | 电子科技大学 | Device for predicting severe condition of COVID-19 patient |
-
2021
- 2021-03-22 CN CN202110302072.1A patent/CN112951413B/en active Active
- 2021-05-10 WO PCT/CN2021/092681 patent/WO2022198761A1/en active Application Filing
- 2021-07-08 AU AU2021103976A patent/AU2021103976A4/en not_active Ceased
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114611616A (en) * | 2022-03-16 | 2022-06-10 | 吕少岚 | Unmanned aerial vehicle intelligent fault detection method and system based on integrated isolated forest |
CN114611616B (en) * | 2022-03-16 | 2023-02-07 | 吕少岚 | Unmanned aerial vehicle intelligent fault detection method and system based on integrated isolated forest |
CN115169556A (en) * | 2022-07-25 | 2022-10-11 | 美的集团(上海)有限公司 | Model pruning method and device |
CN115169556B (en) * | 2022-07-25 | 2023-08-04 | 美的集团(上海)有限公司 | Model pruning method and device |
Also Published As
Publication number | Publication date |
---|---|
CN112951413B (en) | 2023-07-21 |
WO2022198761A1 (en) | 2022-09-29 |
CN112951413A (en) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2021103976A4 (en) | Asthma diagnosis system based on decision tree and improved SMOTE algorithm | |
CN109350032B (en) | Classification method, classification system, electronic equipment and storage medium | |
CN107066791A (en) | A kind of aided disease diagnosis method based on patient's assay | |
CN110246577B (en) | Method for assisting gestational diabetes genetic risk prediction based on artificial intelligence | |
CN111951965B (en) | Panoramic health dynamic monitoring and predicting system based on time sequence knowledge graph | |
CN113274031B (en) | Arrhythmia classification method based on depth convolution residual error network | |
CN111145902A (en) | Asthma diagnosis method based on improved artificial neural network | |
Inan et al. | A hybrid probabilistic ensemble based extreme gradient boosting approach for breast cancer diagnosis | |
CN113470816A (en) | Machine learning-based diabetic nephropathy prediction method, system and prediction device | |
WO2021073255A1 (en) | Time series clustering-based medication reminder method and related device | |
CN112652398A (en) | New coronary pneumonia severe prediction method and system based on machine learning algorithm | |
CN108346474A (en) | The electronic health record feature selection approach of distribution within class and distribution between class based on word | |
CN115691788A (en) | Dual attention coupling network diabetes classification system based on heterogeneous data | |
CN116564521A (en) | Chronic disease risk assessment model establishment method, medium and system | |
CN109907751B (en) | Laboratory chest pain data inspection auxiliary identification method based on artificial intelligence supervised learning | |
CN113674824B (en) | Disease coding method and system based on regional medical big data | |
Sari et al. | Best performance comparative analysis of architecture deep learning on ct images for lung nodules classification | |
Zhang et al. | A deep Bayesian neural network for cardiac arrhythmia classification with rejection from ECG recordings | |
Chandra et al. | Application Of Machine Learning K-Nearest Neighbour Algorithm To Predict Diabetes | |
CN111261283B (en) | Electrocardiosignal deep neural network modeling method based on pyramid convolution layer | |
CN117195027A (en) | Cluster weighted clustering integration method based on member selection | |
Ali et al. | Cardiovascular disease detection using multiple machine learning algorithms and their performance analysis | |
Xu et al. | Hybrid label noise correction algorithm for medical auxiliary diagnosis | |
Ahouz et al. | Extracting rules for diagnosis of diabetes using genetic programming | |
Chudacek et al. | Comparison of seven approaches for holter ECG clustering and classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |