AU2021103976A4

AU2021103976A4 - Asthma diagnosis system based on decision tree and improved SMOTE algorithm

Info

Publication number: AU2021103976A4
Application number: AU2021103976A
Authority: AU
Inventors: Wen Chen; Yubao CUI; Zhifeng Liu; Ya Ma; Limin Xia; Conghua Zhou
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-03-22
Filing date: 2021-07-08
Publication date: 2021-09-09
Anticipated expiration: 2029-07-08
Also published as: CN112951413A; WO2022198761A1; CN112951413B

Abstract

OF THE DISCLOSURE The present disclosure relates to the technical field of data mining, and in particular to an asthma diagnosis system based on a decision tree and the improved SMOTE algorithm. The data is composed of blood routine physical examination data of normal people and blood routine physical examination data of asthma patients. Particle Swarm Optimization (PSO) is used to optimize the sampling rate of the SMOTE over-sampling technology so as to obtain an improved SMOTE over-sampling technology is obtained. This algorithm is used to over sample data sets, and then the data is modeled and diagnosed by the decision tree algorithm. Compared with the traditional diagnosis depending on symptoms of patients, the asthma diagnosis system can automatically diagnose whether a patient is suffered from asthma according to his/her blood routine physical examination data, which reduces the influence due to fatigue, misjudgment or inexperience of physicians and improves the efficiency of asthma diagnosis. The present disclosure can be applied to intelligent detection of asthma. -1/2 DRAWINGS Data acquisition Data processing [ Oversampling processing Blood routine database Data acquisition (for training) No Construct and train Data processing decision tree model Yes Blood routine database Does the model works? Decision tree model of (for validation) asthma diagnosis Results of intelligent diagnosis by machine Auxiliary diagnosis by physician Model training and validation Application of disease diagnosis FIG.1

Description

-1/2

DRAWINGS

Data acquisition

Data processing [

Oversampling processing

Blood routine database Data acquisition (for training) No

Construct and train Data processing decision tree model

Yes Blood routine database Does the model works? Decision tree model of (for validation) asthma diagnosis

Results ofintelligent diagnosis by machine

Auxiliary diagnosis by physician

Model training and validation Application of disease diagnosis

FIG.1

ASTHMA DIAGNOSIS SYSTEM BASED ON DECISION TREE AND IMPROVED SMOTE ALGORITHM TECHNICAL FIELD

[01] The present disclosure relates to the technical field of data mining, and in particular to an asthma diagnosis system based on a decision tree and the improved SMOTE algorithm.

BACKGROUNDART

[02] Bronchial asthma (asthma for short) is a chronic inflammatory disease of the airway, which involves various cells (such as eosinophils, mastocytes, T lymphocytes, neutrophils, and airway epithelial cells) and cellular components. Asthma is an allergic inflammation reaction of the airway. Its clinical manifestation in an acute attack includes: repeated wheezing, dyspnea, chest tightness and cough, and decreased exercise tolerance accompanied by airway hyper-responsiveness and obstruction. Asthma is a chronic respiratory disease that seriously threatens human health, which is high in incidence and cannot be cured, seriously affecting normal working and life of patients. A lot of patients who didn't receive treatment in time or made mistakes in treatment methods have their lung functions further damaged. A bad attack of asthma, if not intervened and treated in time, will even endanger the life security of patients.

[03] Statistically, about 300 million people in the world are suffering from asthma, and the number of affected patients is increasing exponentially. By 2025, another 100 million people may be affected by asthma. Commonly used methods for evaluating asthma, such as sputum smear observation of eosinophils, pulmonary function (SPIR) and impulse oscillometry system (IOS), are difficult to perform detection, time-consuming, strenuous, and expensive. The above detection methods require a large amount of professionals equipped with expertise and diagnosis experience, but the number of professionals is relatively small relative to the large disease base, which will create great fatigue to medical staff, and even prone to misdiagnosis. Moreover, because of the lack of unified clinical indexes, different physicians will give different diagnosis results, which is greatly restrictive and dangerous. Some patients often have paroxysmal cough as their unique symptom, which is often misdiagnosed as bronchitis in clinic, while some teenagers have chest distress and shortness of breath during exercise as their unique clinical manifestation. If physicians don't know enough about asthma or have incorrect ideas about clinical diagnosis, they will easily make misdiagnosis or missed diagnosis.

[04] In the present disclosure, we focus on asthma, use the blood routine data of asthma patients obtained from relevant departments of hospitals, and combine the data with related data mining algorithms of machine learning to establish an asthma diagnosis model system, so as to help physicians working on clinical diagnosis, thus achieving early diagnosis and treatment and helping patients reduce the incidence of asthma.

SUMMARY

[05] In view of the above problems, the present disclosure provides an asthma diagnosis system based on a decision tree and the improved SMOTE algorithm, which includes a primary module of data acquisition, an oversampling processing module, a primary module of decision tree, a primary training module and a primary detection module;

[06] The primary module of data acquisition is used for acquiring asthma data, preprocessing the acquired asthma data to obtain preprocessed data, and inputting the preprocessed data into the primary module of oversampling processing;

[07] The primary module of over-sampling processing is used for processing input data and randomly dividing the processed data into two groups, namely a training sample set and a validation sample set;

[08] The primary module of decision tree is used for constructing an asthma diagnosis model;

[09] The primary training module trains the constructed asthma disease diagnosis model by using the training sample set, and obtains the trained asthma diagnosis model;

[10] The primary detection module is used for loading the trained asthma diagnosis model, and validating the trained asthma diagnosis model by using the validation sample set;

[11] If the trained asthma diagnosis model has an asthma diagnosis accuracy of greater than or equal to 85% on the validation sample set, the trained asthma diagnosis model is used as the final model, and the final model is used for asthma diagnosis;

[12] Otherwise, the parameters of the constructed asthma model are adjusted, and the constructed asthma model is retrained by using the training sample set until the asthma diagnosis accuracy of the trained model on the validation sample set is greater than or equal to 85%, then the final model is obtained, and the final model is used for asthma diagnosis.

[13] The present disclosure has the following beneficial effects:

[14] For asthma diagnosis in the prior art, physicians made determinations according to their own experience in combination with patients' characteristics. According to the present disclosure, physicians may carry out diagnosis simply by the physical examination data of patients' blood routine, which brings a great auxiliary to physicians, reduces medical burden, and makes the diagnosis faster.

BRIEFT DESCRIPTION OF THE DRAWINGS

[15] Fig. 1 is a structure schematic view of the system according to the present disclosure.

[16] Fig. 2 is a flow chart of the improved SMOTE algorithm according to the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[17] In order to make the technical schemes provided by the present disclosure clearer, the present disclosure will be further described in detail with reference to accompany drawings and embodiments below. It should be understood that the specific embodiments described herein are only used to explain the present disclosure without limiting the same.

[18] As shown in Fig. 1, the present disclosure discloses an asthma diagnosis system based on a decision tree and an improved SMOTE algorithm, which includes a primary module of acquisition, an oversampling processing module, a primary module of decision tree, a primary training module and a primary detection module;

[19] The specific steps are as follows:

[20] In Step 1, the primary module of data acquisition obtains 1800 entries of blood routine physical examination data of patients and normal people from the outpatient department of Wuxi People's Hospital, including 400 patients, and the outpatient data mainly relates to the basic information of patients and various asthma-related detection indexes.

[21] In Step 2, the data is cleaned, including missing value cleaning, format content cleaning, logic error cleaning, non-required data cleaning, correlation verification and other steps as follows:

[22] 2.1) The missing value cleaning step includes: determining a missing value range, calculating a missing value ratio for each field, and formulating strategies according to the missing value ratio and field significance; removing unnecessary fields and deleting some meaningless fields, such as a patient's physical examination serial number.

[23] 2.2) The missing values are filled in; for missing values with different features, using different filling methods, such as filling the missing values according to a physician's experience, or filling the missing values with special values, median, and hot deck.

[24] 2.3) Reacquiring data, as some features are very important but the missing ratio is too high, it is necessary to contact the outpatient department for reacquiring the data.

[25] 2.4) Cleaning the format content includes solutions of the following problems: time and date values have inconsistent display formats, the content contains characters that should have not existed in the content, and the content of the field is inconsistent with the content that the field should have.

[26] 2.5) The operation of cleaning logical errors is to remove some data that can be found problematic by using simple logical reasoning, so as to prevent the analysis results from deviating. The step mainly includes eliminating duplication, removing unreasonable values, correcting contradictory contents and so on.

[27] 2.6) Non-required data cleaning is to delete unnecessary fields.

[28] 2.7) The purpose of correlation verification is to ensure the correctness of the correlation among data when the data comes from multiple tables or databases, so as to prevent errors from occurring in the correlation or contradictions from occurring among data.

[29] In Step 3, the discrete data is pre-processed, which includes the following steps:

[30] For the preprocessing of discrete data, we should not carry out encoding schemes in normal conditions, but should digitize features of the discrete data. The One-Hot encoding scheme is adopted in the present disclosure. One-Hot encoding, also known as one-bit valid encoding, is a scheme that mainly adopts an N-bit status register to encode N states, wherein each state has its own register bit, and only one bit is valid at any time. The One-Hot encoding is the representation of classification variables acting as binary vectors. It requires mapping the classification values into integer values, and then representing each integer value as a binary vector, wherein all integers are zero except the index of the integers, and the index is marked as 1. By using the One-Hot encoding, the value selection of discrete features are extended to the Euclidean space, and a certain value of discrete features corresponds to a certain point in the Euclidean space. Since the calculation of distance or similarity among features is very important, and the calculation of distance or similarity commonly used is the similarity calculation in the Euclidean space, using the One-Hot encoding for discrete features will make the calculation of distance among features more reasonable.

[31] In Step 4: the primary module of over-sampling processing includes the following steps:

[32] Firstly, the K-means clustering algorithm is used to cluster samples of minority classes to form fixed K clusters and record each cluster center. E=Z |x,-z||2

[33] Wherein:

[34] In the above formula, / represents the i data sample in the data set; Ni

represents the i cluster; zJ represents a cluster center of the i cluster.

[35] m sampling points are selected from n samples nearest to the minority class sample. The sampling rate is optimized by particle swarm optimization (PSO) algorithm.

[36] In the formula:

[37] vi = wx v[+ cl x rl(pbestf- x+ c2 x r2(gbest," - z)

[38] d d

[39] w represents inertia factor, whose value is non-negative, i represents the

particle and d represents the d dimension of the particle. r1 , r2 represents two

random numbers located at [0,1] (for different dimensions of a particle, r1 and r2

have different values), pbest[i] refer to the position where the particle obtains the

highest (lowest) fitness and gbest[i] refer to the position where the whole system obtains the highest (lowest) fitness. Therefore, the optimal sampling rate may be found out.

[40] After selecting the original point and the sampling rate, new minority class samples are generated.

[41] In the formula: X_1- = X+rand(0,1)*(M, -X),i=1,2,,,,N

[421 X-' is a newly inserted sample; X is the selected original sample data;

rand(0,1) represents a certain random number between 0 and 1; Mi is the best

sampling point optimized by PSO in the nearest samples of the original sample data X.

[43] In Step 5, the primary decision tree module includes the following steps:

[44] 1) In the attribute space of the training sample, a region is segmented into two sub-regions, the output values of the sub-regions are determined. By recursively executing this step, a decision tree is constructed, the optimal segmentation pointj and segmentation point s are selected for solving mi1 [mn (y - c,2+mn (y, - c)2 X;E R, (j,s)

[45] XE R1(j,s)

[46] R1 and R 2 represent the segmented space. By traversing the variable j, the segmentation point s is scanned for the fixed segmentation variable j, such that above formula achieves the variable (j, s) with the minimum error.

[47] 2) The region is segmented with (j, s) and the output values in response are determined.

i S

[48] R jlS)=(2 1 s,R2(jlS)=(x

c, = N y,,xe R,,m=1,2

[49] , XgE R (,Sx)

[50] 3) (1) and (2) are called repeatedly from the two sub-regions until the conditions are met.

[51] 4) The signature space is segmented into M regions R1, R2R3.......RM and a decision tree is constructed;

f(x)= Yc,I(xe R,)

[52] -1

[53] In Step 6: the MEP post-pruning algorithm is adopted, with the step including the following steps:

[54] 1) If there are K classes of samples in total, the probability of belonging to class i in the training sample at the decision tree node t is as follows: n, (t) + P,, (t) * m n(t = +

[55] n(t)+m

[56] Wherein: i is the priori probability of the i class samples, namely the

proportion accounted by the i class samples in the whole data set; m is the influence

factor of i on the posterior probability , so that m is not a fixed value. Then the prediction error rate E,(t) of the node tis defined as the following formula:

E(t)= min{l-P,(t))= min n(t) - n,(t)+ (1+ Pi,(t)* m)

[57] n(t)+ m j

[58] If the priori probabilities of all classes are the same, namely PTh=1/k,(i= 1,2,,,,k), m=k, then E,(t) at this moment can be expressed as

+(k - 1) E, (t)= n(t) -n, (t)

[59] n(t)+ k

[60] In the above formula: n(t) is the total number of samples at the node t; n, (t) is the sample number of the primary class at the node t.

[61] Finally, the errors E,(Tt) of non-leaf nodes are calculated respectively, and the sub-tree is retained, otherwise the sub-tree is cut off.

[62] In Step 7: the system is constructed and the visual design is executed, including the following steps:

[63] The trained model is used to construct the system, and a visual operation interface is designed. After the user can input his/her blood routine data into the system, the system will diagnose whether he/she is suffered from asthma according to each entry of the user's data. After a large amount of data testing, the validation accuracy of the system reaches more than 96.5%, which is valuable in practical application.

[64] The above description only aims at providing preferred embodiments of the present disclosure, but not limiting the present disclosure in other forms. Anyone skilled in this art may use the technical content disclosed above to change or modify the embodiments herein into equivalent embodiments with equivalent variations to be applied in other fields. However, any simple modification, equivalent variation and modification made to the above embodiment according to the technical essence of the present disclosure without departing from the technical scheme content thereof still falls within the claimed scope of the technical scheme of the present disclosure.

Claims

WHAT IS CLAIMED IS:

1. An asthma diagnosis system based on a decision tree and the improved SMOTE algorithm, comprising a primary module of data acquisition, a primary module of oversampling processing, a primary module of decision tree, a primary training module and a primary detection module; The primary module of data acquisition is used for acquiring physical examination data of blood routine, preprocessing the acquired data to obtain preprocessed data, and inputting the preprocessed data into the primary module of oversampling processing; The primary module of over-sampling processing is used for processing input data and dividing the processed and balanced data into two groups, namely a training sample set and a validation sample set; The primary module of over-sampling processing consists of a PSO optimization module, a newly generated sample detection module and a correlation sorting module; The PSO optimization module is an SMOTE over-sampling method based on a PSO algorithm; in order to improve the accuracy of the model diagnosis, it is necessary to over-sample asthma samples of minority classes; aiming at the blindness of neighboring selection due to fixed sampling rates of traditional SMOTE, PSO is used herein to optimize the over-sampling rate of SMOTE and select an optimal sampling rate. The newly generated sample detection module focuses on the fuzzy boundary issue of the newly generated sample points by SMOTE, and frames a space with the newly generated points being the center. If the samples of minority classes/majority classes are less than 1/2, the newly generated samples are considered as "garbage points", and are discarded, otherwise, they are retained. The correlation sorting module selects features of a whole data set of the generated data, sorts the features according to the correlation among the data, and selects the features before a median as the data set for training the decision tree model. The primary module of decision tree is used for constructing an asthma diagnosis model; As the asthma diagnosis is a binary classification issue, and the eigenvalues are continuous values, for which the CART regression tree algorithm is suitable, the CART regression tree algorithm is adopted. Moreover, since most of the data sets are less in data due to the unbalanced distribution of samples, ID3 and C4.5 algorithms respectively use information gain and information gain rate for note calculations. This will lead to the selection of nodes tending to a multi-class feature, thereby affecting the accuracy. Therefore, a CART regression tree algorithm can better deal with continuous eigenvalues, and it is more advantageous when a mean square deviation is used as a standard for selecting nodes. As the pre-pruning algorithm is simple, but it may lose more important information, the MEP post-pruning algorithm is adopted. For the MEP post-pruning algorithm, no additional pruning set is required, so that it can be applied in a wider range. Firstly, the K-fold cross-validation method is introduced to select the optimal influence factor m, and then m is substituted into the MEP algorithm to prune the original decision tree. In this way, a more accurate and precise decision tree can be obtained, and the influence characteristics of the decision tree can be retained at the same time. The primary training module trains the constructed asthma disease diagnosis model by using the training sample set, and obtains the trained asthma diagnosis model; specific process of this step is as follows: Cross-validation and grid search are used to construct the decision tree model, wherein is selected as the fold number of the cross-validation of training set and testing set, and a ratio of training set to testing set is 4:1. The training set is used for model training and the testing set is used for model checking. Each parameter value is divided into cells, results of different parameters are compared to find out the global optimal or nearly global optimal target value and parameter solution. The primary detection module is used for loading the trained asthma diagnosis model, and validating the trained asthma diagnosis model by using the validation sample set; If the trained asthma diagnosis model has an asthma diagnosis accuracy of greater than or equal to 85% on the validation sample set, the trained asthma diagnosis model is used as the final model, and the final model is used for asthma diagnosis; Otherwise, the parameters of the constructed asthma model are adjusted, and the constructed asthma model is retrained by using the training sample set until the asthma diagnosis accuracy of the trained model on the validation sample set is greater than or equal to 85%, then the final model is obtained, and the final model is used for asthma diagnosis.

2. The asthma diagnosis system based on a decision tree and the improved SMOTE algorithm according to claim 1, wherein the primary module of data acquisition acquires blood routine physical examination data from hospitals, wherein the physical examination data of asthma patients are taken as positive samples, and a large number of physical examination data of people not suffered from asthma are taken as negative samples. Each examinee is takes as a sample, and each sample has 23 features as follows: gender, age, basophil ratio, basophil count, eosinophil ratio, eosinophil count, HCT, hemoglobin, lymphocyte ratio, lymphocyte count, average erythrocyte hemoglobin content, average erythrocyte hemoglobin concentration, average erythrocyte volume, monocyte ratio, monocyte count, average platelet volume, neutrophil ratio, neutrophil count, PCT, platelet distribution width, platelet count, red blood cell count, red blood cell distribution width, white blood cell count, diagnosis result, etc.

3. The asthma diagnosis system based on a decision tree and the improved SMOTE algorithm according to claim 1, wherein the primary module of over-sampling processing comprises the following processing steps: Firstly, the K-means clustering algorithm is used to cluster samples of minority classes to form fixed K clusters and record each cluster center. wherein: E = ||x -z, ||

In the above formula, ' represents the i data sample in the data set; " represents

the i cluster; zJ represents a cluster center of the i cluster.

m sampling points are selected from n samples nearest to the minority class sample. The sampling rate is optimized by particle swarm optimization (PSO) algorithm. In the formula:

v = wx v +clx rl(pbest'-x)+c2xr2(gbestd-x

) d =d +d xi xi vi

w represents inertia factor, whose value is non-negative, i represents the i particle

and d represents the d dimension of the particle. r1 , r2 represents two random

numbers located at [0,1] (for different dimensions of a particle, r1 and r2 have

different values), pbest[i] refer to the position where the particle obtains the highest

(lowest) fitness and gbest[i] refer to the position where the whole system obtains the highest (lowest) fitness. Therefore, the optimal sampling rate may be found out. After selecting the original point and the sampling rate, new minority class samples are generated. In the formula:

Xww = X + rand(0,1) * (M, - X), i= 1,2,,,,. N

Xe, is a newly inserted sample; X is the selected original sample data; rand(0,1)

represents a certain random number between 0 and 1; Mi is the best sampling point

optimized by PSO in the nearest samples of the original sample data X.

4. The asthma diagnosis system based on a decision tree and the improved SMOTE algorithm according to claim 1, wherein the primary decision tree module comprises the following processing steps: After the positive and negative samples are balanced, a CART regression tree is generated; The MEP post-pruning algorithm is adopted for the generated decision tree.