CN105760889A

CN105760889A - Efficient imbalanced data set classification method

Info

Publication number: CN105760889A
Application number: CN201610114730.3A
Authority: CN
Inventors: 陈宗海; 曹璨; 王鹏
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2016-03-01
Filing date: 2016-03-01
Publication date: 2016-07-13

Abstract

The invention discloses an efficient imbalanced data set classification method, which includes the steps of adding boundary samples and isolated points into consideration based on a traditional SMOTE method, so as to acquire an approximately balanced data set; and then making designs on the type of data based on a sub space selection and tree model scheme of ensemble random forests, to clearly classify two types of data. In this way, the accuracy and recall ratio are improved, the result is close to the actual situation, and the invention can be applied to the actual industry analysis.

Description

A kind of efficient imbalanced data sets sorting technique

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of efficient imbalanced data sets sorting technique.

Background technology

Classification problem is one of sixty-four dollar question in data analysis, and real data collection often exists the unbalanced problem of categorical measure.The method effects such as unbalanced data set, conventional sorting methods is decision tree such as, SVM, Bayesian network are poor, because traditional algorithm comprises the assumed condition that sample is balanced.When data set is unbalanced, owing to multiclass sample occupies absolute advantages in disaggregated model training process so that minority class sample is not easily identified.What data classification problem was typically concerned about is just minority class sample, the prediction that telecom client in practical application runs off, credit card trade abnormality detection, the intrusion prediction of network and the diagnosis etc. of medical domain disease.

At present, the technique study of imbalanced data sets classification is broadly divided into two aspects, data plane and algorithm aspect.Data plane is to focus on the additions and deletions reconstruct of data, and algorithm aspect is then that grader is improved.

But, the classification of new samples synthesized by the SMOTE method of current data plane obscures, and does not also consider the particularity of isolated point and boundary sample simultaneously.What algorithm aspect had outstanding advantage is random forest method, and this is the strong classifier of many decision tree compositions.The independent identically distributed random vector adopted in each grader determines the growth course of tree, is finally the output result being determined last model by the mode of all trees.Though tradition random forests algorithm ensure that randomness, but does not account for the unbalanced particularity of data set, it does not have gives minority class sample more multiple resource, choose in the tree that all trees list ballot in when forming decision forest, be unfavorable for the classification of imbalanced data sets.

Summary of the invention

It is an object of the invention to provide a kind of efficient imbalanced data sets sorting technique, it is possible to sort out two class data significantly, press close to legitimate reading, the problem simultaneously evading over-fitting to a certain extent.

It is an object of the invention to be achieved through the following technical solutions:

A kind of efficient imbalanced data sets sorting technique, including:

Based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced；

The data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection；Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets；

Adopting layered sampling method to extract S reference characteristic intersection from two category feature subsets, each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree；

Based on the metric of the similarity between index and measurement decision tree and the decision tree of measurement decision tree classification effect, from described S decision tree, filter out a number of decision tree, form final decision forest, thus realizing imbalanced data sets classification.

Further, described based on BSMOTE sampling techniques, the most class samples in imbalanced data sets and minority class sample are carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus the data set obtaining convergence balanced includes:

Comprising two class samples in described imbalanced data sets, a fairly large number of is then most class samples, and negligible amounts is then minority class sample；

All minority class samples are found k neighbour in the sample space of imbalanced data sets, according to the ratio of class sample sizes most in k neighborhood and minority class sample size, minority class is divided into 3 sample sets: safe sample set, dangerous sample set and isolated sample set；

Safe sample, dangerous sample are as follows with the definition of isolated sample: if most class sample sizes are less than minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is safe sample；If most class sample sizes are not 0 no less than minority class sample size and minority class sample size in the k neighborhood of current minority class sample, then current minority class sample is dangerous sample；If in the k neighborhood of current minority class sample being entirely most class samples, then current minority class sample is isolated sample；

To each the minority class sample in dangerous sample set, in all minority class sample spaces, all find k neighbour, and s the minority class sample randomly selected in k neighborhood carries out linear interpolation, s new minority class sample of synthesis with former minority class sample；If total d in dangerous sample set_numIndividual minority class sample, then finally synthesize s × d altogether_numIndividual new minority class sample；

The new minority class sample of synthesis is merged with described imbalanced data sets, is formed and obtain the data set that convergence is balanced.

Further, the described data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extracts the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection；Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets include:

It is sampled based on Bagging method and installs in Bag by the most class samples in data set balanced to described convergence with putting back to, and loads minority class sample in Bag so that Bag invading the interior has the two class samples that quantity is approximate；Repeat aforesaid operations N-1 time, obtain N number of subset altogether；

Using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection；

Each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ₁Size；Will appear from number of times more than threshold value δ₁Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ₁Feature as difference feature put in character subset B.

Further, described employing layered sampling method extracts S reference characteristic intersection from two category feature subsets, and each stack features intersection, all for the structure of decision tree, includes thus obtaining S decision tree:

Determine the ratio of feature quantity in sorted character subset A and character subset B；

Extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection；Carry out S time altogether to extract, thus obtaining S reference characteristic intersection；

For each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.

Further, the metric of the similarity between the described index based on measurement decision tree classification effect and measurement decision tree and decision tree, from described S decision tree, filtering out a number of decision tree, form final decision forest, including thus realizing imbalanced data sets classification:

Calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good；

Calculate the metric of the similarity weighed between decision tree and decision tree: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree；

Classical strength threshold value δ is set₂, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting₂Compare, filter out the decision tree higher than threshold value；The threshold value δ of similarity measurement is set₃, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ₃Paired decision tree in the more weak decision tree of classical strength；

Decision tree through above-mentioned screening forms final decision forest, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.

As seen from the above technical solution provided by the invention, on the basis of tradition SMOTE method, add the consideration to boundary sample and isolated point, it is hereby achieved that the data set of approximate equalization；The decision-making of data type is carried out again, it is possible to sorting out two class data significantly, thus improving accuracy rate and recall ratio, its result presses close to truth, can be applicable in actual industry analysis based on the subspace selection of integrated random forest and tree-model scheme.

Accompanying drawing explanation

In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below the accompanying drawing used required during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawings according to these accompanying drawings.

The flow chart of a kind of efficient imbalanced data sets sorting technique that Fig. 1 provides for the embodiment of the present invention；

The principle schematic of the BSMOTE sampling techniques that Fig. 2 provides for the embodiment of the present invention；

The flow chart of the tagsort that Fig. 3 provides for the embodiment of the present invention；

The flow chart carrying out decision tree structure based on layered sampling method that Fig. 4 provides for the embodiment of the present invention；

Fig. 5 forms the flow chart of decision forest for the screening decision tree that the embodiment of the present invention provides；

The decision forest ballot schematic diagram that Fig. 6 provides for the embodiment of the present invention.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on embodiments of the invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into protection scope of the present invention.

The embodiment of the present invention provides a kind of efficient imbalanced data sets sorting technique, as it is shown in figure 1, it mainly comprises the steps:

Step 11, based on BSMOTE sampling techniques the most class samples in imbalanced data sets and minority class sample carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus obtaining the data set that convergence is balanced.

In this step, traditional BSMOTE method is improved, namely according to k neighbour and linear interpolation thought, tradition SMOTE method basis adds the consideration to the sample near decision boundary；Its principle is as in figure 2 it is shown, specific as follows:

In the embodiment of the present invention, described imbalanced data sets comprises two class samples, a fairly large number of is then most class sample (i.e. samples of black background in Fig. 2, its quantity is 10), negligible amounts is then minority class sample (i.e. the sample of white background in Fig. 2, its quantity is 4)；

In the embodiment of the present invention, it is contemplated that the sample that isolated point is concentrated is probably noise spot, does not consider.Meanwhile, it is little that safe sample is generally acknowledged to the impact of the classification performance on minority class, does not also consider.

Step 12, the data set that described convergence is balanced is sampled, it is thus achieved that N number of subset comprising the approximate minority class sample of quantity and most class samples, and extract the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection；Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets.

In this step, the data set of the convergence equilibrium that abovementioned steps 11 is obtained by employing predetermined way is repeated sampling, obtain the bag subset of N number of equalization, again the bag subset of N number of equalization is done Ensemble feature selection, further according to the significance level of feature, select for unbalanced the data characteristic set agreeing and the characteristic set differed from.Its step is as it is shown on figure 3, specifically include that

1) it is sampled based on Bagging method by the most class samples in the balanced data set of described convergence with putting back to install in Bag, and in Bag, load minority class sample so that Bag invading the interior has the two class samples that quantity is approximate；Repeat aforesaid operations N-1 time, obtain N number of subset altogether.

2) using this N number of subset input training sample set as Feature Selection, correlative character system of selection is adopted to carry out feature extraction, it is thus achieved that feature intersection.

3) each feature occurrence number in statistical nature intersection, the importance of number of times more many expressions feature that feature occurs is more high, is respectively compared each feature occurrence number and threshold value δ₁Size；Will appear from number of times more than threshold value δ₁Feature put into character subset A as good feature, the number of times that will appear from is more than threshold value δ₁Feature as difference feature put in character subset B.

Step 13, employing layered sampling method extract S reference characteristic intersection from two category feature subsets, and each stack features intersection is all for the structure of decision tree, thus obtaining S decision tree.

This step is decision tree construction process, and its process as shown in Figure 4, specifically includes that

1) ratio of feature quantity in sorted character subset A and character subset B is determined.

2) extract according to the ratio of character subset A Yu character subset B, the feature extracted from character subset A and character subset B each time is merged into a reference characteristic intersection；Carry out S time altogether to extract, thus obtaining S reference characteristic intersection.

3) for each reference characteristic intersection, all adopt the metric of Attributions selection to choose the split point of tree-model, carry out the structure of decision tree, thus obtaining S decision tree.

Step 14, based on weighing the index of decision tree classification effect and weighing the metric of similarity between decision tree and decision tree, a number of decision tree is filtered out from described S decision tree, form final decision forest, thus realizing imbalanced data sets classification.

This step realize process as it is shown in figure 5, specifically include that

1) calculating the index weighing decision tree classification effect: all adopt area AUC under ROC curve as the index weighing decision tree classification effect for each decision tree, when AUC area is more big, ROC curve more tends to upper left side, then it represents that classifying quality is more good；

2) metric of the similarity weighed between decision tree and decision tree is calculated: adopt decision tree classifying quality in unbalanced data to calculate, it is thus achieved that the metric of the similarity between decision tree and decision tree；

3) classical strength threshold value δ is set₂, and by the classical strength threshold value δ of the index of each decision tree classification effect Yu setting₂Compare, filter out the decision tree higher than threshold value；The threshold value δ of similarity measurement is set₃, travel through the dependency each other of the decision tree after screening, delete higher than threshold value δ₃Paired decision tree in the more weak decision tree of classical strength；

4) forming final decision forest through the decision tree of above-mentioned screening, as shown in Figure 6, the result of calculation of decision tree in statistical decision forest, thus obtaining corresponding classification results.

In order to make it easy to understand, the present invention is introduced below in conjunction with a concrete example.It should be noted that the concrete numerical value of parameters involved in following example is only for example, not it is construed as limiting；In actual applications, user can arrange the concrete numerical value of parameters according to practical situation.

In this example, the same step adopting previous embodiment performs；Detailed process is as follows:

According to step 11, k=4, s=1, d in Fig. 2, figure_num=2.4 minority class samples are had, 10 most class samples shown in figure.Minority class sample is from left to right followed successively by 1,2,3, No. 4 sample, when choosing k=4 neighbour, can be determined that Isosorbide-5-Nitrae sample belongs to isolated sample according to definition, and 2,3 samples belong to dangerous sample, then 2,3 samples need to carry out linear interpolation.From No. 2 samples, there are 2 minority class samples in 4 neighborhoods, randomly select 1, then 1 new minority class sample of synthesis；In like manner, in 4 fields of No. 3 samples, only have 1 minority class sample, then choose its 1 minority class sample of synthesis.Altogether synthesize s × d_num=1 × 2=2 new samples.Then minority class sample is 4+2=6, then reduce with the quantity gap of most class samples, tend to balanced.As s and d in dangerous sample_numTime bigger, it is possible to synthesis exceedes the new samples of original minority class sample size.

Take the data set of 5000 samples, first tend to equalization according to step 11.If former data set has 4800 most class samples and 200 minority class samples, after BSMOTE method, 400 minority class samples are synthesized, then overall new data set totally 5400 samples, containing 4800 most class samples and 600 minority class samples.Now, most class samples are adjusted to 8:1 with the unbalanced ratio of minority class sample by 24:1.The feature subset selection method based on Bagging according to step 12: all choose 600 minority class samples, randomly draws 600 in most class samples simultaneously, synthesizes a subset with putting back to.Repetitive operation N=10 time altogether, it is thus achieved that 10 subsets.Each subset being done feature extraction respectively, respectively obtains a characteristic set, such as 10 characteristic sets are g1={D1, D2, D4 ... D20}, g2={D2, D3 ... D19}, g3={D1, D3 ... D18, D20} ..., g10={D1, D2, these 10 feature sets are merged into a big feature intersection by D3 ... D19, D20}.Add up the number of each feature Di in intersection again, the threshold value δ of the feature number of times set₁=8, then the feature intersection A that the number of times feature be more than or equal to 8 is put into, and the feature that number of times is less than 10 puts into bad feature intersection B.

According to the mode of step 13, now having 12 features in feature intersection A, have 8 features in B, ratio is 3:2.Adopt the mode of stratified sampling, feature intersection A is extracted 6 features every time, B extracts 4 features, merges into a reference characteristic intersection.Carry out S=10 time altogether extracting with putting back to, obtain 10 reference characteristic intersections.Carry out division and the structure of tree-model more respectively, it is thus achieved that 10 decision trees.

According to the mode of step 14, use every decision tree that data set is carried out decision-making, calculate every AUC area corresponding to decision tree, be the index weighing classifying quality of 10 trees.Such as, classical strength threshold value δ is set₂=0.75, then filter out 8 trees reaching requirement.The metric of similarity is set: the another test data set preparing not add training set in advance again, is divided into 3 groups, carries out decision-making by each tree and obtain 3 groups of results, then obtain 3 × 8 results altogether.Jiang Shu and the result of tree between two compared with, seek the relative coefficient between two trees, then need operation C (8,2) secondary, namely 8 × 7/2=28 time.Such as, similarity threshold δ is set₃=0.5, it has been found that decision tree 1 and 3, decision tree 4 and 7 exceedes threshold value.Then delete the one tree of difference according to the classical strength ability of decision tree, the classical strength such as decision tree 1 and 3 is 0.81 and 0.77, during the classical strength of decision tree 4 and 7 0.82 and 0.85, then deletes tree 3 and 4, leaves tree 1 and 7.After screening, remaining 6 decision trees form decision forest.Finally, thus the mode result of 6 decision trees makees final decision and prediction.

Through the above description of the embodiments, those skilled in the art is it can be understood that can realize by software to above-described embodiment, it is also possible to the mode adding necessary general hardware platform by software realizes.Based on such understanding, the technical scheme of above-described embodiment can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) performs the method described in each embodiment of the present invention.

The above; being only the present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, any those familiar with the art is in the technical scope of present disclosure; the change that can readily occur in or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims

1. an efficient imbalanced data sets sorting technique, it is characterised in that including:

2. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, described based on BSMOTE sampling techniques, the most class samples in imbalanced data sets and minority class sample are carried out k neighbour and linear interpolation calculates, and the some new minority class sample and imbalanced data sets set that obtain will be calculated, thus the data set obtaining convergence balanced includes:

3. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, the described data set that described convergence is balanced is sampled, obtain N number of subset comprising the approximate minority class sample of quantity and most class samples, and extract the feature of minority class sample and most class samples in N number of subset respectively, it is thus achieved that feature intersection；Again through the magnitude relationship of each feature occurrence number in comparative feature intersection Yu threshold value, feature intersection is divided into two category feature subsets include:

4. the efficient imbalanced data sets sorting technique of one according to claim 1, it is characterized in that, described employing layered sampling method extracts S reference characteristic intersection from two category feature subsets, and each stack features intersection, all for the structure of decision tree, includes thus obtaining S decision tree:

5. a kind of efficient imbalanced data sets sorting technique according to claim 1 or 3 or 4, it is characterized in that, the metric of the similarity between the described index based on measurement decision tree classification effect and measurement decision tree and decision tree, a number of decision tree is filtered out from described S decision tree, forming final decision forest, including thus realizing imbalanced data sets classification: