CN110516741A

CN110516741A - Classification based on dynamic classifier selection is overlapped unbalanced data classification method

Info

Publication number: CN110516741A
Application number: CN201910802242.5A
Authority: CN
Inventors: 王宾; 陈东; 张强; 魏小鹏; 周昌军
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2019-11-29

Abstract

The invention discloses the classifications based on dynamic classifier selection to be overlapped unbalanced data classification method, applies half unsupervised hierarchical clustering algorithm first, and data set is divided into multiple balanced subsets, classification overlap problem is not present in subset sample space therein.Then fundamental classifier is constructed in these subsets to form candidate classification device pond.The classification of each test sample is carried out in order to from candidate classification device pond select most suitable fundamental classifier, the ability of classifier is protruded using weight mechanism, these classifiers ability in the minority class sample for the peripheral region where belonging to test sample of classifying is more powerful.

Description

Classification based on dynamic classifier selection is overlapped unbalanced data classification method

Technical field

The invention belongs to artificial intelligence fields, and the classification specifically based on dynamic classifier selection is overlapped unbalanced data Classification method.

Background technique

Unbalanced data refers to the sample in learning sample containing plurality of classes, wherein the number of samples of a certain classification To be far less than the number of samples of other classifications, those classifications with a small amount of sample usually be called minority class, that The remaining classification containing more sample size is known as most classes a bit.The classification problem of unbalanced data is on a kind of sample space The unbalanced Data Mining Classification problem of different classes of sample distribution.Rare information generally comprises more values, is worth discovery And concern, and more accurately screening and classification.There is a situation where in many real world problems it is similar, that is, exist some The sample size of classification will be far smaller than the sample size in other classifications, and the sample of these classifications is obviously critically important, however such as Fruit does classification with those traditional classification learning algorithms for not using any special modification, is to be difficult classification correctly, tradition Sorting algorithm is often partial to most class classifications, these minority class classification samples are mistakenly classified as most classes.So working as use When traditional machine learning Classification Algorithms in Data Mining processing has the data of uneven feature, it often will appear final classification Precision is not achieved the case where expectation, and the classifier that thus training obtains also has significant limitation.The most common performance is more The classification accuracy rate of several classes of samples is much higher than the classification accuracy rate of minority class sample, and the sample for originally belonging to minority class is easy to Accidentally it is divided into most classes.

In real world real life example, the phenomenon that data nonbalance, is also very common.Many classification problems exist Scene just determined classification tilt, such as detection credit card trade in the presence or absence of fraud transaction data, detection postal With the presence or absence of the categorical data of luxury goods and examining for medical field in the text data of fraud spam, recommender system in part Disconnected data etc..Such issues that people be often all more concerned about be the classification accuracy rate of minority class sample, for example judging a disease When whether people suffers from cancer, the people of people's misclassification not illness of illness is mistakenly classified as to people's cost of illness than the people of not illness It is much higher, because of available more judgements and treatment in the latter cases.But training sample normally not illness in practice Data sample can account for the overwhelming majority, and only a few sample is with cancer, if directlying adopt traditional Data Mining Classification side Method will identify that the people of these illness is very difficult.Some during Sample Data Collection due to human factor and Lead to the generation of unbalanced data sample, for example certain classifications involve other people privacy concerns and cause to be difficult to acquire in data record Or acquisition cost is too high.Also some unbalanced data problems decomposition for coming from multiclass classification problem.Some sorting algorithms Such as logistic regression and support vector machines (SVM), these algorithms can not be directly applied in multi-class classification problem, first Primal problem is resolved into the subproblem of multiple two classifications to solve, easilys lead to data sample distribution in this way by original Balanced sort problem become imbalance problem, originally unbalanced problem can also become more uneven.So raw in reality The classification problem of unbalanced data is widely present in work, and the research for the data mining problem of this kind of data set is that have very much Realistic meaning.Currently, these are all absorbed in for most of the algorithm that unbalanced data proposes solves original unbalanced data Problem.However, class imbalance problem is usually overlapped along with other data complexity problems, such as classification.Classification overlapping refers to Be that minority class sample appears in most class sample positions, mostly occur near decision boundary.It is current some for uneven number According to after the algorithm process unbalanced data of proposition even can deteriorate classification overlap problem, and eventually lead to these algorithms performance damage It loses.

Summary of the invention

To solve disadvantages mentioned above of the existing technology, the application provides a kind of classification weight based on dynamic classifier selection Folded unbalanced data classification method improves the final nicety of grading of the unbalanced data of classification overlapping.

To achieve the above object, the technical solution of the application are as follows: the classification based on dynamic classifier selection is overlapped uneven Data classification method applies half unsupervised hierarchical clustering algorithm first, data set is divided into multiple balanced subsets, son therein Collect and classification overlap problem is not present in sample space.Then fundamental classifier is constructed in these subsets to form candidate classification Device pond.The classification of each test sample is carried out in order to from candidate classification device pond select most suitable fundamental classifier, is used Weight mechanism protrudes the ability of classifier, these classifiers belong to test sample in classification where peripheral region minority class Ability is more powerful when sample.The specific implementation steps are as follows for it:

Step 1 generates candidate classification device pond；

Classification overlapping is the obstacle practised from unbalanced data middle school.Different classes of sample should be balance and not include Classification overlapping region.At present dynamic select it is integrated in most of Data Preprocessing Technology all use bagging；Namely It says, original study collection is sampled, classifier is established in these sampled data sets to generate candidate classification device pond；However it is logical Cross bagging sampling policy acquisition each data subset be still it is unbalanced, lead to the Generalization Capability of final integrated model still It is so very poor.Half unsupervised hierarchical algorithms obtain the data subset without classification overlapping according to the following steps:

Step 11: regarding N number of most class samples as an individual cluster, generate the cluster that sample size is 1 in N number of cluster Cmaj；

Step 12: the shortest cluster Cmaj of square Euclidean distance is calculated_aAnd Cmaj_b, record square Euclidean distance at this time For Dist；

Step 13: calculating each minority class sample to cluster Cmaj_aAnd Cmaj_bSquare Euclidean distance, it is few if there is some The distance of several classes of samples to two clusters is respectively less than Dist, then illustrates cluster Cmaj_aAnd Cmaj_bBetween there are minority class sample, mark cluster Cmaj_aAnd Cmaj_bAnd it does not remerge；Conversely, then minority class sample is not present therebetween in explanation, merge cluster Cmaj_aAnd Cmaj_b As new cluster Cmaj_c, the quantity of total cluster is N-1 after merging；

Step 14: newly-generated cluster Cmaj_cA square Euclidean distance will be recalculated with remaining N-1 cluster；

Step 15: repeating step 12- step 14 until not new cluster can merge；

By upper generation m most class sample clusters, this m cluster is generated to m subset in conjunction with minority class sample respectively, each Most class sample sizes and minority class sample may be unequal inside subset, concentrate over-sampling to obtain in m son using smote method The subset balanced to m；Then, using the selected fundamental classifier of equilibrium data collection training of generation to form candidate classification Device pond；

The most suitable fundamental classifier set of step 2, dynamic select；

It is overlapped on sample space without classification by the fundamental classifier subset that half unsupervised hierarchical clustering generates, and each It is all that classification sample similar in attribute is combined with minority class sample in subset.It is necessary to select from candidate classification device pond Effective classifier is to be used for each training sample x_queryClassification.In consideration of it, a most important step is how to measure The ability of candidate classification device.Although many methods have been proposed to estimate the ability of classifier, all these methods are all It is that the premise of scene is balanced based on data sample and is proposed.

In order to select more powerful classifier set appropriate in classification minority class sample, a kind of dynamic select is proposed Algorithm, the minority class sample for more belonging to capacity locations of correctly classifying for candidate classification device will have higher ability.

Therefore, it is each sample x to be sorted in unbalanced data scene that main target, which is description,_querySelect suitable base The process of plinth classifier.Here, committed step is that assessment belongs to needs and each of classifies sample x to be sorted_queryCapacity locations Candidate classification device performance.By using sample x to be sorted_queryK nearest-neighbors define capacity locations.Pass through DES Classifier system is come the problem of handling unbalanced dataset.Selection to the minority class sample in limit of power region into More powerful classifier when row classification.Specific step is as follows:

Step 21: verifying centralized calculation currently sample x to be sorted_queryK nearest-neighbors sample be denoted as Ψ；

Step 22: respectively to each fundamental classifier h in candidate classification device pond_i, the Ψ that step 21 is obtained is as defeated Enter prediction to be exported；

Step 23: for prediction output and true label, TP, FN, FP, TN being calculated according to table 1.

The case where minority class sample and most class samples in table 1 refer to the sample distribution within the scope of forecast sample Ψ.

Table 1: confusion matrix

	It is predicted as minority class	It is predicted as most classes
			It is really minority class	TP	FN
True is most classes	FP	TN

The precision of minority class sample is calculated according to formula (1):

The recall rate of minority class sample is calculated according to formula (2):

The ability of each classifier in classifier pond is calculated according to formula (3):

W_i=Precision*Recall (3)

Since classification overlapping region is not present in the subset of division in sample space, sample to be predicted should be categorized into recently Subset class in, according to formula (4) calculate weighting after each fundamental classifier ability:

Wherein D_iRepresent the average distance of current sample to be predicted sample into i-th of subset；To Compen_iNumerical value according to It arranges from big to small, selects the fundamental classifier to rank the first.

Step 3, the classifier weighting output to selecting；

The fundamental classifier currently selected can export two kinds of Probability ps 1 to sample to be predicted and p2 respectively corresponds classification c1 And c2, final output is obtained according to formula (5):

Wherein Dist_jIndicate the average distance of sample to be predicted sample into jth classification.

Step 2 and step 3 is repeated to complete to the classification of all forecast samples.

The present invention due to using the technology described above, can obtain following technical effect:

(1) traditional Ensemble Learning Algorithms are mostly equilibrium datas after bagging (bagging) or bagging, obtain in this way Classification overlap problem is still remained in balanced subset, there are problems that the integrated model established in classification overlapping is extensive at this Ability is still poor.Therefore, it is based on half unsupervised hierarchical clustering algorithm, most class samples are divided into not comprising classification weight Folded subset, the submodel established on this can effectively promote generalization ability.

(2) traditional classifier algorithm is mostly the algorithm of calculating accuracy rate to select optimal classifier individual or collection It closing, the application had both considered the accuracy rate of the prediction minority class sample of classifier using the dynamic classifier selection algorithm of weighting, The relationship between forecast sample and learning sample is had also contemplated, this relationship is exactly that it should more be classified into apart from nearest class Not in.

Detailed description of the invention

Fig. 1 is the flow chart of the application.

Specific embodiment

Specific embodiment refers to Fig. 1, it is the flow chart that the present invention realizes step, in conjunction with the figure to implementation of the invention Process is described in detail.The embodiment of the present invention is implemented under the premise of the technical scheme of the present invention, gives Detailed embodiment and specific operating process, but protection scope of the present invention is not limited to following embodiments.

Embodiment 1

The classification based on dynamic classifier selection that the present embodiment provides a kind of is overlapped unbalanced data classification method, including waits The generation in classifier pond, the weighting output of the strongest fundamental classifier of dynamic select classification capacity and fundamental classifier are selected, according to It is secondary the following steps are included:

(1) it is drawn with most class samples that half unsupervised hierarchical clustering algorithm concentrates the unbalanced data being overlapped comprising classification Get m most class sample submanifold；

(2) m of acquisition most class sample submanifolds are merged to obtain the son of m uneven two classifications with minority class sample Collection；

(3) subset that over-sampling obtains two classifications of m balance is carried out in uneven subset using SMOTE algorithm；

(4) m homogeneous classification device is obtained with same learning algorithm in the subset that step (3) obtains constitute candidate classification Device pond；

(5) using dynamic classifier selection algorithm from classifier pond by test sample peripheral region sample classification ability most Strong candidate sub-classifier is picked out；

(6) prediction result that the rule using a kind of based on distance weighted mechanism provides the classifier that step (5) obtain Output；

The data majority class sample being overlapped comprising classification is divided into using half unsupervised hierarchical clustering algorithm and is not included The submanifold of classification overlapping, the scheme for being different from the building subset of traditional data preprocess method and integrated study take into consideration only injustice The difference to weigh on data bulk, these processing schemes even can deteriorate classification overlapping phenomenon after handling data, using half without prison Superintending and directing hierarchical clustering algorithm then and can guarantee the position of sample in each subset will not overlap phenomenon with minority class sample.

The dynamic classifier selection algorithm of use strongest classifier of selective power from classifier pond, fundamental classifier energy The assessment of power is calculated by the verifying collection classification situation of test sample arest neighbors, and specific method is to concentrate and take in verifying K nearest-neighbors of current test sample, each sub-classifier in classifier pond predict to this k nearest-neighbors defeated Out, it is different from traditional dynamic select algorithm, the dynamic select algorithm newly proposed is distinguished while guaranteeing classification accuracy rate The more classifier of less several classes of samples in this k sample has stronger classification capacity, the dynamic classifier selection newly proposed Algorithm will select the fundamental classifier to participate in last decision.

Embodiment 2

The present embodiment uses a common unbalanced data library, the glass2 data set that KEEL is collected.Glass2 number It in total include 214 samples according to collection, each sample has 9 attribute, wherein 17 minority class samples, 197 most class samples. Degree of unbalancedness is 11.59.Specific unbalanced data assorting process is as follows:

The generation of step 1, candidate classification device pond

(1) half unsupervised hierarchical clustering is carried out to 197 most class samples first and obtains m most class sample submanifold, it will This m cluster merges to obtain m binary data set with minority class sample.

(2) over-sampling is concentrated to obtain the binary data set of m balance in this m data with smote algorithm.

(3) m candidate fundamental classifier pond is obtained with decision tree classification learning algorithm on this m sample.

Step 2, dynamic classifier selection

(1) it to current sample to be predicted, is concentrated in verifying and selects its 7 nearest-neighbors；

(2) the classification situation for recording this 7 nearest-neighbors each fundamental classifier in candidate classification device pond, according to formula (1)-(4) the ability Compen of each fundamental classifier after weighting is calculated.

(3) the maximum corresponding fundamental classifier of Compen value is selected；

Step 3, classifier output

The selected sample of current class device can export two kinds of Probability ps 1 and p2 to sample to be predicted, respectively correspond classification C1 and c2 is exported according to formula (5).

Step 2 and step 3 are repeated until the classification of all test samples is completed.

In order to better illustrate the validity of algorithm, respectively only with decision Tree algorithms classification glass2 data set and smote Decision Tree algorithms classification glass2 data set is used to compare after processing as algorithm, while in order to quantify last result output, It the use of AUC is algorithm index.

Table 2: distinct methods compare the classification results of glass2 database

It can be seen from Table 2 that based on the dynamic cataloging that in glass2 unbalanced data classification experiments, the application is proposed The AUC value that device selection method obtains is 0.8608, is had on classification performance compared to other typical sorting algorithms larger Raising.Experimental result illustrates that this method can be effectively combined half unsupervised hierarchical clustering and dynamic classifier selection is respective Advantage improves the precision of the unbalanced data classification of classification overlapping.

Claims

1. the classification based on dynamic classifier selection is overlapped unbalanced data classification method, which is characterized in that specific implementation step It is as follows:

Step 1 generates candidate classification device pond；

Step 2, dynamic select fundamental classifier set；

Step 3, the classifier weighting output to selecting；

Step 4 repeats step 2 and step 3 to the classification completion of all forecast samples.

2. the classification based on dynamic classifier selection is overlapped unbalanced data classification method, feature according to claim 1 It is, step 1 specifically obtains the data subset without classification overlapping using half unsupervised hierarchical algorithms according to the following steps:

Step 11: regarding N number of most class samples as an individual cluster, generate the cluster Cmaj that sample size is 1 in N number of cluster；

Step 12: the shortest cluster Cmaj of square Euclidean distance is calculated_aAnd Cmaj_b, square Euclidean distance recorded at this time is Dist；

Step 13: calculating each minority class sample to cluster Cmaj_aAnd Cmaj_bSquare Euclidean distance, if there is some minority class The distance of sample to two clusters is respectively less than Dist, then illustrates cluster Cmaj_aAnd Cmaj_bBetween there are minority class sample, mark cluster Cmaj_aAnd Cmaj_bAnd it does not remerge；Conversely, then minority class sample is not present therebetween in explanation, merge cluster Cmaj_aAnd Cmaj_b As new cluster Cmaj_c, the quantity of total cluster is N-1 after merging；

Step 15: repeating step 12- step 14 until not new cluster can merge；

By upper generation m most class sample clusters, this m cluster is generated to m subset, each subset in conjunction with minority class sample respectively The inside majority class sample size and minority class sample may be unequal, concentrate over-sampling to obtain m in m son using smote method The subset of a balance；Then, using the selected fundamental classifier of equilibrium data collection training of generation to form candidate classification device Pond.

3. the classification based on dynamic classifier selection is overlapped unbalanced data classification method, feature according to claim 1 It is, specific step is as follows for step 2:

Step 21: verifying centralized calculation currently sample x to be sorted_q0eryK nearest-neighbors sample be denoted as Ψ；

Step 22: respectively to each fundamental classifier h in candidate classification device pond_i, the Ψ that step 21 is obtained is as input prediction It is exported；

Step 23: for prediction output and true label, TP, FN, FP, TN being calculated according to table 1；

Table 1: confusion matrix

The precision of minority class sample is calculated according to formula (1):

W_i=Precision*Recall (3)

Since classification overlapping region is not present in the subset of division in sample space, sample to be predicted should be categorized into nearest son Collect in classification, the ability of each fundamental classifier after weighting calculated according to formula (4):

Wherein D_iRepresent the average distance of current sample to be predicted sample into i-th of subset；To Compen_iNumerical value is according to from big To minispread, the fundamental classifier to rank the first is selected.

4. the classification based on dynamic classifier selection is overlapped unbalanced data classification method, feature according to claim 1 It is, specific step is as follows for step 3: