CN110533116A

CN110533116A - Based on the adaptive set of Euclidean distance at unbalanced data classification method

Info

Publication number: CN110533116A
Application number: CN201910832525.4A
Authority: CN
Inventors: 王宾; 陈东; 张强; 魏小鹏; 周昌军
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2019-12-03

Abstract

The invention discloses the adaptive set based on Euclidean distance at unbalanced data classification method, several multifarious balanced subsets are obtained by stochastic equilibrium method first, then on each balanced subset establish obtain multiple fundamental classifiers.It joined the preselected algorithm of classifier before dynamic select algorithm.After the fundamental classifier screened, a kind of new dynamic select algorithm is proposed, by assessing the sample classification device situation in sample peripheral region to be sorted, when to belong to the more more then abilities of minority class sample in range stronger for correct classification.Finally exported based on the adaptive set of distance at the prediction result that the fundamental classifier that rule will be selected obtains using a kind of.This method can obtain establishing fundamental classifier in the subset for generating multiplicity, it is proposed that dynamic select algorithm can pick out the strongest sub-classifier of classification capacity simultaneously, the integrated rule finally proposed is capable of providing preferably output as a result, finally effectively increasing unbalanced data nicety of grading.

Description

Based on the adaptive set of Euclidean distance at unbalanced data classification method

Technical field

The invention belongs to artificial intelligence field, it is specifically a kind of based on the adaptive set of Euclidean distance at uneven number According to classification method.

Background technique

Unbalanced data refers in training sample the sample an of classification or the sample of multiple classifications and other classification samples The case where quantity differs greatly.According to research report, class imbalance problem occurs in the various fields of real world, If the facial age estimate, detect satellite image oil leak, abnormality detection, identify fraudulent credit card trade, software defect prediction and Image labeling etc..Therefore, researcher pays much attention to data nonbalance problem and has held symposium several times and meeting, such as Artificial intelligence promotes association (AAAI) 2000, international Conference on Machine Learning (ICML) Knowledge Discovery sum number in 2003 and 2004 The special interest group of ACM is explored according to excavation (SIGKDD).

For two classification imbalance problems, learning sample is generally divided into most classes and minority class.In general, people are to few The degree of concern of several classes of samples will be more than most class samples, for example credit card fraud transaction identification is wanted at the cost of arm's length dealing It is much higher that fraudulent trading cost is identified as than credit card arm's length dealing, because the latter can contact credit clamping by staff Someone confirms what whether transaction was initiated by me.But the quantity of minority class sample is well below most this feelings of class sample size The consequence that condition band comes may be very serious.Due to most of traditional sorting algorithm such as decision trees, k- arest neighbors and RIPPER Tend to generate the model for maximizing whole classification accuracy, minority class sample is typically ignored.For example, for only 1% sample belongs to the data set of minority class, even if all sample classifications are most classes by model, it still can achieve 99% Overall accuracy, the minority class mistake of desired Accurate classification can be classified with the classifier of this high accuracy.

At present applied to the integrated learning approach of machine learning and the field of data mining in terms of unbalanced data classification Practical application is more and more to be suggested, but such most of algorithm can only limited raising unbalanced data classification prediction essence Degree, each fundamental classifier is the expert of regional area, does not account for each fundamental classifier for different test specimens This classification capacity is different, and the poor fundamental classifier of these performances, which is participated in final integrate, will affect the general of integrated model Change ability, and generate for fundamental classifier study subset should be it is various, guarantee the diversity of fundamental classifier, together When most of integrated studies integrated rule be all voted by most classes it is determining, do not consider training sample and test sample it Between relationship, the prediction result that provides of fundamental classifier after optimization also cannot be improved further immediately.

Summary of the invention

To solve, integrated study neutron taxonomic diversity is insufficient, does not consider that the poor fundamental classifier sum aggregate of performance is set at rule The problem of meter, the application propose it is a kind of based on the adaptive set of Euclidean distance at unbalanced data classification method, improve uneven Weigh data classification precision.

To achieve the above object, the technical solution of the present invention is as follows: based on the adaptive set of Euclidean distance at uneven number According to classification method, specifically comprise the following steps:

Step 1: data prediction, obtains diversity balanced subset；

Step 2: obtaining m homogeneous classification device building candidate using same classification learning algorithm on m balanced subset Classifier pond；

Step 3: the preselected fundamental classifier in candidate classification device pond, will not have the classification of minority class sample ability Device is deleted；

Step 4: using dynamic select algorithm by test sample peripheral region from the classifier pond that step 3 is screened The strong candidate sub-classifier of sample classification ability picks out formation base classifier set；

Step 5: using a kind of fundamental classifier set that will be selected based on the adaptive set of distance at rule for surveying The prediction result of sample sheet exports.

Further, in step 1, to data prediction: including the balanced subset obtained to training set stochastic equilibrium, Verifying collection and test set；Specific steps are as follows:

1. according to training set S_train, verifying collection S_vaWith test set sample S_testQuantitative proportion is a:b:c, in raw data set Middle division sample, and guarantee the training set after division, the ratio of verifying collection and most classes in test set sample and minority class The ratio of most classes and minority class is concentrated to be consistent with initial data；

2. being randomly assigned a random number num according to formula (1)_rand；

num_rand=S_min+rand(0,1)*(S_max-S_min) (1)

Wherein S_minFor training set S_trainMiddle minority class sample size, rand (0,1) are the random number between 0 and 1, S_maxIt is Training set S_trainMiddle majority class sample size；

3. in training set S_trainIt takes at random in most class samples and does not put back to sample until the sample newly formed reaches sample Quantity is num_rand, while over-sampling is carried out to minority class sample according to formula (2) and generates new sample z addition minority class sample In, repeating minority class number of samples of the over-sampling after being added is num_rand, the most class samples and over-sampling that will newly form Minority class sample merging afterwards then obtains a balanced subset；

Z=β p+ (1- β) q (2)

Wherein p, q are S_trainMiddle minority class sample, β are the random numbers between 0 to 1；

4. repeating step 2. and 3. until obtaining m balanced subset.

Further, in step 2, construct candidate classification device pond, specific steps: the m subset obtained to step 1 is equal M homogeneity fundamental classifier composition candidate classification device pond is obtained using same classification learning algorithm.

Further, it in step 3, needs preselected to the fundamental classifier in candidate classification device pond；Specific steps Are as follows:

1. to currently in test set S_testIn sample x to be sorted_q, collect S in verifying_vaMiddle k nearest-neighbors for calculating it, If there are different classes of samples in k nearest-neighbors, recording k current neighbours is Ψ；If existing in k nearest-neighbors Same category of sample, then enter step four；

2. each fundamental classifier h using the Ψ of acquisition as input, in candidate classification device pond_iFor the Ψ for label of erasing Prediction obtains output y_p；

3. comparison basis classification prediction output y_pWith the label y of true Ψ, if there is cannot at least correct classification simultaneously The fundamental classifier of the sample of one group of minority class and most classes, which is given, deletes；Fundamental classifier after deletion in candidate classification device is N.

Further, it in step 4, needs to be dynamically selected the candidate classification device after preselected, specific steps Are as follows:

1. to currently in test set S_testIn sample x to be sorted_q, collect S in verifying_vaMiddle k nearest-neighbors for calculating it, K sample is denoted as ￡；

2. each fundamental classifier h using the ￡ of acquisition as input, in candidate classification device pond_iFor the ￡ for label of erasing Prediction obtains output y_out；Y is exported for prediction_outWith true label y, each fundamental classifier is calculated according to formula (3) Ability weight:

Wherein I () is indicator function, θ_jFor the weight coefficient of j-th of sample class, θ_jIt is defined as follows:

3. it sorts after ability weight has been calculated according to numerical values recited, P% formation base before being taken from n fundamental classifier Classifier set C'.

Further, in step 5, classifier set C' is obtained to selection and provides prediction to current sample to be sorted Integrated output, specific steps are as follows:

1. calculating separately out parameter R1 and R2 according to formula (4) and (5)

Wherein t is the fundamental classifier quantity in set C', P_i1And P_i2It corresponds respectively in i-th of classifier for surveying The probability of minority class and most classes that sample originally provides, D_i1And D_i2Test sample is corresponded respectively into i-th of fundamental classifier The average Euclidean distance of the training sample of minority class and most classes, α is auto-adaptive parameter, is needed true according to different sorting algorithms It is vertical；

Before calculating distance, need that sample is normalized by formula (6):

Whereinx_iRespectively represent the value of normalization front and back, x_max、x_minRespectively indicate maximum value in sample data, most Small value；

2. comparing the value of parameter R1 and R2, if R₁>R₂, then current sample classification be minority class, on the contrary it is then for majority classes；

It repeats Step 3: step 4 and step 5 are to all test set sample S_testIn sample classification complete.

The present invention can be achieved that by above method

(1) has the characteristics that the basis multifarious, guarantee is established on it in the subset obtained using stochastic equilibrium method Classifier has diversity.

(2) it joined pre-selection selection method, ensure that next step dynamic select algorithm can more preferably select basis point faster Class device.

(3) with dynamic select algorithm be each stronger fundamental classifier of samples selection ability to be sorted, avoid by The poor fundamental classifier of performance brings Generalization Capability decline problem caused by final decision exports into.

(4) the integrated rule proposed combines the output of each fundamental classifier, and considers training set and test set Between relationship, this relationship is exactly that sample to be sorted should be more categorized into nearest sample class.It is integrated using this Multiple output end values effectively can be merged integrated output by rule, improve integrated output accuracy.

Detailed description of the invention

Fig. 1 is implementation flow chart of the invention.

Specific embodiment

With reference to Fig. 1, it is the flow chart that the present invention realizes step, is made in conjunction with the figure to implementation process of the invention detailed Explanation.The embodiment of the present invention is implemented under the premise of the technical scheme of the present invention, gives detailed embodiment party Formula and specific operating process, but protection scope of the present invention is not limited to following embodiments.

It is a kind of based on the adaptive set of Euclidean distance at unbalanced data classification method, the life including candidate classification device pond At the adaptive set of, the stronger fundamental classifier set of dynamic select classification capacity and fundamental classifier at output, successively wrap Include following steps:

(1) data prediction obtains training set, verifying collection and test set；And in training pooled applications stochastic equilibrium method Obtain m balanced subset；

(2) candidate point of m homogeneous classification device building is obtained using same classification learning algorithm on this m balanced subset In class device pond；

(3) the preselected fundamental classifier in candidate classification device pond, by point for the sample ability for not having classification minority class Class device is deleted；

(4) it is screened test sample peripheral region from step (3) using dynamic select algorithm in obtained classifier pond The strongest candidate sub-classifier of sample classification ability is picked out；

(5) using a kind of fundamental classifier that will be selected based on the adaptive set of distance at rule for test sample Prediction result output；

Diversity subset is obtained using stochastic equilibrium method, is randomly assigned most class sample sizes and minority class sample number Most class samples are carried out lack sampling and carry out over-sampling to minority class sample, reach data balancing by a value between amount Purpose steps be repeated alternatively until that the subset quantity of generation reaches desired subset quantity.

In order to reduce the efficiency and more conducively dynamic select algorithm that the quantity of candidate classification device improves dynamic select algorithm The stronger fundamental classifier of selection ability deletes a part of fundamental classifier using pre-selection selection method, and specific method is to verify K nearest-neighbors for taking current sample to be tested are concentrated, it is every in candidate classification device pond using this k nearest-neighbors as inputting A sub-classifier carries out prediction output to it, deletes the fundamental classifier for not having and distinguishing minority class sample ability.

Classify situation to calculate each fundamental classifier classification capacity, specifically by the verifying collection of test sample arest neighbors Method is that k nearest-neighbors for taking current test sample are concentrated in verifying, respectively with each of candidate classification device pond basis point Class device carries out prediction output to this k nearest-neighbors, and a small number of classification abilities of correct classification relatively can guarantee totally by force just again simultaneously The fundamental classifier of true rate chooses, and is different from traditional dynamic select algorithm, and traditional selection algorithm design is mostly being protected It is carried out in the case where demonstrate,proving overall accuracy, but this fundamental classifier chosen in uneven sample can be partial to majority Class.

Each fundamental classifier can provide the output of sample to be predicted, not only allow for the defeated of each fundamental classifier Out, while the relationship between sample and training sample to be sorted, formula being had also contemplated are as follows:

Wherein t is basic classifier quantity, P_i1And P_i2It corresponds respectively in i-th of classifier provide test sample Classification 1 and classification 2 probability, D_i1And D_i2Correspond respectively to test sample classification 1 and classification 2 into i-th of fundamental classifier Training sample average Euclidean distance, α is auto-adaptive parameter, needs to be established according to different sorting algorithm.

If R₁>R₂, then current sample classification be classification 1, on the contrary it is then be classification 2.

The present embodiment uses ecoli046vs5 data collected by a disclosed standard unbalanced data library KEEL Collection.Ecoli046vs5 data set includes 203 samples in total, and each sample has 7 attribute, wherein 20 minority class samples, 183 most class samples.Degree of unbalancedness is 9.15.Specific unbalanced data assorting process is as follows:

(1) according to sample size in training set, it is 8:1:1's that sample size in sample size and test set is concentrated in verifying The original uneven learning sample collection of ratio cut partition, and guarantee most class samples and minority class sample size in the data set after dividing Ratio and original uneven learning sample collection ratio it is consistent.

(2) specific step is as follows for stochastic equilibrium on training set:

1. being randomly assigned a random number num according to formula (1)_rand；

2. to training set S_trainMiddle minority class sample carries out reaching number of samples according to formula (2) over-sampling being num_rand, Most class samples are carried out lack sampling to reach number of samples being num_rand, obtain a balanced subset；

3. repeating step 1. and 2. until obtaining 100 balanced subsets.

(3) 100 homogeneous classification device building candidate classification devices are obtained using decision Tree algorithms on this 100 balanced subsets Chi Zhong；

(4) pre-selection selection method is executed to the fundamental classifier that step (3) obtain, the specific steps are as follows:

1. to currently in training set S_testIn sample x to be sorted_q, collect S in verifying_vaMiddle 7 nearest-neighbors for calculating it, If there are different classes of samples in 7 nearest-neighbors, recording 7 current neighbours is Ψ.If existing in 7 nearest-neighbors Same category of sample then enters step (5)；

3. comparison basis classification prediction output y_pWith the label y of true Ψ, if there is cannot simultaneously at least classify just Really the fundamental classifier of the sample of one group of minority class of classification and most classes, which is given, deletes.Basis point after deletion in candidate classification device Class device is n.

(5) the n fundamental classifier acquired to step (4) is dynamically selected, the specific steps are as follows:

1. to currently in training set S_testIn sample x to be sorted_q, collect S in verifying_vaMiddle 7 nearest-neighbors for calculating it, 7 samples are denoted as ￡；

2. each fundamental classifier h using the ￡ of acquisition as input, in candidate classification device pond_iFor the ￡ for label of erasing Prediction obtains output y_out.Y is exported for prediction_outWith true label y, each fundamental classifier is calculated according to formula (3) Ability weight；

3. sorting after ability weight has been calculated according to numerical values recited, preceding 15% is taken to constitute base from n fundamental classifier Plinth classifier set C'.

(6) in order to determine the α value in formula (4)-(5), cross validation is carried out to different α values using verifying collection, finally Obtaining α value with decision tree is 1, brings α value into formula (2)-(3) and calculates separately the value of R1 and R2 and compare, if R₁>R₂, then current sample classification be minority class, on the contrary it is then for majority classes.

Step (4) (5) and step (6) are repeated until all test set sample S_testIn sample classification complete.

In order to better illustrate the validity of algorithm, decision Tree algorithms are used after only being handled with decision Tree algorithms and smote It is compared as algorithm, while the use of AUC being algorithm index to quantify last result output.

Table 1: distinct methods compare the classification results of ecoli046vs5 data set

It can be seen from Table 1 that based on the base that in ecoli046vs5 unbalanced data classification experiments, the application is proposed In Euclidean distance adaptive set at the obtained AUC value of unbalanced data classification method be 0.9192, compared to other allusion quotations The processing method of type is enhanced on classification performance.Experimental result illustrates that this method can be effectively combined dynamic and select Select algorithm sum aggregate and design respective advantage at rule, can effectively improve unbalanced data precision of prediction and integrated model it is extensive Ability.

Claims

1. based on the adaptive set of Euclidean distance at unbalanced data classification method, which is characterized in that specifically include following step It is rapid:

Step 1: data prediction, obtains diversity balanced subset；

Step 2: obtaining m homogeneous classification device building candidate classification using same classification learning algorithm on m balanced subset Device pond；

Step 3: the preselected fundamental classifier in candidate classification device pond, the classifier for not having minority class sample ability is deleted It removes；

Step 4: using dynamic select algorithm by test sample peripheral region sample from the classifier pond that step 3 is screened The strong candidate sub-classifier of classification capacity picks out formation base classifier set；

Step 5: using a kind of fundamental classifier set that will be selected based on the adaptive set of distance at rule for test specimens This prediction result output.

2. according to claim 1 based on the adaptive set of Euclidean distance at unbalanced data classification method, feature exists In in step 1, to data prediction: including the balanced subset obtained to training set stochastic equilibrium, verifying collection and test Collection；Specific steps are as follows:

1. according to training set S_train, verifying collection S_vaWith test set sample S_testQuantitative proportion is a:b:c, concentrates and draws in initial data Divide sample, and guarantees the training set after division, the ratio and original of verifying collection and most classes in test set sample and minority class The ratio of most classes and minority class is consistent in beginning data set；

2. being randomly assigned a random number num according to formula (1)_rand；

num_rand=S_min+rand(0,1)*(S_max-S_min) (1)

Wherein S_minFor training set S_trainMiddle minority class sample size, rand (0,1) are the random number between 0 and 1, S_maxIt is trained Collect S_trainMiddle majority class sample size；

3. in training set S_trainIt takes at random in most class samples and does not put back to sample until the sample newly formed reaches sample size For num_rand, while over-sampling is carried out to minority class sample according to formula (2) and is generated in new sample z addition minority class sample, Repeating minority class number of samples of the over-sampling after being added is num_rand, after the most class samples and over-sampling that newly form The merging of minority class sample then obtains a balanced subset；

Z=β p+ (1- β) q (2)

4. repeating step 2. and 3. until obtaining m balanced subset.

3. according to claim 1 based on the adaptive set of Euclidean distance at unbalanced data classification method, feature exists In needing preselected to the fundamental classifier in candidate classification device pond in step 3；Specific steps are as follows:

1. to currently in test set S_testIn sample x to be sorted_q, collect S in verifying_vaMiddle k nearest-neighbors for calculating it, if k There are different classes of samples in nearest-neighbors, then recording k current neighbours is ψ；If in k nearest-neighbors, there are same class Other sample, then enter step four；

2. each fundamental classifier h using the Ψ of acquisition as input, in candidate classification device pond_iΨ prediction for label of erasing Obtain output y_p；

3. comparison basis classification prediction output y_pWith the label y of true Ψ, if there is cannot at least correct one group of classification simultaneously it is few The fundamental classifier of the sample of several classes of and most classes, which is given, deletes；Fundamental classifier after deletion in candidate classification device is n.

4. according to claim 1 based on the adaptive set of Euclidean distance at unbalanced data classification method, feature exists In needing to be dynamically selected the candidate classification device after preselected, specific steps in step 4 are as follows:

1. to currently in test set S_testIn sample x to be sorted_q, collect S in verifying_vaMiddle k nearest-neighbors for calculating it, by k Sample is denoted as ￡；

2. each fundamental classifier h using the ￡ of acquisition as input, in candidate classification device pond_i￡ prediction for label of erasing Obtain output y_out；Y is exported for prediction_outWith true label y, the ability of each fundamental classifier is calculated according to formula (3) Weight:

3. it sorts after ability weight has been calculated according to numerical values recited, P% formation base classification before being taken from n fundamental classifier Device set C'.

5. according to claim 4 based on the adaptive set of Euclidean distance at unbalanced data classification method, feature exists In, in step 5, classifier set C' is obtained to selection and is provided to the integrated output of the prediction of current sample to be sorted, it is specific to walk Suddenly are as follows:

Wherein t is the fundamental classifier quantity in set C', P_i1And P_i2It corresponds respectively in i-th of classifier for test sample The probability of the minority class and most classes that provide, D_i1And D_i2Correspond respectively to test sample minority class into i-th of fundamental classifier With the average Euclidean distance of the training sample of most classes, α is auto-adaptive parameter；

Before calculating distance, need that sample is normalized by formula (6):

WhereinFor the value after normalization, x_iFor the value before normalization, x_max、x_minRespectively indicate maximum value in sample data, Minimum value；

2. comparing the value of parameter R1 and R2, if R₁>R₂, then current sample classification be minority class, on the contrary it is then for majority classes.