CN108764346A

CN108764346A - A kind of mixing sampling integrated classifier based on entropy

Info

Publication number: CN108764346A
Application number: CN201810536985.8A
Authority: CN
Inventors: 王喆; 李冬冬; 程阳; 杜文莉; 张静
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2018-11-06

Abstract

The present invention provides a kind of mixing sampling integrated classifier based on entropy, calculates comentropy in original training sample first；Then, negative class sample is divided into two groups according to the size of entropy；Secondly, smaller one group of entropy is divided into several subsets at random, and each subset forms new subset with another group of sample；Then positive class is up-sampled, then new positive class is merged to form new data views with each negative class subset；Finally, these visual angles of generation are individually handled using several graders, and they are integrated.The model that test sample can be substituted into training in test phase is identified.Compared to traditional method of sampling, present invention introduces entropys to avoid characteristic excessively random in sampling.Meanwhile the information of sample being avoided to lose, and mitigate sample overlap problem.The present invention is that each data set generation suitably samples degree in the case where fully considering sample imbalance degree, and new method and thinking are provided to solve imbalance problem.

Description

A kind of mixing sampling integrated classifier based on entropy

Technical field

The present invention relates to a kind of, and the mixing based on entropy samples integrated classifier, belongs to mode identification technology.

Background technology

Unbalanced problem has become data mining and machine since it is in the generality of actual life and scientific research One of most important topic in study.When as soon as our target is the rare and important case of solution, will appear not The problem of balance, i.e. sample size in a classification are few more many than another classification.For example, fraud detection, disease is examined Disconnected and access control is typical imbalance problem.In fraud detection, fraud case only accounts for the fraction of regular traffic.And And the request of access control system most of time processing kinsfolk, and the record of stranger is seldom.In actual conditions, gate inhibition system Stranger is mistakenly considered kinsfolk's ratio and kinsfolk is mistakenly considered stranger seriously much by system.Therefore, different estate is being located Different concerns should be given when managing these problems.It is explicitly described unbalanced problem, there is the class of great amount of samples to be known as more Several classes of or negative class, the few class of sample size are known as minority class or positive class.The ratio of negative class and positive class sample size is known as unbalance factor (IR), it is used for the uneven degree of descriptor data set.

Although canonical algorithm achieves ideal effect on equilibrium criterion collection, they would generally in imbalance problem There is lower positive class discrimination.In order to solve this problem, there are two types of most common technologies.It is the method for data plane first, Refer to the method for sampling.It makes data balance as possible and independently of specific grader in pretreatment stage.Followed by algorithm layer The method in face, including threshold method, the study of single class study and cost-sensitive.Different from data plane, the method for algorithm level Do not change the distribution of sample, but is thought of as imbalance problem exploitation more suitably algorithm.

The present invention mainly proposes a new method of sampling, it not only considers the distribution of sample in sampling process It is interior, while also overcoming the problem of sample information lacks.On the one hand, the method weighs point of original sample using comentropy Cloth.In message area, entropy is the tool for weighing sample certainty.Since the sample close to positive class and negative class boundary is being classified In play an important role, therefore they should give special attention in assorting process.Sample close to positive and negative class boundary is usual Possess lower classification certainty.Therefore, their entropy is often different from those samples far from boundary.In order to which area detaches side It is distinguished with other samples and is treated by the closer sample in boundary, methods herein in the training process.On the other hand, in order to avoid The loss of sample information during down-sampling, negative class sample are directly divided into several subdivisions rather than selected part subset. Therefore, all negative class samples all will include in the training process.Meanwhile positive class sample can be sampled it is each new to balance Data subset.These subsets are considered as several data views.Since the positive class in each visual angle only needs to extend to negative class The size of subset, therefore up-sample the new samples generated and be controlled in a smaller quantity.Institute can mitigate sample in this approach Overlap problem caused by this generation.Finally, the present invention will mix sampling policy with integrated approach and combine propose happy one it is new Disaggregated model has great importance in the sorting algorithm research for solving imbalance problem.

Invention content

Technical problem：The present invention provides a kind of mixing sampling that can solve imbalance problem plus integrated classification moulds Type.During down-sampling, the distribution of sample is weighed by introducing comentropy, mitigates the excessively random characteristic of traditional sampling, Important originally to be distinguished in sampling process is treated.Meanwhile not losing any sample information, that is, retain whole training samples Under the premise of, it is sampled.In addition, being asked by controlling the quantity of the artificial sample newly formed to slow down sample overlapping in up-sampling Topic.Finally, several data views that sampling generates are integrated.The Model Independent carried adapts to simultaneously in specific classification device In various types of graders, therefore the characteristics of each grader can be made full use of, final classification effect is improved.

Technical solution：First, raw sample data is divided into training set and test set two parts；Secondly, original The comentropy of each sample is calculated on training set；Then, by negative class sample according to the threshold value of setting according to the size of entropy by sample Two groups are divided into, the entropy of first group of sample is more than threshold value, and the entropy of second group of sample is no more than threshold value；Then, by second The sample of group is divided into M groups at random, and each group all forms new subset with first group；Then, positive class sample adopt Sample reaches balance with the subset with negative class sample, then forms M new data views with each subset of negative class sample；Finally, These visual angles are handled using M base grader, and obtain the training result of training sample using integrated approach.In test step Suddenly, test sample is updated in the corresponding discriminant function of the model and is identified.

The technical solution adopted by the present invention to solve the technical problems can also be refined further.In the training stage In the operation for calculating sample information entropy, the present invention is based on k nearest neighbor algorithms, but indeed, it is possible to use any near neighbor method.This Outside, number M in visual angle is determined according to the unbalance factor IR of each data set, and in an experiment, M is selected from { 1,2 ..., round (IR)}.Wherein round (IR) is the value that rounds up to IR.Finally, in integration phase, the present invention uses most classes to throw Ticket method.But in fact, other integrated approaches all can be applied to this.

Advantageous effect：Compared with prior art, the present invention haing the following advantages：

The present invention proposes a mixing sampling algorithm based on entropy to be directed to unbalanced data in data plane.Due to Sampling algorithm is easily achieved and is achieved on unbalanced data considerable effect, therefore its quilt in various research work It is widely used.Although the existing method of sampling works well on unbalanced data, they still have some defects.Such as Down-sampling may lose the information of significant samples and up-sampling is likely to result in sample overlap problem.It is asked to solve these Topic, this paper presents an integrated methods of new mixing sampling to be known as EHSEL.On the one hand, the method in sampling by sample Distribution situation take into account, therefore it treats significant samples with can distinguishing.On the other hand, this method is protected in the training process All original samples have been stayed, it in this way can be to avoid the loss of sample information.In addition, the algorithm of this paper only expands the scale of positive class It opens up to the size of each negative class sample set, significantly reduces the influence of sample overlap problem in this way.It is more to adopt herein With a plurality of types of base graders come the multiple data views for training pretreatment stage to generate, then integrated using integrated approach These base graders.In addition, in order to verify the validity of context of methods, we are in 48 reality KEEL imbalance standard data sets Upper contrived experiment compared EHSEL_BP and BP, EHSEL_GFRNN and GFRNN, EHSEL_SVM and SVM.Experimental result explanation Methods herein is significantly increased to classification results of the original base grader on unbalanced dataset.Moreover, herein also Contrived experiment compared the performance of EHSEL_BP and other 6 relevant algorithms on unbalanced dataset.Experimental result is shown Classification performances of the EHSEL_BP in all algorithms makes number one.Therefore, experimental result illustrates EHSEL_BP more other Comparison algorithm has better performance on unbalanced dataset, this also illustrates that the method carried herein is for uneven number simultaneously According to a feasible and effective method of collection.

Description of the drawings

Fig. 1 is that the mixing sampling based on entropy of the present invention integrates learning model overall flow figure

Specific implementation mode

For the content of the clearer description present invention, it is described further with reference to example and Figure of description. The embodiment hereafter carried is not to be used for limiting the range that the present invention is covered.The present invention is based on the mixing of entropy to sample Ensemble classifier Device algorithm, includes the following steps：

Step 1：Input training sampleWherein N is the quantity of training sample, y_i=+1 generation This x of table sample_iBelong to positive class, y_i=-1 representative sample x_iBelong to negative class；

Step 2：The comentropy of each training sample is calculated first, is not known specifically as follows：

Step 2.1：Euclidean distance between calculating all samples two-by-two using following formula：

It is sample x_iAnd x_jDistance, d indicate sample dimension；

Step 2.2：Next the comentropy of each training sample is calculated.Sample x_iComentropy calculation formula be：

C is the classification number of training sample, P^j(x_i) it is the sample x calculated according to neighbour's rule_iBelong to certain a kind of probability.It calculates Probability P^j(x_i) pass through following steps：

Step 2.2.1：To each sample x_i, calculate its probability：

num_jIt is x_candiIn belong to the number of samples of jth class, k is neighbour's number of samples in k near neighbor methods.

Step 3：Assuming that the entropy of all samples is { E₁,E₂,...,E_N}.First negative class sample is divided into according to these entropy Two groups.A threshold alpha is given, first group of division is as follows：

G₁={ x_i|x_i∈x^neg,E_i> α } (4)

Second group is divided into：

G₂={ x_j|x_j∈x^neg,E_j≤α} (5)

Wherein x^negIt represents and bears class data.

Step 4：Retain the sample of first part.Then the sample of second part is further partitioned into M subset, be expressed as {Sub₁,Sub₂,...,Sub_M}.M is known as resampling rate.Finally, by first part's samples fusion to each subset of reservation, To form the subset of M original minus class sample, it is denoted as { S₁,S₂,...,S_M}。

Step 5：After completing negative class specimen sample, next set about handling positive class sample.In order to by positive class sample Scale as possible close to negative class sample, we increase the number of positive class sample using SMOTE algorithms herein.In addition, just Class sample needs newly-generated quantity depending on the subset S of negative class sample_i.Assuming that original positive class sample is denoted as S^posAnd it will be new The positive class sample generated is denoted as S^smo.So newly generated positive class sample is：

S^pos'=S^pos+S^smo (6)

Step 6：Finally, by the positive class sample of above-mentioned generation with the M group data of the sub-combinations Cheng Xin of negative class.New data It can be expressed as：

V_i=S^pos'+Sⁱ, i=1,2 ..., M (7)

In general, every group of data can regard a data views as.Therefore, the method is by mixing the form of sampling finally from original Training set in generate M New Century Planned Textbook.

Step 7：Next, will be trained using individual base grader on each visual angle.Due to above-mentioned sampling Journey is operated in pretreatment stage, therefore it is independently of specific grader.Namely almost all of grader can be herein It is applicable under method.Herein, we use three different types of graders, they include BP algorithm, GFRNN algorithms and SVM algorithm.

BP algorithm is a classical neural network algorithm.L layers are suppose there is, and every layer of neuronal quantity is S_l,For sample (x_i,y_i), cost function is：

The wherein connection weight of W interlayers, b are offset, h_W,b(x_i) represent the output valve of a sample.So all samples Cost function can be expressed as：

Wherein l=1,2 ..., L andRepresent the link between l+1 layers of i-th of neuron and l layers of j-th of neuron Weight.

GFRNN algorithms judge its belonging kinds according to the gravity acted on solution sample point z.Its main calculating Formula is as follows：

Wherein c is the number of classification, X_candiIt is the candidate samples determined by near neighbor method.In addition, sample point z and x_jBetween Gravity is calculated as：

Wherein d (z, x_j) it is sample z and x_jBetween Euclidean distance.

SVM is a kind of method of discrimination, its thought keeps one boundary of searching that positive class and negative class is as separated as possible.SVM Object function it is as follows：

ζ_i>=0, i=1,2 ..., N

WhereinRepresent kernel function.ζ_iIt is slack variable, it allows some unessential sample mistakes point, and C is regularization ginseng Number.

Step 8：Base grader will produce M training pattern for handling data views.In order to integrate these moulds Type judges final result using most voting methods herein.It is assumed that each test sample x_iEach output model be p_i,j, j=1,2 ..., M.Majority ballot is to sample x_iOutput be：

WhereinIt is sample x_iPrediction class label.Sample x is indicated respectively with 0_iBelong to positive class and negative class.M is that data regard The number at angle.

The specific implementation mode of the present invention is above described with reference to the accompanying drawings.But those skilled in the art It is understood that under the premise of not departing from spirit and principles of the present invention, several improvement and equivalent replacement can also be made.This hair Bright claim be improved with the technology and scheme after equivalent replacement, each fall within protection scope of the present invention.

Experimental design

Experimental data set is chosen：Reality unbalanced dataset used herein is selected from KEEL imbalance normal datas.They Details it is as shown in the table.

Dataset	IR	Size	Dimension
				pima	1.87	768	8
vehicle0	3.25	846	18
				glass04vs5	9.22	92	9
glass2	11.59	214	9
				abalone918	16.40	713	8
yest5	32.73	1484	8

The parameter selection of all algorithms is data set to be first divided into two parts, portion training is a using 5 wheel cross methods Test.Then training set is further divided into 1,2,3,4,5 this five parts.First by 1234 training, 5 verifications, then 1235 training, 4 test Card, shares five times in this way.It is tested on test set again after choosing best parameter.

Contrast model：In an experiment, we select M BP algorithm as base grader, and the systematic naming method of formation is EHSEL_BP.We mix sampling algorithm REA, hybrid algorithm EasyEnsemble, down-sampling algorithm in EHSEL_BP Classification performance comparison is carried out between Bagging, boosting algorithm Adaboost, up-sampling algorithm SMOTE-SVM.

Performance metric mode：Due to herein for be unbalanced dataset, this evaluation index of overall precision is not applicable. Therefore the evaluation criterion of mean accuracy (AACC) as a result is used herein.AACC takes into account the precision of each class, because This it positive class mistake point can be avoided very much, but overall evaluation when negative class mistake point is less or good situation.In practice, AACC be based on positive class sample point to ratio and negative class sample point to ratio calculate, they are denoted as TP respectively_rate And TN_rate.The calculation formula of AACC is：

Experimental result：

Data are the prediction result and its mean square deviation under AACC measure of criterions in table, a data set are corresponded to per a line, often The corresponding algorithm of one row.Best result on each data set is used in both runic and marks, and the ranking of each algorithm obtains Divide and is all marked in bracket.

By classification results and its score of each algorithm it is found that EHSEL_BP obtains optimum efficiency on all data sets, And the stability embodied on all data sets makes it finally possess highest average classification performance and near preceding average row Name score.

Claims

1. a kind of mixing based on entropy samples integrated classifier, which is characterized in that the training method of the grader includes following rapid：

1) raw sample data is divided into training set and test set two parts；

2) comentropy of each sample is calculated on original training set；

3) sample is divided into two groups, the entropy of first group of sample by negative class sample according to the threshold value of setting according to the size of entropy More than threshold value, the entropy of second group of sample is no more than threshold value；

4) second group of sample is divided into M groups at random, and each group all forms new subset with first group；

5) positive class sample is up-sampled and balance, then each subset shape with negative class sample is reached with the subset with negative class sample At M new data views；

6) these visual angles are handled using M base grader, and obtains the training result of training sample using integrated approach；

7) test sample is updated in the corresponding discriminant function of the model and is identified by testing procedure.

2. the mixing according to claim 1 based on entropy samples integrated classifier, it is characterised in that pressed in training second step Training sample is divided into two groups by the size of this entropy in the same old way, and concrete operations are：

A threshold alpha is given, first group of division is as follows：

G₁={ x_i|x_i∈x^neg,E_i> α }

Second group is divided into：

G₂={ x_j|x_j∈x^neg,E_j≤α}

Wherein x^negIt represents and bears class data.

3. the mixing according to claim 1 based on entropy samples integrated classifier, which is characterized in that in step 4), bear class The number M that sample is divided equally at random is determined by the unbalance factor of data set；Assuming that the unbalance factor of data set is IR, then M Range be selected from { 1,2 ..., round (IR) }, wherein round () is that IR is taken approximate integral；In this way, the present invention is fully examining Consider under the uneven degree of sample is that each data set calculates a suitable down-sampling ratio.

4. the mixing according to claim 1 based on entropy samples integrated classifier, it is characterised in that will in training third step Second part is divided into M groups at random, it is assumed that second group of sample number is { 1,2 ..., N }, then grouping situation is：{1,2,…, Floor (N/M) }, { floor (N/M)+1, floor (N/M)+2 ..., 2floor (N/M) } ..., (M-1) floor (N/M)+1, (M-1) floor (N/M)+2 ..., N }, wherein floor () is for rounding.

5. the mixing according to claim 1 based on entropy samples integrated classifier, train in the 5th step, through raw data set M new data views are generated, are then handled using individual base grader on each visual angle；At this point, these bases are classified Device can make the grader of M single type, can also be the grader of M mixing of different types；These graders can be But it is not limited to following type：

BP algorithm is a classical neural network algorithm；L layers are suppose there is, and every layer of neuronal quantity is S_l,For sample (x_i,y_i), cost function is：

The wherein connection weight of W interlayers, b are offset, h_W,b(x_i) represent the output valve of a sample；The generation of so all samples Valence function can be expressed as：

Wherein l=1,2 ..., L andRepresent the link power between l+1 layers of i-th of neuron and l layers of j-th of neuron Weight；

GFRNN algorithms judge its belonging kinds according to the gravity acted on solution sample point z；Its main formulas for calculating It is as follows：

Wherein c is the number of classification, X_candiIt is the candidate samples determined by near neighbor method；In addition, sample point z and x_jBetween Gravity is calculated as：

Wherein d (z, x_j) it is sample z and x_jBetween Euclidean distance；

SVM is a kind of method of discrimination, its thought keeps one boundary of searching that positive class and negative class is as separated as possible；The mesh of SVM Scalar functions are as follows：

ζ_i>=0, i=1,2 ..., N

WhereinRepresent kernel function；ζ_iIt is slack variable, it allows some unessential sample mistakes point, and C is regularization ginseng Number；Since the present invention is independently of specific grader, it has good robustness under different type grader.