CN105868775A

CN105868775A - Imbalance sample classification method based on PSO (Particle Swarm Optimization) algorithm

Info

Publication number: CN105868775A
Application number: CN201610172812.3A
Authority: CN
Inventors: 张春慨
Original assignee: Shenzhen Yitong Technology Co Ltd
Current assignee: Shenzhen Yitong Technology Co Ltd
Priority date: 2016-03-23
Filing date: 2016-03-23
Publication date: 2016-08-17

Abstract

The invention provides an imbalance sample classification method based on a PSO (Particle Swarm Optimization) algorithm. A PSO algorithm is adopted, the sampling rates of a boundary sample and a safe sample in an oversampling process are optimized to obtain an optimal oversampling multiplying power, and meanwhile, characteristics are optimized so as to select a characteristic combination which can simplify operations and improve a classification result and has best representativeness. The imbalance sample classification method adopts AUC/F-Mea as the fitness function of the algorithm so as to improve the effect of a final classifier.

Description

Imbalanced data classification issue method based on PSO algorithm

Technical field

The invention belongs to the sorting technique in data mining and optimize field, particularly relate to a kind of applicable injustice The sorting technique of weighing apparatus sample.

Background technology

In imbalanced data classification issue problem, unbalanced data is exactly in whole data set sample space Quantitatively there is huge gap in the sample of one type and remaining class or a few class sample.This In the case of often minority class need us to put into more concern.Such as in the application in medical diagnosis, Cancer or cardiopathic data sample space are exactly uneven sample, in this kind of sample, Wo Menguan The object of note is often ill sample, is gone out the attribute of these samples by Accurate classification, such that it is able to accurate Make a definite diagnosis the state of an illness of disconnected patient, and give and these patients immunotherapy targeted autoantibody timely.

Traditional classifier is in processing imbalanced data classification issue problem, in order to pursue higher global classification Accuracy rate, often direct is most class samples by minority class sample classification, and this way can be by potential Patient information is categorized as health and fitness information, and therefore patient may miss the best period for the treatment of, causes patient Irremediable loss.

In imbalanced data classification issue problem, there is a lot of parameters all can have one to final classifying quality Fixed impact, is determined by different parameter combinations, can obtain the result that comparison in difference is big.And this The determination of a little parameters is typically all by artificially specify, and neither one verifies these m odel validities Index, can only judge these parameters by final classifying quality, and the parameter artificially specified is led to Crossing adjustment repeatedly, frequently result in is also only a locally optimal solution, seldom in the case of can obtain To globally optimal solution.

In prior art, for imbalanced data classification issue problem, there is process side based on data resampling Method and processing method based on algorithm improvement.

Processing method based on data resampling: the unbalance response of sample is mainly manifested in two class samples Quantitative imbalance, thus cause traditional classifier classification difficulty, first expect be Data plane, the problem uneven by changing the quantity of primary data sample, so that sample base Originally balance is reached to solve this classification difficulty.In data plane, Chinese scholars mainly proposes Over-sampling, lack sampling and the thought of mixing sampling.

Over-sampling, as the term suggests, it is simply that add some minority class samples by certain method, thus Original uneven sample data set is made to basically reach balance.If but the method for over-sampling is the most right, then may be used Expired Drugs can be caused.

SMOTE algorithm is the oversampling technique that current Application comparison is popular.It is mainly by minority Class sample inserts some copy minority class sample rather than simply carry out neighbouring over-sampling.This Technology is mainly in the spirit obtained from an algorithm put forward in the project of handwriting recongnition Sense.The main process that SMOTE carries out over-sampling is as follows: first according to each minority class sample, choose Go out from the K that this minority class sample point is nearest other minority class sample points individual, then put and it at this The method using linear interpolation between his K minority class sample point, and use certain random interpolation because of The a part of minority class of generation that son and over-sampling rate are carried out copys sample point.The formula of SMOTE interpolation Shown in following formula:

p_i=x+rand (0,1) * (y_i-x) i=1,2 ... N (1)

SMOTE algorithm is mainly inserted between similar minority class sample (the most adjoining) Value, the imitation sample so generated has certain representativeness.Therefore, over-fitting problem is at SMOTE Algorithm is avoided by, and the decision space of minority class also can extend more preferable；Equally, it can also It is applied in most class sample space, the decision space of most class can be reduced.

Lou Xiaojun et al. proposed in 2013 a kind of based on cluster and considers boundary sample information By monolateral back-and-forth method, oversampler method, determines that the boundary sample of minority class and most class is believed in the method Breath, by all minority class samples being accounted for the cluster on border, bunch in carry out suitable mistake and adopt Sample, the imitation minority class sample so produced and original sample have more similitude, more can represent father Class sample space is distributed.Simultaneously by boundary sample being carried out over-sampling targetedly, increase border sample This definition, thus the importance of prominent minority class boundary sample.

The thought of lack sampling method is then to be moved according to certain rule in most class samples by some method Except some specific samples so that sample data space can basically reach balance.But lack sampling Method also has some drawbacks, and the important information easily causing some representative samples is lost.

Mixing sampling is exactly by over-sampling algorithm together with lack sampling algorithm fusion, to original number During carrying out resampling according to collection, use mixing sampling that being balanced property of data set is operated.In a large number Research show, the method for mixing sampling has bigger superiority than single method for resampling, first Carry out over-sampling and typically can obtain more preferable effect carrying out lack sampling.

Processing method based on algorithm improvement: process in imbalanced data classification issue problem, data plane Method for resampling can eliminate unbalanced impact to a certain extent, promotes the classifying quality of minority class, But the method essence of data plane is to change the distribution of data set, and this change may be to classifying quality Authenticity produce certain impact.The improvement carried out in algorithm aspect can eliminate in data plane On drawback, also can obtain the lifting of minority class classifying quality to a certain extent.Through these several years both at home and abroad The research of scholar, the research in algorithm aspect is mainly concerned with cost sensitive learning (Cost-sensitive Learning), integrated study (Ensemble Learning) and single quasi-tradition learning method (One-class Learing)。

Integrated learning approach is to apply more a kind of algorithm frame, Boosting in imbalanced data classification issue Algorithm belongs to one the most classical in integrated learning approach, and AdaBoost is Boosting class algorithm In a kind of algorithm of being often used.AdaBoost passes through the Weak Classifier that repetitive exercise is different, passes through These Weak Classifiers are integrated into strong classifier by combination, and it is according to the classification of each Iterative classification device The weights that situation, more new samples are trained next time.Training process according to AdaBoost algorithm we can Mainly there are high-precision classification effect, the sub-classifier can self-regulated as required, classification results obtaining it Easy to understand and the advantages such as overfitting phenomenon will not be produced.

In data plane in the operation of uneven being balanced of sample space, having some parameters is to need Very important person is specify.Such as in Borderline algorithm, the parameter of needs has K, and namely K is near Adjacent method finds its K nearest samples in whole data set S, and is stored in by these samples In the set KNNsmin that each Smin sample is corresponding.Carried out by SMOTE algorithm simultaneously The when of over-sampling, the parameter having the problem (OrRate) of an over-sampling multiplying power is also by artificially referring to Fixed, different over-sampling multiplying powers is the biggest on the impact of final classifying quality.Carrying out simultaneously The when of mixing sampling, there is also the parameter problem of the lack sampling multiplying power of most class sample point simultaneously (UsRate)。

In algorithm aspect, there is also some parameters needs to go to carry out artificial appointment simultaneously, such as exists Using in the AdaBoost algorithm of Boosting algorithm, base grader the most all can have a ballot weight Determination, this ballot weight coefficient determination be also true by the classification error rate of this base grader Fixed.And the weight coefficient of these base graders has an impact the most mutually or relation, simple Determine that the final classifying quality that their coefficient obtains is frequently not optimum according to classification error.

Summary of the invention

In order to solve problem in prior art, the invention provides a kind of imbalance based on PSO algorithm Sample classification method, uses PSO (Particle Swarm Optimization) algorithm to carry out parameter Adaptive optimization such that it is able to obtain the coefficient combination of a series of optimum so that final classification effect Fruit is greatly improved.

The present invention is achieved through the following technical solutions:

A kind of imbalanced data classification issue method based on PSO algorithm, in described method in PSO algorithm Being expressed as of particle: OsRatei, UsRate and Fecj, wherein OsRatei is ith cluster bunch Over-sampling rate, i=1,2 ..., N, N are the clustering cluster number formed by DBSCAN algorithm, Fecj For the jth feature of sample, j=1,2 ..., M；Described method includes: data set is divided into instruction by (1) Practice collection Train and test set Test；(2) initial solution x is first generated in search volume_i, i=1,2 ..., SN, Wherein SN is population size；(3) globally optimal solution gbest=0 is made；(4) step (5), (6) are performed Maximum cycle MCN time；(5) make j from 1 to SN, it is thus achieved that current x_jSolution；According to obtaining Described solution, feature generates new data set after reselecting, by k folding cross validation and classification Device learns, and obtains corresponding AUC or F-Measure；(5) according to the speedometer in PSO algorithm Calculation formula and location updating formula obtain the pbest of each particle, update globally optimal solution gbest simultaneously； (7) combined by the over-sampling rate obtained, lack sampling rate and feature, carry out mixing sampling simultaneously, build Grader is trained data set, is then trained with grader, obtains final AUC.

The invention has the beneficial effects as follows: the present invention is by using particle swarm optimization algorithm, to over-sampling mistake In journey, the sample rate of boundary sample and safe sample is optimized and obtains optimum over-sampling multiplying power, the most right Feature is optimized, thus selects and can reach simplified operation and promote the most representative of classification results Property feature combination, use AUC/F-Mea as the fitness function of algorithm, thus promote final The effect of grader.

Accompanying drawing explanation

Fig. 1 is PSO algorithm false code；

Fig. 2 is the representation schematic diagram of particle in PSO algorithm；

Fig. 3 is that the PSO algorithm of the present invention optimizes false code to over-sampling rate.

Detailed description of the invention

The present invention is further described for explanation and detailed description of the invention below in conjunction with the accompanying drawings.

Particle group optimizing (Particle Swarm Optimization, PSO) algorithm be by Kennedy and Doctor Eberhart proposed in nineteen ninety-five, its a kind of evolution algorithm, and the inspiration of generation is derived from imitating certainly So group behavior of insect, herd and the flock of birds etc. in boundary, the colonial organism in these natures can be by According to a kind of themselves it will be appreciated that mode carry out search of food, spouse and some other thing, Each member in colony can the experience of experience He other group members by learning oneself change Become the individual or behavior pattern of colony, be finally completed the global search of colony.Example from reality We are appreciated that what kind of the detailed process of PSO algorithm is: assume there is bevy in a certain given zone The position of food is purposelessly searched in territory, and the particular location of things is known nothing by these birds, does not also know Road flies just can fly to the position of near food toward which direction, and how these birds could be searched with optimum Rope route finds this food？Correct way is through obtaining other grains that distance food is nearer The information of son, then moves according to these information toward this bird region close to food and carries out Search.

In computer science, PSO algorithm is a kind of computational methods, and it is mainly asked by iteration optimization Topic and a specific fitness function attempt to obtain optimal candidate solution.In PSO algorithm It is individual that " particle " represents each in whole collective search space, above our actual life of lifting In the example of middle flock of birds search of food, the detailed process of PSO algorithm optimization is as follows: initialize the position of bird Putting and flying speed, this speed comprises particle at next step search procedure track to be run, so The optimal solution of rear search current region.In the renewal that PSO algorithm is each, there are two important value needs Retaining, one of them is the individual optimal solution of each particle in colony, is designated as pbest；Also has a solution For globally optimal solution, represent optimal solution in the whole search volume of population, be designated as gbest.The speed of particle Degree and position moment are updated according to the two value, and iteration each time is all public according to following two Formula is updated:

v_{i}^{t + 1} = w \times v_{i}^{t} + c_{1} \times r_{1} \times ({pbest}^{t} - x_{i}^{t}) + c_{2} \times r_{2} \times ({gbest}^{t} - x_{i}^{t}) - - - (2)

x_{i}^{t + 1} = x^{t} + v_{i}^{t + 1} - - - (3)

Wherein: i={1,2 ..., SN}, other parameter declaration is as shown in table 1 below:

Table 1 PSO algorithm parameter explanation

Particle swarm optimization algorithm is applied to parameter optimizations of some concrete application or the excellent of weight coefficient In change problem, typically can obtain globally optimal solution, thus improve final experiment effect.PSO The false code of algorithm routine is as shown in Figure 1.

By using particle swarm optimization algorithm, boundary sample during over-sampling and safe sample are adopted Sample rate is optimized and obtains optimum over-sampling multiplying power, is optimized feature simultaneously, thus selects energy Enough reach simplified operation and promote the most representational feature combination of classification results, using AUC/F-Mea is as the fitness function of algorithm, thus promotes the effect of final grader.Population In optimized algorithm, the expression of particle is as shown in Figure 2.Wherein, OsRate represents over-sampling rate, OsRatei For ith cluster bunch over-sampling rate (i=1,2 ..., N), N for by improve DBSCAN algorithm institute The clustering cluster number formed；UsRate represents the lack sampling rate of most class boundary sample；Fec represents spy Levy, Feci be sample ith feature (i=1,2 ..., M), characteristic vector here is binary form Representing, 1 to represent the feature on this position be to be helpful to classifying quality, needs to retain, if Be 0, then representing this position feature does not has any help to the lifting of classification results, can remove this Individual feature reduction training process.

AUC (Area under Receiver Operating Characteristics Curve) is that one can Depending on the classifier algorithm performance evaluation method changed, represent with a two-dimensional coordinate system.Wherein, X-axis Being the ratio (FP_Rate) of wrong point minority class (positive), Y-axis is the minority class (sun of correct classification Property) ratio (TP_Rate), each classifier algorithm can after classifying one group of sample data Producing some Point (FP_Rate, TP_Rate), the threshold values adjusting grader produces multiple points, shape Becoming ROC curve, AUC is then the area that this curve covers the lower right corner.AUC is the biggest, then table Show that the recognition capability of classifier algorithm is the strongest.

F-Measure is most frequently applied in the appraisal of imbalanced data classification issue, equation below Shown in.F-Measure is obtained by recall ratio, precision ratio and balance factor are compound, as Recall and When Precision obtains a higher numerical value, F-Measure will obtain ideal result.

F - M e a s u r e = \frac{(1 + β^{2}) * Re c a l l * \Pr e c i s i o n}{β^{2} * Re c a l l + \Pr e c i s i o n} - - - (4)

β regulation recall ratio and the balance factor (usual β is set to 1) of precision ratio in formula.

The concrete cluster process of standard DBSCAN algorithm is as follows: according to unit radius set in advance and Density, at whole data set, finds these points, referred to as core point meeting condition, then to this Core point expands.The method expanded is to find to be connected with its density from all of this core point Other sample points.Travel through all core points in the epsilon neighborhood of this core point (because boundary point is cannot Expand), find other minority class sample points being connected with this core point density, and judge this point The most also it is core point, continues to expand according to this core point, until density cannot be found to be connected The core point that can expand till.Next the minority class not being clustered cluster is rescaned exactly Sample data set, remaining data concentrate find be never clustered into any one bunch other cores Heart point, then uses said method to carry out expanding until not having new in whole data set to this core point It is not clustered the core point of cluster.After cluster terminates, those minority class sample sets are not clustered in office Sample point in cluster with regard to time noise spot.

Unbalanced situation in can solve the problem that class based on the DBSCAN algorithm improved, its basic thought is:

First, it is considered to the situation of uneven minority class sample distribution Density inhomogeneity in class, it is possible to obtain One group of EPS value based on distribution density.Within there is class in unbalanced data set, each minority class Data sample point is different from the distance of other minority class sample points, and namely distribution density is not With.The calculating of distribution density is K the minority nearest by calculating distance any of which minority class sample Class sample point is weighed to the distance of this sample point.Concrete grammar is as follows: by adding up arbitrary minority Class sample point X_iNearest K other minority class sample points, calculate X respectively_iTo this K minority class sample Then these distances are taken a mean value by this distance.By the distance average obtained, permissible Obtain minority class sample point X_iDistribution density, all of minority class sample point can calculate such one The average distance formula of this distribution density of individual measurement.

Then, these are formed one by the average distance of calculated each minority class sample point Distance vector array.Using these average distances as raw data set, this data set passes through into Cluster on row distance.After this distance array is clustered into N number of bunch, calculate in each bunch All distances add and are averaged, and using this mean value of obtaining as the adjacent region threshold of this bunch, pass through Calculate this mean value of N number of bunch respectively, N number of adjacent region threshold EPS can be obtained_i(i=1,2 ..., N).

It follows that this N number of field threshold value is ranked up, according to order from small to large, by these Sorted threshold value is saved in an array, stays the EPS of the improvement DBSCAN algorithm doing next step The determination successively of parameter.

In ensuing clustering algorithm, first select minimum that in threshold value array, regard The EPS value (MinPts can be manually specified, and keeps constant in the training process) of DBSCAN algorithm, Then all minority class samples are clustered, the multiple of this density can be met by cluster Minority class sample clustering bunch, other minority class samples being unsatisfactory for condition are then classified as noise sample.Connect Next threshold value in employing threshold value array the minority class sample being labeled as noise sample point is proceeded DBSCAN clusters, and is similarly obtained some clustering cluster and remaining noise sample point.

Finally, repeat above operation, use in threshold value array different threshold values from small to large to mark Minority class sample for noise sample point carries out DBSCAN cluster, when leading to all minority class samples Crossing after different EPS clusters, then complete all cluster operations of minority class sample, those are After be not the most classified as the data at any one bunch and be noise data.

By the DBSCAN algorithm improved, the clustering cluster of minority class can not only be produced, the most again Over-sampling is carried out in these samples bunch, but also the distribution in imbalance in can sufficiently solving class Uneven and fragmentation of data or the problem of Small disjuncts.

PSO algorithm is to the over-sampling rate optimization problem in imbalanced data classification issue problem and feature selecting The false code optimized is as shown in Figure 3.Wherein, MCN is maximum cycle, and NumFolds is Parameter K in K folding cross validation.

Successive value is mainly optimized by particle swarm optimization algorithm, and characteristic vector here is discrete Type, in order to make PSO equally processing feature vector, use sigmoid function, to generate Successive value speed v uses equation below (5) and (6), is converted to 0,1 discrete value, thus The selection being applicable to characteristic optimization set comes up.

v_{i}^{t'} = sigmoid (v_{i}^{t}) = \frac{1}{1 + e^{- v_{i}^{t}}} - - - (5)

The boundary sample sample rate of each clustering cluster and safe specimen sample is determined by PSO algorithm Rate and classifying quality is had lifting feature after, just can be selected by feature extraction and have most Representational data characteristics and set of data samples, the over-sampling rate then obtained by optimization is to minority class Sample carries out over-sampling, finally gives the equilibrium criterion collection sample wanted.

The base grader of AdaBoost algorithm can be made up of arbitrary single grader, typically now uses Be exactly that the conventional machines learning algorithms such as decision tree, neutral net, SVM are as base grader.Big portion The integrated learning approach divided is all to use same single base grader learning algorithm to go to construct an isomorphism Ensemble Learning Algorithms framework, or use different base graders to construct a Manufacturing resource study calculation Method framework.

The thought of AdaBoost is according to current iteration grader classification results, regulates and normalizes sample Weights, thus ensure that in iterative computation next time, more energy can be concentrated by Weak Classifier On the sample of a upper grader misclassification, so the more New Policy of sample weights is also impact classification One of factor of effect；The weight coefficient of base grader is also to affect AdaBoost algorithm finally to divide simultaneously One factor of class, the weight coefficient so how arranging base grader is also the main research of the present invention One of.

Determination for the ballot weight coefficient of the base grader in AdaBoost algorithm is also according to upper The thought in face, by multiple base grader coefficients as each " particle " in PSO optimized algorithm, The most specific setting, finally gives these base graders ballot weight coefficient in AdaBoost algorithm Optimum combination.

Above content is that to combine concrete preferred embodiment made for the present invention the most specifically Bright, it is impossible to assert the present invention be embodied as be confined to these explanations.For technology belonging to the present invention For the those of ordinary skill in field, without departing from the inventive concept of the premise, it is also possible to if making Dry simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims

1. an imbalanced data classification issue method based on PSO algorithm, it is characterised in that in described method Being expressed as of particle in PSO algorithm: OsRatei, UsRate and Fecj, wherein OsRatei is The over-sampling rate of i clustering cluster, i=1,2 ..., N, N are by being formed by DBSCAN algorithm Clustering cluster number, Fecj is the jth feature of sample, j=1,2 ..., M；Described method includes: (1) Data set is divided into training set Train and test set Test；(2) first at the beginning of search volume generates Begin to solve x_i, i=1,2 ..., SN, wherein SN is population size；(3) globally optimal solution gbest=is made 0；(4) step (5), (6) maximum cycle are performed MCN time；(5) make j from 1 to SN, Obtain current x_jSolution；According to the described solution obtained, feature generates new data set after reselecting, Learnt by k folding cross validation and grader, obtain corresponding AUC or F-Measure； (5) each particle is obtained according to the speed computing formula in PSO algorithm and location updating formula Pbest, updates globally optimal solution gbest simultaneously；(7) by the over-sampling rate obtained, lack sampling rate Combine with feature, carry out mixing sampling simultaneously, build grader and be trained data set, then use Grader is trained, and obtains final AUC.

Imbalanced data classification issue method the most according to claim 1, it is characterised in that: described DBSCAN algorithm is the DBSCAN algorithm improved, and this algorithm is unbalanced in can solve the problem that class Situation, its basic thought is: first, it is considered to uneven minority class sample distribution density unevenness in class Even situation, it is possible to obtain one group of EPS value based on distribution density；Then, by these by meter Average distance one distance vector array of composition of each minority class sample point obtained, by these Average distance is as raw data set, by carrying out the cluster in distance on this data set；? After this distance array is clustered into N number of bunch, all distances calculated in each bunch add and make even All, using this mean value of obtaining as the adjacent region threshold of this bunch, N number of by calculating this respectively Bunch mean value, N number of adjacent region threshold EPS can be obtained_i, i=1,2 ..., N；It follows that by this N Individual field threshold value carries out order from small to large and sorts and be saved in an array；Ensuing In clustering algorithm, first select minimum that in threshold value array, as DBSCAN algorithm EPS value, then all minority class samples are clustered, then use next in threshold value array Individual threshold value proceeds DBSCAN cluster to the minority class sample being labeled as noise sample point, with Sample obtains some clustering cluster and remaining noise sample point.Finally, repeat above operation, when right After being clustered by different EPS of all minority class samples, then complete minority class sample All cluster operations, those are not the most classified as being noise number in the data of any one bunch According to.

Imbalanced data classification issue method the most according to claim 1, it is characterised in that: it is characterized in that: Successive value is mainly optimized by PSO algorithm, and characteristic vector here is discrete type, for Enable described PSO algorithm to process the characteristic vector of discrete type, use sigmoid function, right The successive value rate conversion generated is 0,1 discrete value.

Imbalanced data classification issue method the most according to claim 1, it is characterised in that: described method is led to Cross PSO algorithm determine the boundary sample sample rate of each clustering cluster and safe specimen sample rate with And after classifying quality is had the feature of lifting, selected the most representational by feature extraction Data characteristics and set of data samples, then entered minority class sample by the over-sampling rate that optimization obtains Row over-sampling, finally gives the equilibrium criterion collection sample wanted.