CN105975992A

CN105975992A - Unbalanced data classification method based on adaptive upsampling

Info

Publication number: CN105975992A
Application number: CN201610331709.9A
Authority: CN
Inventors: 吕卫; 李喆; 褚晶辉
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-05-18
Filing date: 2016-05-18
Publication date: 2016-09-28

Abstract

The invention relates to an unbalanced data classification method based on adaptive upsampling. The method includes the following steps of calculating the total of positive samples to be newly generated; calculating the probability density distribution for each positive sample by taking the Euclidean distance as the metric; determining the number of the new samples to be generated of the positive sample; generating a new positive sample and adding the newly generated positive sample points to an original unbalanced training set to make the positive and negative samples be same in number, namely, obtaining a new balance training set including n<n> positive samples and n<n> negative samples; and training the newly generated balance training set by means of an Adaboost algorithm and obtaining a final classification model after the iteration for T times. According to the invention, the classification performance of the unbalanced dataset is improved.

Description

A kind of unbalanced dataset sorting technique rising sampling based on self adaptation

Art

The present invention relates to mode identification technology, be specifically related to a kind of grader for unbalanced dataset.

Background technology

Along with data mining, the fast development of pattern recognition and machine learning techniques, data classification image retrieval, Medical treatment detection with diagnose, detect a lie, multiple fields such as text classification and crude oil leakage detection are applied and play a significant role. But, the such as classical taxonomy algorithm such as support vector machine, artificial neural network and linear discriminant analysis all assumes that instruction when design In data set used by white silk, all kinds of comprised sample numbers are roughly the same.But it practice, in above-mentioned several fields, exceptional sample The number of (positive sample) is often far fewer than normal sample (negative sample).Now, for obtaining higher overall accuracy rate, classical taxonomy Device can more concern negative sample classes, classification boundaries can move to positive sample orientation so that a large amount of positive samples are divided into negative class by mistake, Cause positive class sample classification hydraulic performance decline eventually.It is possessed of higher values in decision-making, for carrying in view of in most cases exceptional sample High positive sample classification accuracy rate, the sorting algorithm for unbalanced dataset becomes study hotspot.

In recent years, scientific research personnel proposes the multiple sorting technique for unbalanced dataset.According to effective object not With, these methods are mainly divided into data level method and the big class of algorithm level method two.

Data level method mainly by data are carried out resampling change data be distributed, the number making positive negative sample is basic Identical, realize data balancing with this.Negative sample carried out down-sampled and align sample and carry out a liter sampling and all can reach this purpose. Patent " the protein-nucleotide bindings bit point prediction method based on there being supervision up-sampling study " (CN104077499A) have employed The method rising sampling, by increasing positive sample size to obtain the data set of balance and for Training Support Vector Machines.But due to This kind of method adds in original data set after simply being replicated by positive sample, is equivalent to each positive sample standard deviation and is repeatedly instructed Practice, Expired Drugs easily occurs, ultimately result in classifier performance and decline.Patent is " based on sub-sampling towards unbalanced dataset Traffic event automatic detection method " (CN103927874A) use down-sampled method, concentrate from negative sample and randomly draw part Grader is trained by sample sample positive with entirety composition training set.But owing to having abandoned a large amount of negative sample, the method cannot Ensureing that the negative sample subset that extraction obtains can preferably represent original sample set, therefore training effect is the most not ideal enough.

Algorithm level method mainly changes data distribution solve uneven classification problem by improving sorting algorithm. Adaboost is one of classical algorithm level method.This method is by cascading multiple graders, and is continuously increased wrong point of sample This weight, to improve such sample wrong cost divided again, thus improves the accuracy rate of classification.But, due to traditional Adaboost algorithm itself does not too much pay close attention to positive sample, and therefore effect is the most not ideal enough.

As can be seen from the above analysis, although data level method and algorithm level method can alleviate data nonbalance to dividing The impact that class effect produces, but two kinds of methods all have some limitations.

Summary of the invention

It is an object of the invention to overcome the most methodical deficiency, propose a kind of unbalanced data rising sampling based on self adaptation Collection sorting algorithm, to improve the classification performance of unbalanced dataset.Technical scheme is as follows:

A kind of unbalanced dataset sorting technique rising sampling based on self adaptation, if original unbalanced data concentrates positive sample Number is n_p, negative sample number is n_n, the method comprises the following steps:

(1) according to n_pAnd n_nCalculate the unbalance factor IR of unbalanced dataset, IR calculating needs newly-generated positive sample total Number G；

(2) with Euclidean distance for tolerance, for each positive sample i, search unbalanced data is concentrated with its closest K Individual nearest samples, adds up the ratio shared by negative sample in above-mentioned K nearest samples, is designated as p_i, to each positive sample gained The p arrived_iValue is added and is normalized, and the value obtained after process being completed is designated as r_i, the r of the most each positive sample_iValue sum Be 1, i.e. r_iFormation probability Density Distribution, claims r_iProbability for positive sample i；

(3) for each positive sample i, according to the probability r obtained in positive total sample number G-value and step (2)_iDetermine this positive sample This required new samples number g generated_i；

(4) for each positive sample i, K the nearest samples obtained in step (2) randomly selects g_iIndividual, respectively Forming sample pair with it, randomly select and a little i.e. obtain newly-generated positive sample on the line of sample pair, new positive sample is raw One-tenth process generates G new positive sample point after completing, newly-generated G positive sample point is joined original Nonblanced training sets In, make positive and negative number of samples identical, i.e. obtain comprising positive sample and each n of negative sample_nIndividual new balance training collection；

(5) iterations of note Adaboost algorithm is T, uses Adaboost algorithm to enter newly-generated balance training collection Row training, obtains final disaggregated model after iteration T time.

The present invention is directed to unbalanced dataset, the algorithm that data level method and algorithm level method are combined, and to a liter sampling Algorithm improves and optimizes, and the positive sample point near positive and negative sample boundary mainly carries out a liter sampling, to away from border Positive sample does not processes, to obtain more preferable classifying quality on unbalanced dataset, combine self adaptation rise sampling algorithm with The advantage of Adaboost algorithm, it is ensured that rise the new positive sample generated in sampling and be concentrated mainly near border, simultaneously by combination Grader carries out strengthening study, improves grader overall performance.Comparing through experiment, the present invention is in multiple grader evaluation indexes There is clear superiority.

Accompanying drawing explanation

Fig. 1 is that Adaboost strengthens learning algorithm flow chart.

Fig. 2 is the flow chart of the present invention.

Detailed description of the invention

The present invention is risen Adaboost algorithm shown in sampling algorithm and Fig. 1 by self adaptation and inspires, and the two is combined, is formed One integrated classifier.The present invention is further detailed explanation below in conjunction with the accompanying drawings.

(1) test and training data are obtained: the present invention selects the vehicle class identification database in KEEL data base, altogether bag Containing 846, sample.Positive sample in data base is buggy data, totally 199, i.e. n_p=199.Negative sample comprise bus, The data of Opel car, Sa Bo automobile totally three kinds of vehicles, totally 647, i.e. n_n=647.Data base comprises moment of torsion, turns to half Totally 18 dimensional feature such as footpath, maximum braking distance.Unbalance factor is calculated by (1) formula,

IR=n_n/n_p(1)

Unbalance factor in this experiment can be obtained and should be 3.25.

(2) the positive number of samples needing to generate is calculated by (2) formula,

G=(n_n-n_p)×β(2)

Wherein, β is a constant between 0 to 1.When β=1, after liter sampling, the number of positive negative sample is by complete Exactly the same, data set reaches complete equipilibrium, and the present invention takes β=1.Understand, need the new positive number of samples generated to should be 448.With Align sample according to this value afterwards and carry out self adaptation liter sampling processing, make positive and negative number of samples reach balance.Method particularly includes: for Each positive sample, using Euclidean distance as tolerance, calculates negative sample proportion p in K the sample point that it is nearest respectively_i:

p_i=k_i/ K, i=1 ..., n_p (3)

For ensure accurately to judge each positive sample whether near positive and negative sample boundary, K should take higher value, but along with K value Increase, amount of calculation also will substantially increase.For keeping relatively low computation complexity, above-mentioned two demands are carried out at compromise by the present invention Reason, takes K=5.Subsequently, to all p_iIt is normalized so that it is be expressed as probability density distribution and calculate each positive sample The new positive number of samples that should generate

g_{i} = \frac{p_{i}}{Σ_{j = 1}^{n_{p}} p_{j}} \times G - - - (4)

From (4) formula, the sample point that negative sample is more in border, neighbouring sample will be used for generating more just Sample, and the sample point being positive sample away from border, neighbouring sample is not used to generate positive sample.Subsequently, to each Individual positive sample, randomly selects g in its K nearest samples point_iIndividual, by the positive sample that the generation of (5) formula method is new:

new_i=x_i+λ(x_ni-x_i)(5)

Wherein, new_iBeing newly-generated sample point, λ is value random number between 0 to 1, x_niFor being selected at random In neighbouring sample point.For each positive sample, this process will carry out g_iSecondary.After sample generation process completes, by newly-generated Sample point join in original Nonblanced training sets, i.e. can get new balance training collection.This adaptive increasing is sampled Method may insure that newly-generated training set does not exist imbalance problem, and newly-generated sample is predominantly located at positive and negative sample and distinguishes The borderline region that difficulty is bigger.

Being can be seen that by Fig. 1 and Fig. 2, if rising sampling the most at random, all positive sample points being replicated, the most newly-generated Sample point will be completely superposed and be distributed in original positive sample point in whole positive sample space.And self adaptation liter sampling is permissible Generate the positive sample different from former sample point, and newly-generated positive sample standard deviation is near border.

(3) present invention takes five folding cross validations be trained unbalanced dataset and test.Train and all select with test Select the C4.5 decision tree Adaboost sorting algorithm as base grader.Wherein, if the minimum leaf segment of C4.5 decision tree count into 2, confidence level is 0.25, and tree training needs to carry out beta pruning process after completing.All data all complete normalization before entering grader Process, i.e. data minima is 0, and maximum is 1.Positive sample data label is+1, and negative sample data label is-1.

The positive negative sample of balance is gone out training set and test set by five folding cross-validation division, now training set should comprise Positive each 518 of negative sample.Number of samples used by training is 2n_n, i.e. 1036.Take the iterations T=10 of Adaboost algorithm, It is trained the most as follows:

1. remember that each sample weights is D_t(i), wherein, the integer value between t desirable 1 to (T-1), represent current iteration wheel Secondary, i represents sample number.The weights initializing each sample are D₁(i)=1/ (2n_n), i=1 ..., 2n_n.

2. it is used for training grader h by the training set after weighting_t, after having trained, calculate its training error rate

ϵ_{t} = Σ_{i = 1}^{m} D_{t} [y_{i} &NotEqual; h_{t} (x_{i})] - - - (6)

Wherein, t=1 ... T, for the iteration wheel number of times being presently in.ε_tIt is the training error rate of t wheel iteration, D_t(i) The weight of each sample, y in iteration is taken turns for this_iFor sample x_iAffiliated class label, value is 1 or-1.h(x_i) it is sample x_i Tag along sort after trained.

3. set the grader obtained after t wheel iteration completes weight in final vote as α_t, according to often taking turns in iteration Training error rate calculate this and take turns weight of grader that repetitive exercise generates and be

α_{t} = \frac{1}{2} l n \frac{1 - ϵ_{t}}{ϵ_{t}} - - - (7)

Meanwhile, in next round iteration, the weight of each sample is updated to

D_{t + 1} (i) = \frac{D_{t} (i) \exp [- α_{t} y_{i} h_{t} (x_{i})]}{Z_{t}} - - - (8)

Wherein, Z_tFor the weights sum of sample each in current iteration round, it is used for each sample weights is normalized place Reason.

4. perform 2,3 steps T time altogether, complete whole iteration and right value update process, thus complete classifier training.For Test sample to be sorted, its classification results should be

s i g n (H (x) = Σ_{t = 1}^{T} α_{t} h_{t} (x)) - - - (9)

From (7) formula, the weight of each sub-classifier is determined by its classification error rate.The grader that error rate is lower will Higher weight is obtained in the voting process of (9) formula.Additionally, for single sample, by formula (8) if it will be seen that sample Original tag is different from classification results, then the value of exponential depth will be greater than 0, and the result of natural logrithm will be less than 1 so that this sample exists Weight in lower whorl iteration increases.Otherwise, sample weights in lower whorl iteration will reduce.

Test set sample is inputted in the grader of training, the final classification results of test sample, as shown in Figure 2.

Table 1 gives and directly uses C4.5 decision tree to classify unbalanced dataset, aligns sample and rise at random Use C4.5 to carry out classifying after sampling and method used in the present invention carries out the test result respectively obtained of classifying.We use Classifier performance is evaluated by following index:

Table 1 sorting algorithm result with compare (result black matrix best under same index marks)

By table 1 data it can be seen that although direct use C4.5 decision tree carries out classifying can obtain the highest specificity Index, but sensitivity is minimum, it was demonstrated that and now classification performance is created and significantly affects by data nonbalance phenomenon.The border of positive sample Region is invaded bites, and a large amount of positive samples are divided into negative sample by mistake.After simple random liter sampling, this problem has been alleviated, But sensitivity is the biggest with specific gap；And the present invention has obtained good sensitivity and specific index simultaneously, two The geometrical mean of person is the highest in the several method participating in contrast, it was demonstrated that sensitivity and specificity are had most preferably by the present invention Compromise.

In sum, the present invention can obtain good classifying quality on unbalanced dataset, effectively eliminates data not The negative influence that classification is brought by equilibrium problem.

Claims

1. rise a unbalanced dataset sorting technique for sampling based on self adaptation, if original unbalanced data concentrates positive sample number Mesh is n_p, negative sample number is n_n, the method comprises the following steps:

(1) according to n_pAnd n_nCalculate the unbalance factor IR of unbalanced dataset, IR calculating needs newly-generated positive total sample number G；

(2) with Euclidean distance for tolerance, for each positive sample i, search unbalanced data is concentrated with its closest K Neighbour's sample, adds up the ratio shared by negative sample in above-mentioned K nearest samples, is designated as p_i, to the p obtained by each positive sample_i Value is added and is normalized, and the value obtained after process being completed is designated as r_i, the r of the most each positive sample_iValue sum is 1, i.e. r_iFormation probability Density Distribution, claims r_iProbability for positive sample i；

(3) for each positive sample i, according to the probability r obtained in positive total sample number G-value and step (2)_iDetermine this positive sample institute New samples number g that need to generate_i；

(4) for each positive sample i, K the nearest samples obtained in step (2) randomly selects g_iIndividual, respectively with its group Becoming sample pair, randomly select and a little i.e. obtain newly-generated positive sample on the line of sample pair, new positive sample generates process Generate G new positive sample point after completing, newly-generated G positive sample point is joined in original Nonblanced training sets, makes Positive and negative number of samples is identical, i.e. obtains comprising positive sample and each n of negative sample_nIndividual new balance training collection；

(5) iterations of note Adaboost algorithm is T, uses Adaboost algorithm to instruct newly-generated balance training collection Practice, after iteration T time, obtain final disaggregated model.