CN107688831A

CN107688831A - A kind of unbalanced data sorting technique based on cluster down-sampling

Info

Publication number: CN107688831A
Application number: CN201710784810.4A
Authority: CN
Inventors: 曹路
Original assignee: Wuyi University
Current assignee: Wuyi University
Priority date: 2017-09-04
Filing date: 2017-09-04
Publication date: 2018-02-13

Abstract

The invention discloses a kind of unbalanced data sorting technique based on cluster down-sampling, this method comprises the steps：Using fast search and density peaks clustering algorithm, which clusters, to be found to the multiclass sample of training set, cluster result is obtained, the multiclass sample in training set is divided into N clusters；Few class sample in every cluster sample of multiclass sample in training set and training set forms to new sample set, and with support vector cassification, the supporting vector of multiclass sample in acquisition training set；The few class sample extracted in supporting vector and training set per cluster forms new training set together；New training set is trained by SVMs, and Performance Evaluation is carried out by cross validation collection.The present invention can not only shorten the training time of grader, and the discrimination of few class sample is improved in the case where not endangering multiclass sample identification rate, improve the performance of grader.

Description

A kind of unbalanced data sorting technique based on cluster down-sampling

Technical field

The present invention relates to the research field of pattern-recognition, more particularly to a kind of unbalanced data based on cluster down-sampling Sorting technique.

Background technology

Classification problem is a very important research contents in the fields such as pattern-recognition, machine learning, in actual life In have very extensive application, such as the Handwritten Digit Recognition in banking system, the recognition of face in safety and protection monitoring system and network Intrusion detection in safety etc..At present, treatment classification problem has had the sorting technique of some relative maturities, such as：Decision tree, The methods of K- neighbours, neutral net, SVMs, wherein, SVMs is with its complete theoretical explanation and good reality Result is tested to receive significant attention.These traditional sorting techniques are all based on what class distribution equilibrium was assumed and proposed, its main mesh Be to improve overall classification performance, good effect is shown to the data set being evenly distributed.But the institute in actual life The features such as uneven sample size and noise jamming often occurs between classification in the data of acquisition, traditional grader is not reached To Expected Results.

Unbalanced dataset is widely present in actual life, such as the defect ware detection on production line, credit card fraud inspection Survey and medical diagnosis on disease etc., in these data sets, the more classification of sample number is referred to as multiclass, and the less classification of sample number is referred to as few Class, the sample number of multiclass are far longer than the sample number of few class.In the classification problem of unbalanced dataset, the identification of few class sample The emphasis often classified, such as the product on production line, most of to belong to qualified products, only sub-fraction is defect ware, If using traditional sorting technique, the discrimination of defect ware is very low, just can not really realize the purpose of detection defect ware.Cause How this, improve performance of the grader in uneven classification problem, is improved in the case where not endangering multicategory classification precision few The discrimination of class sample is urgent problem to be solved.

The Research of Classification of unbalanced dataset can be divided into two aspects, and one is started with itself from algorithm, by changing Enter existing algorithm, make the few class of classification deviation, typical such as Cost Sensitive Support Vector Machines, pass through the power higher to few class sample Weigh to improve the nicety of grading of few class.Second, being pre-processed in data plane by Sampling techniques to unbalanced dataset, make The sample number of few class and multiclass is in a basic balance in training set.

Sampling techniques can be divided into two kinds of up-sampling and down-sampling, and up-sampling technology is by simple copy or using didactic Method typically has random up-sampling and SMOTE to increase the quantity of few class sample（Synthetic Minority Over- sampling Technique）Algorithm.SMOTE algorithms pass through in given few random interpolation between class sample point and its K neighbour New sample point is constructed, improves the performance of unbalanced data classification to a certain extent.But either random up-sampling is still SMOTE algorithms, the regularity of distribution of data in itself is not followed, when the sample of generation and the inconsistent distribution of initial data, Noise will be unavoidably introduced, not only easy over-fitting also add algorithm complex, it is impossible to adapt to the development of current big data Trend.

Down-sampling by deleting some multiclass sample points to reduce the number of multiclass sample, typically have random down-sampling and OSS（One Side Selection）Algorithm.Multiclass sample is divided into noise sample by OSS algorithms, boundary sample, redundant samples and Safe sample, noise spot and boundary point are removed according to Tomek Links technologies to reduce few class number of samples.Because reduce sample This point, down-sampling technology can reduce the complexity of algorithm, reduce the training time.But down-sampling technology is by multiclass sample It is possible to that representative multiclass sample information can be lost while deletion, and classifying face is shifted.

The content of the invention

The main object of the present invention be the shortcomings that overcoming prior art with deficiency, there is provided it is a kind of based on cluster down-sampling not Equilibrium criterion sorting technique, the discrimination of few class sample is improved in the case where ensureing multicategory classification precision, to improve imbalance The classification performance of data set.

The present invention principle be：SVMs is the grader highly dependent upon supporting vector, the present invention according to support to A kind of this key property of amount machine, it is proposed that unbalanced data sorting technique based on cluster down-sampling.First by quickly searching Multiclass is divided into different clusters by rope and discovery density peaks clustering algorithm；Then every cluster of multiclass and few class sample point are built Training set, the supporting vector per cluster is obtained by SVMs training, retains all supporting vectors of all clusters, deletes non- Supporting vector builds new multiclass sample point to obtain the data set of relative equilibrium；Finally the new data set of acquisition is supported Vector machine is classified.

The present invention uses following technical scheme：

A kind of unbalanced data sorting technique based on cluster down-sampling, comprises the steps：

（1）Unbalanced dataset is divided into training set and cross validation collection two parts；

（2）Multiclass sample and few class sample are extracted from training set；

（3）Using fast search and density peaks clustering algorithm, which clusters, to be found to the multiclass sample of training set, is clustered As a result, the multiclass sample in training set is divided into N clusters；

（4）Few class sample in every cluster sample of multiclass sample in training set and training set is formed to new sample set, is used in combination Support vector cassification, obtain the supporting vector of multiclass sample in training set；

（5）The few class sample extracted in supporting vector and training set per cluster forms new training set together；

（6）New training set is trained by SVMs, and Performance Evaluation is carried out by cross validation collection.

Further, step（1）In, the ratio that training set intersects collection can be allocated as needed, typically can be with Take ten folding cross validations, i.e., be divided into data set very, will wherein 9 parts be used as training set, 1 part is used as test set.

Further, step（3）In, clustering algorithm implementation steps are：1）According to the definition of local density, calculate each more Class sample pointLocal density；2）According toCarry out descending sort；3）Order, according to proximity density point Range formula tries to achieve distance；4）According toWithRelationships decision figure, select cluster center, cluster center is regarded asValue compared with Big sample point；5）After obtaining cluster center, left point is assigned in each cluster according to cluster center；The definition of local density is, whereinDefinition,For multiclass sample pointThe distance put to other,For distance threshold；Proximity density point distance definition is, its implication is than multiclass sample pointDensity is high In sample point, withMost adjacent point arrivesDistance.

Further, step（4）In, can in the supporting vector of every cluster sample of multiclass sample in obtaining training set The number of supporting vector is controlled by adjusting punishment parameter C and the kernel functional parameter of SVMs；Supporting vector is being supported Decision-making Function is played in the classification of vector machine, contains the important information of multiclass sample, the supporting vector of every cluster is remained, that is, protects Stayed multiclass sample to include the maximum sample of information content, weed out in multiclass sample be not supporting vector sample point, reach and subtract The purpose of few multiclass sample point.

Further, step（5）In, preferably, the intersection of the supporting vector per cluster should be with lacking in training set Class number of samples approaches.

Further, step（6）In, preferably, the standard that classification performance is assessed can use geometric average accuracy G- The average value F-measure of the accuracy and recall rate of mean and few class.

The present invention compared with prior art, has the following advantages that and beneficial effect：

（1）The complexity of SVMs is O（N³）, wherein N is training sample number, and the Downsapling method that the present invention uses subtracts The scale of training sample is lacked, with conventional top sampling method（Such as random up-sampling and SMOTE algorithms）Compare, when shortening training Between, more adapt to the development trend of current big data.

（2）Technical scheme provided by the invention, catch important decision effect of the supporting vector in supporting vector grader This characteristic, multiclass is divided into more clusters using clustering algorithm, and in extracting per cluster to the support that classification plays a decisive role to The message sample for reservation is measured, with respect to other Downsapling methods（Such as random down-sampling and OSS algorithms）, preferably remain multiclass The information of sample.

（3）In the present invention, few class sample of negligible amounts all participates in classification based training process in training set, ensure that few class Effect of the sample in classification, the contribution rate of few class sample is improved, enhances grader overall performance.

Brief description of the drawings

Fig. 1 is that the inventive method realizes block diagram；

Fig. 2 is original support vector cassification face in the present invention, ideal sort face, relation between two class sample points and supporting vector Schematic diagram；

Fig. 3 is support vector cassification face, two class sample points and supporting vector triadic relation after clustered down-sampling in the present invention Schematic diagram；

The F-measure values of data set compare figure under Fig. 4 distinct methods；

The G-mean values of data set compare figure under Fig. 5 distinct methods.

Embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are unlimited In this.In order to not obscure the present invention, will be no longer described in detail for some common technologies such as SVMs theory etc..

A kind of unbalanced data sorting technique based on cluster down-sampling provided by the invention, specific implementation step are as follows：

（1）Unbalanced dataset is divided into training set and cross validation collection two parts, is represented by, wherein D is Unbalanced dataset, Tr are training set, and Te is used to represent cross validation collection.The ratio that training set intersects collection can be as needed Be allocated, can typically take ten folding cross validations, i.e., be divided into data set very, will wherein 9 parts be used as training set, 1 part of work For test set.

（2）Tr extracts multiclass sample Ma and few class sample Mi from training set.

（3）To the multiclass sample of training setUsing fast search and find density peaks cluster Algorithm is clustered, and obtains cluster result, the multiclass sample in training set is divided into N clusters.

Fast search and discovery density peaks clustering algorithm are based on two it is assumed that 1）Class cluster center is by close with relatively low part Neighbours' point of degree surrounds, and 2）With having relatively large distance with more highdensity any point.

Local density is defined as, whereinIt is defined as,For multiclass Sample pointThe distance put to other,For distance threshold, in this example,It is taken as 0.01.

Proximity density point distance definition is, its implication is than sample pointIn the high sample point of density, withMost adjacent point arrivesDistance.

The implementation steps of clustering algorithm are as follows in the present invention：

1）According to the definition of local density, the local density of every bit is calculated；

2）According toCarry out descending sort；

3）Order, distance is tried to achieve according to proximity density point range formula；

4）According toWithRelationships decision figure, select cluster center；

5）After obtaining cluster center, left point is assigned in each cluster according to cluster center.

Involved clustering algorithm need to only calculate once distance in the present invention, it is not necessary to interative computation.

（4）Few class sample in every cluster sample of multiclass sample in training set and training set is formed to new sample set, And with support vector cassification, obtain the supporting vector of multiclass sample in training set.Classification of the supporting vector in SVMs In play Decision-making Function, contain the important information of multiclass sample, remain the supporting vector of every cluster, that is, remain multiclass sample Include the maximum sample of information content, weed out in multiclass sample be not supporting vector sample point, reach and reduce multiclass sample point Purpose., can be by adjusting SVMs in the supporting vector of every cluster sample of multiclass sample in obtaining training set Punishment parameter C and kernel functional parameter control the number of supporting vector.

（5）The few class sample extracted in supporting vector and training set per cluster forms new training set together.According to branch The characteristic of vector machine is held, the number of supporting vector will be less than the sample number included per cluster.Preferably, the branch per cluster Holding the intersection of vector should approach with few class number of samples in training set.

（6）New training set is trained by SVMs, and Performance Evaluation is carried out by cross validation collection.Make For preferably, the standard that classification performance is assessed can use geometric average accuracy G-mean and few class accuracy and putting down for recall rate Average F-measure.G-mean and F-measure is built upon what is proposed on the basis of hybrid matrix, wherein, G-mean is same When taken into account precision ratio and precision ratio, available for evaluation system entirety classification performance, G-mean values are bigger, then system is integrally classified Better.In unbalanced system, F-measure is used for the classification performance for evaluating few class sample, and F-measure values are bigger, then few Class sample classification better performances.

The present embodiment is illustrated below by way of actual scene.

The present embodiment is chosen the data set that two unbalance factors differ greatly and tested.Data set is all from plus sharp welfare Sub- university Irving branch school machine learning databases UCI, one is Haberman's Survival Data Set, the data set bag Existence and death of the hospital of Chicago University to the patient that suffers from breast cancer after completing to perform the operation between 1958 and 1970 are contained The judgement of situation, i.e. two classification problems, share 306 samples, each sample is surrounded by 3 attributes, respectively patient during operation Age, the operation time of patient and detection axillary gland number positive；Another data set is Letter Recognition, number According to concentration sample Shi white pixel 26 English alphabets, i.e., classification number be 26, share 20000 samples, each letter 16 numerical characteristics are converted into, i.e. attribute is 16 dimensions.In the present embodiment, the details of two datasets are shown in Table 1, wherein not Balanced ratio refers to the ratio of multiclass and few class in data set.

The data set of table 1

Data set	Sample	Attribute number	It is more/few	Unbalance factor
					Haberman	306	3	225/81	2.78
Letter	20000	16	19266/734	26.25

It should be noted that being simplified experiment in embodiment, Letter data sets are converted into the processing of two-value class, quantity 734 Zee be few class, it is remaining to merge into multiclass.

In embodiment, by data set random division, ensure that unbalance factor does not change in partition process, wherein taking instruction Practice 90% that collection is total sample set, test set is the 10% of sample set.

By method proposed by the invention and SVMs Direct Classification in embodiment（SVM）, prop up after random down-sampling Hold vector machine classification（RUS+SVM）It is compared, experimental result is as follows：

The F-measure values of data set under the distinct methods of table 2

Data set	SVM	RUS+SVM	The inventive method
				Haberman	0.627	0.612	0.635
Letter	0.576	0.581	0.594

The G-mean values of data set under the distinct methods of table 3

Data set	SVM	RUS+SVM	The inventive method
				Haberman	0.683	0.677	0.691
Letter	0.607	0.615	0.627

From result, it can be seen that, method proposed by the invention has some superiority, F- on the data set of different balanced ratios Measure and G-mean raising, illustrate that the method that invention proposes can not only improve the overall classification performance of unbalanced data, The nicety of grading of few class is also improved.

Claims

1. a kind of unbalanced data sorting technique based on cluster down-sampling, it is characterised in that comprise the steps：

A kind of 2. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step （1）In, the ratio that training set intersects collection is allocated as needed, is taken ten folding cross validations, i.e., is divided into data set very, To wherein 9 parts and be used as training set, 1 part is used as test set.

A kind of 3. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step Suddenly（3）In, clustering algorithm implementation steps are：1）According to the definition of local density, each multiclass sample point is calculatedPart it is close Degree；2）According toCarry out descending sort；3）Order, distance is tried to achieve according to proximity density point range formula；4） According toWithRelationships decision figure, select cluster center, cluster center is regarded asIt is worth larger sample point；5）According to cluster Center is assigned in each cluster by remaining sample point；The definition of local density is, whereinIt is defined as,For multiclass sample pointThe distance put to other,For distance threshold；Proximity density point distance is fixed Justice is。

A kind of 4. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step （4）In, in the supporting vector of every cluster sample of multiclass sample in obtaining training set, by the punishment for adjusting SVMs Parameter C and kernel functional parameter control the number of supporting vector, and supporting vector plays Decision-making Function in the classification of SVMs, The important information of multiclass sample is contained, remains the supporting vector of every cluster, that is, remains multiclass sample and includes information content most Big sample, weed out in multiclass sample be not supporting vector sample point, reach reduce multiclass sample point purpose.

A kind of 5. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step （5）In, the intersection of the supporting vector per cluster should approach with few class number of samples in training set.

A kind of 6. unbalanced data sorting technique based on cluster down-sampling as claimed in claim 1, it is characterised in that step （6）In, the standard that classification performance is assessed is geometric average accuracy G-mean and the accuracy of few class and the average value of recall rate F-measure。