CN108229507A

CN108229507A - Data classification method and device

Info

Publication number: CN108229507A
Application number: CN201611149072.8A
Authority: CN
Inventors: 陈新河; 李慧芳; 赵静; 詹文浩; 张诺亚
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2016-12-14
Filing date: 2016-12-14
Publication date: 2018-06-29

Abstract

The invention discloses a kind of data classification method and devices, are related to data analysis field.The present invention is divided into multiple secondary classes for the negative sample data in the data of positive and negative sample imbalance, further, negative sample data in each secondary class are further divided into two classes after being combined with positive sample data, obtain the weight of the degree of closeness of multiple graders and each grader based on grouped data, finally, final classification device is determined based on weight and each grader, the weight assigned for the closer grader of positive and negative sample data is bigger, can will not be divided to a few sample as the outlier of most samples in the classification of most samples in actual classification.The method of the present invention will not decrease or increase sample data, it will not cause the loss of sample data important information, it will not lead to over-fitting, and the feature of negative sample data is considered in assorting process, and the classifying quality of each grader, it can effectively improve the classifying quality of sample data entirety.

Description

Data classification method and device

Technical field

The present invention relates to data analysis field, more particularly to a kind of data classification method and device.

Background technology

The positive and negative sample data that we can obtain in many problems in reality is unbalanced, as quality detector is examined daily Defect rate is well below qualification rate in the product of survey；Resident's number with cancer is to be far less than in resident's Carcinoma screening Healthy population, it is generally the case that this kind of a few sample is referred to as positive sample data for the meaning bigger of the research of data characteristics, And the sample data to occupy the majority is referred to as negative sample data.

Traditional sorting algorithm reduces error rate by minimizing loss function, and data distribution feelings are not accounted in algorithm Condition is often partial to most classes.In the worst case, the example of minority class can be considered as the outlier of most classes and be neglected Slightly.

The existing positive and negative unbalanced method of sample data of processing is big by reducing based on lack sampling method and over-sampling method The data of class increase the data of group to reach data set balance, but lack sampling method leaves out many data and can cause major class Many important informations are lost, and over-sampling method increases group repeated sample and is easy to cause over-fitting and increases the calculating time and deposit Store up expense.The effect that both methods classifies for the data of positive and negative sample imbalance is bad.

Invention content

A purpose being realized of the invention is：It proposes a kind of method of data classification, improves for positive negative sample not The effect that the data of balance are classified.

According to an aspect of the present invention, a kind of data classification method provided, including：Sample data is divided into positive sample Notebook data and negative sample data, wherein, the ratio of the quantity of negative sample data and positive sample data is more than threshold value；According to negative sample Negative sample data are divided into multiple secondary classes by the similitude between each data point of data；By the negative sample data of each secondary class One group of training data is incorporated as with positive sample data, obtains multigroup training data；Supporting vector is utilized to every group of training data Machine is trained, and obtains the degree of closeness of two class data that a grader and the grader divide；According to each grader The degree of closeness of the two class data divided determines the weight of each grader, wherein, the two class data that grader divides approach The weight of the smaller then grader of degree is bigger；Final classification is determined according to the weight of each grader and each grader Device classifies to testing data using final classification device.

In one embodiment, negative sample data are divided according to the similitude between each data point of negative sample data Include for multiple secondary classes：According to the ratio of negative sample data and the quantity of positive sample data, time that negative sample data divide is determined The quantity of class；Negative sample data are drawn according to the similitude between each data points of negative sample data using cluster algorithm It is divided into the secondary class of determining quantity.

In one embodiment, grader is the optimum segmentation planar representation obtained using support vector machines training；Each The degree of closeness for the two class data that grader divides is the maximum fractionation spacing of the grader.

In one embodiment, determine that final classification device includes according to the weight of each grader and each grader： The optimum segmentation planar representation of each grader is weighted by read group total according to the weight of each grader, obtains final point The optimum segmentation planar representation of class device.

In one embodiment, inverse of the weight of grader for the maximum fractionation spacing of the grader.

According to another aspect of the present invention, a kind of device for classifying data provided, including：Positive negative sample division module, For sample data to be divided into positive sample data and negative sample data, wherein, the quantity of negative sample data and positive sample data Ratio be more than threshold value；Negative sample division module, will be negative for the similitude between each data point according to negative sample data Sample data is divided into multiple secondary classes；Training data generation module, for by the negative sample data of each secondary class and positive sample number According to one group of training data is incorporated as, multigroup training data is obtained；Station work module, for utilizing branch to every group of training data It holds vector machine to be trained, obtains the degree of closeness of two class data that a grader and the grader divide；Grader is weighed Weight determining module, for determining the weight of each grader according to the degree of closeness of two class data that each grader divides, In, the weight of the smaller then grader of degree of closeness for the two class data that grader divides is bigger；Final classification device determining module, For determining final classification device, data categorization module, for using most according to the weight of each grader and each grader Whole grader classifies to testing data.

In one embodiment, negative sample division module, for according to the quantity of negative sample data and positive sample data Ratio determines the quantity for the secondary class that negative sample data divide, using cluster algorithm according to each data of negative sample data Negative sample data are divided into the secondary class of determining quantity by the similitude between point.

In one embodiment, final classification device determining module, for the weight according to each grader by each classification The optimum segmentation planar representation of device is weighted read group total, obtains the optimum segmentation planar representation of final classification device.

The present invention is divided into multiple secondary classes for the negative sample data in the data of positive and negative sample imbalance, in each class Negative sample data tailed off relative to the total quantity of negative sample data and each class represents a type of negative sample number According to the negative sample data in further each secondary class are further divided into two classes after being combined with positive sample data, obtain multiple classification The weight of the degree of closeness of device and each grader based on grouped data finally, is determined most based on weight and each grader Whole grader, the weight assigned for the closer grader of positive and negative sample data is bigger, can will not be incited somebody to action in actual classification A few sample is divided to as the outlier of most samples in the classification of most samples.The present invention method will not reduce or Increase sample data, the loss of sample data important information will not be caused, over-fitting will not be led to, and examined in assorting process Consider the feature of negative sample data and the classifying quality of each grader, can effectively improve whole point of sample data Class effect.

By referring to the drawings to the detailed description of exemplary embodiment of the present invention, other feature of the invention and its Advantage will become apparent.

Description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 shows the flow diagram of the data classification method of one embodiment of the present of invention.

Fig. 2 shows the schematic diagrames of the data classification method of an alternative embodiment of the invention.

Fig. 3 shows the structure diagram of the device for classifying data of one embodiment of the present of invention.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Below Description only actually at least one exemplary embodiment is illustrative, is never used as to the present invention and its application or makes Any restrictions.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Lower all other embodiments obtained, shall fall within the protection scope of the present invention.

When being classified for the data for being directed to positive and negative sample imbalance in the prior art, lack sampling method leaves out many data Major class can be caused to lose many important informations, and over-sampling method increases group repeated sample and is easy to cause over-fitting and increases meter Evaluation time and storage overhead.The problem of effect that both methods classifies for the data of positive and negative sample imbalance is bad, It is proposed this programme.

With reference to the data classification method of Fig. 1 and Fig. 2 description present invention.

Fig. 1 is the flow chart of data classification method one embodiment of the present invention.As shown in Figure 1, the method packet of the embodiment It includes：

Sample data is divided into positive sample data and negative sample data by step S102.

In data classification field, positive sample data will be known as the more valuable a kind of data of data analysis and classification, Positive sample data are less under normal conditions, and therefore, it is necessary to whether judge the ratio of the quantity of negative sample data and positive sample data More than threshold value.When negative sample quantity reaches certain proportion relative to positive sample quantity using the method effect of this programme more. Such as in credit card fraud behavioral value, the sample data of acquisition includes the data of arm's length dealing and the data of fraud are used In subsequent sorter model training process, the data volume of wherein arm's length dealing is generally far larger than the data volume of fraud, And for identifying that the data of the obvious fraud of credit card fraud behavior more have researching value, therefore, by the data of fraud As positive sample data, using the data of arm's length dealing as negative sample data.

As shown in Fig. 2, rectangle with shade represents the smaller positive sample data of data volume above, below larger blank square Shape represents the larger negative sample data of data volume.

Negative sample data are divided into multiple by step S104 according to the similitude between each data point of negative sample data Secondary class.

Specifically, according to the ratio of negative sample data and the quantity of positive sample data, time that negative sample data divide is determined The quantity of class, for example, the quantity of negative sample data is F, the quantity of positive sample data is T, negative sample data and positive sample data Quantity ratio for F/T, K=[F/T] is obtained to the ratio rounding, the number as the secondary class that negative sample data divide.So Afterwards, negative sample data are divided by K using cluster algorithm according to the similitude between each data point of negative sample data Class, wherein, cluster algorithm is, for example, K mean values (K-means) algorithm, K centers (K-medoids) algorithm, DBSCAN (Density-Based Spatial Clustering of Applications with Noise, tool are noisy based on close The clustering algorithm of degree) and EM algorithms (Expectation Maximization Algorithm, expectation-maximization algorithm) etc. Deng.

Cluster algorithm is the algorithm classified based on the similitude between sample data, and the similitude of sample data can To be weighed by taking the distance between data point or similarity as an example.By taking K mean algorithms as an example, negative sample data are divided into K Secondary class, is simply described：

(1) K=[F/T] a object is randomly selected from negative sample data as initial cluster center；(2) by negative sample number According to the sample in set closest cluster is assigned to according to minimal distance principle；(3) k cluster is recalculated according to cluster result Center, and as new cluster centre；(4) step (2), (3) are repeated until cluster centre no longer changes or amplitude of variation is small In scheduled threshold values.Since cluster algorithm belongs to more common algorithm, other several cluster algorithms, herein It repeats no more.

Negative sample data can be divided into the secondary class with different characteristic using cluster algorithm, by each secondary class point The negative sample data of each secondary class and positive sample data can effectively be reflected when being classified by not combined with positive sample data Difference, the grader finally obtained can also separate the negative sample data of different characteristic and positive sample data field, if with Negative sample data are divided into multiple secondary classes by meaning, the data characteristics difference unobvious between secondary class, so as to cause with positive sample When data combination is classified, the result of grader is not much different, and can not effectively improve classifying quality, therefore, using poly- Negative sample data are divided into multiple secondary classes by alanysis algorithm first, help to improve classifying quality.

As shown in Fig. 2, the space rectangles for representing the larger negative sample data of data volume are divided into multiple small rectangles.

The negative sample data of each secondary class and positive sample data are incorporated as one group of training data, obtained by step S106 Multigroup training data.

For example, negative sample data are divided into K time classes, the negative sample data of each class are with positive sample data group cooperation One group of training data obtains K group training datas.

As shown in Fig. 2, the small rectangle of each negative sample data merges with the shaded rectangle of positive sample data, multiple packets are formed Rectangle containing two kinds of data.

Step S108 is trained every group of training data using support vector machines, obtains a grader and this point The degree of closeness for the two class data that class device divides.

It is super can to find a classification in n-dimensional space by SVM (Support Vector Machine, support vector machines) Data point in space is divided into two classes by plane.The hyperplane of grader is expressed as f (x)=W^TX+b, can as f (x)=0 To obtain the optimum segmentation plane of grader.For example, K group training datas obtain K grader after being trained, grader i is For the optimum segmentation planar representation that SVM training obtains, the formula of optimum segmentation planar representation isWherein, X be space in data point, W_iAnd b_iFor the parameter of optimum segmentation plane, 1≤i≤K, i represent i-th of grader.Grader is drawn The degree of closeness of two class data being divided to can use the maximum fractionation spacing L of the grader_iIt represents, L_iThat is the branch of optimum segmentation plane The distance between vector is held, 1≤i≤K, i represent i-th of grader.L_iIt is more big, represent the two class data that SVM classifier divides Degree of closeness it is smaller.

As shown in Fig. 2, it is that two classes, wherein circle and fork represent that two classes are separated by SVM respectively that every group of data are divided to using SVM Data, intermediate line are the optimum segmentation plane of grader.

Step S110, the degree of closeness of the two class data divided according to each grader determine the weight of each grader.

Wherein, the weight of the smaller then grader of degree of closeness for the two class data that grader divides is bigger, i.e. L_iIt is smaller Then the weight of grader i is bigger.The smaller differentiation for representing this two classes data of degree of closeness for the two class data that grader divides Degree is smaller, needs also accurately to distinguish the smaller two classes data of this differentiation degree in actual classification, therefore, by this The bigger of the weight setting of the grader of sample.It specifically, can be by 1/L_iIt is set as the weight of grader i.It can also be according to need It asks using other weight set-up modes.

Step S112 determines final classification device according to the weight of each grader and each grader.

Specifically, the optimum segmentation planar representation of each grader is weighted by summation according to the weight of each grader It calculates, obtains the optimum segmentation planar representation of final classification device.

As shown in Fig. 2, each segmentation optimal planar is weighted summation, final segmentation plane is obtained in sample data Positive sample data and negative sample data field separate.

Step S114 classifies to testing data using final classification device.

The final classification device obtained by training can classify to new testing data.Such as in credit card fraud row In detection, to obtain final classification device by step S102~S114 and finally dividing what fraud and normal behaviour were classified Transaction data input final classification device when there is new transaction data, that is, can determine whether new transaction data is to take advantage of by class device Swindleness behavior.

The method of above-described embodiment is divided into the negative sample data in the data of positive and negative sample imbalance multiple times Class, each the negative sample data in time class have tailed off relative to the total quantity of negative sample data and each class represents a type The negative sample data of type, the negative sample data in further each secondary class are further divided into two classes after being combined with positive sample data, The weight of the degree of closeness of multiple graders and each grader based on grouped data is obtained, finally, based on weight and each Grader determines final classification device, and the weight assigned for the closer grader of positive and negative sample data is bigger, can be in reality During classification, the smaller two classes data of this differentiation degree are also accurately distinguished, it will not be using a few sample as most samples Outlier be divided in the classification of most samples.The method of the present invention will not decrease or increase sample data, Bu Huizao Into the loss of sample data important information, over-fitting will not be caused, and the spy of negative sample data is considered in assorting process The classifying quality of sign and each grader can effectively improve the classifying quality of sample data entirety.

The present invention also provides a kind of devices of data classification, are described with reference to Fig. 3.

Fig. 3 is the structure chart of device for classifying data one embodiment of the present invention.As shown in figure 3, the device 30 includes：

Positive negative sample division module 302, for sample data to be divided into positive sample data and negative sample data, wherein, The ratio of the quantity of negative sample data and positive sample data is more than threshold value.

Negative sample division module 304, for the similitude between each data point according to negative sample data by negative sample Data are divided into multiple secondary classes.

Specifically, negative sample division module 304, for the ratio according to negative sample data and the quantity of positive sample data, The quantity for the secondary class that negative sample data divide is determined, using between each data point of the cluster algorithm according to negative sample data Similitude negative sample data are divided into the secondary class of determining quantity.

Training data generation module 306, for the negative sample data of each secondary class and positive sample data to be incorporated as one Group training data, obtains multigroup training data.

Station work module 308 for being trained to every group of training data using support vector machines, obtains a classification The degree of closeness of two class data that device and the grader divide.

Wherein, grader is the optimum segmentation planar representation obtained using support vector machines training；Each grader divides Two class data degree of closeness be the grader maximum fractionation spacing.

Grader weight determination module 310, for being determined according to the degree of closeness of two class data that each grader divides The weight of each grader.

Wherein, the weight of the smaller then grader of degree of closeness for the two class data that grader divides is bigger.For example, classification Inverse of the weight of device for the maximum fractionation spacing of the grader.

Final classification device determining module 312, for being determined finally according to the weight of each grader and each grader Grader.

Specifically, final classification device determining module 312, for the weight according to each grader by each grader most The expression of optimal sorting cutting plane is weighted read group total, obtains the optimum segmentation planar representation of final classification device.

Data categorization module 314, for being classified using final classification device to testing data.

One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of data classification method, which is characterized in that including：

Sample data is divided into positive sample data and negative sample data, wherein, the quantity of negative sample data and positive sample data Ratio be more than threshold value；

The negative sample data are divided by multiple secondary classes according to the similitude between each data point of the negative sample data；

The negative sample data of each secondary class and the positive sample data are incorporated as one group of training data, obtain multigroup trained number According to；

Every group of training data using support vector machines is trained, obtains two classes that a grader and the grader divide The degree of closeness of data；

The degree of closeness of two class data divided according to each grader determines the weight of each grader, wherein, classification The weight of the smaller then grader of degree of closeness for the two class data that device divides is bigger；

Final classification device is determined according to the weight of each grader and each grader；

Classified using the final classification device to testing data.

2. according to the method described in claim 1, it is characterized in that,

The negative sample data are divided by multiple secondary classes according to the similitude between each data point of the negative sample data Including：

According to the ratio of the negative sample data and the quantity of positive sample data, the secondary class that the negative sample data divide is determined Quantity；

Using cluster algorithm according to the similitude between each data points of the negative sample data by the negative sample number According to the secondary class for being divided into the determining quantity.

3. according to the method described in claim 1, it is characterized in that,

The grader is the optimum segmentation planar representation obtained using support vector machines training；

The degree of closeness of two class data that each grader divides is the maximum fractionation spacing of the grader.

4. according to the method described in claim 3, it is characterized in that,

Determine that final classification device includes according to the weight of each grader and each grader：

The optimum segmentation planar representation of each grader is weighted by read group total according to the weight of each grader, is obtained most The optimum segmentation planar representation of whole grader.

5. according to the method described in claim 3, it is characterized in that,

Inverse of the weight of the grader for the maximum fractionation spacing of the grader.

6. a kind of device of data classification, which is characterized in that including：

Positive negative sample division module, for sample data to be divided into positive sample data and negative sample data, wherein, negative sample number It is more than threshold value according to the ratio of the quantity with positive sample data；

Negative sample division module, for the similitude between each data point according to the negative sample data by the negative sample Data are divided into multiple secondary classes；

Training data generation module, for the negative sample data of each secondary class and the positive sample data to be incorporated as one group of instruction Practice data, obtain multigroup training data；

Station work module, for being trained to every group of training data using support vector machines, obtain a grader and The degree of closeness for the two class data that the grader divides；

Grader weight determination module, the degree of closeness of two class data for being divided according to each grader determine each The weight of grader, wherein, the weight of the smaller then grader of degree of closeness for the two class data that grader divides is bigger；

Final classification device determining module, for determining final classification device according to the weight of each grader and each grader；

Data categorization module, for being classified using the final classification device to testing data.

7. device according to claim 6, which is characterized in that

The negative sample division module for the ratio according to the negative sample data and the quantity of positive sample data, determines institute State the quantity of the secondary class of negative sample data division, using cluster algorithm according to each data points of the negative sample data it Between similitude the negative sample data are divided into the secondary class of the determining quantity.

8. device according to claim 6, which is characterized in that

9. device according to claim 8, which is characterized in that

The final classification device determining module, for the weight according to each grader by the optimum segmentation plane of each grader Expression is weighted read group total, obtains the optimum segmentation planar representation of final classification device.

10. device according to claim 8, which is characterized in that