CN103500205B

CN103500205B - Non-uniform big data classifying method

Info

Publication number: CN103500205B
Application number: CN201310452365.3A
Authority: CN
Inventors: 朱晓峰; 张师超
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2017-04-12
Anticipated expiration: 2033-09-29
Also published as: CN103500205A

Abstract

The invention relates to a non-uniform big data classifying method which is used for classifying data set categories which cannot be classified and non-uniform category data sets of the data set categories which cannot be classified in a computer memory. Firstly, the size of a sample is determined by a downsampling method according to a theory, and the number of classifiers can be determined according to the number of the samples. An integrated classifier is established for each category of the big data. When a test case is tested, the integrated classifiers of all categories are classified, and the category where the integrated classifier with the highest classification rate is arranged is used as the category of the test case. The method is in linear to time complexity of big data category and can reduce polarization of non-uniform big data classification results. Furthermore, the integrated classifiers improve the accuracy. The method is easy to implement and only involves some simple math models in writing codes.

Description

Non-homogeneous big data sorting technique

Technical field

The present invention relates to Computer Science and Technology field and areas of information technology, and in particular to big data, particularly one Plant the processing method of non-homogeneous big data classification.

Background technology

Big data is referred to has no idea to use conventional Software tool to content in existing physical condition and in the time for allowing The data acquisition system for being captured, managed and being processed.Big data has following features：Volume（Data volume is big）、Variety（Data Wide variety）、Value（Value density is low）、Velocity（Processing speed is fast）, it is called 4V for short.

Big data research at present generally includes two big class.First, challenge of the big data to framework.Current many famous Web sites HADOOP clusters in uncorrected data capacity reach tens PB, and there is redundancy, need scanning to update daily.Then HADOOP is Guarantee that single-unit point failure or single chassis failure do not affect to run generally using 3 replication policies.So data are in time dimension and sky Between dimension aspect be required for consider Cost Problems.Therefore, if the efficient extensive small documents management of construction and big file pipe Reason mechanism is simultaneously deposited, while supporting that structuring data, unstructured data and the storage without structured data, management and access etc. are all The problem for having to consider.Second, the challenge of big data Knowledge Discovery and big data to mining algorithm.It is firstly the need of mostly face The extensibility of algorithm.Some data minings and machine learning classic algorithm, such as KNN density estimations, non-ginseng BAYES is supported Vector machine, Gaussian process is returned and hierarchical clustering scheduling algorithm, because their complexity is at least more than secondary, all can not be big Preferably applied in data mining.So, this is accomplished by designing the algorithm of higher efficiency, i.e. O (nlogn) or O (n)。

From the point of view of document in terms of existing a large amount of big data excavations, big data study research be concentrated mainly on division, Cluster, retrieval, increment（In batches, it is online or parallel）Learn the upgrading improvement of the conventional method of this 4 aspects.At present to non-homogeneous The research of big data issue handling is fewer.Other researchs of generally similar big data Knowledge Discovery, big data classification problem is first First it is envisaged that the complexity issue of algorithm.Next existing sorting algorithm（The different classes of distribution for assuming data is uniform 's）It is applied directly in non-homogeneous big data and is easily caused biasing（bias）, i.e. classification results are partial to big classification（That is this classification It is very big containing example number ratio, such as more than 90% in two class problems）.Finally, common algorithms are used for non-homogeneous（It is uneven） Data classification problem generally pursues classification error minimum problem, but have ignored non-homogeneous class misclassification cost problem.

However, non-homogeneous big data classification is an extremely challenging problem, starts with from what, how to be entered using big data A series of basic problems urgently to be resolved hurrily of row intelligency activity, etc..

The content of the invention

The non-homogeneous big data classification problem of present invention research.

It is an object of the invention to provide simple and effective non-homogeneous big data sorting technique.The method can solve big number The biasing problem easily occurred according to classification and the high complexity issue of big data algorithm.I.e. this method is by dropping sampling (Downsampling) the non-homogeneous big data classification of linear complexity and two classification (one-vs-all) big data, is reached, is led to The method for crossing integrated multiple graders (ensemble) result solves biasing problem and improves classification accuracy, and with robustness (robust) it is noise immunity.

This method is comprised the following steps that：

(1) the number m of all kinds of examples of big data is obtained_i,i=1,2,…,M；

(2) Downsampling methods are adopted for each class m_iSample out D_iData set.Wherein data volume in each data set Size n_iByDetermine, wherein t_a/2The value of confidence level is represented, critical value can be distributed by t and be obtained, ε represents maximum Allowable error.By this quadrat method to each class m_iSample out D_i。

(3) to each class m_iD_iIndividual data set one-vs-all methods（The all examples of i.e. current class are positive class, other All examples of class are negative class）Set up D_iIndividual grader, i.e., build a grader to each data set.

(4) to each class m_iD_iIndividual grader carries out integrated study.Theoretical according to integrated study, integrated classifier can be with Formed according to integration principle set by multiple meta classifiers.All of meta classifier classification speed should be fast, and is each other Independent, the error rate of each grader is not higher than 50%.Common such grader, such as nearest neighbor algorithm, decision tree method, god Jing network techniques or forest tree method (Forest tree) etc. can meet above-mentioned requirements.Integration principle typically has bagging, Adaboost, selective ensemble etc..Each class m of the invention_iThe D for obtaining_iIndividual grader is using greedy collection composition forward Class result method（forward greedy ensemble）Integrated study is carried out to grader.

(5) test：Each example is classified in each class, the accuracy rate highest class obtained in M result is The classification of test case.

Wherein step（2）Target be to solve the problems, such as algorithm complex, i.e., by drop sampling method using part it is original Data rather than total data structure grader.In order to improve classification accuracy, using multiple strategy, i.e., repeatedly take out Sample, the sample size of sampling meets above-mentioned regulation, and frequency in sampling is determined by user.

Step of the present invention（2）Downsampling methods are comprised the following steps that：

A. to each class m_iWhen being sampled, the sample size of sampling is no less than the regulation of above table（I.e.It is individual）, sample number is to need to set up the number of meta classifier.It is first during a sample number is generated to every class First obtain the number of samples of current class.The present invention treats as A classes current class, and other classes are collectively referred to as non-A classes.Then, A classes are analyzed With the number of levels of non-A classes.Present invention note # (A), # (～A), # (R), # (T) are A classes, and non-A classes, calculator memory will with theory The data of the sample size asked, if (# (A)>>#(R))&&(#(A)># (T)), then extract from A classes and the much the same reality of non-A classes Example；If (# (～A)>># (R)) ＆＆ (# (～A)># (T)), then extract and the much the same example of A classes from non-A classes.

B. said process is repeated, until each class m_iSampling D_iIndividual sample.For sake of simplicity, the fixed D of the present invention_iFor n.

C. so far D=M*n sample of whole data set generation.

By step（2）, the present invention obtains M*n sample, and every group of data have n grader.The step of the present invention（3）It is right Each class m_iN sample set up n meta classifier altogether；

Then step of the invention（4）The n meta classifier to obtaining carries out integrating an integrated classifier, that is, adopt Take forward greedy ensemble methods.Its step is as follows：

D. candidate classification device set CCS={ C are built₁,…,C_MAnd selected grader set SCS={ };

E. to each grader C_i, the best grader of accuracy rate is chosen, it is removed from CCS and is added in SCS；

F. the grader C in current each CCS_jAdd in SCS and verify, classification results are if it exceeds user specifies in advance Threshold value, then jump to E, and C_jSCS is moved on to from CCS.Otherwise jump to step（5）, now show that integrated classifier study is completed；

G. F is repeated, until CCS is empty set,

So far, to M class, the present invention establishes M integrated classifier C altogether_i, i=1 ..., M.Each integrated classifier Comprising n meta classifier.

Above-mentioned steps ensure that the grader for obtaining is less, and this causes test process fairly simple.

The non-homogeneous big data sorting technique implemented by above step is had the characteristics that：First, in assorting process by As balanced as possible in all kinds of instance datas for being using downsampling methods, this effectively prevent asking for classification deflection big class Topic；Second, classification is carried out so that the complexity of whole sorting algorithm is up to linearly using the methods of sampling；3rd, in order to avoid Sampling causes classification accuracy to reduce, and the present invention improves classification accuracies by two methods, that is, multiple method of sampling and forward Greedy compressive classification result method.

The present invention reduces unbalanced classification using the method for sampling, and reduces the complexity of algorithm；It is many using sampling It is secondary and set up a meta classifier to every time sampling, and using the method for integrated study, comprehensively all of meta classifier is improved point Class achievement.

Sampling big data：It is extremely difficult generally to carry out classifying in whole big data.Even if feasible, complexity is also very Height, the methods of sampling causes to be changed into feasible to the operation of big data classification, and the complexity of classification is reduced to linearly.This is just It is that big data excavates the result expected.

Sample size and sample number：The sample size of sampling is according to theoretical acquisition, it is ensured that result is obtained after sampling with original Beginning resultant error reaches minimum.Extract multiple samples to be conducive to improving classification achievement；

One-vs-all sorting techniques are it is verified that to be a kind of very effective method for solving non-homogeneous data set. The present invention is used in the method in the classification of non-homogeneous big data, on the one hand can solve non-homogeneous classification problem, on the other hand also The high complexity issue of big data classification can be solved；

Meta classifier causes the classification on large data sets quicker, and integrated study can effectively improve meta classifier Achievement.And forward greedy ensemble ensure also to reduce answering for grader while improving the achievement of meta classifier Miscellaneous degree, this is the strong guarantee for processing big data linear complexity.

Specific embodiment

Embodiment 1

Given simulation big data example contains 2,000,000, and the dimension 1000 of each example is tieed up.Whole data set is divided to two classes, wherein The first kind contains example 1,990,000, and Equations of The Second Kind only contains example 10,000.This data set randomly generates and belongs to uneven big data two Class classification problem.

(1) confidence level 99% and the limits of error 1% are determined.Therefore the sample size of each data set of every class is 16641.Press According to ratio from A classes（Data set containing 1,990,000 examples）10,000 examples of middle extraction, add 10,000 examples of non-A classes, each data Collection includes 20,000 examples.Common PC computers generally can be easily using common meta classifier to the data containing 20,000 examples Collection is classified.

(2) according to said method, this example generates altogether 10 Sub Data Sets.10 classification are set up using nearest neighbor algorithm Device, k is respectively set to 1 to 10.

(3) according to this 10 graders, this 10 meta classifiers are entered using forward greedy ensemble methods Row is integrated, assembles an integrated classifier.

(4) to the test case for giving, classified using integrated classifier obtained above.If classification results More than 50%, judge that this test case belongs to A classes, otherwise belong to non-A classes.

Embodiment 2

Given simulation big data example contains 20,000,000, and the dimension 1000 of each example is tieed up.Three classes of whole data set point, wherein A classes contain example 12,000,000, and B classes contain example 7,900,000, and C classes contain example 100,000.This data set is randomly generated and belongs to uneven Weighing apparatus big data multicategory classification problem.

(1) confidence level 95% and the limits of error 1% are determined.Often the sample size of each data set of class is 9604.Due to one As the data of computer disposal 300,000 it is somewhat difficult.Therefore three classes need to be sampled.

(2) A classes 10 data of sampling are given, and each data set includes 20,000 examples（Note：As long as the quantity of example surpasses Cross 9604）.In particular, first 10,000 samples are randomly selected from A classes, then 5000 samples is randomly selected from B classes, from C Class randomly selects 5000 samples.A Sub Data Set is now obtained containing sample 20,000.Repeat this sampling 10 times, A class samples can be obtained This 10 Sub Data Sets.By that analogy, to B classes and C classes 10 Sub Data Sets of each sampling.Finally, in this process common property life 30 Individual Sub Data Set.

(3) it is a class using the sorting technique of one-vs-all, i.e. A classes to 10 data sets of A classes, B classes and C classes are unified For a class, using 10 meta classifiers 10 graders are set up.Meta classifier is nearest neighbor algorithm 9, and k takes 1 to 9, decision tree C5.0 graders one.

(4) same method sets up 10 meta classifiers to B classes, and 10 meta classifiers are also set up to C classes.

(5) 10 meta classifiers of A classes are carried out using forward greedy ensemble methods integrated, is assembled One integrated classifier.Same 10 meta classifiers to B classes and C classes carry out integrated, difference one integrated classifier of concentration.

(6) to the test case for giving, classified using three integrated classifiers obtained above.If A collection The classification results of constituent class device are that the classification results of 85%, B integrated classifiers are for the classification results of 89%, C integrated classifiers 90%, judge that this test case belongs to C classes.

Claims

1. the sorting technique of non-homogeneous big data, comprises the steps：

(1) the class number of each example of big data is obtained, all kinds of titles are designated as m_i, i=1,2 ..., M；

(2) the drop methods of sampling is adopted for each class m_iSample out D_iIndividual data set；

(3) meta classifier is built to each data set；

(4) to each class m_iD_iIndividual grader carries out integrated study；

(5) test：To each example, in each class m_iIn classified, accuracy rate highest class is in the M result for obtaining The classification of test case.

2. method according to claim 1, data volume n of each data set of the step (2)_iByIt is determined that,

Wherein t_a/2The value of confidence level is represented, critical value is distributed by t and is obtained, ε represents the maximum allowable error of setting.

3. method according to claim 1, in step (3), to each class m_iD_iIndividual data set builds D_iThe side of individual meta classifier Method is selected from：Two classification method, nearest neighbor algorithm, decision tree method, neural network or forest tree method.

4. method according to claim 1, in step (3), to each class m_iD_iIndividual data set builds D_iThe side of individual meta classifier Method is selected：Two classification method.

5. method according to claim 1, in step (4), take forward greedy Ensemble classifier result method to each class m_iD_i Individual meta classifier carries out integrated study, obtains an integrated classifier.

6. method according to claim 1, in step (4), takes the detailed process of greedy Ensemble classifier result method forward such as Under：

D. candidate classification device set CCS={ C are built₁,…,C_MAnd selected grader set SCS={ }；

F. the grader C in current each CCS_jAdd in SCS and verify, classification results are if it exceeds the threshold specified in advance of user Value, then jump to E, and C_jSCS is moved on to from CCS, until CCS is empty set；Otherwise jump to step (5)；

So far, to M class, M integrated classifier C is established altogether_i, i=1 ..., M, each integrated classifier comprising n unit divide Class device.