CN103500205B - Non-uniform big data classifying method - Google Patents

Non-uniform big data classifying method Download PDF

Info

Publication number
CN103500205B
CN103500205B CN201310452365.3A CN201310452365A CN103500205B CN 103500205 B CN103500205 B CN 103500205B CN 201310452365 A CN201310452365 A CN 201310452365A CN 103500205 B CN103500205 B CN 103500205B
Authority
CN
China
Prior art keywords
class
classifier
classification
data
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310452365.3A
Other languages
Chinese (zh)
Other versions
CN103500205A (en
Inventor
朱晓峰
张师超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201310452365.3A priority Critical patent/CN103500205B/en
Publication of CN103500205A publication Critical patent/CN103500205A/en
Application granted granted Critical
Publication of CN103500205B publication Critical patent/CN103500205B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)

Abstract

The invention relates to a non-uniform big data classifying method which is used for classifying data set categories which cannot be classified and non-uniform category data sets of the data set categories which cannot be classified in a computer memory. Firstly, the size of a sample is determined by a downsampling method according to a theory, and the number of classifiers can be determined according to the number of the samples. An integrated classifier is established for each category of the big data. When a test case is tested, the integrated classifiers of all categories are classified, and the category where the integrated classifier with the highest classification rate is arranged is used as the category of the test case. The method is in linear to time complexity of big data category and can reduce polarization of non-uniform big data classification results. Furthermore, the integrated classifiers improve the accuracy. The method is easy to implement and only involves some simple math models in writing codes.

Description

Non-homogeneous big data sorting technique
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, and in particular to big data, particularly one Plant the processing method of non-homogeneous big data classification.
Background technology
Big data is referred to has no idea to use conventional Software tool to content in existing physical condition and in the time for allowing The data acquisition system for being captured, managed and being processed.Big data has following features:Volume(Data volume is big)、Variety(Data Wide variety)、Value(Value density is low)、Velocity(Processing speed is fast), it is called 4V for short.
Big data research at present generally includes two big class.First, challenge of the big data to framework.Current many famous Web sites HADOOP clusters in uncorrected data capacity reach tens PB, and there is redundancy, need scanning to update daily.Then HADOOP is Guarantee that single-unit point failure or single chassis failure do not affect to run generally using 3 replication policies.So data are in time dimension and sky Between dimension aspect be required for consider Cost Problems.Therefore, if the efficient extensive small documents management of construction and big file pipe Reason mechanism is simultaneously deposited, while supporting that structuring data, unstructured data and the storage without structured data, management and access etc. are all The problem for having to consider.Second, the challenge of big data Knowledge Discovery and big data to mining algorithm.It is firstly the need of mostly face The extensibility of algorithm.Some data minings and machine learning classic algorithm, such as KNN density estimations, non-ginseng BAYES is supported Vector machine, Gaussian process is returned and hierarchical clustering scheduling algorithm, because their complexity is at least more than secondary, all can not be big Preferably applied in data mining.So, this is accomplished by designing the algorithm of higher efficiency, i.e. O (nlogn) or O (n)。
From the point of view of document in terms of existing a large amount of big data excavations, big data study research be concentrated mainly on division, Cluster, retrieval, increment(In batches, it is online or parallel)Learn the upgrading improvement of the conventional method of this 4 aspects.At present to non-homogeneous The research of big data issue handling is fewer.Other researchs of generally similar big data Knowledge Discovery, big data classification problem is first First it is envisaged that the complexity issue of algorithm.Next existing sorting algorithm(The different classes of distribution for assuming data is uniform 's)It is applied directly in non-homogeneous big data and is easily caused biasing(bias), i.e. classification results are partial to big classification(That is this classification It is very big containing example number ratio, such as more than 90% in two class problems).Finally, common algorithms are used for non-homogeneous(It is uneven) Data classification problem generally pursues classification error minimum problem, but have ignored non-homogeneous class misclassification cost problem.
However, non-homogeneous big data classification is an extremely challenging problem, starts with from what, how to be entered using big data A series of basic problems urgently to be resolved hurrily of row intelligency activity, etc..
The content of the invention
The non-homogeneous big data classification problem of present invention research.
It is an object of the invention to provide simple and effective non-homogeneous big data sorting technique.The method can solve big number The biasing problem easily occurred according to classification and the high complexity issue of big data algorithm.I.e. this method is by dropping sampling (Downsampling) the non-homogeneous big data classification of linear complexity and two classification (one-vs-all) big data, is reached, is led to The method for crossing integrated multiple graders (ensemble) result solves biasing problem and improves classification accuracy, and with robustness (robust) it is noise immunity.
This method is comprised the following steps that:
(1) the number m of all kinds of examples of big data is obtainedi,i=1,2,…,M;
(2) Downsampling methods are adopted for each class miSample out DiData set.Wherein data volume in each data set Size niByDetermine, wherein ta/2The value of confidence level is represented, critical value can be distributed by t and be obtained, ε represents maximum Allowable error.By this quadrat method to each class miSample out Di
(3) to each class miDiIndividual data set one-vs-all methods(The all examples of i.e. current class are positive class, other All examples of class are negative class)Set up DiIndividual grader, i.e., build a grader to each data set.
(4) to each class miDiIndividual grader carries out integrated study.Theoretical according to integrated study, integrated classifier can be with Formed according to integration principle set by multiple meta classifiers.All of meta classifier classification speed should be fast, and is each other Independent, the error rate of each grader is not higher than 50%.Common such grader, such as nearest neighbor algorithm, decision tree method, god Jing network techniques or forest tree method (Forest tree) etc. can meet above-mentioned requirements.Integration principle typically has bagging, Adaboost, selective ensemble etc..Each class m of the inventioniThe D for obtainingiIndividual grader is using greedy collection composition forward Class result method(forward greedy ensemble)Integrated study is carried out to grader.
(5) test:Each example is classified in each class, the accuracy rate highest class obtained in M result is The classification of test case.
Wherein step(2)Target be to solve the problems, such as algorithm complex, i.e., by drop sampling method using part it is original Data rather than total data structure grader.In order to improve classification accuracy, using multiple strategy, i.e., repeatedly take out Sample, the sample size of sampling meets above-mentioned regulation, and frequency in sampling is determined by user.
Step of the present invention(2)Downsampling methods are comprised the following steps that:
A. to each class miWhen being sampled, the sample size of sampling is no less than the regulation of above table(I.e.It is individual), sample number is to need to set up the number of meta classifier.It is first during a sample number is generated to every class First obtain the number of samples of current class.The present invention treats as A classes current class, and other classes are collectively referred to as non-A classes.Then, A classes are analyzed With the number of levels of non-A classes.Present invention note # (A), # (~A), # (R), # (T) are A classes, and non-A classes, calculator memory will with theory The data of the sample size asked, if (# (A)>>#(R))&&(#(A)># (T)), then extract from A classes and the much the same reality of non-A classes Example;If (# (~A)>># (R)) && (# (~A)># (T)), then extract and the much the same example of A classes from non-A classes.
B. said process is repeated, until each class miSampling DiIndividual sample.For sake of simplicity, the fixed D of the present inventioniFor n.
C. so far D=M*n sample of whole data set generation.
By step(2), the present invention obtains M*n sample, and every group of data have n grader.The step of the present invention(3)It is right Each class miN sample set up n meta classifier altogether;
Then step of the invention(4)The n meta classifier to obtaining carries out integrating an integrated classifier, that is, adopt Take forward greedy ensemble methods.Its step is as follows:
D. candidate classification device set CCS={ C are built1,…,CMAnd selected grader set SCS={ };
E. to each grader Ci, the best grader of accuracy rate is chosen, it is removed from CCS and is added in SCS;
F. the grader C in current each CCSjAdd in SCS and verify, classification results are if it exceeds user specifies in advance Threshold value, then jump to E, and CjSCS is moved on to from CCS.Otherwise jump to step(5), now show that integrated classifier study is completed;
G. F is repeated, until CCS is empty set,
So far, to M class, the present invention establishes M integrated classifier C altogetheri, i=1 ..., M.Each integrated classifier Comprising n meta classifier.
Above-mentioned steps ensure that the grader for obtaining is less, and this causes test process fairly simple.
The non-homogeneous big data sorting technique implemented by above step is had the characteristics that:First, in assorting process by As balanced as possible in all kinds of instance datas for being using downsampling methods, this effectively prevent asking for classification deflection big class Topic;Second, classification is carried out so that the complexity of whole sorting algorithm is up to linearly using the methods of sampling;3rd, in order to avoid Sampling causes classification accuracy to reduce, and the present invention improves classification accuracies by two methods, that is, multiple method of sampling and forward Greedy compressive classification result method.
The present invention reduces unbalanced classification using the method for sampling, and reduces the complexity of algorithm;It is many using sampling It is secondary and set up a meta classifier to every time sampling, and using the method for integrated study, comprehensively all of meta classifier is improved point Class achievement.
Sampling big data:It is extremely difficult generally to carry out classifying in whole big data.Even if feasible, complexity is also very Height, the methods of sampling causes to be changed into feasible to the operation of big data classification, and the complexity of classification is reduced to linearly.This is just It is that big data excavates the result expected.
Sample size and sample number:The sample size of sampling is according to theoretical acquisition, it is ensured that result is obtained after sampling with original Beginning resultant error reaches minimum.Extract multiple samples to be conducive to improving classification achievement;
One-vs-all sorting techniques are it is verified that to be a kind of very effective method for solving non-homogeneous data set. The present invention is used in the method in the classification of non-homogeneous big data, on the one hand can solve non-homogeneous classification problem, on the other hand also The high complexity issue of big data classification can be solved;
Meta classifier causes the classification on large data sets quicker, and integrated study can effectively improve meta classifier Achievement.And forward greedy ensemble ensure also to reduce answering for grader while improving the achievement of meta classifier Miscellaneous degree, this is the strong guarantee for processing big data linear complexity.
Specific embodiment
Embodiment 1
Given simulation big data example contains 2,000,000, and the dimension 1000 of each example is tieed up.Whole data set is divided to two classes, wherein The first kind contains example 1,990,000, and Equations of The Second Kind only contains example 10,000.This data set randomly generates and belongs to uneven big data two Class classification problem.
(1) confidence level 99% and the limits of error 1% are determined.Therefore the sample size of each data set of every class is 16641.Press According to ratio from A classes(Data set containing 1,990,000 examples)10,000 examples of middle extraction, add 10,000 examples of non-A classes, each data Collection includes 20,000 examples.Common PC computers generally can be easily using common meta classifier to the data containing 20,000 examples Collection is classified.
(2) according to said method, this example generates altogether 10 Sub Data Sets.10 classification are set up using nearest neighbor algorithm Device, k is respectively set to 1 to 10.
(3) according to this 10 graders, this 10 meta classifiers are entered using forward greedy ensemble methods Row is integrated, assembles an integrated classifier.
(4) to the test case for giving, classified using integrated classifier obtained above.If classification results More than 50%, judge that this test case belongs to A classes, otherwise belong to non-A classes.
Embodiment 2
Given simulation big data example contains 20,000,000, and the dimension 1000 of each example is tieed up.Three classes of whole data set point, wherein A classes contain example 12,000,000, and B classes contain example 7,900,000, and C classes contain example 100,000.This data set is randomly generated and belongs to uneven Weighing apparatus big data multicategory classification problem.
(1) confidence level 95% and the limits of error 1% are determined.Often the sample size of each data set of class is 9604.Due to one As the data of computer disposal 300,000 it is somewhat difficult.Therefore three classes need to be sampled.
(2) A classes 10 data of sampling are given, and each data set includes 20,000 examples(Note:As long as the quantity of example surpasses Cross 9604).In particular, first 10,000 samples are randomly selected from A classes, then 5000 samples is randomly selected from B classes, from C Class randomly selects 5000 samples.A Sub Data Set is now obtained containing sample 20,000.Repeat this sampling 10 times, A class samples can be obtained This 10 Sub Data Sets.By that analogy, to B classes and C classes 10 Sub Data Sets of each sampling.Finally, in this process common property life 30 Individual Sub Data Set.
(3) it is a class using the sorting technique of one-vs-all, i.e. A classes to 10 data sets of A classes, B classes and C classes are unified For a class, using 10 meta classifiers 10 graders are set up.Meta classifier is nearest neighbor algorithm 9, and k takes 1 to 9, decision tree C5.0 graders one.
(4) same method sets up 10 meta classifiers to B classes, and 10 meta classifiers are also set up to C classes.
(5) 10 meta classifiers of A classes are carried out using forward greedy ensemble methods integrated, is assembled One integrated classifier.Same 10 meta classifiers to B classes and C classes carry out integrated, difference one integrated classifier of concentration.
(6) to the test case for giving, classified using three integrated classifiers obtained above.If A collection The classification results of constituent class device are that the classification results of 85%, B integrated classifiers are for the classification results of 89%, C integrated classifiers 90%, judge that this test case belongs to C classes.

Claims (6)

1. the sorting technique of non-homogeneous big data, comprises the steps:
(1) the class number of each example of big data is obtained, all kinds of titles are designated as mi, i=1,2 ..., M;
(2) the drop methods of sampling is adopted for each class miSample out DiIndividual data set;
(3) meta classifier is built to each data set;
(4) to each class miDiIndividual grader carries out integrated study;
(5) test:To each example, in each class miIn classified, accuracy rate highest class is in the M result for obtaining The classification of test case.
2. method according to claim 1, data volume n of each data set of the step (2)iByIt is determined that,
Wherein ta/2The value of confidence level is represented, critical value is distributed by t and is obtained, ε represents the maximum allowable error of setting.
3. method according to claim 1, in step (3), to each class miDiIndividual data set builds DiThe side of individual meta classifier Method is selected from:Two classification method, nearest neighbor algorithm, decision tree method, neural network or forest tree method.
4. method according to claim 1, in step (3), to each class miDiIndividual data set builds DiThe side of individual meta classifier Method is selected:Two classification method.
5. method according to claim 1, in step (4), take forward greedy Ensemble classifier result method to each class miDi Individual meta classifier carries out integrated study, obtains an integrated classifier.
6. method according to claim 1, in step (4), takes the detailed process of greedy Ensemble classifier result method forward such as Under:
D. candidate classification device set CCS={ C are built1,…,CMAnd selected grader set SCS={ };
E. to each grader Ci, the best grader of accuracy rate is chosen, it is removed from CCS and is added in SCS;
F. the grader C in current each CCSjAdd in SCS and verify, classification results are if it exceeds the threshold specified in advance of user Value, then jump to E, and CjSCS is moved on to from CCS, until CCS is empty set;Otherwise jump to step (5);
So far, to M class, M integrated classifier C is established altogetheri, i=1 ..., M, each integrated classifier comprising n unit divide Class device.
CN201310452365.3A 2013-09-29 2013-09-29 Non-uniform big data classifying method Expired - Fee Related CN103500205B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310452365.3A CN103500205B (en) 2013-09-29 2013-09-29 Non-uniform big data classifying method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310452365.3A CN103500205B (en) 2013-09-29 2013-09-29 Non-uniform big data classifying method

Publications (2)

Publication Number Publication Date
CN103500205A CN103500205A (en) 2014-01-08
CN103500205B true CN103500205B (en) 2017-04-12

Family

ID=49865415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310452365.3A Expired - Fee Related CN103500205B (en) 2013-09-29 2013-09-29 Non-uniform big data classifying method

Country Status (1)

Country Link
CN (1) CN103500205B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399413A (en) * 2019-07-04 2019-11-01 博彦科技股份有限公司 Sampling of data method, apparatus, storage medium and processor

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156029A (en) * 2015-03-24 2016-11-23 中国人民解放军国防科学技术大学 The uneven fictitious assets data classification method of multi-tag based on integrated study
CN107193836B (en) * 2016-03-15 2021-08-10 腾讯科技(深圳)有限公司 Identification method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404009A (en) * 2008-10-31 2009-04-08 金蝶软件(中国)有限公司 Data classification filtering method, system and equipment
CN103268336A (en) * 2013-05-13 2013-08-28 刘峰 Fast data and big data combined data processing method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053537B2 (en) * 2011-09-21 2015-06-09 Tandent Vision Science, Inc. Classifier for use in generating a diffuse image

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404009A (en) * 2008-10-31 2009-04-08 金蝶软件(中国)有限公司 Data classification filtering method, system and equipment
CN103268336A (en) * 2013-05-13 2013-08-28 刘峰 Fast data and big data combined data processing method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Adapted One-versus-All Decision Trees for Data Stream Classification;Sattar Hashemi 等;《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》;20090531;第21卷(第5期);第624-637页 *
基于Bagging支持向量机集成的入侵检测研究;谷雨 等;《微电子学与计算机》;20051231;第22卷(第5期);第17-19页 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399413A (en) * 2019-07-04 2019-11-01 博彦科技股份有限公司 Sampling of data method, apparatus, storage medium and processor

Also Published As

Publication number Publication date
CN103500205A (en) 2014-01-08

Similar Documents

Publication Publication Date Title
Triguero et al. ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem
Wang et al. How many software metrics should be selected for defect prediction?
CN102737126B (en) Classification rule mining method under cloud computing environment
CN107292186A (en) A kind of model training method and device based on random forest
Tsai et al. Evolutionary instance selection for text classification
CN105913077A (en) Data clustering method based on dimensionality reduction and sampling
CN102609714A (en) Novel classifier based on information gain and online support vector machine, and classification method thereof
CN103500205B (en) Non-uniform big data classifying method
CN110647995A (en) Rule training method, device, equipment and storage medium
CN107577792A (en) A kind of method and its system of business data automatic cluster
Li et al. Scalable random forests for massive data
Luo et al. A hybrid particle swarm optimization for high-dimensional dynamic optimization
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN102567405A (en) Hotspot discovery method based on improved text space vector representation
CN103207804A (en) MapReduce load simulation method based on cluster job logging
CN102426598A (en) Method for clustering Chinese texts for safety management of network content
Almunirawi et al. A comparative study on serial decision tree classification algorithms in text mining
Saidi et al. Feature selection using genetic algorithm for big data
Gupta et al. Feature selection: an overview
CN113743004B (en) Quantum Fourier transform-based full-element productivity calculation method
Li et al. Pruning SMAC search space based on key hyperparameters
CN114880690A (en) Source data time sequence refinement method based on edge calculation
Jiang et al. Scaling deep phylogenetic embedding to ultra-large reference trees: a tree-aware ensemble approach
Song et al. A dynamic ensemble framework for mining textual streams with class imbalance
CN111143560A (en) Short text classification method, terminal equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170412