CN103500205B - Non-uniform big data classifying method - Google Patents
Non-uniform big data classifying method Download PDFInfo
- Publication number
- CN103500205B CN103500205B CN201310452365.3A CN201310452365A CN103500205B CN 103500205 B CN103500205 B CN 103500205B CN 201310452365 A CN201310452365 A CN 201310452365A CN 103500205 B CN103500205 B CN 103500205B
- Authority
- CN
- China
- Prior art keywords
- class
- classifier
- classification
- data
- big data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000012360 testing method Methods 0.000 claims abstract description 11
- 238000005070 sampling Methods 0.000 claims description 20
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims 1
- 230000010287 polarization Effects 0.000 abstract 1
- 238000011160 research Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 238000007418 data mining Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
Abstract
The invention relates to a non-uniform big data classifying method which is used for classifying data set categories which cannot be classified and non-uniform category data sets of the data set categories which cannot be classified in a computer memory. Firstly, the size of a sample is determined by a downsampling method according to a theory, and the number of classifiers can be determined according to the number of the samples. An integrated classifier is established for each category of the big data. When a test case is tested, the integrated classifiers of all categories are classified, and the category where the integrated classifier with the highest classification rate is arranged is used as the category of the test case. The method is in linear to time complexity of big data category and can reduce polarization of non-uniform big data classification results. Furthermore, the integrated classifiers improve the accuracy. The method is easy to implement and only involves some simple math models in writing codes.
Description
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, and in particular to big data, particularly one
Plant the processing method of non-homogeneous big data classification.
Background technology
Big data is referred to has no idea to use conventional Software tool to content in existing physical condition and in the time for allowing
The data acquisition system for being captured, managed and being processed.Big data has following features:Volume(Data volume is big)、Variety(Data
Wide variety)、Value(Value density is low)、Velocity(Processing speed is fast), it is called 4V for short.
Big data research at present generally includes two big class.First, challenge of the big data to framework.Current many famous Web sites
HADOOP clusters in uncorrected data capacity reach tens PB, and there is redundancy, need scanning to update daily.Then HADOOP is
Guarantee that single-unit point failure or single chassis failure do not affect to run generally using 3 replication policies.So data are in time dimension and sky
Between dimension aspect be required for consider Cost Problems.Therefore, if the efficient extensive small documents management of construction and big file pipe
Reason mechanism is simultaneously deposited, while supporting that structuring data, unstructured data and the storage without structured data, management and access etc. are all
The problem for having to consider.Second, the challenge of big data Knowledge Discovery and big data to mining algorithm.It is firstly the need of mostly face
The extensibility of algorithm.Some data minings and machine learning classic algorithm, such as KNN density estimations, non-ginseng BAYES is supported
Vector machine, Gaussian process is returned and hierarchical clustering scheduling algorithm, because their complexity is at least more than secondary, all can not be big
Preferably applied in data mining.So, this is accomplished by designing the algorithm of higher efficiency, i.e. O (nlogn) or O
(n)。
From the point of view of document in terms of existing a large amount of big data excavations, big data study research be concentrated mainly on division,
Cluster, retrieval, increment(In batches, it is online or parallel)Learn the upgrading improvement of the conventional method of this 4 aspects.At present to non-homogeneous
The research of big data issue handling is fewer.Other researchs of generally similar big data Knowledge Discovery, big data classification problem is first
First it is envisaged that the complexity issue of algorithm.Next existing sorting algorithm(The different classes of distribution for assuming data is uniform
's)It is applied directly in non-homogeneous big data and is easily caused biasing(bias), i.e. classification results are partial to big classification(That is this classification
It is very big containing example number ratio, such as more than 90% in two class problems).Finally, common algorithms are used for non-homogeneous(It is uneven)
Data classification problem generally pursues classification error minimum problem, but have ignored non-homogeneous class misclassification cost problem.
However, non-homogeneous big data classification is an extremely challenging problem, starts with from what, how to be entered using big data
A series of basic problems urgently to be resolved hurrily of row intelligency activity, etc..
The content of the invention
The non-homogeneous big data classification problem of present invention research.
It is an object of the invention to provide simple and effective non-homogeneous big data sorting technique.The method can solve big number
The biasing problem easily occurred according to classification and the high complexity issue of big data algorithm.I.e. this method is by dropping sampling
(Downsampling) the non-homogeneous big data classification of linear complexity and two classification (one-vs-all) big data, is reached, is led to
The method for crossing integrated multiple graders (ensemble) result solves biasing problem and improves classification accuracy, and with robustness
(robust) it is noise immunity.
This method is comprised the following steps that:
(1) the number m of all kinds of examples of big data is obtainedi,i=1,2,…,M;
(2) Downsampling methods are adopted for each class miSample out DiData set.Wherein data volume in each data set
Size niByDetermine, wherein ta/2The value of confidence level is represented, critical value can be distributed by t and be obtained, ε represents maximum
Allowable error.By this quadrat method to each class miSample out Di。
(3) to each class miDiIndividual data set one-vs-all methods(The all examples of i.e. current class are positive class, other
All examples of class are negative class)Set up DiIndividual grader, i.e., build a grader to each data set.
(4) to each class miDiIndividual grader carries out integrated study.Theoretical according to integrated study, integrated classifier can be with
Formed according to integration principle set by multiple meta classifiers.All of meta classifier classification speed should be fast, and is each other
Independent, the error rate of each grader is not higher than 50%.Common such grader, such as nearest neighbor algorithm, decision tree method, god
Jing network techniques or forest tree method (Forest tree) etc. can meet above-mentioned requirements.Integration principle typically has bagging,
Adaboost, selective ensemble etc..Each class m of the inventioniThe D for obtainingiIndividual grader is using greedy collection composition forward
Class result method(forward greedy ensemble)Integrated study is carried out to grader.
(5) test:Each example is classified in each class, the accuracy rate highest class obtained in M result is
The classification of test case.
Wherein step(2)Target be to solve the problems, such as algorithm complex, i.e., by drop sampling method using part it is original
Data rather than total data structure grader.In order to improve classification accuracy, using multiple strategy, i.e., repeatedly take out
Sample, the sample size of sampling meets above-mentioned regulation, and frequency in sampling is determined by user.
Step of the present invention(2)Downsampling methods are comprised the following steps that:
A. to each class miWhen being sampled, the sample size of sampling is no less than the regulation of above table(I.e.It is individual), sample number is to need to set up the number of meta classifier.It is first during a sample number is generated to every class
First obtain the number of samples of current class.The present invention treats as A classes current class, and other classes are collectively referred to as non-A classes.Then, A classes are analyzed
With the number of levels of non-A classes.Present invention note # (A), # (~A), # (R), # (T) are A classes, and non-A classes, calculator memory will with theory
The data of the sample size asked, if (# (A)>>#(R))&&(#(A)># (T)), then extract from A classes and the much the same reality of non-A classes
Example;If (# (~A)>># (R)) && (# (~A)># (T)), then extract and the much the same example of A classes from non-A classes.
B. said process is repeated, until each class miSampling DiIndividual sample.For sake of simplicity, the fixed D of the present inventioniFor n.
C. so far D=M*n sample of whole data set generation.
By step(2), the present invention obtains M*n sample, and every group of data have n grader.The step of the present invention(3)It is right
Each class miN sample set up n meta classifier altogether;
Then step of the invention(4)The n meta classifier to obtaining carries out integrating an integrated classifier, that is, adopt
Take forward greedy ensemble methods.Its step is as follows:
D. candidate classification device set CCS={ C are built1,…,CMAnd selected grader set SCS={ };
E. to each grader Ci, the best grader of accuracy rate is chosen, it is removed from CCS and is added in SCS;
F. the grader C in current each CCSjAdd in SCS and verify, classification results are if it exceeds user specifies in advance
Threshold value, then jump to E, and CjSCS is moved on to from CCS.Otherwise jump to step(5), now show that integrated classifier study is completed;
G. F is repeated, until CCS is empty set,
So far, to M class, the present invention establishes M integrated classifier C altogetheri, i=1 ..., M.Each integrated classifier
Comprising n meta classifier.
Above-mentioned steps ensure that the grader for obtaining is less, and this causes test process fairly simple.
The non-homogeneous big data sorting technique implemented by above step is had the characteristics that:First, in assorting process by
As balanced as possible in all kinds of instance datas for being using downsampling methods, this effectively prevent asking for classification deflection big class
Topic;Second, classification is carried out so that the complexity of whole sorting algorithm is up to linearly using the methods of sampling;3rd, in order to avoid
Sampling causes classification accuracy to reduce, and the present invention improves classification accuracies by two methods, that is, multiple method of sampling and forward
Greedy compressive classification result method.
The present invention reduces unbalanced classification using the method for sampling, and reduces the complexity of algorithm;It is many using sampling
It is secondary and set up a meta classifier to every time sampling, and using the method for integrated study, comprehensively all of meta classifier is improved point
Class achievement.
Sampling big data:It is extremely difficult generally to carry out classifying in whole big data.Even if feasible, complexity is also very
Height, the methods of sampling causes to be changed into feasible to the operation of big data classification, and the complexity of classification is reduced to linearly.This is just
It is that big data excavates the result expected.
Sample size and sample number:The sample size of sampling is according to theoretical acquisition, it is ensured that result is obtained after sampling with original
Beginning resultant error reaches minimum.Extract multiple samples to be conducive to improving classification achievement;
One-vs-all sorting techniques are it is verified that to be a kind of very effective method for solving non-homogeneous data set.
The present invention is used in the method in the classification of non-homogeneous big data, on the one hand can solve non-homogeneous classification problem, on the other hand also
The high complexity issue of big data classification can be solved;
Meta classifier causes the classification on large data sets quicker, and integrated study can effectively improve meta classifier
Achievement.And forward greedy ensemble ensure also to reduce answering for grader while improving the achievement of meta classifier
Miscellaneous degree, this is the strong guarantee for processing big data linear complexity.
Specific embodiment
Embodiment 1
Given simulation big data example contains 2,000,000, and the dimension 1000 of each example is tieed up.Whole data set is divided to two classes, wherein
The first kind contains example 1,990,000, and Equations of The Second Kind only contains example 10,000.This data set randomly generates and belongs to uneven big data two
Class classification problem.
(1) confidence level 99% and the limits of error 1% are determined.Therefore the sample size of each data set of every class is 16641.Press
According to ratio from A classes(Data set containing 1,990,000 examples)10,000 examples of middle extraction, add 10,000 examples of non-A classes, each data
Collection includes 20,000 examples.Common PC computers generally can be easily using common meta classifier to the data containing 20,000 examples
Collection is classified.
(2) according to said method, this example generates altogether 10 Sub Data Sets.10 classification are set up using nearest neighbor algorithm
Device, k is respectively set to 1 to 10.
(3) according to this 10 graders, this 10 meta classifiers are entered using forward greedy ensemble methods
Row is integrated, assembles an integrated classifier.
(4) to the test case for giving, classified using integrated classifier obtained above.If classification results
More than 50%, judge that this test case belongs to A classes, otherwise belong to non-A classes.
Embodiment 2
Given simulation big data example contains 20,000,000, and the dimension 1000 of each example is tieed up.Three classes of whole data set point, wherein
A classes contain example 12,000,000, and B classes contain example 7,900,000, and C classes contain example 100,000.This data set is randomly generated and belongs to uneven
Weighing apparatus big data multicategory classification problem.
(1) confidence level 95% and the limits of error 1% are determined.Often the sample size of each data set of class is 9604.Due to one
As the data of computer disposal 300,000 it is somewhat difficult.Therefore three classes need to be sampled.
(2) A classes 10 data of sampling are given, and each data set includes 20,000 examples(Note:As long as the quantity of example surpasses
Cross 9604).In particular, first 10,000 samples are randomly selected from A classes, then 5000 samples is randomly selected from B classes, from C
Class randomly selects 5000 samples.A Sub Data Set is now obtained containing sample 20,000.Repeat this sampling 10 times, A class samples can be obtained
This 10 Sub Data Sets.By that analogy, to B classes and C classes 10 Sub Data Sets of each sampling.Finally, in this process common property life 30
Individual Sub Data Set.
(3) it is a class using the sorting technique of one-vs-all, i.e. A classes to 10 data sets of A classes, B classes and C classes are unified
For a class, using 10 meta classifiers 10 graders are set up.Meta classifier is nearest neighbor algorithm 9, and k takes 1 to 9, decision tree
C5.0 graders one.
(4) same method sets up 10 meta classifiers to B classes, and 10 meta classifiers are also set up to C classes.
(5) 10 meta classifiers of A classes are carried out using forward greedy ensemble methods integrated, is assembled
One integrated classifier.Same 10 meta classifiers to B classes and C classes carry out integrated, difference one integrated classifier of concentration.
(6) to the test case for giving, classified using three integrated classifiers obtained above.If A collection
The classification results of constituent class device are that the classification results of 85%, B integrated classifiers are for the classification results of 89%, C integrated classifiers
90%, judge that this test case belongs to C classes.
Claims (6)
1. the sorting technique of non-homogeneous big data, comprises the steps:
(1) the class number of each example of big data is obtained, all kinds of titles are designated as mi, i=1,2 ..., M;
(2) the drop methods of sampling is adopted for each class miSample out DiIndividual data set;
(3) meta classifier is built to each data set;
(4) to each class miDiIndividual grader carries out integrated study;
(5) test:To each example, in each class miIn classified, accuracy rate highest class is in the M result for obtaining
The classification of test case.
2. method according to claim 1, data volume n of each data set of the step (2)iByIt is determined that,
Wherein ta/2The value of confidence level is represented, critical value is distributed by t and is obtained, ε represents the maximum allowable error of setting.
3. method according to claim 1, in step (3), to each class miDiIndividual data set builds DiThe side of individual meta classifier
Method is selected from:Two classification method, nearest neighbor algorithm, decision tree method, neural network or forest tree method.
4. method according to claim 1, in step (3), to each class miDiIndividual data set builds DiThe side of individual meta classifier
Method is selected:Two classification method.
5. method according to claim 1, in step (4), take forward greedy Ensemble classifier result method to each class miDi
Individual meta classifier carries out integrated study, obtains an integrated classifier.
6. method according to claim 1, in step (4), takes the detailed process of greedy Ensemble classifier result method forward such as
Under:
D. candidate classification device set CCS={ C are built1,…,CMAnd selected grader set SCS={ };
E. to each grader Ci, the best grader of accuracy rate is chosen, it is removed from CCS and is added in SCS;
F. the grader C in current each CCSjAdd in SCS and verify, classification results are if it exceeds the threshold specified in advance of user
Value, then jump to E, and CjSCS is moved on to from CCS, until CCS is empty set;Otherwise jump to step (5);
So far, to M class, M integrated classifier C is established altogetheri, i=1 ..., M, each integrated classifier comprising n unit divide
Class device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452365.3A CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452365.3A CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103500205A CN103500205A (en) | 2014-01-08 |
CN103500205B true CN103500205B (en) | 2017-04-12 |
Family
ID=49865415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310452365.3A Expired - Fee Related CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103500205B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399413A (en) * | 2019-07-04 | 2019-11-01 | 博彦科技股份有限公司 | Sampling of data method, apparatus, storage medium and processor |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156029A (en) * | 2015-03-24 | 2016-11-23 | 中国人民解放军国防科学技术大学 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
CN107193836B (en) * | 2016-03-15 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Identification method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404009A (en) * | 2008-10-31 | 2009-04-08 | 金蝶软件(中国)有限公司 | Data classification filtering method, system and equipment |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9053537B2 (en) * | 2011-09-21 | 2015-06-09 | Tandent Vision Science, Inc. | Classifier for use in generating a diffuse image |
-
2013
- 2013-09-29 CN CN201310452365.3A patent/CN103500205B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404009A (en) * | 2008-10-31 | 2009-04-08 | 金蝶软件(中国)有限公司 | Data classification filtering method, system and equipment |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
Non-Patent Citations (2)
Title |
---|
Adapted One-versus-All Decision Trees for Data Stream Classification;Sattar Hashemi 等;《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》;20090531;第21卷(第5期);第624-637页 * |
基于Bagging支持向量机集成的入侵检测研究;谷雨 等;《微电子学与计算机》;20051231;第22卷(第5期);第17-19页 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399413A (en) * | 2019-07-04 | 2019-11-01 | 博彦科技股份有限公司 | Sampling of data method, apparatus, storage medium and processor |
Also Published As
Publication number | Publication date |
---|---|
CN103500205A (en) | 2014-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Triguero et al. | ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem | |
Wang et al. | How many software metrics should be selected for defect prediction? | |
CN102737126B (en) | Classification rule mining method under cloud computing environment | |
CN107292186A (en) | A kind of model training method and device based on random forest | |
Tsai et al. | Evolutionary instance selection for text classification | |
CN105913077A (en) | Data clustering method based on dimensionality reduction and sampling | |
CN102609714A (en) | Novel classifier based on information gain and online support vector machine, and classification method thereof | |
CN103500205B (en) | Non-uniform big data classifying method | |
CN110647995A (en) | Rule training method, device, equipment and storage medium | |
CN107577792A (en) | A kind of method and its system of business data automatic cluster | |
Li et al. | Scalable random forests for massive data | |
Luo et al. | A hybrid particle swarm optimization for high-dimensional dynamic optimization | |
CN103268346A (en) | Semi-supervised classification method and semi-supervised classification system | |
CN102567405A (en) | Hotspot discovery method based on improved text space vector representation | |
CN103207804A (en) | MapReduce load simulation method based on cluster job logging | |
CN102426598A (en) | Method for clustering Chinese texts for safety management of network content | |
Almunirawi et al. | A comparative study on serial decision tree classification algorithms in text mining | |
Saidi et al. | Feature selection using genetic algorithm for big data | |
Gupta et al. | Feature selection: an overview | |
CN113743004B (en) | Quantum Fourier transform-based full-element productivity calculation method | |
Li et al. | Pruning SMAC search space based on key hyperparameters | |
CN114880690A (en) | Source data time sequence refinement method based on edge calculation | |
Jiang et al. | Scaling deep phylogenetic embedding to ultra-large reference trees: a tree-aware ensemble approach | |
Song et al. | A dynamic ensemble framework for mining textual streams with class imbalance | |
CN111143560A (en) | Short text classification method, terminal equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170412 |