CN103500205A - Non-uniform big data classifying method - Google Patents
Non-uniform big data classifying method Download PDFInfo
- Publication number
- CN103500205A CN103500205A CN201310452365.3A CN201310452365A CN103500205A CN 103500205 A CN103500205 A CN 103500205A CN 201310452365 A CN201310452365 A CN 201310452365A CN 103500205 A CN103500205 A CN 103500205A
- Authority
- CN
- China
- Prior art keywords
- category
- class
- data
- classifier
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000012360 testing method Methods 0.000 claims abstract description 11
- 238000005070 sampling Methods 0.000 claims description 21
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 230000010287 polarization Effects 0.000 abstract 1
- 238000007418 data mining Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
Abstract
The invention relates to a non-uniform big data classifying method which is used for classifying data set categories which cannot be classified and non-uniform category data sets of the data set categories which cannot be classified in a computer memory. Firstly, the size of a sample is determined by a downsampling method according to a theory, and the number of classifiers can be determined according to the number of the samples. An integrated classifier is established for each category of the big data. When a test case is tested, the integrated classifiers of all categories are classified, and the category where the integrated classifier with the highest classification rate is arranged is used as the category of the test case. The method is in linear to time complexity of big data category and can reduce polarization of non-uniform big data classification results. Furthermore, the integrated classifiers improve the accuracy. The method is easy to implement and only involves some simple math models in writing codes.
Description
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to the disposal route of large data, particularly a kind of non-homogeneous large Data classification.
Background technology
Large data refer to the data acquisition that with conventional Software tool, content is captured, manages and processes in the time of existing physical condition and permission of having no idea.Large data have following characteristics: the Volume(data volume is large), the Variety(data type is various), the Value(value density is low), the Velocity(processing speed is fast), be called 4V for short.
At present large data research generally includes two large classes.The first, the challenge of large data to framework.In the HADOOP cluster of a lot of famous websites, the uncorrected data capacity reaches tens PB at present, and has redundancy, and need to scan renewal every day.Then HADOOP does not affect operation employing 3 replication policies usually in order to ensure single node inefficacy or single chassis inefficacy.Data are all needing to consider Cost Problems aspect time dimension and space dimension like this.Therefore, if construct high efficiency extensive small documents management and large file management mechanism and deposit, support that storage, management and the access etc. of structuring data, unstructured data and non-structure data are all the problems that must consider simultaneously.The second, large data knowledge is found and the challenge of data to mining algorithm greatly.At first need face many be the extensibility of algorithm.Some data minings and machine learning classic algorithm, KNN density Estimation for example, non-ginseng BAYES, support vector machine, Gaussian process returns and the hierarchical clustering scheduling algorithm, because their complexity is at least more than secondary, all can not in data mining, applied preferably greatly.So this just need to design more high efficiency algorithm, i.e. O (nlogn) or O (n).
From the documents of existing a large amount of large data minings aspect, the research of large data study mainly concentrates on the upgrading improvement that the classic method of these 4 aspects is learnt in division, cluster, retrieval, increment (in batches, online or parallel).The research and comparison of at present non-homogeneous large data problem being processed is few.Usually other researchs that similar large data knowledge is found, what at first large Data classification problem needed consideration is the complexity issue of algorithm.Secondly existing sorting algorithm (the different classes of distribution of tentation data is uniform) is applied directly on non-homogeneous large data and easily causes biasing (bias), be that classification results is partial to large classification (it is very large to be that this classification contains example number ratio, for example in two class problems, surpasses 90%).Finally, common algorithm is pursued classification error minimum problem usually for non-homogeneous (imbalance) Data classification problem, but has ignored non-homogeneous class misclassification cost problem.
Yet non-homogeneous large Data classification is a very challenging problem, and what starts with, how to utilize large data to carry out intelligency activity from, etc. a series of basic problems urgently to be resolved hurrily.
Summary of the invention
The present invention studies non-homogeneous large Data classification problem.
The object of the present invention is to provide simply and effectively non-homogeneous large data classification method.The method can solve biasing problem and the large high complexity issue of data algorithm that large Data classification is prone to.Be that this method is by falling the large data of sampling (Downsampling) and two classes classification (one-vs-all), reach the non-homogeneous large Data classification of linear complexity, method by integrated a plurality of sorters (ensemble) result solves biasing problem and improves classification accuracy, and to have robustness (robust) be noise immunity.
The concrete steps of this method are as follows:
(1) obtain the number m of all kinds of examples of large data
i, i=1,2 ..., M;
(2) adopting the Downsampling method is each class m
id samples out
idata set.Each data centralization data volume size n wherein
iby
determine, wherein t
a/2the value that means degree of confidence, can obtain by the t critical value that distributes, and ε means maximum permissible error.By this quadrat method to each class m
id samples out
i.
(3) to each class m
id
ione-vs-all method for individual data set (be that all examples of current class are positive class, all examples of other classes are negative class) is set up D
iindividual sorter, build a sorter to each data set.
(4) to each class m
id
iindividual sorter carries out integrated study.According to the integrated study theory, integrated classifier can by a plurality of meta classifiers, according to integration principle, set forms.All meta classifier classification speeds should be fast, and be independently each other, and the error rate of each sorter is not higher than 50%.This type of common sorter, as nearest neighbor algorithm, decision tree method, neural network or forest tree method (Forest tree) etc. can meet above-mentioned requirements.Integration principle generally has bagging, adaboost, selective ensemble etc.Each class m of the present invention
ithe D obtained
iindividual sorter adopt forward greedy Ensemble classifier as a result method (forward greedy ensemble) sorter is carried out to integrated study.
(5) test: each example is classified in each class, obtain the classification that M the highest class of the accuracy rate in result is test case.
Wherein the target of step (2) is to solve the algorithm complex problem, by the method for falling sampling, uses part raw data rather than total data to build sorter.In order to improve classification accuracy, adopt multiple strategy, i.e. multiple sampling, the sample size of sampling meets afore mentioned rules, and frequency in sampling is determined by the user.
The concrete steps of step of the present invention (2) Downsampling method are as follows:
A. to each class m
iwhen being sampled, the regulation that the sample size of sampling is no less than above table (
individual), sample number is the number that need to set up meta classifier.In every class is generated to the process of a sample number, at first obtain the number of samples of current class.The present invention is current class as category-A, and other class unifications are called non-category-A.Then, analyze the quantity rank of category-A and non-category-A.The present invention remembers # (A), # (~A), and # (R), # (T) is category-A, non-category-A, the data of calculator memory and the theoretical sample size required, if (# (A) > > # (R)) & & (# (A) > # (T)), from category-A, extract and the much the same example of non-category-A; If (# (~A) > > # (R)) & & (# (~A) > # (T)), from non-category-A, extract and the much the same example of category-A.
B. repeat said process, until each class m
isampling D
iindividual sample.For simplicity, the fixing D of the present invention
ifor n.
C. so far whole data set has generated D=M*n sample.
By step (2), the present invention obtains M*n sample, and every group of data have n sorter.Step of the present invention (3) is to each class m
in sample set up altogether n meta classifier;
Then step of the present invention (4) is integrated an integrated classifier to n the meta classifier obtained, and takes forward greedy ensemble method.Its step is as follows:
D. build candidate classification device set CCS={C
1..., C
mand selected sorter S set CS={};
E. to each sorter C
i, choose the best sorter of accuracy rate, it is removed and add SCS from CCS;
F. the sorter C in current each CCS
jadd in SCS and verify, if classification results surpasses the threshold value of the prior appointment of user, jump to E, and C
jmove on to SCS from CCS.Otherwise jump to step (5), now show that integrated classifier learnt;
G. repeat F, until CCS is empty set,
So far, to M class, the present invention has set up altogether M integrated classifier C
i, i=1 ..., M.Each integrated classifier comprises n meta classifier.
Above-mentioned steps guarantees that the sorter obtained is less, and this makes test process fairly simple.
The non-homogeneous large data classification method of implementing by above step has following characteristics: the first, and in assorting process, because all kinds of instance datas that adopt the downsampling method to be are balanced as far as possible, this has effectively been avoided the problem of the large class of classification deflection; The second, adopt the methods of sampling to classify and make the complexity of whole sorting algorithm be up to linearity; The 3rd, for fear of sampling, cause classification accuracy to reduce, the present invention is by two method improvement classification accuracies, the method for sampling repeatedly and greedy compressive classification method as a result forward.
The present invention uses the method for sampling to reduce unbalanced classification, and reduces the complexity of algorithm; Use sampling repeatedly and to each sampling to set up a meta classifier, and use all meta classifiers of method synthesis of integrated study to improve the classification achievement.
The large data of sampling: it is very difficult usually in whole large data, classifying.Even feasible, complexity is also very high, and it is feasible that the methods of sampling makes the operation to large Data classification become, and the reduced complexity that makes classification is to linear.This result that large data mining is expected just.
Sample size and sample number: the sample size of sampling obtains according to theory, guarantees that sampling obtains result afterwards and reaches minimum with the baseline results error.Extract a plurality of samples and be conducive to improve the classification achievement;
It is the method for the non-homogeneous data set of a kind of very effective solution that the one-vs-all sorting technique has been proved to be.The present invention is used in the method in the classification of non-homogeneous large data, can solve non-homogeneous classification problem on the one hand, can also solve on the other hand the high complexity issue of large Data classification;
Meta classifier makes the classification on large data sets quicker, and integrated study can effectively improve the achievement of meta classifier.And forward greedy ensemble also reduces the complexity of sorter when guarantee improving the achievement of meta classifier, this is large data to be processed to the strong guarantee of linear complexity.
Embodiment
Embodiment 1
The large data instance of given simulation is containing 2,000,000, and the dimension 1000 of each example is tieed up.Whole data set divides two classes, and wherein the first kind contains example 1,990,000, and Equations of The Second Kind only contains example 10,000.This data set produces at random and belongs to uneven large data two class classification problems.
(1) determine degree of confidence 99% and the limits of error 1%.Therefore the sample size of every each data set of class is 16641.Proportionally from category-A (containing the data set of 1,990,000 examples), extract 10,000 examples, add 10,000 examples of non-category-A, each data set comprises 20,000 examples.Common PC computer usually can be applied easily common meta classifier the data set containing 20,000 examples is classified.
(2), according to said method, this example generates altogether 10 sub-data sets.Use nearest neighbor algorithm to set up 10 sorters, k is set to respectively 1 to 10.
(3) according to these 10 sorters, adopt forward greedy ensemble method to carry out these 10 meta classifiers integrated, assemble an integrated classifier.
(4) to a given test case, use integrated classifier obtained above to be classified.If classification results surpasses 50%, judge that this test case belongs to category-A, otherwise belong to non-category-A.
Embodiment 2
The large data instance of given simulation is containing 20,000,000, and the dimension 1000 of each example is tieed up.Whole data set divides three classes, and wherein category-A contains example 1,200 ten thousand, and category-B contains example 7,900,000, and the C class contains example 100,000.This data set produces at random and belongs to uneven large data multicategory classification problem.
(1) determine degree of confidence 95% and the limits of error 1%.The sample size of every each data set of class is 9604.Because general computing machine is processed a little difficulty of 300,000 data.Therefore need be sampled to three classes.
(2) to 10 data of category-A sampling, and each data set comprises 20,000 examples (annotate: the quantity of example is as long as surpass 9604).In particular, first from category-A, randomly draw 10,000 samples, then from category-B, randomly draw 5000 samples, from the C class, randomly draw 5000 samples.Now obtain a sub-data set containing sample 20,000.Repeat this sampling 10 times, can obtain 10 sub-data sets of category-A sample.By that analogy, give category-B and C class 10 the sub-data sets of respectively sampling.Finally, give birth to 30 sub-data sets at this process common property.
(3) to 10 data sets of category-A, adopt the sorting technique of one-vs-all, category-A is a class, category-B and the unification of C class are a class, use 10 meta classifiers to set up 10 sorters.Meta classifier is 9 of nearest neighbor algorithms, and k gets 1 to 9, one of decision tree C5.0 sorter.
(4) same method is set up 10 meta classifiers to category-B, also the C class is set up to 10 meta classifiers.
(5) adopt forward greedy ensemble method to carry out 10 meta classifiers of category-A integrated, assemble an integrated classifier.Same 10 meta classifiers to category-B and C class carry out integrated, respectively a concentrated integrated classifier.
(6), to a given test case, use three integrated classifiers obtained above to be classified.If the classification results that the classification results that the classification results of A integrated classifier is 85%, B integrated classifier is 89%, C integrated classifier is 90%, judge that this test case belongs to the C class.
Claims (7)
1. the sorting technique of non-homogeneous large data, comprise the steps:
(1) obtain the number m of all kinds of examples of large data
i, i=1,2 ..., M;
(2) adopt and fall the methods of sampling for each class m
id samples out
idata set;
(3) each data set is built to a meta classifier;
(4) to each class m
id
iindividual sorter carries out integrated study;
(5) test: to each example, at each class m
iin classified, the classification that in the M an obtained result, the highest class of accuracy rate is test case.
2. according to the method for claim 1, the data volume n of each data set of described step (2)
iby
determine,
T wherein
a/2the value that means degree of confidence, obtain by t distribution critical value, and e means the permissible error of the maximum of setting.
3. according to the method for claim 1 or 2, the detailed process of described step (2) is as follows:
A. current class is as category-A, and other class unifications are called non-category-A; Then, analyze the quantity rank of category-A and non-category-A; Note # (A), # (~A), # (R), # (T) is respectively category-A, non-category-A, the data of calculator memory and the theoretical sample size required, if (# (A) > > # (R)) & & (# (A) > # (T)), from category-A, extract and the much the same example of non-category-A; If (# (~A) > > # (R)) & & (# (~A) > # (T)), from non-category-A, extract and the much the same example of category-A;
B. repeat said process, until each class m
id samples out
i, and fixing D
i=n;
C. whole data set generates D=M*n sub-data set.
4. according to the method for claim 1, in step (3), to each class m
id
ithe method that individual data set builds Di meta classifier is selected from: two class classification, nearest neighbor algorithm, decision tree method, neural network or forest tree method.
5. according to the method for claim 1 or 4, in step (3), to each class m
id
iindividual data set builds the method for Di meta classifier and selects: two class classification.
6. according to the method for claim 1, in step (4), take greedy Ensemble classifier forward as a result method Di the meta classifier of each class mi carried out to integrated study, obtain an integrated classifier.
7. according to the method for claim 1 or 6, in step (4), the detailed process of method is as follows as a result to take greedy Ensemble classifier forward:
D. build candidate classification device set CCS={C
1..., C
mand selected sorter S set CS={};
E. to each sorter C
i, choose the best sorter of accuracy rate, it is removed and add SCS from CCS;
F. the sorter C in current each CCS
jadd in SCS and verify, if classification results surpasses the threshold value of the prior appointment of user, jump to E, and C
jmove on to SCS from CCS, otherwise jump to step (5);
G. repeat F, until CCS is empty set,
So far, to M class, set up altogether M integrated classifier Ci, i=1 ..., M, each integrated classifier comprises n meta classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452365.3A CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452365.3A CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103500205A true CN103500205A (en) | 2014-01-08 |
CN103500205B CN103500205B (en) | 2017-04-12 |
Family
ID=49865415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310452365.3A Expired - Fee Related CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103500205B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156029A (en) * | 2015-03-24 | 2016-11-23 | 中国人民解放军国防科学技术大学 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
CN107193836A (en) * | 2016-03-15 | 2017-09-22 | 腾讯科技(深圳)有限公司 | A kind of recognition methods and device |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399413A (en) * | 2019-07-04 | 2019-11-01 | 博彦科技股份有限公司 | Sampling of data method, apparatus, storage medium and processor |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404009A (en) * | 2008-10-31 | 2009-04-08 | 金蝶软件(中国)有限公司 | Data classification filtering method, system and equipment |
US20130071033A1 (en) * | 2011-09-21 | 2013-03-21 | Tandent Vision Science, Inc. | Classifier for use in generating a diffuse image |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
-
2013
- 2013-09-29 CN CN201310452365.3A patent/CN103500205B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404009A (en) * | 2008-10-31 | 2009-04-08 | 金蝶软件(中国)有限公司 | Data classification filtering method, system and equipment |
US20130071033A1 (en) * | 2011-09-21 | 2013-03-21 | Tandent Vision Science, Inc. | Classifier for use in generating a diffuse image |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
Non-Patent Citations (2)
Title |
---|
SATTAR HASHEMI 等: "Adapted One-versus-All Decision Trees for Data Stream Classification", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 * |
谷雨 等: "基于Bagging支持向量机集成的入侵检测研究", 《微电子学与计算机》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156029A (en) * | 2015-03-24 | 2016-11-23 | 中国人民解放军国防科学技术大学 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
CN107193836A (en) * | 2016-03-15 | 2017-09-22 | 腾讯科技(深圳)有限公司 | A kind of recognition methods and device |
CN107193836B (en) * | 2016-03-15 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Identification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103500205B (en) | 2017-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Text features extraction based on TF-IDF associating semantic | |
Jović et al. | A review of feature selection methods with applications | |
CN107292186A (en) | A kind of model training method and device based on random forest | |
WO2019125874A1 (en) | Neural entropy enhanced machine learning | |
CN103955489A (en) | Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification | |
Xia et al. | Random space division sampling for label-noisy classification or imbalanced classification | |
Li et al. | Scalable random forests for massive data | |
Zhu et al. | A classification algorithm of CART decision tree based on MapReduce attribute weights | |
Liu et al. | MLRF: multi-label classification through random forest with label-set partition | |
CN103500205A (en) | Non-uniform big data classifying method | |
CN103268346A (en) | Semi-supervised classification method and semi-supervised classification system | |
Kimovski et al. | Leveraging cooperation for parallel multi‐objective feature selection in high‐dimensional EEG data | |
Bu et al. | Incomplete big data clustering algorithm using feature selection and partial distance | |
Kocaguneli et al. | Size doesn't matter? On the value of software size features for effort estimation | |
Gupta et al. | Feature selection: an overview | |
Dong et al. | The research of kNN text categorization algorithm based on eager learning | |
Tulgar et al. | A distributed k nearest neighbor classifier for big data | |
Ren et al. | Online biomedical publication classification using multi-instance multi-label algorithms with feature reduction | |
Koohi-Var et al. | Scientific workflow clustering based on motif discovery | |
Mandli et al. | Selection of most relevant features from high dimensional data using ig-ga hybrid approach | |
Kancharla | Feature selection in big data using filter based techniques | |
Wang et al. | Edcleaner: Data cleaning for entity information in social network | |
Bochkaryov et al. | The use of clustering algorithms ensemble with variable distance metrics in solving problems of web mining | |
Putra et al. | Performance of SVM in Classifying the Quartile of Computer Science Journals | |
Wang et al. | Fuzzy C-means clustering algorithm based on coefficient of variation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170412 |