CN103500205A - Non-uniform big data classifying method - Google Patents

Non-uniform big data classifying method Download PDF

Info

Publication number
CN103500205A
CN103500205A CN201310452365.3A CN201310452365A CN103500205A CN 103500205 A CN103500205 A CN 103500205A CN 201310452365 A CN201310452365 A CN 201310452365A CN 103500205 A CN103500205 A CN 103500205A
Authority
CN
China
Prior art keywords
category
class
data
classifier
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310452365.3A
Other languages
Chinese (zh)
Other versions
CN103500205B (en
Inventor
朱晓峰
张师超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201310452365.3A priority Critical patent/CN103500205B/en
Publication of CN103500205A publication Critical patent/CN103500205A/en
Application granted granted Critical
Publication of CN103500205B publication Critical patent/CN103500205B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)

Abstract

The invention relates to a non-uniform big data classifying method which is used for classifying data set categories which cannot be classified and non-uniform category data sets of the data set categories which cannot be classified in a computer memory. Firstly, the size of a sample is determined by a downsampling method according to a theory, and the number of classifiers can be determined according to the number of the samples. An integrated classifier is established for each category of the big data. When a test case is tested, the integrated classifiers of all categories are classified, and the category where the integrated classifier with the highest classification rate is arranged is used as the category of the test case. The method is in linear to time complexity of big data category and can reduce polarization of non-uniform big data classification results. Furthermore, the integrated classifiers improve the accuracy. The method is easy to implement and only involves some simple math models in writing codes.

Description

Non-homogeneous large data classification method
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to the disposal route of large data, particularly a kind of non-homogeneous large Data classification.
Background technology
Large data refer to the data acquisition that with conventional Software tool, content is captured, manages and processes in the time of existing physical condition and permission of having no idea.Large data have following characteristics: the Volume(data volume is large), the Variety(data type is various), the Value(value density is low), the Velocity(processing speed is fast), be called 4V for short.
At present large data research generally includes two large classes.The first, the challenge of large data to framework.In the HADOOP cluster of a lot of famous websites, the uncorrected data capacity reaches tens PB at present, and has redundancy, and need to scan renewal every day.Then HADOOP does not affect operation employing 3 replication policies usually in order to ensure single node inefficacy or single chassis inefficacy.Data are all needing to consider Cost Problems aspect time dimension and space dimension like this.Therefore, if construct high efficiency extensive small documents management and large file management mechanism and deposit, support that storage, management and the access etc. of structuring data, unstructured data and non-structure data are all the problems that must consider simultaneously.The second, large data knowledge is found and the challenge of data to mining algorithm greatly.At first need face many be the extensibility of algorithm.Some data minings and machine learning classic algorithm, KNN density Estimation for example, non-ginseng BAYES, support vector machine, Gaussian process returns and the hierarchical clustering scheduling algorithm, because their complexity is at least more than secondary, all can not in data mining, applied preferably greatly.So this just need to design more high efficiency algorithm, i.e. O (nlogn) or O (n).
From the documents of existing a large amount of large data minings aspect, the research of large data study mainly concentrates on the upgrading improvement that the classic method of these 4 aspects is learnt in division, cluster, retrieval, increment (in batches, online or parallel).The research and comparison of at present non-homogeneous large data problem being processed is few.Usually other researchs that similar large data knowledge is found, what at first large Data classification problem needed consideration is the complexity issue of algorithm.Secondly existing sorting algorithm (the different classes of distribution of tentation data is uniform) is applied directly on non-homogeneous large data and easily causes biasing (bias), be that classification results is partial to large classification (it is very large to be that this classification contains example number ratio, for example in two class problems, surpasses 90%).Finally, common algorithm is pursued classification error minimum problem usually for non-homogeneous (imbalance) Data classification problem, but has ignored non-homogeneous class misclassification cost problem.
Yet non-homogeneous large Data classification is a very challenging problem, and what starts with, how to utilize large data to carry out intelligency activity from, etc. a series of basic problems urgently to be resolved hurrily.
Summary of the invention
The present invention studies non-homogeneous large Data classification problem.
The object of the present invention is to provide simply and effectively non-homogeneous large data classification method.The method can solve biasing problem and the large high complexity issue of data algorithm that large Data classification is prone to.Be that this method is by falling the large data of sampling (Downsampling) and two classes classification (one-vs-all), reach the non-homogeneous large Data classification of linear complexity, method by integrated a plurality of sorters (ensemble) result solves biasing problem and improves classification accuracy, and to have robustness (robust) be noise immunity.
The concrete steps of this method are as follows:
(1) obtain the number m of all kinds of examples of large data i, i=1,2 ..., M;
(2) adopting the Downsampling method is each class m id samples out idata set.Each data centralization data volume size n wherein iby
Figure BDA0000389625810000021
determine, wherein t a/2the value that means degree of confidence, can obtain by the t critical value that distributes, and ε means maximum permissible error.By this quadrat method to each class m id samples out i.
Figure BDA0000389625810000022
(3) to each class m id ione-vs-all method for individual data set (be that all examples of current class are positive class, all examples of other classes are negative class) is set up D iindividual sorter, build a sorter to each data set.
(4) to each class m id iindividual sorter carries out integrated study.According to the integrated study theory, integrated classifier can by a plurality of meta classifiers, according to integration principle, set forms.All meta classifier classification speeds should be fast, and be independently each other, and the error rate of each sorter is not higher than 50%.This type of common sorter, as nearest neighbor algorithm, decision tree method, neural network or forest tree method (Forest tree) etc. can meet above-mentioned requirements.Integration principle generally has bagging, adaboost, selective ensemble etc.Each class m of the present invention ithe D obtained iindividual sorter adopt forward greedy Ensemble classifier as a result method (forward greedy ensemble) sorter is carried out to integrated study.
(5) test: each example is classified in each class, obtain the classification that M the highest class of the accuracy rate in result is test case.
Wherein the target of step (2) is to solve the algorithm complex problem, by the method for falling sampling, uses part raw data rather than total data to build sorter.In order to improve classification accuracy, adopt multiple strategy, i.e. multiple sampling, the sample size of sampling meets afore mentioned rules, and frequency in sampling is determined by the user.
The concrete steps of step of the present invention (2) Downsampling method are as follows:
A. to each class m iwhen being sampled, the regulation that the sample size of sampling is no less than above table (
Figure BDA0000389625810000031
individual), sample number is the number that need to set up meta classifier.In every class is generated to the process of a sample number, at first obtain the number of samples of current class.The present invention is current class as category-A, and other class unifications are called non-category-A.Then, analyze the quantity rank of category-A and non-category-A.The present invention remembers # (A), # (~A), and # (R), # (T) is category-A, non-category-A, the data of calculator memory and the theoretical sample size required, if (# (A) > > # (R)) & & (# (A) > # (T)), from category-A, extract and the much the same example of non-category-A; If (# (~A) > > # (R)) & & (# (~A) > # (T)), from non-category-A, extract and the much the same example of category-A.
B. repeat said process, until each class m isampling D iindividual sample.For simplicity, the fixing D of the present invention ifor n.
C. so far whole data set has generated D=M*n sample.
By step (2), the present invention obtains M*n sample, and every group of data have n sorter.Step of the present invention (3) is to each class m in sample set up altogether n meta classifier;
Then step of the present invention (4) is integrated an integrated classifier to n the meta classifier obtained, and takes forward greedy ensemble method.Its step is as follows:
D. build candidate classification device set CCS={C 1..., C mand selected sorter S set CS={};
E. to each sorter C i, choose the best sorter of accuracy rate, it is removed and add SCS from CCS;
F. the sorter C in current each CCS jadd in SCS and verify, if classification results surpasses the threshold value of the prior appointment of user, jump to E, and C jmove on to SCS from CCS.Otherwise jump to step (5), now show that integrated classifier learnt;
G. repeat F, until CCS is empty set,
So far, to M class, the present invention has set up altogether M integrated classifier C i, i=1 ..., M.Each integrated classifier comprises n meta classifier.
Above-mentioned steps guarantees that the sorter obtained is less, and this makes test process fairly simple.
The non-homogeneous large data classification method of implementing by above step has following characteristics: the first, and in assorting process, because all kinds of instance datas that adopt the downsampling method to be are balanced as far as possible, this has effectively been avoided the problem of the large class of classification deflection; The second, adopt the methods of sampling to classify and make the complexity of whole sorting algorithm be up to linearity; The 3rd, for fear of sampling, cause classification accuracy to reduce, the present invention is by two method improvement classification accuracies, the method for sampling repeatedly and greedy compressive classification method as a result forward.
The present invention uses the method for sampling to reduce unbalanced classification, and reduces the complexity of algorithm; Use sampling repeatedly and to each sampling to set up a meta classifier, and use all meta classifiers of method synthesis of integrated study to improve the classification achievement.
The large data of sampling: it is very difficult usually in whole large data, classifying.Even feasible, complexity is also very high, and it is feasible that the methods of sampling makes the operation to large Data classification become, and the reduced complexity that makes classification is to linear.This result that large data mining is expected just.
Sample size and sample number: the sample size of sampling obtains according to theory, guarantees that sampling obtains result afterwards and reaches minimum with the baseline results error.Extract a plurality of samples and be conducive to improve the classification achievement;
It is the method for the non-homogeneous data set of a kind of very effective solution that the one-vs-all sorting technique has been proved to be.The present invention is used in the method in the classification of non-homogeneous large data, can solve non-homogeneous classification problem on the one hand, can also solve on the other hand the high complexity issue of large Data classification;
Meta classifier makes the classification on large data sets quicker, and integrated study can effectively improve the achievement of meta classifier.And forward greedy ensemble also reduces the complexity of sorter when guarantee improving the achievement of meta classifier, this is large data to be processed to the strong guarantee of linear complexity.
Embodiment
Embodiment 1
The large data instance of given simulation is containing 2,000,000, and the dimension 1000 of each example is tieed up.Whole data set divides two classes, and wherein the first kind contains example 1,990,000, and Equations of The Second Kind only contains example 10,000.This data set produces at random and belongs to uneven large data two class classification problems.
(1) determine degree of confidence 99% and the limits of error 1%.Therefore the sample size of every each data set of class is 16641.Proportionally from category-A (containing the data set of 1,990,000 examples), extract 10,000 examples, add 10,000 examples of non-category-A, each data set comprises 20,000 examples.Common PC computer usually can be applied easily common meta classifier the data set containing 20,000 examples is classified.
(2), according to said method, this example generates altogether 10 sub-data sets.Use nearest neighbor algorithm to set up 10 sorters, k is set to respectively 1 to 10.
(3) according to these 10 sorters, adopt forward greedy ensemble method to carry out these 10 meta classifiers integrated, assemble an integrated classifier.
(4) to a given test case, use integrated classifier obtained above to be classified.If classification results surpasses 50%, judge that this test case belongs to category-A, otherwise belong to non-category-A.
Embodiment 2
The large data instance of given simulation is containing 20,000,000, and the dimension 1000 of each example is tieed up.Whole data set divides three classes, and wherein category-A contains example 1,200 ten thousand, and category-B contains example 7,900,000, and the C class contains example 100,000.This data set produces at random and belongs to uneven large data multicategory classification problem.
(1) determine degree of confidence 95% and the limits of error 1%.The sample size of every each data set of class is 9604.Because general computing machine is processed a little difficulty of 300,000 data.Therefore need be sampled to three classes.
(2) to 10 data of category-A sampling, and each data set comprises 20,000 examples (annotate: the quantity of example is as long as surpass 9604).In particular, first from category-A, randomly draw 10,000 samples, then from category-B, randomly draw 5000 samples, from the C class, randomly draw 5000 samples.Now obtain a sub-data set containing sample 20,000.Repeat this sampling 10 times, can obtain 10 sub-data sets of category-A sample.By that analogy, give category-B and C class 10 the sub-data sets of respectively sampling.Finally, give birth to 30 sub-data sets at this process common property.
(3) to 10 data sets of category-A, adopt the sorting technique of one-vs-all, category-A is a class, category-B and the unification of C class are a class, use 10 meta classifiers to set up 10 sorters.Meta classifier is 9 of nearest neighbor algorithms, and k gets 1 to 9, one of decision tree C5.0 sorter.
(4) same method is set up 10 meta classifiers to category-B, also the C class is set up to 10 meta classifiers.
(5) adopt forward greedy ensemble method to carry out 10 meta classifiers of category-A integrated, assemble an integrated classifier.Same 10 meta classifiers to category-B and C class carry out integrated, respectively a concentrated integrated classifier.
(6), to a given test case, use three integrated classifiers obtained above to be classified.If the classification results that the classification results that the classification results of A integrated classifier is 85%, B integrated classifier is 89%, C integrated classifier is 90%, judge that this test case belongs to the C class.

Claims (7)

1. the sorting technique of non-homogeneous large data, comprise the steps:
(1) obtain the number m of all kinds of examples of large data i, i=1,2 ..., M;
(2) adopt and fall the methods of sampling for each class m id samples out idata set;
(3) each data set is built to a meta classifier;
(4) to each class m id iindividual sorter carries out integrated study;
(5) test: to each example, at each class m iin classified, the classification that in the M an obtained result, the highest class of accuracy rate is test case.
2. according to the method for claim 1, the data volume n of each data set of described step (2) iby determine,
T wherein a/2the value that means degree of confidence, obtain by t distribution critical value, and e means the permissible error of the maximum of setting.
3. according to the method for claim 1 or 2, the detailed process of described step (2) is as follows:
A. current class is as category-A, and other class unifications are called non-category-A; Then, analyze the quantity rank of category-A and non-category-A; Note # (A), # (~A), # (R), # (T) is respectively category-A, non-category-A, the data of calculator memory and the theoretical sample size required, if (# (A) > > # (R)) & & (# (A) > # (T)), from category-A, extract and the much the same example of non-category-A; If (# (~A) > > # (R)) & & (# (~A) > # (T)), from non-category-A, extract and the much the same example of category-A;
B. repeat said process, until each class m id samples out i, and fixing D i=n;
C. whole data set generates D=M*n sub-data set.
4. according to the method for claim 1, in step (3), to each class m id ithe method that individual data set builds Di meta classifier is selected from: two class classification, nearest neighbor algorithm, decision tree method, neural network or forest tree method.
5. according to the method for claim 1 or 4, in step (3), to each class m id iindividual data set builds the method for Di meta classifier and selects: two class classification.
6. according to the method for claim 1, in step (4), take greedy Ensemble classifier forward as a result method Di the meta classifier of each class mi carried out to integrated study, obtain an integrated classifier.
7. according to the method for claim 1 or 6, in step (4), the detailed process of method is as follows as a result to take greedy Ensemble classifier forward:
D. build candidate classification device set CCS={C 1..., C mand selected sorter S set CS={};
E. to each sorter C i, choose the best sorter of accuracy rate, it is removed and add SCS from CCS;
F. the sorter C in current each CCS jadd in SCS and verify, if classification results surpasses the threshold value of the prior appointment of user, jump to E, and C jmove on to SCS from CCS, otherwise jump to step (5);
G. repeat F, until CCS is empty set,
So far, to M class, set up altogether M integrated classifier Ci, i=1 ..., M, each integrated classifier comprises n meta classifier.
CN201310452365.3A 2013-09-29 2013-09-29 Non-uniform big data classifying method Expired - Fee Related CN103500205B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310452365.3A CN103500205B (en) 2013-09-29 2013-09-29 Non-uniform big data classifying method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310452365.3A CN103500205B (en) 2013-09-29 2013-09-29 Non-uniform big data classifying method

Publications (2)

Publication Number Publication Date
CN103500205A true CN103500205A (en) 2014-01-08
CN103500205B CN103500205B (en) 2017-04-12

Family

ID=49865415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310452365.3A Expired - Fee Related CN103500205B (en) 2013-09-29 2013-09-29 Non-uniform big data classifying method

Country Status (1)

Country Link
CN (1) CN103500205B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156029A (en) * 2015-03-24 2016-11-23 中国人民解放军国防科学技术大学 The uneven fictitious assets data classification method of multi-tag based on integrated study
CN107193836A (en) * 2016-03-15 2017-09-22 腾讯科技(深圳)有限公司 A kind of recognition methods and device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399413A (en) * 2019-07-04 2019-11-01 博彦科技股份有限公司 Sampling of data method, apparatus, storage medium and processor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404009A (en) * 2008-10-31 2009-04-08 金蝶软件(中国)有限公司 Data classification filtering method, system and equipment
US20130071033A1 (en) * 2011-09-21 2013-03-21 Tandent Vision Science, Inc. Classifier for use in generating a diffuse image
CN103268336A (en) * 2013-05-13 2013-08-28 刘峰 Fast data and big data combined data processing method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101404009A (en) * 2008-10-31 2009-04-08 金蝶软件(中国)有限公司 Data classification filtering method, system and equipment
US20130071033A1 (en) * 2011-09-21 2013-03-21 Tandent Vision Science, Inc. Classifier for use in generating a diffuse image
CN103268336A (en) * 2013-05-13 2013-08-28 刘峰 Fast data and big data combined data processing method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SATTAR HASHEMI 等: "Adapted One-versus-All Decision Trees for Data Stream Classification", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 *
谷雨 等: "基于Bagging支持向量机集成的入侵检测研究", 《微电子学与计算机》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156029A (en) * 2015-03-24 2016-11-23 中国人民解放军国防科学技术大学 The uneven fictitious assets data classification method of multi-tag based on integrated study
CN107193836A (en) * 2016-03-15 2017-09-22 腾讯科技(深圳)有限公司 A kind of recognition methods and device
CN107193836B (en) * 2016-03-15 2021-08-10 腾讯科技(深圳)有限公司 Identification method and device

Also Published As

Publication number Publication date
CN103500205B (en) 2017-04-12

Similar Documents

Publication Publication Date Title
Liu et al. Text features extraction based on TF-IDF associating semantic
Jović et al. A review of feature selection methods with applications
CN107292186A (en) A kind of model training method and device based on random forest
WO2019125874A1 (en) Neural entropy enhanced machine learning
CN103955489A (en) Distributed mass short text KNN (K Nearest Neighbor) classification algorithm and distributed mass short text KNN classification system based on information entropy feature weight quantification
Xia et al. Random space division sampling for label-noisy classification or imbalanced classification
Li et al. Scalable random forests for massive data
Zhu et al. A classification algorithm of CART decision tree based on MapReduce attribute weights
Liu et al. MLRF: multi-label classification through random forest with label-set partition
CN103500205A (en) Non-uniform big data classifying method
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
Kimovski et al. Leveraging cooperation for parallel multi‐objective feature selection in high‐dimensional EEG data
Bu et al. Incomplete big data clustering algorithm using feature selection and partial distance
Kocaguneli et al. Size doesn't matter? On the value of software size features for effort estimation
Gupta et al. Feature selection: an overview
Dong et al. The research of kNN text categorization algorithm based on eager learning
Tulgar et al. A distributed k nearest neighbor classifier for big data
Ren et al. Online biomedical publication classification using multi-instance multi-label algorithms with feature reduction
Koohi-Var et al. Scientific workflow clustering based on motif discovery
Mandli et al. Selection of most relevant features from high dimensional data using ig-ga hybrid approach
Kancharla Feature selection in big data using filter based techniques
Wang et al. Edcleaner: Data cleaning for entity information in social network
Bochkaryov et al. The use of clustering algorithms ensemble with variable distance metrics in solving problems of web mining
Putra et al. Performance of SVM in Classifying the Quartile of Computer Science Journals
Wang et al. Fuzzy C-means clustering algorithm based on coefficient of variation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170412