CN103500205A - Non-uniform big data classifying method - Google Patents
Non-uniform big data classifying method Download PDFInfo
- Publication number
- CN103500205A CN103500205A CN201310452365.3A CN201310452365A CN103500205A CN 103500205 A CN103500205 A CN 103500205A CN 201310452365 A CN201310452365 A CN 201310452365A CN 103500205 A CN103500205 A CN 103500205A
- Authority
- CN
- China
- Prior art keywords
- class
- classifier
- classification
- classifiers
- big data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000012360 testing method Methods 0.000 claims abstract description 11
- 238000005070 sampling Methods 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 230000010354 integration Effects 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 238000013461 design Methods 0.000 abstract description 2
- 238000013178 mathematical model Methods 0.000 abstract 1
- 238000007418 data mining Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明为非均匀大数据的分类方法。对不能在计算机内存进行分类的数据集类别及其非均匀类别的数据集进行分类。首先使用降抽样的方法按照理论决定样本的大小,根据样本数量的个数确定分类器的数量。对大数据的每个类别建立一个集成的分类器。测试实例的时候,让它对每个类别的集成分类器进行分类,取分类率最高的集成分类器所在的类别为测试实例的类别。本方法对大数据分类的时间复杂度为线性,能降低非均匀大数据分类结果的偏置,而且通过集成分类器的设计提高了分类器的准确率。而且本发明易于实施,编写代码时只涉及到一些简单的数学模型。The invention is a classification method for non-uniform big data. Classify categories of datasets that cannot be classified in computer memory and datasets of non-uniform categories. Firstly, the downsampling method is used to determine the sample size according to the theory, and the number of classifiers is determined according to the number of samples. Build an ensemble classifier for each category of big data. When testing an instance, let it classify the integrated classifier of each category, and take the category of the integrated classifier with the highest classification rate as the category of the test instance. The time complexity of this method for big data classification is linear, which can reduce the bias of non-uniform big data classification results, and improve the accuracy of the classifier through the design of the integrated classifier. Moreover, the present invention is easy to implement, and only involves some simple mathematical models when writing codes.
Description
技术领域technical field
本发明涉及计算机科学与技术领域和信息技术领域,具体涉及大数据,特别是一种非均匀大数据分类的处理方法。The present invention relates to the fields of computer science and technology and information technology, in particular to big data, in particular to a processing method for classification of non-uniform big data.
背景技术Background technique
大数据是指没有办法在现有物理条件和允许的时间里用常规的软件工具对内容进行抓取、管理和处理的数据集合。大数据有如下特点:Volume(数据量大)、Variety(数据类型繁多)、Value(价值密度低)、Velocity(处理速度快),被简称为4V。Big data refers to the collection of data that cannot be captured, managed and processed with conventional software tools within the existing physical conditions and allowed time. Big data has the following characteristics: Volume (large amount of data), Variety (variety of data types), Value (low value density), Velocity (fast processing speed), referred to as 4V.
目前大数据研究通常包括两大类。第一,大数据对架构的挑战。目前很多著名网站的HADOOP集群中裸数据容量达到几十PB,而且存在冗余,每天需要扫描更新。然后HADOOP为了确保单节点失效或单机架失效不影响运行通常采用3副本策略。这样数据在时间维和空间维方面都需要考虑到成本问题。因此,如果构造高效率的大规模小文件管理和大文件管理机制并存,同时支持结构化数据、非结构化数据和无结构数据的存储、管理与访问等都是必须要考虑的问题。第二,大数据知识发现和大数据对挖掘算法的挑战。首先需要面多的是算法的可扩展性。一些数据挖掘和机器学习经典算法,例如KNN密度估计,非参BAYES,支持向量机,高斯过程回归和层次聚类等算法,由于它们的复杂度至少是二次以上,都不能在大数据挖掘中得以较好的应用。所以,这就需要设计出更高效率的算法,即O(nlogn)或者O(n)。At present, big data research usually includes two categories. First, big data challenges architecture. At present, the raw data capacity in the HADOOP clusters of many famous websites reaches dozens of PB, and there is redundancy, which needs to be scanned and updated every day. Then, in order to ensure that the failure of a single node or a single rack does not affect the operation, HADOOP usually adopts a 3-copy strategy. In this way, the cost of data needs to be considered in both time and space dimensions. Therefore, if an efficient large-scale small file management and large file management mechanism coexist, and support the storage, management and access of structured data, unstructured data and unstructured data at the same time, it must be considered. Second, big data knowledge discovery and the challenges of big data to mining algorithms. The first thing that needs to be covered is the scalability of the algorithm. Some classic data mining and machine learning algorithms, such as KNN density estimation, non-parametric BAYES, support vector machines, Gaussian process regression and hierarchical clustering algorithms, cannot be used in big data mining because their complexity is at least quadratic be better applied. Therefore, it is necessary to design a more efficient algorithm, that is, O(nlogn) or O(n).
从现有的大量大数据挖掘方面的文献来看,大数据学习的研究主要集中在划分、聚类、检索、增量(批量、在线或平行)学习这4个方面的传统方法的升级改良。目前对非均匀大数据问题处理的研究比较少。通常类似大数据知识发现的其他研究,大数据分类问题首先需要考虑的是算法的复杂度问题。其次现有分类算法(假设数据的不同类别分布是均匀的)直接应用到非均匀大数据上容易导致偏置(bias),即分类结果偏向于大类别(即此类别含有实例数目比例很大,例如在二类问题中超过90%)。最后,常见算法用于非均匀(不平衡)数据分类问题通常追求分类错误最小问题,但是忽略了非均匀类误分类代价问题。Judging from the large amount of existing literature on big data mining, the research on big data learning mainly focuses on the upgrading and improvement of traditional methods in four aspects: division, clustering, retrieval, and incremental (batch, online or parallel) learning. At present, there are relatively few studies on the processing of non-uniform big data problems. Usually similar to other researches on big data knowledge discovery, the first thing to consider in big data classification is the complexity of the algorithm. Secondly, the existing classification algorithm (assuming that the distribution of different categories of data is uniform) directly applied to non-uniform large data tends to lead to bias (bias), that is, the classification results are biased towards large categories (that is, this category contains a large proportion of instances, For example, more than 90% in the second type of problem). Finally, common algorithms for non-uniform (imbalanced) data classification problems usually pursue the problem of minimizing classification errors, but ignore the problem of non-uniform misclassification costs.
然而,非均匀大数据分类是一个极为挑战性的问题,从何入手、怎么利用大数据进行智能活动,等等一系列亟待解决的基本问题。However, the classification of non-uniform big data is an extremely challenging problem. There are a series of basic problems that need to be solved urgently, such as where to start, how to use big data for intelligent activities, and so on.
发明内容Contents of the invention
本发明研究非均匀大数据分类问题。The invention studies the classification problem of non-uniform big data.
本发明的目的在于提供简单而有效的非均匀大数据分类方法。该方法可解决大数据分类易出现的偏置问题以及大数据算法高复杂度问题。即本方法通过降抽样(Downsampling)和二类分类(one-vs-all)大数据,达到线性复杂度非均匀大数据分类,通过集成多个分类器(ensemble)结果的方法解决偏置问题和提高分类准确率,并具有稳健性(robust)即抗噪性。The purpose of the present invention is to provide a simple and effective non-uniform big data classification method. This method can solve the bias problem that is easy to occur in big data classification and the high complexity problem of big data algorithm. That is, this method achieves linear complexity non-uniform big data classification through downsampling (Downsampling) and two-class classification (one-vs-all) big data, and solves the bias problem and Improve classification accuracy and have robustness (robust), that is, noise resistance.
本方法的具体步骤如下:The concrete steps of this method are as follows:
(1)获得大数据各类实例的数目mi,i=1,2,…,M;(1) Obtain the number m i of various types of big data instances, i=1,2,...,M;
(2)采用Downsampling方法为每个类mi抽样出Di数据集。其中每个数据集中数据量大小ni由决定,其中ta/2表示置信度的值,可以通过t分布临界值获得,ε表示最大的允许误差。通过这样方法对每个类mi抽样出Di。(2) Use the Downsampling method to sample the D i data set for each class m i . The data size n i in each data set is determined by Decision, where t a/2 represents the value of the confidence level, which can be obtained through the critical value of the t distribution, and ε represents the maximum allowable error. In this way, D i is sampled for each class mi .
(3)对每个类mi的Di个数据集用one-vs-all方法(即当前类所有实例为正类,其他类的所有实例为负类)建立Di个分类器,即对每个数据集构建一个分类器。(3) For the D i data sets of each class m i use the one-vs-all method (that is, all instances of the current class are positive classes, and all instances of other classes are negative classes) to establish D i classifiers, that is, for Build a classifier per dataset.
(4)对每个类mi的Di个分类器进行集成学习。根据集成学习理论,集成分类器可以由多个元分类器根据集成原理集合而成。所有的元分类器分类速度应该快,而且相互间是独立的,每个分类器的错误率不高于50%。常见的此类分类器,如最近邻算法,决策树法,神经网络法或者森林树法(Forest tree)等都能满足上述要求。集成原理一般有bagging,adaboost,selective ensemble等。本发明每个类mi得到的Di个分类器采用向前贪婪集成分类结果方法(forward greedy ensemble)对分类器进行集成学习。(4) Perform ensemble learning on D i classifiers of each class mi . According to the ensemble learning theory, an ensemble classifier can be assembled from multiple meta-classifiers according to the ensemble principle. All meta-classifiers should be fast and independent of each other, with each classifier having an error rate no higher than 50%. Common classifiers of this type, such as nearest neighbor algorithm, decision tree method, neural network method or forest tree method (Forest tree), etc., can meet the above requirements. Integration principles generally include bagging, adaboost, selective ensemble, etc. In the present invention, the D i classifiers obtained for each class m i use the forward greedy ensemble method (forward greedy ensemble) to carry out ensemble learning on the classifiers.
(5)测试:对每个实例在每个类中进行分类,得到M个结果中的准确率最高的类为测试实例的类别。(5) Test: Classify each instance in each class, and the class with the highest accuracy among the M results is the class of the test instance.
其中步骤(2)的目标是解决算法复杂度问题,即通过降抽样的方法使用部分原始数据而不是全部数据构建分类器。为了提高分类准确率,采用重复多次的策略,即多次抽样,抽样的样本量满足上述规定,抽样次数由用户决定。The goal of step (2) is to solve the algorithm complexity problem, that is, use part of the original data instead of all the data to build a classifier by down-sampling. In order to improve the classification accuracy, the strategy of repeating multiple times is adopted, that is, multiple sampling, the sample size of the sampling meets the above regulations, and the sampling frequency is determined by the user.
本发明步骤(2)Downsampling方法的具体步骤如下:The specific steps of step (2) Downsampling method of the present invention are as follows:
A.对每个类mi进行抽样的时候,抽样的样本容量不少于上述表格的规定(即个),样本数是需要建立元分类器的个数。在对每类生成一个样本数的过程中,首先取得当前类的样本数目。本发明把当前类当成A类,其他类统一称为非A类。接着,分析A类和非A类的数量级别。本发明记#(A),#(~A),#(R),#(T)为A类,非A类,计算机内存和理论要求的样本量的数据,如果(#(A)>>#(R))&&(#(A)>#(T)),则从A类中抽取与非A类差不多的实例;如果(#(~A)>>#(R))&&(#(~A)>#(T)),则从非A类中抽取与A类差不多的实例。A. When sampling each class m i , the sample size of sampling shall not be less than the provisions in the above table (ie ), the number of samples is the number of meta-classifiers that need to be built. In the process of generating a sample number for each class, first obtain the sample number of the current class. In the present invention, the current class is regarded as class A, and other classes are collectively referred to as non-A class. Next, analyze the quantity level of A category and non-A category. The present invention records #(A), #(~A), #(R), #(T) as class A, non-class A, computer memory and data of the sample size required by theory, if (#(A)>>#(R))&&(#(A)>#(T)), then extract the instance from class A that is similar to non-A class; if (#(~A)>>#(R))&&(#( ~A)>#(T)), the instances similar to class A are extracted from non-A classes.
B.重复上述过程,直至每个类mi抽样Di个样本。为了简洁,本发明固定Di为n。B. Repeat the above process until each class m i samples D i samples. For simplicity, the present invention fixes D i as n.
C.至此整个数据集生成了D=M*n个样本。C. So far, the entire data set has generated D=M*n samples.
通过步骤(2),本发明得到M*n个样本,每组数据有n个分类器。本发明的步骤(3)对每个类mi的n个样本一共建立n个元分类器;Through step (2), the present invention obtains M*n samples, and each set of data has n classifiers. Step (3) of the present invention establishes a total of n meta-classifiers for n samples of each class m i ;
然后本发明的步骤(4)对得到的n个元分类器进行集成得到一个集成分类器,即采取forward greedy ensemble方法。其步骤如下:Then step (4) of the present invention integrates the obtained n meta-classifiers to obtain an integrated classifier, that is, adopts the forward greedy ensemble method. The steps are as follows:
D.构建候选分类器集合CCS={C1,…,CM}和选定的分类器集合SCS={};D. Construct the candidate classifier set CCS={C 1 ,...,C M } and the selected classifier set SCS={};
E.对每个分类器Ci,选取准确率最好的分类器,把它从CCS去掉而加入SCS中;E. For each classifier C i , select the classifier with the best accuracy, remove it from the CCS and add it to the SCS;
F.把当前每个CCS中的分类器Cj加入SCS中验证,分类结果如果超过用户事先指定的阈值,则跳到E,且把Cj从CCS移到SCS。否则跳到步骤(5),此时表明集成分类器学习完成;F. Add the classifier C j in each current CCS to the SCS for verification. If the classification result exceeds the threshold specified by the user, skip to E and move C j from the CCS to the SCS. Otherwise, skip to step (5), which indicates that the learning of the integrated classifier is completed;
G.重复F,直到CCS为空集,G. Repeat F until CCS is an empty set,
至此,对M个类,本发明一共建立了M个集成分类器Ci,i=1,…,M。每个集成分类器包含n个元分类器。So far, for M classes, the present invention has established M integrated classifiers C i , i=1,...,M. Each ensemble classifier contains n meta-classifiers.
上述步骤保证得到的分类器较小,这使得测试过程比较简单。The above steps ensure that the resulting classifier is small, which makes the testing process relatively simple.
通过以上步骤实施的非均匀大数据分类方法具有以下特点:第一,分类过程中由于采用downsampling方法是的各类实例数据尽可能均衡,这有效避免了分类偏向大类的问题;第二,采用抽样方法进行分类使得整个分类算法的复杂度最高为线性;第三,为了避免抽样造成分类准确率降低,本发明通过两个方法改进分类准确率,即抽样多次方法和向前贪婪综合分类结果方法。The non-uniform big data classification method implemented through the above steps has the following characteristics: first, due to the use of downsampling method in the classification process, the various instance data are as balanced as possible, which effectively avoids the problem of classification biased towards large categories; second, using Classification by sampling method makes the complexity of the entire classification algorithm linear; third, in order to avoid the reduction of classification accuracy caused by sampling, the present invention improves the classification accuracy by two methods, that is, multiple sampling methods and forward greedy comprehensive classification results method.
本发明使用抽样的方法降低不平衡的分类,而且降低算法的复杂度;使用抽样多次并对每次抽样建立一个元分类器,而且使用集成学习的方法综合所有的元分类器提高分类成绩。The invention uses a sampling method to reduce unbalanced classification, and reduces the complexity of the algorithm; uses sampling multiple times and establishes a meta-classifier for each sampling, and uses an integrated learning method to synthesize all meta-classifiers to improve classification results.
抽样大数据:通常在整个大数据内进行分类是非常困难的。即使可行,复杂度也很高,抽样方法使得对大数据分类的操作变为可行,而且使得分类的复杂度降低到线性。这正是大数据挖掘期待的结果。Sampling Big Data: Often it is very difficult to classify within the entirety of Big Data. Even if it is feasible, the complexity is very high. The sampling method makes the operation of classification of large data feasible, and reduces the complexity of classification to linear. This is exactly the expected result of big data mining.
样本容量和样本数:抽样的样本容量根据理论获取,保证抽样之后得到结果跟原始结果误差达到最小。抽取多个样本有利于提高分类成绩;Sample size and number of samples: The sample size of sampling is obtained according to the theory to ensure that the error between the results obtained after sampling and the original results is minimized. Taking multiple samples is beneficial to improve the classification performance;
one-vs-all分类方法已经被证实为是一种非常有效的解决非均匀数据集的方法。本发明把此方法用在非均匀大数据的分类上,一方面能解决非均匀分类问题,另一方面还能解决大数据分类高复杂度问题;The one-vs-all classification method has been proven to be a very effective method for dealing with non-uniform datasets. The present invention applies this method to the classification of non-uniform big data, which can solve the problem of non-uniform classification on the one hand, and solve the problem of high complexity of big data classification on the other hand;
元分类器使得在大数据集上的分类更加快速,集成学习能有效的提高元分类器的成绩。而且forward greedy ensemble保证提高元分类器的成绩的同时还降低分类器的复杂度,这是对大数据处理线性复杂度的有力保证。Meta-classifiers make classification on large data sets faster, and ensemble learning can effectively improve the performance of meta-classifiers. Moreover, the forward greedy ensemble guarantees to improve the performance of the meta-classifier while reducing the complexity of the classifier, which is a strong guarantee for the linear complexity of big data processing.
具体实施方式Detailed ways
实施例1Example 1
给定模拟大数据实例含两百万,每个实例的维数1000维。整个数据集分两类,其中第一类含有实例199万,第二类仅含有实例1万。此数据集随机产生且属于不平衡大数据二类分类问题。It is given that the simulated big data instance contains two million, and the dimension of each instance is 1000 dimensions. The entire data set is divided into two categories, the first category contains 1.99 million instances, and the second category only contains 10,000 instances. This dataset is randomly generated and belongs to the imbalanced big data binary classification problem.
(1)确定置信度99%和最大允许误差1%。因此每类每个数据集的样本量为16641。按照比例从A类(含199万实例的数据集)中抽取1万个实例,加上非A类的1万个实例,每个数据集包含2万个实例。常见的PC电脑通常能轻松的应用常见的元分类器对含2万个实例的数据集进行分类。(1) Determine the confidence level of 99% and the maximum allowable error of 1%. So the sample size per dataset per class is 16641. According to the proportion, 10,000 instances are extracted from class A (data set containing 1.99 million instances), plus 10,000 instances of non-class A, each data set contains 20,000 instances. Common PCs can usually easily apply common meta-classifiers to classify datasets containing 20,000 instances.
(2)根据上述方法,本实例一共生成10个子数据集。使用最近邻算法建立10个分类器,k分别设置为1到10。(2) According to the above method, a total of 10 sub-datasets are generated in this example. Establish 10 classifiers using the nearest neighbor algorithm, and set k to 1 to 10, respectively.
(3)根据这10个分类器,采用forward greedy ensemble方法对这10个元分类器进行集成,集合成一个集成分类器。(3) According to the 10 classifiers, the forward greedy ensemble method is used to integrate the 10 meta-classifiers into an integrated classifier.
(4)对给定的一个测试实例,使用上述得到的集成分类器进行分类。如果分类结果超过50%,判定此测试实例属于A类,否则属于非A类。(4) For a given test instance, use the integrated classifier obtained above to classify. If the classification result exceeds 50%, it is determined that the test instance belongs to class A, otherwise it belongs to non-class A.
实施例2Example 2
给定模拟大数据实例含两千万,每个实例的维数1000维。整个数据集分三类,其中A类含有实例1200万,B类含有实例790万,C类含有实例10万。此数据集随机产生且属于不平衡大数据多类分类问题。Given a simulated big data instance containing 20 million, the dimension of each instance is 1000 dimensions. The entire data set is divided into three categories, among which category A contains 12 million instances, category B contains 7.9 million instances, and category C contains 100,000 instances. This dataset is randomly generated and belongs to the imbalanced big data multiclass classification problem.
(1)确定置信度95%和最大允许误差1%。每类每个数据集的样本量为9604。由于一般计算机处理30万数据有点困难。因此需对三个类进行抽样。(1) Determine the confidence level of 95% and the maximum allowable error of 1%. The sample size for each dataset per class is 9604. It is a bit difficult for a general computer to process 300,000 data. Therefore, three classes need to be sampled.
(2)给A类抽样10个数据,而且每个数据集包含2万个实例(注:实例的数量只要超过9604即可)。更具体的说,先从A类随机抽取1万个样本,然后从B类随机抽取5000样本,从C类随机抽取5000样本。此时得到一个子数据集含样本2万。重复此抽样10次,可以得到A类样本的10个子数据集。以此类推,给B类和C类各抽样10个子数据集。最终,在此过程共产生30个子数据集。(2) Sample 10 data for class A, and each data set contains 20,000 instances (note: the number of instances only needs to exceed 9604). More specifically, 10,000 samples are randomly drawn from class A, then 5,000 samples are randomly drawn from class B, and 5,000 samples are randomly drawn from class C. At this point, a sub-dataset containing 20,000 samples is obtained. By repeating this sampling 10 times, 10 sub-datasets of class A samples can be obtained. By analogy, 10 sub-datasets are sampled for class B and class C respectively. Ultimately, a total of 30 sub-datasets were generated during this process.
(3)对A类的10数据集,采用one-vs-all的分类方法,即A类为一类,B类和C类统一为一类,使用10个元分类器建立10个分类器。元分类器为最近邻算法9个,k取1到9,决策树C5.0分类器一个。(3) For the 10 data sets of category A, adopt the one-vs-all classification method, that is, category A is one category, category B and category C are unified into one category, and 10 meta-classifiers are used to establish 10 classifiers. There are 9 meta-classifiers for the nearest neighbor algorithm, k ranges from 1 to 9, and one decision tree C5.0 classifier.
(4)同样的方法对B类建立10个元分类器,也对C类建立10个元分类器。(4) The same method establishes 10 meta-classifiers for class B and 10 meta-classifiers for class C.
(5)采用forward greedy ensemble方法对A类的10个元分类器进行集成,集合成一个集成分类器。同样对B类和C类的10个元分类器进行集成,分别个集中一个集成分类器。(5) Use the forward greedy ensemble method to integrate the 10 meta-classifiers of class A into an integrated classifier. The 10 meta-classifiers of class B and class C are also integrated, and an integrated classifier is collected for each.
(6)对给定的一个测试实例,使用上述得到的三个集成分类器进行分类。如果A集成分类器的分类结果为85%,B集成分类器的分类结果为89%,C集成分类器的分类结果为90%,判定此测试实例属于C类。(6) For a given test instance, use the three integrated classifiers obtained above to classify. If the classification result of the A ensemble classifier is 85%, the classification result of the B ensemble classifier is 89%, and the classification result of the C ensemble classifier is 90%, it is determined that the test instance belongs to the C class.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452365.3A CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452365.3A CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103500205A true CN103500205A (en) | 2014-01-08 |
CN103500205B CN103500205B (en) | 2017-04-12 |
Family
ID=49865415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310452365.3A Expired - Fee Related CN103500205B (en) | 2013-09-29 | 2013-09-29 | Non-uniform big data classifying method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103500205B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156029A (en) * | 2015-03-24 | 2016-11-23 | 中国人民解放军国防科学技术大学 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
CN107193836A (en) * | 2016-03-15 | 2017-09-22 | 腾讯科技(深圳)有限公司 | A kind of recognition methods and device |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399413A (en) * | 2019-07-04 | 2019-11-01 | 博彦科技股份有限公司 | Sampling of data method, apparatus, storage medium and processor |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404009A (en) * | 2008-10-31 | 2009-04-08 | 金蝶软件(中国)有限公司 | Data classification filtering method, system and equipment |
US20130071033A1 (en) * | 2011-09-21 | 2013-03-21 | Tandent Vision Science, Inc. | Classifier for use in generating a diffuse image |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
-
2013
- 2013-09-29 CN CN201310452365.3A patent/CN103500205B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101404009A (en) * | 2008-10-31 | 2009-04-08 | 金蝶软件(中国)有限公司 | Data classification filtering method, system and equipment |
US20130071033A1 (en) * | 2011-09-21 | 2013-03-21 | Tandent Vision Science, Inc. | Classifier for use in generating a diffuse image |
CN103268336A (en) * | 2013-05-13 | 2013-08-28 | 刘峰 | Fast data and big data combined data processing method and system |
Non-Patent Citations (2)
Title |
---|
SATTAR HASHEMI 等: "Adapted One-versus-All Decision Trees for Data Stream Classification", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 * |
谷雨 等: "基于Bagging支持向量机集成的入侵检测研究", 《微电子学与计算机》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156029A (en) * | 2015-03-24 | 2016-11-23 | 中国人民解放军国防科学技术大学 | The uneven fictitious assets data classification method of multi-tag based on integrated study |
CN107193836A (en) * | 2016-03-15 | 2017-09-22 | 腾讯科技(深圳)有限公司 | A kind of recognition methods and device |
CN107193836B (en) * | 2016-03-15 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Identification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103500205B (en) | 2017-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Susan et al. | The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art | |
Bechini et al. | A MapReduce solution for associative classification of big data | |
Wang et al. | Machine learning in big data | |
Popat et al. | Review and comparative study of clustering techniques | |
US10579661B2 (en) | System and method for machine learning and classifying data | |
Pan et al. | Graph stream classification using labeled and unlabeled graphs | |
Taha et al. | Multilabel over-sampling and under-sampling with class alignment for imbalanced multilabel text classification | |
CN104239553A (en) | Entity recognition method based on Map-Reduce framework | |
Aggarwal et al. | Towards community detection in locally heterogeneous networks | |
Yao et al. | Scalable svm-based classification in dynamic graphs | |
Berrocal et al. | Exploring void search for fault detection on extreme scale systems | |
Meira et al. | Fast anomaly detection with locality-sensitive hashing and hyperparameter autotuning | |
Almunirawi et al. | A comparative study on serial decision tree classification algorithms in text mining | |
Yu et al. | Mining emerging patterns by streaming feature selection | |
Radhakrishna et al. | Document clustering using hybrid XOR similarity function for efficient software component reuse | |
Chen et al. | Active learning for unbalanced data in the challenge with multiple models and biasing | |
CN103500205B (en) | Non-uniform big data classifying method | |
Patil et al. | Enriched over_sampling techniques for improving classification of imbalanced big data | |
Naik et al. | Large scale hierarchical classification: state of the art | |
CN106033432A (en) | Multi-category unbalanced virtual asset data classification method based on decomposition strategy | |
Masoumi et al. | File fragment recognition based on content and statistical features | |
Marath et al. | Large-scale web page classification | |
Lathiya et al. | Improved CURE clustering for big data using Hadoop and Mapreduce | |
CN118643444A (en) | Big data anomaly detection method, device, equipment, storage medium and product | |
Gupta et al. | Feature selection: an overview |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170412 |
|
CF01 | Termination of patent right due to non-payment of annual fee |