CN103500205A

CN103500205A - Non-uniform big data classifying method

Info

Publication number: CN103500205A
Application number: CN201310452365.3A
Authority: CN
Inventors: 朱晓峰; 张师超
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2014-01-08
Anticipated expiration: 2033-09-29
Also published as: CN103500205B

Abstract

本发明为非均匀大数据的分类方法。对不能在计算机内存进行分类的数据集类别及其非均匀类别的数据集进行分类。首先使用降抽样的方法按照理论决定样本的大小，根据样本数量的个数确定分类器的数量。对大数据的每个类别建立一个集成的分类器。测试实例的时候，让它对每个类别的集成分类器进行分类，取分类率最高的集成分类器所在的类别为测试实例的类别。本方法对大数据分类的时间复杂度为线性，能降低非均匀大数据分类结果的偏置，而且通过集成分类器的设计提高了分类器的准确率。而且本发明易于实施，编写代码时只涉及到一些简单的数学模型。The invention is a classification method for non-uniform big data. Classify categories of datasets that cannot be classified in computer memory and datasets of non-uniform categories. Firstly, the downsampling method is used to determine the sample size according to the theory, and the number of classifiers is determined according to the number of samples. Build an ensemble classifier for each category of big data. When testing an instance, let it classify the integrated classifier of each category, and take the category of the integrated classifier with the highest classification rate as the category of the test instance. The time complexity of this method for big data classification is linear, which can reduce the bias of non-uniform big data classification results, and improve the accuracy of the classifier through the design of the integrated classifier. Moreover, the present invention is easy to implement, and only involves some simple mathematical models when writing codes.

Description

Non-uniform Big Data Classification Method

技术领域technical field

本发明涉及计算机科学与技术领域和信息技术领域，具体涉及大数据，特别是一种非均匀大数据分类的处理方法。The present invention relates to the fields of computer science and technology and information technology, in particular to big data, in particular to a processing method for classification of non-uniform big data.

背景技术Background technique

大数据是指没有办法在现有物理条件和允许的时间里用常规的软件工具对内容进行抓取、管理和处理的数据集合。大数据有如下特点：Volume（数据量大）、Variety（数据类型繁多）、Value（价值密度低）、Velocity（处理速度快），被简称为4V。Big data refers to the collection of data that cannot be captured, managed and processed with conventional software tools within the existing physical conditions and allowed time. Big data has the following characteristics: Volume (large amount of data), Variety (variety of data types), Value (low value density), Velocity (fast processing speed), referred to as 4V.

目前大数据研究通常包括两大类。第一，大数据对架构的挑战。目前很多著名网站的HADOOP集群中裸数据容量达到几十PB，而且存在冗余，每天需要扫描更新。然后HADOOP为了确保单节点失效或单机架失效不影响运行通常采用3副本策略。这样数据在时间维和空间维方面都需要考虑到成本问题。因此，如果构造高效率的大规模小文件管理和大文件管理机制并存，同时支持结构化数据、非结构化数据和无结构数据的存储、管理与访问等都是必须要考虑的问题。第二，大数据知识发现和大数据对挖掘算法的挑战。首先需要面多的是算法的可扩展性。一些数据挖掘和机器学习经典算法，例如KNN密度估计，非参BAYES，支持向量机，高斯过程回归和层次聚类等算法，由于它们的复杂度至少是二次以上，都不能在大数据挖掘中得以较好的应用。所以，这就需要设计出更高效率的算法，即O(nlogn)或者O(n)。At present, big data research usually includes two categories. First, big data challenges architecture. At present, the raw data capacity in the HADOOP clusters of many famous websites reaches dozens of PB, and there is redundancy, which needs to be scanned and updated every day. Then, in order to ensure that the failure of a single node or a single rack does not affect the operation, HADOOP usually adopts a 3-copy strategy. In this way, the cost of data needs to be considered in both time and space dimensions. Therefore, if an efficient large-scale small file management and large file management mechanism coexist, and support the storage, management and access of structured data, unstructured data and unstructured data at the same time, it must be considered. Second, big data knowledge discovery and the challenges of big data to mining algorithms. The first thing that needs to be covered is the scalability of the algorithm. Some classic data mining and machine learning algorithms, such as KNN density estimation, non-parametric BAYES, support vector machines, Gaussian process regression and hierarchical clustering algorithms, cannot be used in big data mining because their complexity is at least quadratic be better applied. Therefore, it is necessary to design a more efficient algorithm, that is, O(nlogn) or O(n).

从现有的大量大数据挖掘方面的文献来看，大数据学习的研究主要集中在划分、聚类、检索、增量（批量、在线或平行）学习这4个方面的传统方法的升级改良。目前对非均匀大数据问题处理的研究比较少。通常类似大数据知识发现的其他研究，大数据分类问题首先需要考虑的是算法的复杂度问题。其次现有分类算法（假设数据的不同类别分布是均匀的）直接应用到非均匀大数据上容易导致偏置（bias），即分类结果偏向于大类别（即此类别含有实例数目比例很大，例如在二类问题中超过90%）。最后，常见算法用于非均匀（不平衡）数据分类问题通常追求分类错误最小问题，但是忽略了非均匀类误分类代价问题。Judging from the large amount of existing literature on big data mining, the research on big data learning mainly focuses on the upgrading and improvement of traditional methods in four aspects: division, clustering, retrieval, and incremental (batch, online or parallel) learning. At present, there are relatively few studies on the processing of non-uniform big data problems. Usually similar to other researches on big data knowledge discovery, the first thing to consider in big data classification is the complexity of the algorithm. Secondly, the existing classification algorithm (assuming that the distribution of different categories of data is uniform) directly applied to non-uniform large data tends to lead to bias (bias), that is, the classification results are biased towards large categories (that is, this category contains a large proportion of instances, For example, more than 90% in the second type of problem). Finally, common algorithms for non-uniform (imbalanced) data classification problems usually pursue the problem of minimizing classification errors, but ignore the problem of non-uniform misclassification costs.

然而，非均匀大数据分类是一个极为挑战性的问题，从何入手、怎么利用大数据进行智能活动，等等一系列亟待解决的基本问题。However, the classification of non-uniform big data is an extremely challenging problem. There are a series of basic problems that need to be solved urgently, such as where to start, how to use big data for intelligent activities, and so on.

发明内容Contents of the invention

本发明研究非均匀大数据分类问题。The invention studies the classification problem of non-uniform big data.

本发明的目的在于提供简单而有效的非均匀大数据分类方法。该方法可解决大数据分类易出现的偏置问题以及大数据算法高复杂度问题。即本方法通过降抽样(Downsampling)和二类分类(one-vs-all)大数据，达到线性复杂度非均匀大数据分类，通过集成多个分类器(ensemble)结果的方法解决偏置问题和提高分类准确率，并具有稳健性(robust)即抗噪性。The purpose of the present invention is to provide a simple and effective non-uniform big data classification method. This method can solve the bias problem that is easy to occur in big data classification and the high complexity problem of big data algorithm. That is, this method achieves linear complexity non-uniform big data classification through downsampling (Downsampling) and two-class classification (one-vs-all) big data, and solves the bias problem and Improve classification accuracy and have robustness (robust), that is, noise resistance.

本方法的具体步骤如下：The concrete steps of this method are as follows:

(1)获得大数据各类实例的数目m_i,i=1,2,…,M；(1) Obtain the number m _i of various types of big data instances, i=1,2,...,M;

(2)采用Downsampling方法为每个类m_i抽样出D_i数据集。其中每个数据集中数据量大小n_i由

决定，其中t_a/2表示置信度的值，可以通过t分布临界值获得，ε表示最大的允许误差。通过这样方法对每个类m_i抽样出D_i。(2) Use the Downsampling method to sample the D _i data set for each class m _i . The data size n _i in each data set is determined by

Decision, where t _a/2 represents the value of the confidence level, which can be obtained through the critical value of the t distribution, and ε represents the maximum allowable error. In this way, D _i is sampled for each class _mi .

(3)对每个类m_i的D_i个数据集用one-vs-all方法（即当前类所有实例为正类，其他类的所有实例为负类）建立D_i个分类器，即对每个数据集构建一个分类器。(3) For the D _i data sets of each class m _i use the one-vs-all method (that is, all instances of the current class are positive classes, and all instances of other classes are negative classes) to establish D _i classifiers, that is, for Build a classifier per dataset.

(4)对每个类m_i的D_i个分类器进行集成学习。根据集成学习理论，集成分类器可以由多个元分类器根据集成原理集合而成。所有的元分类器分类速度应该快，而且相互间是独立的，每个分类器的错误率不高于50%。常见的此类分类器，如最近邻算法，决策树法，神经网络法或者森林树法(Forest tree)等都能满足上述要求。集成原理一般有bagging,adaboost,selective ensemble等。本发明每个类m_i得到的D_i个分类器采用向前贪婪集成分类结果方法（forward greedy ensemble）对分类器进行集成学习。(4) Perform ensemble learning on D _i classifiers of each class _mi . According to the ensemble learning theory, an ensemble classifier can be assembled from multiple meta-classifiers according to the ensemble principle. All meta-classifiers should be fast and independent of each other, with each classifier having an error rate no higher than 50%. Common classifiers of this type, such as nearest neighbor algorithm, decision tree method, neural network method or forest tree method (Forest tree), etc., can meet the above requirements. Integration principles generally include bagging, adaboost, selective ensemble, etc. In the present invention, the D _i classifiers obtained for each class m _i use the forward greedy ensemble method (forward greedy ensemble) to carry out ensemble learning on the classifiers.

(5)测试：对每个实例在每个类中进行分类，得到M个结果中的准确率最高的类为测试实例的类别。(5) Test: Classify each instance in each class, and the class with the highest accuracy among the M results is the class of the test instance.

其中步骤（2）的目标是解决算法复杂度问题，即通过降抽样的方法使用部分原始数据而不是全部数据构建分类器。为了提高分类准确率，采用重复多次的策略，即多次抽样，抽样的样本量满足上述规定，抽样次数由用户决定。The goal of step (2) is to solve the algorithm complexity problem, that is, use part of the original data instead of all the data to build a classifier by down-sampling. In order to improve the classification accuracy, the strategy of repeating multiple times is adopted, that is, multiple sampling, the sample size of the sampling meets the above regulations, and the sampling frequency is determined by the user.

本发明步骤（2）Downsampling方法的具体步骤如下：The specific steps of step (2) Downsampling method of the present invention are as follows:

A.对每个类m_i进行抽样的时候，抽样的样本容量不少于上述表格的规定（即

个），样本数是需要建立元分类器的个数。在对每类生成一个样本数的过程中，首先取得当前类的样本数目。本发明把当前类当成A类，其他类统一称为非A类。接着，分析A类和非A类的数量级别。本发明记#(A),#(～A),#(R),#(T)为A类，非A类，计算机内存和理论要求的样本量的数据，如果(#(A)>>#(R))&&(#(A)>#(T))，则从A类中抽取与非A类差不多的实例；如果(#(～A)>>#(R))&&(#(～A)>#(T))，则从非A类中抽取与A类差不多的实例。A. When sampling each class m _i , the sample size of sampling shall not be less than the provisions in the above table (ie

), the number of samples is the number of meta-classifiers that need to be built. In the process of generating a sample number for each class, first obtain the sample number of the current class. In the present invention, the current class is regarded as class A, and other classes are collectively referred to as non-A class. Next, analyze the quantity level of A category and non-A category. The present invention records #(A), #(～A), #(R), #(T) as class A, non-class A, computer memory and data of the sample size required by theory, if (#(A)>>#(R))&&(#(A)>#(T)), then extract the instance from class A that is similar to non-A class; if (#(～A)>>#(R))&&(#( ～A)>#(T)), the instances similar to class A are extracted from non-A classes.

B.重复上述过程，直至每个类m_i抽样D_i个样本。为了简洁，本发明固定D_i为n。B. Repeat the above process until each class m _i samples D _i samples. For simplicity, the present invention fixes D _i as n.

C.至此整个数据集生成了D=M*n个样本。C. So far, the entire data set has generated D=M*n samples.

通过步骤（2），本发明得到M*n个样本，每组数据有n个分类器。本发明的步骤（3）对每个类m_i的n个样本一共建立n个元分类器；Through step (2), the present invention obtains M*n samples, and each set of data has n classifiers. Step (3) of the present invention establishes a total of n meta-classifiers for n samples of each class m _i ;

然后本发明的步骤（4）对得到的n个元分类器进行集成得到一个集成分类器，即采取forward greedy ensemble方法。其步骤如下：Then step (4) of the present invention integrates the obtained n meta-classifiers to obtain an integrated classifier, that is, adopts the forward greedy ensemble method. The steps are as follows:

D.构建候选分类器集合CCS={C₁,…,C_M}和选定的分类器集合SCS={};D. Construct the candidate classifier set CCS={C ₁ ,...,C _M } and the selected classifier set SCS={};

E.对每个分类器C_i，选取准确率最好的分类器，把它从CCS去掉而加入SCS中；E. For each classifier C _i , select the classifier with the best accuracy, remove it from the CCS and add it to the SCS;

F.把当前每个CCS中的分类器C_j加入SCS中验证，分类结果如果超过用户事先指定的阈值，则跳到E，且把C_j从CCS移到SCS。否则跳到步骤（5），此时表明集成分类器学习完成；F. Add the classifier C _j in each current CCS to the SCS for verification. If the classification result exceeds the threshold specified by the user, skip to E and move C _j from the CCS to the SCS. Otherwise, skip to step (5), which indicates that the learning of the integrated classifier is completed;

G.重复F，直到CCS为空集，G. Repeat F until CCS is an empty set,

至此，对M个类，本发明一共建立了M个集成分类器C_i,i=1,…，M。每个集成分类器包含n个元分类器。So far, for M classes, the present invention has established M integrated classifiers C _i , i=1,...,M. Each ensemble classifier contains n meta-classifiers.

上述步骤保证得到的分类器较小，这使得测试过程比较简单。The above steps ensure that the resulting classifier is small, which makes the testing process relatively simple.

通过以上步骤实施的非均匀大数据分类方法具有以下特点：第一，分类过程中由于采用downsampling方法是的各类实例数据尽可能均衡，这有效避免了分类偏向大类的问题；第二，采用抽样方法进行分类使得整个分类算法的复杂度最高为线性；第三，为了避免抽样造成分类准确率降低，本发明通过两个方法改进分类准确率，即抽样多次方法和向前贪婪综合分类结果方法。The non-uniform big data classification method implemented through the above steps has the following characteristics: first, due to the use of downsampling method in the classification process, the various instance data are as balanced as possible, which effectively avoids the problem of classification biased towards large categories; second, using Classification by sampling method makes the complexity of the entire classification algorithm linear; third, in order to avoid the reduction of classification accuracy caused by sampling, the present invention improves the classification accuracy by two methods, that is, multiple sampling methods and forward greedy comprehensive classification results method.

本发明使用抽样的方法降低不平衡的分类，而且降低算法的复杂度；使用抽样多次并对每次抽样建立一个元分类器，而且使用集成学习的方法综合所有的元分类器提高分类成绩。The invention uses a sampling method to reduce unbalanced classification, and reduces the complexity of the algorithm; uses sampling multiple times and establishes a meta-classifier for each sampling, and uses an integrated learning method to synthesize all meta-classifiers to improve classification results.

抽样大数据：通常在整个大数据内进行分类是非常困难的。即使可行，复杂度也很高，抽样方法使得对大数据分类的操作变为可行，而且使得分类的复杂度降低到线性。这正是大数据挖掘期待的结果。Sampling Big Data: Often it is very difficult to classify within the entirety of Big Data. Even if it is feasible, the complexity is very high. The sampling method makes the operation of classification of large data feasible, and reduces the complexity of classification to linear. This is exactly the expected result of big data mining.

样本容量和样本数：抽样的样本容量根据理论获取，保证抽样之后得到结果跟原始结果误差达到最小。抽取多个样本有利于提高分类成绩；Sample size and number of samples: The sample size of sampling is obtained according to the theory to ensure that the error between the results obtained after sampling and the original results is minimized. Taking multiple samples is beneficial to improve the classification performance;

one-vs-all分类方法已经被证实为是一种非常有效的解决非均匀数据集的方法。本发明把此方法用在非均匀大数据的分类上，一方面能解决非均匀分类问题，另一方面还能解决大数据分类高复杂度问题；The one-vs-all classification method has been proven to be a very effective method for dealing with non-uniform datasets. The present invention applies this method to the classification of non-uniform big data, which can solve the problem of non-uniform classification on the one hand, and solve the problem of high complexity of big data classification on the other hand;

元分类器使得在大数据集上的分类更加快速，集成学习能有效的提高元分类器的成绩。而且forward greedy ensemble保证提高元分类器的成绩的同时还降低分类器的复杂度，这是对大数据处理线性复杂度的有力保证。Meta-classifiers make classification on large data sets faster, and ensemble learning can effectively improve the performance of meta-classifiers. Moreover, the forward greedy ensemble guarantees to improve the performance of the meta-classifier while reducing the complexity of the classifier, which is a strong guarantee for the linear complexity of big data processing.

具体实施方式Detailed ways

实施例1Example 1

给定模拟大数据实例含两百万，每个实例的维数1000维。整个数据集分两类，其中第一类含有实例199万，第二类仅含有实例1万。此数据集随机产生且属于不平衡大数据二类分类问题。It is given that the simulated big data instance contains two million, and the dimension of each instance is 1000 dimensions. The entire data set is divided into two categories, the first category contains 1.99 million instances, and the second category only contains 10,000 instances. This dataset is randomly generated and belongs to the imbalanced big data binary classification problem.

(1)确定置信度99%和最大允许误差1%。因此每类每个数据集的样本量为16641。按照比例从A类（含199万实例的数据集）中抽取1万个实例，加上非A类的1万个实例，每个数据集包含2万个实例。常见的PC电脑通常能轻松的应用常见的元分类器对含2万个实例的数据集进行分类。(1) Determine the confidence level of 99% and the maximum allowable error of 1%. So the sample size per dataset per class is 16641. According to the proportion, 10,000 instances are extracted from class A (data set containing 1.99 million instances), plus 10,000 instances of non-class A, each data set contains 20,000 instances. Common PCs can usually easily apply common meta-classifiers to classify datasets containing 20,000 instances.

(2)根据上述方法，本实例一共生成10个子数据集。使用最近邻算法建立10个分类器，k分别设置为1到10。(2) According to the above method, a total of 10 sub-datasets are generated in this example. Establish 10 classifiers using the nearest neighbor algorithm, and set k to 1 to 10, respectively.

(3)根据这10个分类器，采用forward greedy ensemble方法对这10个元分类器进行集成，集合成一个集成分类器。(3) According to the 10 classifiers, the forward greedy ensemble method is used to integrate the 10 meta-classifiers into an integrated classifier.

(4)对给定的一个测试实例，使用上述得到的集成分类器进行分类。如果分类结果超过50%，判定此测试实例属于A类，否则属于非A类。(4) For a given test instance, use the integrated classifier obtained above to classify. If the classification result exceeds 50%, it is determined that the test instance belongs to class A, otherwise it belongs to non-class A.

实施例2Example 2

给定模拟大数据实例含两千万，每个实例的维数1000维。整个数据集分三类，其中A类含有实例1200万，B类含有实例790万，C类含有实例10万。此数据集随机产生且属于不平衡大数据多类分类问题。Given a simulated big data instance containing 20 million, the dimension of each instance is 1000 dimensions. The entire data set is divided into three categories, among which category A contains 12 million instances, category B contains 7.9 million instances, and category C contains 100,000 instances. This dataset is randomly generated and belongs to the imbalanced big data multiclass classification problem.

(1)确定置信度95%和最大允许误差1%。每类每个数据集的样本量为9604。由于一般计算机处理30万数据有点困难。因此需对三个类进行抽样。(1) Determine the confidence level of 95% and the maximum allowable error of 1%. The sample size for each dataset per class is 9604. It is a bit difficult for a general computer to process 300,000 data. Therefore, three classes need to be sampled.

(2)给A类抽样10个数据，而且每个数据集包含2万个实例（注：实例的数量只要超过9604即可）。更具体的说，先从A类随机抽取1万个样本，然后从B类随机抽取5000样本，从C类随机抽取5000样本。此时得到一个子数据集含样本2万。重复此抽样10次，可以得到A类样本的10个子数据集。以此类推，给B类和C类各抽样10个子数据集。最终，在此过程共产生30个子数据集。(2) Sample 10 data for class A, and each data set contains 20,000 instances (note: the number of instances only needs to exceed 9604). More specifically, 10,000 samples are randomly drawn from class A, then 5,000 samples are randomly drawn from class B, and 5,000 samples are randomly drawn from class C. At this point, a sub-dataset containing 20,000 samples is obtained. By repeating this sampling 10 times, 10 sub-datasets of class A samples can be obtained. By analogy, 10 sub-datasets are sampled for class B and class C respectively. Ultimately, a total of 30 sub-datasets were generated during this process.

(3)对A类的10数据集，采用one-vs-all的分类方法，即A类为一类，B类和C类统一为一类，使用10个元分类器建立10个分类器。元分类器为最近邻算法9个，k取1到9，决策树C5.0分类器一个。(3) For the 10 data sets of category A, adopt the one-vs-all classification method, that is, category A is one category, category B and category C are unified into one category, and 10 meta-classifiers are used to establish 10 classifiers. There are 9 meta-classifiers for the nearest neighbor algorithm, k ranges from 1 to 9, and one decision tree C5.0 classifier.

(4)同样的方法对B类建立10个元分类器，也对C类建立10个元分类器。(4) The same method establishes 10 meta-classifiers for class B and 10 meta-classifiers for class C.

(5)采用forward greedy ensemble方法对A类的10个元分类器进行集成，集合成一个集成分类器。同样对B类和C类的10个元分类器进行集成，分别个集中一个集成分类器。(5) Use the forward greedy ensemble method to integrate the 10 meta-classifiers of class A into an integrated classifier. The 10 meta-classifiers of class B and class C are also integrated, and an integrated classifier is collected for each.

(6)对给定的一个测试实例，使用上述得到的三个集成分类器进行分类。如果A集成分类器的分类结果为85%，B集成分类器的分类结果为89%，C集成分类器的分类结果为90%，判定此测试实例属于C类。(6) For a given test instance, use the three integrated classifiers obtained above to classify. If the classification result of the A ensemble classifier is 85%, the classification result of the B ensemble classifier is 89%, and the classification result of the C ensemble classifier is 90%, it is determined that the test instance belongs to the C class.

Claims

1. The classification method of heterogeneous big data, comprises the following steps:

(1) Obtain the number m _i of various types of big data instances, i=1,2,...,M;

(2) Use the down-sampling method to sample the D _i data set for each class m _i ;

(3) Build a meta-classifier for each dataset;

(4) Carry out ensemble learning on D _i classifiers of each class m _i ;

(5) Test: For each instance, classify in each class m _i , and the class with the highest accuracy rate among the obtained M results is the class of the test instance.

2. The method according to claim 1, the data amount n _i of each data set in the step (2) is given by Sure,

Among them, t _a/2 represents the value of the confidence degree, which is obtained through the critical value of the t distribution, and e represents the maximum allowable error set.

3. according to the method for claim 1 or 2, the specific process of described step (2) is as follows:

A. The current class is regarded as class A, and other classes are collectively called non-A class; then, analyze the quantitative level of class A and non-A class; record #(A),#(～A),#(R),#(T ) are the data of Class A, non-Class A, computer memory and theoretically required sample size, if (#(A)>>#(R))&&(#(A)>#(T)), then from A Instances similar to non-A class are extracted from the class; if (#(～A)>>#(R))&&(#(～A)>#(T)), then the non-A class is similar to A class instance of

B. Repeat the above process until D _i is sampled for each class m _i , and D _i =n is fixed;

C. The entire data set generates D=M*n sub-data sets.

4. The method according to claim 1, in step (3), the method for constructing Di meta-classifiers for D _i data sets of each class _mi is selected from the group consisting of: binary classification, nearest neighbor algorithm, decision tree method , neural network method or forest tree method.

5. The method according to claim 1 or 4, in step (3), the method of constructing Di meta-classifiers for D _i data sets of each class m _i is selected: two-class classification.

6. The method according to claim 1, in step (4), the Di meta-classifiers of each class mi are ensemble-learned by adopting the forward greedy ensemble classification result method to obtain an ensemble classifier.

7. According to the method of claim 1 or 6, in step (4), the specific process of adopting the method of forward greedy integration of classification results is as follows:

D. Build candidate classifier set CCS={C ₁ ,...,C _M } and selected classifier set SCS={};

E. For each classifier C _i , select the classifier with the best accuracy, remove it from the CCS and add it to the SCS;

F. Add the classifier C _j in each current CCS to the SCS for verification. If the classification result exceeds the threshold specified by the user, then skip to E, and move C _j from the CCS to the SCS, otherwise skip to step (5) ;

G. Repeat F until CCS is an empty set,

So far, for M classes, a total of M integrated classifiers Ci,i=1,...,M have been established, and each integrated classifier contains n meta-classifiers.