KR20130063565A

KR20130063565A - Combination of multiple classifiers using bagging in semi-supervised learning

Info

Publication number: KR20130063565A
Application number: KR1020110129967A
Authority: KR
Inventors: 조윤진; 우호영
Original assignee: 조윤진
Priority date: 2011-12-07
Filing date: 2011-12-07
Publication date: 2013-06-17

Abstract

The present invention is a method for predicting unlabeled data for a supervised learning algorithm using a semi-supervised algorithm and a method for improving the performance of the classifier using a small number of labeling data. It is about. As a method of improving classification accuracy compared to the existing anti-supervised learning method, the anti-supervised algorithm is used to predict the label on the unlabeled data, and the higher predicted unlabeled data through the reliability among the predicted labels is used as the training set. The final model is generated through the combined model using bagging, one of the ensemble methods. The present invention can replace and use the anti-supervised learning method and the supervised learning algorithm using existing methods, and has an effect of improving classification accuracy compared to the existing boosting method.

Description

Device for constructing ensemble type data mining model using unlabeled data and its method {Combination of Multiple Classifiers using bagging in Semi-supervised Learning}

The present invention is a method for predicting unlabeled data for a supervised learning algorithm using a semi-supervised algorithm and a method for improving the performance of the classifier using a small number of labeling data. It is about.

Ensemble Algorithm in DataMining has been studied in recent years with 'Breiman''s bagging technique as the first.

In other words, numerous studies are being conducted to further improve the above-described bagging technique of Breiman. As a result of the conventional research, the boosting algorithm proposed by Freund and Schapire, Breiman, is proposed. The Arching algorithm and the Random Forest algorithm proposed by Breiman.

Among these ensemble techniques, the boosting algorithm proposed by 'Freund and Schapire' has recently appeared in various improved algorithms based on its superior predictive power. These improved algorithms include Real Boosting algorithm proposed by 'Schapire and Singer' and Gradient Boosting algorithm proposed by 'Friedman'.

That is, the ensemble algorithm used in the conventional data mining is mainly based on the boosting algorithm.

However, as a big problem of the boosting method, the labeling accuracy can be greatly influenced by the accuracy of label prediction, and the verification of the equation for the reliability calculation for the method of predicting the unlabeled data is not performed properly or is not easy.

In addition, although the conventional technology often uses unlabeled data with a supervised learning classifier included in the classification model, it is not easy to find a method of using the unlabeled data regardless of the supervised learning classifier.

The present invention for solving the problems of the prior art as described above is an ensemble construction apparatus using a bagging having a high classification accuracy using the labeling data used in the training data to increase the classification accuracy of the classifiers as well as the unlabeled data and its It is to provide a method.

A method of increasing classification accuracy using unlabeled data according to the present invention includes selecting a first unlabeled data set from an unlabeled data set and predicting a label on the selected unlabeled data. And a second step, a first step, and a first step of training the supervised classifier after including unlabeled data having high reliability of prediction among the unlabeled data having the predicted label in the training set. And an ensemble classifier model generated by combining the classifiers obtained by repeating the two-step process. 3 shows an overview of the algorithm of the present invention.

In the first step, label prediction for the unlabeled data is performed by Linear Neighborhood Propagation (LNP), a semi-supervised algorithm, and the first step is a combination of an ensemble classifier. As many base classifiers are executed as possible.

In the second step, based on the reliability of the predicted label of the unlabeling data, the training set includes the unlabeling data having a high reliability so that the class labels are equal.

After the first and second steps, the third step outputs the result using a multiple voting method, which is one of methods of combining independent base classifiers.

The present invention predicts a label of unlabeled data. Increase the training set using unlabeled data and its predicted labels, and improve the predictive accuracy of the classifier through ensemble learning, and also the desired semi-supervised algorithm and supervised classifier. (supervised learning classifier) can be used to replace the effect.

1 is a flowchart illustrating a model generation method according to an embodiment of the present invention.
2 is a flow chart illustrating data classification through the generated model of the present invention.
3 is an overview showing a model generation method of the present invention.

The novelty of the approach according to the present invention utilizes the Linear Neighborhood Propagation (LNP) method, which is a semi-supervised algorithm that uses unlabeled data to construct a training set. It is a new method of generating an ensemble classifier model using bagging.

Advantages of the present invention often include the use of unlabeled data that is convenient to obtain in many applications with minimal modification in existing methods. Another advantage of the present invention is its broad applicability. The present invention can be used in place of many proven anti-supervisory algorithms and supervised learning classifiers.

In order to evaluate the advantages of the approach according to the present invention, the existing method is compared with the semi-supervised learning model of semi-boost [1] in which the boosting method is used. Using three data sets, the classification accuracy is shown to be improved compared to the present model.

In the past, a semi-supervised learning method has been proposed to solve the problem that unlabeling data is easy to obtain and labeling data is difficult to obtain in the classification problem. Recent research is being done through ensemble classifiers in semi-supervised learning. The most common methods are bagging and boosting. Bagging, which is a classifier combining method used in the present invention, has a big feature that base classifiers that generate an ensemble classifier are made independently of each other. It has the characteristic of affecting classifier learning.

In the present invention, a method of increasing classification accuracy by using unlabeled data includes: a first step of selecting any unlabeled data from an unlabeled data set and predicting a label on the selected unlabeled data; Steps 2, 1, and 2 of training unsupervised classifiers after including unlabeled data with high confidence in prediction among unlabeled data having the predicted label in a training set. And an ensemble classifier model generated by combining the classifiers obtained by repeating the process of the steps.

The first step for utilizing the unlabeled data is to predict the labels for the unlabeled data, as indicated in FIG. 1 (S601). The second step is to use the unlabeled data with the highest confidence among the unlabeled data with the predicted labels to increase the predictive power of the classifiers. As shown in FIG. 6, after the unlabeled data having the predicted label is included as a training data set, one base classifier is trained using the training data (S602). In the next step, the final model is generated by combining the basic classifiers created by repeating this process (S603). The base classifier is used as the supervised classifier. The steps are shown in sections a, b and c.

a. Step 1: Predict Labels for Unlabeled Data

In the machine learning domain, semi-supervised classification can provide a solution to this step. Typical classifiers use only labeled label data for training. Optionally, the semi-supervised classification uses unlabeled data along with labeled label label data to train better classifiers. The semi-supervised classification uses unlabeled data to modify or rerank hypotheses obtained from labeled label data.

In general, there are many anti-supervision methods, and the present invention has the advantage that various anti-supervision methods can be selectively used. An example is given here using a semi-supervision method called LNP. Linear Neighbor Propagation (LNP) [3] is a method of propagating labels on a graph-based basis, and is a graph-based semi-supervision method for predicting labels of unlabeled data with labeled data using proximity between neighboring data. We construct a weighting matrix representing the association between data samples and neighbors, and predict the label of unlabeled data through label propagation using it. The difference from the method of [3] is that in the present invention, the label is predicted by using any unlabeled data, rather than using all unlabeled data.

b. Step 2: Training the Base Classifier by Including Predicted Unlabeled Data in the Training Set

In the second step, based on the reliability of the predicted label of the unlabeling data, the training set includes the unlabeling data having a high reliability so that the class labels are equal. The measure of reliability is the probability of having the same label as that label when propagating from the labeled data to the unlabeled data. This probability can be seen as the weight of the connecting line between two points on the graph. By selecting unlabeled data with high reliability, a training set is constructed with a few labeled data and predicted unlabeled data. The basic classifier is trained as shown in S602 of FIG.

c. Step 3: Create a combined classification model after learning mood classifier through repetition according to the set number of times

The third step is a step of generating a final combined model by combining the learning classifiers configured through the first and second steps. The final model is characterized by using a bagging method, and each classifier generation, which is a feature of the bagging method, can be used independently. This has the advantage of speeding up the computation by processing the generation of each classifier in parallel.

d. Output form of the generated model

The generated output of the combined model uses a majority voting method among several methods. The majority voting method outputs the label with the largest number of labels selected by each classifier. Figure 2 illustrates the process.

Claims

Selecting any unlabeled data among the unlabeled data sets and predicting a label for the selected unlabeled data by half supervised learning;
A step of training basic classifiers after making unlabeled data with the predicted label into a training set with labeling data;
Three steps to build an ensemble model by combining basic classifiers created through iterations of steps 1 and 2

In the method of constructing an ensemble model, it is possible to replace the semi-supervised learning classifier used for the semi-supervised learning method and the basic classifier.

The method of claim 1, wherein in the first step,
The label prediction of the unlabeled data is characterized in that the reliability-based data obtained by modifying the linear peripheral propagation (LNP), which is a semi-supervision method

The method of claim 1, wherein in the first step, a semi-supervision method other than LNP (Linear Neighborhood Propagation) is used.

The method of claim 1, wherein in the third step, the supervised learning classifier used as the basic classifier uses various supervised learning classifiers.

The method according to claim 1, wherein the anti-supervision method and the supervision method are mixed.