KR20130063565A - Combination of multiple classifiers using bagging in semi-supervised learning - Google Patents
Combination of multiple classifiers using bagging in semi-supervised learning Download PDFInfo
- Publication number
- KR20130063565A KR20130063565A KR1020110129967A KR20110129967A KR20130063565A KR 20130063565 A KR20130063565 A KR 20130063565A KR 1020110129967 A KR1020110129967 A KR 1020110129967A KR 20110129967 A KR20110129967 A KR 20110129967A KR 20130063565 A KR20130063565 A KR 20130063565A
- Authority
- KR
- South Korea
- Prior art keywords
- unlabeled data
- data
- supervised learning
- semi
- classifier
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention is a method for predicting unlabeled data for a supervised learning algorithm using a semi-supervised algorithm and a method for improving the performance of the classifier using a small number of labeling data. It is about. As a method of improving classification accuracy compared to the existing anti-supervised learning method, the anti-supervised algorithm is used to predict the label on the unlabeled data, and the higher predicted unlabeled data through the reliability among the predicted labels is used as the training set. The final model is generated through the combined model using bagging, one of the ensemble methods. The present invention can replace and use the anti-supervised learning method and the supervised learning algorithm using existing methods, and has an effect of improving classification accuracy compared to the existing boosting method.
Description
The present invention is a method for predicting unlabeled data for a supervised learning algorithm using a semi-supervised algorithm and a method for improving the performance of the classifier using a small number of labeling data. It is about.
Ensemble Algorithm in DataMining has been studied in recent years with 'Breiman''s bagging technique as the first.
In other words, numerous studies are being conducted to further improve the above-described bagging technique of Breiman. As a result of the conventional research, the boosting algorithm proposed by Freund and Schapire, Breiman, is proposed. The Arching algorithm and the Random Forest algorithm proposed by Breiman.
Among these ensemble techniques, the boosting algorithm proposed by 'Freund and Schapire' has recently appeared in various improved algorithms based on its superior predictive power. These improved algorithms include Real Boosting algorithm proposed by 'Schapire and Singer' and Gradient Boosting algorithm proposed by 'Friedman'.
That is, the ensemble algorithm used in the conventional data mining is mainly based on the boosting algorithm.
However, as a big problem of the boosting method, the labeling accuracy can be greatly influenced by the accuracy of label prediction, and the verification of the equation for the reliability calculation for the method of predicting the unlabeled data is not performed properly or is not easy.
In addition, although the conventional technology often uses unlabeled data with a supervised learning classifier included in the classification model, it is not easy to find a method of using the unlabeled data regardless of the supervised learning classifier.
The present invention for solving the problems of the prior art as described above is an ensemble construction apparatus using a bagging having a high classification accuracy using the labeling data used in the training data to increase the classification accuracy of the classifiers as well as the unlabeled data and its It is to provide a method.
A method of increasing classification accuracy using unlabeled data according to the present invention includes selecting a first unlabeled data set from an unlabeled data set and predicting a label on the selected unlabeled data. And a second step, a first step, and a first step of training the supervised classifier after including unlabeled data having high reliability of prediction among the unlabeled data having the predicted label in the training set. And an ensemble classifier model generated by combining the classifiers obtained by repeating the two-step process. 3 shows an overview of the algorithm of the present invention.
In the first step, label prediction for the unlabeled data is performed by Linear Neighborhood Propagation (LNP), a semi-supervised algorithm, and the first step is a combination of an ensemble classifier. As many base classifiers are executed as possible.
In the second step, based on the reliability of the predicted label of the unlabeling data, the training set includes the unlabeling data having a high reliability so that the class labels are equal.
After the first and second steps, the third step outputs the result using a multiple voting method, which is one of methods of combining independent base classifiers.
The present invention predicts a label of unlabeled data. Increase the training set using unlabeled data and its predicted labels, and improve the predictive accuracy of the classifier through ensemble learning, and also the desired semi-supervised algorithm and supervised classifier. (supervised learning classifier) can be used to replace the effect.
1 is a flowchart illustrating a model generation method according to an embodiment of the present invention.
2 is a flow chart illustrating data classification through the generated model of the present invention.
3 is an overview showing a model generation method of the present invention.
The novelty of the approach according to the present invention utilizes the Linear Neighborhood Propagation (LNP) method, which is a semi-supervised algorithm that uses unlabeled data to construct a training set. It is a new method of generating an ensemble classifier model using bagging.
Advantages of the present invention often include the use of unlabeled data that is convenient to obtain in many applications with minimal modification in existing methods. Another advantage of the present invention is its broad applicability. The present invention can be used in place of many proven anti-supervisory algorithms and supervised learning classifiers.
In order to evaluate the advantages of the approach according to the present invention, the existing method is compared with the semi-supervised learning model of semi-boost [1] in which the boosting method is used. Using three data sets, the classification accuracy is shown to be improved compared to the present model.
In the past, a semi-supervised learning method has been proposed to solve the problem that unlabeling data is easy to obtain and labeling data is difficult to obtain in the classification problem. Recent research is being done through ensemble classifiers in semi-supervised learning. The most common methods are bagging and boosting. Bagging, which is a classifier combining method used in the present invention, has a big feature that base classifiers that generate an ensemble classifier are made independently of each other. It has the characteristic of affecting classifier learning.
In the present invention, a method of increasing classification accuracy by using unlabeled data includes: a first step of selecting any unlabeled data from an unlabeled data set and predicting a label on the selected unlabeled data; Steps 2, 1, and 2 of training unsupervised classifiers after including unlabeled data with high confidence in prediction among unlabeled data having the predicted label in a training set. And an ensemble classifier model generated by combining the classifiers obtained by repeating the process of the steps.
The first step for utilizing the unlabeled data is to predict the labels for the unlabeled data, as indicated in FIG. 1 (S601). The second step is to use the unlabeled data with the highest confidence among the unlabeled data with the predicted labels to increase the predictive power of the classifiers. As shown in FIG. 6, after the unlabeled data having the predicted label is included as a training data set, one base classifier is trained using the training data (S602). In the next step, the final model is generated by combining the basic classifiers created by repeating this process (S603). The base classifier is used as the supervised classifier. The steps are shown in sections a, b and c.
a. Step 1: Predict Labels for Unlabeled Data
In the machine learning domain, semi-supervised classification can provide a solution to this step. Typical classifiers use only labeled label data for training. Optionally, the semi-supervised classification uses unlabeled data along with labeled label label data to train better classifiers. The semi-supervised classification uses unlabeled data to modify or rerank hypotheses obtained from labeled label data.
In general, there are many anti-supervision methods, and the present invention has the advantage that various anti-supervision methods can be selectively used. An example is given here using a semi-supervision method called LNP. Linear Neighbor Propagation (LNP) [3] is a method of propagating labels on a graph-based basis, and is a graph-based semi-supervision method for predicting labels of unlabeled data with labeled data using proximity between neighboring data. We construct a weighting matrix representing the association between data samples and neighbors, and predict the label of unlabeled data through label propagation using it. The difference from the method of [3] is that in the present invention, the label is predicted by using any unlabeled data, rather than using all unlabeled data.
b. Step 2: Training the Base Classifier by Including Predicted Unlabeled Data in the Training Set
In the second step, based on the reliability of the predicted label of the unlabeling data, the training set includes the unlabeling data having a high reliability so that the class labels are equal. The measure of reliability is the probability of having the same label as that label when propagating from the labeled data to the unlabeled data. This probability can be seen as the weight of the connecting line between two points on the graph. By selecting unlabeled data with high reliability, a training set is constructed with a few labeled data and predicted unlabeled data. The basic classifier is trained as shown in S602 of FIG.
c. Step 3: Create a combined classification model after learning mood classifier through repetition according to the set number of times
The third step is a step of generating a final combined model by combining the learning classifiers configured through the first and second steps. The final model is characterized by using a bagging method, and each classifier generation, which is a feature of the bagging method, can be used independently. This has the advantage of speeding up the computation by processing the generation of each classifier in parallel.
d. Output form of the generated model
The generated output of the combined model uses a majority voting method among several methods. The majority voting method outputs the label with the largest number of labels selected by each classifier. Figure 2 illustrates the process.
Claims (6)
A step of training basic classifiers after making unlabeled data with the predicted label into a training set with labeling data;
Three steps to build an ensemble model by combining basic classifiers created through iterations of steps 1 and 2
The label prediction of the unlabeled data is characterized in that the reliability-based data obtained by modifying the linear peripheral propagation (LNP), which is a semi-supervision method
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020110129967A KR20130063565A (en) | 2011-12-07 | 2011-12-07 | Combination of multiple classifiers using bagging in semi-supervised learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020110129967A KR20130063565A (en) | 2011-12-07 | 2011-12-07 | Combination of multiple classifiers using bagging in semi-supervised learning |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20130063565A true KR20130063565A (en) | 2013-06-17 |
Family
ID=48860827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020110129967A KR20130063565A (en) | 2011-12-07 | 2011-12-07 | Combination of multiple classifiers using bagging in semi-supervised learning |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20130063565A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110249341A (en) * | 2017-02-03 | 2019-09-17 | 皇家飞利浦有限公司 | Classifier training |
KR102033136B1 (en) | 2019-04-03 | 2019-10-16 | 주식회사 루닛 | Method for machine learning based on semi-supervised learning and apparatus thereof |
KR20190117969A (en) * | 2018-04-09 | 2019-10-17 | 주식회사 뷰노 | Method for semi supervised reinforcement learning using data with label and data without label together and apparatus using the same |
CN110348241A (en) * | 2019-07-12 | 2019-10-18 | 之江实验室 | A kind of multicenter under data sharing strategy cooperates with prognosis prediction system |
KR20190135329A (en) * | 2018-05-28 | 2019-12-06 | 삼성에스디에스 주식회사 | Computing system and method for data labeling thereon |
WO2020004867A1 (en) * | 2018-06-29 | 2020-01-02 | 주식회사 디플리 | Machine learning method and device enabling automatic labeling |
KR20200094938A (en) * | 2019-01-31 | 2020-08-10 | 동서대학교 산학협력단 | Data imbalance solution method using Generative adversarial network |
KR20210012761A (en) * | 2019-07-26 | 2021-02-03 | 주식회사 수아랩 | Method for managing data |
KR20210012762A (en) * | 2019-07-26 | 2021-02-03 | 주식회사 수아랩 | Method to decide a labeling priority to a data |
CN112835797A (en) * | 2021-02-03 | 2021-05-25 | 杭州电子科技大学 | Metamorphic relation prediction method based on program intermediate structure characteristics |
US11017294B2 (en) | 2016-12-16 | 2021-05-25 | Samsung Electronics Co., Ltd. | Recognition method and apparatus |
CN112989841A (en) * | 2021-02-24 | 2021-06-18 | 中国搜索信息科技股份有限公司 | Semi-supervised learning method for emergency news identification and classification |
-
2011
- 2011-12-07 KR KR1020110129967A patent/KR20130063565A/en not_active Application Discontinuation
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11017294B2 (en) | 2016-12-16 | 2021-05-25 | Samsung Electronics Co., Ltd. | Recognition method and apparatus |
CN110249341A (en) * | 2017-02-03 | 2019-09-17 | 皇家飞利浦有限公司 | Classifier training |
KR20190117969A (en) * | 2018-04-09 | 2019-10-17 | 주식회사 뷰노 | Method for semi supervised reinforcement learning using data with label and data without label together and apparatus using the same |
KR20190135329A (en) * | 2018-05-28 | 2019-12-06 | 삼성에스디에스 주식회사 | Computing system and method for data labeling thereon |
WO2020004867A1 (en) * | 2018-06-29 | 2020-01-02 | 주식회사 디플리 | Machine learning method and device enabling automatic labeling |
KR20200002149A (en) * | 2018-06-29 | 2020-01-08 | 주식회사 디플리 | Method and Device for Machine Learning able to automatically-label |
KR20200094938A (en) * | 2019-01-31 | 2020-08-10 | 동서대학교 산학협력단 | Data imbalance solution method using Generative adversarial network |
KR102033136B1 (en) | 2019-04-03 | 2019-10-16 | 주식회사 루닛 | Method for machine learning based on semi-supervised learning and apparatus thereof |
CN110348241A (en) * | 2019-07-12 | 2019-10-18 | 之江实验室 | A kind of multicenter under data sharing strategy cooperates with prognosis prediction system |
KR20210012762A (en) * | 2019-07-26 | 2021-02-03 | 주식회사 수아랩 | Method to decide a labeling priority to a data |
KR20210012761A (en) * | 2019-07-26 | 2021-02-03 | 주식회사 수아랩 | Method for managing data |
CN112835797A (en) * | 2021-02-03 | 2021-05-25 | 杭州电子科技大学 | Metamorphic relation prediction method based on program intermediate structure characteristics |
CN112835797B (en) * | 2021-02-03 | 2024-03-29 | 杭州电子科技大学 | Metamorphic relation prediction method based on program intermediate structure characteristics |
CN112989841A (en) * | 2021-02-24 | 2021-06-18 | 中国搜索信息科技股份有限公司 | Semi-supervised learning method for emergency news identification and classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR20130063565A (en) | Combination of multiple classifiers using bagging in semi-supervised learning | |
Ye et al. | Bp-transformer: Modelling long-range context via binary partitioning | |
Jia et al. | Neural extractive summarization with hierarchical attentive heterogeneous graph network | |
Huang et al. | Unignn: a unified framework for graph and hypergraph neural networks | |
Luo et al. | Semi-supervised neural architecture search | |
De Haan et al. | Patterns in transitions: Understanding complex chains of change | |
Agrawal et al. | An online tool for predicting fatigue strength of steel alloys based on ensemble data mining | |
Yigit | A weighting approach for KNN classifier | |
Amiri et al. | Adventures in data analysis: A systematic review of Deep Learning techniques for pattern recognition in cyber-physical-social systems | |
Zhang et al. | Few-Shot Audio Classification with Attentional Graph Neural Networks. | |
US20210110273A1 (en) | Apparatus and method with model training | |
JP2013196680A (en) | Concept recognition method and concept recognition device based on co-learning | |
Gao et al. | Heterogeneous graph neural architecture search | |
McLeod et al. | A modular system for the harmonic analysis of musical scores using a large vocabulary | |
CN105760929A (en) | Layered global optimization method based on DFP algorithm and differential evolution | |
Wang et al. | Cross-modal graph with meta concepts for video captioning | |
CN117237621A (en) | Small sample semantic segmentation algorithm based on pixel-level semantic association | |
Goswami et al. | A new evaluation measure for feature subset selection with genetic algorithm | |
Feng et al. | Prototypical networks relation classification model based on entity convolution | |
Zhou et al. | Hierarchical knowledge propagation and distillation for few-shot learning | |
Zhou et al. | Lie detection from speech analysis based on k–svd deep belief network model | |
Saunders et al. | Automated Machine Learning for Positive-Unlabelled Learning | |
Wei et al. | Negatives make a positive: An embarrassingly simple approach to semi-supervised few-shot learning | |
Hillel | New perspectives on the performance of machine learning classifiers for mode choice prediction | |
Tan et al. | Dual-gradients localization framework with skip-layer connections for weakly supervised object localization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application |