KR20130063565A - Combination of multiple classifiers using bagging in semi-supervised learning - Google Patents

Combination of multiple classifiers using bagging in semi-supervised learning Download PDF

Info

Publication number
KR20130063565A
KR20130063565A KR1020110129967A KR20110129967A KR20130063565A KR 20130063565 A KR20130063565 A KR 20130063565A KR 1020110129967 A KR1020110129967 A KR 1020110129967A KR 20110129967 A KR20110129967 A KR 20110129967A KR 20130063565 A KR20130063565 A KR 20130063565A
Authority
KR
South Korea
Prior art keywords
unlabeled data
data
supervised learning
semi
classifier
Prior art date
Application number
KR1020110129967A
Other languages
Korean (ko)
Inventor
조윤진
우호영
Original Assignee
조윤진
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 조윤진 filed Critical 조윤진
Priority to KR1020110129967A priority Critical patent/KR20130063565A/en
Publication of KR20130063565A publication Critical patent/KR20130063565A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is a method for predicting unlabeled data for a supervised learning algorithm using a semi-supervised algorithm and a method for improving the performance of the classifier using a small number of labeling data. It is about. As a method of improving classification accuracy compared to the existing anti-supervised learning method, the anti-supervised algorithm is used to predict the label on the unlabeled data, and the higher predicted unlabeled data through the reliability among the predicted labels is used as the training set. The final model is generated through the combined model using bagging, one of the ensemble methods. The present invention can replace and use the anti-supervised learning method and the supervised learning algorithm using existing methods, and has an effect of improving classification accuracy compared to the existing boosting method.

Description

Device for constructing ensemble type data mining model using unlabeled data and its method {Combination of Multiple Classifiers using bagging in Semi-supervised Learning}

The present invention is a method for predicting unlabeled data for a supervised learning algorithm using a semi-supervised algorithm and a method for improving the performance of the classifier using a small number of labeling data. It is about.

Ensemble Algorithm in DataMining has been studied in recent years with 'Breiman''s bagging technique as the first.

In other words, numerous studies are being conducted to further improve the above-described bagging technique of Breiman. As a result of the conventional research, the boosting algorithm proposed by Freund and Schapire, Breiman, is proposed. The Arching algorithm and the Random Forest algorithm proposed by Breiman.

Among these ensemble techniques, the boosting algorithm proposed by 'Freund and Schapire' has recently appeared in various improved algorithms based on its superior predictive power. These improved algorithms include Real Boosting algorithm proposed by 'Schapire and Singer' and Gradient Boosting algorithm proposed by 'Friedman'.

That is, the ensemble algorithm used in the conventional data mining is mainly based on the boosting algorithm.

However, as a big problem of the boosting method, the labeling accuracy can be greatly influenced by the accuracy of label prediction, and the verification of the equation for the reliability calculation for the method of predicting the unlabeled data is not performed properly or is not easy.

In addition, although the conventional technology often uses unlabeled data with a supervised learning classifier included in the classification model, it is not easy to find a method of using the unlabeled data regardless of the supervised learning classifier.

The present invention for solving the problems of the prior art as described above is an ensemble construction apparatus using a bagging having a high classification accuracy using the labeling data used in the training data to increase the classification accuracy of the classifiers as well as the unlabeled data and its It is to provide a method.

A method of increasing classification accuracy using unlabeled data according to the present invention includes selecting a first unlabeled data set from an unlabeled data set and predicting a label on the selected unlabeled data. And a second step, a first step, and a first step of training the supervised classifier after including unlabeled data having high reliability of prediction among the unlabeled data having the predicted label in the training set. And an ensemble classifier model generated by combining the classifiers obtained by repeating the two-step process. 3 shows an overview of the algorithm of the present invention.

In the first step, label prediction for the unlabeled data is performed by Linear Neighborhood Propagation (LNP), a semi-supervised algorithm, and the first step is a combination of an ensemble classifier. As many base classifiers are executed as possible.

In the second step, based on the reliability of the predicted label of the unlabeling data, the training set includes the unlabeling data having a high reliability so that the class labels are equal.

After the first and second steps, the third step outputs the result using a multiple voting method, which is one of methods of combining independent base classifiers.

The present invention predicts a label of unlabeled data. Increase the training set using unlabeled data and its predicted labels, and improve the predictive accuracy of the classifier through ensemble learning, and also the desired semi-supervised algorithm and supervised classifier. (supervised learning classifier) can be used to replace the effect.

1 is a flowchart illustrating a model generation method according to an embodiment of the present invention.
2 is a flow chart illustrating data classification through the generated model of the present invention.
3 is an overview showing a model generation method of the present invention.

The novelty of the approach according to the present invention utilizes the Linear Neighborhood Propagation (LNP) method, which is a semi-supervised algorithm that uses unlabeled data to construct a training set. It is a new method of generating an ensemble classifier model using bagging.

Advantages of the present invention often include the use of unlabeled data that is convenient to obtain in many applications with minimal modification in existing methods. Another advantage of the present invention is its broad applicability. The present invention can be used in place of many proven anti-supervisory algorithms and supervised learning classifiers.

In order to evaluate the advantages of the approach according to the present invention, the existing method is compared with the semi-supervised learning model of semi-boost [1] in which the boosting method is used. Using three data sets, the classification accuracy is shown to be improved compared to the present model.

In the past, a semi-supervised learning method has been proposed to solve the problem that unlabeling data is easy to obtain and labeling data is difficult to obtain in the classification problem. Recent research is being done through ensemble classifiers in semi-supervised learning. The most common methods are bagging and boosting. Bagging, which is a classifier combining method used in the present invention, has a big feature that base classifiers that generate an ensemble classifier are made independently of each other. It has the characteristic of affecting classifier learning.

In the present invention, a method of increasing classification accuracy by using unlabeled data includes: a first step of selecting any unlabeled data from an unlabeled data set and predicting a label on the selected unlabeled data; Steps 2, 1, and 2 of training unsupervised classifiers after including unlabeled data with high confidence in prediction among unlabeled data having the predicted label in a training set. And an ensemble classifier model generated by combining the classifiers obtained by repeating the process of the steps.

The first step for utilizing the unlabeled data is to predict the labels for the unlabeled data, as indicated in FIG. 1 (S601). The second step is to use the unlabeled data with the highest confidence among the unlabeled data with the predicted labels to increase the predictive power of the classifiers. As shown in FIG. 6, after the unlabeled data having the predicted label is included as a training data set, one base classifier is trained using the training data (S602). In the next step, the final model is generated by combining the basic classifiers created by repeating this process (S603). The base classifier is used as the supervised classifier. The steps are shown in sections a, b and c.

a. Step 1: Predict Labels for Unlabeled Data

In the machine learning domain, semi-supervised classification can provide a solution to this step. Typical classifiers use only labeled label data for training. Optionally, the semi-supervised classification uses unlabeled data along with labeled label label data to train better classifiers. The semi-supervised classification uses unlabeled data to modify or rerank hypotheses obtained from labeled label data.

In general, there are many anti-supervision methods, and the present invention has the advantage that various anti-supervision methods can be selectively used. An example is given here using a semi-supervision method called LNP. Linear Neighbor Propagation (LNP) [3] is a method of propagating labels on a graph-based basis, and is a graph-based semi-supervision method for predicting labels of unlabeled data with labeled data using proximity between neighboring data. We construct a weighting matrix representing the association between data samples and neighbors, and predict the label of unlabeled data through label propagation using it. The difference from the method of [3] is that in the present invention, the label is predicted by using any unlabeled data, rather than using all unlabeled data.

b. Step 2: Training the Base Classifier by Including Predicted Unlabeled Data in the Training Set

In the second step, based on the reliability of the predicted label of the unlabeling data, the training set includes the unlabeling data having a high reliability so that the class labels are equal. The measure of reliability is the probability of having the same label as that label when propagating from the labeled data to the unlabeled data. This probability can be seen as the weight of the connecting line between two points on the graph. By selecting unlabeled data with high reliability, a training set is constructed with a few labeled data and predicted unlabeled data. The basic classifier is trained as shown in S602 of FIG.

c. Step 3: Create a combined classification model after learning mood classifier through repetition according to the set number of times

The third step is a step of generating a final combined model by combining the learning classifiers configured through the first and second steps. The final model is characterized by using a bagging method, and each classifier generation, which is a feature of the bagging method, can be used independently. This has the advantage of speeding up the computation by processing the generation of each classifier in parallel.

d. Output form of the generated model

The generated output of the combined model uses a majority voting method among several methods. The majority voting method outputs the label with the largest number of labels selected by each classifier. Figure 2 illustrates the process.

Claims (6)

Selecting any unlabeled data among the unlabeled data sets and predicting a label for the selected unlabeled data by half supervised learning;
A step of training basic classifiers after making unlabeled data with the predicted label into a training set with labeling data;
Three steps to build an ensemble model by combining basic classifiers created through iterations of steps 1 and 2
In the method of constructing an ensemble model, it is possible to replace the semi-supervised learning classifier used for the semi-supervised learning method and the basic classifier. The method of claim 1, wherein in the first step,
The label prediction of the unlabeled data is characterized in that the reliability-based data obtained by modifying the linear peripheral propagation (LNP), which is a semi-supervision method
The method of claim 1, wherein in the first step, a semi-supervision method other than LNP (Linear Neighborhood Propagation) is used. The method of claim 1, wherein in the third step, the supervised learning classifier used as the basic classifier uses various supervised learning classifiers. The method according to claim 1, wherein the anti-supervision method and the supervision method are mixed.
KR1020110129967A 2011-12-07 2011-12-07 Combination of multiple classifiers using bagging in semi-supervised learning KR20130063565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020110129967A KR20130063565A (en) 2011-12-07 2011-12-07 Combination of multiple classifiers using bagging in semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020110129967A KR20130063565A (en) 2011-12-07 2011-12-07 Combination of multiple classifiers using bagging in semi-supervised learning

Publications (1)

Publication Number Publication Date
KR20130063565A true KR20130063565A (en) 2013-06-17

Family

ID=48860827

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020110129967A KR20130063565A (en) 2011-12-07 2011-12-07 Combination of multiple classifiers using bagging in semi-supervised learning

Country Status (1)

Country Link
KR (1) KR20130063565A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110249341A (en) * 2017-02-03 2019-09-17 皇家飞利浦有限公司 Classifier training
KR102033136B1 (en) 2019-04-03 2019-10-16 주식회사 루닛 Method for machine learning based on semi-supervised learning and apparatus thereof
KR20190117969A (en) * 2018-04-09 2019-10-17 주식회사 뷰노 Method for semi supervised reinforcement learning using data with label and data without label together and apparatus using the same
CN110348241A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter under data sharing strategy cooperates with prognosis prediction system
KR20190135329A (en) * 2018-05-28 2019-12-06 삼성에스디에스 주식회사 Computing system and method for data labeling thereon
WO2020004867A1 (en) * 2018-06-29 2020-01-02 주식회사 디플리 Machine learning method and device enabling automatic labeling
KR20200094938A (en) * 2019-01-31 2020-08-10 동서대학교 산학협력단 Data imbalance solution method using Generative adversarial network
KR20210012761A (en) * 2019-07-26 2021-02-03 주식회사 수아랩 Method for managing data
KR20210012762A (en) * 2019-07-26 2021-02-03 주식회사 수아랩 Method to decide a labeling priority to a data
CN112835797A (en) * 2021-02-03 2021-05-25 杭州电子科技大学 Metamorphic relation prediction method based on program intermediate structure characteristics
US11017294B2 (en) 2016-12-16 2021-05-25 Samsung Electronics Co., Ltd. Recognition method and apparatus
CN112989841A (en) * 2021-02-24 2021-06-18 中国搜索信息科技股份有限公司 Semi-supervised learning method for emergency news identification and classification

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11017294B2 (en) 2016-12-16 2021-05-25 Samsung Electronics Co., Ltd. Recognition method and apparatus
CN110249341A (en) * 2017-02-03 2019-09-17 皇家飞利浦有限公司 Classifier training
KR20190117969A (en) * 2018-04-09 2019-10-17 주식회사 뷰노 Method for semi supervised reinforcement learning using data with label and data without label together and apparatus using the same
KR20190135329A (en) * 2018-05-28 2019-12-06 삼성에스디에스 주식회사 Computing system and method for data labeling thereon
WO2020004867A1 (en) * 2018-06-29 2020-01-02 주식회사 디플리 Machine learning method and device enabling automatic labeling
KR20200002149A (en) * 2018-06-29 2020-01-08 주식회사 디플리 Method and Device for Machine Learning able to automatically-label
KR20200094938A (en) * 2019-01-31 2020-08-10 동서대학교 산학협력단 Data imbalance solution method using Generative adversarial network
KR102033136B1 (en) 2019-04-03 2019-10-16 주식회사 루닛 Method for machine learning based on semi-supervised learning and apparatus thereof
CN110348241A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter under data sharing strategy cooperates with prognosis prediction system
KR20210012762A (en) * 2019-07-26 2021-02-03 주식회사 수아랩 Method to decide a labeling priority to a data
KR20210012761A (en) * 2019-07-26 2021-02-03 주식회사 수아랩 Method for managing data
CN112835797A (en) * 2021-02-03 2021-05-25 杭州电子科技大学 Metamorphic relation prediction method based on program intermediate structure characteristics
CN112835797B (en) * 2021-02-03 2024-03-29 杭州电子科技大学 Metamorphic relation prediction method based on program intermediate structure characteristics
CN112989841A (en) * 2021-02-24 2021-06-18 中国搜索信息科技股份有限公司 Semi-supervised learning method for emergency news identification and classification

Similar Documents

Publication Publication Date Title
KR20130063565A (en) Combination of multiple classifiers using bagging in semi-supervised learning
Ye et al. Bp-transformer: Modelling long-range context via binary partitioning
Jia et al. Neural extractive summarization with hierarchical attentive heterogeneous graph network
Huang et al. Unignn: a unified framework for graph and hypergraph neural networks
Luo et al. Semi-supervised neural architecture search
De Haan et al. Patterns in transitions: Understanding complex chains of change
Agrawal et al. An online tool for predicting fatigue strength of steel alloys based on ensemble data mining
Yigit A weighting approach for KNN classifier
Amiri et al. Adventures in data analysis: A systematic review of Deep Learning techniques for pattern recognition in cyber-physical-social systems
Zhang et al. Few-Shot Audio Classification with Attentional Graph Neural Networks.
US20210110273A1 (en) Apparatus and method with model training
JP2013196680A (en) Concept recognition method and concept recognition device based on co-learning
Gao et al. Heterogeneous graph neural architecture search
McLeod et al. A modular system for the harmonic analysis of musical scores using a large vocabulary
CN105760929A (en) Layered global optimization method based on DFP algorithm and differential evolution
Wang et al. Cross-modal graph with meta concepts for video captioning
CN117237621A (en) Small sample semantic segmentation algorithm based on pixel-level semantic association
Goswami et al. A new evaluation measure for feature subset selection with genetic algorithm
Feng et al. Prototypical networks relation classification model based on entity convolution
Zhou et al. Hierarchical knowledge propagation and distillation for few-shot learning
Zhou et al. Lie detection from speech analysis based on k–svd deep belief network model
Saunders et al. Automated Machine Learning for Positive-Unlabelled Learning
Wei et al. Negatives make a positive: An embarrassingly simple approach to semi-supervised few-shot learning
Hillel New perspectives on the performance of machine learning classifiers for mode choice prediction
Tan et al. Dual-gradients localization framework with skip-layer connections for weakly supervised object localization

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E601 Decision to refuse application