KR102201198B1

KR102201198B1 - Apparatus and method for classifying data by using machine learning and ensemble method

Info

Publication number: KR102201198B1
Application number: KR1020200061841A
Authority: KR
Inventors: 김한준; 이수은
Original assignee: 서울시립대학교 산학협력단
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2021-01-11

Abstract

A device for classifying data using a machine learning and ensemble method may comprise: a collection unit collecting learning data; a generation unit generating an ensemble network for each of the plurality of layers and generating a classification model including an ensemble network generated for each of the plurality of layers; a learning unit learning the classification model to classify learning data by inputting the learning data into the classification model; and a classification unit classifying the classification data by inputting the classification data into the learned classification model. Therefore, the accuracy for automatic classification and classification prediction can be improved.

Description

Device and method for classifying data using machine learning and ensemble techniques {APPARATUS AND METHOD FOR CLASSIFYING DATA BY USING MACHINE LEARNING AND ENSEMBLE METHOD}

본 발명은 앙상블 기법 및 기계학습을 이용하여 데이터를 분류하는 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for classifying data using an ensemble technique and machine learning.

기계 학습(Machine Learning) 기술은 입력 데이터에 대한 일반화 과정으로부터 특정 문제를 해결할 수 있는 모델의 생성 기술을 의미한다. 우수한 성능을 갖는 모델을 생성하기 위해서는 양질의 학습데이터와 일반화 과정을 위한 학습 알고리즘이 필요하다. Machine learning technology refers to a technology for creating a model that can solve a specific problem from the generalization process for input data. In order to create a model with excellent performance, good quality training data and a learning algorithm for the generalization process are required.

모델의 성능을 개선하기 위한 기법으로서 앙상블(Ensemble) 기법이 있다. 앙상블 기법은 다수의 약한 학습기(Weak Learner)를 결합하여 하나의 강한 학습기(Strong Learner)를 생성하는 기법이다. As a technique for improving the performance of a model, there is an ensemble technique. The ensemble technique is a technique that combines a number of weak learners to create one strong learner.

이러한 앙상블 기법에는 투표(Voting) 방식을 이용한 배깅(Bagging) 기법, 가중 투표(Weighted Voting) 방식을 이용한 부스팅(Boosting) 기법 및 단일 모델(Sing Model)로부터 얻어낸 예측값을 학습 데이터로 이용하는 스태킹(Stacking) 기법이 있다. These ensemble techniques include a bagging method using a voting method, a boosting method using a weighted voting method, and stacking using predicted values obtained from a single model as training data. There is a technique.

도 1a를 참조하면, 배깅 기법은 학습 데이터를 샘플링하여 동일한 크기로 N 개의 서브 학습 데이터를 구성한 후, N개의 서브 학습 데이터를 N개의 모델에 일대일로 입력하여 N개의 모델을 학습시키고, 투표 방식을 통해 결과물을 최종적으로 분류(또는 예측)를 하는 기법이다. 이러한 배깅 기법은 개별 모델들이 독립적으로 동시에 서브 학습 데이터를 학습할 수 있으므로 속도가 빠르며 개별 모델들의 분산을 크게 하여 모델의 성능을 높인다. Referring to FIG. 1A, the bagging technique samples training data and constructs N sub-learning data with the same size, then inputs N sub-learning data one-to-one to N models to train N models, and a voting method. This is a technique that finally classifies (or predicts) the result. This bagging technique is fast because individual models can independently and simultaneously learn sub-learning data, and increases the performance of the model by increasing the variance of individual models.

도 1b를 참조하면, 부스팅 기법은 배깅 기법과 달리 생성된 다수 개의 모델들을 동등하게 취급하지 않고, 각 모델마다 부여된 가중치가 최종 분류를 위한 투표에 반영된다. 즉, 초기 단계에서 학습 데이터를 모두 학습한 제 1 모델이 생성된 후, 생성된 제 1 모델을 통해 학습 데이터를 분류한다. 제 1 모델을 통해 분류된 학습 데이터의 분류 결과에 기초하여 분류된 학습 데이터에 부여될 가중치를 결정한다. 제 1 모델에 의해 잘 분류된 학습 데이터에 적은 가중치를 부여하고, 잘못 분류된 학습 데이터에 높은 가중치를 부여함으로써 가중치를 조절한다. 이렇게 가중치가 조절된 학습 데이터의 임의 추출 과정을 통해 연속적으로 새로운 모델을 생성하고, 결과적으로 신뢰도가 상이한 모델들을 이용하여 가중 투표 방식을 이용하여 분류 작업을 수행한다. 이러한 부스팅 기법은 편향(Bias)을 조정함으로써 모델의 성능을 높인다. Referring to FIG. 1B, unlike the bagging technique, the boosting technique does not treat a plurality of generated models equally, and a weight assigned to each model is reflected in a vote for final classification. That is, after a first model in which all of the training data is learned in an initial stage is generated, the training data is classified through the generated first model. A weight to be assigned to the classified training data is determined based on the classification result of the training data classified through the first model. The weights are adjusted by assigning a small weight to the training data that is well classified by the first model and a high weight to the training data that is incorrectly classified. A new model is continuously generated through the random extraction process of the training data whose weight is adjusted, and as a result, classification is performed using a weighted voting method using models with different reliability. This boosting technique improves the model's performance by adjusting the bias.

도 1c를 참조하면, 스태킹 기법은 두 개 이상의 학습 알고리즘을 이용하여 단일 모델을 생성하고, 생성된 단일 모델로부터 얻어진 예측값 자체를 학습 데이터로 삼아 메타 모델(또는, 메타 분류기)을 생성하는 기법이다. Referring to FIG. 1C, the stacking technique is a technique of generating a single model using two or more learning algorithms, and generating a meta model (or meta classifier) by using a prediction value obtained from the generated single model as training data.

즉, 스태킹 기법은 입력 데이터로 학습시킨 단일 모델의 예측 결과를 학습 데이터로 하여 메타 모델(Meta Mdoel)을 통해 분류한다. 스태킹 기법의 구조는 단일 모델 부분으로 구성된 제 1 레이어와 메타 모델 부분으로 구성된 제 2 레이어로 2개의 레이어를 갖는다. 이러한 스태킹 기법은 각 개별 모델이 독립적이라는 가정을 하기 때문에 이상치(Outier)에 대응력이 높아 단알 모델의 오분류율보다 작은 값을 갖게 되어 성능이 우수하다. That is, in the stacking technique, the prediction result of a single model trained with input data is used as training data and classified through a meta model (Meta Mdoel). The structure of the stacking technique has two layers, a first layer composed of a single model part and a second layer composed of a meta model part. Since this stacking technique assumes that each individual model is independent, it has a high ability to respond to outliers and has a value smaller than the misclassification rate of the single model, and thus has excellent performance.

한국등록특허공보 제10-1731626호 (2017.04.24. 등록)Korean Patent Publication No. 10-1731626 (registered on April 24, 2017)

본 발명은 복수의 레이어마다 생성된 앙상블 네트워크를 포함하는 분류 모델을 생성하고, 생성된 분류 모델을 통해 데이터의 분류를 학습시키고자 한다. An object of the present invention is to generate a classification model including an ensemble network generated for each of a plurality of layers, and to learn classification of data through the generated classification model.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 기계학습 및 앙상블 기법을 이용하여 데이터를 분류하는 장치는 학습 데이터를 수집하는 수집부; 복수의 레이어 별로 앙상블(Ensemble) 네트워크를 생성하고, 상기 복수의 레이어마다 생성된 앙상블 네트워크를 포함하는 분류 모델을 생성하는 생성부; 상기 학습 데이터를 상기 분류 모델에 입력하여 상기 학습 데이터를 분류하도록 상기 분류 모델을 학습하는 학습부; 및 분류용 데이터를 상기 학습된 분류 모델에 입력하여 상기 분류용 데이터를 분류하는 분류부를 포함할 수 있다. As a technical means for achieving the above-described technical problem, an apparatus for classifying data using machine learning and ensemble techniques according to a first aspect of the present invention includes: a collection unit for collecting learning data; A generator for generating an ensemble network for each of a plurality of layers and generating a classification model including an ensemble network generated for each of the plurality of layers; A learning unit for learning the classification model to classify the training data by inputting the training data into the classification model; And a classification unit for classifying the classification data by inputting classification data into the learned classification model.

본 발명의 제 2 측면에 따른 데이터 분류 장치에 의해 수행되는 기계학습 및 앙상블 기법을 이용하여 데이터를 분류하는 방법은 학습 데이터를 수집하는 단계; 복수의 레이어 별로 앙상블(Ensemble) 네트워크를 생성하고, 상기 복수의 레이어마다 생성된 앙상블 네트워크를 포함하는 분류 모델을 생성하는 단계; 상기 학습 데이터를 상기 분류 모델에 입력하여 상기 학습 데이터를 분류하도록 상기 분류 모델을 학습하는 단계; 및 분류용 데이터를 상기 학습된 분류 모델에 입력하여 상기 분류용 데이터를 분류하는 단계를 포함할 수 있다. A method of classifying data using machine learning and ensemble techniques performed by a data classification apparatus according to a second aspect of the present invention includes the steps of: collecting learning data; Generating an ensemble network for each of a plurality of layers, and generating a classification model including an ensemble network generated for each of the plurality of layers; Learning the classification model to classify the training data by inputting the training data into the classification model; And classifying the classification data by inputting the classification data into the learned classification model.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present invention. In addition to the above-described exemplary embodiments, there may be additional embodiments described in the drawings and detailed description of the invention.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 본 발명은 복수의 레이어마다 생성된 앙상블 네트워크를 포함하는 분류 모델을 생성하고, 생성된 분류 모델을 통해 데이터의 분류를 학습시킴으로써 자동 분류 및 분류 예측에 대한 정확도를 향상시킬 수 있다. According to any one of the above-described problem solving means of the present invention, the present invention generates a classification model including an ensemble network generated for each of a plurality of layers, and automatically classifies and classifies data by learning classification of data through the generated classification model. The accuracy of prediction can be improved.

이를 통해, 종래의 2개의 레이어로 구성된 스태킹 앙상블 기법의 성능 제한 문제를 개선함으로써 안정적인 자동 분류 성능을 제공할 수 있고, 이로써 다양한 도메인의 빅데이터 분석 플랫폼에 활용할 수 있어 시장성을 확장시킬 수 있다. Through this, it is possible to provide stable automatic classification performance by improving the performance limitation problem of the conventional stacking ensemble method composed of two layers, and thereby can be used in a big data analysis platform of various domains, thereby expanding marketability.

도 1a 내지 1d는 종래의 데이터 분류 방법을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른, 기계학습 및 앙상블 기법을 이용한 데이터 분류 장치의 블록도이다.
도 3a 내지 3c는 본 발명의 일 실시예에 따른, 기계학습 및 앙상블 기법을 이용한 분류 모델의 생성 방법을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른, 기계학습 및 앙상블 기법을 이용하여 데이터를 분류하는 방법을 나타낸 흐름도이다.1A to 1D are diagrams for explaining a conventional data classification method.
2 is a block diagram of an apparatus for classifying data using machine learning and an ensemble technique according to an embodiment of the present invention.
3A to 3C are diagrams for explaining a method of generating a classification model using machine learning and ensemble techniques according to an embodiment of the present invention.
4 is a flowchart illustrating a method of classifying data using machine learning and ensemble techniques according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are assigned to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다. In the present specification, the term "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, or two or more units may be realized using one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다. In the present specification, some of the operations or functions described as being performed by the terminal or device may be performed instead by a server connected to the terminal or device. Likewise, some of the operations or functions described as being performed by the server may also be performed by a terminal or device connected to the server.

이하, 첨부된 구성도 또는 처리 흐름도를 참고하여, 본 발명의 실시를 위한 구체적인 내용을 설명하도록 한다. Hereinafter, with reference to the accompanying configuration diagram or processing flow chart, specific details for the implementation of the present invention will be described.

본 발명은 DNN(Deep Neural Network)의 형태를 가지는 스태킹 기법으로 학습한 앙상블 네트워크들을 이용하여 각 레이어를 구축하는 다중 스태킹 앙상블 기법을 갖는 분류 모델을 생성한다. The present invention generates a classification model having a multi-stacking ensemble technique that constructs each layer by using ensemble networks learned by a stacking technique having a form of a deep neural network (DNN).

본 발명은 인공 신경망의 아키텍처(도 1d 참조)의 은닉층(Hidden Layer)을 2개 이상 가지고 있는 DNN의 형태와 유사하게 분류 모델의 레이어의 수를 증가시킴으로써 분류 모델의 분류 성능을 개선할 수 있다. The present invention can improve the classification performance of the classification model by increasing the number of layers of the classification model similar to the form of a DNN having two or more hidden layers of an artificial neural network architecture (see FIG. 1D).

즉, 기존의 2개의 레이어를 갖는 스태킹 앙상블 기법을 N개의 레이어로 증가시키고 스태킹 앙상블 기법으로 학습한 앙상블 네트워크들로 각 레이어를 구축할 수 있다. In other words, it is possible to increase the existing stacking ensemble technique having two layers to N layers and build each layer with ensemble networks learned by the stacking ensemble technique.

이에 따라 N개의 레이어를 생성할 때 학습된 현재 레이어에 포함된 앙상블 네트워크의 출력값(분류값)들을 다음 레이어에 포함된 앙상블 네트워크의 입력 데이터로 하여 다음 레이어에 포함된 앙상블 네트워크를 학습시키는 과정을 반복할 수 있다. Accordingly, the process of learning the ensemble network included in the next layer is repeated by using the learned output values (classification values) of the ensemble network included in the current layer as input data of the ensemble network included in the next layer when creating N layers. can do.

도 2는 본 발명의 일 실시예에 따른, 기계학습 및 앙상블 기법을 이용한 데이터 분류 장치(100)의 블록도이다. 2 is a block diagram of a data classification apparatus 100 using machine learning and an ensemble technique according to an embodiment of the present invention.

도 2를 참조하면, 데이터 분류 장치(100)는 수집부(200), 생성부(210), 학습부(220), 평가부(230) 및 분류부(240)를 포함할 수 있다. 다만, 도 2에 도시된 데이터 분류 장치(100)는 본 발명의 하나의 구현 예에 불과하며, 도 2에 도시된 구성요소들을 기초로 하여 여러 가지 변형이 가능하다. Referring to FIG. 2, the data classification apparatus 100 may include a collection unit 200, a generation unit 210, a learning unit 220, an evaluation unit 230, and a classification unit 240. However, the data classification apparatus 100 shown in FIG. 2 is only an example of an implementation of the present invention, and various modifications are possible based on the components shown in FIG. 2.

이하에서는 도 3a 내지 3c를 함께 참조하여 도 2를 설명하기로 한다. Hereinafter, FIG. 2 will be described with reference to FIGS. 3A to 3C.

수집부(200)는 분류 모델의 분류 학습을 위한 학습 데이터를 수집할 수 있다. The collection unit 200 may collect training data for classification learning of a classification model.

생성부(210)는 복수의 레이어 별로 앙상블(Ensemble) 네트워크를 생성하고, 복수의 레이어마다 생성된 앙상블 네트워크를 포함하는 분류 모델을 생성할 수 있다. The generator 210 may generate an ensemble network for each of a plurality of layers, and may generate a classification model including an ensemble network generated for each of the plurality of layers.

분류 모델의 경우, 앙상블 네트워크가 구축되는 레이어의 수 및 각 레이어에 구축될 앙상블 네트워크에 사용되는 분류 알고리즘의 종류 및 개수에 따라 분류 모델의 성능이 달라질 수 있다. 즉, 분류 모델을 구성하는 레이어의 수 및 각 레이어마다 사용되는 분류 알고리즘의 종류 및 개수에 따라 분류 모델의 분류 성능에 있어서 차이가 발생한다. In the case of a classification model, the performance of the classification model may vary depending on the number of layers in which an ensemble network is built and the type and number of classification algorithms used in the ensemble network to be built in each layer. That is, a difference occurs in the classification performance of the classification model according to the number of layers constituting the classification model and the type and number of classification algorithms used for each layer.

이러한 이유로, 생성부(210)는 앙상블 네트워크가 구축될 복수의 레이어의 수 및 각 레이어에 포함되는 분류 알고리즘의 수를 결정할 수 있다. For this reason, the generation unit 210 may determine the number of layers in which the ensemble network is to be built and the number of classification algorithms included in each layer.

또한, 생성부(210)는 결정된 복수의 레이어의 수 및 결정된 각 레이어에 포함되는 분류 알고리즘의 수에 기초하여 분류 모델을 생성할 수 있다. Also, the generator 210 may generate a classification model based on the determined number of layers and the determined number of classification algorithms included in each layer.

생성부(210)는 결정된 복수의 레이어의 수 및 결정된 각 레이어에 포함되는 분류 알고리즘의 수에 기초하여 복수의 레이어 별로 앙상블 학습을 위한 복수의 분류 알고리즘을 포함하는 앙상블 네트워크를 생성하고, 각 레이어마다 생성된 앙상블 네트워크를 포함하는 분류 모델을 생성할 수 있다. The generation unit 210 generates an ensemble network including a plurality of classification algorithms for ensemble learning for each of a plurality of layers based on the determined number of layers and the number of classification algorithms included in each layer, A classification model including the generated ensemble network can be generated.

여기서, 앙상블 학습은 여러 개의 결정 트리(Decision Tree)를 결합하여 하나의 결정 트리보다 더 좋은 성능을 내는 머신러닝 기법이다. 앙상블 학습의 핵심은 여러 개의 약 분류기(Weak Classifier)를 결합하여 강 분류기(Strong Classifier)를 만들기 때문에 학습의 정확성을 향상시킬 수 있다. Here, ensemble learning is a machine learning technique that combines several decision trees to achieve better performance than one decision tree. The core of ensemble learning is to create a strong classifier by combining several weak classifiers, so the accuracy of learning can be improved.

이때, 복수의 분류 알고리즘은 예를 들어, SVM(Support Vector Machine) 알고리즘, 의사결정나무(Decision Tree) 알고리즘, 랜덤포레스트(Random Forest) 알고리즘, K-최근접 이웃(K-Nearest Neighbor, KNN) 알고리즘, 나이브 베이즈(Naive Bayes) 알고리즘 중 적어도 하나를 포함할 수 있다. 이 때, SVM 알고리즘에는 선형분리 커널(Linear kernal) 알고리즘과, RBF 커널(RBF kernal) 알고리즘이 포함될 수 있다. At this time, the plurality of classification algorithms are, for example, SVM (Support Vector Machine) algorithm, Decision Tree algorithm, Random Forest algorithm, K-Nearest Neighbor (KNN) algorithm. , Naive Bayes (Naive Bayes) may include at least one of the algorithm. In this case, the SVM algorithm may include a linear kernal algorithm and an RBF kernel algorithm.

생성부(210)는 m개의 분류 알고리즘을 포함하는 앙상블 네트워크 중에서 k개의 분류 알고리즘을 선택하고, 선택된 k개의 분류 알고리즘을 포함하는 다른 앙상블 네트워크를 생성하면서 복수의 레이어를 구축할 수 있다. The generator 210 may select k classification algorithms from among ensemble networks including m classification algorithms, and may construct a plurality of layers while generating another ensemble network including the selected k classification algorithms.

여기서, m개의 분류 알고리즘 중 임의로 k개의 분류 알고리즘을 선택하기 때문에 생성 가능한 분류 알고리즘의 조합 개수는

이다. 이 때, 생성부(210)는 생성 가능한 모든 분류 알고리즘의 조합에서 임의로 하나의 조합을 선택하여 앙상블 네트워크를 생성하고, 생성될 앙상블 네트워크가 위치할 레이어를 결정할 수 있다. Here, since k classification algorithms are arbitrarily selected among m classification algorithms, the number of possible combinations of classification algorithms is

to be. In this case, the generator 210 may generate an ensemble network by randomly selecting a combination of all possible classification algorithm combinations, and may determine a layer in which the generated ensemble network is located.

도 3a는 레이어의 수가 3 개인 분류 모델(30)을 나타낸다. 3A shows a classification model 30 in which the number of layers is three.

도 3a를 참조하면, 생성부(210)는 복수의 분류 알고리즘(예컨대, 상이한 6 개의 분류 알고리즘)을 포함하는 제 1 앙상블 네트워크(30-1)를 제 1 레이어에 구축할 수 있다. Referring to FIG. 3A, the generator 210 may build a first ensemble network 30-1 including a plurality of classification algorithms (eg, six different classification algorithms) in the first layer.

이어서, 생성부(210)는 제 1 앙상블 네트워크(30-1)에 포함된 복수의 분류 알고리즘 중 랜덤으로 적어도 하나 이상의 분류 알고리즘(예컨대, 4 개의 분류 알고리즘)을 선정하고, 선정된 적어도 하나 이상의 분류 알고리즘을 포함하는 제 2 앙상블 네트워크(30-3)를 생성하여 이를 제 2 레이어에 구축할 수 있다. 이 때, 제 2 앙상블 네트워크(30-2)에 포함된 복수의 분류 알고리즘 각각은 상이한 알고리즘일 수 있다. Subsequently, the generation unit 210 randomly selects at least one classification algorithm (eg, four classification algorithms) among a plurality of classification algorithms included in the first ensemble network 30-1, and classifies at least one selected A second ensemble network 30-3 including an algorithm may be created and built into the second layer. In this case, each of the plurality of classification algorithms included in the second ensemble network 30-2 may be a different algorithm.

마지막으로, 생성부(210)는 제 1 앙상블 네트워크(30-1)에 포함된 복수의 분류 알고리즘 중 랜덤으로 적어도 하나 이상의 분류 알고리즘(예컨대, 5 개의 분류 알고리즘)을 선정하고, 선정된 적어도 하나 이상의 분류 알고리즘을 포함하는 제 3 앙상블 네트워크(30-5)를 생성하여 이를 제 3 레이어에 구축할 수 있다. Finally, the generation unit 210 randomly selects at least one classification algorithm (eg, five classification algorithms) among a plurality of classification algorithms included in the first ensemble network 30-1, and selects at least one or more A third ensemble network 30-5 including a classification algorithm may be created and built into the third layer.

즉, 본 발명은 m 개의 분류 알고리즘 중 임의로 선택한 k 개의 분류 알고리즘을 이용하여 다음 레이어에 구축될 앙상블 네트워크를 생성하고, 구축된 레이어의 수를 조정하여 스태킹 과정을 반복할 수 있다. That is, the present invention may generate an ensemble network to be built in the next layer using k classification algorithms arbitrarily selected among m classification algorithms, and repeat the stacking process by adjusting the number of built layers.

다른 실시 예로, 생성부(210)는 m 개의 분류 알고리즘에서 m-1 개를 선택한 후, 선택된 m-1 개의 분류 알고리즘을 포함하는 다른 앙상블 네트워크를 생성하면서 복수의 레이어를 구축할 수 있다. In another embodiment, the generator 210 may select m-1 from m classification algorithms and then construct a plurality of layers while generating another ensemble network including the selected m-1 classification algorithms.

이와 같은 과정을 반복하면, 생성 가능한 분류 알고리즘의 조합 개수는

이고, 결과적으로 m 개의 앙상블 네트워크로 구성된 m개의 레이어가 생성될 수 있다. 따라서, 앙상블 네트워크는 분류 알고리즘의 수와 같고 최대로 생성 가능한 레이어의 개수 또한 분류 알고리즘의 수와 같게 된다. If this process is repeated, the number of combinations of classification algorithms that can be generated is

As a result, m layers composed of m ensemble networks may be generated. Accordingly, the ensemble network is equal to the number of classification algorithms, and the number of maximum layers that can be generated is also equal to the number of classification algorithms.

학습부(220)는 생성부(210)가 분류 모델을 생성하는 과정에서 분류 모델에 포함되는 서브 분류 모델을 학습할 수 있다. 여기서, 서브 분류 모델은 적어도 하나의 레이어를 포함할 수 있다.The learning unit 220 may learn a sub classification model included in the classification model while the generation unit 210 generates a classification model. Here, the sub-classification model may include at least one layer.

예를 들어, 학습부(220)는 최초 레이어만을 포함하는 제 1 서브 분류 모델을 학습할 수 있다. 이어서, 학습부(220)는 생성부(210)가 두번째 레이어를 생성하면 최초 레이어 및 두번째 레이어를 포함하는 제 2 서브 분류 모델을 학습할 수 있다. For example, the learning unit 220 may learn a first sub-classification model including only the first layer. Subsequently, when the generator 210 generates the second layer, the learning unit 220 may learn a second sub-classification model including the first layer and the second layer.

학습부(220)는 생성부(210)가 분류 모델을 생성한 후, 학습 데이터를 분류 모델에 입력하여 학습 데이터를 분류하도록 분류 모델을 학습시킬 수 있다. After the generation unit 210 generates the classification model, the learning unit 220 may train the classification model to classify the training data by inputting the training data into the classification model.

도 3a를 참조하면, 학습부(220)는 학습 데이터(301)가 입력된 제 1 앙상블 네트워크(30-1)에 포함된 복수의 알고리즘 각각으로부터 출력된 제 1 출력값(분류 처리된 학습 데이터)을 제 2 앙상블 네트워크(30-3)에 입력하여 제 2 앙상블 네트워크(30-3)에 포함된 적어도 하나 이상의 분류 알고리즘을 통해 제 1 출력값을 분류하도록 제 2 앙상블 네트워크(30-3)를 학습시킬 수 있다. Referring to FIG. 3A, the learning unit 220 receives a first output value (classified learning data) output from each of a plurality of algorithms included in the first ensemble network 30-1 into which the training data 301 is input. It is possible to train the second ensemble network 30-3 to classify the first output value through at least one classification algorithm included in the second ensemble network 30-3 by inputting it into the second ensemble network 30-3. have.

학습부(220)는 제 2 앙상블 네트워크(30-3)에 포함된 적어도 하나 이상의 분류 알고리즘으로부터 출력된 제 2 출력값(재분류 처리된 학습 데이터)을 제 3 앙상블 네트워크(30-5)에 입력하여 제 3 앙상블 네트워크(30-5)에 포함된 적어도 하나 이상의 분류 알고리즘을 통해 제 2 출력값을 분류하도록 제 3 앙상블 네트워크(30-5)를 학습시킬 수 있다. The learning unit 220 inputs a second output value (reclassified learning data) output from at least one classification algorithm included in the second ensemble network 30-3 to the third ensemble network 30-5. The third ensemble network 30-5 may be trained to classify the second output value through at least one classification algorithm included in the third ensemble network 30-5.

즉, 학습 데이터(301)를 입력받은 제 1 앙상블 네트워크(30-1)는 제 1 앙상블 네트워크(30-1)에 포함된 복수의 알고리즘 각각을 통해 분류된 결과값으로서 제 1 출력값을 출력할 수 있다. 이 때 출력된 제 1 출력값은 제 2 앙상블 네트워크(30-3)의 입력 데이터로서 제 2 앙상블 네트워크(30-3)에 입력된다. That is, the first ensemble network 30-1 receiving the training data 301 may output the first output value as a result value classified through each of a plurality of algorithms included in the first ensemble network 30-1. have. The first output value output at this time is input data of the second ensemble network 30-3 and is input to the second ensemble network 30-3.

제 2 앙상블 네트워크(30-3)는 제 2 앙상블 네트워크(30-3)에 포함된 적어도 하나 이상의 분류 알고리즘을 통해 분류된 결과값으로서 제 2 출력값을 출력할 수 있다. 이 때, 출력된 제 2 출력값은 제 3 앙상블 네트워크(30-5)의 입력 데이터로서 제 3 앙상블 네트워크(30-5)에 입력된다. The second ensemble network 30-3 may output a second output value as a result value classified through at least one classification algorithm included in the second ensemble network 30-3. At this time, the outputted second output value is input to the third ensemble network 30-5 as input data of the third ensemble network 30-5.

제 3 앙상블 네트워크(30-5)는 제 3 앙상블 네트워크(30-5)에 포함된 적어도 하나 이상의 분류 알고리즘을 통해 최종으로 분류된 최종 출력값(303)을 출력할 수 있다. The third ensemble network 30-5 may output a final output value 303 that is finally classified through at least one classification algorithm included in the third ensemble network 30-5.

다시 도 2로 돌아오면, 평가부(230)는 복수의 레이어 별로 생성된 앙상블 네트워크 각각에 대한 분류 성능을 로지스틱 회귀(Logistic Regression) 분석을 통해 평가할 수 있다. Returning to FIG. 2 again, the evaluation unit 230 may evaluate the classification performance of each of the ensemble networks generated for each of a plurality of layers through logistic regression analysis.

도 3b를 참조하면, 평가부(230)는 제 1 레이어에 구축된 제 1 앙상블 네트워크에 포함되는 복수의 분류 알고리즘(SVM_Linear 알고리즘, SVM_ RBF 알고리즘, Decision Tree 알고리즘, Random Forest 알고리즘, KNN 알고리즘 및 Naive Bayes 알고리즘) 각각으로부터 출력된 출력값(제 1 내지 제 6 출력값)에 대하여 로지스틱 회귀 분석을 수행하고, 출력값에 대한 로지스틱 회귀 분석의 결과에 기초하여 제 1 앙상블 네트워크의 분류 성능을 평가할 수 있다. 3B, the evaluation unit 230 includes a plurality of classification algorithms (SVM_Linear algorithm, SVM_ RBF algorithm, Decision Tree algorithm, Random Forest algorithm, KNN algorithm and Naive Bayes) included in the first ensemble network built in the first layer. Algorithm) A logistic regression analysis may be performed on output values (first to sixth output values) output from each, and classification performance of the first ensemble network may be evaluated based on the result of logistic regression analysis on the output values.

또한, 평가부(230)는 제 2 레이어(미도시)에 구축된 제 2 앙상블 네트워크에 포함되는 적어도 하나의 분류 알고리즘 각각으로부터 출력된 출력값에 대하여 로지스틱 회귀 분석을 수행하고, 출력값에 대한 로지스틱 회귀 분석의 결과에 기초하여 제 2 앙상블 네트워크의 분류 성능을 평가할 수 있다. In addition, the evaluation unit 230 performs logistic regression analysis on the output values output from each of at least one classification algorithm included in the second ensemble network built in the second layer (not shown), and logistic regression analysis on the output values. It is possible to evaluate the classification performance of the second ensemble network based on the result of.

또한, 평가부(230)는 제 3 레이어(미도시)에 구축된 제 3 앙상블 네트워크에 포함되는 적어도 하나의 분류 알고리즘 각각으로부터 출력된 출력값에 대하여 로지스틱 회귀 분석을 수행하고, 출력값에 대한 로지스틱 회귀 분석의 결과에 기초하여 제 3 앙상블 네트워크의 분류 성능을 평가할 수 있다.In addition, the evaluation unit 230 performs logistic regression analysis on the output values output from each of at least one classification algorithm included in the third ensemble network built in the third layer (not shown), and logistic regression analysis on the output values. The classification performance of the third ensemble network may be evaluated based on the result of.

본 발명의 일 실시예에 따르면, 도 3c와 같이 분류 모델은 m 개의 레이어를 포함하고, 각 레이어는 m 개의 분류 알고리즘을 포함한다. 이때, m 개의 분류 알고리즘에서 m-1 개를 선택하면서 분류 모델을 생성할 경우, 분류 모델의 성능이 높은 것으로 확인된다.According to an embodiment of the present invention, as shown in FIG. 3C, the classification model includes m layers, and each layer includes m classification algorithms. In this case, when a classification model is generated while selecting m-1 from m classification algorithms, the performance of the classification model is confirmed to be high.

예를 들어, 제 1 레이어에 구축된 제 1 앙상블 네트워크에 포함된 6개의 분류 알고리즘(SVM_Linear 알고리즘, SVM_ RBF 알고리즘, Decision Tree 알고리즘, Random Forest 알고리즘, KNN 알고리즘 및 Naive Bayes 알고리즘) 중 SVM_Linear 알고리즘을 제외한 5개의 분류 알고리즘의 조합을 통해 6개의 분류 알고리즘을 포함(적어도 하나의 분류 알고리즘이 복수개 포함될 수 있음)하는 제 2 앙상블 네트워크가 제 2 레이어에 구축되고, 제 1 앙상블 네트워크에 포함된 6개의 분류 알고리즘 중 SVM_ RBF 알고리즘을 제외한 5개의 분류 알고리즘의 조합을 통해 6개의 분류 알고리즘을 포함(적어도 하나의 분류 알고리즘이 복수개 포함될 수 있음)하는 제 3 앙상블 네트워크가 제 3 레이어에 구축되는 레이어 생성 과정을 반복하다 보면, 6개의 레이어마다 6개의 분류 알고리즘을 포함하는 앙상블 네트워크를 포함하는 분류 모델이 생성된다.For example, among six classification algorithms (SVM_Linear algorithm, SVM_ RBF algorithm, Decision Tree algorithm, Random Forest algorithm, KNN algorithm, and Naive Bayes algorithm) included in the first ensemble network built in the first layer, 5 excluding the SVM_Linear algorithm. A second ensemble network including six classification algorithms (at least one classification algorithm may be included) through a combination of three classification algorithms is constructed in the second layer, and among the six classification algorithms included in the first ensemble network When repeating the layer creation process in which a third ensemble network including six classification algorithms (at least one classification algorithm may be included) through a combination of five classification algorithms excluding the SVM_ RBF algorithm is constructed in the third layer. , A classification model including an ensemble network including 6 classification algorithms in every 6 layers is generated.

분류부(240)는 분류용 데이터를 학습된 분류 모델에 입력하여 분류용 데이터를 분류할 수 있다. The classification unit 240 may classify the classification data by inputting the classification data into the learned classification model.

이와 같이, 본 발명은 기존의 스태킹 기법을 개선한 다중 스태킹 앙상블(Multiple Stacking Ensemble) 학습 기법을 제공할 수 있다. 이러한, 다중 스태킹 앙상블 기법의 학습 구조는 각 레이어(Layer)마다 앙상블 네트워크를 구성하면서 결과적으로 딥러닝 구조와 유사한 형태를 띠고, 기존의 스태킹 기법에 비해 우수한 분류 성능을 제공할 수 있다. As described above, the present invention can provide a multiple stacking ensemble learning technique that improves the existing stacking technique. Such a learning structure of the multi-stacking ensemble technique can form an ensemble network for each layer, and as a result, take a form similar to a deep learning structure, and can provide superior classification performance compared to the existing stacking technique.

한편, 당업자라면, 수집부(200), 생성부(210), 학습부(220), 평가부(230) 및 분류부(240) 각각이 분리되어 구현되거나, 이 중 하나 이상이 통합되어 구현될 수 있음을 충분히 이해할 것이다. On the other hand, for those skilled in the art, each of the collection unit 200, the generation unit 210, the learning unit 220, the evaluation unit 230, and the classification unit 240 may be implemented separately, or one or more of them may be integrated and implemented. You will fully understand that you can.

도 4는 본 발명의 일 실시예에 따른, 기계학습 및 앙상블 기법을 이용하여 데이터를 분류하는 방법을 나타낸 흐름도이다. 4 is a flowchart illustrating a method of classifying data using machine learning and ensemble techniques according to an embodiment of the present invention.

도 4를 참조하면, 단계 S401에서 데이터 분류 장치(100)는 학습 데이터를 수집할 수 있다. Referring to FIG. 4, in step S401, the data classification apparatus 100 may collect training data.

단계 S403에서 데이터 분류 장치(100)는 복수의 레이어 별로 앙상블 네트워크를 생성하고, 복수의 레이어마다 생성된 앙상블 네트워크를 포함하는 분류 모델을 생성할 수 있다. In step S403, the data classification apparatus 100 may generate an ensemble network for each of a plurality of layers, and may generate a classification model including an ensemble network generated for each of the plurality of layers.

단계 S405에서 데이터 분류 장치(100)는 학습 데이터를 분류 모델에 입력하여 학습 데이터를 분류하도록 분류 모델을 학습시킬 수 있다. In step S405, the data classification apparatus 100 may train the classification model to classify the training data by inputting the training data into the classification model.

단계 S407에서 데이터 분류 장치(100)는 분류용 데이터를 학습된 분류 모델에 입력하여 분류용 데이터를 분류할 수 있다. In step S407, the data classification apparatus 100 may classify the classification data by inputting the classification data into the learned classification model.

상술한 설명에서, 단계 S401 내지 S407은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the above description, steps S401 to S407 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains will be able to understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. .

100: 데이터 분류 장치
200: 수집부
210: 생성부
220: 학습부
230: 평가부
240: 분류부100: data classification device
200: collection unit
210: generation unit
220: Learning Department
230: evaluation unit
240: classification unit

Claims

In an apparatus for classifying data using machine learning and ensemble techniques,
A collection unit for collecting learning data;
A generator for generating an ensemble network for each of a plurality of layers and for generating a classification model including an ensemble network for each of the plurality of layers;
A learning unit for learning the classification model to classify the training data by inputting the training data into the classification model; And
Classification unit for classifying the classification data by inputting classification data into the learned classification model
Including,
The generator generates the ensemble network including a plurality of classification algorithms for ensemble learning for each of the plurality of layers,
The generator constructs a first ensemble network including the plurality of classification algorithms on a first layer, and a second including at least one classification algorithm randomly selected from among a plurality of classification algorithms included in the first ensemble network Build the ensemble network on the second layer,
The first ensemble network includes m classification algorithms, and the second ensemble network includes m-1 classification algorithms.

delete

The method of claim 1,
Wherein the generation unit determines the number of the plurality of layers and the number of classification algorithms included in each layer.

The method of claim 3,
The generation unit generates the classification model based on the determined number of layers and the number of classification algorithms included in each of the determined layers.

delete

The method of claim 1,
Each of the plurality of classification algorithms included in the first ensemble network is a different algorithm.

The method of claim 1,
The learning unit inputs a first output value output from each of a plurality of classification algorithms included in the first ensemble network into which the training data is input to the second ensemble network, and at least one classification algorithm included in the second ensemble network To learn the second ensemble network through the data classification apparatus.

The method of claim 1,
The generation unit
The data classification apparatus, wherein a third ensemble network including at least one randomly selected classification algorithm among a plurality of classification algorithms included in the first ensemble network is constructed in a third layer.

The method of claim 8,
The learning unit inputs a second output value output from at least one classification algorithm included in the second ensemble network to the third ensemble network, and the third ensemble through at least one classification algorithm included in the third ensemble network To train the network, data classification device.

In a method of classifying data using machine learning and an ensemble technique performed by a data classification device,
Collecting learning data;
Generating an ensemble network for each of a plurality of layers, and generating a classification model including an ensemble network generated for each of the plurality of layers;
Learning the classification model to classify the training data by inputting the training data into the classification model; And
Classifying the classification data by inputting the classification data into the learned classification model,
The generating of the classification model includes generating the ensemble network including a plurality of classification algorithms for ensemble learning for each of the plurality of layers,
Generating the ensemble network
Building a first ensemble network including the plurality of classification algorithms in a first layer; And
Constructing a second ensemble network including at least one randomly selected classification algorithm among a plurality of classification algorithms included in the first ensemble network in a second layer,
The first ensemble network includes m classification algorithms, and the second ensemble network includes m-1 classification algorithms.

The method of claim 10,
The step of generating the classification model
And determining the number of the plurality of layers and the number of classification algorithms included in each layer.

The method of claim 11,
The step of generating the classification model
And generating the classification model based on the determined number of layers and the number of classification algorithms included in each of the determined layers.