KR102314848B1

KR102314848B1 - Model generating method and apparatus for easy analysis, and data classifying method and apparatus using the model

Info

Publication number: KR102314848B1
Application number: KR1020210041359A
Authority: KR
Inventors: 최유리; 최정혁
Original assignee: 주식회사 솔리드웨어
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-10-19
Also published as: WO2022211180A1

Abstract

Disclosed are a method for generating an interpretable model, a device thereof, a data classification method using the same, and a device of the data classification method. The device for generating a model obtains a first classification result obtained by classifying learning data using an unsupervised learning model and obtains a second classification result obtained by classifying the learning data using a supervised learning model. After cumulatively adding the second classification result to the learning data, the data classification model is generated by repeating a process of obtaining the first classification result and the second classification result until the first classification result and the second classification result are the same within a predefined tolerance range.

Description

Model generating method and apparatus for easy analysis, and data classification method using the model generated using the same, and apparatus for easy analysis, and data classifying method and apparatus using the model

본 발명의 실시 예는, 레이블링되지 않은 데이터를 비지도학습모델을 이용하여 분류하는 데이터분류모델을 해석이 용이하도록 생성하는 방법 및 그 장치와, 이를 이용하여 생성된 데이터분류모델을 이용하여 데이터를 분류하는 방법 및 그 장치에 관한 것이다.An embodiment of the present invention provides a method and apparatus for easily interpreting a data classification model for classifying unlabeled data using an unsupervised learning model, and data using the data classification model generated using the method It relates to a classification method and an apparatus therefor.

비지도학습모델은 사전에 정의된 레이블이 존재하지 않는 상태에서 데이터의 특성만 고려하여 데이터를 복수의 군집으로 분류하는 모델로, 레이블이 존재하는 학습데이터를 이용하여 데이터를 분할하도록 훈련시킨 지도학습모델과 구분된다. 비지도학습 또는 준지도학습 방법에 의한 군집화 알고리즘은 군집화 또는 이상탐지의 품질을 추정하기 위한 척도로 군집간 거리나 군집 내 거리 등 데이터의 기하학적 특성을 활용한다. 그러나 데이터의 차원이 높아질수록(즉, 유의미한 변수가 많아질수록) 데이터 간 거리라는 지표 자체가 왜곡될 수 있다. 또한 많은 경우 비지도학습 및 준지도학습 알고리즘은 서로 다르게 분류된 데이터의 가분성(separability), 즉 다른 분류 간에 얼마나 쉽게 구별할 수 있는지의 척도를 고려하지 않고 있다. 거리가 아니라 표본의 밀도를 기준으로 군집화하는 알고리즘도 존재하나, 이런 방법론 역시 데이터의 차원이 높아지면 확률밀도의 안정적 추정을 위해 필요한 표분의 크기가 기하급수적으로 커짐으로 인해 부정확해지는 문제점이 존재한다. 이상탐지에 데이터의 가분성을 직접 활용하는 알고리즘(예를 들어, isolation forest 또는 extended isolation forest 등)도 존재하지만, 이런 알고리즘은 의사결정나무만을 이용하여 고차원적 비선형 초평면(hyperplane) 형태의 판별함수를 만들어내야 하기 때문에 많은 수의 트리를 필요로 하게 되어 비효율적일 수 있다. The unsupervised learning model is a model that classifies data into a plurality of clusters by considering only the characteristics of the data in the absence of a predefined label. separate from the model. The clustering algorithm by unsupervised or semi-supervised learning methods utilizes geometrical characteristics of data such as inter-cluster distance or intra-cluster distance as a measure for estimating the quality of clustering or anomaly detection. However, as the dimension of the data increases (that is, as the number of significant variables increases), the index itself of the distance between the data may be distorted. Also, in many cases, unsupervised and semi-supervised learning algorithms do not take into account the separability of differently classified data, that is, a measure of how easily they can be distinguished between different classifications. There is also an algorithm that clusters based on the density of the sample rather than the distance, but this method also has a problem of inaccuracy due to the exponential increase in the size of the table required for stable estimation of the probability density when the dimension of the data increases. Algorithms that directly utilize the divisibility of data for anomaly detection (eg, isolation forest or extended isolation forest, etc.) exist, but these algorithms use only decision trees to create a high-dimensional nonlinear hyperplane discriminant function. It can be inefficient because it requires a large number of trees.

본 발명의 실시 예가 이루고자 하는 기술적 과제는, 지도학습모델의 분류결과를 이용하여 비지도학습모델의 해석이 용이한 데이터분류모델을 생성하는 방법 및 그 장치와, 이를 이용하여 생성한 데이터분류모델을 이용하여 데이터를 분류하는 방법 및 그 장치를 제공하는 데 있다.A technical task to be achieved by an embodiment of the present invention is a method and apparatus for generating a data classification model that is easy to interpret an unsupervised learning model by using the classification result of the supervised learning model, and a data classification model generated using the method An object of the present invention is to provide a method and an apparatus for classifying data using the same.

상기의 기술적 과제를 달성하기 위한, 본 발명의 실시 예에 따른 모델생성방법의 일 예는, 모델생성장치가 수행하는 모델생성방법에 있어서, 비지도학습모델을 이용하여 학습데이터를 분류한 제1 분류결과를 획득하는 단계; 지도학습모델을 이용하여 상기 분류결과를 기초로 레이블링된 학습데이터를 분류한 제2 분류결과를 획득하는 단계; 상기 제1 분류결과와 상기 제2 분류결과가 상이하면, 상기 학습데이터에 상기 제2 분류결과를 누적 추가하고, 상기 제1 분류결과를 획득하는 단계 및 상기 제2 분류결과를 획득하는 단계를 재수행하는 단계; 및 상기 제1 분류결과와 상기 제2 분류결과가 기 정의된 허용오차 범위 내에서 동일할 때까지 상기 제1 분류결과를 획득하는 단계부터 상기 재수행하는 단계까지를 반복 수행하는 단계;를 포함한다.An example of a model generation method according to an embodiment of the present invention for achieving the above technical problem is a first method for classifying learning data using an unsupervised learning model in a model generation method performed by a model generation apparatus. obtaining a classification result; obtaining a second classification result obtained by classifying the labeled learning data based on the classification result using a supervised learning model; If the first classification result and the second classification result are different from the first classification result, the second classification result is accumulated and added to the learning data, and the steps of obtaining the first classification result and obtaining the second classification result are repeated performing; and repeating the steps from obtaining the first classification result to the re-performing until the first classification result and the second classification result are the same within a predefined tolerance range.

상기의 기술적 과제를 달성하기 위한, 본 발명의 실시 예에 따른 데이터분류장치의 일 예는, 입력데이터를 지도학습모델에 입력하여 구한 예측결과를 상기 입력데이터에 추가하고, 추가된 예측결과를 포함하는 입력데이터를 상기 지도학습모델에 다시 입력하여 구한 예측결과를 다시 상기 입력데이터에 누적 추가하는 과정을 일정 횟수 반복 수행하는 데이터증강부; 및 예측결과가 누적 추가된 입력데이터를 비지도학습모델에 입력하여 분류결과를 예측하는 분류예측부;를 포함한다.In order to achieve the above technical problem, an example of a data classification apparatus according to an embodiment of the present invention adds a prediction result obtained by inputting input data into a supervised learning model to the input data, and includes the added prediction result a data augmentation unit that repeats the process of re-entering the input data to the supervised learning model and accumulatively adding the prediction results obtained to the input data a predetermined number of times; and a classification prediction unit for predicting the classification result by inputting the input data to which the prediction result is accumulated and added to the unsupervised learning model.

본 발명의 실시 예에 따르면, 지도학습모델의 분류결과를 이용하여 비지도학습모델의 분류결과의 성능을 향상시킬 수 있다. 또한 학습 과정에서 지도학습모델의 분류결과를 비지도학습모델의 학습데이터의 변수로 누적 추가하여 비지도학습모델의 식별가능성 척도를 높일 수 있다. According to an embodiment of the present invention, the performance of the classification result of the unsupervised learning model can be improved by using the classification result of the supervised learning model. In addition, in the learning process, the classification result of the supervised learning model can be accumulated and added as a variable of the learning data of the unsupervised learning model to increase the identifiability scale of the unsupervised learning model.

도 1은 본 발명의 실시 예에 따른 데이터분류모델의 일 예를 도시한 도면,
도 2는 본 발명의 실시 예에 따른 증강데이터의 일 예를 도시한 도면,
도 3은 본 발명의 실시 예에 따른 데이터분류모델을 생성하는 방법의 일 예를 도시한 도면,
도 4는 본 발명의 실시 예에 따른 데이터분류장치의 일 실시 예의 구성을 도시한 도면, 그리고,
도 5는 본 발명의 실시 예에 따른 모델생성부의 일 예의 구성을 도시한 도면이다.1 is a view showing an example of a data classification model according to an embodiment of the present invention;
2 is a view showing an example of augmented data according to an embodiment of the present invention;
3 is a view showing an example of a method for generating a data classification model according to an embodiment of the present invention;
4 is a diagram showing the configuration of an embodiment of a data classification apparatus according to an embodiment of the present invention;
5 is a diagram illustrating a configuration of an example of a model generator according to an embodiment of the present invention.

이하에서, 첨부된 도면들을 참조하여 본 발명의 실시 예에 따른 모델생성방법 및 그 장치와 이를 이용하여 생성된 모델을 이용한 데이터분류방법 및 그 장치에 대해 상세히 살펴본다.Hereinafter, with reference to the accompanying drawings, a method and apparatus for generating a model according to an embodiment of the present invention, a method for classifying data using a model generated using the method, and an apparatus thereof will be described in detail.

도 1은 본 발명의 실시 예에 따른 데이터분류모델의 일 예를 도시한 도면이다.1 is a diagram illustrating an example of a data classification model according to an embodiment of the present invention.

도 1을 참조하면, 데이터분류모델(100)은 지도학습모델(110,112,114)과 비지도학습모델(120)로 구성된다. 데이터분류모델(100)은 적어도 하나 이상의 지도학습모델(110,112,114)이 다단계로 연결되고, 마지막 지도학습모델(114)의 출력이 비지도학습모델(120)로 입력된다. 비지도학습모델(120)은 입력데이터(130)를 복수의 군집으로 분류(예를 들어, 2개의 군집으로 이진분류)하는 모델이고, 지도학습모델은 입력데이터(130)가 특정 그룹에 속하는지 여부를 예측하는 모델이다. Referring to FIG. 1 , the data classification model 100 includes supervised learning models 110 , 112 , and 114 and an unsupervised learning model 120 . In the data classification model 100 , at least one supervised learning model 110 , 112 , 114 is connected in multiple stages, and the output of the last supervised learning model 114 is input to the unsupervised learning model 120 . The unsupervised learning model 120 is a model for classifying the input data 130 into a plurality of clusters (for example, binary classification into two clusters), and the supervised learning model determines whether the input data 130 belongs to a specific group. It is a model that predicts whether

지도학습모델(110,112,114)은 예측결과를 통해 데이터를 2개의 그룹(즉, 양성 그룹과 음성 그룹)으로 분류하는 이진분류모델일 수 있다. 지도학습모델(110,112,114)은 양성 그룹에 속하는 샘플과 음성 그룹에 속하는 샘플을 효과적으로 구분해내는 판별 함수(decision funtion)를 조정하는 방식으로 작동할 수 있다. 이 판별 함수의 출력값은 이진(binary) 값이거나 연속적(continuous)인 값일 수 있는데 연속적인 값의 형태의 경우 특정 값을 초과하면 양성으로 분류하도록 임계치(trhreshold)가 부여될 수 있다. 즉, 판별 함수는 n차원 데이터에 대하여 그것이 속한 n차원 유클리드 공간 내에서 양성과 음성 샘플을 가장 정확하게 구분해내는 경계면을 정의하게 되고, 이것을 분리초평면(separating hyperplane)이라고 한다.The supervised learning models 110 , 112 , and 114 may be binary classification models that classify data into two groups (that is, a positive group and a negative group) through prediction results. The supervised learning models 110 , 112 , and 114 may operate by adjusting a decision function that effectively distinguishes between samples belonging to a positive group and a sample belonging to a negative group. The output value of this discriminant function may be a binary value or a continuous value. In the case of a continuous value, a threshold may be assigned to classify as positive if a specific value is exceeded. That is, the discriminant function defines an interface that most accurately separates positive and negative samples in the n-dimensional Euclidean space to which n-dimensional data belongs, and this is called a separating hyperplane.

분리하고자 하는 데이터가 노이지(noisy)하여 특성은 유사한데 클래스는 다른 샘플이 다수 존재하는 경우에 분리 초평면은 데이터를 완벽하게 분리해내지 못할 수 있다. 데이터에 노이즈가 많이 포함될수록 분리 초평면, 즉 지도학습모델의 판별함수의 성능이 떨어지게 되는데, 거꾸로 말하면 충분한 복잡도를 가진 지도학습모델조차도 분리해내지 못하는 학습데이터는 그 분류 체계 자체가 분리하기 어렵게 되었다고 할 수 있다. When the data to be separated is noisy and there are many samples with similar characteristics but different classes, the separation hyperplane may not completely separate the data. As the data contains a lot of noise, the performance of the separation hyperplane, that is, the discriminant function of the supervised learning model, deteriorates. can

이에 본 실시 예는 반복 학습 단계를 거칠 때마다 직전 지도학습모델(110,112,114)의 분류결과를 비지도학습모델(120)의 샘플 특성으로 참고하도록 데이터분류모델(100)을 설계한다. 각 회차 비지도학습 훈련 결과에 지도학습모델(110,112,114)의 식별 시도를 원본의 입력데이터(130)에 변수로 추가함으로써 데이터에 일종의 식별가능성 척도를 점진적으로 주입하여 식별가능성이 상당 부분 제고된 데이터분류모델을 만들 수 있다.Accordingly, in this embodiment, the data classification model 100 is designed so that the classification results of the immediately preceding supervised learning models 110 , 112 , and 114 are referred to as sample characteristics of the unsupervised learning model 120 whenever the repeated learning step is performed. By adding the identification attempt of the supervised learning model (110, 112, 114) to the original input data 130 as a variable in the unsupervised learning training result of each round, a kind of identifiability measure is gradually injected into the data, and the identifiability is significantly improved. model can be created.

데이터분류모델(100)을 구성하는 다단계의 지도학습모델(110,112,114)의 개수는 데이터분류모델(100)의 훈련과정에서 결정된다. 데이터분류모델을 훈련시켜 생성하는 과정은 도 3에서 다시 살펴본다. 본 실시 예에서는 도 3의 방법을 통해 지도학습모델 및 비지도학습모델을 학습시켜 데이터분류모델(100)이 생성되었다고 가정한다. The number of multi-step supervised learning models 110 , 112 , and 114 constituting the data classification model 100 is determined in the training process of the data classification model 100 . The process of training and generating a data classification model will be reviewed again in FIG. 3 . In this embodiment, it is assumed that the data classification model 100 is generated by learning the supervised learning model and the unsupervised learning model through the method of FIG. 3 .

데이터분류장치는 입력데이터(130)를 제1 지도학습모델(110)에 입력하여 제1 예측결과를 파악한다. 데이터분류장치는 제1 예측결과를 입력데이터(130)에 추가한 새로운 제1 증강데이터를 생성한 후 이를 제2 지도학습모델(112)에 입력하여 제2 예측결과를 파악한다. 데이터분류장치는 제2 예측결과를 다시 제1 증가데이터에 추가한 제2 증강데이터를 생성하고, 이를 제3 지도학습모델에 입력하여 제3 예측결과를 파악한다. 데이터분류장치는 이와 같은 방법을 반복 수행하여 생성한 제(N-1) 증강데이터를 제N 지도학습모델(114)에 입력하여 제N 예측결과를 파악한다. The data classification apparatus inputs the input data 130 to the first supervised learning model 110 to determine the first prediction result. The data classification device generates new first augmented data obtained by adding the first prediction result to the input data 130 , and then inputs it to the second supervised learning model 112 to determine the second prediction result. The data classification device generates second augmented data by adding the second prediction result to the first incremental data again, and inputs it to the third supervised learning model to determine the third prediction result. The data classification apparatus inputs the (N-1)th augmented data generated by repeating this method to the Nth supervised learning model 114 to determine the Nth prediction result.

제1 내지 제N 지도학습모델(110,112,114)은 모두 동일한 지도학습모델이다. 일 실시 예로, 제1 예측결과를 출력한 제1 지도학습모델(110)이 제2 지도학습모델(112)로 사용되고, 제2 예측결과를 출력한 제2 지도학습모델(112)이 제3 지도학습모델로 사용될 수 있다. 즉 하나의 지도학습모델이 N번 반복하여 사용될 수 있다. 다른 실시 예로, N개의 동일한 지도학습모델(110,112,114)이 본 실시 예와 다단계로 배치된 후 각각 제1 내지 제N 예측결과를 출력할 수 있다. The first to Nth supervised learning models 110 , 112 , and 114 are all the same supervised learning models. In one embodiment, the first supervised learning model 110 outputting the first prediction result is used as the second supervised learning model 112, and the second supervised learning model 112 outputting the second prediction result is the third guidance It can be used as a learning model. That is, one supervised learning model can be used repeatedly N times. As another example, after the N identical supervised learning models 110 , 112 , and 114 are arranged in multiple stages as in the present embodiment, the first to Nth prediction results may be respectively output.

데이터분류장치는 제(N-1) 증강데이터에 제N 지도학습모델(114)의 제N 예측결과를 추가한 제N 증강데이터를 비지도학습모델(120)에 입력하여 입력데이터(130)에 대한 최종 분류결과를 획득한다. The data classification apparatus inputs the N-th augmented data obtained by adding the N-th prediction result of the N-th supervised learning model 114 to the (N-1)-th augmented data into the unsupervised learning model 120 to input the data to the input data 130 . to obtain the final classification result for

도 2는 본 발명의 실시 예에 따른 증강데이터의 일 예를 도시한 도면이다.2 is a diagram illustrating an example of augmented data according to an embodiment of the present invention.

도 2를 참조하면, 데이터분류장치는 지도학습모델의 예측결과(즉, 분류결과)를 입력데이터에 누적 추가하여 증강데이터를 생성한다. Referring to FIG. 2 , the data classification apparatus generates augmented data by cumulatively adding the prediction result (ie, classification result) of the supervised learning model to input data.

예를 들어, 도 1의 입력데이터(130)는 적어도 하나 이상의 샘플(210)과 각 샘플에 대한 복수 개의 변수(220)의 값으로 구성될 수 있다. 데이터분류장치는 입력데이터(130)를 제1 지도학습모델(110)에 입력하여 획득한 제1 예측결과(Y1)를 입력데이터에 추가한 제1 증강데이터를 생성한다. 즉, 제1 증강데이터는 변수X1~Xn(220)의 값과 변수 Y1의 값으로 구성된다. 데이터분류장치는 m번의 지도학습모델의 예측결과를 입력데이터에 누적 추가하여 변수 Y1~Ym(230)의 예측결과를 포함하는 증강데이터(200)를 생성할 수 있다.For example, the input data 130 of FIG. 1 may include at least one sample 210 and values of a plurality of variables 220 for each sample. The data classification device generates first augmented data obtained by adding the first prediction result Y1 obtained by inputting the input data 130 into the first supervised learning model 110 to the input data. That is, the first augmented data is composed of the value of the variable X1 ~ Xn (220) and the value of the variable Y1. The data classification apparatus may generate the augmented data 200 including the prediction results of the variables Y1 to Ym (230) by cumulatively adding the prediction results of the m number of supervised learning models to the input data.

도 3은 본 발명의 실시 예에 따른 데이터분류모델을 생성하는 방법의 일 예를 도시한 도면이다.3 is a diagram illustrating an example of a method for generating a data classification model according to an embodiment of the present invention.

도 3을 참조하면, 모델생성장치는 비지도학습모델(310)을 이용하여 학습데이터(300)를 복수 개의 군집으로 분류한다. 예를 들어, 모델생성장치는 K-평균 알고리즘의 비지도학습모델(310)을 이용하여 학습데이터(300)를 2개 이상의 군집으로 분류할 수 있다. 이하에서는 설명의 편의를 위하여 비지도학습모델(310)은 학습데이터를 2개의 군집으로 분류하는 이진분류모델이라고 가정한다.Referring to FIG. 3 , the model generating apparatus classifies the training data 300 into a plurality of clusters using the unsupervised learning model 310 . For example, the model generating apparatus may classify the training data 300 into two or more clusters using the unsupervised learning model 310 of the K-means algorithm. Hereinafter, for convenience of explanation, it is assumed that the unsupervised learning model 310 is a binary classification model that classifies learning data into two clusters.

일 실시 예로, 데이터분류모델의 생성을 위하여 비지도학습모델 (310) 및 지도학습모델(330)의 학습에 사용되는 학습데이터(300)는 이진 특성이 데이터 속에서 분리 가능한 형태로 나타나는 데이터일 수 있다. 다른 실시 예로, 비지도학습모델(310)의 판별 초평면의 복잡도가 과하게 높을 경우에도 지도학습으로 근사하는 것이 불가능할 수 있으므로, 학습데이터(300)는 비지도학습모델(310)의 분류결과의 판별 초평면이 지도학습으로 근사가 가능한 데이터일 수 있다.In one embodiment, the training data 300 used for learning the unsupervised learning model 310 and the supervised learning model 330 for the generation of the data classification model may be data in which binary characteristics appear in a separable form in the data. have. In another embodiment, even when the complexity of the discrimination hyperplane of the unsupervised learning model 310 is excessively high, it may be impossible to approximate by supervised learning, so the learning data 300 is the discrimination hyperplane of the classification result of the unsupervised learning model 310 . It may be data that can be approximated by this supervised learning.

모델생성장치는 비지도학습모델(310)을 이용하여 제1 분류결과(320)를 구하고, 제1 분류결과(320)를 기초로 레이블링된 학습데이터(300)를 이용하여 지도학습모델(330)을 훈련(즉, 학습)시켜 제2 분류결과(340)를 획득한다. 예를 들어, 학습데이터(300)가 도 2와 같이 복수 개의 샘플(210)로 구성된 경우에, 학습데이터의 각 샘플은 비지도학습모델(310)에 의해 두 개의 군집으로 분류된다. 모델생성장치는 분류된 군집 결과를 이용하여 학습데이터(300)를 예측대상 그룹(즉, 양성 그룹)에 속하는 샘플(예를 들어, '1'로 태깅)과 노이즈 그룹(즉, 음성 그룹)에 속하는 샘플(즉, '0'으로 태깅)로 레이블링한 지도학습용 학습데이터를 생성하고, 이를 이용하여 지도학습모델(330)을 학습시킨 후 그 학습결과에 따른 제2 분류결과(즉, 예측결과)를 획득한다. 모델생성장치는 훈련시킨 지도학습모델(330)을 이용하여 학습데이터(300)의 각 샘플이 예측대상 그룹에 속할지 예측함으로써, 학습데이터의 샘플들을 예측대상 그룹과 나머지 그룹의 두 그룹으로 분류한 제1 분류결과(340)를 출력할 수 있다.The model generating apparatus obtains the first classification result 320 using the unsupervised learning model 310, and uses the labeled learning data 300 based on the first classification result 320 to the supervised learning model 330. is trained (ie, learned) to obtain a second classification result 340 . For example, when the training data 300 is composed of a plurality of samples 210 as shown in FIG. 2 , each sample of the training data is classified into two clusters by the unsupervised learning model 310 . The model generating device applies the training data 300 to the sample (eg, tagging as '1') belonging to the prediction target group (ie, the positive group) and the noise group (ie, the negative group) using the classified cluster result. After generating the training data for supervised learning labeled with the belonging sample (ie, tagging with '0'), and using this to train the supervised learning model 330, the second classification result (ie, prediction result) according to the learning result to acquire The model generating device uses the trained supervised learning model 330 to predict whether each sample of the training data 300 belongs to the prediction target group, and classifies the training data samples into two groups: the prediction target group and the remaining group. A first classification result 340 may be output.

모델생성장치는 비지도학습모델(310)에 의해 분류된 제1 분류결과(320)와 지도학습모델(330)에 의해 분류된 제2 분류결과(340)를 비교하여 동일하지 파악한다(350). 여기서 동일이라고 함은 제1 분류결과(320)와 제2 분류결과(340)가 완벽하게 일치하는 경우 뿐만 아니라 제1 분류결과(320)와 제2 분류결과(340)의 차이가 일정 허용 오차 범위 내에 존재하는 경우를 포함한다. 예를 들어, 모델생성장치는 제1 분류결과(320)와 제2 분류결과(340)의 일치비율이 기 정의된 비율(예를 들어, 95%) 이상(허용오차가 5%)이면 동일하다고 판단할 수 있다. 이 외에도 제1 분류결과(320)와 제2 분류결과(340)의 동일 여부를 파악하는 종래의 다양한 통계적 방법이 본 실시 예에 적용될 수 있다. The model generating apparatus compares the first classification result 320 classified by the unsupervised learning model 310 and the second classification result 340 classified by the supervised learning model 330 to determine whether they are the same (350). . Here, the same means that the difference between the first classification result 320 and the second classification result 340 is within a certain tolerance as well as when the first classification result 320 and the second classification result 340 perfectly match. Including cases in For example, the model generating apparatus determines that the matching ratio between the first classification result 320 and the second classification result 340 is equal to or greater than a predefined ratio (eg, 95%) (with a tolerance of 5%). can judge In addition, various conventional statistical methods for determining whether the first classification result 320 and the second classification result 340 are the same may be applied to the present embodiment.

모델생성장치는 제1 분류결과(320)와 제2 분류결과(340)가 상이하면, 지도학습모델(330)의 제2 분류결과(즉, 예측결과)(340)를 학습데이터(300)에 추가한 제1 증강학습데이터를 생성한다(370). 예를 들어, 학습데이터(300)가 도 2와 같이 변수 X1~Xn(220)을 포함한다면, 모델생성장치는 지도학습모델의 제2 분류결과(340)를 변수 Y1을 추가한 제1 증강학습데이터를 생성한다. 제1 증강학습데이터에 추가된 Y1 컬럼의 값은 각 샘플이 속한 그룹을 나타내는 값이다. 예를 들어, 제2 분류결과(340)에 따라 제1 그룹(즉, 양성 그룹)에 속하는 것으로 예측된 샘플의 Y1 값은 '1', 그 외의 샘플의 Y1 값은 '0'으로 설정할 수 있다. 증강학습데이터에 추가되는 Y 변수(230)의 값은 0과 1 외에 실시 예에 따라 다양한 형태로 표현될 수 있다.When the first classification result 320 and the second classification result 340 are different from the first classification result 320 , the model generating apparatus adds the second classification result (ie, prediction result) 340 of the supervised learning model 330 to the training data 300 . Generates the added first augmented learning data (370). For example, if the training data 300 includes variables X1 to Xn 220 as shown in FIG. 2 , the model generating apparatus adds the variable Y1 to the second classification result 340 of the supervised learning model. create data The value of the Y1 column added to the first augmented learning data is a value indicating a group to which each sample belongs. For example, the Y1 value of the sample predicted to belong to the first group (ie, the positive group) according to the second classification result 340 may be set to '1', and the Y1 value of the other samples may be set to '0'. . The value of the Y variable 230 added to the augmented learning data may be expressed in various forms other than 0 and 1 according to embodiments.

모델생성장치는 제1 증강학습데이터를 비지도학습모델(310)의 학습데이터로 다시 입력하여 제1 분류결과(320)를 획득한다. 그리고 모델생성장치는 제1 증강학습데이터에 대한 제1 분류결과(320)를 이용하여 레이블링한 제1 증강학습데이터로지도학습모델(330)을 학습시키고 제2 분류결과(340)를 획득한다. The model generating apparatus obtains the first classification result 320 by re-inputting the first augmented learning data as the learning data of the unsupervised learning model 310 . And the model generating apparatus learns the supervised learning model 330 with the first augmented learning data labeled using the first classification result 320 for the first augmented learning data, and obtains the second classification result 340 .

모델생성장치는 제1 증강학습데이터에 대한 제1 분류결과(320)와 제2 분류결과(340)를 비교하여 상이하면, 제1 증강학습데이터에 대한 제2 분류결과(340)를 제1 증강학습데이터에 추가 누적한 제2 증강학습데이터를 생성한다. 예를 들어, 도 2의 예에서 제2 증강학습데이터는 X1~Xn(220), Y1, Y2 컬럼으로 구성된다.The model generating apparatus compares the first classification result 320 and the second classification result 340 for the first augmented learning data and, if different, first augments the second classification result 340 for the first augmented learning data. Generates second augmented learning data that is additionally accumulated in the learning data. For example, in the example of Figure 2, the second augmented learning data is composed of X1 ~ Xn (220), Y1, Y2 columns.

모델생성장치는 제1 분류결과(320)와 제2 분류결과(340)가 동일할 때가지 지도학습모델의 제2 분류결과(340)를 학습데이터(300)에 누적 추가한 증강학습데이터를 생성하는 과정을 반복수행한다. 제1 분류결과(320)와 제2 분류결과(340)가 동일하면 모델생성장치는 데이터분류모델의 훈련 과정을 종료한다. 다른 실시 예로, 모델생성장치는 지도학습모델(330)의 현재 제2 분류결과가 이전의 제2 분류결과 중 어느 하나와 동일하면 훈련 과정을 종료할 수도 있다. 데이터분류모델(100)의 지도학습모델(110,112,114) 및 비지도학습모델(130)은 본 실시 예에서 훈련된 지도학습모델(330) 및 비지도학습모델(310)이다.The model generating device generates augmented learning data by adding the second classification result 340 of the supervised learning model to the learning data 300 until the first classification result 320 and the second classification result 340 are the same. repeat the process. When the first classification result 320 and the second classification result 340 are the same, the model generating apparatus ends the training process of the data classification model. In another embodiment, the model generating apparatus may end the training process when the current second classification result of the supervised learning model 330 is the same as any one of the previous second classification results. The supervised learning models 110 , 112 , and 114 and the unsupervised learning model 130 of the data classification model 100 are the supervised learning model 330 and the unsupervised learning model 310 trained in this embodiment.

모델생성장치는 종료 시점까지 학습데이터(300)에 누적 추가한 제2 분류결과의 개수를 기반으로 도 1에서 살핀 데이터분류모델(100)의 지도학습모델(110,112,114)의 다단계 개수를 결정할 수 있다. 예를 들어, 도 2와 같이 m개의 Y 컬럼(230)이 학습데이터에 추가된 경우에, 데이터분류모델(100)을 구성하는 지도학습모델(110,112,114)의 단계 개수는 m이 된다. 이 경우에, 데이터분류장치는 입력데이터(130)에 지도학습모델(110,112,114)의 예측결과(즉, 분류결과)를 누적 추가하는 과정을 m번 반복하여 입력데이터(130)에 대한 증강입력데이터(도 2의 200)를 생성하고 이를 비지도학습모델(120)에 입력하여 입력데이터의 최종 분류결과를 파악할 수 있다.The model generating apparatus may determine the multi-step number of the supervised learning models 110, 112, 114 of the salpin data classification model 100 in FIG. 1 based on the number of second classification results accumulated and added to the learning data 300 until the end point. For example, when m Y columns 230 are added to the training data as shown in FIG. 2 , the number of steps of the supervised learning models 110 , 112 , and 114 constituting the data classification model 100 becomes m. In this case, the data classification apparatus repeats the process of accumulatively adding the prediction results (ie, classification results) of the supervised learning models 110 , 112 , and 114 to the input data 130 m times to obtain the augmented input data ( 200 of FIG. 2 is generated and input to the unsupervised learning model 120 to determine the final classification result of the input data.

다른 실시 예로, 비지도학습모델이 이상탐지 알고리즘을 포함하는 경우 오염도(contamination ratio) 파라미터를 사용할 수 있다. 오염도의 정확한 설정을 위해서는 해당 분야에 대한 높은 수준의 전문성과 데이터 자체에 대한 깊은 이해를 필요로 하며, 사용자가 이를 갖추었다고 하여도 오염도를 과대 또는 과소 추정하기 싶다. 오염도에 대한 부정확한 설정은 이상탐지의 정확도를 현저히 떨어뜨릴 수 있으며 다량의 위양성 또는 위음성(false positives/neagtives) 판정을 초래할 수 있다. As another embodiment, when the unsupervised learning model includes an anomaly detection algorithm, a contamination ratio parameter may be used. Accurate setting of the pollution level requires a high level of expertise in the field and a deep understanding of the data itself, and even if the user has it, he or she wants to overestimate or underestimate the level of pollution. Incorrect setting of the contamination level can significantly reduce the accuracy of anomaly detection and result in a large number of false positives/negtives.

이에 모델생성장치는 초기에 설정된 오염도 값을 훈련 과정 중에 보정하는 과정을 포함할 수 있다. 초기 오염도 값은 실시 예에 따라 다양하게 설정될 수 있다. 모델생성장치는 제1 분류결과(320)와 제2 분류결과(340)에서 일치하는 샘플들 중 진양성(true positive)의 비율로 오염도 값을 보정할 수 있다. 즉, 모델생성장치는 제1 분류결과(320)와 제2 분류결과(340)에서 일치하는 샘플들 중 제1 그룹(즉, 양성그룹)에 속한 샘플(즉, TP(true positive))의 수를 제1 분류결과(320)와 제2 분류결과(340)에서 일치하는 샘플(TP + TN(Ture nagitive))의 전체 수로 나눈 값을 새로운 오염도 값으로 설정할 수 있다(즉, 오염도 = TP / (TP+TN)). 모델생성장치는 매 회차마다 오염도를 설정할 수 있다.Accordingly, the model generating apparatus may include a process of correcting the initially set pollution degree value during the training process. The initial pollution level value may be variously set according to an embodiment. The model generating apparatus may correct the contamination level by a ratio of true positives among the samples matching the first classification result 320 and the second classification result 340 . That is, the model generating device determines the number of samples (ie, TPs (true positives)) belonging to the first group (ie, positive group) among the samples matching the first classification result 320 and the second classification result 340 . A value obtained by dividing by the total number of matching samples (TP + TN (Ture nagitive)) in the first classification result 320 and the second classification result 340 may be set as a new pollution degree value (ie, pollution degree = TP / ( TP+TN)). The model generating device can set the pollution level for every cycle.

또 다른 실시 예로, 모델생성장치는 TP 또는 TN의 값이 0이 되거나 오염도의 값이 0.5를 초과하는 특이사항이 발생하면, 훈련에 사용되는 지도학습모델(330) 또는 비지도학습모델(340)의 알고리즘을 교체하거나, 학습데이터(300)에 추가적인 변수를 입력하거나, 초기 오염도 값을 변경하거나, 또는 제1 분류결과(320)와 제2 분류결과(340)의 동일 여부를 파악하는 허용오차의 값을 변경할 수 있다. In another embodiment, the model generating device is a supervised learning model 330 or unsupervised learning model 340 used for training when the value of TP or TN becomes 0 or a singularity occurs in which the value of the pollution degree exceeds 0.5. of tolerance of replacing the algorithm of You can change the value.

또 다른 실시 예로, 지도학습모델의 성능을 분류의 정확도로 평가하는 경우에 허용 오차의 범위는 오염도 값보다 작도록 설정하는 것이 바람직하다. 만약 허용 오차의 범위가 오염도 값보다 크면 지도학습모델(330)이 모든 표본을 음성으로 판단하는 무의미한 모델로 지도학습모델(330)이 최적화될 가능성이 존재한다.As another embodiment, when evaluating the performance of the supervised learning model with classification accuracy, it is preferable to set the allowable error range to be smaller than the pollution degree value. If the tolerance range is larger than the pollution level, there is a possibility that the supervised learning model 330 is optimized as a meaningless model in which the supervised learning model 330 determines all samples as negative.

도 4는 본 발명의 실시 예에 따른 데이터분류장치의 일 실시 예의 구성을 도시한 도면이다.4 is a diagram showing the configuration of an embodiment of a data classification apparatus according to an embodiment of the present invention.

도 4를 참조하면, 데이터분류장치(400)는 모델생성부(450), 지도학습모델(410), 비지도학습모델(420), 데이터증강부(430) 및 분류예측부(440)를 포함한다. 데이터분류장치(400)는 메모리 및 프로세서를 포함하는 컴퓨팅 장치로 구현될 수 있다. 각 구성은 소프트웨어 구현되어 메모리에 탑재된 후 프로세서에 의해 수행될 수 있다.Referring to FIG. 4 , the data classification apparatus 400 includes a model generation unit 450 , a supervised learning model 410 , an unsupervised learning model 420 , a data augmentation unit 430 , and a classification prediction unit 440 . do. The data classification device 400 may be implemented as a computing device including a memory and a processor. Each configuration may be implemented by software, loaded in a memory, and then performed by a processor.

모델생성부(450)는 도 3에서 설명한 방법을 통해 지도학습모델(410) 및 비지도학습모델(420)을 훈련시켜 m개의 다단형태로 연결되는 지도학습모델(410)과 비지도학습모델(420)로 구성되는 도 1과 같은 데이터분류모델(100)을 생성한다. 모델생성부(450)의 상세 구성에 대해서는 도 5에서 다시 살펴본다.The model generator 450 trains the supervised learning model 410 and the unsupervised learning model 420 through the method described in FIG. 3 , and the supervised learning model 410 and the unsupervised learning model ( A data classification model 100 as shown in FIG. 1 consisting of 420 is generated. The detailed configuration of the model generating unit 450 will be looked at again in FIG. 5 .

데이터증강부(430)는 모델생성부(450)에 의해 지도학습모델(410)과 비지도학습모델(420)로 구성된 도 1의 데이터분류모델(100)이 생성되면, 입력데이터에 지도학습모델(410)의 예측결과를 누적 추가한 증강데이터를 도 2와 같이 생성한다. 입력데이터에 추가 누적하는 예측결과의 수는 모델생성부(450)에 의해 데이터분류모델을 훈련시킬 때 정해진다. When the data classification model 100 of FIG. 1 composed of the supervised learning model 410 and the unsupervised learning model 420 is generated by the data augmentation unit 430 by the model generator 450, the supervised learning model is added to the input data. Augmented data to which the prediction result of 410 is added is generated as shown in FIG. 2 . The number of prediction results that are additionally accumulated in the input data is determined when the data classification model is trained by the model generator 450 .

분류예측부(440)는 데이터증강부(430)에 의해 생성된 증강데이터를 비지도학습모델(420)에 입력하여 입력데이터에 대한 분류결과를 획득한다. 비지도학습모델(420)은 입력데이터에 존재하는 변수값(예를 들어, 도 2의 X1~Xn(220))뿐만 아니라 지도학습모델에 의해 누적 추가된 예측결과값(예를 들어, 도 2의 Y1~Ym(230))을 함께 포함하는 샘플들을 대상으로 분류하므로 분류 결과가 보다 정확할 뿐만 아니라 비지도학습모델(420)의 분류 결과에 대한 설명력을 높일 수 있다.The classification prediction unit 440 inputs the augmented data generated by the data augmentation unit 430 into the unsupervised learning model 420 to obtain a classification result for the input data. The unsupervised learning model 420 includes not only the variable values (eg, X1 to Xn ( 220 ) in FIG. 2 ) present in the input data), but also the predicted result values accumulated and added by the supervised learning model (eg, FIG. 2 ). Since samples including Y1 to Ym (230) of Y1 to Ym (230)) are classified as a target, the classification result is more accurate and the explanatory power of the classification result of the unsupervised learning model 420 can be increased.

도 5는 본 발명의 실시 예에 따른 모델생성부의 일 예의 구성을 도시한 도면이다.5 is a diagram illustrating a configuration of an example of a model generator according to an embodiment of the present invention.

도 5를 참조하면, 모델생성부(450)는 제1 분류부(500), 제2 분류부(510), 학습데이터증강부(520) 및 반복수행부(530)를 포함한다. 모델생성부(450)는 도 4의 데이터분류장치(400)와 별개의 장치로 구현될 수 있다. 즉, 모델생성부(450)는 메모리 및 프로세스를 포함하는 컴퓨팅 장치로 구현되어, 도 3의 모델생성장치가 될 수 있다.Referring to FIG. 5 , the model generating unit 450 includes a first classifying unit 500 , a second classifying unit 510 , a learning data enhancing unit 520 , and an iterative performing unit 530 . The model generator 450 may be implemented as a separate device from the data classification device 400 of FIG. 4 . That is, the model generating unit 450 may be implemented as a computing device including a memory and a process, and may become the model generating device of FIG. 3 .

제1 분류부(500)는 학습데이터를 비지도학습모델을 이용하여 적어도 두 개의 군집으로 분류한다. 또한, 제1 분류부(500)는 학습데이터에 지도학습모델의 예측결과가 누적 추가된 증강학습데이터를 비지도학습모델을 이용하여 적어도 두 개의 군집으로 분류한다. The first classification unit 500 classifies the learning data into at least two clusters using an unsupervised learning model. In addition, the first classification unit 500 classifies the augmented learning data to which the prediction results of the supervised learning model are accumulated and added to the learning data into at least two clusters using the unsupervised learning model.

제2 분류부(510)는 제1 분류부(500)에 의해 생성된 비지도학습모델의 제1 분류결과로 레이블링된 학습데이터를 이용하여 지도학습모델을 훈련시켜 제2 분류결과를 획득한다. 또한 제2 분류부(510)는 증강학습데이터에 대한 제1 분류결과와 제1 분류결과로 레이블링된 증강학습데이터를 이용하여 지도학습모델을 훈련시켜 제2 분류결과를 획득한다.The second classification unit 510 acquires a second classification result by training the supervised learning model using the learning data labeled as the first classification result of the unsupervised learning model generated by the first classification unit 500 . In addition, the second classification unit 510 acquires a second classification result by training the supervised learning model using the first classification result for the augmented learning data and the augmented learning data labeled with the first classification result.

학습데이터증강부(520)는 학습데이터에 제2 분류결과를 누적 추가한 증강학습데이터를 생성한다. 제1 분류부(500)와 제2 분류부(510)를 통해 비지도학습모델 및 지도학습모델을 훈련시켜 분류가 반복될 때마다 학습데이터에 제2 분류결과가 누적 추가된다.The learning data augmentation unit 520 generates augmented learning data in which the second classification result is accumulated and added to the learning data. By training the unsupervised learning model and the supervised learning model through the first classifying unit 500 and the second classifying unit 510, the second classification result is accumulated and added to the learning data whenever classification is repeated.

반복수행부(530)는 증강학습데이터를 이용하여 상기 제1 분류결과와 상기 제2 분류결과를 획득하고 제2 분류결과를 누적 추가한 증강학습데이터를 생성하는 과정을 제1 분류결과와 제2 분류결과가 동일할 때까지 반복 수행한다.The iterative performer 530 acquires the first classification result and the second classification result using the augmented learning data, and generates the augmented learning data obtained by adding the second classification result to the first classification result and the second classification result. Repeat until the classification results are the same.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can also be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device. In addition, the computer-readable recording medium is distributed in a network-connected computer system so that the computer-readable code can be stored and executed in a distributed manner.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, with respect to the present invention, the preferred embodiments have been looked at. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

In the model generation method performed by the model generation device,
obtaining a first classification result obtained by classifying learning data using an unsupervised learning model;
obtaining a second classification result obtained by classifying the labeled learning data based on the classification result using a supervised learning model;
If the first classification result and the second classification result are different from the first classification result, the second classification result is accumulated and added to the learning data, and the steps of obtaining the first classification result and obtaining the second classification result are repeated performing; and
repeating the steps from obtaining the first classification result to the re-performing until the first classification result and the second classification result are the same within a predefined tolerance range; How to create a model with

The method of claim 1,
The unsupervised learning model is a model generation method, characterized in that it is a binary classification model that divides the learning data into two clusters.

The method of claim 1, wherein the repeating step comprises:
and determining that the first classification result and the second classification result are the same if the matching ratio is equal to or greater than a preset criterion.

The method of claim 1,
and correcting the degree of contamination of the unsupervised learning model by a ratio of true positives among samples matching the first classification result and the second classification result; before the re-performing step.

The method of claim 1,
Predicting a classification result for input data using the unsupervised learning model and the supervised learning model; further comprising,
The predicting step is
The prediction result obtained by inputting input data into the supervised learning model is added to the input data, and the prediction result obtained by inputting the input data including the added prediction result back into the supervised learning model is accumulated and added back to the input data repeating the process;
Including; inputting the input data to which the prediction results of the supervised learning model are accumulated and added to the unsupervised learning model to obtain a classification result;
The number of iterations of the cumulative adding process is determined by the number of times of re-performing the step of obtaining the second classification result.

delete

The prediction result obtained by inputting input data into the supervised learning model is added to the input data, and the prediction result obtained by inputting the input data including the added prediction result back into the supervised learning model is accumulated and added back to the input data a data augmentation unit that repeats the process a predetermined number of times;
a classification prediction unit for predicting a classification result by inputting the input data to which the prediction result is accumulated to the unsupervised learning model; and
Including; a model generator for training a data classification model composed of an unsupervised learning model and a supervised learning model;
The model generation unit,
a first classification unit for obtaining a first classification result obtained by classifying learning data using the unsupervised learning model;
a second classification unit for obtaining a second classification result obtained by classifying the labeled learning data based on the first classification result using a supervised learning model;
a learning data augmentation unit for generating augmented learning data by accumulating and adding the second classification result to the learning data;
The process of re-acquiring the first classification result and the second classification result by using the augmented learning data, and generating the augmented learning data obtained by accumulatively adding the obtained second classification result, is performed with the first classification result and the second classification result. Data classification apparatus comprising a; iterative execution unit that repeatedly performs until the classification result is the same.

8. The method of claim 7,
The number of iterations of the data augmentation unit accumulatively adding the prediction result of the supervised learning model to the input data is determined by the number of times the learning data augmentation unit accumulates and adds the second classification result to the learning data data classification device.

The method of claim 7, wherein the model generation unit,
and a pollution degree correction unit for correcting the pollution degree of the unsupervised learning model by a ratio of true positives among samples matching the first classification result and the second classification result.

A computer-readable recording medium in which a computer program for performing the method according to any one of claims 1 to 5 by a computer is recorded.