KR102245270B1

KR102245270B1 - Method for oversampling minority category for training data

Info

Publication number: KR102245270B1
Application number: KR1020190022080A
Authority: KR
Inventors: 양지훈; 신동일
Original assignee: 서강대학교 산학협력단
Priority date: 2019-02-25
Filing date: 2019-02-25
Publication date: 2021-04-26
Also published as: KR20200103494A

Abstract

본 발명은 학습 데이터에 대한 데이터 불균형 처리를 위한 오버 샘플링 방법에 관한 것이다. 상기 오버샘플링 방법은, 학습 데이터에 대하여 적대적 오토인코더(AAE)를 적용하여 잠재 변수 공간에 대한 특징들을 추출하는 인코딩 단계; 상기 특징이 추출된 잠재 변수 공간에서의 소수 범주의 데이터들에 대하여 SMOTE를 적용하여 오버샘플링하는 오버샘플링 단계; 상기 오버 샘플링된 잠재 변수 공간에 대하여 디코딩하는 디코딩 단계; 를 구비하여, 학습 데이터를 재구성하여 소수 범주의 데이터와 다수 범주의 데이터의 균형을 맞춘다.The present invention relates to an oversampling method for data imbalance processing for training data. The oversampling method includes: an encoding step of extracting features of a latent variable space by applying a hostile auto-encoder (AAE) to training data; An oversampling step of oversampling by applying SMOTE to data of a small number of categories in the latent variable space from which the features are extracted; A decoding step of decoding the oversampled latent variable space; Equipped with, reconstruct the training data to balance the data of the minority category and the data of the multiple categories.

Description

Method for oversampling minority category for training data}

본 발명은 학습 데이터에 대한 오버샘플링 방법에 관한 것으로서, 더욱 구체적으로는 적대적 오토 인코더를 이용하여 잠재 변수에 대한 특징들을 추출하고, 데이터의 특징이 추출된 잠재 변수 공간에서 SMOTE를 적용하여 오버샘플링하고 다시 디코딩함으로써, 소수 범주 데이터를 오버 샘플링하는 방법에 관한 것이다.The present invention relates to a method for oversampling training data, and more specifically, extracting features of a latent variable using a hostile auto encoder, and oversampling by applying SMOTE in a latent variable space from which features of the data are extracted. By decoding again, it relates to a method of oversampling the fractional category data.

일반적으로 기계학습 실험을 진행하기 위한 학습 데이터로는 범주 별로 데이터의 수가 비슷하게 구성되어 있는 것을 사용한다. 그러나 많은 실제 데이터들은 클래스 별로 데이터의 수가 비슷하지 않아 불균형 문제에 속하게 되고 이러한 경우 소수 범주에 속한 데이터들은 잘못 분류되어 성능이 낮게 나올 가능성이 높다. 이러한 부작용은 기계학습 알고리즘의 설계 특성상 각 범주의 상대적인 분포를 고려하는 대신 전반적인 성능을 최적화시키려 하기 때문에 발생하는 것이다. In general, as training data for conducting machine learning experiments, the same number of data by category is used. However, many actual data belong to an imbalance problem because the number of data for each class is not similar. In this case, data belonging to a small number of categories are classified incorrectly and the performance is likely to be low. This side effect occurs because the design characteristics of machine learning algorithms try to optimize the overall performance instead of considering the relative distribution of each category.

불균형 데이터 문제를 해결하기 위한 방법으로는 크게 데이터 수준, 알고리즘 수준 및 앙상블 접근 방법이 있다. 데이터 수준의 접근 방법은 언더샘플링(Under-sampling), 오버샘플링(Over-sampling)이나 이들을 동시에 사용하여 데이터의 균형을 조정하는 방법이다. 알고리즘 수준의 접근 방법은 기존 기계학습 알고리즘의 오차함수를 수정 및 비용 개념을 사용해 소수 군집에 속한 데이터에 더 큰 중요도를 주는 방법이다. 앙상블 접근 방법은 낮은 성능의 분류기 어려 개를 종합하여 최종 분류기의 성능을 개선시키는 방법이다. 특히, 기계학습 알고리즘과 독립적인 데이터 수준의 접근 방법에서 많은 연구가 진행되어왔다. 언더샘플링 방법은 다수 범주에 속한 데이터를 임의적으로 표본 추출하여 소수 범주와 균형을 이루도록 하는 방법이며, 오버샘플링 방법은 소수 범주에 속한 데이터를 반복적으로 복사하여 다수 범주와 균형을 이루도록 하는 방법이다. There are largely a data level, an algorithm level, and an ensemble approach as methods to solve the unbalanced data problem. The data-level approach is under-sampling, over-sampling, or using them simultaneously to balance the data. The algorithm-level approach is a method of modifying the error function of the existing machine learning algorithm and using the concept of cost to give greater importance to data belonging to a small group. The ensemble approach is a method of improving the performance of the final classifier by synthesizing younger classifiers with low performance. In particular, many studies have been conducted on machine learning algorithms and independent data-level approaches. The undersampling method is a method of randomly sampling data belonging to multiple categories to achieve balance with a minority category, and the oversampling method is a method of repetitively copying data belonging to a minority category to balance multiple categories.

대표적인 오버샘플링 방법으로 SMOTE(Synthetic Minority Over-Sampling Technique)가 있다. SMOTE는 소수 범주의 샘플들을 중심으로 최 근접 이웃 k개(k-Nearest Neighbor)를 합성하여 새로운 샘플을 생성하는 방법이다. 도 1은 SMOTE를 활용하여 합성 샘플을 생성하는 방법을 설명하는 그래프들이다. 도 1의 (a)는 소수 범주의 샘플(

)에 대하여 최 근접 이웃 (

)을 선택한 것이고, (b)는

와

사이의 임의의 난수를 생성하여 합성 샘플을 생성하는 것을 나타낸다. 이 방법을 소수 범주의 샘플이 다수 범주가 원하는 균형 비율에 도달할 때까지 무작위로 선택된 소수 범주에 대해 수행한다. 따라서 소수 범주를 복제하는 것이 아니라 새로운 샘플을 합성하기 때문에 과적합의 문제가 발생되지 않는데 도움이 된다.A representative oversampling method is SMOTE (Synthetic Minority Over-Sampling Technique). SMOTE is a method of generating a new sample by synthesizing k-Nearest Neighbors based on samples of a small number of categories. 1 is a graph illustrating a method of generating a synthetic sample using SMOTE. (A) of FIG. 1 is a sample of a minority category (

) With respect to the nearest neighbor (

) Is selected, and (b) is

Wow

It represents generating a synthetic sample by generating a random number between. This method is performed on a randomly selected minority category until a sample of the minority category reaches the desired equilibrium ratio for the majority category. Therefore, it is helpful to avoid the problem of overfitting because a new sample is synthesized rather than duplicated in a few categories.

한편, 오토인코더(Auto-Encoder)는 비지도 학습으로 입력 값과 출력 값이 같아지도록 학습을 한다. 도 2는 오토인코더의 구조를 개념적으로 도시한 모식도이다. 도 2를 참조하면, 오토인코더는 입력 데이터

로 구성된 입력 층과 변형된 특징을 나타내는 ‘code’를 가리키는 은닉 층, 그리고 입력 층을 복원한 데이터

을 나타내는 출력 층으로 구성되어 있다. 오토인코더의 목적은 입력 데이터와 가장 유사한 복원 데이터를 만드는 신경망을 구성하는 것이다. 즉, 오토인코더의 은닉 층을 입력 층의 인코더를 통해 뉴런수보다 작게 만들어 차원을 축소하고, 축소된 차원의 데이터를 디코더를 통해 복원함으로써, 복원 데이터와 원본 데이터의 차이를 최소한 한다. 이렇게 학습을 진행하면 오토인코더는 은닉 층에서 입력 데이터의 압축된 정보를 해석할 수 있으며 차원의 축소가 가능하다. 즉, 은닉 층의 뉴런 수 조절을 통해 데이터 특징 추출할 수 있고 목적에 따라 특징을 변형시킬 수 있다. 또한, 축소된 공간에서 분류도 수행할 수 있다. Meanwhile, the auto-encoder learns so that the input value and the output value become the same through unsupervised learning. 2 is a schematic diagram conceptually showing the structure of an autoencoder. 2, the autoencoder is input data

An input layer consisting of, a hidden layer indicating'code' representing the transformed features, and data reconstructing the input layer

It consists of an output layer representing. The purpose of an autoencoder is to construct a neural network that produces reconstructed data that is most similar to the input data. That is, by making the hidden layer of the autoencoder smaller than the number of neurons through the encoder of the input layer to reduce the dimension, and by restoring the reduced-dimensional data through the decoder, the difference between the reconstructed data and the original data is minimized. If the learning proceeds in this way, the autoencoder can interpret the compressed information of the input data in the hidden layer, and the dimension can be reduced. That is, data features can be extracted by controlling the number of neurons in the hidden layer, and features can be modified according to the purpose. In addition, classification can be performed in a reduced space.

오토인코더의 구조를 갖는 생성 모델(Generative Model)의 중 하나인 적대적 오토인코더(Adversarial Auto-Encoder, AAE)를 통해 잠재 변수 공간에서 특징을 추출한다. 오토인코더는 입력 변수와 잠재 변수 사이에 상관관계가 없다는 단점이 있어 이를 보완한 방법으로 인코더와 디코더를 확률 모델로 만들어 잠재 변수가 학습 데이터와 비슷한 확률 분포를 갖게 하여 더 나은 특징을 추출할 수 있다.Features are extracted from the latent variable space through Adversarial Auto-Encoder (AAE), which is one of the generative models with the structure of an auto-encoder. The autoencoder has the disadvantage that there is no correlation between the input variable and the latent variable. As a complementary method, the encoder and the decoder are made into a probability model, so that the latent variable has a probability distribution similar to the training data, so that better features can be extracted. .

한국등록특허공보 제 10-1563406호Korean Registered Patent Publication No. 10-1563406 한국등록특허공보 제 10-1843074호Korean Registered Patent Publication No. 10-1843074 미국공개특허공보 US 2016/0092789United States Patent Publication US 2016/0092789

전술한 문제점을 해결하기 위한 본 발명의 목적은 학습 데이터에 대해 적대적 오토 인코더를 적용하여 잠재 변수 공간에서 특징을 추출하고, 특징이 추출된 잠재 변수 공간에서 소수 범주 데이터를 오버샘플링함으로써, 학습 데이터의 소수 범주 데이터와 다수 범주 데이터의 불균형을 해소할 수 있도록 하는 학습 데이터에 대한 오버샘플링 장치 및 방법을 제공하는 것이다. An object of the present invention for solving the above-described problem is to extract features from a latent variable space by applying a hostile auto-encoder to training data, and oversampling fractional category data from the latent variable space from which features are extracted, It is to provide an apparatus and method for oversampling training data capable of resolving an imbalance between minority category data and multiple category data.

전술한 기술적 과제를 달성하기 위한 본 발명의 특징에 따른 학습 데이터에 대한 데이터 불균형 처리를 위한 오버 샘플링 방법은, 컴퓨터에 의해 각 단계가 수행되는 방법으로서, 학습 데이터에 대하여 잠재 변수 공간에 대한 특징들을 추출하는 인코딩 단계; 상기 특징이 추출된 잠재 변수 공간에서의 소수 범주의 데이터들에 대하여 오버샘플링하는 오버샘플링 단계; 상기 오버 샘플링된 잠재 변수 공간에 대하여 디코딩하는 디코딩 단계; 를 구비하여, 학습 데이터를 재구성하여 소수 범주의 데이터와 다수 범주의 데이터의 균형을 맞춘다.The oversampling method for data imbalance processing for learning data according to a feature of the present invention for achieving the above-described technical problem is a method in which each step is performed by a computer, and features of a latent variable space for the learning data are evaluated. An encoding step of extracting; An oversampling step of oversampling data of a small number of categories in the latent variable space from which the features are extracted; A decoding step of decoding the oversampled latent variable space; Equipped with, reconstruct the training data to balance the data of the minority category and the data of the multiple categories.

전술한 특징에 따른 학습 데이터에 대한 오버 샘플링 방법에 있어서, 상기 (a) 단계는 오토인코더의 구조를 갖는 생성 모델(Generative Model)을 이용하여 학습 데이터의 잠재 변수 공간에 대한 특징들을 추출하는 것이 바람직하며, 상기 오토인코더의 구조를 갖는 생성 모델(Generative Model)은 변분 인코더(Variable Autoencoder, VAE)와 적대적 생성 네트워크(Generative Adversarial Network, GAN)를 결합한 적대적 오토인코더(Adversarial Auto-Encoder, AAE)를 적용하는 것이 더욱 바람직하다. In the oversampling method for the training data according to the above-described features, in step (a), it is preferable to extract features of the latent variable space of the training data using a generative model having an autoencoder structure. And, the Generative Model having the structure of the autoencoder applies an Adversarial Auto-Encoder (AAE) that combines a Variable Autoencoder (VAE) and a Generative Adversarial Network (GAN). It is more preferable to do it.

전술한 특징에 따른 학습 데이터에 대한 오버 샘플링 방법에 있어서, 상기 (b) 단계는, SMOTE((Synthetic Minority Over-sampling Technique)을 이용하여 잠재 변수 공간의 소수 범주의 각 샘플에 대하여 사전 설정된 개수(k)의 최근접 이웃을 합성하여 새로운 샘플들을 생성하여, 소수 범주의 샘플들에 대하여 오버 샘플링하는 것이 바람직하며, In the oversampling method for the learning data according to the above-described feature, the step (b) includes a preset number for each sample of a prime category of the latent variable space using SMOTE ((Synthetic Minority Over-sampling Technique)). It is preferable to synthesize the nearest neighbors of k) to generate new samples, and to oversample samples of a small number of categories,

상기 (b) 단계는, 소수 범주의 한 샘플(

)을 선택하고, 사전 설정된 개수(k)의 최근접 이웃으로부터 임의의 이웃(

)을 선택하고, 0과 1 사이의 난수(

)를 생성하고, 이를 이용하여 합성 샘플(

)을 생성하는 것이 바람직하다. In step (b), a sample of a small number of categories (

), and random neighbors (

), and a random number between 0 and 1 (

) And use it to create a synthetic sample (

It is desirable to generate ).

전술한 특징에 따른 학습 데이터에 대한 오버 샘플링 방법에 있어서, 상기 (b) 단계는 소수 범주의 데이터와 다수 범주의 데이터가 동일 비율이 될 때까지 반복 수행되는 것이 바람직하다. In the oversampling method for the training data according to the above-described feature, it is preferable that step (b) is repeatedly performed until the data of a small number of categories and data of a plurality of categories become the same ratio.

전술한 특징에 따른 학습 데이터에 대한 오버 샘플링 방법에 있어서, 상기 (c) 단계는 오버샘플링된 소수 범주의 데이터만을 복원하고, 초기의 학습 데이터와 복원된 오버샘플들로 학습 데이터를 재구성하거나, 초기의 학습 데이터와 오버샘플링된 소수 범주의 데이터를 모두 복원하고, 복원된 학습 데이터와 복원된 오버 샘플들로 학습 데이터를 재구성하는 것이 바람직하다.In the oversampling method for the training data according to the above-described feature, in step (c), only the oversampled data of a small number of categories is restored, and the training data is reconstructed from the initial training data and the restored oversamples, or It is preferable to restore both the training data of and the oversampled fractional category data, and reconstruct the training data with the restored training data and the restored oversamples.

전술한 구성을 갖는 본 발명에 따른 오버 샘플링 방법은 적대적 오토 인코더(AAE)와 SMOTE를 결합하여 데이터 불균형 문제를 해결할 수 있게 된다. The oversampling method according to the present invention having the above-described configuration can solve the data imbalance problem by combining the hostile auto encoder (AAE) and SMOTE.

본 발명에 따른 오버 샘플링 방법은 적대적 오토인코더 학습을 통해 데이터의 잠재 변수 공간에서 특징 추출을 잘 해낸 경우, 클래스별로 데이터의 분포가 확실하게 나뉠 수 있게 된다. 또한, 본 발명에 따라 데이터가 확실하게 분류된 잠재 변수 공간에서 SMOTE를 적용하여 소수 범주 데이터를 오버샘플링하여 소수 범주와 다수 범주의 데이터를 동일한 비율로 맞추고 SVM 기계학습 모델에 적용함으로서, 종래의 다른 오버샘플링 기법들보다 더 우수한 성능을 보여준다. In the oversampling method according to the present invention, when the feature extraction is well performed in the latent variable space of data through hostile autoencoder learning, the distribution of data for each class can be reliably divided. In addition, by applying SMOTE in the latent variable space in which data is clearly classified according to the present invention, the data of the minority category and the data of the multiple categories are matched at the same ratio by oversampling the data of the minority category and applied to the SVM machine learning model. It shows better performance than oversampling techniques.

또한, 본 발명에 따라 오버샘플링하여 학습 데이터를 재구성하고, 재구성된 학습 데이터를 기계 학습 모델에 적용함으로써, 종래의 방법들에 비하여 정확도 및 정확도에 대한 신뢰도를 향상시킬 수 있게 된다. Further, by oversampling according to the present invention to reconstruct training data and applying the reconstructed training data to a machine learning model, it is possible to improve accuracy and reliability in accuracy compared to conventional methods.

본 발명에 따른 방법과 기존의 오버샘플링 방법의 성능을 확인하기 위한 비교 실험을 표 1과 같은 조건으로 진행하였다. 아홉 가지의 데이터를 사용하였으며, 기계학습 모델은 SVM을 사용하여 정확도를 도출하였으며, 5-fold cross validation을 사용하여 정확도에 대한 신뢰도를 높였다. 그리고, 다양한 성능 평가 척도를 통해 비교 실험을 진행하였는데, 불균형 데이터의 성능 측정을 할 경우 예측 정확도(Accuracy)의 경우 불균형의 비율이 크다면 다수 범주 데이터만을 예측해도 정확도가 높게 나오기 때문에 신뢰도가 낮은 편이다. 그러나 AUC와 F1-Score의 경우 불균형 여부에 관계없이 성능 측정이 가능해 이 두 가지의 성능 평가 척도의 정확도가 더 신뢰성을 갖는다. A comparative experiment for confirming the performance of the method according to the present invention and the conventional oversampling method was conducted under the conditions shown in Table 1. Nine types of data were used, and the machine learning model used SVM to derive the accuracy, and 5-fold cross validation was used to increase the reliability of the accuracy. In addition, a comparative experiment was conducted through various performance evaluation scales.In the case of measuring the performance of unbalanced data, in the case of the prediction accuracy, if the ratio of the imbalance is large, the reliability is low because the accuracy is high even if only the data of multiple categories is predicted. to be. However, in the case of AUC and F1-Score, performance can be measured regardless of imbalance, so the accuracy of these two performance evaluation scales is more reliable.

도 11은 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 사용된 데이터 셋(Data Set)을 도시한 도표이다. 도 11을 참조하면, UCI repository 데이터 총 아홉가지를 사용하였으며, 데이터의 설명은 다음과 같다. (1) 위스콘신 유방암 데이터: 유방암 진단 관련 속성과 양성, 음성에 대한 클래스를 갖는다. (2) 전리층 데이터: 전리층에 대한 레이더 관측치를 모은 데이터로 전리층의 패턴 유무에 대한 클래스를 갖는다. (3) 피마족 인디언 데이터: 피마족 인디언 당뇨 발병에 대한 데이터로 당뇨 발병에 대한 속성과 당뇨병 발병 유무의 클래스를 갖는다. (4) 심장 질환 발병 데이터: 심장 질환 발병에 대한 데이터로 심장 질환 발병에 대한 속성과 심장 질환 발병 유무의 클래스를 갖는다. (5) 유리 데이터 : 유리의 종류를 분별하기 위한 데이터로 유리 속성과 건물 유리와 그 외의 유리들로 두 가지의 클래스를 갖는다. (6) 차량 형태 데이터 : 차량의 형태를 분별하기 위한 데이터로 차량 형태의 속성과 밴 차량의 형태와 그 외의 차량 형태로 두 가지 클래스를 갖는다. (7) 갑상선 질환 데이터 : 갑상선 질환 발병에 대한 데이터로 갑상선 질환 관련 속성과 발병 유무를 클래스로 갖는다. (8) 숫자 분할 데이터 : 숫자의 분할 값을 분별하기 위한 데이터로 숫자 0과 나머지 1~8 숫자에 대한 두 가지 클래스를 갖는다. (9) 와인 종류 데이터 : 와인 종류에 대한 데이터로 와인 종류에 대한 속성과 세 가지 클래스를 갖는다. 11 is a diagram showing a data set used in an experiment to compare and confirm the performance of the present invention and the existing methods. Referring to FIG. 11, a total of nine types of UCI repository data are used, and the description of the data is as follows. (1) Wisconsin breast cancer data: It has properties related to breast cancer diagnosis and classes for positive and negative. (2) Ionospheric data: Data collected from radar observations of the ionosphere, and have a class for the presence or absence of an ionosphere pattern. (3) Pima Indian data: This is the data on the onset of diabetes in Pima Indians, and has the properties of the onset of diabetes and the class of the presence or absence of diabetes. (4) Heart disease onset data: This is data on the onset of heart disease, and has a class of onset of heart disease and the presence or absence of heart disease. (5) Glass data: It is data to identify the type of glass. It has two classes: glass properties, building glass, and other glass. (6) Vehicle type data: Data for discriminating the shape of the vehicle. It has two classes: vehicle type attribute, van vehicle type, and other vehicle types. (7) Thyroid disease data: This is the data on the onset of thyroid disease, and it has properties related to thyroid disease and the presence or absence of the onset as a class. (8) Numerical division data: This is the data for separating the division value of a number. It has two classes: number 0 and the remaining 1-8 numbers. (9) Wine type data: It is data on wine type, and has three classes and attributes for wine type.

본 발명에서는 성능 평가 척도인 예측 정확도(Accuracy), AUC, F1-Score를 사용하여 도출한 정확도 결과를 설명한다. 전술한 아홉 가지의 데이터 셋을 사용한 실험 결과로, 본 발명에 따른 방법과 기존의 오버샘플링 기법을 비교 실험 대상으로 사용하였다. 모든 실험은 5-fold cross validation을 사용하여 각각의 학습/테스트 데이터 셋에 대한 정확도의 평균을 사용하였다. 각 도표에서 제일 높은 성능은 굵은 글씨와 밑줄로 표시되어 있다. In the present invention, the accuracy results derived using the performance evaluation measures such as prediction accuracy, AUC, and F1-Score will be described. As an experimental result using the above-described nine data sets, the method according to the present invention and the existing oversampling method were used as a comparative experiment object. All experiments used 5-fold cross validation and used the average of the accuracy for each training/test data set. The highest performance in each chart is indicated in bold and underlined.

도 12는 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 있어서, 모든 데이터 셋에 대한 예측 정확도 결과를 도시한 도표이다. 도 12를 참조하면, 아홉 가지의 데이터 셋 중 여섯 가지의 데이터 셋이 기존 기법보다 제안 기법을 사용하였을 때 더 분류 정확도가 높았다. 하지만 예측 정확도는 단순히 참, 거짓만을 판별하기 때문에 대체적으로 불균형 문제를 해결하지 않고 기존 데이터(Original)만을 사용하였을 때, 세 가지 데이터 셋이 가장 높은 분류 정확도를 갖고 나머지 데이터 셋에 대해서도 대체적으로 분류 정확도가 높은 편이다. 12 is a chart showing prediction accuracy results for all data sets in an experiment for comparing and confirming the performance of the present invention and the existing methods. Referring to FIG. 12, when six data sets out of nine data sets use the proposed method, the classification accuracy is higher than that of the existing method. However, since prediction accuracy simply determines true and false, when using only the original data without solving the imbalance problem, the three data sets have the highest classification accuracy, and the classification accuracy is generally classified for the rest of the data sets. Is on the high side.

도 13은 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 있어서, 모든 데이터 셋에 대한 AUC 결과를 도시한 도표이다. 도 13을 참조하면, 아홉 가지의 데이터 셋 중 여덟 가지의 데이터 셋이 기존 기법보다 더 나은 결과를 도출하였다. ROC 커브의 민감도(Sensitivity)와 특이도(Specificity)를 반영하기 때문에 데이터 불균형 여부와 관계없이 정확도 도출이 가능하여 기존 데이터에 대해서는 분류 정확도가 가장 높은 것이 없었다. 13 is a diagram showing AUC results for all data sets in an experiment for comparing and confirming the performance of the present invention and the existing methods. Referring to FIG. 13, eight data sets out of nine data sets resulted in better results than the conventional technique. Since it reflects the sensitivity and specificity of the ROC curve, it is possible to derive the accuracy regardless of whether or not the data is unbalanced, so the classification accuracy was not the highest for the existing data.

도 14는 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 있어서, 모든 데이터 셋에 대한 F1-Score 결과를 도시한 도표이다. 도 14를 참조하면, 아홉 가지 데이터 셋 중 여덟 가지의 데이터 셋이 기존 기법보다 정확도가 더 나은 결과를 도출하였다. F1-Score 또한 재현율(Recall)와 정밀도(Precision)의 조화평균으로 값을 도출하기 때문에 기존 데이터만을 사용하였을 때에는 전반적으로 낮은 정확도를 도출하였다.14 is a diagram showing F1-Score results for all data sets in an experiment for comparing and confirming the performance of the present invention and the existing methods. Referring to FIG. 14, eight data sets out of nine data sets resulted in better accuracy than the conventional technique. F1-Score also derives the value from the harmonic average of recall and precision, so when only the existing data were used, overall low accuracy was derived.

도 1은 SMOTE를 활용하여 합성 샘플을 생성하는 방법을 설명하는 그래프들이다.
도 2는 오토인코더를 도시한 구조도이다.
도 3은 본 발명의 바람직한 실시예에 따른 오버샘플링 방법을 순차적으로 도시한 순서도이다.
도 4는 변분 오토인코더(VAE)를 도시한 구조도이다.
도 5는 적대적 생성 신경망(GAN) 모델을 도시한 구조도이다.
도 6은 적대적 오토인코더(AAE)를 도시한 구조도이다.
도 7은 본 발명에 따른 오버샘플링 방법에 있어서, 잠재 변수 공간에서의 특징 추출을 위한 적대적 오토인코더 모델에 대한 학습 과정을 예시적으로 도시한 알고리즘이다.
도 8은 본 발명의 바람직한 실시예에 따른 오버샘플링 방법에 있어서, 오버샘플링 단계 및 디코딩 단계에 의해 잠재 변수 공간에서 SMOTE를 활용하여 오버샘플링한 후 디코딩하는 과정을 도시한 도면이다.
도 9는 본 발명의 바람직한 실시예에 따른 오버샘플링 방법에 있어서, 디코딩 단계의 일실시 형태를 도시한 블록도이다.
도 10은 본 발명의 바람직한 실시예에 따른 오버샘플링 방법에 있어서, 디코딩 모듈의 다른 실시 형태를 도시한 블록도이다.
도 11은 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 사용된 데이터 셋(Data Set)을 도시한 도표이다.
도 12는 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 있어서, 모든 데이터 셋에 대한 예측 정확도 결과를 도시한 도표이다.
도 13은 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 있어서, 모든 데이터 셋에 대한 AUC 결과를 도시한 도표이다.
도 14는 본 발명과 기존 방법들의 성능을 비교 확인하기 위한 실험에 있어서, 모든 데이터 셋에 대한 F1-Score 결과를 도시한 도표이다.1 is a graph illustrating a method of generating a synthetic sample using SMOTE.
2 is a structural diagram showing an auto encoder.
3 is a flowchart sequentially showing an oversampling method according to an exemplary embodiment of the present invention.
4 is a structural diagram showing a variable auto-encoder (VAE).
5 is a structural diagram showing a hostile generated neural network (GAN) model.
6 is a structural diagram showing a hostile auto-encoder (AAE).
7 is an algorithm exemplarily showing a learning process for a hostile autoencoder model for feature extraction in a latent variable space in the oversampling method according to the present invention.
8 is a diagram illustrating a process of oversampling and then decoding using SMOTE in a latent variable space by an oversampling step and a decoding step in an oversampling method according to a preferred embodiment of the present invention.
9 is a block diagram showing an embodiment of a decoding step in an oversampling method according to a preferred embodiment of the present invention.
10 is a block diagram showing another embodiment of a decoding module in an oversampling method according to a preferred embodiment of the present invention.
11 is a diagram showing a data set used in an experiment for comparing and confirming the performance of the present invention and the existing methods.
12 is a chart showing prediction accuracy results for all data sets in an experiment for comparing and confirming the performance of the present invention and existing methods.
13 is a chart showing AUC results for all data sets in an experiment for comparing and confirming the performance of the present invention and the existing methods.
14 is a chart showing F1-Score results for all data sets in an experiment for comparing and confirming the performance of the present invention and the existing methods.

본 발명에 따른 오버샘플링 방법은 컴퓨터에 의해 수행되는 것으로서, 학습 데이터에 대해 적대적 오토인코더를 적용하여 잠재 변수 공간에서의 특징을 추출하고, 특징이 추출된 잠재 변수 공간에서 SMOTE를 적용하여 소수 범주 데이터를 오버샘플링한 후 복원하여 학습 데이터를 재구성함으로써, 데이터 불균형을 해결하는 것을 특징으로 한다. 본 발명은 데이터의 특징이 추출이 잘된 잠재변수 공간이라면 소수 범주의 데이터 분포와 다수 범주의 데이터 분포는 명확한 구분을 띄기 때문에 더 정확한 오버샘플링을 할 수 있을 것이라는 가정을 바탕으로 한다.The oversampling method according to the present invention is performed by a computer, and extracts features in the latent variable space by applying a hostile autoencoder to the learning data, and applies SMOTE in the latent variable space from which the features are extracted to obtain fractional category data. It is characterized in that the data imbalance is solved by oversampling and restoring the training data to reconstruct the training data. The present invention is based on the assumption that if the characteristic of the data is a well-extracted latent variable space, since the data distribution of minority categories and the data distribution of multiple categories are clearly distinguished, more accurate oversampling can be performed.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 오버샘플링 방법의 구성 및 동작에 대하여 구체적으로 설명한다. Hereinafter, a configuration and operation of an oversampling method according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 3은 본 발명의 바람직한 실시예에 따른 오버샘플링 방법을 순차적으로 도시한 순서도이다. 본 발명에 따른 오버샘플링 방법은 컴퓨터 등과 같은 데이터 처리 장치의 프로세서(processor) 등에 의해 실행될 수 있는 프로그램으로 구현될 수 있으며, 인코딩 단계(S100), 오버샘플링 단계(S110) 및 디코딩 단계(S120)을 구비한다. 이하, 전술한 오버샘플링 방법을 구성하는 각 단계의 동작들에 대하여 구체적으로 설명하도록 한다. 3 is a flowchart sequentially showing an oversampling method according to an exemplary embodiment of the present invention. The oversampling method according to the present invention may be implemented as a program that can be executed by a processor of a data processing device such as a computer, etc., and includes an encoding step (S100), an oversampling step (S110), and a decoding step (S120). Equipped. Hereinafter, operations of each step constituting the above-described oversampling method will be described in detail.

상기 인코딩 단계(S100)는 입력된 학습 데이터들에 대하여 인코더를 적용하여 차원이 축소된 잠재 변수 공간에 대한 특징들을 추출한다. 잠재 변수 공간에 대한 특징을 추출하기 위한 인코더로는 적대적 오토인코더(Adversarial Auto-Encoder, AAE) 사용하는데, 적대적 오토인코더는 생성 모델(Generative Model)로서 데이터를 만들어 내기 위한 목적으로 잠재변수를 활용한다. In the encoding step (S100), an encoder is applied to input training data to extract features of a reduced-dimensional latent variable space. Adversarial Auto-Encoder (AAE) is used as an encoder to extract features of the latent variable space, and the hostile auto-encoder uses latent variables for the purpose of generating data as a generative model. .

상기 적대적 오토인코더(Adversarial Autoencoder, AAE)는 오토인코더의 구조를 갖는 생성 모델(Generative Model)로서, 변분 오토인코더(Variational Autoender, VAE)와 적대적 생성 네트워크(Generative Adversarial Network, GAN)을 결합한 구조이다. The Adversarial Autoencoder (AAE) is a Generative Model having the structure of an autoencoder, and is a structure that combines a Variational Autoender (VAE) and a Generative Adversarial Network (GAN).

상기 변분 오토인코더(Variational Autoender, VAE)는 생성 모델의 일종이다. 생성 모델은 잠재 변수(

)가 주어졌을 때의 조건부 확률

를 모델링하는 것이다. VAE는 (

)의 특징을 잘 찾아내는 잠재변수(

)를 찾아내는데 목적이 있다. 도 4는 변분 오토인코더를 도시한 구조도이다. 도 4를 참조하면, 인코더를 통해 입력 데이터 (

)를 받아서 잠재 변수(

)를 얻을 확률

을 만들어 내고, 디코더를 통해 인코더가 만든 잠재 변수(

)를 활용해 입력 데이터(

)를 복원해낼

를 만들어낸다. The variant autoencoder (Variational Autoender, VAE) is a kind of generation model. The generative model is a latent variable (

Conditional probability given)

Is to model. VAE is (

The latent variable (

There is a purpose to find ). 4 is a structural diagram showing a variable auto-encoder. 4, input data (

) And a latent variable (

Chance to get)

And the latent variable created by the encoder through the decoder (

) To input data (

Restore)

Produces.

디코더는 잠재 변수(

)로부터 입력 데이터(

)를 만들어내는 신경망으로 입력데이터(

)의 확률 분포

를 알기 위해 디코더는

를 학습한다. 인코더는 입력데이터(

)로부터 잠재 변수(

)를 얻는 신경망으로 무작위적인 잠재 변수(

)를 샘플링 하는 것이 아니라 입력 데이터(

)를 잘 생성할 수 있는 잠재 변수(

)를 생성하는 것이 목적이다. VAE에서는 이러한 잠재 변수(

)가 정규분포를 따르도록 학습한다. 하지만, 입력 데이터(

)가 주어졌을 때 잠재 변수(

)의 확률 분포

는 사후 분포로써 이 값을 직접 계산하기는 어렵다. 그래서 이 문제를 해결하기 위해 변분 추론(Variational Inference) 방법으로, 계산하기 힘든

대신

를 도입해

로 근사시키는 방법을 사용한다. The decoder is a latent variable (

Input data (

) Probability distribution

To know, the decoder is

To learn. Encoder input data (

) From the latent variable (

A neural network to obtain a random latent variable (

), rather than sampling the input data (

A latent variable (

The purpose is to create ). In VAE, these latent variables (

) To follow a normal distribution. However, the input data (

) Given a latent variable (

) Probability distribution

Is a posterior distribution, and it is difficult to calculate this value directly. So, in order to solve this problem, we use the variable inference method, which is difficult to calculate.

instead

Introduce

Use a method of approximating with.

수학식 1은 확률 분포인

의 계산 편의를 위해

를 대신 풀어보면

의 하계(lower bound) 값인

를 구할 수 있다. 수학식 2의 우변의 식은 복원 오차 값을 구하는 것이고, 좌변의 식은 정규화 하는 부분으로 쿨백-라이블러 발산(Kullback-Leibler divergence, KLD)를 사용한다. KLD는 잠재 변수(

)를 샘플링 하는

와 정규분포인

의 거리를 최소화하는 역할을 한다. 여기서

,

는

,

의 변수로 사용된다. 그래서

,

를 조절하여

를 최대화하는 점을 찾으면서

와

가 같아지도록 한다.Equation 1 is the probability distribution

For convenience of calculation

Instead of solving

Which is the lower bound of

Can be obtained. The equation on the right side of Equation 2 calculates a restoration error value, and the equation on the left side uses Kullback-Leibler divergence (KLD) as a normalizing part. KLD is a latent variable (

) To sample

And normal distribution

It serves to minimize the distance between. here

,

Is

,

Is used as a variable of so

,

By adjusting

While looking for a point that maximizes

Wow

Let's be the same.

한편, 적대적 생성 신경망(Generative Adversarial Network, GAN)은 생성자(Generator)와 판별자(Discriminator)의 역할을 하는 두 개의 모델로 구성되어 있다. 도 5는 적대적 생성 신경망(GAN) 모델을 도시한 구조도이다. 도 5를 참조하면, 판별자는 실제 데이터만을 참으로 판단하기 위해 힘쓰고, 생성자는 판별자가 거짓으로 판별하지 못하도록 가짜 데이터를 생성해가며 두 모델의 성능을 같이 올라가게 되고 최종적으로는 판별자가 실제 데이터와 가짜 데이터를 구분하지 못하게 된다. On the other hand, the Generative Adversarial Network (GAN) is composed of two models serving as a generator and a discriminator. 5 is a structural diagram showing a hostile generated neural network (GAN) model. Referring to FIG. 5, the discriminator strives to determine only real data as true, and the generator generates fake data so that the discriminator cannot discriminate as false, thereby increasing the performance of the two models together, and finally, the discriminator is the real data. And fake data.

수학식 3은 GAN의 목적함수이다. Equation 3 is the objective function of GAN.

수학식 3에서, 먼저 판별자(

)가

를 최대화 하는 관점에서 보면

의 확률 분포이고

는 그 중 샘플링 데이터이다. 판별자(

)는 실제 데이터가 들어오면 1에 가깝게 확률을 추정하여 출력하고, 생성자(

)가 만들어낸 가짜 데이터가 들어오면 0에 가깝게 확률을 추정하여 출력한다. 따라서

를 사용했기 때문에 실제 데이터라면 최대값인 0에 가까운 값이 나오고, 가짜 데이터라면 무한대로 발산하게 하기 때문에

를 최대화하는 방향으로 학습하게 된다. 생성자(

) 부분인 오른쪽 부분이

를 최소화하는 관점에서 보면,

는 보통 정규분포로 사용하는 임의의 노이즈 분포이고

는 노이즈 분포에서 샘플링한 것이다. 이 입력을 생성자(

)에 넣어 만든 데이터를 판별자 (

)가 진짜로 판별하면

이기 때문에

는 1이며,

가 가짜로 판별하면

는 0이기 때문에

는 0에 가까운 최대값이 나오게 될 것이다. 따라서 생성자(

)는

를 최소화하는 방향으로 학습하게 된다. 즉

에 있어서

는 이를 최소화하는 방향으로,

는 최대화하는 방향으로 간다 In Equation 3, first the discriminator (

)end

From the perspective of maximizing

Is the probability distribution of

Is the sampling data among them. Discriminator(

) Is output by estimating the probability close to 1 when the actual data comes in, and the constructor (

When the fake data created by) comes in, it estimates the probability close to 0 and outputs it. therefore

Is used, so if it is real data, the maximum value is close to 0, and if it is fake data, it emits infinitely.

You will learn in the direction of maximizing. Constructor(

The right part, which is the) part

From the perspective of minimizing

Is an arbitrary noise distribution that is usually used as a normal distribution.

Is sampled from the noise distribution. This input into the constructor (

) And put the created data into the discriminator (

) Is really determined

Because

Is 1,

Is determined to be fake

Because is 0

Will result in a maximum value close to zero. So the constructor(

) Is

You will learn in the direction of minimizing. In other words

In

Is in the direction of minimizing this,

Goes in the direction of maximizing

도 6은 적대적 오토인코더(AAE)를 도시한 구조도이다. 전술한 바와 같이, 적대적 오토인코더(AAE)는 VAE와 GAN을 결합한 구조로서, VAE 구조가 GAN에서의 생성자 역할을 한다. 생성자의 인코더는 데이터(

)를 받아서 잠재 변수(

)를 샘플링하고, 생성자의 디코더는 이로부터 다시 잠재 변수(

)를 복원한다. AAE가 기존 VAE와 다른 점은 GAN의 판별자 역할을 하는 네트워크가 추가되어, 생성자의 인코더가 샘플링한 가짜 잠재 변수(

)와

로부터 직접 샘플링한 진짜 잠재 변수(

)를 구분하는 역할을 한다. 6 is a structural diagram showing a hostile auto-encoder (AAE). As described above, the hostile auto-encoder (AAE) is a structure that combines VAE and GAN, and the VAE structure serves as a generator in the GAN. The constructor's encoder is the data(

) And a latent variable (

) Is sampled, and the constructor's decoder is again from it to the latent variable (

) Is restored. The difference between AAE and the existing VAE is that a network that acts as a discriminator of GAN has been added, and a fake latent variable sampled by the generator's encoder (

)Wow

A real latent variable sampled directly from (

).

이렇게 복잡한 네트워크를 만든 이유는 VAE 단점 때문이다. VAE는 사전확률 분포

를 표준정규분포로 가정하고

를 이와 비슷하게 맞추는 과정에서 학습이 이루어진다. VAE 구조가 이처럼 구성되어 있는 이유는 표준정규분포와 같이 단순한 확률 함수여야 샘플링에 용이하고, KLD 계산을 쉽게 할 수 있기 때문이다. 그런데 실제 데이터 분포가 정규분포를 따르지 않거나 이보다 복잡할 경우 VAE 성능에 문제가 발생할 수 있다.The reason for creating such a complex network is the disadvantage of VAE. VAE is the prior probability distribution

Is assumed to be the standard normal distribution

Learning takes place in the process of fitting a similarly. The reason why the VAE structure is structured in this way is that it is easy to sample and calculate KLD only if it is a simple probability function such as a standard normal distribution. However, if the actual data distribution does not follow the normal distribution or is more complex, a problem may occur in VAE performance.

하지만 GAN의 경우 모델에 특정 확률분포를 전제할 필요가 없이 데이터가 어떤 분포를 따르든, 데이터의 실제 분포와 생성자가 만들어내는 분포 사이의 차이를 줄이도록 학습되기 때문이다. VAE의 수학식 2의 좌변의 정규화 부분을 GAN의 손실로 대체할 경우 사전확률과 사후확률 분포를 정규분포 이외에 다른 분포를 쓸 수 있게 되어 모델 선택의 폭이 넓어지는 효과를 누릴 수 있다. However, in the case of GAN, it is not necessary to presuppose a specific probability distribution in the model, and no matter what distribution the data follows, it is learned to reduce the difference between the actual distribution of the data and the distribution produced by the generator. When the normalized part of the left side of Equation 2 of the VAE is replaced with the loss of GAN, it is possible to use a distribution other than the normal distribution for the prior and posterior distributions, thereby broadening the range of model selection.

AAE의 학습 과정은 오토인코더, 판별자, 생성자 순으로 진행된다. 먼저, 수학식 4와 같이 오토인코더 학습을 진행하는데, 입력 데이터(

)에 대해 복원 오차를 구해 인코더 파라미터(

)와 디코더 파라미터(

)를 업데이트 한다. 그리고 수학식 5의 목적 분포인

와 인코더를 통해 샘플링된

판별자를 통해 학습을 진행하여 판별자 파라미터(

)를 업데이트한다. 마지막으로 수학식 6과 같이 생성자 학습을 통해 인코더 파라미터(

)를 업데이트한다.The learning process of AAE proceeds in the order of autoencoder, discriminator, and generator. First, the autoencoder learning is performed as shown in Equation 4, and the input data (

) To find the reconstruction error and the encoder parameter (

) And decoder parameters (

) Is updated. And the objective distribution of Equation 5

And sampled through the encoder

Discriminator parameters (

) Is updated. Finally, as shown in Equation 6, the encoder parameter (

) Is updated.

전술한 바와 같이, 본 발명에 따른 오버샘플링 방법은 각 데이터의 특성에 맞게 AAE 모델을 학습시켜 잠재 변수 공간에서 데이터의 특징을 잘 추출해야 된다. AAE 학습 모델을 구축하기 위하여 다양한 파라미터의 결정이 필요하며, 결정해야 될 파라미터로는 은닉층의 수, 은닉층의 뉴런의 수, 활성함수 등이 있다. As described above, in the oversampling method according to the present invention, the AAE model needs to be well extracted from the latent variable space by learning the AAE model according to the characteristics of each data. In order to build the AAE learning model, it is necessary to determine various parameters, and parameters to be determined include the number of hidden layers, the number of neurons in the hidden layer, and the activation function.

도 7은 본 발명에 따른 오버샘플링 방법에 있어서, 잠재 변수 공간에서의 특징 추출을 위한 적대적 오토인코더 모델에 대한 학습 과정을 예시적으로 도시한 알고리즘이다. 도 7을 참조하면, 적대적 오토인코더 모델의 학습을 위하여, 파라미터 설정 과정, AAE 학습 및 특징 추출 과정을 알 수 있다. 7 is an algorithm exemplarily illustrating a learning process for a hostile autoencoder model for feature extraction in a latent variable space in the oversampling method according to the present invention. Referring to FIG. 7, in order to learn a hostile autoencoder model, a parameter setting process, AAE learning, and feature extraction process can be seen.

적대적 오토인코더의 학습을 위해, 아홉 가지의 데이터 셋을 사용하여 파라미터 변경을 해가며 실험을 진행해본 결과, 가장 높은 성능을 내는 최적의 파라미터들을 표 2를 통해 나타냈다. 표 2는 도 11에 설명된 각 데이터 셋에 대한 최적의 파라미터들을 도시한 도표이다. For the learning of the hostile autoencoder, as a result of experimenting with changing parameters using nine data sets, the optimal parameters with the highest performance are shown in Table 2. Table 2 is a diagram showing optimal parameters for each data set described in FIG. 11.

표 1을 참조하면, 오토인코더의 인코더와 디코더는 레이어의 수, 뉴런의 수는 동일하게 사용하였다. 실험에 사용한 데이터의 수, 속성의 수가 크지 않기 때문에 레이어의 수는 2개, 3개, 4개로 진행하였고, 각 레이어의 뉴런의 수는 데이터 속성의 수보다 작게 설정하였다. 또한 상대적으로 단순한 데이터들을 사용하여서 잠재 변수 차원도 이차원으로 고정하여 실험을 진행하였고, 목적 분포는 정규분포를 사용하였다. 활성함수로는 Relu를 사용하였는데, 이 함수는 Sigmoid 함수의 기울기 값이 사라지는 문제를 보완하여 최근 가장 많이 사용되고 있는 함수이다. 그리고 학습 데이터의 과적합 방지를 위한 dropout을 사용하였다.Referring to Table 1, the autoencoder encoder and decoder used the same number of layers and neurons. Since the number of data and attributes used in the experiment was not large, the number of layers was set to 2, 3, and 4, and the number of neurons in each layer was set smaller than the number of data attributes. In addition, using relatively simple data, the experiment was conducted by fixing the dimension of the latent variable to two dimensions, and the normal distribution was used as the target distribution. Relu was used as the activation function, and this function is the most used function recently as it compensates for the problem that the slope value of the Sigmoid function disappears. And dropout was used to prevent overfitting of training data.

도 8은 본 발명의 바람직한 실시예에 따른 오버샘플링 방법에 있어서, 오버샘플링 단계 및 디코딩 단계에 의해 잠재 변수 공간에서 SMOTE를 활용하여 오버샘플링한 후 디코딩하는 과정을 도시한 도면이다.8 is a diagram illustrating a process of decoding after oversampling using SMOTE in a latent variable space by an oversampling step and a decoding step in an oversampling method according to a preferred embodiment of the present invention.

SMOTE(Synthetic Minority Over-Sampling Technique)는 소수 범주의 샘플들을 중심으로 최 근접 이웃 k개(k-Nearest Neighbor)를 합성하여 새로운 샘플을 생성한다. 합성 샘플(

)은 수학식 7을 통해 얻을 수 있다. 소수 범주의 샘플(

)에 대하여 최 근접 이웃 (

)을 선택하고

와

사이의 임의의 난수(

)를 생성하여 수학식 7에 따라 합성 샘플을 생성하게 된다. 이 방법을 소수 범주의 샘플이 다수 범주가 원하는 균형 비율에 도달할 때까지 무작위로 선택된 소수 범주에 대해 수행한다.SMOTE (Synthetic Minority Over-Sampling Technique) generates a new sample by synthesizing k-Nearest Neighbors based on a small number of samples. Synthetic sample (

) Can be obtained through Equation 7. Samples of the prime category (

) With respect to the nearest neighbor (

) And

Wow

Any random number between (

) To generate a synthetic sample according to Equation 7. This method is performed on a randomly selected minority category until a sample of the minority category reaches the desired equilibrium ratio for the majority category.

도 8 및 도 2를 참조하면, 본 발명에 따른 오버샘플링 방법의 상기 오버샘플링 단계(S110)는 인코딩 단계에서 최종적으로 학습을 마친 학습 데이터의 잠재 변수 공간에서 SMOTE 알고리즘을 적용하여, 소수 범주의 데이터를 오버샘플링함으로써, 소수 범주의 데이터를 다수 범주의 데이터와 동일한 비율로 맞춘다. 8 and 2, in the oversampling step (S110) of the oversampling method according to the present invention, the SMOTE algorithm is applied in the latent variable space of the training data finally learned in the encoding step, By oversampling, the data of a minority category fits the same proportions as the data of many categories.

도 8 및 도 2를 참조하면, 본 발명에 따른 오버샘플링 방법의 상기 디코딩 단계(S120)는 오버샘플링 단계에 의해 소수 범주의 데이터와 다수 범주의 데이터를 균형을 맞춘 잠재 변수 공간에서의 특징들(

)을 디코딩하여 데이터들을 복원한다. 8 and 2, the decoding step (S120) of the oversampling method according to the present invention includes features in a latent variable space in which data of a minority category and data of a plurality of categories are balanced by the oversampling step (

) Is decoded to restore data.

이러한 과정을 통해, 학습 데이터들을 재구성하게 되며, 재구성된 학습 데이터를 기반으로 기계 학습을 함으로써 정확도를 향상시킬 수 있게 된다. Through this process, training data are reconstructed, and accuracy can be improved by performing machine learning based on the reconstructed training data.

한편, 상기 디코딩 단계(S120)는 두 가지의 실시 형태가 있다. 도 9는 본 발명의 바람직한 실시예에 따른 오버샘플링 방법에 있어서, 디코딩 단계의 일실시 형태를 도시한 블록도이다. 도 9를 참조하면, 일실시형태에 따른 디코딩 단계는, 전술한 오버샘플링 모듈에 의해 소수 범주의 데이터가 생성된 경우, 오버샘플링된 소수 범주의 데이터만을 복원하고, 원래의 학습 데이터(x)와 상기 복원된 오버샘플들(Reconstructed Oversamples)을 학습 데이터로 재구성하게 된다. Meanwhile, the decoding step S120 has two embodiments. 9 is a block diagram showing an embodiment of a decoding step in an oversampling method according to a preferred embodiment of the present invention. Referring to FIG. 9, in the decoding step according to an embodiment, when data of a fractional category is generated by the above-described oversampling module, only the oversampled fractional category data is restored, and the original training data (x) and The reconstructed oversamples are reconstructed as training data.

도 10은 본 발명의 바람직한 실시예에 따른 오버샘플링 방법에 있어서, 디코딩 모듈의 일실시 형태를 도시한 블록도이다. 도 10을 참조하면, 디코딩 모듈(120)의 다른 실시 형태는 비율을 맞춘 소수 범주의 데이터와 다수 범주의 데이터, 즉 학습 데이터(x)와 오버샘플링된 데이터를 모두 복원하여, 복원된 학습 데이터(Reconstructed x)와 복원된 오버샘플들(Reconstructed Oversamples)을 학습 데이터로 재구성하게 된다. 10 is a block diagram showing an embodiment of a decoding module in an oversampling method according to a preferred embodiment of the present invention. Referring to FIG. 10, another embodiment of the decoding module 120 restores both fractional categories of data and multiple categories of data, that is, training data (x) and oversampled data in which proportions are adjusted, and the restored training data ( Reconstructed x) and reconstructed oversamples are reconstructed as training data.

전술한 과정들에 의해 재구성된 학습 데이터는 소수 범주 데이터와 다수 범주 데이터의 균형을 맞추게 되며, 기계 학습에 적용됨으로써 학습의 성능 및 정확도를 향상시킬 수 있게 된다. The training data reconstructed by the above-described processes balances the minority category data and the majority category data, and is applied to machine learning, thereby improving the performance and accuracy of learning.

이상에서 본 발명에 대하여 그 바람직한 실시예를 중심으로 설명하였으나, 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고, 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다. In the above, the present invention has been described with reference to its preferred embodiments, but these are only examples and do not limit the present invention, and those of ordinary skill in the field to which the present invention pertains will not depart from the essential characteristics of the present invention. It will be appreciated that various modifications and applications not exemplified above are possible in the range. And, differences related to these modifications and applications should be construed as being included in the scope of the present invention defined in the appended claims.

Claims

In the oversampling method for data imbalance processing for training data in which each step is performed by a computer,
(a) an encoding step of extracting features of a latent variable space with respect to the training data;
(b) an oversampling step of oversampling data of a small number of categories in the latent variable space from which the features are extracted;
(c) decoding the oversampled latent variable space;
And, the oversampling method for the training data, characterized in that the balance between the data of the minority category and the data of a plurality of categories by reconstructing the training data.

The method of claim 1, wherein step (a)
An oversampling method for training data, comprising extracting features of a latent variable space of training data using a generative model having an autoencoder structure.

The method of claim 2, wherein the generative model having the structure of the autoencoder is an adversarial auto-encoder that combines a variable encoder (VAE) and a generative adversarial network (GAN). AAE), an oversampling method for training data, characterized in that applying.

The method of claim 1, wherein step (b),
Oversampling for training data, characterized in that new samples are generated by synthesizing a preset number (k) of nearest neighbors for each sample of a prime category in a latent variable space, and oversampling the samples of the prime category. Way.

The method of claim 4, wherein step (b),
One sample from the decimal category (

), and random neighbors (

), and a random number between 0 and 1 (

), and a synthesis sample (

), the oversampling method for training data, characterized in that to generate.

The method of claim 1, wherein in step (b), oversampling is performed using SMOTE (Synthetic Minority Over-sampling Technique).

The method of claim 1, wherein step (b) is
An oversampling method for training data, characterized in that iteratively performed until data of a small number of categories and data of a plurality of categories become the same ratio.

The method of claim 1, wherein step (c) is
An oversampling method for training data, comprising restoring only the oversampled data of a small number of categories, and reconstructing training data from initial training data and restored oversamples.

The method of claim 1, wherein step (c) is
An oversampling method for training data, comprising: restoring both initial training data and oversampled data of a prime number category, and reconstructing training data using restored training data and restored oversamples.

A computer-readable recording medium in which a program implemented to execute the oversampling method for the learning data according to claim 1 by a computer is recorded.