KR20200113397A

KR20200113397A - Method of under-sampling based ensemble for data imbalance problem

Info

Publication number: KR20200113397A
Application number: KR1020190033526A
Authority: KR
Inventors: 강대기
Original assignee: 동서대학교 산학협력단
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2020-10-07
Also published as: KR102266950B1

Abstract

The present invention relates to an undersampling-based ensemble method for resolving data imbalance. The undersampling-based ensemble method for resolving data imbalance comprises the steps of: dividing into a major category (normal companies) and a minority category (insolvent companies) based on a plurality of insolvent company data; forming a set of sub-instances by undersampling; measuring similarities between the data of a population and the data of subgroups in order to measure the loss of information of the subgroups with respect to the population; training each subgroup by using a basic training device and forming an ensemble; and evaluating the performance of each classifier by using a test set for validation and measuring the statistical significance of differences in the performance of classifiers.

Description

{Method of under-sampling based ensemble for data imbalance problem}

본 발명은 데이터 불균형 해결을 위한 언더샘플링 기반 앙상블 방법에 관한 것으로, 더욱 자세하게는 데이터 불균형이 문제되는 사안에서 언더 샘플링 방식을 적용한 앙상블 학습방법으로 정확도를 높일수 있는 데이터 불균형 해결을 위한 언더샘플링 기반 앙상블 방법에 관한 것이다.The present invention relates to an undersampling-based ensemble method for resolving data imbalance, and more specifically, an undersampling-based ensemble method for solving data imbalance that can increase accuracy by applying an undersampling method in a case where data imbalance is a problem. It is about.

일반적으로 머신러닝(machine-learning)이 수행되는 응용이 데이터 불균형 상황이 되면 이로 인해 파생되는 문제점은 크게 두 가지가 있다.In general, when an application where machine-learning is performed becomes a data imbalance situation, there are two main problems resulting from this.

첫째로, 분류자의 정확도가 데이터 불균형 상황에서는 바람직하지 않다는 점이다. 일반적으로 분류자의 성능 측정에서 보편적으로 활용되는 지표는 단순평균 기반 정확도(arithmetic mean based accuracy)이다. 이는 전체 학습 인스턴스 중에 올바르게 분류된 인스턴스의 비율로 계산된다. 상기 인스턴스란, 머신러닝이나 딥러닝에서 인스턴스는 통계학에서 하나의 case와 동일한 것으로 볼수 있다. 데이터베이스 테이블에서는 하나의 로우(row), 또는 레코드(record)이다. 예를 들어 일반적인 머신러닝 문제에서는 여러 개의 값들로 구성된 한 개의 복합적인 자료구조 데이터이며, 영상 관련 데이터에서는 하나의 이미지, 그리고 문서 처리 문제에서는 한 개의 문서로 볼 수 있다. First, the accuracy of the classifier is not desirable in a data imbalance situation. In general, an index commonly used in measuring the performance of a classifier is arithmetic mean based accuracy. It is calculated as the proportion of correctly classified instances among all training instances. The instance can be regarded as the same as one case in statistics in machine learning or deep learning. In a database table, it is a row or record. For example, in a general machine learning problem, it is one complex data structure data composed of several values, one image for image-related data, and one document for document processing problems.

그런데, 데이터 불균형상황에서는 단순히 평균에 근거한 정확도는 다수 범주(majority class) 인스턴스들의 분류 정확성에 의존하여 분류자의 성과를 결정하게 되는 단점이 있다. 예를 들어, 기업의 부실은 발생 빈도가 매우 희귀한 사건으로 국내의 전문 신용평가기관은 국내 외부감사 법인의 장기평균 부도율을 약 3~ 5% 수준으로 예상하고 있다. 만일, 전체 외부감사 기업이 2만개이고 부실기업이 600개인 경우를 학습 자료로 활용하는 경우, 극단적으로 모든 기업에 대해 정상이라고 판단해도 단순평균 정확도는 97%가 된다. However, in a data imbalance situation, there is a disadvantage that the accuracy based on the average depends on the classification accuracy of instances of a majority class to determine the performance of the classifier. For example, corporate insolvency is a very rare event, and domestic professional credit rating agencies predict the long-term average default rate of domestic external audit firms to be about 3 to 5%. If the total number of external audit companies is 20,000 and 600 insolvent companies are used as learning materials, the simple average accuracy is 97% even if it is considered normal for all companies in the extreme.

이처럼 단순평균 정확도는 다수 범주인 정상 기업의 분류 정확성에 크게 의존하게 된다. 이처럼 데이터의 불균형을 가지는 문제들은 실생활에 매우 많다. 적 비행기와 아군 비행기를 판단하는 레이더 계측, 공장에서 기계의 장애를 판단하는 진단 기술, 메디컬 이미징에서 유방암, 폐암 등의 판별, 운영체제에서 해커의 침입을 판단하는 침입탐지 시스템 및 화재나 재난을 판단하는 시스템들 등 다양한 분야에서 데이터 불균형 문제를 가진다. As such, the simple average accuracy is highly dependent on the classification accuracy of normal companies, which are many categories. These problems with data imbalance are very many in real life. Radar measurement that judges enemy and friendly planes, diagnostic technology that determines machine failures in factories, breast cancer, lung cancer, etc. in medical imaging, intrusion detection system that determines hacker intrusion in operating system, and intrusion detection system that determines fire or disaster. There is a data imbalance problem in various fields such as systems.

이러한 문제로 인하여 최근 다수 범주와 소수 범주의 정확도를 동시에 고려할 수 있는 ROC 분석(Receiving Operator Characteristic Analysis)이나 기하평균 정확도(geometric mean based accuracy)와 같은 지표들이 단순평균 정확도를 대체하여 이용되고 있다.Due to this problem, indicators such as Receiving Operator Characteristic Analysis (ROC) and geometric mean based accuracy, which can simultaneously consider the accuracy of multiple categories and minority categories, have been used in place of simple average accuracy.

둘째, 분류자의 학습 성과가 저하되는 문제이다. 데이터 불균형 하에서 일반적인 머신러닝 학습은 다수 범주의 인스턴스들에 의해 결정 경계영역이 계속 커지므로 소수 범주 영역이 점차로 축소하고 결과적으로 소수 범주에 대한 분류 정확성이 급격히 감소된다. 이러한 분류 경계영역 침해의 문제를 해결하기 접근방법으로는 크게 분류자(또는 알고리즘) 수정기법(classifier/algorithms modification)과 데이터 수정기법(data manipulation)이 활용되고 있다. Second, the classifier's learning performance is degraded. In general machine learning learning under data imbalance, since the decision boundary region continues to grow larger by instances of multiple categories, the minority category area gradually shrinks, and as a result, the classification accuracy for the minority category rapidly decreases. Classifier/algorithms modification and data manipulation are largely used as approaches to solving the problem of infringement of the classification boundary area.

분류자 수정기법 중 대표적인 기법인 비용 적응 전략(Cost-Adaptative Strategies)은 오분류에 대해 페널티를 부과하는 방식으로 데이터 분포를 왜곡시키지 않는다는 장점이 있는 반면, 데이터 불균형이 매우 심각할 경우 효과가 미미하다는 단점이 있다. 데이터 조작기법으로는 언더샘플링(Under-Sampling)과 오버샘플링(Over-Sampling) 기법이 활용되고 있는데 언더샘플링 기법은 정해진 규칙에 의해 소수 범주의 인스턴스들의 수와 동일하게 다수 범주의 인스턴스들을 추출하는 방법이다. 언더샘플링 기법은 학습시간이 단축 가능하다는 장점이 있지만, 다수 범주의 인스턴스들을 제거함으로써 발생하는 정보손실(loss of information)의 문제가 존재하고 있다. Cost-Adaptative Strategies, a representative method of classifier correction techniques, has the advantage of not distorting the data distribution by imposing a penalty for misclassification, whereas the effect is insignificant when the data imbalance is very serious. There are drawbacks. As data manipulation techniques, under-sampling and over-sampling techniques are used, and under-sampling techniques are a method of extracting instances of multiple categories equal to the number of instances of a minority category according to a set rule. to be. The undersampling technique has the advantage of shortening the learning time, but there is a problem of loss of information caused by removing multiple categories of instances.

반면, 오버샘플링 방법은 언더샘플링 방법과 정반대의 방법으로 정해진 규칙에 의해 다수 범주의 인스턴스들의 수만큼 소수 범주의 인스턴스들을 증가시키는 방법으로 부족한 소수 범주의 데이터를 증가시켜 대용량의 데이터 학습이라는 장점이 있는 반면, 소수 범주의 데이터 간의 유사성(similiarity) 문제로 인한 over-fitting 문제가 존재하고 있다.On the other hand, the oversampling method has the advantage of learning a large amount of data by increasing the data of the insufficient minority category by increasing the number of instances of the minority category by the number of instances of the multiple category according to the rule set in the opposite way to the undersampling method. On the other hand, there is an over-fitting problem due to the problem of similiarity between data of a small number of categories.

대한민국 특허공개 제2001-0087974호Korean Patent Publication No. 2001-0087974 대한민국 특허공개 제2007-0067484호Korean Patent Publication No. 2007-0067484 대한민국 특허공개 제2018-0130511호Korean Patent Publication No. 2018-0130511 대한민국 특허등록 제10-1563406호Korean Patent Registration No. 10-1563406

따라서, 본 발명은 이러한 문제점을 해결하기 위한 것으로, 데이터가 불균형되는 상황에서 추출된 인스턴스들이 모집단의 특성을 충분히 대표하지 못하는 경우에 데이터의 객관성이 감소되는 문제점을 극복하기 위해 데이터 불균형 해결을 위한 언더샘플링 기반 앙상블 학습 방법을 제공함에 있다. Accordingly, the present invention is to solve this problem, and in order to overcome the problem of reducing objectivity of data when the extracted instances do not sufficiently represent the characteristics of the population in a situation where data is unbalanced, It is to provide a sampling-based ensemble learning method.

이러한 목적을 달성하기 위하여 본 발명은 데이터 불균형 해결을 위한 언더샘플링 기반 앙상블 방법에 있어서, 다수 기업부실 데이터를 기준으로 다수 범주와 소수 범주로 구분하는 단계와 언더샘플링에 의하여 하위 인스턴스들의 집합을 구성하는 단계와 모집단에 대한 하위집단의 정보손실을 측정하기 위하여 모집단의 데이터와 하위집단의 데이터들 간의 유사성을 측정하는 단계와 각각의 하위집단을 기본 학습기를 이용하여 학습하고 앙상블을 구성하는 단계 및 검증을 위한 테스트 집합을 이용하여 각 분류자의 성과를 평가하고 이들의 성과 차이에 대한 통계적 유의성을 측정하는 단계를 포함하는 것을 특징으로 한다.In order to achieve this object, the present invention is an undersampling-based ensemble method for resolving data imbalance, in which a set of sub-instances is formed by dividing into multiple categories and minority categories based on a large number of corporate insolvency data, and undersampling. In order to measure the information loss of the subgroup for the level and the population, the step of measuring the similarity between the data of the population and the data of the subgroup, the step of learning each subgroup using the basic learner, and the step of constructing an ensemble and verification are performed. It characterized by including the step of evaluating the performance of each classifier using the test set for and measuring the statistical significance of the difference in performance.

또한, 상기 단계에서 상기 인스턴스들의 집단은 다수 범주를 가지는 인스턴스들에서 무작위로 추출하며, 상기 인스턴스들의 수는 상기 소수범주의 수와 동일한 것을 특징으로 한다.In addition, in the step, the group of instances is randomly extracted from instances having multiple categories, and the number of instances is the same as the number of the minority categories.

또한, 상기 단계에서 상기 앙상블의 구성방식은 부스팅, 배깅, 아킹, 스태킹 중에서 선택되는 어느 하나인 것을 특징으로 한다. In addition, in the above step, the configuration method of the ensemble may be any one selected from boosting, bagging, arcing, and stacking.

또한, 상기 부스팅의 알고리즘으로는 에이다부스트(Adaboost) 알고리즘 또는 GM-Boost 알고리즘인 인 것을 특징으로 하는 것이다.In addition, the boosting algorithm is characterized in that it is an Adaboost algorithm or a GM-Boost algorithm.

따라서, 본 발명은 데이터 불균형의 문제점을 완화하고 다수 범주와 소수 범주에 대한 균형적 학습이 가능하며, 상기 다수 범주와 소수 범주를 가지는 데이터의 불균형이 존재하는 분류 문제에 대해 검증이 가능한 언더샘플링을 수행하며, 이에 대해 머신러닝 알고리즘으로 유도된 분류기들의 앙상블 학습 알고리즘들을 적용하여 높은 정확도와 신뢰성을 가질수 있게 하는 효과가 있다.Accordingly, the present invention mitigates the problem of data imbalance, enables balanced learning for multiple categories and minority categories, and provides undersampling that can be verified for classification problems in which the imbalance of data having multiple categories and minority categories exists. In this case, by applying ensemble learning algorithms of classifiers derived by machine learning algorithms, there is an effect of enabling high accuracy and reliability.

도 1은 부스팅 알고리즘의 사진.
도 2는 본 발명에 의한 언더샘플링 기반 앙상블 방법의 흐름도. 1 is a photograph of a boosting algorithm.
2 is a flowchart of an undersampling-based ensemble method according to the present invention.

이하에서는 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 상세히 설명한다. 우선 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시가 되더라도 가능한 한 동일 부호를 가지도록 하고 있음에 유의하여야 한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. First of all, in adding reference numerals to elements of each drawing, it should be noted that the same elements have the same numerals as possible, even if they are indicated on different drawings.

또한, 하기에서 본 발명을 설명함에 있어 관련된 공지기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. In addition, in the following description of the present invention, when it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

본 발명의 상세한 사항을 다루기 전에, 중요한 용어를 정의하거나 또는 명확히 하기로 한다.Before addressing the details of the present invention, important terms are defined or clarified.

언더샘플링(Under-Sampling)은 전체 데이터셋에서 데이터를 무작위 확률로 선택해서 선택된 데이터로 이루어진 작아진 데이터셋을 사용하는 방식이다. Under-Sampling is a method of selecting data from the entire data set at random probability and using a smaller data set consisting of the selected data.

데이터를 무작위로 선택하는 방법은 다양하지만, 가장 중요한 점은 데이터의 특성과 상관없도록 뽑아야 한다는 것이다. 예를 들어, 사용자 구매 이력 데이터에서 50% 확률로 데이터를 뽑는 것은 괜찮지만, 사용자 이름이 가나다순으로 정렬된 데이터셋의 앞부분 절반만 사용하는 것은 데이터의 특성을 바꿀 수 있기 때문이다. There are many ways to randomly select data, but the most important thing is that it should be picked so that it doesn't matter what the data is. For example, it is okay to extract data with 50% probability from user purchase history data, but using only the first half of a dataset in which user names are sorted alphabetically is because the characteristics of the data can be changed.

랜덤 언더샘플링(Random Under-Sampling; RUS) 알고리즘은 다수범주 집단에서 무작위로 인스턴스들을 추출하는 알고리즘으로 추출한 인스턴스들의 수는 다수범주 집단의 인스턴스들의 수보다 적어야 한다. The Random Under-Sampling (RUS) algorithm is an algorithm that randomly extracts instances from a majority category group. The number of instances extracted should be less than the number of instances of the majority category group.

따라서, 본 발명은 언더샘플링된 하위집단 데이터에 대해 머신러닝 알고리즘을 수행하고 이러한 앙상블을 구성하는 방법으로 데이터 불균형 문제를 해결하는 것이 특징인 것이다. Accordingly, the present invention is characterized by solving the data imbalance problem by performing a machine learning algorithm on the undersampled subgroup data and constructing such an ensemble.

이러한 앙상블을 구성하는 방법은 매우 다양하나 본 발명에서는 부스팅(boosting) 알고리즘을 예로서 설명한다. 그 외의 구성 방법인 배깅(bagging), 아킹(arcing), 스태킹(stacking) 등을 통한 앙상블 구성도 가능하며, 본 발명의 카테고리에 포함된다.There are various methods of configuring such an ensemble, but in the present invention, a boosting algorithm will be described as an example. Ensemble configuration through other configuration methods such as bagging, arcing, and stacking is also possible, and is included in the category of the present invention.

상기 부스팅(boosting) 알고리즘은 잘못 분류된 개체들에 집중하여 새로운 분류기를 만드는 단계를 반복하는 방법을 의미한다. 머신러닝의 분류 문제에서 잘못 분류된 개체들에게 관심을 가지고 이들을 더 잘 분류하는 새로운 분류기를 구성할 수 있다. The boosting algorithm refers to a method of repeating the steps of creating a new classifier by focusing on misclassified entities. In the classification problem of machine learning, we can construct a new classifier that is interested in misclassified entities and classifies them better.

이러한 경우, 과거의 분류기와 새로운 분류기를 같이 사용한다면 전체 분류의 성능을 높일 수 있다. 부스팅을 통해 잘못 분류된 개체들에 대해 집중하여 새로운 분류기를 만드는 과정을 반복하고, 이러한 분류기들의 집합을 구성할 수 있는 데, 이것이 바로 부스팅 알고리즘이고, 이러한 분류기들의 집합을 앙상블이라 부른다. In this case, if the old classifier and the new classifier are used together, the performance of the entire classification can be improved. Through boosting, the process of creating a new classifier by focusing on misclassified entities is repeated, and a set of these classifiers can be formed. This is a boosting algorithm, and a set of such classifiers is called an ensemble.

즉, 도 1과 같이 약한 분류기(weak classifier)들을 결합하여 전체적으로 강한 분류기를 만드는 것이 바로 부스팅 알고리즘이다. 이렇게, 부스팅 알고리즘은 약한 학습기(weak learner)라 불리우는 다른 학습 알고리즘들의 결과물들에 가중치를 두어 하나의 집합으로 구성하는 방법이다. That is, it is the boosting algorithm that combines weak classifiers as shown in FIG. 1 to create an overall strong classifier. In this way, the boosting algorithm is a method of constructing a set by placing weights on the results of other learning algorithms called a weak learner.

대표적인 부스팅 알고리즘으로 에이다부스트(AdaBoost) 알고리즘의 의사 코드는 이하의 수식이다.As a representative boosting algorithm, the pseudo code of the AdaBoost algorithm is the following equation.

[[ 에이다부스트Ada Boost (( AdaBoostAdaBoost ) 알고리즘]) algorithm]

상기와 같이 AdaBoost에서는 다른 학습 알고리즘(약한 학습기, weak learner)의 결과물들을 가중치를 두어 더하는 방법으로 가속화 분류기의 최종 결과물을 표현할 수 있다. 상기 AdaBoost는 이전 분류기에 의해 잘못 분류된 것들을 이어지는 다른 학습 알고리즘(약한 학습기)들이 수정해줄 수 있다는 점에서 다양한 상황에 적용할 수 있다. As described above, in AdaBoost, the final result of the acceleration classifier can be expressed by adding weights to the results of other learning algorithms (weak learners). The AdaBoost can be applied to a variety of situations in that other learning algorithms (weak learners) that continue to be classified incorrectly by the previous classifier can correct them.

이러한 다른 부스팅 알고리즘으로 GM-Boost(geometric mean based boosting) 방법이 있다. 상기 GM-Boost란 다양한 부스팅 알고리즘의 데이터 불균형 문제의 해결 대안으로 기하평균 정확도와 기하평균 오류계산을 기반으로 하는 부스팅 알고리즘을 의미한다.Another such boosting algorithm is GM-Boost (geometric mean based boosting). The GM-Boost refers to a boosting algorithm based on geometric mean accuracy and geometric mean error calculation as an alternative to solving the data imbalance problem of various boosting algorithms.

이처럼 앞에서 설명한 대로 부스팅 알고리즘이나 배깅(bagging), 아킹(arcing), 스태킹(stacking) 등의 다양한 방법으로 앙상블을 구성할 수 있다. As described above, an ensemble can be configured in various ways, such as a boosting algorithm, bagging, arcing, and stacking.

이러한 앙상블은 특정한 머신러닝 알고리즘의 결과로 생성되는 분류기들의 집합에서 만들어진다. These ensembles are built from a set of classifiers that are produced as a result of specific machine learning algorithms.

앙상블 기법에 대한 연구에서는 이렇게 앙상블을 구성하기 위한 하나의 분류기를 학습하는 머신러닝 알고리즘을 기본 학습기(base learner) 또는 기본 학습 알고리즘(base learning algorithm)이라고 부른다. In the study of the ensemble technique, a machine learning algorithm that learns a classifier for constructing an ensemble is called a base learner or a base learning algorithm.

모든 학습 알고리즘이 기본 학습기(base leraner)가 될 수 있으며, 대표적인 기본 학습기로는 SVM이 있다. SVM이란 Support Vector Machine의 약자로 특징 공간에서 주어진 두 분류의 데이터를 구분지을 수 있는 최적(optimal)의 초평면(hyperplane)을 의미한다. SVM은 그러한 support vector를 찾는 알고리즘으로 두 분류에서 가장 가까운 데이터를 하나씩 찾아서 그 거리를 계산했을 때 가장 멀어질 수 있는 초평면을 찾는 것을 목표로 한다.Any learning algorithm can be a base leraner, and the representative basic learner is SVM. SVM stands for Support Vector Machine and means an optimal hyperplane capable of distinguishing data of two given classes in a feature space. SVM is an algorithm that finds such a support vector and aims to find the hyperplane that can be the farthest when calculating the distance by finding the closest data in two classifications.

본 발명에 따른 데이터 불균형 해결을 위한 언더샘플링 적용 방법은 도 2에 흐름도가 도시된 바와 같다. A method of applying undersampling for resolving data imbalance according to the present invention is as shown in a flowchart in FIG. 2.

즉, 주어진 데이터를 기준으로 다수 범주와 소수범주로 구분하는 단계(S100); 랜덤 언더샘플링(Random Undersampling; RUS)에 의하여 하위집단의 인스턴스들 집합을 구성하는 단계(S200); 모집단에 대한 하위집단의 정보손실을 측정하기 위하여 모집단의 인스턴스들과 하위집단의 인스턴스들 간의 유사성을 측정하는 단계(S300); 각각의 하위집단을 기본 학습기를 이용하여 학습하여 앙상블을 구성하는 단계(S400) 및 검증을 위한 테스트 집합을 이용하여 각 분류자의 성과를 평가하고 이들의 성과 차이에 대한 통계적 유의성을 측정하는 단계(S500)로 이루어진다. That is, the step of dividing into multiple categories and minority categories based on the given data (S100); Configuring a set of instances of a subgroup by random undersampling (RUS) (S200); Measuring similarity between instances of the population and instances of the subgroup in order to measure information loss of the subgroup with respect to the population (S300); The steps of constructing an ensemble by learning each subgroup using a basic learner (S400) and evaluating the performance of each classifier using a test set for verification, and measuring the statistical significance of the difference in performance (S500) ).

이하, 상기 흐름(S100~ S500)에 관하여 설명하기로 한다.Hereinafter, the flows (S100 to S500) will be described.

주어진 데이터를 기준으로 다수 범주와 소수범주로 구분하는 단계(S100)에서는 데이터의 범주들마다 인스턴스들의 개수를 계산하여 상대적으로 양이 더 많은 범주를 다수 범주로 간주한다. In the step (S100) of dividing the data into multiple categories and minority categories based on the given data, the number of instances is calculated for each category of data, and a category having a relatively larger amount is regarded as a multiple category.

랜덤 언더샘플링(Random Undersampling; RUS)에 의하여 하위집단의 인스턴스들 집합을 구성하는 단계(S200)에서는 다수 범주를 가지는 인스턴스들에서 무작위로 인스턴스들을 추출한다. In the step (S200) of configuring a set of instances of a subgroup by random undersampling (RUS), instances are randomly extracted from instances having multiple categories.

이때, 다수 범주에서 추출하는 인스턴스들의 수는 소수 범주 인스턴스들의 수와 동일하게 한다. 언더샘플링은 다수 범주 인스턴스들에 대해서만 수행하며, 언더샘플링의 결과로 하위 집단의 다수 범주 인스턴스들이 구성된다. 이렇게 언더샘플링된 다수 범주 인스턴스들과 기존의 소수 범주 인스턴스들을 합하여 하위집단을 구성한다. In this case, the number of instances extracted from the multiple categories is equal to the number of instances of the minority category. Undersampling is performed only on multiple category instances, and as a result of the undersampling, multiple category instances of a subgroup are constructed. A subgroup is formed by summing the undersampled multiple category instances and the existing minority category instances.

모집단에 대한 하위집단의 정보손실을 측정하기 위하여 모집단의 인스턴스들과 하위집단의 인스턴스들 간의 유사성을 측정하는 단계(S300)에서는 모집단과 하위집단 간의 거리 또는 디버전스(divergence)를 구하는 방법으로 정보 손실을 측정할 수 있다. 서로 다른 두 집단 X₁과 X₂의 집단 간 거리는 다음과 같이 정의된다. In the step (S300) of measuring the similarity between the instances of the population and the instances of the subgroup in order to measure the information loss of the subgroup with respect to the population, information loss is determined by obtaining the distance or divergence between the population and the subgroup. Can be measured. The distance between groups of two different groups X ₁ and X ₂ is defined as follows.

여기서 N₁은 X₁의 인스턴스의 개수, N₂은 X₂의 인스턴스의 개수,

,

이며, dist는 어떠한 거리 측정 수단이나 디버전스도 가능하다. Where N ₁ is the number of instances of X ₁ , N ₂ is the number of instances of X ₂ ,

,

And dist can be any distance measurement means or divergence.

주로, 마할라노비스 거리(Mahalanobis distance), 유클리드 거리(Euclid distance), 또는 쿨백-리블러 디버전스(Kullback-Liebler divergence), 잰슨-새넌 디버전스(Jensen-Shannon divergence) 등이 사용된다. 측정된 거리가 미리 정해진 임계값(thr eshold)보다 작으면 유사한 것으로 간주할 수 있다. Mainly, Mahalanobis distance, Euclid distance, or Kullback-Liebler divergence, Jensen-Shannon divergence, etc. are used. If the measured distance is less than a predetermined threshold, it can be considered similar.

각각의 하위 집단을 기본 학습기를 이용하여 학습하여 앙상블을 구성하는 단계(S400)에서는 앙상블(ensemble)의 구성을 위해 다양한 앙상블 학습 알고리즘이 가능하다. 주로 사용되는 방식으로는 부스팅(boosting), 배깅(bagging), 아킹(arcing), 스태킹(stacking) 등이 있다. 이 중 부스팅의 알고리즘을 선택할 경우, 이전에서 설명했었던 에이다부스트(Adaboost) 또는 지엠부스트(GM-Boost)를 사용할 수 있다. In the step S400 of configuring an ensemble by learning each subgroup by using a basic learner, various ensemble learning algorithms are possible for configuring an ensemble. Mainly used methods include boosting, bagging, arcing, and stacking. Among them, when selecting the boosting algorithm, you can use Adaboost or GM-Boost, which were previously described.

검증을 위한 테스트 집합을 이용하여 각 분류자의 성과를 평가하고 이들의 성과차이에 대한 통계적 유의성을 측정하는 단계(S500)에서는 여러 방법이 가능하지만, 이를 측정하기 위하여 귀무가설(제안한 알고리즘에 의한 성능 향상이 우연한 결과일 뿐이라는 가설)을 기각할 수 있는지를 측정하는 방법들을 사용하거나, 특정 신뢰도에 대한 신뢰구간을 계산하여 이를 비교하는 방법을 사용한다. In the step (S500) of evaluating the performance of each classifier using a test set for verification and measuring the statistical significance of the difference in performance (S500), several methods are possible, but to measure this, the null hypothesis (performance improvement by the proposed algorithm) The hypothesis that this is only an accidental result) can be rejected, or a method of calculating a confidence interval for a specific reliability and comparing it is used.

이 경우, 언더샘플링을 수행하지 않은 앙상블 알고리즘의 결과를 대조군으로 잡고, 언더샘플링을 적용한 앙상블 알고리즘의 결과를 실험군으로 삼는다. In this case, the result of the ensemble algorithm without undersampling is taken as a control, and the result of the ensemble algorithm to which undersampling has been applied is taken as an experimental group.

따라서, 본 발명은 데이터 불균형의 상황에서, 추출된 인스턴스(case)들이 모집단의 특성을 충분히 대표하지 못함으로 인해 일반화 또는 객관화의 특성이 감소되는 문제점을 극복하기 위해, 언더샘플링으로 데이터 불균형의 문제점을 완화하고 다수 범주와 소수 범주에 대한 균형적 학습이 가능하도록 하고, 이를 기본 학습기들로 학습하여 그 결과를 앙상블로 구성하는 실시 예로서 데이터 불균형이 존재하는 2범주 분류 문제인 침입 탐지 문제(예, 시스템에서 해커가 침입해서 컴퓨터 교란)와 기업부실 예측문제(즉, 기업의 흥망예측)와 같은 데이터의 불균형으로 인하여 초래되는 데이터의 불균형 문제를 해소할 수 있는 우수한 효과가 있는 것이다. Accordingly, the present invention solves the problem of data imbalance by undersampling in order to overcome the problem that the characteristics of generalization or objectification are reduced due to the fact that the extracted instances do not sufficiently represent the characteristics of the population in the situation of data imbalance. As an embodiment of mitigating and enabling balanced learning for multiple categories and minority categories, learning them with basic learners and organizing the results into ensembles, an intrusion detection problem (e.g., system) is a two-category classification problem with data imbalance. It has an excellent effect that can solve the data imbalance problem caused by the data imbalance, such as the computer disturbance by hacker intrusion in the computer and the problem of predicting corporate insolvency (that is, predicting the rise and fall of a company).

이상에서와 같이, 상기 서술한 내용은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. As described above, the above-described contents are merely illustrative of the technical idea of the present invention, and those of ordinary skill in the technical field to which the present invention pertains, within the range not departing from the essential characteristics of the present invention. Modifications, changes and substitutions will be possible.

따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Accordingly, the embodiments disclosed in the present invention and the accompanying drawings are not intended to limit the technical idea of the present invention, but to explain, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

Claims

In the undersampling-based ensemble method for solving data imbalance,
Dividing into a plurality of categories and a minority category based on the plurality of corporate insolvency data (S100);
Configuring a set of sub-instances by undersampling (S200);
Measuring similarity between the data of the population and the data of the subgroup in order to measure the loss of information of the subgroup with respect to the population (S300);
Learning each subgroup using a basic learner and constructing an ensemble (S400); And
An undersampling-based ensemble method for resolving data imbalance, comprising: evaluating the performance of each classifier using a test set for verification and measuring statistical significance of the difference in performance (S500).

The method of claim 1,
In the S200, the group of instances is randomly extracted from instances having a plurality of categories, and the number of instances is the same as the number of the minority categories.

The method of claim 1,
In the S400, the configuration method of the ensemble is one selected from boosting, bagging, arcing, and stacking. An undersampling-based ensemble method for resolving data imbalances.

The method of claim 3,
An undersampling-based ensemble method for solving data imbalance, characterized in that when the boosting algorithm is selected, it is an Adaboost algorithm or a GM-Boost algorithm.