KR20220132804A

KR20220132804A - Apparatus and method of recommending sampling method and classification algorithm by using metadata set

Info

Publication number: KR20220132804A
Application number: KR1020210037802A
Authority: KR
Inventors: 권오병; 김정훈
Original assignee: 경희대학교 산학협력단
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2022-10-04
Also published as: KR102556796B1

Abstract

The present invention relates to technology which generates a metadata set using dataset features of an open dataset, and when a dataset to be used by a user is input, recommends an optimal algorithm type, a parameter value, and a data sampling method for the dataset to be used by the user by using a metadata set. An embodiment of the present invention provides a sampling method and classification algorithm recommendation device using a metadata set. The device comprises: a dataset collection unit which collects an open dataset from an open database; a feature extraction unit which extracts a plurality of dataset features of the collected open dataset and pre-processes the extracted plurality of dataset features; a mapping processing unit for mapping a sampling method and classification algorithm according to the features of the plurality of preprocessed datasets; a metadata set generation unit which generates a selection rule base for selecting a recommended sampling method and a recommended classification algorithm based on the mapped sampling method and the mapped classification algorithm, and generates a metadata set including the generated selection rule base and the preprocessed plurality of dataset features; and a recommendation unit for recommending at least one of a customized sampling method and a customized classification algorithm using the generated metadata set for a user dataset input from a user. The present invention can reduce a time and costs.

Description

Apparatus and method for recommending sampling method and classification algorithm using metadata set

본 발명은 메타데이터셋을 이용하여 최적화된 인공지능 알고리즘과 관련하여 샘플링 방법 및 분류 알고리즘을 추천하는 기술적 사상에 관한 것으로, 구체적으로, 오픈 데이터셋의 데이터셋 특성들을 이용하여 메타데이터셋을 생성하고, 사용자가 사용하려는 데이터셋이 입력될 경우, 메타데이터셋을 이용하여 사용자가 사용하려는 데이터셋에 대하여 최적의 알고리즘 종류 및 파라미터 값 그리고 데이터 샘플링 방법을 추천하는 기술에 관한 것이다.The present invention relates to a technical idea for recommending a sampling method and a classification algorithm in relation to an artificial intelligence algorithm optimized using a metadata set. Specifically, a metadata set is generated using the dataset characteristics of an open dataset, and , relates to a technique for recommending an optimal algorithm type, parameter value, and data sampling method for a dataset to be used by the user using a metadata set when a data set to be used by the user is input.

모든 산업분야에서 기업의 경쟁 우위를 선점하기 위해 기계학습 알고리즘을 통한 고객 서비스 및 기업 내 의사결정의 정확도를 향상시키기 위한 노력들이 지속되고 있다.Efforts are being made to improve customer service and the accuracy of decision-making within the company through machine learning algorithms in order to preoccupy the competitive advantage of companies in all industries.

기존 인공지능 개발 업체들은 성능이 뛰어난 알고리즘을 개발하기 위해 반복 실험을 하면서 시간과 컴퓨터 하드웨어 자원을 무분별하게 소비하고 있다.Existing AI developers are recklessly consuming time and computer hardware resources by repeating experiments to develop high-performance algorithms.

즉, 최적화된 인공지능 알고리즘을 선정하기 위해 많은 반복 실험과 시간 및 자원을 소모하고 있다.In other words, it consumes a lot of iterative experiments and time and resources to select an optimized AI algorithm.

또한, 많은 인공지능 알고리즘 연구자들이 뛰어난 성능의 알고리즘을 발표하고 있지만, 데이터셋의 특성에 따라 성능 차이가 존재할 수 있다.Also, although many AI algorithm researchers have published algorithms with excellent performance, performance differences may exist depending on the characteristics of the dataset.

최근, 제조 및 생산 분야에서 인공지능이 탑재된 스마트팩토리 구축을 위한 노력을 하고 있으며, 인공지능 구축 초기에 사전 지식이 부족한 인공지능 컨설팅업체 및 스마트 팩토리 구축에 활용될 수 있어 인공지능 알고리즘과 관련된 샘플링 방법 및 분류 알고리즘 선정은 중요한 이슈일 수 밖에 없다.Recently, efforts are being made to build a smart factory equipped with artificial intelligence in the manufacturing and production field, and it can be used for artificial intelligence consulting companies and smart factory construction that lack prior knowledge in the early stage of AI construction, so sampling related to artificial intelligence algorithms Selection of methods and classification algorithms is inevitably an important issue.

또한, 기계학습을 근거로 한 솔루션, 시스템, 서비스 개발을 위한 플랫폼의 핵심 컴포넌트로서 인공지능 알고리즘의 자동적 선택과 추천은 필요하며, 기계학습 기반 개발 시장은 지속적으로 크게 성장하는 추세이다.In addition, as a core component of a platform for developing solutions, systems, and services based on machine learning, automatic selection and recommendation of artificial intelligence algorithms is required, and the machine learning-based development market continues to grow significantly.

데이터셋에 따라 알고리즘의 성능이 차이나는 것은 개발된 알고리즘이 특수한 상황에서 발생한 데이터셋에 대해 뛰어난 성능을 낼 수 있도록 개발되어 있기 때문이다.The reason why the performance of the algorithm differs depending on the dataset is that the developed algorithm has been developed so that it can achieve excellent performance on the dataset generated in a special situation.

분류 알고리즘의 정확성은 알고리즘의 특성과 하이퍼파라미터뿐 만 아니라 데이터셋의 특성에 의해 결정될 수 있다.The accuracy of the classification algorithm can be determined by the characteristics of the data set as well as the characteristics and hyperparameters of the algorithm.

인공지능에 사용되는 분류 알고리즘들은 분류 전략이 다르기 때문에 특정한 데이터셋의 특성에 따라 효율적인 알고리즘이 존재하기 마련이므로, 데이터셋의 특성 파악이 중요할 수 있다.Since classification algorithms used in artificial intelligence have different classification strategies, efficient algorithms exist according to the characteristics of a specific dataset, so it may be important to understand the characteristics of a dataset.

메타특징(meta-feature)이라고 하는 데이터셋의 특성과 분류 알고리즘 성능과의 연관성에 대한 연구가 아직 미흡하고, 다중 클래스(multi-class)의 불균형 특성을 반영하는 메타특징에 대한 연구가 이루어지지 않고있다.Research on the correlation between the characteristics of a dataset called meta-features and the performance of classification algorithms is still insufficient, and studies on meta-features that reflect the disproportionate characteristics of multi-classes have not been conducted. have.

한국등록특허 제10-2103902호, "컴포넌트 기반 머신러닝 자동화 예측 장치 및 방법"Korean Patent Registration No. 10-2103902, "Component-based Machine Learning Automation Prediction Apparatus and Method" 한국등록특허 제10-2098897호, "기계학습 지식 및 자동화된 기계 학습 절차 기반의 자가 학습 시스템"Korea Patent Registration No. 10-2098897, "Self-learning system based on machine learning knowledge and automated machine learning procedure" 한국등록특허 제10-1864286호, "머신 러닝 알고리즘을 이용하는 방법 및 장치"Korean Patent Registration No. 10-1864286, "Method and Apparatus Using Machine Learning Algorithm" 미국공개특허 제2020/0210775호, "DATA STITCHING AND HARMONIZATION FOR MACHINE LEARNING"US Patent Publication No. 2020/0210775, "DATA STITCHING AND HARMONIZATION FOR MACHINE LEARNING"

본 발명은 오픈 데이터셋의 데이터셋 특성들을 이용하여 메타데이터셋을 생성하고, 사용자가 사용하려는 데이터셋이 입력될 경우, 메타데이터셋을 이용하여 사용자가 사용하려는 데이터셋에 대하여 최적의 알고리즘 종류 및 파라미터 값 그리고 데이터 샘플링 방법을 추천하는 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치를 제공하는 것을 목적으로 한다.The present invention generates a metadata set using the dataset characteristics of an open dataset, and when a dataset to be used by a user is input, an optimal algorithm type and An object of the present invention is to provide an apparatus for recommending a sampling method and a classification algorithm using a metadata set that recommends a parameter value and a data sampling method.

본 발명은 오픈 데이터셋의 데이터셋 특성을 사전에 학습하여 메타데이타화함에 따라 생성된 메타데이터셋에 근거하여 최고의 성능을 나타내는 최적의 알고리즘 종류와 파라미터값, 그리고 데이터 샘플링 방법을 추천하는 것을 목적으로 한다.The present invention aims to recommend an optimal algorithm type, parameter value, and data sampling method that shows the best performance based on a metadata set created by learning and metadataizing the dataset characteristics of an open dataset in advance. do.

본 발명은 데이터 사이언스, 인공지능 개발 업체에서 반복적인 실험을 줄여 시간과 비용을 줄일 수 있고, 하드웨어의 사용을 줄여 친환경적인 인공지능 개발 및 데이터 분석이 가능하도록 지원하는 것을 목적으로 한다.An object of the present invention is to reduce repetitive experiments in data science and artificial intelligence development companies to reduce time and cost, and to support eco-friendly artificial intelligence development and data analysis by reducing the use of hardware.

본 발명은 사용자가 데이터 사이언스가 적용되지 않은 분야에 진입할 경우 사전 지식이 부족하여 어려움을 겪을 수 있는데, 오픈 데이터베이스를 통해 수집된 오픈 데이터셋들을 통해 사전에 학습된 데이터셋들을 통해 유사한 기계학습 알고리즘을 자동적으로 찾아서 추천하는 것을 목적으로 한다.In the present invention, when a user enters a field to which data science is not applied, the user may experience difficulties due to lack of prior knowledge. It aims to automatically find and recommend

본 발명은 초보 데이터 과학자, 인공지능 개발자들이 데이터에 대한 노하우가 부족한 경우, 사전에 학습된 데이터셋들을 통해 유사한 기계학습 알고리즘을 자동적으로 찾아서 추천하는 방식을 참고하여 인공지능 알고리즘과 관련된 샘플링 방법 및 분류 알고리즘의 선택 방법에 대한 노하우 획득 지원을 목적으로 한다.The present invention provides a sampling method and classification related to artificial intelligence algorithms by referring to the method of automatically finding and recommending similar machine learning algorithms through pre-learned datasets when novice data scientists and AI developers lack data know-how. The purpose is to support the acquisition of know-how on the algorithm selection method.

본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 오픈 데이터 베이스로부터 오픈 데이터셋을 수집하는 데이터셋 수집부, 상기 수집된 오픈 데이터셋의 복수의 데이터셋 특성을 추출하고, 상기 추출된 복수의 데이터셋 특성을 전처리하는 특성 추출부, 상기 전처리된 복수의 데이터셋 특성에 따른 샘플링 방법 및 분류 알고리즘을 매핑하는 매핑 처리부, 상기 매핑된 샘플링 방법과 상기 매핑된 분류 알고리즘에 기반하여 추천 샘플링 방법 및 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성하고, 상기 생성된 선정 룰 베이스 및 상기 전처리된 복수의 데이터셋 특성을 포함하는 메타데이터셋을 생성하는 메타데이터셋 생성부 및 사용자로부터 입력된 사용자 데이터셋에 대하여 상기 생성된 메타데이터셋을 이용하여 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천하는 추천부를 포함할 수 있다.The apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention includes a dataset collection unit that collects an open dataset from an open database, and extracts characteristics of a plurality of datasets of the collected open dataset. and a feature extracting unit for pre-processing the plurality of extracted data set characteristics, a mapping processing unit for mapping a sampling method and a classification algorithm according to the pre-processed plurality of data set characteristics, and the mapped sampling method and the mapped classification algorithm. a metadata set generator for generating a selection rule base for selecting a recommended sampling method and a recommended classification algorithm based on and a recommender for recommending at least one of a custom sampling method and a custom classification algorithm by using the generated metadata set with respect to the user data set input from the user.

상기 특성 추출부는 상기 수집된 오픈 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 상기 복수의 데이터셋 특성을 추출하고, 상기 추출된 복수의 데이터셋 특성을 전처리할 수 있다.The feature extraction unit includes the number of variables, the number of instances, the number of classes, the degree of class bias, the entropy of the class, the degree of overlap of the variables in the collected open dataset, the silhouette score, the hub score, the entropy of the variable, and the The plurality of dataset features including linearity and neighborhood may be extracted, and the extracted plurality of dataset features may be pre-processed.

상기 특성 추출부는 상기 수집된 오픈 데이터셋을 복수의 폴드(fold)로 분류하고, 상기 분류된 복수의 폴드(fold) 중 하나를 제외한 나머지 폴드(fold)에 포함된 데이터셋을 복수의 훈련 데이터셋으로 결정하고, 상기 결정된 복수의 훈련 데이터셋으로부터 상기 복수의 데이터셋 특성을 추출할 수 있다.The feature extractor classifies the collected open dataset into a plurality of folds, and uses the dataset included in the remaining folds except for one of the classified plurality of folds into a plurality of training datasets. to be determined, and the plurality of dataset characteristics may be extracted from the determined plurality of training datasets.

상기 특성 추출부는 상기 복수의 데이터셋 특성이 추출된 데이터셋에서 결측치가 존재하고, 상기 결측치가 존재하는 변수가 수치형인 경우 해당 클래스의 평균값을 이용하여 상기 결측치를 처리함에 따라 상기 추출된 복수의 데이터셋 특성을 전처리할 수 있다.When a missing value exists in the dataset from which the plurality of dataset characteristics are extracted, and the variable in which the missing value exists is a numeric type, the characteristic extractor processes the missing value by using the average value of the corresponding class, thereby processing the extracted plurality of data It is possible to preprocess three properties.

상기 특성 추출부는 상기 복수의 데이터셋 특성이 추출된 데이터셋에서 결측치가 존재하고, 상기 결측치가 존재하는 변수가 명목형인 경우 해당 클래스의 최빈값을 이용하여 상기 결측치를 처리함에 따라 상기 추출된 복수의 데이터셋 특성을 전처리할 수 있다.When a missing value exists in the dataset from which the plurality of dataset characteristics are extracted, and the variable in which the missing value exists is a nominal type, the feature extractor processes the missing value using the mode value of the corresponding class to process the plurality of extracted data It is possible to preprocess three properties.

상기 특성 추출부는 상기 복수의 데이터셋 특성이 추출된 데이터셋에서 클래스 불균형이 존재하는 경우, 상기 존재하는 클래스 불균형에 따라 다수 클래스(majority class)를 제거하는 과소 표집 방법(under sampling) 및 소수 클래스(minority class)를 다수 클래스(majority class)에 맞게 복제하는 과대 표집 방법(over sampling) 중 어느 하나의 클래스 불균형 해소 방법을 이용하여 상기 존재하는 클래스 불균형을 해소함에 따라 상기 추출된 복수의 데이터셋 특성을 전처리할 수 있다.When class imbalance exists in the dataset from which the plurality of dataset characteristics are extracted, the characteristic extraction unit removes a majority class according to the existing class imbalance. An under-sampling method and a minority class ( Minority class) according to the majority class (majority class) by using any one of the class imbalance resolution method of the over sampling method to resolve the existing class imbalance, the extracted plurality of dataset characteristics can be pre-processed.

상기 매핑 처리부는 상기 전처리된 복수의 데이터셋 특성을 복수의 샘플링 방법에 적용하고, 상기 적용된 복수의 샘플링 방법 각각에서의 샘플링 방법 정확도를 산출하고, 상기 산출된 샘플링 방법 정확도에 따라 상기 전처리된 복수의 데이터셋 특성과 샘플링 방법을 매핑할 수 있다.The mapping processing unit applies the characteristics of the plurality of preprocessed datasets to a plurality of sampling methods, calculates sampling method accuracy in each of the plurality of applied sampling methods, and according to the calculated sampling method accuracy, the plurality of preprocessed data sets Data set characteristics and sampling methods can be mapped.

상기 매핑 처리부는 상기 전처리된 복수의 데이터셋 특성을 복수의 분류 알고리즘에 적용하고, 상기 적용된 복수의 분류 알고리즘 각각에서의 분류 알고리즘 정확도를 산출하고, 상기 산출된 분류 알고리즘 정확도에 따라 상기 전처리된 복수의 데이터셋 특성과 분류 알고리즘을 매핑할 수 있다.The mapping processing unit applies the characteristics of the plurality of preprocessed datasets to a plurality of classification algorithms, calculates classification algorithm accuracy in each of the plurality of applied classification algorithms, and calculates the accuracy of the plurality of preprocessed classification algorithms according to the calculated classification algorithm accuracy. You can map data set characteristics and classification algorithms.

상기 매핑 처리부는 상기 전처리된 복수의 데이터셋 특성에 대한 상기 적용된 복수의 분류 알고리즘 각각에서의 분류 알고리즘의 특성과 하이퍼파라미터에 기반하여 상기 분류 알고리즘 정확도를 산출할 수 있다.The mapping processing unit may calculate the classification algorithm accuracy based on the characteristics and hyperparameters of the classification algorithm in each of the plurality of classification algorithms applied to the preprocessed characteristics of the plurality of datasets.

상기 메타데이터셋 생성부는 상기 매핑된 샘플링 방법과 상기 매핑된 분류 알고리즘에 적용된 상기 전처리된 복수의 데이터셋 특성을 필터링하고, 상기 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋을 상기 추천 샘플링 방법 및 상기 추천 분류 알고리즘에 투입하여 기계학습하고, 상기 기계학습에 기반하여 상기 추천 샘플링 방법 및 상기 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성할 수 잇다.The metadata set generating unit filters the plurality of preprocessed dataset characteristics applied to the mapped sampling method and the mapped classification algorithm, and selects a plurality of datasets related to the filtered plurality of dataset characteristics as the recommended sampling method and machine learning by inputting the recommendation classification algorithm, and generating a selection rule base for selecting the recommendation sampling method and the recommendation classification algorithm based on the machine learning.

상기 메타데이터셋 생성부는 상기 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋과 상기 생성된 선정 룰 베이스를 포함하는 메타데이터셋을 생성할 수 있다.The metadata set generator may generate a metadata set including a plurality of datasets related to characteristics of the plurality of filtered datasets and the generated selection rule base.

본 발명의 일실시예에 따르면 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 상기 생성된 메타데이터셋을 저장하는 메타데이터셋 저장부를 더 포함할 수 있다.According to an embodiment of the present invention, the apparatus for recommending a sampling method and classification algorithm using a metadata set may further include a metadata set storage unit for storing the generated metadata set.

상기 특성 추출부는 상기 입력된 사용자 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 복수의 데이터셋 특성을 추출할 수 있다.The feature extractor includes the number of variables, the number of instances, the number of classes, the degree of bias in the class, the entropy of the class, the degree of overlap of the variables in the input user dataset, the silhouette score, the hub score, the entropy of the variable, and the It is possible to extract a plurality of data set features including linearity and neighborliness.

상기 특성 추출부는 상기 입력된 사용자 데이터셋을 복수의 폴드(fold)로 분류하고, 상기 분류된 복수의 폴드(fold) 중 하나를 제외한 나머지 폴드(fold)에 포함된 데이터셋을 복수의 훈련 데이터셋으로 결정하고, 상기 결정된 복수의 훈련 데이터셋으로부터 상기 복수의 데이터셋 특성을 추출할 수 있다.The feature extractor classifies the input user dataset into a plurality of folds, and uses the dataset included in the remaining folds except for one of the classified plurality of folds into a plurality of training datasets. to be determined, and the plurality of dataset characteristics may be extracted from the determined plurality of training datasets.

상기 특성 추출부는 상기 복수의 데이터셋 특성이 추출된 데이터셋에서 결측치가 존재하고, 상기 결측치가 존재하는 변수가 수치형인 경우 해당 클래스의 평균값을 이용하고, 상기 결측치가 존재하는 변수가 명목형인 경우 해당 클래스의 최빈값을 이용하여 상기 결측치를 처리함에 따라 상기 추출된 복수의 데이터셋 특성을 전처리하고, 상기 복수의 데이터셋 특성이 추출된 데이터셋에서 클래스 불균형이 존재하는 경우, 상기 존재하는 클래스 불균형에 따라 다수 클래스(majority class)를 제거하는 과소 표집 방법(under sampling) 및 소수 클래스(minority class)를 다수 클래스(majority class)에 맞게 복제하는 과대 표집 방법(over sampling) 중 어느 하나의 클래스 불균형 해소 방법을 이용하여 상기 존재하는 클래스 불균형을 해소함에 따라 상기 추출된 복수의 데이터셋 특성을 전처리할 수 있다.The feature extractor uses the average value of the class when a missing value exists in the dataset from which the plurality of dataset features are extracted, and the variable in which the missing value exists is of a numeric type, and when the variable with the missing value is of a nominal type The plurality of extracted dataset characteristics are preprocessed as the missing value is processed using the class mode, and when class imbalance exists in the dataset from which the plurality of dataset characteristics are extracted, according to the existing class imbalance One of the methods of resolving class imbalance is the undersampling method that removes the majority class and the oversampling method that duplicates the minority class to fit the majority class. As a result of resolving the existing class imbalance by using the method, it is possible to pre-process the extracted plurality of data set characteristics.

상기 추천부는 상기 사용자 데이터셋의 전처리된 복수의 데이터셋 특성을 인식하고, 상기 생성된 메타데이터셋에서 상기 인식된 복수의 데이터셋 특성과 관련된 복수의 데이터셋 특성을 확인하고, 상기 확인된 복수의 데이터셋 특성과 상기 생성된 선정 룰 베이스에 기반하여 상기 맞춤형 샘플링 방법 및 상기 맞춤형 분류 알고리즘 중 적어도 하나를 추천할 수 있다.The recommendation unit recognizes a plurality of preprocessed dataset characteristics of the user dataset, identifies a plurality of dataset characteristics related to the recognized plurality of dataset characteristics in the generated metadata set, and identifies the plurality of identified dataset characteristics. At least one of the customized sampling method and the customized classification algorithm may be recommended based on a dataset characteristic and the generated selection rule base.

본 발명의 일실시예에 따르면 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 데이터셋 수집부에서, 오픈 데이터 베이스로부터 오픈 데이터셋을 수집하는 단계, 특성 추출부에서, 상기 수집된 오픈 데이터셋의 복수의 데이터셋 특성을 추출하고, 상기 추출된 복수의 데이터셋 특성을 전처리하는 단계, 매핑 처리부에서, 상기 전처리된 복수의 데이터셋 특성에 따른 샘플링 방법 및 분류 알고리즘을 매핑하는 단계, 메타데이터셋 생성부에서, 상기 매핑된 샘플링 방법과 상기 매핑된 분류 알고리즘에 기반하여 추천 샘플링 방법 및 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성하고, 상기 생성된 선정 룰 베이스 및 상기 전처리된 복수의 데이터셋 특성을 포함하는 메타데이터셋을 생성하는 단계 및 추천부에서, 사용자로부터 입력된 사용자 데이터셋에 대하여 상기 생성된 메타데이터셋을 이용하여 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a sampling method and a classification algorithm recommendation method using a metadata set includes the steps of: in a dataset collecting unit, collecting an open data set from an open database; in a feature extracting unit, the collected open data set extracting a plurality of dataset characteristics of , and pre-processing the extracted plurality of dataset characteristics, mapping a sampling method and a classification algorithm according to the plurality of pre-processed dataset characteristics in a mapping processing unit, a metadata set In the generator, a selection rule base for selecting a recommended sampling method and a recommended classification algorithm is generated based on the mapped sampling method and the mapped classification algorithm, and the generated selection rule base and the plurality of preprocessed datasets are generated. generating a metadata set including characteristics, and recommending, in the recommendation unit, at least one of a customized sampling method and a customized classification algorithm by using the generated metadata set with respect to a user dataset input from a user. can do.

상기 수집된 오픈 데이터셋의 복수의 데이터셋 특성을 추출하고, 상기 추출된 복수의 데이터셋 특성을 전처리하는 단계는, 상기 수집된 오픈 데이터셋을 복수의 폴드(fold)로 분류하고, 상기 분류된 복수의 폴드(fold) 중 하나를 제외한 나머지 폴드(fold)에 포함된 데이터셋을 복수의 훈련 데이터셋으로 결정하고, 상기 결정된 복수의 훈련 데이터셋으로부터 상기 수집된 오픈 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 상기 복수의 데이터셋 특성을 추출하는 단계, 상기 복수의 데이터셋 특성이 추출된 데이터셋에서 결측치가 존재하고, 상기 결측치가 존재하는 변수가 수치형인 경우 해당 클래스의 평균값을 이용하여 상기 결측치를 처리하며, 상기 결측치가 존재하는 변수가 명목형인 경우 해당 클래스의 최빈값을 이용하여 상기 결측치를 처리함에 따라 상기 추출된 복수의 데이터셋 특성을 전처리하는 단계 및 상기 복수의 데이터셋 특성이 추출된 데이터셋에서 클래스 불균형이 존재하는 경우, 상기 존재하는 클래스 불균형에 따라 다수 클래스(majority class)를 제거하는 과소 표집 방법(under sampling) 및 소수 클래스(minority class)를 다수 클래스(majority class)에 맞게 복제하는 과대 표집 방법(over sampling) 중 어느 하나의 클래스 불균형 해소 방법을 이용하여 상기 존재하는 클래스 불균형을 해소함에 따라 상기 추출된 복수의 데이터셋 특성을 전처리하는 단계를 포함할 수 있다.The step of extracting a plurality of dataset characteristics of the collected open dataset and pre-processing the extracted plurality of dataset characteristics includes classifying the collected open dataset into a plurality of folds, A dataset included in the remaining folds except for one of a plurality of folds is determined as a plurality of training datasets, and the number of variables in the collected open dataset from the determined plurality of training datasets, instances The number of, the number of classes, the degree of bias of the class, the entropy of the class, the degree of overlap of the variable, the silhouette score, the hub score, the entropy of the variable, the linearity of the data set, Step, when a missing value exists in the dataset from which the plurality of dataset characteristics are extracted, and the variable with the missing value is of a numeric type, the missing value is processed using the average value of the class, and the variable with the missing value is nominal preprocessing the plurality of extracted dataset characteristics as the missing value is processed using the mode of the corresponding class in the case of type, and if there is a class imbalance in the dataset from which the plurality of dataset characteristics are extracted, the existing One of the classes of undersampling, which removes a majority class according to class imbalance, and oversampling, which replicates a minority class to fit a majority class The method may include pre-processing the extracted characteristics of the plurality of datasets as the existing class imbalance is resolved using an imbalance resolution method.

상기 전처리된 복수의 데이터셋 특성에 따른 샘플링 방법 및 분류 알고리즘을 매핑하는 단계는, 상기 전처리된 복수의 데이터셋 특성을 복수의 샘플링 방법에 적용하고, 상기 적용된 복수의 샘플링 방법 각각에서의 샘플링 방법 정확도를 산출하고, 상기 산출된 샘플링 방법 정확도에 따라 상기 전처리된 복수의 데이터셋 특성과 샘플링 방법을 매핑하는 단계 및 상기 전처리된 복수의 데이터셋 특성을 복수의 분류 알고리즘에 적용하고, 상기 적용된 복수의 분류 알고리즘 각각에서의 분류 알고리즘의 특성과 하이퍼파라미터에 기반하여 분류 알고리즘 정확도를 산출하고, 상기 산출된 분류 알고리즘 정확도에 따라 상기 전처리된 복수의 데이터셋 특성과 분류 알고리즘을 매핑하는 단계를 포함할 수 있다.The mapping of the sampling method and the classification algorithm according to the plurality of preprocessed dataset characteristics may include applying the plurality of preprocessed dataset characteristics to a plurality of sampling methods, and sampling method accuracy in each of the plurality of applied sampling methods. , mapping the plurality of preprocessed dataset characteristics to the sampling method according to the calculated sampling method accuracy, and applying the plurality of preprocessed dataset characteristics to a plurality of classification algorithms, and the applied plurality of classifications Calculating the classification algorithm accuracy based on the characteristics and hyperparameters of the classification algorithm in each algorithm, and mapping the preprocessed plurality of data set characteristics and the classification algorithm according to the calculated classification algorithm accuracy.

상기 매핑된 샘플링 방법과 상기 매핑된 분류 알고리즘에 기반하여 추천 샘플링 방법 및 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성하고, 상기 생성된 선정 룰 베이스 및 상기 전처리된 복수의 데이터셋 특성을 포함하는 메타데이터셋을 생성하는 단계는, 상기 매핑된 샘플링 방법과 상기 매핑된 분류 알고리즘에 적용된 상기 전처리된 복수의 데이터셋 특성을 필터링하고, 상기 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋을 상기 추천 샘플링 방법 및 상기 추천 분류 알고리즘에 투입하여 기계학습하고, 상기 기계학습에 기반하여 상기 추천 샘플링 방법 및 상기 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성하는 단계 및 상기 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋과 상기 생성된 선정 룰 베이스를 포함하는 메타데이터셋을 생성하는 단계를 포함할 수 있다.generating a selection rule base for selecting a recommended sampling method and a recommended classification algorithm based on the mapped sampling method and the mapped classification algorithm, and including the generated selection rule base and the preprocessed plurality of data set characteristics The generating of the metadata set may include filtering the plurality of preprocessed dataset characteristics applied to the mapped sampling method and the mapped classification algorithm, and generating a plurality of datasets related to the filtered plurality of dataset characteristics. machine learning by inputting the recommended sampling method and the recommended classification algorithm, generating a selection rule base for selecting the recommended sampling method and the recommended classification algorithm based on the machine learning, and characteristics of the filtered plurality of datasets and generating a metadata set including a plurality of datasets related to and the generated selection rule base.

본 발명은 오픈 데이터셋의 데이터셋 특성들을 이용하여 메타데이터셋을 생성하고, 사용자가 사용하려는 데이터셋이 입력될 경우, 메타데이터셋을 이용하여 사용자가 사용하려는 데이터셋에 대하여 최적의 알고리즘 종류 및 파라미터 값 그리고 데이터 샘플링 방법을 추천하는 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치를 제공할 수 있다.The present invention generates a metadata set using the dataset characteristics of an open dataset, and when a dataset to be used by a user is input, an optimal algorithm type and A sampling method and classification algorithm recommendation apparatus using a metadata set that recommends a parameter value and a data sampling method may be provided.

본 발명은 오픈 데이터셋의 데이터셋 특성을 사전에 학습하여 메타데이타화함에 따라 생성된 메타데이터셋에 근거하여 최고의 성능을 나타내는 최적의 알고리즘 종류와 파라미터값, 그리고 데이터 샘플링 방법을 추천할 수 있다.The present invention can recommend an optimal algorithm type, parameter value, and data sampling method showing the best performance based on the metadata set generated by learning the dataset characteristics of the open dataset in advance and making it metadata.

본 발명은 데이터 사이언스, 인공지능 개발 업체에서 반복적인 실험을 줄여 시간과 비용을 줄일 수 있고, 하드웨어의 사용을 줄여 친환경적인 인공지능 개발 및 데이터 분석이 가능하도록 지원할 수 있다.The present invention can reduce time and cost by reducing repetitive experiments in data science and artificial intelligence development companies, and can support eco-friendly artificial intelligence development and data analysis by reducing the use of hardware.

본 발명은 사용자가 데이터 사이언스가 적용되지 않은 분야에 진입할 경우 사전 지식이 부족하여 어려움을 겪을 수 있는데, 오픈 데이터베이스를 통해 수집된 오픈 데이터셋들을 통해 사전에 학습된 데이터셋들을 통해 유사한 기계학습 알고리즘을 자동적으로 찾아서 추천할 수 있다.In the present invention, when a user enters a field to which data science is not applied, the user may experience difficulties due to lack of prior knowledge. can be automatically found and recommended.

본 발명은 초보 데이터 과학자, 인공지능 개발자들이 데이터에 대한 노하우가 부족한 경우, 사전에 학습된 데이터셋들을 통해 유사한 기계학습 알고리즘을 자동적으로 찾아서 추천하는 방식을 참고하여 인공지능 알고리즘과 관련된 샘플링 방법 및 분류 알고리즘의 선택 방법에 대한 노하우 획득 지원할 수 있다.The present invention provides a sampling method and classification related to artificial intelligence algorithms by referring to the method of automatically finding and recommending similar machine learning algorithms through pre-learned datasets when novice data scientists and AI developers lack data know-how. Acquisition of know-how on how to select an algorithm can be supported.

도 1은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치를 설명하는 도면이다.
도 2는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 메타데이터셋을 생성하는 절차를 설명하는 도면이다.
도 3 내지 도 5는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법을 설명하는 도면이다.
도 6은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 수집하는 데이터셋 특성의 구조를 설명하는 도면이다.
도 7은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 생성하는 메타데이터셋을 예시하는 도면이다.
도 8은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 분류 알고리즘과 관련하여 성능 평가한 결과를 예시하는 도면이다.
도 9은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 샘플링 방법과 관련하여 성능 평가한 결과를 예시하는 도면이다.1 is a diagram for explaining a sampling method and a classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.
2 is a diagram for explaining a procedure for generating a metadata set by a sampling method and classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.
3 to 5 are diagrams for explaining a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention.
6 is a diagram for explaining the structure of a data set characteristic collected by a sampling method and classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.
7 is a diagram illustrating a metadata set generated by a sampling method and classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.
8 is a diagram illustrating a result of performance evaluation in relation to a classification algorithm by the apparatus for recommending a sampling method and a classification algorithm using a metadata set according to an embodiment of the present invention.
9 is a diagram illustrating a result of performance evaluation in relation to a sampling method by the apparatus for recommending a sampling method and a classification algorithm using a metadata set according to an embodiment of the present invention.

이하, 본 문서의 다양한 실시 예들이 첨부된 도면을 참조하여 기재된다.Hereinafter, various embodiments of the present document will be described with reference to the accompanying drawings.

실시 예 및 이에 사용된 용어들은 본 문서에 기재된 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 해당 실시 예의 다양한 변경, 균등물, 및/또는 대체물을 포함하는 것으로 이해되어야 한다.Examples and terms used therein are not intended to limit the technology described in this document to specific embodiments, and should be understood to include various modifications, equivalents, and/or substitutions of the embodiments.

하기에서 다양한 실시 예들을 설명에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다.In the following, when it is determined that a detailed description of a known function or configuration related to various embodiments may unnecessarily obscure the gist of the present invention, a detailed description thereof will be omitted.

그리고 후술되는 용어들은 다양한 실시 예들에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, terms to be described later are terms defined in consideration of functions in various embodiments, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification.

도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.In connection with the description of the drawings, like reference numerals may be used for like components.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다.The singular expression may include the plural expression unless the context clearly dictates otherwise.

본 문서에서, "A 또는 B" 또는 "A 및/또는 B 중 적어도 하나" 등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다.In this document, expressions such as “A or B” or “at least one of A and/or B” may include all possible combinations of items listed together.

"제1," "제2," "첫째," 또는 "둘째," 등의 표현들은 해당 구성요소들을, 순서 또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다.Expressions such as "first," "second," "first," or "second," can modify the corresponding elements, regardless of order or importance, and to distinguish one element from another element. It is used only and does not limit the corresponding components.

어떤(예: 제1) 구성요소가 다른(예: 제2) 구성요소에 "(기능적으로 또는 통신적으로) 연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제3 구성요소)를 통하여 연결될 수 있다.When an (eg, first) component is referred to as being “connected (functionally or communicatively)” or “connected” to another (eg, second) component, that component is It may be directly connected to the component or may be connected through another component (eg, a third component).

본 명세서에서, "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, 하드웨어적 또는 소프트웨어적으로 "~에 적합한," "~하는 능력을 가지는," "~하도록 변경된," "~하도록 만들어진," "~를 할 수 있는," 또는 "~하도록 설계된"과 상호 호환적으로(interchangeably) 사용될 수 있다.As used herein, "configured to (or configured to)" according to the context, for example, hardware or software "suitable for," "having the ability to," "modified to ," "made to," "capable of," or "designed to," may be used interchangeably.

어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다.In some contexts, the expression “a device configured to” may mean that the device is “capable of” with other devices or components.

예를 들면, 문구 "A, B, 및 C를 수행하도록 구성된(또는 설정된) 프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(예: CPU 또는 application processor)를 의미할 수 있다.For example, the phrase “a processor configured (or configured to perform) A, B, and C” refers to a dedicated processor (eg, an embedded processor) for performing the operations, or by executing one or more software programs stored in a memory device. , may refer to a general-purpose processor (eg, a CPU or an application processor) capable of performing corresponding operations.

또한, '또는' 이라는 용어는 배타적 논리합 'exclusive or' 이기보다는 포함적인 논리합 'inclusive or' 를 의미한다.Also, the term 'or' means 'inclusive or' rather than 'exclusive or'.

즉, 달리 언급되지 않는 한 또는 문맥으로부터 명확하지 않는 한, 'x가 a 또는 b를 이용한다' 라는 표현은 포함적인 자연 순열들(natural inclusive permutations) 중 어느 하나를 의미한다.That is, unless stated otherwise or clear from context, the expression 'x employs a or b' means any one of natural inclusive permutations.

이하 사용되는 '..부', '..기' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는, 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Terms such as '.. unit' and '.. group' used below mean a unit for processing at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software.

도 1은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치를 설명하는 도면이다.1 is a diagram for explaining a sampling method and a classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.

도 1은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치의 구성요소를 예시한다.1 illustrates components of a sampling method and classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.

도 1을 참고하면, 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치(100)는 데이터셋 수집부(110), 특성 추출부(120), 매핑처리부(130), 메타데이터셋 생성부(140) 및 추천부(150)를 포함한다.Referring to FIG. 1 , an apparatus 100 for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention includes a dataset collecting unit 110 , a feature extracting unit 120 , and a mapping processing unit 130 . , a metadata set generating unit 140 and a recommendation unit 150 .

본 발명의 일실시예에 따르면 데이터셋 수집부(110)는 오픈 데이터 베이스로부터 오픈 데이터셋을 수집한다.According to an embodiment of the present invention, the data set collecting unit 110 collects an open data set from an open database.

여기서, 오픈 데이터 베이스는 공개된 데이터를 저장하고 있는 데이터 베이스를 지칭할 수 있다.Here, the open database may refer to a database storing open data.

즉, 데이터셋 수집부(110)는 사전 데이터학습을 위하여 오픈 데이터 베이스로부터 오픈 데이터셋을 수집하는데, 오픈 데이터셋은 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치(100)를 이용하는 사용자에 의해 오픈 데이터ㅂ 베이스 기반으로 입력되는 데이터일 수 있고, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치(100)가 오픈 데이터 베이스에 접근하여 수집하는 데이터일 수 있다.That is, the dataset collection unit 110 collects an open dataset from an open database for prior data learning. It may be data input based on an open database, or it may be data collected by the apparatus 100 for recommending a sampling method and classification algorithm using a metadata set by accessing an open database.

한편, 데이터셋 수집부(110)는 사용자에 의해서 테스트를 위한 사용자 데이터셋이 입력되는 경우에도 사용자 데이터셋을 수집할 수 있다.Meanwhile, the data set collection unit 110 may collect a user data set even when a user data set for testing is input by a user.

본 발명의 일실시예에 따르면 특성 추출부(120)는 오픈 데이터셋 또는 사용자 데이터셋의 복수의 데이터셋 특성을 추출하고, 추출된 복수의 데이터셋 특성을 전처리할 수 있다.According to an embodiment of the present invention, the feature extraction unit 120 may extract a plurality of data set characteristics of an open data set or a user data set, and preprocess the plurality of extracted data set characteristics.

일례로, 특성 추출부(120)는 수집된 오픈 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 복수의 데이터셋 특성을 추출하고, 추출된 복수의 데이터셋 특성을 전처리할 수 있다.As an example, the feature extraction unit 120 may include the number of variables, the number of instances, the number of classes, the degree of bias of classes, entropy of classes, degree of overlap of variables, silhouette score, hub score, and number of variables in the collected open dataset. A plurality of dataset characteristics including entropy, linearity and neighborliness of the dataset may be extracted, and the extracted plurality of dataset characteristics may be pre-processed.

또한, 특성 추출부(120)는 입력된 사용자 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 복수의 데이터셋 특성을 추출하고, 추출된 복수의 데이터셋 특성을 전처리할 수 있다.In addition, the feature extraction unit 120 is the number of variables in the input user dataset, the number of instances, the number of classes, the degree of class bias, the entropy of the class, the degree of overlap of the variables, the silhouette score, the hub score, the entropy of the variable , it is possible to extract a plurality of dataset characteristics including linearity and neighborliness of the dataset, and preprocess the extracted plurality of dataset characteristics.

즉, 특성 추출부(120)는 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 복수의 데이터셋 특성을 추출할 수 있다.That is, the feature extraction unit 120 determines the number of variables, the number of instances, the number of classes, the degree of class bias, the entropy of the class, the degree of overlap of the variables, the silhouette score, the hub score, the entropy of the variable, the linearity of the dataset and It is possible to extract a plurality of data set features including neighborliness.

구체적으로, 특성 추출부(120)는 사용자 데이터셋 또는 오픈 데이터셋을 복수의 폴드(fold)로 분류하고, 분류된 복수의 폴드(fold) 중 하나를 제외한 나머지 폴드(fold)에 포함된 데이터셋을 복수의 훈련 데이터셋으로 결정하고, 결정된 복수의 훈련 데이터셋으로부터 복수의 데이터셋 특성을 추출할 수 있다.Specifically, the feature extractor 120 classifies a user dataset or an open dataset into a plurality of folds, and a dataset included in the remaining folds except for one of the classified plurality of folds. may be determined as a plurality of training datasets, and a plurality of dataset characteristics may be extracted from the determined plurality of training datasets.

여기서, 분류된 복수의 폴드(fold) 중 하나에 해당하는 데이터셋은 복수의 훈련 데이터셋을 이용한 훈련 과정이 완료되어 선정된 분류 알고리즘의 테스트 데이터셋으로 이용될 수 있다.Here, a dataset corresponding to one of a plurality of classified folds may be used as a test dataset of a selected classification algorithm after a training process using the plurality of training datasets is completed.

본 발명의 일실시예에 따르면 특성 추출부(120)는 복수의 데이터셋 특성에서 결측치 및 클래스 불균형이 존재하는지 판단하고, 결측치 및 클래스 불균형을 해소하기 위한 전처리 과정을 수행할 수 있다.According to an embodiment of the present invention, the feature extractor 120 may determine whether missing values and class imbalance exist in a plurality of data set characteristics, and perform a preprocessing process for resolving the missing values and class imbalance.

일례로, 특성 추출부(120)는 복수의 데이터셋 특성이 추출된 데이터셋에서 결측치가 존재하고, 결측치가 존재하는 변수가 수치형인 경우 해당 클래스의 평균값을 이용하여 결측치를 처리함에 따라 추출된 복수의 데이터셋 특성을 전처리할 수 있다.As an example, the feature extraction unit 120 includes missing values in a dataset from which a plurality of data set characteristics are extracted, and when a variable having missing values is a numeric type, the plurality of extracted values are processed using the average value of the corresponding class. You can preprocess the data set characteristics of .

본 발명의 일실시예에 따르면 특성 추출부(120)는 복수의 데이터셋 특성이 추출된 데이터셋에서 결측치가 존재하고, 결측치가 존재하는 변수가 명목형인 경우 해당 클래스의 최빈값을 이용하여 결측치를 처리함에 따라 추출된 복수의 데이터셋 특성을 전처리할 수 있다.According to an embodiment of the present invention, when a missing value exists in a dataset from which a plurality of dataset characteristics are extracted, and the variable in which the missing value exists is a nominal type, the characteristic extraction unit 120 processes the missing value using the mode value of the corresponding class. Accordingly, it is possible to pre-process the extracted plurality of data set characteristics.

일례로, 특성 추출부(120)는 복수의 데이터셋 특성이 추출된 데이터셋에서 클래스 불균형이 존재하는 경우, 존재하는 클래스 불균형에 따라 다수 클래스(majority class)를 제거하는 과소 표집 방법(under sampling) 및 소수 클래스(minority class)를 다수 클래스(majority class)에 맞게 복제하는 과대 표집 방법(over sampling) 중 어느 하나의 클래스 불균형 해소 방법을 이용하여 클래스 불균형을 해소함에 따라 복수의 데이터셋 특성을 전처리할 수 있다.For example, when class imbalance exists in a dataset from which a plurality of dataset characteristics are extracted, the feature extractor 120 removes a majority class according to the existing class imbalance. Under sampling method And a plurality of data set characteristics can be preprocessed by resolving the class imbalance using any one of the over sampling methods of replicating the minority class to fit the majority class. can

본 발명의 일실시예에 따르면 매핑 처리부(130)는 특성 추출부(120)에 의해 전처리된 복수의 데이터셋 특성에 따른 샘플링 방법 및 분류 알고리즘을 매핑할 수 있다.According to an embodiment of the present invention, the mapping processing unit 130 may map a sampling method and a classification algorithm according to the plurality of data set characteristics preprocessed by the characteristic extraction unit 120 .

일례로, 매핑 처리부(130)는 전처리된 복수의 데이터셋 특성을 복수의 샘플링 방법에 적용하고, 적용된 복수의 샘플링 방법 각각에서의 샘플링 방법 정확도를 산출하고, 산출된 샘플링 방법 정확도에 따라 전처리된 복수의 데이터셋 특성과 샘플링 방법을 매핑할 수 있다.For example, the mapping processing unit 130 applies a plurality of preprocessed data set characteristics to a plurality of sampling methods, calculates a sampling method accuracy in each of the plurality of applied sampling methods, and calculates a plurality of preprocessed data sets according to the calculated sampling method accuracy. of data set characteristics and sampling methods can be mapped.

본 발명의 일실시예에 따르면 매핑 처리부(130)는 전처리된 복수의 데이터셋 특성을 복수의 분류 알고리즘에 적용하고, 적용된 복수의 분류 알고리즘 각각에서의 분류 알고리즘 정확도를 산출하고, 산출된 분류 알고리즘 정확도에 따라 전처리된 복수의 데이터셋 특성과 분류 알고리즘을 매핑할 수 있다.According to an embodiment of the present invention, the mapping processing unit 130 applies a plurality of preprocessed dataset characteristics to a plurality of classification algorithms, calculates the classification algorithm accuracy in each of the plurality of classification algorithms applied, and calculates the classification algorithm accuracy. It is possible to map the characteristics of a plurality of preprocessed datasets and classification algorithms.

여기서, 샘플링 방법 정확도 및 분류 알고리즘 정확도는 F1-Score 및 G-mean이 사용되고, F1-score는 긍정 참 값 비율(True positive Rate)과 긍정 예측 값(Positive Predictive Vale)를 고려하여 결정되는 값이고, G-mean은 참 긍정 값과 참 부정 값을 고려한 산술평균으로 볼 수 있다.Here, for the sampling method accuracy and classification algorithm accuracy, F1-Score and G-mean are used, and F1-score is a value determined by considering a True Positive Rate and a Positive Predictive Vale, G-mean can be viewed as an arithmetic mean considering true positive and true negative values.

예를 들어, F1-score는 정밀도(precision)와 재현율(recall)을 산출하여 조화 평균을 사용하는 것을 특징으로 한다.For example, the F1-score is characterized by using a harmonic average to calculate precision and recall.

F1-socre는 하기 수학식 1을 이용하여 산출될 수 있고, G-mean은 하기 수학식 2를 이용하여 산출될 수 있다.F1-socre may be calculated using Equation 1 below, and G-mean may be calculated using Equation 2 below.

[수학식 1][Equation 1]

수학식 1은 정밀도(precision)와 재현율(recall)을 고려하여 산출될 수 있다.Equation 1 may be calculated in consideration of precision and recall.

[수학식 2][Equation 2]

수학식 2는 참 긍정(True Positive, TP) 비율과 참 부정(True Negative, TN) 비율을 이용하여 산출될 수 있다.Equation 2 may be calculated using a true positive (TP) ratio and a true negative (TN) ratio.

본 발명의 일실시예에 따르면 매핑 처리부(130)는 전처리된 복수의 데이터셋 특성에 대한 적용된 복수의 분류 알고리즘 각각에서의 분류 알고리즘의 특성과 하이퍼파라미터에 기반하여 분류 알고리즘 정확도를 산출할 수 있다.According to an embodiment of the present invention, the mapping processing unit 130 may calculate the classification algorithm accuracy based on the characteristics and hyperparameters of the classification algorithm in each of the plurality of classification algorithms applied to the plurality of preprocessed data set characteristics.

즉, 매핑 처리부(130)는 분류 알고리즘의 특성과 하이퍼파라미터에 대하여 정밀도(precision)와 재현율(recall) 또는 참 긍정(True Positive, TP) 비율과 참 부정(True Negative, TN) 비율을 고려하여 분류 알고리즘 정확도를 산출할 수 있다.That is, the mapping processing unit 130 classifies in consideration of precision and recall or True Positive (TP) ratio and True Negative (TN) ratio with respect to the characteristics and hyperparameters of the classification algorithm. Algorithm accuracy can be calculated.

본 발명의 일실시예에 따르면 메타데이터셋 생성부(140)는 매핑 처리부(130)에 의해 매핑된 샘플링 방법과 매핑된 분류 알고리즘에 기반하여 추천 샘플링 방법 및 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성할 수 있다.According to an embodiment of the present invention, the metadata set generating unit 140 selects the recommended sampling method and the recommended classification algorithm based on the mapping method mapped by the mapping processing unit 130 and the mapped classification algorithm. can create

또한, 메타데이터셋 생성부(140)는 생성된 선정 룰 베이스 및 전처리된 복수의 데이터셋 특성을 포함하는 메타데이터셋을 생성할 수 있다.Also, the metadata set generating unit 140 may generate a metadata set including the generated selection rule base and a plurality of preprocessed data set characteristics.

여기서, 선정 룰 베이스는 추가적으로 테스트 데이터셋이 입력될 경우에 추천 샘플링 방법과 추천 분류 알고리즘을 선별하기 위한 기준이 될 수 있다.Here, the selection rule base may be a criterion for selecting a recommended sampling method and a recommended classification algorithm when a test dataset is additionally input.

본 발명의 일실시예에 따르면 메타데이터셋 생성부(140)는 매핑된 샘플링 방법과 매핑된 분류 알고리즘에 적용된 전처리된 복수의 데이터셋 특성을 필터링하고, 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋을 추천 샘플링 방법 및 추천 분류 알고리즘에 투입하여 기계학습하고, 기계학습에 기반하여 추천 샘플링 방법 및 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성할 수 있다. 여기서, 기계학습에 기반하는 것은 기계학습 결과를 이용하는 것을 나타낼 수 있다.According to an embodiment of the present invention, the metadata set generating unit 140 filters the plurality of preprocessed dataset characteristics applied to the mapped sampling method and the mapped classification algorithm, A data set may be input to a recommended sampling method and a recommended classification algorithm to perform machine learning, and a selection rule base for selecting a recommended sampling method and a recommended classification algorithm may be generated based on the machine learning. Here, based on machine learning may refer to using a machine learning result.

또한, 메타데이터셋 생성부(140)는 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋과 상기 생성된 선정 룰 베이스를 포함하는 메타데이터셋을 생성할 수 있다.Also, the metadata set generating unit 140 may generate a plurality of datasets related to the filtered plurality of dataset characteristics and a metadata set including the generated selection rule base.

본 발명의 일실시예에 따르면 추천부(150)는 사용자로부터 입력된 사용자 데이터셋에 대하여 생성된 메타데이터셋을 이용하여 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천할 수 있다.According to an embodiment of the present invention, the recommendation unit 150 may recommend at least one of a custom sampling method and a custom classification algorithm by using a metadata set generated with respect to a user data set input from a user.

즉, 추천부(150)는 사용자가 사용하려는 사용자 데이터셋을 입력하면 입력된 데이터셋을 스캔하여 사용자 데이터셋 특성을 특성 추출부(120)를 통해 자동인식하고, 사전 학습되어 생성된 메타데이터셋에 기반하여 사용자 데이터셋 특성에 적합한 최적의 알고리즘 종류, 파라미터 값 그리고 데이터를 샘플링하기 위한 샘플링 방법을 자동적으로 추천 또는 선택하여 사용하도록 할 수 있다.That is, when a user inputs a user dataset to be used, the recommendation unit 150 scans the input dataset, automatically recognizes the characteristics of the user dataset through the characteristic extraction unit 120 , and a metadata set generated by pre-learning. Based on this, it is possible to automatically recommend or select the optimal algorithm type, parameter value, and sampling method for sampling data suitable for the characteristics of the user dataset.

일례로, 추천부(150)는 사용자 데이터셋의 전처리된 복수의 데이터셋 특성을 인식하고, 생성된 메타데이터셋에서 인식된 복수의 데이터셋 특성과 관련된 복수의 데이터셋 특성을 확인하고, 확인된 복수의 데이터셋 특성과 생성된 선정 룰 베이스에 기반하여 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천할 수 있다.As an example, the recommendation unit 150 recognizes a plurality of preprocessed dataset characteristics of the user dataset, identifies a plurality of dataset characteristics related to the plurality of dataset characteristics recognized in the generated metadata set, and checks the At least one of a custom sampling method and a custom classification algorithm may be recommended based on the plurality of data set characteristics and the generated selection rule base.

본 발명의 일실시예에 따르면 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치(100)는 메타데이터셋 생성부(140)에 의해 생성된 메타데이터셋을 저장하는 메타데이터셋 저장부(미도시)를 더 포함할 수 있다.According to an embodiment of the present invention, the apparatus 100 for recommending a sampling method and classification algorithm using a metadata set includes a metadata set storage unit (not shown) that stores the metadata set generated by the metadata set generator 140 . ) may be further included.

본 발명의 일실시예에 따르면 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치(100)는 사용자 데이터셋이 입력된 경우, 메타데이터셋 저장부(미도시)에 저장된 메타데이터셋을 바로 호출하여 사용자 데이터셋에 맞는 샘플링 방법 및 분류 알고리즘을 추천하는데 활용할 수 있다.According to an embodiment of the present invention, the apparatus 100 for recommending a sampling method and classification algorithm using a metadata set directly calls a metadata set stored in a metadata set storage unit (not shown) when a user dataset is input. It can be used to recommend sampling methods and classification algorithms suitable for user datasets.

따라서, 본 발명은 오픈 데이터셋의 데이터셋 특성들을 이용하여 메타데이터셋을 생성하고, 사용자가 사용하려는 데이터셋이 입력될 경우, 메타데이터셋을 이용하여 사용자가 사용하려는 데이터셋에 대하여 최적의 알고리즘 종류 및 파라미터 값 그리고 데이터 샘플링 방법을 추천하는 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치를 제공할 수 있다.Therefore, the present invention generates a metadata set using the dataset characteristics of an open dataset, and when a dataset to be used by the user is input, an optimal algorithm for the dataset the user wants to use using the metadata set It is possible to provide a sampling method and a classification algorithm recommendation apparatus using a metadata set that recommends a type and parameter value and a data sampling method.

또한, 본 발명은 오픈 데이터셋의 데이터셋 특성을 사전에 학습하여 메타데이타화함에 따라 생성된 메타데이터셋에 근거하여 최고의 성능을 나타내는 최적의 알고리즘 종류와 파라미터값, 그리고 데이터 샘플링 방법을 추천할 수 있다.In addition, the present invention can recommend the optimal algorithm type, parameter value, and data sampling method that shows the best performance based on the metadata set generated by learning the dataset characteristics of the open dataset in advance and making it metadata. have.

도 2는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 메타데이터셋을 생성하는 절차를 설명하는 도면이다.2 is a diagram for explaining a procedure for generating a metadata set by a sampling method and classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.

도 2는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 오픈 데이터셋을 이용하여 메타데이터셋을 생성하는 과정을 예시한다.2 illustrates a process in which the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention generates a metadata set using an open dataset.

도 2를 참고하면, 단계(S201)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 오픈 데이터베이스로부터 수집한 오픈 데이터셋을 복수의 폴드로 분류한다.Referring to FIG. 2 , in step S201 , the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention classifies an open dataset collected from an open database into a plurality of folds.

예를 들어, 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 하나의 오픈 데이터셋을 10개의 폴드로 분류하고, 9개의 폴드의 데이터셋은 훈련 데이터셋으로 이용하고, 하나의 폴드에서의 테스트 데이터셋으로 이용한다.For example, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention classifies one open dataset into 10 folds, and uses a dataset of 9 folds as a training dataset. and use it as a test data set in one fold.

단계(S202)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 9개의 훈련 데이터셋 및 하나의 테스트 데이터셋으로부터 복수의 데이터셋 특성을 추출한다.In step S202, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention extracts a plurality of dataset characteristics from nine training datasets and one test dataset.

여기서, 추출된 복수의 데이터셋 특성은 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함할 수 있다.Here, the extracted plurality of dataset characteristics are the number of variables, the number of instances, the number of classes, the degree of class bias, the entropy of the class, the degree of overlap of the variables, the silhouette score, the hub score, the entropy of the variable, and the linearity of the dataset. and neighborhood.

즉, 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 하나의 데이터셋에 대하여 복수의 데이터셋 특성을 추출할 수 있다. 또한, 추출된 복수의 데이터셋 특성은 메타데이터셋에 포함될 수 있다.That is, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention may extract a plurality of data set characteristics from one data set. In addition, the plurality of extracted data set characteristics may be included in the metadata set.

단계(S203)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 추출된 데이터셋 특성의 결측치를 보간할 수 있다.In step S203, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention may interpolate missing values of the extracted data set characteristics.

즉, 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 결측치가 존재하는 변수가 수치형인 경우 해당 클래스의 평균값을 이용하여 결측치를 보간하고, 결측치가 존재하는 변수가 명목형인 경우 해당 클래스의 최빈값을 이용하여 결측치를 보간할 수 있다.That is, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention interpolates the missing value using the average value of the class when the variable having the missing value is a numeric type, and the variable with the missing value is In the case of a nominal type, missing values can be interpolated using the mode of the corresponding class.

단계(S204)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 클래스 불균형 해소 방법을 적용하여 클래스 불균형을 해소할 수 있다.In step S204, the apparatus for recommending a sampling method and a classification algorithm using a metadata set according to an embodiment of the present invention may resolve class imbalance by applying the class imbalance resolution method.

클래스 불균형은 분류 성능에 많은 영향을 미치는 것으로 분류해야 할 목표 변수(target variable)의 속성이 불균형을 이루는 것을 나타내는데, 기존 알고리즘을 수정하여 다수 클래스 불균형을 완화하는 비용 민감 학습 방법(cost sensitive learning)이 있고, 비용 민감 학습은 클래스의 분포에 맞게 데이터 추출 방법과 다르게 오분류 데이터에 대해 비용 매트릭스(Cost Matrix)를 사용하여 분류 오류를 줄이는 방법일 수 있다.Class imbalance indicates that the property of the target variable to be classified is imbalanced as it has a great influence on classification performance. In addition, cost-sensitive learning may be a method of reducing classification errors by using a cost matrix for misclassified data differently from the data extraction method according to the distribution of classes.

또한, 과대 표집 방법 및 과소 표집 방법이 있는데, 과소 표집 방법(Under Sampling)은 다수 클래스(Majority Class)를 제거하여 클래스 분포의 균형을 맞추는 것이다. 과소 표집 방법의 문제점은 정보의 손실이다. 반면, 과대 표집 방법(Over Sampling)은 소수 클래스(Minority Class)를 다수 클래스에 맞게 복제하여 균형을 맞추는 방법일 수 있다.In addition, there are an oversampling method and an undersampling method. The undersampling method balances the class distribution by removing a majority class. The problem with the undersampling method is the loss of information. On the other hand, the over-sampling method may be a method of balancing a minority class by duplicating it to fit a majority class.

본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 복수의 데이터셋 특성이 추출된 데이터셋에서 클래스 불균형이 존재하는 경우, 존재하는 클래스 불균형에 따라 다수 클래스(majority class)를 제거하는 과소 표집 방법(under sampling) 및 소수 클래스(minority class)를 다수 클래스(majority class)에 맞게 복제하는 과대 표집 방법(over sampling) 중 어느 하나의 클래스 불균형 해소 방법을 이용하여 클래스 불균형을 해소하는 클래스 분균형 해소 방법을 적용할 수 있다.In the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention, when class imbalance exists in a dataset from which a plurality of dataset characteristics are extracted, a majority class (majority class) according to the existing class imbalance ) and the oversampling method of duplicating a minority class to fit the majority class. It is possible to apply the method of resolving the class imbalance to resolve.

단계(S205)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 전처리된 데이터셋 특성을 추출한다.In step S205, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention extracts the preprocessed data set characteristics.

또한, 단계(S206)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 훈련 데이터셋에 해당하는 전처리된 데이터셋 특성을 분류 알고리즘에 적용하여 테스트를 진행한다.In addition, in step S206 , the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention applies the preprocessed dataset characteristic corresponding to the training dataset to the classification algorithm to perform a test.

단계(S207)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 단계(S206)에 기반하여 성능 평가를 진행한다.In step S207, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention performs performance evaluation based on step S206.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 K-겹 교차검증법(K-fold cross validation)을 사용하여 각 폴드(fold) 별 샘플링을 수행하고, 각 폴드에 대한 데이터 특성 및 성능 테스트를 수행하고, 성능 테스트 결과를 도출하여 성능 평가를 진행할 수 있다.That is, the sampling method and classification algorithm recommendation device using the metadata set performs sampling for each fold using K-fold cross validation, and data characteristics and performance for each fold A performance evaluation may be performed by performing a test and deriving a performance test result.

단계(S208)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 메타데이터셋을 생성한다.In step S208, the apparatus for recommending a sampling method and classification algorithm using a metadata set according to an embodiment of the present invention generates a metadata set.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 단계(S202)에서 추출된 복수의 데이터셋 특성, 단계(S205)에서 추출된 전처리된 데이터셋 특성, 단계(S206)에서 적용된 분류 알고리즘 및 단계(S207)에서 진행된 성능 평가 결과에 기반하여 메타데이터셋을 생성한다.That is, the apparatus for recommending a sampling method and classification algorithm using a metadata set includes a plurality of dataset characteristics extracted in step S202, a preprocessed dataset characteristic extracted in step S205, a classification algorithm applied in step S206 and A metadata set is generated based on the result of the performance evaluation performed in step S207.

다시 말해, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치는 매핑된 샘플링 방법과 매핑된 분류 알고리즘에 적용된 전처리된 복수의 데이터셋 특성을 필터링하고, 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋을 추천 샘플링 방법 및 추천 분류 알고리즘에 투입하여 기계학습하고, 기계학습에 기반하여 추천 샘플링 방법 및 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성 결과 및 복수의 데이터셋 특성을 모두 포함하는 메타데이터셋을 생성한다.In other words, the apparatus for recommending a sampling method and a classification algorithm using a metadata set filters the plurality of preprocessed dataset characteristics applied to the mapped sampling method and the mapped classification algorithm, and a plurality of data related to the filtered plurality of dataset characteristics The set is put into a recommended sampling method and a recommended classification algorithm for machine learning, and a selection rule base for selecting a recommended sampling method and a recommended classification algorithm based on machine learning is generated and metadata including both the results and characteristics of a plurality of datasets. create three

도 3은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법을 설명하는 도면이다.3 is a view for explaining a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention.

도 3은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법이 오픈 데이터셋으로부터 추출한 데이터셋의 특성을 파악하여 맞춤형 알고리즘을 추천하는 실시예를 설명한다.3 illustrates an embodiment in which a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention identify characteristics of a dataset extracted from an open dataset and recommend a customized algorithm.

도 3을 참고하면, 단계(301)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 오픈 데이터 베이스로부터 오픈 데이터셋을 수집한다.Referring to FIG. 3 , in step 301 , the sampling method and the classification algorithm recommendation method using a metadata set according to an embodiment of the present invention collects an open data set from an open database.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 오픈 데이터 베이스로부터 오픈 데이터셋을 수집할 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set can collect the open data set from the open database.

단계(302)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 오픈 데이터셋의 복수의 데이터셋 특성을 추출한다.In step 302, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention extracts a plurality of data set characteristics of an open data set.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 수집된 오픈 데이터셋을 복수의 폴드(fold)로 분류하고, 상기 분류된 복수의 폴드(fold) 중 하나를 제외한 나머지 폴드(fold)에 포함된 데이터셋을 복수의 훈련 데이터셋으로 결정하고, 복수의 훈련 데이터셋으로부터 수집된 오픈 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 복수의 데이터셋 특성을 추출한다.That is, the sampling method and classification algorithm recommendation method using the metadata set classifies the collected open data set into a plurality of folds, and the collected open dataset is divided into a plurality of folds except for one of the classified folds. The included dataset is determined as a plurality of training datasets, and the number of variables, the number of instances, the number of classes, the degree of class bias, the entropy of the classes, and the overlap of variables in the open dataset collected from the plurality of training datasets. Multiple dataset characteristics including degree, silhouette score, hub score, entropy of variables, linearity and neighborhood of the dataset are extracted.

단계(303)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 복수의 데이터셋 특성을 전처리한다.In step 303, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention pre-process a plurality of data set characteristics.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 복수의 데이터셋 특성의 결측치 보간 및 클래스 불균형 해소를 위한 전처리 과정을 수행한다.That is, a sampling method and a classification algorithm recommendation method using a metadata set performs a preprocessing process for interpolating missing values of a plurality of dataset characteristics and resolving class imbalance.

단계(304)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 복수의 데이터셋 특성에 따른 샘플링 방법을 매핑한다.In step 304, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention map a sampling method according to characteristics of a plurality of datasets.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 전처리된 복수의 데이터셋 특성을 복수의 샘플링 방법에 적용하고, 적용된 복수의 샘플링 방법 각각에서의 샘플링 방법 정확도를 산출하고, 산출된 샘플링 방법 정확도에 따라 전처리된 복수의 데이터셋 특성과 샘플링 방법을 매핑할 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set applies the characteristics of the plurality of preprocessed datasets to the plurality of sampling methods, calculates the sampling method accuracy in each of the plurality of sampling methods applied, and calculates the sampling method According to the accuracy, it is possible to map the characteristics of a plurality of preprocessed datasets and the sampling method.

단계(305)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 복수의 데이터셋 특성에 따른 분류 알고리즘을 매핑한다.In step 305, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention map a classification algorithm according to characteristics of a plurality of datasets.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 전처리된 복수의 데이터셋 특성을 복수의 분류 알고리즘에 적용하고, 적용된 복수의 분류 알고리즘 각각에서의 분류 알고리즘의 특성과 하이퍼파라미터에 기반하여 분류 알고리즘 정확도를 산출하고, 산출된 분류 알고리즘 정확도에 따라 전처리된 복수의 데이터셋 특성과 분류 알고리즘을 매핑할 수 있다.That is, the sampling method and the classification algorithm recommendation method using the metadata set apply the characteristics of a plurality of preprocessed datasets to the plurality of classification algorithms, and classify them based on the characteristics and hyperparameters of the classification algorithm in each of the plurality of classification algorithms applied. Algorithm accuracy may be calculated, and a plurality of preprocessed dataset characteristics may be mapped to a classification algorithm according to the calculated classification algorithm accuracy.

단계(306)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 매핑된 샘플링 방법 및 분류 알고리즘에 기반하여 선정 룰 베이스를 생성한다.In step 306, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention generate a selection rule base based on the mapped sampling method and classification algorithm.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 매핑된 샘플링 방법과 매핑된 분류 알고리즘에 적용된 전처리된 복수의 데이터셋 특성을 필터링하고, 필터링된 복수의 데이터셋 특성과 관련된 복수의 데이터셋을 추천 샘플링 방법 및 추천 분류 알고리즘에 투입하여 기계학습하고, 기계학습에 기반하여 추천 샘플링 방법 및 추천 분류 알고리즘을 선정하기 위한 선정 룰 베이스를 생성할 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set filters the plurality of preprocessed dataset characteristics applied to the mapped sampling method and the mapped classification algorithm, and a plurality of datasets related to the filtered plurality of dataset characteristics. can be input to a recommended sampling method and a recommended classification algorithm to perform machine learning, and a selection rule base for selecting a recommended sampling method and a recommended classification algorithm can be generated based on machine learning.

단계(307)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 선정 룰 베이스를 메타 데이터셋으로 저장한다.In step 307, the sampling method and classification algorithm recommendation method using a metadata set according to an embodiment of the present invention stores the selection rule base as a metadata set.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 단계(306)에서 생성된 선정 룰 베이스와 함께 복수의 데이터셋 특성을 메타데이터셋으로 저장할 수 있다.That is, the sampling method and the classification algorithm recommendation method using the metadata set may store a plurality of data set characteristics as a metadata set together with the selection rule base generated in step 306 .

단계(308)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 사용자로부터 입력된 사용자 데이터셋에 대하여 메타데이터셋을 이용하여 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천할 수 있다.In step 308, the sampling method and classification algorithm recommendation method using a metadata set according to an embodiment of the present invention uses at least one of a customized sampling method and a customized classification algorithm using a metadata set for a user dataset input from a user. I can recommend one.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 사용자에 의해 입력된 사용자 데이터셋에 따른 복수의 데이터셋 특성과 메타데이터셋에 포함된 복수의 데이터셋 특성을 비교하여 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천할 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set compares the plurality of dataset characteristics according to the user dataset input by the user with the plurality of dataset characteristics included in the metadata set to provide a customized sampling method and customized method. At least one of the classification algorithms may be recommended.

도 4는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법을 설명하는 도면이다.4 is a diagram for explaining a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention.

도 4는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법이 입력된 사용자 데이터셋에 따른 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천하는 실시예를 설명한다.4 illustrates an embodiment of recommending at least one of a sampling method and a classification algorithm recommendation method using a metadata set according to a user dataset inputted according to an embodiment of the present invention.

도 4를 참고하면, 단계(401)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 사용자 데이터셋을 입력 받는다.Referring to FIG. 4 , in step 401 , a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention receive a user data set.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 사용자로부터 사용자 데이터셋을 입력 받는데, 사용자 데이터셋은 사용자가 인공지능 알고리즘에 활용하기 위한 데이터셋으로 볼 수 있다.That is, a sampling method and a classification algorithm recommendation method using a metadata set receives a user data set from a user, and the user data set can be viewed as a data set for the user to use in an artificial intelligence algorithm.

단계(402)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 사용자 데이터셋의 복수의 데이터셋 특성을 추출한다.In step 402, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention extracts a plurality of data set characteristics of a user data set.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 입력된 사용자 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 복수의 데이터셋 특성을 추출할 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set include the number of variables, the number of instances, the number of classes, the degree of class bias, the entropy of classes, the degree of overlap of variables, the silhouette score, and the number of variables in the input user dataset. A plurality of data set characteristics including hub scores, entropy of variables, and linearity and neighborliness of the data set can be extracted.

단계(403)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 복수의 데이터셋 특성을 전처리한다.In step 403, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention pre-process a plurality of data set characteristics.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 사용자 데이터셋으로부터 추출된 복수의 데이터셋 특성에 대하여 결측치 보간 및 클래스 불균형 해소를 위한 전처리 과정을 수행한다.That is, the sampling method and the classification algorithm recommendation method using the metadata set perform a preprocessing process for interpolating missing values and resolving class imbalance for a plurality of data set characteristics extracted from a user data set.

단계(404)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 메타데이터셋 기반 복수의 데이터셋 특성에 따른 추천 샘플링 방법을 결정한다.In step 404, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention determine a recommended sampling method according to characteristics of a plurality of datasets based on a metadata set.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 메타데이터셋에 포함된 사전 학습된 추천 샘플링 방법들에 사용자 데이터셋의 복수의 데이터셋 특성을 적용하고, 적용 결과에 따라 성능이 가장 우수한 샘플링 방법을 추천 샘플링 방법으로 결정한다.That is, the sampling method and classification algorithm recommendation method using the metadata set applies a plurality of dataset characteristics of the user dataset to the pre-learned recommended sampling methods included in the metadata set, and the best performance is obtained according to the application result. The sampling method is determined as the recommended sampling method.

단계(405)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 메타데이터셋 기반 복수의 데이터셋 특성에 따른 추천 분류 알고리즘을 결정한다.In step 405, the sampling method and the classification algorithm recommendation method using a metadata set according to an embodiment of the present invention determine a recommended classification algorithm according to characteristics of a plurality of datasets based on the metadata set.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 메타데이터셋에 포함된 사전 학습된 추천 분류 알고리즘들에 사용자 데이터셋의 복수의 데이터셋 특성을 적용하고, 적용 결과에 따라 성능이 가장 우수한 분류 알고리즘을 추천 분류 알고리즘으로 결정한다. 여기서, 추천 분류 알고리즘의 결정에 따라 분류 알고리즘의 종류 및 파라미터 값도 결정될 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set applies a plurality of dataset characteristics of the user dataset to the pre-learned recommendation classification algorithms included in the metadata set, and the best performance is obtained according to the application result. A classification algorithm is determined as a recommended classification algorithm. Here, according to the determination of the recommended classification algorithm, the type and parameter value of the classification algorithm may also be determined.

단계(406)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 추천 샘플링 방법 및 추천 분류 알고리즘을 학습하고, 맞춤형 샘플링 방법 및 맞춤형 분류 알고리즘 중 적어도 하나를 추천할 수 있다.In step 406, the sampling method and classification algorithm recommendation method using the metadata set according to an embodiment of the present invention learns the recommended sampling method and the recommended classification algorithm, and recommends at least one of the customized sampling method and the customized classification algorithm. can

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 사용자 데이터셋의 복수의 데이터셋 특성과 메타데이터셋 내에 포함된 복수의 데이터셋 특성을 동시에 고려하여 결정된 추천 샘플링 방법 및 추천 분류 알고리즘을 학습하고, 이에 따라 사용자 데이터셋에 맞춤형으로 분류 알고리즘 및 데이터 샘플링 방법을 자동적으로 결정하여 추천할 수 있다.That is, the sampling method and the classification algorithm recommendation method using the metadata set learn the recommended sampling method and the recommended classification algorithm determined by simultaneously considering the plurality of dataset characteristics of the user dataset and the plurality of dataset characteristics included in the metadata set. Accordingly, it is possible to automatically determine and recommend a classification algorithm and a data sampling method customized to a user dataset.

도 5는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법을 설명하는 도면이다.5 is a diagram for explaining a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention.

도 5는 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법이 복수의 데이터셋 특성을 수집함에 따라 메타데이터셋을 생성하는 실시예를 설명한다.5 illustrates an embodiment in which a metadata set is generated by a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention by collecting a plurality of dataset characteristics.

도 5를 참고하면, 단계(501)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 표집 방법을 훈련 데이터셋에 적용하여 복수의 훈련 데이터셋을 생성한다.Referring to FIG. 5 , in step 501 , a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention apply a sampling method to a training dataset to generate a plurality of training datasets.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 데이터 샘플링 방법 중 하나인 분류 알고리즘에 적합하도록 훈련 데이터를 샘플링하는 방법인 표집 방법을 이용하여 복수의 훈련 데이터셋을 생성한다.That is, a sampling method and a classification algorithm recommendation method using a metadata set generate a plurality of training datasets using a sampling method, which is a method of sampling training data to be suitable for a classification algorithm, which is one of the data sampling methods.

단계(502)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 데이터셋의 특성을 추출한다.In step 502, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention extract characteristics of a dataset.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 데이터셋에서 변수의 개수, 인스턴스의 개수, 클래스의 개수, 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성 및 이웃성을 포함하는 복수의 데이터셋 특성을 각 훈련 데이터셋으로부터 추출할 수 있다.In other words, the sampling method and classification algorithm recommendation method using the metadata set are the number of variables in the dataset, the number of instances, the number of classes, the degree of class bias, the entropy of classes, the degree of overlap of variables, the silhouette score, the hub score, A plurality of dataset characteristics including entropy of variables, linearity and neighborliness of the dataset can be extracted from each training dataset.

단계(503)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 결측치 존재 여부를 판단한다.In step 503, the sampling method and the classification algorithm recommendation method using the metadata set according to an embodiment of the present invention determine whether missing values exist.

본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 결측치가 존재할 경우 단계(504)를 진행하고, 결측치가 존재하지 않을 경우, 단계(507)로 진행한다.The sampling method and classification algorithm recommendation method using a metadata set according to an embodiment of the present invention proceeds to step 504 if there is a missing value, and proceeds to step 507 if there is no missing value.

단계(504)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 결측치가 존재하는 변수가 수치형인지 판단한다.In step 504, the sampling method and classification algorithm recommendation method using a metadata set according to an embodiment of the present invention determines whether a variable having a missing value is a numeric type.

일례로, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 결측치가 존재하는 변수가 수치형일 경우, 단계(505)로 진행하고, 결측치가 존재하는 변수가 수치형이 아닐 경우, 단계(506)로 진행한다.As an example, the sampling method and classification algorithm recommendation method using the metadata set proceeds to step 505 when the variable with the missing value is of a numeric type, and goes to step 506 when the variable with the missing value is not a numeric type. proceed with

단계(505)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 해당 클래스의 평균을 입력하여 결측치를 처리한다.In step 505, the sampling method and the classification algorithm recommendation method using the metadata set according to an embodiment of the present invention process the missing value by inputting the average of the corresponding class.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 수치형 결측치를 보간하기 위해 클래스의 평균을 결측치에 입력하여 결측치를 보간할 수 있다.That is, the sampling method and the classification algorithm recommendation method using the metadata set can interpolate the missing value by inputting the class average into the missing value in order to interpolate the numerical missing value.

단계(506)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 해당 클래스의 최빈값을 입력하여 결측치를 처리한다.In step 506, the sampling method and classification algorithm recommendation method using the metadata set according to an embodiment of the present invention input the mode value of the corresponding class to process the missing value.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 명목형 결측치를 보간하기 위해 클래스의 최빈값을 결측치에 입력하여 결측치를 보간할 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set can interpolate the missing value by inputting the class mode to the missing value in order to interpolate the nominally missing value.

단계(507)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 K-겹 교차 검증법을 이용하여 각 폴드 별 샘플링을 수행하고, 테스트 결과를 저장한다.In step 507, the sampling method and classification algorithm recommendation method using a metadata set according to an embodiment of the present invention performs sampling for each fold using a K-fold cross-validation method, and stores the test result.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 K만큼 분류된 각 폴드 별 훈련 데이터셋 특성을 샘플링하고, 샘플링된 훈련 데이터셋 특성에 따른 테스트 결과를 도출하여 저장한다. 예를 들어, K는 10일 수 있다.That is, the sampling method and classification algorithm recommendation method using the metadata set samples the training dataset characteristics for each fold classified by K, and derives and stores test results according to the sampled training dataset characteristics. For example, K may be 10.

단계(508)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 클래스 불균형 방법 및 분류 알고리즘을 적용한다.In step 508, a sampling method and a classification algorithm recommendation method using a metadata set according to an embodiment of the present invention apply a class imbalance method and a classification algorithm.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 클래스 불균형을 해소하기 위한 방법을 적용하고, 샘플링된 훈련 데이터셋 특성과 관련이 있는 테스트 데이터셋 특성을 분류 알고리즘에 적용할 수 있다.That is, a sampling method and a classification algorithm recommendation method using a metadata set may apply a method for resolving class imbalance, and a test dataset characteristic related to the sampled training dataset characteristic may be applied to the classification algorithm.

단계(509)에서 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 분류 성능을 측정한다.In step 509, the sampling method and the classification algorithm recommendation method using the metadata set according to an embodiment of the present invention measure classification performance.

즉, 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 방법은 분류 알고리즘에 대한 적용 결과에 대한 분류 성능을 측정하는데, 분류 성능 측정에는 F1-Score 및 G-mean을 이용할 수 있다.That is, the sampling method and the classification algorithm recommendation method using the metadata set measure the classification performance for the application result of the classification algorithm, and F1-Score and G-mean can be used to measure the classification performance.

도 6은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 수집하는 데이터셋 특성의 구조를 설명하는 도면이다.6 is a diagram for explaining the structure of a data set characteristic collected by a sampling method and classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.

도 6을 참고하면, 데이터셋 특성의 구조 데이터셋 특성(600)이 데이터셋 복잡성(610)과 기본 데이터셋 특성(620)으로 구분될 수 있는 구조를 나타낼 수 있다.Referring to FIG. 6 , a structure of a dataset characteristic 600 may represent a structure in which a dataset complexity 610 and a basic dataset characteristic 620 can be distinguished.

예를 들어, 데이터셋 복잡성(610)은 클래스의 치우침 정도, 클래스의 엔트로피, 변수의 겹침정도, 실루엣 점수, 허브스코어, 변수의 엔트로피, 데이터셋의 선형성, 이웃성, HHI(Herfindahl-Hirschman Index) 및 Silhouette 등을 포함할 수 있다.For example, the dataset complexity 610 includes class bias, class entropy, variable overlap, silhouette score, hub score, variable entropy, data set linearity, neighborhood, and Herfindahl-Hirschman Index (HHI). and Silhouette.

한편, 기본 데이터셋 특성(620)은 인스턴트 수, 결측 값의 수, 변수의 개수, 클래스의 개수를 포함할 수 있다.Meanwhile, the basic dataset characteristic 620 may include the number of instances, the number of missing values, the number of variables, and the number of classes.

HHI(Herfindahl-Hirschman Index)는 산업의 경쟁 상황인 시장상황을 0 내지 1로 나타내는 것일 수 있고, Silhouette는 군집 분석에서 군집 평가를 위해 사용되는 평가 지표일 수 있다.The Herfindahl-Hirschman Index (HHI) may represent a market situation, which is a competitive situation in the industry, on a scale of 0 to 1, and the Silhouette may be an evaluation index used for cluster evaluation in cluster analysis.

Silhouette는 높을수록 클러스터(Cluster)가 잘 묶일수록 클러스터내의 동질성을 갖고 있다고 볼 수 있고, 동시에 클러스터간의 이질성을 잘 나타낸다고 할 수 있다.The higher the Silhouette, the better the clusters are grouped, the more homogeneous within the cluster, and the better the heterogeneity between clusters.

인스턴스 수는 데이터 셋의 크기를 나타내는 것으로, 이는 정보가 많을수록 정보가 더 많다는 것을 의미한다.The number of instances indicates the size of the data set, which means that more information means more information.

그러나 더 큰 데이터일수록 잡음이 발생하고 머신러닝 분야에서 학습 시간이 더 많이 소요되고, 너무 많은 노이즈가 발생할 수 있으며 너무 작으면 정보 부족으로 인해 올바르게 분류하기 어려울 수 있다.However, the larger the data, the more noisy, the more time it takes to train in the field of machine learning, the more noise it can generate, and if it is too small, it can be difficult to classify correctly due to lack of information.

변수의 개수는 너무 적은 수의 변수를 고려한다면, 예측 모델의 복잡성은 떨어지겠지만, 많은 수의 변수를 고려했을 때에 비해 정확도는 떨어질 수 있다.If too few variables are considered for the number of variables, the complexity of the predictive model will decrease, but the accuracy may be lower than when considering a large number of variables.

변수의 개수가 줄어들면 줄어들수록 G-mean이 상승하는 특징이 있다.As the number of variables decreases, the G-mean increases.

클래스 개수(Number of classes)는 클래스가 갖고 있는 차원으로, 바이너리형태의 클래스라면 각 클래스의 기대확률은 0.5이고, 클래스 수가 4개인 경우는 0.25일 것이다.The number of classes is a dimension of a class. If it is a binary type class, the expected probability of each class is 0.5, and if the number of classes is 4, it will be 0.25.

또한, 확률적으로 바이너리 형태인 경우와 클래스가 4개인 경우의 기대확률이 다르고 클래스가 많을수록 기대확률이 떨어지기 때문에 클래스의 정확도와 밀접한 관련이 있을 수 있다.In addition, since the expected probability is different in the case of probabilistic binary form and the case of 4 classes, and the expected probability decreases as the number of classes is increased, it may be closely related to the accuracy of the class.

결측 값은 다양한 분야에서 발생하고 있다. 결측 값은 데이터 마이닝, 기계 학습 및 기타 정보 시스템에서 좋지 않은 영향을 미칠 수 있다.Missing values occur in various fields. Missing values can have adverse effects in data mining, machine learning, and other information systems.

결측 값은 일반적으로 센서 결함, 과학 실험에서의 응답 부족, 측정 결함, 디지털 시스템의 데이터 전송 문제 또는 설문 조사에 대한 응답자의 응답을 꺼리기 때문에 발생할 수 있다.Missing values can usually be caused by sensor defects, lack of response in scientific experiments, measurement deficiencies, data transmission problems in digital systems, or the reluctance of respondents to respond to surveys.

결측 값이 많으면 클래스를 분류하기 위한 정보가 부족하여 오분류를 발생시킬 수 있다.If there are many missing values, there is not enough information to classify the class, which may cause misclassification.

HHI는 산업의 기업 시장 집중도를 나타내는 지수 중 하나로, HHI의 장점은 전체 산업의 경쟁 상황을 시장상황을 0 내지 1로 나타내어 한번에 직관적으로 알 수 있다는 장점이 있다.HHI is one of the indices indicating the concentration of a company's market in an industry. The advantage of HHI is that it can be intuitively known at once by representing the market conditions of the entire industry as 0 to 1.

클래스의 균형 상태에 HHI를 적용할 수 있는데 시장의 점유율을 각 클래스의 비율로 볼 때 class의 균형 상태와 시장의 경쟁상태가 유사한 형태를 갖고 있기 때문 적용할 수 있다.HHI can be applied to the equilibrium state of a class because, when the market share is viewed as the ratio of each class, the equilibrium state of the class and the competitive state of the market have a similar shape.

엔트로피(entropy)는 정보의 양과 순도를 나타내고, 주어진 데이터에 의해 발생하는 정보량에 대한 불확실성을 정량적으로 측정할 수 있는 방법으로, 발생확률이 1에 가까울수록 정보의 양은 적고, 발생확률이 적을수록 정보의 양이 많아지게 된다.Entropy indicates the amount and purity of information and is a method that can quantitatively measure the uncertainty about the amount of information generated by given data. will increase the amount of

즉, 희귀한 정보일수록 정보량을 많이 갖고 있으며 보편적인 데이터일수록 데이터가 일관성을 갖고 있다고 할 수 있다.In other words, it can be said that the rarer the information, the greater the amount of information, and the more universal the data, the more consistent the data.

따라서 엔트로피는 특이 데이터가 많지 않고, 일관된 순도 높은 데이터인지 알 수 있는 지표가 될 수 있다.Therefore, entropy can be an indicator of whether there are not many specific data and consistently high purity data.

클래스 엔트로피는 엔트로피를 사용하여 클래스의 불균형을 표현한 것으로, 클래스 불균형이 심할수록 높은 수로 표현될 수 있다.Class entropy is an expression of class imbalance using entropy, and as the class imbalance is severe, it can be expressed as a higher number.

Silhouette 계수는 클러스터링 분석 후 클러스터가 올바르게 구성되었는지 검증하기 위해 사용되는 지표이다.The Silhouette coefficient is an indicator used to verify that the cluster is correctly formed after clustering analysis.

산출된 클러스터내에 값들이 얼마나 밀도 있게 뭉쳐있는지, 클러스터간의 거리는 충분한지를 판단할 수 있다. Silhouette의 값이 1에 가까울수록 클러스터의 개수가 적절하다고 판단될 수 있다.It can be determined how densely the values are clustered in the calculated cluster and whether the distance between the clusters is sufficient. The closer the value of Silhouette is to 1, the more appropriate the number of clusters can be.

데이터 선형성(data nonlinearity)은 데이터의 형태가 복잡할수록 분류문제에서 어려움을 겪을 수 있고, 일부 분류 알고리즘은 선형모형을 이용하여 분류 문제를 해결한다.As for data nonlinearity, the more complex the data type, the more difficult it may be in classification problems, and some classification algorithms use linear models to solve classification problems.

예를 들어, 본 발명은 데이터의 선형성과 이상치를 동시에 고려하는 방법인 선형 분류기의 비선형(Nonlinearity of linear classifier) 측정 방법을 사용할 수 있다.For example, the present invention may use a nonlinearity of linear classifier measurement method, which is a method for simultaneously considering linearity of data and outliers.

비선형 분류기(non-linear classifier)를 통해 분류하고, SVM(support vector machine)의 선형 커널 기능(linear kernel function)을 통해 분류된 결과를 비교하여, 오류(error)를 비교하는 방법이 활용될 수 있다.A method of classifying through a non-linear classifier and comparing the classified results through a linear kernel function of a support vector machine (SVM) to compare errors may be used. .

허브 스코어(hub score)는 네트워크의 연결성을 이용하여 데이터의 응집력을 측정하는 지표이다.The hub score is an index for measuring the cohesion of data using network connectivity.

허브 스코어는 주어진 데이터를 이용하여 네트워크를 구성하고, 노드에 연결된 숫자를 통해 측정될 수 있다.The hub score can be measured through a number connected to a node by constructing a network using the given data.

변수의 겹침정도(feature overlap)은 변수 간의 중복 정도를 비율로 제시한다. 데이터가 겹치지 않을 수록 의사결정 경계가 명확해진다. 따라서 변수의 중첩이 높을 수록 Tomek link, ENN, CNN과 같은 과소 샘플링 방법이 더 유리할 수 있다.The degree of overlap of variables (feature overlap) suggests the degree of overlap between variables as a ratio. The more non-overlapping the data, the clearer the decision boundaries are. Therefore, the higher the overlap of variables, the more advantageous the undersampling methods such as Tomek link, ENN, and CNN.

이웃성(neighborhood)는 데이터가 잘못 레이블링이 된 경우 데이터의 위치가 엉뚱한 곳에 배치되기 때문에 선형 분리 문제에서 분류 성능을 떨어뜨리는 원인과 관련될 수 있다.Neighborhood can be related to the cause of poor classification performance in linear separation problems because the data is placed in the wrong place when it is incorrectly labeled.

각 인스턴스에 대해 클래스 내에서 가장 가까운 이웃과의 거리와 다른 클래스의 가장 가까운 이웃까지의 거리를 계산하고, 클래스 간 거리의 합에 대한 클래스 내 거리의 합계의 비율을 데이터셋의 개수로 이용한다.For each instance, the distance to the nearest neighbor within a class and the distance to the nearest neighbor of another class are calculated, and the ratio of the sum of distances within a class to the sum of distances between classes is used as the number of datasets.

차원(Dimensionality)은 PCA(Principal Component Analysis)를 통해 감소된 변수의 수와 원본 데이터 변수의 수 간의 차이를 비율로 나타낼 수 있다.Dimensionality may represent the difference between the number of variables reduced through Principal Component Analysis (PCA) and the number of original data variables as a ratio.

도 7은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 생성하는 메타데이터셋을 예시하는 도면이다.7 is a diagram illustrating a metadata set generated by a sampling method and classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.

도 7을 참고하면, 메타데이터셋의 예시(700)는 식별번호, 정확도, G-mean 점수, F-score 점수, 알고리즘 데이터, 방법, 비율, 복수의 데이터셋 특성 등 다양한 정보를 포함할 수 있다.Referring to FIG. 7 , an example 700 of a metadata set may include various information such as identification number, accuracy, G-mean score, F-score score, algorithm data, method, ratio, and characteristics of a plurality of datasets. .

도 8은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 분류 알고리즘과 관련하여 성능 평가한 결과를 예시하는 도면이다.8 is a diagram illustrating a result of performance evaluation in relation to a classification algorithm by a sampling method and a classification algorithm recommendation apparatus using a metadata set according to an embodiment of the present invention.

도 8을 참고하면, 그래프(800)는 가로축에서 분류 알고리즘의 종류를 나타내고, 세로축에서 주파수를 나타낸다.Referring to FIG. 8 , the graph 800 indicates the type of classification algorithm on the horizontal axis and frequency on the vertical axis.

분류 알고리즘의 종류는 knn(Nearest Neighbor), LR(Logistic regression), NB(na

ve Bayes), RF(random forest) 및 SVM(support vector machine)을 포함한다.The classification algorithms are knn (Nearest Neighbor), LR (Logistic Regression), NB (na

ve Bayes), random forest (RF) and support vector machine (SVM).

그래프(800)는 클래스를 분류하는 알고리즘인 랜덤 포레스트(Random Forest, RF)가 다른 알고리즘에 비해 우수한 성능을 나타내고 있음을 보여준다.The graph 800 shows that a random forest (RF), which is an algorithm for classifying classes, exhibits superior performance compared to other algorithms.

k-NN방법은 나이브 베이즈(na

ve Bayes) 분류기나, SVM, 의사결정 나무 모형과 다르게 학습데이터를 이용하지 않다가 실증데이터가 주어져야 움직이는 lazy학습법이다.The k-NN method is naive Bayes (na

ve Bayes) Unlike classifiers, SVMs, and decision tree models, it is a lazy learning method that does not use learning data and moves only when empirical data is given.

로지스틱 회귀(Logistic regression) 모델은 로지스틱 누적 분포 함수(Cumulative Distribution Function, CDF)의 기능적 형태를 가지고 있다.The logistic regression model has a functional form of a logistic cumulative distribution function (CDF).

나이브 베이즈 분류기는 지도학습 중 확률적으로 접근하는 방법으로, 베이즈 정리를 이용한다. 나이브 베이즈는 모형이 비교적 단순하며, 계산과정이 복잡하지 않지만, 우수한 성능을 나타내는 것으로 알려져 있다.The naive Bayes classifier is a probabilistic approach during supervised learning and uses Bayes theorem. The naive Bayes model is relatively simple and the calculation process is not complicated, but it is known to show excellent performance.

랜덤 포레스트(random forest, RF)는 의사결정 나무모형을 기반으로 한 분류 알고리즘의 하나로, 여러 개의 의사결정 나무모형을 생성하여 다수결 방식의 앙상블 기법이다.Random forest (RF) is a classification algorithm based on a decision tree model, and it is an ensemble technique of a majority vote method by generating several decision tree models.

SVM(Support Vector Machine)은 n차원의 데이터에서 n-1차원의 초평면을 이용하여 데이터를 분류할 때 사용하는 방법이다.SVM (Support Vector Machine) is a method used to classify data using an n-1 dimensional hyperplane in n dimensional data.

도 9은 본 발명의 일실시예에 따른 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치가 샘플링 방법과 관련하여 성능 평가한 결과를 예시하는 도면이다.9 is a diagram illustrating a result of performance evaluation in relation to a sampling method by the apparatus for recommending a sampling method and a classification algorithm using a metadata set according to an embodiment of the present invention.

도 9를 참고하면, 그래프(900)는 가로축에서 샘플링 방법의 종류를 나타내고, 세로축에서 주파수를 나타낸다.Referring to FIG. 9 , a graph 900 indicates the type of sampling method on the horizontal axis and frequency on the vertical axis.

샘플링 방법의 종류는 Adasyn(Adaptive Synthetic Sampling Approach for Imbalanced Learning), ENN(Edited Nearest Neighbors), NCL(Neighbourhood Cleaning Rule), ROS(Random Over Sampling), RUS(Random Under Sampling), SMOTE(Synthetic Minority Oversampling TEchnique) 및 Tomek을 포함한다.The sampling methods are Adasyn(Adaptive Synthetic Sampling Approach for Imbalanced Learning), ENN(Edited Nearest Neighbors), NCL(Neighborhood Cleaning Rule), ROS(Random Over Sampling), RUS(Random Under Sampling), SMOTE(Synthetic Minority Oversampling TEchnique) ) and Tomek.

그래프(900)는 Adasyn과 SMOTE가 우수한 성능을 나타내는 것을 확인시켜준다.Graph 900 confirms that Adasyn and SMOTE exhibit excellent performance.

과대 표집 방법에는 ROS, SMOTE, ADASYN을 포함하고, 과소 표집법은 RUS, ENN, Tomek link 방법, CNN 및 NCL을 포함한다.Oversampling methods include ROS, SMOTE, and ADASYN, and undersampling methods include RUS, ENN, Tomek link method, CNN and NCL.

ROS는 임의 과대 추출 방법은 소수의 클래스를 다수 클래스의 데이터 크기와 같아질 때까지 데이터를 무작위로 선택하여 반복 복원 추출하는 방법이다.ROS is a method of repetitively restoring and extracting data by randomly selecting a small number of classes until the data size of the majority class is the same as the random over-extraction method.

SMOTE는 소수 클래스의 임의의 데이터를 선정하고 최 근접 이웃 k개(k-Nearest Neighbor, NN)사이에 새로운 인공 데이터를 생성하는 방법이다.SMOTE is a method of selecting random data of a prime class and generating new artificial data among k-Nearest Neighbors (NN).

ADASYN(Adaptive Synthetic Sampling)은 SMOTE를 기반으로 소수 클래스의 밀도 분포를 고려하여 데이터를 생성하는 방법이다.ADASYN (Adaptive Synthetic Sampling) is a method of generating data based on SMOTE considering the density distribution of a prime class.

RUS는 다수 클래스(Majority Class)를 임의로 삭제시켜, 소수의 클래스(Minority Class)의 비율에 맞추는 표집 방법으로, RUS는 ROS와 같이 사용하기 편리한 장점이 있고, 대규모 데이터는 데이터 수를 줄여 비용을 줄일 수 있으, 임의로 데이터를 줄이기 때문에 중요한 정보를 손실할 가능성이 높다.RUS is a sampling method that arbitrarily deletes the majority class and matches the ratio of the minority class. RUS has the advantage of being convenient to use like ROS. However, there is a high possibility of losing important information because the data is reduced arbitrarily.

CNN방법은 Hart(1968)에 제안된 방법으로 훈련데이터 중 임의 데이터를 선정하여 집합 X에 저장하고, 또 다른 하나의 데이터를 선정하여 집합 X에 포함시킨다.The CNN method is a method proposed by Hart (1968), which selects random data from training data and stores it in set X, and selects another piece of data and includes it in set X.

집합X의 데이터를 NN(Nearest Neighbors)규칙을 사용하여 분류하여, 오분류 되었을 때 집합 X에 저장하고, 다시 임의 데이터를 선정하여 집합 X에 포함시켜 NN규칙을 사용한다.The data of set X is classified using the NN (Nearest Neighbors) rule, and when it is misclassified, it is stored in the set X. Then, random data is selected again and included in the set X to use the NN rule.

CNN방법은 오분류 된 X집합을 제외한, 훈련데이터가 모두 잘 분류가 되었을 때까지 반복하고, 서로 다른 클래스의 경계(boundary)가 명확한 데이터를 남기는 방법이다.The CNN method is a method that iterates until all the training data are classified well except for the misclassified X set, and leaves data with clear boundaries of different classes.

ENN은 CNN의 변형으로, CNN과 달리 집합 X에 포함된 값이 오분류인 경우 X에서 제외시킬 수 있다.ENN is a variant of CNN. Unlike CNN, if a value included in set X is misclassified, it can be excluded from X.

Tomek link는 CNN샘플링 방법을 기본으로 하여 의사 결정 경계 근처에 있는 내부 데이터를 제거하는 방법이다.Tomek link is a method of removing internal data near decision boundaries based on CNN sampling method.

NCL은 CNN(Condensed Nearest Neighbour)과 ENN(Edited Nearest Neighbours)을 혼합한 방법이다.NCL is a mixture of Condensed Nearest Neighbors (CNN) and Edited Nearest Neighbors (ENN).

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with reference to the limited drawings as described above, various modifications and variations are possible by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

100: 메타데이터셋을 이용한 샘플링 방법 및 분류 알고리즘 추천 장치
110: 데이터셋 수집부 120: 특성 추출부
130: 매핑처리부 140: 메타데이터셋 생성부
150: 추천부100: Sampling method and classification algorithm recommendation device using metadata set
110: data set collection unit 120: feature extraction unit
130: mapping processing unit 140: metadata set generation unit
150: recommendation

Claims

a data set collecting unit that collects open data sets from open databases;
a feature extracting unit for extracting a plurality of data set characteristics of the collected open data set and pre-processing the plurality of extracted data set characteristics;
a mapping processing unit for mapping a sampling method and a classification algorithm according to the characteristics of the plurality of preprocessed datasets;
generating a selection rule base for selecting a recommended sampling method and a recommended classification algorithm based on the mapped sampling method and the mapped classification algorithm, and including the generated selection rule base and the preprocessed plurality of data set characteristics a metadata set generator for generating a metadata set; and
and a recommender for recommending at least one of a custom sampling method and a custom classification algorithm using the generated metadata set with respect to a user data set input from a user.
Sampling method and classification algorithm recommendation device using metadata set.

According to claim 1,
The feature extraction unit includes the number of variables, the number of instances, the number of classes, the degree of class bias, the entropy of the class, the degree of overlap of the variables in the collected open dataset, the silhouette score, the hub score, the entropy of the variable, and the Extracting the plurality of dataset features including linearity and neighbor, and pre-processing the extracted plurality of dataset features
Sampling method and classification algorithm recommendation device using metadata set.

3. The method of claim 2,
The feature extractor classifies the collected open dataset into a plurality of folds, and uses the dataset included in the remaining folds except for one of the classified plurality of folds into a plurality of training datasets. and extracting the plurality of dataset characteristics from the determined plurality of training datasets.
Sampling method and classification algorithm recommendation device using metadata set.

4. The method of claim 3,
When a missing value exists in the dataset from which the plurality of dataset characteristics are extracted, and the variable in which the missing value exists is a numeric type, the characteristic extractor processes the missing value by using the average value of the corresponding class, thereby processing the extracted plurality of data Characterized in pre-processing the set characteristics
Sampling method and classification algorithm recommendation device using metadata set.

4. The method of claim 3,
When a missing value exists in the dataset from which the plurality of dataset characteristics are extracted, and the variable in which the missing value exists is a nominal type, the feature extractor processes the missing value using the mode value of the corresponding class to process the plurality of extracted data Characterized in pre-processing the set characteristics
Sampling method and classification algorithm recommendation device using metadata set.

4. The method of claim 3,
When class imbalance exists in the dataset from which the plurality of dataset characteristics are extracted, the characteristic extraction unit removes a majority class according to the existing class imbalance. An under-sampling method and a minority class ( Minority class) according to the majority class (majority class) by using any one of the class imbalance resolution method of the over sampling method to resolve the existing class imbalance, the extracted plurality of dataset characteristics characterized by pre-treatment
Sampling method and classification algorithm recommendation device using metadata set.

3. The method of claim 2,
The mapping processing unit applies the characteristics of the plurality of preprocessed datasets to a plurality of sampling methods, calculates sampling method accuracy in each of the plurality of applied sampling methods, and according to the calculated sampling method accuracy, the plurality of preprocessed data sets Characterized in mapping data set characteristics and sampling methods
Sampling method and classification algorithm recommendation device using metadata set.

3. The method of claim 2,
The mapping processing unit applies the characteristics of the plurality of preprocessed datasets to a plurality of classification algorithms, calculates classification algorithm accuracy in each of the plurality of applied classification algorithms, and calculates the accuracy of the plurality of preprocessed classification algorithms according to the calculated classification algorithm accuracy. Characterized in mapping dataset characteristics and classification algorithms
Sampling method and classification algorithm recommendation device using metadata set.

9. The method of claim 8,
The mapping processing unit calculates the accuracy of the classification algorithm based on the characteristics and hyperparameters of the classification algorithm in each of the plurality of classification algorithms applied to the characteristics of the plurality of preprocessed datasets
Sampling method and classification algorithm recommendation device using metadata set.

According to claim 1,
The metadata set generating unit filters the plurality of preprocessed dataset characteristics applied to the mapped sampling method and the mapped classification algorithm, and selects a plurality of datasets related to the filtered plurality of dataset characteristics as the recommended sampling method and machine learning by inputting the recommendation classification algorithm, and generating a selection rule base for selecting the recommendation sampling method and the recommendation classification algorithm based on the machine learning.
Sampling method and classification algorithm recommendation device using metadata set.

11. The method of claim 10,
The metadata set generating unit generates a metadata set including a plurality of datasets related to the filtered plurality of dataset characteristics and the generated selection rule base.
Sampling method and classification algorithm recommendation device using metadata set.

12. The method of claim 11,
and a metadata set storage unit configured to store the generated metadata set.
Sampling method and classification algorithm recommendation device using metadata set.

According to claim 1,
The feature extractor includes the number of variables, the number of instances, the number of classes, the degree of bias in the class, the entropy of the class, the degree of overlap of the variables in the input user dataset, the silhouette score, the hub score, the entropy of the variable, and the Characterized in extracting a plurality of data set characteristics including linearity and neighbor
Sampling method and classification algorithm recommendation device using metadata set.

14. The method of claim 13,
The feature extractor classifies the input user dataset into a plurality of folds, and uses the dataset included in the remaining folds except for one of the classified plurality of folds into a plurality of training datasets. and extracting the plurality of dataset characteristics from the determined plurality of training datasets.
Sampling method and classification algorithm recommendation device using metadata set.

15. The method of claim 14,
The feature extractor uses the average value of the class when a missing value exists in the dataset from which the plurality of dataset features are extracted, and the variable in which the missing value exists is of a numeric type, and when the variable with the missing value is of a nominal type The plurality of extracted dataset characteristics are preprocessed as the missing value is processed using the class mode, and when class imbalance exists in the dataset from which the plurality of dataset characteristics are extracted, according to the existing class imbalance One of the methods of resolving class imbalance is the undersampling method that removes the majority class and the oversampling method that duplicates the minority class to fit the majority class. Pre-processing the extracted plurality of data set characteristics as the existing class imbalance is resolved using
Sampling method and classification algorithm recommendation device using metadata set.

16. The method of claim 15,
The recommendation unit recognizes a plurality of preprocessed dataset characteristics of the user dataset, identifies a plurality of dataset characteristics related to the recognized plurality of dataset characteristics in the generated metadata set, and identifies the plurality of identified dataset characteristics. Recommending at least one of the customized sampling method and the customized classification algorithm based on a dataset characteristic and the generated selection rule base
Sampling method and classification algorithm recommendation device using metadata set.

In the data set collecting unit, collecting an open data set from an open database;
extracting, in a feature extraction unit, a plurality of data set features of the collected open data set, and pre-processing the extracted plurality of data set features;
mapping, in a mapping processing unit, a sampling method and a classification algorithm according to the characteristics of the plurality of preprocessed datasets;
In the metadata set generator, a selection rule base for selecting a recommended sampling method and a recommended classification algorithm is generated based on the mapped sampling method and the mapped classification algorithm, and the generated selection rule base and the preprocessed plurality are generated. generating a metadata set including data set characteristics of ; and
and recommending, by the recommendation unit, at least one of a custom sampling method and a custom classification algorithm using the generated metadata set with respect to the user data set input by the user.
Sampling method and classification algorithm recommendation method using metadata set.

18. The method of claim 17,
Extracting a plurality of dataset characteristics of the collected open dataset, and pre-processing the extracted plurality of dataset characteristics,
Classifying the collected open dataset into a plurality of folds, and determining a dataset included in the remaining folds except for one of the classified plurality of folds as a plurality of training datasets, The number of variables, the number of instances, the number of classes, the degree of class bias, the entropy of the classes, the degree of overlap of the variables, the silhouette score, the hub score, the entropy of the variables in the collected open dataset from the determined plurality of training datasets , extracting the plurality of dataset characteristics including linearity and neighborliness of the dataset;
When a missing value exists in the dataset from which the plurality of dataset characteristics are extracted and the variable with the missing value is of a numeric type, the missing value is processed using the average value of the corresponding class, and the variable with the missing value is of a nominal type pre-processing the plurality of data set characteristics extracted by processing the missing value using the mode value of the corresponding class; and
When class imbalance exists in the dataset from which the plurality of dataset characteristics are extracted, an undersampling method that removes a majority class according to the existing class imbalance and a minority class Pre-processing the characteristics of the plurality of extracted datasets as the existing class imbalance is resolved using any one of the class imbalance resolution methods of oversampling that replicates for a majority class characterized by including
Sampling method and classification algorithm recommendation method using metadata set.

18. The method of claim 17,
The mapping of the sampling method and the classification algorithm according to the characteristics of the plurality of preprocessed datasets comprises:
applying the plurality of preprocessed dataset characteristics to a plurality of sampling methods, calculating sampling method accuracy in each of the applied plurality of sampling methods, and calculating the plurality of preprocessed dataset characteristics according to the calculated sampling method accuracy; mapping the sampling method; and
Applying the preprocessed plurality of dataset characteristics to a plurality of classification algorithms, calculating the classification algorithm accuracy based on the characteristics and hyperparameters of the classification algorithm in each of the applied plurality of classification algorithms, and adding to the calculated classification algorithm accuracy and mapping the preprocessed plurality of data set characteristics and a classification algorithm according to
Sampling method and classification algorithm recommendation method using metadata set.

18. The method of claim 17,
generating a selection rule base for selecting a recommended sampling method and a recommended classification algorithm based on the mapped sampling method and the mapped classification algorithm, and including the generated selection rule base and the preprocessed plurality of data set characteristics The steps to create a metadata set are:
Filtering the plurality of preprocessed dataset characteristics applied to the mapped sampling method and the mapped classification algorithm, and adding a plurality of datasets related to the filtered plurality of dataset characteristics to the recommended sampling method and the recommended classification algorithm generating a selection rule base for performing machine learning by input and selecting the recommended sampling method and the recommended classification algorithm based on the machine learning; and
and generating a metadata set including a plurality of datasets related to the filtered plurality of dataset characteristics and the generated selection rule base.
Sampling method and classification algorithm recommendation method using metadata set.