KR20200049273A

KR20200049273A - A method and apparatus of data configuring learning data set for machine learning

Info

Publication number: KR20200049273A
Application number: KR1020180132323A
Authority: KR
Inventors: 한미란; 김근희
Original assignee: 주식회사 메디치소프트
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2020-05-08
Also published as: KR102147097B1

Abstract

The present invention relates to a method for constructing a training data set labeled for machine learning. The method for constructing a training data set labeled for machine learning comprises the steps of: calculating, by a utility module, a ratio (L1 / L2) of the total number (L1) of first labeled data and the total number (L2) of second labeled data that are included in a training data set, wherein L1 > L2; and generating, by the utility module, the calculated N number of subsets; and constructing, by the utility module, each generated subset to include the same number of the first labeled data and the second labeled data.

Description

{A method and apparatus of data configuring learning data set for machine learning}

본 발명은 머신러닝을 위한 학습데이터 세트의 구성 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for configuring a learning data set for machine learning.

머신 러닝(Machine Learning)은 데이터를 이용해서 컴퓨터를 학습시키는 방법으로, 지도 학습(supervised learning), 비지도 학습(unsupervised learning), 강화 학습(Reinforcement Learning)이 있다.Machine learning is a method of learning a computer using data, and includes supervised learning, unsupervised learning, and reinforcement learning.

이중에서 지도 학습은 데이터에 대한 라벨(Label) "명시적인 정답"이 주어진 상태에서 컴퓨터를 학습시키는 방법으로, 라벨이 있는 학습 데이터 세트를 이용하는 것이다.Of these, supervised learning is a method of learning a computer with a label "explicit correct answer" for data, and using a set of labeled learning data.

이와 같이 구성된 학습 데이터 세트로 학습하게 되면, 컴퓨터는 라벨이 새로운 데이터가 어떠한 결과 값을 갖는지 판단할 수 있게 된다.When learning with the learning data set configured in this way, the computer can determine what result value the label has for the new data.

학습된 컴퓨터를 다양한 환경에서 활용하기 위해서는 학습하는 과정에서 정답과 오답을 함께 학습시킴으로써, 정상적인 상황과 비정상적인 상황에 대한 정확한 판단을 할 수 있도록 해주어야 한다.In order to use the learned computer in various environments, it is necessary to make correct judgments about normal and abnormal situations by learning both correct and incorrect answers in the course of learning.

만약 정답과 오답에 대한 각각의 데이터 세트(data set)의 양이 대등하지 않은 경우, 즉 어느 한쪽의 데이터 세트의 양이 다른 한쪽의 데이터 세트에 비해 월등히 많다면 컴퓨터는 편향되게 학습을 할 수밖에 없다. 이렇게 학습된 컴퓨터의 판단 결과는 신뢰하기가 어렵다. 따라서 컴퓨터가 편향된지 않는 학습을 진행할 수 있도록 학습 데이터 세트를 적절히 구성하는 것이 필요하다.If the amount of each data set for the correct and incorrect answers is not equal, that is, the amount of data set on one side is significantly higher than the data set on the other side, the computer must learn to be biased. . It is difficult to trust the result of the computer learning. Therefore, it is necessary to properly configure the training data set so that the computer can proceed with unbiased learning.

등록특허공보 제10-1590896호, 2016.02.02.Registered Patent Publication No. 10-1590896, 2016.02.02.

상술한 바와 같은 문제점을 해결하기 위한 본 발명은 머신러닝을 위한 학습 데이터 세트를 구성함에 있어서, 데이터 불균형 문제를 해결하는 머신러닝을 위한 학습데이터 세트의 구성 방법을 제공할 수 있다.The present invention for solving the above-described problems may provide a method of constructing a learning data set for machine learning that solves a data imbalance problem in constructing a learning data set for machine learning.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 실시예에 따른 머신러닝을 위한 학습데이터 세트의 구성 방법은, 유틸리티 모듈이 상기 학습데이터 세트에 포함된 제1라벨 데이터의 총 개수(L1) 와 제2라벨 데이터의 총 개수(L2)의 비율(L1 / L2)인 N을 산출하는 단계(단, L1>L2); 및 상기 유틸리티 모듈이 상기 산출된 N에 대응되는 개수의 서브 세트를 생성하는 단계; 상기 유틸리티 모듈이 상기 생성된 각 서브 세트에 동일한 개수의 상기 제1라벨 데이터와 상기 제2라벨 데이터가 포함되도록 구성하는 단계;를 포함한다.A method of configuring a learning data set for machine learning according to an embodiment of the present invention for solving the above-described problem includes: a total number (L1) and a second number of first label data included in the learning data set by a utility module Calculating N, which is a ratio (L1 / L2) of the total number of label data L2 (L1> L2); And the utility module generating a subset of the number corresponding to the calculated N; And configuring the utility module to include the same number of the first label data and the second label data in each of the generated subsets.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램을 기록하는 컴퓨터 판독 가능한 기록 매체가 더 제공될 수 있다.In addition to this, another method for implementing the present invention, another system, and a computer-readable recording medium for recording a computer program for executing the method may be further provided.

본 발명의 다른 실시예에 따른 머신러닝을 위한 학습데이터의 데이터 분류 프로그램은, 하드웨어인 컴퓨터와 결합되어 상기 언급된 머신러닝을 위한 학습데이터 세트의 구성 방법을 실행하며, 매체에 저장된다.The data classification program of learning data for machine learning according to another embodiment of the present invention is combined with a computer that is hardware to execute the above-mentioned method of constructing a learning data set for machine learning, and is stored in a medium.

상기와 같은 본 발명에 따르면, 학습 데이터 세트에서 데이터 불균형 문제를 해결하고, 해당 학습 데이터 세트를 통해 컴퓨터를 균형잡히도록 머신러닝할 수 있으므로, 새로운 데이터가 입력되었을 때 컴퓨터가 보다 정확하게 판단할 수 있도록 하는 효과가 있다.According to the present invention as described above, it is possible to solve the data imbalance problem in the training data set and machine learning to balance the computer through the training data set, so that the computer can more accurately determine when new data is input. It has the effect.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 일반적으로 머신러닝에 수행되는 단계들에 대한 흐름도이다.
도 2는 본 발명의 실시예에 따른 머신러닝을 위한 학습데이터 세트의 구성 방법의 흐름도이다.
도 3은 본 발명의 실시예에 따른 전처리 모듈로부터 입력받은 학습데이터를 예시한 도면이다.
도 4는 본 발명의 실시예에 따른 학습데이터의 제1라벨 데이터와 제2라벨 데이터의 비율이 맞춰진 것을 예시한 도면이다.
도 5는 본 발명의 실시예에 따른 3개의 서브 세트가 생성된 것을 예시한 도면이다.
도 6은 5-fold의 검증세트가 생성된 것을 예시한 도면이다.
도 7은 본 발명의 실시예에 따른 교차 검증을 위한 3-fold의 검증세트가 생성된 것을 예시한 도면이다.
도 8은 머신러닝에서 이루어지는 학습 과정을 예시한 도면이다.
도 9는 본 발명의 실시예를 통해 생성된 학습 모델을 적용하여 예측 단계를 수행하는 것을 예시한 도면이다.
도 10은 본 발명의 실시예에 머신러닝을 위한 학습데이터 세트의 구성 장치의 블록도이다.1 is a flow chart for the steps that are generally performed in machine learning.
2 is a flowchart of a method of configuring a learning data set for machine learning according to an embodiment of the present invention.
3 is a diagram illustrating learning data received from a pre-processing module according to an embodiment of the present invention.
4 is a diagram illustrating that the ratio of the first label data to the second label data of the learning data according to an embodiment of the present invention is matched.
5 is a diagram illustrating that three subsets are generated according to an embodiment of the present invention.
6 is a diagram illustrating that a 5-fold verification set is generated.
7 is a diagram illustrating that a 3-fold verification set for cross verification according to an embodiment of the present invention is generated.
8 is a diagram illustrating a learning process performed in machine learning.
9 is a diagram illustrating performing a prediction step by applying a learning model generated through an embodiment of the present invention.
10 is a block diagram of an apparatus for configuring a learning data set for machine learning according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and are common in the technical field to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for describing the embodiments and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form unless otherwise specified in the phrase. As used herein, “comprises” and / or “comprising” does not exclude the presence or addition of one or more other components other than the components mentioned. Throughout the specification, the same reference numerals refer to the same components, and “and / or” includes each and every combination of one or more of the components mentioned. Although "first", "second", etc. are used to describe various components, it goes without saying that these components are not limited by these terms. These terms are only used to distinguish one component from another component. Therefore, it goes without saying that the first component mentioned below may be the second component within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used as meanings commonly understood by those skilled in the art to which the present invention pertains. In addition, terms defined in the commonly used dictionary are not ideally or excessively interpreted unless explicitly defined.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

설명에 앞서 본 명세서에서 사용하는 용어의 의미를 간략히 설명한다. 그렇지만 용어의 설명은 본 명세서의 이해를 돕기 위한 것이므로, 명시적으로 본 발명을 한정하는 사항으로 기재하지 않은 경우에 본 발명의 기술적 사상을 한정하는 의미로 사용하는 것이 아님을 주의해야 한다.Prior to the description, the meaning of terms used in the present specification will be briefly described. However, it should be noted that the description of terms is intended to aid understanding of the present specification, and is not used to limit the technical spirit of the present invention unless explicitly described as a limiting matter of the present invention.

분류기 모듈(150): 데이터가 어떠한 카테고리에 해당하는지 나누는 역할을 수행한다.Classifier module 150: serves to divide what category the data corresponds to.

특징선택 모듈(140): 입력, 수집되는 데이터에스 특징을 선택, 추출하는 역할을 수행한다.Feature selection module 140: Selects and extracts features of the input and collected data.

도 1은 일반적으로 머신러닝에 수행되는 단계들에 대한 흐름도이다.1 is a flow chart for the steps that are generally performed in machine learning.

도 1을 참조하면, 일반적으로 머신러닝에 수행되는 단계는 가용 데이터 확인 및 문제 정의 단계, 데이터 탐색 및 전처리 단계, 학습 알고리즘 선정 단계, 모델 검증 및 결론 단계를 포함한다.Referring to FIG. 1, steps generally performed in machine learning include an available data verification and problem definition step, a data search and preprocessing step, a learning algorithm selection step, a model verification and conclusion step.

먼저, 가용 데이터를 확인하고, 협업 인터뷰를 진행할 수도 있다. 그리고, 가설을 구체화하고, 향후 추진 방향성을 수립하게 된다. 이때, 사전 준비로 DB스키마 및 설계도가 필요할 수 있다.First, you can check available data and conduct a collaborative interview. Then, the hypothesis will be materialized, and the direction of the future will be established. At this time, a DB schema and a blueprint may be required in advance.

다음으로, 전 단계에서 수립된 추진 방향성을 검토하게 된다.Next, the propulsion direction established in all stages will be reviewed.

실제 데이터를 확인하여 척도 별 변수를 파악하고, 데이터를 탐색한 후에 스캘링, 이상 결측치 처리 등을 거쳐 데이터 전처리를 수행하게 된다.After confirming the actual data, the variables for each scale are identified, and after data is searched, data pre-processing is performed through scaling, anomaly missing value processing, and the like.

다음으로, 학습 알고리즘 선정 단계로, 전 단계에서 검토한 방향성을 검증하게 된다.Next, as a learning algorithm selection step, the directionality reviewed in the previous step is verified.

데이터의 라벨화 작업 및 특징을 추출하고, 알고리즘을 선정하며 단일 모델 검증 기법을 거치게 된다.Data labeling tasks and features are extracted, algorithms are selected, and a single model verification technique is performed.

다음으로, 모델 검증 및 결론 단계로, 전체 모델의 성능을 검증하고, 최종적으로 모델을 선정하게 된다. 또한, 각종 해결 방안을 도출하고, 합의점을 도출함으로써 결론을 내리게 된다.Next, in the model verification and conclusion stage, the performance of the entire model is verified, and the model is finally selected. In addition, it draws various solutions and draws consensus to draw conclusions.

이상으로 설명한 단계들은 머신러닝에 수행되는 단계들에 대한 흐름도이며, 이하에서 설명할 본 발명의 실시예는 학습 알고리즘 선정 단계에 대응되는 내용들이며, 추가적으로 모델 검증 및 결론 단계에 관한 것도 다루게 될 것이다.The above-described steps are flowcharts of steps performed in machine learning, and the embodiments of the present invention to be described below are contents corresponding to the step of selecting a learning algorithm, and additionally, a model verification and conclusion step will also be covered.

도 2는 본 발명의 실시예에 따른 머신러닝을 위한 학습데이터 세트의 구성 방법의 흐름도이고, 도 3은 본 발명의 실시예에 따른 전처리 모듈(130)로부터 입력받은 학습데이터를 예시한 도면이며, 도 4는 본 발명의 실시예에 따른 학습데이터의 제1라벨 데이터와 제2라벨 데이터의 비율이 맞춰진 것을 예시한 도면이고, 도 5는 본 발명의 실시예에 따른 3개의 서브 세트가 생성된 것을 예시한 도면이다.2 is a flowchart of a method of configuring a learning data set for machine learning according to an embodiment of the present invention, and FIG. 3 is a diagram illustrating learning data received from the pre-processing module 130 according to an embodiment of the present invention, 4 is a diagram illustrating that the ratio of the first label data to the second label data of the learning data according to an embodiment of the present invention is matched, and FIG. 5 shows that three subsets according to an embodiment of the present invention are generated It is an illustrated drawing.

도 2 내지 도 5를 참조하여, 본 발명의 실시예에 따른 머신러닝을 위한 학습데이터 세트의 구성 방법(서브 세트 구성)에 대해서 설명하도록 한다.With reference to FIGS. 2 to 5, a method of configuring a learning data set ( sub-set configuration ) for machine learning according to an embodiment of the present invention will be described.

이때, 도 1에서 설명한 바와 같이 전처리 모듈(130)로부터 전처리된 학습데이터를 수신하는 것이 전제된다.At this time, it is assumed that the pre-processed learning data is received from the pre-processing module 130 as described in FIG. 1.

이에 대한 예시로, 도 3에 전처리 모듈(130)로부터 입력받은 학습데이터가 도시되어 있다.As an example of this, FIG. 3 shows learning data input from the pre-processing module 130.

또한, 본 발명의 실시예를 설명함에 있어서, 라벨(Label)은 데이터에 라벨링이 된 것을 의미한다. 예컨대 제1라벨은 명시적인 정답을 의미하고, 제2라벨은 명시적인 오답을 의미하여, 정답과 오답에 대해서 머신러닝을 진행하는 것을 의미하지만, 제1라벨과 제2라벨은 데이터의 다른 특징값에 대응될 수 있다.In addition, in describing an embodiment of the present invention, the label (Label) means that the data is labeled. For example, the first label means an explicit correct answer, and the second label means an explicit incorrect answer, meaning that machine learning is performed for the correct answer and incorrect answer, but the first label and the second label are different feature values of the data. Can correspond to

먼저, 유틸리티 모듈(110)이 학습데이터 세트에 포함된 제1라벨 데이터의 총 개수(L1) 와 제2라벨 데이터의 총 개수(L2)의 비율(L1 / L2) N을 산출한다(단, L1>L2). (S510단계)First, the utility module 110 calculates the ratio (L1 / L2) N of the total number (L1) of the first label data and the total number (L2) of the second label data included in the learning data set (however, L1 > L2). (Step S510)

기존 학습 방법은 도 3과 같이 전처리 모듈(130)로부터 입력받은 학습데이터를 그대로 학습시켜 머신러닝을 진행하는 방식이었다.The existing learning method was a method in which machine learning is performed by learning the learning data received from the pre-processing module 130 as it is.

이러한 방식을 이용하여도 머신러닝 자체는 수행할 수 있었지만, 제1라벨 데이터와 제2라벨 데이터의 비율이 맞지 않기 때문에, 데이터 불균형 문제로 인해 편향되도록 학습이 된다는 문제점이 있었다.Machine learning itself could be performed using this method, but there was a problem in that the ratio of the first label data to the second label data was not correct, so that learning was performed to be biased due to a data imbalance problem.

다시 말해, 라벨 값이 불균형한 데이터로 분류기를 학습하면, 실제 현장에서 적용되었을 때 편향된 값을 예측하게 되고, 이는 각종 오류, 사고 등을 일으키는 주범이 된다.In other words, if the classifier is trained with unbalanced data on the label value, the deflected value is predicted when applied in the actual field, which is a major cause of various errors and accidents.

여기서, L1>L2인 것이 전제되며, 자연수 N을 산출하기 위해 제1라벨 데이터의 총 개수와 제2라벨 데이터의 총 개수의 비율을 산출한다.Here, it is assumed that L1> L2, and a ratio of the total number of first label data to the total number of second label data is calculated to calculate the natural number N.

몇몇 실시예에서, 2종류의 라벨 데이터가 아닌 n종류의 라벨 데이터가 학습데이터 세트에 포함될 수 있으며, 학습데이터 세트에는 L1개의 제1라벨 데이터부터 Ln개의 제n라벨 데이터를 포함하고(단, L1>L2> ? >Ln), 여기서 제1라벨 데이터는 데이터의 개수가 가장 많은 것이고 제2라벨 데이터는 데이터의 개수가 가장 적은 것이다.In some embodiments, two types of label data, not two types of label data, may be included in the learning data set, and the learning data set includes L1 first label data to Ln nth label data (however, L1 > L2>?> Ln), where the first label data has the largest number of data and the second label data has the lowest number of data.

S510단계 다음으로, 유틸리티 모듈(110)이 S510단계에서 산출된 N에 대응되는 개수의 서브 세트를 생성한다. (S520단계)Next to step S510, the utility module 110 generates a subset of the number corresponding to N calculated in step S510. (Step S520)

도 5와 같이 제1서브 세트, 제2서브 세트, 제3서브 세트, 총 3개의 서브 세트가 생성된 것을 알 수 있다.As shown in FIG. 5, it can be seen that a first sub-set, a second sub-set, a third sub-set, and a total of three sub-sets were generated.

그리고, 서브 세트의 개수는 L1 / L2 = N을 적용하여, (3+2+5) / (1+1+1) = 3개가 산출된 것을 알 수 있다.In addition, it can be seen that (3 + 2 + 5) / (1 + 1 + 1) = 3 is calculated by applying L1 / L2 = N as the number of subsets.

S520단계 다음으로, 유틸리티 모듈(110)이 상기 생성된 각 서브 세트에 동일한 개수의 제1라벨 데이터와 제2라벨 데이터개수개수와가 포함되도록 구성한다. (S530단계)Next, in step S520, the utility module 110 is configured to include the same number of first label data and the second number of label data in each of the generated subsets. (Step S530)

도 4와 같이, 도 3와 2케이스에 1개의 제1라벨 데이터가 추가되고, 3케이스에 1개의 제1라벨 데이터가 삭제된 것을 확인할 수 있다. 그리고, 전체적인 제1라벨 데이터와 제2라벨 데이터의 비율이 3:1로 맞춰진 것을 알 수 있다.As shown in FIG. 4, it can be seen that one first label data is added to the cases of FIGS. 3 and 2, and one first label data is deleted to the three cases. Then, it can be seen that the ratio of the overall first label data to the second label data is set to 3: 1.

그리고, 유틸리티 모듈(110)은 도 5와 같이 생성된 서브 세트에 각각의 케이스의 제1라벨 데이터와 제2라벨 데이터의 세트를 분류한다. 이때, 제1라벨 데이터와 제2라벨 데이터의 세트는 도 5와 같이 제1라벨 데이터와 제2라벨 데이터가 짝으로 함께 있는 것을 의미한다.Then, the utility module 110 classifies the first label data and the second label data set of each case into the generated subset as shown in FIG. 5. At this time, the set of the first label data and the second label data means that the first label data and the second label data are paired together as shown in FIG. 5.

그리고, 유틸리티 모듈(110)은 각 서브 세트에 L1 / N 개의 제1라벨 데이터와 제2라벨 데이터 세트를 포함시키는 것을 특징으로 한다.And, the utility module 110 is characterized by including L1 / N first label data and second label data set in each subset.

보다 상세하게는, 유틸리티 모듈(110)은 1서브 세트로부터 N서브 세트까지 L1 / N 개의 제1라벨 데이터와 제2라벨 데이터의 세트를 포함시키되, 제2라벨 데이터가 미달되는 서브 세트에 제1서브 세트의 제2라벨 데이터를 복사하는 것을 특징으로 한다.More specifically, the utility module 110 includes a set of L1 / N first label data and second label data from one sub-set to an N-sub set, but the first is included in a subset in which the second label data is insufficient. And copying the second label data of the subset.

즉, 유틸리티 모듈(110)은, 학습데이터 세트에 포함된 제1라벨 데이터를 N등분하여 각 서브 세트에 포함시키고, 각 서브 세트에 학습데이터 세트에 포함된 제2라벨 데이터가 모두가 포함되도록 학습데이터 세트에 포함된 상기 제2라벨 데이터를 복사하는 것을 특징으로 한다.That is, the utility module 110 divides the first label data included in the learning data set into N equal parts and includes them in each subset, and each subset includes learning to include all of the second label data included in the learning data set. And copying the second label data included in the data set.

몇몇 실시예에서, 2종류의 라벨 데이터가 아닌 n종류의 라벨 데이터가 학습데이터 세트에 포함될 수 있으며, 학습데이터 세트에는 L1개의 제1라벨 데이터부터 Ln개의 제n라벨 데이터를 포함하고(단, L1>L2> ? >Ln), 여기서 제1라벨 데이터는 데이터의 개수가 가장 많은 것이고 제2라벨 데이터는 데이터의 개수가 가장 적은 것인 경우, 유틸리티 모듈(110)은 생성된 각 서브 세트에 동일한 개수의 제1라벨 데이터부터 제n라벨 데이터가 포함되도록 구성할 수 있다. 이를 위해서, 제2라벨 데이터부터 제n라벨 데이터는 적어도 일부가 복사되어 이용될 수 있다.In some embodiments, two types of label data, not two types of label data, may be included in the learning data set, and the learning data set includes L1 first label data to Ln nth label data (however, L1 > L2>?> Ln), where the first label data has the largest number of data and the second label data has the smallest number of data, the utility module 110 has the same number in each generated subset. It can be configured to include the first label data from the n-th label data. To this end, at least a portion of the second label data to the n-th label data may be copied and used.

몇몇 실시예에서, N을 산출하는 단계(S510)에서 유틸리티 모듈(110)은 N이 자연수가 아닌 경우 반올림하여 N을 산출하도록 하고, 구성 단계(S530)에서 유틸리티 모듈(110)은 제1라벨 데이터가 미달되는 서브 세트에 제1서브 세트의 제1라벨 데이터를 복사할 수도 있다.In some embodiments, in the step of calculating N (S510), the utility module 110 rounds up to calculate N when N is not a natural number, and in the configuration step (S530), the utility module 110 displays the first label data. It is also possible to copy the first label data of the first sub-set to a subset that does not reach.

예를 들어, 전처리 모듈(130)로부터 입력받은 학습데이터 세트에 포함된 제1라벨 데이터의 총 개수(L1)가 13개이고, 제2라벨 데이터의 총 개수(L2)가 3개인 경우, N은 4.333으로 산출된다. 이 경우, N을 4로 정하지 않고, 올림하여 5로 하도록 하는 것을 의미한다.For example, when the total number (L1) of first label data included in the learning data set received from the pre-processing module 130 is 13 and the total number (L2) of second label data is 3, N is 4.333 Is calculated as In this case, it means that N is not set to 4, but is rounded up to 5.

그리고, N을 5로 설정하여 5개의 서브 세트를 생성하여 제1라벨 데이터와 제2라벨 데이터의 세트를 분류, 복사할 때 제1라벨 데이터에 공백이 발생하는 서브 세트가 존재할 수 있으므로, 제1서브 세트의 제1라벨 데이터를 복사하여 제1라벨 데이터와 제2라벨 데이터의 비율이 유지되도록 하는 것을 의미한다.In addition, when N is set to 5, 5 subsets are generated to classify and copy the first label data and the second label data, so a subset may occur in the first label data. It means that the first label data of the subset is copied to maintain the ratio of the first label data to the second label data.

위와 같은 S510단계, S520단계, S530단계를 통해서 서브 세트를 구성하는 단계들이 완성되게 된다.Through the above steps S510, S520, and S530, the steps of configuring the subset are completed.

이하에서는, 교차 검증을 수행하는 단계들에 대해서 설명하도록 한다.Hereinafter, steps for performing cross-validation will be described.

도 2는 본 발명의 실시예에 따른 머신러닝을 위한 학습데이터 세트의 구성 방법의 흐름도이고, 도 6은 5-fold의 검증세트가 생성된 것을 예시한 도면이며, 도 7은 본 발명의 실시예에 따른 교차 검증을 위한 3-fold의 검증세트가 생성된 것을 예시한 도면이다.2 is a flowchart of a method of configuring a learning data set for machine learning according to an embodiment of the present invention, FIG. 6 is a diagram illustrating that a 5-fold verification set is generated, and FIG. 7 is an embodiment of the present invention It is a diagram illustrating that a 3-fold verification set for cross verification according to is generated.

학습 데이터를 학습용과 검증용 데이터로 분리할 때, 어떻게 분리하느냐에 따라서 검증 결과가 조금씩 달라지게 된다. 이 때문에 데이터를 여러 방식으로 분리하여 학습과 검증을 수행하고, 검증 결과의 평균과 분산을 가지고 모델의 성능을 판단/평가 하게 된다.When separating the training data into training and verification data, the verification results vary slightly depending on how they are separated. For this reason, the data is separated in several ways to perform training and verification, and the model performance is judged / evaluated with the average and variance of the verification results.

데이터를 분리하여 여러 번 학습과 검증을 수행하는 방법 중의 하나로 교차 검증(Cross Validation)이 있으며, 도 6에서는 k-Fold Cross Validation에서 5개의 Iteration이 적용된 검증 데이터가 예시되어 있다.Cross validation is one of the methods to separate data and perform learning and verification multiple times. In FIG. 6, validation data to which five iterations are applied in k-Fold Cross Validation is illustrated.

본 발명의 실시예에서는 도 7과 같이, 먼저 파이프라인 모듈(120)이 N개의 검증세트를 생성하고, 각 검증세트에 서브 세트의 제1라벨 데이터와 제2라벨 데이터를 동비율로 합하여 N개의 학습 그룹데이터를 배치한다. (S540단계) In the embodiment of the present invention, as shown in FIG. 7, first, the pipeline module 120 generates N verification sets, and the first label data and the second label data of the subset are added to each verification set at the same ratio, and N Place learning group data. (Step S540)

이때, N은 S510단계에서 산출된 N(= L1 / L2)를 의미한다.At this time, N means N (= L1 / L2) calculated in step S510.

S540단계 다음으로, 파이프라인 모듈(120)이 각 검증세트의 N개의 학습 그룹데이터 중 하나에 검증 데이터를 배치하되, 서로 다른 검증세트에 검증 데이터의 순서를 다르게 배치하여, N개의 검증세트에 배치된 검증 데이터의 순서가 겹쳐지지 않도록 한다. (S550단계) Next, in step S540, the pipeline module 120 places the verification data in one of the N learning group data of each verification set, but arranges the order of verification data differently in different verification sets, and places the verification data in N verification sets. Do not overlap the order of verified data. (Step S550)

이와 같이, 서브 세트 구성 단계들에서 생성된 서브 세트를 이용하여 검증세트에 교차 검증 방식의 검증 데이터 배치 방법을 적용하기 때문에, 보다 신뢰도 있는 검증을 수행할 수 있게 된다.As described above, since the method of arranging the verification data of the cross-validation method is applied to the verification set using the subset generated in the sub-configuration steps, more reliable verification can be performed.

S550단계 다음으로, 파이프라인 모듈(120)이 상기 생성된 N개의 검증세트를 이용하여 N회의 교차 검증(Cross Validation)을 수행한다. (S560단계)Next, in step S550, the pipeline module 120 performs N times of cross validation using the generated N verification sets. (Step S560)

S560단계를 통해서, 파이프라인 모듈(120)이 S550단계에서 생성된 검증세트를 이용하여 실질적인 교차 검증을 수행하게 된다.Through step S560, the pipeline module 120 performs substantial cross-verification using the verification set generated in step S550.

S560단계 다음으로, 파이프라인 모듈(120)이 S560단계의 교차 검증을 통해 검증된 결과를 체킹하여 에러값을 추출한다. (S570단계)Next to step S560, the pipeline module 120 checks the verified result through the cross-validation in step S560 to extract the error value. (Step S570)

이와 같이 검증 과정을 거침으로써, 오류, 에러값을 추출하는 것은 차후 단계에서 이를 감안하여 튜닝, 수정, 정정하여 최적(최상)의 검증세트를 선별하기 위한 것이다.By performing the verification process as described above, extracting errors and error values is to select the optimal (best) verification set by tuning, correcting, and correcting it in consideration of the subsequent steps.

S570단계 다음으로, 파이프라인 모듈(120)이 S560단계에서 추출된 에러값을 통해 각 검증세트에 대한 튜닝을 수행하고, N개의 검증세트 중 최적의 조건을 만족하는 검증세트를 저장한다. (S580단계)Next to step S570, the pipeline module 120 performs tuning for each verification set through the error value extracted in step S560, and stores the verification set that satisfies the optimal condition among the N verification sets. (Step S580)

이때, 최적의 조건을 만족하는 검증세트란, 검증을 수행했던 검증세트 중에서 검증 결과가 제일 좋은(적합, 부합)하다고 판단되는 1개 이상의 검증세트를 선택하는 것을 의미한다.At this time, the verification set that satisfies the optimal condition means selecting one or more verification sets that are judged to have the best (appropriate, conforming) verification result from among the verification sets that have performed verification.

도 8은 머신러닝에서 이루어지는 학습 과정을 예시한 도면이다.8 is a diagram illustrating a learning process performed in machine learning.

도 8은 위에서 도 2 내지 도 7을 참조하여 설명한 것들을 요약 설명하는 도면으로, 전처리 모듈(130)이 전처리를 수행하고, 특징선택 모듈(140)이 특징을 선택하여 특징 선택 모델을 저장하고, 분류기 모듈(150)이 분류기를 학습시켜 분류기 모델을 저장하는 흐름으로 학습 과정을 진행하는 것을 설명하고 있다.FIG. 8 is a diagram for briefly explaining the above-described items with reference to FIGS. 2 to 7, in which the pre-processing module 130 performs pre-processing, and the feature selection module 140 selects features to store a feature selection model and classifier. It is explained that the module 150 trains the classifier to proceed with the learning process as a flow for storing the classifier model.

전처리 모듈(130)이 전처리를 수행하고 검증하여 전처리 모델을 저장하고, 특징선택 모듈(140)이 특징을 선택하고 검증하여 특징선택 모델을 저장하고, 분류기 모듈(150)이 분류기 학습을 진행하고 검증하여 분류기 모델을 저장하며, 평가를 수행하는 것을 설명하고 있다.The pre-processing module 130 performs pre-processing and verifies to store the pre-processing model, the feature selection module 140 selects and verifies the feature to store the feature selection model, and the classifier module 150 proceeds to classifier learning and verification It explains how to store the classifier model and perform the evaluation.

이와 같은 과정들을 수행함으로서, 본 발명의 실시예에 따른 데이터 분류 장치(10)는 적절한 하이퍼 파리미터를 찾기 위해 학습 과정을 반복하고, 분류기의 하이퍼 파라미터를 변화시키면서 학습 과정을 반복하고, 평가 결과가 좋은 파라미터를 선택하게 된다.By performing such processes, the data classification device 10 according to an embodiment of the present invention repeats the learning process to find an appropriate hyperparameter, repeats the learning process while changing the hyperparameters of the classifier, and has good evaluation results. The parameters are selected.

이때, 하이퍼 파라미터는 머신러닝 모델의 특성을 결정하는 파라미터를 의미한다.At this time, the hyper parameter means a parameter that determines the characteristics of the machine learning model.

도 9는 본 발명의 실시예를 통해 생성된 학습 모델을 적용하여 예측 단계를 수행하는 것을 예시한 도면이다.9 is a diagram illustrating performing a prediction step by applying a learning model generated through an embodiment of the present invention.

도 9를 참조하면, 예측 단계는 검증 단계와 유사한 방식으로 수행되며, 본 발명의 실시예에 따른 구성들을 통해 생성/학습된 모델을 이용하여 수행되며, 이때 입력되는 새로운 데이터는 라벨이 없으므로 평가는 수행할 수 없다.Referring to FIG. 9, the prediction step is performed in a manner similar to the verification step, and is performed using a model generated / learned through configurations according to an embodiment of the present invention. It cannot be done.

도 10은 본 발명의 실시예에 머신러닝을 위한 학습데이터 세트의 구성 장치(10)의 블록도이다.10 is a block diagram of a configuration device 10 of a learning data set for machine learning according to an embodiment of the present invention.

본 발명의 실시예에 따른 머신러닝을 위한 학습데이터 세트의 구성 장치(10)는 유틸리티 모듈(110), 검증 모듈, 전처리 모듈(130), 특징선택 모듈(140) 및 분류기 모듈(150)을 포함한다.The apparatus 10 for constructing a learning data set for machine learning according to an embodiment of the present invention includes a utility module 110, a verification module, a pre-processing module 130, a feature selection module 140, and a classifier module 150 do.

다만, 몇몇 실시예에서 데이터 분류 장치(10)는 도 10에 도시된 구성요소보다 더 적은 수의 구성요소나 더 많은 구성요소를 포함할 수도 있다.However, in some embodiments, the data classification device 10 may include fewer components or more components than those illustrated in FIG. 10.

유틸리티 모듈(110)은 상기 학습데이터 세트에 포함된 제1라벨 데이터의 총 개수(L1)와 제2라벨 데이터의 총 개수(L2)의 비율(L1 / L2) N을 산출하고, 상기 산출된 N에 대응되는 개수의 서브 세트를 생성하며, 상기 생성된 각 서브 세트에 상기 제1라벨 데이터와 제2라벨 데이터의 세트를 동일한 개수와 비율로 구성한다.The utility module 110 calculates the ratio (L1 / L2) N of the total number (L1) of the first label data and the total number (L2) of the second label data included in the learning data set, and the calculated N The number of subsets corresponding to is generated, and the set of the first label data and the second label data are configured in the same number and ratio in each of the generated subsets.

또한, 유틸리티 모듈(110)은 복사하는 과정에서 각 서브 세트에 L1 / N 개의 제1라벨 데이터와 제2라벨 데이터의 세트를 분류하는 것을 특징으로 한다.In addition, the utility module 110 is characterized by classifying a set of L1 / N first label data and second label data into each subset during the copying process.

보다 상세하게는, 유틸리티 모듈(110)은 1서브 세트로부터 N서브 세트까지의 L1 / N 개의 제1라벨 데이터와 제2라벨 데이터의 세트를 분리하여 포함시키되, 제2라벨 데이터가 미달되는 서브 세트에 1서브 세트의 제2라벨 데이터를 복사하는 것을 특징으로 한다.More specifically, the utility module 110 separately includes a set of L1 / N first label data and second label data from one sub-set to N sub-sets, but the second label data is less than a subset It is characterized in that the second label data of one sub-set is copied.

그리고, 유틸리티 모듈(110)은 N을 산출할 때 N이 자연수가 아닌 경우 N을 소수점에서 올림하여 N을 결정하고, 분류 및 복사하는 과정에서 제1라벨 데이터가 미달되는 서브 세트에 제1서브 세트의 제1라벨 데이터를 복사하는 것을 특징으로 한다.In addition, when calculating N, when the N is not a natural number, the utility module 110 determines N by raising N from the decimal point, and sets the first sub in a subset in which the first label data is not reached in the process of classifying and copying. It characterized in that the first label data of the copy.

파이프라인 모듈(120)은 N개의 검증세트를 생성하고, 각 검증세트에 상기 서브셍의 제1라벨 데이터와 제2라벨 데이터를 동비율로 합하여 N개의 학습 그룹데이터를 배치한다.The pipeline module 120 generates N verification sets, and the N learning group data is arranged by adding the first label data and the second label data of the subseng at the same ratio to each verification set.

그리고, 파이프라인 모듈(120)은 상기 각 검증세트의 N개의 학습 그룹데이터 중 하나에 검증 데이터를 배치하되, 서로 다른 검증세트에 검증 데이터의 순서를 다르게 배치하여 상기 N개의 검증세트에 배치된 검증 데이터의 순서가 겹쳐지지 않도록 한다.In addition, the pipeline module 120 arranges verification data in one of the N learning group data of each verification set, but the order of verification data is differently arranged in different verification sets, and verifications are arranged in the N verification sets. Avoid ordering data.

일 실시예로, 파이프라인 모듈(120)은 상기 생성된 N개의 검증세트를 이용하여 N회의 교차 검증(Cross Validation)을 수행한다. 그리고, 파이프라인 모듈(120)은 상기 교차 검증을 통해 검증된 결과를 체킹하여 에러값을 추출한다.In one embodiment, the pipeline module 120 performs N times of cross validation using the generated N verification sets. Then, the pipeline module 120 extracts an error value by checking the result verified through the cross-validation.

그리고, 파이프라인 모듈(120)은 상기 추출된 에러값을 통해 각 검증세트에 대한 튜닝을 수행하고, 상기 N개의 검증세트 중 최적의 조건을 만족하는 검증세트를 저장하도록 한다.Then, the pipeline module 120 performs tuning for each verification set through the extracted error value, and stores a verification set that satisfies an optimal condition among the N verification sets.

이외 구성 또는 구성들에 대한 보다 상세한 설명들은, 위에서 도 1 내지 도 9를 통해 설명한 머신러닝을 위한 학습데이터 세트의 구성 방법과 발명의 카테고리만 다르기 때문에, 중복되는 설명은 생략하도록 한다.Other detailed descriptions of configurations or configurations are omitted because only the method of constructing the learning data set for machine learning described above with reference to FIGS. 1 to 9 and the category of the invention are different.

이상에서 전술한 본 발명의 일 실시예에 따른 머신러닝을 위한 학습데이터 세트의 구성 방법은, 하드웨어인 서버와 결합되어 실행되기 위해 머신러닝을 위한 학습데이터의 데이터 프로그램(또는 어플리케이션)으로 구현되어 매체에 저장될 수 있다.The method of configuring the learning data set for machine learning according to the above-described embodiment of the present invention is implemented as a data program (or application) of learning data for machine learning in order to be executed in combination with a server that is hardware. Can be stored in.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The above-described program is C, C ++, JAVA, machine language, etc., in which a processor (CPU) of the computer can be read through a device interface of the computer in order for the computer to read the program and execute the methods implemented as a program. It may include a code (Code) coded in the computer language of. Such code may include functional code related to a function or the like that defines necessary functions for executing the above methods, and control code related to an execution procedure necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, the code may further include a memory reference-related code as to which location (address address) of the internal or external memory of the computer should be referred to additional information or media necessary for the computer's processor to perform the functions. have. Also, when the processor of the computer needs to communicate with any other computer or server in the remote to execute the functions, the code can be used to communicate with any other computer or server in the remote using the communication module of the computer. It may further include a communication-related code for whether to communicate, what information or media to transmit and receive during communication, and the like.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium refers to a medium that stores data semi-permanently and that can be read by a device, rather than a medium that stores data for a short time, such as registers, caches, and memory. Specifically, examples of the medium to be stored include, but are not limited to, ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device. That is, the program may be stored in various recording media on various servers that the computer can access or various recording media on the computer of the user. In addition, the medium is distributed in a computer system connected by a network, so that the computer readable code may be stored in a distributed manner.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, a software module executed by hardware, or a combination thereof. The software modules may include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside on any type of computer readable recording medium well known in the art.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.The embodiments of the present invention have been described above with reference to the accompanying drawings, but a person skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing its technical spirit or essential features. You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive.

10: 구성 장치
110: 유틸리티 모듈
120: 파이프라인 모듈
130: 전처리 모듈
140: 특징선택 모듈
150: 분류기 모듈10: Configuration device
110: utility module
120: pipeline module
130: pre-processing module
140: feature selection module
150: classifier module

Claims

In the method of constructing a set of labeled learning data for machine learning,
Calculating, as a ratio (L1 / L2) N of the total number of first label data (L1) and the total number of second label data (L2) included in the learning data set (L1> L2) ); And
Generating, by the utility module, a subset of the number corresponding to the calculated N;
The utility module is configured to include the same number of the first label data and the second label data in each of the generated subsets.
How to configure a learning data set for machine learning.

According to claim 1,
In the configuration step,
The utility module is characterized in that the first label data included in the learning data set is divided into N equal parts and included in each subset.
How to configure a learning data set for machine learning.

According to claim 2,
In the configuration step,
The utility module is characterized in that the second label data included in the learning data set is copied so that all of the second label data included in the learning data set are included in each subset.
How to configure a learning data set for machine learning.

According to claim 1,
The learning data set includes L1 first label data to Ln nth label data (L1>L2>?> Ln),
In the configuration step,
The utility module is configured to include the same number of the first label data to the n-th label data in each of the generated subsets,
How to configure a learning data set for machine learning.

According to claim 1,
A pipeline module generating N verification sets, and placing N learning group data in each verification set by adding the first label data and the second label data of the subset at the same ratio;
The pipeline module places verification data in one of the N learning group data of each verification set, but the order of verification data arranged in the N verification sets is differently arranged by disposing the order of verification data in different verification sets. The step of arranging the verification data to prevent overlapping, further comprising
How to configure a learning data set for machine learning.

The method of claim 5,
The pipeline module performing N cross validations using the generated N verification sets;
The pipeline module extracts an error value by checking a result verified through the cross-validation.
How to configure a learning data set for machine learning.

The method of claim 6,
Further comprising, performing tuning for each verification set through the extracted error value, and storing a verification set that satisfies an optimal condition among the N verification sets.
How to configure a learning data set for machine learning.

In the configuration device of the learning data set constituting the learning data set labeled for machine learning,
The ratio (L1 / L2) N of the total number (L1) of the first label data and the total number (L2) of the second label data included in the learning data set is calculated (L1> L2), and the calculated And a utility module configured to generate a subset of the number corresponding to N and to include the same number of the first label data and the second label data in each of the generated subsets,
Device for configuring learning data sets for machine learning.

The method of claim 8,
The utility module,
Characterized in that the first label data included in the learning data set is divided into N equal parts and included in each subset,
Device for configuring learning data sets for machine learning.

The method of claim 9,
The utility module,
The utility module is characterized in that the second label data included in the learning data set is copied so that all of the second label data included in the learning data set are included in each subset.
Device for configuring learning data sets for machine learning.