KR20210060978A

KR20210060978A - Training Data Quality Assessment Technique for Machine Learning-based Software

Info

Publication number: KR20210060978A
Application number: KR1020190148654A
Authority: KR
Inventors: 홍장의; 김문현; 권용균
Original assignee: 충북대학교 산학협력단
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2021-05-27
Also published as: KR102303111B1

Abstract

Disclosed is a method of evaluating quality of learning data in software based on machine learning. The present invention includes the steps of: extracting data characteristics for each data evaluation criterion based on input data to be used as learning data; evaluating quality factors based on the extracted characteristics; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor. Therefore, the present invention can increase an effect of learning by increasing the quality of the learning data used for machine learning.

Description

Training Data Quality Assessment Technique for Machine Learning-based Software}

본 발명은 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법에 관한 것이다.The present invention relates to a method for evaluating learning data quality of machine learning-based software.

최근 컴퓨터 과학 분야에서 인공지능에 대한 연구가 활성화됨에 따라 인간의 학습 체계를 모방한 기계 학습 기법과 관련된 여러 알고리즘이 개발되고 있다. 이에 따라 여러 소프트웨어가 기계 학습 관련된 여러 알고리즘을 채용하고 있다. 기계 학습을 통해 소프트웨어는 과거 및 현재의 데이터로부터 특징 추출 및 일반화를 통하여 미래의 데이터를 예측하는데 사용하고 있다. Recently, as research on artificial intelligence has become active in the field of computer science, several algorithms related to machine learning techniques that mimic human learning systems have been developed. Accordingly, several softwares employ several algorithms related to machine learning. Through machine learning, software is used to predict future data through feature extraction and generalization from past and present data.

학습에 있어서 중요한 것은 여러 가지가 있을 수 있지만 그 중 무엇을 통해 학습을 할 것인가는 중요한 문제이다. There may be many things that are important in learning, but it is an important question to learn through which of them.

좋은 정보, 올바른 정보를 가지고 학습을 하게 된다면 그 효과는 그렇지 않은 경우보다 학습의 능률도, 결과도 좋을 것이다. 이는 비단 사람에게만 국한된 것이 아니다. 기계 학습에서 있어서 중요한 것 역시 어떤 데이터를 통해 학습하는 가이다. 학습 데이터를 구성(혹은 생성)하는 방법에 있어서는 기존에 여러 방법이 소개 된 바 있지만, 구성된 데이터의 품질을 평가할 수 있는 기준이나 방법은 제시 되지 않았다. If you learn with good information and correct information, the effect will be better and the efficiency and results of learning will be better than otherwise. This is not just limited to people. What is important in machine learning is what data is used to learn. As for the method of constructing (or generating) the training data, several methods have been previously introduced, but no standard or method for evaluating the quality of the composed data has been presented.

1. 대한민국 등록특허 제10-2005628호(2019.07.24)1. Korean Patent Registration No. 10-2005628 (2019.07.24) 2. 대한민국 공개특허 제10-2019-0044814호(2019.05.02)2. Korean Patent Application Publication No. 10-2019-0044814 (2019.05.02)

본 발명의 목적은 새로운 품질 평가 척도 및 방법을 통해서 기계 학습 기반의 소프트웨어의 학습데이터를 효과적으로 평가할 수 있도록 한 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법을 제공하는 데 있다.An object of the present invention is to provide a method for evaluating the quality of learning data of a machine learning-based software that enables effective evaluation of the learning data of a machine-learning-based software through a new quality evaluation measure and method.

상기 목적을 달성하기 위하여, 본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법은, 학습데이터로 사용될 입력 데이터를 토대로 데이터 평가 기준별 데이터 특성을 추출하는 단계; 추출한 특성을 토대로 품질요소를 평가하는 단계; 및 각각의 품질요소의 평가 결과들을 종합해 데이터의 추적성 품질을 도출하는 단계;를 포함하는 것을 특징으로 한다.In order to achieve the above object, the method for evaluating the quality of learning data of the machine learning-based software according to the present invention includes: extracting data characteristics for each data evaluation criterion based on input data to be used as the learning data; Evaluating a quality factor based on the extracted characteristics; And deriving traceability quality of data by synthesizing evaluation results of each quality factor.

상기 데이터 평가 기준별 데이터 특성을 추출하는 단계는, 데이터 커버리지 관련 특성, 데이터 분포성 관련 특성, 데이터 완전성 관련 특성 및 데이터 중복성 관련 특성을 추출하는 단계를 포함하는 것을 특징으로 한다.The extracting of data characteristics for each data evaluation criterion may include extracting data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics.

상기 추출한 특성을 토대로 품질요소를 평가하는 단계는, 추출한 상기 특성들을 토대로 데이터 커버리지, 데이터 분포, 데이터 완전성 및 데이터 중복성에 대하여 평가를 수행하는 단계를 포함하는 것을 특징으로 한다.The evaluating the quality factor based on the extracted characteristics may include performing evaluation on data coverage, data distribution, data integrity, and data redundancy based on the extracted characteristics.

상기 데이터의 추적성 품질을 도출하는 단계는, 학습 데이터 집합으로 지능 소프트웨어 시스템이 학습을 진행하는 단계; 학습결과가 기준을 만족하는지 확인하는 단계; 입력데이터 관련 특성을 추출하는 단계; 데이터 추적성을 평가하는 단계; 추적성이 존재하는지 판단하는 단계; 및 추적성이 존재한다면 평가결과를 보고하고, 학습데이터를 재구성하는 단계;를 포함하는 것을 특징으로 한다.The deriving of the traceability quality of the data may include: performing learning by an intelligent software system with a learning data set; Checking whether the learning result satisfies the criteria; Extracting input data-related characteristics; Evaluating data traceability; Determining whether traceability exists; And if traceability exists, reporting the evaluation result and reconstructing the learning data.

상기 데이터 커버리지는 학습하고자 하는 대상에 대한 데이터의 유형을 나타내주는 척도로서, 학습 데이터 커버리지를 산출하는 척도는 하기 식 (1)과 같이 정의하는 것을 특징으로 한다.The data coverage is a measure representing the type of data on an object to be learned, and a measure for calculating the learning data coverage is defined as in Equation (1) below.

식(1)

Equation (1)

상기 데이터 분포성은 학습 데이터가 정규 분포를 따르는지 확인하는 척도로서, 데이터의 분포성을 나타내는 척도는 하기 식(2)와 같이 정의하는 것을 특징으로 한다.The data distribution is a measure for confirming whether the training data follows a normal distribution, and a measure representing the distribution of data is defined as in Equation (2) below.

식(2)

Equation (2)

상기 데이터 완전성은 학습 데이터 집합에 학습하고자 하는 대상의 모든 속성이 포함되어 있는가를 나타내주는 척도로서, 상기 데이터의 완전성에 대한 산정은 식 (3)과 같이 정의하는 것을 특징으로 한다.The data integrity is a measure indicating whether all attributes of an object to be learned are included in the learning data set, and the calculation of the data integrity is defined as in Equation (3).

식 (3)

Equation (3)

상기 데이터의 중복성은 학습 데이터 집합에 중복되는 데이터가 얼마나 포함되어 있는가를 나타내주는 척도로서, 상기 데이터 중복성은 데이터 유사도를 통해 판별할 수 있으며, 하기 식 (4)를 통해 계산할 수 있는 것을 특징으로 한다.The data redundancy is a measure indicating how much overlapping data is included in the training data set, and the data redundancy can be determined through data similarity, and can be calculated through Equation (4) below.

식 (4)

Equation (4)

(식 (4)에서 n은 동일 유형에 속하는 학습 데이터의 개수이고, m은 유형의 수이다. 데이터의 유사도 Sim(dj, dji)는 유형 j에 속하는 데이터 dj를 기준으로 유형내의 모든 다른 데이터와의 유사도를 산출한 후, 이들을 합한 값이다.)(In Equation (4), n is the number of training data belonging to the same type, and m is the number of types. The similarity of data Sim(dj, dji) is based on the data dj belonging to type j, and After calculating the similarity of, it is the sum of them.)

상기 추적성은 기계 학습을 진행한 후의 데이터 품질 평가 기준의 척도로서,The traceability is a measure of the criteria for evaluating data quality after performing machine learning,

상기 추적성은 하기 식(5)로 표현될 수 있으며, 이는 추적성 존재 유무에 대하여 바이너리 값으로 평가되는 것을 특징으로 한다.The traceability can be expressed by the following equation (5), which is characterized in that it is evaluated as a binary value for the presence or absence of traceability.

식 (5)

Equation (5)

(속성 p를 갖는 입력 데이터 I_p와 학습 데이터 집합 D_L의 원소중 유사한 속성을 갖는 원소로 매핑(함수 α)된다면 추적성은 1의 값을 그렇지 않은 경우는 0을 값으로 평가된다. 추적성의 값이 0으로 나타나는 경우는 학습 모델에 수정이 필요하게 된다.)(If the input data I _p having the property p and the elements of the training data set D _L are mapped to an element having a similar property (function α), the traceability is evaluated as a value of 1, otherwise the value of traceability is evaluated as a value of 0. If this appears as 0, the learning model needs to be modified.)

본 발명을 통해 얻을 수 있는 효과는 다음과 같다.The effects that can be obtained through the present invention are as follows.

본 발명에 따르면, 기존에 고려되지 않았던 학습 데이터에 대한 품질 기준을 제시함으로써, 해당 분야에 새로운 기술을 적용할 수 있는 장점이 있다.According to the present invention, there is an advantage in that a new technology can be applied to a corresponding field by presenting a quality standard for learning data that has not been previously considered.

본 발명에 따르면, 기계 학습을 사용하는 애플리케이션의 학습 데이터의 수집 및 생성 과정에서 제시하는 품질 평가 기준을 고려하여 필요한 데이터만을 선택, 수집할 수 있는 장점이 있다.According to the present invention, there is an advantage in that only necessary data can be selected and collected in consideration of quality evaluation criteria presented in a process of collecting and generating training data of an application using machine learning.

본 발명에 따르면, 기계 학습에 사용되는 학습 데이터의 품질을 높임으로써 학습의 효과를 높이고 궁극적으로 어플리케이션의 성능 향상에 기여할 수 있다. According to the present invention, by increasing the quality of learning data used for machine learning, it is possible to increase the learning effect and ultimately contribute to the improvement of application performance.

도 1은 본 발명에 따른 학습 데이터 품질 평가 프로세스를 도시한 도면이다.
도 2는 도 1의 개략적인 과정을 세부적인 프로세스로 나누어 설명한 도면이다.
도 3은 본 발명에 따라서 만족하는 품질을 가진 학습 데이터를 통해 기계 학습 기반의 소프트웨어 학습을 진행했을 때의 결과를 토대로 데이터의 추적성 품질을 도출하는 프로세스를 설명한 도면이다.1 is a diagram showing a learning data quality evaluation process according to the present invention.
FIG. 2 is a diagram illustrating the schematic process of FIG. 1 divided into detailed processes.
FIG. 3 is a diagram illustrating a process of deriving traceability quality of data based on a result of machine learning-based software learning through learning data having satisfactory quality according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법에 대하여 상세히 설명한다.Hereinafter, a method for evaluating the quality of learning data of the machine learning-based software according to the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서 제시하고자 하는 것은 기계 학습 기반의 소프트웨어를 학습시키기 위한 학습 데이터의 품질을 평가하는 기준이다. 이러한 기준을 활용하여 기계학습 기반의 소프트웨어에 사용되는 학습의 데이터의 품질을 보장하고 궁극적으로 소프트웨어의 성능을 높이는데 기여할 수 있다. What is suggested in the present invention is a criterion for evaluating the quality of learning data for learning machine learning-based software. Using these criteria, it can contribute to guaranteeing the quality of learning data used in machine learning-based software and ultimately improving the performance of the software.

도 1은 본 발명에 따른 학습 데이터 품질 평가 프로세스를 도시한 도면이다.1 is a diagram showing a learning data quality evaluation process according to the present invention.

도 1에 도시된 바와 같이, 데이터의 품질을 평가하는 전체적인 프로세스를 보여준다. 먼저 학습데이터로 사용될 데이터를 입력 데이터로 사용하고 입력된 데이터들을 토대로 본 발명에서 제시하는 데이터 평가 기준 별 데이터 특성을 추출하는 과정과 추출한 특성을 토대로 품질요소를 평가하는 두 과정으로 이루어진다. 각 품질 요소의 평가 결과들을 종합해 데이터의 품질을 평가할 수 있는 척도를 확인 할 수 있다. 학습데이터에 대한 평가를 진행한 사용자는 학습 데이터를 그대로 사용할지 아니면 다시 구성할지 결정하는데 있어서 해당 결과를 토대로 결정할 수 있는 것이다.As shown in Fig. 1, it shows the overall process of evaluating the quality of data. First, the data to be used as learning data is used as input data, and based on the input data, it consists of two processes: a process of extracting data characteristics for each data evaluation criterion suggested by the present invention, and a process of evaluating a quality factor based on the extracted characteristics. By synthesizing the evaluation results of each quality factor, you can check the scale for evaluating the quality of the data. Users who have evaluated the learning data can make a decision based on the result in deciding whether to use the learning data as it is or to reorganize it.

도 2는 도 1의 개략적인 과정을 세부적인 프로세스로 나누어 설명한 도면이고, 도 3은 본 발명에 따라서 만족하는 품질을 가진 학습 데이터를 통해 기계 학습 기반의 소프트웨어 학습을 진행했을 때의 결과를 토대로 데이터의 추적성 품질을 도출하는 프로세스를 설명한 도면이다.FIG. 2 is a diagram illustrating the schematic process of FIG. 1 by dividing it into detailed processes, and FIG. 3 is a data based on results of machine learning-based software learning through learning data having satisfactory quality according to the present invention. It is a diagram explaining the process of deriving the traceability quality of.

도 2에서는 본 발명에서 제시하는 사전 데이터 평가 기준인 4가지의 데이터 품질 평가 기준을 명시한다. 해당 데이터 품질 항목을 평가하기 위해 학습데이터에서 특성을 추출하고 추출한 특성들을 토대로 수치화된 척도를 계산한다. 사용자는 4가지 항목들에서 계산된 척도들이 기준에 만족하는 수치인지 판단하고 학습데이터를 재구성할지 결정한다. 만약 만족하는 수치가 나왔다면 해당 데이터 집합을 가지고 기계학습 기반 소프트웨어의 학습을 진행한다. In FIG. 2, four data quality evaluation criteria, which are prior data evaluation criteria suggested by the present invention, are specified. In order to evaluate the data quality item, features are extracted from the training data, and a numerical scale is calculated based on the extracted features. The user judges whether the scales calculated from the four items satisfy the criteria and decides whether to reconstruct the learning data. If satisfactory numbers are found, machine learning-based software is trained with the data set.

도 2 및 도 3에 도시된 바와 같이, 본 발명에서는 학습 데이터 품질 평가 기준으로 (1) 데이터 커버리지, (2) 데이터 분포성, (3) 데이터 완전성, (4) 데이터 중복성, 그리고 (5) 데이터 추적성 총 5가지의 척도를 제시한다. 이 중 데이터 추적성 평가 기준은 사후 데이터 품질 평가 기준으로, 나머지 4가지 기준은 사전 데이터 품질 평가 기준척도로 삼는다.2 and 3, in the present invention, (1) data coverage, (2) data distribution, (3) data integrity, (4) data redundancy, and (5) data A total of five measures of traceability are presented. Among them, the data traceability evaluation criteria is used as the post data quality evaluation criteria, and the remaining four criteria are used as the preliminary data quality evaluation criteria.

여기서 말하는 사전과 사후의 기준점은 기계 학습을 진행하기 전인지, 혹은 학습을 진행한 후인지로 나눈다. 각각의 데이터 평가 기준에 대해서는 다음과 같다.The pre- and post-reference points are divided into whether it is before machine learning or after learning. Each data evaluation criterion is as follows.

1. 데이터 커버리지(Data coverage)1. Data coverage

데이터 커버리지는 학습하고자 하는 대상에 대한 데이터의 유형을 나타내주는 척도이다. Data coverage is a measure that indicates the type of data for an object to be learned.

이 척도를 산정하기 위해서는 데이터의 유형이 사전에 정의되어야 한다. 일반적으로는 데이터의 분류는 크게 정상 데이터(Valid data)와 비정상 데이터(Invalid data)로 분류할 수 있다. 본 특허에서는 이를 좀 더 세분화하여 정상 범주의 데이터를 원본 데이터(Original Data), 유사 데이터(Similar Data), 변형 데이터(Transformed Data)로 구분하며, 비정상 데이터에는 왜곡 데이터(Distorted Data), 오류 데이터(Adversarial Data)로 구분한다. 이와 같은 데이터 유형의 분류는 지능 소프트웨어의 응용 영역에 대하여 추가 또는 삭제될 수 있다. 다만 학습 데이터의 유형이 사전 정의되어야 한다. In order to calculate this scale, the type of data must be defined in advance. In general, data can be classified into valid data and invalid data. In this patent, data in the normal category is divided into original data, similar data, and transformed data by further subdividing them, and abnormal data is distorted data and error data. Adversarial Data). This classification of data types can be added or deleted for the application area of the intelligent software. However, the type of learning data must be predefined.

데이터 커버리지는 이러한 세분화된 학습 데이터의 유형에 적어도 하나 이상의 데이터가 존재해야 한다는 것이다. 제시한 5가지 유형에 각각 적어도 하나 이상의 데이터가 존재한다면 구성된 데이터는 모든 데이터 유형을 커버한다고 할 수 있다. 학습 데이터 커버리지를 산출하는 척도는 식 (1)과 같이 정의 한다. Data coverage is that at least one or more data must exist in this type of subdivided training data. If there is at least one data in each of the five types presented, the composed data can be said to cover all data types. The measure for calculating the training data coverage is defined as Equation (1).

식(1)

Equation (1)

식 (1)에서 정의한 것처럼 데이터 커버리지는 데이터가 하나라도 존재하는 유형의 수를 전체 유형의 수로 나눈 것의 백분율로 표현한다. 예를 들어 한 이미지 분류기에서 사전에 정의된 데이터의 유형이 5가지이고, 준비된 학습 데이터의 유형이 3가지 일 때, 학습 데이터 커버리지는 60%의 데이터 커버리지를 갖는다.As defined in Equation (1), data coverage is expressed as a percentage of the number of types in which at least one data exists divided by the number of all types. For example, in an image classifier, when there are 5 types of pre-defined data and 3 types of prepared training data, the training data coverage has a data coverage of 60%.

2. 데이터의 분포성(Data distribution)2. Data distribution

데이터 분포성은 학습 데이터가 정규 분포를 따르는지 확인하는 척도이다.Data distribution is a measure of whether the training data is normally distributed.

자연의 데이터는 대부분 정규 분포를 이루지만 학습 데이터를 수집할 때 그 수가 충분하지 않거나 편향된 데이터를 수집하는 경우 정규분포를 따르지 않게 된다. 학습 데이터도 정규 분포를 형성해야 기계 학습 기반의 소프트웨어를 학습시킬 때 좋은 성능을 기대할 수 있다. 따라서 학습시키기 전에 데이터의 정규 분포를 먼저 확인하고 데이터를 다시 구성할지 학습을 진행할지 판단 할 수 있다. 데이터의 분포성을 나타내는 척도를 원본 데이터로 부터의 거리를 기준으로 산정하는 표준 편차로 정의 하였다. 이는 일본적인 통계 분석에서 정의하는 방법과 동일하게 식(2)와 같이 표현되었다.Natural data are mostly normally distributed, but when training data is collected, if the number is insufficient or biased data is collected, the normal distribution will not follow. The training data must also form a normal distribution, so that good performance can be expected when training machine learning-based software. Therefore, before training, it is possible to check the normal distribution of the data first and determine whether to reorganize the data or proceed with the training. The measure representing the distribution of data was defined as the standard deviation calculated based on the distance from the original data. This was expressed as Equation (2) in the same way as the method defined in Japanese statistical analysis.

식(2)

Equation (2)

본 특허에서는 학습의 효과를 높이기 위하여 구성된 학습 데이터가 표준 정규 분포를 따르도록 정의한다. 이는 원본 데이터(중앙값)를 중심으로 학습 데이터가 표준 정규분포를 따를 때, 여러 측면의 데이터 유형에 대한 학습 효과가 나타날 수 있기 때문이다. 따라서 데이터 분산성은 표준 편차의 값이 1에 가까을수록 좋다고 할 수 있으며, 유의 수준은 95%로 정의 한다. In this patent, the training data constructed to increase the learning effect is defined to follow a standard normal distribution. This is because when the training data follows the standard normal distribution centered on the original data (median), learning effects for various aspects of data types may appear. Therefore, the data variance can be said to be better as the standard deviation value is closer to 1, and the significance level is defined as 95%.

3. 데이터의 완전성(Data completeness) 3. Data completeness

데이터 완전성은 학습 데이터 집합에 학습하고자 하는 대상의 모든 속성이 포함되어 있는가를 나타내주는 척도이다. Data completeness is a measure that indicates whether all the attributes of the object to be learned are included in the training data set.

이 척도를 산정하기 위해서는 먼저 사용자는 학습하고자 하는 대상을 선정한다. 단순 이진 분류기라면 두 개의 대상에 대한 학습만 진행하면 되지만, 더 높은 차원의 분류기라면 여러 학습 대상을 선정한다. 이후 선정한 대상들에 대해 구조적인 특성을 분류하여 포함되어야 할 속성들을 나눈다. 나눈 기준들을 토대로 학습 데이터 집합에 해당 속성을 나타낼 수 있는 데이터들이 포함되어 있는가를 판단하여 척도를 계산한다. 데이터의 완전성에 대한 산정은 식 (3)과 같이 정의 한다.In order to calculate this scale, the user first selects an object to learn. If it is a simple binary classifier, it is only necessary to learn two objects, but if it is a higher-dimensional classifier, several learning objects are selected. Afterwards, the structural characteristics of the selected objects are classified and the attributes to be included are divided. Based on the divided criteria, it is determined whether the training data set contains data that can represent the corresponding attribute, and the scale is calculated. The calculation of the completeness of the data is defined as in Equation (3).

식 (3)

Equation (3)

예를 들어, 이미지를 통한 물체 인식의 경우 하나의 물체를 표현하기 위한 다양한 구조적 형상을 기준으로 삼아 전체적인 속성의 수가 정해진다. 보다 상세히, 사람의 경우를 살펴보면 두 발, 두 손, 가슴, 등, 골반(엉덩이), 머리, 눈, 코, 입, 귀와 같은 속성을 식별할 수 있으며, 학습 데이터 전체로부터 이러한 속성이 누락 없이 모두 포함되어 있는 가를 확인하는 것이 학습 데이터의 완전성 속성이다. 만약 사람의 속성으로 정의된 엉덩이가 학습 데이터에서 누락된 경우, 엉덩이만 찍은 사진이 입력되었을 때, 이를 사람의 일부로 판단할 수 없게 된다. For example, in the case of object recognition through an image, the total number of attributes is determined based on various structural shapes for expressing an object. In more detail, in the case of humans, attributes such as two feet, two hands, chest, back, pelvis (hip), head, eyes, nose, mouth, and ears can be identified. Checking whether it is included is an attribute of the completeness of the training data. If a hip defined as a human attribute is omitted from the training data, when a picture of only the hip is input, it cannot be judged as a part of the person.

4. 데이터 중복성(Data redundency)4. Data redundency

데이터의 중복성은 학습 데이터 집합에 중복되는 데이터가 얼마나 포함되어 있는가를 나타내주는 척도이다. Data redundancy is a measure of how much redundant data is included in the training data set.

학습에 있어서 데이터가 중복되는 데이터가 많이 포함되어 있다면 이미 학습한 데이터를 다시 반복 학습하는 의미 없는 과정을 거치게 되어 학습 효율측면에서 좋지 못한 영향을 미친다. 데이터에서 완전히 똑같은 데이터는 포함될 확률이 적지만 거의 유사한 데이터가 포함될 수 있는 확률이 있다. 이를 방지하기 위해 먼저 학습 대상의 속성을 가장 잘 보여주는 혹은 반드시 학습해야 하는 속성을 지닌 데이터들을 선별하여 데이터 유형이 같은 것 끼리 집합을 구성한다. 여기서 같은 데이터 유형 집합이라는 것은 첫 번째 소개했던 데이터 커버리지 기준에서 나누었던 데이터 유형별 집합을 말하며 같은 기댓값을 가지는 집합을 의미한다. 이후 비교하고자 하는 데이터와 선별한 데이터 집합에 속한 데이터들을 비교하여 유사도를 측정한다. 유사도를 측정하는 방법에는 유클리디언 거리계산, 민코프스키 거리, 코사인 유사도 측정 방법 등이 있고, 응용 영역에 따라 적절한 계산법을 선정하여 산출한다. 데이터 중복성은 데이터 유사도를 통해 판별할 수 있으며 다음의 식 (4)를 통해 계산할 수 있다. In learning, if a lot of duplicated data is included, a meaningless process of repetitive learning of the already learned data is performed, which adversely affects learning efficiency. It is unlikely that the data will contain exactly the same data, but there is a probability that almost similar data will be included. To prevent this, first, the data that best shows the properties of the object to be learned or that have properties that must be learned are selected to form a set of the same data types. Here, the same data type set refers to a set for each data type divided in the data coverage criterion introduced first, and refers to a set having the same expected value. Then, the similarity is measured by comparing the data to be compared with the data belonging to the selected data set. Methods for measuring similarity include Euclidean distance calculation, Minkowski distance, and cosine similarity measurement method, and are calculated by selecting an appropriate calculation method according to the application area. Data redundancy can be determined through data similarity, and can be calculated through Equation (4) below.

식 (4)

Equation (4)

식 (4)에서 n은 동일 유형에 속하는 학습 데이터의 개수이고, m은 유형의 수이다. 데이터의 유사도 Sim(d_j, d_ji)는 유형 j에 속하는 데이터 d_j를 기준으로 유형내의 모든 다른 데이터와의 유사도를 산출한 후, 이들을 합한 값이다. 이를 전체 학습 데이터의 개수로 나누면 데이터이 중복성을 나타내는 척도가 된다. 산출된 데이터 중복성 값이 특정 임계치 보다 높게 산정되면, 유사도가 가장 높은 값을 가진 데이터부터 삭제하고, 데이터를 재 구성해야 한다. In Equation (4), n is the number of training data belonging to the same type, and m is the number of types. The similarity of data Sim(d _j , d _ji ) is a value obtained by calculating the similarity with all other data in the type based _{on the data d j} belonging to the type j, and then adding them. Dividing this by the total number of training data is a measure of the redundancy of the data. If the calculated data redundancy value is calculated to be higher than a specific threshold, the data with the highest similarity value must be deleted first, and the data must be reorganized.

5. 데이터 추적성(Data Traceability)5. Data Traceability

데이터 추적성은 학습 데이터 집합으로 지능 소프트웨어 시스템이 학습을 진행 한 후, 새로운 입력 데이터에 대하여 예상 결과와 다른 결과를 보여주었을 때 어떤 데이터가 학습의 성능을 낮추는지 확인(추적)할 수 있는 척도를 말한다. 즉 올바른 데이터가 입력되는 경우 올바른 결과를 내야 하고, 올바르지 못한 데이터가 입력되는 경우, 올바르지 못하다는 결과를 제시해야 한다. 그런데 올바르지 않은 데이터를 입력하였음에도 불구하고 올바르다고 판단하는 경우(false positive), 학습 데이터에 문제가 있음을 예상할 수 있으며, 이로 인해 어떤 학습 데이터로부터 이러한 결과가 발생하는지 그 원인을 추적할 수 있어야 한다. Data traceability is a set of training data, and it is a measure by which an intelligent software system can identify (track) which data lowers the performance of learning when it shows different results from the expected results for new input data after training. . In other words, if the correct data is entered, the correct result must be produced, and if the incorrect data is entered, the result indicating that it is incorrect must be presented. However, if it is determined that it is correct despite inputting incorrect data (false positive), it is possible to predict that there is a problem with the training data, and the cause of which this result is generated from which training data can be traced should be able to be traced. .

앞서 설명한 네 가지의 학습 데이터 품질 평가 척도와 달리 이 품질은 사후 데이터 품질 평가 기준의 척도로 삼는다. 즉, 기계 학습 기반의 소프트웨어를 학습시켰을 때 다음과 같은 두가지의 경우가 발생하면 데이터 추적성을 평가해야 한다.Unlike the four measures for evaluating the quality of training data described above, this quality is used as a measure of the post-data quality evaluation criteria. That is, when machine learning-based software is trained, data traceability should be evaluated when the following two cases occur.

(1) 올바른 입력 데이터에 대한 잘못된 결과(false positive) : 이 경우에는 올바른 데이터에 대한 학습의 부족으로 인한 결과이다. 따라서 입력 데이터와 유사도가 높은 데이터를 추가로 생성하여 학습 데이터에 포함시켜야 한다. (1) False positive for correct input data: In this case, it is a result of lack of learning about correct data. Therefore, data with high similarity to the input data must be additionally created and included in the training data.

(2) 올바르지 못한 입력 데이터에 대한 올바른 결과(false negative) : 이 경우는 학습의 오류에 해당되며, 올바르지 못한 입력 데이터와 유사도가 높은 학습 데이터를 제거해야 한다. (2) Correct result for incorrect input data (false negative): In this case, it is a learning error, and training data with high similarity to incorrect input data must be removed.

학습 데이터에 대한 추적성은 다음과 같이 식 (5)로 표현될 수 있으며, 이는 추적성 존재 유무에 대하여 바이너리 값으로 평가된다.Traceability of the training data can be expressed as Equation (5) as follows, which is evaluated as a binary value for the presence or absence of traceability.

식 (5)

Equation (5)

즉, 속성 p를 갖는 입력 데이터 I_p와 학습 데이터 집합 D_L의 원소중 유사한 속성을 갖는 원소로 매핑(함수 α)된다면 추적성은 1의 값을 그렇지 않은 경우는 0을 값으로 평가된다. 추적성의 값이 0으로 나타나는 경우는 학습 모델에 수정이 필요하게 된다. That is, if the input data I _p having the attribute p and the elements of the training data set D _L are mapped to an element having a similar attribute (function α), the traceability is evaluated as a value of 1, and otherwise, a value of 0. If the traceability value is 0, the learning model needs to be modified.

이하, 본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법을 상세히 설명한다.Hereinafter, a method for evaluating the quality of learning data of the machine learning-based software according to the present invention will be described in detail.

도 1 내지 도 3을 참조하면, 학습데이터로 사용될 입력 데이터를 토대로 데이터 평가 기준별 데이터 특성을 추출하는 단계, 추출한 특성을 토대로 품질요소를 평가하는 단계 및 각각의 품질요소의 평가 결과들을 종합해 데이터의 추적성 품질을 도출하는 단계를 포함한다.1 to 3, extracting data characteristics for each data evaluation criterion based on input data to be used as learning data, evaluating quality factors based on the extracted characteristics, and synthesizing evaluation results of each quality factor And deriving the traceability quality of.

상기 학습데이터로 사용될 입력 데이터를 토대로 데이터 평가 기준별 데이터 특성을 추출하는 단계는, 데이터 커버리지 관련 특성, 데이터 분포 관련 특성, 데이터 완전성 관련 특성 및 데이터 중복성 관련 특성을 추출한다.In the step of extracting data characteristics for each data evaluation criterion based on input data to be used as training data, data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics are extracted.

그리고, 상기 추출한 특성을 토대로 품질요소를 평가하는 단계는, 추출한 상기 특성들을 토대로 데이터 커버리지, 데이터 분포성, 데이터 완전성 및 데이터 중복성에 대하여 평가를 수행한다.In the step of evaluating the quality factor based on the extracted characteristics, data coverage, data distribution, data integrity, and data redundancy are evaluated based on the extracted characteristics.

상기 평가 수행이 기준을 만족한다면 평가 척도를 보고하고, 기준을 만족하지 못한다면 기준미달 평가 척도를 보고하고 학습데이터를 수정 및 재구성하여, 수정된 학습데이트의 품질을 다시 평가한다.If the evaluation performance satisfies the criteria, the evaluation scale is reported, and if the criteria are not satisfied, the evaluation scale is reported, the learning data is corrected and reconstructed, and the quality of the modified learning data is evaluated again.

이후에, 각각의 품질요소의 평가 결과들을 종합해 데이터의 추적성 품질을 도출하는 단계는 다음과 같은 프로세스를 포함한다.Thereafter, the step of deriving the traceability quality of data by synthesizing the evaluation results of each quality factor includes the following process.

1) 학습 데이터 집합으로 지능 소프트웨어 시스템이 학습을 진행하는 단계1) The stage in which the intelligent software system learns with the learning data set

2) 학습결과가 기준을 만족하는지 확인하는 단계2) Step to check whether the learning result satisfies the standard

만족한다면 프로세스를 종료한다.If satisfied, terminate the process.

3) 입력데이터 관련 특성을 추출하는 단계3) Step of extracting input data related characteristics

기계학습을 통해 사용자가 기대했던 성능을 보여주지 못하는 경우에는 데이터 추적성에 대한 품질 평가를 진행한다. 예를 들어, 올바른 데이터가 입력되는 경우 올바른 결과를 내야 하고, 올바르지 못한 데이터가 입력되는 경우, 올바르지 못하다는 결과를 제시해야 한다. 그런데 올바르지 않은 데이터를 입력하였음에도 불구하고 올바르다고 판단하는 경우(false positive), 학습 데이터에 문제가 있음을 예상할 수 있다.If the user does not show the expected performance through machine learning, the quality of data traceability is evaluated. For example, if the correct data is entered, the correct result must be produced, and if the incorrect data is entered, the incorrect result must be presented. However, if it is determined that it is correct despite inputting incorrect data (false positive), it can be expected that there is a problem with the training data.

4) 데이터 추적성을 평가하는 단계4) Steps to evaluate data traceability

기계 학습 기반의 소프트웨어를 학습시켰을 때 다음과 같은 두가지의 경우가 발생하면 데이터 추적성을 평가해야 한다.When machine learning-based software is trained, data traceability should be evaluated when the following two cases occur.

a) 올바른 입력 데이터에 대한 잘못된 결과(false positive) : 이 경우에는 올바른 데이터에 대한 학습의 부족으로 인한 결과이다. 따라서 입력 데이터와 유사도가 높은 데이터를 추가로 생성하여 학습 데이터에 포함시켜야 한다. a) False positive for correct input data: In this case, it is a result of lack of learning about correct data. Therefore, data with high similarity to the input data must be additionally created and included in the training data.

b) 올바르지 못한 입력 데이터에 대한 올바른 결과(false negative) : 이 경우는 학습의 오류에 해당되며, 올바르지 못한 입력 데이터와 유사도가 높은 학습 데이터를 제거해야 한다. b) Correct result for incorrect input data (false negative): In this case, it is a learning error, and training data with high similarity to incorrect input data must be removed.

5) 추적성이 존재하는지 판단하는 단계5) Determining whether traceability exists

추적성이 존재한다면 평가결과를 보고하고, 학습데이터를 재구성한다.If traceability exists, the evaluation result is reported and the learning data is reconstructed.

추적성이 존재하지 않는다면 학습모델을 개선하고 종료한다.If traceability does not exist, the learning model is improved and terminated.

본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법은 다음과 같은 용도로 활용될 수 있다.The method for evaluating learning data quality of machine learning-based software according to the present invention can be used for the following purposes.

기계 학습 기반의 소프트웨어를 내장한 자율주행자동차, 지능 로봇 등을 개발하 때, 시스템의 올바른 동작을 제공하기 위해서는 적합한 학습 데이터가 필요하다. 따라서 이러한 기계학습 기반의 제어 소프트웨어 개발시 필요한 학습 데이터의 품질을 평가할 수 있다.When developing self-driving cars and intelligent robots with built-in machine learning-based software, appropriate learning data is required to provide correct operation of the system. Therefore, it is possible to evaluate the quality of learning data required when developing such machine learning-based control software.

또한, 본 발명에 따르면, 기계 학습 기반의 소프트웨어에 대한 학습 데이터를 구성할 때, 제안하는 품질 평가 기준을 활용하여 적합한 학습 데이터 개발, 생성할 수 있다. In addition, according to the present invention, when configuring learning data for machine learning-based software, it is possible to develop and generate suitable learning data by using the proposed quality evaluation criteria.

이상 본 발명자에 의해서 이루어진 발명을 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 이 기술분야에서 통상의 지식을 가진 자에게 자명하다.Although the invention made by the present inventor has been described in detail according to the above embodiment, the present invention is not limited to the above embodiment, and it is common knowledge in the art that various changes can be made without departing from the gist of the invention. It is self-evident to those who have.

Claims

Extracting data characteristics for each data evaluation criterion based on input data to be used as learning data;
Evaluating a quality factor based on the extracted characteristics; And
Comprising the step of deriving the traceability quality of the data by synthesizing the evaluation results of each quality factor; Learning data quality evaluation method of the machine learning-based software comprising a.

The method of claim 1,
The step of extracting data characteristics for each data evaluation criterion,
A method for evaluating the quality of learning data of machine learning-based software, comprising extracting data coverage-related characteristics, data distribution-related characteristics, data integrity-related characteristics, and data redundancy-related characteristics.

The method of claim 1,
Evaluating the quality factor based on the extracted characteristics,
And evaluating data coverage, data distribution, data integrity, and data redundancy based on the extracted characteristics.

The method of claim 1,
The step of deriving the traceability quality of the data,
An intelligent software system performing learning with a learning data set;
Checking whether the learning result satisfies the criteria;
Extracting input data-related characteristics;
Evaluating data traceability;
Determining whether traceability exists; And
If traceability exists, reporting the evaluation result and reconfiguring the learning data; A method for evaluating the quality of learning data of machine learning-based software, comprising: a.

The method of claim 2,
The data coverage is a measure that indicates the type of data for an object to be learned,
A method for evaluating the quality of learning data of machine learning-based software, characterized in that the measure for calculating the learning data coverage is defined as in Equation (1) below.

Equation (1)

The method of claim 2,
The data distribution is a measure of whether the training data follows a normal distribution,
A method for evaluating the quality of learning data of machine learning-based software, characterized in that the measure representing the distribution of data is defined as shown in Equation (2) below.

Equation (2)

The method of claim 2,
The data integrity is a measure that indicates whether all attributes of the object to be learned are included in the training data set.
The method of evaluating the quality of learning data of machine learning-based software, characterized in that the calculation of the completeness of the data is defined as in Equation (3).

Equation (3)

The method of claim 2,
The redundancy of the data is a measure of how much redundant data is included in the training data set,
The data redundancy can be determined through data similarity and can be calculated through Equation (4) below.

Equation (4)

(In Equation (4), n is the number of training data belonging to the same type, and m is the number of types. The similarity of data Sim(d _j , d _ji ) is based _{on the data d j} belonging to the type j. After calculating the similarity with other data, it is the sum of them.)

The method of claim 4,
The traceability is a measure of the criteria for evaluating data quality after performing machine learning,
The traceability can be expressed by the following equation (5), which is evaluated as a binary value for the presence or absence of traceability.

Equation (5)

(If the input data I _p having the property p and the elements of the training data set D _L are mapped to an element having a similar property (function α), the traceability is evaluated as a value of 1, otherwise the value of traceability is evaluated as a value of 0. If this appears as 0, the learning model needs to be modified.)