KR102303111B1

KR102303111B1 - Training Data Quality Assessment Technique for Machine Learning-based Software

Info

Publication number: KR102303111B1
Application number: KR1020190148654A
Authority: KR
Inventors: 홍장의; 김문현; 권용균
Original assignee: 충북대학교 산학협력단
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2021-09-17
Also published as: KR20210060978A

Abstract

기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법이 개시되어 있다. 본 발명은, 학습데이터로 사용될 입력 데이터를 토대로 데이터 평가 기준별 데이터 특성을 추출하는 단계; 추출한 특성을 토대로 품질요소를 평가하는 단계; 및 각각의 품질요소의 평가 결과들을 종합해 데이터의 추적성 품질을 도출하는 단계;를 포함하는 것을 특징으로 한다.A method for evaluating the quality of learning data of machine learning-based software is disclosed. The present invention comprises the steps of extracting data characteristics for each data evaluation criterion based on input data to be used as learning data; evaluating quality factors based on the extracted characteristics; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor.

Description

Training Data Quality Assessment Technique for Machine Learning-based Software

본 발명은 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법에 관한 것이다.The present invention relates to a method for evaluating the quality of learning data of machine learning-based software.

최근 컴퓨터 과학 분야에서 인공지능에 대한 연구가 활성화됨에 따라 인간의 학습 체계를 모방한 기계 학습 기법과 관련된 여러 알고리즘이 개발되고 있다. 이에 따라 여러 소프트웨어가 기계 학습 관련된 여러 알고리즘을 채용하고 있다. 기계 학습을 통해 소프트웨어는 과거 및 현재의 데이터로부터 특징 추출 및 일반화를 통하여 미래의 데이터를 예측하는데 사용하고 있다. Recently, as research on artificial intelligence has been activated in the field of computer science, several algorithms related to machine learning techniques that mimic human learning systems are being developed. As a result, different software employs different algorithms related to machine learning. Through machine learning, software is used to predict future data through feature extraction and generalization from past and present data.

학습에 있어서 중요한 것은 여러 가지가 있을 수 있지만 그 중 무엇을 통해 학습을 할 것인가는 중요한 문제이다. There may be many things that are important in learning, but what one will learn through them is an important issue.

좋은 정보, 올바른 정보를 가지고 학습을 하게 된다면 그 효과는 그렇지 않은 경우보다 학습의 능률도, 결과도 좋을 것이다. 이는 비단 사람에게만 국한된 것이 아니다. 기계 학습에서 있어서 중요한 것 역시 어떤 데이터를 통해 학습하는 가이다. 학습 데이터를 구성(혹은 생성)하는 방법에 있어서는 기존에 여러 방법이 소개 된 바 있지만, 구성된 데이터의 품질을 평가할 수 있는 기준이나 방법은 제시 되지 않았다. If you learn with good information and correct information, the effect will be better than if you don't, the efficiency and results of learning will be better. This is not just limited to humans. Another important thing in machine learning is what data it learns from. As for the method of constructing (or generating) learning data, several methods have been introduced in the past, but no standards or methods for evaluating the quality of the constructed data have been presented.

1. 대한민국 등록특허 제10-2005628호(2019.07.24)1. Republic of Korea Patent Registration No. 10-2005628 (2019.07.24) 2. 대한민국 공개특허 제10-2019-0044814호(2019.05.02)2. Republic of Korea Patent Publication No. 10-2019-0044814 (2019.05.02)

본 발명의 목적은 새로운 품질 평가 척도 및 방법을 통해서 기계 학습 기반의 소프트웨어의 학습데이터를 효과적으로 평가할 수 있도록 한 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법을 제공하는 데 있다.It is an object of the present invention to provide a method for evaluating the quality of learning data of machine learning-based software so that the learning data of machine-learning-based software can be effectively evaluated through a new quality evaluation scale and method.

상기 목적을 달성하기 위하여, 본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법은, 학습데이터로 사용될 입력 데이터를 토대로 데이터 평가 기준별 데이터 특성을 추출하는 단계; 추출한 특성을 토대로 품질요소를 평가하는 단계; 및 각각의 품질요소의 평가 결과들을 종합해 데이터의 추적성 품질을 도출하는 단계;를 포함하는 것을 특징으로 한다.In order to achieve the above object, the learning data quality evaluation method of machine learning-based software according to the present invention comprises the steps of: extracting data characteristics for each data evaluation criterion based on input data to be used as learning data; evaluating quality factors based on the extracted characteristics; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor.

상기 데이터 평가 기준별 데이터 특성을 추출하는 단계는, 데이터 커버리지 관련 특성, 데이터 분포성 관련 특성, 데이터 완전성 관련 특성 및 데이터 중복성 관련 특성을 추출하는 단계를 포함하는 것을 특징으로 한다.The extracting of data characteristics for each data evaluation criterion may include extracting data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics.

상기 추출한 특성을 토대로 품질요소를 평가하는 단계는, 추출한 상기 특성들을 토대로 데이터 커버리지, 데이터 분포, 데이터 완전성 및 데이터 중복성에 대하여 평가를 수행하는 단계를 포함하는 것을 특징으로 한다.The evaluating the quality factor based on the extracted characteristics may include performing evaluation on data coverage, data distribution, data integrity, and data redundancy based on the extracted characteristics.

상기 데이터의 추적성 품질을 도출하는 단계는, 학습 데이터 집합으로 지능 소프트웨어 시스템이 학습을 진행하는 단계; 학습결과가 기준을 만족하는지 확인하는 단계; 입력데이터 관련 특성을 추출하는 단계; 데이터 추적성을 평가하는 단계; 추적성이 존재하는지 판단하는 단계; 및 추적성이 존재한다면 평가결과를 보고하고, 학습데이터를 재구성하는 단계;를 포함하는 것을 특징으로 한다.The step of deriving the traceability quality of the data may include: performing learning by an intelligent software system with a learning data set; checking whether the learning result satisfies a criterion; extracting input data related characteristics; evaluating data traceability; determining whether traceability exists; And if traceability exists, reporting the evaluation result, and reconstructing the learning data; characterized in that it comprises a.

상기 데이터 커버리지는 학습하고자 하는 대상에 대한 데이터의 유형을 나타내주는 척도로서, 학습 데이터 커버리지를 산출하는 척도는 하기 식 (1)과 같이 정의하는 것을 특징으로 한다.The data coverage is a measure indicating the type of data for an object to be learned, and the measure for calculating the learning data coverage is defined as in Equation (1) below.

식(1)

Formula (1)

상기 데이터 분포성은 학습 데이터가 정규 분포를 따르는지 확인하는 척도로서, 데이터의 분포성을 나타내는 척도는 하기 식(2)와 같이 정의하는 것을 특징으로 한다.The data distribution is a measure for confirming whether the learning data follows a normal distribution, and a measure indicating the distribution of data is defined as in Equation (2) below.

식(2)

Equation (2)

상기 데이터 완전성은 학습 데이터 집합에 학습하고자 하는 대상의 모든 속성이 포함되어 있는가를 나타내주는 척도로서, 상기 데이터의 완전성에 대한 산정은 식 (3)과 같이 정의하는 것을 특징으로 한다.The data integrity is a measure indicating whether all properties of an object to be learned are included in the learning data set, and the calculation of the data integrity is characterized as defined by Equation (3).

식 (3)

Equation (3)

상기 데이터의 중복성은 학습 데이터 집합에 중복되는 데이터가 얼마나 포함되어 있는가를 나타내주는 척도로서, 상기 데이터 중복성은 데이터 유사도를 통해 판별할 수 있으며, 하기 식 (4)를 통해 계산할 수 있는 것을 특징으로 한다.The data redundancy is a measure indicating how much overlapping data is included in the training data set, and the data redundancy can be determined through data similarity, and can be calculated through the following equation (4).

식 (4)

Equation (4)

(식 (4)에서 n은 동일 유형에 속하는 학습 데이터의 개수이고, m은 유형의 수이다. 데이터의 유사도 Sim(dj, dji)는 유형 j에 속하는 데이터 dj를 기준으로 유형내의 모든 다른 데이터와의 유사도를 산출한 후, 이들을 합한 값이다.)(In Equation (4), n is the number of training data belonging to the same type, and m is the number of types. The similarity of data Sim(dj, dji) is based on the data dj belonging to type j, with all other data in the type. After calculating the similarity of

상기 추적성은 기계 학습을 진행한 후의 데이터 품질 평가 기준의 척도로서,The traceability is a measure of data quality evaluation criteria after machine learning,

상기 추적성은 하기 식(5)로 표현될 수 있으며, 이는 추적성 존재 유무에 대하여 바이너리 값으로 평가되는 것을 특징으로 한다.The traceability can be expressed by the following equation (5), which is characterized in that it is evaluated as a binary value for the presence or absence of traceability.

식 (5)

Equation (5)

(속성 p를 갖는 입력 데이터 I_p와 학습 데이터 집합 D_L의 원소중 유사한 속성을 갖는 원소로 매핑(함수 α)된다면 추적성은 1의 값을 그렇지 않은 경우는 0을 값으로 평가된다. 추적성의 값이 0으로 나타나는 경우는 학습 모델에 수정이 필요하게 된다.)(If the input data I _p with property p and the elements of the training data set D _L are mapped to elements with similar properties (function α), the traceability is evaluated as 1, otherwise, the value is 0. The value of the traceability If it appears as 0, the learning model needs to be modified.)

본 발명을 통해 얻을 수 있는 효과는 다음과 같다.The effects that can be obtained through the present invention are as follows.

본 발명에 따르면, 기존에 고려되지 않았던 학습 데이터에 대한 품질 기준을 제시함으로써, 해당 분야에 새로운 기술을 적용할 수 있는 장점이 있다.According to the present invention, there is an advantage in that a new technology can be applied to a relevant field by presenting a quality standard for learning data that has not been considered before.

본 발명에 따르면, 기계 학습을 사용하는 애플리케이션의 학습 데이터의 수집 및 생성 과정에서 제시하는 품질 평가 기준을 고려하여 필요한 데이터만을 선택, 수집할 수 있는 장점이 있다.According to the present invention, there is an advantage in that only necessary data can be selected and collected in consideration of the quality evaluation criteria presented in the process of collecting and generating learning data of an application using machine learning.

본 발명에 따르면, 기계 학습에 사용되는 학습 데이터의 품질을 높임으로써 학습의 효과를 높이고 궁극적으로 어플리케이션의 성능 향상에 기여할 수 있다. According to the present invention, by increasing the quality of learning data used for machine learning, it is possible to increase the learning effect and ultimately contribute to the improvement of application performance.

도 1은 본 발명에 따른 학습 데이터 품질 평가 프로세스를 도시한 도면이다.
도 2는 도 1의 개략적인 과정을 세부적인 프로세스로 나누어 설명한 도면이다.
도 3은 본 발명에 따라서 만족하는 품질을 가진 학습 데이터를 통해 기계 학습 기반의 소프트웨어 학습을 진행했을 때의 결과를 토대로 데이터의 추적성 품질을 도출하는 프로세스를 설명한 도면이다.1 is a diagram illustrating a learning data quality evaluation process according to the present invention.
FIG. 2 is a diagram illustrating the schematic process of FIG. 1 by dividing it into detailed processes.
3 is a diagram illustrating a process for deriving traceability quality of data based on a result of performing machine learning-based software learning through learning data having satisfactory quality according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법에 대하여 상세히 설명한다.Hereinafter, a method for evaluating the quality of learning data of machine learning-based software according to the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서 제시하고자 하는 것은 기계 학습 기반의 소프트웨어를 학습시키기 위한 학습 데이터의 품질을 평가하는 기준이다. 이러한 기준을 활용하여 기계학습 기반의 소프트웨어에 사용되는 학습의 데이터의 품질을 보장하고 궁극적으로 소프트웨어의 성능을 높이는데 기여할 수 있다. What is proposed in the present invention is a criterion for evaluating the quality of learning data for learning machine learning-based software. By utilizing these criteria, it can contribute to guaranteeing the quality of the data of learning used in machine learning-based software and ultimately improving the performance of the software.

도 1은 본 발명에 따른 학습 데이터 품질 평가 프로세스를 도시한 도면이다.1 is a diagram illustrating a learning data quality evaluation process according to the present invention.

도 1에 도시된 바와 같이, 데이터의 품질을 평가하는 전체적인 프로세스를 보여준다. 먼저 학습데이터로 사용될 데이터를 입력 데이터로 사용하고 입력된 데이터들을 토대로 본 발명에서 제시하는 데이터 평가 기준 별 데이터 특성을 추출하는 과정과 추출한 특성을 토대로 품질요소를 평가하는 두 과정으로 이루어진다. 각 품질 요소의 평가 결과들을 종합해 데이터의 품질을 평가할 수 있는 척도를 확인 할 수 있다. 학습데이터에 대한 평가를 진행한 사용자는 학습 데이터를 그대로 사용할지 아니면 다시 구성할지 결정하는데 있어서 해당 결과를 토대로 결정할 수 있는 것이다.As shown in Figure 1, it shows the overall process of evaluating the quality of data. First, data to be used as learning data is used as input data, and the process of extracting data characteristics for each data evaluation criterion presented in the present invention based on the input data is composed of two processes, and a process of evaluating quality factors based on the extracted characteristics. By synthesizing the evaluation results of each quality factor, it is possible to confirm the scale that can evaluate the quality of the data. A user who has evaluated the learning data can decide based on the result in deciding whether to use the learning data as it is or to reconfigure it.

도 2는 도 1의 개략적인 과정을 세부적인 프로세스로 나누어 설명한 도면이고, 도 3은 본 발명에 따라서 만족하는 품질을 가진 학습 데이터를 통해 기계 학습 기반의 소프트웨어 학습을 진행했을 때의 결과를 토대로 데이터의 추적성 품질을 도출하는 프로세스를 설명한 도면이다.2 is a diagram explaining the schematic process of FIG. 1 by dividing it into detailed processes, and FIG. 3 is a data based on the results of machine learning-based software learning through learning data with satisfactory quality according to the present invention. It is a diagram explaining the process of deriving the traceability quality of

도 2에서는 본 발명에서 제시하는 사전 데이터 평가 기준인 4가지의 데이터 품질 평가 기준을 명시한다. 해당 데이터 품질 항목을 평가하기 위해 학습데이터에서 특성을 추출하고 추출한 특성들을 토대로 수치화된 척도를 계산한다. 사용자는 4가지 항목들에서 계산된 척도들이 기준에 만족하는 수치인지 판단하고 학습데이터를 재구성할지 결정한다. 만약 만족하는 수치가 나왔다면 해당 데이터 집합을 가지고 기계학습 기반 소프트웨어의 학습을 진행한다. In FIG. 2, four data quality evaluation criteria, which are preliminary data evaluation criteria presented in the present invention, are specified. In order to evaluate the data quality item, characteristics are extracted from the learning data and a numerical scale is calculated based on the extracted characteristics. The user determines whether the scales calculated from the four items are numerical values that satisfy the criteria and decides whether to reconstruct the learning data. If a satisfactory number is obtained, the machine learning-based software is trained with the corresponding data set.

도 2 및 도 3에 도시된 바와 같이, 본 발명에서는 학습 데이터 품질 평가 기준으로 (1) 데이터 커버리지, (2) 데이터 분포성, (3) 데이터 완전성, (4) 데이터 중복성, 그리고 (5) 데이터 추적성 총 5가지의 척도를 제시한다. 이 중 데이터 추적성 평가 기준은 사후 데이터 품질 평가 기준으로, 나머지 4가지 기준은 사전 데이터 품질 평가 기준척도로 삼는다.2 and 3 , in the present invention, (1) data coverage, (2) data distribution, (3) data integrity, (4) data redundancy, and (5) data A total of five scales of traceability are presented. Among them, the data traceability evaluation criteria are used as ex post data quality evaluation criteria, and the remaining 4 criteria are used as preliminary data quality evaluation criteria.

여기서 말하는 사전과 사후의 기준점은 기계 학습을 진행하기 전인지, 혹은 학습을 진행한 후인지로 나눈다. 각각의 데이터 평가 기준에 대해서는 다음과 같다.The pre- and post-reference points mentioned here are divided into whether it is before or after machine learning. Each data evaluation criterion is as follows.

1. 데이터 커버리지(Data coverage)1. Data coverage

데이터 커버리지는 학습하고자 하는 대상에 대한 데이터의 유형을 나타내주는 척도이다. Data coverage is a measure that indicates the type of data for an object to be learned.

이 척도를 산정하기 위해서는 데이터의 유형이 사전에 정의되어야 한다. 일반적으로는 데이터의 분류는 크게 정상 데이터(Valid data)와 비정상 데이터(Invalid data)로 분류할 수 있다. 본 특허에서는 이를 좀 더 세분화하여 정상 범주의 데이터를 원본 데이터(Original Data), 유사 데이터(Similar Data), 변형 데이터(Transformed Data)로 구분하며, 비정상 데이터에는 왜곡 데이터(Distorted Data), 오류 데이터(Adversarial Data)로 구분한다. 이와 같은 데이터 유형의 분류는 지능 소프트웨어의 응용 영역에 대하여 추가 또는 삭제될 수 있다. 다만 학습 데이터의 유형이 사전 정의되어야 한다. In order to calculate this scale, the type of data must be defined in advance. In general, the classification of data can be largely divided into normal data (Valid data) and abnormal data (Invalid data). In this patent, the data in the normal category is further subdivided into original data, similar data, and transformed data, and abnormal data includes distorted data and error data ( Adversarial Data). This classification of data types may be added or deleted with respect to the application area of the intelligent software. However, the type of training data must be predefined.

데이터 커버리지는 이러한 세분화된 학습 데이터의 유형에 적어도 하나 이상의 데이터가 존재해야 한다는 것이다. 제시한 5가지 유형에 각각 적어도 하나 이상의 데이터가 존재한다면 구성된 데이터는 모든 데이터 유형을 커버한다고 할 수 있다. 학습 데이터 커버리지를 산출하는 척도는 식 (1)과 같이 정의 한다. Data coverage is that at least one data must exist in this type of granular training data. If at least one data exists in each of the five suggested types, the constructed data can be said to cover all data types. The scale for calculating the learning data coverage is defined as Equation (1).

식(1)

Formula (1)

식 (1)에서 정의한 것처럼 데이터 커버리지는 데이터가 하나라도 존재하는 유형의 수를 전체 유형의 수로 나눈 것의 백분율로 표현한다. 예를 들어 한 이미지 분류기에서 사전에 정의된 데이터의 유형이 5가지이고, 준비된 학습 데이터의 유형이 3가지 일 때, 학습 데이터 커버리지는 60%의 데이터 커버리지를 갖는다.As defined in Equation (1), data coverage is expressed as a percentage of the number of types in which at least one data exists divided by the total number of types. For example, when there are 5 types of predefined data in an image classifier and 3 types of prepared training data, the training data coverage has a data coverage of 60%.

2. 데이터의 분포성(Data distribution)2. Data distribution

데이터 분포성은 학습 데이터가 정규 분포를 따르는지 확인하는 척도이다.Data distribution is a measure of whether the training data follows a normal distribution.

자연의 데이터는 대부분 정규 분포를 이루지만 학습 데이터를 수집할 때 그 수가 충분하지 않거나 편향된 데이터를 수집하는 경우 정규분포를 따르지 않게 된다. 학습 데이터도 정규 분포를 형성해야 기계 학습 기반의 소프트웨어를 학습시킬 때 좋은 성능을 기대할 수 있다. 따라서 학습시키기 전에 데이터의 정규 분포를 먼저 확인하고 데이터를 다시 구성할지 학습을 진행할지 판단 할 수 있다. 데이터의 분포성을 나타내는 척도를 원본 데이터로 부터의 거리를 기준으로 산정하는 표준 편차로 정의 하였다. 이는 일본적인 통계 분석에서 정의하는 방법과 동일하게 식(2)와 같이 표현되었다.Most of the natural data has a normal distribution, but when collecting training data, the number is insufficient or when collecting biased data, it does not follow the normal distribution. When training data also forms a normal distribution, good performance can be expected when training machine learning-based software. Therefore, before training, it is possible to first check the normal distribution of the data and decide whether to reconstruct the data or proceed with learning. The scale indicating the distribution of the data was defined as the standard deviation calculated based on the distance from the original data. This is expressed as Equation (2) in the same way as defined in Japanese statistical analysis.

식(2)

Equation (2)

본 특허에서는 학습의 효과를 높이기 위하여 구성된 학습 데이터가 표준 정규 분포를 따르도록 정의한다. 이는 원본 데이터(중앙값)를 중심으로 학습 데이터가 표준 정규분포를 따를 때, 여러 측면의 데이터 유형에 대한 학습 효과가 나타날 수 있기 때문이다. 따라서 데이터 분산성은 표준 편차의 값이 1에 가까을수록 좋다고 할 수 있으며, 유의 수준은 95%로 정의 한다. In this patent, it is defined that the training data configured to increase the learning effect follow a standard normal distribution. This is because, when the training data follows a standard normal distribution around the original data (median), the learning effect on data types in various aspects may appear. Therefore, it can be said that the data dispersibility is better as the value of the standard deviation is closer to 1, and the significance level is defined as 95%.

3. 데이터의 완전성(Data completeness) 3. Data completeness

데이터 완전성은 학습 데이터 집합에 학습하고자 하는 대상의 모든 속성이 포함되어 있는가를 나타내주는 척도이다. Data integrity is a measure of whether the learning data set contains all the properties of the object to be learned.

이 척도를 산정하기 위해서는 먼저 사용자는 학습하고자 하는 대상을 선정한다. 단순 이진 분류기라면 두 개의 대상에 대한 학습만 진행하면 되지만, 더 높은 차원의 분류기라면 여러 학습 대상을 선정한다. 이후 선정한 대상들에 대해 구조적인 특성을 분류하여 포함되어야 할 속성들을 나눈다. 나눈 기준들을 토대로 학습 데이터 집합에 해당 속성을 나타낼 수 있는 데이터들이 포함되어 있는가를 판단하여 척도를 계산한다. 데이터의 완전성에 대한 산정은 식 (3)과 같이 정의 한다.In order to calculate this scale, the user first selects an object to learn. A simple binary classifier only needs to learn two objects, but a higher-level classifier selects multiple learning objects. Thereafter, the structural characteristics of the selected objects are classified and the properties to be included are divided. Based on the divided criteria, it is determined whether data that can represent the corresponding property is included in the learning data set, and the scale is calculated. The calculation of data integrity is defined as Equation (3).

식 (3)

Equation (3)

예를 들어, 이미지를 통한 물체 인식의 경우 하나의 물체를 표현하기 위한 다양한 구조적 형상을 기준으로 삼아 전체적인 속성의 수가 정해진다. 보다 상세히, 사람의 경우를 살펴보면 두 발, 두 손, 가슴, 등, 골반(엉덩이), 머리, 눈, 코, 입, 귀와 같은 속성을 식별할 수 있으며, 학습 데이터 전체로부터 이러한 속성이 누락 없이 모두 포함되어 있는 가를 확인하는 것이 학습 데이터의 완전성 속성이다. 만약 사람의 속성으로 정의된 엉덩이가 학습 데이터에서 누락된 경우, 엉덩이만 찍은 사진이 입력되었을 때, 이를 사람의 일부로 판단할 수 없게 된다. For example, in the case of object recognition through an image, the total number of attributes is determined based on various structural shapes for expressing one object. In more detail, if we look at the human case, we can identify properties such as feet, hands, chest, back, pelvis (hip), head, eyes, nose, mouth, and ears, and all of these properties can be identified from the entire training data without omission. It is the integrity property of the training data to check whether it is included. If the hip defined as a human attribute is omitted from the training data, when a picture of only the buttock is input, it cannot be determined as a part of the person.

4. 데이터 중복성(Data redundency)4. Data redundancy

데이터의 중복성은 학습 데이터 집합에 중복되는 데이터가 얼마나 포함되어 있는가를 나타내주는 척도이다. Data redundancy is a measure of how much redundant data is included in the training data set.

학습에 있어서 데이터가 중복되는 데이터가 많이 포함되어 있다면 이미 학습한 데이터를 다시 반복 학습하는 의미 없는 과정을 거치게 되어 학습 효율측면에서 좋지 못한 영향을 미친다. 데이터에서 완전히 똑같은 데이터는 포함될 확률이 적지만 거의 유사한 데이터가 포함될 수 있는 확률이 있다. 이를 방지하기 위해 먼저 학습 대상의 속성을 가장 잘 보여주는 혹은 반드시 학습해야 하는 속성을 지닌 데이터들을 선별하여 데이터 유형이 같은 것 끼리 집합을 구성한다. 여기서 같은 데이터 유형 집합이라는 것은 첫 번째 소개했던 데이터 커버리지 기준에서 나누었던 데이터 유형별 집합을 말하며 같은 기댓값을 가지는 집합을 의미한다. 이후 비교하고자 하는 데이터와 선별한 데이터 집합에 속한 데이터들을 비교하여 유사도를 측정한다. 유사도를 측정하는 방법에는 유클리디언 거리계산, 민코프스키 거리, 코사인 유사도 측정 방법 등이 있고, 응용 영역에 따라 적절한 계산법을 선정하여 산출한다. 데이터 중복성은 데이터 유사도를 통해 판별할 수 있으며 다음의 식 (4)를 통해 계산할 수 있다. If a lot of data with duplicate data is included in the learning, it will go through a meaningless process of repeating the already learned data again, which adversely affects the learning efficiency. Exactly identical data in the data is less likely to be included, but there is a probability that nearly similar data may be included. In order to prevent this, first, the data that best shows the properties of the learning object or that have the properties that must be learned are selected, and the same data type is configured as a set. Here, the same data type set refers to a set for each data type divided in the data coverage criteria introduced first, and means a set having the same expected value. Thereafter, the similarity is measured by comparing the data to be compared with the data belonging to the selected data set. Methods for measuring similarity include Euclidean distance calculation, Minkowski distance, cosine similarity measuring method, etc., and an appropriate calculation method is selected and calculated according to the application area. Data redundancy can be determined through data similarity and can be calculated using the following equation (4).

식 (4)

Equation (4)

식 (4)에서 n은 동일 유형에 속하는 학습 데이터의 개수이고, m은 유형의 수이다. 데이터의 유사도 Sim(d_j, d_ji)는 유형 j에 속하는 데이터 d_j를 기준으로 유형내의 모든 다른 데이터와의 유사도를 산출한 후, 이들을 합한 값이다. 이를 전체 학습 데이터의 개수로 나누면 데이터이 중복성을 나타내는 척도가 된다. 산출된 데이터 중복성 값이 특정 임계치 보다 높게 산정되면, 유사도가 가장 높은 값을 가진 데이터부터 삭제하고, 데이터를 재 구성해야 한다. In Equation (4), n is the number of training data belonging to the same type, and m is the number of types. The similarity of data Sim(d _j , d _ji ) is the sum of the similarities with all other data in the type based on the data d _{j belonging to type j.} Dividing this by the total number of training data is a measure of the redundancy of the data. If the calculated data redundancy value is calculated to be higher than a specific threshold, data with the highest similarity value should be deleted and data should be reconstructed.

5. 데이터 추적성(Data Traceability)5. Data Traceability

데이터 추적성은 학습 데이터 집합으로 지능 소프트웨어 시스템이 학습을 진행 한 후, 새로운 입력 데이터에 대하여 예상 결과와 다른 결과를 보여주었을 때 어떤 데이터가 학습의 성능을 낮추는지 확인(추적)할 수 있는 척도를 말한다. 즉 올바른 데이터가 입력되는 경우 올바른 결과를 내야 하고, 올바르지 못한 데이터가 입력되는 경우, 올바르지 못하다는 결과를 제시해야 한다. 그런데 올바르지 않은 데이터를 입력하였음에도 불구하고 올바르다고 판단하는 경우(false positive), 학습 데이터에 문제가 있음을 예상할 수 있으며, 이로 인해 어떤 학습 데이터로부터 이러한 결과가 발생하는지 그 원인을 추적할 수 있어야 한다. Data traceability refers to a measure that can identify (track) which data lowers the learning performance when an intelligent software system performs learning with a learning data set and shows a result different from the expected result with respect to new input data. . That is, when correct data is input, a correct result must be presented, and when incorrect data is input, an incorrect result must be presented. However, if incorrect data is entered and it is judged to be correct (false positive), it can be expected that there is a problem in the training data, and it should be possible to trace the cause from which training data this result occurs. .

앞서 설명한 네 가지의 학습 데이터 품질 평가 척도와 달리 이 품질은 사후 데이터 품질 평가 기준의 척도로 삼는다. 즉, 기계 학습 기반의 소프트웨어를 학습시켰을 때 다음과 같은 두가지의 경우가 발생하면 데이터 추적성을 평가해야 한다.Unlike the four training data quality evaluation scales described above, this quality is used as a measure of the ex post data quality evaluation criteria. In other words, when the following two cases occur when machine learning-based software is trained, data traceability should be evaluated.

(1) 올바른 입력 데이터에 대한 잘못된 결과(false positive) : 이 경우에는 올바른 데이터에 대한 학습의 부족으로 인한 결과이다. 따라서 입력 데이터와 유사도가 높은 데이터를 추가로 생성하여 학습 데이터에 포함시켜야 한다. (1) False positive for correct input data: In this case, it is the result of lack of learning on correct data. Therefore, data with high similarity to the input data should be additionally generated and included in the training data.

(2) 올바르지 못한 입력 데이터에 대한 올바른 결과(false negative) : 이 경우는 학습의 오류에 해당되며, 올바르지 못한 입력 데이터와 유사도가 높은 학습 데이터를 제거해야 한다. (2) Correct result for incorrect input data (false negative): This case corresponds to a learning error, and training data with high similarity to incorrect input data should be removed.

학습 데이터에 대한 추적성은 다음과 같이 식 (5)로 표현될 수 있으며, 이는 추적성 존재 유무에 대하여 바이너리 값으로 평가된다.The traceability of the training data can be expressed by Equation (5) as follows, and it is evaluated as a binary value with respect to the presence or absence of traceability.

식 (5)

Equation (5)

즉, 속성 p를 갖는 입력 데이터 I_p와 학습 데이터 집합 D_L의 원소중 유사한 속성을 갖는 원소로 매핑(함수 α)된다면 추적성은 1의 값을 그렇지 않은 경우는 0을 값으로 평가된다. 추적성의 값이 0으로 나타나는 경우는 학습 모델에 수정이 필요하게 된다. That is, if the input data I _p having the property p and the elements of the training data set D _L are mapped (function α) to the elements having the similar properties, the traceability is evaluated as a value of 1, otherwise, the traceability is evaluated as a value of 0. If the traceability value is 0, the learning model needs to be modified.

이하, 본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법을 상세히 설명한다.Hereinafter, a method for evaluating the quality of learning data of machine learning-based software according to the present invention will be described in detail.

도 1 내지 도 3을 참조하면, 학습데이터로 사용될 입력 데이터를 토대로 데이터 평가 기준별 데이터 특성을 추출하는 단계, 추출한 특성을 토대로 품질요소를 평가하는 단계 및 각각의 품질요소의 평가 결과들을 종합해 데이터의 추적성 품질을 도출하는 단계를 포함한다.1 to 3 , the step of extracting data characteristics for each data evaluation criterion based on input data to be used as learning data, the step of evaluating quality factors based on the extracted characteristics, and the data obtained by synthesizing the evaluation results of each quality factor deriving the traceability quality of

상기 학습데이터로 사용될 입력 데이터를 토대로 데이터 평가 기준별 데이터 특성을 추출하는 단계는, 데이터 커버리지 관련 특성, 데이터 분포 관련 특성, 데이터 완전성 관련 특성 및 데이터 중복성 관련 특성을 추출한다.In the step of extracting data characteristics for each data evaluation criterion based on the input data to be used as the learning data, data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics are extracted.

그리고, 상기 추출한 특성을 토대로 품질요소를 평가하는 단계는, 추출한 상기 특성들을 토대로 데이터 커버리지, 데이터 분포성, 데이터 완전성 및 데이터 중복성에 대하여 평가를 수행한다.And, in the step of evaluating the quality factor based on the extracted characteristics, data coverage, data distribution, data integrity, and data redundancy are evaluated based on the extracted characteristics.

상기 평가 수행이 기준을 만족한다면 평가 척도를 보고하고, 기준을 만족하지 못한다면 기준미달 평가 척도를 보고하고 학습데이터를 수정 및 재구성하여, 수정된 학습데이트의 품질을 다시 평가한다.If the evaluation performance satisfies the criteria, the evaluation scale is reported, and if the criterion is not satisfied, the substandard evaluation scale is reported, the learning data is corrected and reconstructed, and the quality of the modified learning data is evaluated again.

이후에, 각각의 품질요소의 평가 결과들을 종합해 데이터의 추적성 품질을 도출하는 단계는 다음과 같은 프로세스를 포함한다.Thereafter, the step of deriving traceability quality of data by synthesizing the evaluation results of each quality factor includes the following process.

1) 학습 데이터 집합으로 지능 소프트웨어 시스템이 학습을 진행하는 단계1) The stage in which the intelligent software system performs learning with the training data set

2) 학습결과가 기준을 만족하는지 확인하는 단계2) Checking whether the learning result satisfies the standard

만족한다면 프로세스를 종료한다.If satisfied, the process is terminated.

3) 입력데이터 관련 특성을 추출하는 단계3) Step of extracting input data related characteristics

기계학습을 통해 사용자가 기대했던 성능을 보여주지 못하는 경우에는 데이터 추적성에 대한 품질 평가를 진행한다. 예를 들어, 올바른 데이터가 입력되는 경우 올바른 결과를 내야 하고, 올바르지 못한 데이터가 입력되는 경우, 올바르지 못하다는 결과를 제시해야 한다. 그런데 올바르지 않은 데이터를 입력하였음에도 불구하고 올바르다고 판단하는 경우(false positive), 학습 데이터에 문제가 있음을 예상할 수 있다.If the machine learning does not show the expected performance of the user, the quality of data traceability is evaluated. For example, if correct data is input, a correct result should be presented, and if incorrect data is input, an incorrect result should be presented. However, when it is determined that the data is correct despite input of incorrect data (false positive), it can be expected that there is a problem in the training data.

4) 데이터 추적성을 평가하는 단계4) Assessing data traceability

기계 학습 기반의 소프트웨어를 학습시켰을 때 다음과 같은 두가지의 경우가 발생하면 데이터 추적성을 평가해야 한다.When the following two cases occur when machine learning-based software is trained, data traceability should be evaluated.

a) 올바른 입력 데이터에 대한 잘못된 결과(false positive) : 이 경우에는 올바른 데이터에 대한 학습의 부족으로 인한 결과이다. 따라서 입력 데이터와 유사도가 높은 데이터를 추가로 생성하여 학습 데이터에 포함시켜야 한다. a) False positive on correct input data: In this case, it is the result of lack of learning on correct data. Therefore, data with high similarity to the input data should be additionally generated and included in the training data.

b) 올바르지 못한 입력 데이터에 대한 올바른 결과(false negative) : 이 경우는 학습의 오류에 해당되며, 올바르지 못한 입력 데이터와 유사도가 높은 학습 데이터를 제거해야 한다. b) Correct result for incorrect input data (false negative): This case corresponds to an error in learning, and training data with high similarity to incorrect input data should be removed.

5) 추적성이 존재하는지 판단하는 단계5) Step to determine if traceability exists

추적성이 존재한다면 평가결과를 보고하고, 학습데이터를 재구성한다.If traceability exists, the evaluation results are reported and the learning data is reconstructed.

추적성이 존재하지 않는다면 학습모델을 개선하고 종료한다.If traceability does not exist, the learning model is improved and terminated.

본 발명에 따른 기계학습 기반 소프트웨어의 학습데이터 품질 평가 방법은 다음과 같은 용도로 활용될 수 있다.The learning data quality evaluation method of machine learning-based software according to the present invention can be utilized for the following purposes.

기계 학습 기반의 소프트웨어를 내장한 자율주행자동차, 지능 로봇 등을 개발하 때, 시스템의 올바른 동작을 제공하기 위해서는 적합한 학습 데이터가 필요하다. 따라서 이러한 기계학습 기반의 제어 소프트웨어 개발시 필요한 학습 데이터의 품질을 평가할 수 있다.When developing autonomous vehicles and intelligent robots with embedded machine learning-based software, appropriate learning data is needed to provide the correct operation of the system. Therefore, it is possible to evaluate the quality of the learning data required when developing such machine learning-based control software.

또한, 본 발명에 따르면, 기계 학습 기반의 소프트웨어에 대한 학습 데이터를 구성할 때, 제안하는 품질 평가 기준을 활용하여 적합한 학습 데이터 개발, 생성할 수 있다. In addition, according to the present invention, when configuring learning data for machine learning-based software, it is possible to develop and generate suitable learning data by utilizing the quality evaluation criteria proposed.

이상 본 발명자에 의해서 이루어진 발명을 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 이 기술분야에서 통상의 지식을 가진 자에게 자명하다.Although the invention made by the present inventor has been described in detail according to the above embodiment, the present invention is not limited to the above embodiment, and it is common knowledge in the art that various changes can be made without departing from the gist of the present invention. self-evident to those who have

Claims

In the method for evaluating the quality of learning data of machine learning-based software in which each step is performed by a computing device,
extracting data characteristics for each data evaluation criterion based on input data to be used as learning data;
evaluating quality factors based on the extracted characteristics; and
Including; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor.
The step of deriving the traceability quality of the data comprises:
performing learning by the intelligent software system with the training data set;
checking whether the learning result satisfies a criterion;
extracting input data related characteristics;
evaluating data traceability;
determining whether traceability exists; and
If traceability exists, reporting the evaluation result, and reconstructing the learning data; learning data quality evaluation method of machine learning-based software comprising the.

delete

According to claim 1,
The step of evaluating the quality factor based on the extracted characteristics,
Based on the extracted characteristics, the learning data quality evaluation method of machine learning-based software, comprising the step of performing evaluation on data coverage, data distribution, data integrity and data redundancy.

delete

In the method for evaluating the quality of learning data of machine learning-based software in which each step is performed by a computing device,
extracting data characteristics for each data evaluation criterion based on input data to be used as learning data;
evaluating quality factors based on the extracted characteristics; and
Including; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor.
The step of extracting the data characteristics for each data evaluation criterion,
extracting data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics;
The data coverage is a measure indicating the type of data for the object to be learned,
A measure for calculating the learning data coverage is a learning data quality evaluation method of machine learning-based software, characterized in that it is defined as in Equation (1) below.

Formula (1)

In the method for evaluating the quality of learning data of machine learning-based software in which each step is performed by a computing device,
extracting data characteristics for each data evaluation criterion based on input data to be used as learning data;
evaluating quality factors based on the extracted characteristics; and
Including; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor.
The step of extracting the data characteristics for each data evaluation criterion,
extracting data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics;
The data distribution is a measure to confirm whether the training data follows a normal distribution,
A scale indicating the distribution of data is a learning data quality evaluation method of machine learning-based software, characterized in that it is defined as in Equation (2) below.

Equation (2)

In the method for evaluating the quality of learning data of machine learning-based software in which each step is performed by a computing device,
extracting data characteristics for each data evaluation criterion based on input data to be used as learning data;
evaluating quality factors based on the extracted characteristics; and
Including; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor.
The step of extracting the data characteristics for each data evaluation criterion,
extracting data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics;
The data integrity is a measure indicating whether all properties of the object to be learned are included in the learning data set,
Calculation of the completeness of the data is a learning data quality evaluation method of machine learning-based software, characterized in that it is defined as Equation (3).

Equation (3)

In the method for evaluating the quality of learning data of machine learning-based software in which each step is performed by a computing device,
extracting data characteristics for each data evaluation criterion based on input data to be used as learning data;
evaluating quality factors based on the extracted characteristics; and
Including; and deriving traceability quality of data by synthesizing the evaluation results of each quality factor.
The step of extracting the data characteristics for each data evaluation criterion,
extracting data coverage related characteristics, data distribution related characteristics, data integrity related characteristics, and data redundancy related characteristics;
The redundancy of the data is a measure of how much overlapping data is included in the training data set,
The data redundancy can be determined through data similarity, and can be calculated through the following Equation (4).

Equation (4)

(In Equation (4), n is the number of training data belonging to the same type, and m is the number of types. Similarity of data Sim(d _j , d _ji ) is the number of training data belonging to the same type, based _{on the data d j} belonging to type j. After calculating the similarity with other data, it is the sum of them.)

According to claim 1,
The traceability is a measure of data quality evaluation criteria after machine learning,
The traceability can be expressed by the following equation (5), which is a method for evaluating the quality of learning data of machine learning-based software, characterized in that it is evaluated as a binary value for the existence of traceability.

Equation (5)

(If the input data I _p with property p and the elements of the training data set D _L are mapped to elements with similar properties (function α), the traceability is evaluated as 1, otherwise, the value is 0. The value of the traceability If it appears as 0, the learning model needs to be modified.)