KR101644740B1

KR101644740B1 - Method and system for evaluating data quality

Info

Publication number: KR101644740B1
Application number: KR1020150027944A
Authority: KR
Inventors: 이건명
Original assignee: 충북대학교 산학협력단
Priority date: 2015-02-27
Filing date: 2015-02-27
Publication date: 2016-08-01

Abstract

A method for evaluating data quality in a system for evaluating the data quality to evaluate the data quality comprises: a step for forming an attribute set of analysis object data; a step for calculating a probability for each of attribute values in each attribute set; a step for generating a combination of the attribute values in each attribute set; a step for calculating a salient degree for the combination of each attribute value; a step for calculating an index for quality evaluation for each combination for each attribute set by using the salient degree for the combination of each attribute value; and a step for displaying the calculated index as a graph. The present invention is provided to offer the index showing quality information amount of data, thereby enabling a user to confirm information for the data quality, and to usefully use the information in a data transaction. The system for evaluating the data quality comprises: a data analyzing module which receives and analyzes the data; a calculating module which evaluates the data analyzed in the data analyzing module; and a result deducting unit which displays a result calculated in the calculating module as the graph.

Description

[0001] The present invention relates to a method and system for evaluating data quality,

본 발명은 데이터의 품질을 평가하기 위한 평가 척도와 연관성 있는 속성들의 조합에 따른 데이터의 품질을 평가하기 위한 방법에 관한 것으로서, 더욱 상세하게는 품질 평가자로부터 연관성이 있을 것으로 보이는 데이터의 속성 집합들을 입력받아 속성 집합 별로 데이터의 품질을 평가하는 방법에 관한 것이다.
The present invention relates to a method for evaluating the quality of data according to a combination of attributes associated with an evaluation scale for evaluating data quality, and more particularly, to a method for evaluating the quality of data by inputting attribute sets of data And a method for evaluating the quality of data by attribute set.

IT 산업이 발전함으로 인하여 데이터의 양이 점점 늘어나고 있지만, 데이터에 대한 효과적인 관리가 제대로 되지 않아서 데이터의 품질이 떨어지는 경향을 나타내고 있다. 하지만 데이터의 품질에 대한 명확한 기준이 정확하지 않아서 사용자들이 데이터를 사용하는데 신뢰를 줄 수 없는 문제가 발생하고 있다. 품질에 대한 기준을 제공하고 명확한 척도가 제시된다면 데이터를 사용하는 사용자들이 여러 데이터의 품질을 비교하여 원하는 품질의 데이터를 얻을 수 있게 제공하는 가능할 것이다.Although the amount of data is gradually increasing due to the development of the IT industry, the quality of data is deteriorating due to ineffective management of data. However, there is a problem that the clear criteria for the quality of the data is not accurate, so that users can not trust the use of the data. If a quality criterion is provided and a clear scale is presented, it will be possible for the users using the data to compare the quality of the various data to provide the data of the desired quality.

종래 기술은 데이터의 품질을 관리하기 위하여 기존에 있는 데이터에 대해 개선 전략이나 보완 전략을 다루고 있다. 이러한 데이터 품질관리는 자기 자신이 가지고 있는 데이터에 대한 것이지만, 현재 시장에서는 데이터를 하나의 상품으로써 거래를 진행하고 있기 때문에, 이런 데이터 시장에서 데이터 품질 관리 방법을 사용하는 것은 바람직한 일이 아니다. 또한, 이러한 기술은 데이터 품질에 대한 평가 기준이 명확하지 않기 때문에 다른 데이터와 비교하였을 때 데이터의 품질 정도가 얼마나 뛰어난지 혹은 얼마나 떨어지는지 알 수 없다는 문제점이 있다. The prior art deals with improvement strategies or supplementary strategies for existing data to manage the quality of the data. This data quality management is about own data, but it is not desirable to use the data quality management method in this data market because the market is currently trading data as one product. In addition, there is a problem in that, since the evaluation standard for data quality is not clear, it is difficult to know how much the quality of data is superior or how low it is when compared with other data.

전술한 바와 같이, 데이터 관리 자체로는 데이터를 판매하는 시장에서 효과를 보지 못하므로 데이터의 품질 정보를 표현할 수 있는 수단이 필요하다. 품질 정보를 수치적으로 혹은 시각적으로 표현하는 척도가 주어진다면 사용자들이 데이터의 품질을 보고 선택하는 것이 가능해 진다.
As described above, since the data management itself does not show any effect in the market for selling the data, a means for expressing the quality information of the data is needed. Given a metric or a visual representation of quality information, it is possible for users to view and select the quality of the data.

대한민국 공개특허 10-2009-0058143Korean Patent Publication No. 10-2009-0058143

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명은 데이터의 품질 평가를 위한 평가 기준을 제공하고, 평가자에게 평가하기 위한 기본적인 자료를 제공함으로 데이터의 품질 정보량을 표현하고, 이를 통하여 사용자에게 품질 높은 데이터를 제공하는데 그 목적이 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and it is an object of the present invention to provide an evaluation criterion for evaluating data quality and to provide a basic data for evaluator's evaluation, The purpose is to provide users with high quality data.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.
The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명의 데이터의 품질을 평가하기 위한 데이터 품질 평가 시스템에서의 데이터 품질 평가 방법에서, 분석 대상 데이터의 속성집합을 구성하는 단계, 각 속성집합 별로 속성값별 확률을 계산하는 단계, 각 속성집합 별로 속성값들의 조합을 생성하는 단계, 각 속성값들의 조합에 대해서 현저성 정도(salient degree)를 계산하는 단계, 각 속성값들의 조합에 대한 현저성 정도값을 이용하여 각 속성집합 별 각 조합에 대한 품질 평가를 위한 척도를 계산하는 단계 및 계산된 척도를 그래프로 표시하여 출력하는 단계를 포함한다. In order to achieve the above object, there is provided a data quality evaluation method in a data quality evaluation system for evaluating the quality of data of the present invention, comprising the steps of: constructing an attribute set of data to be analyzed; A step of generating a combination of attribute values for each attribute set, a step of calculating a salient degree with respect to a combination of the attribute values, a step of calculating a salient degree by using a saliency degree value for each combination of attribute values, Calculating a scale for quality evaluation for each combination of each set, and outputting the calculated scale as a graph.

상기 품질 평가를 위한 척도를 계산하는 단계에서, 각 속성집합 별로 최대 현저성 정도값을 계산하는 단계, 각 속성집합 별로 미리 정해진 k값에 대해 최대 현저성 정도값의 평균을 계산하는 단계, 각 속성집합 별로 전체 현저성 정도값의 평균을 계산하는 단계 및 각 속성집합 별로 엔트로피(entropy)를 계산하는 단계를 포함할 수 있다. Calculating a maximum saliency degree value for each attribute set in the step of calculating the scale for the quality evaluation, calculating an average of maximum saliency degree values for a predetermined k value for each attribute set, Calculating the average of the total salience degree values for each set, and calculating entropy for each set of attributes.

AS는 하나 이상의 속성의 집합이고, v(AS)는 속성집합 AS에서 나타나는 속성값들의 집합이고, a_i는 v(AS)의 특정 속성값이고, p_i는 속성값 a_i가 전체 데이터 레코드에 나타나는 비율을 나타내고,

은 p_i가 균등 분포일 경우, 즉 전체 속성값에서 하나가 나올 확률인

을 의미하고, -log₂p_i는 정보이론에 따른 정보량을 나타낸다고 할 때, 상기 현저성 정도는,

(수학식 1)의 수학식으로 나타낼 수 있다. AS is a set of one or more attributes, v (AS) is a set of attribute values appearing in the attribute set AS, a _i is a specific attribute value of v (AS), p _i is an attribute value a _i Represents the rate of appearance,

Is the probability that one of p _i is uniformly distributed, that is,

, And -log ₂ p _i represents an amount of information according to the information theory,

(1). &Quot; (1) "

상기 최대 현저성 정도값을 maxSD라고 할 때,

(수학식 2)의 수학식으로 나타낼 수 있다. And the maximum salinity degree value is maxSD,

(2): " (2) "

상기 미리 정해진 k값에 대해 가장 큰 현저성 정도값의 평균을 avgTOP_kSD 라고 할 때,

(수학식 3)의 수학식으로 나타낼 수 있다. Assuming that avgTOP _k SD is the average of the largest salience degree values with respect to the predetermined k value,

Can be expressed by the following equation (3).

상기 전체 현저성 정도값의 평균을 avgSD 라고 할 때,

(수학식 4)의 수학식으로 나타낼 수 있다. And an avgSD is an average of the total saliency values,

(4). &Quot; (4) "

상기 엔트로피는,

(수학식 5)의 수학식으로 나타낼 수 있다. The entropy,

(5). &Lt; EMI ID = 5.0 >

상기 계산된 척도를 그래프로 표시하여 출력하는 단계에서, 상기 maxSD, avgTOP_kSD, avgSD, entropy의 네 가지 척도를 방사형 그래프로 표시하여 출력할 수 있다. Wherein the calculated metrics from displaying a graph to an output, the four measures of the maxSD, avgTOP _k SD, avgSD, entropy can be output by displaying a radial graph.

상기 각 속성집합 별로 속성값들의 조합을 생성하는 단계 후에, 생성된 조합의 확률이 균등분포 확률 이상이면, 조합의 현저성 정도값을 0으로 계산하고, 생성된 조합의 확률이 균등분포 확률 미만이면, 조합의 현저성 정도를 계산하는 단계로 진행할 수 있다. If the probability of the generated combination is equal to or greater than the uniform distribution probability, the value of the probability of the combination is calculated as 0, and if the probability of the generated combination is less than the uniform distribution probability , And proceed to the step of calculating the degree of saliency of the combination.

본 발명의 데이터 품질 평가 시스템은 데이터를 받아서 분석하기 위한 데이터 분석 모듈, 상기 데이터 분석 모듈에서 분석한 데이터를 평가하기 위한 계산 모듈 및 상기 계산 모듈에서 계산한 결과를 그래프로 표현하여 출력하기 위한 결과 도출부를 포함한다. The data quality evaluation system of the present invention comprises a data analysis module for receiving and analyzing data, a calculation module for evaluating the data analyzed by the data analysis module, and a result outputting a result of graphically representing the result calculated by the calculation module .

상기 데이터 분석 모듈은, 데이터를 평가하는 평가자의 컴퓨터로부터 연관성 있는 데이터 속성을 입력받기 위한 속성집합 구성부, 상기 속성집합 구성부에서 입력받은 데이터 속성에 따라 데이터베이스로부터 연관성 있는 데이터 속성값들을 추출하기 위한 속성값 추출부 및 상기 속성값 추출부에서 추출된 데이터 속성값들을 조합하여 가능한 조합 쌍들을 생성하기 위한 속성값 조합부를 포함하여 이루어질 수 있다. The data analysis module includes an attribute set construction unit for receiving an attribute of a data item from a computer of an evaluator for evaluating data, an attribute set construction unit for extracting associative data attribute values from the database according to the attribute of data input by the attribute set construction unit, An attribute value extracting unit, and an attribute value combining unit for generating pairs of possible combinations by combining the data attribute values extracted by the attribute value extracting unit.

상기 계산 모듈은 각 조합 쌍들에 대하여 속성 집합별 확률에 기반한 품질 평가를 계산하고, 상기 결과 도출부는 상기 계산 모듈로부터 나온 품질 평가 결과를 방사상 형태의 그래프로 표현하여 출력할 수 있다. The calculation module calculates a quality evaluation based on the probability of each attribute set for each combination pair, and the result derivation unit may output the quality evaluation result from the calculation module in a graph of a radial shape and output.

상기 계산 모듈은, 각 속성집합 별로 최대 현저성 정도값 척도와, 각 속성집합 별로 미리 정해진 k값에 대해 최대 현저성 정도값의 평균 척도와, 각 속성집합 별로 전체 현저성 정도값의 평균 척도와, 각 속성집합 별로 엔트로피(entropy) 척도를 계산할 수 있다. The calculation module calculates a maximum likelihood degree value scale for each attribute set, an average measure of the maximum salience degree value for a predetermined k value for each attribute set, and an average measure of the total salience degree value for each attribute set , And entropy scales can be calculated for each set of attributes.

을 의미하고, -log₂p_i는 정보이론에 따른 정보량을 나타낸다고 할 때, 현저성 정도는,

Is the probability that one of p _i is uniformly distributed, that is,

, And -log ₂ p _i represents the amount of information according to the information theory,

(1). &Quot; (1) "

상기 최대 현저성 정도값을 maxSD라고 할 때,

(2): " (2) "

Can be expressed by the following equation (3).

상기 전체 현저성 정도값의 평균을 avgSD 라고 할 때,

(4). &Quot; (4) "

상기 엔트로피는,

(수학식 5)의 수학식으로 나타낼 수 있다. The entropy,

(5). &Lt; EMI ID = 5.0 >

상기 결과 도출부는 상기 maxSD, avgTOP_kSD, avgSD, entropy의 네 가지 척도를 방사형 그래프로 표시하여 출력할 수 있다.
The results derived portion may be output to display the four measures of the _{maxSD, avgTOP k SD, avgSD,} entropy a radial graph.

본 발명에 의하면 데이터의 품질 정보량을 표현해 주는 척도를 제시하여 데이터의 품질에 대한 정보를 사용자가 확인하고, 데이터 거래 시 유용하게 사용할 수 있는 효과가 있다. According to the present invention, there is an effect that the user can confirm information about the quality of data by presenting a scale for expressing the quality information amount of data, and can be usefully used in data transaction.

또한, 품질 정보량을 방사형 그래프 형태로 표현하여 데이터 품질에 대한 정보를 사용자가 좀 더 쉽게 접근할 수 있도록 하는 장점이 있다.
Also, there is an advantage that a user can more easily access information on data quality by expressing the quality information amount in the form of a radial graph.

도 1은 본 발명의 일 실시예에 따른 데이터 품질 평가 시스템의 구성을 보여주는 블록도이다.
도 2는 본 발명의 일 실시예에 따른 데이터 품질 평가 방법을 보여주는 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 품질 정보량을 표현한 방사형 그래프이다. 1 is a block diagram illustrating a configuration of a data quality evaluation system according to an embodiment of the present invention.
2 is a flowchart illustrating a data quality evaluation method according to an embodiment of the present invention.
FIG. 3 is a radial graph representing quality information quantities according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 갖는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted in an ideal or overly formal sense unless expressly defined in the present application Do not.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In the following description of the present invention with reference to the accompanying drawings, the same components are denoted by the same reference numerals regardless of the reference numerals, and redundant explanations thereof will be omitted. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

도 1은 본 발명의 일 실시예에 따른 데이터 품질 평가 시스템의 구성을 보여주는 블록도이다. 1 is a block diagram illustrating a configuration of a data quality evaluation system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 데이터 품질 평가 시스템은 데이터 분석 모듈(110), 계산 모듈(120) 및 결과 도출부(130)를 포함한다. Referring to FIG. 1, the data quality evaluation system of the present invention includes a data analysis module 110, a calculation module 120, and a result derivation unit 130.

데이터 분석 모듈(110)은 데이터를 받아서 분석하는 모듈이다. The data analysis module 110 is a module for receiving and analyzing data.

계산 모듈(120)은 데이터 분석 모듈(110)에서 분석한 데이터를 평가하기 위한 모듈이다. The calculation module 120 is a module for evaluating the data analyzed by the data analysis module 110.

결과 도출부(130)는 계산 모듈(120)에서 계산한 결과를 그래프로 표현하여 출력하는 역할을 한다. The result derivation unit 130 outputs the result calculated by the calculation module 120 as a graph.

데이터 분석 모듈(110)은 속성집합 구성부(112), 속성값 추출부(114), 속성값 조합부(116)를 포함하여 이루어진다. The data analysis module 110 includes an attribute set construction unit 112, an attribute value extraction unit 114, and an attribute value combination unit 116.

속성집합 구성부(112)는 데이터를 평가하는 평가자의 컴퓨터(20)로부터 연관성 있는 데이터 속성을 입력받는다. The attribute set construction unit 112 receives an associative data attribute from the evaluator's computer 20 that evaluates the data.

속성값 추출부(114)는 속성집합 구성부(112)에서 입력받은 데이터 속성에 따라 데이터베이스(10)로부터 연관성 있는 데이터 속성값들을 추출한다. The attribute value extracting unit 114 extracts associative data attribute values from the database 10 according to the data attribute input from the attribute set construction unit 112.

속성값 조합부(116)는 속성값 추출부(114)에서 추출된 데이터 속성값들을 조합하여 가능한 조합 쌍들을 생성한다. The attribute value combining unit 116 combines the data attribute values extracted by the attribute value extracting unit 114 to generate possible combination pairs.

계산 모듈(120)은 속성집합별 각 조합 쌍들에 대하여 속성 집합별 확률에 기반한 품질 평가를 계산한다. The calculation module 120 calculates a quality estimate based on the probability of each attribute set for each combination pair per attribute set.

결과 도출부(130)는 계산 모듈(120)로부터 나온 품질 평가 결과를 방사상 형태의 그래프로 표현하여 출력한다.
The result derivation unit 130 outputs a quality evaluation result from the calculation module 120 as a graph of a radial shape.

도 2는 본 발명의 일 실시예에 따른 데이터 품질 평가 방법을 보여주는 흐름도이다. 2 is a flowchart illustrating a data quality evaluation method according to an embodiment of the present invention.

도 2를 참조하면 본 발명의 일 실시예에 따른 데이터 품질 평가 방법은 다음과 같다. Referring to FIG. 2, a data quality evaluation method according to an embodiment of the present invention is as follows.

먼저 분석 대상이 되는 데이터의 속성집합을 구성한다(S201). First, an attribute set of data to be analyzed is configured (S201).

그리고, 각 속성집합별로 속성값별 확률을 계산한다(S203). Then, the probability for each attribute value is calculated for each attribute set (S203).

그리고, 속성값들의 조합을 생성한다(S205).Then, a combination of attribute values is generated (S205).

각 속성값들의 조합에 대하여 조합의 확률이 균등분포 확률 이상인지 여부를 확인한다(S207).It is checked whether or not the combination probability is equal to or greater than the uniform distribution probability for each combination of attribute values (S207).

각 속성값들의 조합에 대하여 조합의 확률이 균등분포 확률 이상이면, 조합의 현저성 정도값을 0으로 계산하고(S209), 그렇지 않으면 조합의 현저성 정도(salient degree)를 계산한다(S211).If the probability of the combination is greater than or equal to the uniform distribution probability for each combination of attribute values, the value of the salience degree of the combination is calculated as 0 (S209). Otherwise, the salient degree of the combination is calculated (S211).

조합의 현저성 정도를 계산한 후, 속성 집합의 최대 현저성 정도값을 계산한다(S213). After calculating the degree of saliency of the combination, the maximum saliency degree of the attribute set is calculated (S213).

그리고, 속성 집합의 가장 큰 K%의 현저성 정도값의 평균을 계산한다(S215). Then, an average of the values of the salience degree of the largest K% of the attribute set is calculated (S215).

그리고, 속성 집합의 전체 현저성 정도값의 평균을 계산한다(S217). Then, the average of the total salience degree values of the attribute set is calculated (S217).

그리고, 속성 집합의 엔트로피를 계산한다(S219). Then, the entropy of the attribute set is calculated (S219).

마지막으로 계산한 값들을 방사형 그래프로 출력한다(S221). The last calculated values are output as a radial graph (S221).

본 발명에서 데이터 품질 평가를 위해서 분석이 필요할 것으로 판단되는 속성이나 속성의 조합을 먼저 찾아야 한다. 속성값 조합부로부터 연관성 있는 속성들의 조합을 받아서 계산 모듈에서 품질 정보량을 측정한다. 데이터 품질 정보량을 표현할 수 있는 척도로 현저성 정도(salient degree)를 제시한다. 이 현저성 정도는 서로 연관성 있는 속성집합에 대하여 데이터의 확률을 보고 속성값들의 조합이 나올 확률이 적으면 적을수록 해당 조합에 대해 큰 값을 나타내게 해 주는 척도이다. In order to evaluate the data quality in the present invention, a combination of attributes or attributes that are considered to be necessary for analysis should first be found. The quality information quantity is measured in the calculation module by receiving a combination of the associative attributes from the attribute value combination section. Data quality A salient degree is presented as a measure of information quantity. This degree of saliency is a measure that shows the probability of data for a set of related attributes, and a larger value for a given combination as the probability of a combination of attribute values decreases.

데이터의 현저성 정도는 다음 [수학식 1]과 같이 표현한다. The degree of saliency of the data is expressed by the following equation (1).

여기서, AS는 하나 이상의 속성의 집합이고, v(AS)는 속성집합 AS에서 나타나는 속성값들의 집합이고, a_i는 v(AS)의 특정 속성값이고, p_i는 속성값 a_i가 전체 데이터 레코드에 나타나는 비율을 나타내고,

을 의미하고, -log₂p_i는 정보이론에 따른 정보량을 나타낸다. Here, AS is a set of one or more attributes, v (AS) is a set of attribute values that appear in the set of properties AS, a _i is a specific property value of v (AS), p _i is a property value a _i is the total data Represents the rate that appears in the record,

Is the probability that one of p _i is uniformly distributed, that is,

, And -log ₂ p _i represents the amount of information according to the information theory.

-log₂p_i는 드물게 나타나는 것일수록 큰 값을 갖게 되는데, 일반적으로 이상치는 드물기 때문에 -log₂p_i 값이 클수록 이상치일 가능성이 높다. 이 값에 1/p_i를 곱한 것은 작은 확률값에 대한 민감도를 키우기 위한 것이다. -log ₂ p _i has a larger value as it appears infrequently. In general, since an ideal is rare, -log ₂ p _i The larger the value, the more likely it is an abnormality. Multiplying this value by 1 / p _i is intended to increase the sensitivity to small probability values.

도 3은 본 발명의 일 실시예에 따른 품질 정보량을 표현한 방사형 그래프이다. FIG. 3 is a radial graph representing quality information quantities according to an embodiment of the present invention.

도 3에서 왼쪽은 -log₂p_i의 함수이고, 오른쪽은 -(1/p_i)log₂p_i 함수인데, -(1/p_i)log₂p_i 의 함수가 작은 확률값에 대해서 민감하다. 한편, p_i≤

가 아닌 경우에 대하여는 0의 값을 주는데, 확률값이 균등분포보다 큰 경우는 정상적인 데이터이기 때문이다. 여기에서는 확률값이 균등분포보다 현저히 낮은 경우, 즉 이상치 데이터의 경우값을 높게 주어서 데이터의 품질을 평가하기 때문에 현저성 정도값을 0으로 준다. In Fig. 3, the left side is a function of -log ₂ p _i and the right side is a function of - (1 / p _i ) log ₂ p _{i where the} function of - (1 / p _i ) log ₂ p _i is sensitive to small probability values . On the other hand, when p _i <

, The value is given as 0, and when the probability value is larger than the even distribution, it is normal data. In this case, when the probability value is significantly lower than the even distribution, that is, in the case of the outlier data, the value of the data is evaluated by giving a high value.

본 발명에서는 현저성 정도를 가지고 데이터의 품질 정보량을 시각적으로 표현하기 위하여 다음 수학식2~수학식5와 같은 4가지 척도를 사용한다.In the present invention, four scales such as the following equations (2) to (5) are used to visually express the amount of quality information of data with a degree of salience.

수학식 2에서 maxSD는 최대 현저성 정도값으로서, 이 값이 클수록 빈도가 매우 낮은 속성값이 있다는 것을 나타낸다.
In Equation (2), maxSD is a maximum saliency degree value, which indicates that there is an extremely low frequency attribute value as the value is larger.

수학식 3에서 avgTOP_kSD는 가장 큰 k개의 현저성 정도값의 평균을 계산한 것인데, k는 전체 데이터 개수의 지정된 비율에 해당하는 수로 설정된 값이다. 예를 들어, 전형적인 값으로 0.02%, 0.05%에 대응하는 정수값을 사용한다. 이들 비율에 해당하는 속성값들이 매우 낮은 비율로 나타난다면 avgTOP_kSD 값도 크게 나오게 된다. 이 값과 maxSD의 차이가 클수록 매우 드물게 출현하는 값들의 빈도가 매우 낮은 것이 있다는 것을 의미한다.
In Equation (3), avgTOP _k SD is the average of the k largest saliency values, where k is a value set to a number corresponding to a specified ratio of the total number of data. For example, an integer value corresponding to 0.02%, 0.05% is used as a typical value. If the attribute values corresponding to these ratios appear at a very low rate, the avgTOP _k SD value also becomes large. The larger the difference between this value and maxSD, the less frequent the frequency of occurrence is.

수학식 4에서 avgSD는 균등분호 확률 미만으로 발생하는 속성값들의 현저성 정도값의 평균을 의미한다.
In Equation (4), avgSD means an average value of salience degree values of attribute values occurring less than an equal division probability.

수학식 5에서 entropy는 엔트로피로 전체적인 속성값들의 분포가 균등할수록 큰 값을 갖는다. 최대 엔트로피는 log｜v(AS)｜이기 때문에, 이 값과 entropy의 차이를 보면 분포를 추정할 수 있다. In Equation (5), entropy is entropy and has a larger value as the distribution of the overall attribute values becomes equal. Since the maximum entropy is log | v (AS), the distribution can be estimated by looking at the difference between this value and entropy.

본 발명에서는 전술한 네 가지 데이터의 품질 척도를 한 눈에 확인할 수 있도록 하기 위하여 도 3과 같은 방사형 그래프로 결과를 나타낸다. In the present invention, the results are shown in a radial graph as shown in FIG. 3 so that the above-mentioned four data quality metrics can be confirmed at a glance.

도 3에서 그래프에 대한 해석은 각 척도에 대한 이해를 통해서 가능하며, 기본적으로 maxSD에 비교하여 0.02% avgTOP_kSD가 큰 차이가 나면서, 0.05% avgTOP_kSD는 avgSD와 차이가 작기 때문에, 해당 속성집합이 특이한 값들을 포함할 가능성이 크다고 해석될 수 있다. 이러한 그래프를 사용함으로써 분석 대상의 선정 또는 분석의 필요성을 판정할 수 있다. 특히 이렇게 시각적으로 보여주는 것이 데이터 품질 평가 제품에서 매우 중요한 요소라고 할 수 있다.
In Fig. 3, the interpretation of the graph is possible through understanding of each scale. Basically, 0.02% avgTOP _k SD is larger than maxSD, and 0.05% avgTOP _k SD is smaller than avgSD, It can be interpreted that the set is likely to contain unusual values. By using these graphs, it is possible to determine the necessity of selection or analysis of the analysis target. In particular, this visual display is a very important factor in data quality assessment products.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 다양한 변화와 수정을 가할 수 있음을 이해할 것이다.
While the present invention has been described with reference to several preferred embodiments, these embodiments are illustrative and not restrictive. It will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit of the invention and the scope of the appended claims.

100 데이터 품질 평가 시스템
110 데이터 분석 모듈
120 계산 모듈
130 결과 도출부
112 속성집합 구성부
114 속성값 추출부
116 속성값 조합부100 data quality evaluation system
110 Data Analysis Module
120 Calculation Modules
130 Result Derivation Section
112 Property Set Component
114 attribute value extracting unit
116 Attribute value combination unit

Claims

In a data quality evaluation method in a data quality evaluation system for evaluating the quality of data,
Constructing an attribute set of data to be analyzed;
Calculating a probability for each attribute set for each attribute value;
Generating a combination of attribute values for each attribute set;
Calculating a salient degree for each combination of attribute values;
Calculating a measure for quality evaluation for each combination of each attribute set using a salience degree value for each combination of attribute values; And
And displaying the calculated scale as a graph and outputting the data.

The method according to claim 1,
In the step of calculating the measure for the quality evaluation,
Calculating a maximum salience degree value for each attribute set;
Calculating an average of maximum likelihood degree values for a predetermined k value for each attribute set;
Calculating an average of total saliency values for each attribute set; And
And calculating entropy for each set of attributes. &Lt; RTI ID = 0.0 > 8. < / RTI >

The method of claim 2,
AS is a set of one or more attributes, v (AS) is a set of attribute values appearing in the attribute set AS, a _i is a specific attribute value of v (AS), p _i is an attribute value a _i Represents the rate of appearance,

Is the probability that one of p _i is uniformly distributed, that is,

, And -log ₂ p _i represents the amount of information according to the information theory,
The above-

(1)
The data quality evaluation method comprising the steps of:

The method of claim 3,
And the maximum salinity degree value is maxSD,

(2)
The data quality evaluation method comprising the steps of:

The method of claim 4,
Assuming that avgTOP _k SD is the average of the largest salience degree values with respect to the predetermined k value,

(3)
The data quality evaluation method comprising the steps of:

The method of claim 5,
And an avgSD is an average of the total saliency values,

(4)
The data quality evaluation method comprising the steps of:

The method of claim 6,
The entropy,

(5)
The data quality evaluation method comprising the steps of:

The method of claim 7,
Displaying the calculated scale as a graph and outputting the graph,
Data quality assessment method of the four measures of the _{maxSD, avgTOP k SD, avgSD,} entropy characterized in that it outputs the display in a radial graph.

The method according to claim 1,
After generating the combination of attribute values for each attribute set,
And if the probability of the generated combination is equal to or greater than the uniform distribution probability, the value of the apparentness degree of the combination is calculated as 0, and if the probability of the generated combination is less than the uniform distribution probability, Wherein the data quality evaluation method comprises:

delete