KR20210085886A

KR20210085886A - Data profiling method and data profiling system using attribute value quality index

Info

Publication number: KR20210085886A
Application number: KR1020190179424A
Authority: KR
Inventors: 장원중
Original assignee: 가톨릭관동대학교산학협력단
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-08
Also published as: KR102365910B1

Abstract

In accordance with one embodiment of the present invention, provided is a data profiling method using an attribute value quality index. The data profiling method includes the following steps of: calculating at least one first statistical value regarding each attribute based on data included in each attribute defined in a data set; determining a weighted value assigned to the first statistical value in accordance with a preset first condition; calculating an attribute value quality index regarding each attribute based on a result of calculating the first statistical value and the weighted value assigned to the first statistical value; selecting at least one of the attributes, of which the attribute value quality index corresponds to a threshold condition, as a profiling target attribute; and calculating a quality index for the data set based on a result of calculating the first statistical value corresponding to the profiling target attribute and the weighted value assigned to the first statistical value. Therefore, the present invention is capable of increasing the accuracy of the attribute value quality index.

Description

Data profiling method and data profiling system using attribute value quality index {DATA PROFILING METHOD AND DATA PROFILING SYSTEM USING ATTRIBUTE VALUE QUALITY INDEX}

본 발명은 데이터 프로파일링 방법 및 데이터 프로파일링 시스템에 관한 것으로, 더욱 상세하게는 데이터의 속성 값 품질 지수를 활용하여 데이터의 품질을 효율적으로 프로파일링 할 수 있는 속성 값 품질 지수를 이용한 데이터 프로파일링 방법 및 데이터 프로파일링 시스템에 관한 것이다.The present invention relates to a data profiling method and a data profiling system, and more particularly, to a data profiling method using an attribute value quality index capable of efficiently profiling the quality of data by utilizing the attribute value quality index of the data. and to a data profiling system.

4차 산업혁명의 시대에 접어들면서 보다 많은 사람과 사물이 인터넷으로 연결되고 있으며, 이 과정에서 많은 양의 데이터(빅 데이터)가 생산되고 있다. (도 1 참조)As we enter the era of the 4th industrial revolution, more people and things are connected to the Internet, and a large amount of data (big data) is being produced in the process. (See Fig. 1)

특히, 전 세계적으로 생산 및 유통되는 디지털 정보량은 급속히 증가되고 있어 2020년에는 90 제타바이트(ZB)가 될 것으로 예상된다. 또한, 한국의 디지털 정보량은 연평균 57% 정도 증가하는 것으로 예상되는 실정이다.In particular, the amount of digital information produced and distributed worldwide is rapidly increasing and is expected to reach 90 zettabytes (ZB) by 2020. In addition, the amount of digital information in Korea is expected to increase by an average of 57% annually.

이러한 빅 데이터를 활용하는 기술의 대표적인 예로는 인공지능(AI) 기술이 있다. 전 세계의 주요기업들은 경쟁력 강화를 위해 4차 산업혁명의 핵심 기술인 인공지능(AI) 기술 확보에 주력하고 있다.A typical example of a technology that utilizes such big data is artificial intelligence (AI) technology. Major companies around the world are focusing on securing artificial intelligence (AI) technology, the core technology of the 4th industrial revolution, to strengthen competitiveness.

예를 들어, 구글(Google), 페이스북(Facebook), 마이크로소프트(MS), 아이비엠(IBM) 등 글로벌 IT 기업들은 인공지능 알고리즘 개발을 주요 연구분야로 채택하고 있다.For example, global IT companies such as Google, Facebook, Microsoft, and IBM are adopting artificial intelligence algorithm development as their main research field.

인공지능 기술의 핵심 기술인 머신러닝은 컴퓨터가 데이터를 스스로 학습하도록 하여 높은 차원의 특징을 획득하도록 하는 방법이다.Machine learning, the core technology of artificial intelligence technology, is a method that allows a computer to learn data on its own to acquire high-level features.

이때, 효과적인 머신러닝을 위해서는 많은 양의 데이터 뿐만 아니라, 높은 품질의 데이터가 확보되어야 한다. 이는 머신러닝을 위해 제공되는 데이터의 품질이 높아질수록 인공지능 기술을 이용하여 내린 판단의 정확도가 높아지기 때문이다.At this time, for effective machine learning, not only a large amount of data but also high quality data must be secured. This is because the higher the quality of data provided for machine learning, the higher the accuracy of judgments made using artificial intelligence technology.

그러나, 신뢰할 수 있는 양질의 데이터 확보를 위한 데이터 프로파일링에 관한 기술 개발은 아직 미흡한 실정이다.However, the development of data profiling technology for securing reliable and high-quality data is still insufficient.

특히 최근에는 디지털 형식으로 구성된 데이터가 크게 증가하고 있어 이에 적합한 프로파일링 방법이 필요한데, 이러한 데이터들의 품질을 빠르고 구체적으로 프로파일링 할 수 있는 방법 또한 아직 마련되지 못하고 있다.In particular, in recent years, as the amount of data in digital format has increased significantly, a suitable profiling method is required. However, a method for quickly and specifically profiling the quality of such data has not yet been prepared.

따라서, 데이터의 품질 지수를 체계적으로 산출하여 데이터의 품질을 효율적으로 프로파일링 할 수 있는 기술의 개발이 요구된다.Therefore, it is required to develop a technology that can efficiently profile the data quality by systematically calculating the data quality index.

본 발명이 이루고자 하는 기술적 과제는 데이터의 속성 값 품질 지수를 활용하여 데이터의 품질을 효율적으로 프로파일링 할 수 있는 속성 값 품질 지수를 이용한 데이터 프로파일링 방법 및 데이터 프로파일링 시스템을 제공하는 것이다.An object of the present invention is to provide a data profiling method and a data profiling system using an attribute value quality index capable of efficiently profiling the quality of data by utilizing the attribute value quality index of the data.

본 발명이 이루고자 하는 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved by the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned can be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. There will be.

상기 기술적 과제를 달성하기 위하여, 본 발명의 일실시예는 a) 데이터 집합에서 정의되는 각 속성에 포함되는 데이터를 기초로 상기 각 속성에 관한 적어도 하나의 제1 통계값을 산출하는 단계; b) 기 설정된 제1 조건에 따라 상기 제1 통계값에 부여되는 가중치를 결정하는 단계; c) 상기 제1 통계값과 상기 제1 통계값에 부여되는 상기 가중치의 연산 결과를 기초로 상기 각 속성에 관한 속성 값 품질 지수를 산출하는 단계; d) 상기 속성 값 품질 지수가 임계 조건에 해당하는 속성들 중 적어도 하나의 속성을 프로파일링 대상 속성으로 선정하는 단계; 및 e) 상기 프로파일링 대상 속성에 해당하는 상기 제1 통계값과 상기 제1 통계값에 부여되는 가중치의 연산 결과를 기초로 상기 데이터 집합에 대한 품질 지수를 산출하는 단계를 포함하는 속성 값 품질 지수를 이용한 데이터 프로파일링 방법을 제공한다.In order to achieve the above technical object, an embodiment of the present invention provides the steps of: a) calculating at least one first statistical value for each attribute based on data included in each attribute defined in a data set; b) determining a weight assigned to the first statistical value according to a first preset condition; c) calculating an attribute value quality index for each attribute based on the calculation result of the first statistical value and the weight assigned to the first statistical value; d) selecting at least one of the attributes whose attribute value quality index corresponds to a threshold condition as a profiling target attribute; and e) calculating a quality index for the data set based on an operation result of the first statistical value corresponding to the profiling target attribute and a weight assigned to the first statistical value. A data profiling method using

상기 제1 조건은 적어도 하나의 세부 조건을 포함하고, 상기 제1 통계값은 상기 제1 조건에 속하는 상기 세부 조건 별로 산출되고, 상기 제1 통계값에 부여되는 상기 가중치는 상기 제1 통계값이 해당하는 각각의 상기 세부 조건 별로 상이하게 결정될 수 있다.The first condition includes at least one detailed condition, the first statistical value is calculated for each detailed condition belonging to the first condition, and the weight given to the first statistical value is the first statistical value. It may be determined differently for each corresponding detailed condition.

상기 가중치는 높은 데이터 오류율을 가정한 상기 세부 조건일수록 높게 결정될 수 있다.The weight may be determined to be higher as the detailed condition assuming a high data error rate.

상기 c) 단계 이전에, 상기 각 속성에 관하여 상기 제1 통계값과 상이한 방식으로 산출되는 적어도 하나의 제2 통계값을 산출하는 단계; 및 기 설정된 제2 조건에 따라 상기 제2 통계값에 부여될 가중치를 결정하는 단계;를 더 포함하고, 상기 c) 단계에서, 상기 속성 값 품질 지수는 상기 제1 통계값과 제1 통계값에 부여된 가중치의 연산 결과 및 상기 제2 통계값과 상기 제2 통계값에 부여된 가중치의 연산 결과를 기초로 산출될 수 있다.before step c), calculating at least one second statistical value calculated in a different manner from the first statistical value for each attribute; and determining a weight to be given to the second statistical value according to a second preset condition, wherein in step c), the attribute value quality index is based on the first statistical value and the first statistical value. The calculation may be performed based on a calculation result of the assigned weight and a calculation result of the second statistical value and the weight assigned to the second statistical value.

상기 제1 통계값은 Z-Score, 결측치, 최소값, 최대값, 최빈값, 평균값, 분산, 표준편차, 다섯수치 요약, 이상치(outlier), 영에 가까운 분산값(Near Zero Variance) 중 적어도 하나를 기초로 산출될 수 있다.The first statistical value is based on at least one of Z-Score, missing value, minimum value, maximum value, mode, mean value, variance, standard deviation, five-value summary, outlier, and near zero variance. can be calculated as

상기 a) 단계 이전에, 상기 데이터 집합에 속한 상기 데이터를 상기 데이터 집합에서 정의되는 각 속성 별로 정규화 시키는 단계를 더 포함할 수 있다.The method may further include, before step a), normalizing the data belonging to the data set for each attribute defined in the data set.

위 방법에 따른 방법을 실행하기 위한 컴퓨터 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체가 제공될 수 있다.A computer-readable recording medium recording a computer program for executing the method according to the above method may be provided.

상기 기술적 과제를 달성하기 위하여, 본 발명의 일실시예는 데이터 집합에서 정의되는 각 속성에 포함되는 데이터를 기초로 상기 각 속성에 관한 적어도 하나의 제1 통계값을 산출하는 통계값 산출부; 기 설정된 제1 조건에 따라 상기 제1 통계값에 부여되는 가중치를 결정하는 가중치 부여부; 상기 제1 통계값과 상기 제1 통계값에 부여되는 상기 가중치의 연산 결과를 기초로 상기 각 속성에 관한 속성 값 품질 지수를 산출하는 속성 값 품질 지수 산출부; 상기 속성 값 품질 지수가 임계 조건에 해당하는 속성들 중 적어도 하나의 속성을 프로파일링 대상 속성으로 선정하는 프로파일링 대상 속성 판단부; 및 상기 프로파일링 대상 속성에 해당하는 상기 제1 통계값과 상기 제1 통계값에 부여되는 상기 가중치의 연산 결과를 기초로 상기 데이터 집합에 대한 품질 지수를 산출하는 데이터 프로파일링 수행부를 포함하는 속성 값 품질 지수를 이용한 데이터 프로파일링 시스템을 제공한다.In order to achieve the above technical object, an embodiment of the present invention provides a statistical value calculator for calculating at least one first statistical value for each attribute based on data included in each attribute defined in a data set; a weighting unit configured to determine a weight to be given to the first statistical value according to a preset first condition; an attribute value quality index calculator configured to calculate an attribute value quality index for each attribute based on the calculation result of the first statistical value and the weight assigned to the first statistical value; a profiling target attribute determination unit for selecting at least one of the attributes whose attribute value quality index corresponds to a threshold condition as a profiling target attribute; and a data profiling unit configured to calculate a quality index for the data set based on a result of calculating the first statistical value corresponding to the profiling target attribute and the weight given to the first statistical value. A data profiling system using a quality index is provided.

상기 통계값 산출부는 상기 제1 통계값과 상이한 방식으로 산출되는 적어도 하나의 제2 통계값을 산출하고, 상기 가중치 부여부는 기 설정된 제2 조건에 따라 상기 제2 통계값에 부여될 가중치를 결정하고, 상기 속성 값 품질 지수는 상기 제1 통계값과 제1 통계값에 부여된 가중치의 연산 결과 및 상기 제2 통계값과 상기 제2 통계값에 부여된 가중치의 연산 결과를 기초로 산출될 수 있다.The statistical value calculating unit calculates at least one second statistical value calculated in a different way from the first statistical value, and the weighting unit determines a weight to be given to the second statistical value according to a preset second condition, , the attribute value quality index may be calculated based on a calculation result of the first statistical value and a weight assigned to the first statistical value, and a calculation result of the second statistical value and a weight assigned to the second statistical value. .

상기 통계값 산출부는 각 상기 속성에 포함되는 데이터들을 상기 속성 별로 정규화 시키는 기술통계 분석부를 포함할 수 있다.The statistical value calculating unit may include a descriptive statistics analyzing unit for normalizing data included in each attribute for each attribute.

본 발명의 실시예에 따르면, 데이터 집합에서 정의되는 각 속성 별로 산출되는 속성 값 품질 지수에 기초하여 프로파일링의 필요성이 높은 속성을 선정할 수 있으므로, 데이터 프로파일링은 효율적으로 이루어질 수 있다.According to an embodiment of the present invention, data profiling can be performed efficiently because an attribute with a high need for profiling can be selected based on the attribute value quality index calculated for each attribute defined in the data set.

본 발명의 실시예에 따르면, 데이터 집합에서 정의되는 각 속성 별로 산출되는 통계값에 적용되는 가중치는 복수의 조건 별로 상이하게 부여되므로, 속성 값 품질 지수의 정확도는 높아질 수 있다.According to an embodiment of the present invention, since the weight applied to the statistical value calculated for each attribute defined in the data set is differently assigned to each of a plurality of conditions, the accuracy of the attribute value quality index may be increased.

본 발명의 효과는 상기한 효과로 한정되는 것은 아니며, 본 발명의 상세한 설명 또는 특허청구범위에 기재된 발명의 구성으로부터 추론 가능한 모든 효과를 포함하는 것으로 이해되어야 한다.It should be understood that the effects of the present invention are not limited to the above-described effects, and include all effects that can be inferred from the configuration of the invention described in the detailed description or claims of the present invention.

도 1은 종래의 데이터 집합을 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 데이터 프로파일링을 제공하기 위한 전체 시스템을 개략적으로 도시한 블록도이다.
도 3은 본 발명의 실시예에 따른 데이터 프로파일링 시스템을 개략적으로 도시한 블록도이다.
도 4는 본 발명의 실시예에 따른 프로파일링 대상 속성 선정부를 구체적으로 도시한 블록도이다.
도 5는 본 발명의 실시예에 따른 데이터 프로파일링 시스템의 전체적인 프로세스를 설명하기 위한 도면이다.
도 6은 본 발명의 실시예에 따른 데이터 프로파일링 방법을 설명하기 위한 도면이다.
도 7은 본 발명의 실시예에 따른 데이터 프로파일링 방법의 각 단계에서 실시되는 프로세스의 예시를 설명하기 위한 도면이다.1 is a diagram for explaining a conventional data set.
2 is a block diagram schematically illustrating an overall system for providing data profiling according to an embodiment of the present invention.
3 is a block diagram schematically illustrating a data profiling system according to an embodiment of the present invention.
4 is a block diagram specifically illustrating a profiling target attribute selector according to an embodiment of the present invention.
5 is a diagram for explaining an overall process of a data profiling system according to an embodiment of the present invention.
6 is a diagram for explaining a data profiling method according to an embodiment of the present invention.
7 is a diagram for explaining an example of a process performed in each step of a data profiling method according to an embodiment of the present invention.

이하에서는 첨부한 도면을 참조하여 본 발명을 설명하기로 한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 따라서 여기에서 설명하는 실시예로 한정되는 것은 아니다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, the present invention will be described with reference to the accompanying drawings. However, the present invention may be embodied in several different forms, and thus is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결(접속, 접촉, 결합)"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 부재를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 구비할 수 있다는 것을 의미한다.Throughout the specification, when a part is said to be “connected (connected, contacted, coupled)” with another part, it is not only “directly connected” but also “indirectly connected” with another member interposed therebetween. "Including cases where In addition, when a part "includes" a certain component, this means that other components may be further provided without excluding other components unless otherwise stated.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used herein are used only to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprises" or "have" are intended to designate that the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification exist, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

이하 첨부된 도면을 참고하여 본 발명의 실시예를 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 실시예에 따른 데이터 프로파일링을 제공하기 위한 전체 시스템을 개략적으로 도시한 블록도이다.2 is a block diagram schematically illustrating an overall system for providing data profiling according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 실시예에 따른 데이터 프로파일링을 제공하기 위한 전체 시스템은 외부 서버(A), 통신망(B), 데이터 프로파일링 시스템(1000) 및 사용자 디바이스(C)를 포함할 수 있다.As shown in Fig. 2, the entire system for providing data profiling according to an embodiment of the present invention includes an external server (A), a communication network (B), a data profiling system 1000 and a user device (C). may include

외부 서버(A)는 데이터 프로파일링의 대상이 되는 데이터 집합(Data Set)을 저장할 수 있다. 이때, 데이터 집합(Data Set)은 파일 또는 데이터베이스의 형태로 이루어질 수 있다.The external server A may store a data set to be subjected to data profiling. In this case, the data set may be in the form of a file or a database.

일 예로, 데이터 집합은 복수의 데이터 레코드(Data Record)들로 이루어질 수 있다. 일 예로, 데이터 집합은 기 정의된 복수의 속성을 포함할 수 있다.For example, the data set may include a plurality of data records. For example, the data set may include a plurality of predefined properties.

일 예로, 데이터 집합은 기준 정보, 거래 정보, 집계 정보 등의 구조화된 데이터일 수 있다. 일 예로, 데이터 집합은 HTML, XML, GIS 등의 반 구조화된 데이터일 수 있다. 일 예로, 데이터 집합은 동영상, 이미지 등의 비구조화된 데이터일 수 있다.For example, the data set may be structured data such as reference information, transaction information, and aggregate information. For example, the data set may be semi-structured data such as HTML, XML, or GIS. For example, the data set may be unstructured data such as a moving picture or an image.

또한, 외부 서버(A)는 통신망(B)을 통해 데이터 프로파일링의 대상이 되는 데이터 집합(Data Set)을 데이터 프로파일링 시스템(1000) 또는 사용자 디바이스(C)로 제공할 수 있다.In addition, the external server A may provide a data set to be subjected to data profiling to the data profiling system 1000 or the user device C through the communication network B.

이때, 통신망(B)은 유선 통신과 무선 통신을 포함하는 공지의 통신망(B)일 수 있다.In this case, the communication network B may be a known communication network B including wired communication and wireless communication.

예를 들어, 통신망(B)은 근거리 통신망(B)(LAN; Local Area Network), 도시권 통신망(B)(MAN; Metropolitan Area Network), 광역 통신망(B)(WAN; Wide Area Network) 등으로 이루어질 수 있다. 일 예로, 통신망(B)은 공지의 인터넷 또는 월드와이드웹(WWW; World Wide Web)일 수 있다.For example, the communication network B includes a local area network (B) (LAN), a metropolitan area network (B) (MAN), a wide area network (B) (WAN; wide area network), and the like. can For example, the communication network B may be a well-known Internet or World Wide Web (WWW).

또한, 통신망(B)은 본 발명의 실시예에 따른 전체 시스템에 포함된 외부 서버(A), 통신망(B), 데이터 프로파일링 시스템(1000) 및 사용자 디바이스(C)를 유무선으로 서로 연결할 수 있다.In addition, the communication network B may connect the external server A, the communication network B, the data profiling system 1000, and the user device C included in the entire system according to an embodiment of the present invention to each other via wire or wireless. .

데이터 프로파일링 시스템(1000)은 외부 서버(A)로부터 제공받은 데이터 집합(Data Set)을 분석하여 데이터 집합에 대한 품질 지수를 산출할 수 있다.The data profiling system 1000 may analyze the data set provided from the external server A to calculate a quality index for the data set.

구체적으로 데이터 프로파일링 시스템(1000)은 데이터 집합 내에 기 정의된 속성 각각에 대한 속성 값 품질 지수를 산출할 수 있다.In more detail, the data profiling system 1000 may calculate an attribute value quality index for each attribute defined in the data set.

그리고, 데이터 프로파일링 시스템(1000)은 산출된 속성 값 품질 지수가 임계 조건에 해당하는 속성들 중 적어도 하나를 프로파일링 대상 속성으로 선정할 수 있다. 또한, 데이터 프로파일링 시스템(1000)은 프로파일링 대상 속성에 속한 데이터들을 기초로 데이터 집합에 대한 품질 지수를 산출할 수 있다.In addition, the data profiling system 1000 may select at least one of the properties for which the calculated property value quality index corresponds to a threshold condition as the profiling target property. Also, the data profiling system 1000 may calculate a quality index for a data set based on data belonging to a profiling target attribute.

데이터 프로파일링 시스템(1000)의 구체적인 동작 및 구성은 도 3 내지 도 7에서 상세히 살펴보기로 한다.A detailed operation and configuration of the data profiling system 1000 will be described in detail with reference to FIGS. 3 to 7 .

그리고, 사용자 디바이스(C)는 통신망(B)을 통해 외부 서버(A) 및 데이터 프로파일링 시스템(1000)과 연결될 수 있다. In addition, the user device C may be connected to the external server A and the data profiling system 1000 through the communication network B.

그리고, 사용자 디바이스(C)는 데이터 프로파일링 시스템(1000)으로부터 데이터 프로파일링 결과, 즉 데이터 집합에 대한 품질 지수를 수신할 수 있으며, 이를 사용자에게 제공할 수 있다.In addition, the user device C may receive the data profiling result, that is, the quality index for the data set, from the data profiling system 1000 , and may provide it to the user.

일 예로, 사용자 디바이스(C)는 데스크탑, 노트북, 태블릿, 스마트폰 등의 다양한 전자기기일 수 있다.For example, the user device C may be various electronic devices such as a desktop, a laptop computer, a tablet, and a smart phone.

한편, 사용자 디바이스(C)는 데이터 프로파일링의 대상이 되는 데이터 집합을 데이터 프로파일링 시스템(1000)으로 업로드 할 수도 있다.Meanwhile, the user device C may upload a data set to be subjected to data profiling to the data profiling system 1000 .

도 3은 본 발명의 실시예에 따른 데이터 프로파일링 시스템(1000)을 개략적으로 도시한 블록도이다. 그리고, 도 4는 본 발명의 실시예에 따른 프로파일링 대상 속성 선정부(200)를 구체적으로 도시한 블록도이다.3 is a block diagram schematically illustrating a data profiling system 1000 according to an embodiment of the present invention. 4 is a block diagram specifically illustrating the profiling target attribute selector 200 according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 데이터 프로파일링 시스템(1000)은 데이터 집합 관리부(100), 프로파일링 대상 속성 선정부(200), 데이터 프로파일링 수행부(300), 메모리(400), 통신부(500) 및 제어부(600)를 포함할 수 있다.As shown in FIG. 3 , the data profiling system 1000 includes a data set management unit 100 , a profiling target property selection unit 200 , a data profiling performing unit 300 , a memory 400 , and a communication unit 500 . ) and a control unit 600 .

데이터 집합 관리부(100)는 외부 서버(A)로부터 데이터 프로파일링의 대상이 되는 데이터 집합을 획득할 수 있다. 또한, 데이터 집합 관리부(100)는 수집한 데이터 집합을 데이터 프로파일링에 적합한 포맷의 데이터 집합으로 변환할 수 있다.The data set management unit 100 may obtain a data set to be subjected to data profiling from the external server (A). Also, the data set manager 100 may convert the collected data set into a data set having a format suitable for data profiling.

그리고, 프로파일링 대상 속성 선정부(200)는 데이터 집합에서 정의된 복수의 속성 중 적어도 하나의 속성을 프로파일링 대상 속성으로 선정할 수 있다.In addition, the profiling target property selector 200 may select at least one of a plurality of properties defined in the data set as the profiling target property.

이를 위해 프로파일링 대상 속성 선정부(200)는 도 4의 (a)에 도시된 바와 같이 통계값 산출부(210), 가중치 부여부(220), 속성 값 품질 지수 산출부(230) 및 프로파일링 대상 속성 판단부(240)를 포함할 수 있다.To this end, the profiling target attribute selecting unit 200 includes a statistical value calculating unit 210, a weighting unit 220, an attribute value quality index calculating unit 230 and profiling as shown in FIG. 4(a). It may include a target attribute determiner 240 .

통계값 산출부(210)는 데이터 집합에 속한 데이터들을 각각의 속성 별로 기 설정된 분류기준에 따라 분류하고 분류된 데이터들에 관한 통계값을 산출할 수 있다.The statistical value calculating unit 210 may classify data belonging to a data set according to a preset classification criterion for each attribute, and may calculate statistical values for the classified data.

구체적으로, 통계값 산출부(210)는 도 4의 (b)에 도시된 바와 같이 기술통계 분석부(211), 제1 통계값 분석부(212) 및 제2 통계값 분석부(213)를 포함할 수 있다.Specifically, the statistical value calculation unit 210 performs the descriptive statistics analysis unit 211, the first statistical value analysis unit 212, and the second statistical value analysis unit 213 as shown in FIG. 4(b). may include

기술통계 분석부(211)는 데이터 집합에서 정의되는 각 속성에 포함되는 데이터들을 각 속성 별로 정규화 시킬 수 있다. 이때, 정규화는 데이터 집합에 속한 데이터들의 분포를 각 속성 별로 표준화함을 의미한다. The descriptive statistics analyzer 211 may normalize data included in each attribute defined in the data set for each attribute. In this case, normalization means standardizing the distribution of data belonging to a data set for each attribute.

일 예로, 기술통계 분석부(211)는 데이터 집합에 기 정의된 각각의 속성에 속한 데이터들에 대한 지표값을 산출할 수 있다. 일 예로, 지표값은 평균, 분산, 표준편차 및 평균편차 중 어느 하나일 수 있다.As an example, the descriptive statistics analyzer 211 may calculate an index value for data belonging to each attribute predefined in the data set. For example, the index value may be any one of mean, variance, standard deviation, and mean deviation.

일 예로, 기술통계 분석부(211)는 데이터 집합에 기 정의된 각각의 속성에 속한 데이터들의 값과 지표값을 기초로, 각각의 속성에 속한 데이터들을 Z-Score로 변환할 수 있다.For example, the descriptive statistics analyzer 211 may convert data belonging to each attribute into a Z-Score based on a value and an index value of data belonging to each attribute predefined in the data set.

Z-Score는 아래의 수학식 1로 산출될 수 있다. Z-Score can be calculated by Equation 1 below.

[수학식 1][Equation 1]

위의 수학식 1에서

는 제1 속성에 속한 데이터들 중 i번째 데이터의 Z-Score이다.

는 제1 속성에 속한 데이터들 중 i번째 데이터의 값이다.

는 제1 속성에 속한 데이터들의 평균값이다. s는 제1 속성에 속한 데이터들의 표준편차이다.In Equation 1 above

is the Z-Score of the i-th data among data belonging to the first attribute.

is a value of i-th data among data belonging to the first attribute.

is an average value of data belonging to the first attribute. s is the standard deviation of data belonging to the first attribute.

그리고, 제1 통계값 분석부(212)는 데이터 집합에 기 정의된 각 속성에 속한 데이터를 각 속성별로 분석하여 제1 통계값을 산출할 수 있다. In addition, the first statistical value analyzer 212 may calculate a first statistical value by analyzing data belonging to each attribute predefined in the data set for each attribute.

일 예로, 제1 통계값은 Z-Score, 결측치, 최소값, 최대값, 최빈값, 평균값, 분산, 표준편차, 다섯수치 요약, 이상치(outlier), 영에 가까운 분산값(Near Zero Variance) 중 적어도 하나를 기초로 산출된 통계값일 수 있다.As an example, the first statistical value is at least one of Z-Score, missing value, minimum value, maximum value, mode, mean value, variance, standard deviation, five-value summary, outlier, and near-zero variance. It may be a statistical value calculated based on .

일 예로, 제1 통계값 분석부(212)는 데이터 집합에 기 정의된 각각의 속성 별로 결측치를 분석하여 제1 통계값으로서 산출할 수 있다.As an example, the first statistical value analyzer 212 may calculate a first statistical value by analyzing a missing value for each attribute previously defined in the data set.

이때, 가중치 부여부(220)는 제1 조건에 근거하여 제1 통계값에 부여되는 가중치를 결정할 수 있다.In this case, the weight assigning unit 220 may determine a weight to be given to the first statistical value based on the first condition.

일 예로 가중치 부여부(220)는 제1 통계값이 제1 조건을 만족하는 경우 기 설정된 가중치를 제1 통계값에 부여할 수 있다.For example, when the first statistical value satisfies the first condition, the weight assigning unit 220 may assign a preset weight to the first statistical value.

한편, 제1 조건은 복수의 세부 조건을 포함할 수 있다.Meanwhile, the first condition may include a plurality of detailed conditions.

그리고, 제1 통계값은 제1 조건에 속하는 복수의 세부 조건 별로 산출될 수 있다.In addition, the first statistical value may be calculated for each of a plurality of detailed conditions belonging to the first condition.

그리고, 가중치 부여부(220)는 제1 조건에 속하는 복수의 세부 조건 각각에 해당하는 제1 통계값에 서로 상이한 가중치를 부여할 수 있다.In addition, the weighting unit 220 may assign different weights to first statistical values corresponding to each of a plurality of detailed conditions belonging to the first condition.

일 예로, 제1 통계값이 결측치에 관한 통계값일 경우, 제1 통계값에 부여되는 가중치는 표1과 같이 설정될 수 있다.For example, when the first statistical value is a statistical value related to a missing value, a weight given to the first statistical value may be set as shown in Table 1.

구분division 제1 조건별 가중치 적용 기준Weight application criteria by first condition 가중치weight 제1 조건 - ACondition 1 - A 결측치 = 0missing value = 0 0.00.0 제1 조건 - B1st condition - B 0 < 결측치 수 ≤5%0 < number of missing values ≤5% 1.21.2 제1 조건 - CCondition 1 - C 5% < 결측치 수 ≤ 15%5% < number of missing values ≤ 15% 1.51.5 제1 조건 - D1st condition - D 결측치 수 > 15%Number of missing values > 15% 2.02.0

표 1은 제1 조건에 4개의 세부 조건(A, B, C, D)이 속하는 경우를 나타내었다.표 1에 제시된 바와 같이, 각 세부 조건에 적용되는 가중치는 높은 데이터 오류율을 가정한 상기 세부 조건일수록 높게 결정될 수 있다.Table 1 shows a case where four detailed conditions (A, B, C, and D) belong to the first condition. As shown in Table 1, the weight applied to each detailed condition is the above detailed condition assuming a high data error rate. The higher the condition, the higher may be determined.

이에 따르면, 제1 통계값에 부여되는 가중치가 제1 조건에 속하는 복수의 세부 조건 별로 다르게 부여됨에 따라, 데이터 집합에 속한 각 속성 값 품질 지수는 보다 정확히 산출될 수 있다.According to this, since the weight given to the first statistical value is differently assigned to each of the plurality of detailed conditions belonging to the first condition, the quality index of each attribute value included in the data set can be more accurately calculated.

한편, 제2 통계값 분석부(213)는 데이터 집합에 기 정의된 각 속성에 속한 데이터를 각 속성별로 분석하여 제2 통계값을 산출할 수 있다.Meanwhile, the second statistical value analysis unit 213 may calculate a second statistical value by analyzing data belonging to each attribute predefined in the data set for each attribute.

이때, 제2 통계값은 제1 통계값과 상이한 방식으로 산출되는 통계값일 수 있다.In this case, the second statistical value may be a statistical value calculated in a different way from the first statistical value.

일 예로, 제1 통계값이 결측치에 관한 통계값일 경우, 제2 통계값은 이상치에 관한 통계값일 수 있다. 다시 말해, 제2 통계값 분석부(213)는 데이터 집합에 기 정의된 각각의 속성별로 이상치를 분석하여 제2 통계값으로서 산출할 수 있다.For example, when the first statistical value is a statistical value related to a missing value, the second statistical value may be a statistical value related to an outlier. In other words, the second statistical value analyzer 213 may analyze an outlier for each attribute predefined in the data set and calculate it as a second statistical value.

이때, 가중치 부여부(220)는 제2 조건에 근거하여 제2 통계값에 부여되는 가중치를 결정할 수 있다.In this case, the weight assigning unit 220 may determine a weight to be given to the second statistical value based on the second condition.

일 예로 가중치 부여부(220)는 제2 통계값이 제2 조건을 만족하는 경우 기 설정된 가중치를 제2 통계값에 부여할 수 있다.For example, when the second statistical value satisfies the second condition, the weight assigning unit 220 may assign a preset weight to the second statistical value.

한편, 제2 조건은 복수의 세부 조건을 포함할 수 있다.Meanwhile, the second condition may include a plurality of detailed conditions.

그리고, 제2 통계값은 제2 조건에 속하는 복수의 세부 조건 별로 산출될 수 있다.In addition, the second statistical value may be calculated for each of a plurality of detailed conditions belonging to the second condition.

그리고, 가중치 부여부(220)는 제2 조건에 속하는 복수의 세부 조건 각각에 해당하는 제2 통계값에 서로 상이한 가중치를 부여할 수 있다.In addition, the weighting unit 220 may assign different weights to the second statistical values corresponding to each of a plurality of detailed conditions belonging to the second condition.

일 예로, 제2 통계값이 이상치에 관한 통계값일 경우, 제2 통계값에 부여되는 가중치는 표2와 같이 설정될 수 있다.For example, when the second statistical value is a statistical value related to an outlier, a weight given to the second statistical value may be set as shown in Table 2.

구분division 제2 조건 별 가중치 적용 기준(β)Weight application criteria for each second condition ( β ) 가중치(α)weight ( α ) 제2 조건 - A`2nd condition - A` Z-Score ≤ | 2 |Z-Score ≤ | 2 | 0.00.0 제2 조건 - B`2nd condition - B` | 2 | Z-Score ≤ | 3 || 2 | Z-Score ≤ | 3 | 1.21.2 제2 조건 - C`2nd condition - C` | 3 | Z-Score ≤ | 4 || 3 | Z-Score ≤ | 4 | 1.51.5 제2 조건 - D`2nd condition - D` Z-Score | 4 |Z-Score | 4 | 2.02.0

표 2는 제2 조건에 4개의 세부 조건(A`, B`, C`, D`)이 속하는 경우를 나타내었다.이에 따르면, 제2 통계값에 부여되는 가중치가 제2 조건에 속하는 복수의 세부 조건 별로 다르게 부여됨에 따라, 데이터 집합에 속한 각 속성 값 품질 지수는 보다 정확히 산출될 수 있다.Table 2 shows a case where four detailed conditions (A', B', C', D') belong to the second condition. According to this, the weight assigned to the second statistical value is a plurality of sub-conditions belonging to the second condition. As different conditions are given for each detailed condition, the quality index of each attribute value included in the data set can be more accurately calculated.

그리고, 속성 값 품질 지수 산출부(230)는 제1 통계값, 제2 통계값 및 각각의 통계값에 부여된 가중치를 기초로 속성 값 품질 지수를 산출할 수 있다.In addition, the attribute value quality index calculation unit 230 may calculate the attribute value quality index based on the first statistical value, the second statistical value, and a weight assigned to each statistical value.

일 예로, 속성 값 품질 지수는 아래의 수학식 2로 산출될 수 있다.As an example, the attribute value quality index may be calculated by Equation 2 below.

[수학식 2][Equation 2]

수학식 2에서

는 i번째 속성의 속성 값 품질 지수(Attribute Value Quality Index, AVQI)를 의미한다. k는 측정 항목을 의미한다. n은 측정 항목의 수를 의미한다.

는 k번째 측정 항목의 가중치 적용 기준에 해당하는 가중치를 곱한 값이다.

는 측정 항목의 가중치가 0 이상인 데이터 레코드의 수, 즉 측정항목의 수이다. (단, 가중치가 0이외의 가중치 적용 기준에 해당하는 레코드가 없는 경우, 해당 속성의 속성 값 품질 지수는 0)in Equation 2

denotes an Attribute Value Quality Index (AVQI) of the i-th attribute. k means a measurement item. n means the number of measurement items.

is the value multiplied by the weight corresponding to the weight application criterion of the k-th measurement item.

is the number of data records with a weight of 0 or more, that is, the number of metrics. (However, if there is no record corresponding to a weighting criterion other than 0, the attribute value quality index of the attribute is 0)

일 예로,

는 아래의 수학식 3으로 산출될 수 있다.For example,

can be calculated by Equation 3 below.

[수학식 3][Equation 3]

일 예로,

는 아래의 수학식 4로 산출될 수 있다.For example,

can be calculated by Equation 4 below.

[수학식 4][Equation 4]

수학식 3과 수학식 4에서 β 기준에 의해 추출된 레코드 수는 가중치 적용 기준(β)의 가중치(α)가 0.0 이외인 레코드 수이다.In Equations 3 and 4, the number of records extracted by the β criterion is the number of records in which the weight α of the weighting criterion β is other than 0.0.

그리고, 프로파일링 대상 속성 판단부(240)는 속성 값 품질 지수가 임계 조건에 해당하는 속성 들 중 적어도 하나의 속성을 프로파일링 대상 속성으로 선정할 수 있다.In addition, the profiling target attribute determiner 240 may select at least one of the attributes whose attribute value quality index corresponds to a threshold condition as the profiling target attribute.

일 예로, 프로파일링 대상 속성 판단부(240)는 속성 값 품질 지수가 임계값(ex. 0)보다 큰 속성을 프로파일링 대상 속성으로 판단할 수 있다.As an example, the profiling target attribute determiner 240 may determine an attribute whose attribute value quality index is greater than a threshold value (eg, 0) as the profiling target attribute.

한편, 데이터 프로파일링 수행부(300)는 프로파일링 대상 속성 선정부(200)가 선정한 프로파일링 대상 속성에 해당하는 통계값과 각 통계값에 부여된 가중치의 연산 결과를 기초로 데이터 집합에 대한 품질 지수를 산출할 수 있다.On the other hand, the data profiling performing unit 300 determines the quality of the data set based on the calculation result of the statistical value corresponding to the profiling target property selected by the profiling target property selecting unit 200 and the weight assigned to each statistical value. index can be calculated.

일 예로, 데이터 집합에 대한 품질 지수는 아래의 수학식 5로 산출될 수 있다.As an example, the quality index for the data set may be calculated by Equation 5 below.

[수학식 5][Equation 5]

수학식 5에서

(Structured Data Value Quality Index)는 데이터 집합에 대한 품질 지수를 의미한다. 그리고, i는 i번째 속성을 의미한다. 그리고, n은 프로파일링 대상 속성의 수이다. 그리고,

는 i번째 속성의 가중치 적용 기준(β)에 해당하는 가중치(α)를 i번째 속성의 통계값에 곱한 값이다. 그리고,

는 i번째 속성의 가중치 적용 기준(β)에 해당하는 데이터 레코드의 수, 즉 측정 항목의 수이다.in Equation 5

(Structured Data Value Quality Index) means a quality index for a data set. And, i means the i-th attribute. And, n is the number of profiling target attributes. And,

is a value obtained by multiplying the statistical value of the i-th attribute by the weight (α) corresponding to the weight application criterion (β) of the i-th attribute. And,

is the number of data records corresponding to the weight application criterion (β) of the i-th attribute, that is, the number of measurement items.

일 예로, SDVQI는 정형데이터 집합에 대한 품질 지수일 수 있다.As an example, SDVQI may be a quality index for a structured data set.

이하에서는 표 3의 실시예를 참조하여 본 발명의 실시예에 따른 데이터 프로파일링 시스템(1000)의 데이터 품질 효율을 살펴보기로 한다.Hereinafter, data quality efficiency of the data profiling system 1000 according to an embodiment of the present invention will be described with reference to the embodiment of Table 3.

표 3은 캐글(Kaggle)에 등록된 Deli Weather 데이터 집합을 이용하여 제1 통계값(결측치)과 제2 통계값(이상치)을 산출한 표이다.Table 3 is a table in which the first statistical value (missing value) and the second statistical value (outlier) are calculated using the Deli Weather data set registered in Kaggle.

구분division 조건 별 가중치 적용 기준Criteria for weighting by condition 가중치weight DewptmDewptm fogfog hailhail humhum pressurempressure 결측치missing value 결측치 = 0missing value = 0 0.00.0 0<결측치 수≤5%0<Number of missing values ≤5% 1.21.2 621621 757757 232232 5%<결측치 수≤15%5%<Number of missing values ≤15% 1.51.5 결측치 수>15%Number of missing values >15% 2.02.0 이상치outlier Z-Score ≤ |2|Z-Score ≤ |2| 0.00.0 |2| < Z-Score ≤ |3||2| < Z-Score ≤ |3| 1.21.2 826826 780780 |3| < Z-Score ≤ |4||3| < Z-Score ≤ |4| 1.51.5 2626 70387038 1One Z-Score > |4|Z-Score > |4| 2.02.0 55 1313 22 1One AVQIAVQI 0.0280.028 0.500.50 1.01.0 0.2010.201 0.2030.203 RainRain snowsnow tempmtempm thunderthunder tonadotonado vismvism wdirdwdird wspdmwspdm datedate timetime 673673 44284428 23582358 1475514755 30803080 99 294294 1One 3434 26522652 1One 44 952952 22 1One 33 133133 1.0001.000 1.0001.000 0.2010.201 1.0001.000 1.0001.000 0.2000.200 0.5000.500 0.2410.241 0.0000.000 0.0000.000

표 3에 제시된 바와 같이, Deli Weather 데이터 집합 내에 15개의 속성(dewptm, fog, hail, hum, pressurem, rain, snow, tempm, thunder, tornado, vism, wdird, wspdm, date, time)이 정의될 수 있다.As shown in Table 3, 15 attributes (dewptm, fog, hail, hum, pressurem, rain, snow, tempm, thunder, tornado, vism, wdird, wspdm, date, time) can be defined within the Deli Weather data set. have.

그리고, 수학식 2 내지 4에서 설명한 바와 같이, 15개의 속성 중 AVQI가 0보다 큰 13개의 속성이 프로파일링 대상 속성으로 선정될 수 있다.And, as described in Equations 2 to 4, among 15 attributes, 13 attributes having AVQI greater than 0 may be selected as profiling target attributes.

한편, 데이터 품질 효율 측정값(Data Quality Efficiency Measurement Value, DQEM)은 아래의 수학식 6으로 산출될 수 있다.Meanwhile, a Data Quality Efficiency Measurement Value (DQEM) may be calculated by Equation 6 below.

[수학식 6][Equation 6]

수학식 6에서 S는 데이터 속성의 수와 데이터 레코드 수의 곱이다. 그리고, m은 프로파일링 대상 속성의 수와 데이터 레코드 수의 곱이다. 이때, 데이터 품질 효율 측정 값은 100%에 가까울수록 성능이 우수하다.In Equation 6, S is the product of the number of data attributes and the number of data records. And, m is the product of the number of properties to be profiled and the number of data records. In this case, the closer the data quality efficiency measurement value is to 100%, the better the performance.

이에 따르면, 표 3의 실시예에서 데이터 품질 효율 측정값은 13.333%로 산출될 수 있다.Accordingly, in the embodiment of Table 3, the measured data quality efficiency may be calculated as 13.333%.

즉, 표 3의 실시예의 데이터 품질 효율 측정값(13.333%)은 속성 값 품질 지수에 기초한 프로파일링 대상 속성 선정을 하지 않았을 때의 데이터 품질 효율 측정값(0%)에 비해 우수함을 알 수 있다.That is, it can be seen that the data quality efficiency measurement value (13.333%) of the embodiment of Table 3 is superior to the data quality efficiency measurement value (0%) when the profiling target attribute is not selected based on the attribute value quality index.

그리고, 메모리(400)는 데이터 집합, 데이터 집합에서 정의되는 속성의 종류, 데이터 집합에서 정의되는 속성들에 대한 통계값, 데이터 집합에서 정의되는 속성들에 대한 통계값에 부여되는 가중치, 프로파일링 대상 속성, 속성 값 품질 지수, 데이터 집합에 대한 품질 지수 등을 저장할 수 있다.In addition, the memory 400 stores a data set, a type of attribute defined in the data set, statistical values for attributes defined in the data set, weights assigned to statistical values for attributes defined in the data set, and a profiling target. You can store attributes, attribute value quality indexes, quality indexes for data sets, and more.

일 예로, 메모리(400)는 컴퓨터 또는 사람이 판독 가능한 기록 매체일 수 있다.For example, the memory 400 may be a computer- or human-readable recording medium.

그리고, 통신부(500)는 통신망(B)을 통해 외부서버 또는 사용자 디바이스(C)와 통신할 수 있다.And, the communication unit 500 may communicate with the external server or the user device (C) through the communication network (B).

그리고, 제어부(600)는 데이터 프로파일링 시스템(1000)에 구비된 구성요소들의 동작 및 구성요소들 간의 데이터 흐름을 제어할 수 있다.In addition, the controller 600 may control the operation of the components included in the data profiling system 1000 and the data flow between the components.

도 5는 본 발명의 실시예에 따른 데이터 프로파일링 시스템(1000)의 전체적인 프로세스를 설명하기 위한 도면이다.5 is a diagram for explaining the overall process of the data profiling system 1000 according to an embodiment of the present invention.

도 5에 도시된 바와 같이, 데이터 집합 관리부(100)는 외부 서버(A)로부터 데이터 프로파일링의 대상이 되는 데이터 집합을 획득할 수 있다.As shown in FIG. 5 , the data set management unit 100 may obtain a data set to be subjected to data profiling from the external server A. As shown in FIG.

이때, 데이터 집합 관리부(100)는 하나 이상의 데이터 집합을 획득할 수 있다. 즉, 데이터 집합 관리부(100)는 복수의 데이터 집합(101, 102)을 획득할 수 있다.In this case, the data set management unit 100 may acquire one or more data sets. That is, the data set management unit 100 may acquire a plurality of data sets 101 and 102 .

그리고, 통계값 산출부(210)는 데이터 집합 관리부(100)가 획득한 데이터 집합에 대한 통계값을 산출할 수 있다.In addition, the statistical value calculation unit 210 may calculate a statistical value for the data set obtained by the data set management unit 100 .

이때, 통계값 산출부(210)는 서로 다른 타입의 데이터에 관한 통계값을 산출하는 복수의 통계값 산출 유닛을 포함할 수 있다.In this case, the statistical value calculating unit 210 may include a plurality of statistical value calculating units for calculating statistical values related to different types of data.

일 예로, 통계값 산출부(210)는 제1 통계값 산출 유닛(210a)과 제2 통계값 산출 유닛(210b)을 포함할 수 있다.As an example, the statistical value calculating unit 210 may include a first statistical value calculating unit 210a and a second statistical value calculating unit 210b.

일 예로, 제1 통계값 산출 유닛(210a)은 데이터 집합에 속한 데이터 중 정형 데이터에 대한 통계값을 산출할 수 있다.As an example, the first statistical value calculating unit 210a may calculate a statistical value for the structured data among data belonging to the data set.

구체적으로, 제1 통계값 산출 유닛(210a)은 전술한 기술통계 분석부(211), 제1 통계값 분석부(212) 및 제2 통계값 분석부(213)를 이용하여 정형 데이터에 대한 통계값을 산출할 수 있다.Specifically, the first statistical value calculation unit 210a uses the above-described descriptive statistics analysis unit 211 , the first statistical value analysis unit 212 , and the second statistical value analysis unit 213 to obtain statistics on the structured data. value can be calculated.

일 예로, 제2 통계값 산출 유닛(210b)은 데이터 집합에 속한 데이터 중 비정형 데이터에 대한 통계값을 산출할 수 있다.As an example, the second statistical value calculating unit 210b may calculate a statistical value for unstructured data among data belonging to a data set.

구체적으로, 제2 통계값 산출 유닛(210b)은 TF-IDF(Term Frequency - Inverse Document Frequency) 분석부(214) 및 단어 임베딩 분석부(215)를 포함할 수 있다.Specifically, the second statistical value calculation unit 210b may include a Term Frequency - Inverse Document Frequency (TF-IDF) analysis unit 214 and a word embedding analysis unit 215 .

TF-IDF 분석부(214)는 데이터 집합 내에 속한 특정 단어의 중요도를 분석할 수 있다. The TF-IDF analysis unit 214 may analyze the importance of a specific word included in the data set.

일 예로, TF-IDF 분석부(214)는 데이터 집합 내의 단어 빈도(Term Frequency, TF) 값을 산출할 수 있다.For example, the TF-IDF analyzer 214 may calculate a term frequency (TF) value in the data set.

그리고 TF-IDF 분석부(214)는 단어 빈도 값을 기초로 문서 빈도(Document Frequency, DF) 값 및 역 문서 빈도(Inverse Document Frequency, IDF) 값을 산출할 수 있다. 그리고, TF-IDF 분석부(214)는 단어 빈도 값, 문서 빈도 값 및 역 문서 빈도 값을 기초로 TF-IDF 값을 산출할 수 있다.In addition, the TF-IDF analyzer 214 may calculate a document frequency (DF) value and an inverse document frequency (IDF) value based on the word frequency value. In addition, the TF-IDF analysis unit 214 may calculate the TF-IDF value based on the word frequency value, the document frequency value, and the inverse document frequency value.

그리고, 단어 임베딩 분석부(215)는 인공 신경망 방식을 이용해 데이터 집합에 포함된 단어들의 관계성을 분석할 수 있다. 일 예로, 단어 임베딩 분석부(215)는 Word2Vec 알고리즘을 기초로 데이터 집합에 포함된 단어들의 관계성을 분석할 수 있다.In addition, the word embedding analyzer 215 may analyze the relationship between words included in the data set using an artificial neural network method. For example, the word embedding analyzer 215 may analyze the relationship between words included in the data set based on the Word2Vec algorithm.

그리고, 도 4에서 설명한 바와 같이, 각 통계값 산출 유닛들이 산출한 통계값에 기초하여 데이터 집합에 대한 품질 지수가 산출될 수 있다.And, as described with reference to FIG. 4 , the quality index for the data set may be calculated based on the statistical values calculated by the respective statistical value calculating units.

일 예로, 제1 데이터 집합 품질 지수는 제1 통계값 산출 유닛(210a)이 산출한 통계값에 기초하여 산출될 수 있다.As an example, the first data set quality index may be calculated based on the statistical value calculated by the first statistical value calculating unit 210a.

그리고, 제2 데이터 집합 품질 지수는 제2 통계값 산출 유닛(210b)이 산출한 통계값에 기초하여 산출될 수 있다.In addition, the second data set quality index may be calculated based on the statistical value calculated by the second statistical value calculating unit 210b.

그리고, 종합 데이터 집합 품질 지수는 제1 데이터 집합 품질 지수와 제2 데이터 집합 품질 지수를 기초로 산출될 수 있다.In addition, the comprehensive data set quality index may be calculated based on the first data set quality index and the second data set quality index.

도 6은 본 발명의 실시예에 따른 데이터 프로파일링 방법을 설명하기 위한 도면이다. 그리고, 도 7은 본 발명의 실시예에 따른 데이터 프로파일링 방법의 각 단계에서 실시되는 프로세스의 예시를 설명하기 위한 도면이다. 6 is a diagram for explaining a data profiling method according to an embodiment of the present invention. 7 is a diagram for explaining an example of a process performed in each step of the data profiling method according to an embodiment of the present invention.

도 6 및 도 7에 도시된 바와 같이, 복수의 데이터 레코드들로 이루어진 데이터 집합을 준비하는 단계가 실시될 수 있다. (S100)6 and 7 , a step of preparing a data set including a plurality of data records may be performed. (S100)

일 예로, 데이터 집합은 외부 서버(A) 또는 외부 저장장치(미도시)로부터 통신망(B)을 거쳐 데이터 프로파일링 시스템(1000)에 전달될 수 있다.As an example, the data set may be transmitted from an external server A or an external storage device (not shown) to the data profiling system 1000 via a communication network B.

구체적으로, 데이터 집합은 데이터 프로파일링 시스템(1000)에 구비된 데이터 집합 관리부(100)로 전달될 수 있으며, 메모리(400)에 저장될 수 있다.Specifically, the data set may be transmitted to the data set management unit 100 included in the data profiling system 1000 , and may be stored in the memory 400 .

일 예로, 데이터 레코드는 데이터 집합에서 정의되는 속성 별로 구분된 데이터들로 이루어질 수 있다.For example, the data record may be composed of data divided by properties defined in the data set.

그리고, 데이터 집합에서 정의되는 각 속성에 포함되는 데이터를 기초로 각 속성에 관한 적어도 하나의 제1 통계값을 산출하는 단계가 실시될 수 있다. (S200)In addition, the step of calculating at least one first statistical value for each attribute based on data included in each attribute defined in the data set may be performed. (S200)

일 예로, 제1 통계값은 각 속성별로 산출될 수 있다.For example, the first statistical value may be calculated for each attribute.

일 예로, 제1 속성에 대한 제1 통계값이 산출될 수 있다. 그리고, 제1 통계값은 제1조건에 속하는 복수의 세부 조건(제1 조건 - A, 제1 조건 - B, 제1 조건 - C) 별로 각각 산출될 수 있다.As an example, a first statistical value for the first attribute may be calculated. In addition, the first statistical value may be calculated for each of a plurality of detailed conditions (first condition-A, first condition-B, first condition-C) belonging to the first condition.

이때, 요구되는 프로파일링의 수준에 따라 제1 속성에 대한 제2 통계값이 더 산출될 수 있다. 제2 통계값은 제2조건에 속하는 복수의 세부 조건(제2 조건 - A, 제2 조건 - B, 제2 조건 - C) 별로 각각 산출될 수 있다.In this case, a second statistical value for the first attribute may be further calculated according to a required level of profiling. The second statistical value may be calculated for each of a plurality of detailed conditions (second condition - A, second condition - B, and second condition - C) belonging to the second condition.

한편, 도 7은 제1 속성과 제2 속성에 같은 조건과 가중치가 설정된 것으로 도시하였으나, 본 발명은 이에 한정되는 것은 아니다. 즉, 제1 속성과 제2 속성에 적용되는 조건과 가중치는 상이할 수도 있음은 물론이다.Meanwhile, although FIG. 7 illustrates that the same condition and weight are set to the first attribute and the second attribute, the present invention is not limited thereto. That is, it goes without saying that the conditions and weights applied to the first attribute and the second attribute may be different.

그리고, 기 설정된 제1 조건에 따라 제1 통계값에 부여되는 가중치를 결정하는 단계가 실시될 수 있다. (S300)Then, the step of determining the weight to be given to the first statistical value according to the first preset condition may be performed. (S300)

제1 통계값이 제1 조건을 만족하는 경우 기 설정된 가중치가 부여될 수 있다.When the first statistical value satisfies the first condition, a preset weight may be assigned.

일 예로, 제1 조건이 복수의 세부 조건을 포함하는 경우, 각각의 세부 조건에 해당하는 제1 통계값에 부여되는 가중치(a1, a2, a3)는 각각의 세부 조건마다 상이할 수 있다.For example, when the first condition includes a plurality of detailed conditions, weights a1 , a2 , a3 given to the first statistical values corresponding to each detailed condition may be different for each detailed condition.

또한, 단계 S200에서 제2 통계값이 산출된 경우, 제2 조건에 근거하여 제2 통계값에 기 설정된 가중치가 부여될 수 있다.Also, when the second statistical value is calculated in step S200, a preset weight may be given to the second statistical value based on the second condition.

또한, 제2 조건이 복수의 세부 조건을 포함하는 경우, 각각의 조건에 해당하는 제2 통계값에 부여되는 가중치(b1, b2, b3)는 각각의 세부 조건마다 상이할 수 있다.Also, when the second condition includes a plurality of detailed conditions, the weights b1, b2, and b3 given to the second statistical values corresponding to each condition may be different for each detailed condition.

그리고, 제1 통계값과 제1 통계값에 부여되는 가중치의 연산 결과를 기초로 각 속성에 관한 속성 값 품질 지수를 산출하는 단계가 실시될 수 있다. (S400)In addition, the step of calculating the attribute value quality index for each attribute based on the calculation result of the first statistical value and the weight assigned to the first statistical value may be performed. (S400)

이러한 속성 값 품질 지수는 데이터 집합에 속하는 각 속성 별로 산출될 수 있다.The attribute value quality index may be calculated for each attribute belonging to the data set.

이때, 제2 통계값이 산출된 경우, 제2 통계값과 제2 통계값에 부여되는 가중치의 연산 결과는 속성 값 품질 지수의 산출에 활용될 수 있다.In this case, when the second statistical value is calculated, the calculation result of the second statistical value and the weight assigned to the second statistical value may be utilized to calculate the attribute value quality index.

그리고, 속성 값 품질 지수가 임계 조건에 해당하는 속성들 중 적어도 하나의 속성을 프로파일링 대상 속성으로 선정하는 단계가 실시될 수 있다. (S500)In addition, the step of selecting at least one attribute among attributes corresponding to the threshold condition of the attribute value quality index as a profiling target attribute may be performed. (S500)

일 예로, 속성 값 품질 지수가 임계값(ex.0)보다 큰 속성은 프로파일링 대상 속성으로 선정될 수 있다.For example, an attribute whose attribute value quality index is greater than a threshold value (ex. 0) may be selected as a profiling target attribute.

그리고, 프로파일링 대상 속성에 해당하는 제1 통계값과 제1통계값에 부여되는 가중치의 연산 결과를 기초로 데이터 집합에 대한 품질 지수를 산출하는 단계가 실시될 수 있다. (S600)Then, the step of calculating the quality index for the data set based on the calculation result of the first statistical value corresponding to the profiling target attribute and the weight given to the first statistical value may be performed. (S600)

이때, 제2 통계값이 산출된 경우, 프로파일링 대상 속성에 해당하는 제2 통계값과 제2 통계값에 부여되는 가중치의 연산결과는 데이터 집합에 대한 품질 지수의 산출에 활용될 수 있다.In this case, when the second statistical value is calculated, the calculation result of the second statistical value corresponding to the profiling target attribute and the weight assigned to the second statistical value may be utilized to calculate the quality index for the data set.

일 예로, 데이터 집합에 대한 품질 지수는 수학식 5를 참조하여 산출될 수 있다.As an example, the quality index for the data set may be calculated with reference to Equation 5.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention.

100 : 데이터 집합 관리부 200 : 프로파일링 대상 속성 선정부
210 : 통계값 산출부 220 : 가중치 부여부
230 : 속성 값 품질 지수 산출부 240 : 프로파일링 대상 속성 판단부
300 : 데이터 프로파일링 시스템 400 : 메모리
500 : 통신부 600 : 제어부100: data set management unit 200: profiling target property selection unit
210: statistical value calculation unit 220: weighting unit
230: attribute value quality index calculation unit 240: profiling target attribute determination unit
300: data profiling system 400: memory
500: communication unit 600: control unit

Claims

a) calculating at least one first statistical value for each attribute based on data included in each attribute defined in the data set;
b) determining a weight assigned to the first statistical value according to a first preset condition;
c) calculating an attribute value quality index for each attribute based on the calculation result of the first statistical value and the weight assigned to the first statistical value;
d) selecting at least one of the attributes whose attribute value quality index corresponds to a threshold condition as a profiling target attribute; and
e) calculating a quality index for the data set based on the calculation result of the first statistical value corresponding to the profiling target attribute and a weight assigned to the first statistical value
A data profiling method using an attribute value quality index comprising a.

According to claim 1,
The first condition includes at least one detailed condition,
The first statistical value is calculated for each detailed condition belonging to the first condition,
The data profiling method using the attribute value quality index, wherein the weight given to the first statistical value is determined differently for each of the detailed conditions to which the first statistical value corresponds.

3. The method of claim 2,
The data profiling method using the attribute value quality index, wherein the weight is determined to be higher in the detailed condition assuming a high data error rate.

According to claim 1,
Prior to step c),
calculating at least one second statistical value calculated in a different manner from the first statistical value for each attribute; and
Determining a weight to be given to the second statistical value according to a preset second condition; further comprising,
In step c),
The attribute value quality index is calculated based on the calculation result of the first statistical value and the weight assigned to the first statistical value, and the calculation result of the second statistical value and the weight assigned to the second statistical value
Data profiling method using attribute value quality index.

According to claim 1,
The first statistical value is based on at least one of Z-Score, missing value, minimum value, maximum value, mode, mean value, variance, standard deviation, five-value summary, outlier, and near zero variance. A data profiling method using an attribute value quality index calculated as .

According to claim 1,
Before step a),
normalizing the data belonging to the data set for each attribute defined in the data set;
Data profiling method using the attribute value quality index further comprising.

A computer-readable recording medium recording a computer program for executing the method according to any one of claims 1 to 6.

a statistical value calculation unit for calculating at least one first statistical value for each attribute based on data included in each attribute defined in the data set;
a weighting unit configured to determine a weight to be given to the first statistical value according to a preset first condition;
an attribute value quality index calculator configured to calculate an attribute value quality index for each attribute based on a result of calculating the first statistical value and the weight assigned to the first statistical value;
a profiling target attribute determination unit for selecting at least one of the attributes whose attribute value quality index corresponds to a threshold condition as a profiling target attribute; and
and a data profiling performing unit for calculating a quality index for the data set based on an operation result of the first statistical value corresponding to the profiling target attribute and the weight given to the first statistical value; Data profiling system using exponents.

9. The method of claim 8,
The first condition includes at least one detailed condition,
The first statistical value is calculated for each detailed condition belonging to the first condition,
The data profiling system using the attribute value quality index, wherein the weight given to the first statistical value is determined differently for each of the detailed conditions to which the first statistical value corresponds.

10. The method of claim 9,
The data profiling system using the attribute value quality index, wherein the weight is determined to be higher in the detailed condition assuming a high data error rate.

9. The method of claim 8,
The statistical value calculating unit calculates at least one second statistical value calculated in a different way from the first statistical value,
The weighting unit determines a weight to be given to the second statistical value according to a preset second condition,
wherein the attribute value quality index is calculated based on a calculation result of the first statistical value and a weight assigned to the first statistical value, and a calculation result of the second statistical value and a weight assigned to the second statistical value Data profiling system using value quality index.

9. The method of claim 8,
The data profiling system using the attribute value quality index, wherein the statistical value calculation unit includes a descriptive statistics analysis unit for normalizing data included in each attribute for each attribute.