KR102240496B1

KR102240496B1 - Data quality management system and method

Info

Publication number: KR102240496B1
Application number: KR1020200046436A
Authority: KR
Inventors: 이우용
Original assignee: 주식회사 한국정보기술단; 이우용
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2021-04-15

Abstract

The present invention relates to a data quality management system and a method thereof. The system comprises: an account generation unit for generating account information in which a unique identification ID is assigned to an institution to be diagnosed and an administrator; a connection management unit for distributing client access or web access by determining whether access to a DB of the institution to be diagnosed is possible; a schema analysis unit for accessing the DB of the institution to be diagnosed, querying tables and columns, and analyzing a schema structure when direct access to the DB of the institution to be diagnosed is possible; a CSV registration unit for registering CSV files for the columns stored in the DB of the institution to be diagnosed when direct access to the DB of the institution to be diagnosed is not possible; and a data analysis unit for indexing inaccurate instances of data stored in the DB of the institution to be diagnosed according to set indicators and patterns and performing data profiling to extract invalid values, structural violations, and data rule violations. According to the present invention as described above, the data quality management system, which was limited to web-based operation, is operated by applying a web version or a client version depending on data characteristics and management purpose, thereby significantly reducing the analysis time required for data quality diagnosis and resolving security risks.

Description

Data quality management system and method

본 발명은 데이터 품질 관리 시스템 및 그 방법에 관한 것으로 더욱 상세하게는, 웹 기반으로만 국한되어 운용되던 데이터 품질 관리 시스템을, 데이터 특성과 관리 목적에 따라 웹 버전 또는 클라이언트 버전을 적용하여 운용함으로써, 데이터 품질진단에 소요되는 분석시간을 현저히 단축시키고 보안 리스크를 해소하는 기술에 관한 것이다.The present invention relates to a data quality management system and method thereof, and more particularly, by applying a web version or a client version according to data characteristics and management purposes to operate a data quality management system that was limited and operated only on a web basis, It relates to technology that significantly shortens the analysis time required for data quality diagnosis and eliminates security risks.

종래의 데이터 품질관리 솔루션은 웹 기반으로 설계되어 운영됨에 따라 데이터 품질 진단 시 해당 기관의 Current System에 부하가 발생하고, 데이터 분석에 많은 시간이 소요되는 문제점이 있다.As the conventional data quality management solution is designed and operated based on the web, there is a problem that a load is generated on the current system of the relevant institution when the data quality is diagnosed, and a lot of time is required for data analysis.

또한, 공공기관의 경우 업무 특성상 데이터 관련 업무는 보안이 생명이나, 웹 기반의 솔루션을 통해 데이터 품질을 관리하는 경우, 관리 대상 데이터에 대한 변경, 손상 또는 파괴 여부 검증 등에 대한 무결성을 답보할 수 없다는 문제점이 있다.In addition, in the case of public institutions, security is vital to data-related work due to the nature of the work, but if data quality is managed through a web-based solution, it is not possible to answer the integrity of the verification of changes, damage or destruction of the managed data. There is a problem.

또한, 대부분의 데이터 품질관리 솔루션은 기관이나 부처에서 고시한 기준에 부합하도록 설계됨에 따라, 기관이나 업체별 데이터 특성에 대한 고려 없이 특정 지표 및 패턴에 국한되어 운영되어 다양성이 결여되는 문제점이 있다.In addition, as most data quality management solutions are designed to meet the standards notified by agencies or ministries, there is a problem in that they are limited to specific indicators and patterns without considering the data characteristics of each agency or company, and thus lack diversity.

아울러, 특정 기관의 데이터 특성에 맞는 지표나 패턴을 추가 또는 변경하는 경우, 솔루션 개발업체에 문의하거나 의존할 수밖에 없어 개발업체에 종속되는 문제점이 있다.In addition, in the case of adding or changing an index or pattern suitable for the data characteristics of a specific institution, there is a problem of being subordinated to the developer because there is no choice but to inquire or rely on the solution developer.

대한민국 공개특허 제10-2007-0057806호(2007.06.07.공개)Republic of Korea Patent Publication No. 10-2007-0057806 (published on June 7, 2007)

본 발명의 목적은, 웹 기반으로만 국한되어 운용되던 데이터 품질 관리 시스템을, 데이터 특성과 관리 목적에 따라 웹 버전 또는 클라이언트 버전을 적용하여 운용함으로써, 데이터 품질진단에 소요되는 분석시간을 현저히 단축시키고 보안 리스크를 해소하는데 있다.An object of the present invention is to significantly shorten the analysis time required for data quality diagnosis by applying and operating a web version or a client version according to data characteristics and management purposes to a data quality management system that has been limited and operated only on a web basis. It is aimed at solving security risks.

본 발명의 목적은, 기관이나 업체별 데이터 특성을 고려한 클라이언트 기반의 데이터 품질관리 솔루션을 제공함으로써, 솔루션 개발업체에 종속됨 없이 데이터 분석을 위한 지표 및 패턴의 확장이 용이하고, 다양한 지표 및 패턴 적용을 통해 데이터 품질진단 결과에 대한 정확성을 향상시키는데 있다. An object of the present invention is to provide a client-based data quality management solution that considers the data characteristics of each organization or company, so that it is easy to expand the indicators and patterns for data analysis without being subordinated to the solution developer, and to apply various indicators and patterns. It is to improve the accuracy of data quality diagnosis results.

이러한 기술적 과제를 달성하기 위한 본 발명의 일 실시예는 데이터 품질 관리 시스템으로서, 진단 대상 기관 및 관리자에 대한 고유 식별ID를 부여한 계정정보를 생성하는 계정 생성부; 진단 대상 기관 DB에 접속이 가능한지 여부를 판단하여 클라이언트 접속 또는 웹 접속을 분배하는 접속 관리부; 진단 대상 기관 DB에 직접 접속이 가능한 경우, 진단 대상 기관 DB에 접속하여 테이블 및 컬럼을 조회하여 스키마 구조를 분석하는 스키마 분석부; 진단 대상 기관 DB에 직접 접속이 불가능한 경우, 진단 대상 기관 DB에 저장된 컬럼들에 대한 CSV 파일을 등록하는 CSV 등록부; 및 설정된 지표 및 패턴에 따라 진단 대상 기관 DB에 저장된 데이터의 부정확한 인스턴스를 색인하고, 유효하지 않은 값, 구조적 위반, 및 데이터 규칙 위반 여부를 추출하는 데이터 프로파일링을 수행하는 데이터 분석부를 포함하는 것을 특징으로 한다.An embodiment of the present invention for achieving such a technical problem is a data quality management system, comprising: an account creation unit for generating account information to which a unique identification ID is assigned to a diagnosis target organization and an administrator; A connection management unit for distributing a client connection or a web connection by determining whether access to a DB of an institution to be diagnosed is possible; A schema analysis unit for analyzing a schema structure by inquiring tables and columns by accessing the diagnosis target institution DB when direct access to the diagnosis target institution DB is possible; A CSV register for registering CSV files for columns stored in the diagnosis target institution DB when direct access to the diagnosis target institution DB is not possible; And a data analysis unit that performs data profiling for indexing inaccurate instances of data stored in the DB of the institution to be diagnosed according to set indicators and patterns, and extracting invalid values, structural violations, and data rule violations. It is characterized.

바람직하게는, 접속 관리부는 계정 생성부에 의해 생성된 진단 대상 기관 DB에 직접 접속이 가능한지 여부를 판단하는 접속모드 감지모듈; 진단 대상 기관 DB에 직접 접속이 가능한 경우 클라이언트 접속 모드를 활성화시키는 클라이언트 접속모듈; 및 진단 대상 기관 DB에 직접 접속이 불가능한 경우 웹 접속 모드가 활성화시키는 웹 접속모듈을 포함하는 것을 특징으로 한다.Preferably, the access management unit includes a connection mode detection module that determines whether direct access to the diagnosis target institution DB generated by the account generation unit is possible; A client connection module for activating a client connection mode when a direct connection to a DB of an institution to be diagnosed is possible; And a web access module that activates a web access mode when direct access to a DB of an institution to be diagnosed is not possible.

데이터 분석부는, 데이터 프로파일링을 위한 지표 및 패턴을 입력받는 기준 설정모듈; 입력받은 지표 및 패턴과 부합하도록 진단 대상 기관 DB에 저장된 데이터의 인스턴스를 색인하는 인스턴스 색인모듈; 및 색인된 인스턴스에 대한 데이터 프로파일링을 수행하여 유효하지 않은 값, 구조적 위반, 및 데이터 규칙 위반 여부를 추출하는 프로파일링 모듈을 포함하는 것을 특징으로 한다.The data analysis unit includes: a reference setting module for receiving an index and a pattern for data profiling; An instance index module for indexing instances of data stored in the DB of the institution to be diagnosed to match the received index and pattern; And a profiling module for extracting invalid values, structural violations, and data rule violations by performing data profiling on the indexed instance.

데이터 분석부의 프로파일링 결과를 스키마 분석정보별로 선별하여 기 설정된 서식에 부합하는 보고서로 생성하는 결과 출력부를 더 포함하는 것을 특징으로 한다.It characterized in that it further comprises a result output unit that selects the profiling result of the data analysis unit by schema analysis information and generates a report conforming to a preset format.

전술한바와 같은 시스템을 기반으로 하는 본 발명의 일 실시예에 따른 데이터 품질 관리 방법은, 계정 생성부가 진단 대상 기관 및 관리자에 대한 고유 식별ID를 부여한 계정정보를 생성하는 (a) 단계; 접속 관리부가 진단 대상 기관 DB에 직접 접속이 가능한지 여부를 판단하는 (b) 단계; (b) 단계의 판단결과, 진단 대상 기관 DB에 직접 접속이 가능한 경우, 진단 대상 기관 DB에 접속하여 테이블 및 컬럼을 조회하여 스키마 구조를 분석하는 (c) 단계; 및 데이터 분석부가 설정된 지표 및 패턴에 따라 진단 대상 기관 DB에 저장된 데이터의 부정확한 인스턴스에 대한 데이터 프로파일링을 수행하는 (d) 단계를 포함하는 것을 특징으로 한다.The data quality management method according to an embodiment of the present invention based on the above-described system includes the steps of: (a) generating, by an account generation unit, account information to which unique identification IDs are assigned to an organization to be diagnosed and an administrator; (B) determining whether the connection management unit can directly access the DB of the institution to be diagnosed; (c) analyzing the schema structure by accessing the diagnosis target institution DB and inquiring the tables and columns by accessing the diagnosis target institution DB as a result of the determination in step (b); And (d) performing data profiling on an incorrect instance of the data stored in the DB of the institution to be diagnosed according to the index and pattern set by the data analysis unit.

바람직하게는, (b) 단계의 판단결과, 진단 대상 기관 DB에 직접 접속이 불가능한 경우, CSV 등록부가 진단 대상 기관 DB에 저장된 컬럼들에 대한 CSV 파일을 등록하는 (e) 단계를 포함하는 것을 특징으로 한다.Preferably, as a result of the determination in step (b), when direct access to the DB of the institution to be diagnosed is impossible, the CSV registration unit includes the step (e) of registering a CSV file for columns stored in the DB of the institution to be diagnosed. It is done.

상기와 같은 본 발명에 따르면, 웹 기반으로만 국한되어 운용되던 데이터 품질 관리 시스템을, 데이터 특성과 관리 목적에 따라 웹 버전 또는 클라이언트 버전을 적용하여 운용함으로써, 데이터 품질진단에 소요되는 분석시간을 현저히 단축시키고 보안 리스크를 해소하는 효과가 있다.According to the present invention as described above, by applying a web version or a client version according to the data characteristics and management purposes, the data quality management system, which was only operated based on the web, significantly reduces the analysis time required for data quality diagnosis. It has the effect of shortening and resolving security risks.

본 발명에 따르면, 기관이나 업체별 데이터 특성을 고려한 클라이언트 기반의 데이터 품질관리 솔루션을 제공함으로써, 솔루션 개발업체에 종속됨 없이 데이터 분석을 위한 지표 및 패턴의 확장이 용이하고, 다양한 지표 및 패턴 적용을 통해 데이터 품질진단 결과에 대한 정확성을 향상시키는 효과가 있다.According to the present invention, by providing a client-based data quality management solution that considers the data characteristics of each institution or company, it is easy to expand the indicators and patterns for data analysis without being subordinated to the solution developer, and through the application of various indicators and patterns. It has the effect of improving the accuracy of data quality diagnosis results.

도 1은 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템을 도시한 블록도.
도 2는 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 세부구성을 도시한 블록도.
도 3a는 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 테이블에 대한 스키마 분석정보를 도시한 예시도.
도 3b는 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 컬럼에 대한 스키마 분석정보를 도시한 예시도.
도 4는 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 CSV 등록부에 의해 CSV 파일을 수동으로 등록하는 것을 도시한 예시도.
도 5a는 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 진단 대상 테이블 및 컬럼을 선택받는 것을 도시한 예시도.
도 5b는 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 진단 대상 컬럼에 해당하는 지표 및 패턴을 입력받는 것을 도시한 예시도.
도 6은 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 부가구성을 도시한 블록도.
도 7은 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템의 보고서를 도시한 예시도.
도 8은 본 발명의 일 실시예에 따른 데이터 품질 관리 방법을 도시한 순서도.
도 9는 본 발명의 일 실시예에 따른 데이터 품질 관리 방법의 제S406단계의 세부과정을 도시한 순서도.
도 10은 본 발명의 일 실시예에 따른 데이터 품질 관리 방법의 제S408단계의 세부과정을 도시한 순서도.1 is a block diagram showing a data quality management system according to an embodiment of the present invention.
2 is a block diagram showing a detailed configuration of a data quality management system according to an embodiment of the present invention.
3A is an exemplary view showing schema analysis information for a table in a data quality management system according to an embodiment of the present invention.
3B is an exemplary view showing schema analysis information for columns of a data quality management system according to an embodiment of the present invention.
4 is an exemplary view showing manually registering a CSV file by a CSV registration unit of a data quality management system according to an embodiment of the present invention.
5A is an exemplary view showing selection of a table and a column to be diagnosed by the data quality management system according to an embodiment of the present invention.
5B is an exemplary diagram illustrating receiving an index and a pattern corresponding to a column to be diagnosed in a data quality management system according to an embodiment of the present invention.
6 is a block diagram showing an additional configuration of a data quality management system according to an embodiment of the present invention.
7 is an exemplary view showing a report of a data quality management system according to an embodiment of the present invention.
8 is a flowchart illustrating a data quality management method according to an embodiment of the present invention.
9 is a flow chart showing a detailed process of step S406 of the data quality management method according to an embodiment of the present invention.
10 is a flow chart showing a detailed process of step S408 of the data quality management method according to an embodiment of the present invention.

본 발명의 구체적인 특징 및 이점들은 첨부 도면에 의거한 다음의 상세한 설명으로 더욱 명백해질 것이다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다. 또한, 본 발명에 관련된 공지 기능 및 그 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 구체적인 설명을 생략하였음에 유의해야 할 것이다.Specific features and advantages of the present invention will become more apparent from the following detailed description based on the accompanying drawings. Prior to this, terms or words used in the present specification and claims are based on the principle that the inventor can appropriately define the concept of the term in order to describe his or her invention in the best way. It should be interpreted as a corresponding meaning and concept. In addition, when it is determined that a detailed description of known functions and configurations thereof related to the present invention may unnecessarily obscure the subject matter of the present invention, it should be noted that the detailed description thereof has been omitted.

도 1을 참조하면 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템(100)은, 데이터 품질진단 대상(이하, '진단 대상 기관' 이라고 함) 및 관리자에 대한 고유 식별ID를 부여한 계정정보를 생성하는 계정 생성부(110)와, 진단 대상 기관의 데이터베이스(이하, '진단 대상 기관 DB' 라고 함 )에 접속이 가능한지 여부를 판단하여 클라이언트 접속 또는 웹 접속을 분배하는 접속 관리부(120)와, 진단 대상 기관 DB에 직접 접속이 가능한 경우, 진단 대상 기관 DB에 접속하여 테이블 및 컬럼을 조회하여 스키마 구조를 분석하는 스키마 분석부(130)와, 진단 대상 기관 DB에 직접 접속이 불가능한 경우, 진단 대상 기관 DB에 저장된 테이블 및 컬럼들에 대한 CSV 파일을 등록하는 CSV 등록부(140), 및 설정된 지표 및 패턴에 따라 진단 대상 기관 DB에 저장된 데이터의 부정확한 인스턴스(instance)를 색인하고, 유효하지 않은 값, 구조적 위반, 및 데이터 규칙 위반 여부를 추출하는 데이터 프로파일링(data profiling)을 수행하는 데이터 분석부(150)를 포함하여 구성된다.Referring to FIG. 1, the data quality management system 100 according to an embodiment of the present invention generates account information with unique identification IDs for data quality diagnosis targets (hereinafter referred to as'diagnosis target organizations') and managers. A connection management unit 120 for distributing client access or web access by determining whether access to a database of an institution to be diagnosed (hereinafter referred to as “diagnosis target institution DB”) is possible, and a diagnosis When direct access to the target institution DB is possible, the schema analysis unit 130 that analyzes the schema structure by accessing the diagnosis target institution DB and inquires tables and columns, and when direct access to the diagnosis target institution DB is not possible, the diagnosis target institution A CSV register 140 for registering CSV files for tables and columns stored in the DB, and indexes inaccurate instances of data stored in the DB of the institution to be diagnosed according to set indicators and patterns, and invalid values, It is configured to include a data analysis unit 150 that performs data profiling for extracting structural violations and data rule violations.

이하에서는 그 구체적인 언급을 생략하겠으나, 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템은 WAS(Web Application Server) 환경에서 구동될 수 있고, 이때 시스템 구동을 위해 요구되는 프로그래밍 언어 및 데이터베이스를 포함하는 플랫폼은 자동으로 설치된다. 이때, 설치되는 프로그램밍 언어는 JAVA이고 데이터베이스는 MariaDB로 설치될 수 있으나, 본 발명의 일 실시예에 이에 국한되는 것은 아니다. 또한, 본 발명의 일 실시예에 따른 인스턴스는 진단 대상 기관 DB에 접근하도록 할당된 객체(메모리 할당)로 이해함이 바람직하다.In the following, a detailed reference will be omitted, but the data quality management system according to an embodiment of the present invention may be operated in a WAS (Web Application Server) environment, and at this time, a platform including a programming language and a database required for system operation Is installed automatically. In this case, the installed programming language is JAVA and the database may be installed as MariaDB, but the present invention is not limited thereto. In addition, it is preferable to understand that an instance according to an embodiment of the present invention is an object (memory allocation) allocated to access a DB of an institution to be diagnosed.

이하, 도 2를 참조하면 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템(100)의 세부구성에 대해 살피면 아래와 같다.Hereinafter, referring to FIG. 2, a detailed configuration of the data quality management system 100 according to an embodiment of the present invention will be described below.

구체적으로, 계정 생성부(110)는 진단 대상 기관 및 관리자별로 고유 식별ID를 부여하여 계정정보를 생성하는 계정 관리모듈(112), 및 계정정보와 매칭되어 데이터 품질 관리를 위한 진단 대상 기관 DB를 생성하는 DB 생성모듈(114)을 포함하여 구성된다.Specifically, the account creation unit 110 is an account management module 112 that generates account information by assigning a unique identification ID to each diagnosis target organization and manager, and the diagnosis target institution DB for data quality management by matching with the account information. It is configured to include a DB generation module 114 to generate.

또한, 접속 관리부(120)는 계정 생성부(110)에 의해 생성된 진단 대상 기관 DB에 직접 접속이 가능한지 여부를 판단하는 접속모드 감지모듈(122)과, 진단 대상 기관 DB에 직접 접속이 가능한 경우 클라이언트 접속 모드를 활성화시키는 클라이언트 접속모듈(124), 및 진단 대상 기관 DB에 직접 접속이 불가능한 경우 웹 접속 모드가 활성화시키는 웹 접속모듈(126)을 포함하여 구성된다.In addition, the access management unit 120 includes a connection mode detection module 122 that determines whether direct access to the diagnosis target institution DB generated by the account creation unit 110 is possible, and when direct access to the diagnosis target institution DB is possible. A client connection module 124 for activating the client connection mode, and a web connection module 126 for activating the web connection mode when direct connection to the diagnosis target institution DB is not possible.

또한, 스키마 분석부(130)는 진단 대상 기관 DB에 접속하여 테이블 및 컬럼을 색인하는 데이터 색인모듈(132), 및 색인된 테이블 및 컬럼을 포함하는 개체의 집합을 분석하여 스키마 분석정보를 생성하는 스키마 분석모듈(134)을 포함하여 구성된다.In addition, the schema analysis unit 130 is a data index module 132 that accesses a DB of an institution to be diagnosed to index tables and columns, and generates schema analysis information by analyzing a set of objects including the indexed tables and columns. It is configured to include a schema analysis module 134.

구체적으로, 데이터 색인모듈(132)은 진단 대상 기관 DB의 시스템 카탈로그(테이블 정보, 인덱스 정보, 및 뷰 정보 등을 저장하는 시스템 테이블)를 검색하여 테이블 및 컬럼 구조 분석을 위한 정보를 색인한다.Specifically, the data indexing module 132 searches a system catalog (a system table that stores table information, index information, and view information) of a DB of an institution to be diagnosed and indexes information for table and column structure analysis.

또한, 스키마 분석모듈(134)은 도 3a에 도시된 바와 같이, 색인된 테이블에 대한 테이블 수, 테이블명, ROW 수, ROW 평균길이, INDEX 수, PK 컬럼 수, 및 테이블 설명 정보를 추출하여 스키마 분석정보를 생성한다.In addition, the schema analysis module 134 extracts the number of tables, the table name, the number of ROWs, the average ROW length, the number of INDEXs, the number of PK columns, and table description information for the indexed table, as shown in FIG. 3A. Generate analysis information.

그리고, 스키마 분석모듈(134)은 도 3b에 도시된 바와 같이, 색인된 컬럼에 대한 컬럼명, DATA 타입, DATA 길이, DATA 정밀도, DATA 스케일, 널 허용, 기본값, 최대/최소값, 및 컬럼 설명 정보를 추출하여 스키마 분석정보를 생성한다.And, the schema analysis module 134, as shown in Figure 3b, the column name for the indexed column, DATA type, DATA length, DATA precision, DATA scale, null allowed, default value, maximum / minimum value, and column description information To generate schema analysis information.

또한, CSV 등록부(140)는 보안상의 이유로 진단 대상 기관 DB에 직접 접속을 불허하는 경우, 도 4에 도시된 예시와 같이 진단 대상 데이터를 CSV 파일로 수동으로 업로드하여 등록하도록 구성된다.In addition, the CSV registration unit 140 is configured to manually upload and register the diagnosis target data as a CSV file, as shown in the example shown in FIG. 4, when direct access to the diagnosis target institution DB is not permitted for security reasons.

한편, 데이터 분석부(150)는 상기 스키마 분석정보의 데이터 프로파일링을 위한 지표 및 패턴을 입력받는 기준 설정모듈(152)과, 입력받은 지표 및 패턴과 부합하도록 진단 대상 기관 DB에 저장된 데이터의 인스턴스를 색인하는 인스턴스 색인모듈(154), 및 색인된 인스턴스에 대한 데이터 프로파일링을 수행하여 유효하지 않은 값, 구조적 위반, 및 데이터 규칙 위반 여부를 추출하는 프로파일링 모듈(156)을 포함하여 구성된다.Meanwhile, the data analysis unit 150 includes a reference setting module 152 for receiving an index and pattern for data profiling of the schema analysis information, and an instance of data stored in the DB of the institution to be diagnosed to match the received index and pattern. And an instance indexing module 154 for indexing, and a profiling module 156 for extracting invalid values, structural violations, and data rule violations by performing data profiling on the indexed instances.

이때, 기준 설정모듈(152)이 입력받는 지표 및 패턴은, 진단 대상 기관의 데이터 특성과 관리 목적에 부합하도록 기 설정된 정규식 또는 구조화 조회 언어(SQL: Structured Query Language)를 통해 입력되며, 지표, 패턴, 정규식 및 SQL 각각은 관리자에 의해 변경될 수 있다.At this time, the indicators and patterns received by the reference setting module 152 are input through a regular expression or structured query language (SQL) set in accordance with the data characteristics and management purposes of the institution to be diagnosed, and the indicators and patterns , Regular expression and SQL can each be changed by the administrator.

또한, 기준 설정모듈(152)은 도 5a에 도시된 바와 같이, 스키마 분석정보를 참조하여 진단 대상 테이블 및 컬럼을 선택받고, 도 5b 에 도시된 바와 같이, 각 컬럼에 해당하는 지표 및 패턴을 입력받는다.In addition, the criterion setting module 152, as shown in FIG. 5A, selects a table and a column to be diagnosed with reference to schema analysis information, and inputs an index and pattern corresponding to each column, as shown in FIG. 5B. Receive.

이하, 도 6 및 도 7을 참조하여 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템(100)의 부가구성에 대해 살피면 아래와 같다.Hereinafter, an additional configuration of the data quality management system 100 according to an embodiment of the present invention will be described with reference to FIGS. 6 and 7.

도 6을 참조하면, 본 발명의 일 실시예에 따른 데이터 품질 관리 시스템(100)은 데이터 분석부(150)의 프로파일링 결과를 스키마 분석정보별로 선별하여 기 설정된 서식에 부합하는 보고서로 생성하는 결과 출력부(160)를 더 포함하여 구성된다. 6, the data quality management system 100 according to an embodiment of the present invention selects the profiling result of the data analysis unit 150 for each schema analysis information, and generates a report conforming to a preset format. It is configured to further include an output unit 160.

이때, 보고서는 도 7에 도시된 바와 같이, 유효하지 않은 값, 구조적 위반, 및 데이터 규칙 위반 여부를 표와 그래프로 나타내고 있으며, 이렇게 생성된 보고서는 문서 편집 프로그램인 메모장, Excel, Word, Hwp 또는 PDF 중에 어느 하나의 포맷으로 생성된다.At this time, the report shows invalid values, structural violations, and data rule violations in tables and graphs, as shown in FIG. 7, and the generated report is a document editing program such as Notepad, Excel, Word, Hwp, or It is created in any one of PDF format.

이하, 도 8을 참조하여 본 발명의 일 실시예에 따른 데이터 품질 관리 방법에 대해 살피면 아래와 같다.Hereinafter, a data quality management method according to an embodiment of the present invention will be described with reference to FIG. 8.

먼저, 계정 생성부가 진단 대상 기관 및 관리자에 대한 고유 식별ID를 부여한 계정정보를 생성한다(S802).First, the account generation unit generates account information to which a unique identification ID for the diagnosis target organization and administrator is assigned (S802).

이어서, 접속 관리부가 진단 대상 기관 DB에 직접 접속이 가능한지 여부를 판단한다(S804).Subsequently, the connection management unit determines whether or not direct access to the diagnosis target institution DB is possible (S804).

제S804단계의 판단결과, 진단 대상 기관 DB에 직접 접속이 가능한 경우, 진단 대상 기관 DB에 접속하여 테이블 및 컬럼을 조회하여 스키마 구조를 분석한다(S806).As a result of the determination in step S804, if direct access to the diagnosis target institution DB is possible, the schema structure is analyzed by accessing the diagnosis target institution DB and querying tables and columns (S806).

뒤이어, 데이터 분석부가 설정된 지표 및 패턴에 따라 진단 대상 기관 DB에 저장된 데이터의 부정확한 인스턴스에 대한 데이터 프로파일링을 수행한다(S808).Subsequently, the data analysis unit performs data profiling on an incorrect instance of the data stored in the diagnosis target institution DB according to the set index and pattern (S808).

반면에, 제S804단계의 판단결과, 진단 대상 기관 DB에 직접 접속이 불가능한 경우, CSV 등록부가 진단 대상 기관 DB에 저장된 컬럼들에 대한 CSV 파일을 등록하고, 제S408단계로 절차를 이행한다(S810).On the other hand, as a result of the determination in step S804, if direct access to the diagnosis target institution DB is not possible, the CSV registration unit registers the CSV files for the columns stored in the diagnosis target institution DB, and the procedure proceeds to step S408 (S810). ).

이하, 도 9를 참조하여 본 발명의 일 실시예에 따른 데이터 품질 관리 방법의 제S806단계의 세부과정에 대해 살피면 아래와 같다.Hereinafter, a detailed process of step S806 of the data quality management method according to an embodiment of the present invention will be described with reference to FIG. 9.

제S804단계 이후, 스키마 분석부가 진단 대상 기관 DB에 접속하여 테이블 및 컬럼을 색인한다(S902).After step S804, the schema analysis unit accesses the DB of the institution to be diagnosed and indexes the table and column (S902).

그리고, 스키마 분석부가 색인된 테이블 및 컬럼을 포함하는 개체의 집합을 분석하여 스키마 분석정보를 생성한다(S904).Then, the schema analysis unit analyzes the set of entities including the indexed table and column to generate schema analysis information (S904).

이하, 도 10을 참조하여 본 발명의 일 실시예에 따른 데이터 품질 관리 방법의 제S808단계의 세부과정에 대해 살피면 아래와 같다.Hereinafter, a detailed process of step S808 of the data quality management method according to an embodiment of the present invention will be described with reference to FIG. 10.

제S806단계 이후, 데이터 분석부가 데이터 프로파일링을 위한 지표 및 패턴을 입력받는다(S1002).After step S806, the data analysis unit receives an index and a pattern for data profiling (S1002).

이어서, 데이터 분석부가 입력받은 지표 및 패턴과 부합하도록 진단 대상 기관 DB에 저장된 데이터의 인스턴스를 색인한다(S1004).Subsequently, the data analysis unit indexes the instance of data stored in the DB of the institution to be diagnosed so as to match the input index and pattern (S1004).

그리고, 데이터 분석부가 색인된 인스턴스에 대한 데이터 프로파일링을 수행하여 유효하지 않은 값, 구조적 위반, 및 데이터 규칙 위반 여부를 추출한다(S1006).In addition, the data analysis unit performs data profiling on the indexed instance to extract invalid values, structural violations, and data rule violations (S1006).

이처럼 전술한 바와 같은 본 발명의 일 실시예에 의하면, 웹 기반으로만 국한되어 운용되던 데이터 품질 관리 시스템을, 데이터 특성과 관리 목적에 따라 웹 버전 또는 클라이언트 버전을 적용하여 운용함으로써, 데이터 품질진단에 소요되는 분석시간을 현저히 단축시키고 보안 리스크를 해소하는 효과가 있다.As described above, according to an embodiment of the present invention as described above, the data quality management system, which was limited and operated only on a web basis, is operated by applying a web version or a client version according to data characteristics and management purposes, so that data quality diagnosis can be performed. It has the effect of remarkably shortening the required analysis time and eliminating security risks.

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등 물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다.Although described and illustrated in connection with a preferred embodiment for illustrating the technical idea of the present invention as described above, the present invention is not limited to the configuration and operation as illustrated and described as described above, and deviates from the scope of the technical idea. It will be well understood by those skilled in the art that many changes and modifications are possible to the present invention without. Accordingly, all such appropriate changes and modifications and equivalents should be considered to be within the scope of the present invention.

100: 데이터 품질 관리 시스템
110: 계정 생성부
112: 계정 관리모듈 114: DB 생성모듈
120: 접속 관리부
122: 접속모드 감지모듈 124: 클라이언트 접속모듈
130: 스키마 분석부
132: 데이터 색인모듈 134: 스키마 분석모듈
140: CSV 등록부
150: 데이터 분석부 152: 기준 설정모듈
154: 인스턴스 색인모듈 156: 프로파일링 모듈
160: 결과 출력부100: data quality management system
110: account creation unit
112: account management module 114: DB creation module
120: connection management unit
122: connection mode detection module 124: client connection module
130: schema analysis unit
132: data indexing module 134: schema analysis module
140: CSV register
150: data analysis unit 152: standard setting module
154: instance index module 156: profiling module
160: result output section

Claims

An account creation unit that generates account information to which a unique identification ID is assigned to an organization to be diagnosed and an administrator;
A connection management unit for distributing a client connection or a web connection by determining whether access to a DB of an institution to be diagnosed is possible;
A schema analysis unit for analyzing a schema structure by accessing the diagnosis target institution DB and inquiring tables and columns when direct access to the diagnosis target institution DB is possible;
A CSV registration unit for registering CSV files for columns stored in the diagnosis target institution DB when direct access to the diagnosis target institution DB is not possible; And
It includes a data analysis unit that indexes inaccurate instances of data stored in the DB of the institution to be diagnosed according to set indicators and patterns, and performs data profiling to extract invalid values, structural violations, and data rule violations,
The access management unit includes a connection mode detection module that determines whether direct access to the diagnosis target institution DB created by the account creation unit is possible, and a client access module that activates the client connection mode when direct access to the diagnosis target institution DB is possible, and It includes a web access module that activates the web access mode when direct access to the DB of the institution to be diagnosed is not possible,
The schema analysis unit comprises a data index module for accessing a DB of an institution to be diagnosed to index tables and columns, and a schema analysis module for generating schema analysis information by analyzing an entity set including the indexed tables and columns. Data quality management system.

delete

The method of claim 1,
The data analysis unit,
A reference setting module for receiving an index and a pattern for data profiling;
An instance index module for indexing instances of data stored in the DB of the institution to be diagnosed to match the received index and pattern; And
Profiling module that performs data profiling on indexed instances to extract invalid values, structural violations, and data rule violations.
Data quality management system comprising a.

The method of claim 1,
A result output unit that selects the profiling result of the data analysis unit by schema analysis information and generates a report conforming to a preset format.
Data quality management system, characterized in that it further comprises.

(a) generating, by the account generation unit, account information to which a unique identification ID is assigned to an organization to be diagnosed and an administrator;
(b) determining, by the access management unit, whether direct access to the diagnosis target institution DB is possible;
(c) analyzing the schema structure by accessing the diagnosis target institution DB and querying tables and columns by accessing the diagnosis target institution DB as a result of the determination in step (b); And
(d) performing data profiling on inaccurate instances of the data stored in the DB of the institution to be diagnosed according to the index and pattern set by the data analysis unit,
In the step (b), the connection management unit detects a connection mode for determining whether direct access to the diagnosis target institution DB is possible, and when a direct connection to the diagnosis target institution DB is possible, activates a client connection mode. Including the step of activating the web access mode when direct access to the institutional DB is not possible,
The step (c) is a data indexing step in which a schema analysis unit accesses the diagnosis target institution DB to index tables and columns, and a schema analysis step of generating schema analysis information by analyzing an entity set including the indexed tables and columns. Including,
As a result of the determination in step (b), if direct access to the diagnosis target institution DB is not possible, including the step of activating a web access mode and registering a CSV file for the columns stored in the diagnosis target institution DB by the CSV registration unit. Data quality management method characterized by.

delete