KR102262279B1

KR102262279B1 - Data processing for detecting outlier method and device thereof

Info

Publication number: KR102262279B1
Application number: KR1020190122137A
Authority: KR
Inventors: 이우성; 이용희; 오성준; 안덕호
Original assignee: (주)디지탈쉽
Priority date: 2019-10-02
Filing date: 2019-10-02
Publication date: 2021-06-08
Also published as: KR20210039654A

Abstract

본 발명의 일실시예에 따른, 데이터 처리 방법은 수집 대상이 되는 데이터를 획득하고, 상기 수집 대상 데이터에 대해 전처리를 진행하여, 정상치, 이상치 및 결측치를 검출하는 단계, 상기 검출된 정상치, 이상치 및 결측치를 인터페이스를 통해 시각화하는 단계 및 사용자의 요청에 응답하여 상기 이상치 검출의 기준이 되는 메타데이터 정보를 상기 인터페이스를 통해서 제공하는 단계를 포함한다.According to an embodiment of the present invention, a data processing method includes the steps of acquiring data to be collected, performing pre-processing on the data to be collected, and detecting normal values, outliers and missing values, the detected normal values, outliers, and Visualizing the missing value through an interface and providing metadata information that is a criterion for detecting the outlier through the interface in response to a user's request.

Description

DATA PROCESSING FOR DETECTING OUTLIER METHOD AND DEVICE THEREOF

본 발명은 컴퓨터 정보 처리 분야에 관한 것으로, 이상치 또는 결측치를 탐지하기 위한 데이터 처리 방법, 장치 및 시스템에 관한 것이다.The present invention relates to the field of computer information processing, and to a data processing method, apparatus and system for detecting outliers or missing values.

이하에 기술되는 내용은 단순히 본 발명에 따른 일 실시예와 관련되는 배경 정보만을 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described below merely provides background information related to an embodiment according to the present invention and does not constitute the prior art.

현대 사회에서 데이터는 다양한 비즈니스에 침투되어 중요한 요소가 되었으며, 빅데이터의 시대가 도래하였다. 대용량, 다양성을 갖는 빅데이터의 폭발적인 증가는 현대 기업의 데이터 처리 및 분석 능력을 요구함과 동시에, 기업에 정확한 미래 예측을 가능하게 하는 기회를 제공하게 되었다. 기업들은 데이터를 적극적으로 활용하여 자신만의 비즈니스 전략 구축과 전술적 의사결정을 하기 위해서 방법을 찾았고, 비즈니스 인텔리전스(BI; Business Intelligence)라는 개념이 생겨나게 되었다. 현대 사회에서 가장 중요한 것은 이러한 빅데이터에 대해 처리와 분석을 진행하는 것이며, 빅데이터의 처리와 분석을 거쳐야만 지능적이고 가치 있는 정보를 얻을 수가 있게 되었다.In modern society, data has penetrated into various businesses and has become an important factor, and the era of big data has arrived. The explosive growth of large-capacity and diverse big data requires the data processing and analysis capabilities of modern companies, and provides opportunities for companies to accurately predict the future. Companies actively used data to find ways to build their own business strategies and make tactical decisions, and the concept of business intelligence (BI) was born. The most important thing in modern society is to process and analyze such big data, and only through processing and analysis of big data can intelligent and valuable information be obtained.

빅데이터 분석 기술의 발달로 증강분석 기술이 새롭게 대두되고 있다. 기존 비즈니스 인텔리전스에 인공지능 머신러닝 기술을 접목시켜서 앞으로 일어날 것을 예측을 하고, 그에 따른 최적화된 대안을 인공지능이 찾아서 제시해 주는 것을 말한다. 즉, 증강분석은 데이터 전처리부터 피처엔지니어링 과정까지의 일련의 프로세스를 사람의 판단이아닌 기계학습으로 자동으로 처리하고, 더 많은 양의 데이터와 숨겨진 패턴을 찾아냄으로써 예측분석을 할 수 있다.With the development of big data analysis technology, augmented analysis technology is emerging. It refers to predicting what will happen in the future by grafting artificial intelligence machine learning technology to existing business intelligence, and AI finds and presents optimized alternatives accordingly. In other words, augmented analysis automatically processes a series of processes from data preprocessing to feature engineering by machine learning rather than human judgment, and can perform predictive analysis by finding larger amounts of data and hidden patterns.

이렇게 증강분석 기술을 사용하기 위해서는 데이터의 전처리 과정에서 이상치(OUTLIER)와 결측값(MISSING VALUE)을 보완하는 기술이 필요하다. 이때, 정확한 보완을 위해서는 이상치 탐지의 기준이 되는 메타데이터를 확인하고 알맞은 명령이 들어가 있는지 확인을 해야 하지만, 이러한 메타데이터 정보를 확인하고, 수정하는 과정이 쉽지 않은 실정이다.In order to use the augmented analysis technology in this way, it is necessary to supplement the outliers and missing values in the data preprocessing process. At this time, in order to accurately supplement the metadata, it is necessary to check the metadata, which is the standard for detecting outliers, and to check whether an appropriate command is included. However, it is not easy to check and correct such metadata information.

이와 같은 점에서 착안된 본 발명이 해결하고자 하는 과제는, 현장에서 생산되는 대용량의 비가공, 비정형 데이터로부터 증강 분석이 가능하도록 데이터를 정제/선별할 수 있는 빅데이터의 처리 및 가공하여 분석할 수 있는 알고리즘이 필요하다.The problem to be solved by the present invention, conceived in this regard, can be analyzed by processing and processing big data that can refine/select data to enable augmented analysis from large-capacity raw and unstructured data produced in the field. An algorithm is needed.

또한, 데이터의 전처리 과정에서 이상치 또는 결측치의 탐지 기준이 되는 메타데이터 정보를 확인하고 수정할 수 있도록 사용자 인터페이스를 제공하는 것이 필요하다.In addition, it is necessary to provide a user interface so that metadata information, which is a standard for detecting outliers or missing values, can be checked and corrected in the data preprocessing process.

본 발명의 일실시예에 따른, 데이터 처리 방법은 수집 대상이 되는 데이터를 획득하고, 상기 수집 대상 데이터에 대해 전처리를 진행하여, 정상치, 이상치 및 결측치를 검출하는 단계, 상기 검출된 정상치, 이상치 및 결측치를 인터페이스를 통해 시각화하는 단계 및 사용자의 요청에 응답하여 상기 이상치 검출의 기준이 되는 메타데이터 정보를 상기 인터페이스를 통해서 제공하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a data processing method includes the steps of acquiring data to be collected, performing pre-processing on the data to be collected, and detecting normal values, outliers and missing values, the detected normal values, outliers, and The method may include visualizing the missing value through an interface and providing metadata information, which is a criterion for detecting the outlier, through the interface in response to a user's request.

일실시예에 따르면, 상기 메타데이터 정보는 데이터 사전 및 정규 표현식 중 적어도 하나를 포함하는 인덱스 정보가 될 수 있다.According to an embodiment, the metadata information may be index information including at least one of a data dictionary and a regular expression.

일실시예에 따르면, 상기 인터페이스는 상기 사용자가 상기 인덱스 정보에 대한 검색을 수행할 수 있도록 검색창을 제공할 수 있다.According to an embodiment, the interface may provide a search window so that the user can search for the index information.

일실시예에 따르면, 상기 인덱스 정보는 상기 인터페이스를 통한 상기 사용자의 명령에 응답하여 추가/삭제/수정이 가능할 수 있다.According to an embodiment, the index information may be added/deleted/modified in response to the user's command through the interface.

본 발명의 일실시예에 따른, 데이터 처리 장치는 수집 대상이 되는 데이터를 획득하는 데이터 수집부,상기 수집 대상 데이터에 대해 전처리를 수행하는 데이터 전처리부, 상기 전처리된 데이터의 정상치, 이상치 및 결측치를 검출하는 데이터 품질 관리부 및 상기 검출된 정상치, 이상치 및 결측치를 인터페이스를 통해 시각화하는 인터페이스 제공부를 포함할 수 있다.According to an embodiment of the present invention, a data processing apparatus includes a data collection unit for acquiring data to be collected, a data preprocessor for performing preprocessing on the data to be collected, normal values, outliers, and missing values of the preprocessed data It may include a data quality control unit for detecting and an interface providing unit for visualizing the detected normal values, outliers and missing values through an interface.

일실시예에 따르면, 상기 인터페이스 제공부는, 사용자의 요청에 응답하여 상기 이상치 검출의 기준이 되는 메타데이터 정보를 제공할 수 있다.According to an embodiment, the interface providing unit may provide metadata information that is a criterion for detecting the outlier in response to a user's request.

일실시예에 따르면, 상기 인터페이스 제공부는, 상기 사용자가 상기 인덱스 정보에 대한 검색을 수행할 수 있도록 검색창을 제공할 수 있다.According to an embodiment, the interface providing unit may provide a search window so that the user can search for the index information.

일실시예에 따르면, 상기 인덱스 정보는 상기 인터페이스 제공부를 통한 상기 사용자의 명령에 응답하여 추가/삭제/수정이 가능할 수 있다.According to an embodiment, the index information may be added/deleted/modified in response to the user's command through the interface providing unit.

본 발명의 일실시예에 따르면, 대용량의 비가공, 비정형 데이터로부터 증강 분석이 가능하도록 데이터를 정제/선별할 수 있는 빅데이터의 처리 및 가공하여 분석할 수 있는 알고리즘이 제공될 수 있다.According to an embodiment of the present invention, an algorithm capable of processing, processing and analyzing big data capable of refining/selecting data to enable augmented analysis from a large amount of raw and unstructured data may be provided.

또한, 데이터의 전처리 과정에서 이상치 또는 결측치의 탐지 기준이 되는 메타데이터 정보를 확인할 수 있도록 사용자 인터페이스가 제공되어 사용자로 하여금 해당 정보를 쉽게 파악하고 필요한 경우 수정할 수 있도록 할 수 있다. 이를 통해서 새로운 데이터를 추가하거나 기존 정보를 삭제하는 것이 용이하게 될 수 있다.In addition, a user interface is provided to check metadata information, which is a standard for detecting outliers or missing values, during data pre-processing, so that a user can easily understand the information and, if necessary, modify it. This may facilitate adding new data or deleting existing information.

도 1은 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 장치를 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 장치의 동작 순서를 설명하기 위한 도면이다.
도 3 내지 도 5는 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 장치에서 제공되는 인터페이스를 설명하기 위한 도면이다.
도 6은 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 방법을 설명하기 위한 흐름도이다.1 is a diagram for explaining a data processing apparatus for detecting an outlier according to an embodiment of the present invention.
2 is a diagram for explaining an operation sequence of a data processing apparatus for detecting an outlier according to an embodiment of the present invention.
3 to 5 are diagrams for explaining an interface provided by a data processing apparatus for detecting an outlier according to an embodiment of the present invention.
6 is a flowchart illustrating a data processing method for detecting outliers according to an embodiment of the present invention.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시예들은 다양한 형태로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed herein are only exemplified for the purpose of explaining the embodiments according to the concept of the present invention, and the embodiment according to the concept of the present invention These may be embodied in various forms and are not limited to the embodiments described herein.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. In the entire specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

본 발명의 개념에 따른 실시예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시예들을 특정한 개시형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention may have various changes and may have various forms, the embodiments will be illustrated in the drawings and described in detail herein. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all changes or replacements included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만, 예를 들어 본 발명의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one element from other elements, for example, without departing from the scope of rights according to the concept of the present invention, a first element may be named a second element, Similarly, the second component may also be referred to as the first component.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

이하, 이상치 탐지를 위한 데이터 처리 장치 및 방법의 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of a data processing apparatus and method for detecting an outlier will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 장치를 설명하기 위한 도면이다.1 is a diagram for explaining a data processing apparatus for detecting an outlier according to an embodiment of the present invention.

도 1을 참조하면, 데이터 처리 장치(130)는 네트워크(120)를 통해서 사용자 단말(110)에 연결될 수 있다. 이때, 데이터 처리 장치(130)는 서버 형태로 구현되어 웹 브라우저를 통해서 사용자 단말에 제공될 수 있다.Referring to FIG. 1 , the data processing apparatus 130 may be connected to the user terminal 110 through the network 120 . In this case, the data processing device 130 may be implemented in the form of a server and provided to the user terminal through a web browser.

일실시예에 따르면, 사용자 단말(110)은 웹 브라우저에 접속할 수 있는 어떠한 종류의 사용자 단말을 포함할 수도 있다. 예를 들면, 퍼스널 컴퓨터, 랩탑 컴퓨터, PDA, 스마트폰, 스마트 패드, 태블릿 PC 등을 포함할 수 있다.According to an embodiment, the user terminal 110 may include any type of user terminal capable of accessing a web browser. For example, it may include a personal computer, a laptop computer, a PDA, a smartphone, a smart pad, a tablet PC, and the like.

상기 네트워크(120)는 유선 및/또는 무선일 수 있으며 그리고 보안 통신을 지원할 수 있는 하나 이상의 광역 네트워크들 (WAN들) 및/또는 로컬 영역 네트워크 (LAN)들일 수 있으며 또는 그것들을 포함할 수 있다. 몇몇의 예들에서, 상기 네트워크(120)는 다른 네트워크들 중에서도 하나 이상의 셀룰러 네트워크들 및/또는 인터넷을 포함할 수 있다. 상기 통신 네트워크 (104)는 LTE, CDMA, GSM, LPWAN, WiFi, 블루투스, 이더넷, HTTP/S, TCP, CoAP/DTLS 등과 같은 하나 이상의 통신 프로토콜들에 따라 동작할 수 있다. 비록 네트워크(120)가 단일의 네트워크인 것으로 보이지만, 상기 네트워크(120)는 자신들이 통신 가능하게 링크된 다수의, 별개의 네트워크들일 수 있다는 것이 이해되어야 한다. 또한, 예시의 경우들에서, 상기 네트워크(120)는 (예를 들면, 암호화 또는 다른 보안 수단들을 경유하여) 네트워크 컴포넌트들 사이에서의 보안 통신들을 용이하게 할 수 있다. 상기 네트워크(120)는 마찬가지로 다른 모습들을 취할 수 있다.The network 120 may be wired and/or wireless and may be or include one or more wide area networks (WANs) and/or local area networks (LANs) capable of supporting secure communication. In some examples, the network 120 may include one or more cellular networks and/or the Internet, among other networks. The communication network 104 may operate according to one or more communication protocols such as LTE, CDMA, GSM, LPWAN, WiFi, Bluetooth, Ethernet, HTTP/S, TCP, CoAP/DTLS, and the like. Although network 120 appears to be a single network, it should be understood that network 120 may be a number of separate networks to which they are communicatively linked. Further, in example cases, the network 120 may facilitate secure communications between network components (eg, via encryption or other secure means). The network 120 may likewise take other forms.

일실시예에 따르면, 이상치 탐지를 위한 데이터 처리 장치(130)는 중앙 프로세서(131), 데이터 수집부(132), 데이터 전처리부(133), 데이터 품질 관리부(134), 인터페이스 제공부(135) 및 통신부(136)를 포함할 수 있다.According to an embodiment, the data processing device 130 for detecting outliers includes a central processor 131 , a data collection unit 132 , a data preprocessor 133 , a data quality management unit 134 , and an interface providing unit 135 . and a communication unit 136 .

중앙 프로세서(131)는 데이터 처리 장치(130)의 각 구성을 제어하는 장치로, 실제로는 하나의 중앙 프로세서에서 각 구성의 역할을 수행할 수도 있다.The central processor 131 is a device that controls each component of the data processing device 130 , and in fact, may perform a role of each component in one central processor.

일실시예에 따르면, 데이터 수집부(132)는 수집 대상이 되는 로우 데이터를 수집할 수 있다. 여기서 로우 데이터는 엑셀 또는 스프레드 시트 형태의 파일로 작성된 정제되지 않은 데이터, 비정형 데이터 등을 포함할 수 있다.According to an embodiment, the data collection unit 132 may collect raw data to be collected. Here, the raw data may include unrefined data, unstructured data, etc. prepared in an Excel or spreadsheet format file.

일실시예에 따르면, 데이터 전처리부(133)는 수집된 로우 데이터에 대한 전처리 과정을 진행할 수 있다. 전처리 과정은 데이터의 정제, 데이터의 정형화, 데이터의 변환 과정을 포함할 수 있다.According to an embodiment, the data preprocessor 133 may perform a preprocessing process on the collected raw data. The preprocessing process may include data purification, data formalization, and data transformation processes.

일실시예에 따르면, 데이터 품질 관리부(134)는 전처리된 데이터의 정상치, 이상치, 및 결측치를 검출할 수 있다. 이때, 데이터의 품질에 대한 미리 수집된 메타데이터를 이용할 수 있다. 또한, 메타데이터는 새로운 유형이 생길 경우, 수정되거나 추가될 수도 있다. 메타데이터 정보는 데이터 사전 및 정규 표현식 중 적어도 하나를 포함하는 인덱스 정보가 될 수 있다.According to an embodiment, the data quality management unit 134 may detect normal values, outliers, and missing values of the preprocessed data. In this case, pre-collected metadata on data quality may be used. In addition, metadata may be modified or added when a new type is created. The metadata information may be index information including at least one of a data dictionary and a regular expression.

메타데이터는 데이터의 속성 정보로, 대량의 정보 가운데에서 찾고 있는 정보를 효율적으로 찾아내서 이용하기 위해 일정한 규칙에 따라 콘텐츠에 대하여 부여되는 데이터를 말한다. 여기에는 콘텐츠의 위치와 내용, 작성자에 관한 정보, 권리 조건, 이용 조건, 이용 내력 등이 기록될 수 있다. 컴퓨터에서는 보통 메타데이터를 데이터를 표현하기 위한 목적과 데이터를 빨리 찾기 위한 목적으로 사용하고 있다.Metadata is data attribute information, and refers to data given to content according to a certain rule in order to efficiently find and use the information that is being sought from among a large amount of information. Here, the location and content of the content, information about the creator, rights conditions, conditions of use, history of use, etc. may be recorded. In computers, metadata is usually used for the purpose of expressing data and for the purpose of finding data quickly.

여기서, 인덱스는 색인(索引) 또는 목록(目錄)이라는 의미이며, 데이터를 기록할 경우 그 데이터의 이름, 데이터 크기 등의 속성(屬性)과 그 기록장소 등을 표로 표시한 것, 즉 참조용의 데이터를 인덱스라 하고 있다.Here, the index means an index or a list, and when data is recorded, the attributes such as the name and data size of the data and the recording location are indicated in a table, that is, for reference. The data is called an index.

일실시예에 따르면, 인터페이스 제공부(135)는 검출된 상치, 이상치, 및 결측치를 시각화하여 사용자 단말에 제공할 수 있다. 일실시예에 따르면, 인터페이스 제공부(135)는, 사용자의 요청에 응답하여 상기 이상치 검출의 기준이 되는 메타데이터 정보를 제공할 수 있다.According to an embodiment, the interface providing unit 135 may visualize the detected upper values, outliers, and missing values and provide them to the user terminal. According to an embodiment, the interface providing unit 135 may provide metadata information that is a criterion for detecting the outlier in response to a user's request.

시각화에 대한 보다 구체적인 설명은 도 3 및 도 4를 통해서 하도록 한다.A more detailed description of the visualization will be provided with reference to FIGS. 3 and 4 .

일실시예에 따르면, 인터페이스 제공부(135)는 상기 사용자가 상기 인덱스 정보에 대한 검색을 수행할 수 있도록 검색창을 제공할 수 있다.According to an embodiment, the interface providing unit 135 may provide a search window so that the user can search for the index information.

일실시예에 따르면, 상기 인덱스 정보는 상기 인터페이스 제공부를 통한 상기 사용자의 명령에 응답하여 추가/삭제/수정이 가능하다.According to an embodiment, the index information may be added/deleted/modified in response to the user's command through the interface providing unit.

일실시예에 따르면, 통신부(136)는 사용자 단말에 인터페이스를 전송하고, 사용자로부터의 명령을 수신할 수 있다.According to an embodiment, the communication unit 136 may transmit an interface to the user terminal and receive a command from the user.

도 2는 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 장치의 동작 순서를 설명하기 위한 도면이다.2 is a diagram for explaining an operation sequence of a data processing apparatus for detecting an outlier according to an embodiment of the present invention.

도 2를 참조하면, 이상치 탐지를 위한 데이터 처리 장치(200)는 게이트웨이(220), 데이터 수집부(230), 데이터 전처리부(240), 데이터 품질 관리부(250), 및 사용자 인터페이스(260)를 포함할 수 있다.Referring to FIG. 2 , the data processing device 200 for detecting outliers includes a gateway 220 , a data collection unit 230 , a data preprocessor 240 , a data quality management unit 250 , and a user interface 260 . may include

일실시예에 따르면, 게이트웨이(220)는 로우데이터(210)를 입력받을 수 있다.According to an embodiment, the gateway 220 may receive the raw data 210 .

일실시예에 따르면, 데이터 수집부(230)는 원본 데이터를 수집할 수 있다.According to an embodiment, the data collection unit 230 may collect original data.

일실시예에 따르면, 데이터 전처리부(240)는 수집된 원본 데이터에 대한 전처리 과정을 진행할 수 있다.According to an embodiment, the data preprocessor 240 may perform a preprocessing process on the collected original data.

일실시예에 따르면, 데이터 품질 관리부(250)는 전처리된 데이터의 정상치, 이상치, 및 결측치를 검출할 수 있다. 이때, 데이터 품질 관리부(250)는 미리 정해진 데이터 사전 및 정규 표현식 중 적어도 하나에 따라 데이터의 품질을 평가할 수 있다.According to an embodiment, the data quality management unit 250 may detect normal values, outliers, and missing values of the preprocessed data. In this case, the data quality management unit 250 may evaluate the quality of data according to at least one of a predetermined data dictionary and a regular expression.

일실시예에 따르면, 사용자 인터페이스(260)는 검출된 정상치, 이상치, 및 결측치를 시각화하여 사용자(290)에게 제공할 수 있다.According to an embodiment, the user interface 260 may visualize the detected normal values, outliers, and missing values and provide them to the user 290 .

사용자(290)는 사용자 인터페이스를 통해서 제공된 자료를 보고, 메타데이터의 이상여부를 판단할 수 있다.The user 290 may view the data provided through the user interface and determine whether the metadata is abnormal.

일실시예에 따르면, 사용자(290)는 제공되는 인터페이스를 통해서, 이상치 검출의 기준이 되는 메타데이터를 검색하고, 추가/삭제/수정할 수 있다.According to an embodiment, the user 290 may search and add/delete/modify metadata that is a criterion for detecting an outlier through the provided interface.

도 3 내지 도 5는 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 장치에서 제공되는 인터페이스를 설명하기 위한 도면이다.3 to 5 are diagrams for explaining an interface provided by a data processing apparatus for detecting an outlier according to an embodiment of the present invention.

도 3을 참조하면, 로우 데이터를 전처리하여 테이블을 통해서 제공되는 인터페이스를 확인할 수 있다.Referring to FIG. 3 , an interface provided through a table by preprocessing raw data can be checked.

이때, 녹색 표시가 정상치로 검출되는 데이터를 말하고, 빨간색 표시가 이상치로 검출되는 데이터를 말하고, 흰색 표시가 결측치로 검출되는 데이터를 표현한다. 여기서는 색깔을 녹색, 빨간색 및 흰색을 사용했지만, 다른 색깔으 사용해도 무방하다. 제목 칼럼의 아래 부분에 표시되는 컬러바가 데이터의 정상치/이상치/결측치에 대한 비율을 나타낼 수 있다. 녹색 바의 길이가 길면, 정상치가 많은 것이고, 흰색 바의 길이가 늘어날수록 결측치가 많아지는 것이고, 빨간색 바의 길이가 길어질수록 이상치가 많아지는 것이다.In this case, green marks indicate data detected as normal values, red marks indicate data detected as outliers, and white marks indicate data detected as missing values. I used green, red, and white for the colors here, but you can use any other color. A color bar displayed at the bottom of the title column may indicate the ratio of data to normal/outlier/missing values. The longer the green bar, the more normal values, the longer the white bar, the more missing values, and the longer the red bar, the more outliers.

여기서 컬러바는 데이터 품질 확인바가 될 수 있으며, 해당 칼럼의 데이터 값이 얼마나 유효한지 판단할 수 있는 지표가 될 수 있다.Here, the color bar may be a data quality check bar, and may be an indicator for determining how valid the data value of the corresponding column is.

일실시예에 따르면, 식별자 칼럼에서, id 값은 정규 표현식에 따라, 정수값만을 가져야 하지만, 특수 문자(예를 들면, @)가 들어가 있거나, 알파벳(예를 들면, A)이 들어가 있는 경우에는 이상치임을 판단할 수 있다.According to one embodiment, in the identifier column, the id value should have only an integer value according to a regular expression, but if a special character (eg, @) is included or an alphabet (eg, A) is included, It can be judged as an outlier.

다른 일실시예에 따르면, 이메일 칼럼에서, 이메일의 형식은 정규 표현식에 규정된 형식(예를 들면, adh@digitalship.co.kr)이 존재하는데, 이러한 정규 표현식에 맞지 않은 데이터는 이상치에 해당함을 판단할 수 있다.According to another embodiment, in the email column, the format of the email has a format specified in the regular expression (eg, adh@digitalship.co.kr), and data that does not match the regular expression corresponds to an outlier. can judge

또 다른 일실시예에 따르면, 도시 칼럼에서, 도시 이름은 미리 저장된 데이터 사전에 저장되어 있을 수 있다. 만약, 미리 저장된 데이터 사전에 등재되어 있지 않은 데이터가 탐지될 경우, 이상치에 해당하는 것으로 판단할 수 있다.According to another embodiment, in the city column, the city name may be stored in a pre-stored data dictionary. If data not registered in the pre-stored data dictionary is detected, it may be determined that it corresponds to an outlier.

도 3에 따르면, 이름 칼럼, 성 칼럼, 직업 칼럽, 회사 칼럼, 도시 칼럼 및 날짜 칼럼에는 이상치가 없는 것으로 인터페이스가 제공되며, id 칼럼, 이메일 칼럼 및 주 칼럼에서는 이상치가 존재하는 것으로 인터페이스가 제공되는 것을 확인할 수 있다.According to FIG. 3, interfaces are provided as no outliers in the name column, last name column, job column, company column, city column, and date column, and the interface is provided as outliers exist in the id column, email column, and main column. that can be checked

도 4를 참조하면, id 칼럼(410), 이름 칼럼(420), 성 칼럼(430), 이메일 칼럼(440), 직업이름 칼럼(450)을 확인할 수 있다. 이때, 이메일 칼럼(440)의 하단 부분에 데이터 품질에 관한 컬러바가 제공되는 것을 확인할 수 있다. 이메일의 경우에는 많은 데이터가 정상치(441)에 해당하지만, 결측치(442)와 이상치(443)도 섞여있는 것을 확인할 수 있다.Referring to FIG. 4 , an id column 410 , a first name column 420 , a last name column 430 , an email column 440 , and a job name column 450 may be identified. In this case, it can be seen that a color bar related to data quality is provided at the lower portion of the email column 440 . In the case of e-mail, a lot of data corresponds to the normal value 441, but it can be seen that the missing value 442 and the outlier value 443 are also mixed.

이때, 사용자는 이상치 검출의 기준이 무엇인지 궁금할 수 있다. 하지만, 다른 데이터 처리 장치에서는 메타데이터를 공개하지 않거나, 수정/추가/삭제가 불가능하기 때문에 데이터 분석이 매우 어려웠다.In this case, the user may wonder what the criteria for detecting outliers are. However, it was very difficult to analyze data because other data processing devices do not disclose or modify/add/delete metadata.

도 5를 참조하면, 이메일 칼럼을 선택하면, 해당 데이터 칼럼이 이메일일 확률이 99.8%임을 보여주는 데이터와 만약 사용자가 보기에 다른 데이터임이 분명할 경우, 다른 데이터로 수정할 수가 있다. 즉, 데이터 처리의 정확도를 높이기 위해서, 해당 칼럼의 속성을 사용자가 수정할 수 있다.Referring to FIG. 5 , if an e-mail column is selected, data showing that the data column has a 99.8% probability of being an e-mail and if it is clear that the data is different from the user's view, it can be modified to other data. That is, in order to increase the accuracy of data processing, the user can modify the properties of the corresponding column.

아울러, 사용자의 명령에 응답하여, 데이터 사전과 정규 표현식도 수정될 수가 있다. 새로운 도시가 생기거나, 새로운 룰에 의해 이메일 주소 형식이 바뀔 경우에도 즉각적인 반영이 필요하기 때문이다.In addition, data dictionaries and regular expressions may be modified in response to user commands. This is because immediate reflection is required even when a new city is created or the email address format is changed due to a new rule.

도 6은 본 발명의 실시예에 따른 이상치 탐지를 위한 데이터 처리 방법을 설명하기 위한 흐름도이다.6 is a flowchart illustrating a data processing method for detecting outliers according to an embodiment of the present invention.

도 6을 참조하면, 단계(S610)에서, 일실시예에 따른 데이터 처리 장치는 로우 데이터를 수집할 수 있다.Referring to FIG. 6 , in operation S610 , the data processing apparatus according to an embodiment may collect raw data.

단계(S620)에서, 일실시예에 따른 데이터 처리 장치는 수집된 데이터를 전처리할 수 있다.In step S620, the data processing apparatus according to an embodiment may pre-process the collected data.

단계(S630)에서, 일실시예에 따른 데이터 처리 장치는 데이터의 이상치 및 결측치에 대한 시각화 인터페이스를 제공할 수 있다. 데이터 처리 장치는 전처리된 데이터를 통해서 정상치, 이상치 및 결측치를 검출할 수 있다. 이때, 정상치, 이상치 및 결측치는 데이터 사전 및 정규 표현식 중 적어도 하나에 따라 검출이 가능하다.In operation S630, the data processing apparatus according to an embodiment may provide a visualization interface for outliers and missing values of data. The data processing apparatus may detect normal values, outliers, and missing values through the preprocessed data. In this case, normal values, outliers, and missing values can be detected according to at least one of a data dictionary and a regular expression.

단계(S640)에서, 일실시예에 따른 데이터 처리 장치는 이상치 검출의 기준이 되는 메타데이터에 대한 인터페이스를 제공할 수 있다. 여기서 메타데이터 정보는 데이터 사전 및 정규 표현식 중 적어도 하나를 포함하는 인덱스 정보가 될 수 있다.In operation S640 , the data processing apparatus according to an embodiment may provide an interface for metadata that is a criterion for detecting outliers. Here, the metadata information may be index information including at least one of a data dictionary and a regular expression.

일실시예에 따르면, 인터페이스는 상기 사용자가 상기 인덱스 정보에 대한 검색을 수행할 수 있도록 검색창을 제공할 수 있다.According to an embodiment, the interface may provide a search window so that the user can search for the index information.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

110: 사용자 단말
120: 네트워크
130: 데이터 처리 장치
131: 중앙 프로세서
132: 데이터 수집부
133: 데이터 전처리부
134: 데이터 품질 관리부
135: 인터페이스 제공부
136: 통신부110: user terminal
120: network
130: data processing unit
131: central processor
132: data collection unit
133: data preprocessor
134: Data Quality Management Department
135: interface providing unit
136: communication department

Claims

acquiring data to be collected, performing pre-processing on the data to be collected, and detecting normal values, outliers, and missing values;
visualizing the detected normal values, outliers and missing values through an interface; and
In response to a user's request, comprising the step of providing metadata information, which is a criterion for detecting the outlier, through the interface,
Visualizing the detected normal values, outliers, and missing values through an interface includes displaying the preprocessed data in a tabular format consisting of rows and columns through the interface,
The title row of the table includes color bars indicating normal values, outliers, and missing values among data of each column,
Each of the color bars is partitioned into partial bars having a length corresponding to the ratio of each of the normal value, the outlier value, and the missing value among the data of each column,
The partial bars are displayed in different colors corresponding to normal values, outliers, and missing values, respectively,
The step of providing the metadata information through the interface comprises:
When the user's request includes a selection input for any one of the columns, displaying metadata information including a first type as a criterion for detecting an outlier for the data of the one column, and ,
The metadata information includes a probability that the one column corresponds to the first type and a second type changeable from the first type.

According to claim 1,
The data processing method, characterized in that the metadata information is index information including at least one of a data dictionary and a regular expression.

3. The method of claim 2,
The interface provides a search window so that the user can search for the index information.

3. The method of claim 2,
The index information can be added/deleted/modified in response to the user's command through the interface.

delete

a data collection unit for acquiring data to be collected;
a data pre-processing unit that pre-processes the collection target data;
a data quality management unit for detecting normal values, outliers, and missing values of the preprocessed data; and
and an interface providing unit for visualizing the detected normal values, outliers, and missing values through an interface,
The interface providing unit,
In response to a user's request, providing metadata information that is a criterion for detecting the outlier,
The interface providing unit
Displaying the pre-processed data in a table format consisting of rows and columns through the interface,
The title row of the table includes color bars indicating normal values, outliers, and missing values among data of each column,
Each of the color bars is partitioned into partial bars having a length corresponding to a ratio of each of the normal values, outliers, and missing values among the data of each column,
The partial bars are displayed in different colors corresponding to normal values, outliers, and missing values, respectively,
The interface providing unit
When the user's request includes a selection input for any one of the columns, displaying metadata information including a first type as a criterion for detecting outliers for the data of the one column,
The metadata information includes a probability that the one column corresponds to the first type and a second type changeable from the first type.

7. The method of claim 6,
The data processing apparatus, characterized in that the metadata information is index information including at least one of a data dictionary and a regular expression.

8. The method of claim 7,
and the interface providing unit provides a search window so that the user can search for the index information.

8. The method of claim 7,
The data processing apparatus, characterized in that the index information can be added/deleted/modified in response to the user's command through the interface providing unit.