KR20160104945A

KR20160104945A - Apparatus and Method for Evaluating Outlierness based on Data Association

Info

Publication number: KR20160104945A
Application number: KR1020150027914A
Authority: KR
Inventors: 이건명
Original assignee: 충북대학교 산학협력단
Priority date: 2015-02-27
Filing date: 2015-02-27
Publication date: 2016-09-06
Also published as: KR101692611B1

Abstract

본 발명은 데이터 연관성 기반 이상치 평가 장치 및 방법에 대하여 개시한다. 본 발명의 일면에 따른 테이블 형태로 데이터를 저장하는 데이터베이스 내 입력되는 입력 데이터에 대한 이상치 평가 장치는, 상기 입력 데이터가 입력되면, 상호 연관성이 있는 관련 속성에 대한 분포정보를 제공하는 분포 모듈; 및 상기 분포정보를 이용해 관련 속성이 수치형, 범주형 및 복합형 중 어느 종류인지를 확인하고, 상기 관련 속성의 종류에 따라 상기 입력 데이터와 상기 관련 속성 내 속성 간의 거리 및 상기 입력 데이터와 상기 관련 속성의 데이터 간의 조합 빈도수 중 적어도 하나를 이용하여 상기 입력 데이터의 이상치 정도를 평가하는 평가 모듈을 포함하는 것을 특징으로 한다.The present invention discloses an apparatus and method for estimating outliers based on data association. An apparatus for estimating an outliers for input data to be input into a database for storing data in a form of a table according to an aspect of the present invention includes a distribution module for providing distribution information on correlated attributes when the input data is input; Categorization, and complex type using the distribution information, and determining, based on the type of the related property, a distance between the input data and the attribute in the related property, And an evaluation module for evaluating the degree of the outliers of the input data by using at least one of a combination frequency between the data of the attribute.

Description

[0001] Apparatus and Method for Evaluating Outliers Based on Data Association [

본 발명은 이상치 검출 기술에 관한 것으로서, 더 구체적으로는 데이터 속성 간의 연관성을 이용하는 데이터 연관성 기반 이상치 평가 장치 및 방법에 관한 것이다.The present invention relates to an outlier detection technique and, more particularly, to a data association based outlier assessment apparatus and method that utilizes associations between data attributes.

데이터베이스 내 데이터는 복수의 인스턴스와 각 인스턴스를 구분하는 속성으로 구성될 수 있다. 일 예로서, 도 1과 같이, 데이터베이스에는 11개의 인스턴스와 5개의 속성으로 구성된 소프트웨어 프로젝트 데이터(Software Project Data)가 포함될 수 있다. 여기서, 이상치(outlier)란 그 속성에 비정상적인 값을 포함하는 인스턴스이며, 비정상적인 값을 포함하는 속성을 이상 속성(abnormal attribute)이라한다.The data in the database may consist of a plurality of instances and attributes that distinguish each instance. As an example, as shown in FIG. 1, the database may include software project data composed of 11 instances and 5 attributes. Here, an outlier is an instance including an abnormal value in the attribute, and an attribute including an abnormal value is called an abnormal attribute.

이 같이, 데이터베이스 내 데이터가 의사결정에 사용될 경우, 그 품질은 매우 중요하나, 실무에서는 실무자의 실수 등에 의해서 문제 있는 데이터가 함께 데이터베이스에 불가피하게 입력 또는 수집될 수 있다.In this way, when data in the database is used for decision making, its quality is very important, but in practice, troublesome data can be inevitably entered or collected together with the database due to a mistake of the practitioner.

이에, 생명정보 분야 및 데이터 마이닝 분야에서 데이터베이스 내 데이터의 논리적 오류(이하, 이상치 판단)를 찾는 연구가 많이 진행되고 있다.Therefore, much research is being conducted to find a logical error (hereinafter referred to as an outliers judgment) of data in a database in the fields of life information and data mining.

대표적으로, PANDA 기법, AOI 기법 등의 이상치 판단 기법이 있다. 먼저, PANDA 기법은 데이터베이스 내 각 인스턴스의 모든 속성들에 대한 잡음인자의 합으로 이상치 순위를 결정한다. 그리고, AOI 기법은 각 인스턴스에 대해 특정 속성을 포함시켰을 때와 제외했을 때의 잡음인자의 합을 산출하고, 이들의 이상치 순위의 차이를 이용하여 속성의 이상치를 판단한다.Typically, there are outlier detection techniques such as the PANDA technique and the AOI technique. First, the PANDA method determines the outliers rank by sum of the noise factors for all attributes of each instance in the database. The AOI method calculates the sum of the noise factors for each instance when the attribute is included and excluded, and determines the anomaly value of the attribute using the difference of the outliers.

그런데, 이러한 종래의 이상치 판단 기법은 오류 정보를 포함하는 특정 인스턴스를 찾거나, 데이터 속성값으로 어울리지 않는 것을 판단하기 위한 것이었다.However, this conventional outlier determination technique is for finding a specific instance including error information, or determining that it is inappropriate for a data attribute value.

본 발명은 전술한 바와 같은 기술적 배경에서 안출된 것으로서, 데이터 속성 간의 연관성을 이용하여 입력 데이터의 이상치 정도를 평가할 수 있는 데이터 연관성 기반 이상치 평가 장치 및 방법을 제공하는 것을 그 목적으로 한다.SUMMARY OF THE INVENTION It is an object of the present invention to provide an apparatus and method for evaluating outliers based on data associations that can evaluate the degree of outliers of input data by using associations between data attributes.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

본 발명의 일면에 따른 테이블 형태로 데이터를 저장하는 데이터베이스 내 입력되는 입력 데이터에 대한 이상치 평가 장치는, 상기 입력 데이터가 입력되면, 상호 연관성이 있는 관련 속성에 대한 분포정보를 제공하는 분포 모듈; 및 상기 분포정보를 이용해 관련 속성이 수치형, 범주형 및 복합형 중 어느 종류인지를 확인하고, 상기 관련 속성의 종류에 따라 상기 입력 데이터와 상기 관련 속성 내 속성 간의 거리 및 상기 입력 데이터와 상기 관련 속성의 데이터 간의 조합 빈도수 중 적어도 하나를 이용하여 상기 입력 데이터의 이상치 정도를 평가하는 평가 모듈을 포함하는 것을 특징으로 한다.An apparatus for estimating an outliers for input data to be input into a database for storing data in a form of a table according to an aspect of the present invention includes a distribution module for providing distribution information on correlated attributes when the input data is input; Categorization, and complex type using the distribution information, and determining, based on the type of the related property, a distance between the input data and the attribute in the related property, And an evaluation module for evaluating the degree of the outliers of the input data by using at least one of a combination frequency between the data of the attribute.

본 발명의 다른 면에 따른 적어도 하나의 프로세서에 의한 데이터베이스 내 입력되는 입력 데이터에 대한 이상치 평가 방법은, 상기 입력 데이터가 입력되면, 상호 연관성이 있는 관련 속성에 대한 분포정보를 제공하는 단계; 상기 분포정보를 이용해 관련 속성이 수치형, 범주형 및 복합형 중 어느 종류인지를 확인하는 단계; 및 상기 관련 속성의 종류에 따라 상기 입력 데이터와 상기 관련 속성 내 속성 간의 거리 및 상기 입력 데이터와 상기 관련 속성의 데이터 간의 조합 빈도수 중 적어도 하나를 이용하여 상기 입력 데이터의 이상치 정도를 평가하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method for estimating an outliers for input data input into a database by at least one processor, the method comprising: providing distribution information on correlated attributes when the input data is input; Checking whether the related property is a numerical type, a categorical type or a complex type using the distribution information; And evaluating the degree of outliers of the input data using at least one of a distance between the input data and the attribute in the related attribute and a combination frequency between the input data and the data of the related attribute in accordance with the kind of the related attribute .

본 발명에 따르면, 데이터 속성 간의 연관성을 이용하여 입력 데이터의 이상치 정도를 평가할 수 있다.According to the present invention, the degree of the outliers of the input data can be evaluated by using the association between data attributes.

도 1은 테이블 기반으로 소프트웨어 프로젝트 데이터를 저장하는 데이터베이스의 예를 도시한 도면.
도 2는 본 발명에 따른 이상치 평가 장치를 도시한 구성도.
도 3a 내지 3c는 본 발명에 따른 수치형 속성에 대한 이상치 정도 산출 과정을 도시한 도면.
도 4a 및 4b는 본 발명에 따른 범주형 속성에 대한 이상치 정도 산출 과정을 도시한 도면.
도 5a 및 5b는 본 발명에 따른 복합형 속성에 대한 이상치 정도 평가 과정을 도시한 도면.
도 6은 본 발명에 따른 데이터 이상치 평가 방법을 독시한 흐름도.1 shows an example of a database storing software project data on a table basis;
2 is a configuration diagram showing an outlier evaluation apparatus according to the present invention.
3A to 3C are diagrams illustrating a process of calculating an outlier value for a numeric attribute according to the present invention.
4A and 4B are diagrams illustrating a process of calculating an outlier value for a categorical attribute according to the present invention.
5A and 5B are diagrams illustrating an outlier evaluation process for a composite attribute according to the present invention.
6 is a flow chart illustrating a data outlier evaluation method according to the present invention.

본 발명의 전술한 목적 및 그 이외의 목적과 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.
BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, advantages and features of the present invention and methods for accomplishing the same will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms " comprises, " and / or "comprising" refer to the presence or absence of one or more other components, steps, operations, and / Or additions.

이제 본 발명의 실시예에 대하여 첨부한 도면을 참조하여 상세히 설명하기로 한다. 도 2는 본 발명의 실시예에 따른 이상치 평가 장치를 도시한 구성도이다.Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. 2 is a configuration diagram showing an outlier evaluation apparatus according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 실시예에 따른 이상치 평가 장치(20)는 입력 모듈(220), 분포 모듈(210), 평가 모듈(230), 판정 모듈(240) 및 저장 수단(250)을 포함한다.2, the outlier evaluation apparatus 20 according to the embodiment of the present invention includes an input module 220, a distribution module 210, an evaluation module 230, a determination module 240, and a storage unit 250 ).

입력 모듈(220)은 사용자 단말로부터 버퍼(미도시)에 데이터베이스로 입력될 입력 데이터를 입력받아, 임시 저장한다. 여기서, 분포 모듈(210)은 무결성 검증 이전에 입력 데이터를 데이터베이스 저장 이전에 임시 저장할 수 있다.The input module 220 receives input data to be input into a database from a user terminal in a buffer (not shown), and temporarily stores the input data. Here, the distribution module 210 may temporarily store the input data before integrity verification before integrity verification.

이때, 데이터베이스는 적어도 하나의 대상에 대한 하나 이상의 속성별 데이터를 테이블 단위로 저장한다. 그리고, 입력 데이터는 테이블의 한 행(또는 열)에 입력되는 데이터일 수 있다. 예컨대, 테이블 내 속성은 열 단위로 저장되면, 입력 데이터는 테이블의 한 행일 수 있다. At this time, the database stores one or more attribute-specific data for at least one object in units of tables. The input data may be data input in one row (or column) of the table. For example, if an attribute in a table is stored in units of columns, the input data can be one row in the table.

분포 모듈(210)은 기정의된 연관성 정보를 기반으로 데이터베이스 내 데이터 테이블의 관련 속성들에 대해 분포정보를 생성한다.The distribution module 210 generates distribution information for the related attributes of the data table in the database based on the predetermined association information.

먼저, 관련 속성에 대해 설명하면, 데이터베이스 내 A(범주형), B(수치형), C(수치형), D(수치형)이 존재하고, A와 B가 서로 연관성이 있고 C와 D가 상호 연관성이 있는 경우라면, A와 B의 분포정보는 범주형과 수치형이 혼용된 복합형으로 분류한다. 또한, C와 D의 분포정보는 수치형으로 분류한다.First, we explain the related properties. Let A (categorical), B (numeric), C (numeric), D (numerical) exist in the database and A and B are related and C and D If there is a correlation, the distribution information of A and B is categorized as hybrid type in which categorical type and numerical type are mixed. The distribution information of C and D is classified into numerical type.

여기서, 수치형 속성은 데이터가 숫자로 이루어진 속성이고, 범주형 속성은 데이터가 텍스트 기반으로 이루어진 속성이며, 복합형은 수치형과 범주형이 함께 포함된 속성이다.Here, a numerical attribute is an attribute in which data is a number, a categorical attribute is an attribute in which data is text-based, and a complex type is an attribute including a numeric type and a categorical type.

또한, 분포정보는 관련속성의 군집정보 및 조합 가능한 빈도수 중 적어도 하나를 포함한다.Further, the distribution information includes at least one of the cluster information of the related attribute and the combinable frequency.

한편, 분포 모듈(210)은 입력 데이터가 정상 데이터로 판정되어, 입력 데이터가 데이터베이스로 저장되면, 입력 데이터를 포함하는 관련 속성에 대한 분포정보를 갱신한다.On the other hand, when the input data is determined to be normal data and the input data is stored in the database, the distribution module 210 updates the distribution information about the related property including the input data.

평가 모듈(230)은 분포정보로부터 관련 속성에 포함된 속성이 수치형, 범주형 및 복합형 중 어느 종류인지를 확인하고, 확인된 종류에 따라 분포정보로부터 확인된 입력 데이터와 관련 속성 내 속성 간의 거리 및 입력 데이터와 관련 속성 내 속성에 대응하는 데이터 간의 조합 빈도수 중 적어도 하나를 이용하여 입력 데이터의 이상치 정도를 산출한다.The evaluation module 230 checks whether the attribute included in the related attribute is numerical, categorical, or complex, from the distribution information, and determines whether the input data identified from the distribution information according to the identified type and the attribute The degree of outliers of the input data is calculated using at least one of the distance, the input data, and the combination frequency between the data corresponding to the attribute in the related property.

평가 모듈(230)은 각 속성 종류에 대응하는 이상치 정도를 산출하는 제1 내지 제3 평가부(231~233)를 포함한다. 각 평가부에 대해서는 도 3a 내지 5b를 참조하여 후술한다.The evaluation module 230 includes first to third evaluation units 231 to 233 for calculating the degree of outliers corresponding to each attribute type. Each evaluating unit will be described later with reference to Figs. 3A to 5B.

판정 모듈(240)은 입력 데이터의 이상치 정도를 기설정된 임계치와 비교하여 입력 데이터의 이상치 데이터 여부를 판정하고, 판정 결과를 사용자에게 안내한다.The determination module 240 compares the outlier value of the input data with a preset threshold value to determine whether or not the input data is outlier data, and guides the determination result to the user.

상세하게는, 판정 모듈(240)은 이상치 정도가 기설정된 임계치 이하이면, 입력 데이터를 이상치 데이터로 분류한다. 반면, 이상치 정도가 임계치를 초과하면, 판정 모듈(240)은 입력 데이터를 정상 데이터로 분류한다.More specifically, the determination module 240 classifies input data as outlier data if the outlier is below a predetermined threshold. On the other hand, if the outlier value exceeds the threshold, the determination module 240 classifies the input data as normal data.

여기서, 임계치는 입력 데이터의 이상치 여부를 판단하는 기준으로서, 데이터베이스 내 기입력된 정상 데이터와의 거리 및 빈도수 중 적어도 하나를 이용하여 산출될 수 있다.Here, the threshold can be calculated using at least one of the distance from the normal data input in the database and the frequency, as a criterion for determining whether the input data is abnormal.

만약, 입력 데이터를 이상치 데이터임을 분류하면, 판정 모듈(240)은 입력원(관리자 등)에게 판정 결과(입력 오류)를 안내한다. 이때, 판정 모듈(240)은 판정 결과를 표시하거나, 소리 등으로 안내할 수 있으며, SMS 등으로 안내할 수도 있다.If the input data is classified as abnormal data, the determination module 240 notifies the input source (manager or the like) of the determination result (input error). At this time, the determination module 240 may display the determination result, guide the user through sound or the like, and guide the user through SMS or the like.

이때, 입력원은 입력 데이터가 실제로 이상치 데이터인지를 재확인하고, 이상치 데이터이면, 입력 데이터를 수정하여 수정된 입력된 데이터를 피드백할 수 있다. 또는, 입력원은 입력 데이터가 실제로 이상치 데이터가 아니라면, 입력 데이터가 실제로는 이상치 데이터가 아니므로, 별도의 편집 없이 저장하라는 지시를 피드백할 수 있다.At this time, the input source reaffirms whether the input data is actually an outlier data, and if the input data is an outlier data, the input data can be corrected and the corrected input data can be fed back. Alternatively, if the input data is not actually outliers data, the input source can feed back an instruction to store it without further editing, since the input data is not actually outlier data.

저장 수단(250)은 정상적인 데이터로 분류된 입력 데이터를 데이터베이스에 저장시킨다.The storage means 250 stores the input data classified as normal data in the database.

이와 같이, 본 발명의 실시예는 입력 데이터와 데이터베이스 내 데이터의 경향 차이를 확인함에 따라 입력 데이터가 이상치 데이터일 가능성이 있는 경우, 입력원에게 안내하여 오류 여부를 재확인하도록 지원할 수 있다. 따라서, 본 발명의 실시예는 데이터베이스 내 데이터의 신뢰성과 정확도를 향상시킬 수 있다.As described above, according to the embodiment of the present invention, if there is a possibility that the input data is outlier data by checking the tendency difference between the input data and the data in the database, the input data can be guided to the input source to re- Therefore, embodiments of the present invention can improve the reliability and accuracy of data in a database.

뿐만 아니라, 본 발명의 실시예는 입력 데이터에 2개 이상의 속성들이 결합된 경우에도 이상 여부를 판정할 수 있다.
In addition, the embodiment of the present invention can determine whether an abnormality is present even if two or more attributes are combined in the input data.

이하, 도 2 및 도 3a 내지 5b를 참조하여 본 발명의 실시예에 따른 평가 모듈에 대하여 설명한다. 도 3a 내지 3c는 본 발명의 실시예에 따른 수치형 속성에 대한 이상치 정도 산출 과정을 도시한 도면이고, 도 4a 및 4b는 본 발명의 실시예에 따른 범주형 속성에 대한 이상치 정도 산출 과정을 도시한 도면이고, 도 5a 및 5b는 본 발명의 실시예에 따른 복합형 속성에 대한 이상치 정도 평가 과정을 도시한 도면이다.Hereinafter, an evaluation module according to an embodiment of the present invention will be described with reference to FIG. 2 and FIGS. 3A to 5B. FIGS. 3A to 3C are diagrams illustrating an outlier calculation process for a numeric attribute according to an exemplary embodiment of the present invention. FIGS. 4A and 4B are diagrams illustrating an outlier calculation process for a categorical attribute according to an embodiment of the present invention. And FIGS. 5A and 5B are diagrams illustrating an outlier evaluation process for a complex attribute according to an embodiment of the present invention.

도 3a와 같이 관련 속성이 수치형일 경우, 분포 모듈(210)은 도 3b와 같이 관련 속성 내 데이터를 군집하여 적어도 하나의 군집을 생성하며, 적어도 하나의 군집에 대한 군집 정보를 포함하는 분포정보를 제공한다.As shown in FIG. 3A, when the related attribute is a numeric type, the distribution module 210 generates at least one cluster by grouping the data in the related attribute as shown in FIG. 3B, and distributes distribution information including the cluster information for at least one cluster to provide.

제1 평가부(231)는 군집 정보 내 모든 군집과 입력 데이터의 거리를 이용해 입력 데이터와 가장 가까운 일 군집 C_i을 선택한다. 그리고, 제1 평가부(231)는 일 군집과 입력 데이터와의 거리를 퍼지 소속함수에 적용하여 하기의 수학식 1과 같이, 입력 데이터가 선택된 일 군집 C_i에 소속하는 정도

를 산출하고, 1에서 입력 데이터가 일 군집에 소속하는 정도값의 차이를 이상치 정도로 산출한다.The first evaluation unit 231 selects a cluster C _i closest to the input data using the distance of all the clusters in the cluster information and the input data. The first evaluation unit 231 applies the distance between one cluster and the input data to the fuzzy membership function and calculates the degree to which the input data belongs to the selected cluster C _i ,

And calculates the difference between the degrees of degree to which the input data belongs to one cluster in an unexpected value.

여기서, 퍼지 소속함수(Fuzzy Membership Function) μ_A(x)는 함수의 값이 구간 [0, 1]의 값을 가지는 것으로서, 도 4c와 같은 형태일 수 있다. 다시 말해, 일 군집과 입력 데이터와의 거리를 퍼지 소속함수에 대입함에 따라 입력 데이터가 일 군집에 소속하는 정도값은 구간 [0,1] 내로 변환될 수 있다.Here, the fuzzy membership function μ _A (x) may have a value of the function [0, 1] as shown in FIG. 4C. In other words, by assigning the distance between one cluster and the input data to the fuzzy membership function, the degree to which the input data belongs to one cluster can be transformed into the interval [0, 1].

분류 모듈()은 도 4a와 같이 관련 속성이 범주형이면, 관련 속성의 데이터 간의 가능한 조합에 대한 적어도 하나의 조합 빈도수를 산출하고, 도 4b와 같이, 산출된 적어도 하나의 조합 빈도수를 포함하는 분포정보를 제공한다.4A, the classification module calculates at least one combination frequency for a possible combination between the data of the related property, if the related property is categorical, and calculates a distribution frequency including at least one combination frequency calculated as shown in FIG. 4B Provide information.

제2 평가부(232)는 분포정보를 이용하여 입력 데이터와 관련 속성의 데이터 간의 가능한 조합에 대한 정규화된 히스토그램

을 산출하고, 하기의 수학식 2와 같이, 숫자 1과 입력 데이터에 대해 정규화된 히스토그램의 빈도수의 차이값을 이용하여 입력 데이터의 이상치 정도를 산출할 수 있다.The second evaluating unit 232 uses the distribution information to calculate a normalized histogram of possible combinations between the input data and the data of the related property

And the degree of the outliers of the input data can be calculated by using the difference between the number 1 and the frequency of the histogram normalized with respect to the input data, as shown in the following equation (2).

여기서, 히스토그램의 빈도수는 값은 복수의 값이므로, 제2 평가부(232)는 그 중 가장 작은 값을 이상치 정도로 산출할 수 있다.Here, since the frequency of the histogram is a plurality of values, the second evaluating unit 232 can calculate the smallest value of the histograms to an outlier value.

참고로, 도 4b의 빨간색 타원형과 같이, 히스토그램에서 빈도수 값이 작은 부분은 이상치일 가능성이 있는 부분이다. 다시 말해, 해당 부분은 이전에 출현되지 않았던 조합이거나, 아직 이상치 평가가 되지 않았던 부분의 데이터를 포함할 가능성이 있다. 이에, 본 발명에서는 해당 부분을 이상치 데이터로 판단하고, 이에 대해 입력원에게 재확인을 요청하는 것이다.For reference, as in the red ellipse of FIG. 4B, a portion with a small frequency value in the histogram is a portion that is likely to be out of order. In other words, the part may contain data that has not appeared before, or that has not yet been subjected to an unexpected evaluation. Accordingly, in the present invention, the corresponding part is determined as the outlier data, and the input source is requested to confirm it again.

분포 모듈(210)은 도 5a와 같이 관련 속성이 복합형이면, 관련 속성 중에서 범주형 속성값을 기준으로 관련 속성의 데이터를 도 5b와 같이 계층화(Stratification)하여 적어도 하나의 층을 구성하고, 각 층에 있는 수치형 속성값에 대응하는 데이터를 군집화하여 적어도 하나의 군집정보를 포함하는 분포정보를 제공한다.5A, the distribution module 210 constructs at least one layer by layering the related attribute data as shown in FIG. 5B based on the categorical attribute value among the related attributes, And provides distribution information including at least one cluster information by clustering data corresponding to the numerical attribute value in the layer.

그러면, 제3 평가부(233)는 분포정보를 참조하여 군집정보 내 적어도 하나의 군집에 대해 관련 속성이 수치형일 때와 동일한 방식으로 입력 데이터의 이상치 정도를 산출한다. 상세하게는, 제3 평가부(233)는 군집정보를 참조하여 각 층의 군집과 입력 데이터와의 거리를 산출하고, 각 층의 군집 중에서 입력 데이터와 가장 가까운 일 군집의 거리에 상기 수학식 1의 퍼지 소속함수를 적용하여 입력 데이터의 이상치 정도를 산출할 수 있다.Then, the third evaluation unit 233 refers to the distribution information, and calculates the outliers of the input data in the same manner as in the case where the related property is numeric for at least one community in the cluster information. Specifically, the third evaluation unit 233 refers to the cluster information, calculates the distance between the cluster of each layer and the input data, and calculates a distance between the closest cluster and the input data, The degree of outliers of the input data can be calculated by applying the fuzzy membership function of FIG.

이와 같이, 본 발명의 실시예는 입력 데이터 중 데이터베이스 내 데이터와 일정 경향 차이 있으면 데이터를 이상치 데이터로 검출하여, 입력 단계에서 이상치 데이터를 필터링할 수 있다. 따라서, 본 발명의 실시예는 데이터베이스 내 데이터의 품질을 향상시키고, 신뢰도를 보장할 수 있다.
As described above, according to the embodiment of the present invention, when there is a certain tendency difference between the input data and the data in the database, the data can be detected as the outlier data, and the outlier data can be filtered in the input step. Therefore, the embodiment of the present invention can improve the quality of data in the database and ensure reliability.

이하, 도 6을 참조하여 본 발명의 실시예에 따른 데이터 이상치 평가 방법에 대하여 설명한다. 도 6은 본 발명의 실시예에 따른 데이터 이상치 평가 방법을 독시한 흐름도이다.Hereinafter, a data outlier evaluation method according to an embodiment of the present invention will be described with reference to FIG. 6 is a flowchart illustrating a data outlier evaluation method according to an embodiment of the present invention.

도 6을 참조하면, 데이터베이스로 입력될 입력 데이터가 존재하면(S610의 예), 평가 모듈(230)은 입력 데이터과 관련된 속성이 복합형 속성인지를 확인한다(S620). 이때, 평가 모듈(230)은 갱신 모듈()로부터 입력 데이터에 관련된 속성의 분포정보를 전달받을 수 있다.Referring to FIG. 6, if there is input data to be input to the database (YES in S610), the evaluation module 230 confirms whether the attribute related to the input data is a complex attribute (S620). At this time, the evaluation module 230 can receive the distribution information of the attribute related to the input data from the update module ().

입력 데이터의 관련 속성이 복합형 속성이 아니면, 평가 모듈(230)은 입력 데이터의 관련 속성이 범주형 속성인지를 확인한다(S630).If the related attribute of the input data is not a composite attribute, the evaluation module 230 confirms whether the related attribute of the input data is a categorical attribute (S630).

평가 모듈(230)은 관련 속성 중 범주형 속성에 대해서는 히스토그램 빈도수를 이용하여 입력 데이터의 이상치 정도를 산출한다(S640).The evaluation module 230 calculates the outliers of the input data using the histogram frequency for the categorical attributes among the related attributes (S640).

평가 모듈(230)은 관련 속성 중 복합형 속성에 대해서는 계층화하여 적어도 하나의 층을 구성하고, 입력 데이터와 각 층에 대해 군집 간의 차이(거리)를 이용하여 이상치 정도를 산출한다(S650).The evaluation module 230 forms at least one layer of the complex type attribute in the related attributes and calculates the outliers using the input data and the difference (distance) between the clusters for each layer (S650).

평가 모듈(230)은 관련 속성 중 수치형 속성에 대해서는 입력 데이터와 군집 간의 차를 이용해 입력 데이터의 이상치 정도를 산출한다(S660).The evaluation module 230 calculates an abnormal value of the input data using the difference between the input data and the cluster for the numerical attribute among the related attributes (S660).

판정 모듈(240)은 산출된 이상치 정도값이 임계치 이하인지를 확인한다(S670). 이때, 산출된 이상치 정도값이 복수 개인 경우, 판정 모듈(240)은 각 이상치 정도값을 임계치와 비교하고, 그중 이상치 데이터를 검출할 수 있다.The determination module 240 determines whether the calculated outlier value is below the threshold (S670). At this time, when there are a plurality of calculated outlier value values, the determination module 240 can compare each outlier value value with a threshold value and detect outliers data among them.

이상치 정도값이 임계치 이하이면, 판정 모듈(240)은 입력 데이터를 이상치 데이터로 판정한다(S680). 이때, 판정 모듈(240)은 입력 데이터가 이상치 데이터임을 사용자에게 안내할 수 있다. 여기서, 판정 모듈(240)은 입력 데이터의 속성들 중에서 어떤 속성을 이상치 데이터로 판정한지를 안내할 수 있다.If the outlier value is below the threshold, the determination module 240 determines the input data as outlier data (S680). At this time, the determination module 240 can inform the user that the input data is abnormal data. Here, the determination module 240 can guide which attribute among the attributes of the input data is determined as the outlier data.

이상치 정도값이 임계치를 초과하면, 판정 모듈(240)은 입력 데이터를 정상 데이터로 판정한다(S690).If the outlier value exceeds the threshold value, the determination module 240 determines the input data as normal data (S690).

이와 같이, 본 발명의 실시예는 입력 데이터 중 데이터베이스 내 데이터와 일정 경향 차이 있으면 데이터를 이상치 데이터로 검출하여, 입력 단계에서 이상치 데이터를 필터링할 수 있다. 따라서, 본 발명의 실시예는 데이터베이스 내 데이터의 품질을 향상시키고, 신뢰도를 보장할 수 있다.As described above, according to the embodiment of the present invention, when there is a certain tendency difference between the input data and the data in the database, the data can be detected as the outlier data, and the outlier data can be filtered in the input step. Therefore, the embodiment of the present invention can improve the quality of data in the database and ensure reliability.

이상, 본 발명의 구성에 대하여 첨부 도면을 참조하여 상세히 설명하였으나, 이는 예시에 불과한 것으로서, 본 발명이 속하는 기술분야에 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 변형과 변경이 가능함은 물론이다. 따라서 본 발명의 보호 범위는 전술한 실시예에 국한되어서는 아니되며 이하의 특허청구범위의 기재에 의하여 정해져야 할 것이다.While the present invention has been described in detail with reference to the accompanying drawings, it is to be understood that the invention is not limited to the above-described embodiments. Those skilled in the art will appreciate that various modifications, Of course, this is possible. Accordingly, the scope of protection of the present invention should not be limited to the above-described embodiments, but should be determined by the description of the following claims.

Claims

1. An abnormal value evaluation device for input data to be inputted into a database which stores data in a table form,
A distribution module that, when the input data is input, provides distribution information about correlated related attributes; And
Wherein the distribution information is used to determine whether the related property is a numerical type, a categorical type, or a complex type, and determine a distance between the input data and the attribute in the related property, An evaluation module for evaluating the degree of outliers of the input data by using at least one of combinations of data of the input data
And a data outlier determination unit.

The method according to claim 1,
Wherein the distribution module provides the distribution information including at least one cluster information generated by clustering data corresponding to the related property if the related property is a numerical type,
Wherein the evaluation module calculates a distance between the at least one community and the input data by referring to the community information and selects a community closest to the input data using the calculated distance, And a first evaluator for calculating an abnormal value of the input data using the degree of belonging to the data.

3. The image processing apparatus according to claim 2,
Calculating a membership degree value for the one cluster of the input data by applying a fuzzy membership function to the distance between the input data and the one cluster and calculating a membership degree of the input data by subtracting the membership degree from 1, In data outlier evaluation device.

The method according to claim 1,
Wherein the distribution module provides the distribution information including at least one combination frequency information for a possible combination of the related properties if the related property is categorical,
And the evaluation module includes a second evaluation unit for calculating at least one frequency of the possible combinations of the input data and the related attribute using the combination frequency information and calculating the outliers of the input data using the at least one frequency, And a data outlier evaluation device.

5. The apparatus according to claim 4,
Wherein a normalized histogram is calculated using the at least one combination frequency and a result obtained by subtracting each frequency of the histogram from 1 is calculated as the outlier value.

The method according to claim 1,
If the attribute of the related attribute is a composite type in which the numerical type and the categorical type are mixed, the distribution module classifies the data corresponding to the related attribute based on the categorical attribute value among the related attributes Providing at least one attribute layer and providing distribution information including at least one cluster information as a result of grouping data corresponding to numerical attribute values in each layer of the at least one property layer,
Wherein the evaluation module calculates the outlier value using the input data and the distance between the clusters according to the at least one clustering information using the at least one clustering information.

7. The method according to any one of claims 1 to 6,
A determination module that determines whether or not the input data is abnormal data by comparing the abnormal value with a preset threshold value,
Further comprising:

The method according to claim 1,
Wherein the distribution module stores the input data in the database when generating the input data as normal data and generates the distribution information including the input data.

1. An outlier evaluation method for input data to be input into a database by at least one processor,
Providing distribution information for correlated related attributes when the input data is input;
Checking whether the related property is a numerical type, a categorical type or a complex type using the distribution information; And
Evaluating an outlier value of the input data using at least one of a distance between the input data and the attribute in the related attribute and a combination frequency between the input data and the data of the related attribute according to the type of the related attribute
The method comprising:

10. The method of claim 9,
Wherein the providing step includes providing the distribution information including at least one cluster information generated by clustering data corresponding to the related property if the related property is a numerical type,
Wherein the evaluating step calculates a distance between the at least one cluster and the input data by referring to the cluster information and selects one cluster closest to the input data using the calculated distance, And calculating the degree of the outliers of the input data using the degree of belonging to the cluster.

10. The method of claim 9,
Wherein said providing step comprises providing said distribution information including at least one combination frequency information for a possible combination of said related properties if said related property is categorical,
The evaluating step may include calculating at least one frequency for a possible combination of the input data and the related attribute using the combination frequency information and calculating the outlier information of the input data using the at least one frequency, Method of estimating outliers.

10. The method of claim 9,
Wherein the step of providing is a step of classifying data corresponding to the related attribute based on the categorical attribute value among the related attributes when each attribute in the related attribute is a composite type in which the numeric type and the categorical type are mixed, And at least one attribute layer, and provides distribution information including at least one cluster information as a result of grouping data corresponding to numerical attribute values in each layer of the at least one property layer,
Wherein the evaluating step calculates the outliers using the input data and the distances between the clusters based on the at least one clustering information using the at least one clustering information.