KR20160104945A - Apparatus and Method for Evaluating Outlierness based on Data Association - Google Patents

Apparatus and Method for Evaluating Outlierness based on Data Association Download PDF

Info

Publication number
KR20160104945A
KR20160104945A KR1020150027914A KR20150027914A KR20160104945A KR 20160104945 A KR20160104945 A KR 20160104945A KR 1020150027914 A KR1020150027914 A KR 1020150027914A KR 20150027914 A KR20150027914 A KR 20150027914A KR 20160104945 A KR20160104945 A KR 20160104945A
Authority
KR
South Korea
Prior art keywords
input data
data
attribute
information
outlier
Prior art date
Application number
KR1020150027914A
Other languages
Korean (ko)
Other versions
KR101692611B1 (en
Inventor
이건명
Original Assignee
충북대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 충북대학교 산학협력단 filed Critical 충북대학교 산학협력단
Priority to KR1020150027914A priority Critical patent/KR101692611B1/en
Publication of KR20160104945A publication Critical patent/KR20160104945A/en
Application granted granted Critical
Publication of KR101692611B1 publication Critical patent/KR101692611B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)

Abstract

The present invention discloses an apparatus and method for evaluating outlier based on data association. The apparatus for evaluating outlier with respect to input data, which is input into database for storing data in a table form, according to one embodiment of the present invention comprises: a distribution module for providing distribution information on related attributes having a correlation when the input data is input; and an evaluation module for checking that the related attributes correspond to which type among numerical, categorical, and combined types by using the distribution information, and evaluating the degree of outlier of the input data by using at least one of a distance between the input data and attributes in the related attributes, and a frequency of a combination of the input data and the related attributes, according to the type of the related attributes.

Description

[0001] Apparatus and Method for Evaluating Outliers Based on Data Association [

The present invention relates to an outlier detection technique and, more particularly, to a data association based outlier assessment apparatus and method that utilizes associations between data attributes.

The data in the database may consist of a plurality of instances and attributes that distinguish each instance. As an example, as shown in FIG. 1, the database may include software project data composed of 11 instances and 5 attributes. Here, an outlier is an instance including an abnormal value in the attribute, and an attribute including an abnormal value is called an abnormal attribute.

In this way, when data in the database is used for decision making, its quality is very important, but in practice, troublesome data can be inevitably entered or collected together with the database due to a mistake of the practitioner.

Therefore, much research is being conducted to find a logical error (hereinafter referred to as an outliers judgment) of data in a database in the fields of life information and data mining.

Typically, there are outlier detection techniques such as the PANDA technique and the AOI technique. First, the PANDA method determines the outliers rank by sum of the noise factors for all attributes of each instance in the database. The AOI method calculates the sum of the noise factors for each instance when the attribute is included and excluded, and determines the anomaly value of the attribute using the difference of the outliers.

However, this conventional outlier determination technique is for finding a specific instance including error information, or determining that it is inappropriate for a data attribute value.

SUMMARY OF THE INVENTION It is an object of the present invention to provide an apparatus and method for evaluating outliers based on data associations that can evaluate the degree of outliers of input data by using associations between data attributes.

The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

An apparatus for estimating an outliers for input data to be input into a database for storing data in a form of a table according to an aspect of the present invention includes a distribution module for providing distribution information on correlated attributes when the input data is input; Categorization, and complex type using the distribution information, and determining, based on the type of the related property, a distance between the input data and the attribute in the related property, And an evaluation module for evaluating the degree of the outliers of the input data by using at least one of a combination frequency between the data of the attribute.

According to another aspect of the present invention, there is provided a method for estimating an outliers for input data input into a database by at least one processor, the method comprising: providing distribution information on correlated attributes when the input data is input; Checking whether the related property is a numerical type, a categorical type or a complex type using the distribution information; And evaluating the degree of outliers of the input data using at least one of a distance between the input data and the attribute in the related attribute and a combination frequency between the input data and the data of the related attribute in accordance with the kind of the related attribute .

According to the present invention, the degree of the outliers of the input data can be evaluated by using the association between data attributes.

1 shows an example of a database storing software project data on a table basis;
2 is a configuration diagram showing an outlier evaluation apparatus according to the present invention.
3A to 3C are diagrams illustrating a process of calculating an outlier value for a numeric attribute according to the present invention.
4A and 4B are diagrams illustrating a process of calculating an outlier value for a categorical attribute according to the present invention.
5A and 5B are diagrams illustrating an outlier evaluation process for a composite attribute according to the present invention.
6 is a flow chart illustrating a data outlier evaluation method according to the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, advantages and features of the present invention and methods for accomplishing the same will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms " comprises, " and / or "comprising" refer to the presence or absence of one or more other components, steps, operations, and / Or additions.

Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. 2 is a configuration diagram showing an outlier evaluation apparatus according to an embodiment of the present invention.

2, the outlier evaluation apparatus 20 according to the embodiment of the present invention includes an input module 220, a distribution module 210, an evaluation module 230, a determination module 240, and a storage unit 250 ).

The input module 220 receives input data to be input into a database from a user terminal in a buffer (not shown), and temporarily stores the input data. Here, the distribution module 210 may temporarily store the input data before integrity verification before integrity verification.

At this time, the database stores one or more attribute-specific data for at least one object in units of tables. The input data may be data input in one row (or column) of the table. For example, if an attribute in a table is stored in units of columns, the input data can be one row in the table.

The distribution module 210 generates distribution information for the related attributes of the data table in the database based on the predetermined association information.

First, we explain the related properties. Let A (categorical), B (numeric), C (numeric), D (numerical) exist in the database and A and B are related and C and D If there is a correlation, the distribution information of A and B is categorized as hybrid type in which categorical type and numerical type are mixed. The distribution information of C and D is classified into numerical type.

Here, a numerical attribute is an attribute in which data is a number, a categorical attribute is an attribute in which data is text-based, and a complex type is an attribute including a numeric type and a categorical type.

Further, the distribution information includes at least one of the cluster information of the related attribute and the combinable frequency.

On the other hand, when the input data is determined to be normal data and the input data is stored in the database, the distribution module 210 updates the distribution information about the related property including the input data.

The evaluation module 230 checks whether the attribute included in the related attribute is numerical, categorical, or complex, from the distribution information, and determines whether the input data identified from the distribution information according to the identified type and the attribute The degree of outliers of the input data is calculated using at least one of the distance, the input data, and the combination frequency between the data corresponding to the attribute in the related property.

The evaluation module 230 includes first to third evaluation units 231 to 233 for calculating the degree of outliers corresponding to each attribute type. Each evaluating unit will be described later with reference to Figs. 3A to 5B.

The determination module 240 compares the outlier value of the input data with a preset threshold value to determine whether or not the input data is outlier data, and guides the determination result to the user.

More specifically, the determination module 240 classifies input data as outlier data if the outlier is below a predetermined threshold. On the other hand, if the outlier value exceeds the threshold, the determination module 240 classifies the input data as normal data.

Here, the threshold can be calculated using at least one of the distance from the normal data input in the database and the frequency, as a criterion for determining whether the input data is abnormal.

If the input data is classified as abnormal data, the determination module 240 notifies the input source (manager or the like) of the determination result (input error). At this time, the determination module 240 may display the determination result, guide the user through sound or the like, and guide the user through SMS or the like.

At this time, the input source reaffirms whether the input data is actually an outlier data, and if the input data is an outlier data, the input data can be corrected and the corrected input data can be fed back. Alternatively, if the input data is not actually outliers data, the input source can feed back an instruction to store it without further editing, since the input data is not actually outlier data.

The storage means 250 stores the input data classified as normal data in the database.

As described above, according to the embodiment of the present invention, if there is a possibility that the input data is outlier data by checking the tendency difference between the input data and the data in the database, the input data can be guided to the input source to re- Therefore, embodiments of the present invention can improve the reliability and accuracy of data in a database.

In addition, the embodiment of the present invention can determine whether an abnormality is present even if two or more attributes are combined in the input data.

Hereinafter, an evaluation module according to an embodiment of the present invention will be described with reference to FIG. 2 and FIGS. 3A to 5B. FIGS. 3A to 3C are diagrams illustrating an outlier calculation process for a numeric attribute according to an exemplary embodiment of the present invention. FIGS. 4A and 4B are diagrams illustrating an outlier calculation process for a categorical attribute according to an embodiment of the present invention. And FIGS. 5A and 5B are diagrams illustrating an outlier evaluation process for a complex attribute according to an embodiment of the present invention.

As shown in FIG. 3A, when the related attribute is a numeric type, the distribution module 210 generates at least one cluster by grouping the data in the related attribute as shown in FIG. 3B, and distributes distribution information including the cluster information for at least one cluster to provide.

The first evaluation unit 231 selects a cluster C i closest to the input data using the distance of all the clusters in the cluster information and the input data. The first evaluation unit 231 applies the distance between one cluster and the input data to the fuzzy membership function and calculates the degree to which the input data belongs to the selected cluster C i ,

Figure pat00001
And calculates the difference between the degrees of degree to which the input data belongs to one cluster in an unexpected value.

Figure pat00002

Here, the fuzzy membership function μ A (x) may have a value of the function [0, 1] as shown in FIG. 4C. In other words, by assigning the distance between one cluster and the input data to the fuzzy membership function, the degree to which the input data belongs to one cluster can be transformed into the interval [0, 1].

4A, the classification module calculates at least one combination frequency for a possible combination between the data of the related property, if the related property is categorical, and calculates a distribution frequency including at least one combination frequency calculated as shown in FIG. 4B Provide information.

The second evaluating unit 232 uses the distribution information to calculate a normalized histogram of possible combinations between the input data and the data of the related property

Figure pat00003
And the degree of the outliers of the input data can be calculated by using the difference between the number 1 and the frequency of the histogram normalized with respect to the input data, as shown in the following equation (2).

Figure pat00004

Here, since the frequency of the histogram is a plurality of values, the second evaluating unit 232 can calculate the smallest value of the histograms to an outlier value.

For reference, as in the red ellipse of FIG. 4B, a portion with a small frequency value in the histogram is a portion that is likely to be out of order. In other words, the part may contain data that has not appeared before, or that has not yet been subjected to an unexpected evaluation. Accordingly, in the present invention, the corresponding part is determined as the outlier data, and the input source is requested to confirm it again.

5A, the distribution module 210 constructs at least one layer by layering the related attribute data as shown in FIG. 5B based on the categorical attribute value among the related attributes, And provides distribution information including at least one cluster information by clustering data corresponding to the numerical attribute value in the layer.

Then, the third evaluation unit 233 refers to the distribution information, and calculates the outliers of the input data in the same manner as in the case where the related property is numeric for at least one community in the cluster information. Specifically, the third evaluation unit 233 refers to the cluster information, calculates the distance between the cluster of each layer and the input data, and calculates a distance between the closest cluster and the input data, The degree of outliers of the input data can be calculated by applying the fuzzy membership function of FIG.

As described above, according to the embodiment of the present invention, when there is a certain tendency difference between the input data and the data in the database, the data can be detected as the outlier data, and the outlier data can be filtered in the input step. Therefore, the embodiment of the present invention can improve the quality of data in the database and ensure reliability.

Hereinafter, a data outlier evaluation method according to an embodiment of the present invention will be described with reference to FIG. 6 is a flowchart illustrating a data outlier evaluation method according to an embodiment of the present invention.

Referring to FIG. 6, if there is input data to be input to the database (YES in S610), the evaluation module 230 confirms whether the attribute related to the input data is a complex attribute (S620). At this time, the evaluation module 230 can receive the distribution information of the attribute related to the input data from the update module ().

If the related attribute of the input data is not a composite attribute, the evaluation module 230 confirms whether the related attribute of the input data is a categorical attribute (S630).

The evaluation module 230 calculates the outliers of the input data using the histogram frequency for the categorical attributes among the related attributes (S640).

The evaluation module 230 forms at least one layer of the complex type attribute in the related attributes and calculates the outliers using the input data and the difference (distance) between the clusters for each layer (S650).

The evaluation module 230 calculates an abnormal value of the input data using the difference between the input data and the cluster for the numerical attribute among the related attributes (S660).

The determination module 240 determines whether the calculated outlier value is below the threshold (S670). At this time, when there are a plurality of calculated outlier value values, the determination module 240 can compare each outlier value value with a threshold value and detect outliers data among them.

If the outlier value is below the threshold, the determination module 240 determines the input data as outlier data (S680). At this time, the determination module 240 can inform the user that the input data is abnormal data. Here, the determination module 240 can guide which attribute among the attributes of the input data is determined as the outlier data.

If the outlier value exceeds the threshold value, the determination module 240 determines the input data as normal data (S690).

As described above, according to the embodiment of the present invention, when there is a certain tendency difference between the input data and the data in the database, the data can be detected as the outlier data, and the outlier data can be filtered in the input step. Therefore, the embodiment of the present invention can improve the quality of data in the database and ensure reliability.

While the present invention has been described in detail with reference to the accompanying drawings, it is to be understood that the invention is not limited to the above-described embodiments. Those skilled in the art will appreciate that various modifications, Of course, this is possible. Accordingly, the scope of protection of the present invention should not be limited to the above-described embodiments, but should be determined by the description of the following claims.

Claims (12)

1. An abnormal value evaluation device for input data to be inputted into a database which stores data in a table form,
A distribution module that, when the input data is input, provides distribution information about correlated related attributes; And
Wherein the distribution information is used to determine whether the related property is a numerical type, a categorical type, or a complex type, and determine a distance between the input data and the attribute in the related property, An evaluation module for evaluating the degree of outliers of the input data by using at least one of combinations of data of the input data
And a data outlier determination unit.
The method according to claim 1,
Wherein the distribution module provides the distribution information including at least one cluster information generated by clustering data corresponding to the related property if the related property is a numerical type,
Wherein the evaluation module calculates a distance between the at least one community and the input data by referring to the community information and selects a community closest to the input data using the calculated distance, And a first evaluator for calculating an abnormal value of the input data using the degree of belonging to the data.
3. The image processing apparatus according to claim 2,
Calculating a membership degree value for the one cluster of the input data by applying a fuzzy membership function to the distance between the input data and the one cluster and calculating a membership degree of the input data by subtracting the membership degree from 1, In data outlier evaluation device.
The method according to claim 1,
Wherein the distribution module provides the distribution information including at least one combination frequency information for a possible combination of the related properties if the related property is categorical,
And the evaluation module includes a second evaluation unit for calculating at least one frequency of the possible combinations of the input data and the related attribute using the combination frequency information and calculating the outliers of the input data using the at least one frequency, And a data outlier evaluation device.
5. The apparatus according to claim 4,
Wherein a normalized histogram is calculated using the at least one combination frequency and a result obtained by subtracting each frequency of the histogram from 1 is calculated as the outlier value.
The method according to claim 1,
If the attribute of the related attribute is a composite type in which the numerical type and the categorical type are mixed, the distribution module classifies the data corresponding to the related attribute based on the categorical attribute value among the related attributes Providing at least one attribute layer and providing distribution information including at least one cluster information as a result of grouping data corresponding to numerical attribute values in each layer of the at least one property layer,
Wherein the evaluation module calculates the outlier value using the input data and the distance between the clusters according to the at least one clustering information using the at least one clustering information.
7. The method according to any one of claims 1 to 6,
A determination module that determines whether or not the input data is abnormal data by comparing the abnormal value with a preset threshold value,
Further comprising:
The method according to claim 1,
Wherein the distribution module stores the input data in the database when generating the input data as normal data and generates the distribution information including the input data.
1. An outlier evaluation method for input data to be input into a database by at least one processor,
Providing distribution information for correlated related attributes when the input data is input;
Checking whether the related property is a numerical type, a categorical type or a complex type using the distribution information; And
Evaluating an outlier value of the input data using at least one of a distance between the input data and the attribute in the related attribute and a combination frequency between the input data and the data of the related attribute according to the type of the related attribute
The method comprising:
10. The method of claim 9,
Wherein the providing step includes providing the distribution information including at least one cluster information generated by clustering data corresponding to the related property if the related property is a numerical type,
Wherein the evaluating step calculates a distance between the at least one cluster and the input data by referring to the cluster information and selects one cluster closest to the input data using the calculated distance, And calculating the degree of the outliers of the input data using the degree of belonging to the cluster.
10. The method of claim 9,
Wherein said providing step comprises providing said distribution information including at least one combination frequency information for a possible combination of said related properties if said related property is categorical,
The evaluating step may include calculating at least one frequency for a possible combination of the input data and the related attribute using the combination frequency information and calculating the outlier information of the input data using the at least one frequency, Method of estimating outliers.
10. The method of claim 9,
Wherein the step of providing is a step of classifying data corresponding to the related attribute based on the categorical attribute value among the related attributes when each attribute in the related attribute is a composite type in which the numeric type and the categorical type are mixed, And at least one attribute layer, and provides distribution information including at least one cluster information as a result of grouping data corresponding to numerical attribute values in each layer of the at least one property layer,
Wherein the evaluating step calculates the outliers using the input data and the distances between the clusters based on the at least one clustering information using the at least one clustering information.
KR1020150027914A 2015-02-27 2015-02-27 Apparatus and Method for Evaluating Outlierness based on Data Association KR101692611B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020150027914A KR101692611B1 (en) 2015-02-27 2015-02-27 Apparatus and Method for Evaluating Outlierness based on Data Association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020150027914A KR101692611B1 (en) 2015-02-27 2015-02-27 Apparatus and Method for Evaluating Outlierness based on Data Association

Publications (2)

Publication Number Publication Date
KR20160104945A true KR20160104945A (en) 2016-09-06
KR101692611B1 KR101692611B1 (en) 2017-01-17

Family

ID=56945872

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020150027914A KR101692611B1 (en) 2015-02-27 2015-02-27 Apparatus and Method for Evaluating Outlierness based on Data Association

Country Status (1)

Country Link
KR (1) KR101692611B1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020241959A1 (en) * 2019-05-31 2020-12-03 주식회사 포스코아이씨티 System for detecting abnormal control data
CN112101765A (en) * 2020-09-08 2020-12-18 国网山东省电力公司菏泽供电公司 Abnormal data processing method and system for operation index data of power distribution network

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20230166608A (en) 2022-05-31 2023-12-07 삼성에스디에스 주식회사 Method for detecting outlier and system thereof
KR102470763B1 (en) 2022-10-13 2022-11-25 주식회사 비플컨설팅 Data outlier detection apparatus and method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110067647A (en) * 2009-12-15 2011-06-22 한국과학기술원 Pattern-based method and apparatus of identifying data with abnormal attributes

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110067647A (en) * 2009-12-15 2011-06-22 한국과학기술원 Pattern-based method and apparatus of identifying data with abnormal attributes

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020241959A1 (en) * 2019-05-31 2020-12-03 주식회사 포스코아이씨티 System for detecting abnormal control data
CN112101765A (en) * 2020-09-08 2020-12-18 国网山东省电力公司菏泽供电公司 Abnormal data processing method and system for operation index data of power distribution network

Also Published As

Publication number Publication date
KR101692611B1 (en) 2017-01-17

Similar Documents

Publication Publication Date Title
US10262233B2 (en) Image processing apparatus, image processing method, program, and storage medium for using learning data
KR101692611B1 (en) Apparatus and Method for Evaluating Outlierness based on Data Association
CN109472005B (en) Data credibility assessment method, device, equipment and storage medium
US20160342963A1 (en) Tree pathway analysis for signature inference
WO2017101301A1 (en) Data information processing method and device
US11170332B2 (en) Data analysis system and apparatus for analyzing manufacturing defects based on key performance indicators
US20220083814A1 (en) Associating a population descriptor with a trained model
Yamashita et al. Thresholds for size and complexity metrics: A case study from the perspective of defect density
US20160247041A2 (en) Fault-Aware Matched Filter and Optical Flow
US20210157831A1 (en) Method and analytical engine for a semantic analysis of textual data
US11727522B2 (en) Method, system, and apparatus for damage assessment and classification
JP2016051447A (en) Fault analysis system
Azzalini et al. FAIR-DB: Function Al dependencies to discover data bias
US10529453B2 (en) Tool that analyzes image data and generates and displays a confidence indicator along with a cancer score
US20150269014A1 (en) Server, model applicability/non-applicability determining method and non-transitory computer readable medium
WO2011149608A1 (en) Identifying and using critical fields in quality management
US9020268B2 (en) Method and apparatus for annotating multimedia data in a computer-aided manner
Wu et al. Classifying weak, and strong components using ROC analysis with application to burn-in
CN109558910B (en) Method, system and related assembly for evaluating information security level
US20220261642A1 (en) Adversarial example detection system, method, and program
US11301493B2 (en) Systems and methods for providing data exploration techniques
JP2021124886A (en) Information processing device, information processing method and information processing program
JP2016099688A (en) Risk evaluation method and risk evaluation device
US8739279B2 (en) Implementing automatic access control list validation using automatic categorization of unstructured text
KR102300916B1 (en) Case-based reasoning system and case-based reasoning method using support vector machine

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant