WO2016195421A1

WO2016195421A1 - Method, system and non-transitory computer-readable recording medium for providing data profiling based on statistical analysis

Info

Publication number: WO2016195421A1
Application number: PCT/KR2016/005920
Authority: WO
Inventors: 장원중
Original assignee: 장원중
Priority date: 2015-06-04
Filing date: 2016-06-03
Publication date: 2016-12-08
Also published as: KR101632073B1

Abstract

According to one aspect of the present invention, provided is a method for providing data profiling based on a statistical analysis, comprising the steps of: calculating, on the basis of data belonging to each of attributes defined in a data set, at least one statistical value related to each of the attributes; determining a weight imparted to each of the attributes with reference to the at least one statistical value which has been calculated; and determining, among the attributes defined in the data set, at least one attribute having the weight greater than a preset level as an attribute subject to data profiling.

Description

Method, system and non-transitory computer readable recording medium for providing statistical analysis based data profiling

The present invention relates to a method, system and non-transitory computer readable recording medium for providing statistical analysis based data profiling.

In recent years, data generated through email, social network services (SNS), multimedia, multimedia, mobile, and the Internet of Things (IoT) is rapidly increasing, and the amount of information has already been increased by zettabyte ( ZettaByte, 10 ²¹ ) Beyond the level.

In addition, technologies that analyze and utilize big data are being actively researched across industries and are becoming an issue around the world, and in Korea, the government aims to share data through active openness of public information at the level of government (3.0). Situation. In addition, a variety of technologies for providing useful services to users by using such a massive amount of data are being developed.

In such a situation, it is necessary to secure reliability of data quality. As an example of a data quality diagnosis technique introduced in the related art, data quality is determined by targeting only some attributes that are considered important among all attributes according to the purpose of use of data. And techniques for diagnosing (or data profiling). However, according to this conventional technology, there is a problem that the data profiling result may vary greatly depending on the subjective judgment of the administrator. For example, suppose you use a dataset to mail promotional materials to customers. In this case, data profiling can only be performed on the data contained in the data attribute called address, judging that the attribute called address is the most important, and it is difficult to assume that the data set is necessarily used only for mailing and the attribute called address. This important judgment is only a subjective judgment of the manager, but the data quality diagnosis can be performed efficiently, but the reliability of the data profiling result is inevitably deteriorated.

Another example of a data quality diagnosis technique introduced in the related art may be a technique of performing data profiling on all attributes defined in a data set. According to this prior art, although accurate data profiling results can be obtained, there is a limitation that it takes too much time and effort because data profiling must be performed on all data included in the data set. For example, suppose 100 million transaction information data exist for every 100 attributes defined in the data set, in which case the total number of data subject to data profiling is 10 billion (number of attributes x records). Number = 100 x 100,000,000).

Therefore, there is a need for a data profiling technique with high efficiency while ensuring reliability.

The object of the present invention is to solve all the above-mentioned problems.

In addition, the present invention calculates at least one statistical value for each attribute based on data included in each attribute defined in the data set, and assigns to each attribute with reference to the calculated at least one statistical value. By determining the weights to be determined and determining at least one property whose weight is equal to or greater than a predetermined level among the properties defined in the data set as an object to be subjected to data profiling, efficient data profiling can be performed while ensuring reliability. To do so for other purposes.

Representative configuration of the present invention for achieving the above object is as follows.

According to an aspect of the present invention, there is provided a method for providing statistical profiling-based data profiling, based on data included in each attribute defined in a data set, the at least one statistics related to each attribute. Calculating a value, determining a weight to be assigned to each attribute with reference to the calculated at least one statistical value, and at least one of the attributes defined in the data set equal to or greater than a predetermined level A method is provided that includes determining an attribute of a as an attribute subject to data profiling.

According to another aspect of the present invention, there is provided a system for providing statistical profiling-based data profiling, wherein at least one statistic for each attribute is based on data contained in each attribute defined in the data set. A statistical value calculator for calculating a value, a weighting unit for determining a weight given to each attribute with reference to the at least one calculated statistical value, and the weight among attributes defined in the data set A system is provided that includes a target attribute determiner that determines at least one attribute above a set level as an attribute to be subjected to data profiling.

In addition, there is further provided a non-transitory computer readable recording medium for recording another method, user device, system and computer program for executing the method for implementing the present invention.

According to the present invention, since data profiling is performed on data included in some attributes determined to have a high probability of error based on statistical analysis among various attributes defined in the data set, the data are randomly selected according to the subjective judgment of the administrator. Compared with the prior art of performing data profiling on the data included in the attribute, the effect of greatly increasing the reliability is achieved.

In addition, according to the present invention, an effect that can significantly improve the efficiency compared to the prior art that performs data profiling on the data of all the attributes defined in the data set.

In addition, according to the present invention, it is possible to determine the attributes that are subject to data profiling by further reflecting the business rules (code values, business rules, etc.) applied to the data set together with the statistical analysis results, thereby improving the degree of completeness of data profiling. The effect of being able to increase is achieved.

1 is a diagram illustrating a schematic configuration of an entire system for providing statistical analysis-based data profiling according to an embodiment of the present invention.

2 is a diagram illustrating an internal configuration of a data profiling system according to an embodiment of the present invention.

3 is a diagram exemplarily illustrating an internal configuration of an attribute extractor according to an exemplary embodiment of the present invention.

4 is a diagram conceptually illustrating a configuration for determining an attribute, which is an object of data profiling, among attributes defined in a data set according to an embodiment of the present invention.

100: network

200: data profiling system

210: data set management unit

220: attribute extraction unit

221: statistical value calculation unit

222: weighting unit

223: target attribute determination unit

230: data profiling unit

240: database

250: communication unit

260: control unit

300: user device

400: external server

DETAILED DESCRIPTION The following detailed description of the invention refers to the accompanying drawings that show, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein may be embodied in other embodiments without departing from the spirit and scope of the invention with respect to one embodiment. In addition, it is to be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled. Like reference numerals in the drawings refer to the same or similar functions throughout the several aspects.

DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement the present invention.

Configuration of the entire system

As shown in FIG. 1, the entire system according to an embodiment of the present invention may include a communication network 100, a data profiling system 200, a user device 300, and an external server 400. .

First, the communication network 100 according to an embodiment of the present invention may be configured regardless of a communication mode such as wired communication or wireless communication, and includes a local area network (LAN) and a metropolitan area network (MAN). ), And various communication networks such as a wide area network (WAN). Preferably, the communication network 100 as used herein may be a known Internet or World Wide Web (WWW). However, the communication network 100 may include, at least in part, a known wired / wireless data communication network, a known telephone network, or a known wired / wireless television communication network without being limited thereto.

Next, according to an embodiment of the present invention, the data profiling system 200 may be a digital device having a computing capability by mounting a microprocessor and a memory means. This data profiling system 200 may be a server system.

Specifically, according to one embodiment of the invention, the data profiling system 200, as described in detail below, at least one of each property based on the data contained in each property defined in the data set A statistical value is calculated, the weights assigned to each attribute are determined by referring to the at least one statistical value calculated above, and the data profile includes at least one attribute whose weight is equal to or greater than a predetermined level among the attributes defined in the data set. By determining it as an attribute of the ring, it is possible to perform a function to ensure efficient data profiling while ensuring reliability.

The function of the data profiling system 200 will be described in more detail below. Meanwhile, the data profiling system 200 has been described as above, but this description is exemplary, and at least some of the functions or components required for the data profiling system 200 will be described later as needed. It will be apparent to those skilled in the art that or may be realized or included in the external server (400).

Next, according to an embodiment of the present invention, the user device 300 is a digital device that performs a function capable of communicating after connecting to the data profiling system 200 through the communication network 100, the memory means Any digital device having a computing power with a microprocessor can be adopted as the user device 300 according to the present invention.

Next, according to an embodiment of the present invention, the external server 400 is a server that includes a function that can communicate after connecting to the data profiling system 200 through the communication network 100, A function of providing a raw data or a data set in the form of a file or a database may be performed. For example, the external server 400 may provide reference information, transaction information, aggregate information, etc. as structured data, and may provide HTML, XML, GIS, etc. as semi-structured data, and may provide unstructured data. As data, a moving picture, an image, a sound, a document, or the like can be provided.

Configuration of the Data Profiling System

Hereinafter, the internal configuration of the data profiling system performing important functions for the implementation of the present invention and the function of each component will be described.

2 and 3, the data profiling system 200 according to an embodiment of the present invention may include a data set management unit 210, an attribute extractor 220, a data profiling performer 230, and a database. 240, a communication unit 250, and a control unit 260 may be included. Here, the attribute extractor 220 may include a statistical value calculator 221, a weighting unit 222, and a target attribute determiner 223. According to an embodiment of the present invention, the data set management unit 210, the attribute extraction unit 220, the data profiling unit 230, the database 240, the communication unit 250, and the control unit 260 are at least one of them. Some may be program modules in communication with an external system (not shown). Such program modules may be included in the data profiling system 200 in the form of operating systems, application modules, and other program modules, and may be physically stored on various known storage devices. In addition, these program modules may be stored in a remote storage device that can communicate with the data profiling system 200. On the other hand, such program modules include, but are not limited to, routines, subroutines, programs, objects, components, data structures, etc. that perform particular tasks or execute particular abstract data types, described below, in accordance with the present invention.

First, according to an embodiment of the present invention, the data set management unit 210 may perform a function of acquiring a raw data or a data set that is the object of data profiling from the external server 400 (FIG. a). In addition, according to an embodiment of the present invention, the data set management unit 210 may perform a function of converting the various types of raw data collected as described above into a data set having a format suitable for data profiling (FIG. 4). (b)).

Next, according to an embodiment of the present invention, the attribute extracting unit 220 (specifically, the statistical value calculating unit 221) may include data included in each attribute defined in the data set to be subjected to data profiling. Based on the above, the at least one statistical value for each attribute may be calculated.

Here, the attribute defined in the data set refers to an item which is a criterion for classifying a large number of data (ie, records) included in the data set. For example, bicycle sharing status according to weather conditions (Bike Sharing) In the dataset for Demand, the date, year, month, day, hour, season, holiday, working day, weather Properties such as weather, humidity, casual, registered, rental count, temperature, temp, atemp and windspeed can be defined. have.

Specifically, according to an embodiment of the present invention, the attribute extractor 220 may calculate a statistical value that may be used as a measure for estimating the possibility of an error occurring in data included in each attribute defined in the data set. Can be. For example, statistical values such as missing values, minimum values, maximum values, mode values, average values, variances, standard deviations, five numerical summaries, outliers, and near zero variances may be calculated.

In addition, according to an embodiment of the present invention, the attribute extractor 220 (specifically, the weighting unit 222) refers to at least one statistical value calculated as described above with respect to each attribute defined in the data set. In this case, a function of determining a weight assigned to each attribute defined in the data set may be performed (see FIG. 4C).

Specifically, the attribute extracting unit 220 according to an embodiment of the present invention, if at least one statistical value calculated with respect to the first attribute defined in the data set satisfies a predetermined criterion, It may be determined that a set weight is given. More specifically, the attribute extracting unit 220 according to an embodiment of the present invention may determine a higher weight assigned to the first attribute as the probability that an error occurs in the data included in the first attribute is greater.

According to an embodiment of the present invention, a weight that may be assigned to each attribute defined in the data set may include a first weight and a second weight, and the first weight and the second weight may be determined independently of each other. Can be. Specifically, the attribute extracting unit 220 according to an embodiment of the present invention may be assigned a first weight to the first attribute when the probability that an error occurs in the data included in the first attribute corresponds to a preset level. If the probability that an error occurs in the data included in the first attribute exceeds a predetermined level, the second weight may be further assigned to the first attribute.

For example, the criteria for assigning the first weight and the second weight to the attributes defined in the data set may be set as shown in Table 1 and Table 2 below.

Table 1

First weighting criteria	First weight
If any missing value (NA) exists	0.1
When Near Zero Variance Exists	0.1
If the standard deviation is a or more	0.1
If the number of Space ("") entries exceeds b	0.1
Outlier Bonferroni p is less than c	0.1
The data time interval (last day-first day) is greater than the current time interval (current day-first day).	0.1

TABLE 2

Second weighting criteria	Second weight
The number of missing (NA) cases is at least d% of the total number of data cases.	0.1
Outlier Bonferroni p value is less than or equal to e (e is less than c in Table 1)	0.1

However, the first weighting factor and the second weighting criterion according to the present invention are not necessarily limited to those listed in Table 1 or Table 2 above, and may be changed as long as the object of the present invention can be achieved. It is revealed.

In addition, according to an embodiment of the present invention, the attribute extracting unit 220 (specifically, the target attribute determining unit 223), at least one of the weights previously given among the attributes defined in the data set is at least a predetermined level. The function of determining the attribute of as a target of data profiling can be performed.

In addition, according to an embodiment of the present invention, the attribute extracting unit 220 further refers to a business rule applied to the data set, so that the object of data profiling includes at least one attribute among the attributes defined in the data set. It can be determined as an attribute to be made. In this case, the business rule may include a code value or a business rule applied to the data set.

For example, the attribute extractor 220 according to an embodiment of the present invention may include a geometric mean (GM) of a sum of first and second weights calculated between at least two attributes among attributes defined in a data set. ), At least two attributes which form a combination in which the geometric mean above is a predetermined level or more may be determined as attributes to be data profiling. Here, Equation for calculating a geometric mean (GM) of the sum of the first and second weights between at least two attributes may be expressed as Equation 1 below.

Equation 1

In Equation 1 above, S is a set of attributes (a ₁ , a ₂ , ..., a _i , a _n ), n is the number of attributes selected from S, a _i is the i th attribute, and a _i14 Is a first weight assigned to the i-th attribute and a _i15 is a second weight assigned to the i-th attribute.

For another example, the attribute extractor 220 according to an embodiment of the present invention may include a plurality of attributes defined in the data set based on the first weight and the second weight assigned to each attribute defined in the data set. May be classified into at least one group, and at least one attribute belonging to at least one of the above groups may be determined as an attribute for data profiling.

Next, according to an embodiment of the present invention, the data profiling performing unit 230 performs a function of performing data profiling targeting only at least one attribute determined as an attribute of data profiling. Can be.

Meanwhile, according to an exemplary embodiment of the present invention, the database 240 may include raw data, a data set, statistical values calculated with respect to attributes defined in the data set, weights assigned to attributes defined in the data set, and data profiling. The function may store information on an attribute determined as a target of the data, a result of performing data profiling, and the like. The database 240 is a concept including a computer-readable recording medium. The database 240 may be a broad database including data recording based on a file system as well as a narrow database.

Next, according to an embodiment of the present invention, the communication unit 250 performs a function to enable the data profiling system 200 to communicate with the user device 300 or the external server 400.

Finally, the control unit 260 according to an embodiment of the present invention is data between the data set management unit 210, the attribute extraction unit 220, the data profiling unit 230, the database 240, and the communication unit 250. To control the flow of the. That is, the controller 256 controls the flow of data from the outside or between each component of the data profiling system 200, thereby controlling the data set management unit 210, the attribute extractor 220, and the data profiling performer ( 230, the database 240 and the communication unit 250 control to perform a unique function, respectively.

Experimental Example

Hereinafter, an experimental result of performing data profiling according to a statistical analysis-based data profiling method provided by the data profiling system 200 according to the present invention will be described.

In this experiment, we used the "Bike Sharing Demand" data set registered in Kaggle, and calculated the data quality efficiency measure (DQEM) for performance evaluation of data profiling. Here, the equation for calculating the data quality efficiency measurement value can be expressed as Equation 2 below.

Equation 2

In Equation 2, S is the product of the total number of attributes and the number of records (that is, the total number of data included in the data set), and m is the product of the number of attributes and the number of records that are subject to data profiling.

In this experiment, statistical values suggesting that errors are likely to occur with respect to seven of the sixteen attributes defined in the data set were calculated by the statistical analysis-based data profiling method according to the present invention. The attribute has been given a first weight or a second weight according to a preset condition.

TABLE 3

Serial number	Attribute name	First weight related statistics	Second weight related statistics
One	Weather	Bonferroni p: 0	Bonferroni p: 0
2	Temp	Bonferroni p: 0	Bonferroni p: 0
3	Discomfort index (atemp)	Bonferroni p: 0	Bonferroni p: 0
4	Windspeed	Bonferroni p: 0	Bonferroni p: 0
5	Casual rent	Bonferroni p: 0	Bonferroni p: 0
5	Casual rent	Missing value (NA): 6,493 cases	Missing value (NA): 37.36%
6	Registered	Bonferroni p: 0	Bonferroni p: 0
		Standard deviation (sd): 151.039	-
		Missing value (NA): 6,493 cases	Missing value (NA): 37.36%
7	Rental Count	Bonferroni p: 0	Bonferroni p: 0
		Standard deviation (sd): 181.144	-
		Missing value (NA): 6,493 cases	Missing value (NA): 37.36%

Table 4

Serial number	Attribute name	First weight	Second weight
One	Weather	0.1	0.1
2	Temp	0.1	0.1
3	Discomfort index (atemp)	0.1	0.1
4	Windspeed	0.1	0.1
5	Casual rent	0.2	0.2
6	Registered	0.3	0.2
7	Rental Count	0.3	0.2

Referring to Tables 3 and 4, it is likely that errors are likely to occur for seven of the sixteen attributes defined in the dataset: weather, temperature, unpleasantness index, wind strength, temporary rental, registration and rental frequency. It can be confirmed that the first weight or the second weight is assigned.

In the present experiment, (i) when data profiling was performed on only two attributes having a first weight of 0.3 or higher among 16 attributes defined in the data set, the data quality efficiency measure (DQEM) was 87.5%. (Ii) The data quality efficiency measurement was calculated to be 56.25% when data profiling was performed on only seven attributes having a first weight of 0.1 or more among the sixteen attributes defined in the data set. . This data quality efficiency measure is significantly higher than the data quality efficiency measure (0%) calculated in accordance with the prior art of performing data profiling on all 16 attributes defined in the data set. Corresponding.

Therefore, according to the present invention, it can be seen that the effect that can significantly improve the efficiency of data profiling is achieved. In addition, according to the present invention, it is also possible to achieve an effect of increasing the reliability compared to the prior art of performing data profiling on the data included in the attribute selected arbitrarily according to the subjective judgment of the administrator.

Embodiments according to the present invention described above may be implemented in the form of program instructions that may be executed by various computer components, and may be recorded on a non-transitory computer readable recording medium. The non-transitory computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the non-transitory computer readable recording medium may be those specially designed and configured for the present invention, or may be known and available to those skilled in the computer software arts. Examples of non-transitory computer readable recording media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs, DVDs, magnetic-optical media such as floppy disks ( magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform the process according to the invention, and vice versa.

Although the present invention has been described by specific embodiments such as specific components and the like, but the embodiments and the drawings are provided to assist in a more general understanding of the present invention, the present invention is not limited to the above embodiments. For those skilled in the art, various modifications and variations can be made from these descriptions.

Accordingly, the spirit of the present invention should not be limited to the above-described embodiments, and all of the equivalents or equivalents of the claims, as well as the appended claims, fall within the scope of the spirit of the present invention. I will say.

Claims

As a method for providing data profiling based on statistical analysis,

Calculating at least one statistical value for each attribute based on data included in each attribute defined in the data set,

Determining a weight given to each attribute by referring to the calculated at least one statistical value, and

Determining at least one of the attributes defined in the data set whose weight is equal to or greater than a predetermined level as an attribute to be subjected to data profiling;

Including,

In the weight determination step,

And if the at least one statistical value calculated with respect to the first attribute satisfies a predetermined criterion, a predetermined weight is assigned to the first attribute.
The method of claim 1,

The at least one statistical value includes at least one of a missing value, a minimum value, a maximum value, a mode value, an average value, a variance, a standard deviation, a five-value summary, an outlier, and a near zero variance. .
The method of claim 1,

The weighting method is determined as the higher the probability that an error occurs in the data included in the attribute.
The method of claim 1,

The weighted method includes at least one of a first weighted value and a second weighted value determined independently of each other.
The method of claim 4, wherein

In the attribute determination step,

A method for determining, as an attribute to be subjected to data profiling, with reference to a geometric mean of a sum of a first weight and a second weight between at least two attributes, the combination of which the geometric mean is above a predetermined level.
The method of claim 1,

In the attribute determination step,

Determining at least one attribute as an attribute subject to data profiling, further with reference to a business rule applied to the data set.
The method of claim 1,

Performing data profiling on the data set targeting only data included in the determined at least one attribute

How to include more.
A non-transitory computer readable recording medium having recorded thereon a computer program for executing the method according to any one of claims 1 to 7.
A system for providing data profiling based on statistical analysis,

A statistical value calculator for calculating at least one statistical value for each attribute based on data included in each attribute defined in the data set;

A weighting unit for determining a weight given to each attribute with reference to the at least one calculated statistical value, and

A target attribute determination unit that determines at least one attribute whose weight is equal to or greater than a predetermined level among attributes defined in the data set as an attribute to be subjected to data profiling.

Including,

And the weighting unit determines that a predetermined weight is given to the first attribute when at least one statistical value calculated with respect to the first attribute satisfies a predetermined criterion.