KR101632073B1 - Method, device, system and non-transitory computer-readable recording medium for providing data profiling based on statistical analysis - Google Patents
Method, device, system and non-transitory computer-readable recording medium for providing data profiling based on statistical analysis Download PDFInfo
- Publication number
- KR101632073B1 KR101632073B1 KR1020150143390A KR20150143390A KR101632073B1 KR 101632073 B1 KR101632073 B1 KR 101632073B1 KR 1020150143390 A KR1020150143390 A KR 1020150143390A KR 20150143390 A KR20150143390 A KR 20150143390A KR 101632073 B1 KR101632073 B1 KR 101632073B1
- Authority
- KR
- South Korea
- Prior art keywords
- attribute
- data
- weight
- value
- profiling
- Prior art date
Links
Images
Classifications
-
- G06F17/30318—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G06F17/30598—
-
- G06F17/30699—
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Algebra (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
The present invention relates to a method, system and non-temporal computer-readable recording medium for providing statistical analysis-based data profiling.
Recently, data generated through e-mail, social network service (SNS), multimedia, mobile, and Internet (IoT) have been rapidly increasing, ZettaByte, 10 21 ).
In addition, technology for analyzing and utilizing Big Data has been actively researched across all industries and has become a global issue. In Korea, the government (3.0) is also aiming to share data by actively opening public information It is a situation. In addition, a variety of technologies have been developed to provide useful services to users by utilizing the data poured as described above.
In this situation, it is necessary to ensure the reliability of data quality. As an example of the data quality diagnosis technology that has been introduced in the past, data quality (Or data profiling). However, according to the related art, there is a problem that the data profiling result can be greatly changed according to the subjective judgment of the manager. For example, suppose you are using a data set for the purpose of mailing promotional materials to customers. In this case, data profiling can be performed only on the data included in the data attribute of the address, which is determined to be the most important attribute of the address. It is difficult to conclude that the data set is necessarily used for mailing only, It is only a subjective judgment of the manager. Therefore, although the data quality diagnosis can be performed efficiently, the reliability of the data profiling result is inevitably lowered.
As another example of the data quality diagnosis technology that has been introduced in the past, there is a technique of performing data profiling on all the attributes defined in the data set. According to this conventional technology, accurate data profiling results can be obtained, but there is a limitation that excessive time and effort are required because data profiling must be performed on all the data included in the data set. For example, suppose that there are 100 million transaction information data for every 100 attributes defined in the data set. In this case, the total number of data targeted for data profiling is 10 billion (number of attributes x records Number = 100 x 100,000,000).
Therefore, there is a demand for a data profiling technique which can secure reliability and is highly efficient.
It is an object of the present invention to solve all the problems described above.
In addition, the present invention calculates at least one statistic value for each attribute based on data included in each attribute defined in the data set, and assigns at least one statistic value to each attribute with reference to the calculated at least one statistic value And at least one attribute whose weight is equal to or higher than a predetermined level is determined as an attribute to be subjected to data profiling so that data profiling with high efficiency can be performed while ensuring reliability For other purposes.
In order to accomplish the above object, a representative structure of the present invention is as follows.
According to one aspect of the present invention, there is provided a method for providing statistical analysis based data profiling, comprising the steps of: generating at least one statistic for each attribute based on data contained in each attribute defined in the data set Determining at least one statistic value to be weighted for each attribute based on the calculated at least one statistical value, and determining at least one of the attributes defined in the data set, As an attribute to be subjected to data profiling.
According to another aspect of the present invention there is provided a system for providing data profiling based on statistical analysis, the system comprising: means for generating at least one statistic for each attribute based on data contained in each attribute defined in the data set A statistical value calculation unit for calculating a statistical value, a weighting unit for determining a weight given to each attribute with reference to the calculated at least one statistical value, As an attribute to be subjected to data profiling, at least one attribute that is equal to or higher than a set level.
In addition, there is further provided a non-transitory computer-readable recording medium for recording a computer program for executing the method and a user device, system, and other methods for implementing the invention.
According to the present invention, data profiling is performed on data included in some attributes, which are determined to have a high possibility of occurrence of errors, based on statistical analysis among various attributes defined in the data set, so that they are arbitrarily selected according to the subjective judgment of the administrator The reliability can be greatly increased as compared with the prior art in which data profiling is performed on the data included in the attribute.
In addition, according to the present invention, an efficiency can be remarkably improved as compared with the prior art in which data profiling is performed on data of all attributes defined in a data set.
Further, according to the present invention, since the attributes to be subjected to the data profiling can be determined by further reflecting the business rules (code values, business rules, etc.) applied to the data set together with the statistical analysis result, It is possible to achieve the effect of increasing the amount of the liquid.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram showing a schematic configuration of an overall system for providing statistical analysis based data profiling according to an embodiment of the present invention. FIG.
2 is an exemplary diagram illustrating an internal configuration of a data profiling system according to an embodiment of the present invention.
3 is a diagram illustrating an exemplary internal configuration of an attribute extraction unit according to an exemplary embodiment of the present invention.
FIG. 4 is a diagram conceptually showing a configuration for determining an attribute to be subjected to data profiling among attributes defined in a data set according to an embodiment of the present invention.
The following detailed description of the invention refers to the accompanying drawings, which illustrate, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with an embodiment. It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is to be limited only by the appended claims, along with the full scope of equivalents to which such claims are entitled, if properly explained. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, so that those skilled in the art can easily carry out the present invention.
Configuration of the entire system
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram showing a schematic configuration of an overall system for providing statistical analysis based data profiling according to an embodiment of the present invention. FIG.
1, an overall system according to an embodiment of the present invention may be configured to include a
First, the
Next, in accordance with one embodiment of the present invention, the
Specifically, in accordance with one embodiment of the present invention, the
The function of the
Next, in accordance with an embodiment of the present invention, the
According to an embodiment of the present invention, the
Configuration of data profiling system
Hereinafter, the internal configuration of the data profiling system that performs an important function for the implementation of the present invention and the functions of the respective components will be described.
2 is an exemplary diagram illustrating an internal configuration of a data profiling system according to an embodiment of the present invention.
3 is a diagram illustrating an exemplary internal configuration of an attribute extraction unit according to an exemplary embodiment of the present invention.
FIG. 4 is a diagram conceptually showing a configuration for determining an attribute to be subjected to data profiling among attributes defined in a data set according to an embodiment of the present invention.
2 and 3, a
First, according to an embodiment of the present invention, the data
Next, in accordance with an embodiment of the present invention, the attribute extracting unit 220 (specifically, the statistical value calculating unit 221) extracts data included in each attribute defined in the data set to be subjected to data profiling Based on the at least one statistical value for each attribute.
Here, the attribute defined in the data set refers to an item that is a criterion for classifying a large amount of data (i.e., a record) included in the data set. For example, a bike rental status according to a weather situation Demand), the data set includes a date, a year, a month, a day, an hour, a season, a holiday, a working day, properties such as weather, humidity, casual, registered, number of counts, temp, atemp, windspeed can be defined. have.
Specifically, according to an embodiment of the present invention, the
According to an embodiment of the present invention, the attribute extracting unit 220 (specifically, the weight assigning unit 222) refers to at least one statistical value calculated as described above with respect to each attribute defined in the data set , And can determine a weight to be given to each attribute defined in the data set (refer to FIG. 4 (c)).
Specifically, if the at least one statistical value calculated with respect to the first attribute defined in the data set satisfies a preset criterion, the
Here, according to an embodiment of the present invention, the weight that can be given to each attribute defined in the data set may include a first weight and a second weight, and the first weight and the second weight are determined independently of each other . In more detail, the
For example, the criteria for assigning the first weight and the second weight to attributes defined in the data set may be set as shown in Tables 1 and 2 below, respectively.
However, the first weight and the second weighting criterion according to the present invention are not necessarily limited to those listed in Table 1 or Table 2 above, and may be changed to any extent within the scope of achieving the object of the present invention. .
According to an embodiment of the present invention, the attribute extracting unit 220 (more specifically, the object attribute determining unit 223) determines whether or not the attribute of the attribute set in the data set is greater than or equal to a predetermined level As an attribute to be subjected to data profiling.
In addition, according to an embodiment of the present invention, the
For example, the
And from equation (1) above, S is the number of the set of attributes (a 1, a 2, ... , a i, a n), n is property selected from S, a i is the i-th attribute, a i14 Is a first weight given to the i-th attribute, and a i15 is a second weight given to the i-th attribute.
For example, the
Next, according to an embodiment of the present invention, the
Meanwhile, according to an embodiment of the present invention, the
Next, in accordance with an embodiment of the present invention, the
The
Experimental Example
Hereinafter, experimental results of data profiling according to the statistical analysis-based data profiling method provided by the
In this experiment, data set of "Bike Sharing Demand" registered in Kaggle was used and data quality efficiency measure (DQEM) was calculated for performance evaluation of data profiling. Here, the equation for calculating the data quality efficiency measurement value can be expressed by the following equation (2).
In Equation (2), S is the product of the total number of attributes and the number of records (i.e., the total number of data included in the data set), and m is the product of the number of attributes and the number of records subject to data profiling.
In this experiment, statistical values indicating that the probability of error occurrence is high with respect to seven attributes out of the 16 attributes defined in the data set were calculated by the statistical analysis-based data profiling method according to the present invention, For the attribute, a first weight or a second weight is given according to predetermined conditions.
Referring to Table 3 and Table 4, among the 16 attributes defined in the data set, it is suggested that there is a high possibility of occurrence of errors in the seven attributes of weather, temperature, discomfort index, wind intensity, temporary lease, registration lease, It can be confirmed that the first weight or the second weight is given.
In this experiment, (i) when data profiling is performed on only two attributes having a first weight of 0.3 or more among the 16 attributes defined in the data set, the data quality efficiency measurement value (DQEM) is 87.5% , And (ii) data profiling was performed on only seven attributes having a first weight of 0.1 or more out of the 16 attributes defined in the data set, the data quality efficiency measurement value was calculated to be 56.25% . Such a data quality efficiency measurement value is significantly higher than the data quality efficiency measurement value (0%) calculated according to the prior art that performs data profiling on all 16 attributes defined in the data set .
Therefore, according to the present invention, it is confirmed that the efficiency of data profiling can be remarkably improved. In addition, according to the present invention, the reliability can be enhanced as compared with the prior art in which data profiling is performed on data included in an arbitrarily selected attribute in accordance with subjective judgment of an administrator.
The embodiments of the present invention described above can be implemented in the form of program instructions that can be executed through various computer components and recorded in a non-transitory computer readable recording medium. The non-transitory computer readable medium may include program instructions, data files, data structures, etc., either alone or in combination. The program instructions recorded on the non-transitory computer-readable recording medium may be those specially designed and constructed for the present invention or may be those known to those skilled in the computer software arts. Examples of non-transitory computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs, DVDs, magneto-optical media such as floppy disks magneto-optical media), and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules for performing the processing according to the present invention, and vice versa.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, Those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.
Therefore, the spirit of the present invention should not be construed as being limited to the above-described embodiments, and all of the equivalents or equivalents of the claims, as well as the following claims, I will say.
100: Network
200: Data Profiling System
210: Data Set Management Unit
220: Attribute extraction unit
221: statistical value calculating section
222: Weight assignment
223: target attribute determination unit
230: Data profiling performing unit
240: Database
250:
260:
300: User device
400: external server
Claims (10)
Calculating at least one statistical value for each attribute based on data contained in each attribute defined in the data set,
Determining a weight assigned to each attribute with reference to the calculated at least one statistical value, and
Determining at least one attribute whose weight is equal to or higher than a predetermined level among attributes defined in the data set as an attribute to be subjected to data profiling
Lt; / RTI >
In the weight determination step,
And if at least one statistical value calculated with respect to the first attribute satisfies a predetermined criterion, a predetermined weight is given to the first attribute.
Wherein the at least one statistical value includes at least one of a missing value, a minimum value, a maximum value, a mode value, an average value, a variance, a standard deviation, a five value summation, an outlier, and a Near Zero Variance .
Wherein the weight is determined to be higher as the probability of occurrence of an error in the data included in the attribute is greater.
Wherein the weight includes at least one of a first weight and a second weight determined independently of each other.
In the attribute determination step,
Wherein at least two attributes constituting a combination constituting a combination of the geometric mean and a predetermined level or more are determined as attributes to be subjected to data profiling with reference to a geometric mean of a sum of a first weight and a second weight between at least two attributes.
In the attribute determination step,
And determining at least one attribute as an attribute to be subjected to data profiling with further reference to business rules applied to the data set.
Performing data profiling on the data set only for data included in the determined at least one attribute
≪ / RTI >
A statistical value calculating unit for calculating at least one statistical value concerning each attribute based on data included in each attribute defined in the data set,
A weighting unit determining a weighting value to be given to each attribute with reference to the calculated at least one statistical value, and
As an attribute to be subjected to data profiling, at least one attribute whose weight is equal to or higher than a predetermined level among the attributes defined in the data set,
Lt; / RTI >
Wherein the weighting unit determines that a predetermined weight is given to the first attribute if at least one statistical value calculated with respect to the first attribute satisfies a predetermined criterion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/KR2016/005920 WO2016195421A1 (en) | 2015-06-04 | 2016-06-03 | Method, system and non-transitory computer-readable recording medium for providing data profiling based on statistical analysis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150079056 | 2015-06-04 | ||
KR20150079056 | 2015-06-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
KR101632073B1 true KR101632073B1 (en) | 2016-06-20 |
Family
ID=56354579
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150143390A KR101632073B1 (en) | 2015-06-04 | 2015-10-14 | Method, device, system and non-transitory computer-readable recording medium for providing data profiling based on statistical analysis |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR101632073B1 (en) |
WO (1) | WO2016195421A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102240496B1 (en) * | 2020-04-17 | 2021-04-15 | 주식회사 한국정보기술단 | Data quality management system and method |
KR20210085886A (en) * | 2019-12-31 | 2021-07-08 | 가톨릭관동대학교산학협력단 | Data profiling method and data profiling system using attribute value quality index |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110389295B (en) * | 2019-06-14 | 2022-03-25 | 福建省福联集成电路有限公司 | VBA language-based electrical data processing method and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150015029A (en) * | 2008-10-23 | 2015-02-09 | 아브 이니티오 테크놀로지 엘엘시 | A method, a system, and a computer-readable medium storing a computer program for performing a data operation, measuring data quality, or joining data elements |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101033179B1 (en) * | 2003-09-15 | 2011-05-11 | 아브 이니티오 테크놀로지 엘엘시 | Data profiling |
US8869208B2 (en) * | 2011-10-30 | 2014-10-21 | Google Inc. | Computing similarity between media programs |
KR101530848B1 (en) * | 2012-09-20 | 2015-06-24 | 국립대학법인 울산과학기술대학교 산학협력단 | Apparatus and method for quality control using datamining in manufacturing process |
KR101448228B1 (en) * | 2013-02-12 | 2014-10-10 | 이주양 | Apparatus and Method for social data analysis |
-
2015
- 2015-10-14 KR KR1020150143390A patent/KR101632073B1/en active IP Right Grant
-
2016
- 2016-06-03 WO PCT/KR2016/005920 patent/WO2016195421A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150015029A (en) * | 2008-10-23 | 2015-02-09 | 아브 이니티오 테크놀로지 엘엘시 | A method, a system, and a computer-readable medium storing a computer program for performing a data operation, measuring data quality, or joining data elements |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20210085886A (en) * | 2019-12-31 | 2021-07-08 | 가톨릭관동대학교산학협력단 | Data profiling method and data profiling system using attribute value quality index |
KR102365910B1 (en) * | 2019-12-31 | 2022-02-22 | 가톨릭관동대학교산학협력단 | Data profiling method and data profiling system using attribute value quality index |
KR102240496B1 (en) * | 2020-04-17 | 2021-04-15 | 주식회사 한국정보기술단 | Data quality management system and method |
Also Published As
Publication number | Publication date |
---|---|
WO2016195421A1 (en) | 2016-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5917719B2 (en) | Method, apparatus and computer readable recording medium for image management in an image database | |
CA2985028C (en) | Gating decision system and methods for determining whether to allow material implications to result from online activities | |
CN109271420B (en) | Information pushing method, device, computer equipment and storage medium | |
US9836517B2 (en) | Systems and methods for mapping and routing based on clustering | |
US20140317756A1 (en) | Anonymization apparatus, anonymization method, and computer program | |
CN109522190B (en) | Abnormal user behavior identification method and device, electronic equipment and storage medium | |
KR101632073B1 (en) | Method, device, system and non-transitory computer-readable recording medium for providing data profiling based on statistical analysis | |
WO2020211146A1 (en) | Identifier association method and device, and electronic apparatus | |
CN108763956A (en) | A kind of stream data difference secret protection dissemination method based on fractal dimension | |
CN108470195A (en) | Video identity management method and device | |
CN110503566B (en) | Wind control model building method and device, computer equipment and storage medium | |
KR101163196B1 (en) | Method of managing customized social network map in application server providing customized content and computer readable medium thereof | |
CN105376223A (en) | Network identity relationship reliability calculation method | |
CN112101692B (en) | Identification method and device for mobile internet bad quality users | |
US20150302302A1 (en) | Method and device for predicting number of suicides using social information | |
CN106961441B (en) | User dynamic access control method for Hadoop cloud platform | |
Cai et al. | Tropical cyclone risk assessment for China at the provincial level based on clustering analysis | |
JP5847122B2 (en) | Evaluation apparatus, information providing system, evaluation method, and evaluation program | |
KR101959213B1 (en) | Method for predicting cyber incident and Apparatus thereof | |
Cheng et al. | Toward quantitative measures for the semantic quality of polygon generalization | |
Basik et al. | Slim: Scalable linkage of mobility data | |
CN111460796A (en) | Accidental sensitive word discovery method based on word network | |
KR102387284B1 (en) | Apparatus and method for forecasting heatwave Impact considering severity of health impacts and socio-economic vulnerability | |
JP5665685B2 (en) | Importance determination device, importance determination method, and program | |
JP6142617B2 (en) | Information processing apparatus, information processing method, and information processing program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant | ||
FPAY | Annual fee payment |
Payment date: 20190613 Year of fee payment: 4 |