WO2013146884A1 - Système, procédé et programme de nettoyage de données - Google Patents

Système, procédé et programme de nettoyage de données Download PDF

Info

Publication number
WO2013146884A1
WO2013146884A1 PCT/JP2013/059007 JP2013059007W WO2013146884A1 WO 2013146884 A1 WO2013146884 A1 WO 2013146884A1 JP 2013059007 W JP2013059007 W JP 2013059007W WO 2013146884 A1 WO2013146884 A1 WO 2013146884A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
rule
database
cfd
cleansing
Prior art date
Application number
PCT/JP2013/059007
Other languages
English (en)
Japanese (ja)
Inventor
綾子 星野
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2013146884A1 publication Critical patent/WO2013146884A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs

Definitions

  • the present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-072128 (filed on March 27, 2012), and the entire contents of this application are incorporated in the present specification by reference.
  • the present invention relates to a data cleansing system, method and program.
  • CFD conditional function dependency
  • CFD is a rule representing that functional dependency (abbreviated as “FD”) representing dependency between data attributes is established for a tuple set specified by a condition.
  • a tuple represents a row in a relational table with attributes as columns.
  • the CFD is composed of a condition part and a premise part which are the left side (LHS: Left Hand Side) of the rule, and specification of attribute values in the consequent part of the right side (RHS: Right Hand Side) of the rule.
  • “x1” means that the attribute value is a specific value.
  • the premise part consists of designation of only attributes. The fact that the attribute value does not take a specific value (that is, a wild card indicating that it matches an arbitrary value) is also referred to as “unnamed variable” (anonymous variable).
  • a rule in which the consequent part is determined by the specified value is referred to as “Constant CFD”.
  • a rule in which the result part is not determined to have a specified value but has a dependency between attributes is referred to as “Variable CFD”. That is, if the right side of the pattern
  • is unvariable variable '_' (tp [A] _), it is referred to as “Variable CFD”.
  • Violation against a rule such as rule 1, that is, a tuple that satisfies the condition “X1 x1” and “t1 [A] ⁇ a” is called “single tuple violation” (single tuple violation)
  • the tuple t1 is referred to as “violating tuple”.
  • Support degree is the number of tuples in which the conditional part of CFD, the premise part (left side of CFD: LHS), and the consequent part (right side of CFD: RHS) match. In another definition, it may be expressed by the ratio of the number of tuples in which LHS and RHS match in the total number of tuples.
  • Constant is the ratio of the number of tuples in which the CFD rule is satisfied among the number of tuples in which the condition part and the premise part match.
  • the support level and the reliability level will be described according to a specific example.
  • ID is a tuple ID
  • A, B, and C are attributes. From the relationship data set in Table 1, for example, CFD ⁇ 1: A, B ⁇ C (1, _
  • Patent Document 1 discloses an apparatus for extracting error data using a conditional expression for checking the correctness of data in the database, correcting the extracted data, and updating the database with the corrected data. Is disclosed.
  • Patent Document 2 includes a correlation rule whose correlation coefficient is a predetermined value or more based on the correlation coefficient between attributes of the database.
  • a data mining device is disclosed that deletes from the set of correlation rules between attributes when only the correlation coefficient is generated and does not have a true correlation coefficient.
  • Patent Document 3 discloses a configuration in which data cleansing / characterizing means for performing data cleansing that replaces or deletes an abnormal value of data with a specific value is repeatedly performed while changing a set value from an initial value.
  • Patent Document 4 finds a rule violation caused by configuration data abnormality, displays the rule that has been violated, displays a table row corresponding to the rule violation, and executes a corrective action for defective configuration data Is disclosed.
  • Patent Document 5 discloses that an item database is created from a relational database, a correlation rule is extracted from the item database, and the number of correlation rules corresponding to each item is calculated when reading the contents of the result rule file. ing.
  • a data cleansing system receives cleansing target data and a data correction rule set ( ⁇ CFD) as input, and a correction location extraction unit (correction location extraction device) and a correction content determination unit (correction content determination), both of which are not shown. Device) and correction result reflecting means (correction result reflecting device).
  • ⁇ CFD data correction rule set
  • Device correction result reflecting means
  • FIG. 16 shows the processing of Algorithm BATCHREPAIR shown in FIG. 4 of Non-Patent Document 1.
  • a comment or the like is added to FIG. 4 of Non-Patent Document 1 to help understanding.
  • the correction location extracting means refers to the rule set (CFD set ⁇ ), and extracts violation tuples (specifically, rule violation tuple sets) in the data (cleansing target database D) (Line 4).
  • a correction rule, a correction location, and a correction destination value are selected by the correction content determination means PICKNEXT () according to the algorithm (Line 6).
  • Non-Patent Document 2 discloses a system including such an operator user interface.
  • a correction rule for example, whether an operator changes the value (data) of an exception tuple that does not follow the rules to the cleansing target data.
  • the current value is recognized, and either processing of whether to change the value (data) of the exception tuple is performed.
  • execution and updating of corrections are repeatedly performed by an operator approving a correction rule recommended by the system side, the regularity of data increases with each iteration.
  • the CFD rule set may be provided in a data cleansing apparatus, or may be obtained by being extracted from cleansing target data as in the systems of Non-Patent Document 2 and Non-Patent Document 3.
  • data cleansing performed when data in the migration source system is migrated to the migration destination system will be described.
  • Data in the migration source system is called “cleansing target data”, and data already tested or used in the migration destination system is called “migration destination data”.
  • mapping association between the database of the migration source system and the migration destination system in units of columns and columns has been completed.
  • the first problem is that when performing the above-mentioned repetitive data cleansing, it is not known at what point cleansing may be terminated. The reason will be described below.
  • exception data for example, exception data within an allowable range
  • exception data for example, exception data within an allowable range
  • cleansing of additional data is performed without using information regarding this point, there is a possibility that an erroneous correction of data may occur without an operator noticing an allowable exception.
  • the second problem is that, for example, when data cleansing is performed using CFD rules, regularity is imposed more than necessary, and as a result, original information is lost. The reason will be described below.
  • the present invention has been created entirely in view of the above problems, and its main purpose is to provide a system, method, and program that can present the data cleansing end condition.
  • data analysis means for acquiring data configuration information of the first database by using the first database as input, By excluding CFD rules that do not satisfy a predetermined criterion and are determined to have high data dependency from the conditional function dependency (CFD) rule set extracted from the first database, Rule extraction / selection means for selecting a dependent CFD rule set; The number of data-independent CFD rule sets selected by the rule extraction / selection means; Using the data configuration information of the first database acquired by the data analysis unit and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is used as the second database cleansing end determination condition.
  • an apparatus comprising rule number estimating means for calculating an estimated value of the total number of CFD rules to be established in the database.
  • the data analysis process receives the first database as input, obtains data configuration information of the first database,
  • the rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion.
  • CFD conditional function dependency
  • the rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • a method is provided for calculating an estimate of the total number of CFD rules to be established in the database.
  • a data analysis process for obtaining data configuration information of the first database using the first database as an input; Exclude from the CFD rule set a CFD rule that does not satisfy a predetermined predetermined criterion and is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database.
  • CFD conditional function dependency
  • a rule extraction / selection process for selecting a CFD rule set independent of data, The number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • a rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database.
  • a conditional function dependency (CFD) rule set is derived (extracted) from a cleansing target database, tuple data (attribute values, etc.) violating the CFD rule is corrected, and the correction is performed.
  • CFD conditional function dependency
  • the cleansing process for reflecting data in the database is repeatedly performed, the total number of CFD rules established among the set of CFD rules reaches a predetermined number (estimated value) calculated in advance as a result of the correction. Finally, the cleansing process for the cleansing target database is terminated.
  • the total number of CFD rules to be established in the cleansing target database is estimated on the basis of another database corresponding to the cleansing target database, and as a result of the correction to the cleansing target database.
  • the data migration to the migration destination system database is performed.
  • the migration destination database Based on the number of CFD rules that are satisfied and the data size difference between the migration destination database and the migration source database (database storing cleansing target data), an indication of cleansing end conditions is calculated.
  • the first database (DB1) As the configuration information, for example, data size information such as the number of attributes and the number of tuples is acquired, and the rule extraction / selection means (102) has data dependency from the CFD rule set extracted from the first database (DB1). By excluding CFD rules determined to be relatively high, a CFD rule group (108) independent of data is selected, and the rule number estimation means (103) obtains the rule extraction / selection means (102).
  • the CFD rule group independent of data (108), the data size information of the first database (DB1) obtained by the data analysis means (101), Second database storing ring target data: the data size information (106 DB2) (107), and calculates an estimate of the total number of CFD rule to be established in the second database cleansing target (DB2).
  • the user (operator) of this system can use the criteria for the data cleansing end condition ( It is possible to know a value (estimated value) indicating how much the number of extracted CFD rule groups is to end the cleansing process.
  • a cleansing end condition estimating device (abbreviated as “end condition estimating device”) 100 that provides a standard (estimated value) of a data cleansing end condition is: Database (DB1) 105, data analysis means 101 (data analysis apparatus), rule extraction / selection means 102 (rule extraction / selection apparatus), database (DB2) 106, rule number estimation means 103 (rule number estimation apparatus) ).
  • the database (DB1) 105 is a database of the migration destination system (referred to as “migration destination database”).
  • the database (DB2) 106 is a database that stores cleansing target data.
  • the database (DB1) 105 and the database (DB2) 106 are also referred to as the database DB1 and the database DB2 simply by removing the reference numbers.
  • the data analysis unit 101 reads the migration destination database DB1 and acquires information such as the number of attributes and the number of tuples.
  • the rule extraction / selection unit 102 extracts a rule group (CFD rule set) from the migration destination database DB1, excludes a rule having high data dependency from the extracted rule group, and removes the remaining rule group as a data non-data.
  • the dependency rule group 108 is selected.
  • the rule extraction / selection means 102 uses training data (Dtrain) and test data (Dtest) for the data in the migration destination database DB1 in k ways (where k is a predetermined positive integer). From the CFD rules extracted from the training data (Dtrain), for example, rules that have been established m times or more (0 ⁇ m ⁇ k) out of k times are selected from the test data (Dtest).
  • the rule number estimation means 103 reads the data of the cleansing target data (database DB2) 106 and calculates an estimated value of the rule that should be established in the cleansing target data.
  • the rule number estimation means 103 is, for example, The total number of rules of the CFD rule set (data independent rule group 108) selected by the rule extraction / selection means 102; The number of tuples of the database DB1 acquired by the data analysis means 101, and the number of different values of the specified column of the database DB1, CFD to be established in the database DB2 using at least one of the number of tuples of the database DB2 storing the cleansing target data, the number of different values of the designated column of the database DB2, and the number of CFD rules established in the database DB2.
  • the column designation may be input to the cleansing end condition estimation device 100 via the input means (not shown) from the user (worker) or the like as the parameter 104 and supplied to the rule number estimation means 103, for example. Good.
  • the data analysis unit 101, the rule extraction / selection unit 102, and the rule number estimation unit 103 may be realized by a program that operates on a computer constituting the end condition estimation apparatus 100.
  • the program is stored in, for example, a magnetic or optical recording medium (device), a semiconductor memory (for example, a read-only memory or a rewritable nonvolatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)), and the like.
  • the data analysis unit 101 acquires information (data size information) related to the migration destination database DB1 (step A1 in FIG. 2).
  • the rule extraction / selection unit 102 divides the data in the migration destination database DB1, extracts the rules, excludes the rule determined to have high data dependency from the extracted rule group ⁇ , and removes data independence.
  • a rule group is selected (step A2).
  • the rule number estimation means 103 calculates an approximate number (estimated value) of the total number of rules to be established in the cleansing target data (database DB2) (step A3).
  • step A2 in FIG. 2 will be described.
  • the rule extraction / selection means 102 divides the migration destination database DB1 into training data and test data by k methods (step A2-1).
  • the value of k is determined by the table size and sampling method (boot-strap method, cross-validation method, etc.) of the migration destination database DB1 acquired by the rule extraction / selection means 102.
  • test data with an appropriate number of tuples n is extracted k times from the migration destination database DB1, and the rest is used as training data (training data). At this time, there may be an overlap between 1 to k test data sets.
  • Data division methods such as test data and training data are important factors that determine the evaluation of the extracted rules. For example, when it is desired to obtain a set of rules that can be applied to data that differs in time, it is effective to rearrange the data by time stamps and then divide the data. For this reason, before the division, the data may be rearranged based on the prior knowledge of the worker or the like.
  • each tuple becomes (k-1) times training data and only once.
  • two tuples ID1 and ID2 are used as test data, and eight tuples ID3 to ID10 are used as training data.
  • K sets (combinations) of such test data (2 tuples) and training data (8 tuples) can be obtained.
  • any one of the existing algorithms is used in the present embodiment.
  • those described in Patent Document 6, Non-Patent Document 4, and the like are used.
  • the extraction algorithm by specifying the input data and the appropriate support threshold (min_supp) and reliability threshold (min_conf) parameters, a uniquely determined CFD rule group (CFD set) ⁇ is obtained. Can do.
  • the CFD support rate threshold (min_supp) and reliability threshold (min_conf) parameters may be supplied to the rule extraction / selection unit 102 via the input unit (not shown) as the parameter 104 in FIG. .
  • an example of a rule set (CFD set) extracted from the training data of FIG. 6A is the rule in the first column in the table of FIG.
  • the rule extraction / selection unit 102 repeats the following 3) and 4) for the rule (CFD) in the rule group ⁇ i extracted from the training data (8 tuples) (step A2-4).
  • the rule extraction / selection means 102 evaluates whether or not the CFD rule ⁇ is satisfied in the test data (2 tuples other than the training data out of 10 tuples) (step A2-5). That is, it is evaluated whether or not the reliability in the test data (the number of tuples satisfying both the premise part and the consequent part condition / the number of tuples satisfying the precondition part) of the CFD rule ⁇ satisfies a preset criterion. .
  • the rule extraction / selection unit 102 excludes the CFD rule ⁇ from the rule group ⁇ i (step A2-6).
  • step A2-5 the reliability columns obtained from the test data (for example, the first to fifth reliability columns) are related to the CFD rule ⁇ 1.
  • when rule ⁇ is not satisfied, in addition to a method of excluding immediately, ⁇ may be excluded from rule group ⁇ i when not satisfied more than q times during k tests. good.
  • the rule extraction / selection means 102 summarizes the rule groups ⁇ 1 to ⁇ k (takes the union of k sets ⁇ 1 to ⁇ k), and outputs the rule group ⁇ as the data independent rule group 108 (step A2- 7).
  • the rule extraction / selection unit 102 may aggregate rule sets from the rule group ⁇ by omitting redundant rules and implication rules.
  • the rule extracting / selecting means 102 may output only the size (number of rules) of the rule group ⁇ . .
  • step A3 in FIG. 2 will be described.
  • the rule number estimation means 103 obtains a data size comparison index based on the respective data size information in the databases DB1 and DB2 (step A3-1).
  • the data size comparison index for example, ⁇ The number of tuples in the database, or -The number of different values of the specified column is used. Note that the number of different values of the designated column is 10 when the attribute value of the column takes 10 different values, and the number of differences is 5 when the value of 5 different values is taken.
  • the rule number estimating means 103 Total number of CFD rules (data independent rule group) extracted and selected from the migration destination database DB1 by the rule extraction / selection means 102: Number_of_CFDs (DB1), Data size comparison index of the migration destination database DB1: DBSIZE (DB1), Using at least one of the data size comparison indexes of the database DB2 storing the data to be cleansed: DBSIZE (DB2), The estimated value Number_of_CFDs (DB2) of the rule to be established in the database DB2 is calculated (A3-2).
  • An example of the calculation formula is given by the following formula (1), for example.
  • a rule with less generality (a rule with high data dependency) is excluded by using a bootstrap method or a cross-validation method. Therefore, a rule peculiar to the migration destination database DB1 is excluded.
  • the number of CFD rules to be applied to the database DB2 storing cleansing target data can be estimated.
  • such a rule for example, the rule ⁇ 2 in FIG. 7 should be applied as an end condition for the cleansing target data (DB2) after being excluded from the rule group extracted from the cleansing target database. Estimating the number of rules reduces the possibility of incorrect corrections.
  • the data cleansing system of the second embodiment is a data analysis means (first group) that obtains a CFD rule set (rule group 207) having a reliability level p or higher from a database (DB2) 206 that stores cleansing target data. 2 and a rule group 207 from the data analysis unit 201, and a rule application automatic determination unit that automatically determines the correction contents for matching the data of the database (DB2) 206 with the rules of the rule group 207.
  • DB2 database
  • a rule application automatic determination unit that automatically determines the correction contents for matching the data of the database (DB2) 206 with the rules of the rule group 207.
  • DB2 database
  • a data update unit 203 for updating data based on the determined correction content
  • an end condition estimation device 100 an established rule number counter 204.
  • the end condition estimation apparatus 100 estimates the cleansing process end condition of the cleansing target database (DB2) 206 from the migration destination database (DB1) 205. From the cleansing end condition estimation apparatus 100 of the first embodiment, Composed.
  • the data analysis unit 201, the rule application automatic determination unit 202, and the data update unit 203 may be realized by a program that runs on a computer.
  • the program is stored in, for example, a magnetic or optical recording medium (device) or a semiconductor memory (for example, a read-only memory or a rewritable non-volatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)).
  • the data analysis unit 201 uses a method disclosed in Patent Document 6 or Non-Patent Document 4 from the cleansing target database DB2, for example, a CFD rule set (rule group) having a reliability equal to or higher than a preset threshold. Then, a violation tuple set including a set of tuples that are incompatible with each rule of the CFD rule set is obtained.
  • the rule application automatic determination unit 202 determines whether or not the contents of the tuple should be changed so as to eliminate the nonconformity of the set of the CFD rule set and the rule nonconforming tuple set (violating tuple set).
  • the data update unit 203 executes necessary changes (changes in the contents of the tuples that eliminate the rule nonconformity) to the data in the database DB2 in accordance with the determination result of the rule application automatic determination unit 202.
  • FIG. 13 shows a correction to be applied to the worker in the correction destination automatic selection process in step B4 of FIG. 12, determines whether the worker corrects, and executes the correction when the correction is adopted.
  • the process is changed to the process of counting up the number of established rules by one.
  • the data analysis unit 201 analyzes the cleansing target data DB2, and extracts a CFD rule set having a reliability level equal to or higher than the threshold p and a violation tuple set for the rule (step B1 in FIG. 12).
  • the data analysis means 201 initializes the count value of the established rule number counter 204 (step B2 in FIG. 12).
  • the data shown in FIG. 14 is input to the data analysis unit 201 as data of the cleansing target database DB2.
  • FIG. 15 is a diagram illustrating an example of a CFD rule set with a reliability of 0.8 or more extracted from the data of FIG. 14 and a violation tuple set for those rules.
  • a notation for example, v ( ⁇ x): ⁇ tuple1 or tuple2 ⁇
  • v ( ⁇ x): ⁇ tuple1 or tuple2 ⁇ a notation (format) (for example, v ( ⁇ x): ⁇ tuple1 or tuple2 ⁇ ) that, for example, the tuple 1 or the tuple 2 has violated the rule ⁇ x is used.
  • the CFD rules having a reliability of 1.0 are the seven rules ⁇ 1 to ⁇ 7 above the broken line in FIG. In this case, the count value of the established rule count 204 is initialized to “7”.
  • the data analyzing means 201 repeats the following steps B4-B6 (step B3).
  • the rule application automatic determination unit 202 selects a CFD rule to be applied, a location to be corrected, and a value to be corrected (step B4). That is, when there are a plurality of CFD rules having a reliability of less than 1.0, a rule to be applied first is selected from the plurality of CFD rules, and a violation tuple for the selected rule is matched with the rule. Then, the location to be corrected is automatically selected, and the value to be corrected is determined.
  • a method of ordering rules to be applied using an index based on an edit distance of a character string can be used.
  • the data update unit 203 executes correction to the tuple selected by the rule application automatic determination unit 202 (step B5 in FIG. 12). That is, the data update unit 203 executes correction to the data selected by the rule application automatic determination unit 202 in step B4. Further, the data update unit 203 updates the reliability and violation tuple information regarding the established rule having the reliability of less than 1.0.
  • the data updating unit 203 increases the count value of the established rule count 204 by 1 (step B6 in FIG. 12).
  • the estimated value of the number of established rules by the termination condition estimation device 100 is “8”.
  • correct the ID values of ID9 and ID10 Changing the value of either PRIICE or TAX of the ID9 or ID10 tuple is considered as a correction candidate.
  • the rule application automatic determination unit 202 automatically makes a determination regarding correction.
  • automatic determination regarding correction for example, the descriptions of Non-Patent Documents 2 and 3 are referred to.
  • Non-Patent Document 1 regarding the determination regarding correction, it is possible to manually select from among correction candidates.
  • ⁇ 14: (COUNTRY JP, PRODUCT) ⁇ TAX Is adopted and data correction is performed.
  • the rule application automatic determination unit 202 acquires the estimated value of the cleansing process end condition from the end condition estimation apparatus 100, determines the end condition, and automatically ends when the end condition is satisfied.
  • the configuration of FIG. 10 includes a rule application determination input unit 302 instead of the rule application automatic determination unit 202 of FIG. 9, and the operator 303 determines whether to apply the correction rule.
  • the rule application determination input unit 302 ends the iterative cleansing based on the instruction of the worker 303, but the worker 303 may determine whether or not to correct the data further based on the correction rule.
  • FIG. 13 is a flowchart for explaining a procedure in the case of performing a determination process of whether or not there is a correction by a human (worker 303 in FIG. 10) in FIG.
  • the automatic selection processing (performed by the rule application automatic determination means 202 in FIG. 9) of the rule to be applied, the location to be corrected, and the correction destination value in step B4 in FIG. 1, B4-2 is replaced.
  • FIG. 11 is a diagram illustrating a configuration of a second modification of the second embodiment. Referring to FIG. 11, the configuration is the same as that of FIG. 10 except that a value mapping unit 209 is provided. Below, the description of the same part is abbreviate
  • the value mapping means 209 when the value mapping means 209 is used to eliminate the difference in expression of the values (attribute values) of the databases DB1 and DB2, and a matching rule including the value (attribute value) is established in the database DB2 Only, the established rule number counter 204 is incremented by one.
  • the value mapping performed by the value mapping unit 209 is to associate value expressions in the corresponding columns (for example, “male” and “female” are associated with English “male” and “female”). Reference is made to the description of Patent Document 3 and the like.
  • the end time is determined by the end condition estimating means (device) 100, it is not necessary to consider all the enormous rules. For this reason, it is possible to reduce the possibility of adopting a rule that does not need to be applied by mistake by examining a rule with low reliability.
  • it can apply to uses, such as data cleansing at the time of data migration and data integration, and it can apply to the arbitrary systems which perform the data cleansing of the database corresponding to the database used as a reference
  • a cleansing end condition estimation device comprising: rule number estimation means for calculating an estimated value of the total number of CFD rules to be established in the database.
  • the rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer), Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data.
  • the cleansing end condition estimation apparatus according to Supplementary Note 1.
  • the rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer), Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined.
  • the cleansing end condition estimation device according to appendix 1, wherein a rule that is established at a threshold value or higher is selected.
  • the cleansing end condition estimation device according to supplementary note 1 or 2, wherein the cleansing end condition estimation device is calculated using
  • the data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples.
  • (Appendix 6) A first database; A second database to be cleansed; The cleansing end condition estimation device according to any one of appendices 1 to 5, Second data analysis means for obtaining a CFD rule set from the second database; Rule application determination means for determining data correction contents for matching data with the rules in the CFD rule set acquired by the second data analysis means; A data cleansing system comprising: a data update unit that updates data based on the correction content determined by the rule application determination unit.
  • Appendix 7 The data cleansing system according to appendix 6, wherein the second data analysis means extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.
  • the rule application determining unit is configured such that the number of established rules in the CFD rule set extracted by the second data analyzing unit is the total number of CFD rules to be established in the second database calculated by the end condition estimating device.
  • the data analysis process receives the first database as input and obtains data configuration information of the first database;
  • the rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion. By excluding from the CFD rule set, a data-independent CFD rule set is selected,
  • the rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • a cleansing end condition calculation method comprising: calculating an estimated value of the total number of CFD rules to be established in the database.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data.
  • the cleansing end condition calculation method according to appendix 9.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined.
  • the cleansing end condition calculation method according to appendix 9, wherein a rule that is established at a threshold value or higher is selected.
  • the cleansing end condition calculation method according to appendix 9 or 10, wherein the cleansing end condition calculation method is performed using
  • the data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples.
  • Appendix 14 A data cleansing method using the cleansing end condition calculation method according to any one of appendices 9 to 13, A second data analysis process for obtaining a CFD rule set from the second database; A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process; A data update process for updating data based on the correction content determined in the rule application determination process; A data cleansing method comprising:
  • the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device. 15. The data cleansing method according to appendix 14, wherein control is performed to end data cleansing when the estimated value is reached.
  • CFD conditional function dependency
  • a rule extraction / selection process for selecting a CFD rule set independent of data The number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. 18.
  • the data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples.
  • Appendix 22 Each process of the program according to any one of appendices 17 to 21, A second data analysis process for obtaining a CFD rule set from a second database; A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process; A data update process for updating data based on the correction content determined in the rule application determination process; A program for causing the computer to execute.
  • the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device.
  • the program according to appendix 22, wherein the control for terminating the data cleansing is performed when the estimated value is reached.
  • Cleansing end condition estimation device (end condition estimation device) 101 Data Analysis Unit 102 Rule Extraction / Selection Unit 103 Rule Number Estimation Unit 104 Parameter 105, 205 Database (Destination Database: DB1) 106, 206 database (database DB2 for cleansing) DESCRIPTION OF SYMBOLS 107 Data size information 108 Data independent rule group 201 Data analysis means 202 Rule application automatic judgment means 203 Data update means 204 Established rule number counters 207, 208 Rule group 209 Value mapping means 302 Rule application judgment input means 303 Worker

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système capable d'indiquer les conditions finales d'un nettoyage de données. Le système est pourvu : d'un moyen d'analyse de données (101) pour acquérir des informations de configuration de données à partir d'une première base de données (DB1) ; d'un moyen d'extraction et de sélection de règles (102) pour extraire un ensemble de règles CFD de la base de données (DB1) et pour exclure les règles avec une dépendance élevée vis-à-vis des données de l'ensemble ; et d'un moyen d'estimation de nombre de règles (103) pour calculer la valeur estimée du nombre de règles à mettre en application dans une seconde base de données (DB2) en tant que conditions finales du nettoyage récursif des données de la seconde base de données (DB2), au moyen d'informations de configuration de données provenant de la première base de données (DB1) et de la seconde base de données (DB2) à nettoyer.
PCT/JP2013/059007 2012-03-27 2013-03-27 Système, procédé et programme de nettoyage de données WO2013146884A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012-072128 2012-03-27
JP2012072128 2012-03-27

Publications (1)

Publication Number Publication Date
WO2013146884A1 true WO2013146884A1 (fr) 2013-10-03

Family

ID=49260133

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/059007 WO2013146884A1 (fr) 2012-03-27 2013-03-27 Système, procédé et programme de nettoyage de données

Country Status (1)

Country Link
WO (1) WO2013146884A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914616A (zh) * 2014-03-18 2014-07-09 清华大学深圳研究生院 一种应急数据质量控制系统及方法
CN104750861A (zh) * 2015-04-16 2015-07-01 中国电力科学研究院 一种储能电站海量数据清洗方法及系统
JP2017534108A (ja) * 2014-09-26 2017-11-16 オラクル・インターナショナル・コーポレイション 推薦されるデータ変換および修復のための宣言型言語およびビジュアライゼーションシステム
US10915233B2 (en) 2014-09-26 2021-02-09 Oracle International Corporation Automated entity correlation and classification across heterogeneous datasets
US11379506B2 (en) 2014-09-26 2022-07-05 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006302A1 (en) * 2007-06-29 2009-01-01 Wenfei Fan Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies
US20090287721A1 (en) * 2008-03-03 2009-11-19 Lukasz Golab Generating conditional functional dependencies
JP2012141847A (ja) * 2011-01-04 2012-07-26 Hitachi Solutions Ltd データ移行システム、データ移行装置、及びデータ移行方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006302A1 (en) * 2007-06-29 2009-01-01 Wenfei Fan Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies
US20090287721A1 (en) * 2008-03-03 2009-11-19 Lukasz Golab Generating conditional functional dependencies
JP2012141847A (ja) * 2011-01-04 2012-07-26 Hitachi Solutions Ltd データ移行システム、データ移行装置、及びデータ移行方法

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914616A (zh) * 2014-03-18 2014-07-09 清华大学深圳研究生院 一种应急数据质量控制系统及方法
CN103914616B (zh) * 2014-03-18 2017-12-05 清华大学深圳研究生院 一种应急数据质量控制系统及方法
JP2017534108A (ja) * 2014-09-26 2017-11-16 オラクル・インターナショナル・コーポレイション 推薦されるデータ変換および修復のための宣言型言語およびビジュアライゼーションシステム
US10891272B2 (en) 2014-09-26 2021-01-12 Oracle International Corporation Declarative language and visualization system for recommended data transformations and repairs
US10915233B2 (en) 2014-09-26 2021-02-09 Oracle International Corporation Automated entity correlation and classification across heterogeneous datasets
US10976907B2 (en) 2014-09-26 2021-04-13 Oracle International Corporation Declarative external data source importation, exportation, and metadata reflection utilizing http and HDFS protocols
JP2021061063A (ja) * 2014-09-26 2021-04-15 オラクル・インターナショナル・コーポレイション 推薦されるデータ変換および修復のための宣言型言語およびビジュアライゼーションシステム
US11379506B2 (en) 2014-09-26 2022-07-05 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources
JP7148654B2 (ja) 2014-09-26 2022-10-05 オラクル・インターナショナル・コーポレイション 推薦されるデータ変換および修復のための宣言型言語およびビジュアライゼーションシステム
US11693549B2 (en) 2014-09-26 2023-07-04 Oracle International Corporation Declarative external data source importation, exportation, and metadata reflection utilizing HTTP and HDFS protocols
CN104750861A (zh) * 2015-04-16 2015-07-01 中国电力科学研究院 一种储能电站海量数据清洗方法及系统
WO2016165378A1 (fr) * 2015-04-16 2016-10-20 国网新源张家口风光储示范电站有限公司 Procédé et système de nettoyage de données en masse de centrale électrique de stockage d'énergie

Similar Documents

Publication Publication Date Title
WO2013146884A1 (fr) Système, procédé et programme de nettoyage de données
US10289532B2 (en) Method and system for providing delta code coverage information
US9245233B2 (en) Automatic detection of anomalies in graphs
US20170132104A1 (en) Event analysis device, event analysis system, event analysis method, and event analysis program
US20200272559A1 (en) Enhancing efficiency in regression testing of software applications
WO2014188502A1 (fr) Système, programme et procédé de gestion
US10528534B2 (en) Method and system for deduplicating data
Morisse et al. Long-read error correction: a survey and qualitative comparison
JP6689734B2 (ja) テストスクリプト修正装置及びテストスクリプト修正プログラム
Harder How multiple developers affect the evolution of code clones
US20130041886A1 (en) Methods for calculating a combined impact analysis repository
US10346393B2 (en) Automatic enumeration of data analysis options and rapid analysis of statistical models
US9348733B1 (en) Method and system for coverage determination
JP6310865B2 (ja) ソースコード評価システム及び方法
CN107402920B (zh) 确定关系数据库表关联复杂度的方法和装置
JP2019003333A (ja) バグ混入確率計算プログラム及びバグ混入確率計算方法
CN108509347B (zh) 等价变异体识别方法及装置
WO2013147172A1 (fr) Dispositif et procédé de mise à jour de cfd, appareil et procédé de nettoyage de données et programmes
JP6447111B2 (ja) 共通化情報提供プログラム、共通化情報提供方法、および共通化情報提供装置
JP5578625B2 (ja) プログラム分析装置、プログラム分析方法、及びプログラム
US10970176B2 (en) Managing data with restoring from purging
US20150199183A1 (en) Program analysis apparatus and program analysis method
Virmani et al. Variegated data swabbing: An improved purge approach for data cleaning
WO2013115261A1 (fr) Système et procédé de nettoyage de données et programme
JP2020013385A (ja) 情報処理装置、パッチ適用確認システム、パッチ適用確認方法、およびパッチ適用確認プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13768917

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13768917

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP