WO2013146884A1 - Data-cleansing system, method, and program - Google Patents

Data-cleansing system, method, and program Download PDF

Info

Publication number
WO2013146884A1
WO2013146884A1 PCT/JP2013/059007 JP2013059007W WO2013146884A1 WO 2013146884 A1 WO2013146884 A1 WO 2013146884A1 JP 2013059007 W JP2013059007 W JP 2013059007W WO 2013146884 A1 WO2013146884 A1 WO 2013146884A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
rule
database
cfd
cleansing
Prior art date
Application number
PCT/JP2013/059007
Other languages
French (fr)
Japanese (ja)
Inventor
綾子 星野
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2013146884A1 publication Critical patent/WO2013146884A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs

Definitions

  • the present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-072128 (filed on March 27, 2012), and the entire contents of this application are incorporated in the present specification by reference.
  • the present invention relates to a data cleansing system, method and program.
  • CFD conditional function dependency
  • CFD is a rule representing that functional dependency (abbreviated as “FD”) representing dependency between data attributes is established for a tuple set specified by a condition.
  • a tuple represents a row in a relational table with attributes as columns.
  • the CFD is composed of a condition part and a premise part which are the left side (LHS: Left Hand Side) of the rule, and specification of attribute values in the consequent part of the right side (RHS: Right Hand Side) of the rule.
  • “x1” means that the attribute value is a specific value.
  • the premise part consists of designation of only attributes. The fact that the attribute value does not take a specific value (that is, a wild card indicating that it matches an arbitrary value) is also referred to as “unnamed variable” (anonymous variable).
  • a rule in which the consequent part is determined by the specified value is referred to as “Constant CFD”.
  • a rule in which the result part is not determined to have a specified value but has a dependency between attributes is referred to as “Variable CFD”. That is, if the right side of the pattern
  • is unvariable variable '_' (tp [A] _), it is referred to as “Variable CFD”.
  • Violation against a rule such as rule 1, that is, a tuple that satisfies the condition “X1 x1” and “t1 [A] ⁇ a” is called “single tuple violation” (single tuple violation)
  • the tuple t1 is referred to as “violating tuple”.
  • Support degree is the number of tuples in which the conditional part of CFD, the premise part (left side of CFD: LHS), and the consequent part (right side of CFD: RHS) match. In another definition, it may be expressed by the ratio of the number of tuples in which LHS and RHS match in the total number of tuples.
  • Constant is the ratio of the number of tuples in which the CFD rule is satisfied among the number of tuples in which the condition part and the premise part match.
  • the support level and the reliability level will be described according to a specific example.
  • ID is a tuple ID
  • A, B, and C are attributes. From the relationship data set in Table 1, for example, CFD ⁇ 1: A, B ⁇ C (1, _
  • Patent Document 1 discloses an apparatus for extracting error data using a conditional expression for checking the correctness of data in the database, correcting the extracted data, and updating the database with the corrected data. Is disclosed.
  • Patent Document 2 includes a correlation rule whose correlation coefficient is a predetermined value or more based on the correlation coefficient between attributes of the database.
  • a data mining device is disclosed that deletes from the set of correlation rules between attributes when only the correlation coefficient is generated and does not have a true correlation coefficient.
  • Patent Document 3 discloses a configuration in which data cleansing / characterizing means for performing data cleansing that replaces or deletes an abnormal value of data with a specific value is repeatedly performed while changing a set value from an initial value.
  • Patent Document 4 finds a rule violation caused by configuration data abnormality, displays the rule that has been violated, displays a table row corresponding to the rule violation, and executes a corrective action for defective configuration data Is disclosed.
  • Patent Document 5 discloses that an item database is created from a relational database, a correlation rule is extracted from the item database, and the number of correlation rules corresponding to each item is calculated when reading the contents of the result rule file. ing.
  • a data cleansing system receives cleansing target data and a data correction rule set ( ⁇ CFD) as input, and a correction location extraction unit (correction location extraction device) and a correction content determination unit (correction content determination), both of which are not shown. Device) and correction result reflecting means (correction result reflecting device).
  • ⁇ CFD data correction rule set
  • Device correction result reflecting means
  • FIG. 16 shows the processing of Algorithm BATCHREPAIR shown in FIG. 4 of Non-Patent Document 1.
  • a comment or the like is added to FIG. 4 of Non-Patent Document 1 to help understanding.
  • the correction location extracting means refers to the rule set (CFD set ⁇ ), and extracts violation tuples (specifically, rule violation tuple sets) in the data (cleansing target database D) (Line 4).
  • a correction rule, a correction location, and a correction destination value are selected by the correction content determination means PICKNEXT () according to the algorithm (Line 6).
  • Non-Patent Document 2 discloses a system including such an operator user interface.
  • a correction rule for example, whether an operator changes the value (data) of an exception tuple that does not follow the rules to the cleansing target data.
  • the current value is recognized, and either processing of whether to change the value (data) of the exception tuple is performed.
  • execution and updating of corrections are repeatedly performed by an operator approving a correction rule recommended by the system side, the regularity of data increases with each iteration.
  • the CFD rule set may be provided in a data cleansing apparatus, or may be obtained by being extracted from cleansing target data as in the systems of Non-Patent Document 2 and Non-Patent Document 3.
  • data cleansing performed when data in the migration source system is migrated to the migration destination system will be described.
  • Data in the migration source system is called “cleansing target data”, and data already tested or used in the migration destination system is called “migration destination data”.
  • mapping association between the database of the migration source system and the migration destination system in units of columns and columns has been completed.
  • the first problem is that when performing the above-mentioned repetitive data cleansing, it is not known at what point cleansing may be terminated. The reason will be described below.
  • exception data for example, exception data within an allowable range
  • exception data for example, exception data within an allowable range
  • cleansing of additional data is performed without using information regarding this point, there is a possibility that an erroneous correction of data may occur without an operator noticing an allowable exception.
  • the second problem is that, for example, when data cleansing is performed using CFD rules, regularity is imposed more than necessary, and as a result, original information is lost. The reason will be described below.
  • the present invention has been created entirely in view of the above problems, and its main purpose is to provide a system, method, and program that can present the data cleansing end condition.
  • data analysis means for acquiring data configuration information of the first database by using the first database as input, By excluding CFD rules that do not satisfy a predetermined criterion and are determined to have high data dependency from the conditional function dependency (CFD) rule set extracted from the first database, Rule extraction / selection means for selecting a dependent CFD rule set; The number of data-independent CFD rule sets selected by the rule extraction / selection means; Using the data configuration information of the first database acquired by the data analysis unit and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is used as the second database cleansing end determination condition.
  • an apparatus comprising rule number estimating means for calculating an estimated value of the total number of CFD rules to be established in the database.
  • the data analysis process receives the first database as input, obtains data configuration information of the first database,
  • the rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion.
  • CFD conditional function dependency
  • the rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • a method is provided for calculating an estimate of the total number of CFD rules to be established in the database.
  • a data analysis process for obtaining data configuration information of the first database using the first database as an input; Exclude from the CFD rule set a CFD rule that does not satisfy a predetermined predetermined criterion and is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database.
  • CFD conditional function dependency
  • a rule extraction / selection process for selecting a CFD rule set independent of data, The number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • a rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database.
  • a conditional function dependency (CFD) rule set is derived (extracted) from a cleansing target database, tuple data (attribute values, etc.) violating the CFD rule is corrected, and the correction is performed.
  • CFD conditional function dependency
  • the cleansing process for reflecting data in the database is repeatedly performed, the total number of CFD rules established among the set of CFD rules reaches a predetermined number (estimated value) calculated in advance as a result of the correction. Finally, the cleansing process for the cleansing target database is terminated.
  • the total number of CFD rules to be established in the cleansing target database is estimated on the basis of another database corresponding to the cleansing target database, and as a result of the correction to the cleansing target database.
  • the data migration to the migration destination system database is performed.
  • the migration destination database Based on the number of CFD rules that are satisfied and the data size difference between the migration destination database and the migration source database (database storing cleansing target data), an indication of cleansing end conditions is calculated.
  • the first database (DB1) As the configuration information, for example, data size information such as the number of attributes and the number of tuples is acquired, and the rule extraction / selection means (102) has data dependency from the CFD rule set extracted from the first database (DB1). By excluding CFD rules determined to be relatively high, a CFD rule group (108) independent of data is selected, and the rule number estimation means (103) obtains the rule extraction / selection means (102).
  • the CFD rule group independent of data (108), the data size information of the first database (DB1) obtained by the data analysis means (101), Second database storing ring target data: the data size information (106 DB2) (107), and calculates an estimate of the total number of CFD rule to be established in the second database cleansing target (DB2).
  • the user (operator) of this system can use the criteria for the data cleansing end condition ( It is possible to know a value (estimated value) indicating how much the number of extracted CFD rule groups is to end the cleansing process.
  • a cleansing end condition estimating device (abbreviated as “end condition estimating device”) 100 that provides a standard (estimated value) of a data cleansing end condition is: Database (DB1) 105, data analysis means 101 (data analysis apparatus), rule extraction / selection means 102 (rule extraction / selection apparatus), database (DB2) 106, rule number estimation means 103 (rule number estimation apparatus) ).
  • the database (DB1) 105 is a database of the migration destination system (referred to as “migration destination database”).
  • the database (DB2) 106 is a database that stores cleansing target data.
  • the database (DB1) 105 and the database (DB2) 106 are also referred to as the database DB1 and the database DB2 simply by removing the reference numbers.
  • the data analysis unit 101 reads the migration destination database DB1 and acquires information such as the number of attributes and the number of tuples.
  • the rule extraction / selection unit 102 extracts a rule group (CFD rule set) from the migration destination database DB1, excludes a rule having high data dependency from the extracted rule group, and removes the remaining rule group as a data non-data.
  • the dependency rule group 108 is selected.
  • the rule extraction / selection means 102 uses training data (Dtrain) and test data (Dtest) for the data in the migration destination database DB1 in k ways (where k is a predetermined positive integer). From the CFD rules extracted from the training data (Dtrain), for example, rules that have been established m times or more (0 ⁇ m ⁇ k) out of k times are selected from the test data (Dtest).
  • the rule number estimation means 103 reads the data of the cleansing target data (database DB2) 106 and calculates an estimated value of the rule that should be established in the cleansing target data.
  • the rule number estimation means 103 is, for example, The total number of rules of the CFD rule set (data independent rule group 108) selected by the rule extraction / selection means 102; The number of tuples of the database DB1 acquired by the data analysis means 101, and the number of different values of the specified column of the database DB1, CFD to be established in the database DB2 using at least one of the number of tuples of the database DB2 storing the cleansing target data, the number of different values of the designated column of the database DB2, and the number of CFD rules established in the database DB2.
  • the column designation may be input to the cleansing end condition estimation device 100 via the input means (not shown) from the user (worker) or the like as the parameter 104 and supplied to the rule number estimation means 103, for example. Good.
  • the data analysis unit 101, the rule extraction / selection unit 102, and the rule number estimation unit 103 may be realized by a program that operates on a computer constituting the end condition estimation apparatus 100.
  • the program is stored in, for example, a magnetic or optical recording medium (device), a semiconductor memory (for example, a read-only memory or a rewritable nonvolatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)), and the like.
  • the data analysis unit 101 acquires information (data size information) related to the migration destination database DB1 (step A1 in FIG. 2).
  • the rule extraction / selection unit 102 divides the data in the migration destination database DB1, extracts the rules, excludes the rule determined to have high data dependency from the extracted rule group ⁇ , and removes data independence.
  • a rule group is selected (step A2).
  • the rule number estimation means 103 calculates an approximate number (estimated value) of the total number of rules to be established in the cleansing target data (database DB2) (step A3).
  • step A2 in FIG. 2 will be described.
  • the rule extraction / selection means 102 divides the migration destination database DB1 into training data and test data by k methods (step A2-1).
  • the value of k is determined by the table size and sampling method (boot-strap method, cross-validation method, etc.) of the migration destination database DB1 acquired by the rule extraction / selection means 102.
  • test data with an appropriate number of tuples n is extracted k times from the migration destination database DB1, and the rest is used as training data (training data). At this time, there may be an overlap between 1 to k test data sets.
  • Data division methods such as test data and training data are important factors that determine the evaluation of the extracted rules. For example, when it is desired to obtain a set of rules that can be applied to data that differs in time, it is effective to rearrange the data by time stamps and then divide the data. For this reason, before the division, the data may be rearranged based on the prior knowledge of the worker or the like.
  • each tuple becomes (k-1) times training data and only once.
  • two tuples ID1 and ID2 are used as test data, and eight tuples ID3 to ID10 are used as training data.
  • K sets (combinations) of such test data (2 tuples) and training data (8 tuples) can be obtained.
  • any one of the existing algorithms is used in the present embodiment.
  • those described in Patent Document 6, Non-Patent Document 4, and the like are used.
  • the extraction algorithm by specifying the input data and the appropriate support threshold (min_supp) and reliability threshold (min_conf) parameters, a uniquely determined CFD rule group (CFD set) ⁇ is obtained. Can do.
  • the CFD support rate threshold (min_supp) and reliability threshold (min_conf) parameters may be supplied to the rule extraction / selection unit 102 via the input unit (not shown) as the parameter 104 in FIG. .
  • an example of a rule set (CFD set) extracted from the training data of FIG. 6A is the rule in the first column in the table of FIG.
  • the rule extraction / selection unit 102 repeats the following 3) and 4) for the rule (CFD) in the rule group ⁇ i extracted from the training data (8 tuples) (step A2-4).
  • the rule extraction / selection means 102 evaluates whether or not the CFD rule ⁇ is satisfied in the test data (2 tuples other than the training data out of 10 tuples) (step A2-5). That is, it is evaluated whether or not the reliability in the test data (the number of tuples satisfying both the premise part and the consequent part condition / the number of tuples satisfying the precondition part) of the CFD rule ⁇ satisfies a preset criterion. .
  • the rule extraction / selection unit 102 excludes the CFD rule ⁇ from the rule group ⁇ i (step A2-6).
  • step A2-5 the reliability columns obtained from the test data (for example, the first to fifth reliability columns) are related to the CFD rule ⁇ 1.
  • when rule ⁇ is not satisfied, in addition to a method of excluding immediately, ⁇ may be excluded from rule group ⁇ i when not satisfied more than q times during k tests. good.
  • the rule extraction / selection means 102 summarizes the rule groups ⁇ 1 to ⁇ k (takes the union of k sets ⁇ 1 to ⁇ k), and outputs the rule group ⁇ as the data independent rule group 108 (step A2- 7).
  • the rule extraction / selection unit 102 may aggregate rule sets from the rule group ⁇ by omitting redundant rules and implication rules.
  • the rule extracting / selecting means 102 may output only the size (number of rules) of the rule group ⁇ . .
  • step A3 in FIG. 2 will be described.
  • the rule number estimation means 103 obtains a data size comparison index based on the respective data size information in the databases DB1 and DB2 (step A3-1).
  • the data size comparison index for example, ⁇ The number of tuples in the database, or -The number of different values of the specified column is used. Note that the number of different values of the designated column is 10 when the attribute value of the column takes 10 different values, and the number of differences is 5 when the value of 5 different values is taken.
  • the rule number estimating means 103 Total number of CFD rules (data independent rule group) extracted and selected from the migration destination database DB1 by the rule extraction / selection means 102: Number_of_CFDs (DB1), Data size comparison index of the migration destination database DB1: DBSIZE (DB1), Using at least one of the data size comparison indexes of the database DB2 storing the data to be cleansed: DBSIZE (DB2), The estimated value Number_of_CFDs (DB2) of the rule to be established in the database DB2 is calculated (A3-2).
  • An example of the calculation formula is given by the following formula (1), for example.
  • a rule with less generality (a rule with high data dependency) is excluded by using a bootstrap method or a cross-validation method. Therefore, a rule peculiar to the migration destination database DB1 is excluded.
  • the number of CFD rules to be applied to the database DB2 storing cleansing target data can be estimated.
  • such a rule for example, the rule ⁇ 2 in FIG. 7 should be applied as an end condition for the cleansing target data (DB2) after being excluded from the rule group extracted from the cleansing target database. Estimating the number of rules reduces the possibility of incorrect corrections.
  • the data cleansing system of the second embodiment is a data analysis means (first group) that obtains a CFD rule set (rule group 207) having a reliability level p or higher from a database (DB2) 206 that stores cleansing target data. 2 and a rule group 207 from the data analysis unit 201, and a rule application automatic determination unit that automatically determines the correction contents for matching the data of the database (DB2) 206 with the rules of the rule group 207.
  • DB2 database
  • a rule application automatic determination unit that automatically determines the correction contents for matching the data of the database (DB2) 206 with the rules of the rule group 207.
  • DB2 database
  • a data update unit 203 for updating data based on the determined correction content
  • an end condition estimation device 100 an established rule number counter 204.
  • the end condition estimation apparatus 100 estimates the cleansing process end condition of the cleansing target database (DB2) 206 from the migration destination database (DB1) 205. From the cleansing end condition estimation apparatus 100 of the first embodiment, Composed.
  • the data analysis unit 201, the rule application automatic determination unit 202, and the data update unit 203 may be realized by a program that runs on a computer.
  • the program is stored in, for example, a magnetic or optical recording medium (device) or a semiconductor memory (for example, a read-only memory or a rewritable non-volatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)).
  • the data analysis unit 201 uses a method disclosed in Patent Document 6 or Non-Patent Document 4 from the cleansing target database DB2, for example, a CFD rule set (rule group) having a reliability equal to or higher than a preset threshold. Then, a violation tuple set including a set of tuples that are incompatible with each rule of the CFD rule set is obtained.
  • the rule application automatic determination unit 202 determines whether or not the contents of the tuple should be changed so as to eliminate the nonconformity of the set of the CFD rule set and the rule nonconforming tuple set (violating tuple set).
  • the data update unit 203 executes necessary changes (changes in the contents of the tuples that eliminate the rule nonconformity) to the data in the database DB2 in accordance with the determination result of the rule application automatic determination unit 202.
  • FIG. 13 shows a correction to be applied to the worker in the correction destination automatic selection process in step B4 of FIG. 12, determines whether the worker corrects, and executes the correction when the correction is adopted.
  • the process is changed to the process of counting up the number of established rules by one.
  • the data analysis unit 201 analyzes the cleansing target data DB2, and extracts a CFD rule set having a reliability level equal to or higher than the threshold p and a violation tuple set for the rule (step B1 in FIG. 12).
  • the data analysis means 201 initializes the count value of the established rule number counter 204 (step B2 in FIG. 12).
  • the data shown in FIG. 14 is input to the data analysis unit 201 as data of the cleansing target database DB2.
  • FIG. 15 is a diagram illustrating an example of a CFD rule set with a reliability of 0.8 or more extracted from the data of FIG. 14 and a violation tuple set for those rules.
  • a notation for example, v ( ⁇ x): ⁇ tuple1 or tuple2 ⁇
  • v ( ⁇ x): ⁇ tuple1 or tuple2 ⁇ a notation (format) (for example, v ( ⁇ x): ⁇ tuple1 or tuple2 ⁇ ) that, for example, the tuple 1 or the tuple 2 has violated the rule ⁇ x is used.
  • the CFD rules having a reliability of 1.0 are the seven rules ⁇ 1 to ⁇ 7 above the broken line in FIG. In this case, the count value of the established rule count 204 is initialized to “7”.
  • the data analyzing means 201 repeats the following steps B4-B6 (step B3).
  • the rule application automatic determination unit 202 selects a CFD rule to be applied, a location to be corrected, and a value to be corrected (step B4). That is, when there are a plurality of CFD rules having a reliability of less than 1.0, a rule to be applied first is selected from the plurality of CFD rules, and a violation tuple for the selected rule is matched with the rule. Then, the location to be corrected is automatically selected, and the value to be corrected is determined.
  • a method of ordering rules to be applied using an index based on an edit distance of a character string can be used.
  • the data update unit 203 executes correction to the tuple selected by the rule application automatic determination unit 202 (step B5 in FIG. 12). That is, the data update unit 203 executes correction to the data selected by the rule application automatic determination unit 202 in step B4. Further, the data update unit 203 updates the reliability and violation tuple information regarding the established rule having the reliability of less than 1.0.
  • the data updating unit 203 increases the count value of the established rule count 204 by 1 (step B6 in FIG. 12).
  • the estimated value of the number of established rules by the termination condition estimation device 100 is “8”.
  • correct the ID values of ID9 and ID10 Changing the value of either PRIICE or TAX of the ID9 or ID10 tuple is considered as a correction candidate.
  • the rule application automatic determination unit 202 automatically makes a determination regarding correction.
  • automatic determination regarding correction for example, the descriptions of Non-Patent Documents 2 and 3 are referred to.
  • Non-Patent Document 1 regarding the determination regarding correction, it is possible to manually select from among correction candidates.
  • ⁇ 14: (COUNTRY JP, PRODUCT) ⁇ TAX Is adopted and data correction is performed.
  • the rule application automatic determination unit 202 acquires the estimated value of the cleansing process end condition from the end condition estimation apparatus 100, determines the end condition, and automatically ends when the end condition is satisfied.
  • the configuration of FIG. 10 includes a rule application determination input unit 302 instead of the rule application automatic determination unit 202 of FIG. 9, and the operator 303 determines whether to apply the correction rule.
  • the rule application determination input unit 302 ends the iterative cleansing based on the instruction of the worker 303, but the worker 303 may determine whether or not to correct the data further based on the correction rule.
  • FIG. 13 is a flowchart for explaining a procedure in the case of performing a determination process of whether or not there is a correction by a human (worker 303 in FIG. 10) in FIG.
  • the automatic selection processing (performed by the rule application automatic determination means 202 in FIG. 9) of the rule to be applied, the location to be corrected, and the correction destination value in step B4 in FIG. 1, B4-2 is replaced.
  • FIG. 11 is a diagram illustrating a configuration of a second modification of the second embodiment. Referring to FIG. 11, the configuration is the same as that of FIG. 10 except that a value mapping unit 209 is provided. Below, the description of the same part is abbreviate
  • the value mapping means 209 when the value mapping means 209 is used to eliminate the difference in expression of the values (attribute values) of the databases DB1 and DB2, and a matching rule including the value (attribute value) is established in the database DB2 Only, the established rule number counter 204 is incremented by one.
  • the value mapping performed by the value mapping unit 209 is to associate value expressions in the corresponding columns (for example, “male” and “female” are associated with English “male” and “female”). Reference is made to the description of Patent Document 3 and the like.
  • the end time is determined by the end condition estimating means (device) 100, it is not necessary to consider all the enormous rules. For this reason, it is possible to reduce the possibility of adopting a rule that does not need to be applied by mistake by examining a rule with low reliability.
  • it can apply to uses, such as data cleansing at the time of data migration and data integration, and it can apply to the arbitrary systems which perform the data cleansing of the database corresponding to the database used as a reference
  • a cleansing end condition estimation device comprising: rule number estimation means for calculating an estimated value of the total number of CFD rules to be established in the database.
  • the rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer), Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data.
  • the cleansing end condition estimation apparatus according to Supplementary Note 1.
  • the rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer), Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined.
  • the cleansing end condition estimation device according to appendix 1, wherein a rule that is established at a threshold value or higher is selected.
  • the cleansing end condition estimation device according to supplementary note 1 or 2, wherein the cleansing end condition estimation device is calculated using
  • the data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples.
  • (Appendix 6) A first database; A second database to be cleansed; The cleansing end condition estimation device according to any one of appendices 1 to 5, Second data analysis means for obtaining a CFD rule set from the second database; Rule application determination means for determining data correction contents for matching data with the rules in the CFD rule set acquired by the second data analysis means; A data cleansing system comprising: a data update unit that updates data based on the correction content determined by the rule application determination unit.
  • Appendix 7 The data cleansing system according to appendix 6, wherein the second data analysis means extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.
  • the rule application determining unit is configured such that the number of established rules in the CFD rule set extracted by the second data analyzing unit is the total number of CFD rules to be established in the second database calculated by the end condition estimating device.
  • the data analysis process receives the first database as input and obtains data configuration information of the first database;
  • the rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion. By excluding from the CFD rule set, a data-independent CFD rule set is selected,
  • the rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • a cleansing end condition calculation method comprising: calculating an estimated value of the total number of CFD rules to be established in the database.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data.
  • the cleansing end condition calculation method according to appendix 9.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined.
  • the cleansing end condition calculation method according to appendix 9, wherein a rule that is established at a threshold value or higher is selected.
  • the cleansing end condition calculation method according to appendix 9 or 10, wherein the cleansing end condition calculation method is performed using
  • the data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples.
  • Appendix 14 A data cleansing method using the cleansing end condition calculation method according to any one of appendices 9 to 13, A second data analysis process for obtaining a CFD rule set from the second database; A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process; A data update process for updating data based on the correction content determined in the rule application determination process; A data cleansing method comprising:
  • the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device. 15. The data cleansing method according to appendix 14, wherein control is performed to end data cleansing when the estimated value is reached.
  • CFD conditional function dependency
  • a rule extraction / selection process for selecting a CFD rule set independent of data The number of CFD rule sets independent of data selected by the rule extraction / selection process;
  • the second database cleansing end determination condition is set as the second database cleansing end determination condition.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data.
  • the first database is divided into training data and test data by k methods (where k is a predetermined positive integer), Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. 18.
  • the data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples.
  • Appendix 22 Each process of the program according to any one of appendices 17 to 21, A second data analysis process for obtaining a CFD rule set from a second database; A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process; A data update process for updating data based on the correction content determined in the rule application determination process; A program for causing the computer to execute.
  • the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device.
  • the program according to appendix 22, wherein the control for terminating the data cleansing is performed when the estimated value is reached.
  • Cleansing end condition estimation device (end condition estimation device) 101 Data Analysis Unit 102 Rule Extraction / Selection Unit 103 Rule Number Estimation Unit 104 Parameter 105, 205 Database (Destination Database: DB1) 106, 206 database (database DB2 for cleansing) DESCRIPTION OF SYMBOLS 107 Data size information 108 Data independent rule group 201 Data analysis means 202 Rule application automatic judgment means 203 Data update means 204 Established rule number counters 207, 208 Rule group 209 Value mapping means 302 Rule application judgment input means 303 Worker

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a system capable of indicating the end conditions of data cleansing. The system is provided with: a data analysis means (101) for acquiring data configuration information from a first database (DB1); a rule extraction and selection means (102) for extracting a CFD rule collection from the database (DB1) and excluding rules with high data dependence from among the collection; and a rule number estimation means (103) for calculating the estimated value of the number of rules to be placed in effect in a second database (DB2) as the end conditions of recursive data cleansing of the second database (DB2), using data configuration information from the first database (DB1) and the second database (DB2) to be cleansed.

Description

データクレンジングシステム、方法およびプログラムData cleansing system, method and program
 [関連出願についての記載]
 本発明は、日本国特許出願:特願2012-072128号(2012年3月27日出願)に基づくものであり、同出願の全記載内容は引用をもって本書に組み込み記載されているものとする。
 本発明は、データクレンジングシステム、方法およびプログラムに関する。
[Description of related applications]
The present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-072128 (filed on March 27, 2012), and the entire contents of this application are incorporated in the present specification by reference.
The present invention relates to a data cleansing system, method and program.
 条件付き関数従属性(Conditional Functional Dependency:「CFD」と略記される)を利用して、データの修正を行うデータクレンジング技術が知られている。データクレンジング技術は、データベース中の誤りデータを修正し、重複データ等を取り除き、形式を揃える等してクリーニングを行う。 There is known a data cleansing technique for correcting data using conditional function dependency (abbreviated as “CFD”). In the data cleansing technique, cleaning is performed by correcting error data in a database, removing duplicate data, etc., and making the format uniform.
 CFDは、データ属性間の従属性を表す関数従属性(Functional Dependency:「FD」と略記される)が、条件によって指定されたタプル集合について成立することを表すルールである。なお、タプルは属性を列とする関係表において行を表す。CFDは、ルールの左辺(LHS:Left Hand Side)である条件部、前提部と、ルールの右辺(RHS:Right Hand Side)の帰結部における属性値の指定からなる。条件部は、データの部分集合(タプル集合)を指定し、属性X1が属性値x1であるということを「X1=x1」と表す。ここで、「x1」は属性値がある特定の値であることを意味する。また、前提部は、属性のみの指定からなる。属性値が特定の値をとらない(すなわち、任意の値とマッチすることを表すワイルドカード)ことを、「X=_」等と表す‘_’は「unnamed variable」(無名変数)ともいう。 CFD is a rule representing that functional dependency (abbreviated as “FD”) representing dependency between data attributes is established for a tuple set specified by a condition. A tuple represents a row in a relational table with attributes as columns. The CFD is composed of a condition part and a premise part which are the left side (LHS: Left Hand Side) of the rule, and specification of attribute values in the consequent part of the right side (RHS: Right Hand Side) of the rule. The condition part designates a subset (tuple set) of data, and represents that the attribute X1 is the attribute value x1 as “X1 = x1”. Here, “x1” means that the attribute value is a specific value. The premise part consists of designation of only attributes. The fact that the attribute value does not take a specific value (that is, a wild card indicating that it matches an arbitrary value) is also referred to as “unnamed variable” (anonymous variable).
 帰結部には、
(A)属性と属性値の指定からなるもの(例えば、以下のルール1)と、
(B)属性のみを指定するもの(例えば、以下のルール2)、
の2種類がある。
In the consequences,
(A) an attribute and attribute value designation (for example, rule 1 below);
(B) Specifying only attributes (for example, rule 2 below),
There are two types.
 (A)の場合、例えば「A=a」、(B)の場合、例えば「A=_」等と、表される。なお、帰結部に、属性値の指定がある場合には、前提部は省略することができる。また、前提部、帰結部は、複数の属性とそれぞれの属性値の指定からなることもある。以下にCFDのルールの例を示す。 In the case of (A), for example, “A = a”, and in the case of (B), for example, “A = _” is represented. If the attribute value is specified in the consequent part, the premise part can be omitted. Moreover, the premise part and the consequent part may consist of designation of a plurality of attributes and respective attribute values. Examples of CFD rules are shown below.
ルール1:X1 → A(x1 || a)
ルール2:X1, X2 → A(x1, _ || _ )
Rule 1: X1 → A (x1 || a)
Rule 2: X1, X2 → A (x1, _ || _)
 ルール1は、「属性X1が属性値x1のとき、属性Aは属性値aである」という意味のルールである。ルール1が成り立つとき、条件部に当てはまるタプル集合において、帰結部が指定された値であることを表す。つまり、条件X1=x1を満たすタプル集合の全てのタプルにおいて、t[A]=aである(なお、t[A]は、属性Aのタプルを表している)。このように帰結部が指定された値に決まるルールを「Constant CFD」という。 Rule 1 is a rule that means “when attribute X1 is attribute value x1, attribute A is attribute value a”. When Rule 1 is satisfied, it represents that the consequent part is a specified value in the tuple set that applies to the condition part. That is, t [A] = a in all tuples of the tuple set that satisfies the condition X1 = x1 (t [A] represents the tuple of the attribute A). A rule in which the consequent part is determined by the specified value is referred to as “Constant CFD”.
 ルール2は、「属性X1が属性値x1のとき、属性X2によって属性Aが決まる」という意味のルールである。ルール2が成り立つとき、条件部に当てはまるタプル集合において、前提部と帰結部で指定された属性間に従属性があることを表す。つまり、条件「X1=x1」を満たすタプル集合の中の任意のタプルペアt1、t2について、t1[X2]=t2[X2] であれば、t1[A]=t2[A]となる。このように帰結部が指定された値に決まらないが、属性間に従属性を持つようなルールを「Variable CFD」という。すなわち、パタンタプルの||の右側がunnamed variable ‘_’の場合(tp[A]=_)、「Variable CFD」という。 Rule 2 is a rule that means that “attribute A is determined by attribute X2 when attribute X1 is attribute value x1”. When rule 2 is satisfied, it represents that there is a dependency between attributes specified in the premise part and the consequent part in the tuple set that applies to the condition part. That is, for any tuple pair t1, t2 in the tuple set that satisfies the condition “X1 = x1”, if t1 [X2] = t2 [X2], then t1 [A] = t2 [A]. A rule in which the result part is not determined to have a specified value but has a dependency between attributes is referred to as “Variable CFD”. That is, if the right side of the pattern || is unvariable variable '_' (tp [A] = _), it is referred to as “Variable CFD”.
 なお、ルール1、2を
ルール1:X1=x1→A=a
ルール2:(X1=x1, X2) → A
と表記する場合もある(後述する図15)。
Note that rules 1 and 2 are rule 1: X1 = x1 → A = a
Rule 2: (X1 = x1, X2) → A
May be written (FIG. 15 to be described later).
 ルール1のようなルールに対する違反、つまり、条件「X1=x1」を満たすタプルで、「t1[A]≠a」である場合を、「単独のタプルによる違反」(Single tuple violation)と呼び、当該タプルt1を「違反タプル」という。 Violation against a rule such as rule 1, that is, a tuple that satisfies the condition “X1 = x1” and “t1 [A] ≠ a” is called “single tuple violation” (single tuple violation) The tuple t1 is referred to as “violating tuple”.
 ルール2のようなルールに対する違反、つまり、条件「X1=x1」を満たすタプルで、t1[X2]=t2[X2]であるが、t1[A]=t2[A]とはならない場合を、「複数タプルによる違反」(Multi-tuple violation)と呼び、当該タプルt1、t2を「違反タプル」という。 Violation of a rule such as rule 2, that is, a tuple that satisfies the condition “X1 = x1” and t1 [X2] = t2 [X2], but t1 [A] = t2 [A] is not satisfied. This is called “multi-tuple violation” and the tuples t1 and t2 are called “violating tuples”.
 支持度(Support)は、CFDの条件部と前提部(CFDの左辺:LHS)及び帰結部(CFDの右辺:RHS)が一致するタプル数である。別の定義では、全タプル数の中で、LHSとRHSが一致するタプル数の割合で表す場合もある。 Support degree (Support) is the number of tuples in which the conditional part of CFD, the premise part (left side of CFD: LHS), and the consequent part (right side of CFD: RHS) match. In another definition, it may be expressed by the ratio of the number of tuples in which LHS and RHS match in the total number of tuples.
 信頼度(Confidence)は、条件部と前提部が一致するタプル数の中で、CFDのルールが成立するタプル数の割合である。支持度と信頼度について具体例に即して説明しておく。 “Confidence” is the ratio of the number of tuples in which the CFD rule is satisfied among the number of tuples in which the condition part and the premise part match. The support level and the reliability level will be described according to a specific example.
Figure JPOXMLDOC01-appb-T000001
Figure JPOXMLDOC01-appb-T000001
 上記表1において、IDはタプルID、A、B、Cは属性である。表1の関係のデータセットから、例えばCFD
φ1: A,B→C(1,_ || _ )
が抽出される(Aの値が1の場合、BによってCが決定される)。表1において、タプルID:1、2、3と、タプルID:8、9、10がこのルールに一致する。
In Table 1 above, ID is a tuple ID, and A, B, and C are attributes. From the relationship data set in Table 1, for example, CFD
φ1: A, B → C (1, _ || _)
(If the value of A is 1, C is determined by B). In Table 1, tuple IDs: 1, 2, and 3 and tuple IDs: 8, 9, and 10 match this rule.
 CFD φ1のLHS(φ1)とRHS(φ1)が共に一致するタプル数は6であるため、支持度は6、あるいは、全タプル数10のうちの6であるため、支持度=6/10=0.6となる。このCFD φ1の条件部と前提部が一致するタプル数の中で、CFDのルールが成立するタプル数は6であることから、信頼度=6/6=1.0(=100%)となる。 Since the number of tuples in which both LHS (φ1) and RHS (φ1) of CFD φ1 match is 6, the support level is 6 or 6 out of the total 10 tuple numbers, so the support level = 6/10 = 0.6. Among the number of tuples in which the condition part and the premise part of the CFD φ1 match, the number of tuples for which the CFD rule is satisfied is 6, so the reliability = 6/6 = 1.0 (= 100%). .
 データクレンジングシステムとして、例えば非特許文献1、2等の記載が参照される。また、データベースのデータ修正、更新に関して、特許文献1には、データベースのデータの正誤をチェックする条件式を用いて誤りデータを抽出し、抽出データを修正し、修正したデータでデータベースを更新する装置が開示されている。 For the data cleansing system, for example, the descriptions of Non-Patent Documents 1 and 2 are referred to. In addition, regarding correction and update of database data, Patent Document 1 discloses an apparatus for extracting error data using a conditional expression for checking the correctness of data in the database, correcting the extracted data, and updating the database with the corrected data. Is disclosed.
 なお、本件出願人による本件先行文献サーチでいくつかの特許文献がサーチされたが、そのうち特許文献2には、データベースの属性間の相関係数に基づいて相関係数が所定値以上の相関ルールのみを生成し、真の相関係数を有さない場合、属性間相関ルール集合から削除する構成のデータマイニング装置が開示されている。特許文献3には、データの異常値を特定値に置換又は削除するデータクレンジングを行うデータクレンジング/特徴化手段は、設定値を初期値から変化させながら繰り返し行う構成が開示されている。特許文献4には、コンフィギュレーションデータ異常に起因するルール違反をみつけ、違反されたルールが表示され、ルール違反に対応する表の行が表示され、不良コンフィギュレーションデータの修正処置が実行されることが開示されている。特許文献5には、関係データベースから項目データベースを作成し、項目データベースから相関ルールを抽出し、各項目に対応する相関ルール数は結果ルールファイルの内容を読み込む際に算出しておくことが開示されている。 In addition, several patent documents were searched in the prior document search by the applicant of the present application. Among them, Patent Document 2 includes a correlation rule whose correlation coefficient is a predetermined value or more based on the correlation coefficient between attributes of the database. In this case, a data mining device is disclosed that deletes from the set of correlation rules between attributes when only the correlation coefficient is generated and does not have a true correlation coefficient. Patent Document 3 discloses a configuration in which data cleansing / characterizing means for performing data cleansing that replaces or deletes an abnormal value of data with a specific value is repeatedly performed while changing a set value from an initial value. Patent Document 4 finds a rule violation caused by configuration data abnormality, displays the rule that has been violated, displays a table row corresponding to the rule violation, and executes a corrective action for defective configuration data Is disclosed. Patent Document 5 discloses that an item database is created from a relational database, a correlation rule is extracted from the item database, and the number of correlation rules corresponding to each item is calculated when reading the contents of the result rule file. ing.
 データクレンジングシステムは、典型的には、クレンジング対象データと、データ修正ルール集合(ΣCFD)を入力とし、いずれも図示されない修正箇所抽出手段(修正箇所抽出装置)と、修正内容決定手段(修正内容決定装置)と、修正結果反映手段(修正結果反映装置)とから構成され、CFDのルールが与えられた時、そのルールに違反するタプルを、修正候補としてCFDのルールに違反しないように、データを更新する。図16に、非特許文献1のFigure4に示されたAlgorithm BATCHREPAIRの処理を示す。なお、図16は、非特許文献1のFigure4に対して理解を助けるためにコメント等を付加してある。 Typically, a data cleansing system receives cleansing target data and a data correction rule set (ΣCFD) as input, and a correction location extraction unit (correction location extraction device) and a correction content determination unit (correction content determination), both of which are not shown. Device) and correction result reflecting means (correction result reflecting device). When a CFD rule is given, a tuple that violates the rule is regarded as a correction candidate so that the data is not violated by the CFD rule. Update. FIG. 16 shows the processing of Algorithm BATCHREPAIR shown in FIG. 4 of Non-Patent Document 1. In FIG. 16, a comment or the like is added to FIG. 4 of Non-Patent Document 1 to help understanding.
1)修正箇所抽出手段がルール集合(CFD集合Σ)を参照し、データ(クレンジング対象データベースD)中の違反タプル(具体的には、ルール違反のタプル集合)を抽出する(Line 4)。 1) The correction location extracting means refers to the rule set (CFD set Σ), and extracts violation tuples (specifically, rule violation tuple sets) in the data (cleansing target database D) (Line 4).
 そして、違反タプルが存在する限り、以下の2)~4)のステップを繰り返す(Line 5~Line 8)。 Then, as long as there is a violation tuple, the following steps 2) to 4) are repeated (Line 5 to Line 8).
2)アルゴリズムに従って修正内容決定手段PICKNEXT()により、修正ルールと修正箇所、修正先の値を選択する(Line 6)。 2) A correction rule, a correction location, and a correction destination value are selected by the correction content determination means PICKNEXT () according to the algorithm (Line 6).
3)修正結果反映手段RESOLVE(t,B,v,φ)により、選択された修正を実行する(Line 7)。 3) The selected correction is executed by the correction result reflecting means RESOLVE (t, B, v, φ) (Line 7).
4)修正実行に基づき、違反タプル集合を更新する(Update_Dirty_Tuples)(Line 8)。 4) Based on the correction execution, the violation tuple set is updated (Update_Dirty_Tuples) (Line 8).
 上記ステップ:1)~4)のプロセスを「反復的データクレンジング」と呼ぶ。 The above steps 1) to 4) are called “iterative data cleansing”.
 なお、上記ステップ:2)のように、アルゴリズムに従って修正内容を決定するシステムもあるが、これとは別に、アルゴリズムに従って推薦された修正候補を作業者に提示し、作業者の指示により修正を反映するシステムもある。非特許文献2には、このような作業者用ユーザ・インターフェースを備えたシステムが開示されている。 In addition, there is a system that determines the correction contents according to the algorithm as in the above step: 2). However, separately from this, the correction candidates recommended according to the algorithm are presented to the worker, and the correction is reflected according to the instructions of the worker. Some systems do this. Non-Patent Document 2 discloses a system including such an operator user interface.
 修正ルールとして、CFDのルールを用いてデータクレンジングを行うシステム(装置)において、クレンジング対象データに対して、例えば作業者が、ルールに従わない例外タプルの値(データ)をルールに従うよう変更するか、現状の値を認め、例外タプルの値(データ)を変更しないか、のいずれかの処理が行われる。作業者が、システム側から推薦された修正ルールを承認することにより、修正の実行、更新が反復的に行われるシステムでは、データの規則性は、反復を経るごとに高まる。なお、CFDルール集合は、データクレンジング装置が備えている場合もあり、非特許文献2、非特許文献3のシステムのように、クレンジング対象データから抽出されることで得られる場合もある。 In a system (apparatus) that performs data cleansing using CFD rules as a correction rule, for example, whether an operator changes the value (data) of an exception tuple that does not follow the rules to the cleansing target data. The current value is recognized, and either processing of whether to change the value (data) of the exception tuple is performed. In a system in which execution and updating of corrections are repeatedly performed by an operator approving a correction rule recommended by the system side, the regularity of data increases with each iteration. The CFD rule set may be provided in a data cleansing apparatus, or may be obtained by being extracted from cleansing target data as in the systems of Non-Patent Document 2 and Non-Patent Document 3.
特開平2-301840号公報Japanese Patent Laid-Open No. 2-301840 特開2001-265596号公報JP 2001-265596 A 特開2004-29971号公報JP 2004-29971 A 特開2009-48611号公報JP 2009-48611 A 特開平11-250084号公報Japanese Patent Laid-Open No. 11-250084 米国特許出願公開第2010/0250596号明細書US Patent Application Publication No. 2010/0250596 米国特許第7720873号明細書US Pat. No. 7,720,873
 以下は、関連技術に関して本願発明者によって為された分析結果である。 The following are the analysis results made by the present inventor regarding the related technology.
 例えば、移行元システム中のデータを移行先のシステムに移行する際に行われるデータクレンジングについて説明する。移行元システム中のデータを「クレンジング対象データ」、移行先システムで既に試験又は使用されているデータを「移行先データ」と呼ぶ。 For example, data cleansing performed when data in the migration source system is migrated to the migration destination system will be described. Data in the migration source system is called “cleansing target data”, and data already tested or used in the migration destination system is called “migration destination data”.
 移行元と移行先のシステムのデータベース間のテーブルおよびカラム単位でのマッピング(対応付け)は済んでいるものとする。 Suppose that mapping (association) between the database of the migration source system and the migration destination system in units of columns and columns has been completed.
 第1の問題点は、前述の反復的データクレンジングを行う際に、クレンジングをどの時点で終了してよいか分からない、ということである。以下にその理由を説明する。 The first problem is that when performing the above-mentioned repetitive data cleansing, it is not known at what point cleansing may be terminated. The reason will be described below.
 「ルール集合Σ中の全てのルールが成立しなくてはならない。」ということは現実には殆ど起りえない。ルール集合Σ中の全てのルールのうちのいくつかは成立しない場合もあるというのが実情である。しかしながら、実際に、どのくらい数のルールにしたがってデータの修正・更新を行った時点でデータクレンジングの反復処理を終了すればよいのか、作業者にとって判断することは困難である。 “In reality, it is almost impossible for all rules in the rule set Σ to be satisfied.” The fact is that some of all the rules in the rule set Σ may not hold. However, it is difficult for an operator to actually determine how many times it is necessary to end the data cleansing iterative process when the data is corrected / updated according to the number of rules.
 また、移行先システムで問題なく利用されるデータ中にも、ルールに対する例外(ルール違反)のデータ(例えば、許容範囲の例外データ)が存在する場合がある。しかしながら、この点に関する情報を用いずに、追加データのクレンジングを行うと、許容される例外に作業者が気づかずに誤ったデータ修正を施すという事態が発生する可能性がある。 Also, there may be exception data (for example, exception data within an allowable range) for rules in the data used without problems in the migration destination system. However, if cleansing of additional data is performed without using information regarding this point, there is a possibility that an erroneous correction of data may occur without an operator noticing an allowable exception.
 第2の問題点は、例えばCFDルールを用いてデータクレンジングを行う際に、必要以上に規則性を課し、その結果、本来の情報が失われる、ということである。以下にその理由を説明する。 The second problem is that, for example, when data cleansing is performed using CFD rules, regularity is imposed more than necessary, and as a result, original information is lost. The reason will be described below.
 CFDルールを用いたデータクレンジングでは、本来必要とされる以上の数の修正ルールの候補が算出されることが多い。この修正ルールにしたがってデータ(例えば属性値)を修正した場合、必要以上の数の規則性をデータに課すことになる。その結果、データが本来持っていた情報が失われる。すなわち、データベースに対してクレンジングを行う場合、修正ルールにしたがってデータの修正を行うことで、当該修正反映以前はルール違反であったタプル(群)に規則性を持たせるようにしている。規則性が付与されたタプル(群)に基づき、他の属性の値等を修正すると、本来不必要な修正ルールが抽出される場合がある。不必要な修正ルールを適用してデータベースの属性値等を修正することで、必要以上の規則性をさらに課すことになり、本来の情報が失われる。 In data cleansing using CFD rules, more correction rule candidates than are originally required are often calculated. When data (for example, an attribute value) is corrected according to this correction rule, more regularities than necessary are imposed on the data. As a result, information originally possessed by the data is lost. That is, when cleansing the database, the data is corrected according to the correction rule, so that the tuple (group) that has violated the rule before the reflection of the correction has regularity. When a value of another attribute is corrected based on a tuple (group) to which regularity is given, an originally unnecessary correction rule may be extracted. By applying unnecessary correction rules and correcting the attribute values of the database, the regularity more than necessary is further imposed, and the original information is lost.
 したがって、反復的データクレンジングにおいて、どこまでクレンジングを行えば、データ修正を終了してよいかという判定の指針を与える手法の開発、実現が望まれている(本願発明者による知見)。 Therefore, in the iterative data cleansing, it is desired to develop and implement a method for providing a guideline for determining how far the cleansing should be completed to complete the data correction (knowledge by the present inventor).
 したがって、本発明は、上記問題点に鑑みて全く新規に創案されたものであって、その主たる目的は、データクレンジングの終了条件を提示可能としたシステム、方法、プログラムを提供することにある。 Therefore, the present invention has been created entirely in view of the above problems, and its main purpose is to provide a system, method, and program that can present the data cleansing end condition.
 本発明によれば、第1のデータベースを入力として前記第1のデータベースのデータ構成情報を取得するデータ分析手段と、
 前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、予め定められた所定の基準を満たさず、データ依存性が高いと判定されたCFDルールを除外することで、データ非依存のCFDルール集合を選別するルール抽出・選別手段と、
 前記ルール抽出・選別手段で選別されたデータ非依存のCFDルール集合の数と、
 前記データ分析手段で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出するルール数推定手段と、を備えた装置が提供される。
According to the present invention, data analysis means for acquiring data configuration information of the first database by using the first database as input,
By excluding CFD rules that do not satisfy a predetermined criterion and are determined to have high data dependency from the conditional function dependency (CFD) rule set extracted from the first database, Rule extraction / selection means for selecting a dependent CFD rule set;
The number of data-independent CFD rule sets selected by the rule extraction / selection means;
Using the data configuration information of the first database acquired by the data analysis unit and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is used as the second database cleansing end determination condition. There is provided an apparatus comprising rule number estimating means for calculating an estimated value of the total number of CFD rules to be established in the database.
 本発明によれば、データ分析処理が、第1のデータベースを入力として、前記第1のデータベースのデータ構成情報を取得し、
 ルール抽出・選別処理が、前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、予め定められた所定の基準を満たさず、データ依存性が高いと判定されたCFDルールを前記CFDルール集合から除外することで、データ非依存のCFDルール集合を選別し、
 ルール数推定処理が、前記ルール抽出・選別処理で選別されたデータ非依存の前記CFDルール集合の数と、
 前記データ分析処理で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出する、方法が提供される。
According to the present invention, the data analysis process receives the first database as input, obtains data configuration information of the first database,
The rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion. By excluding from the CFD rule set, a data-independent CFD rule set is selected,
The rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A method is provided for calculating an estimate of the total number of CFD rules to be established in the database.
 本発明によれば、第1のデータベースを入力として、前記第1のデータベースのデータ構成情報を取得するデータ分析処理と、
 前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、予め定められた所定の基準を満たさず、データ依存性が高いと判定されたCFDルールを前記CFDルール集合から除外することで、データ非依存のCFDルール集合を選別するルール抽出・選別処理と、
 前記ルール抽出・選別処理で選別されたデータ非依存の前記CFDルール集合の数と、
 前記データ分析処理で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出するルール数推定処理と、をコンピュータに実行させるプログラムが提供される。
According to the present invention, a data analysis process for obtaining data configuration information of the first database using the first database as an input;
Exclude from the CFD rule set a CFD rule that does not satisfy a predetermined predetermined criterion and is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database. A rule extraction / selection process for selecting a CFD rule set independent of data,
The number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. And a rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database.
 本発明によれば、データクレンジングの終了条件を提示することができる。上記以外の課題、作用効果等は以下の実施形態等の開示からも当業者には明らかとされるであろう。 According to the present invention, it is possible to present an end condition for data cleansing. Problems, effects, etc. other than those described above will be apparent to those skilled in the art from disclosure of the following embodiments and the like.
本発明の第1の実施形態の構成を示す図である。It is a figure which shows the structure of the 1st Embodiment of this invention. 第1の実施形態の処理手順の一例を示す流れ図である。It is a flowchart which shows an example of the process sequence of 1st Embodiment. 図2のステップA-2の処理手順の一例を示す流れ図である。It is a flowchart which shows an example of the process sequence of step A-2 of FIG. 図2のステップA-3の処理手順の一例を示す流れ図である。It is a flowchart which shows an example of the process sequence of step A-3 of FIG. 第1の実施形態の入力データ(移行先システムのデータ)の例を示す図である。It is a figure which shows the example of the input data (data of a transfer destination system) of 1st Embodiment. 第1の実施形態におけるデータ分割の一例を示す図である。It is a figure which shows an example of the data division | segmentation in 1st Embodiment. 第1の実施形態における抽出されたルールと評価結果の一例を示す図である。It is a figure which shows an example of the extracted rule and evaluation result in 1st Embodiment. 第1の実施形態におけるクレンジング対象データ(移行元システムのデータ)の一例を示す図である。It is a figure which shows an example of the cleansing object data (data of a migration source system) in 1st Embodiment. 本発明の第2の実施形態の構成の一例を示す図である。It is a figure which shows an example of a structure of the 2nd Embodiment of this invention. 本発明の第2の実施形態の変形例を示す図である。It is a figure which shows the modification of the 2nd Embodiment of this invention. 本発明の第2の実施形態の別の変形例を示す図である。It is a figure which shows another modification of the 2nd Embodiment of this invention. 第2の実施形態の処理手順の一例を示す流れ図である。It is a flowchart which shows an example of the process sequence of 2nd Embodiment. 第2の実施形態の変形例の処理手順を示す流れ図である。It is a flowchart which shows the process sequence of the modification of 2nd Embodiment. 第2の実施形態の入力データ(移行先システムのデータ)の一例を示す図である。It is a figure which shows an example of the input data (data of a transfer destination system) of 2nd Embodiment. 第2の実施形態の出力結果の一例を示す図である。It is a figure which shows an example of the output result of 2nd Embodiment. 非特許文献1のFigure4に基づく図である。It is a figure based on Figure 4 of nonpatent literature 1.
 本発明の実施形態について説明する。本発明の実施形態によれば、クレンジング対象のデータベースから条件付関数従属性(CFD)ルール集合を導出(抽出)し、CFDルールに違反するタプルのデータ(属性値等)を修正し、該修正データをデータベースに反映させるクレンジング処理を、反復的に行うにあたり、前記修正の結果、前記CFDルール集合のうち成立するCFDルールの総数が、予め算出された所定の数(推定値)に達した場合に、クレンジング対象のデータベースのクレンジング処理を終了する。 Embodiments of the present invention will be described. According to the embodiment of the present invention, a conditional function dependency (CFD) rule set is derived (extracted) from a cleansing target database, tuple data (attribute values, etc.) violating the CFD rule is corrected, and the correction is performed. When the cleansing process for reflecting data in the database is repeatedly performed, the total number of CFD rules established among the set of CFD rules reaches a predetermined number (estimated value) calculated in advance as a result of the correction. Finally, the cleansing process for the cleansing target database is terminated.
 本発明の実施形態によれば、クレンジング対象のデータベースに対応する別のデータベースを基準として、クレンジング対象のデータベースで成立すべきCFDルールの総数を推定し、クレンジング対象のデータベースに対する前記修正の結果、前記CFDルール集合のうち成立するCFDルールの総数が推定値に達したか否かに基づき、クレンジングをさらに反復するか、終了するかを判定する構成の装置、方法、コンピュータプログラム、システムが提供される。上記したように、移行元のデータベース(クレンジング対象データ)のクレンジングを行った上で、移行先システムのデータベースへのデータ移行が行われるが、CFDルール集合を用いたデータクレンジングにおいて、移行先データベースにおいて成り立つCFDルールの数、移行先データベースと、移行元のデータベース(クレンジング対象データを格納したデータベース)のデータサイズの違いに基づき、クレンジングの終了条件の目安を算出する。 According to an embodiment of the present invention, the total number of CFD rules to be established in the cleansing target database is estimated on the basis of another database corresponding to the cleansing target database, and as a result of the correction to the cleansing target database, Provided are an apparatus, a method, a computer program, and a system configured to determine whether the cleansing is further repeated or terminated based on whether or not the total number of CFD rules established among the CFD rule sets reaches an estimated value. . As described above, after the cleansing of the migration source database (cleansing target data), the data migration to the migration destination system database is performed. In the data cleansing using the CFD rule set, the migration destination database Based on the number of CFD rules that are satisfied and the data size difference between the migration destination database and the migration source database (database storing cleansing target data), an indication of cleansing end conditions is calculated.
 例示的な実施形態の1つによれば、例えば図1を参照すると、データ分析手段(101)では、移行先システムの第1のデータベース(105:DB1)から、第1のデータベース(DB1)の構成情報として、例えば属性数やタプル数等のデータサイズ情報を取得し、ルール抽出・選別手段(102)では、第1のデータベース(DB1)から抽出したCFDルール集合の中から、データ依存性が相対的に高いと判断したCFDルールを除外することで、データに非依存のCFDルール群(108)を選別し、ルール数推定手段(103)では、ルール抽出・選別手段(102)で得られた、データに非依存のCFDルール群(108)、データ分析手段(101)で得られた第1のデータベース(DB1)のデータサイズ情報、クレンジング対象データを格納した第2のデータベース(106:DB2)のデータサイズ情報(107)から、クレンジング対象の第2のデータベース(DB2)において成立すべきCFDルールの総数の推定値を算出する。かかる構成により、反復的データクレンジングにおいて、クレンジング対象の第2のデータベース(DB2)から抽出したCFDルール群の数に基づき、本システムの利用者(作業者)は、データクレンジングの終了条件の目安(抽出されたCFDルール群の数がどの程度の値になれば、クレンジング処理を終了してよいかの目安(推定値))を知ることができる。また、データクレンジングの終了条件の目安を示すことにより、クレンジング対象データに対して誤ったルールの適用を防ぐことができる。さらに、クレンジング対象データに関して膨大な数のCFDルール集合の中で適用すべきルールの数の推定値を示すことができる。この結果、上記関連技術の誤ったルールにしたがってデータを修正し、クレンジング対象データが持っていた情報が失われる、という事態の発生を回避可能としている。 According to one exemplary embodiment, for example, referring to FIG. 1, in the data analysis means (101), from the first database (105: DB1) of the migration destination system, the first database (DB1) As the configuration information, for example, data size information such as the number of attributes and the number of tuples is acquired, and the rule extraction / selection means (102) has data dependency from the CFD rule set extracted from the first database (DB1). By excluding CFD rules determined to be relatively high, a CFD rule group (108) independent of data is selected, and the rule number estimation means (103) obtains the rule extraction / selection means (102). In addition, the CFD rule group independent of data (108), the data size information of the first database (DB1) obtained by the data analysis means (101), Second database storing ring target data: the data size information (106 DB2) (107), and calculates an estimate of the total number of CFD rule to be established in the second database cleansing target (DB2). With this configuration, in the iterative data cleansing, based on the number of CFD rule groups extracted from the second database (DB2) to be cleansed, the user (operator) of this system can use the criteria for the data cleansing end condition ( It is possible to know a value (estimated value) indicating how much the number of extracted CFD rule groups is to end the cleansing process. In addition, by indicating an indication of the data cleansing end condition, it is possible to prevent an erroneous rule from being applied to cleansing target data. Furthermore, it is possible to indicate an estimated value of the number of rules to be applied in a huge number of CFD rule sets with respect to cleansing target data. As a result, it is possible to correct the data according to the erroneous rule of the related technology and avoid the occurrence of the situation where the information held in the cleansing target data is lost.
<実施形態1>
 図1を参照すると、本発明の第1の実施の形態において、データクレンジングの終了条件の目安(推定値)を与えるクレンジング終了条件推定装置(「終了条件推定装置」と略記される)100は、データベース(DB1)105と、データ分析手段101(データ分析装置)と、ルール抽出・選別手段102(ルール抽出・選別装置)と、データベース(DB2)106と、ルール数推定手段103(ルール数推定装置)とを備えている。
<Embodiment 1>
Referring to FIG. 1, in the first embodiment of the present invention, a cleansing end condition estimating device (abbreviated as “end condition estimating device”) 100 that provides a standard (estimated value) of a data cleansing end condition is: Database (DB1) 105, data analysis means 101 (data analysis apparatus), rule extraction / selection means 102 (rule extraction / selection apparatus), database (DB2) 106, rule number estimation means 103 (rule number estimation apparatus) ).
 本実施の形態において、データベース(DB1)105は、移行先システムのデータベース(「移行先データベース」という)である。データベース(DB2)106は、クレンジング対象データを格納するデータベースである。なお、データベース(DB1)105、データベース(DB2)106は、以下では、参照番号を外し単に、データベースDB1、データベースDB2としても参照される。 In this embodiment, the database (DB1) 105 is a database of the migration destination system (referred to as “migration destination database”). The database (DB2) 106 is a database that stores cleansing target data. In the following, the database (DB1) 105 and the database (DB2) 106 are also referred to as the database DB1 and the database DB2 simply by removing the reference numbers.
 データ分析手段101は、移行先データベースDB1を読み出し、属性数やタプル数等の情報を取得する。 The data analysis unit 101 reads the migration destination database DB1 and acquires information such as the number of attributes and the number of tuples.
 ルール抽出・選別手段102は、移行先データベースDB1からルール群(CFDルール集合)を抽出し、抽出したルール群の中から、データ依存性の高いルールを除外し、残ったルール群を、データ非依存ルール群108として、選別する。例えば、ルール抽出・選別手段102は、移行先データベースDB1中のデータを、k通り方法(ただし、kは予め定められた所定の正整数)で、訓練データ(Dtrain)と、テストデータ(Dtest)に分割し、訓練データ(Dtrain)から抽出されたCFDのルールの中から、テストデータ(Dtest)で例えばk回中m回以上(0<m<k)成立したルールを選別する。 The rule extraction / selection unit 102 extracts a rule group (CFD rule set) from the migration destination database DB1, excludes a rule having high data dependency from the extracted rule group, and removes the remaining rule group as a data non-data. The dependency rule group 108 is selected. For example, the rule extraction / selection means 102 uses training data (Dtrain) and test data (Dtest) for the data in the migration destination database DB1 in k ways (where k is a predetermined positive integer). From the CFD rules extracted from the training data (Dtrain), for example, rules that have been established m times or more (0 <m <k) out of k times are selected from the test data (Dtest).
 ルール数推定手段103は、クレンジング対象データ(データベースDB2)106のデータを読み出し、クレンジング対象データにおいて成立すべきルールの推定値を算出する。ルール数推定手段103は、例えば、
 ルール抽出・選別手段102で選別されたCFDルール集合(データ非依存ルール群108)のルール総数、
 データ分析手段101で取得された、データベースDB1のタプル数、及び、データベースDB1の指定されたカラムの値の異なり数、
 クレンジング対象データを格納したデータベースDB2のタプル数、データベースDB2の指定されたカラムの値の異なり数、及び、データベースDB2において成立するCFDルール数
の少なくとも1つを用いて、データベースDB2において成立すべきCFDルール総数の推定値を算出する。なお、カラムの指定は、例えばパラメータ104として、利用者(作業者)等から、不図示の入力手段を介してクレンジング終了条件推定装置100に入力され、ルール数推定手段103に供給する構成としてもよい。図1において、データ分析手段101、ルール抽出・選別手段102、ルール数推定手段103は、終了条件推定装置100を構成するコンピュータ上で動作するプログラムで実現するようにしてもよい。プログラムは、例えば磁気又は光記録媒体(装置)や半導体メモリ(例えば読み出し専用メモリあるいは書き換え可能な不揮発性メモリ(EEPRROM:Electrically Erasable and Programmable Read Only Memory))等に記憶される。
The rule number estimation means 103 reads the data of the cleansing target data (database DB2) 106 and calculates an estimated value of the rule that should be established in the cleansing target data. The rule number estimation means 103 is, for example,
The total number of rules of the CFD rule set (data independent rule group 108) selected by the rule extraction / selection means 102;
The number of tuples of the database DB1 acquired by the data analysis means 101, and the number of different values of the specified column of the database DB1,
CFD to be established in the database DB2 using at least one of the number of tuples of the database DB2 storing the cleansing target data, the number of different values of the designated column of the database DB2, and the number of CFD rules established in the database DB2. Calculate an estimate of the total number of rules. Note that the column designation may be input to the cleansing end condition estimation device 100 via the input means (not shown) from the user (worker) or the like as the parameter 104 and supplied to the rule number estimation means 103, for example. Good. In FIG. 1, the data analysis unit 101, the rule extraction / selection unit 102, and the rule number estimation unit 103 may be realized by a program that operates on a computer constituting the end condition estimation apparatus 100. The program is stored in, for example, a magnetic or optical recording medium (device), a semiconductor memory (for example, a read-only memory or a rewritable nonvolatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)), and the like.
 図2のフローチャートを参照して、本実施の形態の動作(処理)について説明する。 The operation (processing) of this embodiment will be described with reference to the flowchart of FIG.
 データ分析手段101は、移行先データベースDB1に関する情報(データサイズ情報)を取得する(図2のステップA1)。 The data analysis unit 101 acquires information (data size information) related to the migration destination database DB1 (step A1 in FIG. 2).
 ルール抽出・選別手段102は、移行先データベースDB1のデータを分割し、ルールを抽出し、抽出したルール群Σの中からデータ依存性の高いと判断されたルールを除外し、データ非依存性のルール群を選別する(ステップA2)。 The rule extraction / selection unit 102 divides the data in the migration destination database DB1, extracts the rules, excludes the rule determined to have high data dependency from the extracted rule group Σ, and removes data independence. A rule group is selected (step A2).
 ルール数推定手段103は、クレンジング対象データ(データベースDB2)において成立すべきルール総数の概数(推定値)を算出する(ステップA3)。 The rule number estimation means 103 calculates an approximate number (estimated value) of the total number of rules to be established in the cleansing target data (database DB2) (step A3).
 なお、移行先データベースDB1と、移行元のクレンジング対象データを格納したデータベースDB2の間にはカラムの対応付けが、図2の処理実行の前に、事前に行われているものとする。但し、対応付けられたカラム間で値の対応付けまでは必要ない。データベースが複数のテーブルを有する場合、テーブル毎に、図2のステップA1~A3の処理を行う。 Note that it is assumed that the column correspondence between the migration destination database DB1 and the database DB2 storing the migration source cleansing data is performed in advance before the execution of the processing of FIG. However, it is not necessary to associate values between the associated columns. When the database has a plurality of tables, the processing of steps A1 to A3 in FIG. 2 is performed for each table.
 図3のフローチャートを参照して、図2のステップA2の処理を説明する。 Referring to the flowchart in FIG. 3, the process in step A2 in FIG. 2 will be described.
 ルール抽出・選別手段102は、移行先データベースDB1を参照して、移行先データベースDB1のテーブルサイズ(タプル数)等の情報を取得する。例えば、移行先データベースDB1のあるテーブルが、例えば図5のような内容であった場合、タプル数(テーブルの行数)=10をテーブルサイズとして取得する。 The rule extraction / selection unit 102 refers to the migration destination database DB1 and acquires information such as the table size (number of tuples) of the migration destination database DB1. For example, when a table in the migration destination database DB1 has contents as shown in FIG. 5, for example, the number of tuples (the number of rows in the table) = 10 is acquired as the table size.
 ルール抽出・選別手段102は、移行先データベースDB1を、k通りの方法で訓練データとテストデータに分割する(ステップA2-1)。 The rule extraction / selection means 102 divides the migration destination database DB1 into training data and test data by k methods (step A2-1).
 kの値は、ルール抽出・選別手段102で取得した移行先データベースDB1のテーブルサイズやサンプリング方法(ブートストラップ(boot-strap)法か、交差検定(cross-validation)法等)によって決定する。 The value of k is determined by the table size and sampling method (boot-strap method, cross-validation method, etc.) of the migration destination database DB1 acquired by the rule extraction / selection means 102.
 ブートストラップ法の場合、移行先データベースDB1から適当なタプル数nのテストデータをk回抽出し、残りを訓練データ(トレーニングデータ)とする。このとき、1~kまでのテストデータ集合間に重なりがあっても良い。 In the case of the bootstrap method, test data with an appropriate number of tuples n is extracted k times from the migration destination database DB1, and the rest is used as training data (training data). At this time, there may be an overlap between 1 to k test data sets.
 テストデータと訓練データ等、データ分割の方法は、抽出されたルールの評価を決定する重要な要因である。例えば、時間的に異なるデータに対して適用可能なルール集合を得たい場合には、タイムスタンプによりデータを並び替えた上で、分割することが効果的である。このため、分割を行う前に、作業者の事前知識等によりデータの並び替えを行うようにしても良い。 ∙ Data division methods such as test data and training data are important factors that determine the evaluation of the extracted rules. For example, when it is desired to obtain a set of rules that can be applied to data that differs in time, it is effective to rearrange the data by time stamps and then divide the data. For this reason, before the division, the data may be rearranged based on the prior knowledge of the worker or the like.
 交差検定法の場合、データの量が充分であれば、例えばk=10とした10分割交差検定が用いられる(k-分割交差検定:データをk個のブロック(データセット)に分割し、1つのブロックをテストデータ、残りのk-1個のブロックを訓練データとし、k個に分割したそれぞれのブロックをテストデータとしてk回検定を行い、得られたk回の結果の平均を推定値とする)。 In the case of cross-validation, if the amount of data is sufficient, for example, a 10-fold cross-validation with k = 10 is used (k-division cross-validation: data is divided into k blocks (data sets), 1 One block is used as test data, the remaining k-1 blocks are used as training data, each block divided into k pieces is used as test data, and the test is performed k times. The average of the obtained k times is used as an estimated value. Do).
 データの量が充分でない場合、できるだけ多くのデータを訓練データに割り当てるために、訓練データを、k=タプル数-2、テストデータを2タプルとする。なお、CFDルールのテストには、最低2つのタプルが必要であるため(1つのタプルのみでCFDルールの成立の有無の検証はできない)、テストデータは2タプル以上を用いる。 If the amount of data is not enough, in order to allocate as much data as possible to the training data, the training data is k = number of tuples−2 and the test data is 2 tuples. Since the test of the CFD rule requires at least two tuples (it is not possible to verify whether the CFD rule is established with only one tuple), the test data uses two or more tuples.
 k通りの分割において、各タプルは、(k-1)回、訓練データとなり、1回だけテストデータとなる。 In k divisions, each tuple becomes (k-1) times training data and only once.
 図6(A)、(B)は、図5のデータを、テストデータ数n=2として、訓練データとテストデータに分割した例を示す。図6(A)、(B)では、ID1、ID2の2つのタプルをテストデータとし、ID3~ID10の8つのタプルを訓練データとしている。このようなテストデータ(2タプル)と訓練データ(8タプル)の組(組合せ)をk組得る。 6A and 6B show an example in which the data in FIG. 5 is divided into training data and test data with the number of test data n = 2. 6A and 6B, two tuples ID1 and ID2 are used as test data, and eight tuples ID3 to ID10 are used as training data. K sets (combinations) of such test data (2 tuples) and training data (8 tuples) can be obtained.
 ルール抽出・選別手段102は、移行先データベースDB1のデータの1~kまでの分割に関して、つまり、i=1~k組の訓練データとテストデータの組み合わせに対して、以下の1)~4)を繰り返す(ステップA2-2)。 The rule extraction / selection means 102 relates to the division of the data of the migration destination database DB1 into 1 to k, that is, for the combination of training data and test data of i = 1 to k, the following 1) to 4) Is repeated (step A2-2).
1)ルール抽出・選別手段102は、訓練データからルール群Σi(iはループ変数であり、i=1、2、・・・k)を抽出する(A2-3)。 1) The rule extraction / selection means 102 extracts a rule group Σi (i is a loop variable, i = 1, 2,... K) from the training data (A2-3).
 この抽出を行うアルゴリズムに関して、本実施形態では既存のアルゴリズムのうち任意のものが用いられるが、例えば特許文献6や非特許文献4等に記載されたものを用いる。抽出アルゴリズムを用いると、入力データと、適切な支持率の閾値(min_supp)、信頼度の閾値(min_conf)のパラメータを指定することで、一意に決まるCFDのルール群(CFD集合)Σを得ることができる。なお、CFDの支持率の閾値(min_supp)、信頼度の閾値(min_conf)のパラメータは、図1のパラメータ104として不図示の入力手段を介してルール抽出・選別手段102に供給する構成としてもよい。 Regarding the algorithm for performing this extraction, any one of the existing algorithms is used in the present embodiment. For example, those described in Patent Document 6, Non-Patent Document 4, and the like are used. By using the extraction algorithm, by specifying the input data and the appropriate support threshold (min_supp) and reliability threshold (min_conf) parameters, a uniquely determined CFD rule group (CFD set) Σ is obtained. Can do. The CFD support rate threshold (min_supp) and reliability threshold (min_conf) parameters may be supplied to the rule extraction / selection unit 102 via the input unit (not shown) as the parameter 104 in FIG. .
 例えば、図6(A)の訓練データから抽出したルール集合(CFD集合)の一例が、図7の表中の1列目のルールである。 For example, an example of a rule set (CFD set) extracted from the training data of FIG. 6A is the rule in the first column in the table of FIG.
φ1:(COUNTRY=USA,PRICE)→TAX
は、属性COUNTRYの値がUSAであるタプル集合においては、属性PRICEの値が決定すると、TAXの値も決定する、という意味のルールである。
φ1: (COUNTRY = USA, PRICE) → TAX
Is a rule that means that, in a tuple set whose attribute COUNTRY value is USA, when the value of attribute PRICE is determined, the value of TAX is also determined.
φ2:(COUNTRY=JPN,PRODUCT=001_book)→TAX=50
は、属性COUNTRYの値がJPNであり、かつ、属性PRODUCTの値が001_bookであるタプル集合においては、TAXは常に50である、という意味のルールである。
φ2: (COUNTRY = JPN, PRODUCT = 001_book) → TAX = 50
Is a rule meaning that TAX is always 50 in a tuple set in which the value of the attribute COUNTRY is JPN and the value of the attribute PRODUCT is 001_book.
2)ルール抽出・選別手段102は、訓練データ(8タプル)から抽出したルール群Σi中のルール(CFD)に関して、以下の3)、4)を繰り返す(ステップA2-4)。 2) The rule extraction / selection unit 102 repeats the following 3) and 4) for the rule (CFD) in the rule group Σi extracted from the training data (8 tuples) (step A2-4).
3)ルール抽出・選別手段102は、テストデータ(10タプルのうち訓練データ以外の2タプル)において、CFDルールφが成立するか否かを評価する(ステップA2-5)。すなわち、CFDルールφの、テストデータにおける信頼度(前提部・帰結部の条件をともに満たすタプル数/前提部の条件を満たすタプル数)が、予め設定された基準を満たすか否かを評価する。 3) The rule extraction / selection means 102 evaluates whether or not the CFD rule φ is satisfied in the test data (2 tuples other than the training data out of 10 tuples) (step A2-5). That is, it is evaluated whether or not the reliability in the test data (the number of tuples satisfying both the premise part and the consequent part condition / the number of tuples satisfying the precondition part) of the CFD rule φ satisfies a preset criterion. .
4)CFDルールφのテストデータにおける信頼度が、予め設定された基準を満たさない場合、ルール抽出・選別手段102は、CFDルールφを、ルール群Σiから除外する(ステップA2-6)。 4) When the reliability in the test data of the CFD rule φ does not satisfy the preset standard, the rule extraction / selection unit 102 excludes the CFD rule φ from the rule group Σi (step A2-6).
 例えば、図7のCFDルールを評価する場合、予め設定した基準は、
 支持値(Support)>0の場合に、
 信頼度(Confidence)>0.8
 であるものとする。すなわち、上記パラメータの値は、支持率の閾値(min_supp)=0、信頼度の閾値(min_conf)=0.8となる。
For example, when evaluating the CFD rule of FIG.
When the support value (Support)> 0,
Confidence> 0.8
Suppose that That is, the values of the parameters are the support threshold (min_supp) = 0 and the reliability threshold (min_conf) = 0.8.
 ステップA2-5において、各テストデータで得た信頼度の列(例えば、一回目から5回目の信頼度の列)は、CFDルールφ1に関して、
φ1:[1.0,-,1.0,1.0,-]
(支持値が0.0のときは-とする)となり、上記基準に合格する。
In step A2-5, the reliability columns obtained from the test data (for example, the first to fifth reliability columns) are related to the CFD rule φ1.
φ1: [1.0,-, 1.0, 1.0,-]
(If the support value is 0.0, it is-) and pass the above criteria.
 一方、CFDルールφ2に関して、
φ2:[1.0,-,0.5,1.0,1.0]
となり、基準に満たないため、CFDルールφ2はルール群Σiから除外される。
On the other hand, regarding the CFD rule φ2,
φ2: [1.0,-, 0.5, 1.0, 1.0]
Therefore, the CFD rule φ2 is excluded from the rule group Σi because the standard is not satisfied.
 また、テストデータにおいて、ルールφが成立しなかった場合、即座に除外する方法の他に、k回のテスト中q回以上成立しなかった場合に、φをルール群Σiから除外する構成としても良い。 In addition, in the test data, when rule φ is not satisfied, in addition to a method of excluding immediately, φ may be excluded from rule group Σi when not satisfied more than q times during k tests. good.
 そして、ルール抽出・選別手段102は、ルール群Σ1~Σkをまとめ(k個の集合Σ1~Σkの和集合をとる)、ルール群Σを、データ非依存ルール群108として出力する(ステップA2-7)。この際、ルール抽出・選別手段102は、ルール群Σから、重複するルールや含意するルールを省いてルール集合を集約しても良い。 Then, the rule extraction / selection means 102 summarizes the rule groups Σ1 to Σk (takes the union of k sets Σ1 to Σk), and outputs the rule group Σ as the data independent rule group 108 (step A2- 7). At this time, the rule extraction / selection unit 102 may aggregate rule sets from the rule group Σ by omitting redundant rules and implication rules.
 ルール抽出・選別手段102は、ステップA2-7において、ルール群Σ(データ非依存ルール群108)を出力する代わりに、ルール群Σの大きさ(ルール数)のみを出力するようにしても良い。 Instead of outputting the rule group Σ (data-independent rule group 108) in step A2-7, the rule extracting / selecting means 102 may output only the size (number of rules) of the rule group Σ. .
 図4のフローチャートを参照して、図2のステップA3を説明する。 Referring to the flowchart in FIG. 4, step A3 in FIG. 2 will be described.
 ルール数推定手段103は、データベースDB1、DB2におけるそれぞれのデータサイズ情報に基づき、データサイズ比較指標を得る(ステップA3-1)。 The rule number estimation means 103 obtains a data size comparison index based on the respective data size information in the databases DB1 and DB2 (step A3-1).
 データサイズ比較指標の例としては、例えば、
・データベースのタプル数、もしくは、
・指定したカラムの値の異なり数
等が用いられる。なお、指定したカラムの値の異なり数は、当該カラムの属性値が10個異なる値をとる場合、10となり、5個の異なる値をとる場合、異なり数は5となる。
As an example of the data size comparison index, for example,
・ The number of tuples in the database, or
-The number of different values of the specified column is used. Note that the number of different values of the designated column is 10 when the attribute value of the column takes 10 different values, and the number of differences is 5 when the value of 5 different values is taken.
 ルール数推定手段103は、
 ルール抽出・選別手段102によって移行先データベースDB1から抽出・選別されたCFDルール(データ非依存ルール群)の総数:Number_of_CFDs(DB1)、
 移行先データベースDB1のデータサイズ比較指標:DBSIZE(DB1)、
 クレンジング対象のデータを格納したデータベースDB2のデータサイズ比較指標:DBSIZE(DB2)の少なくとも1つを用いて、
 データベースDB2において成立すべきルールの推定値Number_of_CFDs(DB2)を算出する(A3-2)。その計算式の一例は例えば次式(1)で与えられる。
The rule number estimating means 103
Total number of CFD rules (data independent rule group) extracted and selected from the migration destination database DB1 by the rule extraction / selection means 102: Number_of_CFDs (DB1),
Data size comparison index of the migration destination database DB1: DBSIZE (DB1),
Using at least one of the data size comparison indexes of the database DB2 storing the data to be cleansed: DBSIZE (DB2),
The estimated value Number_of_CFDs (DB2) of the rule to be established in the database DB2 is calculated (A3-2). An example of the calculation formula is given by the following formula (1), for example.
Number_of_CFDs(DB2) = Number_of_CFDs(DB1)×DBSIZE(DB2)/DBSIZE(DB1)   ・・・(1) Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) ・ ・ ・ (1)
 本実施の形態では、例えばブートストラップ法や交差検定法を用いて、一般性の少ないルール(データ依存性の高いルール)を除外するように構成したため、移行先データベースDB1に特有のルールを除外し、クレンジング対象のデータを格納したデータベースDB2に適用すべきCFDルールの数を見積もることができる。 In the present embodiment, for example, a rule with less generality (a rule with high data dependency) is excluded by using a bootstrap method or a cross-validation method. Therefore, a rule peculiar to the migration destination database DB1 is excluded. The number of CFD rules to be applied to the database DB2 storing cleansing target data can be estimated.
 例えば、図7のルールφ2を、クレンジング対象のデータベースDB2(図8)のデータに適用した場合、ID=3のタプルの値10を50と修正してしまう。本実施形態によれば、このようなルール(例えば図7のルールφ2)を、クレンジング対象のデータベースから抽出されたルール群から除外した上で、クレンジング対象データ(DB2)に対する終了条件として適用すべきルール数を推定することで、誤った修正を行う可能性を少なくしている。 For example, when the rule φ2 of FIG. 7 is applied to the data of the cleansing target database DB2 (FIG. 8), the tuple value 10 of ID = 3 is corrected to 50. According to the present embodiment, such a rule (for example, the rule φ2 in FIG. 7) should be applied as an end condition for the cleansing target data (DB2) after being excluded from the rule group extracted from the cleansing target database. Estimating the number of rules reduces the possibility of incorrect corrections.
 クレンジング対象データDB2から抽出されたCFDルールの数が、移行先データベースDB1から抽出されたCFDルール数を上回った場合、クレンジングデータDB2のCFDルール数を上回った場合、クレンジング対象のデータベースDB2のクレンジングを終了する。 When the number of CFD rules extracted from the cleansing target data DB2 exceeds the number of CFD rules extracted from the migration destination database DB1, when the number of CFD rules exceeds the cleansing data DB2, the cleansing target database DB2 is cleansed. finish.
<実施形態2>
 次に、本発明の第2の実施形態のデータクレンジングシステムについて説明する。図9を参照すると、第2の実施形態のデータクレンジングシステムは、クレンジング対象のデータを格納したデータベース(DB2)206から信頼度p以上のCFDルール集合(ルール群207)を得るデータ分析手段(第2のデータ分析手段)201と、データ分析手段201からルール群207を受け、データベース(DB2)206のデータをルール群207のルールに整合させるための修正内容を自動で判断するルール適用自動判断手段202と、決定した修正内容に基づき、データを更新するデータ更新手段203と、終了条件推定装置100と、成立ルール数カウンタ204とを備えている。終了条件推定装置100は、移行先データベース(DB1)205からクレンジング対象のデータベース(DB2)206のクレンジング処理の終了条件を推定するものであり、前記第1の実施形態のクレンジング終了条件推定装置100から構成される。なお、データ分析手段201、ルール適用自動判断手段202、データ更新手段203は、コンピュータ上で動作するプログラムで実現するようにしてもよい。プログラムは、例えば磁気又は光記録媒体(装置)あるいはや半導体メモリ(例えば読み出し専用メモリあるいは書き換え可能な不揮発性メモリ(EEPRROM:Electrically Erasable and Programmable Read Only Memory))等に記憶される。
<Embodiment 2>
Next, a data cleansing system according to a second embodiment of this invention will be described. Referring to FIG. 9, the data cleansing system of the second embodiment is a data analysis means (first group) that obtains a CFD rule set (rule group 207) having a reliability level p or higher from a database (DB2) 206 that stores cleansing target data. 2 and a rule group 207 from the data analysis unit 201, and a rule application automatic determination unit that automatically determines the correction contents for matching the data of the database (DB2) 206 with the rules of the rule group 207. 202, a data update unit 203 for updating data based on the determined correction content, an end condition estimation device 100, and an established rule number counter 204. The end condition estimation apparatus 100 estimates the cleansing process end condition of the cleansing target database (DB2) 206 from the migration destination database (DB1) 205. From the cleansing end condition estimation apparatus 100 of the first embodiment, Composed. The data analysis unit 201, the rule application automatic determination unit 202, and the data update unit 203 may be realized by a program that runs on a computer. The program is stored in, for example, a magnetic or optical recording medium (device) or a semiconductor memory (for example, a read-only memory or a rewritable non-volatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)).
 データ分析手段201は、クレンジング対象のデータベースDB2から、例えば特許文献6や非特許文献4に開示されている手法を用いて、信頼度が予め設定された閾値以上のCFDルール集合(ルール群)と、該CFDルール集合のそれぞれのルールに不適合であるタプルの集合からなる違反タプル集合を得る。 The data analysis unit 201 uses a method disclosed in Patent Document 6 or Non-Patent Document 4 from the cleansing target database DB2, for example, a CFD rule set (rule group) having a reliability equal to or higher than a preset threshold. Then, a violation tuple set including a set of tuples that are incompatible with each rule of the CFD rule set is obtained.
 ルール適用自動判断手段202は、CFDルール集合と、ルール不適合タプル集合(違反タプル集合)の組について、当該不適合を解消するように、タプルの内容を変更するべきか否かを判断する。 The rule application automatic determination unit 202 determines whether or not the contents of the tuple should be changed so as to eliminate the nonconformity of the set of the CFD rule set and the rule nonconforming tuple set (violating tuple set).
 データ更新手段203は、ルール適用自動判断手段202の判断結果に従って、データベースDB2のデータに対して必要な変更(ルール不適合を解消するタプルの内容の変更)を実行する。 The data update unit 203 executes necessary changes (changes in the contents of the tuples that eliminate the rule nonconformity) to the data in the database DB2 in accordance with the determination result of the rule application automatic determination unit 202.
 次に、図12及び図13のフローチャートを参照して、本実施の形態の動作について説明する。図13は、図12のステップB4の修正先の自動選択処理を、適用すべき修正を作業者に提示し、作業者が修正するか否かを判断し、修正を採用する場合、修正の実行、成立ルール数を1つカウントアップする処理に変更したものである。 Next, the operation of the present embodiment will be described with reference to the flowcharts of FIGS. FIG. 13 shows a correction to be applied to the worker in the correction destination automatic selection process in step B4 of FIG. 12, determines whether the worker corrects, and executes the correction when the correction is adopted. The process is changed to the process of counting up the number of established rules by one.
 データ分析手段201は、クレンジング対象データDB2を分析し、信頼度が閾値p以上のCFDルール集合と、ルールに対する違反タプル集合を抽出する(図12のステップB1)。 The data analysis unit 201 analyzes the cleansing target data DB2, and extracts a CFD rule set having a reliability level equal to or higher than the threshold p and a violation tuple set for the rule (step B1 in FIG. 12).
 次に、データ分析手段201は、成立ルール数カウンタ204のカウント値を初期化する(図12のステップB2)。一例として、成立ルール数カウンタ204の初期値を、クレンジング対象のデータベースDB2のデータにおいて、信頼度1.0(=100%)で成立するCFDルールの数とする。 Next, the data analysis means 201 initializes the count value of the established rule number counter 204 (step B2 in FIG. 12). As an example, the initial value of the established rule number counter 204 is the number of CFD rules established with a reliability of 1.0 (= 100%) in the data of the cleansing target database DB2.
 クレンジング対象のデータベースDB2のデータとして、例えば図14のデータがデータ分析手段201に入力されたものとする。データ分析手段201は、図14のデータから、信頼度が閾値p=0.8以上のCFDルール集合と、それらのルールに対する違反タプル集合を抽出する。 Suppose, for example, the data shown in FIG. 14 is input to the data analysis unit 201 as data of the cleansing target database DB2. The data analysis unit 201 extracts a CFD rule set having a reliability of the threshold p = 0.8 or more and a violation tuple set for these rules from the data of FIG.
 図15は、図14のデータから抽出された、信頼度0.8以上CFDルール集合と、それらのルールに対する違反タプル集合の例を示す図である。違反タプルには、タプル数の等しい2つの値のいずれかが誤りであり、どちらに属性値を統一すべきか分からない場合がある。この場合、例えばルールφxに対して、例えばタプル1又はタプル2が違反を起こしているという記法(書式)(例:v(φx):{tuple1 or tuple2})が用いられる。例えば図15のCFDルールφ8に関する記述
φ8:(PRODUCT、PRICE)→ID(conf=0.9)、v(φ8):{tuple9 or tuple10})
は、CFDルールφ8に対して、図14のタプル(tuple)9又は10が違反を起こしていることを表している。すなわち、図14のタプル9、10は、PRODUCTの値に対して、PRICEの値が決まっても、IDの値は決まらない(図14のタプル9、10は重複している)。
FIG. 15 is a diagram illustrating an example of a CFD rule set with a reliability of 0.8 or more extracted from the data of FIG. 14 and a violation tuple set for those rules. There is a case where one of two values having the same number of tuples is wrong in the violation tuple, and it is not known which attribute value should be unified. In this case, for example, a notation (format) (for example, v (φx): {tuple1 or tuple2}) that, for example, the tuple 1 or the tuple 2 has violated the rule φx is used. For example, description φ8 regarding CFD rule φ8 in FIG. 15: (PRODUCT, PRICE) → ID (conf = 0.9), v (φ8): {tuple9 or tuple10})
Indicates that the tuple 9 or 10 in FIG. 14 has violated the CFD rule φ8. That is, for tuples 9 and 10 in FIG. 14, the value of ID is not determined even if the value of PRICE is determined relative to the value of PRODUCT ( tuples 9 and 10 in FIG. 14 are duplicated).
 信頼度1.0(conf=1.0)のCFDルールを成立ルールとして、その数が、成立ルール数カウンタ204の初期値として設定される。図15の例では、信頼度1.0のCFDルールは、図15の破線よりも上の7個のルールφ1~φ7である。この場合、成立ルール数カウント204のカウント値は「7」に初期化される。 The CFD rules with reliability 1.0 (conf = 1.0) are set as established rules, and the number is set as the initial value of the established rule number counter 204. In the example of FIG. 15, the CFD rules having a reliability of 1.0 are the seven rules φ1 to φ7 above the broken line in FIG. In this case, the count value of the established rule count 204 is initialized to “7”.
 次に、違反タプルが存在し、かつ、成立ルール数が終了条件推定装置100で得られた推定値よりも少ない場合、データ分析手段201は、以下のステップB4-B6を繰り返す(ステップB3)。 Next, if there is a violation tuple and the number of established rules is smaller than the estimated value obtained by the end condition estimating apparatus 100, the data analyzing means 201 repeats the following steps B4-B6 (step B3).
 ルール適用自動判断手段202は、適用すべきCFDルールと、修正すべき箇所、修正先の値を選択する(ステップB4)。つまり、信頼度1.0未満のCFDルールが複数ある場合、これら複数のCFDルールの中から、先に適用すべきルールを選択し、選択したルールに対する違反タプルを、該ルールに整合するように、修正すべき箇所を自動で選択し、修正先の値を決定する。この処理は、例えば非特許文献2に記載されているように、文字列の編集距離に基づく指標を用いて適用すべきルールの順序付けを行う方法を用いることができる。 The rule application automatic determination unit 202 selects a CFD rule to be applied, a location to be corrected, and a value to be corrected (step B4). That is, when there are a plurality of CFD rules having a reliability of less than 1.0, a rule to be applied first is selected from the plurality of CFD rules, and a violation tuple for the selected rule is matched with the rule. Then, the location to be corrected is automatically selected, and the value to be corrected is determined. As this process, for example, as described in Non-Patent Document 2, a method of ordering rules to be applied using an index based on an edit distance of a character string can be used.
 次に、データ更新手段203は、ルール適用自動判断手段202で選択されたタプルへの修正を実行する(図12のステップB5)。すなわち、データ更新手段203は、ステップB4でルール適用自動判断手段202が選択したデータへの修正を実行する。また、データ更新手段203は、信頼度が1.0未満の成立ルールに関して、信頼度と違反タプルの情報を更新する。 Next, the data update unit 203 executes correction to the tuple selected by the rule application automatic determination unit 202 (step B5 in FIG. 12). That is, the data update unit 203 executes correction to the data selected by the rule application automatic determination unit 202 in step B4. Further, the data update unit 203 updates the reliability and violation tuple information regarding the established rule having the reliability of less than 1.0.
 データ更新手段203は、成立ルール数カウント204のカウント値を1増加させる(図12のステップB6)。 The data updating unit 203 increases the count value of the established rule count 204 by 1 (step B6 in FIG. 12).
 終了条件推定装置100による成立ルール数の推定値が「8」であるものとする。例えば、図15の場合、信頼度0.9(conf=0.9)のルールをはじめとして、信頼度の高いルール順に、ルールにしたがった違反タプルのデータの修正の適用を検討する。 Suppose that the estimated value of the number of established rules by the termination condition estimation device 100 is “8”. For example, in the case of FIG. 15, the application of correction of violation tuple data according to the rules is examined in the order of the rules with the highest reliability, including the rule with the reliability of 0.9 (conf = 0.9).
 図15のルールφ8の修正箇所の候補は、ID=9又はID=10のタプルである。この場合、ID9、ID10のIDの値を修正するか、
 ID9又はID10のタプルのPRICE又はTAXのいずれかの値を変更することが修正候補として考えられる。あるいは、ルールφ8自体を不採用とすることも考えられる。
The candidate for the correction part of the rule φ8 in FIG. 15 is a tuple with ID = 9 or ID = 10. In this case, correct the ID values of ID9 and ID10,
Changing the value of either PRIICE or TAX of the ID9 or ID10 tuple is considered as a correction candidate. Alternatively, it may be considered that the rule φ8 itself is not adopted.
 本実施形態において、ルール適用自動判断手段202は、修正に関する判断を自動で行っている。修正に関する判断を自動で行うことに関して、例えば非特許文献2、3の記載が参照される。 In the present embodiment, the rule application automatic determination unit 202 automatically makes a determination regarding correction. Regarding automatic determination regarding correction, for example, the descriptions of Non-Patent Documents 2 and 3 are referred to.
 あるいは、修正に関する判断に関して、非特許文献1に開示されているように、人手で修正候補の中から選択する構成としても良い。 Alternatively, as disclosed in Non-Patent Document 1, regarding the determination regarding correction, it is possible to manually select from among correction candidates.
 このように、信頼度の高いルールから検討し、図15のルールφ14(信頼度(conf)=0.8)
φ14:(COUNTRY=JP,PRODUCT)→TAX
を採用し、データ修正を行ったとする。なお、ルールφ14に違反するタプルは、図15のルールφ14のv(φ14):{(tuple6)}から、図14のタプル6(TAX=100)である。
Thus, the rule φ14 in FIG. 15 (reliability (conf) = 0.8) is examined from the rule with high reliability.
φ14: (COUNTRY = JP, PRODUCT) → TAX
Is adopted and data correction is performed. The tuple that violates the rule φ14 is the tuple 6 (TAX = 100) in FIG. 14 from v (φ14): {(tuple6)} of the rule φ14 in FIG.
 一例として、図14のタプル6のTAXの値を50にするという修正が考えられる。図14のタプル6の属性TAXの値を100から50に修正すると、成立ルール数カウント204のカウント値は「8」となり、この時点で、反復的クレンジング処理は終了となる。 As an example, a modification that the value of TAX of the tuple 6 in FIG. When the value of the attribute TAX of the tuple 6 in FIG. 14 is corrected from 100 to 50, the count value of the established rule count 204 becomes “8”, and at this point, the iterative cleansing process ends.
 図9では、ルール適用自動判断手段202が、終了条件推定装置100からクレンジング処理の終了条件の推定値を取得し、終了条件の判断を行い、終了条件を満たす場合、自動で終了する。 In FIG. 9, the rule application automatic determination unit 202 acquires the estimated value of the cleansing process end condition from the end condition estimation apparatus 100, determines the end condition, and automatically ends when the end condition is satisfied.
<実施形態2:変形例1>
 第2の実施形態の変形例1として、図10に示すように、作業者303にクレンジング終了の目安として、クレンジング対象データにおける成立ルール数の変化と、成立ルール数の予測値を、不図示の表示装置等にグラフ表示等することで、実際のデータクレンジングの終了タイミングは、作業者303に判断させるようにしてもよい。図10の変形例では、作業者303からの終了指示を、ルール適用判断入力手段302が入力すると、反復的クレンジング処理を終了する。
<Embodiment 2: Modification 1>
As a first modification of the second embodiment, as shown in FIG. 10, the change in the number of established rules in the cleansing target data and the predicted value of the number of established rules are not shown in FIG. The operator 303 may be made to determine the actual data cleansing end timing by displaying the graph on a display device or the like. In the modified example of FIG. 10, when the rule application determination input unit 302 inputs an end instruction from the operator 303, the iterative cleansing process is ended.
 前記実施形態2において、人手で修正候補の中から選択する構成について、図10の変形例1を参照して説明する。図10の構成は、図9のルール適用自動判断手段202に代わりに、ルール適用判断入力手段302を備え、修正ルールを適用するか否かの判断を、作業者303が行うようにしている。ルール適用判断入力手段302は、作業者303の指示に基づき、反復クレンジングを終了するが、さらに修正ルールに基づきデータを修正するか否かを作業者303が決定するようにしてもよい。 In the second embodiment, a configuration in which correction candidates are manually selected from among correction candidates will be described with reference to Modification 1 of FIG. The configuration of FIG. 10 includes a rule application determination input unit 302 instead of the rule application automatic determination unit 202 of FIG. 9, and the operator 303 determines whether to apply the correction rule. The rule application determination input unit 302 ends the iterative cleansing based on the instruction of the worker 303, but the worker 303 may determine whether or not to correct the data further based on the correction rule.
 図13は、図12において、人手(図10の作業者303)による修正の有無の判断処理を行う場合の手順を説明するフローチャートである。図13を参照すると、図12のステップB4の、適用すべきルールと、修正すべき箇所、修正先の値の自動選択処理(図9のルール適用自動判断手段202で行われる)が、B4-1、B4-2のステップで置き換えられている。すなわち、ルール適用判断入力手段302による適用すべき修正を1つ選択して作業者に提示する処理(B4-1)と、作業者303が修正するか否かを判断し、作業者303の返答が修正を採用するものである場合(B4-2のYes分岐)、採用されたタプルへの修正を実行し(ステップB5)、成立ルール数を1つカウントアップする処理(ステップB6)を行う、ようにしたものである。 FIG. 13 is a flowchart for explaining a procedure in the case of performing a determination process of whether or not there is a correction by a human (worker 303 in FIG. 10) in FIG. Referring to FIG. 13, the automatic selection processing (performed by the rule application automatic determination means 202 in FIG. 9) of the rule to be applied, the location to be corrected, and the correction destination value in step B4 in FIG. 1, B4-2 is replaced. That is, a process of selecting one modification to be applied by the rule application determination input means 302 and presenting it to the worker (B4-1), whether or not the worker 303 is to modify, Is to adopt the modification (Yes branch of B4-2), the modification to the adopted tuple is executed (step B5), and the process of incrementing the number of established rules by one (step B6) is performed. It is what I did.
 <実施形態2:変形例2>
 図11は、第2の実施形態の変形例2の構成を示す図である。図11を参照すると、値マッピング手段209を備えたこと以外、図10の構成と同一である。以下では、同一部分の説明は省略する。
<Embodiment 2: Modification 2>
FIG. 11 is a diagram illustrating a configuration of a second modification of the second embodiment. Referring to FIG. 11, the configuration is the same as that of FIG. 10 except that a value mapping unit 209 is provided. Below, the description of the same part is abbreviate | omitted.
 本変形例では、値マッピング手段209を用いて、データベースDB1とDB2の値(属性の値)の表現の違いを解消し、値(属性値)を含めて一致したルールがデータベースDB2において成立した場合にのみ、成立ルール数カウンタ204を1つ増加させる。値マッピング手段209で行われる値マッピングとは、対応するカラム中の値表現を対応付け(例えば「男性」、「女性」を英語の「male」、「female」に対応付ける)することであり、例えば特許文献3等の記載が参照される。 In this modification, when the value mapping means 209 is used to eliminate the difference in expression of the values (attribute values) of the databases DB1 and DB2, and a matching rule including the value (attribute value) is established in the database DB2 Only, the established rule number counter 204 is incremented by one. The value mapping performed by the value mapping unit 209 is to associate value expressions in the corresponding columns (for example, “male” and “female” are associated with English “male” and “female”). Reference is made to the description of Patent Document 3 and the like.
 上記した本実施形態によれば、終了条件推定手段(装置)100により終了時期を判断するように構成されているため、膨大なルールをすべて検討する必要がなくなる。このため、信頼度の低いルールを検討することで誤って適用しなくてよいルールを採用してしまう可能性を低減することができる。なお、上記実施形態では、データ移行やデータ統合をする際のデータクレンジングといった用途に適用できるほか、基準となるデータベースに対応するデータベースのデータクレンジングを行う任意のシステムに適用可能である。 According to the above-described embodiment, since the end time is determined by the end condition estimating means (device) 100, it is not necessary to consider all the enormous rules. For this reason, it is possible to reduce the possibility of adopting a rule that does not need to be applied by mistake by examining a rule with low reliability. In addition, in the said embodiment, it can apply to uses, such as data cleansing at the time of data migration and data integration, and it can apply to the arbitrary systems which perform the data cleansing of the database corresponding to the database used as a reference | standard.
 なお、上記の特許文献、非特許文献の各開示を、本書に引用をもって繰り込むものとする。本発明の全開示(請求の範囲を含む)の枠内において、さらにその基本的技術思想に基づいて、実施形態ないし実施例の変更・調整が可能である。また、本発明の請求の範囲の枠内において種々の開示要素(各請求項の各要素、各実施例の各要素、各図面の各要素等を含む)の多様な組み合わせないし選択が可能である。すなわち、本発明は、請求の範囲を含む全開示、技術的思想にしたがって当業者であればなし得るであろう各種変形、修正を含むことは勿論である。 It should be noted that the disclosures of the above patent documents and non-patent documents are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiments and examples can be changed and adjusted based on the basic technical concept. Various disclosed elements (including each element of each claim, each element of each embodiment, each element of each drawing, etc.) can be combined or selected within the scope of the claims of the present invention. . That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea.
 上記実施形態は特に制限されないが、以下のように付記される。 The above embodiment is not particularly limited, but is appended as follows.
(付記1)
 第1のデータベースを入力として前記第1のデータベースのデータ構成情報を取得するデータ分析手段と、
 前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、予め定められた所定の基準を満たさず、データ依存性が高いと判定されたCFDルールを除外することで、データ非依存のCFDルール集合を選別するルール抽出・選別手段と、
 前記ルール抽出・選別手段で選別されたデータ非依存のCFDルール集合の数と、
 前記データ分析手段で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出するルール数推定手段と
 を備えたことを特徴とするクレンジング終了条件推定装置。
(Appendix 1)
Data analysis means for acquiring data configuration information of the first database with the first database as an input;
By excluding CFD rules that do not satisfy a predetermined criterion and are determined to have high data dependency from the conditional function dependency (CFD) rule set extracted from the first database, Rule extraction / selection means for selecting a dependent CFD rule set;
The number of data-independent CFD rule sets selected by the rule extraction / selection means;
Using the data configuration information of the first database acquired by the data analysis unit and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is used as the second database cleansing end determination condition. A cleansing end condition estimation device comprising: rule number estimation means for calculating an estimated value of the total number of CFD rules to be established in the database.
(付記2)
 前記ルール抽出・選別手段は、前記第1のデータベースをk通りの方法(ただし、kは所定の正整数)で訓練データとテストデータに分割し、
 前記訓練データから抽出されたCFDルールの中から、前記テストデータでkの検定回中m回(ただし、mはk以下の所定の正整数)以上成立したルールを、選別する、ことを特徴とする付記1に記載のクレンジング終了条件推定装置。
(Appendix 2)
The rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data. The cleansing end condition estimation apparatus according to Supplementary Note 1.
(付記3)
 前記ルール抽出・選別手段は、前記第1のデータベースをk通りの方法(ただし、kは所定の正整数)で、訓練データとテストデータに分割し、
 前記訓練データから抽出されたCFDルールの中から、前記テストデータでk回の検定中m回(ただし、mはk以下の所定の正整数)以上、且つ、信頼度が予め定められた所定の閾値以上で成立するルールを選別する、ことを特徴とする付記1に記載のクレンジング終了条件推定装置。
(Appendix 3)
The rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. The cleansing end condition estimation device according to appendix 1, wherein a rule that is established at a threshold value or higher is selected.
(付記4)
 前記ルール数推定手段は、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値Number_of_CFDs(DB2)を、
 Number_of_CFDs(DB2)=Number_of_CFDs(DB1)×DBSIZE(DB2)/DBSIZE(DB1)(ただし、Number_of_CFDs(DB1)は、前記選別されたCFDルールの総数、DBSIZE(DB1)、DBSIZE(DB2)はそれぞれ前記第1、第2のデータベースのデータサイズ比較指標である)
 を用いて算出する、ことを特徴とする付記1又は2に記載のクレンジング終了条件推定装置。
(Appendix 4)
The rule number estimating means calculates an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database,
Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) (where Number_of_CFDs (DB1) is the total number of the selected CFD rules, DBSIZE (DB1) and DBSIZE (DB2) are each) 1 and 2 is a data size comparison index of the second database)
The cleansing end condition estimation device according to supplementary note 1 or 2, wherein the cleansing end condition estimation device is calculated using
(付記5)
 前記第1、第2のデータベースのデータサイズ比較指標DBSIZE(DB1)、DBSIZE(DB2)は、各々、各データベースのタプル数、又は、指定されたカラムの値の異なり数、又は、タプル数と指定されたカラムの値の異なり数の合成数を含む、ことを特徴とする付記4に記載のクレンジング終了条件推定装置。
(Appendix 5)
The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples. The cleansing end condition estimation device according to appendix 4, characterized in that it includes a composite number of different numbers of column values.
(付記6)
 第1のデータベースと、
 クレンジング対象の第2のデータベースと、
 付記1乃至5のいずれか1の前記クレンジング終了条件推定装置と、
 前記第2のデータベースからCFDルール集合を得る第2のデータ分析手段と、
 データを前記第2のデータ分析手段で取得した前記CFDルール集合中のルールに整合させるデータ修正内容を判断するルール適用判断手段と、
 前記ルール適用判断手段で決定した修正内容に基づきデータを更新するデータ更新手段と、を備えた、ことを特徴とするデータクレンジングシステム。
(Appendix 6)
A first database;
A second database to be cleansed;
The cleansing end condition estimation device according to any one of appendices 1 to 5,
Second data analysis means for obtaining a CFD rule set from the second database;
Rule application determination means for determining data correction contents for matching data with the rules in the CFD rule set acquired by the second data analysis means;
A data cleansing system comprising: a data update unit that updates data based on the correction content determined by the rule application determination unit.
(付記7)
 前記第2のデータ分析手段は、前記第2のデータベースから信頼度が予め定められた所定の閾値以上のCFDルール集合を抽出する、ことを特徴とする付記6記載のデータクレンジングシステム。
(Appendix 7)
The data cleansing system according to appendix 6, wherein the second data analysis means extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.
(付記8)
 前記ルール適用判断手段は、前記第2のデータ分析手段で抽出されたCFDルール集合のうち成立ルール数が、前記終了条件推定装置で算出された前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値に達した場合に、データクレンジングを終了する制御を行う、ことを特徴とする付記6記載のデータクレンジングシステム。
(Appendix 8)
The rule application determining unit is configured such that the number of established rules in the CFD rule set extracted by the second data analyzing unit is the total number of CFD rules to be established in the second database calculated by the end condition estimating device. The data cleansing system according to appendix 6, wherein control is performed to end the data cleansing when the estimated value is reached.
(付記9)
 データ分析処理が、第1のデータベースを入力として、前記第1のデータベースのデータ構成情報を取得し、
 ルール抽出・選別処理が、前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、予め定められた所定の基準を満たさず、データ依存性が高いと判定されたCFDルールを前記CFDルール集合から除外することで、データ非依存のCFDルール集合を選別し、
 ルール数推定処理が、前記ルール抽出・選別処理で選別されたデータ非依存の前記CFDルール集合の数と、
 前記データ分析処理で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出する、ことを特徴とするクレンジング終了条件算出方法。
(Appendix 9)
The data analysis process receives the first database as input and obtains data configuration information of the first database;
The rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion. By excluding from the CFD rule set, a data-independent CFD rule set is selected,
The rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A cleansing end condition calculation method, comprising: calculating an estimated value of the total number of CFD rules to be established in the database.
(付記10)
 前記ルール抽出・選別処理は、前記第1のデータベースをk通りの方法(ただし、kは所定の正整数)で訓練データとテストデータに分割し、
 前記訓練データから抽出されたCFDルールの中から、前記テストデータでkの検定回中m回(ただし、mはk以下の所定の正整数)以上成立したルールを、選別する、ことを特徴とする付記9に記載のクレンジング終了条件算出方法。
(Appendix 10)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data. The cleansing end condition calculation method according to appendix 9.
(付記11)
 前記ルール抽出・選別処理は、前記第1のデータベースをk通りの方法(ただし、kは所定の正整数)で、訓練データとテストデータに分割し、
 前記訓練データから抽出されたCFDルールの中から、前記テストデータでk回の検定中m回(ただし、mはk以下の所定の正整数)以上、且つ、信頼度が予め定められた所定の閾値以上で成立するルールを選別する、ことを特徴とする付記9に記載のクレンジング終了条件算出方法。
(Appendix 11)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. The cleansing end condition calculation method according to appendix 9, wherein a rule that is established at a threshold value or higher is selected.
(付記12)
 前記ルール数推定処理は、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値Number_of_CFDs(DB2)を、
 Number_of_CFDs(DB2)=Number_of_CFDs(DB1)×DBSIZE(DB2)/DBSIZE(DB1)(ただし、Number_of_CFDs(DB1)は、前記選別されたCFDルールの総数、DBSIZE(DB1)、DBSIZE(DB2)はそれぞれ前記第1、第2のデータベースのデータサイズ比較指標である)
 を用いて算出する、ことを特徴とする付記9又は10に記載のクレンジング終了条件算出方法。
(Appendix 12)
In the rule number estimation process, an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database is calculated as follows:
Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) (where Number_of_CFDs (DB1) is the total number of the selected CFD rules, DBSIZE (DB1) and DBSIZE (DB2) are each) 1 and 2 is a data size comparison index of the second database)
The cleansing end condition calculation method according to appendix 9 or 10, wherein the cleansing end condition calculation method is performed using
(付記13)
 前記第1、第2のデータベースのデータサイズ比較指標DBSIZE(DB1)、DBSIZE(DB2)は、各々、各データベースのタプル数、又は、指定されたカラムの値の異なり数、又は、タプル数と指定されたカラムの値の異なり数の合成数を含む、ことを特徴とする付記12に記載のクレンジング終了条件算出方法。
(Appendix 13)
The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples. The cleansing end condition calculation method according to appendix 12, characterized in that it includes a combined number of different numbers of the column values.
(付記14)
 付記9乃至13のいずれか1に記載のクレンジング終了条件算出方法を用いたデータクレンジング方法であって、
 前記第2のデータベースからCFDルール集合を得る第2のデータ分析処理と、
 データを前記第2のデータ分析処理で取得した前記CFDルール集合中のルールに整合させるデータ修正内容を判断するルール適用判断処理と、
 前記ルール適用判断処理で決定した修正内容に基づきデータを更新するデータ更新処理と、
を備えた、ことを特徴とするデータクレンジング方法。
(Appendix 14)
A data cleansing method using the cleansing end condition calculation method according to any one of appendices 9 to 13,
A second data analysis process for obtaining a CFD rule set from the second database;
A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process;
A data update process for updating data based on the correction content determined in the rule application determination process;
A data cleansing method comprising:
(付記15)
 前記第2のデータ分析処理は、前記第2のデータベースから信頼度が予め定められた所定の閾値以上のCFDルール集合を抽出する、ことを特徴とする付記14記載のデータクレンジング方法。
(Appendix 15)
15. The data cleansing method according to claim 14, wherein the second data analysis process extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.
(付記16)
 前記ルール適用判断処理は、前記第2のデータ分析処理で抽出されたCFDルール集合のうち成立ルール数が、前記終了条件推定装置で算出された前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値に達した場合に、データクレンジングを終了する制御を行う、ことを特徴とする付記14記載のデータクレンジング方法。
(Appendix 16)
In the rule application determination process, the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device. 15. The data cleansing method according to appendix 14, wherein control is performed to end data cleansing when the estimated value is reached.
(付記17)
 第1のデータベースを入力として、前記第1のデータベースのデータ構成情報を取得するデータ分析処理と、
 前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、予め定められた所定の基準を満たさず、データ依存性が高いと判定されたCFDルールを前記CFDルール集合から除外することで、データ非依存のCFDルール集合を選別するルール抽出・選別処理と、
 前記ルール抽出・選別処理で選別されたデータ非依存の前記CFDルール集合の数と、
 前記データ分析処理で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出するルール数推定処理と、
 をコンピュータに実行させるプログラム。
(Appendix 17)
A data analysis process for obtaining data configuration information of the first database using the first database as an input;
Exclude from the CFD rule set a CFD rule that does not satisfy a predetermined predetermined criterion and is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database. A rule extraction / selection process for selecting a CFD rule set independent of data,
The number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database;
A program that causes a computer to execute.
(付記18)
 前記ルール抽出・選別処理は、前記第1のデータベースをk通りの方法(ただし、kは所定の正整数)で訓練データとテストデータに分割し、
 前記訓練データから抽出されたCFDルールの中から、前記テストデータでkの検定回中m回(ただし、mはk以下の所定の正整数)以上成立したルールを、選別する、ことを特徴とする付記17に記載のプログラム。
(Appendix 18)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data. The program according to appendix 17.
(付記19)
 前記ルール抽出・選別処理は、前記第1のデータベースをk通りの方法(ただし、kは所定の正整数)で、訓練データとテストデータに分割し、
 前記訓練データから抽出されたCFDルールの中から、前記テストデータでk回の検定中m回(ただし、mはk以下の所定の正整数)以上、且つ、信頼度が予め定められた所定の閾値以上で成立するルールを選別する、ことを特徴とする付記17に記載のプログラム。
(Appendix 19)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. 18. The program according to appendix 17, wherein a rule that is established when the threshold value is exceeded is selected.
(付記20)
 前記ルール数推定処理は、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値Number_of_CFDs(DB2)を、
 Number_of_CFDs(DB2)=Number_of_CFDs(DB1)×DBSIZE(DB2)/DBSIZE(DB1)(ただし、Number_of_CFDs(DB1)は、前記選別されたCFDルールの総数、DBSIZE(DB1)、DBSIZE(DB2)はそれぞれ前記第1、第2のデータベースのデータサイズ比較指標である)
 を用いて算出する、ことを特徴とする付記17又は18に記載のプログラム。
(Appendix 20)
In the rule number estimation process, an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database is calculated as follows:
Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) (where Number_of_CFDs (DB1) is the total number of the selected CFD rules, DBSIZE (DB1) and DBSIZE (DB2) are each) 1 and 2 is a data size comparison index of the second database)
The program according to appendix 17 or 18, wherein the program is calculated using
(付記21)
 前記第1、第2のデータベースのデータサイズ比較指標DBSIZE(DB1)、DBSIZE(DB2)は、各々、各データベースのタプル数、又は、指定されたカラムの値の異なり数、又は、タプル数と指定されたカラムの値の異なり数の合成数を含む、ことを特徴とする付記20に記載のプログラム。
(Appendix 21)
The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples. The program according to appendix 20, characterized in that it includes a composite number of different numbers of the column values.
(付記22)
 付記17乃至21のいずれか1に記載のプログラムの各処理と、
 第2のデータベースからCFDルール集合を得る第2のデータ分析処理と、
 データを前記第2のデータ分析処理で取得した前記CFDルール集合中のルールに整合させるデータ修正内容を判断するルール適用判断処理と、
 前記ルール適用判断処理で決定した修正内容に基づきデータを更新するデータ更新処理と、
 を前記コンピュータに実行させるプログラム。
(Appendix 22)
Each process of the program according to any one of appendices 17 to 21,
A second data analysis process for obtaining a CFD rule set from a second database;
A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process;
A data update process for updating data based on the correction content determined in the rule application determination process;
A program for causing the computer to execute.
(付記23)
 前記第2のデータ分析処理は、前記第2のデータベースから信頼度が予め定められた所定の閾値以上のCFDルール集合を抽出する、ことを特徴とする付記22記載のプログラム。
(Appendix 23)
23. The program according to appendix 22, wherein the second data analysis process extracts a CFD rule set having a degree of reliability equal to or higher than a predetermined threshold value from the second database.
(付記24)
 前記ルール適用判断処理は、前記第2のデータ分析処理で抽出されたCFDルール集合のうち成立ルール数が、前記終了条件推定装置で算出された前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値に達した場合に、データクレンジングを終了する制御を行う、ことを特徴とする付記22記載のプログラム。
(Appendix 24)
In the rule application determination process, the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device. The program according to appendix 22, wherein the control for terminating the data cleansing is performed when the estimated value is reached.
 100 クレンジング終了条件推定装置(終了条件推定装置)
 101 データ分析手段
 102 ルール抽出・選別手段
 103 ルール数推定手段
 104 パラメータ
 105、205 データベース(移行先データベース:DB1)
 106、206 データベース(クレンジング対象のデータベースDB2)
 107 データサイズ情報
 108 データ非依存ルール群
 201 データ分析手段
 202 ルール適用自動判断手段
 203 データ更新手段
 204 成立ルール数カウンタ
 207、208 ルール群
 209 値マッピング手段
 302 ルール適用判断入力手段
 303 作業者  
100 Cleansing end condition estimation device (end condition estimation device)
101 Data Analysis Unit 102 Rule Extraction / Selection Unit 103 Rule Number Estimation Unit 104 Parameter 105, 205 Database (Destination Database: DB1)
106, 206 database (database DB2 for cleansing)
DESCRIPTION OF SYMBOLS 107 Data size information 108 Data independent rule group 201 Data analysis means 202 Rule application automatic judgment means 203 Data update means 204 Established rule number counters 207, 208 Rule group 209 Value mapping means 302 Rule application judgment input means 303 Worker

Claims (10)

  1.  第1のデータベースを入力として前記第1のデータベースのデータ構成情報を取得するデータ分析手段と、
     前記第1のデータベースから抽出したCFD(Conditional Functional Dependency:条件付関数従属性)ルール集合からデータ非依存のCFDルール集合を選別するルール抽出・選別手段と、
     前記ルール抽出・選別手段で選別された前記データ非依存のCFDルールの総数と、
     前記データ分析手段で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出するルール数推定手段と、
     を備えた、ことを特徴とするクレンジング終了条件推定装置。
    Data analysis means for acquiring data configuration information of the first database with the first database as an input;
    Rule extraction / selection means for selecting a CFD rule set independent of data from a CFD (Conditional Functional Dependency) rule set extracted from the first database;
    The total number of CFD rules independent of the data selected by the rule extraction / selection means;
    Using the data configuration information of the first database acquired by the data analysis means and the data configuration information of the second database to be cleaned, the second database as the cleansing end determination condition of the second database Rule number estimating means for calculating an estimated value of the total number of CFD rules to be established in the database;
    A cleansing end condition estimating apparatus comprising:
  2.  前記ルール抽出・選別手段は、前記第1のデータベースをk通りの方法(ただし、kは予め定められた所定の正整数)で訓練データとテストデータに分割し、
     前記訓練データから抽出されたCFDルールの中から、前記テストデータでkの検定回中m回(ただし、mはk以下の所定の正整数)以上成立したルールを、前記データ非依存のCFDルールとして、選別する、ことを特徴とする請求項1に記載のクレンジング終了条件推定装置。
    The rule extracting / selecting means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
    Among the CFD rules extracted from the training data, a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data is defined as the data-independent CFD rule. The cleansing end condition estimating device according to claim 1, wherein the cleansing end condition estimating device is selected.
  3.  前記ルール抽出・選別手段は、前記第1のデータベースをk通りの方法(ただし、kは予め定められた所定の正整数)で、訓練データとテストデータに分割し、
     前記訓練データから抽出されたCFDルールの中から、前記テストデータでk回の検定中m回(ただし、mはk以下の所定の正整数)以上、且つ、信頼度が予め定められた所定の閾値以上で成立するルールを、前記データ非依存のCFDルールとして、選別する、ことを特徴とする請求項1に記載のクレンジング終了条件推定装置。
    The rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
    Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. 2. The cleansing end condition estimation device according to claim 1, wherein a rule that is established at a threshold value or more is selected as the data-independent CFD rule.
  4.  前記ルール数推定手段は、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値Number_of_CFDs(DB2)を計算式、
     Number_of_CFDs(DB2)=Number_of_CFDs(DB1)×DBSIZE(DB2)/DBSIZE(DB1)
    (ただし、Number_of_CFDs(DB1)は前記選別されたCFDルールの総数、DBSIZE(DB1)、DBSIZE(DB2)は、それぞれ前記第1、第2のデータベースのデータサイズ比較指標である)を用いて算出する、ことを特徴とする請求項1又は2に記載のクレンジング終了条件推定装置。
    The rule number estimating means calculates an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database,
    Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1)
    (Where Number_of_CFDs (DB1) is the total number of the selected CFD rules, and DBSIZE (DB1) and DBSIZE (DB2) are data size comparison indexes of the first and second databases, respectively). The cleansing end condition estimation apparatus according to claim 1 or 2, characterized in that:
  5.  前記第1、第2のデータベースのデータサイズ比較指標DBSIZE(DB1)、DBSIZE(DB2)は、各々、
    各データベースのタプル数、又は、
    指定されたカラムの値の異なり数、又は、
    タプル数と指定されたカラムの値の異なり数
    の合成数を含む、ことを特徴とする請求項4に記載のクレンジング終了条件推定装置。
    The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively
    The number of tuples in each database, or
    The number of different values of the specified column, or
    5. The cleansing end condition estimation device according to claim 4, wherein the cleansing end condition estimation device includes a composite number of different numbers of tuple numbers and designated column values.
  6.  第1のデータベースと、
     クレンジング対象の第2のデータベースと、
     請求項1乃至5のいずれか1項記載の前記クレンジング終了条件推定装置と、
     前記第2のデータベースからCFDルール集合を得る第2のデータ分析手段と、
     データを前記第2のデータ分析手段で取得した前記CFDルール集合中のルールに整合させるデータ修正内容を判断するルール適用判断手段と、
     前記ルール適用判断手段で決定した修正内容に基づきデータを更新するデータ更新手段と、
     を備えた、ことを特徴とするデータクレンジングシステム。
    A first database;
    A second database to be cleansed;
    The cleansing end condition estimation device according to any one of claims 1 to 5,
    Second data analysis means for obtaining a CFD rule set from the second database;
    Rule application determination means for determining data correction contents for matching data with the rules in the CFD rule set acquired by the second data analysis means;
    Data update means for updating data based on the correction content determined by the rule application determination means;
    A data cleansing system characterized by comprising:
  7.  前記第2のデータ分析手段は、前記第2のデータベースから信頼度が予め定められた所定の閾値以上のCFDルール集合を抽出する、ことを特徴とする請求項6記載のデータクレンジングシステム。 The data cleansing system according to claim 6, wherein the second data analyzing means extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.
  8.  前記ルール適用判断手段は、前記第2のデータ分析手段で抽出されたCFDルール集合のうち成立ルール数が、前記クレンジング終了条件推定装置で算出された前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値に達した場合に、データクレンジングを終了する制御を行う、ことを特徴とする請求項6記載のデータクレンジングシステム。 The rule application determining means is configured such that the number of established rules in the CFD rule set extracted by the second data analyzing means is the number of CFD rules to be established in the second database calculated by the cleansing end condition estimating device. The data cleansing system according to claim 6, wherein when the estimated value of the total number is reached, control for terminating data cleansing is performed.
  9.  データ分析処理が、第1のデータベースを入力として前記第1のデータベースのデータ構成情報を取得し、
     ルール抽出・選別処理が、前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、データ非依存のCFDルール集合を選別し、
     ルール数推定処理が、前記ルール抽出・選別処理で選別された前記データ非依存の前記CFDルール集合の数と、
     前記データ分析処理で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出する、ことを特徴とするクレンジング終了条件算出方法。
    The data analysis process acquires the data configuration information of the first database by using the first database as an input,
    The rule extraction / selection process selects a data-independent CFD rule set from a conditional function dependency (CFD) rule set extracted from the first database,
    A rule number estimation process, the number of CFD rule sets independent of the data selected by the rule extraction / selection process;
    Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A cleansing end condition calculation method, comprising: calculating an estimated value of the total number of CFD rules to be established in the database.
  10.  第1のデータベースを入力として前記第1のデータベースのデータ構成情報を取得するデータ分析処理と、
     前記第1のデータベースから抽出した条件付関数従属性(CFD)ルール集合から、データ非依存のCFDルール集合を選別するルール抽出・選別処理と、
     前記ルール抽出・選別処理で選別されたデータ非依存の前記CFDルール集合の数と、
     前記データ分析処理で取得した前記第1のデータベースのデータ構成情報、及び、クレンジング対象の第2のデータベースのデータ構成情報を用いて、前記第2のデータベースのクレンジングの終了判定条件として、前記第2のデータベースにおいて成立すべきCFDルールの総数の推定値を算出するルール数推定処理と、
     をコンピュータに実行させるプログラム。
    A data analysis process for acquiring data configuration information of the first database by using a first database as an input;
    A rule extraction / selection process for selecting a data-independent CFD rule set from a conditional function dependency (CFD) rule set extracted from the first database;
    The number of CFD rule sets independent of data selected by the rule extraction / selection process;
    Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database;
    A program that causes a computer to execute.
PCT/JP2013/059007 2012-03-27 2013-03-27 Data-cleansing system, method, and program WO2013146884A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012072128 2012-03-27
JP2012-072128 2012-03-27

Publications (1)

Publication Number Publication Date
WO2013146884A1 true WO2013146884A1 (en) 2013-10-03

Family

ID=49260133

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/059007 WO2013146884A1 (en) 2012-03-27 2013-03-27 Data-cleansing system, method, and program

Country Status (1)

Country Link
WO (1) WO2013146884A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914616A (en) * 2014-03-18 2014-07-09 清华大学深圳研究生院 Emergency data quality control system and emergency data quality control method
CN104750861A (en) * 2015-04-16 2015-07-01 中国电力科学研究院 Method and system for cleaning mass data of energy storage power station
JP2017534108A (en) * 2014-09-26 2017-11-16 オラクル・インターナショナル・コーポレイション Declarative language and visualization system for recommended data transformation and restoration
US10915233B2 (en) 2014-09-26 2021-02-09 Oracle International Corporation Automated entity correlation and classification across heterogeneous datasets
US11379506B2 (en) 2014-09-26 2022-07-05 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006302A1 (en) * 2007-06-29 2009-01-01 Wenfei Fan Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies
US20090287721A1 (en) * 2008-03-03 2009-11-19 Lukasz Golab Generating conditional functional dependencies
JP2012141847A (en) * 2011-01-04 2012-07-26 Hitachi Solutions Ltd Data migration system, data migration device and data migration method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006302A1 (en) * 2007-06-29 2009-01-01 Wenfei Fan Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies
US20090287721A1 (en) * 2008-03-03 2009-11-19 Lukasz Golab Generating conditional functional dependencies
JP2012141847A (en) * 2011-01-04 2012-07-26 Hitachi Solutions Ltd Data migration system, data migration device and data migration method

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914616A (en) * 2014-03-18 2014-07-09 清华大学深圳研究生院 Emergency data quality control system and emergency data quality control method
CN103914616B (en) * 2014-03-18 2017-12-05 清华大学深圳研究生院 A kind of emergency data quality control system and method
JP2017534108A (en) * 2014-09-26 2017-11-16 オラクル・インターナショナル・コーポレイション Declarative language and visualization system for recommended data transformation and restoration
US10891272B2 (en) 2014-09-26 2021-01-12 Oracle International Corporation Declarative language and visualization system for recommended data transformations and repairs
US10915233B2 (en) 2014-09-26 2021-02-09 Oracle International Corporation Automated entity correlation and classification across heterogeneous datasets
US10976907B2 (en) 2014-09-26 2021-04-13 Oracle International Corporation Declarative external data source importation, exportation, and metadata reflection utilizing http and HDFS protocols
JP2021061063A (en) * 2014-09-26 2021-04-15 オラクル・インターナショナル・コーポレイション Declarative language and visualization system for recommended data transformations and repairs
US11379506B2 (en) 2014-09-26 2022-07-05 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources
JP7148654B2 (en) 2014-09-26 2022-10-05 オラクル・インターナショナル・コーポレイション Declarative language and visualization system for recommended data transformation and restoration
US11693549B2 (en) 2014-09-26 2023-07-04 Oracle International Corporation Declarative external data source importation, exportation, and metadata reflection utilizing HTTP and HDFS protocols
CN104750861A (en) * 2015-04-16 2015-07-01 中国电力科学研究院 Method and system for cleaning mass data of energy storage power station
WO2016165378A1 (en) * 2015-04-16 2016-10-20 国网新源张家口风光储示范电站有限公司 Energy storage power station mass data cleaning method and system

Similar Documents

Publication Publication Date Title
WO2013146884A1 (en) Data-cleansing system, method, and program
US10289532B2 (en) Method and system for providing delta code coverage information
KR102214297B1 (en) Conditional validation rules
US9245233B2 (en) Automatic detection of anomalies in graphs
US20200272559A1 (en) Enhancing efficiency in regression testing of software applications
EP3165984A1 (en) An event analysis apparatus, an event analysis method, and an event analysis program
WO2014188502A1 (en) Management system, management program, and management method
CN110945559A (en) Method and system for optimized visual summary of temporal event data sequences
US10528534B2 (en) Method and system for deduplicating data
TWI726401B (en) Data processing method, data processing device, data processing system, and computer-readable recording medium
Morisse et al. Long-read error correction: a survey and qualitative comparison
Harder How multiple developers affect the evolution of code clones
US10346393B2 (en) Automatic enumeration of data analysis options and rapid analysis of statistical models
TWI709833B (en) Data processing method, data processing device, and computer-readable recording medium
US9348733B1 (en) Method and system for coverage determination
JP6310865B2 (en) Source code evaluation system and method
JP6689734B2 (en) Test script correction device and test script correction program
CN108509347B (en) Equivalent variant identification method and device
WO2013147172A1 (en) Cfd updating device and method, data cleansing apparatus and method, and programs
JP6447111B2 (en) Common information providing program, common information providing method, and common information providing apparatus
US20190012240A1 (en) Managing data with restoring from purging
CN111061613A (en) Front-end abnormity monitoring method and device and computer equipment
Virmani et al. Variegated data swabbing: An improved purge approach for data cleaning
CN104424398A (en) System and method for base sequence alignment
JP2020013385A (en) Information processing apparatus, patch application confirming system, patch application confirming method, and patch application confirming program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13768917

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13768917

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP