WO2013146884A1

WO2013146884A1 - Data-cleansing system, method, and program

Info

Publication number: WO2013146884A1
Application number: PCT/JP2013/059007
Authority: WO
Inventors: 綾子星野
Original assignee: 日本電気株式会社
Priority date: 2012-03-27
Filing date: 2013-03-27
Publication date: 2013-10-03

Abstract

Provided is a system capable of indicating the end conditions of data cleansing. The system is provided with: a data analysis means (101) for acquiring data configuration information from a first database (DB1); a rule extraction and selection means (102) for extracting a CFD rule collection from the database (DB1) and excluding rules with high data dependence from among the collection; and a rule number estimation means (103) for calculating the estimated value of the number of rules to be placed in effect in a second database (DB2) as the end conditions of recursive data cleansing of the second database (DB2), using data configuration information from the first database (DB1) and the second database (DB2) to be cleansed.

Description

Data cleansing system, method and program

[Description of related applications]
The present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-072128 (filed on March 27, 2012), and the entire contents of this application are incorporated in the present specification by reference.
The present invention relates to a data cleansing system, method and program.

There is known a data cleansing technique for correcting data using conditional function dependency (abbreviated as “CFD”). In the data cleansing technique, cleaning is performed by correcting error data in a database, removing duplicate data, etc., and making the format uniform.

CFD is a rule representing that functional dependency (abbreviated as “FD”) representing dependency between data attributes is established for a tuple set specified by a condition. A tuple represents a row in a relational table with attributes as columns. The CFD is composed of a condition part and a premise part which are the left side (LHS: Left Hand Side) of the rule, and specification of attribute values in the consequent part of the right side (RHS: Right Hand Side) of the rule. The condition part designates a subset (tuple set) of data, and represents that the attribute X1 is the attribute value x1 as “X1 = x1”. Here, “x1” means that the attribute value is a specific value. The premise part consists of designation of only attributes. The fact that the attribute value does not take a specific value (that is, a wild card indicating that it matches an arbitrary value) is also referred to as “unnamed variable” (anonymous variable).

In the consequences,
(A) an attribute and attribute value designation (for example, rule 1 below);
(B) Specifying only attributes (for example, rule 2 below),
There are two types.

In the case of (A), for example, “A = a”, and in the case of (B), for example, “A = _” is represented. If the attribute value is specified in the consequent part, the premise part can be omitted. Moreover, the premise part and the consequent part may consist of designation of a plurality of attributes and respective attribute values. Examples of CFD rules are shown below.

Rule 1: X1 → A (x1 || a)
Rule 2: X1, X2 → A (x1, _ || _)

Rule 1 is a rule that means “when attribute X1 is attribute value x1, attribute A is attribute value a”. When Rule 1 is satisfied, it represents that the consequent part is a specified value in the tuple set that applies to the condition part. That is, t [A] = a in all tuples of the tuple set that satisfies the condition X1 = x1 (t [A] represents the tuple of the attribute A). A rule in which the consequent part is determined by the specified value is referred to as “Constant CFD”.

Rule 2 is a rule that means that “attribute A is determined by attribute X2 when attribute X1 is attribute value x1”. When rule 2 is satisfied, it represents that there is a dependency between attributes specified in the premise part and the consequent part in the tuple set that applies to the condition part. That is, for any tuple pair t1, t2 in the tuple set that satisfies the condition “X1 = x1”, if t1 [X2] = t2 [X2], then t1 [A] = t2 [A]. A rule in which the result part is not determined to have a specified value but has a dependency between attributes is referred to as “Variable CFD”. That is, if the right side of the pattern || is unvariable variable '_' (tp [A] = _), it is referred to as “Variable CFD”.

Note that

rules

1 and 2 are rule 1: X1 = x1 → A = a
Rule 2: (X1 = x1, X2) → A
May be written (FIG. 15 to be described later).

Violation against a rule such as rule 1, that is, a tuple that satisfies the condition “X1 = x1” and “t1 [A] ≠ a” is called “single tuple violation” (single tuple violation) The tuple t1 is referred to as “violating tuple”.

Violation of a rule such as rule 2, that is, a tuple that satisfies the condition “X1 = x1” and t1 [X2] = t2 [X2], but t1 [A] = t2 [A] is not satisfied. This is called “multi-tuple violation” and the tuples t1 and t2 are called “violating tuples”.

Support degree (Support) is the number of tuples in which the conditional part of CFD, the premise part (left side of CFD: LHS), and the consequent part (right side of CFD: RHS) match. In another definition, it may be expressed by the ratio of the number of tuples in which LHS and RHS match in the total number of tuples.

“Confidence” is the ratio of the number of tuples in which the CFD rule is satisfied among the number of tuples in which the condition part and the premise part match. The support level and the reliability level will be described according to a specific example.

In Table 1 above, ID is a tuple ID, and A, B, and C are attributes. From the relationship data set in Table 1, for example, CFD
φ1: A, B → C (1, _ || _)
(If the value of A is 1, C is determined by B). In Table 1, tuple IDs: 1, 2, and 3 and tuple IDs: 8, 9, and 10 match this rule.

Since the number of tuples in which both LHS (φ1) and RHS (φ1) of CFD φ1 match is 6, the support level is 6 or 6 out of the total 10 tuple numbers, so the support level = 6/10 = 0.6. Among the number of tuples in which the condition part and the premise part of the CFD φ1 match, the number of tuples for which the CFD rule is satisfied is 6, so the reliability = 6/6 = 1.0 (= 100%). .

For the data cleansing system, for example, the descriptions of

Non-Patent Documents

1 and 2 are referred to. In addition, regarding correction and update of database data, Patent Document 1 discloses an apparatus for extracting error data using a conditional expression for checking the correctness of data in the database, correcting the extracted data, and updating the database with the corrected data. Is disclosed.

In addition, several patent documents were searched in the prior document search by the applicant of the present application. Among them, Patent Document 2 includes a correlation rule whose correlation coefficient is a predetermined value or more based on the correlation coefficient between attributes of the database. In this case, a data mining device is disclosed that deletes from the set of correlation rules between attributes when only the correlation coefficient is generated and does not have a true correlation coefficient. Patent Document 3 discloses a configuration in which data cleansing / characterizing means for performing data cleansing that replaces or deletes an abnormal value of data with a specific value is repeatedly performed while changing a set value from an initial value. Patent Document 4 finds a rule violation caused by configuration data abnormality, displays the rule that has been violated, displays a table row corresponding to the rule violation, and executes a corrective action for defective configuration data Is disclosed. Patent Document 5 discloses that an item database is created from a relational database, a correlation rule is extracted from the item database, and the number of correlation rules corresponding to each item is calculated when reading the contents of the result rule file. ing.

Typically, a data cleansing system receives cleansing target data and a data correction rule set (ΣCFD) as input, and a correction location extraction unit (correction location extraction device) and a correction content determination unit (correction content determination), both of which are not shown. Device) and correction result reflecting means (correction result reflecting device). When a CFD rule is given, a tuple that violates the rule is regarded as a correction candidate so that the data is not violated by the CFD rule. Update. FIG. 16 shows the processing of Algorithm BATCHREPAIR shown in FIG. 4 of Non-Patent Document 1. In FIG. 16, a comment or the like is added to FIG. 4 of Non-Patent Document 1 to help understanding.

1) The correction location extracting means refers to the rule set (CFD set Σ), and extracts violation tuples (specifically, rule violation tuple sets) in the data (cleansing target database D) (Line 4).

Then, as long as there is a violation tuple, the following steps 2) to 4) are repeated (Line 5 to Line 8).

2) A correction rule, a correction location, and a correction destination value are selected by the correction content determination means PICKNEXT () according to the algorithm (Line 6).

3) The selected correction is executed by the correction result reflecting means RESOLVE (t, B, v, φ) (Line 7).

4) Based on the correction execution, the violation tuple set is updated (Update_Dirty_Tuples) (Line 8).

The above steps 1) to 4) are called “iterative data cleansing”.

In addition, there is a system that determines the correction contents according to the algorithm as in the above step: 2). However, separately from this, the correction candidates recommended according to the algorithm are presented to the worker, and the correction is reflected according to the instructions of the worker. Some systems do this. Non-Patent Document 2 discloses a system including such an operator user interface.

In a system (apparatus) that performs data cleansing using CFD rules as a correction rule, for example, whether an operator changes the value (data) of an exception tuple that does not follow the rules to the cleansing target data. The current value is recognized, and either processing of whether to change the value (data) of the exception tuple is performed. In a system in which execution and updating of corrections are repeatedly performed by an operator approving a correction rule recommended by the system side, the regularity of data increases with each iteration. The CFD rule set may be provided in a data cleansing apparatus, or may be obtained by being extracted from cleansing target data as in the systems of Non-Patent Document 2 and Non-Patent Document 3.

Japanese Patent Laid-Open No. 2-301840 JP 2001-265596 A JP 2004-29971 A JP 2009-48611 A Japanese Patent Laid-Open No. 11-250084 US Patent Application Publication No. 2010/0250596 US Pat. No. 7,720,873

The following are the analysis results made by the present inventor regarding the related technology.

For example, data cleansing performed when data in the migration source system is migrated to the migration destination system will be described. Data in the migration source system is called “cleansing target data”, and data already tested or used in the migration destination system is called “migration destination data”.

Suppose that mapping (association) between the database of the migration source system and the migration destination system in units of columns and columns has been completed.

The first problem is that when performing the above-mentioned repetitive data cleansing, it is not known at what point cleansing may be terminated. The reason will be described below.

“In reality, it is almost impossible for all rules in the rule set Σ to be satisfied.” The fact is that some of all the rules in the rule set Σ may not hold. However, it is difficult for an operator to actually determine how many times it is necessary to end the data cleansing iterative process when the data is corrected / updated according to the number of rules.

Also, there may be exception data (for example, exception data within an allowable range) for rules in the data used without problems in the migration destination system. However, if cleansing of additional data is performed without using information regarding this point, there is a possibility that an erroneous correction of data may occur without an operator noticing an allowable exception.

The second problem is that, for example, when data cleansing is performed using CFD rules, regularity is imposed more than necessary, and as a result, original information is lost. The reason will be described below.

In data cleansing using CFD rules, more correction rule candidates than are originally required are often calculated. When data (for example, an attribute value) is corrected according to this correction rule, more regularities than necessary are imposed on the data. As a result, information originally possessed by the data is lost. That is, when cleansing the database, the data is corrected according to the correction rule, so that the tuple (group) that has violated the rule before the reflection of the correction has regularity. When a value of another attribute is corrected based on a tuple (group) to which regularity is given, an originally unnecessary correction rule may be extracted. By applying unnecessary correction rules and correcting the attribute values of the database, the regularity more than necessary is further imposed, and the original information is lost.

Therefore, in the iterative data cleansing, it is desired to develop and implement a method for providing a guideline for determining how far the cleansing should be completed to complete the data correction (knowledge by the present inventor).

Therefore, the present invention has been created entirely in view of the above problems, and its main purpose is to provide a system, method, and program that can present the data cleansing end condition.

According to the present invention, data analysis means for acquiring data configuration information of the first database by using the first database as input,
By excluding CFD rules that do not satisfy a predetermined criterion and are determined to have high data dependency from the conditional function dependency (CFD) rule set extracted from the first database, Rule extraction / selection means for selecting a dependent CFD rule set;
The number of data-independent CFD rule sets selected by the rule extraction / selection means;
Using the data configuration information of the first database acquired by the data analysis unit and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is used as the second database cleansing end determination condition. There is provided an apparatus comprising rule number estimating means for calculating an estimated value of the total number of CFD rules to be established in the database.

According to the present invention, the data analysis process receives the first database as input, obtains data configuration information of the first database,
The rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion. By excluding from the CFD rule set, a data-independent CFD rule set is selected,
The rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A method is provided for calculating an estimate of the total number of CFD rules to be established in the database.

According to the present invention, a data analysis process for obtaining data configuration information of the first database using the first database as an input;
Exclude from the CFD rule set a CFD rule that does not satisfy a predetermined predetermined criterion and is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database. A rule extraction / selection process for selecting a CFD rule set independent of data,
The number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. And a rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database.

According to the present invention, it is possible to present an end condition for data cleansing. Problems, effects, etc. other than those described above will be apparent to those skilled in the art from disclosure of the following embodiments and the like.

It is a figure which shows the structure of the 1st Embodiment of this invention. It is a flowchart which shows an example of the process sequence of 1st Embodiment. It is a flowchart which shows an example of the process sequence of step A-2 of FIG. It is a flowchart which shows an example of the process sequence of step A-3 of FIG. It is a figure which shows the example of the input data (data of a transfer destination system) of 1st Embodiment. It is a figure which shows an example of the data division | segmentation in 1st Embodiment. It is a figure which shows an example of the extracted rule and evaluation result in 1st Embodiment. It is a figure which shows an example of the cleansing object data (data of a migration source system) in 1st Embodiment. It is a figure which shows an example of a structure of the 2nd Embodiment of this invention. It is a figure which shows the modification of the 2nd Embodiment of this invention. It is a figure which shows another modification of the 2nd Embodiment of this invention. It is a flowchart which shows an example of the process sequence of 2nd Embodiment. It is a flowchart which shows the process sequence of the modification of 2nd Embodiment. It is a figure which shows an example of the input data (data of a transfer destination system) of 2nd Embodiment. It is a figure which shows an example of the output result of 2nd Embodiment. It is a figure based on Figure 4 of nonpatent literature 1.

Embodiments of the present invention will be described. According to the embodiment of the present invention, a conditional function dependency (CFD) rule set is derived (extracted) from a cleansing target database, tuple data (attribute values, etc.) violating the CFD rule is corrected, and the correction is performed. When the cleansing process for reflecting data in the database is repeatedly performed, the total number of CFD rules established among the set of CFD rules reaches a predetermined number (estimated value) calculated in advance as a result of the correction. Finally, the cleansing process for the cleansing target database is terminated.

According to an embodiment of the present invention, the total number of CFD rules to be established in the cleansing target database is estimated on the basis of another database corresponding to the cleansing target database, and as a result of the correction to the cleansing target database, Provided are an apparatus, a method, a computer program, and a system configured to determine whether the cleansing is further repeated or terminated based on whether or not the total number of CFD rules established among the CFD rule sets reaches an estimated value. . As described above, after the cleansing of the migration source database (cleansing target data), the data migration to the migration destination system database is performed. In the data cleansing using the CFD rule set, the migration destination database Based on the number of CFD rules that are satisfied and the data size difference between the migration destination database and the migration source database (database storing cleansing target data), an indication of cleansing end conditions is calculated.

According to one exemplary embodiment, for example, referring to FIG. 1, in the data analysis means (101), from the first database (105: DB1) of the migration destination system, the first database (DB1) As the configuration information, for example, data size information such as the number of attributes and the number of tuples is acquired, and the rule extraction / selection means (102) has data dependency from the CFD rule set extracted from the first database (DB1). By excluding CFD rules determined to be relatively high, a CFD rule group (108) independent of data is selected, and the rule number estimation means (103) obtains the rule extraction / selection means (102). In addition, the CFD rule group independent of data (108), the data size information of the first database (DB1) obtained by the data analysis means (101), Second database storing ring target data: the data size information (106 DB2) (107), and calculates an estimate of the total number of CFD rule to be established in the second database cleansing target (DB2). With this configuration, in the iterative data cleansing, based on the number of CFD rule groups extracted from the second database (DB2) to be cleansed, the user (operator) of this system can use the criteria for the data cleansing end condition ( It is possible to know a value (estimated value) indicating how much the number of extracted CFD rule groups is to end the cleansing process. In addition, by indicating an indication of the data cleansing end condition, it is possible to prevent an erroneous rule from being applied to cleansing target data. Furthermore, it is possible to indicate an estimated value of the number of rules to be applied in a huge number of CFD rule sets with respect to cleansing target data. As a result, it is possible to correct the data according to the erroneous rule of the related technology and avoid the occurrence of the situation where the information held in the cleansing target data is lost.

<Embodiment 1>
Referring to FIG. 1, in the first embodiment of the present invention, a cleansing end condition estimating device (abbreviated as “end condition estimating device”) 100 that provides a standard (estimated value) of a data cleansing end condition is: Database (DB1) 105, data analysis means 101 (data analysis apparatus), rule extraction / selection means 102 (rule extraction / selection apparatus), database (DB2) 106, rule number estimation means 103 (rule number estimation apparatus) ).

In this embodiment, the database (DB1) 105 is a database of the migration destination system (referred to as “migration destination database”). The database (DB2) 106 is a database that stores cleansing target data. In the following, the database (DB1) 105 and the database (DB2) 106 are also referred to as the database DB1 and the database DB2 simply by removing the reference numbers.

The data analysis unit 101 reads the migration destination database DB1 and acquires information such as the number of attributes and the number of tuples.

The rule extraction / selection unit 102 extracts a rule group (CFD rule set) from the migration destination database DB1, excludes a rule having high data dependency from the extracted rule group, and removes the remaining rule group as a data non-data. The dependency rule group 108 is selected. For example, the rule extraction / selection means 102 uses training data (Dtrain) and test data (Dtest) for the data in the migration destination database DB1 in k ways (where k is a predetermined positive integer). From the CFD rules extracted from the training data (Dtrain), for example, rules that have been established m times or more (0 <m <k) out of k times are selected from the test data (Dtest).

The rule number estimation means 103 reads the data of the cleansing target data (database DB2) 106 and calculates an estimated value of the rule that should be established in the cleansing target data. The rule number estimation means 103 is, for example,
The total number of rules of the CFD rule set (data independent rule group 108) selected by the rule extraction / selection means 102;
The number of tuples of the database DB1 acquired by the data analysis means 101, and the number of different values of the specified column of the database DB1,
CFD to be established in the database DB2 using at least one of the number of tuples of the database DB2 storing the cleansing target data, the number of different values of the designated column of the database DB2, and the number of CFD rules established in the database DB2. Calculate an estimate of the total number of rules. Note that the column designation may be input to the cleansing end condition estimation device 100 via the input means (not shown) from the user (worker) or the like as the parameter 104 and supplied to the rule number estimation means 103, for example. Good. In FIG. 1, the data analysis unit 101, the rule extraction / selection unit 102, and the rule number estimation unit 103 may be realized by a program that operates on a computer constituting the end condition estimation apparatus 100. The program is stored in, for example, a magnetic or optical recording medium (device), a semiconductor memory (for example, a read-only memory or a rewritable nonvolatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)), and the like.

The operation (processing) of this embodiment will be described with reference to the flowchart of FIG.

The data analysis unit 101 acquires information (data size information) related to the migration destination database DB1 (step A1 in FIG. 2).

The rule extraction / selection unit 102 divides the data in the migration destination database DB1, extracts the rules, excludes the rule determined to have high data dependency from the extracted rule group Σ, and removes data independence. A rule group is selected (step A2).

The rule number estimation means 103 calculates an approximate number (estimated value) of the total number of rules to be established in the cleansing target data (database DB2) (step A3).

Note that it is assumed that the column correspondence between the migration destination database DB1 and the database DB2 storing the migration source cleansing data is performed in advance before the execution of the processing of FIG. However, it is not necessary to associate values between the associated columns. When the database has a plurality of tables, the processing of steps A1 to A3 in FIG. 2 is performed for each table.

Referring to the flowchart in FIG. 3, the process in step A2 in FIG. 2 will be described.

The rule extraction / selection unit 102 refers to the migration destination database DB1 and acquires information such as the table size (number of tuples) of the migration destination database DB1. For example, when a table in the migration destination database DB1 has contents as shown in FIG. 5, for example, the number of tuples (the number of rows in the table) = 10 is acquired as the table size.

The rule extraction / selection means 102 divides the migration destination database DB1 into training data and test data by k methods (step A2-1).

The value of k is determined by the table size and sampling method (boot-strap method, cross-validation method, etc.) of the migration destination database DB1 acquired by the rule extraction / selection means 102.

In the case of the bootstrap method, test data with an appropriate number of tuples n is extracted k times from the migration destination database DB1, and the rest is used as training data (training data). At this time, there may be an overlap between 1 to k test data sets.

∙ Data division methods such as test data and training data are important factors that determine the evaluation of the extracted rules. For example, when it is desired to obtain a set of rules that can be applied to data that differs in time, it is effective to rearrange the data by time stamps and then divide the data. For this reason, before the division, the data may be rearranged based on the prior knowledge of the worker or the like.

In the case of cross-validation, if the amount of data is sufficient, for example, a 10-fold cross-validation with k = 10 is used (k-division cross-validation: data is divided into k blocks (data sets), 1 One block is used as test data, the remaining k-1 blocks are used as training data, each block divided into k pieces is used as test data, and the test is performed k times. The average of the obtained k times is used as an estimated value. Do).

If the amount of data is not enough, in order to allocate as much data as possible to the training data, the training data is k = number of tuples−2 and the test data is 2 tuples. Since the test of the CFD rule requires at least two tuples (it is not possible to verify whether the CFD rule is established with only one tuple), the test data uses two or more tuples.

In k divisions, each tuple becomes (k-1) times training data and only once.

6A and 6B show an example in which the data in FIG. 5 is divided into training data and test data with the number of test data n = 2. 6A and 6B, two tuples ID1 and ID2 are used as test data, and eight tuples ID3 to ID10 are used as training data. K sets (combinations) of such test data (2 tuples) and training data (8 tuples) can be obtained.

The rule extraction / selection means 102 relates to the division of the data of the migration destination database DB1 into 1 to k, that is, for the combination of training data and test data of i = 1 to k, the following 1) to 4) Is repeated (step A2-2).

1) The rule extraction / selection means 102 extracts a rule group Σi (i is a loop variable, i = 1, 2,... K) from the training data (A2-3).

Regarding the algorithm for performing this extraction, any one of the existing algorithms is used in the present embodiment. For example, those described in Patent Document 6, Non-Patent Document 4, and the like are used. By using the extraction algorithm, by specifying the input data and the appropriate support threshold (min_supp) and reliability threshold (min_conf) parameters, a uniquely determined CFD rule group (CFD set) Σ is obtained. Can do. The CFD support rate threshold (min_supp) and reliability threshold (min_conf) parameters may be supplied to the rule extraction / selection unit 102 via the input unit (not shown) as the parameter 104 in FIG. .

For example, an example of a rule set (CFD set) extracted from the training data of FIG. 6A is the rule in the first column in the table of FIG.

φ1: (COUNTRY = USA, PRICE) → TAX
Is a rule that means that, in a tuple set whose attribute COUNTRY value is USA, when the value of attribute PRICE is determined, the value of TAX is also determined.

φ2: (COUNTRY = JPN, PRODUCT = 001_book) → TAX = 50
Is a rule meaning that TAX is always 50 in a tuple set in which the value of the attribute COUNTRY is JPN and the value of the attribute PRODUCT is 001_book.

2) The rule extraction / selection unit 102 repeats the following 3) and 4) for the rule (CFD) in the rule group Σi extracted from the training data (8 tuples) (step A2-4).

3) The rule extraction / selection means 102 evaluates whether or not the CFD rule φ is satisfied in the test data (2 tuples other than the training data out of 10 tuples) (step A2-5). That is, it is evaluated whether or not the reliability in the test data (the number of tuples satisfying both the premise part and the consequent part condition / the number of tuples satisfying the precondition part) of the CFD rule φ satisfies a preset criterion. .

4) When the reliability in the test data of the CFD rule φ does not satisfy the preset standard, the rule extraction / selection unit 102 excludes the CFD rule φ from the rule group Σi (step A2-6).

For example, when evaluating the CFD rule of FIG.
When the support value (Support)> 0,
Confidence> 0.8
Suppose that That is, the values of the parameters are the support threshold (min_supp) = 0 and the reliability threshold (min_conf) = 0.8.

In step A2-5, the reliability columns obtained from the test data (for example, the first to fifth reliability columns) are related to the CFD rule φ1.
φ1: [1.0,-, 1.0, 1.0,-]
(If the support value is 0.0, it is-) and pass the above criteria.

On the other hand, regarding the CFD rule φ2,
φ2: [1.0,-, 0.5, 1.0, 1.0]
Therefore, the CFD rule φ2 is excluded from the rule group Σi because the standard is not satisfied.

In addition, in the test data, when rule φ is not satisfied, in addition to a method of excluding immediately, φ may be excluded from rule group Σi when not satisfied more than q times during k tests. good.

Then, the rule extraction / selection means 102 summarizes the rule groups Σ1 to Σk (takes the union of k sets Σ1 to Σk), and outputs the rule group Σ as the data independent rule group 108 (step A2- 7). At this time, the rule extraction / selection unit 102 may aggregate rule sets from the rule group Σ by omitting redundant rules and implication rules.

Instead of outputting the rule group Σ (data-independent rule group 108) in step A2-7, the rule extracting / selecting means 102 may output only the size (number of rules) of the rule group Σ. .

Referring to the flowchart in FIG. 4, step A3 in FIG. 2 will be described.

The rule number estimation means 103 obtains a data size comparison index based on the respective data size information in the databases DB1 and DB2 (step A3-1).

As an example of the data size comparison index, for example,
・ The number of tuples in the database, or
-The number of different values of the specified column is used. Note that the number of different values of the designated column is 10 when the attribute value of the column takes 10 different values, and the number of differences is 5 when the value of 5 different values is taken.

The rule number estimating means 103
Total number of CFD rules (data independent rule group) extracted and selected from the migration destination database DB1 by the rule extraction / selection means 102: Number_of_CFDs (DB1),
Data size comparison index of the migration destination database DB1: DBSIZE (DB1),
Using at least one of the data size comparison indexes of the database DB2 storing the data to be cleansed: DBSIZE (DB2),
The estimated value Number_of_CFDs (DB2) of the rule to be established in the database DB2 is calculated (A3-2). An example of the calculation formula is given by the following formula (1), for example.

Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) ・・・ (1)

In the present embodiment, for example, a rule with less generality (a rule with high data dependency) is excluded by using a bootstrap method or a cross-validation method. Therefore, a rule peculiar to the migration destination database DB1 is excluded. The number of CFD rules to be applied to the database DB2 storing cleansing target data can be estimated.

For example, when the rule φ2 of FIG. 7 is applied to the data of the cleansing target database DB2 (FIG. 8), the tuple value 10 of ID = 3 is corrected to 50. According to the present embodiment, such a rule (for example, the rule φ2 in FIG. 7) should be applied as an end condition for the cleansing target data (DB2) after being excluded from the rule group extracted from the cleansing target database. Estimating the number of rules reduces the possibility of incorrect corrections.

When the number of CFD rules extracted from the cleansing target data DB2 exceeds the number of CFD rules extracted from the migration destination database DB1, when the number of CFD rules exceeds the cleansing data DB2, the cleansing target database DB2 is cleansed. finish.

<Embodiment 2>
Next, a data cleansing system according to a second embodiment of this invention will be described. Referring to FIG. 9, the data cleansing system of the second embodiment is a data analysis means (first group) that obtains a CFD rule set (rule group 207) having a reliability level p or higher from a database (DB2) 206 that stores cleansing target data. 2 and a rule group 207 from the data analysis unit 201, and a rule application automatic determination unit that automatically determines the correction contents for matching the data of the database (DB2) 206 with the rules of the rule group 207. 202, a data update unit 203 for updating data based on the determined correction content, an end condition estimation device 100, and an established rule number counter 204. The end condition estimation apparatus 100 estimates the cleansing process end condition of the cleansing target database (DB2) 206 from the migration destination database (DB1) 205. From the cleansing end condition estimation apparatus 100 of the first embodiment, Composed. The data analysis unit 201, the rule application automatic determination unit 202, and the data update unit 203 may be realized by a program that runs on a computer. The program is stored in, for example, a magnetic or optical recording medium (device) or a semiconductor memory (for example, a read-only memory or a rewritable non-volatile memory (EEPRROM: Electrically Erasable and Programmable Read Only Memory)).

The data analysis unit 201 uses a method disclosed in Patent Document 6 or Non-Patent Document 4 from the cleansing target database DB2, for example, a CFD rule set (rule group) having a reliability equal to or higher than a preset threshold. Then, a violation tuple set including a set of tuples that are incompatible with each rule of the CFD rule set is obtained.

The rule application automatic determination unit 202 determines whether or not the contents of the tuple should be changed so as to eliminate the nonconformity of the set of the CFD rule set and the rule nonconforming tuple set (violating tuple set).

The data update unit 203 executes necessary changes (changes in the contents of the tuples that eliminate the rule nonconformity) to the data in the database DB2 in accordance with the determination result of the rule application automatic determination unit 202.

Next, the operation of the present embodiment will be described with reference to the flowcharts of FIGS. FIG. 13 shows a correction to be applied to the worker in the correction destination automatic selection process in step B4 of FIG. 12, determines whether the worker corrects, and executes the correction when the correction is adopted. The process is changed to the process of counting up the number of established rules by one.

The data analysis unit 201 analyzes the cleansing target data DB2, and extracts a CFD rule set having a reliability level equal to or higher than the threshold p and a violation tuple set for the rule (step B1 in FIG. 12).

Next, the data analysis means 201 initializes the count value of the established rule number counter 204 (step B2 in FIG. 12). As an example, the initial value of the established rule number counter 204 is the number of CFD rules established with a reliability of 1.0 (= 100%) in the data of the cleansing target database DB2.

Suppose, for example, the data shown in FIG. 14 is input to the data analysis unit 201 as data of the cleansing target database DB2. The data analysis unit 201 extracts a CFD rule set having a reliability of the threshold p = 0.8 or more and a violation tuple set for these rules from the data of FIG.

FIG. 15 is a diagram illustrating an example of a CFD rule set with a reliability of 0.8 or more extracted from the data of FIG. 14 and a violation tuple set for those rules. There is a case where one of two values having the same number of tuples is wrong in the violation tuple, and it is not known which attribute value should be unified. In this case, for example, a notation (format) (for example, v (φx): {tuple1 or tuple2}) that, for example, the tuple 1 or the tuple 2 has violated the rule φx is used. For example, description φ8 regarding CFD rule φ8 in FIG. 15: (PRODUCT, PRICE) → ID (conf = 0.9), v (φ8): {tuple9 or tuple10})
Indicates that the

tuple

9 or 10 in FIG. 14 has violated the CFD rule φ8. That is, for

tuples

9 and 10 in FIG. 14, the value of ID is not determined even if the value of PRICE is determined relative to the value of PRODUCT (

tuples

9 and 10 in FIG. 14 are duplicated).

The CFD rules with reliability 1.0 (conf = 1.0) are set as established rules, and the number is set as the initial value of the established rule number counter 204. In the example of FIG. 15, the CFD rules having a reliability of 1.0 are the seven rules φ1 to φ7 above the broken line in FIG. In this case, the count value of the established rule count 204 is initialized to “7”.

Next, if there is a violation tuple and the number of established rules is smaller than the estimated value obtained by the end condition estimating apparatus 100, the data analyzing means 201 repeats the following steps B4-B6 (step B3).

The rule application automatic determination unit 202 selects a CFD rule to be applied, a location to be corrected, and a value to be corrected (step B4). That is, when there are a plurality of CFD rules having a reliability of less than 1.0, a rule to be applied first is selected from the plurality of CFD rules, and a violation tuple for the selected rule is matched with the rule. Then, the location to be corrected is automatically selected, and the value to be corrected is determined. As this process, for example, as described in Non-Patent Document 2, a method of ordering rules to be applied using an index based on an edit distance of a character string can be used.

Next, the data update unit 203 executes correction to the tuple selected by the rule application automatic determination unit 202 (step B5 in FIG. 12). That is, the data update unit 203 executes correction to the data selected by the rule application automatic determination unit 202 in step B4. Further, the data update unit 203 updates the reliability and violation tuple information regarding the established rule having the reliability of less than 1.0.

The data updating unit 203 increases the count value of the established rule count 204 by 1 (step B6 in FIG. 12).

Suppose that the estimated value of the number of established rules by the termination condition estimation device 100 is “8”. For example, in the case of FIG. 15, the application of correction of violation tuple data according to the rules is examined in the order of the rules with the highest reliability, including the rule with the reliability of 0.9 (conf = 0.9).

The candidate for the correction part of the rule φ8 in FIG. 15 is a tuple with ID = 9 or ID = 10. In this case, correct the ID values of ID9 and ID10,
Changing the value of either PRIICE or TAX of the ID9 or ID10 tuple is considered as a correction candidate. Alternatively, it may be considered that the rule φ8 itself is not adopted.

In the present embodiment, the rule application automatic determination unit 202 automatically makes a determination regarding correction. Regarding automatic determination regarding correction, for example, the descriptions of

Non-Patent Documents

2 and 3 are referred to.

Alternatively, as disclosed in Non-Patent Document 1, regarding the determination regarding correction, it is possible to manually select from among correction candidates.

Thus, the rule φ14 in FIG. 15 (reliability (conf) = 0.8) is examined from the rule with high reliability.
φ14: (COUNTRY = JP, PRODUCT) → TAX
Is adopted and data correction is performed. The tuple that violates the rule φ14 is the tuple 6 (TAX = 100) in FIG. 14 from v (φ14): {(tuple6)} of the rule φ14 in FIG.

As an example, a modification that the value of TAX of the tuple 6 in FIG. When the value of the attribute TAX of the tuple 6 in FIG. 14 is corrected from 100 to 50, the count value of the established rule count 204 becomes “8”, and at this point, the iterative cleansing process ends.

In FIG. 9, the rule application automatic determination unit 202 acquires the estimated value of the cleansing process end condition from the end condition estimation apparatus 100, determines the end condition, and automatically ends when the end condition is satisfied.

<Embodiment 2: Modification 1>
As a first modification of the second embodiment, as shown in FIG. 10, the change in the number of established rules in the cleansing target data and the predicted value of the number of established rules are not shown in FIG. The operator 303 may be made to determine the actual data cleansing end timing by displaying the graph on a display device or the like. In the modified example of FIG. 10, when the rule application determination input unit 302 inputs an end instruction from the operator 303, the iterative cleansing process is ended.

In the second embodiment, a configuration in which correction candidates are manually selected from among correction candidates will be described with reference to Modification 1 of FIG. The configuration of FIG. 10 includes a rule application determination input unit 302 instead of the rule application automatic determination unit 202 of FIG. 9, and the operator 303 determines whether to apply the correction rule. The rule application determination input unit 302 ends the iterative cleansing based on the instruction of the worker 303, but the worker 303 may determine whether or not to correct the data further based on the correction rule.

FIG. 13 is a flowchart for explaining a procedure in the case of performing a determination process of whether or not there is a correction by a human (worker 303 in FIG. 10) in FIG. Referring to FIG. 13, the automatic selection processing (performed by the rule application automatic determination means 202 in FIG. 9) of the rule to be applied, the location to be corrected, and the correction destination value in step B4 in FIG. 1, B4-2 is replaced. That is, a process of selecting one modification to be applied by the rule application determination input means 302 and presenting it to the worker (B4-1), whether or not the worker 303 is to modify, Is to adopt the modification (Yes branch of B4-2), the modification to the adopted tuple is executed (step B5), and the process of incrementing the number of established rules by one (step B6) is performed. It is what I did.

<Embodiment 2: Modification 2>
FIG. 11 is a diagram illustrating a configuration of a second modification of the second embodiment. Referring to FIG. 11, the configuration is the same as that of FIG. 10 except that a value mapping unit 209 is provided. Below, the description of the same part is abbreviate | omitted.

In this modification, when the value mapping means 209 is used to eliminate the difference in expression of the values (attribute values) of the databases DB1 and DB2, and a matching rule including the value (attribute value) is established in the database DB2 Only, the established rule number counter 204 is incremented by one. The value mapping performed by the value mapping unit 209 is to associate value expressions in the corresponding columns (for example, “male” and “female” are associated with English “male” and “female”). Reference is made to the description of Patent Document 3 and the like.

According to the above-described embodiment, since the end time is determined by the end condition estimating means (device) 100, it is not necessary to consider all the enormous rules. For this reason, it is possible to reduce the possibility of adopting a rule that does not need to be applied by mistake by examining a rule with low reliability. In addition, in the said embodiment, it can apply to uses, such as data cleansing at the time of data migration and data integration, and it can apply to the arbitrary systems which perform the data cleansing of the database corresponding to the database used as a reference | standard.

It should be noted that the disclosures of the above patent documents and non-patent documents are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiments and examples can be changed and adjusted based on the basic technical concept. Various disclosed elements (including each element of each claim, each element of each embodiment, each element of each drawing, etc.) can be combined or selected within the scope of the claims of the present invention. . That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea.

The above embodiment is not particularly limited, but is appended as follows.

(Appendix 1)
Data analysis means for acquiring data configuration information of the first database with the first database as an input;
By excluding CFD rules that do not satisfy a predetermined criterion and are determined to have high data dependency from the conditional function dependency (CFD) rule set extracted from the first database, Rule extraction / selection means for selecting a dependent CFD rule set;
The number of data-independent CFD rule sets selected by the rule extraction / selection means;
Using the data configuration information of the first database acquired by the data analysis unit and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is used as the second database cleansing end determination condition. A cleansing end condition estimation device comprising: rule number estimation means for calculating an estimated value of the total number of CFD rules to be established in the database.

(Appendix 2)
The rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data. The cleansing end condition estimation apparatus according to Supplementary Note 1.

(Appendix 3)
The rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. The cleansing end condition estimation device according to appendix 1, wherein a rule that is established at a threshold value or higher is selected.

(Appendix 4)
The rule number estimating means calculates an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database,
Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) (where Number_of_CFDs (DB1) is the total number of the selected CFD rules, DBSIZE (DB1) and DBSIZE (DB2) are each) 1 and 2 is a data size comparison index of the second database)
The cleansing end condition estimation device according to

supplementary note

1 or 2, wherein the cleansing end condition estimation device is calculated using

(Appendix 5)
The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples. The cleansing end condition estimation device according to appendix 4, characterized in that it includes a composite number of different numbers of column values.

(Appendix 6)
A first database;
A second database to be cleansed;
The cleansing end condition estimation device according to any one of appendices 1 to 5,
Second data analysis means for obtaining a CFD rule set from the second database;
Rule application determination means for determining data correction contents for matching data with the rules in the CFD rule set acquired by the second data analysis means;
A data cleansing system comprising: a data update unit that updates data based on the correction content determined by the rule application determination unit.

(Appendix 7)
The data cleansing system according to appendix 6, wherein the second data analysis means extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.

(Appendix 8)
The rule application determining unit is configured such that the number of established rules in the CFD rule set extracted by the second data analyzing unit is the total number of CFD rules to be established in the second database calculated by the end condition estimating device. The data cleansing system according to appendix 6, wherein control is performed to end the data cleansing when the estimated value is reached.

(Appendix 9)
The data analysis process receives the first database as input and obtains data configuration information of the first database;
The rule extraction / selection process selects a CFD rule that is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database without satisfying a predetermined criterion. By excluding from the CFD rule set, a data-independent CFD rule set is selected,
The rule number estimation process includes the number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A cleansing end condition calculation method, comprising: calculating an estimated value of the total number of CFD rules to be established in the database.

(Appendix 10)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data. The cleansing end condition calculation method according to appendix 9.

(Appendix 11)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. The cleansing end condition calculation method according to appendix 9, wherein a rule that is established at a threshold value or higher is selected.

(Appendix 12)
In the rule number estimation process, an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database is calculated as follows:
Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) (where Number_of_CFDs (DB1) is the total number of the selected CFD rules, DBSIZE (DB1) and DBSIZE (DB2) are each) 1 and 2 is a data size comparison index of the second database)
The cleansing end condition calculation method according to

appendix

9 or 10, wherein the cleansing end condition calculation method is performed using

(Appendix 13)
The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples. The cleansing end condition calculation method according to appendix 12, characterized in that it includes a combined number of different numbers of the column values.

(Appendix 14)
A data cleansing method using the cleansing end condition calculation method according to any one of appendices 9 to 13,
A second data analysis process for obtaining a CFD rule set from the second database;
A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process;
A data update process for updating data based on the correction content determined in the rule application determination process;
A data cleansing method comprising:

(Appendix 15)
15. The data cleansing method according to claim 14, wherein the second data analysis process extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.

(Appendix 16)
In the rule application determination process, the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device. 15. The data cleansing method according to appendix 14, wherein control is performed to end data cleansing when the estimated value is reached.

(Appendix 17)
A data analysis process for obtaining data configuration information of the first database using the first database as an input;
Exclude from the CFD rule set a CFD rule that does not satisfy a predetermined predetermined criterion and is determined to have high data dependency from a conditional function dependency (CFD) rule set extracted from the first database. A rule extraction / selection process for selecting a CFD rule set independent of data,
The number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database;
A program that causes a computer to execute.

(Appendix 18)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Selecting from the CFD rules extracted from the training data a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data. The program according to appendix 17.

(Appendix 19)
In the rule extraction / selection process, the first database is divided into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. 18. The program according to appendix 17, wherein a rule that is established when the threshold value is exceeded is selected.

(Appendix 20)
In the rule number estimation process, an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database is calculated as follows:
Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1) (where Number_of_CFDs (DB1) is the total number of the selected CFD rules, DBSIZE (DB1) and DBSIZE (DB2) are each) 1 and 2 is a data size comparison index of the second database)
The program according to appendix 17 or 18, wherein the program is calculated using

(Appendix 21)
The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively designated as the number of tuples of each database, the number of different values of the designated column, or the number of tuples. The program according to appendix 20, characterized in that it includes a composite number of different numbers of the column values.

(Appendix 22)
Each process of the program according to any one of appendices 17 to 21,
A second data analysis process for obtaining a CFD rule set from a second database;
A rule application determination process for determining data correction content for matching data to the rules in the CFD rule set acquired in the second data analysis process;
A data update process for updating data based on the correction content determined in the rule application determination process;
A program for causing the computer to execute.

(Appendix 23)
23. The program according to appendix 22, wherein the second data analysis process extracts a CFD rule set having a degree of reliability equal to or higher than a predetermined threshold value from the second database.

(Appendix 24)
In the rule application determination process, the number of established rules in the CFD rule set extracted in the second data analysis process is the total number of CFD rules to be established in the second database calculated by the end condition estimation device. The program according to appendix 22, wherein the control for terminating the data cleansing is performed when the estimated value is reached.

100 Cleansing end condition estimation device (end condition estimation device)
101 Data Analysis Unit 102 Rule Extraction / Selection Unit 103 Rule Number Estimation Unit 104

Parameter

105, 205 Database (Destination Database: DB1)
106, 206 database (database DB2 for cleansing)
DESCRIPTION OF SYMBOLS 107 Data size information 108 Data independent rule group 201 Data analysis means 202 Rule application automatic judgment means 203 Data update means 204 Established rule number counters 207, 208 Rule group 209 Value mapping means 302 Rule application judgment input means 303 Worker

Claims

Data analysis means for acquiring data configuration information of the first database with the first database as an input;
Rule extraction / selection means for selecting a CFD rule set independent of data from a CFD (Conditional Functional Dependency) rule set extracted from the first database;
The total number of CFD rules independent of the data selected by the rule extraction / selection means;
Using the data configuration information of the first database acquired by the data analysis means and the data configuration information of the second database to be cleaned, the second database as the cleansing end determination condition of the second database Rule number estimating means for calculating an estimated value of the total number of CFD rules to be established in the database;
A cleansing end condition estimating apparatus comprising:
The rule extracting / selecting means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, a rule that has been established at least m times in the test time of k (where m is a predetermined positive integer less than or equal to k) in the test data is defined as the data-independent CFD rule. The cleansing end condition estimating device according to claim 1, wherein the cleansing end condition estimating device is selected.
The rule extraction / selection means divides the first database into training data and test data by k methods (where k is a predetermined positive integer),
Among the CFD rules extracted from the training data, the test data is m times during the k tests (where m is a predetermined positive integer less than or equal to k), and the reliability is predetermined. 2. The cleansing end condition estimation device according to claim 1, wherein a rule that is established at a threshold value or more is selected as the data-independent CFD rule.
The rule number estimating means calculates an estimated value Number_of_CFDs (DB2) of the total number of CFD rules to be established in the second database,
Number_of_CFDs (DB2) = Number_of_CFDs (DB1) × DBSIZE (DB2) / DBSIZE (DB1)
(Where Number_of_CFDs (DB1) is the total number of the selected CFD rules, and DBSIZE (DB1) and DBSIZE (DB2) are data size comparison indexes of the first and second databases, respectively). The cleansing end condition estimation apparatus according to claim 1 or 2, characterized in that:
The data size comparison indexes DBSIZE (DB1) and DBSIZE (DB2) of the first and second databases are respectively
The number of tuples in each database, or
The number of different values of the specified column, or
5. The cleansing end condition estimation device according to claim 4, wherein the cleansing end condition estimation device includes a composite number of different numbers of tuple numbers and designated column values.
A first database;
A second database to be cleansed;
The cleansing end condition estimation device according to any one of claims 1 to 5,
Second data analysis means for obtaining a CFD rule set from the second database;
Rule application determination means for determining data correction contents for matching data with the rules in the CFD rule set acquired by the second data analysis means;
Data update means for updating data based on the correction content determined by the rule application determination means;
A data cleansing system characterized by comprising:
The data cleansing system according to claim 6, wherein the second data analyzing means extracts a CFD rule set having a reliability equal to or higher than a predetermined threshold value from the second database.
The rule application determining means is configured such that the number of established rules in the CFD rule set extracted by the second data analyzing means is the number of CFD rules to be established in the second database calculated by the cleansing end condition estimating device. The data cleansing system according to claim 6, wherein when the estimated value of the total number is reached, control for terminating data cleansing is performed.
The data analysis process acquires the data configuration information of the first database by using the first database as an input,
The rule extraction / selection process selects a data-independent CFD rule set from a conditional function dependency (CFD) rule set extracted from the first database,
A rule number estimation process, the number of CFD rule sets independent of the data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A cleansing end condition calculation method, comprising: calculating an estimated value of the total number of CFD rules to be established in the database.
A data analysis process for acquiring data configuration information of the first database by using a first database as an input;
A rule extraction / selection process for selecting a data-independent CFD rule set from a conditional function dependency (CFD) rule set extracted from the first database;
The number of CFD rule sets independent of data selected by the rule extraction / selection process;
Using the data configuration information of the first database acquired in the data analysis process and the data configuration information of the second database to be cleansed, the second database cleansing end determination condition is set as the second database cleansing end determination condition. A rule number estimation process for calculating an estimated value of the total number of CFD rules to be established in the database;
A program that causes a computer to execute.