WO2013172308A1

WO2013172308A1 - Rule discovery system, method, device, and program

Info

Publication number: WO2013172308A1
Application number: PCT/JP2013/063316
Authority: WO
Inventors: 裕貴中山
Original assignee: 日本電気株式会社
Priority date: 2012-05-14
Filing date: 2013-05-13
Publication date: 2013-11-21
Also published as: JPWO2013172308A1

Abstract

Provided are a device, a method, and a program with which a set of rules useful for ascertaining or correcting database content can be obtained with high efficiency. The present invention (see fig. 1) is provided with: a rule-candidate generation means (21) for generating new rule candidates on the basis of database content, set parameters, and previously generated rule candidates; and a rule-appropriateness determination means (22) for checking whether the rule candidates are appropriate for the database content.

Description

Rule discovery system, method, apparatus and program

[Description of related applications]
The present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-110922 (filed on May 14, 2012), and the entire description of the application is incorporated herein by reference.
The present invention relates to a rule discovery technique, and more particularly to a system, method, apparatus, and program for database rule discovery.

The discovery of database rules is, for example, expressing the rules as CFD (Conditional Function Dependency) and outputting CFD rules that match the contents of the database from the generated CFD rule candidates. The following outlines CFD, which is a prerequisite for understanding the invention.

CFD is a rule indicating that a function result part (abbreviated as “FD”) representing a result part between data attributes is established for a tuple set specified by a condition. It consists of the specification of attribute values in the condition part and premise part which are the left side of the rule (LHS: LeftLeHand Side) and the consequent part of the right side of the rule (RHS: Right Hand Side). The condition part and the result part are also called a conditional clause and a subordinate clause, respectively.

The condition part designates a subset (tuple set) of data, and represents that the attribute X is the attribute value x as “X = x”. Here, “x” means that the attribute value is a specific value. Such an expression of the attribute value is referred to as “Constant” (“Constant” means, for example, “constant”).

Also, the premise part consists of designation of only the attribute, and the attribute value does not take a specific value (that is, a wild card indicating that it matches an arbitrary value) is expressed as “X = _”. Such an expression of the attribute value is referred to as “variable” (“Variable” means, for example, “variable”). Here, “_” is also referred to as “unnamed” variable.

There are two types of consequences.
(A) an attribute and attribute value designation (for example, rule 1 below);
(B) Specifying only attributes (for example, rule 2 below)
It is.

In the case of (A), for example, “A = a”,
In the case of (B), for example, “A = _” is represented. If the attribute value is specified in the consequent part, the premise part can be omitted. Moreover, the premise part and the consequent part may consist of designation of a plurality of attributes and respective attribute values. An example rule is shown below.

Rule 1: X1 → A (x1 || a)
Rule 2: X1, X2 → A (x1, _ || _)

Rule 1 is a rule that means “when attribute X1 is attribute value x1, attribute A is attribute value a”. When Rule 1 is satisfied, it represents that the consequent part is a specified value in the tuple set that applies to the condition part. That is, t [A] = a in all tuples of the tuple set that satisfies the condition X1 = x1 (t [A] represents the tuple of the attribute A). In this way, the rule in which the consequent part is determined to the designated value is referred to as “Constant CFD”.

Rule 2 is a rule that means that “attribute A is determined by attribute X2 when attribute X1 is attribute value x1”. When Rule 2 is satisfied, it represents that there is a consequent part between the attributes specified in the premise part and the consequent part in the tuple set applicable to the condition part. That is, for any tuple pair t1, t2 in the tuple set that satisfies the condition “X1 = x1”, if t1 [X2] = t2 [X2], then t1 [A] = t2 [A]. A rule that has a result part between attributes, although the result part is not determined to be specified in this way, is called “variable CFD (Variable CFD)”. That is, when the right side of the pattern || is unnamed variable ‘_’ (tp [A] = _), it is referred to as variable CFD (Variable CFD).

The symbol “||” in the rule tuple (x1 || a) of rule 1 separates the attribute value of X1 on the left side and A on the right side. Although there is an example in which “X1 → A (x1ｘ || a)” in rule 1 is written as “(X → A, (x （|| a))”, the outer parentheses are different from the presence or absence of commas. It is obvious that it represents the same rule. Similarly, “X1, X2 → A (x1, _ || _)” of rule 2 is also expressed as “([X1, X2] → A, (x1, _ || _))”.

As an index indicating how effective the CFD is for a given data, for example, a support level or a confidence level is used. The support level (Support) is the number of tuples in which the condition part and the premise part of the CFD match.

The confidence level (Confidence) is the ratio of the number of tuples that satisfy the CFD rule among the number of tuples that match the condition part and the premise part.

When multiple CFDs are given, a CFD that satisfies the two conditions of “left-reduced” and “most-general” is called “minimal”. “Left-reduced” will be described. When multiple CFDs are given, any CFD left side (LHS) attribute set is said to be “left-reduced” for a CFD that does not contain the other CFD left side attribute set.

For example, when the following rule 3 and rule 4 are given, the left side of rule 4 includes the left side of rule 3 (X1⊂X1, X2), so rule 4 is “left-reduced” Absent. Conversely, the left side of rule 3 does not include the left side of rule 4, so rule 3 is said to be “left-reduced”. In this case, rule 4 can be deleted as redundant CFD with respect to rule 3.

Rule 3: X1, Y → A (x1, _ || _)
Rule 4: X1, X2, Y → A (x1, x2 || _)

Next, “most-general” will be explained. When a plurality of CFDs are given, if the constant of the attribute value included in the left side of any CFD cannot be updated to “_” (Variable), it is said to be “most-general”.

For example, when the following rules 5 and 6 are given, rule 5 can be obtained by replacing the attribute value x2 of rule 6 with Variable. For this reason, rule 6 is not “most-general”. Conversely, rule 5 is said to be “most-general”. In this case, rule 6 can be deleted as a redundant CFD with respect to rule 5.

Rule 5: X1, X2 → A (x1, _ || a)
Rule 6: X1, X2 → A (x1, x2 || a)

This completes the overview of CFD.

An apparatus for discovering a rule from a database includes a storage unit (storage unit) such as a magnetic disk for storing CFD, and a calculation unit (calculation unit) that generates a CFD candidate and determines whether the CFD candidate matches the contents of the database. ) And a storage unit (storage unit) that stores the CFD determined to match the contents in the storage device. The storage means stores the CFD obtained by the rule discovery algorithm. The calculation means generates a CFD candidate to be checked, checks whether it matches the contents of the database, and outputs a valid CFD if it matches. The storage means stores the obtained valid CFD in the storage device.

As a database rule discovery technique, for example, as described in Non-Patent Document 1,
(1) A method for generating constant CFD (constant CFD) candidates from a free itemset and a corresponding closed itemset,
(2) A list of attribute-value pairs is generated by breadth first search, one of which is a dependent term (A) and the rest is a conditional part (X),
Formula: X → A
Generating CFD candidates by obtaining
(3) Place a free item set in a conditional item, one attribute not included in the free items set in a subordinate term (consecutive part), and add other attributes to the conditional term A method for generating CFD candidates by performing a depth first search;
Etc.

As described above, there is a certainty factor as an index indicating how much the contents of the database and the CFD match.

Non-Patent Document 2 uses breadth-first search (breadth first search) as a discovery method for rules (CFD) that do not completely match the contents of the database but have high confidence (Confidence). There is disclosed a method for finding a CFD (hereinafter referred to as “approximate CFD”) (“substantially valid” CFD) having (Confidence) equal to or greater than a threshold value.

For example, Patent Literature 1 discloses a rule base for storing a rule including a condition part and a conclusion part, a case information database for storing case information related to a rule application result, a rule, and a rule. The case search unit searches the case information set for the case information set from the case information database using the relation part that associates the case information satisfying the condition and the condition part of the rule to be validated as a key, and the conclusion part of the rule is satisfied in the case information set There is disclosed a rule base management device including a validity check unit that calculates a proportion of case information and checks validity of the rule based on the proportion. Patent Document 2 discloses a configuration in which a function consequent part (FD) between relation attributes is found and normalization is performed by relation division.

International Publication No. 2004/36496 Japanese Patent Laid-Open No. 6-110749

The following is an analysis of related technologies performed by the present inventors.

The first problem is that the CFD obtained by the CFD discovery algorithm disclosed in Non-Patent Document 1 is completely valid for a database, that is, only having a certainty factor of 1, and It cannot be enumerated.

The second problem is that the approximate CFD discovery algorithm disclosed in Non-Patent Document 2 has an extremely long calculation time. The reason is that the number of CFD candidates generated for a large-scale database, in particular, a large number of attributes, causes a combinational explosion.

The present invention was devised in view of the above problems, and its purpose is a system and apparatus capable of efficiently obtaining a set of rules useful for grasping or correcting the contents of a database. It is to provide a method and a program.

According to the present invention, a storage device for storing a database;
A data processing device;
An output device;
With
The data processing device includes:
Rule candidate generation means for generating rule candidates from the database;
Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database;
With
The rule candidate generation means includes:
An item consisting of attribute-value pairs in the database, wherein a set of items having a frequency in the database equal to or higher than a predetermined threshold value is generated;
As an initial value of the rule candidate, a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item is generated and stored in the storage unit,
The rule validity judging means is:
For each rule of the rule candidate generated by the rule candidate generation means,
The rule is judged to be valid if the rule part of the database matches the condition part / premise part of the rule and the rule result part matches with a predetermined certainty threshold or more. And output to the output device,
When the rule candidate to be validated becomes empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the rule candidate generated last time. Generate a new rule candidate increased by one and store it in the storage unit,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
Generation of a new rule candidate in which the size of the condition part / premise part by the rule candidate generation unit is increased by one from the rule candidate generated previously;
The rule candidate generation unit repeats the process until the new rule candidate becomes empty without generating a new rule candidate in which the size of the condition part / premise part is increased by one from the previous rule candidate generated. A rule discovery system is provided.

According to the present invention, in finding a rule from a database by a data processing device including a rule candidate generation unit and a rule validity determination unit,
(A) The rule candidate generation means reads the database, generates an item having a frequency equal to or higher than a predetermined threshold that includes a pair of attribute and value of the database,
As an initial value of a rule candidate, generating a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item, and storing the rule set in a storage unit;
(B) The rule validity determination means, for the generated rule candidate rule, for the tuple of the database that matches the condition part / premise part of the rule, If there is a match with a predetermined certainty threshold or higher, the rule is determined to be valid and output from the output device,
(C) Check whether the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the rule validity judging means returns to step (b) ,
(D) When the rule candidate to be validated is empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the previously generated rule. A new rule candidate that is one more than the candidate is generated and stored in the storage unit,
(E) Check whether or not the new rule candidate generated in step (d) is empty. If the new rule candidate is empty, the rule discovery is terminated, and the new rule candidate is not empty. In this case, a rule finding method including the steps is provided, which returns to step (b).

According to the present invention, (a) a database is read, an item including a pair of attributes and values of the database and having a frequency equal to or higher than a predetermined threshold value is generated, and the condition part A part (LHS) is empty, and a result part (RHS) generates a rule that is the item and stores it in a storage unit;
(B) For the generated rule candidate rule, for a tuple of the database that matches the condition part / premise part of the rule, the rule consequent part has a predetermined certainty factor. If there is a match at a threshold value or higher, processing to determine that the rule is valid and output from the output device; and
(C) Check whether or not the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the process returns to (b);
(D) When the rule candidate to be validated is empty, the item is set as a new condition part / premise part, and the size of the condition part / premise part is increased by one from the previously generated rule candidate. Processing to generate a new rule candidate and store it in the storage unit;
(E) Check whether the new rule candidate generated in the process (d) is empty,
If the new rule candidate is empty, the rule discovery process is terminated,
If the new rule candidate is not empty, the process returns to (b);
A program for causing a computer to execute is provided. According to the present invention, a computer-readable memory device or a magnetic / optical disk medium / device storing the program is provided.

According to the present invention, rule candidate generation means for generating a new rule candidate based on the content of the database and the frequency threshold set from the input device or the generated rule candidate;
A rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database,
The rule candidate generation means reads the database, generates an item consisting of a pair of attribute and value of the database, the frequency of which is equal to or higher than the threshold value, and sets the condition part / premise part as an initial value of the rule candidate. (LHS) is empty, and the result part (RHS) generates a rule that is the item,
Based on breadth-first search, generate rule candidates in order from smaller size of condition part / premise part,
For each rule of the generated rule candidate, the rule validity determination means inputs the rule consequent part from the input device for the tuple of the database that matches the condition part of the rule. If the match is greater than or equal to the certainty threshold, the rule is judged to be valid and output to the output device,
When the rule candidate to be validated is empty, the rule candidate generating means excludes a rule that is redundant with respect to the valid rule from the rule candidate search, and uses the item as a condition part / premise part. , Generating a new rule candidate in which the size of the condition part / premise part is increased by one from the rule candidate generated previously,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
The generation of a new rule candidate, which is one more than the rule candidate generated last time, the size of the condition part / premise part by the rule candidate generation means,
The rule candidate generation means repeats rule discovery until a new rule candidate cannot be generated with the condition part / premise part size increased by one from the previously generated rule candidate, and the new rule candidate becomes empty An apparatus is provided.

According to the present invention, it is possible to efficiently obtain a set of rules useful for grasping the contents of a database or performing correction.

It is a figure which shows the structure of illustrative 1st Embodiment of this invention. 3 is a flowchart showing the operation of the first exemplary embodiment of the present invention. It is a figure for demonstrating the specific example of operation | movement of illustrative 1st Embodiment of this invention. It is a figure which shows the structure of illustrative 2nd Embodiment of this invention.

Next, embodiments of the present invention will be described in detail with reference to the drawings. According to the present invention, based on the breadth-first search, rules (CFD) are generated in order starting from a candidate with a small size of the condition part / premise part, and when an appropriate rule is found, the rule becomes redundant. By pruning the rules so as to be excluded from the subsequent search for candidate rules, it is possible to efficiently enumerate the rules (approximate CFD) that are almost valid.

∙ Based on the database and setting parameters, calculate a set of rules (approximateDCFD) that almost match the contents of the database. More specifically, a rule candidate generating means (apparatus) (21 in FIG. 1) for generating a new rule candidate based on the contents of the database and setting parameters or already generated rule candidates, A rule validity judging means (apparatus) (22 in FIG. 1) for checking whether the contents of the database are valid.

The rule candidate generation means (device) (21 in FIG. 1) generates a rule (CFD) candidate from the contents of the database or the item set obtained in the previous step, and the rule is valid In addition, pruning is performed so that redundant rules are not output. Eventually, when the rule (CFD) candidate generated by the rule candidate generation means (device) (21 in FIG. 1) becomes empty (when the rule (CFD) candidate cannot be generated), rule discovery The calculation of is terminated.

According to an embodiment of the system of one aspect, a storage device (3 in FIG. 1) for storing a database, a data processing device (2 in FIG. 1), and an output device (4 in FIG. 1) are provided. In the system, the data processing device (2) includes rule candidate generation means (21 in FIG. 1) for generating rule candidates from the database, and determines whether or not the rule candidates are appropriate for the contents of the database. And a rule validity judging means (22 in FIG. 1).

The rule candidate generation means (21 in FIG. 1) is an item composed of attribute-value pairs in the database, and generates a set of items whose frequency in the database is equal to or higher than a predetermined threshold value. Then, as an initial value of the rule candidate, a rule set having the condition part / premise part as empty and the result part as the item is generated and stored in the storage unit.

The rule validity determination means (22 in FIG. 1) sets, for each rule of the rule candidates generated by the rule candidate generation means, a tuple of the database that matches the condition part / premise part of the rule. On the other hand, if the result part of the rule matches with a predetermined threshold of certainty or more, the rule is determined to be valid and output to the output device (4 in FIG. 1). When the rule candidate for determining validity becomes empty, the rule candidate generation means (21 in FIG. 1) uses the item as a new condition part / premise part and generates the size of the condition part / premise part last time. A new rule candidate increased by one from the rule candidates thus generated is stored in the storage unit. Validity check (A3, A4, A5 in FIG. 2) by the rule validity determination means (22 in FIG. 1) for the new rule candidate generated by the rule candidate generation means (21 in FIG. 1); ,
Generation of a new rule candidate (A6 in FIG. 2) in which the size of the condition part / premise part by the rule candidate generation means (21 in FIG. 1) is increased by one from the previously generated rule candidate;
The rule candidate generating means (21 in FIG. 1) cannot generate a new rule candidate in which the size of the condition part / premise part is increased by one from the previously generated rule candidate, Therefore, the process is repeated until the new rule candidate becomes empty (A3 to A7 in FIG. 2).

According to another embodiment of the rule discovery method, the following steps are included.
(A) The rule candidate generation means (21 in FIG. 1) reads out the database, and generates an item that is made up of a pair of attribute and value of the database and whose frequency is equal to or higher than a predetermined threshold (step A1 in FIG. 2). ), As an initial value of the rule candidate, a rule set having the condition part / premise part as empty and the result part as the item is generated and stored in the storage unit (step A2 in FIG. 2),
(B) The rule validity judging means (22 in FIG. 1) is configured such that, for the generated rule candidate rule, the rule against the tuple of the database matching the condition part / premise part of the rule. If the result of the matching is equal to or greater than a predetermined threshold of certainty, the rule is determined to be valid and output from the output device (steps A3 and A4 in FIG. 2).
(C) Check whether or not the rule candidate to be validated is empty (whether or not it still remains), and if the rule candidate to be validated is not empty, the rule validity judging means Returning to step (b) (step A5 in FIG. 2),
(D) When the rule candidate for validity determination is empty (when validity determination has been performed for all rule candidates), the rule candidate generation means sets the item as a new condition part / premise part. Then, a new rule candidate is generated by increasing the size of the condition part / premise part by one from the previously generated rule candidate, and is stored in the storage unit (step A6 in FIG. 2).
(E) A new rule candidate is not generated in step (d), and it is determined whether or not the new rule candidate is empty (step A7 in FIG. 2). If the new rule candidate is empty, Discovery ends and if the new rule candidate is not empty, the process returns to step (b).

According to the present invention, it is possible to speed up the discovery of rules (approximate CFD) that is almost valid in the discovery of database data rules, and it is suitable for the discovery of rules useful for grasping and correcting the contents of the database. Hereinafter, it will be described in detail according to the embodiment.

<Embodiment 1>
Referring to FIG. 1, an exemplary first embodiment of the present invention includes an input device 1 such as a keyboard, a data processing device 2 that operates under program control, a storage device 3, a display device, a printing device, and the like. The output device 4 is included.

The storage device 3 includes a database storage unit 31 composed of a magnetic disk device or the like. The database storage unit 31 stores a database. Data in this database is read out by the data processing device 2 to extract CFD rules.

The data processing device 2 includes a rule candidate generation unit 21 and a rule validity determination unit 22.

The rule candidate generation unit 21 generates database rule candidates stored in the database storage unit 31. In generating rule candidates, the rule candidate generating unit 21 generates rule candidates using the parameters given from the input device 1 and the rule candidates generated in the previous step (generated rule candidates), and the generated rule The candidate is stored in the storage unit. The storage unit may be a storage unit (memory device) (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or a predetermined storage area in the storage device 3. It may be.

The rule validity determination unit 22 checks whether the rule generated by the rule candidate generation unit 21 is a valid rule. If the rule is a valid rule, the rule is sent to the output device 4. Is output.

Here, “appropriate” means
The number of tuples in the database matching the rule is greater than or equal to a predetermined frequency threshold, and
-The result part of the rule and the content of the tuple match at or above the certainty threshold,
It means that. The parameters given from the input device 1 include a frequency threshold and a certainty threshold, and the parameters are referred to by the rule candidate generation means 21.

FIG. 2 is a flowchart for explaining the operation of the present embodiment. The operation of the present embodiment will be described in detail with reference to FIGS.

The parameters given from the input device 1 and the contents of the database given from the database storage unit 31 are supplied to the rule candidate generating means 21. The rule candidate generation means 21 generates an attribute-value pair (this is called an “item”) that appears in the database (step A1). The rule candidate generation unit 21 stores the generated item set in a storage unit (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or a predetermined storage area of the storage device 3. The

The rule candidate generation means 21
-From the set of generated items, extract all items whose frequency (frequency) is greater than or equal to parameter (frequency threshold) k,
-Generate a rule where the condition part (condition part / premise part) is empty and the extracted item is the consequent part,
This is the initial rule candidate (CFD candidate) (step A2). The rule candidate generation unit 21 stores the generated initial rule candidates (CFD candidates) in a storage unit (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or the storage device 3. Store in a predetermined storage area. In FIG. 2, the rule condition / premise part (LHS) is shown as a condition part.

The rule validity determination unit 22 checks the rule candidate (CFD candidate) generated by the rule candidate generation unit 21 with a database stored in the database storage unit 31 and checks whether the rule is valid. I do. Specifically, the rule validity judging means 22
For the tuples in the database that match the condition part of the rule, if the result part of the rule and the content of the tuple match with the parameter (threshold value threshold) p or more,
It determines with it being appropriate (Yes of step A3).

When the valid rule is obtained from the rule candidates, the rule validity determining means 22 outputs the valid rule (CFD) to the output device 4 (step A4). If the rule is not valid, it is not output to the output device 4.

When the rule validity determination unit 22 determines that the rule is valid, the rule candidate generation unit 21 results in an item that is a consequent part of the rule determined to be valid in generating a rule candidate having a larger size. The rules included in the part are not generated as rule candidates. That is, the rule candidate having a larger size is pruned.

Steps A3 and A4 are repeated, and when there are no more candidate candidates for validity determination (CFD candidates) (that is, when the rule candidates are empty) (Yes in step A5), the rule candidate generating means 21 Using the item as a new condition part, a new rule candidate (CFD candidate) is generated by increasing the size of the condition part by one (step A6).

The rule candidate generation means 21 determines whether or not the new rule candidate generated in step A6 is empty (step A7).

If the new rule candidate generated in step A6 is empty, the rule discovery calculation process is terminated (Yes in step A7).

If the new rule candidate (CFD candidate) generated in step A6 is not empty, the process returns to step A3, where it is determined by the rule validity determination means 22 whether the rule is valid and a valid rule is output.

The series of steps A3 to A7 are repeated until no more rule candidates (CFD candidates) are generated in the rule candidate generation means 21 as a result of the determination in step A7.

Thus, in this embodiment,
Starting from enumerating items with a frequency greater than or equal to a predetermined value and combining them, starting from a small rule candidate (CFD candidate), gradually generating a large rule candidate (CFD candidate) Go. When a valid rule (a rule whose certainty is equal to or greater than a threshold value) is obtained, the rule can be output and the rule can be efficiently discovered by suppressing the generation of a redundant rule for the rule. It becomes.

Next, the operation of this embodiment will be described using specific examples. As shown in FIG. 3B (Table 1 below), for example, the database storage unit 31 is registered with a data set composed of the attributes and tuples shown in Table 1 below. The database example is a simplified example for the sake of explanation. FIG. 3A illustrates a specific example of an item set, a candidate for an initial rule (CFD), and a new rule (CFD) candidate corresponding to the steps in FIG. FIG. 3C is a diagram illustrating an example of a rule (approximate CFD) output from the output device 4 as a result of the calculation of FIG. In FIG. 3, the symbol “:” in “attribute 1: _” or the like is synonymous (same) as the symbol “=” in “attribute 1 = _”.

The rule candidate generation means 21 has the above Table 1 and parameters as
k = 2, p = 0.66
Receive.
here,
k is a frequency threshold (lower limit) for determining a valid rule,
p is a certainty threshold (lower limit).

The rule candidate generation means 21 is a list (item set) of all items (items) whose appearance frequency in the database is k = 2 or more: {
“Attribute 1 = _” (4),
“Attribute 2 = _” (4),
“Attribute 3 = _” (4),
“Attribute 1 = 1” (2),
“Attribute 2 = P” (3),
“Attribute 3 = S” (2),
“Attribute 3 = T” (2)} is extracted. Here, the symbol _ is a variable that matches an arbitrary value. The number in parentheses is the frequency of the item. “_” Of “attribute 1 = _” is “unnamed variable”, which is a wild card that matches an arbitrary attribute value.

The rule candidate generation unit 21 sets the condition part / premise part (LHS) to empty for each item (item) in the extracted list, and sets each extracted item (item) as a result part ( RHS) is generated, and this is temporarily stored in a storage unit (not shown) as an initial candidate for the rule (CFD) (step A1 in FIGS. 2 and 3A).

In this case, the initial rule (CFD) candidates are as follows.
ψ1: empty → “attribute 1 = _”,
ψ2: empty → “attribute 2 = _”,
ψ3: empty → “attribute 3 = _”,
ψ4: empty → “attribute 1 = 1”,
ψ5: empty → “attribute 2 = P”,
ψ6: empty → “attribute 3 = S”,
ψ7: empty → “attribute 3 = T”

In the rule ψ5, the frequency of empty (the same as the total number of tuples) is 4, and the frequency of “attribute 2 = P” is 3.
Therefore, the ratio (confidence) of the frequency (= 3) of “attribute 2 = P” and the frequency (= 4) of empty (= 4) = 3/4 = 0.75. This exceeds the certainty threshold (lower limit) p = 0.66.

Therefore, the rule validity judging means 22
ψ5: empty → “attribute 2 = P”
(Step A3 in FIG. 2 and FIG. 3A) is output to the output device 4 (step A4 in FIG. 2 and FIG. 3A).

At the same time, a rule that is redundant with respect to ψ5 is excluded from the search targets for subsequent rule candidates. Specifically, in the later step (step A6 in FIGS. 2 and 3A), the rule candidate generation means 21
Item “attribute 2 = P”
When an item set including “itemset” is generated, “attribute 2 = P” is removed from the candidate of the consequent part (RHS) corresponding thereto.

The remaining six rules (ψ1 to ψ4, ψ6, ψ7) are not valid for the contents of the database. As a result, the rule (CFD) candidate is empty (Yes in step A5 in FIG. 2).

Among the rule candidates ({ψ1, ψ2,..., Ψ7}), the rule (CFD) ψ5 determined to be appropriate for the contents of the database by the rule validity determination means 22 is sent to the output device 4. Is output. Of the rule candidates, the rule whose validity has been determined is deleted from the rule candidates. That is, among the rule candidates ({ψ1, ψ2,..., Ψ7}), the validity is determined by the rule validity determining means 22, and the rule (CFD) ψ5 determined to be valid, and Rules that are determined to be invalid (ψ1 to ψ4, ψ6, ψ7) are deleted from the rule candidates, and as a result, the rule candidates that are checked for validity are empty (the rest are zero). Note that, for example, deletion of a rule from a rule candidate may be configured such that a deletion flag of 1 bit or the like is prepared for each rule candidate rule in addition to deletion from a memory. In this case, when all the deletion bits (a plurality of bits) of the plurality of rule candidates are on, the rule candidate is determined to be empty.

Since the rule candidate (CFD candidate) for validity determination is empty (Yes in step A5 in FIG. 2), the rule candidate generation unit 21 generates a new rule (FIG. 2, FIG. 3A). Step A6). Specifically, from the above set of items, there is no contradiction to each other (the same attribute does not take a different value), the size of the set after synthesis is one larger than that before synthesis, , Select two elements whose frequency is greater than or equal to the threshold value k and combine them (the combined element is called an itemset), one of the elements as the consequent part, and the rest as the condition part / premise part. (Step A6 in FIGS. 2 and 3A).

As mentioned earlier here, two items:
“Attribute 2 = P”,
“Attribute 3 = T”
Item set {"attribute 2 = P", "attribute 3 = T"}
Is excluded from “attribute 2 = P” as a candidate for a consequence section.

Therefore, in the rule candidate generation means 21,
Rule: “Attribute 3 = T” → “Attribute 2 = P”
Is not generated.

Of the rule candidates newly generated by the rule candidate generation means 21 (step A6 in FIGS. 2 and 3A),
CFD ψ′1: “attribute 1 = _” → “attribute 3 = _”
(The value of attribute 1 uniquely determines the value of attribute 3)
Is valid (frequency = 4, certainty factor = 1) is determined by the rule validity determination means 22 (step A3 in FIGS. 2 and 3A), and ψ ′ is output from the output device 4 (Step 4). At the same time, a rule that is redundant with respect to ψ ′ is excluded from the target of the subsequent rule candidate search.

Similarly, the rule validity determination unit 22 determines that the size of the condition part / premise part is 1 (frequency is k = 2 or more and confidence is p = 0.66 or more) ψ′2 : “Attribute 3 = _” → “attribute 1 = _”,
ψ′3: “attribute 2 = _” → “attribute 3 = _”,
ψ′4: “attribute 3 = S” → “attribute 1 = 1”,
ψ′5: “attribute 1 = 1” → “attribute 3 = S”,
ψ′6: “attribute 2 = P” → “attribute 3 = T”
Are found (steps A3 and A4 in FIGS. 2 and 3A), these rules are output to the output device 4 (step 4 in FIG. 2), and redundant rules are added to these rules. Exclude each.

The rule validity determination means 22 has determined the validity of six rule candidates out of six rule (CFD) candidates whose size is 1 for the condition part and the premise part (determined that there is validity). The rule (CFD) candidate whose validity is to be checked is empty (Yes in step A5 in FIG. 2).

Therefore, the rule candidate generation unit 21 tries to generate a rule (CFD) candidate in which the size of the condition part / premise part is increased by one. However, any rule candidate with the condition part / premise part size = 2 is generated. Otherwise, the rule (CFD) candidate with the condition part / premise part size = 2 is empty, and the rule finding algorithm ends (Yes in step A7).

The obtained CFD is as shown in FIG.
empty->"attribute 2 = P": frequency = 3 confidence factor = 0.75
“Attribute 1 = _” → “Attribute 3 = _”: Frequency = 4 Confidence = 1
“Attribute 3 = _” → “attribute 1 = _”: frequency = 4 certainty factor = 0.75
“Attribute 2 = _” → “attribute 3 = _”: frequency = 4 certainty factor = 0.75
“Attribute 3 = S” → “Attribute 1 = 1”: Frequency = 2 Confidence = 1
“Attribute 1 = 1” → “Attribute 3 = S”: Frequency = 2 Confidence = 1
“Attribute 2 = P” → “Attribute 3 = T”: Frequency = 2 Confidence = 0.66

In the output example shown in FIG. 3C, the frequency and certainty factor of the rule determined to be valid are output, but one of the rule frequency and the certainty factor is output or is output together. It does not have to be. When there are a large number of rules determined to be valid, for example, the rules may be sorted and output in the order of certainty. The output form of the rule determined to be valid is arbitrary.

Here, when pruning is not performed, an item set is displayed during the calculation process of FIG.
{“Attribute 2 = P”, “attribute 3 = T”} to CFD ψ: “attribute 3 = T” → “attribute 2 = P”
Etc., a redundant rule (CFD) is generated.

On the other hand, according to the present embodiment, the generation of redundant rules as described above can be prevented by pruning.

<Embodiment 2>
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. Referring to FIG. 4, the second embodiment of the present invention includes a rule finding program 5.

The rule discovery program 5 is read into the data processing device 6 and controls the operation of the data processing device 6. The data processing device 6 executes the following processing, that is, the same processing as the processing by the data processing device 2 in the first embodiment, under the control of the rule finding program 5. The rule discovery program 5
(A) Reading out the database, generating an item having a frequency equal to or higher than a predetermined threshold consisting of a pair of the attribute and value of the database, and as an initial value of the rule candidate, the condition part is empty, the consequence part is Processing to generate a rule (CFD) that is an item and store it in the storage unit;
(B) For the generated rule candidate rule (CFD), for a tuple of the database that matches the rule condition part, the rule result part has a predetermined certainty factor. If the threshold matches, the rule is determined to be valid (approximate CFD) and output to the output device;
(C) If the rule candidate (CFD candidate) for determining validity is not empty, the process returns to (b);
(D) When the rule candidate (CFD candidate) for determining validity is empty, the item is set as a new condition part, and the size of the condition part is increased by one from the rule candidate generated last time. Processing to generate a rule candidate (new CFD candidate) and store it in the storage unit;
(E) It is determined whether or not the new rule candidate generated in the process of (d) is empty. If the new rule candidate (new CFD candidate) is empty, the rule discovery process is terminated, If the new rule candidate (new CFD candidate) is not empty, the process returns to (b) above;
including.

In the process of outputting the rule (approximate CFD) determined to be valid in (b) to the output device, instead of immediately outputting the rule (approximate CFD) to the output device, the rule is temporarily displayed in a list (linear list). In addition, the list may be output when the rule candidate (CFD candidate) for determining validity becomes empty. The list is stored in a storage device buffer or the like. An arbitrary method can be used as output control for outputting a rule determined to be valid.

A command (such as a rule discovery program execution command) and setting parameters are input from the input device 1, and an initial rule candidate is generated using a database stored in the database storage unit 31 in the storage device 3. Next, it is determined whether or not the generated rule candidate is valid. If it is valid, the rule is added to the list. When the coverage of the database satisfies the cutoff condition due to the set of rules stored in the list, the set of rules in the list is displayed on the output device 4.

It should be noted that the disclosures of the above patent documents and non-patent documents are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiments and examples can be changed and adjusted based on the basic technical concept. Various disclosed elements (including each element of each claim, each element of each embodiment, each element of each drawing, etc.) can be combined or selected within the scope of the claims of the present invention. . That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea. In particular, with respect to the numerical ranges described in this document, any numerical value or small range included in the range should be construed as being specifically described even if there is no specific description.

DESCRIPTION OF SYMBOLS 1 Input device 2, 6 Data processing device 3 Storage device 4 Output device 5 Rule discovery program 21 Rule candidate production | generation means 22 Rule validity determination means 31 Database storage part

Claims

A storage device for storing the database;
A data processing device;
An output device;
With
The data processing device includes:
Rule candidate generation means for generating rule candidates from the database;
Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database;
With
The rule candidate generation means includes:
An item consisting of attribute-value pairs in the database, wherein a set of items having a frequency in the database equal to or higher than a predetermined threshold value is generated;
As an initial value of the rule candidate, a rule set having a condition part / premise part as empty and a result part as the item is generated and stored in the storage unit,
The rule validity judging means is:
For each rule of the rule candidate generated by the rule candidate generation means,
The rule is determined to be valid if the rule part of the rule matches the condition part / premise part of the rule and the rule result part matches with a predetermined certainty threshold or more. And output to the output device,
When the rule candidate to be validated becomes empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the rule candidate generated last time. Generate a new rule candidate increased by one and store it in the storage unit,
With respect to the new rule candidate generated by the rule candidate generation means, determination of validity by the validity determination means of the rule,
Generation of a new rule candidate by the rule candidate generation means, in which the size of the condition part / premise part is increased by one from the rule candidate generated previously;
The rule candidate generation means cannot generate a new rule candidate in which the size of the condition part / premise part is increased by one from the rule candidate generated previously, and the new rule candidate is empty. A rule discovery system characterized by repeating until it becomes.
In the search for rule candidates by the rule candidate generation means, exclude the result part of the rule determined to be valid by the rule validity determination means from the candidate of the result part,
The rule according to claim 1, wherein a redundant rule including the consequent part of the rule determined to be valid as a consequent part is not generated as the new rule candidate by the rule candidate generation unit. Discovery system.
The rule discovery system according to claim 1, further comprising an input device for inputting the frequency threshold and the certainty threshold as setting parameters.
4. The rule discovery system according to claim 1, wherein the rule is a rule expressed by CFD (Conditional Functional Dependency).
In discovering a rule from a database by a data processing device equipped with a rule candidate generation unit and a rule validity determination unit,
(A) The rule candidate generation means reads the database, generates an item having a frequency equal to or higher than a predetermined threshold that includes a pair of attribute and value of the database,
As an initial value of the rule candidate, a rule set having a condition part / premise part as empty and a result part as the item is generated and stored in the storage unit,
(B) The rule validity determination means, for the generated rule candidate rule, for the tuple of the database that matches the condition part / premise part of the rule, If there is a match with a predetermined certainty threshold or more, it is determined that the rule is valid and output from the output device,
(C) Check whether the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the rule validity judging means returns to the step (b),
(D) When the rule candidate to be validated is empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the previously generated rule. Generate a new rule candidate, one more than the candidate, and store it in the storage unit,
(E) Check whether or not the new rule candidate generated in the step (d) is empty. If the new rule candidate is empty, the rule discovery is terminated.
If the new rule candidate is not empty, the method returns to step (b), and the rule finding method is characterized in that:
When the rule is determined to be valid by the rule validity determination means, the result part of the rule determined to be valid is excluded from the result part candidates in the rule candidate search, and is determined to be valid. 6. The rule finding method according to claim 5, wherein a redundant rule including the consequent part of the rule included in the consequent part is not generated as the new rule candidate by the rule candidate generating unit.
The rule discovery method according to claim 5 or 6, wherein the rule is a rule expressed in CFD (Conditional Functional Dependency).
(A) Reading out the database, generating an item having a frequency equal to or higher than a predetermined threshold, which is composed of attribute-value pairs of the database, and the condition part / premise part is empty as the initial value of the rule candidate. Is a process of generating a rule that is the item and storing it in a storage unit;
(B) For the generated rule candidate rule, for a tuple of the database that matches the condition part / premise part of the rule, the rule consequent part has a predetermined certainty factor. If there is a match at a threshold value or higher, processing to determine that the rule is valid and output from the output device; and
(C) Checking whether or not the rule candidate for validity determination is empty, and if the rule candidate for determining validity is not empty, the process of returning to (b);
(D) When the rule candidate to be validated is empty, the item is set as a new condition part / premise part, and the size of the condition part / premise part is increased by one from the previously generated rule candidate. Processing to generate a new rule candidate and store it in the storage unit;
(E) Check whether or not the new rule candidate generated in the process of (d) is empty. If the new rule candidate is empty, the rule discovery process is terminated.
If the new rule candidate is not empty, the process returns to (b);
A program that causes a computer to execute.
If the rule candidate is determined to be valid in the process (b), the result part of the rule determined to be valid is used as the result candidate in the rule candidate search in the process (d). The redundant rule that includes the consequent part of the rule determined to be valid and the consequent part is not generated as a newly generated rule candidate by the process (d). 9. The program according to claim 8, wherein
Rule candidate generation means for generating a new rule candidate based on the contents of the database, the frequency threshold set from the input device, or the generated rule candidate;
Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database;
Including
The rule candidate generation means reads the database, generates an item consisting of a pair of attribute and value of the database, the frequency of which is equal to or higher than the threshold value, and sets the condition part / premise part as an initial value of the rule candidate. Is a rule, and the consequent is a rule that is the item,
Based on breadth-first search, generate rule candidates in order from smaller size of condition part / premise part,
For each rule of the generated rule candidate, the rule validity determination means, the rule consequent part inputs the input to the database tuple that matches the condition part / premise part of the rule. If there is a match with a certainty threshold value or more input from the device, the rule is judged to be valid and output to the output device,
When the rule candidate to be validated is empty, the rule candidate generation means excludes a rule that is redundant with respect to the valid rule from the rule candidate search, and uses the item as a condition part / premise part. , Generating a new rule candidate in which the size of the condition part / premise part is increased by one from the rule candidate generated previously,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
Generation of a new rule candidate in which the size of the condition part / premise part by the rule candidate generation unit is increased by one from the rule candidate generated previously;
The rule candidate generation unit repeats the process until the new rule candidate becomes empty without generating a new rule candidate in which the size of the condition part / premise part is increased by one from the previous rule candidate generated. A rule discovery device characterized by that.