WO2013172308A1 - Rule discovery system, method, device, and program - Google Patents
Rule discovery system, method, device, and program Download PDFInfo
- Publication number
- WO2013172308A1 WO2013172308A1 PCT/JP2013/063316 JP2013063316W WO2013172308A1 WO 2013172308 A1 WO2013172308 A1 WO 2013172308A1 JP 2013063316 W JP2013063316 W JP 2013063316W WO 2013172308 A1 WO2013172308 A1 WO 2013172308A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- rule
- candidate
- rule candidate
- generated
- database
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
Definitions
- the present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-110922 (filed on May 14, 2012), and the entire description of the application is incorporated herein by reference.
- the present invention relates to a rule discovery technique, and more particularly to a system, method, apparatus, and program for database rule discovery.
- database rules for example, expressing the rules as CFD (Conditional Function Dependency) and outputting CFD rules that match the contents of the database from the generated CFD rule candidates.
- CFD Consumer Function Dependency
- CFD is a rule indicating that a function result part (abbreviated as “FD”) representing a result part between data attributes is established for a tuple set specified by a condition. It consists of the specification of attribute values in the condition part and premise part which are the left side of the rule (LHS: LeftLeHand Side) and the consequent part of the right side of the rule (RHS: Right Hand Side).
- LHS LeftLeHand Side
- RHS Right Hand Side
- the condition part and the result part are also called a conditional clause and a subordinate clause, respectively.
- x means that the attribute value is a specific value.
- Constant means, for example, “constant”.
- X _
- Such an expression of the attribute value is referred to as “variable” (“Variable” means, for example, “variable”).
- “_” is also referred to as “unnamed” variable.
- a rule that has a result part between attributes, although the result part is not determined to be specified in this way, is called “variable CFD (Variable CFD)”. That is, when the right side of the pattern
- is unnamed variable ‘_’ (tp [A] _), it is referred to as variable CFD (Variable CFD).
- the support level is the number of tuples in which the condition part and the premise part of the CFD match.
- the confidence level is the ratio of the number of tuples that satisfy the CFD rule among the number of tuples that match the condition part and the premise part.
- any CFD left side (LHS) attribute set is said to be “left-reduced” for a CFD that does not contain the other CFD left side attribute set.
- rule 4 when the following rule 3 and rule 4 are given, the left side of rule 4 includes the left side of rule 3 (X1 ⁇ X1, X2), so rule 4 is “left-reduced” Absent. Conversely, the left side of rule 3 does not include the left side of rule 4, so rule 3 is said to be “left-reduced”. In this case, rule 4 can be deleted as redundant CFD with respect to rule 3.
- rule 5 can be obtained by replacing the attribute value x2 of rule 6 with Variable. For this reason, rule 6 is not “most-general”. Conversely, rule 5 is said to be “most-general”. In this case, rule 6 can be deleted as a redundant CFD with respect to rule 5.
- An apparatus for discovering a rule from a database includes a storage unit (storage unit) such as a magnetic disk for storing CFD, and a calculation unit (calculation unit) that generates a CFD candidate and determines whether the CFD candidate matches the contents of the database. ) And a storage unit (storage unit) that stores the CFD determined to match the contents in the storage device.
- the storage means stores the CFD obtained by the rule discovery algorithm.
- the calculation means generates a CFD candidate to be checked, checks whether it matches the contents of the database, and outputs a valid CFD if it matches.
- the storage means stores the obtained valid CFD in the storage device.
- Non-Patent Document 1 A method for generating constant CFD (constant CFD) candidates from a free itemset and a corresponding closed itemset, (2) A list of attribute-value pairs is generated by breadth first search, one of which is a dependent term (A) and the rest is a conditional part (X), Formula: X ⁇ A Generating CFD candidates by obtaining (3) Place a free item set in a conditional item, one attribute not included in the free items set in a subordinate term (consecutive part), and add other attributes to the conditional term A method for generating CFD candidates by performing a depth first search; Etc.
- Non-Patent Document 2 uses breadth-first search (breadth first search) as a discovery method for rules (CFD) that do not completely match the contents of the database but have high confidence (Confidence).
- CFD rules
- Patent Literature 1 discloses a rule base for storing a rule including a condition part and a conclusion part, a case information database for storing case information related to a rule application result, a rule, and a rule.
- the case search unit searches the case information set for the case information set from the case information database using the relation part that associates the case information satisfying the condition and the condition part of the rule to be validated as a key, and the conclusion part of the rule is satisfied in the case information set
- a rule base management device including a validity check unit that calculates a proportion of case information and checks validity of the rule based on the proportion.
- Patent Document 2 discloses a configuration in which a function consequent part (FD) between relation attributes is found and normalization is performed by relation division.
- FD function consequent part
- the first problem is that the CFD obtained by the CFD discovery algorithm disclosed in Non-Patent Document 1 is completely valid for a database, that is, only having a certainty factor of 1, and It cannot be enumerated.
- the second problem is that the approximate CFD discovery algorithm disclosed in Non-Patent Document 2 has an extremely long calculation time. The reason is that the number of CFD candidates generated for a large-scale database, in particular, a large number of attributes, causes a combinational explosion.
- the present invention was devised in view of the above problems, and its purpose is a system and apparatus capable of efficiently obtaining a set of rules useful for grasping or correcting the contents of a database. It is to provide a method and a program.
- a storage device for storing a database; A data processing device; An output device; With The data processing device includes: Rule candidate generation means for generating rule candidates from the database; Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database; With The rule candidate generation means includes: An item consisting of attribute-value pairs in the database, wherein a set of items having a frequency in the database equal to or higher than a predetermined threshold value is generated; As an initial value of the rule candidate, a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item is generated and stored in the storage unit, The rule validity judging means is: For each rule of the rule candidate generated by the rule candidate generation means, The rule is judged to be valid if the rule part of the database matches the condition part / premise part of the rule and the rule result part matches with a predetermined certainty threshold or more.
- the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the rule candidate generated last time.
- the rule candidate generation unit repeats the process until the new rule candidate becomes empty without generating a new rule candidate in which the size of the condition part / premise part is increased by one from the previous rule candidate generated.
- the rule candidate generation means reads the database, generates an item having a frequency equal to or higher than a predetermined threshold that includes a pair of attribute and value of the database, As an initial value of a rule candidate, generating a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item, and storing the rule set in a storage unit;
- the rule validity determination means for the generated rule candidate rule, for the tuple of the database that matches the condition part / premise part of the rule, If there is a match with a predetermined certainty threshold or higher, the rule is determined to be valid and output from the output device,
- C) Check whether the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the rule validity judging means returns to step (b) , (D) When the rule
- step (E) Check whether or not the new rule candidate generated in step (d) is empty. If the new rule candidate is empty, the rule discovery is terminated, and the new rule candidate is not empty. In this case, a rule finding method including the steps is provided, which returns to step (b).
- a database is read, an item including a pair of attributes and values of the database and having a frequency equal to or higher than a predetermined threshold value is generated, and the condition part A part (LHS) is empty, and a result part (RHS) generates a rule that is the item and stores it in a storage unit;
- the rule candidate rule for a tuple of the database that matches the condition part / premise part of the rule, the rule consequent part has a predetermined certainty factor.
- rule candidate generation means for generating a new rule candidate based on the content of the database and the frequency threshold set from the input device or the generated rule candidate;
- a rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database,
- the rule candidate generation means reads the database, generates an item consisting of a pair of attribute and value of the database, the frequency of which is equal to or higher than the threshold value, and sets the condition part / premise part as an initial value of the rule candidate.
- the rule validity determination means inputs the rule consequent part from the input device for the tuple of the database that matches the condition part of the rule. If the match is greater than or equal to the certainty threshold, the rule is judged to be valid and output to the output device, When the rule candidate to be validated is empty, the rule candidate generating means excludes a rule that is redundant with respect to the valid rule from the rule candidate search, and uses the item as a condition part / premise part.
- FIG. 3 is a flowchart showing the operation of the first exemplary embodiment of the present invention. It is a figure for demonstrating the specific example of operation
- rules are generated in order starting from a candidate with a small size of the condition part / premise part, and when an appropriate rule is found, the rule becomes redundant.
- pruning the rules so as to be excluded from the subsequent search for candidate rules it is possible to efficiently enumerate the rules (approximate CFD) that are almost valid.
- ⁇ Based on the database and setting parameters, calculate a set of rules (approximateDCFD) that almost match the contents of the database. More specifically, a rule candidate generating means (apparatus) (21 in FIG. 1) for generating a new rule candidate based on the contents of the database and setting parameters or already generated rule candidates, A rule validity judging means (apparatus) (22 in FIG. 1) for checking whether the contents of the database are valid.
- the rule candidate generation means (device) (21 in FIG. 1) generates a rule (CFD) candidate from the contents of the database or the item set obtained in the previous step, and the rule is valid In addition, pruning is performed so that redundant rules are not output. Eventually, when the rule (CFD) candidate generated by the rule candidate generation means (device) (21 in FIG. 1) becomes empty (when the rule (CFD) candidate cannot be generated), rule discovery The calculation of is terminated.
- a storage device (3 in FIG. 1) for storing a database, a data processing device (2 in FIG. 1), and an output device (4 in FIG. 1) are provided.
- the data processing device (2) includes rule candidate generation means (21 in FIG. 1) for generating rule candidates from the database, and determines whether or not the rule candidates are appropriate for the contents of the database. And a rule validity judging means (22 in FIG. 1).
- the rule candidate generation means (21 in FIG. 1) is an item composed of attribute-value pairs in the database, and generates a set of items whose frequency in the database is equal to or higher than a predetermined threshold value. Then, as an initial value of the rule candidate, a rule set having the condition part / premise part as empty and the result part as the item is generated and stored in the storage unit.
- the rule validity determination means sets, for each rule of the rule candidates generated by the rule candidate generation means, a tuple of the database that matches the condition part / premise part of the rule. On the other hand, if the result part of the rule matches with a predetermined threshold of certainty or more, the rule is determined to be valid and output to the output device (4 in FIG. 1).
- the rule candidate generation means uses the item as a new condition part / premise part and generates the size of the condition part / premise part last time. A new rule candidate increased by one from the rule candidates thus generated is stored in the storage unit.
- Validity check (A3, A4, A5 in FIG.
- the rule candidate generation means (21 in FIG. 1) reads out the database, and generates an item that is made up of a pair of attribute and value of the database and whose frequency is equal to or higher than a predetermined threshold (step A1 in FIG. 2). ), As an initial value of the rule candidate, a rule set having the condition part / premise part as empty and the result part as the item is generated and stored in the storage unit (step A2 in FIG. 2),
- the rule validity judging means (22 in FIG. 1) is configured such that, for the generated rule candidate rule, the rule against the tuple of the database matching the condition part / premise part of the rule.
- step (C) Check whether or not the rule candidate to be validated is empty (whether or not it still remains), and if the rule candidate to be validated is not empty, the rule validity judging means Returning to step (b) (step A5 in FIG. 2), (D) When the rule candidate for validity determination is empty (when validity determination has been performed for all rule candidates), the rule candidate generation means sets the item as a new condition part / premise part.
- a new rule candidate is generated by increasing the size of the condition part / premise part by one from the previously generated rule candidate, and is stored in the storage unit (step A6 in FIG. 2).
- a new rule candidate is not generated in step (d), and it is determined whether or not the new rule candidate is empty (step A7 in FIG. 2). If the new rule candidate is empty, Discovery ends and if the new rule candidate is not empty, the process returns to step (b).
- an exemplary first embodiment of the present invention includes an input device 1 such as a keyboard, a data processing device 2 that operates under program control, a storage device 3, a display device, a printing device, and the like.
- the output device 4 is included.
- the storage device 3 includes a database storage unit 31 composed of a magnetic disk device or the like.
- the database storage unit 31 stores a database. Data in this database is read out by the data processing device 2 to extract CFD rules.
- the data processing device 2 includes a rule candidate generation unit 21 and a rule validity determination unit 22.
- the rule candidate generation unit 21 generates database rule candidates stored in the database storage unit 31.
- the rule candidate generating unit 21 generates rule candidates using the parameters given from the input device 1 and the rule candidates generated in the previous step (generated rule candidates), and the generated rule The candidate is stored in the storage unit.
- the storage unit may be a storage unit (memory device) (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or a predetermined storage area in the storage device 3. It may be.
- the rule validity determination unit 22 checks whether the rule generated by the rule candidate generation unit 21 is a valid rule. If the rule is a valid rule, the rule is sent to the output device 4. Is output.
- “appropriate” means The number of tuples in the database matching the rule is greater than or equal to a predetermined frequency threshold, and -The result part of the rule and the content of the tuple match at or above the certainty threshold, It means that.
- the parameters given from the input device 1 include a frequency threshold and a certainty threshold, and the parameters are referred to by the rule candidate generation means 21.
- FIG. 2 is a flowchart for explaining the operation of the present embodiment. The operation of the present embodiment will be described in detail with reference to FIGS.
- the parameters given from the input device 1 and the contents of the database given from the database storage unit 31 are supplied to the rule candidate generating means 21.
- the rule candidate generation means 21 generates an attribute-value pair (this is called an “item”) that appears in the database (step A1).
- the rule candidate generation unit 21 stores the generated item set in a storage unit (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or a predetermined storage area of the storage device 3.
- the rule candidate generation unit 21 stores the generated initial rule candidates (CFD candidates) in a storage unit (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or the storage device 3. Store in a predetermined storage area.
- the rule condition / premise part (LHS) is shown as a condition part.
- the rule validity determination unit 22 checks the rule candidate (CFD candidate) generated by the rule candidate generation unit 21 with a database stored in the database storage unit 31 and checks whether the rule is valid. I do. Specifically, the rule validity judging means 22 For the tuples in the database that match the condition part of the rule, if the result part of the rule and the content of the tuple match with the parameter (threshold value threshold) p or more, It determines with it being appropriate (Yes of step A3).
- the rule validity determining means 22 When the valid rule is obtained from the rule candidates, the rule validity determining means 22 outputs the valid rule (CFD) to the output device 4 (step A4). If the rule is not valid, it is not output to the output device 4.
- CFD valid rule
- the rule candidate generation unit 21 results in an item that is a consequent part of the rule determined to be valid in generating a rule candidate having a larger size.
- the rules included in the part are not generated as rule candidates. That is, the rule candidate having a larger size is pruned.
- Steps A3 and A4 are repeated, and when there are no more candidate candidates for validity determination (CFD candidates) (that is, when the rule candidates are empty) (Yes in step A5), the rule candidate generating means 21 Using the item as a new condition part, a new rule candidate (CFD candidate) is generated by increasing the size of the condition part by one (step A6).
- CFD candidates candidate candidates for validity determination
- the rule candidate generation means 21 determines whether or not the new rule candidate generated in step A6 is empty (step A7).
- step A6 If the new rule candidate generated in step A6 is empty, the rule discovery calculation process is terminated (Yes in step A7).
- step A6 If the new rule candidate (CFD candidate) generated in step A6 is not empty, the process returns to step A3, where it is determined by the rule validity determination means 22 whether the rule is valid and a valid rule is output.
- FIG. 3B Table 1 below
- the database storage unit 31 is registered with a data set composed of the attributes and tuples shown in Table 1 below.
- the database example is a simplified example for the sake of explanation.
- FIG. 3A illustrates a specific example of an item set, a candidate for an initial rule (CFD), and a new rule (CFD) candidate corresponding to the steps in FIG.
- FIG. 3C is a diagram illustrating an example of a rule (approximate CFD) output from the output device 4 as a result of the calculation of FIG.
- the symbol _ is a variable that matches an arbitrary value.
- the number in parentheses is the frequency of the item.
- the rule candidate generation unit 21 sets the condition part / premise part (LHS) to empty for each item (item) in the extracted list, and sets each extracted item (item) as a result part ( RHS) is generated, and this is temporarily stored in a storage unit (not shown) as an initial candidate for the rule (CFD) (step A1 in FIGS. 2 and 3A).
- LHS condition part / premise part
- RHS result part
- the remaining six rules ( ⁇ 1 to ⁇ 4, ⁇ 6, ⁇ 7) are not valid for the contents of the database.
- the rule (CFD) candidate is empty (Yes in step A5 in FIG. 2).
- the rule (CFD) ⁇ 5 determined to be appropriate for the contents of the database by the rule validity determination means 22 is sent to the output device 4. Is output.
- the rule whose validity has been determined is deleted from the rule candidates. That is, among the rule candidates ( ⁇ 1, ⁇ 2,..., ⁇ 7 ⁇ ), the validity is determined by the rule validity determining means 22, and the rule (CFD) ⁇ 5 determined to be valid, and Rules that are determined to be invalid ( ⁇ 1 to ⁇ 4, ⁇ 6, ⁇ 7) are deleted from the rule candidates, and as a result, the rule candidates that are checked for validity are empty (the rest are zero).
- deletion of a rule from a rule candidate may be configured such that a deletion flag of 1 bit or the like is prepared for each rule candidate rule in addition to deletion from a memory.
- the rule candidate is determined to be empty.
- Step A6 the rule candidate generation unit 21 Since the rule candidate (CFD candidate) for validity determination is empty (Yes in step A5 in FIG. 2), the rule candidate generation unit 21 generates a new rule (FIG. 2, FIG. 3A). Step A6). Specifically, from the above set of items, there is no contradiction to each other (the same attribute does not take a different value), the size of the set after synthesis is one larger than that before synthesis, , Select two elements whose frequency is greater than or equal to the threshold value k and combine them (the combined element is called an itemset), one of the elements as the consequent part, and the rest as the condition part / premise part. (Step A6 in FIGS. 2 and 3A).
- Attribute 2 P
- a rule that is redundant with respect to ⁇ ′ is excluded from the target of the subsequent rule candidate search.
- the rule validity determination means 22 has determined the validity of six rule candidates out of six rule (CFD) candidates whose size is 1 for the condition part and the premise part (determined that there is validity).
- the rule (CFD) candidate whose validity is to be checked is empty (Yes in step A5 in FIG. 2).
- the frequency and certainty factor of the rule determined to be valid are output, but one of the rule frequency and the certainty factor is output or is output together. It does not have to be.
- the rules may be sorted and output in the order of certainty.
- the output form of the rule determined to be valid is arbitrary.
- the generation of redundant rules as described above can be prevented by pruning.
- the second embodiment of the present invention includes a rule finding program 5.
- the rule discovery program 5 is read into the data processing device 6 and controls the operation of the data processing device 6.
- the data processing device 6 executes the following processing, that is, the same processing as the processing by the data processing device 2 in the first embodiment, under the control of the rule finding program 5.
- the rule discovery program 5 (A) Reading out the database, generating an item having a frequency equal to or higher than a predetermined threshold consisting of a pair of the attribute and value of the database, and as an initial value of the rule candidate, the condition part is empty, the consequence part is Processing to generate a rule (CFD) that is an item and store it in the storage unit; (B) For the generated rule candidate rule (CFD), for a tuple of the database that matches the rule condition part, the rule result part has a predetermined certainty factor.
- the rule is determined to be valid (approximate CFD) and output to the output device; (C) If the rule candidate (CFD candidate) for determining validity is not empty, the process returns to (b); (D) When the rule candidate (CFD candidate) for determining validity is empty, the item is set as a new condition part, and the size of the condition part is increased by one from the rule candidate generated last time. Processing to generate a rule candidate (new CFD candidate) and store it in the storage unit; (E) It is determined whether or not the new rule candidate generated in the process of (d) is empty. If the new rule candidate (new CFD candidate) is empty, the rule discovery process is terminated, If the new rule candidate (new CFD candidate) is not empty, the process returns to (b) above; including.
- the rule In the process of outputting the rule (approximate CFD) determined to be valid in (b) to the output device, instead of immediately outputting the rule (approximate CFD) to the output device, the rule is temporarily displayed in a list (linear list).
- the list may be output when the rule candidate (CFD candidate) for determining validity becomes empty.
- the list is stored in a storage device buffer or the like. An arbitrary method can be used as output control for outputting a rule determined to be valid.
- a command (such as a rule discovery program execution command) and setting parameters are input from the input device 1, and an initial rule candidate is generated using a database stored in the database storage unit 31 in the storage device 3. Next, it is determined whether or not the generated rule candidate is valid. If it is valid, the rule is added to the list. When the coverage of the database satisfies the cutoff condition due to the set of rules stored in the list, the set of rules in the list is displayed on the output device 4.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Provided are a device, a method, and a program with which a set of rules useful for ascertaining or correcting database content can be obtained with high efficiency. The present invention (see fig. 1) is provided with: a rule-candidate generation means (21) for generating new rule candidates on the basis of database content, set parameters, and previously generated rule candidates; and a rule-appropriateness determination means (22) for checking whether the rule candidates are appropriate for the database content.
Description
[関連出願についての記載]
本発明は、日本国特許出願:特願2012-110921号(2012年 5月14日出願)に基づくものであり、同出願の全記載内容は引用をもって本書に組み込み記載されているものとする。
本発明は、ルールの発見を行う技術に関し、特にデータベースのルールの発見を行うシステムと方法と装置並びにプログラムに関する。 [Description of related applications]
The present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-110922 (filed on May 14, 2012), and the entire description of the application is incorporated herein by reference.
The present invention relates to a rule discovery technique, and more particularly to a system, method, apparatus, and program for database rule discovery.
本発明は、日本国特許出願:特願2012-110921号(2012年 5月14日出願)に基づくものであり、同出願の全記載内容は引用をもって本書に組み込み記載されているものとする。
本発明は、ルールの発見を行う技術に関し、特にデータベースのルールの発見を行うシステムと方法と装置並びにプログラムに関する。 [Description of related applications]
The present invention is based on a Japanese patent application: Japanese Patent Application No. 2012-110922 (filed on May 14, 2012), and the entire description of the application is incorporated herein by reference.
The present invention relates to a rule discovery technique, and more particularly to a system, method, apparatus, and program for database rule discovery.
データベースのルールの発見は、例えばルールをCFD(Conditional Function Dependency:条件付関数帰結部)として表現するものとし、生成されたCFDルールの候補から、データベースの内容に合致したCFDルールを出力する。以下では、発明の理解の前提となるCFDについて概説する。
The discovery of database rules is, for example, expressing the rules as CFD (Conditional Function Dependency) and outputting CFD rules that match the contents of the database from the generated CFD rule candidates. The following outlines CFD, which is a prerequisite for understanding the invention.
CFDは、データ属性間の帰結部を表す関数帰結部(Functional Dependency:「FD」と略記される)が、条件によって指定されたタプル集合について成立することを表すルールである。ルールの左辺(LHS:Left Hand Side)である条件部、前提部と、ルールの右辺(RHS:Right Hand Side)の帰結部における属性値の指定からなる。なお、条件部、帰結部は、それぞれ条件節、従属節ともいう。
CFD is a rule indicating that a function result part (abbreviated as “FD”) representing a result part between data attributes is established for a tuple set specified by a condition. It consists of the specification of attribute values in the condition part and premise part which are the left side of the rule (LHS: LeftLeHand Side) and the consequent part of the right side of the rule (RHS: Right Hand Side). The condition part and the result part are also called a conditional clause and a subordinate clause, respectively.
条件部は、データの部分集合(タプル集合)を指定し、属性Xが属性値xであるということを「X=x」と表す。ここで、「x」は属性値がある特定の値であることを意味する。このような、属性値の表現を「Constantである」という(なお、「Constant」は例えば「定数」を意味する)。
The condition part designates a subset (tuple set) of data, and represents that the attribute X is the attribute value x as “X = x”. Here, “x” means that the attribute value is a specific value. Such an expression of the attribute value is referred to as “Constant” (“Constant” means, for example, “constant”).
また、前提部は、属性のみの指定からなり、属性値は特定の値をとらない(すなわち、任意の値とマッチすることを表すワイルドカード)ことを「X= _ 」と表す。このような属性値の表現を「バリアブル(Variable)である」という(なお、「Variable」は例えば「変数」を意味する)。ここで、‘_’は「unnamed variable」(無名変数)ともいう。
Also, the premise part consists of designation of only the attribute, and the attribute value does not take a specific value (that is, a wild card indicating that it matches an arbitrary value) is expressed as “X = _”. Such an expression of the attribute value is referred to as “variable” (“Variable” means, for example, “variable”). Here, “_” is also referred to as “unnamed” variable.
帰結部には、2種類ある。
(A)属性と属性値の指定からなるもの(例えば、以下のルール1)と、
(B)属性のみを指定するもの(例えば、以下のルール2)
である。 There are two types of consequences.
(A) an attribute and attribute value designation (for example,rule 1 below);
(B) Specifying only attributes (for example, rule 2 below)
It is.
(A)属性と属性値の指定からなるもの(例えば、以下のルール1)と、
(B)属性のみを指定するもの(例えば、以下のルール2)
である。 There are two types of consequences.
(A) an attribute and attribute value designation (for example,
(B) Specifying only attributes (for example, rule 2 below)
It is.
(A)の場合、例えば「A=a」、
(B)の場合、例えば「A=_」等と表される。なお、帰結部に、属性値の指定がある場合には、前提部は省略することができる。また、前提部、帰結部は、複数の属性とそれぞれの属性値の指定からなることもある。以下にルールの例を示す。 In the case of (A), for example, “A = a”,
In the case of (B), for example, “A = _” is represented. If the attribute value is specified in the consequent part, the premise part can be omitted. Moreover, the premise part and the consequent part may consist of designation of a plurality of attributes and respective attribute values. An example rule is shown below.
(B)の場合、例えば「A=_」等と表される。なお、帰結部に、属性値の指定がある場合には、前提部は省略することができる。また、前提部、帰結部は、複数の属性とそれぞれの属性値の指定からなることもある。以下にルールの例を示す。 In the case of (A), for example, “A = a”,
In the case of (B), for example, “A = _” is represented. If the attribute value is specified in the consequent part, the premise part can be omitted. Moreover, the premise part and the consequent part may consist of designation of a plurality of attributes and respective attribute values. An example rule is shown below.
ルール1:X1 → A(x1 || a)
ルール2:X1, X2 → A(x1, _ || _ ) Rule 1: X1 → A (x1 || a)
Rule 2: X1, X2 → A (x1, _ || _)
ルール2:X1, X2 → A(x1, _ || _ ) Rule 1: X1 → A (x1 || a)
Rule 2: X1, X2 → A (x1, _ || _)
ルール1は、「属性X1が属性値x1のとき、属性Aは属性値aである」という意味のルールである。ルール1が成り立つとき、条件部に当てはまるタプル集合において、帰結部が指定された値であることを表す。つまり、条件X1=x1を満たすタプル集合の全てのタプルにおいて、t[A]=aである(なお、t[A]は、属性Aのタプルを表している)。このように、帰結部が指定された値に決まるルールを「コンスタントCFD(Constant CFD)」という。
Rule 1 is a rule that means “when attribute X1 is attribute value x1, attribute A is attribute value a”. When Rule 1 is satisfied, it represents that the consequent part is a specified value in the tuple set that applies to the condition part. That is, t [A] = a in all tuples of the tuple set that satisfies the condition X1 = x1 (t [A] represents the tuple of the attribute A). In this way, the rule in which the consequent part is determined to the designated value is referred to as “Constant CFD”.
ルール2は、「属性X1が属性値x1のとき、属性X2によって属性Aが決まる」という意味のルールである。ルール2が成り立つとき、条件部に当てはまるタプル集合において、前提部と帰結部で指定された属性間に帰結部があることを表す。つまり、条件「X1=x1」を満たすタプル集合の中の任意のタプルペアt1、t2について、t1[X2]=t2[X2]であれば、t1[A]=t2[A]となる。このように帰結部が指定された値に決まらないが、属性間に帰結部を持つようなルールを「バリアブルCFD(Variable CFD)」という。すなわち、パタンタプルの||の右側がunnamed variable‘_’の場合(tp[A]=_)、バリアブルCFD(Variable CFD)という。
Rule 2 is a rule that means that “attribute A is determined by attribute X2 when attribute X1 is attribute value x1”. When Rule 2 is satisfied, it represents that there is a consequent part between the attributes specified in the premise part and the consequent part in the tuple set applicable to the condition part. That is, for any tuple pair t1, t2 in the tuple set that satisfies the condition “X1 = x1”, if t1 [X2] = t2 [X2], then t1 [A] = t2 [A]. A rule that has a result part between attributes, although the result part is not determined to be specified in this way, is called “variable CFD (Variable CFD)”. That is, when the right side of the pattern || is unnamed variable ‘_’ (tp [A] = _), it is referred to as variable CFD (Variable CFD).
ルール1のパタンタプル(x1 || a)における記号‘||’は、左辺のX1と右辺のAの属性値を分離する。なお、ルール1の“X1→A(x1 || a)”を、“(X→A,(x || a))”と表記する例もあるが、外側の括弧とカンマの有無が相違するだけであり、同一のルールを表すものであることは自明である。同様に、ルール2の“X1,X2→A(x1,_|| _ )”を“([X1,X2]→A,(x1,_|| _ ))”とも表記する。
The symbol “||” in the rule tuple (x1 || a) of rule 1 separates the attribute value of X1 on the left side and A on the right side. Although there is an example in which “X1 → A (x1x || a)” in rule 1 is written as “(X → A, (x (|| a))”, the outer parentheses are different from the presence or absence of commas. It is obvious that it represents the same rule. Similarly, “X1, X2 → A (x1, _ || _)” of rule 2 is also expressed as “([X1, X2] → A, (x1, _ || _))”.
与えられたデータに対してCFDがどれだけ有効なルールであるかを示す指標として、例えば支持度(Support)や確信度(Confidence)が用いられている。支持度(Support)は、CFDの条件部と前提部がマッチするタプル数である。
As an index indicating how effective the CFD is for a given data, for example, a support level or a confidence level is used. The support level (Support) is the number of tuples in which the condition part and the premise part of the CFD match.
確信度(Confidence)は、条件部と前提部がマッチするタプル数の中で、CFDのルールが成立するタプル数の割合である。
The confidence level (Confidence) is the ratio of the number of tuples that satisfy the CFD rule among the number of tuples that match the condition part and the premise part.
複数のCFDが与えられた時、「left-reduced」(レフト・レデュースト)、且つ、「most-general」(モストジェネラル)の2つの条件を満たすCFDを、「minimal」(ミニマル)であるという。「left-reduced」について説明する。複数のCFDが与えられた時、いかなるCFDの左辺(LHS)の属性セットも、他のCFDの左辺の属性セットを包含しないCFDを「left-reduced」であるという。
When multiple CFDs are given, a CFD that satisfies the two conditions of “left-reduced” and “most-general” is called “minimal”. “Left-reduced” will be described. When multiple CFDs are given, any CFD left side (LHS) attribute set is said to be “left-reduced” for a CFD that does not contain the other CFD left side attribute set.
例えば、以下のルール3、ルール4が与えられた時、ルール4の左辺は、ルール3の左辺を包含している(X1⊂X1,X2)ことから、ルール4は、「left-reduced」ではない。逆に、ルール3の左辺は、ルール4の左辺を包含しないので、ルール3は「left-reduced」であるという。この場合、ルール4は、ルール3に対して、冗長なCFDとして削除することができる。
For example, when the following rule 3 and rule 4 are given, the left side of rule 4 includes the left side of rule 3 (X1⊂X1, X2), so rule 4 is “left-reduced” Absent. Conversely, the left side of rule 3 does not include the left side of rule 4, so rule 3 is said to be “left-reduced”. In this case, rule 4 can be deleted as redundant CFD with respect to rule 3.
ルール3: X1,Y→A(x1,_ || _)
ルール4: X1,X2,Y→A(x1,x2 || _) Rule 3: X1, Y → A (x1, _ || _)
Rule 4: X1, X2, Y → A (x1, x2 || _)
ルール4: X1,X2,Y→A(x1,x2 || _) Rule 3: X1, Y → A (x1, _ || _)
Rule 4: X1, X2, Y → A (x1, x2 || _)
次に、「most-general」について説明する。複数のCFDが与えられた時、いかなるCFDの左辺に含まれる属性値の定数も‘_’(Variable)に更新できない場合、「most-general」であるという。
Next, “most-general” will be explained. When a plurality of CFDs are given, if the constant of the attribute value included in the left side of any CFD cannot be updated to “_” (Variable), it is said to be “most-general”.
例えば、以下のルール5、ルール6が与えられた時、ルール6の属性値x2をVariableに置き換えることで、ルール5を得ることができる。このため、ルール6は「most-general」でない。逆に、ルール5は「most-general」であるという。この場合、ルール6は、ルール5に対して冗長なCFDとして削除することができる。
For example, when the following rules 5 and 6 are given, rule 5 can be obtained by replacing the attribute value x2 of rule 6 with Variable. For this reason, rule 6 is not “most-general”. Conversely, rule 5 is said to be “most-general”. In this case, rule 6 can be deleted as a redundant CFD with respect to rule 5.
ルール5: X1,X2→A(x1,_ || a)
ルール6: X1,X2→A(x1,x2 || a) Rule 5: X1, X2 → A (x1, _ || a)
Rule 6: X1, X2 → A (x1, x2 || a)
ルール6: X1,X2→A(x1,x2 || a) Rule 5: X1, X2 → A (x1, _ || a)
Rule 6: X1, X2 → A (x1, x2 || a)
以上でCFDの概説を終える。
This completes the overview of CFD.
データベースからルールを発見する装置は、CFDを保存する磁気ディスク等の記憶手段(記憶部)と、CFDの候補を生成し、CFD候補がデータベースの内容に合致しているか判定する演算手段(演算部)と、内容に合致していると判定されたCFDを記憶装置に保存する保存手段(保存部)から構成される。記憶手段は、ルール発見アルゴリズムで得られたCFDを保存する。演算手段は、チェックの対象とするCFDの候補を生成し、それがデータベースの内容に合致しているかどうか調べ、合致している場合、妥当なCFDとして出力する。保存手段は、得られた妥当なCFDを、記憶装置に保存する。
An apparatus for discovering a rule from a database includes a storage unit (storage unit) such as a magnetic disk for storing CFD, and a calculation unit (calculation unit) that generates a CFD candidate and determines whether the CFD candidate matches the contents of the database. ) And a storage unit (storage unit) that stores the CFD determined to match the contents in the storage device. The storage means stores the CFD obtained by the rule discovery algorithm. The calculation means generates a CFD candidate to be checked, checks whether it matches the contents of the database, and outputs a valid CFD if it matches. The storage means stores the obtained valid CFD in the storage device.
データベースのルールの発見手法として、例えば非特許文献1に記載されているように、
(1)フリー・アイテムセット(free itemset)と、それに対応するクローズド・アイテムセット(closed itemset)からコンスタントCFD(constant CFD)の候補を生成する手法、
(2)属性と値のペアのリストを、幅優先探索(breadth first search)により生成し、そのうちの1つの項を従属項(Aとする)、残りを条件部(Xとする)に置き、
式:X→A
を得ることによって、CFDの候補を生成する手法、
(3)フリー・アイテムセット(free itemset)を条件項、フリー・アイテムセット(free itemset)に含まれない1つの属性を従属項(帰結部)に置き、それ以外に条件項に加える属性を、深さ優先探索(depth first search)することで、CFDの候補を生成する手法、
等がある。 As a database rule discovery technique, for example, as described inNon-Patent Document 1,
(1) A method for generating constant CFD (constant CFD) candidates from a free itemset and a corresponding closed itemset,
(2) A list of attribute-value pairs is generated by breadth first search, one of which is a dependent term (A) and the rest is a conditional part (X),
Formula: X → A
Generating CFD candidates by obtaining
(3) Place a free item set in a conditional item, one attribute not included in the free items set in a subordinate term (consecutive part), and add other attributes to the conditional term A method for generating CFD candidates by performing a depth first search;
Etc.
(1)フリー・アイテムセット(free itemset)と、それに対応するクローズド・アイテムセット(closed itemset)からコンスタントCFD(constant CFD)の候補を生成する手法、
(2)属性と値のペアのリストを、幅優先探索(breadth first search)により生成し、そのうちの1つの項を従属項(Aとする)、残りを条件部(Xとする)に置き、
式:X→A
を得ることによって、CFDの候補を生成する手法、
(3)フリー・アイテムセット(free itemset)を条件項、フリー・アイテムセット(free itemset)に含まれない1つの属性を従属項(帰結部)に置き、それ以外に条件項に加える属性を、深さ優先探索(depth first search)することで、CFDの候補を生成する手法、
等がある。 As a database rule discovery technique, for example, as described in
(1) A method for generating constant CFD (constant CFD) candidates from a free itemset and a corresponding closed itemset,
(2) A list of attribute-value pairs is generated by breadth first search, one of which is a dependent term (A) and the rest is a conditional part (X),
Formula: X → A
Generating CFD candidates by obtaining
(3) Place a free item set in a conditional item, one attribute not included in the free items set in a subordinate term (consecutive part), and add other attributes to the conditional term A method for generating CFD candidates by performing a depth first search;
Etc.
前述したように、データベースの内容とCFDがどの程度一致しているかを表す指標として、確信度がある。
As described above, there is a certainty factor as an index indicating how much the contents of the database and the CFD match.
データベースの内容と完全には一致していないが、高い確信度(Confidence)を持つルール(CFD)の発見手法として、非特許文献2には、幅優先探索(breadth first search)を用いて、確信度(Confidence)が閾値以上のCFD(以下、「approximate CFD」という)(「ほぼ成り立つ」CFD)を発見する、手法が開示されている。
Non-Patent Document 2 uses breadth-first search (breadth first search) as a discovery method for rules (CFD) that do not completely match the contents of the database but have high confidence (Confidence). There is disclosed a method for finding a CFD (hereinafter referred to as “approximate CFD”) (“substantially valid” CFD) having (Confidence) equal to or greater than a threshold value.
なお、ルールの妥当性のチェックとして、例えば特許文献1には、条件部と結論部からなるルールを格納するルールベースと、ルールの適用結果に関する事例情報を格納する事例情報データベースと、ルールとルールを満たす事例情報を関係付ける関係付け部と、妥当性チェック対象のルールの条件部をキーとして事例情報データベースから事例情報集合を事例検索部に検索させ、事例情報集合において該ルールの結論部を満たす事例情報の割合を算出し、該割合に基づき、ルールの妥当性をチェックする妥当性チェック部と、を備えたルールベース管理装置が開示されている。また、特許文献2には、リレーションの属性間の関数帰結部(FD)を見つけ出し、リレーション分割による正規化を行う構成が開示されている。
For example, Patent Literature 1 discloses a rule base for storing a rule including a condition part and a conclusion part, a case information database for storing case information related to a rule application result, a rule, and a rule. The case search unit searches the case information set for the case information set from the case information database using the relation part that associates the case information satisfying the condition and the condition part of the rule to be validated as a key, and the conclusion part of the rule is satisfied in the case information set There is disclosed a rule base management device including a validity check unit that calculates a proportion of case information and checks validity of the rule based on the proportion. Patent Document 2 discloses a configuration in which a function consequent part (FD) between relation attributes is found and normalization is performed by relation division.
以下に、本発明者によって為された関連技術の分析を記載する。
The following is an analysis of related technologies performed by the present inventors.
第1の問題点は、非特許文献1に開示されているCFD発見アルゴリズムにより得られるCFDは、データベースに対して完全に成り立つ、つまり確信度が1のもののみであり、データベースに対して「ほぼ成り立つ」ものを列挙できない、ということである。
The first problem is that the CFD obtained by the CFD discovery algorithm disclosed in Non-Patent Document 1 is completely valid for a database, that is, only having a certainty factor of 1, and It cannot be enumerated.
第2の問題点は、非特許文献2に開示されているapproximate CFD発見アルゴリズムは、計算時間が極端に長くなる、ということである。その理由は、大規模な、特に、属性数の大きいデータベースに対しては生成されるCFDの候補の個数が、組合せ爆発(combinational explosion)を起こしてしまうためである。
The second problem is that the approximate CFD discovery algorithm disclosed in Non-Patent Document 2 has an extremely long calculation time. The reason is that the number of CFD candidates generated for a large-scale database, in particular, a large number of attributes, causes a combinational explosion.
本発明は、上記問題点に鑑みて創案されたものであって、その目的は、データベースの内容を把握し、あるいは修正を行うために有用なルールの集合を効率よく得ることができるシステム、装置、方法、プログラムを提供することにある。
The present invention was devised in view of the above problems, and its purpose is a system and apparatus capable of efficiently obtaining a set of rules useful for grasping or correcting the contents of a database. It is to provide a method and a program.
本発明によれば、データベースを記憶する記憶装置と、
データ処理装置と、
出力装置と、
を備え、
前記データ処理装置は、
前記データベースからルール候補を生成するルール候補生成手段と、
前記ルール候補が前記データベースの内容に対して妥当であるか否か判定するルールの妥当性判定手段と、
を備え、
前記ルール候補生成手段は、
前記データベースにおける属性と値のペアからなるアイテムであって、前記データベースでの頻度が、予め定められた所定の閾値以上のアイテムの集合を生成し、
ルール候補の初期値として、条件部・前提部(LHS)を空、帰結部(RHS)を前記アイテムとするルール集合を生成して記憶部に記憶し、
前記ルールの妥当性判定手段は、
前記ルール候補生成手段で生成された前記ルール候補の各ルールに対して、
前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合、前記ルールを妥当と判断して前記出力装置に出力し、
妥当性判定対象の前記ルール候補が空となると、前記ルール候補生成手段では、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶し、
前記ルール候補生成手段で生成された前記新たなルール候補に対する前記ルールの妥当性判定手段による妥当性の判定と、
前記ルール候補生成手段による条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補の生成と、
を、前記ルール候補生成手段にて、条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成できず前記新たなルール候補が空となるまで、繰り返すルール発見システムが提供される。 According to the present invention, a storage device for storing a database;
A data processing device;
An output device;
With
The data processing device includes:
Rule candidate generation means for generating rule candidates from the database;
Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database;
With
The rule candidate generation means includes:
An item consisting of attribute-value pairs in the database, wherein a set of items having a frequency in the database equal to or higher than a predetermined threshold value is generated;
As an initial value of the rule candidate, a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item is generated and stored in the storage unit,
The rule validity judging means is:
For each rule of the rule candidate generated by the rule candidate generation means,
The rule is judged to be valid if the rule part of the database matches the condition part / premise part of the rule and the rule result part matches with a predetermined certainty threshold or more. And output to the output device,
When the rule candidate to be validated becomes empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the rule candidate generated last time. Generate a new rule candidate increased by one and store it in the storage unit,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
Generation of a new rule candidate in which the size of the condition part / premise part by the rule candidate generation unit is increased by one from the rule candidate generated previously;
The rule candidate generation unit repeats the process until the new rule candidate becomes empty without generating a new rule candidate in which the size of the condition part / premise part is increased by one from the previous rule candidate generated. A rule discovery system is provided.
データ処理装置と、
出力装置と、
を備え、
前記データ処理装置は、
前記データベースからルール候補を生成するルール候補生成手段と、
前記ルール候補が前記データベースの内容に対して妥当であるか否か判定するルールの妥当性判定手段と、
を備え、
前記ルール候補生成手段は、
前記データベースにおける属性と値のペアからなるアイテムであって、前記データベースでの頻度が、予め定められた所定の閾値以上のアイテムの集合を生成し、
ルール候補の初期値として、条件部・前提部(LHS)を空、帰結部(RHS)を前記アイテムとするルール集合を生成して記憶部に記憶し、
前記ルールの妥当性判定手段は、
前記ルール候補生成手段で生成された前記ルール候補の各ルールに対して、
前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合、前記ルールを妥当と判断して前記出力装置に出力し、
妥当性判定対象の前記ルール候補が空となると、前記ルール候補生成手段では、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶し、
前記ルール候補生成手段で生成された前記新たなルール候補に対する前記ルールの妥当性判定手段による妥当性の判定と、
前記ルール候補生成手段による条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補の生成と、
を、前記ルール候補生成手段にて、条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成できず前記新たなルール候補が空となるまで、繰り返すルール発見システムが提供される。 According to the present invention, a storage device for storing a database;
A data processing device;
An output device;
With
The data processing device includes:
Rule candidate generation means for generating rule candidates from the database;
Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database;
With
The rule candidate generation means includes:
An item consisting of attribute-value pairs in the database, wherein a set of items having a frequency in the database equal to or higher than a predetermined threshold value is generated;
As an initial value of the rule candidate, a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item is generated and stored in the storage unit,
The rule validity judging means is:
For each rule of the rule candidate generated by the rule candidate generation means,
The rule is judged to be valid if the rule part of the database matches the condition part / premise part of the rule and the rule result part matches with a predetermined certainty threshold or more. And output to the output device,
When the rule candidate to be validated becomes empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the rule candidate generated last time. Generate a new rule candidate increased by one and store it in the storage unit,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
Generation of a new rule candidate in which the size of the condition part / premise part by the rule candidate generation unit is increased by one from the rule candidate generated previously;
The rule candidate generation unit repeats the process until the new rule candidate becomes empty without generating a new rule candidate in which the size of the condition part / premise part is increased by one from the previous rule candidate generated. A rule discovery system is provided.
本発明によれば、ルール候補生成手段とルールの妥当性判定手段を備えたデータ処理装置によりデータベースからルールを発見するにあたり、
(a)前記ルール候補生成手段が、前記データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し、
ルール候補の初期値として、条件部・前提部(LHS)を空、帰結部(RHS)を前記アイテムとするルール集合を生成して記憶部に記憶するステップと、
(b)前記ルールの妥当性判定手段は、前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判断して出力装置より出力し、
(c)妥当性判定対象の前記ルール候補が空であるか否かチェックし、妥当性判定対象の前記ルール候補が空でない場合、前記ルールの妥当性判定手段は、前記ステップ(b)に戻り、
(d)妥当性判定対象の前記ルール候補が空の場合、前記ルール候補生成手段は、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶し、
(e)前記ステップ(d)で生成された前記新たなルール候補が空か否かチェックし、前記新たなルール候補が空の場合、ルールの発見を終了し、前記新たなルール候補が空でない場合、前記ステップ(b)に戻る、前記各ステップを含むルール発見方法が提供される。 According to the present invention, in finding a rule from a database by a data processing device including a rule candidate generation unit and a rule validity determination unit,
(A) The rule candidate generation means reads the database, generates an item having a frequency equal to or higher than a predetermined threshold that includes a pair of attribute and value of the database,
As an initial value of a rule candidate, generating a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item, and storing the rule set in a storage unit;
(B) The rule validity determination means, for the generated rule candidate rule, for the tuple of the database that matches the condition part / premise part of the rule, If there is a match with a predetermined certainty threshold or higher, the rule is determined to be valid and output from the output device,
(C) Check whether the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the rule validity judging means returns to step (b) ,
(D) When the rule candidate to be validated is empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the previously generated rule. A new rule candidate that is one more than the candidate is generated and stored in the storage unit,
(E) Check whether or not the new rule candidate generated in step (d) is empty. If the new rule candidate is empty, the rule discovery is terminated, and the new rule candidate is not empty. In this case, a rule finding method including the steps is provided, which returns to step (b).
(a)前記ルール候補生成手段が、前記データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し、
ルール候補の初期値として、条件部・前提部(LHS)を空、帰結部(RHS)を前記アイテムとするルール集合を生成して記憶部に記憶するステップと、
(b)前記ルールの妥当性判定手段は、前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判断して出力装置より出力し、
(c)妥当性判定対象の前記ルール候補が空であるか否かチェックし、妥当性判定対象の前記ルール候補が空でない場合、前記ルールの妥当性判定手段は、前記ステップ(b)に戻り、
(d)妥当性判定対象の前記ルール候補が空の場合、前記ルール候補生成手段は、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶し、
(e)前記ステップ(d)で生成された前記新たなルール候補が空か否かチェックし、前記新たなルール候補が空の場合、ルールの発見を終了し、前記新たなルール候補が空でない場合、前記ステップ(b)に戻る、前記各ステップを含むルール発見方法が提供される。 According to the present invention, in finding a rule from a database by a data processing device including a rule candidate generation unit and a rule validity determination unit,
(A) The rule candidate generation means reads the database, generates an item having a frequency equal to or higher than a predetermined threshold that includes a pair of attribute and value of the database,
As an initial value of a rule candidate, generating a rule set having a condition part / premise part (LHS) as empty and a result part (RHS) as the item, and storing the rule set in a storage unit;
(B) The rule validity determination means, for the generated rule candidate rule, for the tuple of the database that matches the condition part / premise part of the rule, If there is a match with a predetermined certainty threshold or higher, the rule is determined to be valid and output from the output device,
(C) Check whether the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the rule validity judging means returns to step (b) ,
(D) When the rule candidate to be validated is empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the previously generated rule. A new rule candidate that is one more than the candidate is generated and stored in the storage unit,
(E) Check whether or not the new rule candidate generated in step (d) is empty. If the new rule candidate is empty, the rule discovery is terminated, and the new rule candidate is not empty. In this case, a rule finding method including the steps is provided, which returns to step (b).
本発明によれば、(a)データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し、ルール候補の初期値として、条件部・前提部(LHS)が空、帰結部(RHS)が、前記アイテムであるルールを生成して記憶部に記憶する処理と、
(b)前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判断して出力装置より出力する処理と、
(c)妥当性判定対象の前記ルール候補が空か否かチェックし、妥当性判定対象の前記ルール候補が空でない場合、前記(b)に戻る処理と、
(d)妥当性判定対象の前記ルール候補が空の場合、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶する処理と、
(e)前記(d)の処理で生成された前記新たなルール候補が空か否かチェックし、
前記新たなルール候補が空の場合、ルール発見処理を終了し、
前記新たなルール候補が空でない場合、前記(b)に戻る処理と、
をコンピュータに実行させるプログラムが提供される。本発明によれば、該プログラムを記憶したコンピュータで読み出し可能なメモリ装置、あるいは、磁気・光ディスク媒体/装置が提供される。 According to the present invention, (a) a database is read, an item including a pair of attributes and values of the database and having a frequency equal to or higher than a predetermined threshold value is generated, and the condition part A part (LHS) is empty, and a result part (RHS) generates a rule that is the item and stores it in a storage unit;
(B) For the generated rule candidate rule, for a tuple of the database that matches the condition part / premise part of the rule, the rule consequent part has a predetermined certainty factor. If there is a match at a threshold value or higher, processing to determine that the rule is valid and output from the output device; and
(C) Check whether or not the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the process returns to (b);
(D) When the rule candidate to be validated is empty, the item is set as a new condition part / premise part, and the size of the condition part / premise part is increased by one from the previously generated rule candidate. Processing to generate a new rule candidate and store it in the storage unit;
(E) Check whether the new rule candidate generated in the process (d) is empty,
If the new rule candidate is empty, the rule discovery process is terminated,
If the new rule candidate is not empty, the process returns to (b);
A program for causing a computer to execute is provided. According to the present invention, a computer-readable memory device or a magnetic / optical disk medium / device storing the program is provided.
(b)前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判断して出力装置より出力する処理と、
(c)妥当性判定対象の前記ルール候補が空か否かチェックし、妥当性判定対象の前記ルール候補が空でない場合、前記(b)に戻る処理と、
(d)妥当性判定対象の前記ルール候補が空の場合、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶する処理と、
(e)前記(d)の処理で生成された前記新たなルール候補が空か否かチェックし、
前記新たなルール候補が空の場合、ルール発見処理を終了し、
前記新たなルール候補が空でない場合、前記(b)に戻る処理と、
をコンピュータに実行させるプログラムが提供される。本発明によれば、該プログラムを記憶したコンピュータで読み出し可能なメモリ装置、あるいは、磁気・光ディスク媒体/装置が提供される。 According to the present invention, (a) a database is read, an item including a pair of attributes and values of the database and having a frequency equal to or higher than a predetermined threshold value is generated, and the condition part A part (LHS) is empty, and a result part (RHS) generates a rule that is the item and stores it in a storage unit;
(B) For the generated rule candidate rule, for a tuple of the database that matches the condition part / premise part of the rule, the rule consequent part has a predetermined certainty factor. If there is a match at a threshold value or higher, processing to determine that the rule is valid and output from the output device; and
(C) Check whether or not the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the process returns to (b);
(D) When the rule candidate to be validated is empty, the item is set as a new condition part / premise part, and the size of the condition part / premise part is increased by one from the previously generated rule candidate. Processing to generate a new rule candidate and store it in the storage unit;
(E) Check whether the new rule candidate generated in the process (d) is empty,
If the new rule candidate is empty, the rule discovery process is terminated,
If the new rule candidate is not empty, the process returns to (b);
A program for causing a computer to execute is provided. According to the present invention, a computer-readable memory device or a magnetic / optical disk medium / device storing the program is provided.
本発明によれば、データベースの内容と、入力装置から設定された頻度の閾値、又は、生成済みルール候補に基づき、新たなルール候補を生成するルール候補生成手段と、
前記ルール候補が前記データベースの内容に対して妥当であるか否か判定するルールの妥当性判定手段と、を有し、
前記ルール候補生成手段は、前記データベースを読み出し、前記データベースの属性と値のペアからなるアイテムであって、頻度が前記閾値以上のアイテムを生成し、ルール候補の初期値として、条件部・前提部(LHS)が空、帰結部(RHS)が、前記アイテムであるルールを生成し、
幅優先探索に基づき、条件部・前提部のサイズの小さなルール候補から順に生成し、
前記生成されたルール候補の各ルールに対して、前記ルールの妥当性判定手段では、前記ルールの条件部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、前記入力装置から入力された確信度の閾値以上でマッチしている場合、前記ルールを妥当と判断して出力装置に出力し、
妥当性判対象の前記ルール候補が空となると、前記ルール候補生成手段では、前記妥当なルールに対して冗長なルールを、ルール候補の探索から除外して、前記アイテムを条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成し、
前記ルール候補生成手段で生成された前記新たなルール候補に対する前記ルールの妥当性判定手段による妥当性の判定と、
前記ルール候補生成手段による条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補の生成を、
前記ルール候補生成手段にて、条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成できず前記新たなルール候補が空となるまで、繰り返すルール発見装置が提供される。 According to the present invention, rule candidate generation means for generating a new rule candidate based on the content of the database and the frequency threshold set from the input device or the generated rule candidate;
A rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database,
The rule candidate generation means reads the database, generates an item consisting of a pair of attribute and value of the database, the frequency of which is equal to or higher than the threshold value, and sets the condition part / premise part as an initial value of the rule candidate. (LHS) is empty, and the result part (RHS) generates a rule that is the item,
Based on breadth-first search, generate rule candidates in order from smaller size of condition part / premise part,
For each rule of the generated rule candidate, the rule validity determination means inputs the rule consequent part from the input device for the tuple of the database that matches the condition part of the rule. If the match is greater than or equal to the certainty threshold, the rule is judged to be valid and output to the output device,
When the rule candidate to be validated is empty, the rule candidate generating means excludes a rule that is redundant with respect to the valid rule from the rule candidate search, and uses the item as a condition part / premise part. , Generating a new rule candidate in which the size of the condition part / premise part is increased by one from the rule candidate generated previously,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
The generation of a new rule candidate, which is one more than the rule candidate generated last time, the size of the condition part / premise part by the rule candidate generation means,
The rule candidate generation means repeats rule discovery until a new rule candidate cannot be generated with the condition part / premise part size increased by one from the previously generated rule candidate, and the new rule candidate becomes empty An apparatus is provided.
前記ルール候補が前記データベースの内容に対して妥当であるか否か判定するルールの妥当性判定手段と、を有し、
前記ルール候補生成手段は、前記データベースを読み出し、前記データベースの属性と値のペアからなるアイテムであって、頻度が前記閾値以上のアイテムを生成し、ルール候補の初期値として、条件部・前提部(LHS)が空、帰結部(RHS)が、前記アイテムであるルールを生成し、
幅優先探索に基づき、条件部・前提部のサイズの小さなルール候補から順に生成し、
前記生成されたルール候補の各ルールに対して、前記ルールの妥当性判定手段では、前記ルールの条件部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、前記入力装置から入力された確信度の閾値以上でマッチしている場合、前記ルールを妥当と判断して出力装置に出力し、
妥当性判対象の前記ルール候補が空となると、前記ルール候補生成手段では、前記妥当なルールに対して冗長なルールを、ルール候補の探索から除外して、前記アイテムを条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成し、
前記ルール候補生成手段で生成された前記新たなルール候補に対する前記ルールの妥当性判定手段による妥当性の判定と、
前記ルール候補生成手段による条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補の生成を、
前記ルール候補生成手段にて、条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成できず前記新たなルール候補が空となるまで、繰り返すルール発見装置が提供される。 According to the present invention, rule candidate generation means for generating a new rule candidate based on the content of the database and the frequency threshold set from the input device or the generated rule candidate;
A rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database,
The rule candidate generation means reads the database, generates an item consisting of a pair of attribute and value of the database, the frequency of which is equal to or higher than the threshold value, and sets the condition part / premise part as an initial value of the rule candidate. (LHS) is empty, and the result part (RHS) generates a rule that is the item,
Based on breadth-first search, generate rule candidates in order from smaller size of condition part / premise part,
For each rule of the generated rule candidate, the rule validity determination means inputs the rule consequent part from the input device for the tuple of the database that matches the condition part of the rule. If the match is greater than or equal to the certainty threshold, the rule is judged to be valid and output to the output device,
When the rule candidate to be validated is empty, the rule candidate generating means excludes a rule that is redundant with respect to the valid rule from the rule candidate search, and uses the item as a condition part / premise part. , Generating a new rule candidate in which the size of the condition part / premise part is increased by one from the rule candidate generated previously,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
The generation of a new rule candidate, which is one more than the rule candidate generated last time, the size of the condition part / premise part by the rule candidate generation means,
The rule candidate generation means repeats rule discovery until a new rule candidate cannot be generated with the condition part / premise part size increased by one from the previously generated rule candidate, and the new rule candidate becomes empty An apparatus is provided.
本発明によれば、データベースの内容を把握し、あるいは修正を行うために有用なルールの集合を、効率よく得ることができる。
According to the present invention, it is possible to efficiently obtain a set of rules useful for grasping the contents of a database or performing correction.
次に、本発明の実施の形態について図面を参照して詳細に説明する。本発明によれば、幅優先探索に基づき、条件部・前提部のサイズの小さなルール(CFD)の候補から順に生成し、妥当なルールが発見された時点で、当該ルールに対して、冗長となるルールを、以降に行われるルール候補の探索から、除外するように枝刈り(Pruning)を行うことで、ほぼ成り立つルール(approximate CFD)を効率よく列挙することができる。
Next, embodiments of the present invention will be described in detail with reference to the drawings. According to the present invention, based on the breadth-first search, rules (CFD) are generated in order starting from a candidate with a small size of the condition part / premise part, and when an appropriate rule is found, the rule becomes redundant. By pruning the rules so as to be excluded from the subsequent search for candidate rules, it is possible to efficiently enumerate the rules (approximate CFD) that are almost valid.
データベースと、設定パラメータとに基づき、データベースの内容に合致した、ほぼ成り立つルール(approximate CFD)の集合を計算する。より具体的には、データベースの内容と、設定パラメータまたは生成済みルールの候補に基づき、新たなルールの候補を生成するルール候補生成手段(装置)(図1の21)と、前記ルールの候補がデータベースの内容に対して妥当であるかチェックするルールの妥当性判定手段(装置)(図1の22)とを有する。
∙ Based on the database and setting parameters, calculate a set of rules (approximateDCFD) that almost match the contents of the database. More specifically, a rule candidate generating means (apparatus) (21 in FIG. 1) for generating a new rule candidate based on the contents of the database and setting parameters or already generated rule candidates, A rule validity judging means (apparatus) (22 in FIG. 1) for checking whether the contents of the database are valid.
ルール候補生成手段(装置)(図1の21)はデータベースの内容、または以前のステップで得られたアイテムセット(itemset)から、ルール(CFD)の候補を生成し、該ルールが妥当である場合に、冗長なルールを出力しないように、枝刈りを行う。最終的に、ルール候補生成手段(装置)(図1の21)で生成されるルール(CFD)候補が空になったとき(ルール(CFD)の候補の生成ができなくなったとき)、ルール発見の計算を終了する。
The rule candidate generation means (device) (21 in FIG. 1) generates a rule (CFD) candidate from the contents of the database or the item set obtained in the previous step, and the rule is valid In addition, pruning is performed so that redundant rules are not output. Eventually, when the rule (CFD) candidate generated by the rule candidate generation means (device) (21 in FIG. 1) becomes empty (when the rule (CFD) candidate cannot be generated), rule discovery The calculation of is terminated.
一つの側面のシステムの実施形態によれば、データベースを記憶する記憶装置(図1の3)と、データ処理装置(図1の2)と、出力装置(図1の4)と、を備えたシステムにおいて、前記データ処理装置(2)は、前記データベースからルール候補を生成するルール候補生成手段(図1の21)と、前記ルール候補が前記データベースの内容に対して妥当であるか否か判定するルールの妥当性判定手段(図1の22)と、を備えている。
According to an embodiment of the system of one aspect, a storage device (3 in FIG. 1) for storing a database, a data processing device (2 in FIG. 1), and an output device (4 in FIG. 1) are provided. In the system, the data processing device (2) includes rule candidate generation means (21 in FIG. 1) for generating rule candidates from the database, and determines whether or not the rule candidates are appropriate for the contents of the database. And a rule validity judging means (22 in FIG. 1).
前記ルール候補生成手段(図1の21)は、前記データベースにおける属性と値のペアからなるアイテムであって、前記データベースでの頻度が、予め定められた所定の閾値以上のアイテムの集合を生成し、ルール候補の初期値として、条件部・前提部を空、帰結部を前記アイテムとするルール集合を生成して記憶部に記憶する。
The rule candidate generation means (21 in FIG. 1) is an item composed of attribute-value pairs in the database, and generates a set of items whose frequency in the database is equal to or higher than a predetermined threshold value. Then, as an initial value of the rule candidate, a rule set having the condition part / premise part as empty and the result part as the item is generated and stored in the storage unit.
前記ルールの妥当性判定手段(図1の22)は、前記ルール候補生成手段で生成された前記ルール候補の各ルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合、前記ルールを妥当と判定して前記出力装置(図1の4)に出力する。妥当性を判定する前記ルール候補が空となると、前記ルール候補生成手段(図1の21)では、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも、1つ増やした新たなルール候補を生成して記憶部に記憶する。前記ルール候補生成手段(図1の21)で生成された前記新たなルール候補に対する前記ルールの妥当性判定手段(図1の22)による妥当性のチェック(図2のA3、A4、A5)と、
前記ルール候補生成手段(図1の21)による条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補の生成(図2のA6)と、
を、前記ルール候補生成手段(図1の21)にて、条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした、新たなルール候補を生成することができず、このため、前記新たなルール候補が空となる、まで、繰り返す(図2のA3~A7)。 The rule validity determination means (22 in FIG. 1) sets, for each rule of the rule candidates generated by the rule candidate generation means, a tuple of the database that matches the condition part / premise part of the rule. On the other hand, if the result part of the rule matches with a predetermined threshold of certainty or more, the rule is determined to be valid and output to the output device (4 in FIG. 1). When the rule candidate for determining validity becomes empty, the rule candidate generation means (21 in FIG. 1) uses the item as a new condition part / premise part and generates the size of the condition part / premise part last time. A new rule candidate increased by one from the rule candidates thus generated is stored in the storage unit. Validity check (A3, A4, A5 in FIG. 2) by the rule validity determination means (22 in FIG. 1) for the new rule candidate generated by the rule candidate generation means (21 in FIG. 1); ,
Generation of a new rule candidate (A6 in FIG. 2) in which the size of the condition part / premise part by the rule candidate generation means (21 in FIG. 1) is increased by one from the previously generated rule candidate;
The rule candidate generating means (21 in FIG. 1) cannot generate a new rule candidate in which the size of the condition part / premise part is increased by one from the previously generated rule candidate, Therefore, the process is repeated until the new rule candidate becomes empty (A3 to A7 in FIG. 2).
前記ルール候補生成手段(図1の21)による条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補の生成(図2のA6)と、
を、前記ルール候補生成手段(図1の21)にて、条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした、新たなルール候補を生成することができず、このため、前記新たなルール候補が空となる、まで、繰り返す(図2のA3~A7)。 The rule validity determination means (22 in FIG. 1) sets, for each rule of the rule candidates generated by the rule candidate generation means, a tuple of the database that matches the condition part / premise part of the rule. On the other hand, if the result part of the rule matches with a predetermined threshold of certainty or more, the rule is determined to be valid and output to the output device (4 in FIG. 1). When the rule candidate for determining validity becomes empty, the rule candidate generation means (21 in FIG. 1) uses the item as a new condition part / premise part and generates the size of the condition part / premise part last time. A new rule candidate increased by one from the rule candidates thus generated is stored in the storage unit. Validity check (A3, A4, A5 in FIG. 2) by the rule validity determination means (22 in FIG. 1) for the new rule candidate generated by the rule candidate generation means (21 in FIG. 1); ,
Generation of a new rule candidate (A6 in FIG. 2) in which the size of the condition part / premise part by the rule candidate generation means (21 in FIG. 1) is increased by one from the previously generated rule candidate;
The rule candidate generating means (21 in FIG. 1) cannot generate a new rule candidate in which the size of the condition part / premise part is increased by one from the previously generated rule candidate, Therefore, the process is repeated until the new rule candidate becomes empty (A3 to A7 in FIG. 2).
別の側面のルール発見方法の実施形態によれば、以下のステップを含む。
(a)ルール候補生成手段(図1の21)が、データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し(図2のステップA1)、ルール候補の初期値として、条件部・前提部を空、帰結部を前記アイテムとするルール集合を生成して記憶部に記憶し(図2のステップA2)、
(b)ルールの妥当性判定手段(図1の22)は、前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判定して出力装置より出力し(図2のステップA3、A4)、
(c)妥当性判定対象の前記ルール候補が空か否か(まだ残っているか否か)チェックし、妥当性判定対象の前記ルール候補が空でない場合、前記ルールの妥当性判定手段は、前記ステップ(b)に戻り(図2のステップA5)、
(d)妥当性判定対象の前記ルール候補が空の場合(妥当性判定を全てのルール候補について行ってしまった場合)、前記ルール候補生成手段は、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも、1つ増やした、新たなルール候補を生成して記憶部に記憶し(図2のステップA6)、
(e)前記ステップ(d)で新たなルール候補が生成されず、前記新たなルール候補が空か否か判定し(図2のステップA7)、前記新たなルール候補が空の場合、ルールの発見を終了し、前記新たなルール候補が空でない場合、前記ステップ(b)に戻る。 According to another embodiment of the rule discovery method, the following steps are included.
(A) The rule candidate generation means (21 in FIG. 1) reads out the database, and generates an item that is made up of a pair of attribute and value of the database and whose frequency is equal to or higher than a predetermined threshold (step A1 in FIG. 2). ), As an initial value of the rule candidate, a rule set having the condition part / premise part as empty and the result part as the item is generated and stored in the storage unit (step A2 in FIG. 2),
(B) The rule validity judging means (22 in FIG. 1) is configured such that, for the generated rule candidate rule, the rule against the tuple of the database matching the condition part / premise part of the rule. If the result of the matching is equal to or greater than a predetermined threshold of certainty, the rule is determined to be valid and output from the output device (steps A3 and A4 in FIG. 2).
(C) Check whether or not the rule candidate to be validated is empty (whether or not it still remains), and if the rule candidate to be validated is not empty, the rule validity judging means Returning to step (b) (step A5 in FIG. 2),
(D) When the rule candidate for validity determination is empty (when validity determination has been performed for all rule candidates), the rule candidate generation means sets the item as a new condition part / premise part. Then, a new rule candidate is generated by increasing the size of the condition part / premise part by one from the previously generated rule candidate, and is stored in the storage unit (step A6 in FIG. 2).
(E) A new rule candidate is not generated in step (d), and it is determined whether or not the new rule candidate is empty (step A7 in FIG. 2). If the new rule candidate is empty, Discovery ends and if the new rule candidate is not empty, the process returns to step (b).
(a)ルール候補生成手段(図1の21)が、データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し(図2のステップA1)、ルール候補の初期値として、条件部・前提部を空、帰結部を前記アイテムとするルール集合を生成して記憶部に記憶し(図2のステップA2)、
(b)ルールの妥当性判定手段(図1の22)は、前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判定して出力装置より出力し(図2のステップA3、A4)、
(c)妥当性判定対象の前記ルール候補が空か否か(まだ残っているか否か)チェックし、妥当性判定対象の前記ルール候補が空でない場合、前記ルールの妥当性判定手段は、前記ステップ(b)に戻り(図2のステップA5)、
(d)妥当性判定対象の前記ルール候補が空の場合(妥当性判定を全てのルール候補について行ってしまった場合)、前記ルール候補生成手段は、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも、1つ増やした、新たなルール候補を生成して記憶部に記憶し(図2のステップA6)、
(e)前記ステップ(d)で新たなルール候補が生成されず、前記新たなルール候補が空か否か判定し(図2のステップA7)、前記新たなルール候補が空の場合、ルールの発見を終了し、前記新たなルール候補が空でない場合、前記ステップ(b)に戻る。 According to another embodiment of the rule discovery method, the following steps are included.
(A) The rule candidate generation means (21 in FIG. 1) reads out the database, and generates an item that is made up of a pair of attribute and value of the database and whose frequency is equal to or higher than a predetermined threshold (step A1 in FIG. 2). ), As an initial value of the rule candidate, a rule set having the condition part / premise part as empty and the result part as the item is generated and stored in the storage unit (step A2 in FIG. 2),
(B) The rule validity judging means (22 in FIG. 1) is configured such that, for the generated rule candidate rule, the rule against the tuple of the database matching the condition part / premise part of the rule. If the result of the matching is equal to or greater than a predetermined threshold of certainty, the rule is determined to be valid and output from the output device (steps A3 and A4 in FIG. 2).
(C) Check whether or not the rule candidate to be validated is empty (whether or not it still remains), and if the rule candidate to be validated is not empty, the rule validity judging means Returning to step (b) (step A5 in FIG. 2),
(D) When the rule candidate for validity determination is empty (when validity determination has been performed for all rule candidates), the rule candidate generation means sets the item as a new condition part / premise part. Then, a new rule candidate is generated by increasing the size of the condition part / premise part by one from the previously generated rule candidate, and is stored in the storage unit (step A6 in FIG. 2).
(E) A new rule candidate is not generated in step (d), and it is determined whether or not the new rule candidate is empty (step A7 in FIG. 2). If the new rule candidate is empty, Discovery ends and if the new rule candidate is not empty, the process returns to step (b).
本発明によれば、データベースのデータのルールの発見において、ほぼ成り立つルール(approximate CFD)発見の高速化を実現し、データベースの内容把握、修正のために有用なルールの発見に好適とされる。以下、実施形態に即して詳細に説明する。
According to the present invention, it is possible to speed up the discovery of rules (approximate CFD) that is almost valid in the discovery of database data rules, and it is suitable for the discovery of rules useful for grasping and correcting the contents of the database. Hereinafter, it will be described in detail according to the embodiment.
<実施形態1>
図1を参照すると、本発明の例示的な第1の実施の形態は、キーボード等の入力装置1と、プログラム制御により動作するデータ処理装置2と、記憶装置3と、ディスプレイ装置や印刷装置等の出力装置4を含む。 <Embodiment 1>
Referring to FIG. 1, an exemplary first embodiment of the present invention includes aninput device 1 such as a keyboard, a data processing device 2 that operates under program control, a storage device 3, a display device, a printing device, and the like. The output device 4 is included.
図1を参照すると、本発明の例示的な第1の実施の形態は、キーボード等の入力装置1と、プログラム制御により動作するデータ処理装置2と、記憶装置3と、ディスプレイ装置や印刷装置等の出力装置4を含む。 <
Referring to FIG. 1, an exemplary first embodiment of the present invention includes an
記憶装置3は、磁気ディスク装置等で構成されるデータベース記憶部31を備えている。データベース記憶部31は、データベースを記憶している。このデータベースのデータをデータ処理装置2で読み出してCFDのルールが抽出される。
The storage device 3 includes a database storage unit 31 composed of a magnetic disk device or the like. The database storage unit 31 stores a database. Data in this database is read out by the data processing device 2 to extract CFD rules.
データ処理装置2は、ルール候補生成手段21と、ルールの妥当性判定手段22とを備える。
The data processing device 2 includes a rule candidate generation unit 21 and a rule validity determination unit 22.
ルール候補生成手段21は、データベース記憶部31に記憶されたデータベースのルール候補を生成する。ルール候補の生成にあたり、ルール候補生成手段21は、入力装置1から与えられたパラメータや、以前のステップで生成されたルール候補(生成済みルール候補)を用いてルール候補を生成し、生成したルール候補を記憶部に記憶する。記憶部は、データ処理装置2内の不図示の記憶部(メモリ装置)であってもよいし、ルール候補生成手段21内の不図示の記憶部、あるいは、記憶装置3内の所定の記憶領域であってもよい。
The rule candidate generation unit 21 generates database rule candidates stored in the database storage unit 31. In generating rule candidates, the rule candidate generating unit 21 generates rule candidates using the parameters given from the input device 1 and the rule candidates generated in the previous step (generated rule candidates), and the generated rule The candidate is stored in the storage unit. The storage unit may be a storage unit (memory device) (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or a predetermined storage area in the storage device 3. It may be.
ルールの妥当性判定手段22は、ルール候補生成手段21によって生成されたルールに対して、それが妥当なルールであるか否かをチェックし、妥当なルールである場合、出力装置4にそのルールを出力する。
The rule validity determination unit 22 checks whether the rule generated by the rule candidate generation unit 21 is a valid rule. If the rule is a valid rule, the rule is sent to the output device 4. Is output.
ここで、「妥当である」とは、
・ルールとマッチするデータベース中のタプルの個数が、予め定められた頻度の閾値以上であり、且つ、
・ルールの帰結部とタプルの内容が、確信度の閾値以上で合致している、
ことを意味している。入力装置1から与えられた前記パラメータは、頻度の閾値や確信度の閾値を含み、該パラメータは、ルール候補生成手段21で参照される。 Here, “appropriate” means
The number of tuples in the database matching the rule is greater than or equal to a predetermined frequency threshold, and
-The result part of the rule and the content of the tuple match at or above the certainty threshold,
It means that. The parameters given from theinput device 1 include a frequency threshold and a certainty threshold, and the parameters are referred to by the rule candidate generation means 21.
・ルールとマッチするデータベース中のタプルの個数が、予め定められた頻度の閾値以上であり、且つ、
・ルールの帰結部とタプルの内容が、確信度の閾値以上で合致している、
ことを意味している。入力装置1から与えられた前記パラメータは、頻度の閾値や確信度の閾値を含み、該パラメータは、ルール候補生成手段21で参照される。 Here, “appropriate” means
The number of tuples in the database matching the rule is greater than or equal to a predetermined frequency threshold, and
-The result part of the rule and the content of the tuple match at or above the certainty threshold,
It means that. The parameters given from the
図2は、本実施の形態の動作を説明する流れ図である。図1及び図2を参照して、本実施の形態の動作について詳細に説明する。
FIG. 2 is a flowchart for explaining the operation of the present embodiment. The operation of the present embodiment will be described in detail with reference to FIGS.
入力装置1から与えられたパラメータ、及び、データベース記憶部31から与えられたデータベースの内容は、ルール候補生成手段21に供給される。ルール候補生成手段21は、データベース中に出現している属性と値のペア(これを「アイテム」(item)と呼ぶ)を生成する(ステップA1)。ルール候補生成手段21は、生成したアイテム集合を、データ処理装置2内の不図示の記憶部又はルール候補生成手段21内の不図示の記憶部、又は記憶装置3の所定の記憶領域に記憶される。
The parameters given from the input device 1 and the contents of the database given from the database storage unit 31 are supplied to the rule candidate generating means 21. The rule candidate generation means 21 generates an attribute-value pair (this is called an “item”) that appears in the database (step A1). The rule candidate generation unit 21 stores the generated item set in a storage unit (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or a predetermined storage area of the storage device 3. The
ルール候補生成手段21は、
・生成したアイテムの集合の中から、頻度(frequency)が、パラメータ(頻度閾値)k以上であるものすべてを抽出し、
・条件部(条件部・前提部)が空、抽出されたアイテムが帰結部となるルールを生成し、
・これをルールの初期候補(CFD候補)とする(ステップA2)。ルール候補生成手段21は、生成したルールの初期候補(CFD候補)を、データ処理装置2内の不図示の記憶部、又はルール候補生成手段21内の不図示の記憶部、又は記憶装置3の所定の記憶領域に記憶する。なお、図2では、ルールの条件部・前提部(LHS)を、条件部として示している。 The rule candidate generation means 21
-From the set of generated items, extract all items whose frequency (frequency) is greater than or equal to parameter (frequency threshold) k,
-Generate a rule where the condition part (condition part / premise part) is empty and the extracted item is the consequent part,
This is the initial rule candidate (CFD candidate) (step A2). The rulecandidate generation unit 21 stores the generated initial rule candidates (CFD candidates) in a storage unit (not shown) in the data processing device 2, a storage unit (not shown) in the rule candidate generation unit 21, or the storage device 3. Store in a predetermined storage area. In FIG. 2, the rule condition / premise part (LHS) is shown as a condition part.
・生成したアイテムの集合の中から、頻度(frequency)が、パラメータ(頻度閾値)k以上であるものすべてを抽出し、
・条件部(条件部・前提部)が空、抽出されたアイテムが帰結部となるルールを生成し、
・これをルールの初期候補(CFD候補)とする(ステップA2)。ルール候補生成手段21は、生成したルールの初期候補(CFD候補)を、データ処理装置2内の不図示の記憶部、又はルール候補生成手段21内の不図示の記憶部、又は記憶装置3の所定の記憶領域に記憶する。なお、図2では、ルールの条件部・前提部(LHS)を、条件部として示している。 The rule candidate generation means 21
-From the set of generated items, extract all items whose frequency (frequency) is greater than or equal to parameter (frequency threshold) k,
-Generate a rule where the condition part (condition part / premise part) is empty and the extracted item is the consequent part,
This is the initial rule candidate (CFD candidate) (step A2). The rule
ルールの妥当性判定手段22は、ルール候補生成手段21で生成されたルールの候補(CFD候補)を、データベース記憶部31に保存されているデータベースと照合し、ルールが妥当なものであるかチェックを行う。具体的には、ルールの妥当性判定手段22は、
ルールの条件部がマッチするデータベースのタプルに対して、ルールの帰結部とタプルの内容が、パラメータ(確信度の閾値)p以上でマッチしている場合、
妥当なものであると判定する(ステップA3のYes)。 The rule validity determination unit 22 checks the rule candidate (CFD candidate) generated by the rulecandidate generation unit 21 with a database stored in the database storage unit 31 and checks whether the rule is valid. I do. Specifically, the rule validity judging means 22
For the tuples in the database that match the condition part of the rule, if the result part of the rule and the content of the tuple match with the parameter (threshold value threshold) p or more,
It determines with it being appropriate (Yes of step A3).
ルールの条件部がマッチするデータベースのタプルに対して、ルールの帰結部とタプルの内容が、パラメータ(確信度の閾値)p以上でマッチしている場合、
妥当なものであると判定する(ステップA3のYes)。 The rule validity determination unit 22 checks the rule candidate (CFD candidate) generated by the rule
For the tuples in the database that match the condition part of the rule, if the result part of the rule and the content of the tuple match with the parameter (threshold value threshold) p or more,
It determines with it being appropriate (Yes of step A3).
ルールの妥当性判定手段22は、ルール候補から妥当なルールが得られたとき、該妥当なルール(CFD)を出力装置4に出力する(ステップA4)。ルールが妥当でない場合、出力装置4には出力されない。
When the valid rule is obtained from the rule candidates, the rule validity determining means 22 outputs the valid rule (CFD) to the output device 4 (step A4). If the rule is not valid, it is not output to the output device 4.
ルールの妥当性判定手段22によりルールが妥当と判定された場合、ルール候補生成手段21では、よりサイズの大きいルール候補の生成にあたり、前記妥当と判定されたルールの帰結部となるアイテムを、帰結部に含むルールは、ルール候補として生成しない。すなわち、よりサイズの大きいルール候補の枝刈り(Pruning)を行う。
When the rule validity determination unit 22 determines that the rule is valid, the rule candidate generation unit 21 results in an item that is a consequent part of the rule determined to be valid in generating a rule candidate having a larger size. The rules included in the part are not generated as rule candidates. That is, the rule candidate having a larger size is pruned.
ステップA3、A4を繰り返し、妥当性判定対象のルールの候補(CFD候補)が無くなったとき(すなわち、ルール候補が空になったとき)(ステップA5のYes)、ルール候補生成手段21は、前記アイテムを新たな条件部とし、条件部のサイズを、前回よりも1つ増やした新たなルール候補(CFD候補)を生成する(ステップA6)。
Steps A3 and A4 are repeated, and when there are no more candidate candidates for validity determination (CFD candidates) (that is, when the rule candidates are empty) (Yes in step A5), the rule candidate generating means 21 Using the item as a new condition part, a new rule candidate (CFD candidate) is generated by increasing the size of the condition part by one (step A6).
ルール候補生成手段21は、ステップA6で生成した新たなルール候補が空か否か判定する(ステップA7)。
The rule candidate generation means 21 determines whether or not the new rule candidate generated in step A6 is empty (step A7).
ステップA6で生成した新たなルール候補が空の場合、ルール発見の計算処理を終了する(ステップA7のYes)。
If the new rule candidate generated in step A6 is empty, the rule discovery calculation process is terminated (Yes in step A7).
ステップA6で生成した新たなルール候補(CFD候補)が空でない場合、ステップA3に戻り、ルールの妥当性判定手段22により妥当なルールであるか判定し、妥当なルールを出力する。
If the new rule candidate (CFD candidate) generated in step A6 is not empty, the process returns to step A3, where it is determined by the rule validity determination means 22 whether the rule is valid and a valid rule is output.
ステップA3~A7の一連の処理を、ステップA7の判定の結果、ルール候補生成手段21において、ルール候補(CFD候補)をそれ以上生成されなくなるまで、繰り返す。
The series of steps A3 to A7 are repeated until no more rule candidates (CFD candidates) are generated in the rule candidate generation means 21 as a result of the determination in step A7.
このように、本実施の形態においては、
頻度の所定値以上のアイテム(item)の列挙から始まり、それらを組み合わせることで、小さなサイズのルール候補(CFD候補)から開始し、徐々に、サイズの大きいルール候補(CFD候補)を生成していく。妥当なルール(確信度が閾値以上のルール)が得られた時点で、当該ルールを出力し、当該ルールに対して冗長なルールの生成を抑制することで、ルールを効率よく発見することが可能となる。 Thus, in this embodiment,
Starting from enumerating items with a frequency greater than or equal to a predetermined value and combining them, starting from a small rule candidate (CFD candidate), gradually generating a large rule candidate (CFD candidate) Go. When a valid rule (a rule whose certainty is equal to or greater than a threshold value) is obtained, the rule can be output and the rule can be efficiently discovered by suppressing the generation of a redundant rule for the rule. It becomes.
頻度の所定値以上のアイテム(item)の列挙から始まり、それらを組み合わせることで、小さなサイズのルール候補(CFD候補)から開始し、徐々に、サイズの大きいルール候補(CFD候補)を生成していく。妥当なルール(確信度が閾値以上のルール)が得られた時点で、当該ルールを出力し、当該ルールに対して冗長なルールの生成を抑制することで、ルールを効率よく発見することが可能となる。 Thus, in this embodiment,
Starting from enumerating items with a frequency greater than or equal to a predetermined value and combining them, starting from a small rule candidate (CFD candidate), gradually generating a large rule candidate (CFD candidate) Go. When a valid rule (a rule whose certainty is equal to or greater than a threshold value) is obtained, the rule can be output and the rule can be efficiently discovered by suppressing the generation of a redundant rule for the rule. It becomes.
次に、具体的な実施例を用いて本実施の形態の動作を説明する。図3(B)(以下の表1)に示すように、例えば、データベース記憶部31には、以下の表1の属性・タプルからなるデータ集合が登録されている。なお、データベースの例は、あくまで説明のために簡易化した例である。図3(A)は、図2のステップに対応して、アイテム集合、初期ルール(CFD)の候補、新たなルール(CFD)候補の具体例を例示したものである。図3(C)は、図2の計算の結果、出力装置4から出力されたルール(approximate CFD)の一例を示す図である。なお、図3において、“属性1:_”等における記号‘:’は、“属性1=_”における記号‘=’と同義(同一)である。
Next, the operation of this embodiment will be described using specific examples. As shown in FIG. 3B (Table 1 below), for example, the database storage unit 31 is registered with a data set composed of the attributes and tuples shown in Table 1 below. The database example is a simplified example for the sake of explanation. FIG. 3A illustrates a specific example of an item set, a candidate for an initial rule (CFD), and a new rule (CFD) candidate corresponding to the steps in FIG. FIG. 3C is a diagram illustrating an example of a rule (approximate CFD) output from the output device 4 as a result of the calculation of FIG. In FIG. 3, the symbol “:” in “attribute 1: _” or the like is synonymous (same) as the symbol “=” in “attribute 1 = _”.
ルール候補生成手段21は、上記の表1、及びパラメータとして、
k=2、p=0.66
を受け取る。
ここで、
kは妥当なルールと判定するための頻度の閾値(下限)、
pは確信度の閾値(下限)である。 The rule candidate generation means 21 has the above Table 1 and parameters as
k = 2, p = 0.66
Receive.
here,
k is a frequency threshold (lower limit) for determining a valid rule,
p is a certainty threshold (lower limit).
k=2、p=0.66
を受け取る。
ここで、
kは妥当なルールと判定するための頻度の閾値(下限)、
pは確信度の閾値(下限)である。 The rule candidate generation means 21 has the above Table 1 and parameters as
k = 2, p = 0.66
Receive.
here,
k is a frequency threshold (lower limit) for determining a valid rule,
p is a certainty threshold (lower limit).
ルール候補生成手段21は、データベース中の出現頻度がk=2以上であるすべてのアイテム(item)のリスト(アイテムの集合):{
“属性1=_”(4),
“属性2=_”(4),
“属性3=_”(4),
“属性1=1”(2),
“属性2=P”(3),
“属性3=S”(2),
“属性3=T”(2)}を抽出する。ここで、記号_は、任意の値にマッチする変数である。括弧の中の数値は、そのアイテム(item)の頻度である。“属性1=_”の‘_’は「unnamed variable」(無名変数)であり、任意の属性値とマッチするワイルドカードである。 The rule candidate generation means 21 is a list (item set) of all items (items) whose appearance frequency in the database is k = 2 or more: {
“Attribute 1 = _” (4),
“Attribute 2 = _” (4),
“Attribute 3 = _” (4),
“Attribute 1 = 1” (2),
“Attribute 2 = P” (3),
“Attribute 3 = S” (2),
“Attribute 3 = T” (2)} is extracted. Here, the symbol _ is a variable that matches an arbitrary value. The number in parentheses is the frequency of the item. “_” Of “attribute 1 = _” is “unnamed variable”, which is a wild card that matches an arbitrary attribute value.
“属性1=_”(4),
“属性2=_”(4),
“属性3=_”(4),
“属性1=1”(2),
“属性2=P”(3),
“属性3=S”(2),
“属性3=T”(2)}を抽出する。ここで、記号_は、任意の値にマッチする変数である。括弧の中の数値は、そのアイテム(item)の頻度である。“属性1=_”の‘_’は「unnamed variable」(無名変数)であり、任意の属性値とマッチするワイルドカードである。 The rule candidate generation means 21 is a list (item set) of all items (items) whose appearance frequency in the database is k = 2 or more: {
“
“Attribute 2 = _” (4),
“Attribute 3 = _” (4),
“
“Attribute 2 = P” (3),
“Attribute 3 = S” (2),
“Attribute 3 = T” (2)} is extracted. Here, the symbol _ is a variable that matches an arbitrary value. The number in parentheses is the frequency of the item. “_” Of “
ルール候補生成手段21は、上記抽出されたリスト内の各アイテム(item)に対し、条件部・前提部(LHS)を空(empty)とし、抽出された各々のアイテム(item)を帰結部(RHS)とするCFDルールを生成し、これをルール(CFD)の初期候補として、不図示の記憶部に一時的に記憶する(図2、図3(A)のステップA1)。
The rule candidate generation unit 21 sets the condition part / premise part (LHS) to empty for each item (item) in the extracted list, and sets each extracted item (item) as a result part ( RHS) is generated, and this is temporarily stored in a storage unit (not shown) as an initial candidate for the rule (CFD) (step A1 in FIGS. 2 and 3A).
この場合、ルール(CFD)の初期候補は以下の通りである。
ψ1:empty →“属性1=_”、
ψ2:empty →“属性2=_”、
ψ3:empty →“属性3=_”、
ψ4:empty →“属性1=1”、
ψ5:empty →“属性2=P”、
ψ6:empty →“属性3=S”、
ψ7:empty →“属性3=T” In this case, the initial rule (CFD) candidates are as follows.
ψ1: empty → “attribute 1 = _”,
ψ2: empty → “attribute 2 = _”,
ψ3: empty → “attribute 3 = _”,
ψ4: empty → “attribute 1 = 1”,
ψ5: empty → “attribute 2 = P”,
ψ6: empty → “attribute 3 = S”,
ψ7: empty → “attribute 3 = T”
ψ1:empty →“属性1=_”、
ψ2:empty →“属性2=_”、
ψ3:empty →“属性3=_”、
ψ4:empty →“属性1=1”、
ψ5:empty →“属性2=P”、
ψ6:empty →“属性3=S”、
ψ7:empty →“属性3=T” In this case, the initial rule (CFD) candidates are as follows.
ψ1: empty → “
ψ2: empty → “attribute 2 = _”,
ψ3: empty → “attribute 3 = _”,
ψ4: empty → “
ψ5: empty → “attribute 2 = P”,
ψ6: empty → “attribute 3 = S”,
ψ7: empty → “attribute 3 = T”
ルールψ5において、空(empty)の頻度は、(タプルの全個数と同じ)4、“属性2=P”の頻度は、3である。
したがって、“属性2=P”の頻度(=3)と、空(empty)の頻度(=4)との割合(確信度)=3/4=0.75である。これは、確信度の閾値(下限)p=0.66を上回っている。 In the rule ψ5, the frequency of empty (the same as the total number of tuples) is 4, and the frequency of “attribute 2 = P” is 3.
Therefore, the ratio (confidence) of the frequency (= 3) of “attribute 2 = P” and the frequency (= 4) of empty (= 4) = 3/4 = 0.75. This exceeds the certainty threshold (lower limit) p = 0.66.
したがって、“属性2=P”の頻度(=3)と、空(empty)の頻度(=4)との割合(確信度)=3/4=0.75である。これは、確信度の閾値(下限)p=0.66を上回っている。 In the rule ψ5, the frequency of empty (the same as the total number of tuples) is 4, and the frequency of “attribute 2 = P” is 3.
Therefore, the ratio (confidence) of the frequency (= 3) of “attribute 2 = P” and the frequency (= 4) of empty (= 4) = 3/4 = 0.75. This exceeds the certainty threshold (lower limit) p = 0.66.
そこで、ルールの妥当性判定手段22は、
ψ5: empty→“属性2=P”
を妥当なものであることを発見し(図2、図3(A)のステップA3)、これを出力装置4に出力する(図2、図3(A)のステップA4)。 Therefore, the rule validity judging means 22
ψ5: empty → “attribute 2 = P”
(Step A3 in FIG. 2 and FIG. 3A) is output to the output device 4 (step A4 in FIG. 2 and FIG. 3A).
ψ5: empty→“属性2=P”
を妥当なものであることを発見し(図2、図3(A)のステップA3)、これを出力装置4に出力する(図2、図3(A)のステップA4)。 Therefore, the rule validity judging means 22
ψ5: empty → “attribute 2 = P”
(Step A3 in FIG. 2 and FIG. 3A) is output to the output device 4 (step A4 in FIG. 2 and FIG. 3A).
それとともに、このψ5に対して冗長なルールを、以後のルール候補の探索の対象から除外する。具体的には、後のステップ(図2、図3(A)のステップA6)にて、ルール候補生成手段21では、
アイテム(item)“属性2=P”
を含むアイテムセット(itemset)が生成されたとき、それに対する帰結部(RHS)の候補から、“属性2=P”を除去する。 At the same time, a rule that is redundant with respect to ψ5 is excluded from the search targets for subsequent rule candidates. Specifically, in the later step (step A6 in FIGS. 2 and 3A), the rule candidate generation means 21
Item “attribute 2 = P”
When an item set including “itemset” is generated, “attribute 2 = P” is removed from the candidate of the consequent part (RHS) corresponding thereto.
アイテム(item)“属性2=P”
を含むアイテムセット(itemset)が生成されたとき、それに対する帰結部(RHS)の候補から、“属性2=P”を除去する。 At the same time, a rule that is redundant with respect to ψ5 is excluded from the search targets for subsequent rule candidates. Specifically, in the later step (step A6 in FIGS. 2 and 3A), the rule candidate generation means 21
Item “attribute 2 = P”
When an item set including “itemset” is generated, “attribute 2 = P” is removed from the candidate of the consequent part (RHS) corresponding thereto.
残りの6つのルール(ψ1~ψ4、ψ6、ψ7)は、データベースの内容に対して妥当ではない。この結果、ルール(CFD)の候補は空となる(図2のステップA5のYes)。
The remaining six rules (ψ1 to ψ4, ψ6, ψ7) are not valid for the contents of the database. As a result, the rule (CFD) candidate is empty (Yes in step A5 in FIG. 2).
ルールの候補({ψ1、ψ2、・・・、ψ7})のうち、ルールの妥当性判定手段22によって、データベースの内容に対して妥当と判定されたルール(CFD)ψ5は、出力装置4に出力される。なお、ルールの候補のうち、妥当性の判定が行われたルールは、ルールの候補から削除される。すなわち、ルールの候補({ψ1、ψ2、・・・、ψ7})のうち、ルールの妥当性判定手段22によって妥当性の判定が行われ、妥当と判定されたルール(CFD)ψ5、及び、妥当でないと判断されたルール(ψ1~ψ4、ψ6、ψ7)は、ルールの候補から削除され、その結果、妥当性がチェックされるルールの候補は、空(残りは零)となる。なお、ルールのルール候補からの削除等は、メモリからの削除以外に、例えば1ビット等の削除フラグを、ルール候補の各ルールに対して用意する構成としてもよい。この場合、ルール候補の複数のルールの削除ビット(複数のビット)が全てオンのとき、ルール候補は、空と判断される。
Among the rule candidates ({ψ1, ψ2,..., Ψ7}), the rule (CFD) ψ5 determined to be appropriate for the contents of the database by the rule validity determination means 22 is sent to the output device 4. Is output. Of the rule candidates, the rule whose validity has been determined is deleted from the rule candidates. That is, among the rule candidates ({ψ1, ψ2,..., Ψ7}), the validity is determined by the rule validity determining means 22, and the rule (CFD) ψ5 determined to be valid, and Rules that are determined to be invalid (ψ1 to ψ4, ψ6, ψ7) are deleted from the rule candidates, and as a result, the rule candidates that are checked for validity are empty (the rest are zero). Note that, for example, deletion of a rule from a rule candidate may be configured such that a deletion flag of 1 bit or the like is prepared for each rule candidate rule in addition to deletion from a memory. In this case, when all the deletion bits (a plurality of bits) of the plurality of rule candidates are on, the rule candidate is determined to be empty.
妥当性判定対象のルールの候補(CFD候補)が空であるため(図2のステップA5のYes)、ルール候補生成手段21は、新たなルールの生成を行う(図2、図3(A)のステップA6)。具体的には、アイテム(item)の上記集合から、互いに矛盾せず(同一の属性が異なる値をとらない)、合成後の集合のサイズが、合成前のものよりも、1だけ大きく、また、頻度が閾値k以上となる2つの要素を選んで合成し(合成したものをアイテムセット(itemset)という)、要素のうち1つを帰結部、残りを条件部・前提部とした、ルールを生成する(図2、図3(A)のステップA6)。
Since the rule candidate (CFD candidate) for validity determination is empty (Yes in step A5 in FIG. 2), the rule candidate generation unit 21 generates a new rule (FIG. 2, FIG. 3A). Step A6). Specifically, from the above set of items, there is no contradiction to each other (the same attribute does not take a different value), the size of the set after synthesis is one larger than that before synthesis, , Select two elements whose frequency is greater than or equal to the threshold value k and combine them (the combined element is called an itemset), one of the elements as the consequent part, and the rest as the condition part / premise part. (Step A6 in FIGS. 2 and 3A).
ここで先に述べた通り、2つのアイテム(item):
“属性2=P”、
“属性3=T”
を合成して得られるアイテムセット
{“属性2=P”、“属性3=T”}
からは、帰結部の候補として、“属性2=P”が除外される。 As mentioned earlier here, two items:
“Attribute 2 = P”,
“Attribute 3 = T”
Item set {"attribute 2 = P", "attribute 3 = T"}
Is excluded from “attribute 2 = P” as a candidate for a consequence section.
“属性2=P”、
“属性3=T”
を合成して得られるアイテムセット
{“属性2=P”、“属性3=T”}
からは、帰結部の候補として、“属性2=P”が除外される。 As mentioned earlier here, two items:
“Attribute 2 = P”,
“Attribute 3 = T”
Item set {"attribute 2 = P", "attribute 3 = T"}
Is excluded from “attribute 2 = P” as a candidate for a consequence section.
したがって、ルール候補生成手段21において、
ルール:“属性3=T”→“属性2=P”
は生成されない。 Therefore, in the rule candidate generation means 21,
Rule: “Attribute 3 = T” → “Attribute 2 = P”
Is not generated.
ルール:“属性3=T”→“属性2=P”
は生成されない。 Therefore, in the rule candidate generation means 21,
Rule: “Attribute 3 = T” → “Attribute 2 = P”
Is not generated.
ルール候補生成手段21で新たに生成されたルール候補(図2、図3(A)のステップA6)のうち、
CFD ψ’1:“属性1=_”→”属性3=_”
(属性1の値が、属性3の値を一意に決定する)
は妥当なものである(頻度=4、確信度=1)と、ルールの妥当性判定手段22によって判定され(図2、図3(A)のステップA3)、ψ’が出力装置4から出力される(ステップ4)。また、これとともに、ψ’に対して冗長なルールは、以後のルール候補の探索の対象から除外される。 Of the rule candidates newly generated by the rule candidate generation means 21 (step A6 in FIGS. 2 and 3A),
CFD ψ′1: “attribute 1 = _” → “attribute 3 = _”
(The value ofattribute 1 uniquely determines the value of attribute 3)
Is valid (frequency = 4, certainty factor = 1) is determined by the rule validity determination means 22 (step A3 in FIGS. 2 and 3A), and ψ ′ is output from the output device 4 (Step 4). At the same time, a rule that is redundant with respect to ψ ′ is excluded from the target of the subsequent rule candidate search.
CFD ψ’1:“属性1=_”→”属性3=_”
(属性1の値が、属性3の値を一意に決定する)
は妥当なものである(頻度=4、確信度=1)と、ルールの妥当性判定手段22によって判定され(図2、図3(A)のステップA3)、ψ’が出力装置4から出力される(ステップ4)。また、これとともに、ψ’に対して冗長なルールは、以後のルール候補の探索の対象から除外される。 Of the rule candidates newly generated by the rule candidate generation means 21 (step A6 in FIGS. 2 and 3A),
CFD ψ′1: “
(The value of
Is valid (frequency = 4, certainty factor = 1) is determined by the rule validity determination means 22 (step A3 in FIGS. 2 and 3A), and ψ ′ is output from the output device 4 (Step 4). At the same time, a rule that is redundant with respect to ψ ′ is excluded from the target of the subsequent rule candidate search.
同様に、ルールの妥当性判定手段22によって、条件部・前提部のサイズが1である妥当なルール(頻度がk=2以上、且つ、確信度がp=0.66以上)として
ψ’2:“属性3=_”→”属性1=_”、
ψ’3:“属性2=_”→”属性3=_”、
ψ’4:“属性3=S”→”属性1=1”、
ψ’5:“属性1=1”→”属性3=S”、
ψ’6:“属性2=P”→”属性3=T”
が発見され(図2、図3(A)のステップA3、A4)、これらのルールが、出力装置4に出力される(図2のステップ4)とともに、これらのルールに対し、冗長なルールをそれぞれ除外する。 Similarly, the rule validity determination unit 22 determines that the size of the condition part / premise part is 1 (frequency is k = 2 or more and confidence is p = 0.66 or more) ψ′2 : “Attribute 3 = _” → “attribute 1 = _”,
ψ′3: “attribute 2 = _” → “attribute 3 = _”,
ψ′4: “attribute 3 = S” → “attribute 1 = 1”,
ψ′5: “attribute 1 = 1” → “attribute 3 = S”,
ψ′6: “attribute 2 = P” → “attribute 3 = T”
Are found (steps A3 and A4 in FIGS. 2 and 3A), these rules are output to the output device 4 (step 4 in FIG. 2), and redundant rules are added to these rules. Exclude each.
ψ’2:“属性3=_”→”属性1=_”、
ψ’3:“属性2=_”→”属性3=_”、
ψ’4:“属性3=S”→”属性1=1”、
ψ’5:“属性1=1”→”属性3=S”、
ψ’6:“属性2=P”→”属性3=T”
が発見され(図2、図3(A)のステップA3、A4)、これらのルールが、出力装置4に出力される(図2のステップ4)とともに、これらのルールに対し、冗長なルールをそれぞれ除外する。 Similarly, the rule validity determination unit 22 determines that the size of the condition part / premise part is 1 (frequency is k = 2 or more and confidence is p = 0.66 or more) ψ′2 : “Attribute 3 = _” → “
ψ′3: “attribute 2 = _” → “attribute 3 = _”,
ψ′4: “attribute 3 = S” → “
ψ′5: “
ψ′6: “attribute 2 = P” → “attribute 3 = T”
Are found (steps A3 and A4 in FIGS. 2 and 3A), these rules are output to the output device 4 (
ルールの妥当性判定手段22によって、条件部・前提部がサイズ1の6個のルール(CFD)候補のうち、6個のルール候補の妥当性の判定が行われたため(妥当性有りと判定)、妥当性のチェックが行われるルール(CFD)候補は空となる(図2のステップA5のYes)。
The rule validity determination means 22 has determined the validity of six rule candidates out of six rule (CFD) candidates whose size is 1 for the condition part and the premise part (determined that there is validity). The rule (CFD) candidate whose validity is to be checked is empty (Yes in step A5 in FIG. 2).
そこで、ルール候補生成手段21において、条件部・前提部のサイズを1つ増やしたルール(CFD)の候補を生成しようとするが、条件部・前提部のサイズ=2のルール候補は何も生成されず、条件部・前提部のサイズ=2のルール(CFD)の候補は空であるため、ルール発見アルゴリズムが終了する(ステップA7のYes)。
Therefore, the rule candidate generation unit 21 tries to generate a rule (CFD) candidate in which the size of the condition part / premise part is increased by one. However, any rule candidate with the condition part / premise part size = 2 is generated. Otherwise, the rule (CFD) candidate with the condition part / premise part size = 2 is empty, and the rule finding algorithm ends (Yes in step A7).
得られるCFDは、図3(C)に示すように、
empty →“属性2=P” :頻度=3 確信度=0.75
“属性1=_”→“属性3=_” :頻度=4 確信度=1
“属性3=_”→“属性1=_” :頻度=4 確信度=0.75
“属性2=_”→“属性3=_” :頻度=4 確信度=0.75
“属性3=S”→“属性1=1” :頻度=2 確信度=1
“属性1=1”→“属性3=S” :頻度=2 確信度=1
“属性2=P”→“属性3=T” :頻度=2 確信度=0.66 The obtained CFD is as shown in FIG.
empty->"attribute 2 = P": frequency = 3 confidence factor = 0.75
“Attribute 1 = _” → “Attribute 3 = _”: Frequency = 4 Confidence = 1
“Attribute 3 = _” → “attribute 1 = _”: frequency = 4 certainty factor = 0.75
“Attribute 2 = _” → “attribute 3 = _”: frequency = 4 certainty factor = 0.75
“Attribute 3 = S” → “Attribute 1 = 1”: Frequency = 2 Confidence = 1
“Attribute 1 = 1” → “Attribute 3 = S”: Frequency = 2 Confidence = 1
“Attribute 2 = P” → “Attribute 3 = T”: Frequency = 2 Confidence = 0.66
empty →“属性2=P” :頻度=3 確信度=0.75
“属性1=_”→“属性3=_” :頻度=4 確信度=1
“属性3=_”→“属性1=_” :頻度=4 確信度=0.75
“属性2=_”→“属性3=_” :頻度=4 確信度=0.75
“属性3=S”→“属性1=1” :頻度=2 確信度=1
“属性1=1”→“属性3=S” :頻度=2 確信度=1
“属性2=P”→“属性3=T” :頻度=2 確信度=0.66 The obtained CFD is as shown in FIG.
empty->"attribute 2 = P": frequency = 3 confidence factor = 0.75
“
“Attribute 3 = _” → “
“Attribute 2 = _” → “attribute 3 = _”: frequency = 4 certainty factor = 0.75
“Attribute 3 = S” → “
“
“Attribute 2 = P” → “Attribute 3 = T”: Frequency = 2 Confidence = 0.66
なお、図3(C)に示す出力例では、妥当と判定されたルールの頻度、確信度が出力されているが、ルールの頻度、確信度のうちの一方を出力するか、あるいはともに出力しなくてもよい。また、妥当と判定されたルールが多量である場合、例えば確信度の順にルールをソートして出力するようにしてもよい。妥当と判定されたルールの出力形態自体は任意である。
In the output example shown in FIG. 3C, the frequency and certainty factor of the rule determined to be valid are output, but one of the rule frequency and the certainty factor is output or is output together. It does not have to be. When there are a large number of rules determined to be valid, for example, the rules may be sorted and output in the order of certainty. The output form of the rule determined to be valid is arbitrary.
ここで、枝刈り(Pruning)を行わない場合、図2の計算処理の途中で、アイテムセット(itemset)
{“属性2=P”,“属性3=T”}から
CFD ψ:“属性3=T”→”属性2=P”
等の冗長なルール(CFD)が生成される。 Here, when pruning is not performed, an item set is displayed during the calculation process of FIG.
{“Attribute 2 = P”, “attribute 3 = T”} to CFD ψ: “attribute 3 = T” → “attribute 2 = P”
Etc., a redundant rule (CFD) is generated.
{“属性2=P”,“属性3=T”}から
CFD ψ:“属性3=T”→”属性2=P”
等の冗長なルール(CFD)が生成される。 Here, when pruning is not performed, an item set is displayed during the calculation process of FIG.
{“Attribute 2 = P”, “attribute 3 = T”} to CFD ψ: “attribute 3 = T” → “attribute 2 = P”
Etc., a redundant rule (CFD) is generated.
これに対して、本実施形態によれば、枝刈りを行うことにより、上記のような冗長なルールの生成を防ぐことができる。
On the other hand, according to the present embodiment, the generation of redundant rules as described above can be prevented by pruning.
<実施の形態2>
次に、本発明の第2の実施の形態について図面を参照して詳細に説明する。図4を参照すると、本発明の第2の実施の形態は、ルール発見用プログラム5を備える。 <Embodiment 2>
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. Referring to FIG. 4, the second embodiment of the present invention includes a rule finding program 5.
次に、本発明の第2の実施の形態について図面を参照して詳細に説明する。図4を参照すると、本発明の第2の実施の形態は、ルール発見用プログラム5を備える。 <Embodiment 2>
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. Referring to FIG. 4, the second embodiment of the present invention includes a rule finding program 5.
ルール発見用プログラム5は、データ処理装置6に読み込まれ、データ処理装置6の動作を制御する。データ処理装置6はルール発見用プログラム5の制御により以下の処理、すなわち第1の実施の形態におけるデータ処理装置2による処理と同一の処理、を実行する。ルール発見用プログラム5は、
(a)データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し、ルール候補の初期値として、条件部が空、帰結部が、前記アイテムであるルール(CFD)を生成して記憶部に記憶する処理と、
(b)前記生成されたルール候補のルール(CFD)に対して、前記ルールの条件部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当(approximate CFD)と判断し、出力装置に出力する処理と、
(c)妥当性を判定する前記ルール候補(CFD候補)が空でない場合、前記(b)に戻る処理と、
(d)妥当性を判定する前記ルール候補(CFD候補)が空の場合、前記アイテムを新たな条件部とし、前記条件部のサイズを、前回生成した前記ルール候補よりも、1つ増やした新たなルール候補(新たなCFD候補)を生成して記憶部に記憶する処理と、
(e)前記(d)の処理で生成される前記新たなルール候補が空か否か判定し、前記新たなルール候補(新たなCFD候補)が空の場合、ルール発見処理を終了し、前記新たなルール候補(新たなCFD候補)が空でない場合、前記(b)に戻る処理と、
を含む。 The rule discovery program 5 is read into the data processing device 6 and controls the operation of the data processing device 6. The data processing device 6 executes the following processing, that is, the same processing as the processing by the data processing device 2 in the first embodiment, under the control of the rule finding program 5. The rule discovery program 5
(A) Reading out the database, generating an item having a frequency equal to or higher than a predetermined threshold consisting of a pair of the attribute and value of the database, and as an initial value of the rule candidate, the condition part is empty, the consequence part is Processing to generate a rule (CFD) that is an item and store it in the storage unit;
(B) For the generated rule candidate rule (CFD), for a tuple of the database that matches the rule condition part, the rule result part has a predetermined certainty factor. If the threshold matches, the rule is determined to be valid (approximate CFD) and output to the output device;
(C) If the rule candidate (CFD candidate) for determining validity is not empty, the process returns to (b);
(D) When the rule candidate (CFD candidate) for determining validity is empty, the item is set as a new condition part, and the size of the condition part is increased by one from the rule candidate generated last time. Processing to generate a rule candidate (new CFD candidate) and store it in the storage unit;
(E) It is determined whether or not the new rule candidate generated in the process of (d) is empty. If the new rule candidate (new CFD candidate) is empty, the rule discovery process is terminated, If the new rule candidate (new CFD candidate) is not empty, the process returns to (b) above;
including.
(a)データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し、ルール候補の初期値として、条件部が空、帰結部が、前記アイテムであるルール(CFD)を生成して記憶部に記憶する処理と、
(b)前記生成されたルール候補のルール(CFD)に対して、前記ルールの条件部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当(approximate CFD)と判断し、出力装置に出力する処理と、
(c)妥当性を判定する前記ルール候補(CFD候補)が空でない場合、前記(b)に戻る処理と、
(d)妥当性を判定する前記ルール候補(CFD候補)が空の場合、前記アイテムを新たな条件部とし、前記条件部のサイズを、前回生成した前記ルール候補よりも、1つ増やした新たなルール候補(新たなCFD候補)を生成して記憶部に記憶する処理と、
(e)前記(d)の処理で生成される前記新たなルール候補が空か否か判定し、前記新たなルール候補(新たなCFD候補)が空の場合、ルール発見処理を終了し、前記新たなルール候補(新たなCFD候補)が空でない場合、前記(b)に戻る処理と、
を含む。 The rule discovery program 5 is read into the data processing device 6 and controls the operation of the data processing device 6. The data processing device 6 executes the following processing, that is, the same processing as the processing by the data processing device 2 in the first embodiment, under the control of the rule finding program 5. The rule discovery program 5
(A) Reading out the database, generating an item having a frequency equal to or higher than a predetermined threshold consisting of a pair of the attribute and value of the database, and as an initial value of the rule candidate, the condition part is empty, the consequence part is Processing to generate a rule (CFD) that is an item and store it in the storage unit;
(B) For the generated rule candidate rule (CFD), for a tuple of the database that matches the rule condition part, the rule result part has a predetermined certainty factor. If the threshold matches, the rule is determined to be valid (approximate CFD) and output to the output device;
(C) If the rule candidate (CFD candidate) for determining validity is not empty, the process returns to (b);
(D) When the rule candidate (CFD candidate) for determining validity is empty, the item is set as a new condition part, and the size of the condition part is increased by one from the rule candidate generated last time. Processing to generate a rule candidate (new CFD candidate) and store it in the storage unit;
(E) It is determined whether or not the new rule candidate generated in the process of (d) is empty. If the new rule candidate (new CFD candidate) is empty, the rule discovery process is terminated, If the new rule candidate (new CFD candidate) is not empty, the process returns to (b) above;
including.
なお、前記(b)の妥当と判定したルール(approximate CFD)を出力装置に出力する処理において、該ルール(approximate CFD)を直ちに出力装置に出力するかわりに、当該ルールを一旦、リスト(線形リスト等)に加え、妥当性を判定する前記ルール候補(CFD候補)が空となったときに、該リストを出力するようにしてもよい。なお、リストは、記憶装置バッファ等に記憶される。妥当と判定したルールを出力する出力制御としては、任意の方式を用いられる。
In the process of outputting the rule (approximate CFD) determined to be valid in (b) to the output device, instead of immediately outputting the rule (approximate CFD) to the output device, the rule is temporarily displayed in a list (linear list). In addition, the list may be output when the rule candidate (CFD candidate) for determining validity becomes empty. The list is stored in a storage device buffer or the like. An arbitrary method can be used as output control for outputting a rule determined to be valid.
入力装置1から、コマンド(ルール発見プログラムの実行コマンド等)、設定パラメータが入力され、記憶装置3内のデータベース記憶部31に記憶されているデータベースを用い、初期ルール候補の生成を行う。次に、生成されたルール候補が妥当であるか否か判定し、妥当である場合、当該ルールをリストに追加する。リストに保存されたルールの集合により、データベースのカバー度が打ち切り条件を満たした時、リスト内のルール集合を、出力装置4に表示させる。
A command (such as a rule discovery program execution command) and setting parameters are input from the input device 1, and an initial rule candidate is generated using a database stored in the database storage unit 31 in the storage device 3. Next, it is determined whether or not the generated rule candidate is valid. If it is valid, the rule is added to the list. When the coverage of the database satisfies the cutoff condition due to the set of rules stored in the list, the set of rules in the list is displayed on the output device 4.
なお、上記の特許文献、非特許文献の各開示を、本書に引用をもって繰り込むものとする。本発明の全開示(請求の範囲を含む)の枠内において、さらにその基本的技術思想に基づいて、実施形態ないし実施例の変更・調整が可能である。また、本発明の請求の範囲の枠内において種々の開示要素(各請求項の各要素、各実施例の各要素、各図面の各要素等を含む)の多様な組み合わせないし選択が可能である。すなわち、本発明は、請求の範囲を含む全開示、技術的思想にしたがって当業者であればなし得るであろう各種変形、修正を含むことは勿論である。特に、本書に記載した数値範囲については、当該範囲内に含まれる任意の数値ないし小範囲が、別段の記載のない場合でも具体的に記載されているものと解釈されるべきである。
It should be noted that the disclosures of the above patent documents and non-patent documents are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiments and examples can be changed and adjusted based on the basic technical concept. Various disclosed elements (including each element of each claim, each element of each embodiment, each element of each drawing, etc.) can be combined or selected within the scope of the claims of the present invention. . That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea. In particular, with respect to the numerical ranges described in this document, any numerical value or small range included in the range should be construed as being specifically described even if there is no specific description.
1 入力装置
2、6 データ処理装置
3 記憶装置
4 出力装置
5 ルール発見用プログラム
21 ルール候補生成手段
22 ルールの妥当性判定手段
31 データベース記憶部 DESCRIPTION OFSYMBOLS 1 Input device 2, 6 Data processing device 3 Storage device 4 Output device 5 Rule discovery program 21 Rule candidate production | generation means 22 Rule validity determination means 31 Database storage part
2、6 データ処理装置
3 記憶装置
4 出力装置
5 ルール発見用プログラム
21 ルール候補生成手段
22 ルールの妥当性判定手段
31 データベース記憶部 DESCRIPTION OF
Claims (10)
- データベースを記憶する記憶装置と、
データ処理装置と、
出力装置と、
を備え、
前記データ処理装置は、
前記データベースからルール候補を生成するルール候補生成手段と、
前記ルール候補が前記データベースの内容に対して妥当であるか否か判定するルールの妥当性判定手段と、
を備え、
前記ルール候補生成手段は、
前記データベースにおける属性と値のペアからなるアイテムであって、前記データベースでの頻度が、予め定められた所定の閾値以上のアイテムの集合を生成し、
ルール候補の初期値として、条件部・前提部を空、帰結部を前記アイテムとするルール集合を生成して記憶部に記憶し、
前記ルールの妥当性判定手段は、
前記ルール候補生成手段で生成された前記ルール候補の各ルールに対して、
前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合、前記ルールを妥当と判定して前記出力装置に出力し、
妥当性判定対象の前記ルール候補が空となると、前記ルール候補生成手段では、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶し、
前記ルール候補生成手段で生成された前記新たなルール候補に対する、前記ルールの妥当性判定手段による、妥当性の判定と、
前記ルール候補生成手段による、条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした、新たなルール候補の生成と、
を、前記ルール候補生成手段にて、条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成することができず前記新たなルール候補が空となるまで、繰り返す、ことを特徴とするルール発見システム。 A storage device for storing the database;
A data processing device;
An output device;
With
The data processing device includes:
Rule candidate generation means for generating rule candidates from the database;
Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database;
With
The rule candidate generation means includes:
An item consisting of attribute-value pairs in the database, wherein a set of items having a frequency in the database equal to or higher than a predetermined threshold value is generated;
As an initial value of the rule candidate, a rule set having a condition part / premise part as empty and a result part as the item is generated and stored in the storage unit,
The rule validity judging means is:
For each rule of the rule candidate generated by the rule candidate generation means,
The rule is determined to be valid if the rule part of the rule matches the condition part / premise part of the rule and the rule result part matches with a predetermined certainty threshold or more. And output to the output device,
When the rule candidate to be validated becomes empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the rule candidate generated last time. Generate a new rule candidate increased by one and store it in the storage unit,
With respect to the new rule candidate generated by the rule candidate generation means, determination of validity by the validity determination means of the rule,
Generation of a new rule candidate by the rule candidate generation means, in which the size of the condition part / premise part is increased by one from the rule candidate generated previously;
The rule candidate generation means cannot generate a new rule candidate in which the size of the condition part / premise part is increased by one from the rule candidate generated previously, and the new rule candidate is empty. A rule discovery system characterized by repeating until it becomes. - 前記ルールの妥当性判定手段で妥当と判断されたルールの前記帰結部を、前記ルール候補生成手段によるルール候補の探索において、帰結部の候補から除外し、
前記妥当と判断されたルールの前記帰結部を帰結部に含む冗長なルールは、前記ルール候補生成手段において前記新たなルール候補として生成されないようにした、ことを特徴とする請求項1記載のルール発見システム。 In the search for rule candidates by the rule candidate generation means, exclude the result part of the rule determined to be valid by the rule validity determination means from the candidate of the result part,
The rule according to claim 1, wherein a redundant rule including the consequent part of the rule determined to be valid as a consequent part is not generated as the new rule candidate by the rule candidate generation unit. Discovery system. - 前記頻度の閾値と前記確信度の閾値を設定パラメータとして入力する入力装置を備えている、ことを特徴とする請求項1記載のルール発見システム。 The rule discovery system according to claim 1, further comprising an input device for inputting the frequency threshold and the certainty threshold as setting parameters.
- 前記ルールは、CFD(Conditional Functional Dependency)で表現されたルールである、ことを特徴とする請求項1乃至3のいずれか1項に記載のルール発見システム。 4. The rule discovery system according to claim 1, wherein the rule is a rule expressed by CFD (Conditional Functional Dependency).
- ルール候補生成手段とルールの妥当性判定手段を備えたデータ処理装置によりデータベースからルールを発見するにあたり、
(a)前記ルール候補生成手段が、前記データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し、
ルール候補の初期値として、条件部・前提部を空、帰結部を前記アイテムとするルール集合を生成して記憶部に記憶し、
(b)前記ルールの妥当性判定手段は、前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判定して出力装置より出力し、
(c)妥当性判定対象の前記ルール候補が空か否かチェックし、妥当性判定対象の前記ルール候補が空でない場合、前記ルールの妥当性判定手段は、前記ステップ(b)に戻り、
(d)妥当性判定対象の前記ルール候補が空の場合、前記ルール候補生成手段は、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした、新たなルール候補を生成して記憶部に記憶し、
(e)前記ステップ(d)で生成された前記新たなルール候補が空か否かチェックし、前記新たなルール候補が空の場合、ルールの発見を終了し、
前記新たなルール候補が空でない場合、前記ステップ(b)に戻る、ことを特徴とするルール発見方法。 In discovering a rule from a database by a data processing device equipped with a rule candidate generation unit and a rule validity determination unit,
(A) The rule candidate generation means reads the database, generates an item having a frequency equal to or higher than a predetermined threshold that includes a pair of attribute and value of the database,
As an initial value of the rule candidate, a rule set having a condition part / premise part as empty and a result part as the item is generated and stored in the storage unit,
(B) The rule validity determination means, for the generated rule candidate rule, for the tuple of the database that matches the condition part / premise part of the rule, If there is a match with a predetermined certainty threshold or more, it is determined that the rule is valid and output from the output device,
(C) Check whether the rule candidate to be validated is empty, and if the rule candidate to be validated is not empty, the rule validity judging means returns to the step (b),
(D) When the rule candidate to be validated is empty, the rule candidate generation means sets the item as a new condition part / premise part, and sets the size of the condition part / premise part to the previously generated rule. Generate a new rule candidate, one more than the candidate, and store it in the storage unit,
(E) Check whether or not the new rule candidate generated in the step (d) is empty. If the new rule candidate is empty, the rule discovery is terminated.
If the new rule candidate is not empty, the method returns to step (b), and the rule finding method is characterized in that: - 前記ルールの妥当性判定手段で前記ルールが妥当と判断された場合、前記妥当と判断されたルールの前記帰結部を、ルール候補の探索において、帰結部の候補から除外し、前記妥当と判断されたルールの前記帰結部を、帰結部に含む冗長なルールは、前記ルール候補生成手段により前記新たなルール候補として生成されないようにした、ことを特徴とする請求項5記載のルール発見方法。 When the rule is determined to be valid by the rule validity determination means, the result part of the rule determined to be valid is excluded from the result part candidates in the rule candidate search, and is determined to be valid. 6. The rule finding method according to claim 5, wherein a redundant rule including the consequent part of the rule included in the consequent part is not generated as the new rule candidate by the rule candidate generating unit.
- 前記ルールは、CFD(Conditional Functional Dependency)で表現されたルールである、ことを特徴とする請求項5又は6記載のルール発見方法。 The rule discovery method according to claim 5 or 6, wherein the rule is a rule expressed in CFD (Conditional Functional Dependency).
- (a)データベースを読み出し、前記データベースの属性と値のペアからなり頻度が予め定められた所定の閾値以上のアイテムを生成し、ルール候補の初期値として、条件部・前提部が空、帰結部が、前記アイテムであるルールを生成して記憶部に記憶する処理と、
(b)前記生成されたルール候補のルールに対して、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、予め定められた所定の確信度の閾値以上でマッチしている場合には、前記ルールを妥当と判断して出力装置より出力する処理と、
(c)妥当性判定対象の前記ルール候補が空であるか否かチェックし、妥当性を判定する前記ルール候補が空でない場合、前記(b)に戻る処理と、
(d)妥当性判定対象の前記ルール候補が空の場合、前記アイテムを新たな条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成して記憶部に記憶する処理と、
(e)前記(d)の処理で生成された前記新たなルール候補が空か否かチェックし、前記新たなルール候補が空の場合、ルール発見処理を終了し、
前記新たなルール候補が空でない場合、前記(b)に戻る処理と、
をコンピュータに実行させるプログラム。 (A) Reading out the database, generating an item having a frequency equal to or higher than a predetermined threshold, which is composed of attribute-value pairs of the database, and the condition part / premise part is empty as the initial value of the rule candidate. Is a process of generating a rule that is the item and storing it in a storage unit;
(B) For the generated rule candidate rule, for a tuple of the database that matches the condition part / premise part of the rule, the rule consequent part has a predetermined certainty factor. If there is a match at a threshold value or higher, processing to determine that the rule is valid and output from the output device; and
(C) Checking whether or not the rule candidate for validity determination is empty, and if the rule candidate for determining validity is not empty, the process of returning to (b);
(D) When the rule candidate to be validated is empty, the item is set as a new condition part / premise part, and the size of the condition part / premise part is increased by one from the previously generated rule candidate. Processing to generate a new rule candidate and store it in the storage unit;
(E) Check whether or not the new rule candidate generated in the process of (d) is empty. If the new rule candidate is empty, the rule discovery process is terminated.
If the new rule candidate is not empty, the process returns to (b);
A program that causes a computer to execute. - 前記(b)の処理で、前記ルール候補が妥当と判断された場合、前記(d)の処理では、前記妥当と判断されたルールの前記帰結部を、ルール候補の探索において、帰結部の候補から除外し、前記妥当と判断されたルールの前記帰結部を、帰結部に含む冗長なルールは、前記(d)の処理により、新たに生成されるルール候補として生成されないようにした、ことを特徴とする請求項8記載のプログラム。 If the rule candidate is determined to be valid in the process (b), the result part of the rule determined to be valid is used as the result candidate in the rule candidate search in the process (d). The redundant rule that includes the consequent part of the rule determined to be valid and the consequent part is not generated as a newly generated rule candidate by the process (d). 9. The program according to claim 8, wherein
- データベースの内容と、入力装置から設定された頻度の閾値、又は、生成済みルール候補に基づき、新たなルール候補を生成するルール候補生成手段と、
前記ルール候補が前記データベースの内容に対して妥当であるか否か判定するルールの妥当性判定手段と、
を含み、
前記ルール候補生成手段は、前記データベースを読み出し、前記データベースの属性と値のペアからなるアイテムであって、頻度が前記閾値以上のアイテムを生成し、ルール候補の初期値として、条件部・前提部が空、帰結部が、前記アイテムであるルールを生成し、
幅優先探索に基づき、条件部・前提部のサイズの小さなルール候補から順に生成し、
前記生成されたルール候補の各ルールに対して、前記ルールの妥当性判定手段では、前記ルールの条件部・前提部とマッチする前記データベースのタプルに対して、前記ルールの帰結部が、前記入力装置から入力された確信度の閾値以上でマッチしている場合、前記ルールを妥当と判断して出力装置に出力し、
妥当性判定対象の前記ルール候補が空となると、前記ルール候補生成手段では、前記妥当なルールに対して冗長なルールを、ルール候補の探索から除外して、前記アイテムを条件部・前提部とし、前記条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成し、
前記ルール候補生成手段で生成された前記新たなルール候補に対する前記ルールの妥当性判定手段による妥当性の判定と、
前記ルール候補生成手段による条件部・前提部のサイズを、前回生成した前記ルール候補よりも1つ増やした新たなルール候補の生成と、
を、前記ルール候補生成手段にて、条件部・前提部のサイズを前回生成した前記ルール候補よりも1つ増やした新たなルール候補を生成できず前記新たなルール候補が空となるまで、繰り返す、ことを特徴とするルール発見装置。 Rule candidate generation means for generating a new rule candidate based on the contents of the database, the frequency threshold set from the input device, or the generated rule candidate;
Rule validity determination means for determining whether or not the rule candidate is valid for the contents of the database;
Including
The rule candidate generation means reads the database, generates an item consisting of a pair of attribute and value of the database, the frequency of which is equal to or higher than the threshold value, and sets the condition part / premise part as an initial value of the rule candidate. Is a rule, and the consequent is a rule that is the item,
Based on breadth-first search, generate rule candidates in order from smaller size of condition part / premise part,
For each rule of the generated rule candidate, the rule validity determination means, the rule consequent part inputs the input to the database tuple that matches the condition part / premise part of the rule. If there is a match with a certainty threshold value or more input from the device, the rule is judged to be valid and output to the output device,
When the rule candidate to be validated is empty, the rule candidate generation means excludes a rule that is redundant with respect to the valid rule from the rule candidate search, and uses the item as a condition part / premise part. , Generating a new rule candidate in which the size of the condition part / premise part is increased by one from the rule candidate generated previously,
Determination of validity by the validity determination means of the rule for the new rule candidate generated by the rule candidate generation means;
Generation of a new rule candidate in which the size of the condition part / premise part by the rule candidate generation unit is increased by one from the rule candidate generated previously;
The rule candidate generation unit repeats the process until the new rule candidate becomes empty without generating a new rule candidate in which the size of the condition part / premise part is increased by one from the previous rule candidate generated. A rule discovery device characterized by that.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014515618A JPWO2013172308A1 (en) | 2012-05-14 | 2013-05-13 | Rule discovery system, method, apparatus and program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2012110921 | 2012-05-14 | ||
JP2012-110921 | 2012-05-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013172308A1 true WO2013172308A1 (en) | 2013-11-21 |
Family
ID=49583713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/063316 WO2013172308A1 (en) | 2012-05-14 | 2013-05-13 | Rule discovery system, method, device, and program |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2013172308A1 (en) |
WO (1) | WO2013172308A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006058974A (en) * | 2004-08-17 | 2006-03-02 | Fujitsu Ltd | Work management system |
US20100250596A1 (en) * | 2009-03-26 | 2010-09-30 | Wenfei Fan | Methods and Apparatus for Identifying Conditional Functional Dependencies |
-
2013
- 2013-05-13 WO PCT/JP2013/063316 patent/WO2013172308A1/en active Application Filing
- 2013-05-13 JP JP2014515618A patent/JPWO2013172308A1/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006058974A (en) * | 2004-08-17 | 2006-03-02 | Fujitsu Ltd | Work management system |
US20100250596A1 (en) * | 2009-03-26 | 2010-09-30 | Wenfei Fan | Methods and Apparatus for Identifying Conditional Functional Dependencies |
Also Published As
Publication number | Publication date |
---|---|
JPWO2013172308A1 (en) | 2016-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2570936A1 (en) | Information retrieval device, information retrieval method, computer program, and data structure | |
US9626434B2 (en) | Systems and methods for generating and using aggregated search indices and non-aggregated value storage | |
US11204707B2 (en) | Scalable binning for big data deduplication | |
US8655921B2 (en) | True/false decision method for deciding whether search query containing logical expression is true or false | |
JP5532189B2 (en) | Rule discovery system, method, apparatus and program | |
US20140222870A1 (en) | System, Method, Software, and Data Structure for Key-Value Mapping and Keys Sorting | |
US9298757B1 (en) | Determining similarity of linguistic objects | |
Lin et al. | High-utility sequential pattern mining with multiple minimum utility thresholds | |
JP4237813B2 (en) | Structured document management system | |
JP2019109782A (en) | Query generating program, query generating method and query generating device | |
Khan et al. | Set-based unified approach for attributed graph summarization | |
JPWO2013111287A1 (en) | SPARQL query optimization method | |
JP5964781B2 (en) | SEARCH DEVICE, SEARCH METHOD, AND SEARCH PROGRAM | |
US9542502B2 (en) | System and method for XML subdocument selection | |
WO2013172309A1 (en) | Rule discovery system, method, device, and program | |
KR20120136677A (en) | Method and tree structure of database for extracting data steams frequent pattern based on weighted support and structure of database | |
JPWO2011016281A1 (en) | Information processing apparatus and program for Bayesian network structure learning | |
WO2013172308A1 (en) | Rule discovery system, method, device, and program | |
Shana et al. | An improved method for counting frequent itemsets using bloom filter | |
JP6005583B2 (en) | SEARCH DEVICE, SEARCH METHOD, AND SEARCH PROGRAM | |
WO2014208728A1 (en) | Rule discovery method, information processing device, and program | |
US12061637B2 (en) | Heuristic identification of shared substrings between text documents | |
US11928154B2 (en) | System and method for efficient creation and incremental updating of representations of email conversations | |
CN102567431A (en) | Document processing method and device | |
Zhang et al. | Discovering frequent induced subgraphs from directed networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13790871 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2014515618 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13790871 Country of ref document: EP Kind code of ref document: A1 |