Summary of the invention
At the shortcoming in the background technology, the invention provides a kind of KDD based on double-library synergistic mechanism
*New system.
Technical scheme 1: a kind of KDD based on double-library synergistic mechanism
*New system, comprise the digital machine that central processing unit, storer are formed, the memory stores of described digital machine has True Data storehouse and primary knowledge base, it is characterized in that: described primary knowledge base is divided into several relevant knowledge word banks according to each concrete domain, described knowledge word bank is based on attribute, represents wherein knowledge with linguistic field and language value structure; Described digital machine is carried out following step:
1), data pre-service: the data in the True Data storehouse are reprocessed, form and excavate database, and under the structure of building the storehouse based on attribute, set up corresponding relation with described primary knowledge base;
2), focus on: by the direction of coming guide data to excavate by the content of man-machine interaction input;
3), the directed excavation: inspiration type telegon is searched for the shortage of discovery knowledge primary knowledge base, and forms new focusing with this, directionally carries out the selection of data from excavating lane database;
4), ask for the hypothesis rule: by selected knowledge excavation method, from the mining data storehouse, extract user and the new knowledge that focuses on required excavation, and express the knowledge of being extracted with specific pattern;
5), estimate: the rule that step 4) is obtained is worth evaluation, deposits received rule in the knowledge base of deriving.
Technical scheme 2: a kind of KDD based on double-library synergistic mechanism
*New system, comprise the digital machine that central processing unit, storer are formed, the memory stores of described digital machine has True Data storehouse and primary knowledge base, it is characterized in that: described primary knowledge base is divided into several relevant knowledge word banks according to each concrete domain, described knowledge word bank is based on attribute, represents wherein knowledge with linguistic field and language value structure; Described digital machine is carried out following step:
1), data pre-service: the data in the True Data storehouse are reprocessed, form and excavate database, and under the structure of building the storehouse based on attribute, set up corresponding relation with described primary knowledge base;
2), focus on: by the direction of coming guide data to excavate by the content of man-machine interaction input;
3), ask for the hypothesis rule: by selected knowledge discovery method, from excavate database, extract the needed knowledge of user, and express the knowledge of being extracted with specific pattern;
4), real-time servicing: the interrupt-type telegon carries out beam search to primary knowledge base, with determining step 3) in each hypothesis rule of being obtained and the primary knowledge base original knowledge whether repeat, redundancy or contradiction, and handle accordingly according to judged result;
5), estimate: step 4) is handled rule back and that be selected be worth evaluation, deposit received rule in the knowledge base of deriving.
Technical scheme 3: a kind of KDD based on double-library synergistic mechanism
*New system, comprise the digital machine that central processing unit, storer are formed, the memory stores of described digital machine has True Data storehouse and primary knowledge base, it is characterized in that: described primary knowledge base is divided into several relevant knowledge word banks according to each concrete domain, described knowledge word bank is based on attribute, represents wherein knowledge with linguistic field and language value structure; Described digital machine is carried out following step:
1), data pre-service: the data in the True Data storehouse are reprocessed, form and excavate database, and under the structure of building the storehouse based on attribute, set up corresponding relation with described primary knowledge base;
2), focus on: by the direction of coming guide data to excavate by the content of man-machine interaction input;
3), the directed excavation: inspiration type telegon is searched for the shortage of discovery knowledge primary knowledge base, and forms new focusing with this, directionally carries out the selection of data from excavating lane database;
4), ask for the hypothesis rule: by selected knowledge discovery method, from excavate database, extract user and the new knowledge that focuses on required excavation, and express the knowledge of being extracted with specific pattern;
5), real-time servicing: the interrupt-type telegon carries out beam search to primary knowledge base, with determining step 4) in each hypothesis rule of being obtained and the primary knowledge base original knowledge whether repeat, redundancy or contradiction, and handle accordingly according to judged result;
6), estimate: step 5) is handled rule back and that be selected be worth evaluation, deposit received rule in the knowledge base of deriving.
Technical scheme 1,2 and 3 described storeies are mass storage, also can be the vast capacity storage system that some mass storages are formed.
Technical scheme 1,2 and 3 described digital machines are the digital computing system that some computing machines are formed.
The described data reprocessing of technical scheme 1,2 and 3 step 1) comprises that integrality, consistance to data check, to the processing of noise data, utilizes statistical method to fill up etc. to the data of losing.
Technical scheme 1,2 and 3 the described corresponding relation of step 1) are the one-to-one relationship of setting up between the knowledge node of knowledge word bank and data word bank subclass structural sheet.
Technical scheme 1 and 3 step 2) described inspiration type telegon may further comprise the steps:
1), search rule intensity is greater than the linguistic variable of a certain threshold value, forms set of node;
2), the node in the set of node is made up the formation tuple-set;
3), the search primary knowledge base, from tuple, remove the tuple that in primary knowledge base, has existed;
4), to the residue tuple by strength of association ordering, make the priority of beam search;
5), according to priority order scans each tuple one by one, focuses on the directed excavation of corresponding inlet in the database.
The described interrupt-type telegon of the step 4) of technical scheme 2 and 3 step 5) may further comprise the steps:
1), reads a rule;
2), in knowledge base, search this rule, if rule intensity, is then carried out the step down greater than set-point; Otherwise forwarding step 4) to carries out;
3), whether judgment rule repeats. and redundancy or contradiction if there is one of them, then forwards step 4) to and carry out; Otherwise deposit this rule in knowledge base, carry out the step down then;
4), judge whether to read strictly all rules, if read strictly all rules, then finish this process; Otherwise read next bar rule, and forward step 2 to) carry out.
It is all kinds of figures and the analysis of data that utilizes visualization tool to provide that the described rule of the step 5) of technical scheme 1,2 and the step 6) of technical scheme 3 is worth evaluation, is estimated by the user by human-computer interaction interface.
The described rule of the step 5) of technical scheme 1,2 and the step 6) of technical scheme 3 is worth the automatic evaluation method of evaluation employing based on the causalnexus rule of autoepistemic logic, that is: according to the strength of association and the preset threshold of rule, realize automatically by described digital machine; Described automatic evaluation method is:
Get reason A and the data of S as a result, constitute the set P={<tw of a preface idol, sw〉} (w=1,2....N), tw is the data (promptly because of sample value) in reason shape (change) state space, sw be with the corresponding shape as a result of reason data (change) state space in data (i.e. fruit sample value), N is the holding strength of rule for the number of sample in the set, SUP, and CR is the strength of association of rule, the regular holding strength of SUP1 for try to achieve at every turn, its initial value is 0; Carry out following steps:
1), get reason sample value tw (w=1,2....N), it belongs to general sample space, obtains because of shape (change) attitude input vector atw;
2), determine because of shape (changes) attitude input vector atw affiliated because of shape (change) attitude type such as Ak (k=1,2,3,4,5) promptly calculate atw and each dH that estimates, get reckling and be that atw belongs to because of shape (change) attitude type, randomly draw a sample set because of shape (change) attitude standard vector Ai by formula (2), can see the set P={<tw that contributes a foreword even, sw〉};
3), with rule
As local major premise, with because of under shape (change) the attitude input vector at because of shape (change) attitude standard vector Ak be minor premise, can in estimating knowledge base, the mode by self-organization find the unique knowledge matrix M ijk that is complementary with it, obtain result's shape (change) attitude vector Sw1 according to automated reasoning pattern (3);
4), cluster: calculate fruit shape (changes) the attitude standard vector β under the Sw1, can by ask it and each really the estimating of shape (change) attitude standard vector (as shown in the formula) get reckling and obtain cluster;
Wherein, μ Sw1 (i) is respectively its each self-corresponding coordinate with μ Sj (i);
5), for set of ordered pairs P={<tw, sw, get the sample value sw of corresponding results, can obtain fruit shape (change) the attitude standard vector γ in interval under its with the method for fuzzy clustering, if β=γ, SUP1=SUP1+1 then, otherwise SUP1=SUP1;
6), repeat said process N time, obtain SUP; If
The causalnexus intensity CR that SUP=SUP1/N gets rule compares with it:
If: SUP>CR then rule is accepted;
SUP≤CR then rule is rejected.
A kind of based on new method based on the excavation correlation rule of double-library synergistic mechanism, it is characterized in that comprising the steps:
1), data pre-service: mainly be that the user selects the True Data storehouse, the connection attribute in the True Data storehouse is carried out discretize, form the excavation database of forming by several tables of data;
2), primary knowledge base is searched for to find " knowledge shortage ", generation knowledge shortage collection;
3), rule that knowledge shortage is concentrated calculates its rule intensity, and according to threshold value rule accepted or rejected, sort according to rule intensity then;
4), orientation is carried out in the mining data storehouse excavates formation hypothesis rule;
5), qualified rule application interrupt-type telegon is handled;
6), the rule of passing through after the interrupt-type mediators handle is estimated; Pass through if estimate, then warehouse-in; Do not pass through if estimate, then delete this rule.
KDD
*Two storehouses (database and knowledge base) synergistic mechanism that new system is proposed, fundamentally solved the deficiency that KDD exists, simultaneously, the introducing of double-library synergistic mechanism makes that KDD has obtained on function further perfect, this mainly shows following two aspects: 1. aspect the data excavation, double-library synergistic mechanism makes knowledge base can participate in the excavation process of database dynamically, intrinsic knowledge in user's priori and the knowledge base can produce " the directed excavation " by this mechanism, with the generation that improves cognitive independence and avoid magnanimity to search for; 2. aspect MAINTENANCE OF KNOWLEDGE BASE, by double-library synergistic mechanism can in data excavation process, revise in real time and the maintenance knowledge storehouse in content, comprise the check of repetition and redundancy, contradiction processing etc.
Meaning of the present invention is: 1) except that going to excavate the knowledge according to user's request and artificial interest, proposed automatically to inspire the directed approach of excavating knowledge according to " knowledge shortage " in the primary knowledge base, promptly improve " cognitive independence ", overcome self limiting to of field user more effectively; 2) significantly reduced " evaluation amount " after hypothesis rule is excavated; 3) according to the mechanism in two storehouses " structure correspondence ", can dwindle the search volume greatly, improve and excavate efficient; 4) solve more effectively new and old knowledge synthetic after, problems such as the redundancy of knowledge base and consistance are guaranteed the real-time servicing to knowledge base; 5) generally speaking, KDD is considered as an open system, in the extensive connection of KDD process and primary knowledge base, improves and structure, process and the operating mechanism of having optimized KDD.
The present invention is embedded into two telegons among the KDD and goes, thereby fundamentally changing the intrinsic operating mechanism of KDD, on structure and function, form an expansion body opening, that optimize for KDD, and also can induce the new construction model of Knowledge Discovery on this basis.
Embodiment1, KDD
*The theoretical foundation of new system:
According to the listed relation of Fig. 2, provide following related definition: 1.1 knowledge representation methods-linguistic field and language value structure:
Definition 1:C=<D, I, N ,≤N 〉, if satisfy following condition:
(1) D is the set that basic underlying variables domain R goes up the intersection closed interval, and D+ is its corresponding opener;
(2) N ≠ Φ is the finite set of language value;
(3)≤N is the ordering relation on the N;
(4) I:N → D is standard value mapping, satisfies isotonicity, that is: n1, n2 ∈ N (n1 ≠ n2 ∧ n1
≤ N n2 → I (n1)≤I (n2)), (≤be partial ordering relation)
Claim that then C is a linguistic field.
Definition 2: for linguistic field C=<D, I, N ,≤N 〉, claim F=<D, W, K〉be the language value knot of C
Structure, if:
(1) C satisfies definition 1;
(2) K is a natural number;
(3) W:N → Rk satisfies:
n1,n2∈N(n1≤N?n2→W(n1)≤dicW(n2)>,
n1,n2∈N(n1≠n2→W(n1)≠W(n2))。
Wherein ,≤dic is the dictionary preface on [0,1] k, and promptly (a1 ...., ak)≤dic (b1 ...., bk) and if only if exists h, makes aj=bj when 0≤j<h, ah≤bh.1.2 excavate the foundation of general relation of homotopy between storehouse and the knowledge base: 1) knowledge node:
Definition 3: in being relevant to the knowledge word bank of domain X, claim that the knowledge of expressing by following formation is uncertain regular pattern composite knowledge:
(1) P(X)Q(X)
P (X) wherein, Pi (x), Q (X), Qj (X) are respectively " attribute speech " (or " descriptive word ")+degree speech " form.
Definition 4: in definition 3, P (X) and Pi (x) are called knowledge beginning node, and Q (X) and Qj (X) are called the knowledge destination node, and are called the plain node of knowledge;
, be called knowledge and close node; Both are referred to as knowledge node.
Obviously, the attribute that each knowledge node indicates promptly constitutes linguistic field, as: temperature field, pressure field etc.; And each state or abnormal degree promptly constitute language value structure, as: the temperature in the temperature field is very high, high, medium and low, very low etc.
Theorem 1: in being relevant to the domain X knowledge word bank of (containing some linguistic fields), the set of all knowledge nodes note is made E (finite set), and its power set note is made ρ (E); Then<and E, ρ (E)〉maximization of formation manifold.2) data subclass structure:
Definition 5: for domain X, in data word bank, with the plain node corresponding structure of each knowledge S=<U, N, I, W corresponding to the knowledge word bank〉be called data subclass structure.Wherein, U ≠ Φ, U={u1, u2 ..., (ui is a data set, is formed by following I), it is under specific linguistic field and language value structure, characterizes the class (being called the data subclass) corresponding to the data set of the plain node of knowledge " attribute speech " or " descriptive word "; N ≠ Φ is the finite set of language value, and it is the set of delineation corresponding to the language value of the plain node of knowledge " degree speech ";
I:N → U, it is the mapping of the class U of data set being divided by the language value.When the data continuous distribution, be divided into some transposition sections usually (that is:
W:N → [0,1] K (k is a positive integer) satisfies:
I, j(u
i∩ u
j≠ Φ));
n1,n2∈N(n1≤N?n2→W(n1)≤dicW(n2)),
n1,n2∈N(n1≠n2→W(n1)≠W(n2))。
Wherein≤and N is that N goes up ordering relation, and≤dic be the dictionary preface on [0,1] K, and W (n) (n ∈ N) is the standard vector of language value when taking from language value interval mid point of correspondence and neighborhood thereof (be sample pairing vector).
Definition 6: at data subclass structure S=<U, N, I, W〉in, title satisfies the tlv triple<ui of following condition, ni, ri〉be the layer of S:
(1) ui ∈ U, ui (i=1,2,3 ..., v) be sample data collection in preliminary i the segment of delimiting;
(2) ni ∈ N, ni (i=1,2,3 ..., v) language value for belonging between settling in an area according to the sample data collection;
(3) ri (i=1,2,3 ..., determining v): when (i) sample data fell within non-transposition section among the ui, ri was taken as standard vector; At this moment, ri ∈ W (n).When (ii) sample data falls in the transposition section among the ui, try to achieve with interpolation formula:
(
Be i interval master sample data, 1i is an i burst length, and Ai is an i interval standard vector, and the A neighbour is by according to standard vector between the fixed adjacent region of ui drop point).
Again according to r
i *With r
i, r
I+1Estimate or r
i *With r
i, r
I-1Estimate, r is got in decision
iOr r
I+1Or i
R-1, and this partial data is retained in the i layer or moves to the i+1 layer or move to the i-1 layer.Obviously, the data subclass constitutes corresponding one by one with data subclass structure.
Theorem 2: for domain X, in the data word bank corresponding to the knowledge word bank, the set of all data subclasses (structure) note is made F (finite set), and its power set note is made ρ (F), then<F, ρ (F)〉maximization of formation manifold.3) relation of " knowledge node " and " data subclass (structure) ":
Definition 7: establish X and Y and be manifold arbitrarily, title Continuous Mappings F:X * [0,1]
nWhat → Y was X to the mapping of Y is general homotopy.(homotopy conception expansion under the ordinary meaning).
Definition 8: establish f, g is the Continuous Mappings from the Topological Space X to Y, if there is general homotopy F (x, t)=and ft (x), making all has f (x)=F (x, (0 for arbitrfary point x ∈ X, ..., 0)), g (x)=F (x, (1, ..., 1)), then claim g general homotopy in f, and claiming that F is the general homotopy of Continuous Mappings f and mapping g, note is made f~g.
Definition 9: the Continuous Mappings f from Topological Space X to manifold Y is called general homotopy equivalence, if there is Continuous Mappings g from manifold Y to Topological Space X, make g ° of f of synthetic mapping and f ° of g respectively from X and Y to self, general homotopy in the identical mapping IX in corresponding space and the mapping of IY, note is made g ° of f~IX, f ° of g~IY respectively; Mapping g also is general homotopy equivalence, and is called the contrary of equal value of f of equal value.
Definition 10: establish given two manifold,, then claim this two spaces that the space is same general homotopy type if there be of the mapping of a space at least to a general homotopy equivalence in another space.
Theorem 3 (structure correspondence theorem): for domain X, in corresponding knowledge word bank and data word bank, about the manifold<E of knowledge node, ρ (E)〉with manifold<F about data subclass (structure), ρ (E) be the space of same general homotopy type.
As the above analysis: when a space was changed into the space of same general homotopy type, the structure of general homotopy class set there is no change, so in general homotopy theory, can regard the space of same general homotopy type as identical.So theorem 3 provided in the knowledge word bank in " knowledge node " and corresponding data word bank in " data subclass structure " layer between one-to-one relationship, as shown in Figure 5.2, the realization of double-library synergistic mechanism: 2.1 Fig. 3 A have represented first kind of scheme of the present invention, and key step comprises:
1), data pre-service: the data in the True Data storehouse are reprocessed, form and excavate database, and under the structure of building the storehouse based on attribute, set up corresponding relation with described primary knowledge base;
2), focus on: by the direction of coming guide data to excavate by the content of man-machine interaction input;
3), the directed excavation: inspiration type telegon is searched for the shortage of discovery knowledge primary knowledge base, and directionally carries out the selection of data from excavating lane database with this;
4), ask for the hypothesis rule: by selected knowledge excavation method, from the mining data storehouse, extract the needed knowledge of user, and express the knowledge of being extracted with specific pattern;
5), estimate: the rule that step 4) is obtained is worth evaluation, deposits received rule in the knowledge base of deriving.
Fig. 3 B has represented second kind of scheme of the present invention, and key step comprises:
1), data pre-service: the data in the True Data storehouse are reprocessed, form and excavate database, and under the structure of building the storehouse based on attribute, set up corresponding relation with described primary knowledge base;
2), focus on: by the direction of coming guide data to excavate by the content of man-machine interaction input;
3), ask for the hypothesis rule: by selected knowledge discovery method, from excavate database, extract the needed knowledge of user, and express the knowledge of being extracted with specific pattern.
4), real-time servicing: the interrupt-type telegon carries out beam search to primary knowledge base, with determining step 3) in each hypothesis rule of being obtained and the primary knowledge base original knowledge whether repeat, redundancy or contradiction, and handle accordingly according to judged result;
5), estimate: step 4) is handled rule back and that be selected be worth evaluation, deposit received rule in the knowledge base of deriving.。
Fig. 3 C has represented the third scheme of the present invention, and key step comprises:
1), data pre-service: the data in the True Data storehouse are reprocessed, form and excavate database, and under the structure of building the storehouse based on attribute, set up corresponding relation with described primary knowledge base;
2), focus on: by the direction of coming guide data to excavate by the content of man-machine interaction input;
3), the directed excavation: inspiration type telegon is searched for the shortage of discovery knowledge primary knowledge base, and directionally carries out the selection of data from excavating lane database with this;
4), ask for the hypothesis rule: by selected knowledge discovery method, from excavate database, extract the needed knowledge of user, and express the knowledge of being extracted with specific pattern.
5), real-time servicing: the interrupt-type telegon carries out beam search to primary knowledge base, with determining step 4) in each hypothesis rule of being obtained and the primary knowledge base original knowledge whether repeat, redundancy or contradiction, and handle accordingly according to judged result;
6), estimate: step 5) is handled rule back and that be selected be worth evaluation, deposit received rule in the knowledge base of deriving.
The pairing technical scheme 1 of Fig. 3 B does not have the real-time servicing step, the technical scheme 2 of Fig. 3 A correspondence does not have directed excavation step, and the pairing technical scheme 3 of Fig. 3 C comprises directed the excavation and two steps of real-time servicing simultaneously, therefore, present embodiment mainly describes in detail the pairing technical scheme of Fig. 3 C, and the realization base reason of all the other two kinds of schemes is identical.
Fig. 4 has further expressed structure of the present invention.According to described theoretical foundation and structure correspondence theorem, in the present invention, the plain node of knowledge in the knowledge base is corresponding, just corresponding with the corresponding attribute degree of this element node speech with the layer in the database, through pre-service the True Data storehouse is divided into n table (table) for this reason, be table1, table2 ..., tablen, n is the number of attribute degree speech, and the k correspondence among the tablek ID number of each attribute degree speech.The field of each table has only one, is used for depositing ID number of the data in the True Data storehouse, and the pairing data of this ID are in the described state of attribute degree speech k.The mining data storehouse is exactly to be made up of this n Table, so just need not to search for entire database, only need scan the corresponding several tables of knowledge node for the knowledge of every shortage.This just seems particularly important for large database, and these little tables can be put into internal memory and carry out computing, and entire database just can't be carried out (being that the Apriori method will be affected).
The knowledge word bank is characterized in being convenient to form the corresponding relation of knowledge node and data subclass based on attribute, lays the foundation thereby excavate for directional data.Logical organization: in corresponding domain, be that the basis turns to the several rules word bank with the rule base class with the attribute, each regular word bank is corresponding with the mining data storehouse.2.2 double-library synergistic mechanism is mainly realized by inspiration type telegon and interrupt-type telegon.
The function of inspiration type telegon is the not related attitude by " knowledge node " in the search knowledge base, to find " knowledge shortage ", produce " original idea image ", thereby inspire and activate corresponding " data class " in the True Data storehouse, to produce " directed excavation process ", promptly finished computing machine automatic focus.
The function of interrupt-type telegon is when line focus from the mass data in True Data storehouse and behind the create-rule (knowledge), make the KDD process produce " interruption ", and correspondence position have or not the repetition, redundancy, contradiction, subordinate, circulation etc. of this create-rule in the removal search knowledge base.If have, then cancel " top " that returns KDD after this create-rule or the respective handling; If do not have, then continue the KDD process, i.e. knowledge evaluation.2.3 KDD
*Software realize that the function mainly comprise inspiration type telegon, KDD process and interrupt-type telegon realizes.
Inspiration type telegon mainly realizes finding " knowledge shortage " by the reachability matrix that calculates oriented hypergraph, and then carries out beta pruning and form focusing on the rule intensity threshold value; The KDD process mainly realizes (is example to excavate correlation rule) by the confidence level threshold value; The interrupt-type telegon then comes repetition, thatch shield, redundancy, circulation and the subordinate of judgemental knowledge with sql like language or the reachability matrix that calculates oriented hypergraph, and handles accordingly.
Several relevant notions
The support and confidence level of rule: with define in the common correlation rule identical.
Define 1 degree Interest interested: be meant the interest level of user, just refer to the interest level of user to the plain node of each knowledge in the knowledge base to each linguistic variable in the database or language value.When pre-service, at first by the degree interested of given each the language value of user, the degree interested of the plain node of promptly corresponding knowledge is expressed as Interest (ek), and codomain is [0,1], and this value is big more, illustrates that the user is interested in more the plain node of this knowledge.Close node F=e1 ∧ e2 ∧ for knowledge ... ∧ em, be defined as mean value, that is: the plain node of each knowledge degree interested
If definition rule length is the regular number that contains the plain node of knowledge, note is made Len (ri), and then for a regular ri=F → h, its degree interested is
Wherein, Len (ri)=m+1.The degree interested of rule is to the number that appears at the plain node of knowledge in the rule and a kind of comprehensive measurement of degree interested.Usually, the plain node of the knowledge that the degree interested that comprises in the rule is big is many more, and the plain node of the knowledge that degree interested is little is few more, thinks that the user is interested in more this rule.
Define 2 rule intensity Intensity: comprise to the degree of support of objective (objective) of rule with to two aspects of interest level of the subjectivity (subjective) of rule.Objective degree of support to rule just is called support, and the interest level of the subjectivity of rule is called degree interested (seeing definition 6).For regular ri=F → h, its rule intensity Intensity (ri)=(Interest (ri)+sup (ri))/2.
Present embodiment is from practical standpoint to the definition of rule intensity, and a kind of regularity of making for ease of tolerance is not lost its essential characteristic.
The method of excavation as the Apriori algorithm, was only come mining rule according to objective metric in the past, was difficult to obtain the real interested rule of user, needed a large amount of manually interested rule being screened.And rule intensity is considered objective and subjective two aspects simultaneously, according to above-mentioned definition, with the rule intensity is that index inspires mining rule, then both can reasonable mutual coordination: on the one hand, even support is smaller, as long as the user is very interested in this short knowledge, then rule intensity just can not be too little, thereby this hypothesis rule still can be focused, and then excites the excavation process; On the other hand, if the user is not very interested to the knowledge of a shortage, just may be focused, and then excite the excavation process when having only this shortage knowledge to have very high support.In addition, in the definition of rule intensity, we have also used this notion of support, but this moment, the support threshold value just can be set lowlyer with respect to the Apriori algorithm, and is very careful when promptly the knowledge of shortage being carried out beta pruning.
By above to KDD
*The introduction of new system global structure illustraton of model and theoretical foundation, we as can be seen the technology of double-library synergistic mechanism realize it being to construct interruption (R) type telegon and inspiration (S) type telegon.The major function of interrupt-type telegon is: generate hypothesis rule (knowledge) when line focus from the mass data in True Data storehouse after, make the KDD process produce " interruption ", and correspondence position have or not the repetition of this create-rule, redundant and contradiction (beam search process) in the removal search knowledge base.If have, then cancel " top " that returns KDD after this create-rule or the respective handling; If do not have, then continue the KDD process, promptly estimate warehouse-in with the result.The major function of inspiration type telegon is: building under the principle of storehouse based on the knowledge base of attribute, not related attitude by " knowledge node " in the search knowledge base, to find " knowledge shortage ", produce " original idea image ", thereby inspire and activate corresponding " data class " in the True Data storehouse, to produce " directed excavation process ".
Key is to adopt double-library synergistic mechanism in the present invention: promptly adopt interrupt-type telegon, inspiration type telegon, respectively the hypothesis rule that is obtained is handled realizing the real-time servicing of knowledge base, and utilized the rule intensity excitation data to focus on to carry out data and excavate.
Therefore: the problem that realizes the double-library synergistic mechanism most critical promptly is to realize " beam search process " (reducing the search volume) and " directed excavation process " (reduce and excavate the space); And the necessary condition that realizes this function is: the corresponding relation of " data subclass (structure) " in " knowledge node " and the True Data storehouse in the structure knowledge base.
Inspire telegon:
The fundamental purpose of inspiration type telegon is that the focusing for system provides another approach.In classical KDD process, the focusing of system normally by the user provide interested parties to, KDD excavates along this direction.If but only carry out along this direction, perhaps potential in the mass data can tend to be ignored by the user to user's Useful Information.Inspiration type telegon can help that KDD is as much as possible to search the Useful Information to the user, to remedy user's self limitation, improves the cognitive independence of machine.
The step of inspiration type telegon as shown in Figure 6:
When calling inspiration type telegon, program forwards step 101 to, and search rule intensity forms set of node greater than the linguistic variable of a certain threshold value; Node in the step 102 pair set of node makes up, and forms tuple-set; Step 103 search primary knowledge base is removed the tuple that has existed in primary knowledge base from tuple; Step 104 pair residue tuple is made the priority of beam search by the strength of association ordering; Step 105 according to priority order scans each tuple one by one, focuses on the directed excavation of corresponding inlet in the database; Step 106 forwards the KDD process to.
The coordination for interrupt device:
Traditional knowledge discovery system, the hypothesis that the KDD process produces is directly estimated, when received knowledge is integrated into knowledge base, be responsible for consistance, the redundancy of knowledge base are checked by knowledge base management system, contradiction and redundant knowledge are handled, formed new knowledge base.The shortcoming of this mode is: form many insignificant hypothesis evaluations and owing to a large amount of accumulation of problem add the burden that weight uniformity, redundancy are checked.
Because the interrupt-type telegon is to the intervention of KDD process, can be in real time, as soon as possible repetition, contradiction, redundant knowledge are eliminated, thus only accomplish those hypothesis that might become new knowledge are estimated, reduced evaluate workload to greatest extent.Step as shown in Figure 7:
When calling the interrupt-type telegon, program forwards step 201 to, and initialization rule counting pointer also makes it point to article one rule; Whether finish in step 202 judgemental knowledge storehouse, if this judgement is sure, then execution in step 203, and to close knowledge base and to finish this time and call, if negate, then execution in step 204; Step 204 is searched I bar rule in knowledge base, execution in step 205 then; Whether step 205 judgment rule intensity greater than 0.5, if judge whether surely, then execution in step 206, and I is added 1 and forward step 202 to, if judge it is sure, then execution in step 207; The rule that step 207 judge to produce whether with knowledge base in rule repeat, if judge it is sure, then execution in step 208, and I is added 1 and forward step 202 to, if judge whether surely, then execution in step 209; The rule that step 209 judge to produce whether with knowledge base in rule exist redundantly, if judge it is sure, then execution in step 210, and I is added 1 and forward step 202 to, if judge whether surely, then execution in step 211; The rule that step 211 judge to produce whether with knowledge base in regular contradiction, if judge it is sure, then execution in step 212, and I is added 1 and forward step 202 to, if judge whether surely, then execution in step 213; Step 213 deposits I bar rule in the knowledge base in, and execution in step 214 then, and I is added 1 and forward step 202 to.
3. based on excavation correlation rule new method---the Maradbcm method of double-library synergistic mechanism:
The present research of KDD in the world mainly is that task description, knowledge evaluation and the representation of knowledge with Knowledge Discovery is served as theme, and is the center with effective knowledge discovery algorithm, and seldom KDD is studied the regularity of its inherence as the complication system of cognition.The algorithm of present most KDD is not all considered knowledge base, excavate many hypothesis rules of coming out and existing knowledge in the knowledge base and be repetition with redundancy, or even inconsistent, therefore just can't embody desired novelty in the KDD definition.And do not carry out subsequent treatment for the rule that produces, promptly do not consider between these rules or and primary knowledge base between the processing of repetition, redundancy, contradiction etc.
Based on the association rule mining new method of double-library synergistic mechanism, be called for short Mara-dbcm method (miningassociation rules algorithms based on double-bases cooperating mechanism) and can solve the above-mentioned problem of mentioning effectively.
Based on the association rule mining method of double-library synergistic mechanism concrete technical mainly be the discretization method that utilized the classics in the data mining, and KDD
*Carry out Mining Association Rules based on the inspiration type telegon of double-library synergistic mechanism and interrupt-type telegon etc. in the system.
If the rule intensity threshold value is MinIntensity, the support threshold value is MinSup, the confidence level threshold value is MinCon, adequacy factor threshold value is minLS, m is the number of the plain node pi of knowledge of support (pi)>minSup, n is that knowledge is closed the node number and added m in the reachability matrix, and the pairing attribute of the plain node pi of knowledge is attr (pi).
1) data pre-service: mainly be that the user selects the True Data storehouse, the connection attribute in the True Data storehouse is carried out discretize, formation excavation database (n table, table1, table2 ..., tablen);
2) find " knowledge shortage ": represent knowledge in the knowledge base with oriented hypergraph H, and the adjacency matrix A (H) that has provided oriented hypergraph represents, proposed a kind of reachability matrix P (H) new algorithm that calculates oriented hypergraph on this basis, 0 element among the reachability matrix P (H) is exactly the knowledge of shortage;
3) produce K2: establish short knowledge collection and represent, represent that with Km regular length is the short knowledge collection of m, i.e. Km={r|Len (r)=m} with K.Because the element among the K is very many, we will utilize the rule intensity Intensity (ri) that introduces above that K2 is carried out beta pruning, and the regular ri of Intensity (ri)>minIntensity (ri) is focused on.Promptly, must satisfy: support sup (ep), sup (eq)>MinSup, and sup (the ri)=min among the Intensity (ri) (sup (ep), sup (eq)) for short knowledge ri:cp → cq (ri ∈ K2);
4)m=2;
5) Km is produced the hypothesis rule: to the short knowledge r among the Km
i: e
1∧ e
2∧ ... ∧ e
p→ e
q(r
i∈ K
m), carry out orientation and excavate, promptly to tables of data table
1, table
2..., table
p, table
qExcavate, calculate Con (r
i) and Intensity (r
i), if Con is (r
i)>MinCon and Intensity (r
i)>MinIntensity (r
i), then change 6); Otherwise, delete this rule;
6) regular ri is used the interrupt-type telegon and handle, correspondence position has or not the repetition, redundancy, contradiction, subordinate, circulation of this create-rule etc. in the search primary knowledge base.If have, then cancel this create-rule or respective handling; Change 8); If do not have, then change 7);
7) regular ri is estimated, pass through, then put in storage, and the corresponding reachability matrix of oriented hypergraph is calculated,, adjust Km if estimate; Do not pass through if estimate, then delete this rule;
8) whether Km finishes, if finish, changes 9); If not finish, then do not change 5) carry out the processing of next bar rule;
9) m=m+1 is if Km=φ changes 10); Otherwise, change 5);
10) show the new rule that produces;
11) finish.
Fig. 8 has provided program flow diagram:
Pre-service is carried out in step 302 pair True Data storehouse, forms the mining data storehouse; Step 303 will be counted pointer and be changed to 1; Step 304 produces all set greater than the data of minimum support from the mining data storehouse, i.e. sport collection L
iStep 305 produces Candidate Set C from knowledge base
I+1Step 306 judges whether Candidate Set is empty, if judge it is sure, then forwards step 314 to, otherwise execution in step 307; Step 307 computation rule intensity intensity (cm); Whether step 308 judgment rule intensity is less than rule intensity threshold value MinIntensity, if judge it is sure, then execution in step 309 is with deletion c
mIf it is fixed to judge whether, then execution in step 310; Step 310 produces knowledge shortage collection K
I+1Step 311 judgemental knowledge shortage collection K
I+1Whether be empty,, then forward step 314 to if judge it is sure, otherwise execution in step 312; Step 312 is called the excavation that the KDD process is carried out data; Step 313 forwards step 305 to after making the counting pointer add 1; Step 314 shows the new regulation that produces; Step 315 this operation of end.
Program flow diagram with reference to KDD process shown in Figure 9:
Step 401 pair excavation database carries out orientation and excavates; The support of step 402 computation rule, confidence level and adequacy factor values; The value that step 403 is tried to achieve step 402 compares with corresponding threshold separately, if support greater than support threshold value and confidence level greater than confidence level threshold value and adequacy factor values greater than adequacy factor threshold value, then execution in step 404, otherwise execution in step 405; Step 404 is called the interrupt-type telegon gained rule is handled; Step 405 this process program of end.
The example operating ratio is:
The mushroom database:
Comparative in order to have, this algorithm utilization experimentizes for the mushroom database (mushroom database) that the classic network database of testing usefulness is provided.This algorithm is that the programming language that is adopted is Delphi5.0, and Database Systems are SQL-Server7.0 of Microsoft, have adopted the Client-Server structure.
Because there is not domain expert's knowledge in the pairing knowledge word bank of mushroom database, at first operation is dug the QAR_SQL algorithm of many-valued correlation rule and the Famer algorithm of the unexpected correlation rule of excavation, the rule of excavate as knowledge in the primary knowledge base, whether we are only poisonous interested in mushroom, and therefore whether the consequent of rule is poisonous this attribute (promptly comprising ' edible ' ' edible ' and ' poisonous ' ' poisonous ').At first move the QAR_SQL algorithm, support threshold value minSup=0.4 is set, confidence level threshold value minCon=0.6, adequacy factor threshold value minLS=1.2, the result will produce 19 rules, be illustrated in fig. 10 shown below.
Operation Famer algorithm is provided with support threshold value minSup=0.14, confidence level threshold value minCon=0.8, and adequacy factor threshold value minLS=1.2 produces 10 and the corresponding unexpected rule of above-mentioned conventional rule, 20 to 29 rules among following Figure 11 in addition.
As primary knowledge base, operation inspiration type telegon is provided with support threshold value minSup=0.14, confidence level threshold value minCon=0.6, rule intensity threshold value minIntensity=0.45 with 29 above-mentioned rules.Produce 45 rules in addition, be illustrated in fig. 12 shown below (only having shown 12 rules wherein).
This case verification shows that the Maradbcm method is effectively, can find some new correlation rules on QAR_SQL method and Famer method basis in addition.4, knowledge evaluation method-------based on the automatic evaluation method 4.1 principle 1:(agreement principles of the causalnexus rule of autoepistemic logic) in the objective world, under uncertain inference mechanism and great amount of samples statistics, the causalnexus rule inferential sign be consistent in statistical sign.
Principle 2:(applicability principle) the authentication reasoning pattern is applicable in the reasoning relevant with the causalnexus rule.That is:
HE
Wherein H is the hypothesis that is verified, and can be considered as the causalnexus rule R of needs assessment after excavating.E asserts for some that can release from H, can be considered as the assay that obtains through check.In evaluation procedure, the check of being carried out is according to uncertain cause and effect induction, and whether check cause and effect data satisfy agreement principle, if the i.e. shape metamorphosis of data equals by the result of data through the reasoning gained, show that then it satisfies agreement principle, otherwise do not satisfy agreement principle.4.2 according to the positive correlation standard:
E authenticates H, and if only if Pr (H/E)>Pr (H)
Wherein, Pr (H) is for testing preceding degree of confidence, and Pr (H/E) is for testing the back degree of confidence.In other words, and if only if that H tests preceding degree of confidence with respect to the back degree of confidence of testing of E greater than it for E authentication H.4.3 do following analysis for the foundation of evaluation method:
With the causalnexus rule of being found be designated as R (
), it is exactly to judge whether accept this rule that rule is estimated, so it belongs to the category of authentication logic.Definition 10: to causalnexus rule R (
), the probability that Ai and Sj occur simultaneously is Pr (Ai ∧ Sj)/Pr (Ai ∨ Sj) with both extract ratios of the probability that occurs, is called causalnexus intensity, note is made CR.(promptly be equivalent to Pr (H), can be used as and test preceding degree of confidence)
Annotate: causalnexus intensity shows, and what not only show is the rule association degree, and main is the cause-effect relationship that shows former piece and consequent, and it emphasizes it is both causalnexus degree.With in general sense confidence level and support and more the degree of confidence on the universal significance obvious difference is arranged.Definition 11: Pr (E2)/(Pr (E1)+Pr (E2)) is called holding strength, and note is made SUP.(promptly be equivalent to Pr (H/E), can be used as and test the back degree of confidence)
Annotate: in fact, in evaluation procedure, the check of being done is inspection rule and whether satisfies agreement principle in the principle 1.E is the assay of gained like this, so just separated into two parts of data: satisfy the part (being designated as E1) of agreement principle and do not satisfy the part (being designated as E2) of agreement principle.Wherein the part of Man Zuing has been represented the degree that the causalnexus rule is set up, promptly be a kind of degree of support that is based upon on the inference mechanism to rule, this is different 4.4 can get following conclusion according to principle 2 and relevant definition with the usually said support that just merely is based upon on the statistics:
For causalnexus rule R (
), if SUP>CR, then this causalnexus rule obtains authentication, if SUP≤CR, then this causalnexus rule is by falsification.4.5 utilizing LS adequacy factor pair correlation rule estimates:
In the subjective Bayes method, the representation of every rule is
IF?E?THEN(LS,LN)H(P(H))
Wherein: P (H) is the prior probability of H; Ls ∈ [0 ,+∞) being called the adequacy factor, it has reflected that evidence E is for very to the influence degree of conclusion H; LN ∈ [0 ,+∞) being called the necessity factor, it has reflected-and E is to the influence degree of H, and promptly E is genuine necessity degree to H.
The relation of LS and P (H/E) is as shown in the formula expression:
Wherein P (H/E) is a conditional probability, and P (H) is the prior probability of H, can release LS thus:
The LS value is provided by the domain expert usually, but can calculate in the Mining Association Rules method.The meaning of LS as can be seen from following formula (4-3):
(1) when LS=1, can get P (H/E)=P (H) by formula (1), this shows that E and H have nothing to do;
(2) when LS>1, can get P (H/E)>P (H) by formula (1), this shows that promptly E is strong more for really supporting to H because the pairing evidence of E exists, and having increased H is genuine possibility, and LS is big more, and P (H/E) is just big more.When LS → ∞, P (H/E) → 1 shows that because the existence of E, it is true will causing H, this shows, the existence of E to H for really being fully, so title LS is the adequacy factor;
(3) when LS<1, can get P (H/E)<P (H) by formula (1), this shows because the existence of evidence E, and will cause H is that genuine possibility descends;
(4) work as LS=0, can get P (H/E)=0 by formula (1), the existence that this shows owing to evidence E, it is false will making H.4.6 automatic evaluation method based on the causalnexus rule of autoepistemic logic:
Its automatic evaluation method is as follows: (evaluation rule
):
Get reason A and the data of S as a result, constitute the set P={<tw of a preface idol, sw〉} (w=1,2....N), tw is the data (promptly because of sample value) in reason shape (change) state space, sw be with the corresponding shape as a result of reason data (change) state space in data (i.e. fruit sample value).N is the number of sample in the set.If SUP1=0.
STEP1: (w=1,2....N), it belongs to general sample space, can obtain because of shape (change) attitude input vector atw according to formula (1) to get the sample value tw of reason.
STEP2: determine because of under shape (changes) the attitude input vector atw because of shape (changes) attitude type such as Ak (k=1,2,3,4,5) promptly by formula (2) calculating atw and each dH that estimates because of shape (change) attitude standard vector Ai, get reckling and be that atw belongs to because of shape (change) attitude type.Randomly draw a sample set, can see the set P={<tw that contributes a foreword even, sw }.
STEP3: with rule
As local major premise, with because of under shape (change) the attitude input vector at because of shape (change) attitude standard vector Ak be minor premise, can in estimating knowledge base, the mode by self-organization find the unique knowledge matrix M ijk that is complementary with it, obtain result's shape (change) attitude vector Sw1 according to automated reasoning pattern (3).
STEP4: cluster.Calculate fruit shape (changes) the attitude standard vector β under the Sw1, can by ask it and each really the estimating of shape (change) attitude standard vector (as shown in the formula) get reckling and obtain cluster.
Wherein, μ Sw1 (i) is respectively its each self-corresponding coordinate with μ Sj (i).
STEP5: for set of ordered pairs P={<tw, sw〉}, get the sample value sw of corresponding results, can obtain fruit shape (change) the attitude standard vector γ in interval under its with the method for fuzzy clustering, if β=γ, SUP1=SUP1+1 then, otherwise SUP1=SUP1.
STEP6: repeat said process N time, obtain SUP.If
SUP=SUP1/N
The causalnexus intensity CR that gets rule compares with it.
If SUP>CR then rule is accepted;
SUP≤CR then rule is rejected.
The embodiment of the best of the present invention is illustrated, and those of ordinary skill in the art is among the various changes of having done on the basis that does not break away from its spirit all should be contained in protection scope of the present invention.