Summary of the invention
The object of the invention is to the deficiency existing for prior art, provide a kind of for finding the complete weighting pattern method for digging of correlation rule between text word, the abundant Association Rule Mining achievement of excavating based on project weights, solves the technical barrier in the positive and negative association rule mining of all-weighted item.The method has important theory value and wide application prospect in the field such as text mining, document information retrieval.
The present invention realizes the technical scheme that above-mentioned purpose takes: a kind of for finding the complete weighting pattern method for digging of correlation rule between text word, comprise the steps:
(1) complete weighted data pretreatment stage:
In real world, there is the complete weighted data of magnanimity, as text message data etc.Weighted data preprocess method will, depending on concrete data object, for example, for Chinese text data message, will carry out participle, remove stop words, extract the preprocess methods such as Feature Words and weights calculating thereof completely; For English text data message, preprocess method is that stem extracts, gets rid of stop words, lexical analysis, extraction Feature Words and weights calculating thereof etc.The pretreated result of weighted data is to build based on complete weighted data storehouse and project library completely;
Feature Words weights computing formula for text data is: w
ij=(0.5+0.5 × tf
ij/ max
j(tf
ij)) × idf
i,
Wherein, w
ijbe the weights of i Feature Words at j piece of writing document, tf
ijbe the word frequency of i Feature Words at j piece of writing document, idf
ibe the reverse document frequency of i Feature Words, it is worth idf
i=log (N/df
i), N is total number of documents in document sets, df
ifor containing the number of documents of i Feature Words.
(2) completely weighted frequent items and negative term collection excavation phase, comprises the following steps 2.1 and step 2.2:
2.1, from project library, extract complete weighting candidate 1_ item collection awC
1, and excavate the frequent 1_ item of complete weighting collection awL
1; Concrete steps are carried out according to 2.1.1~2.1.3:
2.1.1, from project library, extract complete weighting candidate 1_ item collection awC
1;
2.1.2, cumulative complete weighting candidate 1_ item collection awC
1weights summation in complete weighted data storehouse (All-Weighted Database is called for short AWD), calculates its support;
AwC
1support computing formula is as follows:
Wherein,
expression project i
jat transaction journal T
iin weights summation, n is the transaction journal sum of complete weighted data storehouse AWD, k is a collection awC
1length (be awC
1project number).
2.1.3, by complete weighting candidate 1_ item collection C
1middle support is more than or equal to the frequent 1_ item of the complete weighting collection awL of minimum support threshold value minsup
1join frequent item set set awPIS;
2.2, from complete weighting candidate 2_ item collection, according to step, 2.2.1~2.2.4 operates:
2.2.1, by complete weighting frequent (i-1) _ collection awL
i-1carry out Apriori connection, generate complete weighting candidate i_ item collection awC
i; Described i>=2;
2.2.2, cumulative complete weighting candidate i_ item collection awC
i-1weights summation in complete weighted data storehouse AWD, calculates its support awsup (awC
i-1), its computing formula is as follows:
Wherein,
expression project i
jat transaction journal T
iin weights summation, n is the transaction journal sum of complete weighted data storehouse AWD, k is a collection awC
i-1length.
2.2.3, from complete weighting candidate i_ item collection awC
ithe middle frequent i_ item collection awL that its support is not less than to support threshold value minsup
itake out, deposit complete weighted frequent items set awPIS in, meanwhile, its support is less than to the negative i_ item collection awN of complete weighting of support threshold value
ideposit complete weighting negative term collection set awNIS in.
2.2.4, the value of i is added to 1, if frequent (i-1) _ collection awL
i-1for empty (being that its length is 0) just proceeds to (3) step, otherwise, 2.2.1~2.2.3 step continued;
(3) the beta pruning stage: obtain interesting complete weighted frequent items and negative term collection by the beta pruning stage
3.1, for each the frequent i-item collection awL in frequent item set set awPIS
i, calculate IAWFI (awL
i) value, wipe out its IAWFI (awL
i) value is false frequent item set, obtains interesting complete weighted frequent items set awPIS after beta pruning; IAWFI (awL
i) computing formula is as follows:
Wherein, awItemsetInt (I
1∪ I
2)=awsup (I
1) × awsup (I
1∪ I
2) × (1 – awsup (I
2)), awItemsetInt (﹁ I
1, ﹁ I
2)=awsup (I
2) × (1 – awsup (I
1)) × (1 – awsup (I
1) – awsup (I
2)+awsup (I
1∪ I
2)), minInt is minimum interestingness threshold value, minsup minimum support threshold value.
3.2, for each the negative i-item collection awN in negative term collection set awNIS
i, calculate IAWNI (awN
i) value, wipe out its IAWNI (awN
i) value is false negative term collection, obtains interesting complete weighting negative term collection set awNIS after beta pruning; IAWNI (awN
i) computing formula as follows:
Wherein, awItemsetInt (I
1∪ I
2)=awsup (I
1) × awsup (I
1∪ I
2) × (1 – awsup (I
2))
awItemsetInt(I
1∪﹁I
2)=awsup(I
1)×awsup(I
2)×(awsup(I
1)–awsup(I
1∪I
2))
awItemsetInt(﹁I
1∪I
2)=(1–awsup(I
1))×(1–awsup(I
2)×(awsup(I
2)–awsup(I
1∪I
2))
awItemsetInt(﹁I
1∪﹁I
2)=awsup(I
2)×(1–awsup(I
1))×(1–awsup(I
1)–awsup(I
2)+awsup(I
1∪I
2))
(4) from interesting complete weighted frequent items set awPIS, excavate effectively the positive and negative correlation rule of weighting completely, comprise the following steps:
4.1, take out frequent item set awL from interesting complete weighted frequent items set awPIS
i, obtain awL
iall proper subclass, build awL
iproper subclass set, then carry out following operation:
4.2.1, from awL
iproper subclass set in take out arbitrarily two proper subclass I
1and I
2, work as I
1and I
2common factor be empty set (I
1∩ I
2=φ), I
1and I
2project number sum equal the project number (I of its former frequent item set
1∪ I
2=awL
i), and I
1and I
2support be all not less than support threshold value (awsup (I
1)>=minsup, awsup (I
2)>=minsup), calculate frequent item set (I
1∪ I
2) item in weights than awIWR (I
1, I
2) and dimension than awIDR (I
1, I
2); AwIWR (I
1, I
2) and awIDR (I
1, I
2) computing formula as follows:
W
12and w
1, w
2be respectively complete weighted term collection (I
1, I
2) and subitem collection I
1and I
2weights summation in complete weighted data storehouse AWD, k
12, k
1and k
2be respectively a collection (I
1, I
2) and subitem collection I
1and I
2project number, n is transaction journal sum in database.
4.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I
1, I
2)) product be greater than its dimension than (awIDR (I
1, I
2)) time (n × awIWR (I
1, I
2) >awIDR (I
1, I
2)), proceed as follows:
If 4.2.2.1 I
1→ I
2awCPIR value (awCPIR (I
1→ I
2)) be not less than confidence threshold value minconf, excavate all-weighted association I
1→ I
2; If I
2→ I
1awCPIR value be not less than confidence threshold value (awCPIR (I
2→ I
1)>=minconf), excavate all-weighted association I
2→ I
1; AwCPIR (I
1→ I
2) and awCPIR (I
2→ I
1) computing formula as follows:
If 4.2.2.2 (﹁ I
1∪ ﹁ I
2) support be not less than support threshold value (awsup (﹁ I
1∪ ﹁ I
2)>=minsup), so, if 1. ﹁ I
1→ ﹁ I
2awCPIR value be not less than confidence threshold value (awCPIR (﹁ I
1→ ﹁ I
2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
1→ ﹁ I
2; If 2. ﹁ I
2→ ﹁ I
1awCPIR value be not less than confidence threshold value (awCPIR (﹁ I
2→ ﹁ I
1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
2→ ﹁ I
1; Awsup (﹁ I
1∪ ﹁ I
2), awCPIR (﹁ I
1→ ﹁ I
2) and awCPIR (﹁ I
2→ ﹁ I
1) computing formula as follows:
awsup(﹁I
1∪﹁I
2)=awsup(﹁I
1∪﹁I
2)=1–awsup(I
1)–awsup(I
2)+awsup(I
1∪I
2)
4.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I
1, I
2)) product be less than its dimension than (awIDR (I
1, I
2)) time (n × awIWR (I
1, I
2) <awIDR (I
1, I
2)), proceed as follows:
If 4.2.3.1 (I
1∪ ﹁ I
2) support be not less than support threshold value (awsup (I
1∪ ﹁ I
2)>=minsup), so, if 1. I
1→ ﹁ I
2awCPIR value be not less than confidence threshold value (awCPIR (I
1→ ﹁ I
2)>=minconf), excavate the negative correlation rule I of complete weighting
1→ ﹁ I
2; If 2. ﹁ I
2→ I
1awCPIR value be not less than confidence threshold value (awCPIR (﹁ I
2→ I
1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
2→ I
1; Awsup (I
1∪ ﹁ I
2), awCPIR (I
1→ ﹁ I
2) and awCPIR (﹁ I
2→ I
1) computing formula as follows:
awsup(I
1→﹁I
2)=awsup(I
1∪﹁I
2)=awsup(I
1)–awsup(I
1∪I
2)
If 4.2.3.2 (﹁ I
1∪ I
2) support be not less than support threshold value (awsup (﹁ I
1∪ I
2)>=minsup), so, if 1. ﹁ I
1→ I
2awCPIR value be not less than confidence threshold value (awCPIR (﹁ I
1→ I
2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
1→ I
2; If 2. I
2→ ﹁ I
1awCPIR value be not less than confidence threshold value (awCPIR (I
2→ ﹁ I
1)>=minconf), excavate the negative correlation rule I of complete weighting
2→ ﹁ I
1; Awsup (﹁ I
1∪ I
2), awCPIR (﹁ I
1→ I
2) and awCPIR (I
2→ ﹁ I
1) computing formula as follows:
awsup(﹁I
1→I
2)=awsup(﹁I
1∪I
2)=awsup(I
2)–awsup(I
1∪I
2)
4.2.4, continue 4.2.1~4.2.3 step, if awL
iproper subclass set in each proper subclass and if only if is removed once, proceed to 4.2.5 step;
4.2.5, continue 4.1 steps, if each frequent item set awL in interesting complete weighted frequent items set awPIS
iall and if only if is removed once, proceeds to (5) step;
(5) from interesting complete weighting negative term collection set awNIS, excavate effectively the negative correlation rule of weighting completely, comprise the following steps:
5.1, take out negative term collection awN from interesting complete weighting negative term collection set awNIS
i, obtain awN
iall proper subclass, build awN
iproper subclass set, then carry out following operation:
5.2.1, from awN
iproper subclass set in take out arbitrarily two proper subclass I
1and I
2, work as I
1and I
2common factor be empty set (I
1∩ I
2=φ), I
1and I
2project number sum equal the project number (I of its former frequent item set
1∪ I
2=awN
i), and I
1and I
2support be all greater than or equal to support threshold value (awsup (I
1)>=minsup, awsup (I
2)>=minsup), calculate negative term collection (I
1∪ I
2) item in weights than (awIWR (I
1, I
2)) and dimension than (awIDR (I
1, I
2)); AwIWR (I
1, I
2) and awIDR (I
1, I
2) computing formula with the formula of 4.2.1.
5.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I
1, I
2)) product be greater than its dimension than (awIDR (I
1, I
2)) time (n × awIWR (I
1, I
2) >awIDR (I
1, I
2)), proceed as follows:
If 5.2.2.1 (﹁ I
1∪ ﹁ I
2) support be greater than or equal to support threshold value (awsup (﹁ I
1∪ ﹁ I
2)>=minsup), so, if 1. ﹁ I
1→ ﹁ I
2awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I
1→ ﹁ I
2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
1→ ﹁ I
2; If 2. ﹁ I
2→ ﹁ I
1awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I
2→ ﹁ I
1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
2→ ﹁ I
1; Awsup (﹁ I
1∪ ﹁ I
2), awCPIR (﹁ I
1→ ﹁ I
2) and awCPIR (﹁ I
2→ ﹁ I
1) computing formula with the formula of 4.2.2.2.
5.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I
1, I
2)) product be less than its dimension than (awIDR (I
1, I
2)) time (n × awIWR (I
1, I
2) <awIDR (I
1, I
2)):
If 5.2.3.1 (I
1∪ ﹁ I
2) support be greater than or equal to support threshold value (awsup (I
1∪ ﹁ I
2)>=minsup), so, if 1. I
1→ ﹁ I
2awCPIR value be greater than or equal to confidence threshold value (awCPIR (I
1→ ﹁ I
2)>=minconf), excavate the negative correlation rule I of complete weighting
1→ ﹁ I
2; If 2. ﹁ I
2→ I
1awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I
2→ I
1)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
2→ I
1; Awsup (I
1∪ ﹁ I
2), awCPIR (I
1→ ﹁ I
2) and awCPIR (﹁ I
2→ I
1) computing formula with the formula of 4.2.3.1;
If 5.2.3.2 (﹁ I
1∪ I
2) support be greater than or equal to support threshold value (awsup (﹁ I
1∪ I
2>=minsup), so, if 1. ﹁ I
1→ I
2awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I
1→ I
2)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting
1→ I
2; If 2. I
2→ ﹁ I
1awCPIR value be greater than or equal to confidence threshold value (awCPIR (I
2→ ﹁ I
1)>=minconf), excavate the negative correlation rule I of complete weighting
2→ ﹁ I
1; Awsup (﹁ I
1∪ I
2), awCPIR (﹁ I
1→ I
2) and awCPIR (I
2→ ﹁ I
1) computing formula with the formula of 4.2.3.2;
5.2.4, continue 5.2.1~5.2.3 step, if awN
iproper subclass set in each proper subclass and if only if is removed once, proceed to 5.2.5 step;
5.2.5, continue 5.1 steps, if each negative term collection awN in interesting complete weighting negative term collection set awNIS
iall and if only if is removed once, and the positive and negative association rule mining of weighting finishes completely;
So far, the positive and negative association rule mining of weighting finishes completely.
The present invention compared with prior art, has following beneficial effect:
(1) for the defect of the positive and negative association rule mining of existing weighting, the present invention has built the positive and negative association mode of complete weighting and has evaluated framework: support-CPIR model (Conditional Probability Increment Ratio)-correlativity-interest-degree, and the Pruning strategy of frequent item set and negative term collection, propose a kind of new positive and negative association rule mining method of complete weighting based on SCPIRCI evaluation framework, effectively solved the positive and negative Association Rule Mining problem of complete weighting.The present invention not only considers the complete weighted data feature that project changes with data-base recording, adopts new item collection Pruning strategy, and the excavation time is significantly reduced, and greatly improves digging efficiency.
(2) propose the interior weights ratio of complete plus item collection item and dimension than concept, enriched the theory that complete weighted data excavates.
(3), by a large amount of strict and careful experiments, the present invention is tested to comparison with traditional item without the positive and negative association rule mining method of weighting.Take Chinese Web test set CWT200g as experiment document test set, from the excavation performance experiment Analysis of aspect to the technology of the present invention such as support variation, degree of confidence variation, the number of entry and the variations of document sets scale.Experimental result shows: with control methods comparison, the excavation performance of the technology of the present invention has reached good effect, and digging efficiency is greatly improved; No matter be in support changes of threshold situation or confidence threshold value situation of change, what candidate, frequent item set and the negative term collection that the technology of the present invention is excavated and positive and negative correlation rule quantity were all excavated than existing control methods is few a lot; Under the number of entry and affairs document scale situation of change, the present invention also shows good extensibility.Ananlysis of main cause is as follows: control methods be based on project frequency excavate without the positive and negative association rule mining method of weighting, do not consider a centralization of state power value, do not have to reflect the feature that complete weighted data is intrinsic comprehensively, thereby, can produce a lot of invalid items with false collects and positive and negative association rule model, make collection and regular quantity much more, its digging efficiency lowers greatly.The invention belongs to the positive and negative association rule mining method of complete weighting excavating based on weights, effectively overcome the inherent shortcoming of control methods, the feature (being that objective being distributed in transaction journal of project weights changes along with record changes) that complete weighted data model is had incorporates in whole mining process, make excavated correlation rule more rationally and more approaching reality, simultaneously, adopt new Pruning strategy, invalid and barren frequent item set and negative term collection quantity are significantly reduced, effectively having reduced barren rule occurs, improve widely digging efficiency.
Specific embodiment mode
For technical scheme of the present invention is described better, below complete weighted data model and the relevant concept that the present invention relates to are described below:
1. the difference that weighted association rules excavates and all-weighted association excavates
Weighted association rules excavates and all-weighted association excavates, their key distinction is that its project weights source is different with excavated data model, the former project weights are set by user is subjective, and be independent of transaction database, once set, invariable in whole mining process, for example, copy paper in shop and facsimile recorder, because copy paper price is not as the height of facsimile recorder, its single-piece profit is lower than facsimile recorder, in order to embody the importance difference of commodity to profit contribution, higher weights given by facsimile recorder commodity higher single-piece profit by user, and the weights of copy paper commodity are relatively low, after its weight setting, just immobilize, and be independent of its transaction data base, the latter's project weights are not to be set by the user, but derive from each transaction journal of transaction database, and change with transaction journal is different, for example, in the text database of magnanimity, each Feature Words project weights are to derive from each document in its database, change along with document difference, for different documents, its Feature Words project weights are different.
Item weighted data model and all-weighted item data model are respectively the data models that weighted association rules excavates and all-weighted association excavates, and are diverse two class data models, as shown in Table 1 and Table 2, are wherein { i
1, i
2..., i
mits project set, { T
1, T
2..., T
nit is its affairs set.In weighted data model, { w
1, w
2..., w
mits project weights, and " 1 " the expression project of " 1/0 " occurs in transaction journal, " 0 " represents absent variable situation.In complete weighted data model, " w[T
i] [i
j]/0 (1≤i≤n, 1≤j≤m) " represent the weights of project, if project in transaction journal, occur, its weights are " w[T
i] [i
j] ", otherwise be " 0 ".
A table 1 weighted data model table 2 all-weighted item data model
Example: table 3 has 5 projects and 5 transaction journals, and wherein project set is { i
1, i
2, i
3, i
4, i
5}={ Apple, Orange, Banana, Milk, Coca-cola}, as known from Table 3, i
1do not appear at T
3in transaction journal.Table 4 is all-weighted item data instances, project and transaction journal quantity and with table 3, wherein, project i
1at transaction journal T
1, T
2, T
3, T
5in weights be respectively 0.85,0.93,0.65,0.75, do not appear at transaction journal T
4therefore its weights are 0.
A table 3 weighted data example table 4 all-weighted item data instances
2. weighted data excavates key concept completely
If weighted data storehouse AWD={T completely
1, T
2..., T
n, number of transactions is n, T
i(1≤i≤n) represent i affairs in AWD, item collects I={i
1, i
2..., i
mrepresenting whole project sets in AWD, item number is m, i
j(1≤j≤m) represents j project in AWD, w[T
i] [i
j] (1≤i≤n, 1≤j≤m) expression project i
jat transaction journal T
iin weights, refer to the all-weighted item data model of table 2.If I
1, I
2a subitem collection of collection I,
and,
provide following basic definition:
Definition 1 (weighting support completely: All-weighted support, be called for short awsup): the computing formula of weighting support awsup (I) is suc as formula shown in (1) completely.
Wherein,
, n is the transaction journal sum of complete weighted data storehouse AWD, k is a length (being the project number of I) of collection I.
Completely weighting negative term collection and negative correlation rule support suc as formula (2) to shown in formula (5).
awsup(﹁I)=1–awsup(I) (2)
awsup(I
1→﹁I
2)=awsup(I
1∪﹁I
2)=awsup(I
1)–awsup(I
1∪I
2) (3)
awsup(﹁I
1→I
2)=awsup(﹁I
1∪I
2)=awsup(I
2)–awsup(I
1∪I
2) (4)
awsup(﹁I
1→﹁I
2)=awsup(﹁I
1∪﹁I
2)=1–awsup(I
1)–awsup(I
2)+awsup(I
1∪I
2) (5)
Definition 2 (weighted frequent items and negative term integrate completely): establish minimum support threshold value as minsup, for complete weighted term collection I, if awsup (I) >=minsup claims that a collection I is complete weighted frequent items.For complete weighted term collection (I
1∪ I
2), work as I
1and I
2while being all frequent item set, if awsup is (I
1∪ I
2) <minsup, a collection (I
1∪ I
2) be called complete weighting negative term collection.
Example: establish minsup=0.1, in table 4 data, awsup (i
2)=(0.21+0.35+0.05)/(5 × 1)=0.122>minsup, awsup (i
4)=0.192>minsup, awsup (i
2∪ i
4)=0.06<minsup, therefore a collection (i
2∪ i
4) be complete weighting negative term collection.
Definition 3 (weighted term collection interest-degree completely: All-weighted Itemset Interest, be awItemsetInt): interest-degree is the tolerance of user to excavated association mode degree of concern, its value is higher, illustrate that this association mode is noveler, user is just higher to its degree of concern.Based on excavating the interest-degree model definition (Cheng Jihua under environment without weighted data, Guo Jiansheng, Shi Pengfei. excavate many strategy process researchs [J] of pay close attention to rule. Chinese journal of computers, 2000,23 (1): 47-51.), provide complete weighted term collection interest-degree (awItemsetInt) computing formula suc as formula (6) to shown in formula (9):
awItemsetInt(I
1∪I
2)=awsup(I
1)×awsup(I
1∪I
2)×(1–awsup(I
2)) (6)
awItemsetInt(I
1∪﹁I
2)=awsup(I
1)×awsup(I
2)×(awsup(I
1)–awsup(I
1∪I
2)) (7)
awItemsetInt(﹁I
1∪I
2)=(1–awsup(I
1))×(1–awsup(I
2)×(awsup(I
2)–awsup(I
1∪I
2)) (8)
awItemsetInt(﹁I
1∪﹁I
2)=awsup(I
2)×(1–awsup(I
1))×(1–awsup(I
1)–awsup(I
2)+awsup(I
1∪I
2))
(9)
Definition 4 (weighting CPIR value completely: All-weighted Conditional_Probability Increment Ratio, be called for short awCPIR): CPIR model is to express p (I with the ratio of conditional probability and prior probability
2/ I
1) relative p (I
2) increase progressively degree, in document, provided its computing formula: CPIR (I
2/ I
1)=(p (I
2/ I
1) – p (I
2))/(1 – p (I
2)).The needs that computing formula based on CPIR model and completely weighted data excavate, the awCPIR computing formula that provides the positive and negative correlation rule of complete weighting suc as formula (10) to shown in formula (13):
Degree of confidence using awCPIR value as all-weighted association, its value is larger, illustrates that the confidence level of this correlation rule is higher, paid close attention to by user.
Example: in the complete data of table 4, awsup (i
1)=0.636, awsup (﹁ i
1)=1-0.636=0.364, awsup (i
2)=0.122, awsup (i
1∪ i
2)=0.294, awCPIR (i
1→ i
2)=(| 0.294-0.636 × 0.122|)/(0.636 × (1-0.122))=0.39, awCPIR (i
1→ ﹁ i
2)=2.79, awCPIR (﹁ i
1→ i
2)=0.68, awCPIR (﹁ i
1→ ﹁ i
2)=4.86.
Definition 5 (weights ratio in weighted term completely: All-weighted Weight Ratio from Itemset, be called for short awIWR): establish w
12and w
1, w
2be respectively complete weighted term collection (I
1, I
2) and subitem collection I
1and I
2weights summation in complete weighted data storehouse AWD, by w
12(w
1× w
2) ratio be called weights ratio in complete weighted term collection, in being called for short, weights are than (awIWR (I
1, I
2)), shown in formula (14).
Definition 6 (dimension ratio in weighted term completely: All-weighted Dimension Ratio from Itemset, be called for short awIDR): establish k
12, k
1and k
2be respectively a collection (I
1, I
2) and subitem collection I
1and I
2project number, by k
12(k
1× k
2) ratio be called dimensional ratio in complete weighted term collection, in being called for short, dimension is than (awIDR (I
1, I
2)), shown in formula (15).
Definition 7 (weighted term collection correlativity completely: All-weighted itemset correlation, be called for short awISCorr): item collection correlativity definition (the Chengqi Zhang based on traditional, Shichao Zhang.Association rule mining:models and algorithms[M] .Springer-Verlag Berlin, Heidelberg, 2002:47-84, ISBN:3-540-43533-6.), provide complete weighted term collection (I
1, I
2) correlativity (awISCorr (I
1, I
2),
) computing formula suc as formula shown in (16).
According to the character of correlativity, excavate under environment a collection (I at complete weighted data
1, I
2) correlativity has following character:
Character 1:
Character 4:
2. awISCorr (﹁ I
1, I
2) <1; 3. awISCorr (﹁ I
1, ﹁ I
2) >1.
Character 5:
2. awISCorr (﹁ I
1, I
2) >1; 3. awISCorr (﹁ I
1, ﹁ I
2) <1.
Inference is excavated in environment at complete weighted data, known terms collection (I
1, I
2), and
if 1. n × awIWR (I
1, I
2) > awIDR (I
1, I
2), complete weighting subitem collection I
1and I
2become positive correlation, and can excavate complete weighting positive association rule I
1→ I
2with negative correlation rule ﹁ I
1→ ﹁ I
2pattern; If 2. n × awIWR (I
1, I
2) <awIDR (I
1, I
2), complete weighted term collection I
1and I
2become negative correlation, and can excavate the negative correlation rule I of complete weighting
1→ ﹁ I
2with ﹁ I
1→ I
2pattern;
According to above-mentioned inference, in the time excavating all-weighted association, only need to calculate the interior weights of complete weighted term than awIWR (I
1, I
2) and dimension than awIDR (I
1, I
2), do not need computational item collection correlativity, just can directly concentrate the positive and negative correlation rule of the complete weighting of excavation from frequent item set and negative term.
Example: for (i
1, i
2, i
3), establish I
1=(i
1, i
2), I
2=(i
3), awIWR (I
1, I
2)=3.34/ (2.94 × 2.85)=0.399, awIDR (I
1, I
2)=3/ (2 × 1)=1.5, n × awIWR (I
1, I
2)=5 × 0.5517=1.995>1.5=awIDR (I
1, I
2), according to above-mentioned inference, I
1and I
2become positive correlation, can excavate correlation rule I
1→ I
2with negative correlation rule ﹁ I
1→ ﹁ I
2pattern.Employing formula (16) checking: awsup (i
1∪ i
2)=0.294, awsup (i
3)=0.57, awsup (i
1∪ i
2∪ i
3)=0.223, awISCorr (I
1, I
2)=0.223/ (0.294 × 0.57)=1.33>1, do as one likes matter 1 and character 4, I
1and I
2become positive correlation, can excavate correlation rule I
1→ I
2with negative correlation rule ﹁ I
1→ ﹁ I
2pattern, conclusion is consistent.
In like manner, for complete weighted term collection (i
2, i
4), its awIWR (i
2, i
4)=0.102, awIDR (i
2, i
4)=2, n × awIWR (i
2, i
4)=0.51<2=awIDR (i
2, i
4), known according to inference, i
2and i
4become negative correlation, can excavate i
2→ ﹁ i
4with ﹁ i
2→ i
4pattern.
Definition 8 (the effectively complete positive and negative correlation rule of weighting): establishing minconf is minimal confidence threshold, as complete weighted term collection I
1and I
2meet following 3 conditions, claim correlation rule I
1→ I
2, ﹁ I
1→ ﹁ I
2, I
1→ ﹁ I
2with ﹁ I
1→ I
2for the effective completely positive and negative correlation rule of weighting: 1. I
1and I
2complete weighted frequent items, I
1∩ I
2=φ; 2. I
1→ I
2, ﹁ I
1→ ﹁ I
2, I
1→ ﹁ I
2with ﹁ I
1→ I
2support be more than or equal to minsup; 3. I
1→ I
2, ﹁ I
1→ ﹁ I
2, I
1→ ﹁ I
2with ﹁ I
1→ I
2awCPIR value be not less than minconf.
Example: suppose minsup=0.1, minconf=0.3 knows from upper example, completely weighted term collection (i
1, i
2), (i
3) and (i
1, i
2, i
3) support be all greater than minsup, (i
1, i
2) and (i
3) become positive correlation, again because, awCPIR ((i
1, i
2) → (i
3))=| 0.223 – 0.94 × 0.57|/(0.294 × (1 – 0.57))=0.438>minconf, awCPIR (﹁ (i
1, i
2) → ﹁ (i
3))=0.138<minconf, according to character 4 and definition 8, (i
1, i
2) → (i
3) be an effectively complete weighting positive association rule, and negative regular ﹁ (i
1, i
2) → ﹁ (i
3) not effective.In like manner, for complete weighted term collection (i
2, i
4), due to awsup (i
2)=0.122>minsup, awsup (i
4)=0.192>minsup, awsup (i
2∪ ﹁ i
4)=0.062<minsup, awsup (﹁ i
2∪ i
4)=0.132>minsup, awCPIR (﹁ i
2→ i
4)=0.052<minconf, according to definition 8, negative correlation rule i
2→ ﹁ i
4with ﹁ i
2→ i
4it not the negative correlation rule of effectively complete weighting.
Below by specific embodiment, technical scheme of the present invention is described further.
The process of his-and-hers watches 4 complete weighted data Case digging all-weighted associations of the present invention following (wherein, minsup=0.1, minInt=0.1, minconf=0.4, w represents a centralization of state power value, s represent and collects a support):
Step1:awPIS={φ};awNIS={φ};
Step2:
Step3:①
②
③
Step4: beta pruning: for the item collection beta pruning in frequent item set set awPIS.The frequent item set of being wiped out is: (i
2, i
3), (i
3, i
4), (i
1, i
2, i
5), (i
1, i
3, i
5), the awPIS={ (i after beta pruning
1, i
2), (i
1, i
3), (i
1, i
5), (i
1, i
2, i
3)
Step5: in like manner, in negative term collection set awNIS, the negative term collection of being wiped out is: (i
3, i
5), the awNIS={ (i after beta pruning
1, i
4), (i
2, i
4), (i
2, i
5), (i
4, i
5).
Step6: excavate the positive and negative correlation rule of complete weighting from frequent item set set awPIS He in negative term collection set awNIS, with frequent item set (i
1, i
2, i
3) and negative term collection (i
4, i
5) be example, provide its mining process as follows:
For frequent item set (i
1, i
2, i
3), with its subset I
1=(i
1) and I
2=(i
2, i
3) be example, from upper example, awsup (i
1), awsup (i
2, i
3) be all greater than minsup, awIDR (I
1, I
2)=1.5, n × awIWR (I
1, I
2)=2.98>awIDR (I
1, I
2), awsup (I
1∪ I
2)=0.223>minsup, awCPIR (I
1→ I
2)=0.212<minconf, awCPIR (I
2→ I
1)=1.73>minconf; Awsup (﹁ I
1∪ ﹁ I
2)=0.411>minsup, awCPIR (﹁ I
1→ ﹁ I
2)=1.73>minconf, awCPIR (﹁ I
2→ ﹁ I
1)=0.212<minconf, therefore, I
2→ I
1with ﹁ I
1→ ﹁ I
2(i.e. (i
2, i
3) → (i
1) and ﹁ (i
1) → ﹁ (i
2, i
3)) be an effectively complete positive and negative correlation rule of weighting.
For negative term collection (i
4, i
5), its subset I
1=(i
4) and I
2=(i
5), from upper example, awsup (i
4), awsup (i
5) be all greater than minsup, awIDR (I
1, I
2)=2, n × awIWR (I
1, I
2)=1.03<awIDR (I
1, I
2), awsup (I
1∪ ﹁ I
2)=0.101>minsup, awsup (﹁ I
1∪ I
2)=0.093<minsup, awCPIR (I
1→ ﹁ I
2)=1.577>minconf, awCPIR (﹁ I
2→ I
1)=0.084<minconf, therefore, I
1→ ﹁ I
2(i.e. (i
4) → ﹁ (i
5)) be a negative correlation rule of effectively complete weighting.
Below by experiment, beneficial effect of the present invention is described further.
In order to verify validity of the present invention, correctness and extendability, we select the part language material of the Chinese Web test set CWT200g (Chinese Web Test Collection with200GB web pages) being provided by network laboratories of Peking University as this paper experimental data test set.The running environment of experiment is Intel (R) Core (TM) i7-3770CPU@3.4GHz3.4GHz, internal memory 4.0G, and operating system is windows7, and programming language is realized and is adopted delphi2006, and Database Systems are SQL Server2008.Select typically without the positive and negative association rule mining method of weighting (Xindong Wu, Chengqi Zhang, and Shichao Zhang, Efficient Mining of Both Positive and Negative Association Rules, ACM Transactions on Information Systems, 22 (2004), 3:381-405.) (being designated as PNAR-Mining method) be experiment control methods.
The capacity of Chinese Web test set CWT200g is 197GB, comprises 37,482,913 webpages, and each page compresses arrangement according to sky net storage format.12024 pieces of plain text document from CWT200g test set, are extracted as experiment document test set.Adopt Chinese lexical analysis system ICTCLAS (Inst. of Computing Techn. Academia Sinica's development is write) to test text document participle.Feature Words weights (w
ij) computing formula be w
ij=(0.5+0.5 × tf
ij/ max
j(tf
ij)) × idf
i.The preprocessing process of experiment test document is: participle, remove stop words, extract Feature Words and calculate its weights, build text database and feature dictionary based on vector space model.After the pre-service of experiment document test set, obtain 8751 Feature Words, its document frequency (containing the number of documents of this Feature Words) df is 51 to 11258.According to excavating needs, in experiment, remove the Feature Words that df value is lower and higher, extraction df value is at 1500 to 5838 Feature Words (now obtaining altogether 400 Feature Words) construction feature word project library.Total frequency that Feature Words occurs in 12024 pieces of experiment test documents is 1019494 times, on average in every piece of document, occurs 85 times.Experiment parameter is as shown in table 5.
Table 5 experiment parameter table
Experiment 1: excavate Performance Ratio in support changes of threshold situation
Under different support threshold values, inventing an AWPNAR-Mining and control methods PNAR-Mining excavation collection in experiment document test set herein (is candidate (Candidate Itemset, CI), frequent item set (Frequent Itemset, FI), negative term collection (Negative Itemset,) and positive and negative correlation rule (Positive and Negative Association Rule NI), PNAR) quantity (ItemNum=50 more as shown in Figures 3 to 8, minconf=0.0002, minInt=0.0002, TRecordNum=12024).
Experiment 2: excavate Performance Ratio under confidence threshold value situation of change
Under confidence threshold value situation of change, invent AWPNAR-Mining and control methods PNAR-Mining herein and excavate positive and negative correlation rule (A → B, A → ﹁ B, ﹁ A → B and ﹁ A → ﹁ B) quantity (minsup=0.03 more as shown in table 6 in experiment document test set, minInt=0.0002, ItemNum=50, TRecordNum=12024).
The positive and negative correlation rule quantity comparison of excavating under the different confidence threshold value of table 6
Experiment 3: excavate time efficiency Performance Ratio
Excavate time efficiency performances in order to compare 2 kinds of methods, we add up the excavation time of inventing AWPNAR-Mining and control methods PNAR-Mining herein respectively in support changes of threshold situation and under confidence threshold value situation of change, its result (minInt=0.0002 as shown in table 7 and table 8, ItemNum=50, TRecordNum=12024).The time comparison (minconf=0.0002) that the lower 2 kinds of method for digging of table 7 degree of expressing support for changes of threshold situation excavate a collection and correlation rule in experiment document test set, table 8 represents the positive and negative correlation rule time comparison of the excavation under confidence threshold value situation of change (minsup=0.03).
Under the different support threshold values of table 7, excavate a collection and correlation rule time (unit: second) relatively
Under the different confidence threshold value of table 8, excavate the time (unit: second) of positive and negative correlation rule relatively
Experiment 4: Scalable Performance analysis
We change and the Scalable Performance experiment and analysis of two kinds of situations of data test collection scale variation to the inventive method from the number of entry.
In order to test extensibility of the present invention, experiment parameter is set: ItemNum=50, TRecordNum=12024, minsup=0.05, minconf=0.07, minInt=0.001, changes and data test collection scale is distinguished under situation of change in the number of entry, AWPNAR-Mining method of the present invention in data test collection 1 Mining Frequent Itemsets Based (FI), negative term collection (NI) and positive and negative correlation rule (PNAR) isotype number change result as shown in Fig. 9 to Figure 14.
In a word, above-mentioned experimental result shows, with control methods PNAR-Mining comparison, the excavation performance of AWPNAR-Mining method of the present invention has reached good effect, and digging efficiency is greatly improved; No matter be in support changes of threshold situation or confidence threshold value situation of change, candidate, frequent item set and the negative term collection that the present invention excavates and positive and negative correlation rule quantity all than control methods few a lot.