CN103838854A

CN103838854A - Completely-weighted mode mining method for discovering association rules among texts

Info

Publication number: CN103838854A
Application number: CN201410096985.2A
Authority: CN
Inventors: 黄名选; 元昌安
Original assignee: GUANGXI COLLEGE OF EDUCATION
Current assignee: Guangxi University of Finance and Economics
Priority date: 2014-03-14
Filing date: 2014-03-14
Publication date: 2014-06-04
Anticipated expiration: 2034-03-14
Also published as: CN103838854B

Abstract

The invention discloses a completely-weighted mode mining method for discovering association rules among texts. Completely-weighted data to be processed are pre-processed, and a completely-weighted database and an item .library are established; a completely-weighted frequent item set and a negative item set are mined, and an interesting completely-weighted frequent item set and an interesting negative item set are obtained through pruning; the effective completely-weighted positive and negative association rules are mined through a support degree-CPIR model-correlation-interestingness evaluation framework. The completely-weighted mode mining method can overcome the defects of the existing weighing mining technology. Item weights are objectively distributed in the database and integrated with the completely-weighted mode mining method along with the completely-weighted data characteristics of the business record change, and a more actual and reasonable completely-weighted positive and negative association mode can be obtained. An invalid and uninteresting association mode is avoided. The number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes are smaller than the number of mined candidate items, the number of frequent item sets, the number of negative item sets and the number of positive and negative association rule modes in the prior art. The mining efficiency is greatly improved, and the completely-weighted mode mining method has good extendibility.

Description

For finding the complete weighting pattern method for digging of correlation rule between text word

Technical field

The invention belongs to Data Mining, specifically a kind of for finding the positive and negative mode excavation method of the complete weighting of correlation rule between text word, be applicable to the field such as the discovery of Feature Words association mode and document information retrieval query expansion in text mining.

Background technology

Over nearly 20 years, association rule mining obtains numerous scholars' very big interest and research, has become one of focus for data mining research, and its research mainly concentrates on based on the excavation of project frequency with based on two aspects such as project weights excavations.

The principal feature that positive and negative association mode based on project frequency excavates is the as one man project in process database of equality, and the probability occurring in database using item collection excavates association mode as support.The defect that association rule mining based on project frequency exists is: only pay attention to project frequency, neglected items weights, usually cause the correlation rule with invalid redundancy, barren to increase.

In order to overcome the defect of above-mentioned association rule mining method, the positive and negative association rule mining based on project weights has obtained paying attention to and research, and it has introduced weight, to have different importance between embodiment project and project has different weights in database.Positive and negative association rule mining based on project weights is divided into the positive and negative association rule mining of weighting and the complete positive and negative association rule mining of weighting.The principal feature of the positive and negative association rule mining of weighting is that its project weights have embodied between collection and have different importance, along with going deep into of research, the effect day of the negative correlation rule of weighting is aobvious outstanding, in excavating favorable factor, also expect to find some unfavorable factors, can reach this object by the analysis of negative correlation rule.The defect that weighted association rules excavates is to have ignored project weights and have in each transaction journal of database the situation of different weights.The objective project weights data that are distributed in transaction journal and change with record are called to complete weighted data.Existing weighted association rules method for digging can not be suitable for complete weighted data and excavate, for this reason, since 2003, all-weighted association Research on Mining has obtained paying close attention to and research, current, the positive and negative Association Rule Mining of weighting has important theory and using value in the field such as text mining, information retrieval completely.All-weighted association method for digging can overcome the defect that weighted association rules excavates effectively, but can't solve the negative Association Rule Mining problem of complete weighting.For these problems, the present invention furthers investigate the positive and negative association rule mining of complete weighting, a kind of new positive and negative association rule mining method of complete weighting based on weights ratio and dimension ratio in item is proposed, be applied to document information retrieval query expansion, can improve retrieval performance, be applied to text mining, can find actual reasonably positive negative feature words association mode.

Summary of the invention

The object of the invention is to the deficiency existing for prior art, provide a kind of for finding the complete weighting pattern method for digging of correlation rule between text word, the abundant Association Rule Mining achievement of excavating based on project weights, solves the technical barrier in the positive and negative association rule mining of all-weighted item.The method has important theory value and wide application prospect in the field such as text mining, document information retrieval.

The present invention realizes the technical scheme that above-mentioned purpose takes: a kind of for finding the complete weighting pattern method for digging of correlation rule between text word, comprise the steps:

(1) complete weighted data pretreatment stage:

In real world, there is the complete weighted data of magnanimity, as text message data etc.Weighted data preprocess method will, depending on concrete data object, for example, for Chinese text data message, will carry out participle, remove stop words, extract the preprocess methods such as Feature Words and weights calculating thereof completely; For English text data message, preprocess method is that stem extracts, gets rid of stop words, lexical analysis, extraction Feature Words and weights calculating thereof etc.The pretreated result of weighted data is to build based on complete weighted data storehouse and project library completely;

Feature Words weights computing formula for text data is: w _ij=(0.5+0.5 × tf _ij/ max _j(tf _ij)) × idf _i,

Wherein, w _ijbe the weights of i Feature Words at j piece of writing document, tf _ijbe the word frequency of i Feature Words at j piece of writing document, idf _ibe the reverse document frequency of i Feature Words, it is worth idf _i=log (N/df _i), N is total number of documents in document sets, df _ifor containing the number of documents of i Feature Words.

(2) completely weighted frequent items and negative term collection excavation phase, comprises the following steps 2.1 and step 2.2:

2.1, from project library, extract complete weighting candidate 1_ item collection awC ₁, and excavate the frequent 1_ item of complete weighting collection awL ₁; Concrete steps are carried out according to 2.1.1～2.1.3:

2.1.1, from project library, extract complete weighting candidate 1_ item collection awC ₁;

2.1.2, cumulative complete weighting candidate 1_ item collection awC ₁weights summation in complete weighted data storehouse (All-Weighted Database is called for short AWD), calculates its support;

AwC ₁support computing formula is as follows:

awsup (aw C_{1}) = \frac{w_{aw C_{1}}}{n \times k}

Wherein,

expression project i _jat transaction journal T _iin weights summation, n is the transaction journal sum of complete weighted data storehouse AWD, k is a collection awC ₁length (be awC ₁project number).

2.1.3, by complete weighting candidate 1_ item collection C ₁middle support is more than or equal to the frequent 1_ item of the complete weighting collection awL of minimum support threshold value minsup ₁join frequent item set set awPIS;

2.2, from complete weighting candidate 2_ item collection, according to step, 2.2.1～2.2.4 operates:

2.2.1, by complete weighting frequent (i-1) _ collection awL _i-1carry out Apriori connection, generate complete weighting candidate i_ item collection awC _i; Described i>=2;

2.2.2, cumulative complete weighting candidate i_ item collection awC _i-1weights summation in complete weighted data storehouse AWD, calculates its support awsup (awC _i-1), its computing formula is as follows:

awsup (aw C_{i - 1}) = \frac{w_{aw C_{i - 1}}}{n \times k}

Wherein,

expression project i _jat transaction journal T _iin weights summation, n is the transaction journal sum of complete weighted data storehouse AWD, k is a collection awC _i-1length.

2.2.3, from complete weighting candidate i_ item collection awC _ithe middle frequent i_ item collection awL that its support is not less than to support threshold value minsup _itake out, deposit complete weighted frequent items set awPIS in, meanwhile, its support is less than to the negative i_ item collection awN of complete weighting of support threshold value _ideposit complete weighting negative term collection set awNIS in.

2.2.4, the value of i is added to 1, if frequent (i-1) _ collection awL _i-1for empty (being that its length is 0) just proceeds to (3) step, otherwise, 2.2.1～2.2.3 step continued;

(3) the beta pruning stage: obtain interesting complete weighted frequent items and negative term collection by the beta pruning stage

3.1, for each the frequent i-item collection awL in frequent item set set awPIS _i, calculate IAWFI (awL _i) value, wipe out its IAWFI (awL _i) value is false frequent item set, obtains interesting complete weighted frequent items set awPIS after beta pruning; IAWFI (awL _i) computing formula is as follows:

Wherein, awItemsetInt (I ₁∪ I ₂)=awsup (I ₁) × awsup (I ₁∪ I ₂) × (1 – awsup (I ₂)), awItemsetInt (﹁ I ₁, ﹁ I ₂)=awsup (I ₂) × (1 – awsup (I ₁)) × (1 – awsup (I ₁) – awsup (I ₂)+awsup (I ₁∪ I ₂)), minInt is minimum interestingness threshold value, minsup minimum support threshold value.

3.2, for each the negative i-item collection awN in negative term collection set awNIS _i, calculate IAWNI (awN _i) value, wipe out its IAWNI (awN _i) value is false negative term collection, obtains interesting complete weighting negative term collection set awNIS after beta pruning; IAWNI (awN _i) computing formula as follows:

Wherein, awItemsetInt (I ₁∪ I ₂)=awsup (I ₁) × awsup (I ₁∪ I ₂) × (1 – awsup (I ₂))

awItemsetInt(I ₁∪﹁I ₂)=awsup(I ₁)×awsup(I ₂)×(awsup(I ₁)–awsup(I ₁∪I ₂))

awItemsetInt(﹁I ₁∪I ₂)=(1–awsup(I ₁))×(1–awsup(I ₂)×(awsup(I ₂)–awsup(I ₁∪I ₂))

awItemsetInt(﹁I ₁∪﹁I ₂)=awsup(I ₂)×(1–awsup(I ₁))×(1–awsup(I ₁)–awsup(I ₂)＋awsup(I ₁∪I ₂))

(4) from interesting complete weighted frequent items set awPIS, excavate effectively the positive and negative correlation rule of weighting completely, comprise the following steps:

4.1, take out frequent item set awL from interesting complete weighted frequent items set awPIS _i, obtain awL _iall proper subclass, build awL _iproper subclass set, then carry out following operation:

4.2.1, from awL _iproper subclass set in take out arbitrarily two proper subclass I ₁and I ₂, work as I ₁and I ₂common factor be empty set (I ₁∩ I ₂=φ), I ₁and I ₂project number sum equal the project number (I of its former frequent item set ₁∪ I ₂=awL _i), and I ₁and I ₂support be all not less than support threshold value (awsup (I ₁)>=minsup, awsup (I ₂)>=minsup), calculate frequent item set (I ₁∪ I ₂) item in weights than awIWR (I ₁, I ₂) and dimension than awIDR (I ₁, I ₂); AwIWR (I ₁, I ₂) and awIDR (I ₁, I ₂) computing formula as follows:

awIWR (I_{1}, I_{2}) = \frac{w_{12}}{w_{1} \times w_{2}}, awIDR (I_{1}, I_{2}) = \frac{k_{12}}{k_{1} \times k_{2}}

W ₁₂and w ₁, w ₂be respectively complete weighted term collection (I ₁, I ₂) and subitem collection I ₁and I ₂weights summation in complete weighted data storehouse AWD, k ₁₂, k ₁and k ₂be respectively a collection (I ₁, I ₂) and subitem collection I ₁and I ₂project number, n is transaction journal sum in database.

4.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I ₁, I ₂)) product be greater than its dimension than (awIDR (I ₁, I ₂)) time (n × awIWR (I ₁, I ₂) >awIDR (I ₁, I ₂)), proceed as follows:

If 4.2.2.1 I ₁→ I ₂awCPIR value (awCPIR (I ₁→ I ₂)) be not less than confidence threshold value minconf, excavate all-weighted association I ₁→ I ₂; If I ₂→ I ₁awCPIR value be not less than confidence threshold value (awCPIR (I ₂→ I ₁)>=minconf), excavate all-weighted association I ₂→ I ₁; AwCPIR (I ₁→ I ₂) and awCPIR (I ₂→ I ₁) computing formula as follows:

awCPIR (I_{1} &RightArrow; I_{2}) = \frac{awsup (I_{2} \cup I_{1}) - awsup (I_{1}) awsup (I_{2})}{awsup (I_{1}) (1 - awsup (I_{2}))}

awCPIR (I_{2} &RightArrow; I_{1}) = \frac{awsup (I_{2} \cup I_{1}) - awsup (I_{1}) awsup (I_{2})}{awsup (I_{2}) (1 - awsup (I_{1}))}

If 4.2.2.2 (﹁ I ₁∪ ﹁ I ₂) support be not less than support threshold value (awsup (﹁ I ₁∪ ﹁ I ₂)>=minsup), so, if 1. ﹁ I ₁→ ﹁ I ₂awCPIR value be not less than confidence threshold value (awCPIR (﹁ I ₁→ ﹁ I ₂)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ ﹁ I ₁awCPIR value be not less than confidence threshold value (awCPIR (﹁ I ₂→ ﹁ I ₁)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₂→ ﹁ I ₁; Awsup (﹁ I ₁∪ ﹁ I ₂), awCPIR (﹁ I ₁→ ﹁ I ₂) and awCPIR (﹁ I ₂→ ﹁ I ₁) computing formula as follows:

awsup(﹁I ₁∪﹁I ₂)=awsup(﹁I ₁∪﹁I ₂)=1–awsup(I ₁)–awsup(I ₂)＋awsup(I ₁∪I ₂)

4.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 4.2.1 step are than (awIWR (I ₁, I ₂)) product be less than its dimension than (awIDR (I ₁, I ₂)) time (n × awIWR (I ₁, I ₂) <awIDR (I ₁, I ₂)), proceed as follows:

If 4.2.3.1 (I ₁∪ ﹁ I ₂) support be not less than support threshold value (awsup (I ₁∪ ﹁ I ₂)>=minsup), so, if 1. I ₁→ ﹁ I ₂awCPIR value be not less than confidence threshold value (awCPIR (I ₁→ ﹁ I ₂)>=minconf), excavate the negative correlation rule I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ I ₁awCPIR value be not less than confidence threshold value (awCPIR (﹁ I ₂→ I ₁)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₂→ I ₁; Awsup (I ₁∪ ﹁ I ₂), awCPIR (I ₁→ ﹁ I ₂) and awCPIR (﹁ I ₂→ I ₁) computing formula as follows:

awsup(I ₁→﹁I ₂)=awsup(I ₁∪﹁I ₂)=awsup(I ₁)–awsup(I ₁∪I ₂)

If 4.2.3.2 (﹁ I ₁∪ I ₂) support be not less than support threshold value (awsup (﹁ I ₁∪ I ₂)>=minsup), so, if 1. ﹁ I ₁→ I ₂awCPIR value be not less than confidence threshold value (awCPIR (﹁ I ₁→ I ₂)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₁→ I ₂; If 2. I ₂→ ﹁ I ₁awCPIR value be not less than confidence threshold value (awCPIR (I ₂→ ﹁ I ₁)>=minconf), excavate the negative correlation rule I of complete weighting ₂→ ﹁ I ₁; Awsup (﹁ I ₁∪ I ₂), awCPIR (﹁ I ₁→ I ₂) and awCPIR (I ₂→ ﹁ I ₁) computing formula as follows:

awsup(﹁I ₁→I ₂)=awsup(﹁I ₁∪I ₂)=awsup(I ₂)–awsup(I ₁∪I ₂)

4.2.4, continue 4.2.1～4.2.3 step, if awL _iproper subclass set in each proper subclass and if only if is removed once, proceed to 4.2.5 step;

4.2.5, continue 4.1 steps, if each frequent item set awL in interesting complete weighted frequent items set awPIS _iall and if only if is removed once, proceeds to (5) step;

(5) from interesting complete weighting negative term collection set awNIS, excavate effectively the negative correlation rule of weighting completely, comprise the following steps:

5.1, take out negative term collection awN from interesting complete weighting negative term collection set awNIS _i, obtain awN _iall proper subclass, build awN _iproper subclass set, then carry out following operation:

5.2.1, from awN _iproper subclass set in take out arbitrarily two proper subclass I ₁and I ₂, work as I ₁and I ₂common factor be empty set (I ₁∩ I ₂=φ), I ₁and I ₂project number sum equal the project number (I of its former frequent item set ₁∪ I ₂=awN _i), and I ₁and I ₂support be all greater than or equal to support threshold value (awsup (I ₁)>=minsup, awsup (I ₂)>=minsup), calculate negative term collection (I ₁∪ I ₂) item in weights than (awIWR (I ₁, I ₂)) and dimension than (awIDR (I ₁, I ₂)); AwIWR (I ₁, I ₂) and awIDR (I ₁, I ₂) computing formula with the formula of 4.2.1.

5.2.2, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I ₁, I ₂)) product be greater than its dimension than (awIDR (I ₁, I ₂)) time (n × awIWR (I ₁, I ₂) >awIDR (I ₁, I ₂)), proceed as follows:

If 5.2.2.1 (﹁ I ₁∪ ﹁ I ₂) support be greater than or equal to support threshold value (awsup (﹁ I ₁∪ ﹁ I ₂)>=minsup), so, if 1. ﹁ I ₁→ ﹁ I ₂awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I ₁→ ﹁ I ₂)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ ﹁ I ₁awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I ₂→ ﹁ I ₁)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₂→ ﹁ I ₁; Awsup (﹁ I ₁∪ ﹁ I ₂), awCPIR (﹁ I ₁→ ﹁ I ₂) and awCPIR (﹁ I ₂→ ﹁ I ₁) computing formula with the formula of 4.2.2.2.

5.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I ₁, I ₂)) product be less than its dimension than (awIDR (I ₁, I ₂)) time (n × awIWR (I ₁, I ₂) <awIDR (I ₁, I ₂)):

If 5.2.3.1 (I ₁∪ ﹁ I ₂) support be greater than or equal to support threshold value (awsup (I ₁∪ ﹁ I ₂)>=minsup), so, if 1. I ₁→ ﹁ I ₂awCPIR value be greater than or equal to confidence threshold value (awCPIR (I ₁→ ﹁ I ₂)>=minconf), excavate the negative correlation rule I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ I ₁awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I ₂→ I ₁)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₂→ I ₁; Awsup (I ₁∪ ﹁ I ₂), awCPIR (I ₁→ ﹁ I ₂) and awCPIR (﹁ I ₂→ I ₁) computing formula with the formula of 4.2.3.1;

If 5.2.3.2 (﹁ I ₁∪ I ₂) support be greater than or equal to support threshold value (awsup (﹁ I ₁∪ I ₂>=minsup), so, if 1. ﹁ I ₁→ I ₂awCPIR value be greater than or equal to confidence threshold value (awCPIR (﹁ I ₁→ I ₂)>=minconf), excavate the negative correlation rule ﹁ I of complete weighting ₁→ I ₂; If 2. I ₂→ ﹁ I ₁awCPIR value be greater than or equal to confidence threshold value (awCPIR (I ₂→ ﹁ I ₁)>=minconf), excavate the negative correlation rule I of complete weighting ₂→ ﹁ I ₁; Awsup (﹁ I ₁∪ I ₂), awCPIR (﹁ I ₁→ I ₂) and awCPIR (I ₂→ ﹁ I ₁) computing formula with the formula of 4.2.3.2;

5.2.4, continue 5.2.1～5.2.3 step, if awN _iproper subclass set in each proper subclass and if only if is removed once, proceed to 5.2.5 step;

5.2.5, continue 5.1 steps, if each negative term collection awN in interesting complete weighting negative term collection set awNIS _iall and if only if is removed once, and the positive and negative association rule mining of weighting finishes completely;

So far, the positive and negative association rule mining of weighting finishes completely.

The present invention compared with prior art, has following beneficial effect:

(1) for the defect of the positive and negative association rule mining of existing weighting, the present invention has built the positive and negative association mode of complete weighting and has evaluated framework: support-CPIR model (Conditional Probability Increment Ratio)-correlativity-interest-degree, and the Pruning strategy of frequent item set and negative term collection, propose a kind of new positive and negative association rule mining method of complete weighting based on SCPIRCI evaluation framework, effectively solved the positive and negative Association Rule Mining problem of complete weighting.The present invention not only considers the complete weighted data feature that project changes with data-base recording, adopts new item collection Pruning strategy, and the excavation time is significantly reduced, and greatly improves digging efficiency.

(2) propose the interior weights ratio of complete plus item collection item and dimension than concept, enriched the theory that complete weighted data excavates.

(3), by a large amount of strict and careful experiments, the present invention is tested to comparison with traditional item without the positive and negative association rule mining method of weighting.Take Chinese Web test set CWT200g as experiment document test set, from the excavation performance experiment Analysis of aspect to the technology of the present invention such as support variation, degree of confidence variation, the number of entry and the variations of document sets scale.Experimental result shows: with control methods comparison, the excavation performance of the technology of the present invention has reached good effect, and digging efficiency is greatly improved; No matter be in support changes of threshold situation or confidence threshold value situation of change, what candidate, frequent item set and the negative term collection that the technology of the present invention is excavated and positive and negative correlation rule quantity were all excavated than existing control methods is few a lot; Under the number of entry and affairs document scale situation of change, the present invention also shows good extensibility.Ananlysis of main cause is as follows: control methods be based on project frequency excavate without the positive and negative association rule mining method of weighting, do not consider a centralization of state power value, do not have to reflect the feature that complete weighted data is intrinsic comprehensively, thereby, can produce a lot of invalid items with false collects and positive and negative association rule model, make collection and regular quantity much more, its digging efficiency lowers greatly.The invention belongs to the positive and negative association rule mining method of complete weighting excavating based on weights, effectively overcome the inherent shortcoming of control methods, the feature (being that objective being distributed in transaction journal of project weights changes along with record changes) that complete weighted data model is had incorporates in whole mining process, make excavated correlation rule more rationally and more approaching reality, simultaneously, adopt new Pruning strategy, invalid and barren frequent item set and negative term collection quantity are significantly reduced, effectively having reduced barren rule occurs, improve widely digging efficiency.

Accompanying drawing explanation

Fig. 1 is of the present invention for finding the block diagram of the complete weighting pattern method for digging of correlation rule between text word.

Fig. 2 is of the present invention for finding the overall procedure schematic diagram of the complete weighting pattern method for digging of correlation rule between text word.

Fig. 3 is that the present invention tests the candidate quantity comparison diagram excavating under different support threshold values in 1.

Fig. 4 is that the present invention tests the frequent item set quantity comparison diagram excavating under different support threshold values in 1.

Fig. 5 is that the present invention tests rule (A → B) the quantity comparison diagram excavating under different support threshold values in 1.

Fig. 6 is that the present invention tests negative rule (A → ﹁ B) the quantity comparison diagram excavating under different support threshold values in 1.

Fig. 7 is that the present invention tests negative rule (﹁ A → B) the quantity comparison diagram excavating under different support threshold values in 1.

Fig. 8 is that the present invention tests negative rule (﹁ A → ﹁ B) the quantity comparison diagram excavating under different support threshold values in 1.

Fig. 9 is candidate, the frequent and negative term collection number change figure that the present invention tests disparity items number in 2.

Figure 10 is the positive and negative correlation rule number change figure that the present invention tests disparity items number in 2.

Figure 11 is the negative correlation rule number change figure that the present invention tests disparity items number in 2.

Figure 12 is candidate, the frequent and negative term collection number change figure that the present invention tests different document scale in 2.

Figure 13 is the negative correlation rule number change figure that the present invention tests different document scale in 2.

Figure 14 is the positive and negative correlation rule number change figure that the present invention tests different document scale in 2.

Specific embodiment mode

For technical scheme of the present invention is described better, below complete weighted data model and the relevant concept that the present invention relates to are described below:

1. the difference that weighted association rules excavates and all-weighted association excavates

Weighted association rules excavates and all-weighted association excavates, their key distinction is that its project weights source is different with excavated data model, the former project weights are set by user is subjective, and be independent of transaction database, once set, invariable in whole mining process, for example, copy paper in shop and facsimile recorder, because copy paper price is not as the height of facsimile recorder, its single-piece profit is lower than facsimile recorder, in order to embody the importance difference of commodity to profit contribution, higher weights given by facsimile recorder commodity higher single-piece profit by user, and the weights of copy paper commodity are relatively low, after its weight setting, just immobilize, and be independent of its transaction data base, the latter's project weights are not to be set by the user, but derive from each transaction journal of transaction database, and change with transaction journal is different, for example, in the text database of magnanimity, each Feature Words project weights are to derive from each document in its database, change along with document difference, for different documents, its Feature Words project weights are different.

Item weighted data model and all-weighted item data model are respectively the data models that weighted association rules excavates and all-weighted association excavates, and are diverse two class data models, as shown in Table 1 and Table 2, are wherein { i ₁, i ₂..., i _mits project set, { T ₁, T ₂..., T _nit is its affairs set.In weighted data model, { w ₁, w ₂..., w _mits project weights, and " 1 " the expression project of " 1/0 " occurs in transaction journal, " 0 " represents absent variable situation.In complete weighted data model, " w[T _i] [i _j]/0 (1≤i≤n, 1≤j≤m) " represent the weights of project, if project in transaction journal, occur, its weights are " w[T _i] [i _j] ", otherwise be " 0 ".

A table 1 weighted data model table 2 all-weighted item data model

Example: table 3 has 5 projects and 5 transaction journals, and wherein project set is { i ₁, i ₂, i ₃, i ₄, i ₅}={ Apple, Orange, Banana, Milk, Coca-cola}, as known from Table 3, i ₁do not appear at T ₃in transaction journal.Table 4 is all-weighted item data instances, project and transaction journal quantity and with table 3, wherein, project i ₁at transaction journal T ₁, T ₂, T ₃, T ₅in weights be respectively 0.85,0.93,0.65,0.75, do not appear at transaction journal T ₄therefore its weights are 0.

A table 3 weighted data example table 4 all-weighted item data instances

2. weighted data excavates key concept completely

If weighted data storehouse AWD={T completely ₁, T ₂..., T _n, number of transactions is n, T _i(1≤i≤n) represent i affairs in AWD, item collects I={i ₁, i ₂..., i _mrepresenting whole project sets in AWD, item number is m, i _j(1≤j≤m) represents j project in AWD, w[T _i] [i _j] (1≤i≤n, 1≤j≤m) expression project i _jat transaction journal T _iin weights, refer to the all-weighted item data model of table 2.If I ₁, I ₂a subitem collection of collection I,

and,

provide following basic definition:

Definition 1 (weighting support completely: All-weighted support, be called for short awsup): the computing formula of weighting support awsup (I) is suc as formula shown in (1) completely.

awsup (I) = \frac{W_{1}}{n \times k} - - - (1)

Wherein, , n is the transaction journal sum of complete weighted data storehouse AWD, k is a length (being the project number of I) of collection I.

Completely weighting negative term collection and negative correlation rule support suc as formula (2) to shown in formula (5).

awsup(﹁I)=1–awsup(I) (2)

awsup(I ₁→﹁I ₂)=awsup(I ₁∪﹁I ₂)=awsup(I ₁)–awsup(I ₁∪I ₂) (3)

awsup(﹁I ₁→I ₂)=awsup(﹁I ₁∪I ₂)=awsup(I ₂)–awsup(I ₁∪I ₂) (4)

awsup(﹁I ₁→﹁I ₂)=awsup(﹁I ₁∪﹁I ₂)=1–awsup(I ₁)–awsup(I ₂)＋awsup(I ₁∪I ₂) (5)

Definition 2 (weighted frequent items and negative term integrate completely): establish minimum support threshold value as minsup, for complete weighted term collection I, if awsup (I) >=minsup claims that a collection I is complete weighted frequent items.For complete weighted term collection (I ₁∪ I ₂), work as I ₁and I ₂while being all frequent item set, if awsup is (I ₁∪ I ₂) <minsup, a collection (I ₁∪ I ₂) be called complete weighting negative term collection.

Example: establish minsup=0.1, in table 4 data, awsup (i ₂)=(0.21+0.35+0.05)/(5 × 1)=0.122>minsup, awsup (i ₄)=0.192>minsup, awsup (i ₂∪ i ₄)=0.06<minsup, therefore a collection (i ₂∪ i ₄) be complete weighting negative term collection.

Definition 3 (weighted term collection interest-degree completely: All-weighted Itemset Interest, be awItemsetInt): interest-degree is the tolerance of user to excavated association mode degree of concern, its value is higher, illustrate that this association mode is noveler, user is just higher to its degree of concern.Based on excavating the interest-degree model definition (Cheng Jihua under environment without weighted data, Guo Jiansheng, Shi Pengfei. excavate many strategy process researchs [J] of pay close attention to rule. Chinese journal of computers, 2000,23 (1): 47-51.), provide complete weighted term collection interest-degree (awItemsetInt) computing formula suc as formula (6) to shown in formula (9):

awItemsetInt(I ₁∪I ₂)=awsup(I ₁)×awsup(I ₁∪I ₂)×(1–awsup(I ₂)) (6)

awItemsetInt(I ₁∪﹁I ₂)=awsup(I ₁)×awsup(I ₂)×(awsup(I ₁)–awsup(I ₁∪I ₂)) (7)

awItemsetInt(﹁I ₁∪I ₂)=(1–awsup(I ₁))×(1–awsup(I ₂)×(awsup(I ₂)–awsup(I ₁∪I ₂)) (8)

(9)

Definition 4 (weighting CPIR value completely: All-weighted Conditional_Probability Increment Ratio, be called for short awCPIR): CPIR model is to express p (I with the ratio of conditional probability and prior probability ₂/ I ₁) relative p (I ₂) increase progressively degree, in document, provided its computing formula: CPIR (I ₂/ I ₁)=(p (I ₂/ I ₁) – p (I ₂))/(1 – p (I ₂)).The needs that computing formula based on CPIR model and completely weighted data excavate, the awCPIR computing formula that provides the positive and negative correlation rule of complete weighting suc as formula (10) to shown in formula (13):

awCPIR (I_{1} &RightArrow; I_{2}) = \frac{awsup (I_{2} \cup I_{1}) - awsup (I_{1}) awsup (I_{2})}{awsup (I_{1}) (1 - awsup (I_{2}))} - - - (10)

Degree of confidence using awCPIR value as all-weighted association, its value is larger, illustrates that the confidence level of this correlation rule is higher, paid close attention to by user.

Example: in the complete data of table 4, awsup (i ₁)=0.636, awsup (﹁ i ₁)=1-0.636=0.364, awsup (i ₂)=0.122, awsup (i ₁∪ i ₂)=0.294, awCPIR (i ₁→ i ₂)=(| 0.294-0.636 × 0.122|)/(0.636 × (1-0.122))=0.39, awCPIR (i ₁→ ﹁ i ₂)=2.79, awCPIR (﹁ i ₁→ i ₂)=0.68, awCPIR (﹁ i ₁→ ﹁ i ₂)=4.86.

Definition 5 (weights ratio in weighted term completely: All-weighted Weight Ratio from Itemset, be called for short awIWR): establish w ₁₂and w ₁, w ₂be respectively complete weighted term collection (I ₁, I ₂) and subitem collection I ₁and I ₂weights summation in complete weighted data storehouse AWD, by w ₁₂(w ₁× w ₂) ratio be called weights ratio in complete weighted term collection, in being called for short, weights are than (awIWR (I ₁, I ₂)), shown in formula (14).

awIWR (I_{1}, I_{2}) = \frac{w_{12}}{w_{1} \times w_{2}} - - - (14)

Definition 6 (dimension ratio in weighted term completely: All-weighted Dimension Ratio from Itemset, be called for short awIDR): establish k ₁₂, k ₁and k ₂be respectively a collection (I ₁, I ₂) and subitem collection I ₁and I ₂project number, by k ₁₂(k ₁× k ₂) ratio be called dimensional ratio in complete weighted term collection, in being called for short, dimension is than (awIDR (I ₁, I ₂)), shown in formula (15).

awIDR (I_{1}, I_{2}) = \frac{k_{12}}{k_{1} \times k_{2}} - - - (15)

Definition 7 (weighted term collection correlativity completely: All-weighted itemset correlation, be called for short awISCorr): item collection correlativity definition (the Chengqi Zhang based on traditional, Shichao Zhang.Association rule mining:models and algorithms[M] .Springer-Verlag Berlin, Heidelberg, 2002:47-84, ISBN:3-540-43533-6.), provide complete weighted term collection (I ₁, I ₂) correlativity (awISCorr (I ₁, I ₂),

) computing formula suc as formula shown in (16).

awISCorr (I_{1}, I_{2}) = \frac{awsup (I_{1} \cup I_{2})}{awsup (I_{1}) \times awsup (I_{2})} - - - (16)

According to the character of correlativity, excavate under environment a collection (I at complete weighted data ₁, I ₂) correlativity has following character:

Character 1:

Character 2:

Character 3:

Character 4:

2. awISCorr (﹁ I ₁, I ₂) <1; 3. awISCorr (﹁ I ₁, ﹁ I ₂) >1.

Character 5: 2. awISCorr (﹁ I ₁, I ₂) >1; 3. awISCorr (﹁ I ₁, ﹁ I ₂) <1.

Inference is excavated in environment at complete weighted data, known terms collection (I ₁, I ₂), and if 1. n × awIWR (I ₁, I ₂) > awIDR (I ₁, I ₂), complete weighting subitem collection I ₁and I ₂become positive correlation, and can excavate complete weighting positive association rule I ₁→ I ₂with negative correlation rule ﹁ I ₁→ ﹁ I ₂pattern; If 2. n × awIWR (I ₁, I ₂) <awIDR (I ₁, I ₂), complete weighted term collection I ₁and I ₂become negative correlation, and can excavate the negative correlation rule I of complete weighting ₁→ ﹁ I ₂with ﹁ I ₁→ I ₂pattern;

According to above-mentioned inference, in the time excavating all-weighted association, only need to calculate the interior weights of complete weighted term than awIWR (I ₁, I ₂) and dimension than awIDR (I ₁, I ₂), do not need computational item collection correlativity, just can directly concentrate the positive and negative correlation rule of the complete weighting of excavation from frequent item set and negative term.

Example: for (i ₁, i ₂, i ₃), establish I ₁=(i ₁, i ₂), I ₂=(i ₃), awIWR (I ₁, I ₂)=3.34/ (2.94 × 2.85)=0.399, awIDR (I ₁, I ₂)=3/ (2 × 1)=1.5, n × awIWR (I ₁, I ₂)=5 × 0.5517=1.995>1.5=awIDR (I ₁, I ₂), according to above-mentioned inference, I ₁and I ₂become positive correlation, can excavate correlation rule I ₁→ I ₂with negative correlation rule ﹁ I ₁→ ﹁ I ₂pattern.Employing formula (16) checking: awsup (i ₁∪ i ₂)=0.294, awsup (i ₃)=0.57, awsup (i ₁∪ i ₂∪ i ₃)=0.223, awISCorr (I ₁, I ₂)=0.223/ (0.294 × 0.57)=1.33>1, do as one likes matter 1 and character 4, I ₁and I ₂become positive correlation, can excavate correlation rule I ₁→ I ₂with negative correlation rule ﹁ I ₁→ ﹁ I ₂pattern, conclusion is consistent.

In like manner, for complete weighted term collection (i ₂, i ₄), its awIWR (i ₂, i ₄)=0.102, awIDR (i ₂, i ₄)=2, n × awIWR (i ₂, i ₄)=0.51<2=awIDR (i ₂, i ₄), known according to inference, i ₂and i ₄become negative correlation, can excavate i ₂→ ﹁ i ₄with ﹁ i ₂→ i ₄pattern.

Definition 8 (the effectively complete positive and negative correlation rule of weighting): establishing minconf is minimal confidence threshold, as complete weighted term collection I ₁and I ₂meet following 3 conditions, claim correlation rule I ₁→ I ₂, ﹁ I ₁→ ﹁ I ₂, I ₁→ ﹁ I ₂with ﹁ I ₁→ I ₂for the effective completely positive and negative correlation rule of weighting: 1. I ₁and I ₂complete weighted frequent items, I ₁∩ I ₂=φ; 2. I ₁→ I ₂, ﹁ I ₁→ ﹁ I ₂, I ₁→ ﹁ I ₂with ﹁ I ₁→ I ₂support be more than or equal to minsup; 3. I ₁→ I ₂, ﹁ I ₁→ ﹁ I ₂, I ₁→ ﹁ I ₂with ﹁ I ₁→ I ₂awCPIR value be not less than minconf.

Example: suppose minsup=0.1, minconf=0.3 knows from upper example, completely weighted term collection (i ₁, i ₂), (i ₃) and (i ₁, i ₂, i ₃) support be all greater than minsup, (i ₁, i ₂) and (i ₃) become positive correlation, again because, awCPIR ((i ₁, i ₂) → (i ₃))=| 0.223 – 0.94 × 0.57|/(0.294 × (1 – 0.57))=0.438>minconf, awCPIR (﹁ (i ₁, i ₂) → ﹁ (i ₃))=0.138<minconf, according to character 4 and definition 8, (i ₁, i ₂) → (i ₃) be an effectively complete weighting positive association rule, and negative regular ﹁ (i ₁, i ₂) → ﹁ (i ₃) not effective.In like manner, for complete weighted term collection (i ₂, i ₄), due to awsup (i ₂)=0.122>minsup, awsup (i ₄)=0.192>minsup, awsup (i ₂∪ ﹁ i ₄)=0.062<minsup, awsup (﹁ i ₂∪ i ₄)=0.132>minsup, awCPIR (﹁ i ₂→ i ₄)=0.052<minconf, according to definition 8, negative correlation rule i ₂→ ﹁ i ₄with ﹁ i ₂→ i ₄it not the negative correlation rule of effectively complete weighting.

Below by specific embodiment, technical scheme of the present invention is described further.

The process of his-and-hers watches 4 complete weighted data Case digging all-weighted associations of the present invention following (wherein, minsup=0.1, minInt=0.1, minconf=0.4, w represents a centralization of state power value, s represent and collects a support):

Step1:awPIS={φ}；awNIS={φ}；

Step2:

C_{1} = {(i_{1}) : w : 3.18 s : 0.636, (i_{2}) : w : 0.61 s : 0.122, (i_{3}) : w : 2.85 s : 0.57, (i_{4}) : w : 0.96 s : 0.192,

(i_{5}) : w : 0.92 s : 0.184} &DoubleRightArrow; L_{1} = {(i_{1}), (i_{2}), (i_{3}), (i_{4}), (i_{5})} &DoubleRightArrow; awPIS = {L_{1}} .

Step3：①

C_{2} = {(i_{1}, i_{2}) : w : 2.94 s : 0.294, (i_{1}, i_{3}) : w : 4.43 s : 0.443, (i_{1}, i_{4}) : w : 0.76 s : 0.076, (i_{1},

i_{5}) : w : 2.52 s : 0.192, (i_{2}, i_{3}) : w : 1.76 s : 0.176, (i_{2}, i_{4}) : w : 0.06 s : 0.006, (i_{2}, i_{5}) : w : 0.95 s : 0.095, (i_{3}, i_{4}) : w : 1.8 s : 0.18,

(i_{3}, i_{5}) : w : 0.82 s : 0.082, (i_{4}, i_{5}) : w : 0.91 s : 0.091} &DoubleRightArrow; L_{2} = {(i_{1}, i_{2}), (i_{1}, i_{3}), (i_{1}, i_{5}), (i_{2}, i_{3}), (i_{3}, i_{4})}, N_{2} = {(i_{1},

i_{4}), (i_{2}, i_{4}), (i_{2}, i_{5}), (i_{3}, i_{5}), (i_{4}, i_{5})} &DoubleRightArrow; awPIS = {L_{1} \cup L_{2}}, awNIS = {N_{2}};

②

C_{3} = {(i_{1}, i_{2}, i_{3}) : w : 3.34

s : 0.223, (i_{1}, i_{2}, i_{5}) : w : 1.7 s : 0.113, (i_{1}, i_{3}, i_{5}) : w : 1.67 s : 0.111} &DoubleRightArrow; L_{3} = {(i_{1}, i_{2}, i_{3}), (i_{1}, i_{2}, i_{5}), (i_{1}, i_{3}, i_{5})},

N_{3} = {φ} &DoubleRightArrow; awPIS = {L_{1} \cup L_{2} \cup L_{3}}, awNIS = {N_{2} \cup N_{3}};

③

C_{4} = {(i_{1}, i_{2}, i_{3}, i_{5}) : w : 0 s : 0} &DoubleRightArrow; L_{3} = {φ} .

Step4: beta pruning: for the item collection beta pruning in frequent item set set awPIS.The frequent item set of being wiped out is: (i ₂, i ₃), (i ₃, i ₄), (i ₁, i ₂, i ₅), (i ₁, i ₃, i ₅), the awPIS={ (i after beta pruning ₁, i ₂), (i ₁, i ₃), (i ₁, i ₅), (i ₁, i ₂, i ₃)

Step5: in like manner, in negative term collection set awNIS, the negative term collection of being wiped out is: (i ₃, i ₅), the awNIS={ (i after beta pruning ₁, i ₄), (i ₂, i ₄), (i ₂, i ₅), (i ₄, i ₅).

Step6: excavate the positive and negative correlation rule of complete weighting from frequent item set set awPIS He in negative term collection set awNIS, with frequent item set (i ₁, i ₂, i ₃) and negative term collection (i ₄, i ₅) be example, provide its mining process as follows:

For frequent item set (i ₁, i ₂, i ₃), with its subset I ₁=(i ₁) and I ₂=(i ₂, i ₃) be example, from upper example, awsup (i ₁), awsup (i ₂, i ₃) be all greater than minsup, awIDR (I ₁, I ₂)=1.5, n × awIWR (I ₁, I ₂)=2.98>awIDR (I ₁, I ₂), awsup (I ₁∪ I ₂)=0.223>minsup, awCPIR (I ₁→ I ₂)=0.212<minconf, awCPIR (I ₂→ I ₁)=1.73>minconf; Awsup (﹁ I ₁∪ ﹁ I ₂)=0.411>minsup, awCPIR (﹁ I ₁→ ﹁ I ₂)=1.73>minconf, awCPIR (﹁ I ₂→ ﹁ I ₁)=0.212<minconf, therefore, I ₂→ I ₁with ﹁ I ₁→ ﹁ I ₂(i.e. (i ₂, i ₃) → (i ₁) and ﹁ (i ₁) → ﹁ (i ₂, i ₃)) be an effectively complete positive and negative correlation rule of weighting.

For negative term collection (i ₄, i ₅), its subset I ₁=(i ₄) and I ₂=(i ₅), from upper example, awsup (i ₄), awsup (i ₅) be all greater than minsup, awIDR (I ₁, I ₂)=2, n × awIWR (I ₁, I ₂)=1.03<awIDR (I ₁, I ₂), awsup (I ₁∪ ﹁ I ₂)=0.101>minsup, awsup (﹁ I ₁∪ I ₂)=0.093<minsup, awCPIR (I ₁→ ﹁ I ₂)=1.577>minconf, awCPIR (﹁ I ₂→ I ₁)=0.084<minconf, therefore, I ₁→ ﹁ I ₂(i.e. (i ₄) → ﹁ (i ₅)) be a negative correlation rule of effectively complete weighting.

Below by experiment, beneficial effect of the present invention is described further.

In order to verify validity of the present invention, correctness and extendability, we select the part language material of the Chinese Web test set CWT200g (Chinese Web Test Collection with200GB web pages) being provided by network laboratories of Peking University as this paper experimental data test set.The running environment of experiment is Intel (R) Core (TM) i7-3770CPU@3.4GHz3.4GHz, internal memory 4.0G, and operating system is windows7, and programming language is realized and is adopted delphi2006, and Database Systems are SQL Server2008.Select typically without the positive and negative association rule mining method of weighting (Xindong Wu, Chengqi Zhang, and Shichao Zhang, Efficient Mining of Both Positive and Negative Association Rules, ACM Transactions on Information Systems, 22 (2004), 3:381-405.) (being designated as PNAR-Mining method) be experiment control methods.

The capacity of Chinese Web test set CWT200g is 197GB, comprises 37,482,913 webpages, and each page compresses arrangement according to sky net storage format.12024 pieces of plain text document from CWT200g test set, are extracted as experiment document test set.Adopt Chinese lexical analysis system ICTCLAS (Inst. of Computing Techn. Academia Sinica's development is write) to test text document participle.Feature Words weights (w _ij) computing formula be w _ij=(0.5+0.5 × tf _ij/ max _j(tf _ij)) × idf _i.The preprocessing process of experiment test document is: participle, remove stop words, extract Feature Words and calculate its weights, build text database and feature dictionary based on vector space model.After the pre-service of experiment document test set, obtain 8751 Feature Words, its document frequency (containing the number of documents of this Feature Words) df is 51 to 11258.According to excavating needs, in experiment, remove the Feature Words that df value is lower and higher, extraction df value is at 1500 to 5838 Feature Words (now obtaining altogether 400 Feature Words) construction feature word project library.Total frequency that Feature Words occurs in 12024 pieces of experiment test documents is 1019494 times, on average in every piece of document, occurs 85 times.Experiment parameter is as shown in table 5.

Table 5 experiment parameter table

Figure 2014100969852100002DEST_PATH_IMAGE001

Experiment 1: excavate Performance Ratio in support changes of threshold situation

Under different support threshold values, inventing an AWPNAR-Mining and control methods PNAR-Mining excavation collection in experiment document test set herein (is candidate (Candidate Itemset, CI), frequent item set (Frequent Itemset, FI), negative term collection (Negative Itemset,) and positive and negative correlation rule (Positive and Negative Association Rule NI), PNAR) quantity (ItemNum=50 more as shown in Figures 3 to 8, minconf=0.0002, minInt=0.0002, TRecordNum=12024).

Experiment 2: excavate Performance Ratio under confidence threshold value situation of change

Under confidence threshold value situation of change, invent AWPNAR-Mining and control methods PNAR-Mining herein and excavate positive and negative correlation rule (A → B, A → ﹁ B, ﹁ A → B and ﹁ A → ﹁ B) quantity (minsup=0.03 more as shown in table 6 in experiment document test set, minInt=0.0002, ItemNum=50, TRecordNum=12024).

The positive and negative correlation rule quantity comparison of excavating under the different confidence threshold value of table 6

Experiment 3: excavate time efficiency Performance Ratio

Excavate time efficiency performances in order to compare 2 kinds of methods, we add up the excavation time of inventing AWPNAR-Mining and control methods PNAR-Mining herein respectively in support changes of threshold situation and under confidence threshold value situation of change, its result (minInt=0.0002 as shown in table 7 and table 8, ItemNum=50, TRecordNum=12024).The time comparison (minconf=0.0002) that the lower 2 kinds of method for digging of table 7 degree of expressing support for changes of threshold situation excavate a collection and correlation rule in experiment document test set, table 8 represents the positive and negative correlation rule time comparison of the excavation under confidence threshold value situation of change (minsup=0.03).

Under the different support threshold values of table 7, excavate a collection and correlation rule time (unit: second) relatively

Under the different confidence threshold value of table 8, excavate the time (unit: second) of positive and negative correlation rule relatively

Experiment 4: Scalable Performance analysis

We change and the Scalable Performance experiment and analysis of two kinds of situations of data test collection scale variation to the inventive method from the number of entry.

In order to test extensibility of the present invention, experiment parameter is set: ItemNum=50, TRecordNum=12024, minsup=0.05, minconf=0.07, minInt=0.001, changes and data test collection scale is distinguished under situation of change in the number of entry, AWPNAR-Mining method of the present invention in data test collection 1 Mining Frequent Itemsets Based (FI), negative term collection (NI) and positive and negative correlation rule (PNAR) isotype number change result as shown in Fig. 9 to Figure 14.

In a word, above-mentioned experimental result shows, with control methods PNAR-Mining comparison, the excavation performance of AWPNAR-Mining method of the present invention has reached good effect, and digging efficiency is greatly improved; No matter be in support changes of threshold situation or confidence threshold value situation of change, candidate, frequent item set and the negative term collection that the present invention excavates and positive and negative correlation rule quantity all than control methods few a lot.

Claims

1. for finding a complete weighting pattern method for digging for correlation rule between text word, it is characterized in that, comprise the steps:

(1) complete weighted data pretreatment stage: pending complete weighted data is carried out to pre-service, build complete weighted data storehouse and project library;

2.1, from project library, extract complete weighting candidate 1_ item collection, and excavate the frequent 1_ item of complete weighting collection; Concrete steps are carried out according to 2.1.1～2.1.3:

2.1.1, from project library, extract complete weighting candidate 1_ item collection;

2.1.2, the weights summation of cumulative complete weighting candidate 1_ item collection in weighted data storehouse completely, calculate its support;

2.1.3 the frequent 1_ item of the complete weighting collection of, concentrating support to be more than or equal to minimum support threshold value complete weighting candidate 1_ item joins complete weighted frequent items set;

2.2.1, complete weighting frequent (i-1) _ collection is carried out to Apriori connection, generates complete weighting candidate i_ item collection; Described i >=2;

2.2.2, the weights summation of cumulative complete weighting candidate i_ item collection in weighted data storehouse completely, calculate its support;

2.2.3, concentrate from complete weighting candidate i_ item the frequent i_ item collection taking-up that its support is not less than to support threshold value, deposit complete weighted frequent items set in, meanwhile, the negative i_ item collection of complete weighting that its support is less than to support threshold value deposits the set of complete weighting negative term collection in;

2.2.4, the value of i is added to 1, if frequent (i-1) _ Xiang Jiwei sky just proceeds to (3) step, otherwise, continue 2.2.1～2.2.3 step;

(3) the beta pruning stage: obtain interesting complete weighted frequent items and negative term collection by the beta pruning stage:

3.1, for each the frequent i-item collection awL in frequent item set set _i, calculate IAWFI (awL _i) value, wipe out its IAWFI (awL _i) value is false frequent item set, obtains interesting complete weighted frequent items set after beta pruning;

3.2, for each the negative i-item collection awN in the set of complete weighting negative term collection _i, calculate IAWNI (awN _i) value, wipe out its IAWNI (awN _i) value is false negative term collection, obtains interesting complete weighting negative term collection set after beta pruning;

(4) from interesting complete weighted frequent items set, excavate effectively the positive and negative correlation rule of weighting completely, comprise the following steps:

4.1, take out frequent item set awL from interesting complete weighted frequent items set _i, obtain awL _iall proper subclass, build awL _iproper subclass set, then carry out following operation:

4.2.1, from awL _iproper subclass set in take out arbitrarily two proper subclass I ₁and I ₂, work as I ₁and I ₂common factor be empty set, I ₁and I ₂project number sum equal the project number of its former frequent item set and I ₁and I ₂support be all not less than support threshold value, calculate frequent item set (I ₁∪ I ₂) item in weights than awIWR (I ₁, I ₂) and dimension than awIDR (I ₁, I ₂);

If 4.2.2.1 I ₁→ I ₂awCPIR value (awCPIR (I ₁→ I ₂)) be not less than confidence threshold value minconf, excavate all-weighted association I ₁→ I ₂; If I ₂→ I ₁awCPIR value (awCPIR (I ₂→ I ₁)) be not less than confidence threshold value minconf, excavate all-weighted association I ₂→ I ₁;

If 4.2.2.2 (﹁ I ₁∪ ﹁ I ₂) support be not less than support threshold value minsup, so, if 1. ﹁ I ₁→ ﹁ I ₂awCPIR value (awCPIR (﹁ I ₁→ ﹁ I ₂)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ ﹁ I ₁awCPIR value (awCPIR (﹁ I ₂→ ﹁ I ₁)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₂→ ﹁ I ₁;

If 4.2.3.1 (I ₁∪ ﹁ I ₂) support be not less than support threshold value minsup, so, if 1. I ₁→ ﹁ I ₂awCPIR value (awCPIR (I ₁→ ﹁ I ₂)) be not less than confidence threshold value minconf, excavate the negative correlation rule I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ I ₁awCPIR value (awCPIR (﹁ I ₂→ I ₁)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₂→ I ₁;

If 4.2.3.2 (﹁ I ₁∪ I ₂) support be not less than support threshold value minsup, so, if 1. ﹁ I ₁→ I ₂awCPIR value (awCPIR (﹁ I ₁→ I ₂)) be not less than confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₁→ I ₂; If 2. I ₂→ ﹁ I ₁awCPIR value (awCPIR (I ₂→ ﹁ I ₁)) be not less than confidence threshold value minconf, excavate the negative correlation rule I of complete weighting ₂→ ﹁ I ₁;

4.2.5, continue 4.1 steps, if each frequent item set awL in interesting complete weighted frequent items set _iall and if only if is removed once, proceeds to (5) step;

(5) from interesting complete weighting negative term collection set, excavate effectively the negative correlation rule of weighting completely, comprise the following steps:

5.1, take out negative term collection awN from interesting complete weighting negative term collection set _i, obtain awN _iall proper subclass, build awN _iproper subclass set, then carry out following operation:

5.2.1, from awN _iproper subclass set in take out arbitrarily two proper subclass I ₁and I ₂, work as I ₁and I ₂common factor be empty set, I ₁and I ₂project number sum equal the project number of its former frequent item set and I ₁and I ₂support be all greater than or equal to support threshold value, calculate negative term collection (I ₁∪ I ₂) item in weights than (awIWR (I ₁, I ₂)) and dimension than (awIDR (I ₁, I ₂));

If 5.2.2.1 (﹁ I ₁∪ ﹁ I ₂) support be greater than or equal to support threshold value minsup, so, if 1. ﹁ I ₁→ ﹁ I ₂awCPIR value (awCPIR (﹁ I ₁→ ﹁ I ₂)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ ﹁ I ₁awCPIR value (awCPIR (﹁ I ₂→ ﹁ I ₁)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₂→ ﹁ I ₁;

5.2.3, when weights in the item of transaction journal sum (n) in database and above-mentioned 5.2.1 step are than (awIWR (I ₁, I ₂)) product be less than its dimension than (awIDR (I ₁, I ₂)) time (n × awIWR (I ₁, I ₂) <awIDR (I ₁, I ₂)), proceed as follows:

If 5.2.3.1 (I ₁∪ ﹁ I ₂) support be greater than or equal to support threshold value minsup, so, if 1. I ₁→ ﹁ I ₂awCPIR value (awCPIR (I ₁→ ﹁ I ₂)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule I of complete weighting ₁→ ﹁ I ₂; If 2. ﹁ I ₂→ I ₁awCPIR value (awCPIR (﹁ I ₂→ I ₁)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₂→ I ₁;

If 5.2.3.2 (﹁ I ₁∪ I ₂) support be greater than or equal to support threshold value minsup, so, if 1. ﹁ I ₁→ I ₂awCPIR value (awCPIR (﹁ I ₁→ I ₂)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule ﹁ I of complete weighting ₁→ I ₂; If 2. I ₂→ ﹁ I ₁awCPIR value (awCPIR (I ₂→ ﹁ I ₁)) be greater than or equal to confidence threshold value minconf, excavate the negative correlation rule I of complete weighting ₂→ ﹁ I ₁;

5.2.5, continue 5.1 steps, if each negative term collection awN in interesting complete weighting negative term collection set _iall and if only if is removed once, and the positive and negative association rule mining of weighting finishes completely;

Described " ﹁ I ₁, ﹁ I ₂, I ₁∪ ﹁ I ₂, I ₁→ ﹁ I ₂" etc. " ﹁ " in symbol be negative correlation symbol, ﹁ I ₁be illustrated in and in issued transaction, do not occur I ₁event, be called negative term collection I ₁; (I ₁∪ ﹁ I ₂) representing an item collection, this collection has subitem collection I ₁with negative subitem collection I ₂; Correlation rule I ₁→ ﹁ I ₂its implication is: if subset I ₁event occur or occur, subset I so ₂event there will not be or not occur.

2. according to claim 1 for finding the complete weighting pattern method for digging of correlation rule between text word, it is characterized in that, the described pending pretreated concrete steps of complete weighted data are, in the time that pending complete weighted data is Chinese text data, carries out participle, remove stop words, extract Feature Words and calculate its weights; In the time that pending complete weighted data is English text data, carries out stem extraction, get rid of stop words, lexical analysis, extraction Feature Words and calculate its weights.