CN104182527A

CN104182527A - Partial-sequence itemset based Chinese-English test word association rule mining method and system

Info

Publication number: CN104182527A
Application number: CN201410427491.8A
Authority: CN
Inventors: 黄名选
Original assignee: GUANGXI COLLEGE OF EDUCATION
Current assignee: Guangxi University of Finance and Economics
Priority date: 2014-08-27
Filing date: 2014-08-27
Publication date: 2014-12-03
Anticipated expiration: 2034-08-27
Also published as: CN104182527B

Abstract

Disclosed is a partial-sequence itemset based Chinese-English test word association rule mining method and system. A text information preprocessing module is used for performing preprocessing to establish a text information database and a feature word item base; a feature word frequent partial-sequence item implementation module is used for mining feature word candidate itemsets and solving out partial-sequence itemsets of the candidate itemsets, the candidate partial-sequence itemsets are pruned by a new itemset pruning method, weights of candidate partial-sequence itemsets are calculate, and supports of the candidate partial-sequence itemsets are calculated by a new calculation method so as to obtain frequent partial-sequence itemsets.

Description

Association rule mining method and system thereof between the Sino-British text word based on partial order item collection

Technical field

The invention belongs to Data Mining, specifically association rule mining method and a digging system thereof between the Chinese and English text word based on partial order item collection, is applicable in Chinese and English text mining that Feature Words association mode is found and the field such as Chinese and English document information retrieval query expansion, Chinese and English text cross-language information retrieval.

Background technology

More than 20 years, association rule mining research has obtained significant technological achievement, mainly concentrates on two aspects of the excavation based on item frequency and the digging technology based on item weights.

Excavation based on item frequency also claims to excavate without weighted association rules, and its principal feature is the principle processing item collection consistent by equality, and the probability that item collection is occurred in affairs and conditional probability are as the support of its collection and the degree of confidence of correlation rule.The most representative classical way is Apriori method (R.Agrawal, T.Imielinski, A.Swami. Mining association rules between sets of items in large database[C] // Proceeding of 1993 ACM SIGMOD International Conference on Management of Data, Washington D.C., 1993, (5): 207-216.), on this basis, scholars adopt diverse ways, have improved Apriori method from different angles.

Although the method for digging based on frequency is studied widely, also there is following defect: only pay attention to a frequency, ignore the situation that has project weights, cause the association mode with barren redundancy, invalid to increase.In order to address the above problem, the weighted association mode excavation technology based on item weights obtains extensive discussions and research, is characterized in introducing weights, to have different importance between embodiment project and project has different weights in transaction journal.According to the source difference of item weights, complete weighting pattern digging technology two classes that the excavation based on item weights is divided into the weighting pattern digging technology fixing based on item weights and changes based on item weights.

It is early stage based on item weights method for digging excavating based on the fixing weighting pattern of item weights, since nineteen ninety-eight obtain numerous scholars' concern and further investigation, be characterized in: project weights derive from user or domain expert arranges, and in affairs mining process, immobilize.Its typical algorithm is Algorithms of Mining Association Rules With Weighted Items MINWAL (O) and MINWAL (W) (the C. H. Cai of the propositions such as Cai, A. da, W. C. Fu, et al. Mining Association Rules with Weighted Items [C] //Proceedings of IEEE International database Engineering and Application Symposiums, 1998:68-77.).On this basis, occurred improved weighting pattern method for digging, it all obtains good performance at digging efficiency and excavation aspect of performance.

Limitation based on the fixing weighted association rules method for digging of item weights is not consider that project weights are along with transaction journal changes and the situation of variation, ignores a situation for weights variation, can not solve and have a data mining problem for weights variation characteristic.Conventionally be called complete weighted data by thering are a data for weights variation characteristic, also claim matrix weighted data.Text message is typical weighted data completely, and in the text message of magnanimity, its Feature Words weights are to depend on each document, and changes with document is different.All-weighted association digging technology has overcome the defect of excavating based on the fixing weighting pattern of item weights, there are a various association mode of the data of weights variation characteristic for excavating, belong to the digging technology changing based on item weights, principal feature is that its project weights depend on affairs and dynamic change.Typical all-weighted association method for digging is the red mining algorithm KWEstimate (Tan Yihong that waits the All-weighted Association Rules from Vector Space Model proposing of Tan Yi in 2003, Lin Yaping. the excavation [J] of All-weighted Association Rules from Vector Space Model. computer engineering and application, 2003 (13): 208-211.) and the matrix Algorithms of Mining Association Rules With Weighted Items MWARM (Huang Mingxuan of inquiry oriented expansion, Yan little Wei, Zhang Shichao. the spurious correlation feedback query expansion [J] of excavating based on matrix weighted association rules. Journal of Software, 2009, 20 (7): 1854-1865.), these methods all obtain good mining effect at the complete weighted data association mode of excavation, and successfully apply to information retrieval query expansion field (Huang Mingxuan, Yan little Wei, Zhang Shichao. the spurious correlation feedback query expansion [J] of excavating based on matrix weighted association rules. Journal of Software, 2009, 20 (7): 1854-1865., Huang Mingxuan, Yan little Wei, Zhang Shichao. all-weighted association excavation and the application [J] in query expansion thereof. computer utility research, 2008, 25 (6): 1724-1727.), obtain significant effect.The defect of the existing method for digging changing based on weights is: the association mode quantity that it excavates is still very huge, increase the difficulty that user selects required mode, barren, the false association mode with invalid is also a lot, is difficult to its technical application that is raised to.

Along with the development of network technology and infotech, weighted data (as network text information data) quantity rapidly increases completely, become mass data, how from the complete weighted data of these magnanimity, to excavate association mode useful, that more approach actual conditions is current problem demanding prompt solution.Based on the complete weighted data of the inapplicable processing of the fixing mining algorithm of item weights, majority still adopts the method for digging based on frequency to process these data at present, causes the association mode with barren bulk redundancy, invalid to produce.For the problems referred to above, the present invention, according to the feature of Chinese and English document data, carries a kind of new Chinese and English eigen word association mode of rule method for digging and digging system thereof based on partial order item collection.This invention adopts new partial order item collection support computing method and technology of prunning branches, avoid a lot of invalid, association mode generations falseness and barren, greatly improve Chinese and English text mining efficiency, the Feature Words association rule model of excavating approaches actual conditions more.Experimental result shows, the Feature Words association mode quantity that the text mining method that this invention proposes is excavated and excavation time all obviously reduce, its excavation performance is better than existing complete weighting pattern method for digging and the mode excavation method based on frequency, its Feature Words association mode can be information retrieval reliable query expansion word source is provided, therefore, this inventive method has important using value and wide application prospect in the field such as text mining, information retrieval.

Summary of the invention

Technical matters to be solved by this invention is, further investigate for the civilian text feature word association of Chinese and English mode excavation, association rule mining method and system thereof between a kind of Chinese and English text word based on partial order item collection proposed, improve Chinese and English text mining efficiency, be applied to Chinese and English document information retrieval query expansion, can improve retrieval performance, be applied to Chinese and English text mining, can find more actual rational Chinese and English Feature Words association mode, thereby improve the precision of text cluster and classification.Such as, in search engine (Baidu, Google etc.), use the inventive method can obtain high-quality expansion word and realize user's query expansion, improve recall ratio and precision ratio.

The present invention solves the problems of the technologies described above taked technical scheme: association rule mining method between a kind of Chinese and English text word based on partial order item collection, comprises the steps:

(1) Chinese and English text message data pre-service: pending Chinese and English text message data are carried out to pre-service: Chinese text participle, English text stem extracts, remove stop words, extract Feature Words and weights calculating thereof, build text message database and Feature Words project storehouse based on vector space model.

Adopt Porter (seeing http://tartarus.org/ ~ martin/PorterStemmer) program as English document stem extraction procedure, Chinese word segmentation program is the ICTCLAS Chinese word segmentation system (seeing http://www.ictclas.org/) that Inst. of Computing Techn. Academia Sinica develops.

Text feature word weights computing formula is: w _ij=(1+ln ( tf _ij)) × idf _i,

Wherein, w _ijbe iindividual Feature Words is jthe weights of section document, idf _ibe ithe reverse document frequency of individual Feature Words, its value idf _i=log ( n/ df _i), nfor total number of documents in document sets, df _ifor containing ithe number of documents of individual Feature Words, tf _ijbe iindividual Feature Words is jthe word frequency of section document;

(2) excavate the numerous partial order item of complete weighted feature word frequency collection, comprise the following steps 2.1 and step 2.2:

2.1, excavate the numerous 1_ item of complete weighted feature word frequency collection l ₁, concrete steps are carried out according to 2.1.1 and 2.1.3:

2.1.1, from Feature Words project storehouse, extract Feature Words candidate 1_ item collection c ₁, in cumulative text message database, the weights of all items, obtain whole project weights summations w, cumulative c ₁weights accumulative total in text message database , calculate c ₁support poisup( c ₁). piosup( c ₁) formula as follows:

2.1.2, by Feature Words candidate 1_ item collection c ₁in its support poisup( c ₁)>= msfrequent 1_ item collection l ₁join the set of Feature Words frequent item set fIS, msfor minimum support threshold value.

2.1.3, cumulative candidate 1-item collection in text message database c ₁occurrence frequency n _c1, extract w _r( c ₁), calculate c ₁partial order item centralization of state power value expect pOIWB( c ₁, 2). pOIWB( c ₁, 2) computing formula be:

POIWB( C ₁,2)=2× W× ms－ n _c1 ×w _r( C ₁)。

w _r( c ₁) for not belong to c ₁project set in the project weights of weights maximum of sundry item.

2.2, excavate complete weighted feature word frequency numerous k_ collection l _k, described k>=2, according to step, 2.2.1 ~ 2.2.12 operates:

2.2.1, for candidate ( k-1) _ collection C _k-1, will w( c _k-1) < pOIWB( c _k-1, k) can not become frequent k_ collection c _k-1wipe out, obtain new candidate c _k-1set.(beta pruning 1)

Wherein, w( c _k-1) be c _k-1weights accumulative total in text message database, pOIWB( c _k-1, k) for comprise complete weighting candidate ( k-1) _ collection c _k-1's k_ centralization of state power heavily expects, its computing formula is as follows:

POIWB( C _k-1, k)= k× W× ms- n _{(
k-1)} ×w _r

n _{(
k-1)}for candidate c _k-1occurrence frequency in text message database, w _rfor not belonging to c _k-1the project weights of weights maximum in the sundry item of project set.

2.2.2, by its collection frequency be not 0 Feature Words candidate ( k-1) _ collection c _k-1carry out Apriori connection, generating feature word candidate k_item collection c _k;

If 2.2.3 c _kfor sky, exit 2.2 steps and proceed to (3) step; Otherwise, if c _knot empty, proceed to 2.2.4 step.

2.2.4, for candidate k_ collection c _k, investigate c _kany ( k-1) _ collected works collection, if exist one its ( k-1) the item centralization of state power value an of _ subset is less than its corresponding partial order item centralization of state power and heavily expects ( w _{(
k-1)}< pOIWB( c _k-1, k)), this collection c _kmust be non-frequent item set, from its set, delete this collection, obtain new candidate's partial order item collection po c _kset.(beta pruning 2)

2.2.5, cumulative candidate in text message database k-collection c _koccurrence frequency n _ckand each project weights w ₁( c _k) , w ₂( c _k) ..., w _k( c _k), extract w _r( c _k), calculate c _kweight expect pOIWB( c _k, k+1). pOIWB( c _k, k+1) computing formula is:

POIWB( C _k, k+1) =( k+1)× W× ms－ n _ck ×w _r( C _k)

2.2.6, delete the candidate that its collection frequency is 0 k-collection c _k, obtain new c _kset.(beta pruning 3)

2.2.7, obtain each c _kpartial order item collection po c _k.

2.2.8, investigate partial order item collection po c _khigh order proper subclass, if there is po c _khigh order proper subclass right and wrong frequently, partial order item collection po c _kcertain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c _kset.(beta pruning 4)

2.2.9, investigate partial order item collection po c _khigh claim object project weights, if there is po c _khigh claim object project weights be less than the minimum weight threshold of 1_ item collection minw, partial order item collection po c _kcertain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c _kset, minwcomputing formula be: minw= w× ms.(beta pruning 5)

2.2.10, investigate partial order item collection poC _klow claim order, if there is po c _klow claim object project weights be not less than minw, partial order item collection po c _kmust be frequently, this collection is joined to the set of Feature Words frequent item set fIS.

2.2.11, to remaining partial order item collection poC _k, calculate its support piosup( poC _k), if piosup( poC _k)>= ms, this partial order item collection poC _kbe frequently, join the set of Feature Words frequent item set fIS. poisup( poC _k) computing formula as follows:

Wherein, it is partial order item collection poC _kweights accumulative total in text message database, kfor Feature Words partial order item collection poC _kproject number.

2.2.12, will kvalue add 1, circulation 2.2.1 ~ 2.2.12 step, until c _kfor sky, exit 2.2 steps and proceed to (3) step as follows.

(3) from the set of Feature Words frequent item set fISthe effectively complete weighted feature word Strong association rule pattern of middle excavation, comprises the following steps:

3.1, from the set of Feature Words frequent item set fIStake out Feature Words frequent item set l _i, generate l _iall proper subclass.

3.2, from l _iproper subclass set in take out arbitrarily two proper subclass i ₁with i ₂, work as I ₁ i ₂= , and I ₁ i ₂=L _iif, w ₁₂>=( k ₁₂/ k ₁) × w ₁× mc, excavate Feature Words Strong association rule i ₁→ i ₂; If w ₁₂>=( k ₁₂/ k ₂) × w ₂× mc, excavate Feature Words Strong association rule i ₂→ i ₁.Described k ₁, k ₂with k ₁₂be respectively a collection i ₁, i ₂( i ₁, i ₂) project number, w ₁, w ₂with w ₁₂be respectively i ₁, i ₂( i ₁, i ₂) item centralization of state power value, mcfor minimal confidence threshold.

3.3, continue 3.2 steps, when Feature Words frequent item set l _iproper subclass set in each proper subclass be removed once, and only can take out once, proceed to step 3.4;

3.4, continue 3.1 steps, when each frequent item set in the set of Feature Words frequent item set l _iall be removed once, and only can take out once, (3) step end of run;

So far, weighted feature word association mode of rule excavates end completely.

A digging system that is applicable to association rule mining method between the above-mentioned Chinese and English text word based on partial order item collection, is characterized in that, comprises following 4 modules:

Text message pretreatment module: for pending Chinese and English notebook data is carried out to pre-service, be that Chinese text participle, English text stem extract, remove stop words and Feature Words extraction and weights calculating thereof etc., build text message database and Feature Words project storehouse based on vector space model.

The frequent partial order item of Feature Words collection generation module: this module is used for from the complete weighted feature word of text message database mining candidate partial order item collection, and adopt new pruning method to the beta pruning of candidate's partial order item collection, obtain final candidate's partial order item collection, by new partial order item collection support computing method, concentrate and draw the numerous partial order item of complete weighted feature word frequency integrated mode from candidate's partial order item.

Completely weighted feature word association rule generation module: simple computation and the comparison of this module and dimension heavy by a centralization of state power, from the numerous partial order item of complete weighted feature word frequency collection ( i ₁, i ₂) the middle weighted feature word association mode of rule completely that excavates effectively: i ₁→ i ₂.

Association rule model result display module: the form that effectively weighted feature word association mode of rule is liked with user is completely shown to user, for customer analysis, choice and operation.

Described text message pretreatment module comprises following 2 modules:

Chinese and English text pretreatment module: this module is responsible for Chinese text message carry out participle and remove Chinese stop words, and English text information is carried out stem extraction and removed the Chinese and English language material pre-service such as English stop words.

Text database and project library build module: this module mainly carries out Chinese and English Feature Words extracts and weight calculation, build text message database and Chinese and English Feature Words project storehouse based on vector space model.

The frequent partial order item of described Feature Words collection generation module comprises following 3 modules:

Feature Words candidate partial order item collection generation module: this module is mainly excavated Feature Words candidate partial order item collection from text message database, detailed process is as follows: from Feature Words project storehouse, extract candidate 1-item collection, the weights summation of cumulative candidate 1-item collection in text message database, calculate its support, draw the numerous 1_ item of complete weighted feature word frequency collection; Then, by connect Apriori connect, by complete weighted feature word frequency numerous ( k-1) _ collection generating feature word candidate k_ item collection; Described k>=2; The project weights of each project of cumulative Feature Words candidate k_ item collection in text message database, draw complete weighted feature word candidate partial order k_ item collection.

Feature Words candidate partial order item collection beta pruning module: this module utilizes pruning method of the present invention to carry out beta pruning to complete weighted feature word candidate partial order k_ item collection, candidate's partial order k_ item collection is deleted frequently, obtains finally likely candidate's partial order k_ item collection set frequently.

The frequent partial order item of Feature Words collection generation module: this module is mainly that the final candidate's partial order k_ item collection to obtaining after above-mentioned module beta pruning excavates, use the support of support computing method calculated candidate partial order k_ item collection of the present invention, with the comparison of minimum support threshold value, draw the numerous partial order k_ item of complete weighted feature word frequency collection.

Described complete weighted feature word association rule generation module comprises following 2 modules:

The subitem collection generation module of the frequent partial order item of Feature Words collection: this module is mainly found out the frequent partial order item of Feature Words and collected all proper subclass, and obtain item centralization of state power value and the dimension of each proper subclass.

Weighted feature word association rule generation module completely: this module is by simple computation and a comparison for centralization of state power value, from the effective weighted feature word Strong association rule pattern completely of the frequent partial order item set mining of Feature Words.

Minimum support threshold value in described digging system ms, minimal confidence threshold mcinputted by user.

Compared with prior art, the present invention has following beneficial effect:

(1) first the present invention proposes Chinese and English concept and a kind of new complete weighted feature word partial order item collection support computing method and the partial order item collection pruning method of weighted feature word partial order item collection completely, proposes on this basis association rule mining method and digging system thereof between a kind of Chinese and English text word based on partial order item collection.This invention adopts new partial order item collection support computing method and technology of prunning branches, avoids a lot of invalid, association mode generations falseness and barren, greatly improves digging efficiency, and the association mode of excavating approaches actual conditions more.With existing method for digging comparison, the present invention has good beta pruning effect, its association mode quantity and excavation time all obviously reduce, its excavation performance is better than existing complete weighting pattern excavation and the mode excavation method based on frequency, improve Chinese and English Feature Words association rule model digging efficiency, obtain association mode between actual text word, in the field such as text mining, information retrieval field, have higher using value and wide application prospect.Such as, in search engine (Baidu, Google etc.), use the inventive method can obtain high-quality expansion word and realize user's query expansion, improve recall ratio and precision ratio.

(2) using the language material of the English data set NTCIR-5 of domestic Chinese standard data set CWT200g and international standard as experimental data, by the present invention and the traditional mode excavation method based on frequency and completely weighting pattern method for digging test relatively and analysis, experimental result shows, no matter in the situation that support threshold value or confidence threshold value change, it is few that the candidate quantity that the present invention excavates is all excavated than control methods, it is few that the excavation time of the present invention excavates than control methods, amount of decrease is larger, and digging efficiency is greatly improved.

Brief description of the drawings

Fig. 1 is the block diagram of association rule mining method between the Chinese and English text word based on partial order item collection of the present invention.

Fig. 2 is the overall flow figure of association rule mining method between the Chinese and English text word based on partial order item collection of the present invention.

Fig. 3 is the structured flowchart of association rule mining system between the Chinese and English text word based on partial order item collection of the present invention.

Fig. 4 is the structured flowchart of text message pretreatment module of the present invention.

Fig. 5 is the structured flowchart of the frequent partial order item of Feature Words of the present invention collection generation module.

Fig. 6 is the structured flowchart of complete weighted feature word association rule generation module of the present invention.

Embodiment

For technical scheme of the present invention is described better, below Chinese and English text data model and the relevant concept that the present invention relates to are described below:

One, key concept

Definition 1 (Chinese and English text message data model):

Chinese and English text message data belong to the complete weighted data changing based on item weights, its data model dWDM( dynamic Weighted Data Model) by the set of affairs paper trail tR( transaction Record), the set of Feature Words project iS( item Set) and Feature Words project and affairs paper trail, weights three's correspondence set iW( item Weight) composition, can form turn to suc as formula (1) and represent.

DWDM=( TR, IS, IW) (1)

Wherein, tR= r ₁ , r ₂ ..., r _n, r _i(1≤ i≤ n) be dWDMin ibar affairs paper trail ( record),

iS= i ₁ , i ₂ ..., i _m, i _j(1≤ j≤ m) be dWDMin jindividual Feature Words project.

IW={, ,…, , , ,…, ,…, , ,…, }。

iWin set, w[ r _i] [ i _j] (1≤ i≤ n, 1≤ j≤ m) be project i _jat affairs paper trail r _iin weights, if Feature Words project i _jdo not appear at affairs paper trail r _iin, w[ r _i] [ i _j]=0.

Example: the complete weighted data example of Chinese text data (Text data) is as follows: text=( tR, iS, iW), wherein, tR= r ₁ , r ₂ , r ₃ , r ₄ , r ₅be 5 paper trails, iS= i ₁ , i ₂ , i ₃ , i ₄ , i ₅be 5 Feature Words projects, iW={ , , , , , , , , , , , , , , , , , , , , , , , , }. iWset can represent with following Fig. 1.

The complete weighted data example of Fig. 1

Definition 2 (centralization of state power value and project weights): weighted term collection completely iby different projects i ₁, i ₂, ..., i _pthe set of composition, i=( i ₁, i ₂, ..., i _p) (1≤ p≤ m), i iS, ian item centralization of state power value refer to a collection iwhen all projects appear at same transaction journal simultaneously in each transaction journal i ₁, i ₂, ..., i _pweights accumulative total, be designated as w _i, , or w _i= w ₁+ w ₂+ ... + w _p, wherein, w ₁, w ₂..., w _pbe iin each project i ₁ , i ₂ ..., i _pcorresponding weights, are called a collection iproject weights, its value for this project in transaction journal set tRin meet collection iwhole projects ( i ₁, i ₂, ..., i _p) the weights cumulative sum of each single project in the different transaction journals that satisfy condition under Conditions simultaneously, ,

Especially, by item collection isubset its meet concentrate each project ( i ₁, i ₂, ..., i _p) simultaneously occur transaction journal in cumulative weights summation be called subset project weights, be designated as w _sub, and this subset is during separately as an item collection, in transaction journal set tRin an item centralization of state power value be designated as w _{(
sub)}, for example, a collection isubset ( i ₁, i ₃) subset project weights w _{sub(
i1}, _i3)= w ₁+ w ₂, and the item centralization of state power value of this subset during separately as item collection is .

Example: Fig. 1's textin example, 3_ item collection ( i ₂, i ₃, i ₄) an item centralization of state power value be all items of this collection i ₂, i ₃, i ₄while appearance, (transaction journal satisfying condition is in each transaction journal simultaneously r ₂, r ₃) in i ₂, i ₃, i ₄weights accumulative total, w _{(
i2,
i3,
i4)}=(0.94+0.7+0.23)+(0.35+0.5+0.63)=3.35.3_ item collection ( i ₂, i ₃, i ₄) project weights be i ₂, i ₃, i ₄while appearance simultaneously single project the different transaction journals that satisfy condition ( r ₂with r ₃) in weights summation, w _i2=0.94+0.35=1.29, w _i3=0.7+0.5=1.2, w _i4=0.23+0.63=0.86.Item collection ( i ₂, i ₃, i ₄) subset ( i ₂, i ₃) subset project weights w _{sub(
i2,
i3)}= w _i2+ w _i3=1.29+1.2=2.49, and the item centralization of state power value of this subset is w _{(
i2,
i3)}=(0.83+0.81)+(0.94+0.7)+(0.35+0.5)=4.13.

Definition 3 (weighting partial order item collection completely): for complete weighted term collection i= i ₁ , i ₂ ..., i _p(1≤ p≤ m), its project weights are w ₁ , w ₂ ..., w _p.According to the size sequence of project weights, if w ₁≤ w ₂≤ ...≤ w _p, its corresponding project is arranged and is designated as i ₁ i ₂ ... i _p, this is collected i ₁ , i ₂ ..., i _pbe called complete weighting partial order item collection ( partial Order Itemset, POI), wherein i ₁be called the minimum project of weights, be called for short low claim order, i _pbe called the highest project of weights, be called for short high claim order.

Example: Fig. 1's textin example, 3_ item collection ( i ₂, i ₃, i ₄) project weights be respectively 1.29,1.2,0.86, therefore its complete weighting partial order Xiang Jiwei ( i ₄ , i ₃ , i ₂), i ₄for low claim order, i ₂it is high claim order.

Definition 4 (weighting partial order item collection support completely): regard a kind of metric point as with project weight, taking entitlement recast in complete weighting transaction database as sample point, according to how much scheme theories in theory of probability, provide a kind of new complete weighting partial order item collection i=( i ₁, i ₂, ..., i _p) (1≤ p≤ m) support ( all-weighted partial order itemset support, poisup) computing formula poisup( i), shown in (7).

(7)

Wherein, for complete weighting partial order item collection iitem centralization of state power value, for whole project weights summations in complete weighting transaction journal set TR, be called complete weighting partial order item collection support standardization coefficient.The reason of introducing support standardization coefficient is: in complete weighted data mining process, partial order item centralization of state power value increases along with the increase of its collection length, cause a collection support and regular degree of confidence to become large, in order to make the numerical value of its support and degree of confidence in rational scope, the special partial order item collection support standardization coefficient 1/ of introducing p, make its support and degree of confidence more reasonable, do not affect again the excavation of complete weighted association pattern.

Definition 5 (the frequent partial order item of weighting collection completely): establishing minimum support threshold value is ms, for complete weighting partial order item collection iif, poisup( i)>= ms, w _i>= w× p× ms, claim a collection ifor the complete frequent partial order item of weighting collection.

Especially, when item collection iduring for 1_ item collection, p=1, can obtain the minimum weight threshold of 1_ item collection minw= w× ms, obviously, when the weights of 1_ item collection are not less than minwtime, this 1_ item collection is frequently.

Definition 6 (expectations of partial order item centralization of state power value): the expectation of partial order item centralization of state power value ( partial Order Itemset Weight Bound, pOIWB) refer to comprise complete weighting ( k-1) _ partial order item collection i _k-1's k_ centralization of state power value prediction critical value, is designated as pOIWB( i _k-1, k).The centralization of state power heavily expects to have important theory significance: by complete weighting partial order ( k-1) weights an of _ collection can be predicted its follow-up generation kthe frequency an of _ collection.

If weighting completely ( k-1) _ partial order item collection i _k-1( k< m) weights be w _{(
k-1)}, i _k-1 iS.Do not belonging to i _k-1in the sundry item of project set, remember that the project of its weights maximum is i _r(i _r iS, i _r i _k-1, 1≤ r≤ m), these project weights are w _r, a collection i _k-1in transaction journal set tRin occurrence frequency be n _{(
k-1)}, comprise so i _k-1's kthe weights an of _ collection maximum possible are: w _{(
k-1)}+ n _{(
k-1)} × w _r, wherein, .

If comprise i _k-1's k_ collection is frequently, from definition 4,

(8)

By formula (8) right-hand component be called comprise complete weighting ( k-1) _ partial order item collection i _k-1's kthe centralization of state power of _ partial order item is heavily expected, is designated as pOIWB( i _k-1, k), that is,

POIWB( I _k-1, k)= k× W× ms－ n _{(
k-1)} ×w _r (9)

Formula (9) shows, if w _{(
k-1)}>= pOIWB( i _k-1, k), comprise i _k-1complete weighting k_ partial order item collection may be frequent item set.

Definition 7 (low order proper subclass and high order proper subclass): establish complete weighting partial order item collection z=( x, y), xwith ybe z2 sub-partial order item collection, wherein x=( i ₁ , i ₂ ..., i _r) (1≤ r< m), y=( i _r+1 , i _r+2 ..., i _r+q) (1≤ q< m, 2≤( r+ q)≤ m), its corresponding project weights are w ₁ , w ₂ ..., w _r(wherein w ₁≤ w ₂≤ ...≤ w _r) and w _r+1 , w _r+2 ..., w _r+q(wherein, w _r+1≤ w _r+2≤ ...≤ w _r+q), if xhigh claim order weights be not more than ylow claim order weights, w _r≤ w _r+1, claim subitem collection xit is partial order item collection zlow order proper subclass, subitem collection ybe zhigh order proper subclass.

The pruning method of described of the present invention complete weighted feature lexical item collection is:

1. Feature Words candidate ( i-1) _ collection c _i-1produce Feature Words candidate i-collection c _i( i>=2) front, calculate c _i-1feature lexical item centralization of state power value expect pOIWB( c _i-1, i), if complete weighted feature word candidate ( i-1) _ collection c _i-1item centralization of state power value w _{(
i-1)}< pOIWB( c _i-1, i), so its Feature Words ( i-1) _ collection c _i-1follow-up Feature Words i_ collection c _imust be non-frequent item set, should be from c _i-1in set, wipe out this Feature Words ( i-1) _ collection.

2. generating feature word candidate c _iafter, for candidate c _iany ( i-1) _ collected works collection, calculates the feature lexical item centralization of state power value of each candidate subset and expects, if exist one its ( i-1) the item centralization of state power value an of _ subset is less than its characteristic of correspondence lexical item centralization of state power value and expects ( w _{(
i-1)}< pOIWB( c _i-1, i)), this Feature Words candidate i_ collection c _imust be non-frequent item set, should be from c _iin set, wipe out this Feature Words candidate.

3. for Feature Words candidate c _ithe high order proper subclass of partial order item collection, if to have its high order proper subclass be non-frequent item set, this Feature Words candidate so c _ithe frequent partial order item of right and wrong collection, should be from c _iin set, wipe out this Feature Words candidate.

4. for Feature Words candidate c _ithe high claim order of partial order item collection, if exist its high claim object project weights to be less than the minimum weight threshold of 1_ item collection minw, this Feature Words candidate must be non-frequent item set, should be from c _iin set, wipe out this Feature Words candidate.

If 5. Feature Words ( i-1) _ collection c _i-1feature lexical item collection frequency be 0, n _{(
i-1)}=0, this Feature Words ( i-1) _ follow-up Feature Words of collection i_ collection must be non-frequent item set, should be from c _i-1in set, wipe out this Feature Words ( i-1) _ collection.

6. for candidate c _ithe low claim order of partial order item collection, if exist its project weights to be not less than the minimum weight threshold of 1_ item collection minw, this candidate so c _ifrequently, will c _ijoin in frequent item set set.

Below by specific embodiment, technical scheme of the present invention is described further.

The method for digging that in specific embodiment, the present invention takes and system are as shown in Fig. 1-Fig. 6.

Process that the present invention excavates complete weighted feature word association rule to Fig. 1 data instance following ( ms=0.1, mc=0.6):

1. obtain whole project weights summations in database w=8.51, therefore minw= w× ms=0. 851.

2. excavate the numerous 1_ item of complete weighted feature word frequency collection l ₁, as shown in table 1.

Table 1:

C ₁	w( C ₁)	poisup( C ₁)	n _c ₁	w _r( C ₁)	POIWB( C ₁,2)
						( i ₁)	1.68	0.197	2	0.94	2×8.51×0.1－2×0.94=-0.178
( i ₂)	2.14	0.25	4	0.95	2×8.51×0.1－4×0.95=-2.098
						( i ₃)	2.86	0.33	4	0.95	2×8.51×0.1－4×0.95=-2.098
( i ₄)	0.92	0.108	3	0.95	2×8.51×0.1－3×0.95=-1.148
						( i ₅)	0.91	0.107	2	0.95	2×8.51×0.1－2×0.95=-0.198

As shown in Table 1, l ₁=( i ₁), ( i ₂), ( i ₃), ( i ₄), ( i ₅),

The set of Feature Words frequent item set fIS=( i ₁), ( i ₂), ( i ₃), ( i ₄), ( i ₅).

3. excavate complete weighted feature word frequency numerous k_ collection l _k, described k>=2.

k=2:

(1) (beta pruning 1) is for candidate 1_ item collection C ₁, do not have w( c ₁) < pOIWB( c ₁, 2) situation, therefore candidate c ₁gather constant.

(2) be not 0 Feature Words candidate 1_ item collection by its collection frequency c ₁carry out Apriori connection, generating feature word candidate 2 _item collection c ₂, and calculate w ₁( c ₂), w ₂( c ₂), poC ₂, w( poC ₂), n _c2, w _r( c ₂) and pOIWB( c ₂, 3) and as shown in table 2.

Table 2:

C ₂	w ₁( C ₂)	w ₂( C ₂)	poC ₂	w( poC ₂)	n _c ₂	w _r( C ₂)	POIWB( C ₂,3)
								( i ₁, i ₂)	0.73	0.02	( i ₂, i ₁)	(0.02,0.73)	1	0.9	3×8.51×0.1－1×0.9=1.653
( i ₁, i ₃)	0.95	0.85	( i ₃, i ₁)	(0.85,0.95)	1	0.94	3×8.51×0.1－1×0.94=1.613
								( i ₁, i ₄)	0.73	0.06	( i ₄, i ₁)	(0.06, 0.73)	1	0.94	3×8.51×0.1－1×0.94=1.613
( i ₁, i ₅)	0.73	0.9	( i ₁, i ₅)	(0.73,0.9)	1	0.94	3×8.51×0.1－1×0.94=1.613
								( i ₂, i ₃)	2.12	2.01	( i ₃, i ₂)	(2.01, 2.12)	3	0.95	3×8.51×0.1－3×0.95=-0.297
( i ₂, i ₄)	1.31	0.92	( i ₄, i ₂)	(0.92,1.31)	3	0.95	3×8.51×0.1－3×0.95=-0.297
								( i ₂, i ₅)	0.85	0.91	( i ₂, i ₅)	(0.85,0.91)	2	0.95	3×8.51×0.1－2×0.95=0.653
( i ₃, i ₄)	1.2	0.86	( i ₄, i ₃)	(0.86,1.2)	2	0.95	3×8.51×0.1－2×0.95=0.653
								( i ₃, i ₅)	0.81	0.01	( i ₅, i ₃)	(0.01, 0.81)	1	0.95	3×8.51×0.1－1×0.95=1.603
( i ₄, i ₅)	0.06	0.9	( i ₄, i ₅)	(0.06, 0.9)	1	0.95	3×8.51×0.1－1×0.95=1.603

For table 2, proceed as follows:

﹡ investigates partial order item collection poC ₂high order proper subclass, ( i ₁), ( i ₂), ( i ₃), ( i ₅), these proper subclass are all frequently, do not deposit non-frequent proper subclass item collection, therefore partial order item collection poC ₂gather constant.

﹡ investigates partial order item collection poC ₂high claim object project weights, poC ₂high claim object project weights < minw=0. 851: ( i ₁, i ₂), ( i ₁, i ₄), ( i ₃, i ₅), their right and wrong frequently, from poC ₂in set, delete this collection.

﹡ investigates partial order item collection poC ₂low claim order, poC ₂low claim object project weights>= minw: ( i ₂, i ₃), ( i ₂, i ₄), ( i ₃, i ₄), they are frequently, and these collection are joined to the set of Feature Words frequent item set fIS, that is, fIS=( i ₁), ( i ₂), ( i ₃), ( i ₄), ( i ₅), ( i ₂, i ₃), ( i ₂, i ₄), ( i ₃, i ₄).

﹡ is to remaining partial order item collection poC ₂, ( i ₃, i ₁), ( i ₁, i ₅), ( i ₂, i ₅), ( i ₄, i ₅), calculate its support, that is, piosup( i ₃, i ₁)=(0.85+0.95)/(8.51 × 2)=0.106> ms, piosup( i ₁, i ₅)=0.096< ms, piosup( i ₂, i ₅)=0.103> ms, piosup( i ₄, i ₅)=0.056< mstherefore, ( i ₃, i ₁) and ( i ₂, i ₅) be frequent partial order item collection, join the set of Feature Words frequent item set fIS, that is, fIS=( i ₁), ( i ₂), ( i ₃), ( i ₄), ( i ₅), ( i ₂, i ₃), ( i ₂, i ₄), ( i ₃, i ₄), ( i ₃, i ₁), ( i ₂, i ₅).

k=3:

﹡ as known from Table 2, for candidate 2_ item collection c ₂, w( c ₂)= w ₁( c ₂)+ w ₂( c ₂), its w( c ₂) < pOIWB( c ₂, 3) partial order Xiang Jiyou: ( i ₂, i ₁), ( i ₄, i ₁), ( i ₅, i ₃) and ( i ₄, i ₅), these partial order item collection can not become frequent 3_ item collection, should be from c ₂in set, wipe out, obtain new candidate c ₂set, c ₂=( i ₁, i ₃), ( i ₁, i ₅), ( i ₂, i ₃), ( i ₂, i ₄), ( i ₂, i ₅), ( i ₃, i ₄).

﹡ is not 0 Feature Words candidate 2_ item collection by its collection frequency c ₂carry out Apriori connection, generating feature word candidate 3 _item collection c ₃, c ₃=( i ₁, i ₃, i ₅), ( i ₂, i ₃, i ₄), ( i ₂, i ₃, i ₅), ( i ₂, i ₄, i ₅).

﹡ is for candidate 3 _item collection c ₃, investigate c ₃any (3-1) _ collected works collection, c ₃2_ item collected works collection:

For ( i ₁, i ₃, i ₅) and ( i ₂, i ₃, i ₅): exist its subitem collection ( i ₅, i ₃), its w( i ₅, i ₃) < pOIWB(( i ₅, i ₃), 3), for ( i ₂, i ₄, i ₅): exist its subitem collection ( i ₄, i ₅), its w( i ₄, i ₅) < pOIWB(( i ₄, i ₅), 3), therefore Feature Words candidate 3 _item collection ( i ₁, i ₃, i ₅), ( i ₂, i ₃, i ₅) and ( i ₂, i ₄, i ₅) be non-frequent item set, should be from c ₃delete, new c ₃=( i ₂, i ₃, i ₄).

﹡ calculates w ₁( c ₃), w ₂( c ₃), w ₃( c ₃), poC ₃, w( poC ₃), n _c3, w _r( c ₃) and pOIWB( c ₃, 4) and as shown in table 3.

Table 3:

For table 3, proceed as follows:

﹡ investigates partial order item collection poC ₃high order proper subclass, ( i ₂), ( i ₂, i ₃), these proper subclass are all frequently, do not deposit non-frequent proper subclass item collection, therefore partial order item collection poC ₃gather constant.

﹡ investigates partial order item collection poC ₃high claim object project weights, poC ₂high claim object project weights are all greater than minwtherefore, partial order item collection poC ₃gather constant.

﹡ investigates partial order item collection poC ₃low claim order, poC ₃low claim object project weights>= minwbe ( i ₄, i ₃, i ₂) collection, this collection is frequently, is joined the set of Feature Words frequent item set fIS, that is, fIS=( i ₁), ( i ₂), ( i ₃), ( i ₄), ( i ₅), ( i ₂, i ₃), ( i ₂, i ₄), ( i ₃, i ₄), ( i ₃, i ₁), ( i ₂, i ₅), ( i ₄, i ₃, i ₂).

﹡ is not 0 Feature Words candidate 3_ item collection by its collection frequency c ₃carry out Apriori connection, generating feature word candidate 4 _item collection c ₄, c ₄= .Due to c ₄for sky, therefore excavating, 3 steps finish, proceed to following 4 steps.

4. from the set of Feature Words frequent item set fISthe effectively complete weighted feature word association mode of rule of middle excavation.

With fISmiddle Feature Words frequent item set ( i ₄, i ₃, i ₂) be example, provide effectively complete weighted feature word association mode of rule mining process as follows:

Frequent item set ( i ₄, i ₃, i ₂) proper subclass set be ( i ₄), ( i ₃), ( i ₂), ( i ₄, i ₃), ( i ₄, i ₂), ( i ₃, i ₂).

(1) for ( i ₄), ( i ₃, i ₂), i ₁=( i ₄), i ₂=( i ₃, i ₂), ( i ₄), ( i ₃, i ₂)=( i ₁, i ₂), therefore k ₁=1, k ₂=2, k ₁₂=3,

As known from Table 1, w ₁=0.92, as known from Table 2, w ₂=2.01+ 2.12=4.13,

As known from Table 3, w ₁₂=0.86+1.2+1.29=3.35,

( k ₁₂/ k ₁) × w ₁× mc=(3/1) × 0.92 × 0.6=1.656, w ₁₂=3.35>=( k ₁₂/ k ₁) × w ₁× mc=1.656, so excavate Feature Words correlation rule i ₁→ i ₂, ( i ₄) → ( i ₃, i ₂).

( k ₁₂/ k ₂) × w ₂× mc=(3/2) × 4.13 × 0.6=3.717, w ₁₂=3.35< ( k ₁₂/ k ₂) × w ₂× mc=3.717, so do not excavate rule.

(2) for ( i ₃), ( i ₄, i ₂), i ₁=( i ₃), i ₂=( i ₄, i ₂), ( i ₃), ( i ₄, i ₂)=( i ₁, i ₂), therefore k ₁=1, k ₂=2, k ₁₂=3,

As known from Table 1, w ₁=2.86, as known from Table 2, w ₂=0.92+1.31=2.23,

As known from Table 3, w ₁₂=0.86+1.2+1.29=3.35,

( k ₁₂/ k ₁) × w ₁× mc=(3/1) × 2.86 × 0.6=5.148, w ₁₂=3.35< ( k ₁₂/ k ₁) × w ₁× mc=5.14, therefore do not excavate rule.

( k ₁₂/ k ₂) × w ₂× mc=(3/2) × 2.23 × 0.6=2.007, w ₁₂=3.35>=( k ₁₂/ k ₂) × w ₂× mc=2.007, so excavate Feature Words correlation rule i ₂→ i ₁, ( i ₄, i ₂) → ( i ₃).

(3) ( i ₂),( i ₄, i ₃)

For ( i ₂), ( i ₄, i ₃), i ₁=( i ₂), i ₂=( i ₄, i ₃), ( i ₂), ( i ₄, i ₃)=( i ₁, i ₂), therefore k ₁=1, k ₂=2, k ₁₂=3,

As known from Table 1, w ₁=2.14, as known from Table 2, w ₂=0.86+1.2=2.06,

As known from Table 3, w ₁₂=0.86+1.2+1.29=3.35,

( k ₁₂/ k ₁) × w ₁× mc=(3/1) × 2.14 × 0.6=3.852, w ₁₂=3.35< ( k ₁₂/ k ₁) × w ₁× mc=3.852, therefore do not excavate rule.

( k ₁₂/ k ₂) × w ₂× mc=(3/2) × 2.06 × 0.6=1.854, w ₁₂=3.35>=( k ₁₂/ k ₂) × w ₂× mc=1.854, so excavate Feature Words correlation rule i ₂→ i ₁, ( i ₄, i ₃) → ( i ₂).

Eventually the above, for Feature Words frequent item set ( i ₄, i ₃, i ₂), can excavate effectively completely weighted feature word association mode of rule ( ms=0.1, mc=0.6): ( i ₄) → ( i ₃, i ₂), ( i ₄, i ₂) → ( i ₃) and ( i ₄, i ₃) → ( i ₂)

Below by experiment, beneficial effect of the present invention is described further.

In order to verify validity of the present invention, correctness, select classical without weighted association rules method for digging Apriori (R.Agrawal, T.Imielinski, A.Swami. Mining association rules between sets of items in large database[C] // Proceeding of 1993 ACM SIGMOD International Conference on Management of Data, Washington D.C., 1993, : 207-216.) and the matrix weighted association rules method for digging MWARM (Huang Mingxuan of inquiry oriented expansion (5), Yan little Wei, Zhang Shichao. the spurious correlation feedback query expansion [J] of excavating based on matrix weighted association rules. Journal of Software, 2009, 20 (7): 1854-1865., in experiment, expansion word quantity is made as to 0) be control methods, write experiment source program, from support changes of threshold and two kinds of situations of confidence threshold value variation, the excavation performance of the present invention and control methods is tested to contrast and analysis respectively.Experiment parameter except mswith mcin addition, also have: iN: the number of entry of excavation, n: document lump record.4-item collection is excavated in experiment.

The part Chinese language material of the Chinese Web test set CWT200g that experimental data provides from Korea_Times2001 English document language material and the network laboratories of Peking University of Japanese national scientific information system central information searching system test set NTCIR-5 CLIR extracts 4936 sections of English document (Serial Number Range is: KT2001_00000--KT2001_05066) and from this CWT200g language material, extracts 12024 sections of Chinese text documents as testing document test set herein from Korea_Times2001.Process participle (Chinese document), stem extract the document pre-service such as the calculating of (English document), elimination stop words and extraction Feature Words and weights thereof, build text database and Feature Words project storehouse based on vector space model.After pre-service, document frequency df(being contained to the document record of this Feature Words) scope is that 1028 to 2593 English Feature Words (totally 50) and df value are extracted and packed feature dictionary (now Chinese Feature Words quantity is 400) at the Chinese Feature Words of [1500,5838] scope.

Experiment 1: in support changes of threshold situation, algorithm excavates Performance Ratio

When support changes of threshold the present invention and 2 kinds of control methodss (Apriori and MWARM method) Chinese and English excavate in 2 kinds of document test sets candidate ( candidate Itemset, CI), frequent item set ( frequent Itemset, FI) and correlation rule ( association Rule, AR) and quantity result is if table 1 is to as shown in table 4.

Experiment 2: excavate Performance Ratio when confidence threshold value changes

When confidence threshold value changes the present invention and 2 kinds of control methodss in 2 kinds of document test sets of Chinese and English Mining Association Rules quantity as shown in table 5 and table 6.

Experiment 3: excavate time efficiency comparison

The time (second) that when support changes of threshold, candidate, frequent item set and correlation rule are excavated in the present invention and control methods is as shown in table 7 and table 8.In the situation that confidence threshold value changes, the time of 3 kinds of algorithm Mining Association Rules (second) as shown in Table 9 and Table 10.

Experiment 4: experimental result instance analysis

In Chinese text test set CWT200g, 28 examples of selected characteristic lexical item order are as the project set excavating, and as shown in table 11, the present invention and 2 kinds of control methodss exist mc=0. 1 He msunder=0. 1 condition, Chinese test set is excavated to (excavating 4-item collection), the correlation rule example extracting taking Feature Words project " participation " as former piece in its result is analyzed, and result is as shown in table 12.

Feature Words example in table 11CWT200g

The correlation rule example table taking " participation " as former piece that three kinds of methods of table 12 are excavated

Table 12 shows, in the correlation rule example taking " participation " as former piece, it is few that the correlation rule quantity that the present invention excavates is excavated than 2 control methodss, and its association rule model more approaches actual conditions, avoided invalid and false association mode generation.For example, " participation " and " participation " is near synonym, in short or in one section of word should occur seldom simultaneously, so correlation rule " participates in → participates in " should not be Strong association rule.In the Result of this paper algorithm MAWAR-POI, do not excavate shape as invalid and false patterns of this class such as " participate in → participate in ", and in the Result of contrast algorithm, the association rule model of not only excavating is many, and can also excavate Strong association rule " participate in → participate in ", and this class association mode should be false, barren and invalid pattern.

Above-mentioned experimental result shows, compares with experiment contrast, and the present invention has good excavation performance, and concrete manifestation is as follows:

(1) no matter in the situation that support threshold value or confidence threshold value change, the candidate that the present invention excavates, frequent item set, correlation rule quantity is all than existing excavate without the complete weighting algorithm of weighted sum few a lot, for example, the candidate quantity that invention is excavated on the English data set of NTCIR-5 is than the minimizing of Apriori method 90.60%, than the minimizing 90.49%(table 1 of MWARM method), and the candidate quantity of excavating on Chinese data collection CWT200g is than the minimizing of Apriori method 94.37%, than the minimizing 87.29%(table 2 of MWARM method), show that the present invention can avoid and reduce a lot of invalid association modes and occur.

(2) what excavation time comparison of the present invention was excavated than algorithm is few, amount of decrease is larger, for example, the time average that the present invention excavates a collection and correlation rule on the English data set of NTCIR-5 is than the minimizing of Apriori method 87.58%, than the minimizing 83.56%(table 7 of MWARM method), and the time of excavating on Chinese data collection CWT200g is than the minimizing of Apriori method 85.98%, than the minimizing 67.60%(table 8 of MWARM method), show that digging efficiency of the present invention is greatly improved.

(3) experimental result of table 12 shows, the Feature Words association rule model that the present invention excavates more can approach reality.

Claims

1. an association rule mining method between the Sino-British text word based on partial order item collection, is characterized in that, comprises the steps:

(1) Chinese and English text message data pre-service: pending Chinese and English text message data are carried out to pre-service: Chinese text participle, English text stem extracts, remove stop words, extract Feature Words and weights calculating thereof, build text message database and Feature Words project storehouse based on vector space model;

(2.1) excavate the numerous 1_ item of complete weighted feature word frequency collection l ₁, concrete steps are carried out according to 2.1.1 and 2.1.3:

(2.1.1) from Feature Words project storehouse, extract Feature Words candidate 1_ item collection c ₁, in cumulative text message database, the weights of all items, obtain whole project weights summations w, cumulative c ₁weights accumulative total in text message database , calculate c ₁support poisup( c ₁);

(2.1.2) by Feature Words candidate 1_ item collection c ₁in its support piosup( c ₁)>= msfrequent 1_ item collection l ₁join the set of Feature Words frequent item set fIS, msfor minimum support threshold value;

(2.1.3) cumulative candidate 1-item collection in text message database c ₁occurrence frequency n _c1, extract w _r( c ₁), calculate c ₁partial order item centralization of state power value expect pOIWB( c ₁, 2);

(2.2) excavate complete weighted feature word frequency numerous k_ collection l _k, described k>=2, according to step, 2.2.1 ~ 2.2.12 operates:

(2.2.1) for candidate ( k-1) _ collection C _k-1, will w( c _k-1) < pOIWB( c _k-1, k) can not become frequent k_ collection c _k-1wipe out, obtain new candidate c _k-1set;

Wherein, w( c _k-1) be c _k-1weights accumulative total in text message database, pOIWB( c _k-1, k) for comprise complete weighting candidate ( k-1) _ collection c _k-1's k_ centralization of state power value is expected;

(2.2.2) by its collection frequency be not 0 Feature Words candidate ( k-1) _ collection c _k-1carry out Apriori connection, generating feature word candidate k_item collection c _k;

If (2.2.3) c _kfor sky, exit 2.2 steps and proceed to (3) step; Otherwise, if c _knot empty, proceed to 2.2.4 step;

(2.2.4) for candidate k_ collection c _k, investigate c _kany ( k-1) _ collected works collection, if exist one its ( k-1) the item centralization of state power value an of _ subset is less than its corresponding partial order item centralization of state power and heavily expects ( w _{(
k-1)}< pOIWB( c _k-1, k)), this collection c _kmust be non-frequent item set, from its set, delete this collection, obtain new candidate's partial order item collection po c _kset;

(2.2.5) cumulative candidate in text message database k-collection c _koccurrence frequency n _ckand each project weights w ₁( c _k) , w ₂( c _k) ..., w _k( c _k), extract w _r( c _k), calculate c _kweight expect pOIWB( c _k, k+1);

(2.2.6) delete the candidate that its collection frequency is 0 k-collection c _k, obtain new c _kset;

(2.2.7) obtain each c _kpartial order item collection po c _k;

(2.2.8) investigate partial order item collection po c _khigh order proper subclass, if there is po c _khigh order proper subclass right and wrong frequently, partial order item collection po c _kcertain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c _kset;

(2.2.9) investigate partial order item collection po c _khigh claim object project weights, if there is po c _khigh claim object project weights be less than the minimum weight threshold of 1_ item collection minw, partial order item collection po c _kcertain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c _kset, minwcomputing formula be: minw= w× ms;

(2.2.10) investigate partial order item collection poC _klow claim order, if there is po c _klow claim object project weights be not less than minw, partial order item collection po c _kmust be frequently, this collection is joined to the set of Feature Words frequent item set fIS;

(2.2.11) to remaining partial order item collection poC _k, calculate its support piosup( poC _k), if piosup( poC _k)>= ms, this partial order item collection poC _kbe frequently, join the set of Feature Words frequent item set fIS;

(2.2.12) will kvalue add 1, circulation 2.2.1 ~ 2.2.12 step, until c _kfor sky, exit 2.2 steps and proceed to (3) step as follows;

(3.1) from the set of Feature Words frequent item set fIStake out Feature Words frequent item set l _i, find out l _iall proper subclass;

(3.2) from l _iproper subclass set in take out arbitrarily two proper subclass i ₁with i ₂, work as I ₁ i ₂= , and I ₁ i ₂=L _iif, w ₁₂>=( k ₁₂/ k ₁) × w ₁× mc, excavate Feature Words Strong association rule i ₁→ i ₂; If w ₁₂>=( k ₁₂/ k ₂) × w ₂× mc, excavate Feature Words Strong association rule i ₂→ i ₁; Described k ₁, k ₂with k ₁₂be respectively a collection i ₁, i ₂( i ₁, i ₂) project number, w ₁, w ₂with w ₁₂be respectively i ₁, i ₂( i ₁, i ₂) item centralization of state power value, mcfor minimal confidence threshold;

(3.3) continue 3.2 steps, when Feature Words frequent item set l _iproper subclass set in each proper subclass be removed once, and only can take out once, proceed to step 3.4;

(3.4) continue 3.1 steps, when each frequent item set in the set of Feature Words frequent item set l _iall be removed once, and only can take out once, (3) step end of run;

2. be applicable to an association rule mining system between the Sino-British text word based on partial order item collection claimed in claim 1, it is characterized in that, comprise following 4 modules:

Text message pretreatment module: for pending Chinese and English notebook data is carried out to pre-service, be that Chinese text participle, English text stem extract, remove stop words and Feature Words extraction and weights calculating thereof etc., build text message database and Feature Words project storehouse based on vector space model;

The frequent partial order item of Feature Words collection generation module: this module is used for from the complete weighted feature word of text message database mining candidate partial order item collection, and adopt new pruning method to the beta pruning of candidate's partial order item collection, obtain final candidate's partial order item collection, by new partial order item collection support computing method, concentrate and draw the numerous partial order item of complete weighted feature word frequency integrated mode from candidate's partial order item;

Completely weighted feature word association rule generation module: simple computation and the comparison of this module and dimension heavy by a centralization of state power, from the numerous partial order item of complete weighted feature word frequency collection ( i ₁, i ₂) the middle weighted feature word Strong association rule pattern completely of excavating effectively: i ₁→ i ₂;

Association rule model result display module: the form that effectively weighted feature word Strong association rule pattern is liked with user is completely shown to user, for customer analysis, choice and operation.

3. digging system according to claim 2, is characterized in that, described text message pretreatment module comprises following 2 modules:

Chinese and English text pretreatment module: this module is responsible for Chinese text message carry out participle and remove Chinese stop words, and English text information is carried out stem extraction and removed the Chinese and English language material pre-service work such as English stop words;

4. digging system according to claim 2, is characterized in that, the frequent partial order item of described Feature Words collection generation module comprises following 3 modules:

Feature Words candidate partial order item collection generation module: this module is mainly excavated Feature Words candidate partial order item collection from text message database, detailed process is as follows: from Feature Words project storehouse, extract candidate 1-item collection, the weights summation of cumulative candidate 1-item collection in text message database, calculate its support, draw the numerous 1_ item of complete weighted feature word frequency collection; Then, connect by Apriori, by complete weighted feature word frequency numerous ( k-1) _ collection generating feature word candidate k_ item collection; Described k>=2; The project weights of each project of cumulative Feature Words candidate k_ item collection in text message database, draw complete weighted feature word candidate partial order k_ item collection;

Feature Words candidate partial order item collection beta pruning module: this module utilizes pruning method of the present invention to carry out beta pruning to complete weighted feature word candidate partial order k_ item collection, candidate's partial order k_ item collection is deleted frequently, obtains finally likely candidate's partial order k_ item collection set frequently;

5. digging system according to claim 2, is characterized in that, described complete weighted feature word association rule generation module comprises following 2 modules:

The subitem collection generation module of the frequent partial order item of Feature Words collection: the numerous partial order item of the main generating feature word frequency of this module collects all proper subclass, and obtain item centralization of state power value and the dimension of each proper subclass;

Weighted feature word association rule generation module completely: this module, by simple computation and a comparison for centralization of state power value, concentrates from the frequent partial order item of Feature Words the weighted feature word Strong association rule pattern completely of excavating effectively.

6. according to the digging system described in any one in claim 2-5, it is characterized in that the minimum support threshold value in described digging system ms, minimal confidence threshold mcinputted by user.