CN104182527A - Partial-sequence itemset based Chinese-English test word association rule mining method and system - Google Patents

Partial-sequence itemset based Chinese-English test word association rule mining method and system Download PDF

Info

Publication number
CN104182527A
CN104182527A CN201410427491.8A CN201410427491A CN104182527A CN 104182527 A CN104182527 A CN 104182527A CN 201410427491 A CN201410427491 A CN 201410427491A CN 104182527 A CN104182527 A CN 104182527A
Authority
CN
China
Prior art keywords
collection
item
partial order
candidate
feature words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410427491.8A
Other languages
Chinese (zh)
Other versions
CN104182527B (en
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
GUANGXI COLLEGE OF EDUCATION
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGXI COLLEGE OF EDUCATION filed Critical GUANGXI COLLEGE OF EDUCATION
Priority to CN201410427491.8A priority Critical patent/CN104182527B/en
Publication of CN104182527A publication Critical patent/CN104182527A/en
Application granted granted Critical
Publication of CN104182527B publication Critical patent/CN104182527B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a partial-sequence itemset based Chinese-English test word association rule mining method and system. A text information preprocessing module is used for performing preprocessing to establish a text information database and a feature word item base; a feature word frequent partial-sequence item implementation module is used for mining feature word candidate itemsets and solving out partial-sequence itemsets of the candidate itemsets, the candidate partial-sequence itemsets are pruned by a new itemset pruning method, weights of candidate partial-sequence itemsets are calculate, and supports of the candidate partial-sequence itemsets are calculated by a new calculation method so as to obtain frequent partial-sequence itemsets.

Description

Association rule mining method and system thereof between the Sino-British text word based on partial order item collection
Technical field
The invention belongs to Data Mining, specifically association rule mining method and a digging system thereof between the Chinese and English text word based on partial order item collection, is applicable in Chinese and English text mining that Feature Words association mode is found and the field such as Chinese and English document information retrieval query expansion, Chinese and English text cross-language information retrieval.
Background technology
More than 20 years, association rule mining research has obtained significant technological achievement, mainly concentrates on two aspects of the excavation based on item frequency and the digging technology based on item weights.
Excavation based on item frequency also claims to excavate without weighted association rules, and its principal feature is the principle processing item collection consistent by equality, and the probability that item collection is occurred in affairs and conditional probability are as the support of its collection and the degree of confidence of correlation rule.The most representative classical way is Apriori method (R.Agrawal, T.Imielinski, A.Swami. Mining association rules between sets of items in large database[C] // Proceeding of 1993 ACM SIGMOD International Conference on Management of Data, Washington D.C., 1993, (5): 207-216.), on this basis, scholars adopt diverse ways, have improved Apriori method from different angles.
Although the method for digging based on frequency is studied widely, also there is following defect: only pay attention to a frequency, ignore the situation that has project weights, cause the association mode with barren redundancy, invalid to increase.In order to address the above problem, the weighted association mode excavation technology based on item weights obtains extensive discussions and research, is characterized in introducing weights, to have different importance between embodiment project and project has different weights in transaction journal.According to the source difference of item weights, complete weighting pattern digging technology two classes that the excavation based on item weights is divided into the weighting pattern digging technology fixing based on item weights and changes based on item weights.
It is early stage based on item weights method for digging excavating based on the fixing weighting pattern of item weights, since nineteen ninety-eight obtain numerous scholars' concern and further investigation, be characterized in: project weights derive from user or domain expert arranges, and in affairs mining process, immobilize.Its typical algorithm is Algorithms of Mining Association Rules With Weighted Items MINWAL (O) and MINWAL (W) (the C. H. Cai of the propositions such as Cai, A. da, W. C. Fu, et al. Mining Association Rules with Weighted Items [C] //Proceedings of IEEE International database Engineering and Application Symposiums, 1998:68-77.).On this basis, occurred improved weighting pattern method for digging, it all obtains good performance at digging efficiency and excavation aspect of performance.
Limitation based on the fixing weighted association rules method for digging of item weights is not consider that project weights are along with transaction journal changes and the situation of variation, ignores a situation for weights variation, can not solve and have a data mining problem for weights variation characteristic.Conventionally be called complete weighted data by thering are a data for weights variation characteristic, also claim matrix weighted data.Text message is typical weighted data completely, and in the text message of magnanimity, its Feature Words weights are to depend on each document, and changes with document is different.All-weighted association digging technology has overcome the defect of excavating based on the fixing weighting pattern of item weights, there are a various association mode of the data of weights variation characteristic for excavating, belong to the digging technology changing based on item weights, principal feature is that its project weights depend on affairs and dynamic change.Typical all-weighted association method for digging is the red mining algorithm KWEstimate (Tan Yihong that waits the All-weighted Association Rules from Vector Space Model proposing of Tan Yi in 2003, Lin Yaping. the excavation [J] of All-weighted Association Rules from Vector Space Model. computer engineering and application, 2003 (13): 208-211.) and the matrix Algorithms of Mining Association Rules With Weighted Items MWARM (Huang Mingxuan of inquiry oriented expansion, Yan little Wei, Zhang Shichao. the spurious correlation feedback query expansion [J] of excavating based on matrix weighted association rules. Journal of Software, 2009, 20 (7): 1854-1865.), these methods all obtain good mining effect at the complete weighted data association mode of excavation, and successfully apply to information retrieval query expansion field (Huang Mingxuan, Yan little Wei, Zhang Shichao. the spurious correlation feedback query expansion [J] of excavating based on matrix weighted association rules. Journal of Software, 2009, 20 (7): 1854-1865., Huang Mingxuan, Yan little Wei, Zhang Shichao. all-weighted association excavation and the application [J] in query expansion thereof. computer utility research, 2008, 25 (6): 1724-1727.), obtain significant effect.The defect of the existing method for digging changing based on weights is: the association mode quantity that it excavates is still very huge, increase the difficulty that user selects required mode, barren, the false association mode with invalid is also a lot, is difficult to its technical application that is raised to.
Along with the development of network technology and infotech, weighted data (as network text information data) quantity rapidly increases completely, become mass data, how from the complete weighted data of these magnanimity, to excavate association mode useful, that more approach actual conditions is current problem demanding prompt solution.Based on the complete weighted data of the inapplicable processing of the fixing mining algorithm of item weights, majority still adopts the method for digging based on frequency to process these data at present, causes the association mode with barren bulk redundancy, invalid to produce.For the problems referred to above, the present invention, according to the feature of Chinese and English document data, carries a kind of new Chinese and English eigen word association mode of rule method for digging and digging system thereof based on partial order item collection.This invention adopts new partial order item collection support computing method and technology of prunning branches, avoid a lot of invalid, association mode generations falseness and barren, greatly improve Chinese and English text mining efficiency, the Feature Words association rule model of excavating approaches actual conditions more.Experimental result shows, the Feature Words association mode quantity that the text mining method that this invention proposes is excavated and excavation time all obviously reduce, its excavation performance is better than existing complete weighting pattern method for digging and the mode excavation method based on frequency, its Feature Words association mode can be information retrieval reliable query expansion word source is provided, therefore, this inventive method has important using value and wide application prospect in the field such as text mining, information retrieval.
Summary of the invention
Technical matters to be solved by this invention is, further investigate for the civilian text feature word association of Chinese and English mode excavation, association rule mining method and system thereof between a kind of Chinese and English text word based on partial order item collection proposed, improve Chinese and English text mining efficiency, be applied to Chinese and English document information retrieval query expansion, can improve retrieval performance, be applied to Chinese and English text mining, can find more actual rational Chinese and English Feature Words association mode, thereby improve the precision of text cluster and classification.Such as, in search engine (Baidu, Google etc.), use the inventive method can obtain high-quality expansion word and realize user's query expansion, improve recall ratio and precision ratio.
The present invention solves the problems of the technologies described above taked technical scheme: association rule mining method between a kind of Chinese and English text word based on partial order item collection, comprises the steps:
(1) Chinese and English text message data pre-service: pending Chinese and English text message data are carried out to pre-service: Chinese text participle, English text stem extracts, remove stop words, extract Feature Words and weights calculating thereof, build text message database and Feature Words project storehouse based on vector space model.
Adopt Porter (seeing http://tartarus.org/ ~ martin/PorterStemmer) program as English document stem extraction procedure, Chinese word segmentation program is the ICTCLAS Chinese word segmentation system (seeing http://www.ictclas.org/) that Inst. of Computing Techn. Academia Sinica develops.
Text feature word weights computing formula is: w ij =(1+ln ( tf ij )) × idf i ,
Wherein, w ij be iindividual Feature Words is jthe weights of section document, idf i be ithe reverse document frequency of individual Feature Words, its value idf i =log ( n/ df i ), nfor total number of documents in document sets, df i for containing ithe number of documents of individual Feature Words, tf ij be iindividual Feature Words is jthe word frequency of section document;
(2) excavate the numerous partial order item of complete weighted feature word frequency collection, comprise the following steps 2.1 and step 2.2:
2.1, excavate the numerous 1_ item of complete weighted feature word frequency collection l 1 , concrete steps are carried out according to 2.1.1 and 2.1.3:
2.1.1, from Feature Words project storehouse, extract Feature Words candidate 1_ item collection c 1, in cumulative text message database, the weights of all items, obtain whole project weights summations w, cumulative c 1weights accumulative total in text message database , calculate c 1support poisup( c 1). piosup( c 1) formula as follows:
2.1.2, by Feature Words candidate 1_ item collection c 1in its support poisup( c 1)>= msfrequent 1_ item collection l 1 join the set of Feature Words frequent item set fIS, msfor minimum support threshold value.
2.1.3, cumulative candidate 1-item collection in text message database c 1occurrence frequency n c1 , extract w r ( c 1), calculate c 1partial order item centralization of state power value expect pOIWB( c 1, 2). pOIWB( c 1, 2) computing formula be:
POIWB( C 1,2)=2× W× msn c1 ×w r ( C 1)。
w r ( c 1) for not belong to c 1project set in the project weights of weights maximum of sundry item.
2.2, excavate complete weighted feature word frequency numerous k_ collection l k , described k>=2, according to step, 2.2.1 ~ 2.2.12 operates:
2.2.1, for candidate ( k-1) _ collection C k-1 , will w( c k-1 ) < pOIWB( c k-1 , k) can not become frequent k_ collection c k-1 wipe out, obtain new candidate c k-1 set.(beta pruning 1)
Wherein, w( c k-1 ) be c k-1 weights accumulative total in text message database, pOIWB( c k-1 , k) for comprise complete weighting candidate ( k-1) _ collection c k-1 's k_ centralization of state power heavily expects, its computing formula is as follows:
POIWB( C k-1 , k)= k× W× ms- n ( k-1) ×w r
n ( k-1) for candidate c k-1 occurrence frequency in text message database, w r for not belonging to c k-1 the project weights of weights maximum in the sundry item of project set.
2.2.2, by its collection frequency be not 0 Feature Words candidate ( k-1) _ collection c k-1 carry out Apriori connection, generating feature word candidate k_item collection c k ;
If 2.2.3 c k for sky, exit 2.2 steps and proceed to (3) step; Otherwise, if c k not empty, proceed to 2.2.4 step.
2.2.4, for candidate k_ collection c k , investigate c k any ( k-1) _ collected works collection, if exist one its ( k-1) the item centralization of state power value an of _ subset is less than its corresponding partial order item centralization of state power and heavily expects ( w ( k-1) < pOIWB( c k-1 , k)), this collection c k must be non-frequent item set, from its set, delete this collection, obtain new candidate's partial order item collection po c k set.(beta pruning 2)
2.2.5, cumulative candidate in text message database k-collection c k occurrence frequency n ck and each project weights w 1( c k ) , w 2( c k ) ..., w k ( c k ), extract w r ( c k ), calculate c k weight expect pOIWB( c k , k+1). pOIWB( c k , k+1) computing formula is:
POIWB( C k , k+1) =( k+1)× W× msn ck ×w r ( C k )
2.2.6, delete the candidate that its collection frequency is 0 k-collection c k , obtain new c k set.(beta pruning 3)
2.2.7, obtain each c k partial order item collection po c k .
2.2.8, investigate partial order item collection po c k high order proper subclass, if there is po c k high order proper subclass right and wrong frequently, partial order item collection po c k certain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c k set.(beta pruning 4)
2.2.9, investigate partial order item collection po c k high claim object project weights, if there is po c k high claim object project weights be less than the minimum weight threshold of 1_ item collection minw, partial order item collection po c k certain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c k set, minwcomputing formula be: minw= w× ms.(beta pruning 5)
2.2.10, investigate partial order item collection poC k low claim order, if there is po c k low claim object project weights be not less than minw, partial order item collection po c k must be frequently, this collection is joined to the set of Feature Words frequent item set fIS.
2.2.11, to remaining partial order item collection poC k , calculate its support piosup( poC k ), if piosup( poC k )>= ms, this partial order item collection poC k be frequently, join the set of Feature Words frequent item set fIS. poisup( poC k ) computing formula as follows:
Wherein, it is partial order item collection poC k weights accumulative total in text message database, kfor Feature Words partial order item collection poC k project number.
2.2.12, will kvalue add 1, circulation 2.2.1 ~ 2.2.12 step, until c k for sky, exit 2.2 steps and proceed to (3) step as follows.
(3) from the set of Feature Words frequent item set fISthe effectively complete weighted feature word Strong association rule pattern of middle excavation, comprises the following steps:
3.1, from the set of Feature Words frequent item set fIStake out Feature Words frequent item set l i , generate l i all proper subclass.
3.2, from l i proper subclass set in take out arbitrarily two proper subclass i 1 with i 2 , work as I 1 i 2= , and I 1 i 2=L iif, w 12>=( k 12/ k 1) × w 1× mc, excavate Feature Words Strong association rule i 1 i 2 ; If w 12>=( k 12/ k 2) × w 2× mc, excavate Feature Words Strong association rule i 2 i 1 .Described k 1, k 2with k 12be respectively a collection i 1 , i 2 ( i 1 , i 2 ) project number, w 1, w 2with w 12be respectively i 1 , i 2 ( i 1 , i 2 ) item centralization of state power value, mcfor minimal confidence threshold.
3.3, continue 3.2 steps, when Feature Words frequent item set l i proper subclass set in each proper subclass be removed once, and only can take out once, proceed to step 3.4;
3.4, continue 3.1 steps, when each frequent item set in the set of Feature Words frequent item set l i all be removed once, and only can take out once, (3) step end of run;
So far, weighted feature word association mode of rule excavates end completely.
A digging system that is applicable to association rule mining method between the above-mentioned Chinese and English text word based on partial order item collection, is characterized in that, comprises following 4 modules:
Text message pretreatment module: for pending Chinese and English notebook data is carried out to pre-service, be that Chinese text participle, English text stem extract, remove stop words and Feature Words extraction and weights calculating thereof etc., build text message database and Feature Words project storehouse based on vector space model.
The frequent partial order item of Feature Words collection generation module: this module is used for from the complete weighted feature word of text message database mining candidate partial order item collection, and adopt new pruning method to the beta pruning of candidate's partial order item collection, obtain final candidate's partial order item collection, by new partial order item collection support computing method, concentrate and draw the numerous partial order item of complete weighted feature word frequency integrated mode from candidate's partial order item.
Completely weighted feature word association rule generation module: simple computation and the comparison of this module and dimension heavy by a centralization of state power, from the numerous partial order item of complete weighted feature word frequency collection ( i 1, i 2) the middle weighted feature word association mode of rule completely that excavates effectively: i 1i 2.
Association rule model result display module: the form that effectively weighted feature word association mode of rule is liked with user is completely shown to user, for customer analysis, choice and operation.
Described text message pretreatment module comprises following 2 modules:
Chinese and English text pretreatment module: this module is responsible for Chinese text message carry out participle and remove Chinese stop words, and English text information is carried out stem extraction and removed the Chinese and English language material pre-service such as English stop words.
Text database and project library build module: this module mainly carries out Chinese and English Feature Words extracts and weight calculation, build text message database and Chinese and English Feature Words project storehouse based on vector space model.
The frequent partial order item of described Feature Words collection generation module comprises following 3 modules:
Feature Words candidate partial order item collection generation module: this module is mainly excavated Feature Words candidate partial order item collection from text message database, detailed process is as follows: from Feature Words project storehouse, extract candidate 1-item collection, the weights summation of cumulative candidate 1-item collection in text message database, calculate its support, draw the numerous 1_ item of complete weighted feature word frequency collection; Then, by connect Apriori connect, by complete weighted feature word frequency numerous ( k-1) _ collection generating feature word candidate k_ item collection; Described k>=2; The project weights of each project of cumulative Feature Words candidate k_ item collection in text message database, draw complete weighted feature word candidate partial order k_ item collection.
Feature Words candidate partial order item collection beta pruning module: this module utilizes pruning method of the present invention to carry out beta pruning to complete weighted feature word candidate partial order k_ item collection, candidate's partial order k_ item collection is deleted frequently, obtains finally likely candidate's partial order k_ item collection set frequently.
The frequent partial order item of Feature Words collection generation module: this module is mainly that the final candidate's partial order k_ item collection to obtaining after above-mentioned module beta pruning excavates, use the support of support computing method calculated candidate partial order k_ item collection of the present invention, with the comparison of minimum support threshold value, draw the numerous partial order k_ item of complete weighted feature word frequency collection.
Described complete weighted feature word association rule generation module comprises following 2 modules:
The subitem collection generation module of the frequent partial order item of Feature Words collection: this module is mainly found out the frequent partial order item of Feature Words and collected all proper subclass, and obtain item centralization of state power value and the dimension of each proper subclass.
Weighted feature word association rule generation module completely: this module is by simple computation and a comparison for centralization of state power value, from the effective weighted feature word Strong association rule pattern completely of the frequent partial order item set mining of Feature Words.
Minimum support threshold value in described digging system ms, minimal confidence threshold mcinputted by user.
Compared with prior art, the present invention has following beneficial effect:
(1) first the present invention proposes Chinese and English concept and a kind of new complete weighted feature word partial order item collection support computing method and the partial order item collection pruning method of weighted feature word partial order item collection completely, proposes on this basis association rule mining method and digging system thereof between a kind of Chinese and English text word based on partial order item collection.This invention adopts new partial order item collection support computing method and technology of prunning branches, avoids a lot of invalid, association mode generations falseness and barren, greatly improves digging efficiency, and the association mode of excavating approaches actual conditions more.With existing method for digging comparison, the present invention has good beta pruning effect, its association mode quantity and excavation time all obviously reduce, its excavation performance is better than existing complete weighting pattern excavation and the mode excavation method based on frequency, improve Chinese and English Feature Words association rule model digging efficiency, obtain association mode between actual text word, in the field such as text mining, information retrieval field, have higher using value and wide application prospect.Such as, in search engine (Baidu, Google etc.), use the inventive method can obtain high-quality expansion word and realize user's query expansion, improve recall ratio and precision ratio.
(2) using the language material of the English data set NTCIR-5 of domestic Chinese standard data set CWT200g and international standard as experimental data, by the present invention and the traditional mode excavation method based on frequency and completely weighting pattern method for digging test relatively and analysis, experimental result shows, no matter in the situation that support threshold value or confidence threshold value change, it is few that the candidate quantity that the present invention excavates is all excavated than control methods, it is few that the excavation time of the present invention excavates than control methods, amount of decrease is larger, and digging efficiency is greatly improved.
Brief description of the drawings
Fig. 1 is the block diagram of association rule mining method between the Chinese and English text word based on partial order item collection of the present invention.
Fig. 2 is the overall flow figure of association rule mining method between the Chinese and English text word based on partial order item collection of the present invention.
Fig. 3 is the structured flowchart of association rule mining system between the Chinese and English text word based on partial order item collection of the present invention.
Fig. 4 is the structured flowchart of text message pretreatment module of the present invention.
Fig. 5 is the structured flowchart of the frequent partial order item of Feature Words of the present invention collection generation module.
Fig. 6 is the structured flowchart of complete weighted feature word association rule generation module of the present invention.
Embodiment
For technical scheme of the present invention is described better, below Chinese and English text data model and the relevant concept that the present invention relates to are described below:
One, key concept
Definition 1 (Chinese and English text message data model):
Chinese and English text message data belong to the complete weighted data changing based on item weights, its data model dWDM( dynamic Weighted Data Model) by the set of affairs paper trail tR( transaction Record), the set of Feature Words project iS( item Set) and Feature Words project and affairs paper trail, weights three's correspondence set iW( item Weight) composition, can form turn to suc as formula (1) and represent.
DWDM=( TR, IS, IW) (1)
Wherein, tR= r 1 , r 2 ..., r n , r i (1≤ in) be dWDMin ibar affairs paper trail ( record),
iS= i 1 , i 2 ..., i m , i j (1≤ jm) be dWDMin jindividual Feature Words project.
IW={< i 1 , r 1, w[ r 1][ i 1]>, < i 2 , r 1, w[ r 1][ i 2]>,…, < i m , r 1, w[ r 1][ i m ]>, < i 1 , r 2, w[ r 2][ i 1]>, < i 2 , r 2, w[ r 2][ i 2]>,…, < i m , r 2, w[ r 2][ i m ]>,…, < i 1 , r n , w[ r n ][ i 1]>, < i 2 , r n , w[ r n ][ i 2]>,…, < i m , r n , w[ r n ][ i m ]>}。
iWin set, w[ r i ] [ i j ] (1≤ in, 1≤ jm) be project i j at affairs paper trail r i in weights, if Feature Words project i j do not appear at affairs paper trail r i in, w[ r i ] [ i j ]=0.
Example: the complete weighted data example of Chinese text data (Text data) is as follows: text=( tR, iS, iW), wherein, tR= r 1 , r 2 , r 3 , r 4 , r 5 be 5 paper trails, iS= i 1 , i 2 , i 3 , i 4 , i 5 be 5 Feature Words projects, iW={ < i 1 , r 1, 0>, < i 2 , r 1, 0.83>, < i 3 , r 1, 0.81>, < i 4 , r 1, 0>, < i 5 , r 1, 0.01>, < i 1 , r 2, 0>, < i 2 , r 2, 0.94>, < i 3 , r 2, 0.7>, < i 4 , r 2, 0.23>, < i 5 , r 2, 0>, < i 1 , r 3, 0>, < i 2 , r 3, 0.35>, < i 3 , r 3, 0.5>, < i 4 , r 3, 0.63>, < i 5 , r 3, 0>, < i 1 , r 4, 0.95>, < i 2 , r 4, 0>, < i 3 , r 4, 0.85>, < i 4 , r 4, 0>, < i 5 , r 4, 0>, < i 1 , r 5, 0.73>, < i 2 , r 5, 0.02>, < i 3 , r 5, 0>, < i 4 , r 5, 0.06>, < i 5 , r 5, 0.9>}. iWset can represent with following Fig. 1.
The complete weighted data example of Fig. 1
Definition 2 (centralization of state power value and project weights): weighted term collection completely iby different projects i 1 , i 2 , ..., i p the set of composition, i=( i 1 , i 2 , ..., i p ) (1≤ pm), i iS, ian item centralization of state power value refer to a collection iwhen all projects appear at same transaction journal simultaneously in each transaction journal i 1 , i 2 , ..., i p weights accumulative total, be designated as w i , , or w i = w 1+ w 2+ ... + w p , wherein, w 1 , w 2 ..., w p be iin each project i 1 , i 2 ..., i p corresponding weights, are called a collection iproject weights, its value for this project in transaction journal set tRin meet collection iwhole projects ( i 1 , i 2 , ..., i p ) the weights cumulative sum of each single project in the different transaction journals that satisfy condition under Conditions simultaneously, ,
Especially, by item collection isubset its meet concentrate each project ( i 1 , i 2 , ..., i p ) simultaneously occur transaction journal in cumulative weights summation be called subset project weights, be designated as w sub , and this subset is during separately as an item collection, in transaction journal set tRin an item centralization of state power value be designated as w ( sub) , for example, a collection isubset ( i 1 , i 3 ) subset project weights w sub( i1 , i3) = w 1 + w 2 , and the item centralization of state power value of this subset during separately as item collection is .
Example: Fig. 1's textin example, 3_ item collection ( i 2 , i 3 , i 4 ) an item centralization of state power value be all items of this collection i 2 , i 3 , i 4 while appearance, (transaction journal satisfying condition is in each transaction journal simultaneously r 2, r 3) in i 2 , i 3 , i 4 weights accumulative total, w ( i2, i3, i4) =(0.94+0.7+0.23)+(0.35+0.5+0.63)=3.35.3_ item collection ( i 2 , i 3 , i 4 ) project weights be i 2 , i 3 , i 4 while appearance simultaneously single project the different transaction journals that satisfy condition ( r 2with r 3) in weights summation, w i2 =0.94+0.35=1.29, w i3 =0.7+0.5=1.2, w i4 =0.23+0.63=0.86.Item collection ( i 2 , i 3 , i 4 ) subset ( i 2 , i 3 ) subset project weights w sub( i2, i3) = w i2 + w i3 =1.29+1.2=2.49, and the item centralization of state power value of this subset is w ( i2, i3) =(0.83+0.81)+(0.94+0.7)+(0.35+0.5)=4.13.
Definition 3 (weighting partial order item collection completely): for complete weighted term collection i= i 1 , i 2 ..., i p (1≤ pm), its project weights are w 1 , w 2 ..., w p .According to the size sequence of project weights, if w 1 w 2 ...w p , its corresponding project is arranged and is designated as i 1 i 2 ... i p, this is collected i 1 , i 2 ..., i p be called complete weighting partial order item collection ( partial Order Itemset, POI), wherein i 1 be called the minimum project of weights, be called for short low claim order, i p be called the highest project of weights, be called for short high claim order.
Example: Fig. 1's textin example, 3_ item collection ( i 2 , i 3 , i 4 ) project weights be respectively 1.29,1.2,0.86, therefore its complete weighting partial order Xiang Jiwei ( i 4 , i 3 , i 2 ), i 4 for low claim order, i 2 it is high claim order.
Definition 4 (weighting partial order item collection support completely): regard a kind of metric point as with project weight, taking entitlement recast in complete weighting transaction database as sample point, according to how much scheme theories in theory of probability, provide a kind of new complete weighting partial order item collection i=( i 1 , i 2 , ..., i p ) (1≤ pm) support ( all-weighted partial order itemset support, poisup) computing formula poisup( i), shown in (7).
(7)
Wherein, for complete weighting partial order item collection iitem centralization of state power value, for whole project weights summations in complete weighting transaction journal set TR, be called complete weighting partial order item collection support standardization coefficient.The reason of introducing support standardization coefficient is: in complete weighted data mining process, partial order item centralization of state power value increases along with the increase of its collection length, cause a collection support and regular degree of confidence to become large, in order to make the numerical value of its support and degree of confidence in rational scope, the special partial order item collection support standardization coefficient 1/ of introducing p, make its support and degree of confidence more reasonable, do not affect again the excavation of complete weighted association pattern.
Definition 5 (the frequent partial order item of weighting collection completely): establishing minimum support threshold value is ms, for complete weighting partial order item collection iif, poisup( i)>= ms, w i >= w× p× ms, claim a collection ifor the complete frequent partial order item of weighting collection.
Especially, when item collection iduring for 1_ item collection, p=1, can obtain the minimum weight threshold of 1_ item collection minw= w× ms, obviously, when the weights of 1_ item collection are not less than minwtime, this 1_ item collection is frequently.
Definition 6 (expectations of partial order item centralization of state power value): the expectation of partial order item centralization of state power value ( partial Order Itemset Weight Bound, pOIWB) refer to comprise complete weighting ( k-1) _ partial order item collection i k-1 's k_ centralization of state power value prediction critical value, is designated as pOIWB( i k-1 , k).The centralization of state power heavily expects to have important theory significance: by complete weighting partial order ( k-1) weights an of _ collection can be predicted its follow-up generation kthe frequency an of _ collection.
If weighting completely ( k-1) _ partial order item collection i k-1 ( k< m) weights be w ( k-1) , i k-1 iS.Do not belonging to i k-1 in the sundry item of project set, remember that the project of its weights maximum is i r (i r iS, i r i k-1, 1≤ rm), these project weights are w r , a collection i k-1 in transaction journal set tRin occurrence frequency be n ( k-1) , comprise so i k-1 's kthe weights an of _ collection maximum possible are: w ( k-1) + n ( k-1) × w r , wherein, .
If comprise i k-1 's k_ collection is frequently, from definition 4,
(8)
By formula (8) right-hand component be called comprise complete weighting ( k-1) _ partial order item collection i k-1 's kthe centralization of state power of _ partial order item is heavily expected, is designated as pOIWB( i k-1 , k), that is,
POIWB( I k-1 , k)= k× W× msn ( k-1) ×w r (9)
Formula (9) shows, if w ( k-1) >= pOIWB( i k-1 , k), comprise i k-1 complete weighting k_ partial order item collection may be frequent item set.
Definition 7 (low order proper subclass and high order proper subclass): establish complete weighting partial order item collection z=( x, y), xwith ybe z2 sub-partial order item collection, wherein x=( i 1 , i 2 ..., i r ) (1≤ r< m), y=( i r+1 , i r+2 ..., i r+q ) (1≤ q< m, 2≤( r+ q)≤ m), its corresponding project weights are w 1 , w 2 ..., w r (wherein w 1w 2...w r ) and w r+1 , w r+2 ..., w r+q (wherein, w r+1 w r+2 ...w r+q ), if xhigh claim order weights be not more than ylow claim order weights, w r w r+1 , claim subitem collection xit is partial order item collection zlow order proper subclass, subitem collection ybe zhigh order proper subclass.
The pruning method of described of the present invention complete weighted feature lexical item collection is:
1. Feature Words candidate ( i-1) _ collection c i-1 produce Feature Words candidate i-collection c i ( i>=2) front, calculate c i-1 feature lexical item centralization of state power value expect pOIWB( c i-1 , i), if complete weighted feature word candidate ( i-1) _ collection c i-1 item centralization of state power value w ( i-1) < pOIWB( c i-1 , i), so its Feature Words ( i-1) _ collection c i-1 follow-up Feature Words i_ collection c i must be non-frequent item set, should be from c i-1 in set, wipe out this Feature Words ( i-1) _ collection.
2. generating feature word candidate c i after, for candidate c i any ( i-1) _ collected works collection, calculates the feature lexical item centralization of state power value of each candidate subset and expects, if exist one its ( i-1) the item centralization of state power value an of _ subset is less than its characteristic of correspondence lexical item centralization of state power value and expects ( w ( i-1) < pOIWB( c i-1 , i)), this Feature Words candidate i_ collection c i must be non-frequent item set, should be from c i in set, wipe out this Feature Words candidate.
3. for Feature Words candidate c i the high order proper subclass of partial order item collection, if to have its high order proper subclass be non-frequent item set, this Feature Words candidate so c i the frequent partial order item of right and wrong collection, should be from c i in set, wipe out this Feature Words candidate.
4. for Feature Words candidate c i the high claim order of partial order item collection, if exist its high claim object project weights to be less than the minimum weight threshold of 1_ item collection minw, this Feature Words candidate must be non-frequent item set, should be from c i in set, wipe out this Feature Words candidate.
If 5. Feature Words ( i-1) _ collection c i-1 feature lexical item collection frequency be 0, n ( i-1) =0, this Feature Words ( i-1) _ follow-up Feature Words of collection i_ collection must be non-frequent item set, should be from c i-1 in set, wipe out this Feature Words ( i-1) _ collection.
6. for candidate c i the low claim order of partial order item collection, if exist its project weights to be not less than the minimum weight threshold of 1_ item collection minw, this candidate so c i frequently, will c i join in frequent item set set.
Below by specific embodiment, technical scheme of the present invention is described further.
The method for digging that in specific embodiment, the present invention takes and system are as shown in Fig. 1-Fig. 6.
Process that the present invention excavates complete weighted feature word association rule to Fig. 1 data instance following ( ms=0.1, mc=0.6):
1. obtain whole project weights summations in database w=8.51, therefore minw= w× ms=0. 851.
2. excavate the numerous 1_ item of complete weighted feature word frequency collection l 1, as shown in table 1.
Table 1:
C 1 w( C 1) poisup( C 1) n c 1 w r( C 1) POIWB( C 1,2)
( i 1) 1.68 0.197 2 0.94 2×8.51×0.1-2×0.94=-0.178
( i 2) 2.14 0.25 4 0.95 2×8.51×0.1-4×0.95=-2.098
( i 3) 2.86 0.33 4 0.95 2×8.51×0.1-4×0.95=-2.098
( i 4) 0.92 0.108 3 0.95 2×8.51×0.1-3×0.95=-1.148
( i 5) 0.91 0.107 2 0.95 2×8.51×0.1-2×0.95=-0.198
As shown in Table 1, l 1 =( i 1 ), ( i 2 ), ( i 3 ), ( i 4 ), ( i 5 ),
The set of Feature Words frequent item set fIS=( i 1 ), ( i 2 ), ( i 3 ), ( i 4 ), ( i 5 ).
3. excavate complete weighted feature word frequency numerous k_ collection l k , described k>=2.
k=2:
(1) (beta pruning 1) is for candidate 1_ item collection C 1, do not have w( c 1) < pOIWB( c 1, 2) situation, therefore candidate c 1gather constant.
(2) be not 0 Feature Words candidate 1_ item collection by its collection frequency c 1carry out Apriori connection, generating feature word candidate 2 _item collection c 2, and calculate w 1( c 2 ), w 2( c 2 ), poC 2 , w( poC 2 ), n c2 , w r ( c 2) and pOIWB( c 2, 3) and as shown in table 2.
Table 2:
C 2 w 1( C 2) w 2( C 2) poC 2 w( poC 2) n c 2 w r( C 2) POIWB( C 2,3)
( i 1, i 2) 0.73 0.02 ( i 2, i 1) (0.02,0.73) 1 0.9 3×8.51×0.1-1×0.9=1.653
( i 1, i 3) 0.95 0.85 ( i 3, i 1) (0.85,0.95) 1 0.94 3×8.51×0.1-1×0.94=1.613
( i 1, i 4) 0.73 0.06 ( i 4, i 1) (0.06, 0.73) 1 0.94 3×8.51×0.1-1×0.94=1.613
( i 1, i 5) 0.73 0.9 ( i 1, i 5) (0.73,0.9) 1 0.94 3×8.51×0.1-1×0.94=1.613
( i 2, i 3) 2.12 2.01 ( i 3, i 2) (2.01, 2.12) 3 0.95 3×8.51×0.1-3×0.95=-0.297
( i 2, i 4) 1.31 0.92 ( i 4, i 2) (0.92,1.31) 3 0.95 3×8.51×0.1-3×0.95=-0.297
( i 2, i 5) 0.85 0.91 ( i 2, i 5) (0.85,0.91) 2 0.95 3×8.51×0.1-2×0.95=0.653
( i 3, i 4) 1.2 0.86 ( i 4, i 3) (0.86,1.2) 2 0.95 3×8.51×0.1-2×0.95=0.653
( i 3, i 5) 0.81 0.01 ( i 5, i 3) (0.01, 0.81) 1 0.95 3×8.51×0.1-1×0.95=1.603
( i 4, i 5) 0.06 0.9 ( i 4, i 5) (0.06, 0.9) 1 0.95 3×8.51×0.1-1×0.95=1.603
For table 2, proceed as follows:
﹡ investigates partial order item collection poC 2high order proper subclass, ( i 1 ), ( i 2 ), ( i 3 ), ( i 5 ), these proper subclass are all frequently, do not deposit non-frequent proper subclass item collection, therefore partial order item collection poC 2gather constant.
﹡ investigates partial order item collection poC 2high claim object project weights, poC 2high claim object project weights < minw=0. 851: ( i 1 , i 2 ), ( i 1 , i 4 ), ( i 3 , i 5 ), their right and wrong frequently, from poC 2in set, delete this collection.
﹡ investigates partial order item collection poC 2low claim order, poC 2low claim object project weights>= minw: ( i 2 , i 3 ), ( i 2 , i 4 ), ( i 3 , i 4 ), they are frequently, and these collection are joined to the set of Feature Words frequent item set fIS, that is, fIS=( i 1 ), ( i 2 ), ( i 3 ), ( i 4 ), ( i 5 ), ( i 2 , i 3 ), ( i 2 , i 4 ), ( i 3 , i 4 ).
﹡ is to remaining partial order item collection poC 2, ( i 3 , i 1 ), ( i 1 , i 5 ), ( i 2 , i 5 ), ( i 4 , i 5 ), calculate its support, that is, piosup( i 3 , i 1 )=(0.85+0.95)/(8.51 × 2)=0.106> ms, piosup( i 1 , i 5 )=0.096< ms, piosup( i 2 , i 5 )=0.103> ms, piosup( i 4 , i 5 )=0.056< mstherefore, ( i 3 , i 1 ) and ( i 2 , i 5 ) be frequent partial order item collection, join the set of Feature Words frequent item set fIS, that is, fIS=( i 1 ), ( i 2 ), ( i 3 ), ( i 4 ), ( i 5 ), ( i 2 , i 3 ), ( i 2 , i 4 ), ( i 3 , i 4 ), ( i 3 , i 1 ), ( i 2 , i 5 ).
k=3:
﹡ as known from Table 2, for candidate 2_ item collection c 2, w( c 2)= w 1( c 2 )+ w 2( c 2 ), its w( c 2) < pOIWB( c 2, 3) partial order Xiang Jiyou: ( i 2 , i 1 ), ( i 4 , i 1 ), ( i 5 , i 3 ) and ( i 4 , i 5 ), these partial order item collection can not become frequent 3_ item collection, should be from c 2in set, wipe out, obtain new candidate c 2set, c 2=( i 1 , i 3 ), ( i 1 , i 5 ), ( i 2 , i 3 ), ( i 2 , i 4 ), ( i 2 , i 5 ), ( i 3 , i 4 ).
﹡ is not 0 Feature Words candidate 2_ item collection by its collection frequency c 2carry out Apriori connection, generating feature word candidate 3 _item collection c 3, c 3=( i 1 , i 3 , i 5 ), ( i 2 , i 3 , i 4 ), ( i 2 , i 3 , i 5 ), ( i 2 , i 4 , i 5 ).
﹡ is for candidate 3 _item collection c 3, investigate c 3any (3-1) _ collected works collection, c 32_ item collected works collection:
For ( i 1 , i 3 , i 5 ) and ( i 2 , i 3 , i 5 ): exist its subitem collection ( i 5 , i 3 ), its w( i 5 , i 3 ) < pOIWB(( i 5 , i 3 ), 3), for ( i 2 , i 4 , i 5 ): exist its subitem collection ( i 4 , i 5 ), its w( i 4 , i 5 ) < pOIWB(( i 4 , i 5 ), 3), therefore Feature Words candidate 3 _item collection ( i 1 , i 3 , i 5 ), ( i 2 , i 3 , i 5 ) and ( i 2 , i 4 , i 5 ) be non-frequent item set, should be from c 3delete, new c 3=( i 2 , i 3 , i 4 ).
﹡ calculates w 1( c 3 ), w 2( c 3 ), w 3( c 3 ), poC 3 , w( poC 3 ), n c3 , w r ( c 3) and pOIWB( c 3, 4) and as shown in table 3.
Table 3:
For table 3, proceed as follows:
﹡ investigates partial order item collection poC 3high order proper subclass, ( i 2 ), ( i 2 , i 3 ), these proper subclass are all frequently, do not deposit non-frequent proper subclass item collection, therefore partial order item collection poC 3gather constant.
﹡ investigates partial order item collection poC 3high claim object project weights, poC 2high claim object project weights are all greater than minwtherefore, partial order item collection poC 3gather constant.
﹡ investigates partial order item collection poC 3low claim order, poC 3low claim object project weights>= minwbe ( i 4 , i 3 , i 2 ) collection, this collection is frequently, is joined the set of Feature Words frequent item set fIS, that is, fIS=( i 1 ), ( i 2 ), ( i 3 ), ( i 4 ), ( i 5 ), ( i 2 , i 3 ), ( i 2 , i 4 ), ( i 3 , i 4 ), ( i 3 , i 1 ), ( i 2 , i 5 ), ( i 4 , i 3 , i 2 ).
﹡ is not 0 Feature Words candidate 3_ item collection by its collection frequency c 3carry out Apriori connection, generating feature word candidate 4 _item collection c 4, c 4= .Due to c 4for sky, therefore excavating, 3 steps finish, proceed to following 4 steps.
4. from the set of Feature Words frequent item set fISthe effectively complete weighted feature word association mode of rule of middle excavation.
With fISmiddle Feature Words frequent item set ( i 4 , i 3 , i 2 ) be example, provide effectively complete weighted feature word association mode of rule mining process as follows:
Frequent item set ( i 4 , i 3 , i 2 ) proper subclass set be ( i 4 ), ( i 3 ), ( i 2 ), ( i 4 , i 3 ), ( i 4 , i 2 ), ( i 3 , i 2 ).
(1) for ( i 4 ), ( i 3 , i 2 ), i 1 =( i 4 ), i 2 =( i 3 , i 2 ), ( i 4 ), ( i 3 , i 2 )=( i 1 , i 2 ), therefore k 1=1, k 2=2, k 12=3,
As known from Table 1, w 1=0.92, as known from Table 2, w 2=2.01+ 2.12=4.13,
As known from Table 3, w 12=0.86+1.2+1.29=3.35,
( k 12/ k 1) × w 1× mc=(3/1) × 0.92 × 0.6=1.656, w 12=3.35>=( k 12/ k 1) × w 1× mc=1.656, so excavate Feature Words correlation rule i 1 i 2 , ( i 4 ) → ( i 3 , i 2 ).
( k 12/ k 2) × w 2× mc=(3/2) × 4.13 × 0.6=3.717, w 12=3.35< ( k 12/ k 2) × w 2× mc=3.717, so do not excavate rule.
(2) for ( i 3 ), ( i 4 , i 2 ), i 1 =( i 3 ), i 2 =( i 4 , i 2 ), ( i 3 ), ( i 4 , i 2 )=( i 1 , i 2 ), therefore k 1=1, k 2=2, k 12=3,
As known from Table 1, w 1=2.86, as known from Table 2, w 2=0.92+1.31=2.23,
As known from Table 3, w 12=0.86+1.2+1.29=3.35,
( k 12/ k 1) × w 1× mc=(3/1) × 2.86 × 0.6=5.148, w 12=3.35< ( k 12/ k 1) × w 1× mc=5.14, therefore do not excavate rule.
( k 12/ k 2) × w 2× mc=(3/2) × 2.23 × 0.6=2.007, w 12=3.35>=( k 12/ k 2) × w 2× mc=2.007, so excavate Feature Words correlation rule i 2 i 1, ( i 4 , i 2 ) → ( i 3 ).
(3) ( i 2 ),( i 4 , i 3 )
For ( i 2 ), ( i 4 , i 3 ), i 1 =( i 2 ), i 2 =( i 4 , i 3 ), ( i 2 ), ( i 4 , i 3 )=( i 1 , i 2 ), therefore k 1=1, k 2=2, k 12=3,
As known from Table 1, w 1=2.14, as known from Table 2, w 2=0.86+1.2=2.06,
As known from Table 3, w 12=0.86+1.2+1.29=3.35,
( k 12/ k 1) × w 1× mc=(3/1) × 2.14 × 0.6=3.852, w 12=3.35< ( k 12/ k 1) × w 1× mc=3.852, therefore do not excavate rule.
( k 12/ k 2) × w 2× mc=(3/2) × 2.06 × 0.6=1.854, w 12=3.35>=( k 12/ k 2) × w 2× mc=1.854, so excavate Feature Words correlation rule i 2 i 1, ( i 4 , i 3 ) → ( i 2 ).
Eventually the above, for Feature Words frequent item set ( i 4 , i 3 , i 2 ), can excavate effectively completely weighted feature word association mode of rule ( ms=0.1, mc=0.6): ( i 4 ) → ( i 3 , i 2 ), ( i 4 , i 2 ) → ( i 3 ) and ( i 4 , i 3 ) → ( i 2 )
Below by experiment, beneficial effect of the present invention is described further.
In order to verify validity of the present invention, correctness, select classical without weighted association rules method for digging Apriori (R.Agrawal, T.Imielinski, A.Swami. Mining association rules between sets of items in large database[C] // Proceeding of 1993 ACM SIGMOD International Conference on Management of Data, Washington D.C., 1993, : 207-216.) and the matrix weighted association rules method for digging MWARM (Huang Mingxuan of inquiry oriented expansion (5), Yan little Wei, Zhang Shichao. the spurious correlation feedback query expansion [J] of excavating based on matrix weighted association rules. Journal of Software, 2009, 20 (7): 1854-1865., in experiment, expansion word quantity is made as to 0) be control methods, write experiment source program, from support changes of threshold and two kinds of situations of confidence threshold value variation, the excavation performance of the present invention and control methods is tested to contrast and analysis respectively.Experiment parameter except mswith mcin addition, also have: iN: the number of entry of excavation, n: document lump record.4-item collection is excavated in experiment.
The part Chinese language material of the Chinese Web test set CWT200g that experimental data provides from Korea_Times2001 English document language material and the network laboratories of Peking University of Japanese national scientific information system central information searching system test set NTCIR-5 CLIR extracts 4936 sections of English document (Serial Number Range is: KT2001_00000--KT2001_05066) and from this CWT200g language material, extracts 12024 sections of Chinese text documents as testing document test set herein from Korea_Times2001.Process participle (Chinese document), stem extract the document pre-service such as the calculating of (English document), elimination stop words and extraction Feature Words and weights thereof, build text database and Feature Words project storehouse based on vector space model.After pre-service, document frequency df(being contained to the document record of this Feature Words) scope is that 1028 to 2593 English Feature Words (totally 50) and df value are extracted and packed feature dictionary (now Chinese Feature Words quantity is 400) at the Chinese Feature Words of [1500,5838] scope.
Experiment 1: in support changes of threshold situation, algorithm excavates Performance Ratio
When support changes of threshold the present invention and 2 kinds of control methodss (Apriori and MWARM method) Chinese and English excavate in 2 kinds of document test sets candidate ( candidate Itemset, CI), frequent item set ( frequent Itemset, FI) and correlation rule ( association Rule, AR) and quantity result is if table 1 is to as shown in table 4.
Experiment 2: excavate Performance Ratio when confidence threshold value changes
When confidence threshold value changes the present invention and 2 kinds of control methodss in 2 kinds of document test sets of Chinese and English Mining Association Rules quantity as shown in table 5 and table 6.
Experiment 3: excavate time efficiency comparison
The time (second) that when support changes of threshold, candidate, frequent item set and correlation rule are excavated in the present invention and control methods is as shown in table 7 and table 8.In the situation that confidence threshold value changes, the time of 3 kinds of algorithm Mining Association Rules (second) as shown in Table 9 and Table 10.
Experiment 4: experimental result instance analysis
In Chinese text test set CWT200g, 28 examples of selected characteristic lexical item order are as the project set excavating, and as shown in table 11, the present invention and 2 kinds of control methodss exist mc=0. 1 He msunder=0. 1 condition, Chinese test set is excavated to (excavating 4-item collection), the correlation rule example extracting taking Feature Words project " participation " as former piece in its result is analyzed, and result is as shown in table 12.
Feature Words example in table 11CWT200g
The correlation rule example table taking " participation " as former piece that three kinds of methods of table 12 are excavated
Table 12 shows, in the correlation rule example taking " participation " as former piece, it is few that the correlation rule quantity that the present invention excavates is excavated than 2 control methodss, and its association rule model more approaches actual conditions, avoided invalid and false association mode generation.For example, " participation " and " participation " is near synonym, in short or in one section of word should occur seldom simultaneously, so correlation rule " participates in → participates in " should not be Strong association rule.In the Result of this paper algorithm MAWAR-POI, do not excavate shape as invalid and false patterns of this class such as " participate in → participate in ", and in the Result of contrast algorithm, the association rule model of not only excavating is many, and can also excavate Strong association rule " participate in → participate in ", and this class association mode should be false, barren and invalid pattern.
Above-mentioned experimental result shows, compares with experiment contrast, and the present invention has good excavation performance, and concrete manifestation is as follows:
(1) no matter in the situation that support threshold value or confidence threshold value change, the candidate that the present invention excavates, frequent item set, correlation rule quantity is all than existing excavate without the complete weighting algorithm of weighted sum few a lot, for example, the candidate quantity that invention is excavated on the English data set of NTCIR-5 is than the minimizing of Apriori method 90.60%, than the minimizing 90.49%(table 1 of MWARM method), and the candidate quantity of excavating on Chinese data collection CWT200g is than the minimizing of Apriori method 94.37%, than the minimizing 87.29%(table 2 of MWARM method), show that the present invention can avoid and reduce a lot of invalid association modes and occur.
(2) what excavation time comparison of the present invention was excavated than algorithm is few, amount of decrease is larger, for example, the time average that the present invention excavates a collection and correlation rule on the English data set of NTCIR-5 is than the minimizing of Apriori method 87.58%, than the minimizing 83.56%(table 7 of MWARM method), and the time of excavating on Chinese data collection CWT200g is than the minimizing of Apriori method 85.98%, than the minimizing 67.60%(table 8 of MWARM method), show that digging efficiency of the present invention is greatly improved.
(3) experimental result of table 12 shows, the Feature Words association rule model that the present invention excavates more can approach reality.

Claims (6)

1. an association rule mining method between the Sino-British text word based on partial order item collection, is characterized in that, comprises the steps:
(1) Chinese and English text message data pre-service: pending Chinese and English text message data are carried out to pre-service: Chinese text participle, English text stem extracts, remove stop words, extract Feature Words and weights calculating thereof, build text message database and Feature Words project storehouse based on vector space model;
(2) excavate the numerous partial order item of complete weighted feature word frequency collection, comprise the following steps 2.1 and step 2.2:
(2.1) excavate the numerous 1_ item of complete weighted feature word frequency collection l 1 , concrete steps are carried out according to 2.1.1 and 2.1.3:
(2.1.1) from Feature Words project storehouse, extract Feature Words candidate 1_ item collection c 1, in cumulative text message database, the weights of all items, obtain whole project weights summations w, cumulative c 1weights accumulative total in text message database , calculate c 1support poisup( c 1);
(2.1.2) by Feature Words candidate 1_ item collection c 1in its support piosup( c 1)>= msfrequent 1_ item collection l 1 join the set of Feature Words frequent item set fIS, msfor minimum support threshold value;
(2.1.3) cumulative candidate 1-item collection in text message database c 1occurrence frequency n c1 , extract w r ( c 1), calculate c 1partial order item centralization of state power value expect pOIWB( c 1, 2);
(2.2) excavate complete weighted feature word frequency numerous k_ collection l k , described k>=2, according to step, 2.2.1 ~ 2.2.12 operates:
(2.2.1) for candidate ( k-1) _ collection C k-1 , will w( c k-1 ) < pOIWB( c k-1 , k) can not become frequent k_ collection c k-1 wipe out, obtain new candidate c k-1 set;
Wherein, w( c k-1 ) be c k-1 weights accumulative total in text message database, pOIWB( c k-1 , k) for comprise complete weighting candidate ( k-1) _ collection c k-1 's k_ centralization of state power value is expected;
(2.2.2) by its collection frequency be not 0 Feature Words candidate ( k-1) _ collection c k-1 carry out Apriori connection, generating feature word candidate k_item collection c k ;
If (2.2.3) c k for sky, exit 2.2 steps and proceed to (3) step; Otherwise, if c k not empty, proceed to 2.2.4 step;
(2.2.4) for candidate k_ collection c k , investigate c k any ( k-1) _ collected works collection, if exist one its ( k-1) the item centralization of state power value an of _ subset is less than its corresponding partial order item centralization of state power and heavily expects ( w ( k-1) < pOIWB( c k-1 , k)), this collection c k must be non-frequent item set, from its set, delete this collection, obtain new candidate's partial order item collection po c k set;
(2.2.5) cumulative candidate in text message database k-collection c k occurrence frequency n ck and each project weights w 1( c k ) , w 2( c k ) ..., w k ( c k ), extract w r ( c k ), calculate c k weight expect pOIWB( c k , k+1);
(2.2.6) delete the candidate that its collection frequency is 0 k-collection c k , obtain new c k set;
(2.2.7) obtain each c k partial order item collection po c k ;
(2.2.8) investigate partial order item collection po c k high order proper subclass, if there is po c k high order proper subclass right and wrong frequently, partial order item collection po c k certain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c k set;
(2.2.9) investigate partial order item collection po c k high claim object project weights, if there is po c k high claim object project weights be less than the minimum weight threshold of 1_ item collection minw, partial order item collection po c k certain right and wrong frequently, are deleted this collection from its set, obtain new candidate's partial order item collection po c k set, minwcomputing formula be: minw= w× ms;
(2.2.10) investigate partial order item collection poC k low claim order, if there is po c k low claim object project weights be not less than minw, partial order item collection po c k must be frequently, this collection is joined to the set of Feature Words frequent item set fIS;
(2.2.11) to remaining partial order item collection poC k , calculate its support piosup( poC k ), if piosup( poC k )>= ms, this partial order item collection poC k be frequently, join the set of Feature Words frequent item set fIS;
(2.2.12) will kvalue add 1, circulation 2.2.1 ~ 2.2.12 step, until c k for sky, exit 2.2 steps and proceed to (3) step as follows;
(3) from the set of Feature Words frequent item set fISthe effectively complete weighted feature word Strong association rule pattern of middle excavation, comprises the following steps:
(3.1) from the set of Feature Words frequent item set fIStake out Feature Words frequent item set l i , find out l i all proper subclass;
(3.2) from l i proper subclass set in take out arbitrarily two proper subclass i 1 with i 2 , work as I 1 i 2= , and I 1 i 2=L iif, w 12>=( k 12/ k 1) × w 1× mc, excavate Feature Words Strong association rule i 1 i 2 ; If w 12>=( k 12/ k 2) × w 2× mc, excavate Feature Words Strong association rule i 2 i 1 ; Described k 1, k 2with k 12be respectively a collection i 1 , i 2 ( i 1 , i 2 ) project number, w 1, w 2with w 12be respectively i 1 , i 2 ( i 1 , i 2 ) item centralization of state power value, mcfor minimal confidence threshold;
(3.3) continue 3.2 steps, when Feature Words frequent item set l i proper subclass set in each proper subclass be removed once, and only can take out once, proceed to step 3.4;
(3.4) continue 3.1 steps, when each frequent item set in the set of Feature Words frequent item set l i all be removed once, and only can take out once, (3) step end of run;
So far, weighted feature word association mode of rule excavates end completely.
2. be applicable to an association rule mining system between the Sino-British text word based on partial order item collection claimed in claim 1, it is characterized in that, comprise following 4 modules:
Text message pretreatment module: for pending Chinese and English notebook data is carried out to pre-service, be that Chinese text participle, English text stem extract, remove stop words and Feature Words extraction and weights calculating thereof etc., build text message database and Feature Words project storehouse based on vector space model;
The frequent partial order item of Feature Words collection generation module: this module is used for from the complete weighted feature word of text message database mining candidate partial order item collection, and adopt new pruning method to the beta pruning of candidate's partial order item collection, obtain final candidate's partial order item collection, by new partial order item collection support computing method, concentrate and draw the numerous partial order item of complete weighted feature word frequency integrated mode from candidate's partial order item;
Completely weighted feature word association rule generation module: simple computation and the comparison of this module and dimension heavy by a centralization of state power, from the numerous partial order item of complete weighted feature word frequency collection ( i 1, i 2) the middle weighted feature word Strong association rule pattern completely of excavating effectively: i 1i 2;
Association rule model result display module: the form that effectively weighted feature word Strong association rule pattern is liked with user is completely shown to user, for customer analysis, choice and operation.
3. digging system according to claim 2, is characterized in that, described text message pretreatment module comprises following 2 modules:
Chinese and English text pretreatment module: this module is responsible for Chinese text message carry out participle and remove Chinese stop words, and English text information is carried out stem extraction and removed the Chinese and English language material pre-service work such as English stop words;
Text database and project library build module: this module mainly carries out Chinese and English Feature Words extracts and weight calculation, build text message database and Chinese and English Feature Words project storehouse based on vector space model.
4. digging system according to claim 2, is characterized in that, the frequent partial order item of described Feature Words collection generation module comprises following 3 modules:
Feature Words candidate partial order item collection generation module: this module is mainly excavated Feature Words candidate partial order item collection from text message database, detailed process is as follows: from Feature Words project storehouse, extract candidate 1-item collection, the weights summation of cumulative candidate 1-item collection in text message database, calculate its support, draw the numerous 1_ item of complete weighted feature word frequency collection; Then, connect by Apriori, by complete weighted feature word frequency numerous ( k-1) _ collection generating feature word candidate k_ item collection; Described k>=2; The project weights of each project of cumulative Feature Words candidate k_ item collection in text message database, draw complete weighted feature word candidate partial order k_ item collection;
Feature Words candidate partial order item collection beta pruning module: this module utilizes pruning method of the present invention to carry out beta pruning to complete weighted feature word candidate partial order k_ item collection, candidate's partial order k_ item collection is deleted frequently, obtains finally likely candidate's partial order k_ item collection set frequently;
The frequent partial order item of Feature Words collection generation module: this module is mainly that the final candidate's partial order k_ item collection to obtaining after above-mentioned module beta pruning excavates, use the support of support computing method calculated candidate partial order k_ item collection of the present invention, with the comparison of minimum support threshold value, draw the numerous partial order k_ item of complete weighted feature word frequency collection.
5. digging system according to claim 2, is characterized in that, described complete weighted feature word association rule generation module comprises following 2 modules:
The subitem collection generation module of the frequent partial order item of Feature Words collection: the numerous partial order item of the main generating feature word frequency of this module collects all proper subclass, and obtain item centralization of state power value and the dimension of each proper subclass;
Weighted feature word association rule generation module completely: this module, by simple computation and a comparison for centralization of state power value, concentrates from the frequent partial order item of Feature Words the weighted feature word Strong association rule pattern completely of excavating effectively.
6. according to the digging system described in any one in claim 2-5, it is characterized in that the minimum support threshold value in described digging system ms, minimal confidence threshold mcinputted by user.
CN201410427491.8A 2014-08-27 2014-08-27 Association rule mining method and its system between Sino-British text word based on partial order item collection Expired - Fee Related CN104182527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410427491.8A CN104182527B (en) 2014-08-27 2014-08-27 Association rule mining method and its system between Sino-British text word based on partial order item collection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410427491.8A CN104182527B (en) 2014-08-27 2014-08-27 Association rule mining method and its system between Sino-British text word based on partial order item collection

Publications (2)

Publication Number Publication Date
CN104182527A true CN104182527A (en) 2014-12-03
CN104182527B CN104182527B (en) 2017-07-18

Family

ID=51963566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410427491.8A Expired - Fee Related CN104182527B (en) 2014-08-27 2014-08-27 Association rule mining method and its system between Sino-British text word based on partial order item collection

Country Status (1)

Country Link
CN (1) CN104182527B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715073A (en) * 2015-04-03 2015-06-17 江苏物联网研究发展中心 Association rule mining system based on improved Apriori algorithm
CN106383883A (en) * 2016-09-18 2017-02-08 广西财经学院 Matrix weighted association mode-based Indonesian and Chinese cross-language retrieval method and system
CN106484781A (en) * 2016-09-18 2017-03-08 广西财经学院 Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system
CN107562904A (en) * 2017-09-08 2018-01-09 广西财经学院 Positive and negative association mode method for digging is weighted between the English words of fusion item weights and frequency
CN108563735A (en) * 2018-04-10 2018-09-21 国网浙江省电力有限公司 One kind being based on the associated data sectioning search method of word
CN109684464A (en) * 2018-12-30 2019-04-26 广西财经学院 Compare across the language inquiry extended method of implementation rule consequent excavation by weight
CN109783628A (en) * 2019-01-16 2019-05-21 福州大学 The keyword search KSAARM algorithm of binding time window and association rule mining
CN110619073A (en) * 2019-08-30 2019-12-27 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN112527953A (en) * 2020-11-20 2021-03-19 出门问问(武汉)信息科技有限公司 Rule matching method and device
CN113254755A (en) * 2021-07-19 2021-08-13 南京烽火星空通信发展有限公司 Public opinion parallel association mining method based on distributed framework

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147688A1 (en) * 2001-09-04 2008-06-19 Frank Beekmann Sampling approach for data mining of association rules
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method
CN103838854A (en) * 2014-03-14 2014-06-04 广西教育学院 Completely-weighted mode mining method for discovering association rules among texts
CN103955542A (en) * 2014-05-20 2014-07-30 广西教育学院 Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147688A1 (en) * 2001-09-04 2008-06-19 Frank Beekmann Sampling approach for data mining of association rules
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method
CN103838854A (en) * 2014-03-14 2014-06-04 广西教育学院 Completely-weighted mode mining method for discovering association rules among texts
CN103955542A (en) * 2014-05-20 2014-07-30 广西教育学院 Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
黄名选等: "基于两次剪枝的完全加权关联规则挖掘算法", 《情报理论与实践》 *
黄名选等: "基于文本库的完全加权词间关联规则挖掘算法", 《广西师范大学学报(自然科学版)》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715073B (en) * 2015-04-03 2017-11-24 江苏物联网研究发展中心 Based on the association rule mining system for improving Apriori algorithm
CN104715073A (en) * 2015-04-03 2015-06-17 江苏物联网研究发展中心 Association rule mining system based on improved Apriori algorithm
CN106383883A (en) * 2016-09-18 2017-02-08 广西财经学院 Matrix weighted association mode-based Indonesian and Chinese cross-language retrieval method and system
CN106484781A (en) * 2016-09-18 2017-03-08 广西财经学院 Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system
CN106484781B (en) * 2016-09-18 2019-03-15 广西财经学院 Merge the Indonesia's Chinese cross-language retrieval method and system of association mode and user feedback
CN106383883B (en) * 2016-09-18 2019-04-16 广西财经学院 Indonesia's Chinese cross-language retrieval method and system based on matrix weights association mode
CN107562904B (en) * 2017-09-08 2019-07-09 广西财经学院 Positive and negative association mode method for digging is weighted between fusion item weight and the English words of frequency
CN107562904A (en) * 2017-09-08 2018-01-09 广西财经学院 Positive and negative association mode method for digging is weighted between the English words of fusion item weights and frequency
CN108563735A (en) * 2018-04-10 2018-09-21 国网浙江省电力有限公司 One kind being based on the associated data sectioning search method of word
CN109684464A (en) * 2018-12-30 2019-04-26 广西财经学院 Compare across the language inquiry extended method of implementation rule consequent excavation by weight
CN109783628A (en) * 2019-01-16 2019-05-21 福州大学 The keyword search KSAARM algorithm of binding time window and association rule mining
CN109783628B (en) * 2019-01-16 2022-06-21 福州大学 Method for searching KSAARM by combining time window and association rule mining
CN110619073A (en) * 2019-08-30 2019-12-27 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN110619073B (en) * 2019-08-30 2022-04-22 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN112527953A (en) * 2020-11-20 2021-03-19 出门问问(武汉)信息科技有限公司 Rule matching method and device
CN112527953B (en) * 2020-11-20 2023-06-20 出门问问创新科技有限公司 Rule matching method and device
CN113254755A (en) * 2021-07-19 2021-08-13 南京烽火星空通信发展有限公司 Public opinion parallel association mining method based on distributed framework
CN113254755B (en) * 2021-07-19 2021-10-08 南京烽火星空通信发展有限公司 Public opinion parallel association mining method based on distributed framework

Also Published As

Publication number Publication date
CN104182527B (en) 2017-07-18

Similar Documents

Publication Publication Date Title
CN104182527A (en) Partial-sequence itemset based Chinese-English test word association rule mining method and system
CN103514183B (en) Information search method and system based on interactive document clustering
CN103955542B (en) Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method
CN104216874B (en) Positive and negative mode excavation method and system are weighted between the Chinese word based on coefficient correlation
Luo et al. A parallel dbscan algorithm based on spark
CN104317794A (en) Chinese feature word association pattern mining method based on dynamic project weight and system thereof
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN103440308B (en) A kind of digital thesis search method based on form concept analysis
Wenli Application research on latent semantic analysis for information retrieval
CN111897926A (en) Chinese query expansion method integrating deep learning and expansion word mining intersection
CN103678642A (en) Concept semantic similarity measurement method based on search engine
CN109739952A (en) Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension
CN104239430A (en) Item weight change based method and system for mining education data association rules
Du et al. An overview of dynamic data mining
CN111259117B (en) Short text batch matching method and device
Lu et al. Research on text classification based on TextRank
CN109684465B (en) Text retrieval method based on pattern mining and mixed expansion of item set weight value comparison
Duan et al. Error correction for search engine by mining bad case
CN111897928A (en) Chinese query expansion method for embedding expansion words into query words and counting expansion word union
Xu An Apriori algorithm to improve teaching effectiveness
CN108170778A (en) Rear extended method is translated across language inquiry by China and Britain based on complete Weighted Rule consequent
Gui et al. Topic modeling of news based on spark Mllib
Xiaohu et al. A Fast Search Algorithm Based on Agent Association Rules
Hu et al. Graphsdh: a general graph sampling framework with distribution and hierarchy
He et al. Enterprise human resources information mining based on improved Apriori algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160325

Address after: Nanning City, 530003 West Road Mingxiu the Guangxi Zhuang Autonomous Region No. 100

Applicant after: Guangxi Finance and Economics Institute

Address before: Building 530023 Nanning Road, the Guangxi Zhuang Autonomous Region No. 37

Applicant before: Guangxi College of Education

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170718

Termination date: 20180827