CN102662948A - Data mining method for quickly finding utility pattern - Google Patents

Data mining method for quickly finding utility pattern Download PDF

Info

Publication number
CN102662948A
CN102662948A CN2012100425708A CN201210042570A CN102662948A CN 102662948 A CN102662948 A CN 102662948A CN 2012100425708 A CN2012100425708 A CN 2012100425708A CN 201210042570 A CN201210042570 A CN 201210042570A CN 102662948 A CN102662948 A CN 102662948A
Authority
CN
China
Prior art keywords
pattern
node
prefix
project
effectiveness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100425708A
Other languages
Chinese (zh)
Inventor
刘君强
蒋晓宁
甘志刚
余斌霄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN2012100425708A priority Critical patent/CN102662948A/en
Publication of CN102662948A publication Critical patent/CN102662948A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data mining method for quickly finding a utility pattern can find a utility pattern which not only has substantial statistical characteristics but also meets user expectations and user goals from massive data, having a wide application in network information search and knowledge discovery. Aiming at solving the present problems of high time overhead and space overhead of existing methods caused by adoption of a two-stage method which generates a candidate pattern, the present invention provides three innovative technologies. The first is data representation based on a sparse matrix and virtual projection, the second is a prefix growth strategy, a prefix growth tree and a tailoring method thereof, and the third is a depth-first dynamic search method. With the three innovative technologies, a novel mining method is designed which has a single stage, causes no candidate pattern, and enables mining the utility pattern. The time efficiency ratio of the data mining method is higher by one to three orders of magnitude than that of other three referential mining methods, and the memory usage is reduced by 40% to 90%. The present data mining method has a high performance and enables various applications such as massive Web mining, multimedia mining and test mining.

Description

A kind of data digging method of quick discovery effectiveness pattern
Technical field
The present invention relates to the Intelligentized Information field.The present invention has designed a kind of can from mass data, the discovery and has not only had remarkable statistical nature but also meet user expectation and the effectiveness mode excavation method of target; Excavate particularly network information search and Knowledge Discovery in mass data; In comprising that Web excavation, text mining, multimedia are excavated, the extensive application prospect.
Background technology
The traditional data mining technology; Frequent Pattern Mining technology [1] [2] particularly; Mainly carry out data analysis according to statistical significance; Such as from the sales data of supermarket, excavating purchase frequency high product combination etc., do not consider user's expectation or target, maybe be interested such as the user in the combination of profit reciprocation high product.That is to say, in data mining, not only will consider the statistical significance of data, also will consider user's interest or target [3].Effectiveness mode excavation technology is as Frequent Pattern Mining latest development [4] [5] [6] [7] [8] of arising at the historic moment.
Yet that is that all right is ripe for effectiveness mode excavation technology, has only the very small amount achievement, all adopts two-phase method.Two-phase method TP is proposed by [4] such as Liu.Phase one is according to the downward closed character of affairs weighting effectiveness TWU; Thereby find out pattern earlier and generate the candidate pattern set with higher TWU, thus subordinate phase once more the scan database actual utility of calculating each candidate pattern find out the pattern that effectiveness is higher than given threshold values.Li etc. [5] have proposed isolated item and have rejected strategy, are used for successively excavating the phase one of candidate pattern, to reduce unnecessary candidate pattern, so also can raise the efficiency, because the calculating of each layer candidate pattern can be carried out on a data set that successively decreases.
Recently, the shortcoming of multipass database [4] [5] when avoiding successively generating candidate pattern is so that the phase one can generate candidate pattern expeditiously, and a plurality of research groups propose the effectiveness mode excavation method [6] [7] [8] based on tree.Erwin etc. [6] propose the CTU-PROL method for digging, use the affairs weighting effectiveness downward closed character of TWU [4], excavate based on effectiveness scheme-tree CUP-tree and FP-Growth [2].Ahmed etc. [7] propose the IHUP method for digging, adopt IHUP-tree to store the TWU information of each affairs, improve the candidate pattern collection that FP-Growth [2] excavates the effectiveness pattern.CTU-PROL method for digging [6] is identical with TP [4] with the candidate pattern quantity that IHUP method for digging [7] generated in the phase one.Tseng etc. [8] design another UPG method for digging based on tree; Utilize the UP-tree compression to express the effectiveness information of affairs; Proposition tree node effectiveness is rejected/is successively decreased strategy and improves the downward closed character of affairs weighting effectiveness TWU, thereby generates the candidate pattern of lesser amt.
Yet on going result is not all jumped out the framework of two-phase method, although there is work [5] [8] to attempt to reduce the candidate pattern quantity that the phase one generates yet.When database exists long transaction journal or given effectiveness threshold values hour, the quantity of candidate pattern still is huge.This not only causes the storage space expense excessive, causes the scalability bottleneck of phase one, also is so for subordinate phase, and the time efficiency that finally causes moving is low.
For overcoming the defective of method for digging in the past, the present invention proposes following three innovative technologies, breaking away from the framework of two-phase method, and designs " a kind of data digging method of quick discovery effectiveness pattern ", thereby solves the bottleneck problem of scalability and efficient.
First data representation that is based on sparse matrix and virtual projection.Specifically, propose the complete information that sparse matrix is expressed each affairs effectiveness, making single phase excavate becomes possibility.This sparse matrix method for expressing is compacter than the method for expressing [6] [7] [8] based on FP-tree [2], avoids multipass database [4] [5].Adopt virtual projection, under the situation that does not increase any storage overhead, calculate the utility value of arbitrary patterns.
Second is prefix growth strategy and prefix growth tree and cut-out method thereof.Prefix growth strategy and corresponding prefix growth tree are used to guide the mining process of effectiveness pattern, and obtain the support of effectiveness pattern search space tailoring technology, promptly through the utility value upper bound in estimation anyon space, can cut out prefix effectively and generate tree.
The 3rd is the News Search method of depth-first.Find in the process of effectiveness pattern in search prefix growth tree; Adopt the depth-first method to construct the branch of current search; Need not in internal memory, to retain complete prefix growth tree, also need not in internal memory, to store the effectiveness pattern, thereby can further reduce storage overhead.
The time efficiency of method for digging of the present invention than three with reference to high 1 to 3 one magnitude of method for digging [4] [7] [8], and internal memory use amount few 40% to 90%.Method for digging of the present invention has high-performance, can in various application such as magnanimity Web excavation, multimedia excavation, text mining, be widely used.
List of references:
[1]R.Agrawal?and?R.Srikant.Fast?algorithms?for?mining?association?rules[A].In?Proc.of?VLDB?1994[C].1994,487-499..
[2]J.Han,J.Pei,Y.Yin.Mining?frequent?patterns?without?candidate?generation[A].In?Proc.of?ACM?SIGMOD2000[C].Dallas,USA,2000,1-12.
[3]H.Yao,H.J.Hamilton,L.Geng.A?unified?framework?for?utility-based?measures?for?mining?itemsets[A].In?Proc.of?ACM?SIGKDD?2nd?Workshop?on?Utility-Based?Data?Mining[C].2006,28-37.
[4]Y.Liu,W.Liao,and?A.Choudhary.A?fast?high?utility?itemsets?mining?algorithm[A].In?Proc.of?the?Utility-Based?Data?Mining?Workshop?in?conjunction?with?the?11th?ACM?SIGKDD[C].2005,253-262.
[5]Y.-C.Li,J.-S.Yeh,and?C.-C.Chang.Isolated?items?discarding?strategy?for?discovering?high?utility?itemsets[J].Data&Knowledge?Engineering,2008,64(1):198-217.
[6]A.Erwin,R.P.Gopalan,and?N.R.Achuthan.Efficient?mining?of?high?utility?itemsets?from?large?datasets[A].In?Proc.ofPAKDD?2008[C].2008,554-561.
[7]C.F.Ahmed,S.K.Tanbeer,B.-S.Jeong,and?Y.-K.Lee.Efficient?tree?structures?for?high?utility?pattern?mining?in?incremental?databases[J].IEEE?Transactions?on?Knowledge?and?Data?Engineering,2009,21(12):1708-1721.
[8]V.S.Tseng,C.-W.Wu,B.-E.Shie,P.S.Yu.UP-Growth:an?efficient?algorithm?for?high?utility?itemset?mining[A].In?Proc.ofthe?16th?ACM?SIGKDD[C].2010,253-262.
Summary of the invention
The objective of the invention is to design a kind of can be with minimum memory space and prestissimo, from transaction database, find the method for digging of (height) effectiveness pattern to be implemented in the Knowledge Discovery in the mass data.
The present invention's " a kind of data digging method of quick discovery effectiveness pattern " comprises the data representation of three core technology: A based on sparse matrix and virtual projection, B prefix growth strategy and prefix growth tree and cut-out method thereof, the News Search method of C depth-first.
Method for digging of the present invention is found out the pattern that utility value is not less than minutil according to transaction database D, effectiveness information table UT, effectiveness threshold values minutil.
One of content of the present invention:
Given I={i 1, i 2..., i mBe the set of all items, given D={t 1, t 2..., t nBe database, i.e. the set of transaction journal.Each transaction journal t is an Item Sets; I.e.
Figure BSA00000674232100021
u (i; T)=and iu (i, t) eu (i) is the utility value of project i in transaction journal t, wherein iu (i; T) be the share of project i in transaction journal t, eu (i) is the external effectiveness that project i is independent of any transaction journal.Pattern X is the sub-set of I; If the share of each project i in affairs t non-0 among the pattern X; Be iu (i; T) ≠ 0; Then pattern X is supported by transaction journal t; Promptly (X t) is the utility value of pattern X in transaction journal t to
Figure BSA00000674232100022
u, and value then is the set of all transaction journals of support mode X for
Figure BSA00000674232100023
.U (X) is the total utility of pattern X, is that by formula (A) calculates the effectiveness sum of X in all support X transaction journals.
u ( X ) = Σ t ∈ TS ( X ) u ( X , t ) = Σ t ∈ TS ( X ) Σ i ∈ X u ( i , t ) Formula (A)
A is based on the data representation of sparse matrix and virtual projection
A1 adopts a kind of sparse matrix of being realized by linear linked list, merges and expresses database D and effectiveness information table UT, just supports the transaction journal collection TS ({ }) of empty pattern { } and the complete information of effectiveness thereof.In this matrix, row then adopts the total utility value upper bound ubound of each project i by project layout, its order Ω 1(i, { }) descending, row adopt the natural preface of transaction journal in database by transaction journal layout, its order, the capable t column element of i is the effectiveness of project i at transaction journal t, promptly u (i, t)=iu (i, t) eu (i).Concrete steps are following:
A1.1 scan database D first pass and according to effectiveness information table UT calculates the utility value upper bound ubound of each project i by the formula (B1.1) of " summary of the invention B1 " 1(i, { }) and by obtaining project order Ω after the descending sort.
A1.2 scan database D second time, each the transaction journal t for reading in sets up linear linked list, and (i, t), each element press the arrangement of Ω preface to the utility value u of a project i among linked list element storage t.This chained list is all non-0 elements of the t row of sparse matrix, is called the row chained list, and note is made Φ (t).
A1.3 gets up the element (that is, non-0 element of sparse matrix) of each row chained list by the row link, by the capable gauge outfit of Ω double as sparse matrix.Row gauge outfit item Ω (i) points to capable first non-0 element of matrix i.The element that Ω (i) is linked forms capable chained list.If capable chained list of Ω (i) indication and row chained list Φ (t) have common element, then Φ (t) is called by the row chained list of Ω (i) threading.
A2 obtains the support transaction journal collection TS (X) of arbitrary patterns X through virtual projection.According to the backward of entry sorting Ω, for each the project i among the X, to choose capable each nonzero element place row of i of submatrix and form new submatrix, the submatrix that finally obtains is TS (X).Because this submatrix is to be embedded in the original matrix of representing entire database, does not need independently storage space.Concrete steps are following:
A2.1 presses the backward of Ω, gets each the project i among the X.
The row matrix gauge outfit item that A2.1.1 will come before the i empties, and promptly for k<i, Ω (k) puts sky.
A2.1.2 adds k the capable chained list of Ω (k) indication for each element k before the i that is arranged in of the rectangular array chained list Φ (t) of Ω (i) threading.
A2.2 makes that X is i by first element of Ω preface 0, Ω (i then 0) submatrix that is fine into of all row chained lists of institute's threading is exactly TS (X), i.e. the transaction journal collection of support mode X.
The key point of one of content of the present invention is that TS (X) is embedded among the TS ({ }), thereby is a kind of virtual projection, need not independently storage space, improves the spatial scalability of method for digging greatly.
Two of content of the present invention:
Pattern X is called (height) effectiveness pattern, if its utility value is not less than given threshold values minutil, i.e. and u (X) >=minutil.The effectiveness mode excavation is found all (height) effectiveness patterns exactly, promptly finds the solution
Figure BSA00000674232100032
The basic ideas of excavating the effectiveness pattern are to enumerate each pattern, calculate and judge whether its utility value surpasses threshold values minutil.The present invention proposes a kind of prefix growth strategy and carries out pattern and enumerate, and this is equivalent to construct a prefix growth tree, and adopts based on utility value demarcation method and cut out prefix growth tree.
B prefix growth strategy and prefix growth tree and cut-out method
According to the entry sorting Ω of structure sparse matrix, a pattern also can be expressed as one has sequence.Such as, { a, b, c} also can be used as < a, b, c >, if Ω also matching word canonical ordering just in time.Therefore, the set representation can be mixed use with the sequence representation, and set and operation ∪ also can be used for the splicing of two sequences, such as < a>∪ <b, c, d >=< a, b, c, d >.The thinking that the present invention enumerates pattern is to obtain another pattern through the splicing prefix from a pattern.Can obtain < a, b, c, d>such as <b, c, d>splicing prefix < a >.Specifically, prefix growth strategy is exactly that empty pattern splicing prefix is obtained length is 1 pattern, with length be 1 pattern splicing prefix to obtain length be 2 pattern, by that analogy.
Enumerate pattern by prefix growth strategy and be equivalent to construct a prefix growth tree (Prefix Growth Tree is called for short PGT).Each PGT node n ode representes a project, and note is made node.item.The tree root node is represented " blank " project, corresponding " sky " pattern { }.A pattern is represented in the set of all items from node to the tree root node, and note is made node.pattern.Be arranged in each the project i before the node.item by the Ω preface, all can have the daughter nodes child of node to represent that promptly child.item gets i.
It is following that B1 cuts out the concrete operations of prefix growth tree PGT node n ode.
B1.1 calculates the utility value upper bound of each project in the prefix growth subtree that node is a root.Just, for the project i that is arranged in before the node.item, by formula (B1.1) calculates i utility value upper bound in the prefix growth subtree of pattern node.pattern,
Figure BSA00000674232100041
formula (B1.1)
X=node.pattern wherein, TS (X) confirm by " summary of the invention A2 ", and (X is that all are merged into prefix by Ω the project before the X of coming and splice to X again and obtain among the transaction journal t t) to preEXT.Can be refined as:
B1.1.1 is for confirming each the transaction journal t among the TS (X) by " summary of the invention A2 "; I.e. row chained list Φ (t); If have element storing project i among the Φ (t), then with all items among the X and come the utility value of all items in t before the X by the Ω preface and add up and draw ubound 1(i, X), the utility value sum of respective element among the row chained list Φ (t) just.
If B1.1.2 is ubound 1(i, X)<minutil, then project i can not appear in (height) effectiveness pattern as the X=node.pattern prefix, therefore it will be labeled as " haveing nothing to do " project, otherwise be labeled as " useful " project.
B1.2 by formula (B1.2) calculates the prefix pattern utility value upper bound that might become (height) effectiveness pattern in the PGT subtree growth that node is a root.
Ubound 2 ( i , X ) = &Sigma; t &Element; TS ( X ) u ( PpEXT ( X , t ) , t ) Formula (B1.2)
X=node.pattern wherein, TS (X) confirm by " summary of the invention A2 ", and (X is that (X rejects in t) and obtains after " haveing nothing to do " project that " summary of the invention B1.1.2 " mark comes out from preEXT t) to ppEXT.Can be refined as:
B1.2.1 is for confirming each the transaction journal t among the TS (X) by " summary of the invention A2 "; I.e. row chained list Φ (t); To add up in the respective element (utility value) of row chained list Φ (t) by all " useful " projects and all items among the X that the Ω preface comes before the X, draw ubound 2(X).
B1.3 carries out and cuts out operation.If the prefix pattern utility value upper bound that " summary of the invention B1.2.1 " calculates is lower than threshold values, promptly
Ubound 2(node.pattern)<minutil formula (B1.3)
Then node n ode is that the prefix growth subtree of root is also to cut out, because wherein can not there be (height) effectiveness pattern.
Two key point of content of the present invention is the prefix growth subtree that can not have (height) effectiveness pattern through calculating the utility value upper bound, cutting out, and can dwindle the search volume effectively, improves the time efficiency of method for digging greatly.
Three of content of the present invention:
The News Search method of C depth-first
The process of search prefix growth tree is just constructed the process of this tree; If carry out according to the depth-first order; Not only can carry out " summary of the invention B1 ", cut out the prefix growth subtree that can not have (height) effectiveness pattern; Also can in time remove and search for the subtree that finishes, the branch of in internal memory, only storing current search rather than whole tree, thereby realize News Search.Concrete steps are following:
The root node root of C1 structure prefix growth tree PGT, root.item puts sky.Set up storehouse traversal, be used to realize the News Search of depth-first.The root root of prefix growth tree PGT is pressed into traversal.
C2 promptly has node to be pressed in wherein when the traversal non-NULL, carries out:
C2.1 ejects the stack top node from traversal and deposits node in.
If C2.2 node has the right side sibling, shellfish is pressed into traversal with this sibling.
C2.3 carries out the transaction journal collection TS (node.pattern) that " summary of the invention A2 " virtual projection obtains to support nodepattern.
If C2.4 is u (node.pattern) >=minutil, then node.pattern is (height) effectiveness pattern.
C2.5 carries out cutting out of " summary of the invention B1 ".
If C2.6 node is not tailored, then set up the daughter nodes child of node for each " useful " project i of " summary of the invention B1.1.2 " mark, make that child.item is i.The 1st children (preface from left to right) are pressed into traversal.
If C2.7 node does not have daughter nodes, then remove node and older generation's node thereof one by one, until older generation's node that other children are arranged along path from node to root.
Three key point of content of the present invention is that depth-first search can further improve the spatial scalability of method for digging on the basis of " summary of the invention A ", makes not generate candidate pattern and directly excavate (height) effectiveness pattern and become possibility.
Description of drawings
The integrated A of Fig. 1 is based on the workflow of the News Search method of the data representation of sparse matrix and virtual projection, B prefix growth strategy and prefix growth tree and cut-out method thereof, C depth-first
Fig. 2 transaction database D
Fig. 3 effectiveness information table UT
The row chained list Φ (t) that each transaction journal of Fig. 4 t is corresponding
A kind of sparse matrix of realizing by linear linked list of Fig. 5
Prefix growth tree (part) after Fig. 6 cuts out
The sparse matrix that obtains behind Fig. 7 virtual projection TS ({ d})
The instantaneous state of the traversal stack of the preferential News Search of Fig. 8 controlling depth
The instantaneous state of Fig. 9 prefix growth tree PGT
Embodiment
The present invention's " a kind of data digging method of quick discovery effectiveness pattern " has proposed three innovative technologies.Fig. 1 summarizes the integrated route of these three innovative technologies.
Below in conjunction with accompanying drawing and instance (given transaction database D shown in Figure 2, effectiveness information table UT shown in Figure 3, effectiveness threshold values minutil=30), technical scheme is divided into two processes does and further describe.
Process one: adopt a kind of sparse matrix of realizing by linear linked list, merge and express database D and effectiveness information table UT, just support the transaction journal collection TS ({ }) of empty pattern { } and the complete information of effectiveness (" summary of the invention A1 ") thereof.
The concrete steps of process one are following:
1.1ScanDatabaseOnceforOmega: carry out " summary of the invention steps A 1.1 ".
Scan database D first pass and according to effectiveness information table UT calculates the utility value upper bound ubound of each project i by the formula (B1,1) of " summary of the invention B1 " 1(i, { }).
For example, for transaction database D shown in Figure 2, effectiveness information table UT shown in Figure 3, result of calculation is: ubound 1(a, { })=96, ubound 1(b, { })=88, ubound 1(c, { })=65, ubound 1(d, { })=61, ubound 1(e, { })=58, ubound 1(f, { })=38, and ubound 1(g, { })=30.
Press utility value upper bound ubound 1Obtain project order Ω after (i, { }) descending sort.
For example, Ω=< a, b, c, d, e, f, g >.
1.2CreateSparseMatrix: carry out " summary of the invention steps A 1.2 is to A1.3 ".
Scan database D second time, each the transaction journal t for reading in sets up linear linked list, and (i, t), each element press the arrangement of Ω preface to the utility value u of a project i among linked list element storage t.This chained list is all non-0 elements of the t row of sparse matrix, is called the row chained list, and note is made Φ (t).
For example, transaction journal t 1To t 5Corresponding row chained list Φ (t 1) to Φ (t 5) as shown in Figure 4.
The element of each row chained list is got up by the row link, by the capable gauge outfit of Ω double as.Row gauge outfit item Ω (i) points to capable first non-0 element of matrix i.The element that Ω (i) is linked forms capable chained list.
For example, shown in Figure 5 for after linking by row, the sparse matrix that obtains is intactly expressed the relevant effectiveness information of TS ({ }).
Process two: adopt depth-first News Search method (" summary of the invention C "); Traversal prefix growth tree is enumerated pattern (" summary of the invention B "); Obtain the support affairs collection (" summary of the invention A2 ") of the pattern of enumerating through virtual projection; Calculate utility value and cut out search volume (" summary of the invention B1 "), in this process, find all (height) effectiveness patterns by the utility value upper bound.
For example, the execution that PGT is writing down process two is set in the part prefix growth after cutting out shown in Figure 6.The mark of the PGT node among the figure has two types: the first kind is only marked project, and this type node does not have actual growth, so also be added with strikethrough; Second roughly the same the markers project with support the transaction journal collection, the actual growth of this type node is the execution virtual projection also.
The concrete steps of process two are following:
2.1CreatePGTnullroot: the root node root of structure prefix growth tree PGT, root.item puts sky.Set up storehouse traversal, root is pressed into traversal.
For example, be exactly root at Fig. 6 " sky, TS ({ }) " node.
2.2, carry out when the traversal non-NULL:
2.2.1PopTopElementoutofStack: eject the stack top node from traversal and deposit node in.
2.2.2PushRightSiblingintoStack: if node has the right side sibling, then this sibling is pressed into traversal.
For example, when having access to " d, TS ({ d}) " node of Fig. 6, will " e, TS ({ e}) " node be pressed traversal, because the latter is the former right side sibling.
2.2.3GetTSforCurrentNode: obtain the transaction journal collection TS (node.pattern) that supports node.pattern, i.e. all the row chained lists that can have access to of capable gauge outfit item Ω (node.item) through sparse matrix.Need indicate especially, if node.item is empty, promptly node is the root node of prefix growth tree PGT, then through special capable gauge outfit item " Ω (sky) " visit " process one " all row chained lists of the TS that sets up ({ }).
For example, when having access to " d, TS ({ d}) " node of Fig. 6, the capable gauge outfit item of the sparse matrix Ω (d) through Fig. 5 can be linked to row chained list Φ (t 3), Φ (t 4) and Φ (t 5), promptly form all transaction journals of TS ({ d}).
2.2.4ComputeUtilitiesandUpperBounds: for being arranged in each project i before the node.item; By formula (A) calculates i splices the pattern Y={i} ∪ node.pattern that obtains as prefix and node.pattern utility value; If u (Y)>=minutil, then Y is (height) effectiveness pattern; By formula (B1.1) calculates ubound 1(i, node.pattern), if ubound 1(i, node.pattern)<minutil, then i is labeled as " haveing nothing to do " project, otherwise is labeled as " useful " project.
For example, when having access to " d, TS ({ d}) " node of Fig. 6, can calculate: u (a, d})=22<minutil, u (b, d})=25<minutil, u (c, d})=9<minutil, so { a, d}, { b, d}, { c, d} all are not (height) effectiveness patterns.Can calculate simultaneously: ubound 1(a, d})=36>=minutil, ubound 1(b, d})=36>=minutil, ubound 1(c, d})=13<minutil, so a and b are " useful " projects, c is " haveing nothing to do " project.
2.2.5PseudoProjection: carry out " summary of the invention A2 " virtual projection, prepare affairs support collection for the daughter nodes that are about to set up of node, promptly at first will come the preceding row matrix gauge outfit item of i and empty, promptly for k<i, Ω (k) puts sky.Then, (the node.item) is visited each the row chained list Φ (t) that is linked to, one by one with the capable chained list that is arranged in preceding each element k adding Ω (k) indication of i among the Φ (t) from Ω.
For example, when having access to " d, TS ({ d}) " node of Fig. 6, as shown in Figure 7 through the sparse matrix that obtains behind the virtual projection TS ({ d}).Need to specify the shared same memory space of matrix behind the virtual projection and original matrix, the Φ (t of support e 1) and support the Φ (t of f 2) still can have access to by Ω (e) and Ω (f).
2.2.6PruneChildrenwithSmallUppersandGrowOthers: carry out cutting out of " summary of the invention B1 ".By formula promptly (B1.2) calculates the utility value upper bound of the pattern Y={i} ∪ node.pattern that each " useful " project i and node.pattern splicing obtains.If ubound 2(Y)<and minutil, then with in the corresponding prefix growth of the Y subtree can not there be (height) effectiveness pattern, therefore corresponding daughter nodes needn't be grown, and promptly can cut out, otherwise for node sets up daughter nodes child, make child.item get i.
For example, when having access to " d, TS ({ d}) " node of Fig. 6, c is " haveing nothing to do " project, is tailored naturally.By formula (B1.2) calculates: ubound 2(a, d})=22<minutil and ubound 2({ b d})=31>=minutil, so needn't be project a growth daughter nodes, only be the b daughter nodes of growing, i.e. " b, TS (b, d) " node of Fig. 6.
2.2.7PushFirstChildintoStackorPurgePath: if node has daughter nodes; Then be that the 1st children with node are pressed into traversal; Otherwise the path along from node to root is removed node and older generation's node thereof one by one, until older generation's node that other children are arranged.
For example, have access to Fig. 6 " a, TS (a, b, d}) " during node, { a, b, d} confirm as (height) effectiveness pattern when having calculated in a last step, and (utility value equals 31 >=minutil).Owing to do not have daughter nodes, so from then on node can be deleted to the whole piece path of root.At this moment, the only element in the traversal stack is " e, TS ({ e}) " node (real in pointing to the address pointer of this node), referring to Fig. 8.The prefix of actual storage growth tree branch is root node and three daughter nodes that also not have to visit " e, TS ({ e}) ", " f, TS ({ f}) " and " g, TS ({ g}) " node in the internal memory, referring to Fig. 9.
Performance measuring and evaluating: the experiment of the present invention's's " a kind of data digging method of quick discovery effectiveness pattern " performance measuring and evaluating shows, the time efficiency of method for digging of the present invention than three with reference to high 1 to 3 one magnitude of method for digging, and internal memory use amount less 40% to 90%.
Brief summary: the present invention has designed " a kind of data digging method of quick discovery effectiveness pattern ".Employing is minimum based on the storage expense of the transaction journal collection use of the feasible support of the data representation effectiveness pattern of sparse matrix and virtual projection, guarantees the spatial scalability of method for digging.Adopt prefix growth strategy and prefix growth tree and cut-out method thereof, guarantee the time efficiency of method for digging.Adopt the News Search method of depth-first, further improve the spatial scalability of method for digging, thereby do not generate the candidate pattern set and directly excavate (height) effectiveness pattern in single phase.Method for digging of the present invention excavates particularly in mass data and has wide application prospects in the network information search and Knowledge Discovery.

Claims (4)

1. data digging method of finding fast the effectiveness pattern; According to transaction database D, effectiveness information table UT, effectiveness threshold values minutil; Find out the pattern that value of utility is not less than minutil with minimum memory space and prestissimo, comprise the concrete grammar flow process of following three core technologies:
A is based on the data representation of sparse matrix and virtual projection.
B prefix growth strategy and prefix growth tree and cut-out method thereof.
The News Search method of C depth-first.
2. the data digging method of a kind of quick discovery effectiveness pattern according to claim 1, the concrete grammar flow process of core technology A is following:
A1 adopts a kind of sparse matrix of being realized by linear linked list, merges and expresses database D and effectiveness information table UT, just supports the transaction journal collection TS ({ }) of empty pattern { } and the complete information of effectiveness thereof.In this matrix, row then adopts the total utility value upper bound ubound of each project i by project layout, its order Ω 1(i, { }) descending, row adopt the natural preface of transaction journal in database by transaction journal layout, its order, the capable t column element of i is the effectiveness of project i at transaction journal t, promptly u (i, t)=iu (i, t) eu (i).Concrete steps are following:
A1.1 scan database D first pass and according to effectiveness information table UT calculates the utility value upper bound ubound of each project i by the formula (B1.1) of " summary of the invention B1 " 1(i{}) and by obtaining project order Ω after the descending sort.
A1.2 scan database D second time, each the transaction journal t for reading in sets up linear linked list, and (i, t), each element press the arrangement of Ω preface to the utility value u of a project i among linked list element storage t.This chained list is all non-0 elements of the t row of sparse matrix, is called the row chained list, and note is made Φ (t).
A1.3 gets up the element (that is, non-0 element of sparse matrix) of each row chained list by the row link, by the capable gauge outfit of Ω double as sparse matrix.Row gauge outfit item Ω (i) points to capable first non-0 element of matrix i.The element that Ω (i) is linked forms capable chained list.If capable chained list of Ω (i) indication and row chained list Φ (t) have common element, then Φ (t) is called by the row chained list of Ω (i) threading.
A2 obtains the support transaction journal collection TS (X) of arbitrary patterns X through virtual projection.According to the backward of entry sorting Ω, for each the project i among the X, to choose capable each nonzero element place row of i of submatrix and form new submatrix, the submatrix that finally obtains is TS (X).Because this submatrix is to be embedded in the original matrix of representing entire database, does not need independently storage space.Concrete steps are following:
A2.1 presses the backward of Ω, gets each the project i among the X.
The row matrix gauge outfit item that A2.1.1 will come before the i empties, and promptly for k<i, Ω (k) puts sky.
A2.1.2 adds k the capable chained list of Ω (k) indication for each element k before the i that is arranged in of the rectangular array chained list Φ (t) of Ω (i) threading.
A2.2 makes that X is i by first element of Ω preface 0, Ω (i then 0) submatrix formed of all row chained lists of institute's threading is exactly TS (X), i.e. the transaction journal collection of support mode X.
3. the single phase method of a kind of fast mining effectiveness pattern according to claim 1, the concrete grammar flow process of core technology B is following:
According to the entry sorting Ω of structure sparse matrix, a pattern also can be expressed as one has sequence.Such as, { a, b, c} also can be used as < a, b, c >, if Ω also matching word canonical ordering just in time.Therefore, the set representation can be mixed use with the sequence representation, and set and operation ∪ also can be used for the splicing of two sequences, such as < a>∪ <b, c, d >=< a, b, c, d >.The thinking that the present invention enumerates pattern is to obtain another pattern through the splicing prefix from a pattern.Can obtain < a, b, c, d>such as <b, c, d>splicing prefix < a >.Specifically, prefix growth strategy is exactly that empty pattern splicing prefix is obtained length is 1 pattern, with length be 1 pattern splicing prefix to obtain length be 2 pattern, by that analogy.
Enumerate pattern by prefix growth strategy and be equivalent to construct a prefix growth tree (Prefix Growth Tree is called for short PGT).Each PGT node n ode representes a project, and note is made node.item.The tree root node is represented " blank " project, corresponding " sky " pattern { }.A pattern is represented in the set of all items from node to the tree root node, and note is made node.pattern.Be arranged in each the project i before the node.item by the Ω preface, all can have the daughter nodes child of node to represent that promptly child.item gets i.
It is following that B1 cuts out the concrete operations of prefix growth tree PGT node n ode.
B1.1 calculates the utility value upper bound of each project in the prefix growth subtree that node is a root.Just, for the project i that is arranged in before the node.item, by formula (B1.1) calculates i utility value upper bound in the prefix growth subtree of pattern node.pattern,
Figure FSA00000674232000021
formula (B1.1)
X=node.pattern wherein, TS (X) confirm by " summary of the invention A2 ", and (X is that all are merged into prefix by Ω the project before the X of coming and splice to X again and obtain among the transaction journal t t) to preEXT.Can be refined as:
B1.1.1 is for confirming each the transaction journal t among the TS (X) by " summary of the invention A2 "; I.e. row chained list Φ (t); If have element storing project i among the Φ (t), then with all items among the X and come the utility value of all items in t before the X by the Ω preface and add up and draw ubound 1(i, X), the utility value sum of respective element among the row chained list Φ (t) just.
If B1.1.2 is ubound 1(i, X)<minutil, then project i can not appear in (height) effectiveness pattern as the X=node.pattern prefix, therefore it will be labeled as " haveing nothing to do " project, otherwise be labeled as " useful " project.
B1.2 by formula (B1.2) calculates the prefix pattern utility value upper bound that might become (height) effectiveness pattern in the PGT subtree growth that node is a root.
Ubound 2 ( i , X ) = &Sigma; t &Element; TS ( X ) u ( PpEXT ( X , t ) , t ) Formula (B1.2)
X=node.pattern wherein, TS (X) confirm by " summary of the invention A2 ", and (X is that (X rejects in t) and obtains after " haveing nothing to do " project that " summary of the invention B1.1.2 " mark comes out from preEXT t) to ppEXT.Can be refined as:
B1.2.1 is for confirming each the transaction journal t among the TS (X) by " summary of the invention A2 "; I.e. row chained list Φ (t); To add up in the respective element (utility value) of row chained list Φ (t) by all " useful " projects and all items among the X that the Ω preface comes before the X, draw ubound 2(X).
B1.3 carries out and cuts out operation.If the prefix pattern utility value upper bound that " summary of the invention B1.2.1 " calculates is lower than threshold values, promptly
Ubound 2(node.pattern)<minutil formula (B1.3)
Then node n ode is that the prefix growth subtree of root is also to cut out, because wherein can not there be (height) effectiveness pattern.
4. the single phase method of a kind of fast mining effectiveness pattern according to claim 1, the concrete grammar flow process of core technology C is following:
The process of search prefix growth tree is just constructed the process of this tree; If carry out according to the depth-first order; Not only can carry out " summary of the invention B1 ", cut out the prefix growth subtree that can not have (height) effectiveness pattern; Also can in time remove and search for the subtree that finishes, the branch of in internal memory, only storing current search rather than whole tree, thereby realize News Search.Concrete steps are following:
The root node root of C1 structure prefix growth tree PGT, root.item puts sky.Set up storehouse traversal, be used to realize the News Search of depth-first.The root root of prefix growth tree PGT is pressed into traversal.
C2 promptly has node to be pressed in wherein when the traversal non-NULL, carries out:
C2.1 ejects the stack top node from traversal and deposits node in.
If C2.2 node has the right side sibling, then this sibling is pressed into traversal.
C2.3 carries out the transaction journal collection TS (node.pattern) that " summary of the invention A2 " virtual projection obtains to support node.pattern.
If C2.4 is u (node.pattern) >=minutil, then node.pattern is (height) effectiveness pattern.
C2.5 carries out cutting out of " summary of the invention B1 ".
If C2.6 node is not tailored, then set up the daughter nodes child of node for each " useful " project i of " summary of the invention B1.1.2 " mark, make that child.item is i.The 1st children (preface from left to right) are pressed into traversal.
If C2.7 node does not have daughter nodes, then remove node and older generation's node thereof one by one, until older generation's node that other children are arranged along path from node to root.
CN2012100425708A 2012-02-23 2012-02-23 Data mining method for quickly finding utility pattern Pending CN102662948A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100425708A CN102662948A (en) 2012-02-23 2012-02-23 Data mining method for quickly finding utility pattern

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100425708A CN102662948A (en) 2012-02-23 2012-02-23 Data mining method for quickly finding utility pattern

Publications (1)

Publication Number Publication Date
CN102662948A true CN102662948A (en) 2012-09-12

Family

ID=46772439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100425708A Pending CN102662948A (en) 2012-02-23 2012-02-23 Data mining method for quickly finding utility pattern

Country Status (1)

Country Link
CN (1) CN102662948A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103607412A (en) * 2013-12-04 2014-02-26 西安电子科技大学 Content center multiple-interest-packet processing method based on tree
CN105868296A (en) * 2016-03-24 2016-08-17 银江股份有限公司 Fast pruning policy based method for drug DDD value data analysis in efficient sequence modes
CN106250549A (en) * 2016-08-14 2016-12-21 重庆大学 A kind of Frequent Pattern Mining method based on internal memory
CN107870939A (en) * 2016-09-27 2018-04-03 腾讯科技(深圳)有限公司 A kind of mode excavation method and device
WO2018059298A1 (en) * 2016-09-27 2018-04-05 腾讯科技(深圳)有限公司 Pattern mining method, high-utility item-set mining method and relevant device
CN108153859A (en) * 2017-12-24 2018-06-12 浙江工商大学 A kind of effectiveness order based on Hadoop and Spark determines method parallel
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103607412A (en) * 2013-12-04 2014-02-26 西安电子科技大学 Content center multiple-interest-packet processing method based on tree
CN103607412B (en) * 2013-12-04 2016-07-06 西安电子科技大学 The multiple interest packet processing method of content center network based on tree
CN105868296A (en) * 2016-03-24 2016-08-17 银江股份有限公司 Fast pruning policy based method for drug DDD value data analysis in efficient sequence modes
CN105868296B (en) * 2016-03-24 2019-02-05 银江股份有限公司 A kind of medication DDD Value Data analysis method of the effective sequence pattern based on fast pruning strategy
CN106250549A (en) * 2016-08-14 2016-12-21 重庆大学 A kind of Frequent Pattern Mining method based on internal memory
CN106250549B (en) * 2016-08-14 2019-09-20 重庆大学 A kind of Frequent Pattern Mining method memory-based
CN107870939A (en) * 2016-09-27 2018-04-03 腾讯科技(深圳)有限公司 A kind of mode excavation method and device
WO2018059298A1 (en) * 2016-09-27 2018-04-05 腾讯科技(深圳)有限公司 Pattern mining method, high-utility item-set mining method and relevant device
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN108153859A (en) * 2017-12-24 2018-06-12 浙江工商大学 A kind of effectiveness order based on Hadoop and Spark determines method parallel

Similar Documents

Publication Publication Date Title
CN102662948A (en) Data mining method for quickly finding utility pattern
Lin et al. An efficient algorithm to mine high average-utility itemsets
Wang et al. On efficiently mining high utility sequential patterns
Liu et al. Mining high utility itemsets without candidate generation
Tseng et al. UP-Growth: an efficient algorithm for high utility itemset mining
US9195699B2 (en) Method and apparatus for storage and retrieval of information in compressed cubes
Ahmed et al. HUC-Prune: an efficient candidate pruning technique to mine high utility patterns
Yun et al. Approximate weighted frequent pattern mining with/without noisy environments
Peng et al. mHUIMiner: A fast high utility itemset mining algorithm for sparse datasets
Ryang et al. Fast algorithm for high utility pattern mining with the sum of item quantities
CN111506621B (en) Data statistical method and device
JP5241738B2 (en) Method and apparatus for building tree structure data from tables
CN105893381A (en) Semi-supervised label propagation based microblog user group division method
CN105893382A (en) Priori knowledge based microblog user group division method
CN103678550A (en) Mass data real-time query method based on dynamic index structure
Zhang et al. On-shelf utility mining of sequence data
KR20100060734A (en) System for visualization of patent information by forming the keyword based semantic network and method therefor
Jiang et al. Incremental evaluation of top-k combinatorial metric skyline query
Kumar et al. Sequential pattern mining with multiple minimum supports by MS-SPADE
KR20120078908A (en) Method for data modelling using nosql
CN102214248A (en) Multi-layer frequent pattern discovery algorithm with high space extensibility and high time efficiency for mining mass data
CN102708285A (en) Coremedicine excavation method based on complex network model parallelizing PageRank algorithm
Guo et al. HUITWU: An efficient algorithm for high-utility itemset mining in transaction databases
Oguz et al. Incremental itemset mining based on matrix apriori algorithm
Song et al. Mining multi-relational high utility itemsets from star schemas

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120912