CN102662948A

CN102662948A - Data mining method for quickly finding utility pattern

Info

Publication number: CN102662948A
Application number: CN2012100425708A
Authority: CN
Inventors: 刘君强; 蒋晓宁; 甘志刚; 余斌霄
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2012-02-23
Filing date: 2012-02-23
Publication date: 2012-09-12

Abstract

A data mining method for quickly finding a utility pattern can find a utility pattern which not only has substantial statistical characteristics but also meets user expectations and user goals from massive data, having a wide application in network information search and knowledge discovery. Aiming at solving the present problems of high time overhead and space overhead of existing methods caused by adoption of a two-stage method which generates a candidate pattern, the present invention provides three innovative technologies. The first is data representation based on a sparse matrix and virtual projection, the second is a prefix growth strategy, a prefix growth tree and a tailoring method thereof, and the third is a depth-first dynamic search method. With the three innovative technologies, a novel mining method is designed which has a single stage, causes no candidate pattern, and enables mining the utility pattern. The time efficiency ratio of the data mining method is higher by one to three orders of magnitude than that of other three referential mining methods, and the memory usage is reduced by 40% to 90%. The present data mining method has a high performance and enables various applications such as massive Web mining, multimedia mining and test mining.

Description

A kind of data digging method of quick discovery effectiveness pattern

Technical field

The present invention relates to the Intelligentized Information field.The present invention has designed a kind of can from mass data, the discovery and has not only had remarkable statistical nature but also meet user expectation and the effectiveness mode excavation method of target; Excavate particularly network information search and Knowledge Discovery in mass data; In comprising that Web excavation, text mining, multimedia are excavated, the extensive application prospect.

Background technology

The traditional data mining technology; Frequent Pattern Mining technology [1] [2] particularly; Mainly carry out data analysis according to statistical significance; Such as from the sales data of supermarket, excavating purchase frequency high product combination etc., do not consider user's expectation or target, maybe be interested such as the user in the combination of profit reciprocation high product.That is to say, in data mining, not only will consider the statistical significance of data, also will consider user's interest or target [3].Effectiveness mode excavation technology is as Frequent Pattern Mining latest development [4] [5] [6] [7] [8] of arising at the historic moment.

Yet that is that all right is ripe for effectiveness mode excavation technology, has only the very small amount achievement, all adopts two-phase method.Two-phase method TP is proposed by [4] such as Liu.Phase one is according to the downward closed character of affairs weighting effectiveness TWU; Thereby find out pattern earlier and generate the candidate pattern set with higher TWU, thus subordinate phase once more the scan database actual utility of calculating each candidate pattern find out the pattern that effectiveness is higher than given threshold values.Li etc. [5] have proposed isolated item and have rejected strategy, are used for successively excavating the phase one of candidate pattern, to reduce unnecessary candidate pattern, so also can raise the efficiency, because the calculating of each layer candidate pattern can be carried out on a data set that successively decreases.

Recently, the shortcoming of multipass database [4] [5] when avoiding successively generating candidate pattern is so that the phase one can generate candidate pattern expeditiously, and a plurality of research groups propose the effectiveness mode excavation method [6] [7] [8] based on tree.Erwin etc. [6] propose the CTU-PROL method for digging, use the affairs weighting effectiveness downward closed character of TWU [4], excavate based on effectiveness scheme-tree CUP-tree and FP-Growth [2].Ahmed etc. [7] propose the IHUP method for digging, adopt IHUP-tree to store the TWU information of each affairs, improve the candidate pattern collection that FP-Growth [2] excavates the effectiveness pattern.CTU-PROL method for digging [6] is identical with TP [4] with the candidate pattern quantity that IHUP method for digging [7] generated in the phase one.Tseng etc. [8] design another UPG method for digging based on tree; Utilize the UP-tree compression to express the effectiveness information of affairs; Proposition tree node effectiveness is rejected/is successively decreased strategy and improves the downward closed character of affairs weighting effectiveness TWU, thereby generates the candidate pattern of lesser amt.

Yet on going result is not all jumped out the framework of two-phase method, although there is work [5] [8] to attempt to reduce the candidate pattern quantity that the phase one generates yet.When database exists long transaction journal or given effectiveness threshold values hour, the quantity of candidate pattern still is huge.This not only causes the storage space expense excessive, causes the scalability bottleneck of phase one, also is so for subordinate phase, and the time efficiency that finally causes moving is low.

For overcoming the defective of method for digging in the past, the present invention proposes following three innovative technologies, breaking away from the framework of two-phase method, and designs " a kind of data digging method of quick discovery effectiveness pattern ", thereby solves the bottleneck problem of scalability and efficient.

First data representation that is based on sparse matrix and virtual projection.Specifically, propose the complete information that sparse matrix is expressed each affairs effectiveness, making single phase excavate becomes possibility.This sparse matrix method for expressing is compacter than the method for expressing [6] [7] [8] based on FP-tree [2], avoids multipass database [4] [5].Adopt virtual projection, under the situation that does not increase any storage overhead, calculate the utility value of arbitrary patterns.

Second is prefix growth strategy and prefix growth tree and cut-out method thereof.Prefix growth strategy and corresponding prefix growth tree are used to guide the mining process of effectiveness pattern, and obtain the support of effectiveness pattern search space tailoring technology, promptly through the utility value upper bound in estimation anyon space, can cut out prefix effectively and generate tree.

The 3rd is the News Search method of depth-first.Find in the process of effectiveness pattern in search prefix growth tree; Adopt the depth-first method to construct the branch of current search; Need not in internal memory, to retain complete prefix growth tree, also need not in internal memory, to store the effectiveness pattern, thereby can further reduce storage overhead.

The time efficiency of method for digging of the present invention than three with reference to high 1 to 3 one magnitude of method for digging [4] [7] [8], and internal memory use amount few 40% to 90%.Method for digging of the present invention has high-performance, can in various application such as magnanimity Web excavation, multimedia excavation, text mining, be widely used.

List of references:

[1]R.Agrawal?and?R.Srikant.Fast?algorithms?for?mining?association?rules[A].In?Proc.of?VLDB?1994[C].1994，487-499..

[2]J.Han，J.Pei，Y.Yin.Mining?frequent?patterns?without?candidate?generation[A].In?Proc.of?ACM?SIGMOD2000[C].Dallas，USA，2000，1-12.

[3]H.Yao，H.J.Hamilton，L.Geng.A?unified?framework?for?utility-based?measures?for?mining?itemsets[A].In?Proc.of?ACM?SIGKDD?2nd?Workshop?on?Utility-Based?Data?Mining[C].2006，28-37.

[4]Y.Liu，W.Liao，and?A.Choudhary.A?fast?high?utility?itemsets?mining?algorithm[A].In?Proc.of?the?Utility-Based?Data?Mining?Workshop?in?conjunction?with?the?11th?ACM?SIGKDD[C].2005，253-262.

[5]Y.-C.Li，J.-S.Yeh，and?C.-C.Chang.Isolated?items?discarding?strategy?for?discovering?high?utility?itemsets[J].Data&Knowledge?Engineering，2008，64(1)：198-217.

[6]A.Erwin，R.P.Gopalan，and?N.R.Achuthan.Efficient?mining?of?high?utility?itemsets?from?large?datasets[A].In?Proc.ofPAKDD?2008[C].2008，554-561.

[7]C.F.Ahmed，S.K.Tanbeer，B.-S.Jeong，and?Y.-K.Lee.Efficient?tree?structures?for?high?utility?pattern?mining?in?incremental?databases[J].IEEE?Transactions?on?Knowledge?and?Data?Engineering，2009，21(12)：1708-1721.

[8]V.S.Tseng，C.-W.Wu，B.-E.Shie，P.S.Yu.UP-Growth：an?efficient?algorithm?for?high?utility?itemset?mining[A].In?Proc.ofthe?16th?ACM?SIGKDD[C].2010，253-262.

Summary of the invention

The objective of the invention is to design a kind of can be with minimum memory space and prestissimo, from transaction database, find the method for digging of (height) effectiveness pattern to be implemented in the Knowledge Discovery in the mass data.

The present invention's " a kind of data digging method of quick discovery effectiveness pattern " comprises the data representation of three core technology: A based on sparse matrix and virtual projection, B prefix growth strategy and prefix growth tree and cut-out method thereof, the News Search method of C depth-first.

Method for digging of the present invention is found out the pattern that utility value is not less than minutil according to transaction database D, effectiveness information table UT, effectiveness threshold values minutil.

One of content of the present invention:

Given I={i ₁, i ₂..., i _mBe the set of all items, given D={t ₁, t ₂..., t _nBe database, i.e. the set of transaction journal.Each transaction journal t is an Item Sets; I.e.

u (i; T)=and iu (i, t) eu (i) is the utility value of project i in transaction journal t, wherein iu (i; T) be the share of project i in transaction journal t, eu (i) is the external effectiveness that project i is independent of any transaction journal.Pattern X is the sub-set of I; If the share of each project i in affairs t non-0 among the pattern X; Be iu (i; T) ≠ 0; Then pattern X is supported by transaction journal t; Promptly (X t) is the utility value of pattern X in transaction journal t to

u, and value then is the set of all transaction journals of support mode X for

.U (X) is the total utility of pattern X, is that by formula (A) calculates the effectiveness sum of X in all support X transaction journals.

u (X) = \underset{t &Element; TS (X)}{Σ} u (X, t) = \underset{t &Element; TS (X)}{Σ} \underset{i &Element; X}{Σ} u (i, t)

Formula (A)

A is based on the data representation of sparse matrix and virtual projection

A1 adopts a kind of sparse matrix of being realized by linear linked list, merges and expresses database D and effectiveness information table UT, just supports the transaction journal collection TS ({ }) of empty pattern { } and the complete information of effectiveness thereof.In this matrix, row then adopts the total utility value upper bound ubound of each project i by project layout, its order Ω ₁(i, { }) descending, row adopt the natural preface of transaction journal in database by transaction journal layout, its order, the capable t column element of i is the effectiveness of project i at transaction journal t, promptly u (i, t)=iu (i, t) eu (i).Concrete steps are following:

A1.1 scan database D first pass and according to effectiveness information table UT calculates the utility value upper bound ubound of each project i by the formula (B1.1) of " summary of the invention B1 " ₁(i, { }) and by obtaining project order Ω after the descending sort.

A1.2 scan database D second time, each the transaction journal t for reading in sets up linear linked list, and (i, t), each element press the arrangement of Ω preface to the utility value u of a project i among linked list element storage t.This chained list is all non-0 elements of the t row of sparse matrix, is called the row chained list, and note is made Φ (t).

A1.3 gets up the element (that is, non-0 element of sparse matrix) of each row chained list by the row link, by the capable gauge outfit of Ω double as sparse matrix.Row gauge outfit item Ω (i) points to capable first non-0 element of matrix i.The element that Ω (i) is linked forms capable chained list.If capable chained list of Ω (i) indication and row chained list Φ (t) have common element, then Φ (t) is called by the row chained list of Ω (i) threading.

A2 obtains the support transaction journal collection TS (X) of arbitrary patterns X through virtual projection.According to the backward of entry sorting Ω, for each the project i among the X, to choose capable each nonzero element place row of i of submatrix and form new submatrix, the submatrix that finally obtains is TS (X).Because this submatrix is to be embedded in the original matrix of representing entire database, does not need independently storage space.Concrete steps are following:

A2.1 presses the backward of Ω, gets each the project i among the X.

The row matrix gauge outfit item that A2.1.1 will come before the i empties, and promptly for k＜i, Ω (k) puts sky.

A2.1.2 adds k the capable chained list of Ω (k) indication for each element k before the i that is arranged in of the rectangular array chained list Φ (t) of Ω (i) threading.

A2.2 makes that X is i by first element of Ω preface ₀, Ω (i then ₀) submatrix that is fine into of all row chained lists of institute's threading is exactly TS (X), i.e. the transaction journal collection of support mode X.

The key point of one of content of the present invention is that TS (X) is embedded among the TS ({ }), thereby is a kind of virtual projection, need not independently storage space, improves the spatial scalability of method for digging greatly.

Two of content of the present invention:

Pattern X is called (height) effectiveness pattern, if its utility value is not less than given threshold values minutil, i.e. and u (X) >=minutil.The effectiveness mode excavation is found all (height) effectiveness patterns exactly, promptly finds the solution

The basic ideas of excavating the effectiveness pattern are to enumerate each pattern, calculate and judge whether its utility value surpasses threshold values minutil.The present invention proposes a kind of prefix growth strategy and carries out pattern and enumerate, and this is equivalent to construct a prefix growth tree, and adopts based on utility value demarcation method and cut out prefix growth tree.

B prefix growth strategy and prefix growth tree and cut-out method

According to the entry sorting Ω of structure sparse matrix, a pattern also can be expressed as one has sequence.Such as, { a, b, c} also can be used as < a, b, c >, if Ω also matching word canonical ordering just in time.Therefore, the set representation can be mixed use with the sequence representation, and set and operation ∪ also can be used for the splicing of two sequences, such as < a>∪ <b, c, d >=< a, b, c, d >.The thinking that the present invention enumerates pattern is to obtain another pattern through the splicing prefix from a pattern.Can obtain < a, b, c, d>such as <b, c, d>splicing prefix < a >.Specifically, prefix growth strategy is exactly that empty pattern splicing prefix is obtained length is 1 pattern, with length be 1 pattern splicing prefix to obtain length be 2 pattern, by that analogy.

Enumerate pattern by prefix growth strategy and be equivalent to construct a prefix growth tree (Prefix Growth Tree is called for short PGT).Each PGT node n ode representes a project, and note is made node.item.The tree root node is represented " blank " project, corresponding " sky " pattern { }.A pattern is represented in the set of all items from node to the tree root node, and note is made node.pattern.Be arranged in each the project i before the node.item by the Ω preface, all can have the daughter nodes child of node to represent that promptly child.item gets i.

It is following that B1 cuts out the concrete operations of prefix growth tree PGT node n ode.

B1.1 calculates the utility value upper bound of each project in the prefix growth subtree that node is a root.Just, for the project i that is arranged in before the node.item, by formula (B1.1) calculates i utility value upper bound in the prefix growth subtree of pattern node.pattern,

formula (B1.1)

X=node.pattern wherein, TS (X) confirm by " summary of the invention A2 ", and (X is that all are merged into prefix by Ω the project before the X of coming and splice to X again and obtain among the transaction journal t t) to preEXT.Can be refined as:

B1.1.1 is for confirming each the transaction journal t among the TS (X) by " summary of the invention A2 "; I.e. row chained list Φ (t); If have element storing project i among the Φ (t), then with all items among the X and come the utility value of all items in t before the X by the Ω preface and add up and draw ubound ₁(i, X), the utility value sum of respective element among the row chained list Φ (t) just.

If B1.1.2 is ubound ₁(i, X)＜minutil, then project i can not appear in (height) effectiveness pattern as the X=node.pattern prefix, therefore it will be labeled as " haveing nothing to do " project, otherwise be labeled as " useful " project.

B1.2 by formula (B1.2) calculates the prefix pattern utility value upper bound that might become (height) effectiveness pattern in the PGT subtree growth that node is a root.

{Ubound}_{2} (i, X) = \underset{t &Element; TS (X)}{Σ} u (PpEXT (X, t), t)

Formula (B1.2)

X=node.pattern wherein, TS (X) confirm by " summary of the invention A2 ", and (X is that (X rejects in t) and obtains after " haveing nothing to do " project that " summary of the invention B1.1.2 " mark comes out from preEXT t) to ppEXT.Can be refined as:

B1.2.1 is for confirming each the transaction journal t among the TS (X) by " summary of the invention A2 "; I.e. row chained list Φ (t); To add up in the respective element (utility value) of row chained list Φ (t) by all " useful " projects and all items among the X that the Ω preface comes before the X, draw ubound ₂(X).

B1.3 carries out and cuts out operation.If the prefix pattern utility value upper bound that " summary of the invention B1.2.1 " calculates is lower than threshold values, promptly

Ubound ₂(node.pattern)＜minutil formula (B1.3)

Then node n ode is that the prefix growth subtree of root is also to cut out, because wherein can not there be (height) effectiveness pattern.

Two key point of content of the present invention is the prefix growth subtree that can not have (height) effectiveness pattern through calculating the utility value upper bound, cutting out, and can dwindle the search volume effectively, improves the time efficiency of method for digging greatly.

Three of content of the present invention:

The News Search method of C depth-first

The process of search prefix growth tree is just constructed the process of this tree; If carry out according to the depth-first order; Not only can carry out " summary of the invention B1 ", cut out the prefix growth subtree that can not have (height) effectiveness pattern; Also can in time remove and search for the subtree that finishes, the branch of in internal memory, only storing current search rather than whole tree, thereby realize News Search.Concrete steps are following:

The root node root of C1 structure prefix growth tree PGT, root.item puts sky.Set up storehouse traversal, be used to realize the News Search of depth-first.The root root of prefix growth tree PGT is pressed into traversal.

C2 promptly has node to be pressed in wherein when the traversal non-NULL, carries out:

C2.1 ejects the stack top node from traversal and deposits node in.

If C2.2 node has the right side sibling, shellfish is pressed into traversal with this sibling.

C2.3 carries out the transaction journal collection TS (node.pattern) that " summary of the invention A2 " virtual projection obtains to support nodepattern.

If C2.4 is u (node.pattern) >=minutil, then node.pattern is (height) effectiveness pattern.

C2.5 carries out cutting out of " summary of the invention B1 ".

If C2.6 node is not tailored, then set up the daughter nodes child of node for each " useful " project i of " summary of the invention B1.1.2 " mark, make that child.item is i.The 1st children (preface from left to right) are pressed into traversal.

If C2.7 node does not have daughter nodes, then remove node and older generation's node thereof one by one, until older generation's node that other children are arranged along path from node to root.

Three key point of content of the present invention is that depth-first search can further improve the spatial scalability of method for digging on the basis of " summary of the invention A ", makes not generate candidate pattern and directly excavate (height) effectiveness pattern and become possibility.

Description of drawings

The integrated A of Fig. 1 is based on the workflow of the News Search method of the data representation of sparse matrix and virtual projection, B prefix growth strategy and prefix growth tree and cut-out method thereof, C depth-first

Fig. 2 transaction database D

Fig. 3 effectiveness information table UT

The row chained list Φ (t) that each transaction journal of Fig. 4 t is corresponding

A kind of sparse matrix of realizing by linear linked list of Fig. 5

Prefix growth tree (part) after Fig. 6 cuts out

The sparse matrix that obtains behind Fig. 7 virtual projection TS ({ d})

The instantaneous state of the traversal stack of the preferential News Search of Fig. 8 controlling depth

The instantaneous state of Fig. 9 prefix growth tree PGT

Embodiment

The present invention's " a kind of data digging method of quick discovery effectiveness pattern " has proposed three innovative technologies.Fig. 1 summarizes the integrated route of these three innovative technologies.

Below in conjunction with accompanying drawing and instance (given transaction database D shown in Figure 2, effectiveness information table UT shown in Figure 3, effectiveness threshold values minutil=30), technical scheme is divided into two processes does and further describe.

Process one: adopt a kind of sparse matrix of realizing by linear linked list, merge and express database D and effectiveness information table UT, just support the transaction journal collection TS ({ }) of empty pattern { } and the complete information of effectiveness (" summary of the invention A1 ") thereof.

The concrete steps of process one are following:

1.1ScanDatabaseOnceforOmega: carry out " summary of the invention steps A 1.1 ".

Scan database D first pass and according to effectiveness information table UT calculates the utility value upper bound ubound of each project i by the formula (B1,1) of " summary of the invention B1 " ₁(i, { }).

For example, for transaction database D shown in Figure 2, effectiveness information table UT shown in Figure 3, result of calculation is: ubound ₁(a, { })=96, ubound ₁(b, { })=88, ubound ₁(c, { })=65, ubound ₁(d, { })=61, ubound ₁(e, { })=58, ubound ₁(f, { })=38, and ubound ₁(g, { })=30.

Press utility value upper bound ubound ₁Obtain project order Ω after (i, { }) descending sort.

For example, Ω=< a, b, c, d, e, f, g >.

1.2CreateSparseMatrix: carry out " summary of the invention steps A 1.2 is to A1.3 ".

Scan database D second time, each the transaction journal t for reading in sets up linear linked list, and (i, t), each element press the arrangement of Ω preface to the utility value u of a project i among linked list element storage t.This chained list is all non-0 elements of the t row of sparse matrix, is called the row chained list, and note is made Φ (t).

For example, transaction journal t ₁To t ₅Corresponding row chained list Φ (t ₁) to Φ (t ₅) as shown in Figure 4.

The element of each row chained list is got up by the row link, by the capable gauge outfit of Ω double as.Row gauge outfit item Ω (i) points to capable first non-0 element of matrix i.The element that Ω (i) is linked forms capable chained list.

For example, shown in Figure 5 for after linking by row, the sparse matrix that obtains is intactly expressed the relevant effectiveness information of TS ({ }).

Process two: adopt depth-first News Search method (" summary of the invention C "); Traversal prefix growth tree is enumerated pattern (" summary of the invention B "); Obtain the support affairs collection (" summary of the invention A2 ") of the pattern of enumerating through virtual projection; Calculate utility value and cut out search volume (" summary of the invention B1 "), in this process, find all (height) effectiveness patterns by the utility value upper bound.

For example, the execution that PGT is writing down process two is set in the part prefix growth after cutting out shown in Figure 6.The mark of the PGT node among the figure has two types: the first kind is only marked project, and this type node does not have actual growth, so also be added with strikethrough; Second roughly the same the markers project with support the transaction journal collection, the actual growth of this type node is the execution virtual projection also.

The concrete steps of process two are following:

2.1CreatePGTnullroot: the root node root of structure prefix growth tree PGT, root.item puts sky.Set up storehouse traversal, root is pressed into traversal.

For example, be exactly root at Fig. 6 " sky, TS ({ }) " node.

2.2, carry out when the traversal non-NULL:

2.2.1PopTopElementoutofStack: eject the stack top node from traversal and deposit node in.

2.2.2PushRightSiblingintoStack: if node has the right side sibling, then this sibling is pressed into traversal.

For example, when having access to " d, TS ({ d}) " node of Fig. 6, will " e, TS ({ e}) " node be pressed traversal, because the latter is the former right side sibling.

2.2.3GetTSforCurrentNode: obtain the transaction journal collection TS (node.pattern) that supports node.pattern, i.e. all the row chained lists that can have access to of capable gauge outfit item Ω (node.item) through sparse matrix.Need indicate especially, if node.item is empty, promptly node is the root node of prefix growth tree PGT, then through special capable gauge outfit item " Ω (sky) " visit " process one " all row chained lists of the TS that sets up ({ }).

For example, when having access to " d, TS ({ d}) " node of Fig. 6, the capable gauge outfit item of the sparse matrix Ω (d) through Fig. 5 can be linked to row chained list Φ (t ₃), Φ (t ₄) and Φ (t ₅), promptly form all transaction journals of TS ({ d}).

2.2.4ComputeUtilitiesandUpperBounds: for being arranged in each project i before the node.item; By formula (A) calculates i splices the pattern Y={i} ∪ node.pattern that obtains as prefix and node.pattern utility value; If u (Y)>=minutil, then Y is (height) effectiveness pattern; By formula (B1.1) calculates ubound ₁(i, node.pattern), if ubound ₁(i, node.pattern)＜minutil, then i is labeled as " haveing nothing to do " project, otherwise is labeled as " useful " project.

For example, when having access to " d, TS ({ d}) " node of Fig. 6, can calculate: u (a, d})=22＜minutil, u (b, d})=25＜minutil, u (c, d})=9＜minutil, so { a, d}, { b, d}, { c, d} all are not (height) effectiveness patterns.Can calculate simultaneously: ubound ₁(a, d})=36>=minutil, ubound ₁(b, d})=36>=minutil, ubound ₁(c, d})=13＜minutil, so a and b are " useful " projects, c is " haveing nothing to do " project.

2.2.5PseudoProjection: carry out " summary of the invention A2 " virtual projection, prepare affairs support collection for the daughter nodes that are about to set up of node, promptly at first will come the preceding row matrix gauge outfit item of i and empty, promptly for k＜i, Ω (k) puts sky.Then, (the node.item) is visited each the row chained list Φ (t) that is linked to, one by one with the capable chained list that is arranged in preceding each element k adding Ω (k) indication of i among the Φ (t) from Ω.

For example, when having access to " d, TS ({ d}) " node of Fig. 6, as shown in Figure 7 through the sparse matrix that obtains behind the virtual projection TS ({ d}).Need to specify the shared same memory space of matrix behind the virtual projection and original matrix, the Φ (t of support e ₁) and support the Φ (t of f ₂) still can have access to by Ω (e) and Ω (f).

2.2.6PruneChildrenwithSmallUppersandGrowOthers: carry out cutting out of " summary of the invention B1 ".By formula promptly (B1.2) calculates the utility value upper bound of the pattern Y={i} ∪ node.pattern that each " useful " project i and node.pattern splicing obtains.If ubound ₂(Y)＜and minutil, then with in the corresponding prefix growth of the Y subtree can not there be (height) effectiveness pattern, therefore corresponding daughter nodes needn't be grown, and promptly can cut out, otherwise for node sets up daughter nodes child, make child.item get i.

For example, when having access to " d, TS ({ d}) " node of Fig. 6, c is " haveing nothing to do " project, is tailored naturally.By formula (B1.2) calculates: ubound ₂(a, d})=22＜minutil and ubound ₂({ b d})=31>=minutil, so needn't be project a growth daughter nodes, only be the b daughter nodes of growing, i.e. " b, TS (b, d) " node of Fig. 6.

2.2.7PushFirstChildintoStackorPurgePath: if node has daughter nodes; Then be that the 1st children with node are pressed into traversal; Otherwise the path along from node to root is removed node and older generation's node thereof one by one, until older generation's node that other children are arranged.

For example, have access to Fig. 6 " a, TS (a, b, d}) " during node, { a, b, d} confirm as (height) effectiveness pattern when having calculated in a last step, and (utility value equals 31 >=minutil).Owing to do not have daughter nodes, so from then on node can be deleted to the whole piece path of root.At this moment, the only element in the traversal stack is " e, TS ({ e}) " node (real in pointing to the address pointer of this node), referring to Fig. 8.The prefix of actual storage growth tree branch is root node and three daughter nodes that also not have to visit " e, TS ({ e}) ", " f, TS ({ f}) " and " g, TS ({ g}) " node in the internal memory, referring to Fig. 9.

Performance measuring and evaluating: the experiment of the present invention's's " a kind of data digging method of quick discovery effectiveness pattern " performance measuring and evaluating shows, the time efficiency of method for digging of the present invention than three with reference to high 1 to 3 one magnitude of method for digging, and internal memory use amount less 40% to 90%.

Brief summary: the present invention has designed " a kind of data digging method of quick discovery effectiveness pattern ".Employing is minimum based on the storage expense of the transaction journal collection use of the feasible support of the data representation effectiveness pattern of sparse matrix and virtual projection, guarantees the spatial scalability of method for digging.Adopt prefix growth strategy and prefix growth tree and cut-out method thereof, guarantee the time efficiency of method for digging.Adopt the News Search method of depth-first, further improve the spatial scalability of method for digging, thereby do not generate the candidate pattern set and directly excavate (height) effectiveness pattern in single phase.Method for digging of the present invention excavates particularly in mass data and has wide application prospects in the network information search and Knowledge Discovery.

Claims

1. data digging method of finding fast the effectiveness pattern; According to transaction database D, effectiveness information table UT, effectiveness threshold values minutil; Find out the pattern that value of utility is not less than minutil with minimum memory space and prestissimo, comprise the concrete grammar flow process of following three core technologies:

A is based on the data representation of sparse matrix and virtual projection.

B prefix growth strategy and prefix growth tree and cut-out method thereof.

The News Search method of C depth-first.

2. the data digging method of a kind of quick discovery effectiveness pattern according to claim 1, the concrete grammar flow process of core technology A is following:

A1.1 scan database D first pass and according to effectiveness information table UT calculates the utility value upper bound ubound of each project i by the formula (B1.1) of " summary of the invention B1 " ₁(i{}) and by obtaining project order Ω after the descending sort.

A2.1 presses the backward of Ω, gets each the project i among the X.

A2.2 makes that X is i by first element of Ω preface ₀, Ω (i then ₀) submatrix formed of all row chained lists of institute's threading is exactly TS (X), i.e. the transaction journal collection of support mode X.

3. the single phase method of a kind of fast mining effectiveness pattern according to claim 1, the concrete grammar flow process of core technology B is following:

formula (B1.1)

{Ubound}_{2} (i, X) = \underset{t &Element; TS (X)}{Σ} u (PpEXT (X, t), t)

Formula (B1.2)

Ubound ₂(node.pattern)＜minutil formula (B1.3)

4. the single phase method of a kind of fast mining effectiveness pattern according to claim 1, the concrete grammar flow process of core technology C is following:

C2.1 ejects the stack top node from traversal and deposits node in.

If C2.2 node has the right side sibling, then this sibling is pressed into traversal.

C2.3 carries out the transaction journal collection TS (node.pattern) that " summary of the invention A2 " virtual projection obtains to support node.pattern.

C2.5 carries out cutting out of " summary of the invention B1 ".