CN106250549B

CN106250549B - A kind of Frequent Pattern Mining method memory-based

Info

Publication number: CN106250549B
Application number: CN201610662641.2A
Authority: CN
Inventors: 刘铎; 林怡; 黄柏钧; 朱潇
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2016-08-14
Filing date: 2016-08-14
Publication date: 2019-09-20
Anticipated expiration: 2036-08-14
Also published as: CN106250549A

Abstract

The invention discloses a kind of Frequent Pattern Mining methods memory-based, it constructs frequent mode initial tree, create the root node T of frequent pattern tree (fp tree) the following steps are included: step 1, with " null " label；Frequent episode in every affairs of reading is selected and is sorted by the order in L by scan database again；The path of a frequent pattern tree (fp tree) is constructed after sequence using null as root node, only count is incremented for the node in most end upper to path, and the counting of other nodes on path remains unchanged；It successively scans through and obtains frequent mode initial tree in entire database after all affairs；Step 2, frequent mode initial tree is successively traversed with Depth Priority Algorithm, the Counter Value of traversing nodes is that the value of the node itself adds the value of its all child's node.The solution have the advantages that: it can be reduced the write operation to NVM, can quickly construct frequent pattern tree (fp tree)；And can be reduced to a large amount of intensive write operations of node count field close to root node, extend the NVM service life.

Description

A kind of Frequent Pattern Mining method memory-based

Technical field

The invention belongs to memory technology fields, and in particular to a kind of Frequent Pattern Mining method memory-based.

Background technique

Increasingly mature with computer technology, data analysis has had great development since 20th century established.Data Analysis can find in mass data and extract interested project, to provide instruction to policy-making body.Machine Study and data mining can disclose the information that data are hidden behind, it has also become be the key technology of data analysis.

In the field of data mining, it is found that frequent episode or frequent mode in data set are one in data mining research Important topic, it is the base of many significant data mining tasks such as correlation analysis, sequence pattern, causality, Emerging Pattern Plinth.There are the technologies such as Apriori and FP-tree at present to handle Frequent Pattern Mining problem.

Since the condition of Frequent Pattern Mining method memory-based is to be mined data and data element is stored in byte On addressing register, and DRAM requires to need continued power to keep data, and therefore, efficiency and persistence are likely to become data Key Design problem in digging system.In order to solve the problems, such as such, such as phase transition storage in data memory-based analysis (PCM) etc. nonvolatile memories (NVM) are typically considered the excellent of DRAM due to its outstanding non-volatile and performance efficiency Elegant substitute.But NVM is used as main memory there is a problem of again and is following: first is that the read-write operation time difference to NVM is bigger, Read operation is usually more than time spent by write operation and energy；Second is that NVM write operation number is limited, non-uniform write operation Monolith NVM would generally be accelerated to fail.Just because of the considerations of lacking to NVM essential characteristic, the data carried out on NVM at present are dug Pick seriously affects performance and the service life of storage system with machine learning algorithm.

The prior art uses a kind of technical solution for being called FP-tree algorithm, it is the improvement to Apriori algorithm, will The structure of the key message boil down to frequent pattern tree (fp tree) (FP-tree) of frequent mode, it is huge to reduce expense in Apriori algorithm Candidate item, to solve the performance bottleneck of Apriori algorithm.Briefly, FP-tree algorithm is not generate candidate item In the case where, complete the function of Apriori algorithm.

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. ACM SIGMOD International Conference on Management of Data (SIGMOD ' 00), 29 (2): 1-12, May 2000.(J. Han, J. Pei, and Y. Yin. " do not generate time The Frequent Pattern Mining of option ", data management international conference, 29 (2): 1-12,2000.05.) describe FP-tree algorithm The step of it is as follows:

(1) it is primary to scan entire transaction database D, obtains the support counting of whole item included in D, excludes branch Degree of holding count value is less than the item of threshold value, and remaining item is frequent episode, arranges to obtain by its support counting descending to frequent episode One list L；

(2) the root node T of FP-tree is created, with " null " label.Transaction database is scanned again.To thing each in D Frequent episode therein is selected and is sorted by the order in L by business.If the frequent episode table after sequence is [p | P], wherein p is first A frequent episode, and P is remaining frequent episode.Calling insert_tree ([p | P], T).Insert_tree ([p | P], T) and process Executive condition is as follows: if T has children N to make N .item_name=p.item_name, the counting of N increases by 1；Otherwise one is created A new node N, is counted and is set as 1, its father node T is linked to.If P non-empty, recursively calls insert_tree (P, N).

By above step, a complete FP-tree has just been established.Finally according to established FP-tree under It is up excavated in proper order, that is, can produce required frequent mode.It can be described as utilizing the letter in transaction database in brief Breath constructs FP-tree, then the Mining Frequent Patterns from FP-tree.Its core concept is that direct compressed data library constructs one Then frequent pattern tree (fp tree) generates correlation rule by this tree.

Fig. 1 gives the building process example of FP-tree.Fig. 1 (a) is database, wherein " transaction id " is each friendship The serial number easily recorded, " project " are all items in each transaction record, and " item after sequence " is to go out occurrence according to each item Item after number descending arrangement；Initially set up root node of the node as entire frequent pattern tree (fp tree) that a label is null, scanning After first transaction record, node a is established, and enabling the value of the count field of node a is 1, shows that project a occurs 1 time, such as Fig. 1 (b) It is shown；After scanning Article 2 transaction record, node b, c, d are successively established, the value of node count field is 1, show project b, C, d also occur 1 time respectively, as shown in Fig. 1 (c)；After successively scanning through in database All Activity record, foundation it is complete Shown in FP-tree such as Fig. 1 (d), wherein each alphabet shows the item in database, stored in the digital representation count field after letter The number that value, as this occur in the database.

But FP-tree algorithm there are the problem of have: during constructing frequent pattern tree (fp tree), in one affairs of every scanning One item will be updated operation to FP-tree, i.e., carry out write operation to the node count field of respective items in FP-tree, this A large amount of duplicate write operations are had led to, memory overhead is huge；And it is more closer to the write operation of root node, intensive largely writes The service life that operation will lead to NVM is reduced.

Summary of the invention

In view of the problems of the existing technology, the technical problem to be solved by the invention is to provide a kind of memory-based Frequent Pattern Mining method, it can be reduced write operation during constructing frequent pattern tree (fp tree) to NVM, be avoided that intensive a large amount of Write operation achievees the purpose that extend the NVM service life

The technical problem to be solved by the present invention is in this way technical solution realize, it the following steps are included:

Step 1, frequent mode initial tree is constructed

1), successively each transaction record in scan database obtains the support of whole item included in database Degree counts, and excludes the item that support counting value is less than threshold value, and remaining item is frequent episode, presses its support counting to frequent episode Descending arranges to obtain a list L；

2) the root node T of frequent pattern tree (fp tree) is created, with " null " label；

3) frequent episode in every affairs of reading is selected and is sorted by the order in L by, scan database again；Row The path of a frequent pattern tree (fp tree) is constructed after sequence using null as root node, only count is incremented for the node in most end upper to path, The counting of other nodes on path remains unchanged；It successively scans through at the beginning of obtaining frequent mode after all affairs in entire database Begin tree；

Step 2, frequent mode initial tree is successively traversed with Depth Priority Algorithm, the counter of traversing nodes Value is that the value of the node itself adds the value of its all child's node.

The value of the count field of all elements is that the element occurs in entire database in frequent pattern tree (fp tree) of the invention Number, as the tree that the Mining Algorithms of Frequent Patterns of the prior art constructs.

Compared with prior art, the solution have the advantages that:

The present invention is no longer updated operation to the count field of all nodes in current whole affairs, avoids A large amount of duplicate write operations during frequent pattern tree (fp tree) are constructed, the write operation to NVM is reduced, can quickly construct frequent mode Tree；And can be reduced a large amount of intensive write operations of the node count field to close root node, extend the NVM service life.

Detailed description of the invention

Detailed description of the invention of the invention is as follows:

Fig. 1 is the building exemplary diagram of the frequent pattern tree (fp tree) in background technique；

Fig. 2 is the flow chart of present invention building frequent mode initial tree；

Fig. 3 is the building exemplary diagram of frequent pattern tree (fp tree) of the invention；

Fig. 4 is the comparison diagram of read operation test in test；

Fig. 5 is the comparison diagram of write operation test in test；

Fig. 6 is the comparison diagram of building tree time test in test；

Fig. 7 is the comparison diagram of PCM life test in test.

Specific embodiment

Present invention will be further explained below with reference to the attached drawings and examples:

Input of the invention is database and minimum support threshold value σ, and output is FP-tree.

The present invention the following steps are included:

Step 1, frequent mode initial tree is constructed

3) frequent episode in every affairs of reading is selected and is sorted by the order in L by, scan database again；Row The path of a frequent pattern tree (fp tree) is constructed after sequence using null as root node, only count is incremented for the node in most end upper to path, The counting of other nodes on path remains unchanged；It successively scans through at the beginning of obtaining frequent mode after all affairs in entire database Begin tree.

Fig. 2 is the flow chart of present invention building frequent mode initial tree, and process is as follows:

In step S21, the item that minimum support is not up in each affairs is left out, its frequency of occurrence is pressed to remaining item Descending sort；

In step S22, successively affairs in scan database；

In step S23, each of affairs item is successively scanned, is traversed down along tree from root node from front to back；

In step S24, judge whether currentitem is the item of most end in affairs, if so, executing step S25；If not, Execute step S27；

It whether there is corresponding node in step S25, decision tree, such as exist, then follow the steps S26；It is such as not present, then holds Row step S29；

In step S26, it is incremented by the value of the middle count field of this；Then step S210 is gone to；

It whether there is corresponding node in step S27, decision tree, such as exist, then return step S23；It is such as not present, then holds Row step S28；

In step S28, new node is created, enabling the value of its count field is 0；Then step S23 is returned；

In step S29, new node is created, enabling the value of its count field is 1；Then step S210 is gone to；

In step S210, judge whether all affairs are scanned, if not scanned, return step S22；If scanning It finishes, thens follow the steps S211

In step S211, EP (end of program)；

Step 2, complete frequent pattern tree (fp tree) is constructed

Frequent mode initial tree is successively traversed with Depth Priority Algorithm, the Counter Value of traversing nodes is should The value of node itself adds the value of its all child's node.

Embodiment

Fig. 3 be the present invention building frequent pattern tree (fp tree) an example, the present embodiment the following steps are included:

Step 1, according to Fig. 3 (a) database sharing frequent mode initial tree, detailed process is as follows:

As shown in Figure 3 (b), root node of the node as entire frequent pattern tree (fp tree) that a label is null is established；Scanning After first transaction record, node a is established, enabling the counting thresholding of node a is 1, shows that project a occurs 1 time；

As shown in Figure 3 (c), after scanning Article 2 transaction record, node b, c, d are constructed, enabling the count thresholding of b, c is 0, d Count thresholding be 1, show that project d occurs for 1 time and (generates redundancy in order to reduce at this time when building frequent pattern tree (fp tree) and write, not B is recorded, the number that c occurs, only record is located at the number that the item d at this transaction record end occurs, because what b later, c occurred Number can be obtained according to the value of the count field of its child's node)；

As shown in Fig. 3 (d), constructed initial tree out after entire database All Activity records successively is scanned through；

Step 2, complete frequent pattern tree (fp tree) is constructed

As shown in Fig. 3 (e), frequent mode initial tree is successively traversed with Depth Priority Algorithm, traversing nodes Counter Value be the node itself value add its all child's node value.Such as the value of c count field is the original value 0 of c The sum of with the value 5 of d count field, finally show that c occurs 5 times；The value and f that the value of f count field is child's node e and g of f are original The sum of value 3 finally show that f occurs 6 times.After successively having traversed frequent pattern tree (fp tree), complete frequent pattern tree (fp tree) is constructed.

Experiment test

It chooses different types of data set to be tested, counts the read-write operation number of each data set, total building tree Time and PCM service life.The title of these data sets be respectively T10I4D100K, T40I10D100K, chess, mushroom, pumsb*、connect、pumsb、accidents、C73D10、C20D10。

Experimental result is referring to fig. 4 to Fig. 7:

In Fig. 4, ordinate represents the number read, and abscissa represents each data set, as can be seen from Figure 4, The present invention reduces A large amount of read operation；

In Fig. 5, ordinate represents the number write, and abscissa represents each data set, as can be seen from Figure 5, The present invention reduces A large amount of write operation；

In Fig. 6, ordinate represents the time of total building tree, and abscissa represents each data set, as can be seen from Figure 6, this hair The bright time for reducing building tree；

In Fig. 7, ordinate is represented until PCM is write bad, to handle total transaction amount, and abscissa represents each data Collection, as seen from Figure 7, the service life that the present invention can at least extend PCM is that 16.67%(occurs in data set T40I10D100K), most 99.05%(can be extended greatly to occur greatly to extend the service life of PCM in data set connect).

Claims

1. a kind of Frequent Pattern Mining method memory-based, characterized in that the following steps are included:

Step 1, frequent mode initial tree is constructed

1), successively each transaction record in scan database obtains the support meter of whole item included in database Number excludes the item that support counting value is less than threshold value, and remaining item is frequent episode, presses its support counting descending to frequent episode Arrangement obtains a list L；

3) frequent episode in every affairs of reading is selected and is sorted by the order in L by, scan database again；After sequence The path of a frequent pattern tree (fp tree) is constructed using null as root node, only count is incremented for the node in most end upper to path, path On the countings of other nodes remain unchanged；Successively scan through in entire database that frequent mode is obtained after all affairs is initial Tree；

Step 2, frequent mode initial tree is successively traversed with Depth Priority Algorithm, the Counter Value of traversing nodes is The value of the node itself adds the value of its all child's node.

2. Frequent Pattern Mining method memory-based according to claim 1, characterized in that the 3) step of the of step 1 Detailed process is as follows:

In step S21, the item that minimum support is not up in each affairs is left out, its frequency of occurrence descending is pressed to remaining item Sequence；

In step S22, successively affairs in scan database；

In step S24, judge whether currentitem is the item of most end in affairs, if so, executing step S25；If not, executing Step S27；

It whether there is corresponding node in step S25, decision tree, such as exist, then follow the steps S26；It is such as not present, then executes step Rapid S29；

It whether there is corresponding node in step S27, decision tree, such as exist, then return step S23；It is such as not present, then executes step Rapid S28；

In step S210, judge whether all affairs are scanned, if not scanned, return step S22；If scanning through Finish, thens follow the steps S211

In step S211, EP (end of program).