CN106815302A

CN106815302A - A kind of Mining Frequent Itemsets for being applied to game item recommendation

Info

Publication number: CN106815302A
Application number: CN201611144649.6A
Authority: CN
Inventors: 金海�; 张舫; 张宇; 廖小飞
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2017-06-09

Abstract

The present invention realizes a kind of Mining Frequent Itemsets, belongs to data mining technology field.The inventive method obtains each occurrence number on MapReduce first, screened by sequence and threshold value, reject incongruent item, F List are obtained, F List is then divided and is obtained G List, according to the division of G List, data are transmitted to Mapper, and by Mapper treatment, data are transmitted to Reducer, the excavation of MapReduce is carried out on Reducer.Excavate firstly the need of the PPCTree obtained on each Reducer, obtain after PPCTree and then obtain the G Subsume of respective items on N List, and each Reducer, final frequent item set is obtained finally according to N List and G Subsume recurrence.The present invention is according to load estimation classifying rationally data, it is ensured that load balancing；Flow is excavated by Optimal Recursive, is greatly reduced Method on Dense Type of Data Using and is excavated the time.

Description

A kind of Mining Frequent Itemsets for being applied to game item recommendation

Technical field

The invention belongs to Data Mining, more particularly, to a kind of Mining Frequent Itemsets.

Background technology

Data mining technology has been directed to discovery since the birth and is hidden in valuable information in data, and data mining has Six kinds of patterns：Classification mode, Clustering, Regression Model, association mode, sequence pattern and deviation pattern.Wherein association mode Analysis is the direction of its important research.And frequent item set mining is the important component of association rules mining algorithm.By frequency Numerous item set mining algorithm can find out useful rule in big data, and this method can apply to many fields, such as webpage Web log mining, commercial distribution aspect, financial circles aspect recommend their possible finance interested for different type customer group Business and the recommendation of game application stage property etc..However, traditional unit under the background of big data excavates mode cannot Meet the demand of people, not only cost is too high for the simple method by improving CPU arithmetic speeds and memory size, it is also not existing Real, demand of the people to arithmetic speed is much unable to catch up with the development of hardware, and the operational pattern of at this moment parallelization is particularly important, By improvement or innovation data mining algorithm, and it is when previous good alternative to be combined with distributed arithmetic pattern.

With the arrival of networked information era, network game industry is arisen at the historic moment.Online game is culture, art and high-tech The fusion of skill, it is we provided a kind of new amusement and recreation mode.At the same time, network game industry flourishes, city Field further expands, and online game is increasingly becoming the bellwether of network economy.When the selection of game is more and more, the eye of player More and more fastidious, the game for being only adapted to player could be commercially lasting.Data mining has caused game industry Very big concern, its main cause is the presence of mass data, can be widely used, and in the urgent need to converting the data into Useful information and knowledge.Improve game quality with this, improve efficiency of operation, be that gaming operators get more users.Number It is able to fully use in industry-by-industry according to excavating, but this block market of online game does not have fully exploitation completely.Go simultaneously Effective treatment game data method it is not yet bright and clear.

Existing Frequent Itemsets Mining Algorithm mainly possesses following shortcoming：

1) efficiency of algorithm is too low, it is impossible to which finite time the inside obtains Result again；

2) parallel algorithm cannot in a balanced way divide load.

The content of the invention

Defect or urgent technical need for prior art, the invention discloses one kind in MapReduce platform simultaneously Capable Mining Frequent Itemsets, according to load estimation classifying rationally data, it is ensured that load balancing；Excavated by Optimal Recursive and flowed Journey, greatly reduces Method on Dense Type of Data Using and excavates the time, solves the problems, such as that efficiency of algorithm is low, load imbalance.

To achieve the above object, the present invention has following steps：

A kind of Mining Frequent Itemsets, comprise the following steps：

(1) occurrence number of items in initial data is counted by Mapreduce；

(2) frequent one is filtered out according to every occurrence number, frequent one is sorted from high to low according to occurrence number Constitute F-List；

(3) according to load balancing principle to the items packet in F-List, obtain comprising item and its affiliated group number information G-List；

(4) Mapper is allocated to initial data：

(4-1) resequences to every the every of initial data according to F-List middle terms order；

(4-2) reads item item since last of every initial data, and the group number of item is searched in G-List Gid, then using gid as key key, will come all before item and constitutes key-value pair as value value in data<key =gid, value=items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, before continuing to take One carries out same operation, until a data is disposed；

(5) Reducer carries out frequent item set mining to the key-value pair that Mapper is exported：

The key=gid that (5-1) is exported according to Mapper, corresponding reducer is distributed to by value=items, Reducer builds PPCtree；PPCtree is tree, and each node includes five property values：Name, support Frequency, child node, preamble traversal sequence number pre and postorder traversal sequence number post；

(5-2) is for each node N in PPC-tree_i, will<N_i.pre,N_i.post,N_i.frequency>It is named as PP- Code, by each PP-code according to the ascending sort of pre, builds and obtains each frequent one N-List in F-List；

(5-3) builds the G-Subsume of Reducer：G-Subsume (A)={ A, B ∈ I₁,Wherein, A represents two different frequent one with B, and A.gid represents an A Group number, Reducer.gid represents the corresponding group numbers of Reducer, and g (X) represents the set of the data ID comprising a frequent X, X =A or B, I₁Represent the set of frequent；

(5-4) recurrence is excavated, and its sub-step is as follows：

A) last L is taken in F-List as the recurrence primary data of the first round using F-List, by last L Combined with its G-Subsume (L), generate frequent two item collection, write-in result array Result；

B) take an X one by one from front to back in recurrence primary data, be N by its N-List_XPP-code and L N- List is N_LastPP-code be compared, if X is present in G-Susbume (L), continue to take latter, otherwise：Work as N_X's The pre of PP-code is less than N_LastPP-code, and N_XPP-code post be more than N_LastPP-code post, then give birth to Into frequent two item collections XL, will<N_X.PP-code.pre,N_X.PP-code.post,N_Last.PP-code.frequency>Add frequency The N-List of numerous two item collections XL is N_XL, and N_LastPP-code after move；Work as N_XPP-code pre be less than N_LastPP- Code, and N_XPP-code post be less than N_LastPP-code post, then N_XPP-code after move；Work as N_XPP- The pre of code is more than N_LastPP-code, then N_LastPP-code after move, until N_LastAnd N_XPP-code all traveled through Finish；

N_XPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for Threshold value, then delete XL, and XL is frequent two item collection if meeting；

C) continue to take the next item down, repeat step b), until last L in recurrence primary data from recurrence primary data All items before compare and finish, that is, obtained frequent two item collection and its N-List with last L as suffix, write result Array Result and using its N-List as frequent three item set mining primary data, frequent two item collection is directly and G- Subsume (L) merges frequent three item collection in part obtained with L as suffix, adds array Result；

D) the inverted Section 2 in recurrence primary data, repeats the above steps a), b), c), until recurrence primary data In all end of operations, that is, obtained frequent three item collection of all of frequent two item collection and part；

E) only different frequent two item collection of prefix is extracted, as the recurrence primary data of the second wheel, to be opened from last Begin, according to step b)-d) same way process, obtain all of frequent three item collection, and will be embroidered with after in frequent three item collection The Xiang Yuqi G-Subsume of G-Subsume are combined and are obtained frequent four item collection；

F) by that analogy, unique frequently K item collections are to the last relatively obtained by N-List, recurrence terminates；

(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, so far complete all of frequent item set Mining process.

Further, the step that implements of the step (1) is：

(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks, Block blocks are assigned on the node in cluster；

(1-2) Block blocks as each Map function input data, for the data T in Block blocks_iIn it is every One item a_j, the output key-value pair of Mapper<Key=a_j, value=1>；

(1-3) all key=a_jKey-value pair will be assigned to same Reducer, then the input of Reducer is<key =a_j, value=1,1 ..., 1 }>, Reducer once sued for peace output<Key=a_j, value=sum 1,1 ..., 1 } >。

Further, load balancing principle is in the step (3)：Using sequence number every in F-list as load Value, according to load value to the items packet in F-List.

Further, the G-List is stored using Hash table.

The present invention uses such scheme, other parallel algorithm schemes is better than in performance, and excavate in performance in game It is greatly improved, it is specific as follows：

1) N-List is used, this method can reduce complexity, in general Mining Frequent Itemsets, use Set to carry out recurrence, not only take up room but also set recurrence complexity considerably beyond the recurrence of the method, while this method is used , be not compared for each PP-code in N-List by unique comparative approach, if by two each PP- of N-List Code is compared, and complexity is O (mn), m and n is respectively two length of N-List, and this unique comparative approach is answered Miscellaneous degree is only O (m+n), also significantly reduces recurrence complexity；

2) it is used for the parallel of MapReduce using new concept G-Subsume, during frequent item set mining, passes through G-Subsume can reduce the merging number of comparisons of N-List, but directly be merged with G-Subsume, substantially increase Digging efficiency；

3) generally, G-List can take the mode of remainder to be grouped, but some recurrence times are long, have Recurrence time it is short, the end product stand-by period can be caused to be defined by item at most, while will also result in load imbalance, in order to Equally loaded, the present invention estimates the load of each in advance：Under depth-first pattern, it is right that the effect of depth of PPCTree trees Tree is carried out the time of first sequence, postorder traversal, and depth is bigger time-consuming more；The MAXPATHLEN of PPCTree trees where each single item The corresponding sequence number in F-List equal to it, and the maximum length of the N-List structures corresponding to this is equal to the support of this Number and 2ⁿ- 1 minimum value therebetween, wherein n are sequence number of this in F-List.Can be easily according to two above rule The load for estimating each, you can to realize load balancing of the invention.

Brief description of the drawings

Fig. 1 is the flow chart of frequently method for digging of the invention；

Fig. 2 is the flow chart that Mapper and Reducer carries out frequent item set mining；

Fig. 3 is the building process of PPCTree of the present invention；

Fig. 4 is the flow chart for obtaining frequent two item collection during recurrence of the present invention is excavated by a frequent item collection；

Fig. 5 is the schematic diagram of load balancing of the present invention；

Fig. 6 is the schematic diagram of MapReduce processes of the present invention.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples Present aspect is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain present aspect, and It is not used in the restriction present invention.

Term of the present invention is illustrated first：

Frequent item set：Also referred to as item collection, the collection of item is collectively referred to as item collection；As long as ratio occurs in item collection reaches given constant s, These item collections are all frequent item sets.

Frequent K item collections：The K item collection of item and be frequent item set be referred to as frequent K item collections.

Support：A frequency that goes out of item collection is the number of transactions comprising item collection, referred to as the support of item collection.

MapReduce：It is a kind of programming model, for the concurrent operation of large-scale dataset (being more than 1TB).Concept " Map (mapping) " and " Reduce (reduction) " it is their main thought, is all borrowed from Functional Programming, also from arrow The characteristic borrowed in amount programming language.It is very easy to programming personnel will not distributed parallel program in the case of, will The program of oneself is operated in distributed system.Current software realizes it being to specify Map (mapping) function, for one group Key-value pair is mapped to one group of new key-value pair, concurrent Reduce (reduction) function is specified, for ensureing the key assignments of all mappings The shared identical key group of each of centering.

Fig. 1 show the flow chart of frequently method for digging of the invention.The inventive method is applied to MapReduce platform, first Each occurrence number is first obtained on MapReduce, is screened by sequence and threshold value, reject incongruent item, obtain F- List, then divides F-List and obtains G-List, and according to the division of G-List, record is transmitted to Mapper, and by Mapper at Each affairs is transmitted to Reducer by reason, and the excavation part of MapReduce is carried out on Reducer.Firstly the need of obtaining each PPCTree on Reducer, obtains after PPCTree and then obtains the G- of respective items on N-List, and each Reducer Subsume, final frequent item set is obtained finally according to N-List and G-Subsume recurrence.

More specifically, the detailed process of the frequent method for digging of the present invention is as follows：

To achieve the above object, the present invention has following steps：

(1) occurrence number of items in initial data is counted by Mapreduce.Its sub-step is：

(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks, Block blocks are assigned on the node in cluster, and the step is carried out automatically by Hadoop platform；

(1-2) Block blocks as each Map function input data, the input key-value pair of Mapper is<key,value =T_i>, T_iRepresent the data in Block blocks.For data T_iIn each a_j, Mapper output key-value pairs<key =a_j, value=1>；

(1-3) Reduce merges the key-value pair from each Mapper.Specifically, all key=a_jKey-value pair will Same Reducer is assigned to, so the input of Reducer is<Key=a_j, value=1,1 ..., 1 }>.Reducer Only need to once be sued for peace, then export<Key=a_j, value=sum 1,1 ..., 1 }>；

(2) frequent one is filtered out according to every occurrence number, and the structure that sorted from high to low according to occurrence number is obtained F-List comprising frequent one with correspondence occurrence number information.Its sub-step is as follows：

After the completion of (2-1) aforesaid operations, the output key-value pair result of Reducer is stored on HDFS, is read from HDFS Destination file；

(2-2) sorts and rejects Non-Compliance.Descending sort is carried out according to value values in key-value pair, meanwhile, according to given Threshold value, rejects the item less than threshold value, obtains F-List；

(3) according to load balancing principle to the items packet in F-List, obtain comprising item and its affiliated group number information G-List.Its sub-step is as follows：

(3-1) is predicted to each single item load in F-List in advance, and F-List is divided according to load balancing principle；

(3-2) builds G-List according to F-List division results.G-List includes two：Item and its affiliated group number information gid.Meanwhile, construction Hash table storage；

(4) Mapper is allocated to initial data：

(4-1) resequences to the every of every data according to F-List middle terms order；

(4-2) reads item item since last of every data, and the group number gid of item is searched in G-List, Then using gid as key key, all before item will be come and constitutes key-value pair as value value<Key=gid, value =items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, continue to take previous item carry out it is identical Operation, until a data is disposed；

(5-3) builds the G-Subsume of the Reducer.G-Subsume is new ideas proposed by the present invention：G-Subsume (A)={ A, B ∈ I₁,A represents two different frequent one, A.gid with B The group number of item A is represented, Reducer.gid represents the corresponding group numbers of Reducer.G-Subsume only in a frequent item collection, i.e., Find out the G-Subsume of a corresponding frequent item collection of all Reducer correspondences gid.A.gid ∈ Reducer.gid are represented G-Subsume is found only for the corresponding frequent item collections of Reducer.gid.There are corresponding ID, g (X) to represent bag per data The set of the data ID containing item X,Item B is necessarily then included in every data of the expression comprising item A, and comprising item B Every data in not necessarily include item A.G-Subsume is equivalent to be found for the corresponding frequent item collections of Reducer.gid The set of its ancestors, in follow-up excavation, it is therefore apparent that if the G-Subsume of A is { A₁,A₂,…,A_m, then the 2 of the set^m- The support of the combination of 1 nonvoid subset and A is equal to the support of A, and the characteristic can be used for follow-up frequent item set mining, if G-Susbume (A)={ B }, XA are frequent episodes, then XBA must be frequent episode.

(5-4) recurrence is excavated, and its sub-step is as follows：

A) it is N from the N-List of last L using F-List as the recurrence primary data of the first round_LastProceed by Recurrence, last L is combined with its G-Subsume, generates frequent two item collection, writes result array Result, is not intended as The primary data of the item collection of recurrence Mining Frequent three, only data add Result as a result；

B) in recurrence primary data from front to back be respectively N by the N-List of item X_XPP-code and N_LastPP- Code is compared, if X is present in the G-Susbume of L, continues to take latter, otherwise：Work as N_XPP-code pre it is small In N_LastPP-code, and N_XPP-code post be more than N_LastPP-code post, then by result<N_X.PP- code.pre,N_X.PP-code.post,N_Last.PP-code.frequency>Add new N-List, name is XL, and N_Last PP-code after move；If working as N_XPP-code pre be less than N_LastPP-code, and N_XThe post of PP-code be less than N_LastPP-code post, then N_XPP-code after move；If working as N_XPP-code pre be more than N_LastPP-code, then N_LastPP-code after move, until N_LastAnd N_XPP-code all travel through and finish, this method can reduce complexity, if will N_LastAnd N_XEach be compared, complexity be O (mn), m and n are respectively N_LastAnd N_XLength, and the complexity of this method Degree is only O (m+n), N_XPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL Threshold value is unsatisfactory for, then deletes XL, XL is frequent two item collection if meeting；

C) continue latter the PP-code and N of the N-List of item_LastPP-code be compared i.e. repeat step b), Until all items compare and finish, that is, the set { AL, BL ... } of frequent two item collection with last L as suffix and its every is obtained N-List, write result array Result and using its N-List as frequent three item set mining primary data, due to upper The characteristic of (5-3) introduction is stated, frequent two item collection directly merges the part frequent three obtained with L as suffix with the G-Subsume of L Item collection, adds Result；

D) continue to take previous item carry out it is above-mentioned a), b), c) operate, until all end of operations, that is, obtained all of Frequent three item collection of frequent two item collection and part, above-mentioned steps understand that all frequent item sets for obtaining that merge with G-Subsume are not made It is primary data that recurrence is excavated, i.e. the item collection of next step Mining Frequent three is not used and merges obtain frequent with G-Subsume Item collection；

E) obtain thus frequent two item collection after, different two of the item collection of further Mining Frequent three, only prefix just may be used Frequent three item collection can be obtained, i.e. AX and BX can just carry out judging whether that frequent three item collection can be obtained.Extract only prefix different Frequent two item collection as the second wheel recurrence primary data, since last with it before item be compared, from going to After compare, manner of comparison and b)-d) step is identical, the N-List to AX and BX is compared, and last recycle ratio is relatively owned Frequent three item collection, and the Xiang Yuqi G-Subsume of G-Subsume will be embroidered with after in frequent three item collection combine and obtain frequent four Collection；

F) by that analogy, the frequent K item collections for the last relatively being obtained by N-List (are not closed including G-Subsume And the frequent K item collections for obtaining) in only one of which, recurrence terminates；

(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, complete all of frequent item set mining Process.

So far all steps of frequent item set mining are completed, the application of the invention is explained by taking game application as an example below：

1) to game user point group, the present invention uses heroic sequence length using according to user using heroic number and user To make temperature figure, and then split user group.

2), in initial data, be present many useless interference data in data de-noising, must such as be used in first game game Hero, the data without in all senses, it is necessary to carry out data go it is dry, obtain different crowd user it is significant using hero Sequence.

3) algorithm is applied to the sequence, the end product for being excavated, i.e. user use the frequent mould of heroic sequence Formula, progressively guide user by user using number is few, user be short to user using number is more, user makes using heroic sequence length With in heroic sequence length crowd long.

Fig. 2 show Mapper of the present invention and Reducer and carries out the flow chart of frequent item set mining, first in Mapper In, it is ranked up according to the order of F-List per data, according to the division of G-List, per data by circular treatment, will Result is transmitted to Reducer；On Reducer, it is necessary first to obtain the PPCTree on each Reducer, obtain After PPCTree and then obtain N-List, and G-Subsume, excavated finally according to recurrence and obtain final frequent item set.

Fig. 3 show the building process of PPCTree in the present invention, is example to be input into scheming, and is first according to the order of ABC Be successively to insert in empty tree to root node, the second data is B, C, first look under root node whether B node, do not find B Node, whether newly-built and insert B node, searching under B node has C nodes, does not find C nodes, newly-built and insert C nodes；3rd Data is A, B, D, and A and B node are found first, but the child node of B node does not find D nodes, newly-built under B node to insert Enter D nodes；The last item data are B, D, first look for B node, but D nodes are found not in the child node of B node, in B It is newly-built and insert D nodes under node, finally complete the structure of PPCTree.

Fig. 4 show during recurrence of the present invention is excavated the flow chart that frequent two item collection is obtained by a frequent item collection, first by most Latter L_nMerge with its G-Subsume and obtain frequent two item collection in part, from front to back, by each single item and L_nG-Subsume enter Row compares, and sees whether this is contained in G-Subsume, if removing the next item down comprising if, not comprising then by the N-List of this with L_nN-List be compared, compare each PP-code, comparison rule is as follows：Work as N_xPP-code pre be less than N_nPP- Code, and N_xPP-code post be more than N_nPP-code post, then by result<N_x.PP-code.pre,N_x.PP- code.post,N_n.PP-code.frequency>New N-List is added, name is L_xL_n, and N_nPP-code after move；If Work as N_xPP-code pre be less than N_nPP-code, and N_xPP-code post be less than N_nPP-code post, then N_x PP-code after move；If working as N_xPP-code pre be more than N_nPP-code, then N_nPP-code after move, until N_nAnd N_X PP-code all travel through and finish, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for threshold value, delete Except XL, XL is frequent two item collection if meeting；Continue latter the PP-code and N of the N-List of item_nPP-code carry out Compare, until all items compare finishing, that is, obtained with last L_nIt is frequent two item collection of suffix, then takes L_nPrevious item, Same operation is carried out, until get Section 2 terminating, that is, all of frequent two item collection has been obtained, the digging of follow-up frequently k item collections Pick method does not do excessive elaboration similar to the method for digging of frequent two item collection, it should be noted that G-Subsume subsequently merges is G-Subsume according to frequent episode suffix is merged, and merges the frequent k item collections of generation with item using G-Subsume Need not be as the primary data of frequent k+1 item collections, as just Result, and in follow-up frequently k item set minings, N- The comparing of List is only compared in different two of only prefix, and such as AX and BX is compared.

Fig. 5 show the schematic diagram of load balancing in the present invention, for each, it is necessary to be added into corresponding G- Group in List, group number gid generally, can take the mode of remainder to be grouped, but some recurrence times Long, some recurrence times are short, and the end product stand-by period can be caused to be defined by item at most, while it is uneven to will also result in load Weighing apparatus, for equally loaded, takes the load balancing, and the present invention estimates the load of each in advance, estimates using following several Individual foundation：

1) under depth-first pattern, the effect of depth of PPCTree trees the time that first sequence, postorder traversal are carried out to tree, Depth is bigger time-consuming more；

2) as two frequent item sets of merging corresponding N-List, its time complexity is two N-List length sums；

3) MAXPATHLEN of PPCTree trees where each single item is equal to its corresponding sequence number in F-List, and is somebody's turn to do The maximum length of the N-List structures corresponding to is equal to the support number and 2 of thisⁿ- 1 minimum value therebetween, wherein n is Sequence number of this in F-List；

So the load of each is estimated with the corresponding sequence number in F-List, and after estimation load, in order to reach load balancing, this Invention uses greedy algorithm, by that minimum group of each existing group of load sum of addition, until all items are assigned.

Fig. 6 show the schematic diagram of MapReduce processes in the present invention.The present invention will experience MapReduce processes twice, First MpaReuce is exported by Map<Key=item, value=1>, Reducer and then each value is added, output< Key=item, value=sum { 1,1 ... 1 }>, second MapReduce is then to carry out data mining, obtained F-List and After G-List, carry out data mining according to N-List and Subsume and obtain final result.

As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, it is not used to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc., all should include Within protection scope of the present invention.

Claims

1. a kind of Mining Frequent Itemsets, it is characterised in that comprise the following steps：

(1) occurrence number of items in initial data is counted by Mapreduce；

(2) frequent one is filtered out according to every occurrence number, by frequent one composition that sorted from high to low according to occurrence number F-List；

(3) G- comprising item He its affiliated group number information is obtained to the items packet in F-List according to load balancing principle List；

(4) Mapper is allocated to initial data：

(4-2) reads item item since last of every initial data, and the group number gid of item is searched in G-List, Then using gid as key key, all before item will be come in data and constitutes key-value pair as value value<Key= Gid, value=items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, continue to take previous Item carries out same operation, until a data is disposed；

The key=gid that (5-1) is exported according to Mapper, corresponding reducer, reducer structures are distributed to by value=items Build PPCtree；PPCtree is tree, and each node includes five property values：Name, support frequency, sub- section Point, preamble traversal sequence number pre and postorder traversal sequence number post；

(5-2) is for each node N in PPC-tree_i, will<N_i.pre,N_i.post,N_i.frequency>It is named as PP-code, By each PP-code according to the ascending sort of pre, build and obtain each frequent one N-List in F-List；

(5-3) builds the G-Subsume of Reducer： Wherein, A represents two different frequent one with B, and A.gid represents an A Group number, Reducer.gid represents the corresponding group numbers of Reducer, and g (X) represents the set of the data ID comprising a frequent X, X =A or B, I₁Represent the set of frequent；

(5-4) recurrence is excavated, and its sub-step is as follows：

A) last L is taken in F-List as the recurrence primary data of the first round using F-List, by last L and its G-Subsume (L) is combined, and generates frequent two item collection, write-in result array Result；

B) take an X one by one from front to back in recurrence primary data, be N by its N-List_XThe N-List of PP-code and L be N_LastPP-code be compared, if X is present in G-Susbume (L), continue to take latter, otherwise：

Work as N_XPP-code pre be less than N_LastPP-code, and N_XPP-code post be more than N_LastPP-code Post, then generate frequent two item collections XL, will<N_X.PP-code.pre,N_X.PP-code.post,N_Last.PP- code.frequency>Add the N-List i.e. N of frequent two item collections XL_XL, and N_LastPP-code after move；

Work as N_XPP-code pre be less than N_LastPP-code, and N_XPP-code post be less than N_LastPP-code Post, then N_XPP-code after move；

Work as N_XPP-code pre be more than N_LastPP-code, then N_LastPP-code after move, until N_LastAnd N_XPP- Code is traveled through and finished；

N_XPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for threshold value, XL is then deleted, XL is frequent two item collection if meeting；

C) continue to take the next item down, repeat step b), until in recurrence primary data before last L from recurrence primary data All items compare and finish, that is, obtained frequent two item collection and its N-List with last L as suffix, write result array Result and using its N-List as frequent three item set mining primary data, frequent two item collection directly with G-Subsume (L) Merging obtains frequent three item collection in part with L as suffix, adds array Result；

D) the inverted Section 2 in recurrence primary data, repeats the above steps a), b), c), until institute in recurrence primary data There is an end of operation, that is, obtained frequent three item collection of all of frequent two item collection and part；

E) only different frequent two item collection of prefix is extracted, as the recurrence primary data of the second wheel, since last, to press According to step b)-d) same way treatment, obtain all of frequent three item collection, and G- will be embroidered with after in frequent three item collection The Xiang Yuqi G-Subsume of Subsume are combined and are obtained frequent four item collection；

2. Mining Frequent Itemsets according to claim 1, it is characterised in that the step (1) implements step Suddenly it is：

(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks, Block Block is assigned on the node in cluster；

(1-2) Block blocks as each Map function input data, for the data T in Block blocks_iIn each Item a_j, the output key-value pair of Mapper<Key=a_j, value=1>；

(1-3) all key=a_jKey-value pair will be assigned to same Reducer, then the input of Reducer is<Key=a_j, Value=1,1 ..., 1 }>, Reducer once sued for peace output<Key=a_j, value=sum 1,1 ..., 1 }>.

3. Mining Frequent Itemsets according to claim 1 and 2, it is characterised in that load balancing in the step (3) Principle is：Using sequence number every in F-list as load value, according to load value to the items packet in F-List.

4. Mining Frequent Itemsets according to claim 1 and 2, it is characterised in that the G-List uses Hash table Storage.