CN104731925A

CN104731925A - MapReduce-based FP-Growth load balance parallel computing method

Info

Publication number: CN104731925A
Application number: CN201510138318.0A
Authority: CN
Inventors: 杨勇; 陈曙东
Original assignee: Jiangsu IoT Research and Development Center
Current assignee: Jiangsu IoT Research and Development Center
Priority date: 2015-03-26
Filing date: 2015-03-26
Publication date: 2015-06-24

Abstract

The invention relates to a MapReduce-based FP-Growth load balance parallel computing method. The method comprises the steps that 1, a database transaction set D is divided into different continuous partitions, and a sub-transaction set are stored on multiple nodes; 2, parallel computing is conducted on support counts to obtain all the frequent one-item sets FList; 3, items of the frequent one-item sets are divided into M groups according to a load balancing method to obtain a new list GList; 4, the database transaction set D is also divided into M groups according to the new list GList, a local FP-Tree of each transaction set DB is created when the division of the database transaction set D is finished, and a corresponding GList[gidi] is mined according to each local FP-Tree to obtain the frequent patterns of all the items in the frequent one-item set; 5, the frequent patterns of all the items in the frequent one-item set obtained on each node are aggregately output. The MapReduce-based FP-Growth load balance parallel computing method has good load balancing capacity and execution efficiency.

Description

Based on the load balancing parallel calculating method of the FP-Growth of MapReduce

Technical field

The present invention relates to a kind of parallel calculating method of load balancing, the load balancing parallel calculating method of especially a kind of FP-Grwoth based on MapReduce, belongs to the technical field of data mining.

Background technology

Association rule mining reflects mutual interdependency between a things and other things and relevance, is an important topic in data mining technology.Association rule mining needs experience two steps, i.e. the generation of frequent item set and the generation of correlation rule, and the overall performance of association rule mining determined primarily of the first stage.Classical association rules mining algorithm mainly contains Apriori algorithm, FP-Growth algorithm and Eclat algorithm, and the above two adopt horizontal data form to excavate, and the latter adopts vertical data form to excavate.FP-Growth algorithm comparatively Apriori algorithm, divide-and-conquer strategy is adopted to excavate database, do not produce candidate, it adopts the important information in FP-Tree store data storehouse, only need scan twice database, then the information of key is left in internal memory with the form of FP-Tree, avoid the great expense incurred that Multiple-Scan database brings.

Hadoop be one increase income, can the Distributed Computing Platform of parallel processing large-scale data.MapReduce is one of core component of Hadoop, is a high performance distributed programmed model and Computational frame, for carrying out parallel parsing and process to mass data.MapReduce carries out unified operation all tasks, the i.e. decomposition of task and the merging of result, mainly comprise two important core operations: Map and Reduce(maps and stipulations), large-scale data is split as multiple little data set and is sent on multiple stage machine (node) and carries out concurrent operation by Map function, and the operation result of upper for each machine (node) Map function is then carried out merging and obtains a result by Reduce function.

Along with the progress of society and the development of science and technology, data are explosive growth, the FP-Growth algorithm carrying out association rule mining with unit form far can not the problem such as storage and excavation of satisfying magnanimity data, and some existing FP-Growth parallel algorithms solve division and this two problems of follow-up parallel computation of database, but algorithm is at parallel efficiency calculation, memory consumption, there is obvious difference and deficiency in the aspects such as the performance difference that communication consumes and the sparse degree difference of FP-Tree causes, be short of load balancing when these all divide with db transaction collection and consider there is very large relation.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, provide the load balancing parallel calculating method of a kind of FP-Growth based on MapReduce, it has good load balance ability and execution efficiency.

According to technical scheme provided by the invention, the load balancing parallel calculating method of a kind of FP-Growth based on MapReduce, described load balancing parallel calculating method comprises the steps:

Step 1, db transaction collection D needed for input and minimum support counting, and described db transaction collection D is divided into continuously different subregions, and the subtransaction collection of db transaction collection D is stored on multiple stage node;

Step 2, first time scan database affairs collection D, the support counting of the item on the every platform node of parallel computation, and the support technology of the item of all node calculate is merged, to obtain all frequent 1 collection FList;

Step 3, the item of frequent 1 collection FList is divided into M group according to the method for load balancing, with the new list GList that to obtain length be M, in new list GList, the group number of each group is gid _i(1≤i≤M);

Step 4, second time scan database affairs collection D, be also divided into M group according to new list GList by db transaction collection D, divide the group number obtaining db transaction collection D corresponding with the group number in new list GList, if a transaction packet is containing GList _gidiin item, then part corresponding for these affairs being sent to group number is gid _itransaction set DB; After db transaction collection D division terminates, its local FP-Tree is created to each transaction set DB, and excavate corresponding GList according to local FP-Tree _gidi, to obtain the frequent mode of frequent 1 concentrated all item;

Step 5, by every platform node obtains frequent 1 concentrate all items frequent mode polymerization export.

Described step 3 comprises the steps:

Step 3.1, calculate the load of every in frequent 1 collection FList, according to load descending sort, to obtain permutation table SList;

Step 3.2, according to the group number M specified, M item before in permutation table SList is initialized as the M group in new list GList, and the often group in new list GList with every in permutation table SList in one_to_one corresponding;

Step 3.3, to add to not being assigned to the Section 1 organized in new list GList in permutation table SList in the group of least-loaded in new list GList, and the load value of the item of interpolation is added up, and upgrade the load organized in new list GList;

Step 3.4, repetition above-mentioned steps 3, until all items in permutation table SList all complete grouping;

Step 3.5, the new list GList obtained to be kept in HDFS file, so that multiple stage nodes sharing.

Compared with prior art, advantage of the present invention: the present invention utilizes the load of total length as this of the prefix path in condition pattern tree of each in frequent 1 collection FList, and carry out descending sort, then the group number M be divided into is specified, make every load sum of comprising in each group substantially equal, thus the equilibrium realizing frequent 1 collection FList divides the load balancing between each computing node, thus solve the situation of load inequality between each computing node, there are better load balance ability and execution efficiency.

Accompanying drawing explanation

Fig. 1 is schematic flow sheet of the present invention.

Embodiment

Below in conjunction with concrete drawings and Examples, the invention will be further described.

As shown in Figure 1: in order to have good load balance ability and execution efficiency, load balancing parallel calculating method of the present invention comprises the steps:

Db transaction collection D is divided into a few part of continuous print, is stored in respectively on different computing nodes.Each the parton affairs collection be divided is called data fragmentation, this process is directly completed by Hadoop, db transaction collection only need copy on HDFS by user, the Divide File of input can be that multiple data fragmentation (Blook) is stored on obstructed node by Hadoop framework, and be that each data fragmentation preserves copy, thus automatically complete data fragmentation process.

In the embodiment of the present invention, to be counted in whole db transaction collection D the support counting of each by first pair of MapReduce function, thus obtain frequent 1 collection FList.The wherein corresponding data fragmentation Shard of the input of each Map function.The input key assignments plaid matching formula of Map function is <key=lineNo, value=T>, and wherein lineNO represents current line number, and T represents the affairs that current line is corresponding.Output format for each affairs T, Map function is <key=item, value=1>, and wherein item represents each that occur in T.All Map with identical key value can be exported key assignments and be combined the rear input as Reduce by Hadoop, and the input format of Reduce function is <key=item, value={1, and 1,1 ... >.The output format of Reduce is <key=item, value=itemCount>, and wherein, itemCount represents the number of times that corresponding item item occurs, i.e. support counting.

In the embodiment of the present invention, be to need to divide into groups to db transaction collection D according to new list GList to the object that frequent 1 collection FList divides, to frequent 1 division collecting FList by whether balanced for the load directly having influence on each transaction set divided in next step, thus affect the execution efficiency of whole parallel algorithm.The present invention is to realize dividing frequent 1 collection FList premised on the load balancing between the transaction set be divided, by whole for original larger data base system be loose, be distributed on each node, thus realize parallel computation, so before frequent 1 the collection FList of division, the load of each transaction set first will be estimated.

For transaction set DB(gid _i), will corresponding GList be excavated _gidithe recurrence number of times sum of the condition pattern tree of middle comprised all items is as the load of this group.Therefore, to need first to estimate in frequent 1 collection FList the load of each, then divide frequent 1 collection FList.

The maximal value of the prefix path of the condition pattern tree corresponding to each is that this is at frequent 1 position n collected in FList, if the maximal value of the condition pattern tree prefix path corresponding to a certain item is n, the maximum recurrence number of times that the frequent mode so excavating this does is n-1+n-2+ ... + 1=(n × (n-1))/2, namely the excavation load of each can be estimated as (n × (n-1))/2.

According to the above description, then divide frequent 1 collection FList, the process obtaining new list GList comprises the steps:

In the embodiment of the present invention, gid _icorresponding group is denoted as GList _gidi, and GList _gidieach in group is denoted as α j, α j ∈ GList _gidi, 1≤j≤GList _gidi.length.

In this step, completed by second pair of MapReduce function, wherein the task of Map function is divided into groups to db transaction collection D according to the dividing condition of frequent 1 collection FList, thus obtaining one group of separate to each other transaction set DB, Reduce function is responsible for carrying out FP-Growth excavation to the standalone transaction collection on this node.

Map function: generate the transaction set DB that M group is separate, all affairs on local node are sent in suitable grouping.Map function input key-value pair is still <key=lineNo, value=T>.The operation of Map function is as follows:

1), by new list GList be loaded into local node, generate a hashMap according to new list GList, its key is the item in new list GList, and value is this corresponding group number gid _i.

2), for each the affairs T read in, it is carried out sorting according to the order of frequent 1 collection FList middle term and deletes in T the item be not present in frequent 1 collection FList.

3), sorted affairs T={item is established ₁, item ₂..., item _n, travel through each item in T from back to front _j, circulate from n until when j equals 1 and terminate.If item _jbe present in certain the key-value pair key-value of hashMap, then key-value pairs identical with the value value of key-value pair key-value all in hashMap deleted.Then j item before in affairs T is sent in the group corresponding to value value of key-value pair key-value.

The output key-value pair of Map function is <key=gid _i, value={ item ₁..., item _j>, wherein gid _irepresent the group number of the transaction set that these affairs will be distributed to, { item ₁..., item _jrepresent it is not whole piece affairs be sent in corresponding grouping, but only send item _jpart before, the principle of transmission for: the item that affairs T comprises all belongs to which group in new list GList, and which group is the part that affairs T-phase is answered just be sent to.By deleting Hash table discal patch object, to guarantee that same affairs can not be repeatedly transmitted in same grouping.All like this comprising organizes GList _gidithe affairs of middle term, it is gid that the part of its correspondence is all sent to group number _itransaction set DB(gid _i) in, so to transaction set DB(gid _i) carry out FP-Tree excavate just can obtain all groups of GList _gidithe pattern of middle term.Different group GList _gidiin the item that comprises different, each frequent mode dividing into groups to obtain is different, so each transaction set DB is independently, does not rely on mutually between grouping.

Reduce function: Frequent Pattern Mining is carried out to local matter collection.After all Map tasks are all finished, because Hadoop can automatically merge the Map result with identical key value, thus Reduce be input as <key=gid _i, value=DB(gid _i) >, wherein transaction set DB(gid _i) expression group number is gid _ithe standalone transaction collection corresponding to grouping, this affairs collection is made up of the office being all distributed to this group.Each Reduce task processes the affairs collection that Hadoop distributes to it one by one.The operation of Reduce function is as follows:

1), load new list GList, for generating groupMap, the key in groupMap represents group number gid _i, value represents all item GList corresponding to this group _gidi.

2), transaction set DB(gid is scanned _i) in each record, create local TP and set: localFP-Tree.

3), recursive call Growth algorithm, obstructed with traditional Growth algorithm, first time call Growth(FP-Tree, null) time, a traversal group GList _gidimiddle term, instead of travel through whole gauge outfit, this is because each transaction set only need excavate the group GList of its correspondence _gidithe frequent mode of middle comprised item.

The output of Reduce is <key=pattern, value=sup(pattern) >.Wherein pattern represents frequent mode, sup(pattern) represent the number of times that this frequent mode occurs.

Result for each computing node is carried out once result and is merged, and can obtain the net result under FP-Growth parallel algorithm.

The present invention is directed to traditional F P-Growth algorithm computing power and the limited problem of storage capacity on unit computing node, propose the parallelization computing method based on MapReduce, simultaneously for Data Placement out of true, each computing node causing each node calculate counting yield, memory consumption by the sparse degree difference of FP-Tree between each data block in parallelization process, there is the problems such as notable difference in communication consumption, proposes the load balancing parallel algorithm of a kind of FP-Growth based on MapReduce.

Compared to conventional individual algorithm and common parallel algorithm, the present invention utilizes the load of total length as this of the prefix path in condition pattern tree of each in frequent 1 collection FList, and carry out descending sort, then the group number M be divided into is specified, make every load sum of comprising in each group substantially equal, thus the equilibrium realizing frequent 1 collection FList divides the load balancing between each computing node, thus solve the situation of load inequality between each computing node, there are better load balance ability and execution efficiency.

Claims

1. based on a load balancing parallel calculating method of the FP-Growth of MapReduce, it is characterized in that, described load balancing parallel calculating method comprises the steps:

2. the load balancing parallel calculating method of the FP-Growth based on MapReduce according to claim 1, it is characterized in that, described step 3 comprises the steps: