CN107291734A

CN107291734A - A kind of method for digging of frequent item set, apparatus and system

Info

Publication number: CN107291734A
Application number: CN201610200506.6A
Authority: CN
Inventors: 胡辉; 谢黎文; 杨军; 刘义
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2017-10-24

Abstract

This application discloses a kind of method for digging of frequent item set, the problem of to solve to take considerable time when the frequent item set in big data is excavated in the prior art.This method includes：Host node carries out data segmentation to the total data and obtains each subdata after the frequent item set mining task for total data of client appointment is received；And each subdata is distributed at least two be used for the slave node of parallel execution first stage task；The first stage task is specifically included：Frequent item set mining is carried out to allocated subdata using Frequent Itemsets Mining Algorithm, Local frequent itemset is obtained；The host node distributes the frequent item set of the subdata to each slave node for performing second stage task parallel；The second stage task, including：The frequent item set of the total data is obtained for performing each slave node of second stage task.Disclosed herein as well is a kind of excavating gear of frequent item set and frequent item set mining system.

Description

A kind of method for digging of frequent item set, apparatus and system

Technical field

The application is related to big data field, more particularly to a kind of method for digging of frequent item set, apparatus and system.

Background technology

With the development of Internet technology, the data produced in internet may contain substantial amounts of value.With That the data produced in network are more and more, how fast and effeciently to excavate the valency of the data produced in internet Value, is the big problem faced in the big data epoch.

Data mining generally refers to be hidden in the process of wherein information by algorithm search from substantial amounts of data. At present, it is wide variety of data mining side in Data Mining to the excavation of the correlation rule in data Method.Wherein, the correlation rule, refers between the different item in data and there is valuable incidence relation.Such as Really certain correlation rule meets default minimum support threshold value and minimum putting property degree threshold value, then it is assumed that the association Rule is valuable, and above-mentioned threshold value can be manually set according to requirement is excavated.

For the data in big data, the set that the unique mark characteristic value of the different item in data is constituted can be with I={ i1, i2 ..., im } is expressed as, the unique mark characteristic value of the item refers to for the unique mark in item collection Feature value.Such as, for the mutually different commodity of title, then item can be referred to as with the name of commodity Unique mark characteristic value, or, different numberings can also be distributed to different commodity, then can use commodity Numbering as item unique mark characteristic value.Set and Transaction Information described in the embodiment of the present application are concentrated The unique mark characteristic value of equal available entry represents different items.Affairs Ti is made up of at least one, i.e., Affairs Ti is I non-empty subset.Transaction data set (TDS) be affairs Ti constitute set, be represented by D=T1, T2,…,Ti,...,Tn},i∈[1,n]。

Process to association rule mining in data is main in two steps：1st, Mining Frequent Itemsets Based；2nd, by frequent Item collection produces correlation rule.Wherein, it is the key of association rule mining to the excavation of frequent item set.If set S as By item unique mark eigenvalue cluster into a set, S={ i | i ∈ I }, abbreviation item collection, and include k The item collection of item is referred to as k- item collections, then, the frequent item set is that support is not less than default minimum support The item collection of threshold value.Support of the item collection in D, be in D while comprising the thing of all in the item collection Business accounts for the percentage for the affairs sum that D is included, and the percentage is it can be appreciated that the item collection going out in D Existing probability.The business in transaction data set (TDS), customer purchase record is such as recorded as with the shopping of 1000 customers Product are item, wherein 200 customers are to have bought bread to have bought ham again, default minimum support threshold value is 15%, then the support of 2- item collections { bread, ham } is 20%, as frequent item set.

When being excavated to frequent item set, the alternative manner successively searched for can be used, that is, uses frequent k- The frequent k+1- item collections of item collection removal search.When using frequent k- item collections removal search frequent k+1- item collections, it is necessary to Previously according to frequent k- item collections generation candidate's k+1- item collections, then screen in candidate's k+1- item collections and meet The item collection of minimum support threshold value, is used as the frequent k+1- item collections finally given.

When quantity is many in frequent k- item collections, the number of candidate's k+1- item collections can be more, for example, when frequent When the quantity of 1- item collections is 1000, the quantity of candidate's 2- item collections is, in big data excavation, frequently The quantity of k- item collections is often very big, and the excavation to frequent item set can take a substantial amount of time, and cause frequent item set Digging efficiency it is relatively low.

The content of the invention

The embodiment of the present application provides a kind of method for digging of frequent item set, to solve in the prior art to big The problem of frequent item set in data can take considerable time when being excavated.

The embodiment of the present application also provides a kind of excavating gear of frequent item set, to solve in the prior art right The problem of frequent item set in big data can take considerable time when being excavated.

The embodiment of the present application also provides a kind of digging system of frequent item set, to solve in the prior art right The problem of frequent item set in big data can take considerable time when being excavated.

The embodiment of the present application uses following technical proposals：

A kind of method for digging of frequent item set, including：

Host node is after the frequent item set mining task for total data of client appointment is received, according to pre- Fixed data segmentation rule carries out data segmentation to the total data and obtains each subdata；

Each subdata is distributed at least two by the host node to be used to perform frequent item set mining task parallel First stage task slave node；The first stage task is specifically included：The slave node according to Default minimum support threshold value, frequent episode is carried out using Frequent Itemsets Mining Algorithm to allocated subdata Collection is excavated, and obtains the frequent item set of the subdata；

The host node distributes the frequent item set of the subdata to for performing frequent item set mining parallel Each slave node of the second stage task of task；The second stage task, including：For performing second Each slave node of phased mission according to the frequent item set of default minimum support threshold value and the subdata, Obtain the frequent item set of the total data.

A kind of excavating gear of frequent item set, including：

Slave node determining unit, for being dug in the frequent item set for total data for receiving client appointment After pick task, split rule according to predetermined data and each subnumber is obtained to total data progress data segmentation According to；

The frequent item set acquiring unit of subdata, is used to hold parallel for each subdata to be distributed at least two The slave node of the first stage task of row frequent item set mining task；The first stage task is specifically wrapped Include：The slave node is according to default minimum support threshold value, using Frequent Itemsets Mining Algorithm to being divided The subdata matched somebody with somebody carries out frequent item set mining, obtains the frequent item set of the subdata；

Total frequent item set acquiring unit, for the frequent item set of the subdata to be distributed to for performing parallel Each slave node of the second stage task of frequent item set mining task；The second stage task, including： For performing each slave node of second stage task according to default minimum support threshold value and the subnumber According to frequent item set, obtain the frequent item set of the total data.

A kind of digging system of frequent item set, including host node and at least two slave nodes, wherein：

The host node, for receiving the frequent item set mining task for total data of client appointment Afterwards, split rule according to predetermined data and each subdata is obtained to total data progress data segmentation；

It is used to perform the first of frequent item set mining task parallel for each subdata to be distributed at least two The slave node of phased mission；The first stage task is specifically included：The slave node is according to default Minimum support threshold value, frequent item set digging is carried out using Frequent Itemsets Mining Algorithm to allocated subdata Pick, obtains the frequent item set of the subdata；

For the frequent item set of the subdata to be distributed to for performing frequent item set mining task parallel Each slave node of second stage task；The second stage task, including：Appoint for performing second stage Each slave node of business obtains institute according to the frequent item set of default minimum support threshold value and the subdata State the frequent item set of total data.

The slave node, the task for performing the host node distribution.

At least one above-mentioned technical scheme that the embodiment of the present application is used can reach following beneficial effect：

By the way that the total data is divided into at least two subdatas, and using in distributed computing system from Belong to the frequent item set that nodal parallel excavates the subdata in each subdata, then utilize the frequent of the subdata Item collection obtains the frequent item set in the total data, relative to frequent episode in the prior art in big data The problem of collection can take considerable time when being excavated, improves the efficiency of frequent item set mining in big data.

Brief description of the drawings

Accompanying drawing described herein is used for providing further understanding of the present application, constitutes one of the application Point, the schematic description and description of the application is used to explain the application, does not constitute to the application not Work as restriction.In the accompanying drawings：

A kind of concrete structure schematic diagram for distributed computing system that Fig. 1 provides for the embodiment of the present application；

Fig. 2 is a kind of implementation process signal of the method for digging for frequent item set that the embodiment of the present application 1 is provided Figure；

Fig. 3 is a kind of frequent item set mining based on Map Reduce algorithms that the embodiment of the present application 1 is provided The implementation process schematic diagram of method；

Fig. 4 is a kind of implementation process signal of the method for digging for frequent item set that the embodiment of the present application 2 is provided Figure；

Fig. 5 is what a kind of frequent item set to sub- transaction data set (TDS) that the embodiment of the present application 2 is provided was excavated Process schematic；

Fig. 6 is that a kind of frequent item set and transaction data set (TDS) according to subdata that the embodiment of the present application 2 is provided is obtained Take the implementation process schematic diagram of frequent item set；

Fig. 7 is a kind of concrete structure signal of the excavating gear for frequent item set that the embodiment of the present application 3 is provided Figure；

Fig. 8 is a kind of concrete structure signal of the digging system for frequent item set that the embodiment of the present application 4 is provided Figure.

Embodiment

In the embodiment of the present application, it is possible to use distributed computing system is excavated to frequent item set, described point Cloth computing system is to run on system in server cluster, data being carried out with Distributed Calculation. As shown in figure 1, being a kind of structural representation of distributed computing system.The distributed computing system includes Host node and at least two slave nodes.Wherein, the host node is mainly used in the client that will be received Task is distributed to each slave node, and is dispatched each slave node and be effectively carried out task；The slave node master It is used for the performing the host node distribution of the task.

Distributed computing system can also include storage system, for the security of data, and storage system can be with By way of duplication by data backup into multiple nodes.

In the embodiment of the present application, the client can receive the operational order of user's input, and according to described Operational order, sends task corresponding with the operational order to distributed computing system.The client is also The result of calculation of distributed computing system can be fed back to user.

Due to the distributed computing system be in the correlation technique of comparative maturity, this specification to this no longer Further repeat.A kind of frequent episode based on distributed computing system is discussed in detail below in conjunction with the application example The method for digging of collection.

It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer Apply example and technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, it is described Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made Example, belongs to the scope of the application protection.

Below in conjunction with accompanying drawing, the technical scheme that each embodiment of the application is provided is described in detail.

Embodiment 1

To solve to take considerable time when the frequent item set in big data is excavated in the prior art The problem of, the embodiment of the present application 1 provides a kind of method for digging of frequent item set.What the embodiment of the present application was provided The executive agent of the method for digging of frequent item set can be server, for example, being used as distribution in server cluster Server of formula computing system host node, etc..

For ease of description, hereafter executive agent in this way be server cluster in be used as Distributed Calculation system Unite exemplified by the server of host node, the embodiment to this method is introduced.It is appreciated that this method It in server cluster as the server of distributed computing system host node is a kind of example that executive agent, which is, The explanation of property, is not construed as the restriction to this method.

The implementation process schematic diagram of this method is as shown in Fig. 2 comprise the steps：

Step 11：Host node is receiving the frequent item set mining task for total data of client appointment Afterwards, split rule according to predetermined data and each subdata is obtained to total data progress data segmentation；

In the embodiment of the present application, the total data is the data comprising transaction data set (TDS).I.e. described frequent item set Excavation is to carry out frequent item set mining to the transaction data set (TDS) included in the total data.

, can will be described frequent when carrying out frequent item set mining to the total data in the embodiment of the present application Item set mining task is divided into two stages to perform, to improve the efficiency of frequent item set mining.Hereinafter will be detailed The thin first stage task and second stage task for introducing frequent item set mining task in the embodiment of the present application.

In the embodiment of the present application, the frequent item set of the total data can be carried out using distributed computing system Excavate, then after the frequent item set mining task for the total data of client appointment is received, just The slave node of the first stage task for performing frequent item set mining task can be determined.

In the embodiment of the present application, when carrying out big data excavation, the Transaction Information included due to the total data The quantity of concentration affairs is often a lot, and the quantity of frequent 1- item collections is often also a lot, then to frequent item set Excavation can take a substantial amount of time., can be according to predetermined data in order to improve the efficiency of frequent item set mining The total data is divided into several subdatas by segmentation rule, then carries out frequent item set to individual subdata again Excavation.The transaction data set (TDS) included in the subdata is referred to as subtransaction data set by us.I.e. according to institute Predetermined data segmentation rule is stated, the transaction data set (TDS) included in the total data several can be divided into Subtransaction data set.

In the embodiment of the present application, the predetermined data segmentation rule, for for determining sub- thing during data segmentation The transactions that include of business data set, and ensure after segmentation the integrality of each affairs in subtransaction data set Rule.The integrality of the affairs refers to the item and perform before data segmentation for performing that office includes after data segmentation The item that office includes is identical.

In actual applications, the transactions that specific each subtransaction data set is included in data segmentation rule It can be configured according to the computing capability of distributed computing system.

Step 12：Each subdata is distributed at least two by the host node to be used to perform frequent item set digging parallel The slave node of the first stage task of pick task；

The first stage task is specifically included：The slave node according to default minimum support threshold value, Frequent item set mining is carried out to allocated subdata using Frequent Itemsets Mining Algorithm, the subdata is obtained Frequent item set (claiming Local frequent itemset afterwards).

In the embodiment of the present application, the frequent item set of each subtransaction data set can be excavated in advance.For just In description, the task that the frequent item set to each subtransaction data set is excavated is referred to as frequent item set and dug by us The first stage task of pick task.In actual applications, it is possible to use each subordinate section in distributed computing system Put to perform the first stage task.Using each slave node come the frequent item set of subdata transaction set It is that the quantity of the subdata of each slave node distribution can be according to the calculating energy of each slave node when being excavated Power determines that the embodiment of the present application is not limited this.

In the embodiment of the present application, it is determined that the first stage task for performing frequent item set mining task Slave node after, just the first stage task can be distributed to determination be used for perform the first stage appoint Each slave node of business, to cause each slave node for being used to perform first stage task to perform institute parallel State first stage task.

In the embodiment of the present application, the first stage task is specifically included：According to default minimum support threshold Value, carries out frequent item set mining to the subdata that slave node is allocated using Frequent Itemsets Mining Algorithm, obtains Frequent item set to the subdata is used as Local frequent itemset.

In actual applications, depending on the default minimum support threshold value can be according to business demand, this Shen Please embodiment this is not limited.

In actual applications, the slave node is allocated using Frequent Itemsets Mining Algorithm to slave node Subdata when carrying out frequent item set mining, it is possible to use based on mapping reduction (Map Reduce) algorithm Frequent Itemsets Mining Algorithm carries out frequent item set mining to the subdata that slave node is allocated.The Map Reduce algorithms are divided into mapping (Map) algorithm and reduction (Reduce) algorithm, Map Reduce algorithms Calculating process be divided into Map stages and Reduce stages, carry out Distributed Calculation when, will can perform The program of Map algorithms is referred to as Map nodes, and the program for performing Reduce algorithms is referred to as into Reduce nodes. Because the Map Reduce algorithms have been this no longer to be entered in the correlation technique of comparative maturity, this specification One step is repeated.Be described in detail below using the Frequent Itemsets Mining Algorithm based on Map Reduce algorithms come The process of frequent item set mining is carried out to the subdata that slave node is allocated.

In actual applications, carry out frequent item set mining when, by by the support of item collection with it is default most Small support threshold is compared to determine during frequent item set to concentrate in Transaction Information, it is necessary to calculate each item collection Support, i.e., with the quantity comprising the affairs of all in same item collection divided by affairs sum, this can disappear The certain computing resource of consumption, reduces the efficiency of frequent item set mining.Therefore, in order to improve frequent item set mining Efficiency, can be according to the support counting of item collection come Mining Frequent when being excavated to frequent item set Collection, without calculating the support that each item collection is concentrated in Transaction Information.The support counting of the item collection refers to Transaction Information concentrates the frequency for including the sum, also referred to as item collection of the affairs of all in the item collection.

In actual applications, the total of affairs can be concentrated according to default minimum support threshold value and Transaction Information Quantity, obtains minimum support count threshold of the frequent item set in the transaction data set (TDS), is used as global minima Support counting.Then the number of affairs in global minima support counting divided by subtransaction data set can be utilized Amount, you can obtain the Local Minimum support counting threshold value of frequent item set in subtransaction data set.

In the embodiment of the present application, using the Frequent Itemsets Mining Algorithm based on Map Reduce algorithms to subordinate It is real in the process of the allocated subdata progress Local frequent itemset excavation of node, its schematic flow sheet such as Fig. 3 Shown in line arrow.It is possible, firstly, to will be carried out to the total data described in the subdata that data segmentation is obtained Subtransaction data set as Map algorithms input, then according to and input the subtransaction data set pair Local Minimum support counting threshold value answer, default, using Frequent Itemsets Mining Algorithm to the subtransaction The frequent item set of data set is excavated.

In actual applications, the frequent mining algorithm includes following at least one：Priori frequent item set mining (Apriori) algorithm, frequent pattern tree (fp tree) (FP-Tree) algorithm.Specifically, the frequent item set mining is calculated Method can use the alternative manner successively searched for when being excavated to frequent item set, i.e., with frequent k- item collections The frequent k+1- item collections of removal search.In k+1- item collections frequent using frequent k- item collections removal search, it is necessary in advance According to frequent k- item collections generation candidate's k+1- item collections, then screen in candidate's k+1- item collections and meet minimum The item collection of support threshold, is used as the frequent k+1- item collections finally given.The successively frequent item set of search iteration Mining algorithm such as can be priori frequent item set mining (Apriori) algorithm.Because the frequent item set is dug The correlation technique that algorithm is comparative maturity is dug, the embodiment of the present application is not repeated further this.

In the embodiment of the present application, after Local frequent itemset is obtained using Frequent Itemsets Mining Algorithm, Ke Yitong Map algorithms are crossed to export Result.Map algorithms can be with to the output format of the Local frequent itemset It is<key,value>, wherein key is Local frequent itemset, and value is the support meter of Local frequent itemset Number.

In the embodiment of the present application, each sub- thing is being obtained by the Frequent Itemsets Mining Algorithm based on Map algorithms It is engaged in after the Local frequent itemset of data set, can be exported all Map nodes by Reduce algorithms Local frequent itemset is collected arrangement.Can using the output of Map algorithms as Reduce algorithms input, Then the Local frequent itemset that Reduce algorithms just can export all Map nodes is collected and protected Deposit, subsequently to use.The output format of Reduce algorithms can also be<key,value>, wherein key It is Local frequent itemset, value is 1.

That is, in the embodiment of the present application, it can be obtained by the Frequent Itemsets Mining Algorithm based on Map algorithms The Local frequent itemset for the subtransaction data set that each subdata is included, then by Reduce algorithms by Map The Local frequent itemset of node output is collected arrangement.

It should be noted that obtaining each subtransaction number by the Frequent Itemsets Mining Algorithm based on Map algorithms It is then frequent by the part that Reduce algorithms export Map nodes according to the Local frequent itemset of collection Item collection is collected arrangement, the method for finally giving Local frequent itemset, and simply the embodiment of the present application is provided, The subdata being allocated using the Frequent Itemsets Mining Algorithm based on Map Reduce algorithms to slave node is entered A kind of method of row frequent item set mining.

In actual applications, Map algorithms can also be first passed through to each affairs in each subtransaction data set Traveled through, in ergodic process, often find transaction packet containing some defecate collection with<key,value>Form The item collection is exported, wherein key is item collection, value is 1.Then by Reduce algorithms to Map algorithms The number of times that each item collection of output occurs is added up, and just obtains the support counting of each item collection, and then according to office Portion's minimum support count threshold, obtains Local frequent itemset.It will not be repeated here.

Step 13：The host node distributes the frequent item set of the subdata to for performing frequent episode parallel Collect each slave node of the second stage task of mining task.

The second stage task, including：For performing each slave node of second stage task according to default Minimum support threshold value and the subdata frequent item set, obtain the frequent item set of the total data.

In the embodiment of the present application, the part is obtained by performing the first stage task in each slave node After frequent item set, host node can determine the second stage task for performing frequent item set mining task from Belong to node, to cause each slave node for being used to perform second stage task to perform the second-order parallel Section task.

In the embodiment of the present application, the second stage task is specifically included, and is calculated each Local frequent itemset and is existed Support in the total data；Support in the total data is not less than the default most ramuscule The Local frequent itemset of degree of holding threshold value, is used as the frequent item set of the total data.

In actual applications, it is possible to use Map Reduce algorithms calculate each Local frequent itemset described total It is empty in support in data, and then the frequent item set of the acquisition total data, its schematic flow sheet such as Fig. 3 Shown in line arrow.

Specifically, it is possible, firstly, to the transaction data set (TDS) that Local frequent itemset and the total data are included as The input of Map Reduce algorithms, then counts each Local frequent itemset and concentrates what is occurred in Transaction Information Number of times, you can obtain the Local frequent itemset and supported in the support counting that Transaction Information is concentrated as the overall situation Degree is counted.

In actual applications, each Local frequent itemset is being counted in Transaction Information using Map Reduce algorithms , can be by multiple Map nodal parallels to affairs in order to improve counting efficiency during the support counting of concentration Each Local frequent itemset in data set is counted, can also be by each Map nodal parallels to Map nodes Each Local frequent itemset occurred in allocated subtransaction data set is counted.Then by Reduce letters Counting progress that number exports each Map nodes, in each subtransaction data set to same Local frequent itemset It is cumulative, obtain the global support counting that the frequent item set is concentrated in Transaction Information.

, just can be by the global support after the global support counting is obtained in the embodiment of the present application Degree counts the frequency for being not less than the Local frequent itemset of the minimum support count threshold as the total data Numerous item collection.

In actual applications, if it is desired to obtain frequent k- item collections, then can be from the frequent episode of the total data Concentrate and obtain frequent k- item collections.

In the embodiment of the present application, the excavation of rule can also be associated using the frequent k- item collections.

Specifically, after the association rule mining task for the total data of client appointment is received, The association rule mining task is distributed to each slave node for performing association rule mining task, with So that each slave node for being used to perform association rule mining task performs the correlation rule digging parallel Pick task.

In the embodiment of the present application, the association rule mining task, including：Obtaining the frequency of the total data After numerous item collection, according to the frequent item set of default minimal confidence threshold and the total data, obtain described total Correlation rule in data.

Specifically, in the frequent item set according to default minimal confidence threshold and the total data, institute is obtained When stating the correlation rule in total data, the frequent k- item collections in the frequent item set of the total data are obtained first, And pending association rule is obtained according to the frequent k- item collections, the confidence level of the pending association rule is calculated, Then confidence level is not less than to the rule of the default minimal confidence threshold, the total data is used as In correlation rule.

In actual applications, the confidence level of the pending association rule, is the support according to frequent k- item collections Every support counting is obtained in counting and the frequent k- item collections.I.e. with branch every in frequent k- item collections Degree of holding counting divided by the support counting of the frequent k- item collections.

In actual applications, depending on the default minimal confidence threshold can be according to business demand, this Shen Please embodiment this is not limited.

Due to obtaining the phase that correlation rule has been comparative maturity according to frequent k- item collections and minimal confidence threshold This is not repeated further in pass technology, this specification.

The embodiment of the present application 1 provide frequent item set method for digging, by by the total data be divided into Few two subdatas, and utilize the office in each subdata of slave node P mining in distributed computing system Portion's frequent item set, then obtains the frequent item set in the total data, relatively using the Local frequent itemset The problem of can be taken considerable time when the frequent item set in big data is excavated in the prior art, carry The efficiency of frequent item set mining in high big data.

Embodiment 2

Present invention design is described based on previous embodiment 1 in detail, for the ease of being better understood from this Technical characteristic, means and the effect of application, do further to the method for digging of the frequent item set of the application below Illustrate, so as to form another embodiment of the application.

The excavation of the mining process of frequent item set and frequent item set described in embodiment 1 in the embodiment of the present application 2 Process is similar, and some other step not made referrals in embodiment 2 may refer to the correlation in embodiment 1 Description, here is omitted.

Before being described in detail to the implementation of the program, first the implement scene to the program is carried out simply Introduce.

In the implement scene, the frequent item set in data d will be excavated, default minimum support Threshold value is 40%, the transaction data set (TDS) D={ T1, T2, T3, T4, T5 } in data d, has 5 affairs notes Record, is expressed as：

Wherein TID represents the ID of affairs, the set I={ I1, I2, I3, I4, I5, I6 } of item.

It should be noted that the transaction data set (TDS) D that the present embodiment 2 is provided is simply clearly to describe this hair Bright design and number handled in the example done, the method for digging practical application for the frequent item set that the application is provided It is big data according to object.

Based on above-mentioned scene, the process such as Fig. 4 institutes for showing frequency applications and functional switch are realized in embodiment 2 Show, comprise the steps：

Step 21：Transaction data set (TDS) D is divided into 2 subdatas according to predetermined data segmentation rule, And by the frequent item set mining task assignment in transaction data set (TDS) D to host node；

Wherein subtransaction data set S1={ T1, T4 }, subtransaction data set S2={ T2, T3, T5 }；

To the frequent item set mining task in transaction data set (TDS) D, including：To the subtransaction data set S1 Being excavated with the Local frequent itemset in S2 for task；Obtain described total according to the Local frequent itemset The task of the frequent item set of transaction data set (TDS).

Step 22：Host node will be excavated to the frequent item set in subtransaction the data set S1 and S2 Task distribute to determination be used for perform each slave node of subtransaction data set mining task；

Step 23：Each slave node performs frequent item set mining task, obtains the frequent of each subtransaction data set Item collection is used as Local frequent itemset；

Default minimum support is counted as minimum support threshold value and is multiplied by the quantity that Transaction Information concentrates affairs, That is 40%*5=2, then minimum support when carrying out frequent item set mining to sub- transaction data set (TDS) S1 is counted as 2/2=1, minimum support when carrying out frequent item set mining to sub- transaction data set (TDS) S2 is counted as 2/3.

As shown in figure 5, using subtransaction data set S1 and S2 as Map nodes in each slave node Input, subtransaction data set S1 and S2 Local frequent itemset is obtained by the Map stages, and pass through The Local frequent itemset that each slave node is obtained is collected and unifies to preserve by the Reduce stages.

Step 24：Host node will obtain the frequent episode of total transaction data set (TDS) according to the Local frequent itemset The task of collection is distributed to slave node；

Step 25：It is each in the occurrence number that Transaction Information is concentrated that slave node, which counts each Local frequent itemset, The global support counting of Local frequent itemset.

As shown in fig. 6, using each Local frequent itemset and transaction data set (TDS) as Map nodes input, The counting that each Local frequent itemset is concentrated in the Transaction Information is obtained by the Map stages, passes through Reduce Stage obtains the Map stages, Transaction Information concentrates the counting of same Local frequent itemset to be added up, Obtain the global support counting that the frequent item set is concentrated in Transaction Information.

Step 26：Slave node is global by the global support counting of each Local frequent itemset and default minimum Support threshold 2 is compared, and obtains frequent item set.

Obtained complete or collected works' frequent item set has two, i.e. { I3 } and { I1, I3, I4 }.If only needed to frequent k- Collection, then can cast out item collection { I3 }, you can obtain frequent k- item collections { I1, I3, I4 }.

The embodiment of the present application 2 provide frequent item set method for digging, by by the total data be divided into Few two subdatas, and utilize the office in each subdata of slave node P mining in distributed computing system Portion's frequent item set, then obtains the frequent item set in the total data, relatively using the Local frequent itemset The problem of can be taken considerable time when the frequent item set in big data is excavated in the prior art, carry The efficiency of frequent item set mining in high big data.

Embodiment 3

To solve to take considerable time when the frequent item set in big data is excavated in the prior art The problem of, the embodiment of the present application 3 provides a kind of excavating gear of frequent item set.The frequent item set mining device Structural representation as shown in fig. 7, mainly include following function unit：

Slave node determining unit 31, for receiving the frequent item set for total data of client appointment After mining task, split rule according to predetermined data and each subnumber is obtained to total data progress data segmentation According to；

The frequent item set acquiring unit 32 of subdata, for by each subdata distribute at least two be used for it is parallel Perform the slave node of the first stage task of frequent item set mining task；The first stage task is specifically wrapped Include：The slave node is according to default minimum support threshold value, using Frequent Itemsets Mining Algorithm to being divided The subdata matched somebody with somebody carries out frequent item set mining, obtains the frequent item set of the subdata；

Total frequent item set acquiring unit 33, for the frequent item set of the subdata to be distributed to for holding parallel Each slave node of the second stage task of row frequent item set mining task；The second stage task, including： For performing each slave node of second stage task according to default minimum support threshold value and the subnumber According to frequent item set, obtain the frequent item set of the total data.

The Frequent Itemsets Mining Algorithm includes following at least one：

Priori Frequent Itemsets Mining Algorithm；

FP-tree method.

Association rule mining unit 34, for receiving the association for the total data of client appointment After rule digging task, the association rule mining task is distributed to for performing association rule mining task Each slave node, to cause each slave node for being used to perform association rule mining task to perform parallel The association rule mining task；The association rule mining task, including：Obtaining the total data After frequent item set, according to the frequent item set of default minimal confidence threshold and the total data, obtain described Correlation rule in total data.

The frequent item set mining device that the embodiment of the present application 3 is provided, by the way that the total data is divided at least Two subdatas, and utilize the part in each subdata of slave node P mining in distributed computing system Frequent item set, then obtains the frequent item set in the total data using the Local frequent itemset, relative to The problem of being taken considerable time when the frequent item set in big data is excavated in the prior art, improves The efficiency of frequent item set mining in big data.

Embodiment 4

To solve to take considerable time when the frequent item set in big data is excavated in the prior art The problem of, the embodiment of the present application 4 provides a kind of digging system of frequent item set, the structural representation of the system As shown in figure 8, including host node and at least two slave nodes.The function of the system components introduced below：

Each subdata is distributed at least two is used for the first stage of parallel execution frequent item set mining task The slave node of task；The first stage task is specifically included：The slave node is according to default minimum Support threshold, carries out frequent item set mining to allocated subdata using Frequent Itemsets Mining Algorithm, obtains To the frequent item set of the subdata；

The slave node, the task for performing the host node distribution.

The frequent item set mining system that the embodiment of the present application 4 is provided, by the way that the total data is divided at least Two subdatas, and utilize the part in each subdata of slave node P mining in distributed computing system Frequent item set, then obtains the frequent item set in the total data using the Local frequent itemset, relative to The problem of being taken considerable time when the frequent item set in big data is excavated in the prior art, improves The efficiency of frequent item set mining in big data.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or knot The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the present invention can be used Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) on the form of computer program product implemented.

The present invention is with reference to the production of method according to embodiments of the present invention, equipment (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and / or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or the processing of other programmable datas to set In the standby computer-readable memory worked in a specific way so that be stored in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow or multiple of flow chart The function of being specified in one square frame of flow and/or block diagram or multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and perform series of operation steps on computer or other programmable devices to produce computer implemented place Reason, so that the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.

Embodiments herein is the foregoing is only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.All institutes within spirit herein and principle Any modification, equivalent substitution and improvements of work etc., should be included within the scope of claims hereof.

Claims

1. a kind of method for digging of frequent item set, it is characterised in that including：

2. method as claimed in claim 1, it is characterised in that according to default minimum support threshold value and The frequent item set of the subdata, obtains the frequent item set of the total data, including：

Calculate support of the frequent item set of each subdata in the total data；

Support in the total data is not less than to the subdata of the default minimum support threshold value Frequent item set, be used as the frequent item set of the total data.

3. method as claimed in claim 1, it is characterised in that according to default minimum support threshold value and The frequent item set of the subdata, obtains the frequent item set of the total data, including：

The quantity of affairs is concentrated according to the minimum support threshold value and Transaction Information, minimum support meter is obtained Number threshold value；The transaction data set (TDS) is the set of the affairs composition included in the total data；

Support meter of the frequent item set of each subdata in the total data is calculated using reduction algorithm is mapped Number；

Support counting in the total data is not less than the default minimum support count threshold Subdata frequent item set, be used as the frequent item set of the total data.

4. method as claimed in claim 1, it is characterised in that according to default minimum support threshold value, Frequent item set mining is carried out to the subdata that slave node is allocated using Frequent Itemsets Mining Algorithm, including：

According to default minimum support threshold value, the Frequent Itemsets Mining Algorithm based on mapping reduction algorithm is utilized Frequent item set mining is carried out to the subdata that slave node is allocated.

5. method as claimed in claim 1, it is characterised in that under the Frequent Itemsets Mining Algorithm includes State at least one：

Priori Frequent Itemsets Mining Algorithm；

FP-tree method.

6. method as claimed in claim 1, it is characterised in that methods described also includes：

After the association rule mining task for the total data of client appointment is received, closed described Connection rule digging task is distributed to each slave node for performing association rule mining task, described to cause The association rule mining task is performed parallel for performing each slave node of association rule mining task；

The association rule mining task, including：After the frequent item set of the total data is obtained, according to pre- If minimal confidence threshold and the total data frequent item set, obtain the correlation rule of the total data.

7. method as claimed in claim 5, it is characterised in that according to default minimal confidence threshold and The frequent item set of the total data, obtains the correlation rule in the total data, including：

Obtain the frequent k- item collections in the frequent item set of the total data；

Pending association rule is obtained according to the frequent k- item collections；

Calculate the confidence level of the pending association rule；

Pending association rule by confidence level not less than the default minimal confidence threshold, is used as institute State the correlation rule in total data.

8. a kind of excavating gear of frequent item set, it is characterised in that including：

9. device as claimed in claim 8, it is characterised in that under the Frequent Itemsets Mining Algorithm includes State at least one：

Priori Frequent Itemsets Mining Algorithm；

FP-tree method.

10. device as claimed in claim 8, it is characterised in that described device also includes：

Association rule mining unit, for being advised in the association for the total data for receiving client appointment Then after mining task, the association rule mining task is distributed to for performing association rule mining task Each slave node, to cause each slave node for being used to perform association rule mining task to perform institute parallel State association rule mining task；

The association rule mining task, including：After the frequent item set of the total data is obtained, according to pre- If minimal confidence threshold and the total data frequent item set, obtain the association rule in the total data Then.

11. a kind of digging system of frequent item set, it is characterised in that including host node and at least two subordinates Node, wherein：

The slave node, the task for performing the host node distribution.