A kind of method for digging of frequent item set, apparatus and system
Technical field
The application is related to big data field, more particularly to a kind of method for digging of frequent item set, apparatus and system.
Background technology
With the development of Internet technology, the data produced in internet may contain substantial amounts of value.With
That the data produced in network are more and more, how fast and effeciently to excavate the valency of the data produced in internet
Value, is the big problem faced in the big data epoch.
Data mining generally refers to be hidden in the process of wherein information by algorithm search from substantial amounts of data.
At present, it is wide variety of data mining side in Data Mining to the excavation of the correlation rule in data
Method.Wherein, the correlation rule, refers between the different item in data and there is valuable incidence relation.Such as
Really certain correlation rule meets default minimum support threshold value and minimum putting property degree threshold value, then it is assumed that the association
Rule is valuable, and above-mentioned threshold value can be manually set according to requirement is excavated.
For the data in big data, the set that the unique mark characteristic value of the different item in data is constituted can be with
I={ i1, i2 ..., im } is expressed as, the unique mark characteristic value of the item refers to for the unique mark in item collection
Feature value.Such as, for the mutually different commodity of title, then item can be referred to as with the name of commodity
Unique mark characteristic value, or, different numberings can also be distributed to different commodity, then can use commodity
Numbering as item unique mark characteristic value.Set and Transaction Information described in the embodiment of the present application are concentrated
The unique mark characteristic value of equal available entry represents different items.Affairs Ti is made up of at least one, i.e.,
Affairs Ti is I non-empty subset.Transaction data set (TDS) be affairs Ti constitute set, be represented by D=T1,
T2,…,Ti,...,Tn},i∈[1,n]。
Process to association rule mining in data is main in two steps:1st, Mining Frequent Itemsets Based;2nd, by frequent
Item collection produces correlation rule.Wherein, it is the key of association rule mining to the excavation of frequent item set.If set S as
By item unique mark eigenvalue cluster into a set, S={ i | i ∈ I }, abbreviation item collection, and include k
The item collection of item is referred to as k- item collections, then, the frequent item set is that support is not less than default minimum support
The item collection of threshold value.Support of the item collection in D, be in D while comprising the thing of all in the item collection
Business accounts for the percentage for the affairs sum that D is included, and the percentage is it can be appreciated that the item collection going out in D
Existing probability.The business in transaction data set (TDS), customer purchase record is such as recorded as with the shopping of 1000 customers
Product are item, wherein 200 customers are to have bought bread to have bought ham again, default minimum support threshold value is
15%, then the support of 2- item collections { bread, ham } is 20%, as frequent item set.
When being excavated to frequent item set, the alternative manner successively searched for can be used, that is, uses frequent k-
The frequent k+1- item collections of item collection removal search.When using frequent k- item collections removal search frequent k+1- item collections, it is necessary to
Previously according to frequent k- item collections generation candidate's k+1- item collections, then screen in candidate's k+1- item collections and meet
The item collection of minimum support threshold value, is used as the frequent k+1- item collections finally given.
When quantity is many in frequent k- item collections, the number of candidate's k+1- item collections can be more, for example, when frequent
When the quantity of 1- item collections is 1000, the quantity of candidate's 2- item collections is, in big data excavation, frequently
The quantity of k- item collections is often very big, and the excavation to frequent item set can take a substantial amount of time, and cause frequent item set
Digging efficiency it is relatively low.
The content of the invention
The embodiment of the present application provides a kind of method for digging of frequent item set, to solve in the prior art to big
The problem of frequent item set in data can take considerable time when being excavated.
The embodiment of the present application also provides a kind of excavating gear of frequent item set, to solve in the prior art right
The problem of frequent item set in big data can take considerable time when being excavated.
The embodiment of the present application also provides a kind of digging system of frequent item set, to solve in the prior art right
The problem of frequent item set in big data can take considerable time when being excavated.
The embodiment of the present application uses following technical proposals:
A kind of method for digging of frequent item set, including:
Host node is after the frequent item set mining task for total data of client appointment is received, according to pre-
Fixed data segmentation rule carries out data segmentation to the total data and obtains each subdata;
Each subdata is distributed at least two by the host node to be used to perform frequent item set mining task parallel
First stage task slave node;The first stage task is specifically included:The slave node according to
Default minimum support threshold value, frequent episode is carried out using Frequent Itemsets Mining Algorithm to allocated subdata
Collection is excavated, and obtains the frequent item set of the subdata;
The host node distributes the frequent item set of the subdata to for performing frequent item set mining parallel
Each slave node of the second stage task of task;The second stage task, including:For performing second
Each slave node of phased mission according to the frequent item set of default minimum support threshold value and the subdata,
Obtain the frequent item set of the total data.
A kind of excavating gear of frequent item set, including:
Slave node determining unit, for being dug in the frequent item set for total data for receiving client appointment
After pick task, split rule according to predetermined data and each subnumber is obtained to total data progress data segmentation
According to;
The frequent item set acquiring unit of subdata, is used to hold parallel for each subdata to be distributed at least two
The slave node of the first stage task of row frequent item set mining task;The first stage task is specifically wrapped
Include:The slave node is according to default minimum support threshold value, using Frequent Itemsets Mining Algorithm to being divided
The subdata matched somebody with somebody carries out frequent item set mining, obtains the frequent item set of the subdata;
Total frequent item set acquiring unit, for the frequent item set of the subdata to be distributed to for performing parallel
Each slave node of the second stage task of frequent item set mining task;The second stage task, including:
For performing each slave node of second stage task according to default minimum support threshold value and the subnumber
According to frequent item set, obtain the frequent item set of the total data.
A kind of digging system of frequent item set, including host node and at least two slave nodes, wherein:
The host node, for receiving the frequent item set mining task for total data of client appointment
Afterwards, split rule according to predetermined data and each subdata is obtained to total data progress data segmentation;
It is used to perform the first of frequent item set mining task parallel for each subdata to be distributed at least two
The slave node of phased mission;The first stage task is specifically included:The slave node is according to default
Minimum support threshold value, frequent item set digging is carried out using Frequent Itemsets Mining Algorithm to allocated subdata
Pick, obtains the frequent item set of the subdata;
For the frequent item set of the subdata to be distributed to for performing frequent item set mining task parallel
Each slave node of second stage task;The second stage task, including:Appoint for performing second stage
Each slave node of business obtains institute according to the frequent item set of default minimum support threshold value and the subdata
State the frequent item set of total data.
The slave node, the task for performing the host node distribution.
At least one above-mentioned technical scheme that the embodiment of the present application is used can reach following beneficial effect:
By the way that the total data is divided into at least two subdatas, and using in distributed computing system from
Belong to the frequent item set that nodal parallel excavates the subdata in each subdata, then utilize the frequent of the subdata
Item collection obtains the frequent item set in the total data, relative to frequent episode in the prior art in big data
The problem of collection can take considerable time when being excavated, improves the efficiency of frequent item set mining in big data.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes one of the application
Point, the schematic description and description of the application is used to explain the application, does not constitute to the application not
Work as restriction.In the accompanying drawings:
A kind of concrete structure schematic diagram for distributed computing system that Fig. 1 provides for the embodiment of the present application;
Fig. 2 is a kind of implementation process signal of the method for digging for frequent item set that the embodiment of the present application 1 is provided
Figure;
Fig. 3 is a kind of frequent item set mining based on Map Reduce algorithms that the embodiment of the present application 1 is provided
The implementation process schematic diagram of method;
Fig. 4 is a kind of implementation process signal of the method for digging for frequent item set that the embodiment of the present application 2 is provided
Figure;
Fig. 5 is what a kind of frequent item set to sub- transaction data set (TDS) that the embodiment of the present application 2 is provided was excavated
Process schematic;
Fig. 6 is that a kind of frequent item set and transaction data set (TDS) according to subdata that the embodiment of the present application 2 is provided is obtained
Take the implementation process schematic diagram of frequent item set;
Fig. 7 is a kind of concrete structure signal of the excavating gear for frequent item set that the embodiment of the present application 3 is provided
Figure;
Fig. 8 is a kind of concrete structure signal of the digging system for frequent item set that the embodiment of the present application 4 is provided
Figure.
Embodiment
In the embodiment of the present application, it is possible to use distributed computing system is excavated to frequent item set, described point
Cloth computing system is to run on system in server cluster, data being carried out with Distributed Calculation.
As shown in figure 1, being a kind of structural representation of distributed computing system.The distributed computing system includes
Host node and at least two slave nodes.Wherein, the host node is mainly used in the client that will be received
Task is distributed to each slave node, and is dispatched each slave node and be effectively carried out task;The slave node master
It is used for the performing the host node distribution of the task.
Distributed computing system can also include storage system, for the security of data, and storage system can be with
By way of duplication by data backup into multiple nodes.
In the embodiment of the present application, the client can receive the operational order of user's input, and according to described
Operational order, sends task corresponding with the operational order to distributed computing system.The client is also
The result of calculation of distributed computing system can be fed back to user.
Due to the distributed computing system be in the correlation technique of comparative maturity, this specification to this no longer
Further repeat.A kind of frequent episode based on distributed computing system is discussed in detail below in conjunction with the application example
The method for digging of collection.
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer
Apply example and technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, it is described
Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application
Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made
Example, belongs to the scope of the application protection.
Below in conjunction with accompanying drawing, the technical scheme that each embodiment of the application is provided is described in detail.
Embodiment 1
To solve to take considerable time when the frequent item set in big data is excavated in the prior art
The problem of, the embodiment of the present application 1 provides a kind of method for digging of frequent item set.What the embodiment of the present application was provided
The executive agent of the method for digging of frequent item set can be server, for example, being used as distribution in server cluster
Server of formula computing system host node, etc..
For ease of description, hereafter executive agent in this way be server cluster in be used as Distributed Calculation system
Unite exemplified by the server of host node, the embodiment to this method is introduced.It is appreciated that this method
It in server cluster as the server of distributed computing system host node is a kind of example that executive agent, which is,
The explanation of property, is not construed as the restriction to this method.
The implementation process schematic diagram of this method is as shown in Fig. 2 comprise the steps:
Step 11:Host node is receiving the frequent item set mining task for total data of client appointment
Afterwards, split rule according to predetermined data and each subdata is obtained to total data progress data segmentation;
In the embodiment of the present application, the total data is the data comprising transaction data set (TDS).I.e. described frequent item set
Excavation is to carry out frequent item set mining to the transaction data set (TDS) included in the total data.
, can will be described frequent when carrying out frequent item set mining to the total data in the embodiment of the present application
Item set mining task is divided into two stages to perform, to improve the efficiency of frequent item set mining.Hereinafter will be detailed
The thin first stage task and second stage task for introducing frequent item set mining task in the embodiment of the present application.
In the embodiment of the present application, the frequent item set of the total data can be carried out using distributed computing system
Excavate, then after the frequent item set mining task for the total data of client appointment is received, just
The slave node of the first stage task for performing frequent item set mining task can be determined.
In the embodiment of the present application, when carrying out big data excavation, the Transaction Information included due to the total data
The quantity of concentration affairs is often a lot, and the quantity of frequent 1- item collections is often also a lot, then to frequent item set
Excavation can take a substantial amount of time., can be according to predetermined data in order to improve the efficiency of frequent item set mining
The total data is divided into several subdatas by segmentation rule, then carries out frequent item set to individual subdata again
Excavation.The transaction data set (TDS) included in the subdata is referred to as subtransaction data set by us.I.e. according to institute
Predetermined data segmentation rule is stated, the transaction data set (TDS) included in the total data several can be divided into
Subtransaction data set.
In the embodiment of the present application, the predetermined data segmentation rule, for for determining sub- thing during data segmentation
The transactions that include of business data set, and ensure after segmentation the integrality of each affairs in subtransaction data set
Rule.The integrality of the affairs refers to the item and perform before data segmentation for performing that office includes after data segmentation
The item that office includes is identical.
In actual applications, the transactions that specific each subtransaction data set is included in data segmentation rule
It can be configured according to the computing capability of distributed computing system.
Step 12:Each subdata is distributed at least two by the host node to be used to perform frequent item set digging parallel
The slave node of the first stage task of pick task;
The first stage task is specifically included:The slave node according to default minimum support threshold value,
Frequent item set mining is carried out to allocated subdata using Frequent Itemsets Mining Algorithm, the subdata is obtained
Frequent item set (claiming Local frequent itemset afterwards).
In the embodiment of the present application, the frequent item set of each subtransaction data set can be excavated in advance.For just
In description, the task that the frequent item set to each subtransaction data set is excavated is referred to as frequent item set and dug by us
The first stage task of pick task.In actual applications, it is possible to use each subordinate section in distributed computing system
Put to perform the first stage task.Using each slave node come the frequent item set of subdata transaction set
It is that the quantity of the subdata of each slave node distribution can be according to the calculating energy of each slave node when being excavated
Power determines that the embodiment of the present application is not limited this.
In the embodiment of the present application, it is determined that the first stage task for performing frequent item set mining task
Slave node after, just the first stage task can be distributed to determination be used for perform the first stage appoint
Each slave node of business, to cause each slave node for being used to perform first stage task to perform institute parallel
State first stage task.
In the embodiment of the present application, the first stage task is specifically included:According to default minimum support threshold
Value, carries out frequent item set mining to the subdata that slave node is allocated using Frequent Itemsets Mining Algorithm, obtains
Frequent item set to the subdata is used as Local frequent itemset.
In actual applications, depending on the default minimum support threshold value can be according to business demand, this Shen
Please embodiment this is not limited.
In actual applications, the slave node is allocated using Frequent Itemsets Mining Algorithm to slave node
Subdata when carrying out frequent item set mining, it is possible to use based on mapping reduction (Map Reduce) algorithm
Frequent Itemsets Mining Algorithm carries out frequent item set mining to the subdata that slave node is allocated.The Map
Reduce algorithms are divided into mapping (Map) algorithm and reduction (Reduce) algorithm, Map Reduce algorithms
Calculating process be divided into Map stages and Reduce stages, carry out Distributed Calculation when, will can perform
The program of Map algorithms is referred to as Map nodes, and the program for performing Reduce algorithms is referred to as into Reduce nodes.
Because the Map Reduce algorithms have been this no longer to be entered in the correlation technique of comparative maturity, this specification
One step is repeated.Be described in detail below using the Frequent Itemsets Mining Algorithm based on Map Reduce algorithms come
The process of frequent item set mining is carried out to the subdata that slave node is allocated.
In actual applications, carry out frequent item set mining when, by by the support of item collection with it is default most
Small support threshold is compared to determine during frequent item set to concentrate in Transaction Information, it is necessary to calculate each item collection
Support, i.e., with the quantity comprising the affairs of all in same item collection divided by affairs sum, this can disappear
The certain computing resource of consumption, reduces the efficiency of frequent item set mining.Therefore, in order to improve frequent item set mining
Efficiency, can be according to the support counting of item collection come Mining Frequent when being excavated to frequent item set
Collection, without calculating the support that each item collection is concentrated in Transaction Information.The support counting of the item collection refers to
Transaction Information concentrates the frequency for including the sum, also referred to as item collection of the affairs of all in the item collection.
In actual applications, the total of affairs can be concentrated according to default minimum support threshold value and Transaction Information
Quantity, obtains minimum support count threshold of the frequent item set in the transaction data set (TDS), is used as global minima
Support counting.Then the number of affairs in global minima support counting divided by subtransaction data set can be utilized
Amount, you can obtain the Local Minimum support counting threshold value of frequent item set in subtransaction data set.
In the embodiment of the present application, using the Frequent Itemsets Mining Algorithm based on Map Reduce algorithms to subordinate
It is real in the process of the allocated subdata progress Local frequent itemset excavation of node, its schematic flow sheet such as Fig. 3
Shown in line arrow.It is possible, firstly, to will be carried out to the total data described in the subdata that data segmentation is obtained
Subtransaction data set as Map algorithms input, then according to and input the subtransaction data set pair
Local Minimum support counting threshold value answer, default, using Frequent Itemsets Mining Algorithm to the subtransaction
The frequent item set of data set is excavated.
In actual applications, the frequent mining algorithm includes following at least one:Priori frequent item set mining
(Apriori) algorithm, frequent pattern tree (fp tree) (FP-Tree) algorithm.Specifically, the frequent item set mining is calculated
Method can use the alternative manner successively searched for when being excavated to frequent item set, i.e., with frequent k- item collections
The frequent k+1- item collections of removal search.In k+1- item collections frequent using frequent k- item collections removal search, it is necessary in advance
According to frequent k- item collections generation candidate's k+1- item collections, then screen in candidate's k+1- item collections and meet minimum
The item collection of support threshold, is used as the frequent k+1- item collections finally given.The successively frequent item set of search iteration
Mining algorithm such as can be priori frequent item set mining (Apriori) algorithm.Because the frequent item set is dug
The correlation technique that algorithm is comparative maturity is dug, the embodiment of the present application is not repeated further this.
In the embodiment of the present application, after Local frequent itemset is obtained using Frequent Itemsets Mining Algorithm, Ke Yitong
Map algorithms are crossed to export Result.Map algorithms can be with to the output format of the Local frequent itemset
It is<key,value>, wherein key is Local frequent itemset, and value is the support meter of Local frequent itemset
Number.
In the embodiment of the present application, each sub- thing is being obtained by the Frequent Itemsets Mining Algorithm based on Map algorithms
It is engaged in after the Local frequent itemset of data set, can be exported all Map nodes by Reduce algorithms
Local frequent itemset is collected arrangement.Can using the output of Map algorithms as Reduce algorithms input,
Then the Local frequent itemset that Reduce algorithms just can export all Map nodes is collected and protected
Deposit, subsequently to use.The output format of Reduce algorithms can also be<key,value>, wherein key
It is Local frequent itemset, value is 1.
That is, in the embodiment of the present application, it can be obtained by the Frequent Itemsets Mining Algorithm based on Map algorithms
The Local frequent itemset for the subtransaction data set that each subdata is included, then by Reduce algorithms by Map
The Local frequent itemset of node output is collected arrangement.
It should be noted that obtaining each subtransaction number by the Frequent Itemsets Mining Algorithm based on Map algorithms
It is then frequent by the part that Reduce algorithms export Map nodes according to the Local frequent itemset of collection
Item collection is collected arrangement, the method for finally giving Local frequent itemset, and simply the embodiment of the present application is provided,
The subdata being allocated using the Frequent Itemsets Mining Algorithm based on Map Reduce algorithms to slave node is entered
A kind of method of row frequent item set mining.
In actual applications, Map algorithms can also be first passed through to each affairs in each subtransaction data set
Traveled through, in ergodic process, often find transaction packet containing some defecate collection with<key,value>Form
The item collection is exported, wherein key is item collection, value is 1.Then by Reduce algorithms to Map algorithms
The number of times that each item collection of output occurs is added up, and just obtains the support counting of each item collection, and then according to office
Portion's minimum support count threshold, obtains Local frequent itemset.It will not be repeated here.
Step 13:The host node distributes the frequent item set of the subdata to for performing frequent episode parallel
Collect each slave node of the second stage task of mining task.
The second stage task, including:For performing each slave node of second stage task according to default
Minimum support threshold value and the subdata frequent item set, obtain the frequent item set of the total data.
In the embodiment of the present application, the part is obtained by performing the first stage task in each slave node
After frequent item set, host node can determine the second stage task for performing frequent item set mining task from
Belong to node, to cause each slave node for being used to perform second stage task to perform the second-order parallel
Section task.
In the embodiment of the present application, the second stage task is specifically included, and is calculated each Local frequent itemset and is existed
Support in the total data;Support in the total data is not less than the default most ramuscule
The Local frequent itemset of degree of holding threshold value, is used as the frequent item set of the total data.
In actual applications, it is possible to use Map Reduce algorithms calculate each Local frequent itemset described total
It is empty in support in data, and then the frequent item set of the acquisition total data, its schematic flow sheet such as Fig. 3
Shown in line arrow.
Specifically, it is possible, firstly, to the transaction data set (TDS) that Local frequent itemset and the total data are included as
The input of Map Reduce algorithms, then counts each Local frequent itemset and concentrates what is occurred in Transaction Information
Number of times, you can obtain the Local frequent itemset and supported in the support counting that Transaction Information is concentrated as the overall situation
Degree is counted.
In actual applications, each Local frequent itemset is being counted in Transaction Information using Map Reduce algorithms
, can be by multiple Map nodal parallels to affairs in order to improve counting efficiency during the support counting of concentration
Each Local frequent itemset in data set is counted, can also be by each Map nodal parallels to Map nodes
Each Local frequent itemset occurred in allocated subtransaction data set is counted.Then by Reduce letters
Counting progress that number exports each Map nodes, in each subtransaction data set to same Local frequent itemset
It is cumulative, obtain the global support counting that the frequent item set is concentrated in Transaction Information.
, just can be by the global support after the global support counting is obtained in the embodiment of the present application
Degree counts the frequency for being not less than the Local frequent itemset of the minimum support count threshold as the total data
Numerous item collection.
In actual applications, if it is desired to obtain frequent k- item collections, then can be from the frequent episode of the total data
Concentrate and obtain frequent k- item collections.
In the embodiment of the present application, the excavation of rule can also be associated using the frequent k- item collections.
Specifically, after the association rule mining task for the total data of client appointment is received,
The association rule mining task is distributed to each slave node for performing association rule mining task, with
So that each slave node for being used to perform association rule mining task performs the correlation rule digging parallel
Pick task.
In the embodiment of the present application, the association rule mining task, including:Obtaining the frequency of the total data
After numerous item collection, according to the frequent item set of default minimal confidence threshold and the total data, obtain described total
Correlation rule in data.
Specifically, in the frequent item set according to default minimal confidence threshold and the total data, institute is obtained
When stating the correlation rule in total data, the frequent k- item collections in the frequent item set of the total data are obtained first,
And pending association rule is obtained according to the frequent k- item collections, the confidence level of the pending association rule is calculated,
Then confidence level is not less than to the rule of the default minimal confidence threshold, the total data is used as
In correlation rule.
In actual applications, the confidence level of the pending association rule, is the support according to frequent k- item collections
Every support counting is obtained in counting and the frequent k- item collections.I.e. with branch every in frequent k- item collections
Degree of holding counting divided by the support counting of the frequent k- item collections.
In actual applications, depending on the default minimal confidence threshold can be according to business demand, this Shen
Please embodiment this is not limited.
Due to obtaining the phase that correlation rule has been comparative maturity according to frequent k- item collections and minimal confidence threshold
This is not repeated further in pass technology, this specification.
The embodiment of the present application 1 provide frequent item set method for digging, by by the total data be divided into
Few two subdatas, and utilize the office in each subdata of slave node P mining in distributed computing system
Portion's frequent item set, then obtains the frequent item set in the total data, relatively using the Local frequent itemset
The problem of can be taken considerable time when the frequent item set in big data is excavated in the prior art, carry
The efficiency of frequent item set mining in high big data.
Embodiment 2
Present invention design is described based on previous embodiment 1 in detail, for the ease of being better understood from this
Technical characteristic, means and the effect of application, do further to the method for digging of the frequent item set of the application below
Illustrate, so as to form another embodiment of the application.
The excavation of the mining process of frequent item set and frequent item set described in embodiment 1 in the embodiment of the present application 2
Process is similar, and some other step not made referrals in embodiment 2 may refer to the correlation in embodiment 1
Description, here is omitted.
Before being described in detail to the implementation of the program, first the implement scene to the program is carried out simply
Introduce.
In the implement scene, the frequent item set in data d will be excavated, default minimum support
Threshold value is 40%, the transaction data set (TDS) D={ T1, T2, T3, T4, T5 } in data d, has 5 affairs notes
Record, is expressed as:
Wherein TID represents the ID of affairs, the set I={ I1, I2, I3, I4, I5, I6 } of item.
It should be noted that the transaction data set (TDS) D that the present embodiment 2 is provided is simply clearly to describe this hair
Bright design and number handled in the example done, the method for digging practical application for the frequent item set that the application is provided
It is big data according to object.
Based on above-mentioned scene, the process such as Fig. 4 institutes for showing frequency applications and functional switch are realized in embodiment 2
Show, comprise the steps:
Step 21:Transaction data set (TDS) D is divided into 2 subdatas according to predetermined data segmentation rule,
And by the frequent item set mining task assignment in transaction data set (TDS) D to host node;
Wherein subtransaction data set S1={ T1, T4 }, subtransaction data set S2={ T2, T3, T5 };
To the frequent item set mining task in transaction data set (TDS) D, including:To the subtransaction data set S1
Being excavated with the Local frequent itemset in S2 for task;Obtain described total according to the Local frequent itemset
The task of the frequent item set of transaction data set (TDS).
Step 22:Host node will be excavated to the frequent item set in subtransaction the data set S1 and S2
Task distribute to determination be used for perform each slave node of subtransaction data set mining task;
Step 23:Each slave node performs frequent item set mining task, obtains the frequent of each subtransaction data set
Item collection is used as Local frequent itemset;
Default minimum support is counted as minimum support threshold value and is multiplied by the quantity that Transaction Information concentrates affairs,
That is 40%*5=2, then minimum support when carrying out frequent item set mining to sub- transaction data set (TDS) S1 is counted as
2/2=1, minimum support when carrying out frequent item set mining to sub- transaction data set (TDS) S2 is counted as 2/3.
As shown in figure 5, using subtransaction data set S1 and S2 as Map nodes in each slave node
Input, subtransaction data set S1 and S2 Local frequent itemset is obtained by the Map stages, and pass through
The Local frequent itemset that each slave node is obtained is collected and unifies to preserve by the Reduce stages.
Step 24:Host node will obtain the frequent episode of total transaction data set (TDS) according to the Local frequent itemset
The task of collection is distributed to slave node;
Step 25:It is each in the occurrence number that Transaction Information is concentrated that slave node, which counts each Local frequent itemset,
The global support counting of Local frequent itemset.
As shown in fig. 6, using each Local frequent itemset and transaction data set (TDS) as Map nodes input,
The counting that each Local frequent itemset is concentrated in the Transaction Information is obtained by the Map stages, passes through Reduce
Stage obtains the Map stages, Transaction Information concentrates the counting of same Local frequent itemset to be added up,
Obtain the global support counting that the frequent item set is concentrated in Transaction Information.
Step 26:Slave node is global by the global support counting of each Local frequent itemset and default minimum
Support threshold 2 is compared, and obtains frequent item set.
Obtained complete or collected works' frequent item set has two, i.e. { I3 } and { I1, I3, I4 }.If only needed to frequent k-
Collection, then can cast out item collection { I3 }, you can obtain frequent k- item collections { I1, I3, I4 }.
The embodiment of the present application 2 provide frequent item set method for digging, by by the total data be divided into
Few two subdatas, and utilize the office in each subdata of slave node P mining in distributed computing system
Portion's frequent item set, then obtains the frequent item set in the total data, relatively using the Local frequent itemset
The problem of can be taken considerable time when the frequent item set in big data is excavated in the prior art, carry
The efficiency of frequent item set mining in high big data.
Embodiment 3
To solve to take considerable time when the frequent item set in big data is excavated in the prior art
The problem of, the embodiment of the present application 3 provides a kind of excavating gear of frequent item set.The frequent item set mining device
Structural representation as shown in fig. 7, mainly include following function unit:
Slave node determining unit 31, for receiving the frequent item set for total data of client appointment
After mining task, split rule according to predetermined data and each subnumber is obtained to total data progress data segmentation
According to;
The frequent item set acquiring unit 32 of subdata, for by each subdata distribute at least two be used for it is parallel
Perform the slave node of the first stage task of frequent item set mining task;The first stage task is specifically wrapped
Include:The slave node is according to default minimum support threshold value, using Frequent Itemsets Mining Algorithm to being divided
The subdata matched somebody with somebody carries out frequent item set mining, obtains the frequent item set of the subdata;
Total frequent item set acquiring unit 33, for the frequent item set of the subdata to be distributed to for holding parallel
Each slave node of the second stage task of row frequent item set mining task;The second stage task, including:
For performing each slave node of second stage task according to default minimum support threshold value and the subnumber
According to frequent item set, obtain the frequent item set of the total data.
The Frequent Itemsets Mining Algorithm includes following at least one:
Priori Frequent Itemsets Mining Algorithm;
FP-tree method.
Association rule mining unit 34, for receiving the association for the total data of client appointment
After rule digging task, the association rule mining task is distributed to for performing association rule mining task
Each slave node, to cause each slave node for being used to perform association rule mining task to perform parallel
The association rule mining task;The association rule mining task, including:Obtaining the total data
After frequent item set, according to the frequent item set of default minimal confidence threshold and the total data, obtain described
Correlation rule in total data.
The frequent item set mining device that the embodiment of the present application 3 is provided, by the way that the total data is divided at least
Two subdatas, and utilize the part in each subdata of slave node P mining in distributed computing system
Frequent item set, then obtains the frequent item set in the total data using the Local frequent itemset, relative to
The problem of being taken considerable time when the frequent item set in big data is excavated in the prior art, improves
The efficiency of frequent item set mining in big data.
Embodiment 4
To solve to take considerable time when the frequent item set in big data is excavated in the prior art
The problem of, the embodiment of the present application 4 provides a kind of digging system of frequent item set, the structural representation of the system
As shown in figure 8, including host node and at least two slave nodes.The function of the system components introduced below:
The host node, for receiving the frequent item set mining task for total data of client appointment
Afterwards, split rule according to predetermined data and each subdata is obtained to total data progress data segmentation;
Each subdata is distributed at least two is used for the first stage of parallel execution frequent item set mining task
The slave node of task;The first stage task is specifically included:The slave node is according to default minimum
Support threshold, carries out frequent item set mining to allocated subdata using Frequent Itemsets Mining Algorithm, obtains
To the frequent item set of the subdata;
For the frequent item set of the subdata to be distributed to for performing frequent item set mining task parallel
Each slave node of second stage task;The second stage task, including:Appoint for performing second stage
Each slave node of business obtains institute according to the frequent item set of default minimum support threshold value and the subdata
State the frequent item set of total data.
The slave node, the task for performing the host node distribution.
The frequent item set mining system that the embodiment of the present application 4 is provided, by the way that the total data is divided at least
Two subdatas, and utilize the part in each subdata of slave node P mining in distributed computing system
Frequent item set, then obtains the frequent item set in the total data using the Local frequent itemset, relative to
The problem of being taken considerable time when the frequent item set in big data is excavated in the prior art, improves
The efficiency of frequent item set mining in big data.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or meter
Calculation machine program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or knot
The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the present invention can be used
Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage,
CD-ROM, optical memory etc.) on the form of computer program product implemented.
The present invention is with reference to the production of method according to embodiments of the present invention, equipment (system) and computer program
The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and
/ or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion
Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate
The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one
The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or the processing of other programmable datas to set
In the standby computer-readable memory worked in a specific way so that be stored in the computer-readable memory
Instruction produce include the manufacture of command device, the command device realization in one flow or multiple of flow chart
The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made
Obtain and perform series of operation steps on computer or other programmable devices to produce computer implemented place
Reason, so that the instruction performed on computer or other programmable devices is provided for realizing in flow chart one
The step of function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
Embodiments herein is the foregoing is only, the application is not limited to.For this area skill
For art personnel, the application can have various modifications and variations.All institutes within spirit herein and principle
Any modification, equivalent substitution and improvements of work etc., should be included within the scope of claims hereof.