CN101799810A

CN101799810A - Association rule mining method and system thereof

Info

Publication number: CN101799810A
Application number: CN200910077996A
Authority: CN
Inventors: 高丹; 邓超; 徐萌; 罗治国; 周文辉; 何清; 曾立; 郑诗豪; 沈亚飞; 陈磊
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2009-02-06
Filing date: 2009-02-06
Publication date: 2010-08-11
Anticipated expiration: 2029-02-06
Also published as: CN101799810B

Abstract

The invention discloses an association rule mining method and a system thereof. The method comprises the steps of: generating a K+1 item set from a frequent K item set; performing a plurality of parallel processing tasks, wherein each processing task obtains data of the corresponding part in a transaction data set, and counting the frequent count value of the K+1 item set in the data; collecting the statistical result of all processing tasks to obtain the frequent count value of the K+1 item set in the transaction data set; generating the frequent K+1 item set which meets the requirement of support degree according to the frequent count value of the K+1 item set; and outputting the association rule when the association rule which meets the requirement of support degree is judged to be existed according to the frequent K+1 item set. The invention can improve the processing efficiency for mining the association rule.

Description

A kind of association rule mining method and system thereof

Technical field

The present invention relates to the data mining technology in the communications field, relate in particular to a kind of association rule mining method and system thereof.

Background technology

In data mining was handled, the data mining purpose of correlation rule (Association Rule) was association that merits attention or the correlationship that discovery exists between the lot of data item, and it is the market basket analysis of retail trade that the typical case uses.So-called market basket analysis is meant that data are carried out correlation rule research helps to find the contact between the different commodity (or different item) in the transaction data base, find out the pattern of customer purchasing behavior, for example, if bread and milk are often bought simultaneously by client, then they are placed in the sales volume that helps to increase by two kinds of commodity together.In order to weigh the significance level of a rule, correlation rule adopts support (support) and confidence level (confidence) as module usually.Support can be represented the significance level of commodity in sell in the supermarket, and confidence level has reflected the correlation degree between the commodity.If in the transaction of buying bread, there is 60% transaction not only to buy bread but also bought milk, then claim correlation rule " bread

Milk " confidence level of (if expression is bought bread then bought milk) is 60%.

Correlation rule

(expression A and B exist simultaneously) support in transaction database D, usable probability P (A ∪ B) expression;

Correlation rule

Confidence level in transaction database D is that in transaction database D those comprise in the affairs of A, the probability that B also occurs simultaneously, i.e. conditional probability P (B|A).

The support of an item collection X in transaction database D is the number percent that the affairs count (X) that comprises X among the transaction database D accounts for affairs sum N, i.e. probability P (X).For an item collection X, if its support, claims then that X is frequent item set (FI:Frequent Itemset) or frequent mode more than or equal to support threshold value min_sup given in advance.

In the prior art, the data mining of correlation rule is handled and is generally comprised two parts:

First: find out the frequent item set of all supports more than or equal to the minimum support threshold value;

Second portion: generate the correlation rule that satisfies the confidence level threshold value by frequent item set.

The work of above-mentioned first is quite time-consuming, and second portion is operated in and is easier to realize on the basis of first, so the overall performance of association rules mining algorithm is mainly by first's work decision.

The algorithm of the excavation boolean relation rule frequent item set that Apriori algorithm of the prior art is a kind of classics.The Apriori algorithm is carrying out the work of above-mentioned first, promptly, when finding out frequent item set, need scan database repeatedly, during the amount of bordering on the sea data mining face to face, because the restriction of internal storage capacity, data can't all be loaded into the central computing of internal storage, even can't go up storage at unit (or single node), and, the Apriori algorithm has limited mining efficiency to a certain extent as a kind of serial algorithm.

Summary of the invention

The embodiment of the invention provides a kind of association rule mining method and system thereof, to solve the existing low problem of association rule mining treatment effeciency.

The association rule mining method that the embodiment of the invention provides comprises:

Generate K+1 item collection by frequent K item collection;

Carry out a plurality of parallel Processing tasks, wherein, each Processing tasks obtains the data that Transaction Information is concentrated appropriate section, and the frequent count value of statistics K+1 item collection in this partial data;

The statistics of all Processing tasks gathered obtain the frequent count value that K+1 item collection is concentrated in described Transaction Information, frequent count value according to K+1 item collection generates the frequent K+1 item collection that satisfies the support requirement, and exports this correlation rule according to described frequent K+1 item collection when judgement has the correlation rule that satisfies the confidence level requirement.

The association rule mining system that the embodiment of the invention provides comprises:

Calling module is used for calling a plurality of parallel Processing tasks, and calling after described a plurality of parallel Processing tasks are finished and gather task according to behind the frequent K item collection generation K+1 item collection;

With described a plurality of parallel Processing tasks Processing tasks execution module one to one, be used to carry out Processing tasks, comprising: obtain the data that Transaction Information is concentrated appropriate section, and the frequent count value of statistics K+1 item collection in this partial data;

Gather task execution module, be used for carrying out and gather task, comprise: the statistics of all Processing tasks is gathered obtain the frequent count value that K+1 item collection is concentrated in described Transaction Information, frequent count value according to K+1 item collection generates the frequent K+1 item collection that satisfies the support requirement, and exports this correlation rule according to described frequent K+1 item collection when judgement has the correlation rule that satisfies the confidence level requirement.

The above embodiment of the present invention, generating in the process of frequent K+1 item collection with frequent K item collection, Processing tasks by a plurality of executed in parallel obtains the partial data that Transaction Information is concentrated, and add up the frequent count value of K+1 item collection in the each several part Transaction Information respectively, and then gather, obtain the frequent count value that K+1 item collection is concentrated in whole Transaction Information, thereby generate frequent K+1 item collection and the output of satisfying the support requirement and satisfy the correlation rule that confidence level requires, a plurality of Processing tasks executed in parallel have been realized, compared with prior art, improved the treatment effeciency of association rule mining.

Description of drawings

Fig. 1 is a parallel association rules schematic flow sheet in the embodiment of the invention;

Fig. 2 adopts Map/Reduce mechanism to realize the synoptic diagram of parallel association rules flow process in the embodiment of the invention;

Fig. 3 is the data digging system structural representation in the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing the embodiment of the invention is described in detail.

In the association rule mining process, when generating frequent item set, need generate next frequent item set with the frequent item set of previous generation.

Referring to Fig. 1, the synoptic diagram of the association rule mining flow process that provides for the embodiment of the invention comprises:

Step 101, the frequent k item collection of generation;

Step 102, utilize frequent k item collection to generate to satisfy the frequent k+1 item collection that support requires, judge when satisfying the correlation rule that confidence level requires, export this correlation rule according to this frequent k+1 item collection; Preferably, result can be exported to distributed file system preserves;

Step 103, judge whether to satisfy termination condition, if, process ends then; Otherwise, the k value is increased progressively and return step 102 and carry out the next iteration process.

In the step 103 of above-mentioned flow process, termination condition can comprise: reach the maximum iteration time of setting, perhaps Shu Chu correlation rule quantity reaches the amount threshold of setting, and perhaps the frequent k+1 item collection of Sheng Chenging is empty.

The frequent k item of utilization in the step 102 of above-mentioned flow process collection generates the process of the frequent k+1 item collection that satisfies the support requirement, can adopt Map/Reduce (mapping/simplification) mechanism to realize.Map/Reduce is the programming mode of a distributed treatment mass data collection, can allow Automatic Program be distributed to concurrent execution on the super large cluster of being made up of common machines by this mechanism.The process of the frequent k+1 item of the generation collection that employing Map/Reduce mechanism realizes can be as shown in Figure 2.

Referring to Fig. 2, realize the parallel association rules schematic flow sheet for adopting Map/Reduce mechanism in the embodiment of the invention.With the example that is applied as of commodity purchasing basket, I:{i1, i2 ... be that commodity are gathered, D:{T1, T2 ... being the shopping list set, minimum support is min_sup, minimum confidence level is min_conf, and as shown in the figure, maximum iteration time is that the flow process of the correlation rule of k comprises:

Generate the frequent 1-item collection of support according to set D more than or equal to min_sup.In this step, can generate the frequent 1-item collection that satisfies more than or equal to support threshold value min_sup condition by the mode of scanning set D.The item collection is meant the set of commodity, is the subclass of I.1-item collection is meant in the commodity set and includes only a kind of commodity (as i1), the support of item collection is meant that the number of times that this collection occurs (occurs 30 times as item collection i1 altogether divided by the total degree of concluding the business among the D in D in D, transaction adds up to 100 among the D, and then the support of this collection is 30%).If the support threshold value is 20%, then this 1-item rally is as frequent 1-item collection output.

Generate 2-item collection according to frequent 1-item collection, 2-item collection is meant in the commodity set and comprises that 2 kinds of commodity are (as i2, i3).Consider not need to calculate the possible situation of each 2-item collection, can do beta pruning and handle.

Generate a plurality of parallel Map tasks, and the Reduce task.Wherein, each Map task is responsible for obtaining the data of appropriate section among the set D, and the frequent count value of statistics 2-item collection in this partial data; The Reduce task is responsible for statistics to all Map tasks and is gathered and (for example obtain the frequent count value of 2-item collection in set D, in all shopping lists among the set D, the number of times that i1 and i2 occur in same shopping list simultaneously is { the i1 that the 2-item is concentrated, the frequent count value of i2} in set D), generate according to the frequent count value of 2-item collection and to satisfy the frequent 2-item collection that support requires, and judge according to frequent 2-item collection and to export this correlation rule when satisfying the correlation rule that confidence level requires.

These Map tasks in parallel are carried out, and wherein, for each Map task, carry out:

According to the data of obtaining respective range for the data line off-set value scope of its distribution from set D, specifically can be: according to the scope of predefined data line side-play amount key, read in the data of set D, and the data of reading in are converted to＜key, value〉right, wherein, key is the sign of the data allocations that reads for the Map task, and value is the content of the data that read; According to read＜key, value〉right, the frequent count value of statistics 2-item collection, and statistics is output as new＜key, value〉right, wherein, key is a 2-item collection, value is the frequent count value that counts.

Carry out the Reduce task, the Reduce task with all Map tasks outputs＜key, value〉centering key value is identical＜key, value〉the value value addition of centering, obtain the frequent count value of 2-item collection in gathering D; Calculate the 2-item according to the frequent count value of 2-item collection and concentrate every support, for example, can the number of times of i2 and i3 and the supported degree of ratio of shopping list summation appear simultaneously by calculating in the shopping list, deletion 2-item is concentrated the item of support less than min_sup, keep the item of support wherein, thereby obtain frequent 2-item collection more than or equal to min_sup.The Reduce task also can judge whether the correlation rule of confidence level more than or equal to min_conf according to the frequent 2-item collection that obtains, if having, then exports this correlation rule.For example, the probability P (i3|i2) that also occurs simultaneously as i3 in the inventory that comprises i2 is during more than or equal to min_conf, output correlation rule i2=＞i3.

Judge whether current iterations reaches k, if reach, process ends then; If do not reach, then carry out the next iteration process, promptly utilize frequent 2-item collection to generate frequent 3-item collection, the rest may be inferred, till satisfying termination condition.

It also can be a plurality of that the quantity of the Reduce task in the above-mentioned flow process can be one.If a plurality of, but these Reduce task executed in parallel then, wherein, each Reduce task can from all Map task handling results, search the key value identical＜key, value〉to gathering.

In the above-mentioned flow process, because by the frequent count value in the partial data of Map task statistics K+1 item collection in transaction database of a plurality of executed in parallel, statistics according to all Map tasks gathers the frequent count value that obtains K+1 item collection again, thereby has realized a plurality of Processing tasks executed in parallel data handling procedures.

In the above-mentioned flow process, preferably, can finish to a plurality of XM the Map Task Distribution, also can give a node processing with one or more Map Task Distribution according to the load condition of node.In processing procedure, the Map task that each XM executed in parallel is assigned with, if an XM has been assigned with a plurality of Map tasks, then on this node, these Map tasks also are executed in parallel.

Based on identical technical conceive, the embodiment of the invention provides a kind of association rule mining system.

Referring to Fig. 3, the structural representation of the association rule mining system that provides for the embodiment of the invention, this system comprises: calling module 31, a plurality of Processing tasks execution module 32 (only illustrating 3 among the figure), gathers task execution module 33, also can further comprise judge module 34, wherein:

Calling module 31 is used for calling a plurality of parallel Processing tasks, and calling after described a plurality of parallel Processing tasks are finished and gather task according to behind the frequent K item collection generation K+1 item collection;

Processing tasks execution module 32 is corresponding one by one with described a plurality of parallel Processing tasks, is used to carry out Processing tasks, comprising: obtain the data that Transaction Information is concentrated appropriate section, and the frequent count value of statistics K+1 item collection in this partial data;

Gather task execution module 33, be used for carrying out and gather task, comprise: the statistics of all Processing tasks is gathered obtain the frequent count value that K+1 item collection is concentrated in Transaction Information, frequent count value according to K+1 item collection is calculated the wherein support of each data item, get wherein support and form frequent K+1 item collection, and have in judgement according to described frequent K+1 item collection and to export this correlation rule when satisfying the correlation rule that confidence level requires more than or equal to the data item of support threshold value.

Said system can adopt Map/Reduce mechanism, at this moment, Processing tasks execution module 32 can be the Map task execution module, and this Map task execution module can be according to the data line offset ranges that is its distribution when handling, read the data of respective range from the affairs data centralization, and the data that read are converted to＜key value〉right, wherein, key is the sign of the data allocations that reads for the Map task, and value is the data content that reads; According to this＜key, value〉to the frequent count value of statistics K+1 item collection, and statistics is output as new＜key, value right, wherein, key is a K+1 item collection, the frequent count value that value obtains for statistics.Gathering task execution module 33 can be the Reduce task execution module, this module in carrying out processing procedure, obtain the output of all Map tasks＜key, value〉right, with the key value identical＜key, value〉the value value addition of centering, obtain the frequent count value of K+1 item collection.

Judge module 34 is used for after generating frequent K+1 item collection, if termination condition is satisfied in judgement, then finishes the association rule mining flow process.For example,, judge that perhaps the correlation rule number of output surpasses the correlation rule amount threshold, when the frequent K+1 item collection of perhaps judging generation is empty, finish the association rule mining flow process when judge module 34 judgements reach maximum iteration time.

Need to prove that the embodiment of the invention can be applicable to the implementation procedure of Apriori algorithm, and the implementation procedure of other similar algorithms.

As can be seen from the above description, the embodiment of the invention realizes parallel association rule mining method based on Map/Reduce, and compared with prior art, its technique effect comprises:

(1) efficiency of algorithm gets a promotion.At the serial shortcoming of classical Apriori algorithm, finish algorithm most principal work (calculating frequent item set) based on Map/Reduce mechanism, effectively solve the problems such as effectiveness of performance that improved in the mass data association rule mining.The algorithm cost is bigger and calculating frequent item set work parallelization that parallel composition is higher can obtain higher parallel efficiency and speed-up ratio.

(2) storage capacity gets a promotion.Technical matterss such as efficient storage, redundancy backup, load balance and concurrent access when adopting the distributed file system solution to realize the mass data association rule mining.

(3) computing scale gets a promotion and enhanced scalability.Cluster environment based on Map/Reduce and DFS combination provides a solid computing platform for the large-scale parallel data mining, have good extensibility simultaneously, estimating to dispose the node number reaches about 256, the significantly lifting of computing scale helps solving the many bottleneck problems in the mass data excavation, can further improve and excavate effect and improve practicality.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. an association rule mining method is characterized in that, comprising:

Generate K+1 item collection by frequent K item collection;

2. the method for claim 1 is characterized in that, described Processing tasks is mapping Map task;

Each Processing tasks obtains the data that Transaction Information is concentrated appropriate section, and the frequent count value of statistics K+1 item collection in this partial data, is specially:

Each Map task basis is the data line offset ranges of its distribution, read the data of respective range from the affairs data centralization, and the data that read are converted to＜key1, value1〉right, wherein, key1 is the sign of the data allocations that reads for the Map task, and value1 is the data content that reads; And, statistics K+1 item collection this＜key1, value1〉the frequent count value of centering, and statistics is output as＜key2 value2 right, wherein, key2 is a K+1 item collection, the frequent count value that value2 obtains for statistics.

3. method as claimed in claim 2 is characterized in that, the statistics of all Map tasks is gathered obtain the frequent count value that K+1 item collection is concentrated in described Transaction Information, is specially:

By carry out to simplify the Reduce task obtain all Map tasks outputs＜key2, value2〉right, with the key2 value identical＜key2, value2〉the value2 value addition of centering, obtain K+1 item collection in the concentrated frequent count value of described Transaction Information.

4. the method for claim 1 is characterized in that, satisfies the frequent K+1 item collection that support requires according to the frequent count value generation of K+1 item collection, is specially:

Frequent count value according to K+1 item collection is calculated the wherein support of each data item, gets wherein support and forms frequent K+1 item collection more than or equal to the data item of support threshold value.

5. the method for claim 1 is characterized in that, generate frequent K+1 item collection after, also comprise: if satisfy termination condition, then finish the association rule mining flow process.

6. method as claimed in claim 5 is characterized in that, satisfies termination condition, comprising:

Reach maximum iteration time; Perhaps, the correlation rule number of output surpasses the correlation rule amount threshold; Perhaps, the frequent K+1 item collection of generation is empty.

7. the method for claim 1 is characterized in that, described Processing tasks is assigned to a plurality of XM and carries out, and wherein, an XM is carried out one or more Processing tasks.

8. an association rule mining system is characterized in that, comprising:

9. system as claimed in claim 8, it is characterized in that, described Processing tasks execution module is the Map task execution module, and described Map task execution module is further used for, according to the data line offset ranges that is its distribution, read the data of respective range from the affairs data centralization, and the data that read are converted to＜key1 value1〉right, wherein, key1 is the sign of the data allocations that reads for the Map task, and value1 is the data content that reads; And, statistics K+1 item collection this＜key1, value1〉the frequent count value of centering, and statistics is output as＜key2 value2 right, wherein, key2 is a K+1 item collection, the frequent count value that value2 obtains for statistics.

10. system as claimed in claim 9, it is characterized in that, the described task execution module that gathers is the Reduce task execution module, described Reduce task execution module is further used for, obtain the output of all Map tasks＜key2, value2〉right, with the key2 value identical＜key2, value2〉the value2 value addition of centering, obtain the frequent count value that K+1 item collection is concentrated in described Transaction Information.

11. system as claimed in claim 8, it is characterized in that, the described task execution module that gathers is further used for, and calculates the wherein support of each data item according to the frequent count value of K+1 item collection, gets wherein support and forms frequent K+1 item collection more than or equal to the data item of support threshold value.

12. system as claimed in claim 8 is characterized in that, also comprises:

Judge module is used for after generating frequent K+1 item collection, if termination condition is satisfied in judgement, then finishes the association rule mining flow process.

13. system as claimed in claim 12, it is characterized in that, described judge module is further used for, when judgement reaches maximum iteration time, the correlation rule number of perhaps judging output surpasses the correlation rule amount threshold, perhaps judge when the frequent K+1 item collection that generates is empty, finish the association rule mining flow process.