CN105447134A

CN105447134A - Optimization method of a frequent item set mining algorithm

Info

Publication number: CN105447134A
Application number: CN201510806032.5A
Authority: CN
Inventors: 李磊
Original assignee: CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Current assignee: CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Priority date: 2015-11-20
Filing date: 2015-11-20
Publication date: 2016-03-30
Anticipated expiration: 2035-11-20
Also published as: CN105447134B

Abstract

The invention discloses an optimization method of a frequent item set mining algorithm. The method comprises the following steps of: for received data, using preorder traversal, traversing an item set tree, and thereby arranging item sets; executing parent set and subset comparison for neighboring item sets of the arranged item sets, and combining the item sets of which the comparison result is proper subsets and parent sets. As compared with the existing frequent item set mining algorithm, a function of extracting proper subsets is provide; and the method has main advantages of reducing size of data volume, reducing calculation process of data and reducing size of data storage through extraction of the proper subsets, and preventing repeated calculation of duplicated data through effective calculation for reducing invalid item sets.

Description

The optimization method of Frequent Itemsets Mining Algorithm

Technical field

The present invention relates to data processing field, particularly, relate to a kind of optimization method of Frequent Itemsets Mining Algorithm.

Background technology

Frequent Itemsets Mining Algorithm is for excavating item set (being called frequent item set) often occurred together, by excavating these frequent item sets, when there is one of them item of frequent item set in affairs, then can using other item of this frequent item set as recommendation.

Common Frequent Itemsets Mining Algorithm has two classes, and a class is Apriori algorithm, and another kind of is FPGrowth.FPGrowth forms based on Apriori algorithm optimization.FPgrowth algorithm is relative to Apriori, and maximum breakthrough is the iterations reducing data.Apriori needs to carry out K-1 time at calculating frequent item set and calculates, and K is the number of a frequent collection, and Fpgrowth only needs traversal 2 secondary data just can complete the calculating of frequent item set by building fptree.

Along with informationalized development, the burst of data increases, and the complicacy of data increases greatly.Although the computing time of frequent item set and the iterations of data can be shortened, the data of separate sources by technology such as hadoop, spark, Fpgrowth, the increase of quantity collection and the increasing of invalid frequent item set of frequent item set can be caused.In project, practical effect is not accurate, often recommends the result made mistake.And invalid data amount can increase the size of frequent item set, the performance of project and cost can not be satisfied the demands.

Summary of the invention

The object of the invention is to, for the problems referred to above, propose a kind of optimization method of Frequent Itemsets Mining Algorithm, to realize reducing data volume size, and the advantage of reduction data calculation process and data storage.

For achieving the above object, the technical solution used in the present invention is:

An optimization method for Frequent Itemsets Mining Algorithm, comprising:

Receive data;

For the data received, use preorder traversal, traversal Xiang Jishu, thus item collection is arranged;

Concentrate adjacent item collection to make father and son's collection to the item after arrangement to compare, and by comparative result be the item set of proper subclass and superset relationship also;

Its middle term integrates the abbreviation as frequent item set.

Preferably, described father and son's collection compares, and the content compared comprises, the subordinate relation of item collection and the support of item collection.

Preferably, the subordinate relation of described item collection is more specific is:

Suppose, two item collection are respectively A item collection and B item collection, and concentrate if the item inside A item collection is all contained in B item, then think that A item collection belongs to B item collection, A item collection is the subset of B item collection.

Preferably, the support of described item collection is more specific is:

Suppose, two item collection are respectively A item collection and B item collection, the support of item collection derives from data, be exactly the number of times that the item of this collection the inside occurs in the data simultaneously in simple terms, if the equal and A item collection of the frequent degree of the frequent degree of A item collection and B item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection; If A item collection is the subset of B item collection, but support is different, then A item collection is the subset of B item collection, but is not proper subclass.

Technical scheme of the present invention has following beneficial effect:

Technical scheme of the present invention, compare with existing Frequent Itemsets Mining Algorithm, extract the effect of proper subclass, main advantage is by extracting proper subclass, reduce the size of data volume, the size that the computation process of reduction data and data store, and by the effective calculating reducing void item collection, prevent the calculating repeatedly of repeating data.Thus when using this algorithm to recommend, avoid recommending invalid commodity, can effectively experience by adding users.The use of proper subclass is namely cost-saving, improves performance and Consumer's Experience again.

Below by drawings and Examples, technical scheme of the present invention is described in further detail.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the optimization method of the Frequent Itemsets Mining Algorithm described in the embodiment of the present invention;

The data structure schematic diagram that Fig. 2 is the frequent item set described in the embodiment of the present invention;

The data structure schematic diagram that Fig. 3 can merge for the frequent item set described in the embodiment of the present invention;

The data structure schematic diagram that Fig. 4 can partly merge for the frequent item set described in the embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.

As shown in Figure 1, a kind of optimization method of Frequent Itemsets Mining Algorithm, comprising:

Receive data;

Its middle term integrates the abbreviation as frequent item set.

Preferably, father and son's collection compares, and the content compared comprises, the subordinate relation of item collection and the support of item collection.

Preferably, the support of item collection is more specific is:

As shown in Figure 2: there are three column datas as shown in Figure 2 in frequent item set result set, secondary series and the 3rd is classified as the subset of first row, and the support of three row frequent item sets is all 10.In the case, first row, secondary series and the 3rd come from same data source, describe the proper subclass that secondary series and the 3rd is classified as first row, need not be divided into three column counts, can merge into same row, as shown in Figure 3.

As shown in Figure 3: in the result of calculation of frequent item set, next column data are selected to compare support and father's subset relation with current data column data, if next column is the proper subclass when prostatitis, then 2 row are combined into row, compare the 3rd row again, if the 3rd row are still father's subset relation with first row, then three row are combined into row, compare down successively; If the 3rd row and first row are not set membership, as shown in Figure 4, then first row and secondary series are merged, down compare from the 3rd leu.

Proper subclass and superset merge into same row, reduce the generation of repeating data in data volume, in frequent item set use procedure, decrease the number of times of calculating.In data accuracy, decrease reusing of the frequent item set coming from same data source, on data accuracy, serve the effect of optimization.

Father and son's collection compares: father and son's collection compares and is divided into 2 parts.First is the subordinate relation of frequent item set, and item collection is the set of frequent episode, if the item inside A frequent item set is all contained in B frequent item set, then think that A item collection belongs to B item collection, A item collection is the subset of B item collection.Second point compares support, and the support of frequent item set derives from data, is exactly the number of times that the item of the inside of this collection occurs in the data simultaneously in simple terms.If the equal and A item collection of the frequent degree of the frequent degree of A item collection and B item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection; If A item collection is the subset of B item collection, but support is different, then A item collection is the subset of B item collection, but is not proper subclass.

Father and son collects alternative: select frequent item set be father and son collection compare time, only need to select 2 adjacent set to compare, in the use preorder traversal of traversal Xiang Jishu, the item collection of set membership may be there is, can according to adjacent relationship together.

Last it is noted that the foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, although with reference to previous embodiment to invention has been detailed description, for a person skilled in the art, it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. an optimization method for Frequent Itemsets Mining Algorithm, is characterized in that, comprising:

Receive data;

Its middle term integrates the abbreviation as frequent item set.

2. the optimization method of Frequent Itemsets Mining Algorithm according to claim 1, is characterized in that, described father and son's collection compares, and the content compared comprises, the subordinate relation of item collection and the support of item collection.

3. the optimization method of Frequent Itemsets Mining Algorithm according to claim 2, is characterized in that, the subordinate relation of described item collection is more specific is:

4. the optimization method of Frequent Itemsets Mining Algorithm according to claim 3, is characterized in that, the support of described item collection is more specific is: