CN105447134B

CN105447134B - The optimization method of Frequent Itemsets Mining Algorithm

Info

Publication number: CN105447134B
Application number: CN201510806032.5A
Authority: CN
Inventors: 李磊
Original assignee: CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Current assignee: CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Priority date: 2015-11-20
Filing date: 2015-11-20
Publication date: 2019-03-08
Anticipated expiration: 2035-11-20
Also published as: CN105447134A

Abstract

The invention discloses a kind of optimization methods of Frequent Itemsets Mining Algorithm, comprising: receives data；It for received data, is traversed using preamble, item collection tree is traversed, to arrange item collection；Father and son's collection is made to item collection adjacent in the item collection after arrangement to compare, and comparison result is merged for proper subclass with the item collection of superset relationship.Compared with existing Frequent Itemsets Mining Algorithm, the effect of proper subclass is extracted, therefore it is a major advantage that by extracting proper subclass, reduce the size of data volume, reduce the calculating process of data and the size of data storage, and the calculating by effectively reducing invalid item collection, prevents the calculating repeatedly of repeated data.

Description

The optimization method of Frequent Itemsets Mining Algorithm

Technical field

The present invention relates to data processing fields, and in particular, to a kind of optimization method of Frequent Itemsets Mining Algorithm.

Background technique

Frequent Itemsets Mining Algorithm is used to excavate the item set (referred to as frequent item set) often occurred together, passes through excavation These frequent item sets out then can be the frequent item set as the one of item for being frequent item collection in an affairs Other item are as recommendation.

Common Frequent Itemsets Mining Algorithm has two classes, and one kind is Apriori algorithm, and another kind of is FPGrowth. FPGrowth is optimized based on Apriori algorithm.FPgrowth algorithm is reduction relative to Apriori, maximum breakthrough The number of iterations of data.Apriori needs to carry out K-1 calculating in calculating frequent item set, and K is the number of a frequent item collection, And Fpgrowth only needs to be traversed for 2 calculating that data can complete frequent item set by constructing fptree.

With information-based development, the burst of data increases, and the complexity of data greatly increases.Although passing through The technologies such as hadoop, spark, Fpgrowth can shorten the number of iterations of the calculating time and data of frequent item set, but not With the data in source, the increase of the quantity collection of frequent item set and increasing for invalid frequent item set will cause.It is actually used in project Effect is not accurate, often recommends the result to make mistake.And invalid data amount will increase the size of frequent item set, make project Performance and cost are unable to meet demand.

Summary of the invention

It is an object of the present invention in view of the above-mentioned problems, a kind of optimization method of Frequent Itemsets Mining Algorithm be proposed, with reality Data volume size is now reduced, and the advantages of reduction data calculation process and data storage.

To achieve the above object, the technical solution adopted by the present invention is that:

A kind of optimization method of Frequent Itemsets Mining Algorithm, comprising:

Receive data；

It for received data, is traversed using preamble, item collection tree is traversed, to arrange item collection；

Father and son's collection is made to item collection adjacent in the item collection after arrangement to compare, and is proper subclass and superset relationship by comparison result Item collection merge；

Wherein item collection is the abbreviation of frequent item set.

Preferably, father and son's collection compares, and the content compared includes the subordinate relation of item collection and the support of item collection.

Preferably, the subordinate relation of the item collection is more specific are as follows:

It is assumed that two item collections are respectively that A item collection and B item collection are recognized if the item inside A item collection is all contained in B item collection Belong to B item collection for A item collection, A item collection is the subset of B item collection.

Preferably, the support of the item collection is more specific are as follows:

It is assumed that two item collections are respectively A item collection and B item collection, the support of item collection is exactly in simple terms in data The number that item inside the item collection occurs simultaneously in data, if the frequency of the frequency of A item collection and B item collection it is equal and A item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection；If A item collection is the subset of B item collection, but support is not Together, then A item collection is the subset of B item collection, but is not proper subclass.

Technical solution of the present invention has the advantages that

Technical solution of the present invention extracts the effect of proper subclass compared with existing Frequent Itemsets Mining Algorithm, main excellent Point is to reduce the size of data volume by extracting proper subclass, reduces the calculating process of data and the size of data storage, and lead to The calculating for effectively reducing invalid item collection is crossed, the calculating repeatedly of repeated data is prevented.When recommending to be done using the algorithm, avoid Recommend invalid commodity, can effectively increase user experience.The use of proper subclass is i.e. save the cost, and raising performance and use Family experience.

Below by drawings and examples, technical scheme of the present invention will be described in further detail.

Detailed description of the invention

Fig. 1 is the flow chart of the optimization method of Frequent Itemsets Mining Algorithm described in the embodiment of the present invention；

Fig. 2 is the data structure schematic diagram of frequent item set described in the embodiment of the present invention；

Fig. 3 is the data structure schematic diagram that frequent item set described in the embodiment of the present invention can merge；

Fig. 4 is the data structure schematic diagram that frequent item set described in the embodiment of the present invention can partially merge.

Specific embodiment

Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings, it should be understood that preferred reality described herein Apply example only for the purpose of illustrating and explaining the present invention and is not intended to limit the present invention.

As shown in Figure 1, a kind of optimization method of Frequent Itemsets Mining Algorithm, comprising:

Receive data；

Wherein item collection is the abbreviation of frequent item set.

Preferably, the support of item collection is more specific are as follows:

It is as shown in Figure 2: to there are three column datas as shown in Figure 2 in frequent item set result set, secondary series and third are classified as the The subset of one column, and the support of three column frequent item sets is all 10.In the case, first row, secondary series and third position are come From in same data source, illustrate that secondary series and third are classified as the proper subclass of first row, it is not necessary to be divided into three column counts, can merge For same row, as shown in Figure 3.

It is as shown in Figure 3: in the calculated result of frequent item set, next column data to be selected to prop up compared with current data column data 2 column are combined into a column, then compare third column if next column is the proper subclass when forefront by degree of holding and father's subset relation, if Third column are still father's subset relation with first row, then three column are combined into a column, successively compared down；If third column and first row It is not set membership, as shown in figure 4, then merging first row and secondary series, compares down from third Leie.

Proper subclass and superset merge into same row, the generation of repeated data are reduced in terms of data volume, in frequent item set In use process, reduce the number of calculating.In terms of data accuracy, reduce the frequent item set from same data source Reuse, play the role of optimization on data accuracy.

Father and son's collection compares: father and son's collection, which compares, is divided into 2 parts.First point be frequent item set subordinate relation, item collection be frequency Numerous set, if the item inside A frequent item set is all contained in B frequent item set, then it is assumed that A item collection belongs to B item collection, and A Collection is the subset of B item collection.Second point compares support, and the support of frequent item set is exactly this in data in simple terms The number that the item of the inside of item collection occurs simultaneously in data.If the frequency of A item collection and the frequency of B item collection it is equal and A item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection；If A item collection is the subset of B item collection, but support is not Together, then A item collection is the subset of B item collection, but is not proper subclass.

Father and son's collection compares selection: when selection frequent item set makees father and son's collection and compares, it is only necessary to adjacent 2 be selected to gather It compares, is traversed in the use preamble of traversal item collection tree, there may be the item collections of set membership, meeting is according to adjacent pass System is arranged together.

Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features. All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims

1. a kind of optimization method of Frequent Itemsets Mining Algorithm characterized by comprising

Receive data；

Father and son's collection is made to item collection adjacent in the item collection after arrangement to compare, and is the item of proper subclass and superset relationship by comparison result Collection merges；

Wherein item collection is the abbreviation of frequent item set；

Father and son's collection compares: father and son's collection, which compares, is divided into 2 parts；First point be frequent item set subordinate relation, item collection is frequent episode Set, if the item inside A frequent item set is all contained in B frequent item set, then it is assumed that A item collection belongs to B item collection, and A item collection is B The subset of item collection；

Second point compares support, and the support of frequent item set is exactly the inside of the item collection in data in simple terms The number that item occurs simultaneously in data；If the frequency of the A item collection and frequency of B item collection is equal and A item collection is B item collection Subset, then A item collection is the proper subclass of B item collection；If A item collection is the subset of B item collection, but support is different, then A item collection is The subset of B item collection, but be not proper subclass.