CN105447134A - Optimization method of a frequent item set mining algorithm - Google Patents

Optimization method of a frequent item set mining algorithm Download PDF

Info

Publication number
CN105447134A
CN105447134A CN201510806032.5A CN201510806032A CN105447134A CN 105447134 A CN105447134 A CN 105447134A CN 201510806032 A CN201510806032 A CN 201510806032A CN 105447134 A CN105447134 A CN 105447134A
Authority
CN
China
Prior art keywords
item
item collection
collection
frequent
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510806032.5A
Other languages
Chinese (zh)
Other versions
CN105447134B (en
Inventor
李磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Original Assignee
CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCTV INTERNATIONAL NETWORKS WUXI Co Ltd filed Critical CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Priority to CN201510806032.5A priority Critical patent/CN105447134B/en
Publication of CN105447134A publication Critical patent/CN105447134A/en
Application granted granted Critical
Publication of CN105447134B publication Critical patent/CN105447134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Abstract

The invention discloses an optimization method of a frequent item set mining algorithm. The method comprises the following steps of: for received data, using preorder traversal, traversing an item set tree, and thereby arranging item sets; executing parent set and subset comparison for neighboring item sets of the arranged item sets, and combining the item sets of which the comparison result is proper subsets and parent sets. As compared with the existing frequent item set mining algorithm, a function of extracting proper subsets is provide; and the method has main advantages of reducing size of data volume, reducing calculation process of data and reducing size of data storage through extraction of the proper subsets, and preventing repeated calculation of duplicated data through effective calculation for reducing invalid item sets.

Description

The optimization method of Frequent Itemsets Mining Algorithm
Technical field
The present invention relates to data processing field, particularly, relate to a kind of optimization method of Frequent Itemsets Mining Algorithm.
Background technology
Frequent Itemsets Mining Algorithm is for excavating item set (being called frequent item set) often occurred together, by excavating these frequent item sets, when there is one of them item of frequent item set in affairs, then can using other item of this frequent item set as recommendation.
Common Frequent Itemsets Mining Algorithm has two classes, and a class is Apriori algorithm, and another kind of is FPGrowth.FPGrowth forms based on Apriori algorithm optimization.FPgrowth algorithm is relative to Apriori, and maximum breakthrough is the iterations reducing data.Apriori needs to carry out K-1 time at calculating frequent item set and calculates, and K is the number of a frequent collection, and Fpgrowth only needs traversal 2 secondary data just can complete the calculating of frequent item set by building fptree.
Along with informationalized development, the burst of data increases, and the complicacy of data increases greatly.Although the computing time of frequent item set and the iterations of data can be shortened, the data of separate sources by technology such as hadoop, spark, Fpgrowth, the increase of quantity collection and the increasing of invalid frequent item set of frequent item set can be caused.In project, practical effect is not accurate, often recommends the result made mistake.And invalid data amount can increase the size of frequent item set, the performance of project and cost can not be satisfied the demands.
Summary of the invention
The object of the invention is to, for the problems referred to above, propose a kind of optimization method of Frequent Itemsets Mining Algorithm, to realize reducing data volume size, and the advantage of reduction data calculation process and data storage.
For achieving the above object, the technical solution used in the present invention is:
An optimization method for Frequent Itemsets Mining Algorithm, comprising:
Receive data;
For the data received, use preorder traversal, traversal Xiang Jishu, thus item collection is arranged;
Concentrate adjacent item collection to make father and son's collection to the item after arrangement to compare, and by comparative result be the item set of proper subclass and superset relationship also;
Its middle term integrates the abbreviation as frequent item set.
Preferably, described father and son's collection compares, and the content compared comprises, the subordinate relation of item collection and the support of item collection.
Preferably, the subordinate relation of described item collection is more specific is:
Suppose, two item collection are respectively A item collection and B item collection, and concentrate if the item inside A item collection is all contained in B item, then think that A item collection belongs to B item collection, A item collection is the subset of B item collection.
Preferably, the support of described item collection is more specific is:
Suppose, two item collection are respectively A item collection and B item collection, the support of item collection derives from data, be exactly the number of times that the item of this collection the inside occurs in the data simultaneously in simple terms, if the equal and A item collection of the frequent degree of the frequent degree of A item collection and B item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection; If A item collection is the subset of B item collection, but support is different, then A item collection is the subset of B item collection, but is not proper subclass.
Technical scheme of the present invention has following beneficial effect:
Technical scheme of the present invention, compare with existing Frequent Itemsets Mining Algorithm, extract the effect of proper subclass, main advantage is by extracting proper subclass, reduce the size of data volume, the size that the computation process of reduction data and data store, and by the effective calculating reducing void item collection, prevent the calculating repeatedly of repeating data.Thus when using this algorithm to recommend, avoid recommending invalid commodity, can effectively experience by adding users.The use of proper subclass is namely cost-saving, improves performance and Consumer's Experience again.
Below by drawings and Examples, technical scheme of the present invention is described in further detail.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the optimization method of the Frequent Itemsets Mining Algorithm described in the embodiment of the present invention;
The data structure schematic diagram that Fig. 2 is the frequent item set described in the embodiment of the present invention;
The data structure schematic diagram that Fig. 3 can merge for the frequent item set described in the embodiment of the present invention;
The data structure schematic diagram that Fig. 4 can partly merge for the frequent item set described in the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.
As shown in Figure 1, a kind of optimization method of Frequent Itemsets Mining Algorithm, comprising:
Receive data;
For the data received, use preorder traversal, traversal Xiang Jishu, thus item collection is arranged;
Concentrate adjacent item collection to make father and son's collection to the item after arrangement to compare, and by comparative result be the item set of proper subclass and superset relationship also;
Its middle term integrates the abbreviation as frequent item set.
Preferably, father and son's collection compares, and the content compared comprises, the subordinate relation of item collection and the support of item collection.
Preferably, the subordinate relation of described item collection is more specific is:
Suppose, two item collection are respectively A item collection and B item collection, and concentrate if the item inside A item collection is all contained in B item, then think that A item collection belongs to B item collection, A item collection is the subset of B item collection.
Preferably, the support of item collection is more specific is:
Suppose, two item collection are respectively A item collection and B item collection, the support of item collection derives from data, be exactly the number of times that the item of this collection the inside occurs in the data simultaneously in simple terms, if the equal and A item collection of the frequent degree of the frequent degree of A item collection and B item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection; If A item collection is the subset of B item collection, but support is different, then A item collection is the subset of B item collection, but is not proper subclass.
As shown in Figure 2: there are three column datas as shown in Figure 2 in frequent item set result set, secondary series and the 3rd is classified as the subset of first row, and the support of three row frequent item sets is all 10.In the case, first row, secondary series and the 3rd come from same data source, describe the proper subclass that secondary series and the 3rd is classified as first row, need not be divided into three column counts, can merge into same row, as shown in Figure 3.
As shown in Figure 3: in the result of calculation of frequent item set, next column data are selected to compare support and father's subset relation with current data column data, if next column is the proper subclass when prostatitis, then 2 row are combined into row, compare the 3rd row again, if the 3rd row are still father's subset relation with first row, then three row are combined into row, compare down successively; If the 3rd row and first row are not set membership, as shown in Figure 4, then first row and secondary series are merged, down compare from the 3rd leu.
Proper subclass and superset merge into same row, reduce the generation of repeating data in data volume, in frequent item set use procedure, decrease the number of times of calculating.In data accuracy, decrease reusing of the frequent item set coming from same data source, on data accuracy, serve the effect of optimization.
Father and son's collection compares: father and son's collection compares and is divided into 2 parts.First is the subordinate relation of frequent item set, and item collection is the set of frequent episode, if the item inside A frequent item set is all contained in B frequent item set, then think that A item collection belongs to B item collection, A item collection is the subset of B item collection.Second point compares support, and the support of frequent item set derives from data, is exactly the number of times that the item of the inside of this collection occurs in the data simultaneously in simple terms.If the equal and A item collection of the frequent degree of the frequent degree of A item collection and B item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection; If A item collection is the subset of B item collection, but support is different, then A item collection is the subset of B item collection, but is not proper subclass.
Father and son collects alternative: select frequent item set be father and son collection compare time, only need to select 2 adjacent set to compare, in the use preorder traversal of traversal Xiang Jishu, the item collection of set membership may be there is, can according to adjacent relationship together.
Last it is noted that the foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, although with reference to previous embodiment to invention has been detailed description, for a person skilled in the art, it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (4)

1. an optimization method for Frequent Itemsets Mining Algorithm, is characterized in that, comprising:
Receive data;
For the data received, use preorder traversal, traversal Xiang Jishu, thus item collection is arranged;
Concentrate adjacent item collection to make father and son's collection to the item after arrangement to compare, and by comparative result be the item set of proper subclass and superset relationship also;
Its middle term integrates the abbreviation as frequent item set.
2. the optimization method of Frequent Itemsets Mining Algorithm according to claim 1, is characterized in that, described father and son's collection compares, and the content compared comprises, the subordinate relation of item collection and the support of item collection.
3. the optimization method of Frequent Itemsets Mining Algorithm according to claim 2, is characterized in that, the subordinate relation of described item collection is more specific is:
Suppose, two item collection are respectively A item collection and B item collection, and concentrate if the item inside A item collection is all contained in B item, then think that A item collection belongs to B item collection, A item collection is the subset of B item collection.
4. the optimization method of Frequent Itemsets Mining Algorithm according to claim 3, is characterized in that, the support of described item collection is more specific is:
Suppose, two item collection are respectively A item collection and B item collection, the support of item collection derives from data, be exactly the number of times that the item of this collection the inside occurs in the data simultaneously in simple terms, if the equal and A item collection of the frequent degree of the frequent degree of A item collection and B item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection; If A item collection is the subset of B item collection, but support is different, then A item collection is the subset of B item collection, but is not proper subclass.
CN201510806032.5A 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm Active CN105447134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510806032.5A CN105447134B (en) 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510806032.5A CN105447134B (en) 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm

Publications (2)

Publication Number Publication Date
CN105447134A true CN105447134A (en) 2016-03-30
CN105447134B CN105447134B (en) 2019-03-08

Family

ID=55557311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510806032.5A Active CN105447134B (en) 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm

Country Status (1)

Country Link
CN (1) CN105447134B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021412A (en) * 2016-05-13 2016-10-12 上海市计算技术研究所 Large-scale vehicle-passing data oriented accompanying vehicle identification method
CN109300014A (en) * 2018-10-24 2019-02-01 中南民族大学 Method of Commodity Recommendation, device, server and storage medium based on Web log mining

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937447A (en) * 2010-06-07 2011-01-05 华为技术有限公司 Alarm association rule mining method, and rule mining engine and system
US20130332431A1 (en) * 2012-06-12 2013-12-12 International Business Machines Corporation Closed itemset mining using difference update
CN103678530A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Rapid detection method of frequent item sets
CN104850577A (en) * 2015-03-19 2015-08-19 浙江工商大学 Data flow maximal frequent item set mining method based on ordered composite tree structure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937447A (en) * 2010-06-07 2011-01-05 华为技术有限公司 Alarm association rule mining method, and rule mining engine and system
US20130332431A1 (en) * 2012-06-12 2013-12-12 International Business Machines Corporation Closed itemset mining using difference update
CN103678530A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Rapid detection method of frequent item sets
CN104850577A (en) * 2015-03-19 2015-08-19 浙江工商大学 Data flow maximal frequent item set mining method based on ordered composite tree structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林森媚: "基于合并FP树的频繁模式挖掘算法", 《广西师范大学学报:自然科学版》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021412A (en) * 2016-05-13 2016-10-12 上海市计算技术研究所 Large-scale vehicle-passing data oriented accompanying vehicle identification method
CN109300014A (en) * 2018-10-24 2019-02-01 中南民族大学 Method of Commodity Recommendation, device, server and storage medium based on Web log mining
CN109300014B (en) * 2018-10-24 2020-09-08 中南民族大学 Commodity recommendation method and device based on log mining, server and storage medium

Also Published As

Publication number Publication date
CN105447134B (en) 2019-03-08

Similar Documents

Publication Publication Date Title
Kantor et al. Coreference resolution with entity equalization
US11263247B2 (en) Regular expression generation using longest common subsequence algorithm on spans
Song et al. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning
US9372928B2 (en) System and method for parallel search on explicitly represented graphs
US20150370838A1 (en) Index structure to accelerate graph traversal
CN103761236A (en) Incremental frequent pattern increase data mining method
US9275111B2 (en) Minimizing result set size when converting from asymmetric to symmetric requests
CN105631068B (en) A kind of net boundary conditional processing method that unstrctured grid CFD is calculated
CN102279738A (en) Identifying entries and exits of strongly connected components
US9183598B2 (en) Identifying event-specific social discussion threads
Kovács et al. Frequent itemset mining on hadoop
CN104137095A (en) System for evolutionary analytics
CN103995827B (en) High-performance sort method in MapReduce Computational frames
CN103092992A (en) Vector data preorder quadtree coding and indexing method based on Key / Value type NoSQL (Not only SQL)
CN103455534A (en) Document clustering method and device
CN105447134A (en) Optimization method of a frequent item set mining algorithm
CN109828965B (en) Data processing method and electronic equipment
CN104462095A (en) Extraction method and device of common pars of query statements
CN103853554A (en) Software reconstruction position determination method and software reconstruction position identification device
US9262492B2 (en) Dividing and combining operations
KR101348849B1 (en) Method for mining of frequent subgraph
CN105095239A (en) Uncertain graph query method and device
CN101359337A (en) Method for interactively editing GIS topological data set
EP3488359A1 (en) Systems and methods for database compression and evaluation
Thayasivam et al. Improved convergence of iterative ontology alignment using block-coordinate descent

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant