CN105447134B - The optimization method of Frequent Itemsets Mining Algorithm - Google Patents

The optimization method of Frequent Itemsets Mining Algorithm Download PDF

Info

Publication number
CN105447134B
CN105447134B CN201510806032.5A CN201510806032A CN105447134B CN 105447134 B CN105447134 B CN 105447134B CN 201510806032 A CN201510806032 A CN 201510806032A CN 105447134 B CN105447134 B CN 105447134B
Authority
CN
China
Prior art keywords
item
collection
item collection
data
frequent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510806032.5A
Other languages
Chinese (zh)
Other versions
CN105447134A (en
Inventor
李磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Original Assignee
CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCTV INTERNATIONAL NETWORKS WUXI Co Ltd filed Critical CCTV INTERNATIONAL NETWORKS WUXI Co Ltd
Priority to CN201510806032.5A priority Critical patent/CN105447134B/en
Publication of CN105447134A publication Critical patent/CN105447134A/en
Application granted granted Critical
Publication of CN105447134B publication Critical patent/CN105447134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of optimization methods of Frequent Itemsets Mining Algorithm, comprising: receives data;It for received data, is traversed using preamble, item collection tree is traversed, to arrange item collection;Father and son's collection is made to item collection adjacent in the item collection after arrangement to compare, and comparison result is merged for proper subclass with the item collection of superset relationship.Compared with existing Frequent Itemsets Mining Algorithm, the effect of proper subclass is extracted, therefore it is a major advantage that by extracting proper subclass, reduce the size of data volume, reduce the calculating process of data and the size of data storage, and the calculating by effectively reducing invalid item collection, prevents the calculating repeatedly of repeated data.

Description

The optimization method of Frequent Itemsets Mining Algorithm
Technical field
The present invention relates to data processing fields, and in particular, to a kind of optimization method of Frequent Itemsets Mining Algorithm.
Background technique
Frequent Itemsets Mining Algorithm is used to excavate the item set (referred to as frequent item set) often occurred together, passes through excavation These frequent item sets out then can be the frequent item set as the one of item for being frequent item collection in an affairs Other item are as recommendation.
Common Frequent Itemsets Mining Algorithm has two classes, and one kind is Apriori algorithm, and another kind of is FPGrowth. FPGrowth is optimized based on Apriori algorithm.FPgrowth algorithm is reduction relative to Apriori, maximum breakthrough The number of iterations of data.Apriori needs to carry out K-1 calculating in calculating frequent item set, and K is the number of a frequent item collection, And Fpgrowth only needs to be traversed for 2 calculating that data can complete frequent item set by constructing fptree.
With information-based development, the burst of data increases, and the complexity of data greatly increases.Although passing through The technologies such as hadoop, spark, Fpgrowth can shorten the number of iterations of the calculating time and data of frequent item set, but not With the data in source, the increase of the quantity collection of frequent item set and increasing for invalid frequent item set will cause.It is actually used in project Effect is not accurate, often recommends the result to make mistake.And invalid data amount will increase the size of frequent item set, make project Performance and cost are unable to meet demand.
Summary of the invention
It is an object of the present invention in view of the above-mentioned problems, a kind of optimization method of Frequent Itemsets Mining Algorithm be proposed, with reality Data volume size is now reduced, and the advantages of reduction data calculation process and data storage.
To achieve the above object, the technical solution adopted by the present invention is that:
A kind of optimization method of Frequent Itemsets Mining Algorithm, comprising:
Receive data;
It for received data, is traversed using preamble, item collection tree is traversed, to arrange item collection;
Father and son's collection is made to item collection adjacent in the item collection after arrangement to compare, and is proper subclass and superset relationship by comparison result Item collection merge;
Wherein item collection is the abbreviation of frequent item set.
Preferably, father and son's collection compares, and the content compared includes the subordinate relation of item collection and the support of item collection.
Preferably, the subordinate relation of the item collection is more specific are as follows:
It is assumed that two item collections are respectively that A item collection and B item collection are recognized if the item inside A item collection is all contained in B item collection Belong to B item collection for A item collection, A item collection is the subset of B item collection.
Preferably, the support of the item collection is more specific are as follows:
It is assumed that two item collections are respectively A item collection and B item collection, the support of item collection is exactly in simple terms in data The number that item inside the item collection occurs simultaneously in data, if the frequency of the frequency of A item collection and B item collection it is equal and A item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection;If A item collection is the subset of B item collection, but support is not Together, then A item collection is the subset of B item collection, but is not proper subclass.
Technical solution of the present invention has the advantages that
Technical solution of the present invention extracts the effect of proper subclass compared with existing Frequent Itemsets Mining Algorithm, main excellent Point is to reduce the size of data volume by extracting proper subclass, reduces the calculating process of data and the size of data storage, and lead to The calculating for effectively reducing invalid item collection is crossed, the calculating repeatedly of repeated data is prevented.When recommending to be done using the algorithm, avoid Recommend invalid commodity, can effectively increase user experience.The use of proper subclass is i.e. save the cost, and raising performance and use Family experience.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Detailed description of the invention
Fig. 1 is the flow chart of the optimization method of Frequent Itemsets Mining Algorithm described in the embodiment of the present invention;
Fig. 2 is the data structure schematic diagram of frequent item set described in the embodiment of the present invention;
Fig. 3 is the data structure schematic diagram that frequent item set described in the embodiment of the present invention can merge;
Fig. 4 is the data structure schematic diagram that frequent item set described in the embodiment of the present invention can partially merge.
Specific embodiment
Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings, it should be understood that preferred reality described herein Apply example only for the purpose of illustrating and explaining the present invention and is not intended to limit the present invention.
As shown in Figure 1, a kind of optimization method of Frequent Itemsets Mining Algorithm, comprising:
Receive data;
It for received data, is traversed using preamble, item collection tree is traversed, to arrange item collection;
Father and son's collection is made to item collection adjacent in the item collection after arrangement to compare, and is proper subclass and superset relationship by comparison result Item collection merge;
Wherein item collection is the abbreviation of frequent item set.
Preferably, father and son's collection compares, and the content compared includes the subordinate relation of item collection and the support of item collection.
Preferably, the subordinate relation of the item collection is more specific are as follows:
It is assumed that two item collections are respectively that A item collection and B item collection are recognized if the item inside A item collection is all contained in B item collection Belong to B item collection for A item collection, A item collection is the subset of B item collection.
Preferably, the support of item collection is more specific are as follows:
It is assumed that two item collections are respectively A item collection and B item collection, the support of item collection is exactly in simple terms in data The number that item inside the item collection occurs simultaneously in data, if the frequency of the frequency of A item collection and B item collection it is equal and A item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection;If A item collection is the subset of B item collection, but support is not Together, then A item collection is the subset of B item collection, but is not proper subclass.
It is as shown in Figure 2: to there are three column datas as shown in Figure 2 in frequent item set result set, secondary series and third are classified as the The subset of one column, and the support of three column frequent item sets is all 10.In the case, first row, secondary series and third position are come From in same data source, illustrate that secondary series and third are classified as the proper subclass of first row, it is not necessary to be divided into three column counts, can merge For same row, as shown in Figure 3.
It is as shown in Figure 3: in the calculated result of frequent item set, next column data to be selected to prop up compared with current data column data 2 column are combined into a column, then compare third column if next column is the proper subclass when forefront by degree of holding and father's subset relation, if Third column are still father's subset relation with first row, then three column are combined into a column, successively compared down;If third column and first row It is not set membership, as shown in figure 4, then merging first row and secondary series, compares down from third Leie.
Proper subclass and superset merge into same row, the generation of repeated data are reduced in terms of data volume, in frequent item set In use process, reduce the number of calculating.In terms of data accuracy, reduce the frequent item set from same data source Reuse, play the role of optimization on data accuracy.
Father and son's collection compares: father and son's collection, which compares, is divided into 2 parts.First point be frequent item set subordinate relation, item collection be frequency Numerous set, if the item inside A frequent item set is all contained in B frequent item set, then it is assumed that A item collection belongs to B item collection, and A Collection is the subset of B item collection.Second point compares support, and the support of frequent item set is exactly this in data in simple terms The number that the item of the inside of item collection occurs simultaneously in data.If the frequency of A item collection and the frequency of B item collection it is equal and A item collection is the subset of B item collection, then A item collection is the proper subclass of B item collection;If A item collection is the subset of B item collection, but support is not Together, then A item collection is the subset of B item collection, but is not proper subclass.
Father and son's collection compares selection: when selection frequent item set makees father and son's collection and compares, it is only necessary to adjacent 2 be selected to gather It compares, is traversed in the use preamble of traversal item collection tree, there may be the item collections of set membership, meeting is according to adjacent pass System is arranged together.
Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features. All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims (1)

1. a kind of optimization method of Frequent Itemsets Mining Algorithm characterized by comprising
Receive data;
It for received data, is traversed using preamble, item collection tree is traversed, to arrange item collection;
Father and son's collection is made to item collection adjacent in the item collection after arrangement to compare, and is the item of proper subclass and superset relationship by comparison result Collection merges;
Wherein item collection is the abbreviation of frequent item set;
Father and son's collection compares: father and son's collection, which compares, is divided into 2 parts;First point be frequent item set subordinate relation, item collection is frequent episode Set, if the item inside A frequent item set is all contained in B frequent item set, then it is assumed that A item collection belongs to B item collection, and A item collection is B The subset of item collection;
Second point compares support, and the support of frequent item set is exactly the inside of the item collection in data in simple terms The number that item occurs simultaneously in data;If the frequency of the A item collection and frequency of B item collection is equal and A item collection is B item collection Subset, then A item collection is the proper subclass of B item collection;If A item collection is the subset of B item collection, but support is different, then A item collection is The subset of B item collection, but be not proper subclass.
CN201510806032.5A 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm Active CN105447134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510806032.5A CN105447134B (en) 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510806032.5A CN105447134B (en) 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm

Publications (2)

Publication Number Publication Date
CN105447134A CN105447134A (en) 2016-03-30
CN105447134B true CN105447134B (en) 2019-03-08

Family

ID=55557311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510806032.5A Active CN105447134B (en) 2015-11-20 2015-11-20 The optimization method of Frequent Itemsets Mining Algorithm

Country Status (1)

Country Link
CN (1) CN105447134B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021412A (en) * 2016-05-13 2016-10-12 上海市计算技术研究所 Large-scale vehicle-passing data oriented accompanying vehicle identification method
CN109300014B (en) * 2018-10-24 2020-09-08 中南民族大学 Commodity recommendation method and device based on log mining, server and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937447A (en) * 2010-06-07 2011-01-05 华为技术有限公司 Alarm association rule mining method, and rule mining engine and system
CN103678530A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Rapid detection method of frequent item sets
CN104850577A (en) * 2015-03-19 2015-08-19 浙江工商大学 Data flow maximal frequent item set mining method based on ordered composite tree structure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9563669B2 (en) * 2012-06-12 2017-02-07 International Business Machines Corporation Closed itemset mining using difference update

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937447A (en) * 2010-06-07 2011-01-05 华为技术有限公司 Alarm association rule mining method, and rule mining engine and system
CN103678530A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Rapid detection method of frequent item sets
CN104850577A (en) * 2015-03-19 2015-08-19 浙江工商大学 Data flow maximal frequent item set mining method based on ordered composite tree structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于合并FP树的频繁模式挖掘算法;林森媚;《广西师范大学学报:自然科学版》;20071231;第25卷(第4期);252-256页

Also Published As

Publication number Publication date
CN105447134A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
JP7393357B2 (en) Regular expression generation based on positive and negative pattern matching examples
US10970292B1 (en) Graph based resolution of matching items in data sources
US10037355B2 (en) Mechanisms for merging index structures in MOLAP while preserving query consistency
Zhang et al. New techniques for mining frequent patterns in unordered trees
CN107766428A (en) A kind of automatic method and system for realizing data visualization
KR101617696B1 (en) Method and device for mining data regular expression
CN104268216A (en) Data cleaning system based on internet information
CN102385588B (en) Method and system for improving performance of data parallel insertion
CN103324632B (en) A kind of concept identification method based on Cooperative Study and device
CN106325596A (en) Automatic error correction method and system for writing handwriting
CN102193993A (en) Method, device and facility for determining similarity information between character string information
Chiang et al. Progressive simplification of tetrahedral meshes preserving all isosurface topologies
CN103714086A (en) Method and device used for generating non-relational data base module
CN103593454A (en) Mining method and system for microblog text classification
CN105447134B (en) The optimization method of Frequent Itemsets Mining Algorithm
CN105138650A (en) Hadoop data cleaning method and system based on outlier mining
CN103995827B (en) High-performance sort method in MapReduce Computational frames
CN105426305A (en) Control attribute analysis system and method
CN102298618B (en) Method for obtaining matching degree to execute corresponding operations and device and equipment
CN103793653B (en) A kind of program dependence based on tree optimization analyzes method and system
CN109254962B (en) Index optimization method and device based on T-tree and storage medium
CN104462095A (en) Extraction method and device of common pars of query statements
CN102708285A (en) Coremedicine excavation method based on complex network model parallelizing PageRank algorithm
CN103701590A (en) Dictionary-based complex password traversing method and device
CN104750834A (en) Rule storage method and matching method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant