CN110222090A - A kind of mass data Mining Frequent Itemsets - Google Patents
A kind of mass data Mining Frequent Itemsets Download PDFInfo
- Publication number
- CN110222090A CN110222090A CN201910477465.9A CN201910477465A CN110222090A CN 110222090 A CN110222090 A CN 110222090A CN 201910477465 A CN201910477465 A CN 201910477465A CN 110222090 A CN110222090 A CN 110222090A
- Authority
- CN
- China
- Prior art keywords
- transaction data
- data set
- frequent
- local
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明提供一种海量数据频繁项集挖掘方法,包括:采用频繁项集挖掘算法对原始事务数据集TO进行挖掘,获得原始事务数据集TO对应的所有的局部频繁项集;扫描原始事务数据集TO,对应计算上述所获得的每个局部频繁项集在原始事务数据集TO上的支持度计数,对所获得的局部频繁项集进行过滤,获取支持度不小于ω的各局部频繁项集,并将所获取的各局部频繁项集及计算所得的对应的支持度计数对应写入文件Fqf中;读取新增事务数据集TΔ,并判断新增事务数据集TΔ是否为空,之后基于新增事务数据集TΔ是否为空进行频繁项集挖掘。本发明在整个挖掘过程中复用了文件Fqf、集合STCAD和数组cntΔ,一定程度上减少了计算开销,从而可提高频繁项集的挖掘速率。
The present invention provides a method for mining frequent itemsets of massive data, comprising: mining an original transaction data set TO by using a frequent itemset mining algorithm to obtain all local frequent itemsets corresponding to the original transaction data set TO; scanning the original transaction Data set T O , corresponding to calculating the support count of each local frequent itemset obtained above on the original transaction data set T O , filter the obtained local frequent itemsets, and obtain each local part whose support degree is not less than ω Frequent itemsets, and write the obtained local frequent itemsets and the corresponding support counts calculated into the file F qf ; read the newly added transaction data set T Δ , and judge the newly added transaction data set T Δ Whether it is empty, then perform frequent itemset mining based on whether the newly added transaction data set T Δ is empty. The present invention reuses the file F qf , the set ST CAD and the array cnt Δ in the whole mining process, which reduces the computational cost to a certain extent, thereby improving the mining rate of frequent itemsets.
Description
技术领域technical field
本发明涉及数据挖掘技术领域,具体涉及一种海量数据频繁项集挖掘方法。The invention relates to the technical field of data mining, in particular to a method for mining frequent itemsets of massive data.
背景技术Background technique
频繁项集挖掘一直以来都是数据挖掘中最活跃的领域之一。它在现实生活中具有非常广泛的应用,例如,它广泛应用于数据挖掘、软件错误探测、时空数据分析、生物分析等多个研究领域。由于其实际意义,频繁项集挖掘已经引起了广泛的关注。Frequent itemset mining has always been one of the most active areas in data mining. It has a very wide range of applications in real life, for example, it is widely used in data mining, software error detection, spatiotemporal data analysis, biological analysis and many other research fields. Due to its practical significance, frequent itemset mining has attracted extensive attention.
在数据存储领域,数据通常以只读/只添加模式存储,整个事务数据集可以被分为两个部分:原始事务数据集和新增事务数据集。一定时间或条件下,新增事务数据集中的数据被并入原始事务数据集中,此时原始事务数据集中数据在增加,而新增事务数据集中的数据因被并入原始事务数据集而被清空,当有新增数据的写入时,新增数据被写入新增事务数据集,之后在再次满足一定时间或条件时,新增事务数据集中新增写入的数据再次被并入原始事务数据集,而新增事务数据集继续用于等待新数据的存入,如此往复。可见,在只读/只添加模式进行存储下,原事务数据集始终由原始事务数据集和新增事务数据集组成。In the field of data storage, data is usually stored in read-only/add-only mode, and the entire transaction data set can be divided into two parts: the original transaction data set and the new transaction data set. At a certain time or condition, the data in the new transaction data set is merged into the original transaction data set. At this time, the data in the original transaction data set is increasing, and the data in the new transaction data set is emptied because it is merged into the original transaction data set. , when new data is written, the new data is written into the new transaction data set, and then when a certain time or condition is met again, the newly written data in the new transaction data set is merged into the original transaction again Data set, and the new transaction data set continues to be used to wait for the storage of new data, and so on. It can be seen that in the read-only/add-only mode for storage, the original transaction data set is always composed of the original transaction data set and the newly added transaction data set.
多年来,国内外的科研工作者们已经提出了许多相关算法。现存的算法可以分为两类:基于候选生成的算法、基于模式增长的算法。基于候选生成的算法首先生成候选项集,然后,通过扫描数据库来验证候选项集,并识别出其中的频繁项集。此外,基于候选生成的算法还利用反单调性来剪切搜索空间。但是,这类算法需要多遍扫描数据库,当处理海量数据时,这将带来很高的I/O开销。基于模式增长的算法不会直接生成候选项集,它通过构建一种特殊的基于树的数据结构来保存数据库中频繁项集的必要信息。通过利用这种数据结构,频繁项集可以被高效的计算出来,然而这类算法构建数据结构十分复杂,并且在处理海量数据时,内存需求量通常会超出可用的内存,导致数据结构无法在内存中正确构建。Over the years, researchers at home and abroad have proposed many related algorithms. Existing algorithms can be divided into two categories: those based on candidate generation and those based on pattern growth. Algorithms based on candidate generation first generate candidate itemsets, then scan the database to verify the candidate itemsets and identify frequent itemsets in them. Furthermore, candidate generation-based algorithms also exploit anti-monotonicity to clip the search space. However, such algorithms require multiple passes to scan the database, which will bring high I/O overhead when dealing with massive amounts of data. The algorithm based on pattern growth does not generate candidate itemsets directly, but it builds a special tree-based data structure to save the necessary information of frequent itemsets in the database. By using this data structure, frequent itemsets can be efficiently calculated. However, this kind of algorithm is very complicated to construct data structure, and when processing massive data, the memory requirement usually exceeds the available memory, so the data structure cannot be stored in memory. build correctly in .
为此,本发明提供一种海量数据频繁项集挖掘方法,用于实现对只读/只添加模式存储模式下的海量数据频繁项集的挖掘。To this end, the present invention provides a method for mining frequent itemsets of massive data, which is used to realize the mining of frequent itemsets of massive data in the read-only/add-only mode storage mode.
发明内容SUMMARY OF THE INVENTION
针对现有技术的上述不足,本发明提供一种海量数据频繁项集挖掘方法,用于实现对只读/只添加模式存储模式下海量数据频繁项集的挖掘,以提高对海量数据频繁项集的挖掘速率。In view of the above-mentioned shortcomings of the prior art, the present invention provides a method for mining frequent itemsets of massive data, which is used to realize the mining of frequent itemsets of massive data in the read-only/add-only mode storage mode, so as to improve the mining of frequent itemsets of massive data. mining rate.
本发明提供了一种海量数据频繁项集挖掘方法,该海量数据频繁项集挖掘方法用于挖掘总事务数据集T中满足全局最小支持度minsup的频繁项集,所述的全局最小支持度minsup为预先设定的总事务数据集T上的最小支持度;The present invention provides a method for mining frequent itemsets of massive data. The method for mining frequent itemsets of massive data is used to mine frequent itemsets that satisfy the global minimum support minsup in the total transaction data set T. The global minimum support minsup is the minimum support degree on the preset total transaction data set T;
所述的总事务数据集T包括原始事务数据集TO和新增事务数据集TΔ;The total transaction data set T includes the original transaction data set T O and the newly added transaction data set T Δ ;
该海量数据频繁项集挖掘方法包括步骤:The massive data frequent itemset mining method includes the following steps:
采用频繁项集挖掘算法对原始事务数据集TO进行挖掘,获得原始事务数据集TO对应的所有的局部频繁项集;The frequent itemset mining algorithm is used to mine the original transaction data set TO, and obtain all the local frequent itemsets corresponding to the original transaction data set TO;
扫描原始事务数据集TO,对应计算上述所获得的每个局部频繁项集在原始事务数据集TO上的支持度计数,依据局部最小支持度ω,对所获得的局部频繁项集进行过滤,获取支持度不小于ω的各局部频繁项集,并将所获取的支持度不小于ω的各局部频繁项集及计算所得的对应的支持度计数对应写入文件Fqf中;Scan the original transaction data set T O , correspondingly calculate the support count of each local frequent itemset obtained above on the original transaction data set T O , and filter the obtained local frequent itemsets according to the local minimum support ω , obtain each local frequent item set whose support degree is not less than ω, and write the obtained local frequent itemsets whose support degree is not less than ω and the corresponding calculated support degree count into the file F qf correspondingly;
读取新增事务数据集TΔ,并判断新增事务数据集TΔ是否为空:Read the new transaction data set T Δ and determine whether the new transaction data set T Δ is empty:
是,则依据总事务数据集T中的事务的数目n以及所述的全局最小支持度minsup,对所述文件Fqf中的局部频繁项集进行过滤,得到过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集并输出,所输出的各局部频繁项集即为总事务数据集T上满足所述全局最小支持度minsup的全部的频繁项集;Yes, then according to the number n of transactions in the total transaction data set T and the global minimum support minsup, the local frequent itemsets in the file F qf are filtered, and the filtered support count is not less than the global The local frequent itemsets of the minimum support count n×minsup are output, and the outputted local frequent itemsets are all the frequent itemsets that satisfy the global minimum support minsup on the total transaction data set T;
否,则采用增量更新方法挖掘总事务数据集T上的频繁项集;If not, the incremental update method is used to mine frequent itemsets on the total transaction data set T;
其中,所述的局部最小支持度ω为预先设定的原始事务数据集TO上的最小支持度,局部最小支持度ω<全局最小支持度minsup。Wherein, the local minimum support degree ω is a preset minimum support degree on the original transaction data set T O , and the local minimum support degree ω<global minimum support degree minsup.
进一步地,所述的增量更新方法,包括步骤:Further, the incremental update method includes the steps:
扫描新增事务数据集TΔ,计算新增事务数据集TΔ中各项集在新增事务数据集TΔ上的支持度计数,并将新增事务数据集TΔ中各项集、以及计算所得的新增事务数据集TΔ中各项集在新增事务数据集TΔ上的支持度计数,对应存入数组cntΔ,并记数组cntΔ中最大的支持度计数为masΔ;Scan the newly added transaction data set T Δ , calculate the support count of each item set in the newly added transaction data set T Δ on the newly added transaction data set T Δ , and add the item sets in the newly added transaction data set T Δ , and The calculated support count of each item set in the newly added transaction data set T Δ on the newly added transaction data set T Δ is stored in the array cnt Δ correspondingly, and the maximum support count in the array cnt Δ is recorded as mas Δ ;
扫描当前的文件Fqf,对于每个从当前的文件Fqf中扫描出的局部频繁项集,分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则输出当前扫描出的局部频繁项集,并记该输出的局部频繁项集为第一频繁项集;若判断结果为否,则基于所述的masΔ,判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集:否,则将当前扫描到的局部频繁项集及对应的支持度计数,对应写入集合STCAD中;Scan the current file F qf , for each local frequent itemset scanned from the current file F qf , determine whether it is a frequent item set that satisfies the preset global minimum support minsup on the total transaction data set T, respectively, If the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no, then based on the mas Δ , determine the current Whether the scanned local frequent itemsets are not the frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T: if not, then count the currently scanned local frequent itemsets and the corresponding support, Correspondingly written into the collection ST CAD ;
基于上述数组cntΔ,对应计算与更新集合STCAD中在新增事务数据集TΔ中存在的各局部频繁项集在总事务数据集T上的支持度计数,得到更新后的集合STCAD;Based on the above-mentioned array cnt Δ , corresponding to the support counts of each local frequent item set existing in the newly added transaction data set T Δ in the calculation and update set ST CAD on the total transaction data set T, the updated set ST CAD is obtained;
遍历更新后的集合STCAD,并分别判断各当前遍历到的更新后的集合STCAD中的局部频繁项集是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则对应输出各当前遍历到的更新后的集合STCAD中的相应的局部频繁项集;Traverse the updated set ST CAD , and respectively judge whether the local frequent itemsets in the currently traversed updated sets ST CAD are frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T , if the judgment result is yes, then correspondingly output the corresponding local frequent itemsets in the updated set ST CAD currently traversed;
之后判断表达式(nO×ω-1)+masΔ<(n×minsup)是否成立:Then judge whether the expression (n O ×ω-1)+mas Δ <(n×minsup) holds:
是,则频繁项集挖掘结束;If yes, the frequent itemset mining ends;
否,则继续在新增事务数据集TΔ中挖掘新的频繁项集,并对所挖掘出的新的频繁项集进行输出;If not, continue to mine new frequent itemsets in the newly added transaction data set T Δ , and output the new frequent itemsets mined;
其中,所述的新的频繁项集,其在原始事务数据集TO上的支持度计数大于零,在总事务数据集T上满足所述的全局最小支持度minsup,并区别于上述输出的所有的频繁项集;Wherein, the new frequent itemset, whose support count on the original transaction data set T O is greater than zero, satisfies the global minimum support minsup on the total transaction data set T, and is different from the above output all frequent itemsets;
nO为新增事务数据集TΔ中的事务的数目。n O is the number of transactions in the newly added transaction data set T Δ .
进一步地,所述的在新增事务数据集TΔ中挖掘新的频繁项集,并对所挖掘出的新的频繁项集进行输出,包括:Further, mining new frequent itemsets in the newly added transaction data set T Δ , and outputting the mined new frequent itemsets, including:
通过公式计算新增事务数据集TΔ的最小支持度minsupΔ,式中nΔ为新增事务数据集TΔ中事务的数目;by formula Calculate the minimum support minsup Δ of the newly added transaction data set T Δ , where n Δ is the number of transactions in the newly added transaction data set T Δ ;
分割新增事务数据集TΔ中的事务为多个目标事务数据集;Divide the transactions in the newly added transaction data set T Δ into multiple target transaction data sets;
采用Eclat算法,对各目标事务数据集进行局部频繁项集挖掘,得到各目标事务数据集对应的满足上述计算所得的最小支持度minsupΔ的所有的局部频繁项集;Using the Eclat algorithm, perform local frequent itemsets mining on each target transaction data set, and obtain all local frequent itemsets corresponding to each target transaction data set that satisfy the minimum support minsup Δ calculated above;
记上述得到各目标事务数据集对应的所有的局部频繁项集的集合为LFΔ,遍历并删除所述集合LFΔ中已出现在文件Fqf中的局部频繁项集,得到候选集合GFΔ;并基于上述数组cntΔ,将候选集合GFΔ中各局部频繁项集在新增事务数据集TΔ中的支持度计数对应存入该候选集合GFΔ;Denote the set of all local frequent itemsets corresponding to each target transaction data set obtained above as LF Δ , traverse and delete the local frequent itemsets in the set LF Δ that have appeared in the file F qf to obtain the candidate set GF Δ ; And based on the above-mentioned array cnt Δ , the support counts of each local frequent item set in the candidate set GF Δ in the newly added transaction data set T Δ are correspondingly stored in the candidate set GF Δ ;
扫描当前的原始事务数据集TO,增加并更新候选集合GFΔ中局部频繁项集的支持度计数,得到新的候选集合GFΔ;Scan the current original transaction data set TO , increase and update the support count of the local frequent itemsets in the candidate set GF Δ , and obtain a new candidate set GF Δ ;
扫描上述新的候选集合GFΔ,对应判断各当前扫描出的新的候选集合GFΔ中的局部频繁项集是否为满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则对应输出各当前扫描出的候选集合GFΔ中的局部频繁项集。Scan the above-mentioned new candidate set GF Δ , correspondingly determine whether the local frequent itemsets in the new candidate sets GF Δ currently scanned are frequent itemsets that satisfy the preset global minimum support minsup, if the judgment result is yes , then correspondingly output the local frequent itemsets in each currently scanned candidate set GF Δ .
进一步地,所述的采用Eclat算法,对各目标事务数据集进行局部频繁项集挖掘,得到各目标事务数据集对应的满足上述计算所得的最小支持度minsupΔ的所有的局部频繁项集,包括:Further, the Eclat algorithm is used to perform local frequent itemset mining on each target transaction data set, and all local frequent itemsets corresponding to each target transaction data set that satisfy the minimum support minsup Δ obtained by the above calculation are obtained, including :
P0、遍历各目标事务数据集;P0, traverse each target transaction data set;
P1、对当前遍历到的目标事务数据集:P1. For the current traversed target transaction data set:
P11、采用Eclat算法,计算并获取当前遍历到的目标事务数据集对应的候选频繁k-项集,并在所生成的候选频繁k-项集满足上述计算所得的最小支持度minsupΔ时,记所生成的候选频繁k-项集为频繁k-项集并存入集合LFk,Δ中,k≥1;P11. Use the Eclat algorithm to calculate and obtain the candidate frequent k-itemsets corresponding to the currently traversed target transaction data set, and when the generated candidate frequent k-items satisfies the minimum support minsup Δ calculated above, record The generated candidate frequent k-itemsets are frequent k-itemsets and are stored in the set LF k,Δ , k≥1;
P12、通过合并所述LFk,Δ中两个频繁k-项集生成候选频繁(k+1)-项集,并在所生成的候选频繁(k+1)-项集满足上述计算所得的最小支持度minsupΔ时,记所生成的候选频繁(k+1)-项集为频繁(k+1)-项集并存入集合LFk+1,Δ中,其中所述的两个频繁k-项集的前k-1项相同且最后一项不同;P12. Generate a candidate frequent (k+1)-item set by merging the two frequent k-itemsets in the LF k,Δ , and the generated candidate frequent (k+1)-item set satisfies the above calculation result When the minimum support minsup Δ , the generated candidate frequent (k+1)-itemsets are recorded as frequent (k+1)-itemsets and stored in the set LF k+1,Δ , wherein the two frequent The first k-1 items of the k-item set are the same and the last item is different;
P13、重复执行上述步骤P11-P12,每次k增加1,直到不能再生成当前遍历到的目标事务数据集对应的新的候选频繁项集;之后执行步骤P14;P13. Repeat the above steps P11-P12, each time k increases by 1, until the new candidate frequent itemsets corresponding to the currently traversed target transaction data set can no longer be generated; then step P14 is performed;
P14、继续遍历下一目标事务数据集,并重复执行上述步骤P11-P13,直至遍历完各目标事务数据集,从而得到各目标事务数据集对应的满足上述计算所得的最小支持度minsupΔ的所有的局部频繁项集。P14. Continue to traverse the next target transaction data set, and repeat the above steps P11-P13 until each target transaction data set is traversed, so as to obtain all target transaction data sets that satisfy the minimum support minsup Δ calculated above. The local frequent itemsets of .
进一步地,在步骤P11和步骤P12之间,还包括步骤S:对所述集合LFk,Δ的精剪步骤;Further, between step P11 and step P12, it also includes step S: a fine-pruning step for the set LF k,Δ ;
其中,对所述集合LFk,Δ的精剪步骤,包括:Wherein, the fine pruning step for the set LF k,Δ includes:
依据项集的前(k-1)项是否相同,获取并划分集合LFk,Δ中频繁k-项集的分组,得到相应数量的项集分组,其中,同一项集分组中的频繁k-项集的前(k-1)项相同;According to whether the first (k-1) items of the itemset are the same, obtain and divide the frequent k-item sets in the set LF k,Δ into groups, and obtain the corresponding number of itemset groups, among which, the frequent k-items in the same item set grouping The first (k-1) items of the itemset are the same;
分别统计每个项集分组中频繁k-项集的数量,并对应判断所统计的各数量是否等于1:是,则删除对应项集分组,并删除集合LFk,Δ中与所述对应项集分组中频繁k-项集相同的项集;所述的对应项集分组,其内频繁k-项集的数量等于1;Count the number of frequent k-itemsets in each itemset grouping separately, and judge whether the counted quantities are equal to 1: Yes, delete the corresponding itemset grouping, and delete the corresponding items in the set LF k,Δ The itemsets with the same frequent k-itemsets in the set grouping; in the corresponding itemset grouping, the number of frequent k-itemsets in it is equal to 1;
对应判定当前存在的各项集分组中,是否是项集分组中任意两个频繁k-项集的并集都包含在所述的文件Fqf中:是,则删除对应的项集分组,并删除集合LFk,Δ中与所述对应的项集分组中频繁k-项集相同的项集。Correspondingly determine whether the current item set grouping is the union of any two frequent k-itemsets in the itemset grouping, which are included in the file F qf : if yes, delete the corresponding itemset grouping, and Delete the itemsets in the set LF k,Δ that are the same as the frequent k-itemsets in the corresponding itemset grouping.
进一步地,在基于上述数组cntΔ,对应计算与更新集合STCAD中在新增事务数据集TΔ中存在的各局部频繁项集在总事务数据集T上的支持度计数,得到更新后的集合STCAD之前,还包括对所述集合STCAD的精剪步骤;Further, based on the above-mentioned array cnt Δ , corresponding to the support count of each local frequent itemset existing in the newly added transaction data set T Δ in the calculation and update set ST CAD on the total transaction data set T, the updated Before the collection of ST CAD , it also includes a fine-cutting step for the collection of ST CAD ;
对所述集合STCAD的精剪步骤,包括第一阶段的精简步骤;The fine-cutting steps of the set ST CAD , including the streamlining steps of the first stage;
所述的第一阶段的精简步骤,包括:The simplified steps of the first stage include:
遍历所述的文件Fqf及数组cntΔ,并分别计算各遍历到的文件Fqf中的1-项集的支持度计数与数组cntΔ中相同的1-项集在数组cntΔ中对应的支持度计数之和,若计算得到的支持度之和小于全局最小支持度计数n×minsup,则将遍历到的文件Fqf中的相应的1-项集自所述的集合STCAD中移除。Traverse the file F qf and the array cnt Δ , and calculate the support counts of the 1-itemsets in each traversed file F qf and the corresponding 1-itemsets in the array cnt Δ corresponding to the same 1-item sets in the array cnt Δ The sum of support counts, if the calculated sum of support is less than the global minimum support count n×minsup, the corresponding 1-item set in the traversed file F qf is removed from the set ST CAD .
进一步地,对所述集合STCAD的精剪步骤,还包括第二阶段的精简步骤;Further, the fine-cutting step of the set ST CAD also includes a second-stage streamlining step;
所述的第二阶段的精简步骤,包括:The simplified steps of the second stage include:
构建PIP数组;build the pip array;
遍历经第一阶段的精简步骤精简过的集合STCAD,并为经第一阶段的精简步骤精简过的集合STCAD中的每个局部频繁项集分别选择各自对应项集中的两个支持度计数最小的项,组成对应项集的项对并均保存在上述构建的PIP数组中;Traverse the set ST CAD reduced by the first-stage reduction step, and select two support counts in the corresponding item sets for each local frequent item set in the set ST CAD reduced by the first-stage reduction step. The smallest item, the item pairs that form the corresponding itemset, are stored in the PIP array constructed above;
计算PIP数组中的每个项对在新增事务数据集TΔ上的支持度计数;Calculate the support count of each item in the PIP array on the newly added transaction dataset T Δ ;
判定PIP数组中的每个项对在新增事务数据集TΔ上的支持度计数及对应项集在新增事务数据集TΔ上的支持度计数之和,与全局最小支持度计数n×minsup的大小关系,并将判定该支持度计数之和小于全局最小支持度计数n×minsup的各项对对应的局部频繁项集自集合STCAD中删除。Determine the sum of the support count of each item in the PIP array on the newly added transaction data set T Δ and the support count of the corresponding item set on the newly added transaction data set T Δ , and the global minimum support count n× The size relationship of minsup, and the local frequent itemsets corresponding to the pairs of items whose sum of the support counts is less than the global minimum support count n×minsup will be deleted from the set ST CAD .
进一步地,在更新所述的新增事务数据集TΔ,并在更新所述的原始事务数据集TO为原有的原始事务数据集TO和原有的新增事务数据集TΔ之和、以及更新后的新增事务数据集TΔ非空时,所述的海量数据频繁项集挖掘方法还包括更新挖掘的步骤;Further, when updating the newly added transaction data set T Δ , and updating the original transaction data set T O to be the difference between the original original transaction data set T O and the original newly added transaction data set T Δ and, and when the updated new transaction data set T Δ is not empty, the method for mining frequent itemsets of massive data further includes the step of updating and mining;
所述的更新挖掘的步骤,包括:The described steps of updating mining include:
更新总事务数据集T中的事务的数目n,为原有的原始事务数据集TO和原有的新增事务数据集TΔ的事务的数目之和;Update the number n of transactions in the total transaction data set T, which is the sum of the number of transactions in the original original transaction data set T O and the original new transaction data set T Δ ;
获取上述获得的原有原始事务数据集TO对应的所有的局部频繁项集,计算所获取的原有原始事务数据集TO对应的各局部频繁项集在原有的原始事务数据集TO上的支持度计数,对应写入文件Fqf,O中;Obtain all the local frequent item sets corresponding to the original original transaction data set T O obtained above, and calculate the local frequent itemsets corresponding to the original original transaction data set T O obtained on the original original transaction data set T O The support count of , corresponding to the write file F qf,O ;
基于原有的新增事务数据集TΔ对应的数组cntΔ,增加并更新文件Fqf,O中各局部频繁项集在原有的新增事务数据集TΔ上的支持度计数;之后,依据所述的局部最小支持度ω,对文件Fqf,O中的局部频繁项集进行过滤,获取所述更新总事务数据集T对应的支持度不小于ω的各局部频繁项集;Based on the array cnt Δ corresponding to the original newly added transaction data set T Δ , add and update the support counts of each local frequent itemset in the file F qf,O on the original newly added transaction data set T Δ ; For the local minimum support degree ω, filter the local frequent itemsets in the file F qf,O , and obtain each local frequent item set whose support degree corresponding to the update total transaction data set T is not less than ω;
之后将所获取的所述更新总事务数据集T对应的支持度不小于ω的各局部频繁项集及其在文件Fqf,O中各自对应的支持度计数,对应写入一个新的文件Fqf;Afterwards, each local frequent item set whose support degree is not less than ω and its corresponding support degree in the file F qf,O is written into a new file F correspondingly. qf ;
获取原有的新增事务数据集TΔ对应的集合LFΔ,并删除该获取到的集合LFΔ中存在于所述文件Fqf,O中的项集,对应得到一个新的集合LFΔ;Acquire the set LF Δ corresponding to the original newly added transaction data set T Δ , and delete the itemsets in the obtained set LF Δ that exist in the file F qf,O , correspondingly to obtain a new set LF Δ ;
基于原有的新增事务数据集TΔ对应的数组cntΔ,在所述新的集合LFΔ中,对应写入该新的集合LFΔ中各项集在原有的新增事务数据集TΔ上的支持度计数;Based on the array cnt Δ corresponding to the original newly added transaction data set T Δ , in the new set LF Δ , correspondingly write the item sets in the new set LF Δ into the original newly added transaction data set T Δ support count on ;
之后依据所述的局部最小支持度ω,对所述新的集合LFΔ中的项集进行过滤,获取过滤后的支持度不小于ω的各项集,并将该获取到的支持度不小于ω的各项集及所述新的集合LFΔ中写入的支持度计数,对应写入所述的新的文件Fqf;Then, according to the local minimum support ω, filter the itemsets in the new set LF Δ , obtain the itemsets with the filtered support not less than ω, and use the obtained support not less than ω. the item set of ω and the support count written in the new set LF Δ , corresponding to the new file F qf written;
之后用上述新的文件Fqf替换原有的文件Fqf、用原有的原始事务数据集TO和原有的新增事务数据集TΔ之和替换原有的原始事务数据集TO、以及用更新后的新增事务数据集TΔ替换原有的新增事务数据集TΔ,基于更新后的总事务数据集T中的事务的数目n,采用所述的增量更新方法挖掘上述更新后的总事务数据集T上的频繁项集。Then replace the original file F qf with the above-mentioned new file F qf , and replace the original original transaction data set T O with the sum of the original original transaction data set T O and the original newly added transaction data set T Δ , And replace the original new transaction data set T Δ with the updated new transaction data set T Δ , based on the number n of transactions in the updated total transaction data set T, use the incremental update method to mine the above Frequent itemsets on the updated total transaction dataset T.
进一步地,所述的依据总事务数据集T中的事务的数目n以及所述的全局最小支持度minsup,对所述文件Fqf中的局部频繁项集进行过滤,得到过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集,具体为:Further, according to the number n of transactions in the total transaction data set T and the global minimum support minsup, the local frequent itemsets in the file F qf are filtered to obtain the filtered support count. Local frequent itemsets not less than the global minimum support count n×minsup, specifically:
顺序扫描文件Fqf;Sequentially scan files F qf ;
分别判断扫描到的文件Fqf中的局部频繁项集的支持度计数是否大于或等于全局最小支持度计数n×minsup:Determine whether the support count of the local frequent itemsets in the scanned file F qf is greater than or equal to the global minimum support count n×minsup:
是,则扫描到的文件Fqf中的局部频繁项集为该过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集。If yes, the local frequent itemsets in the scanned file F qf are the local frequent itemsets whose filtered support count is not less than the global minimum support count n×minsup.
进一步地,所述的扫描当前的文件Fqf,对于每个从当前的文件Fqf中扫描出的局部频繁项集,分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则输出当前扫描出的局部频繁项集,并记该输出的局部频繁项集为第一频繁项集;若判断结果为否,则基于所述的masΔ,判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集:否,则将当前扫描到的局部频繁项集及对应的支持度计数,对应写入集合STCAD中,具体包括:Further, in the described scanning of the current file F qf , for each local frequent item set scanned from the current file F qf , it is respectively judged whether the total transaction data set T satisfies the preset global minimum support degree. For the frequent itemset of minsup, if the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no, then based on the mas Δ , to determine whether the currently scanned local frequent itemset is not necessarily a frequent itemset that satisfies the preset global minimum support minsup on the total transaction data set T: if not, the currently scanned local frequent itemsets and The corresponding support counts are written into the set ST CAD , including:
顺序扫描所述的文件Fqf;Sequentially scan the files F qf ;
对于每个扫描出的文件Fqf中的局部频繁项集,分别判断扫描到的文件Fqf中的局部频繁项集的支持度计数是否大于或等于全局最小支持度计数n×minsup:For each local frequent item set in the scanned file F qf , determine whether the support count of the local frequent item set in the scanned file F qf is greater than or equal to the global minimum support count n×minsup:
是,则输出扫描出的局部频繁项集,该输出的局部频繁项集为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集;If yes, output the scanned local frequent itemset, and the output local frequent itemset is the frequent itemset that satisfies the preset global minimum support minsup on the total transaction data set T;
否,则判断扫描到的文件Fqf中的局部频繁项集的支持度计数与所述masΔ的加和是否小于全局最小支持度计数n×minsup,并在判定结果为是,则当前扫描到的局部频繁项集一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,否则将当前扫描到的局部频繁项集及对应扫描到的支持度计数,对应写入所述的集合STCAD。No, then judge whether the sum of the support count of the local frequent itemsets in the scanned file F qf and the mas Δ is less than the global minimum support count n×minsup, and if the judgment result is yes, then the current scan to The local frequent itemsets must not be frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T. Otherwise, the currently scanned local frequent itemsets and the corresponding scanned support counts will be written accordingly. The collection ST CAD .
本发明的有益效果在于:The beneficial effects of the present invention are:
(1)本发明提供的海量数据频繁项集挖掘方法,采用了文件Fqf、集合STCAD、以及数组cntΔ,并在整个挖掘过程中复用了文件Fqf、集合STCAD和数组cntΔ,这在一定程度上避免了对原始事务数据集TO和新增事务数据集TΔ的遍历,一定程度上减少了计算开销,从而可在一定程度上提高频繁项集的挖掘速率。(1) The method for mining frequent itemsets of massive data provided by the present invention adopts the file F qf , the set ST CAD , and the array cnt Δ , and multiplexes the file F qf , the set ST CAD and the array cnt Δ in the whole mining process , which avoids the traversal of the original transaction data set T O and the new transaction data set T Δ to a certain extent, reduces the computational overhead to a certain extent, and thus can improve the mining rate of frequent itemsets to a certain extent.
(2)本发明提供的海量数据频繁项集挖掘方法,包括对所述集合STCAD的精剪步骤,使集合STCAD在被用于后续的计算之前被进一步减小,从而减小了I/O开销和计算开销。(2) The method for mining frequent itemsets of massive data provided by the present invention includes a fine-pruning step for the set ST CAD , so that the set ST CAD is further reduced before being used for subsequent calculations, thereby reducing I/ O overhead and computational overhead.
(3)本发明提供的海量数据频繁项集挖掘方法,提出了具体的增量更新策略,利用已有的计算信息,比如数组cntΔ、集合LFΔ等,加速更新操作,从而有助于提升海量数据频繁项集挖掘的性能和实用性。(3) The method for mining frequent itemsets of massive data provided by the present invention proposes a specific incremental update strategy, using existing calculation information, such as array cnt Δ , set LF Δ , etc., to speed up the update operation, thereby helping to improve Performance and practicability of frequent itemsets mining in massive data.
此外,本发明设计原理可靠,结构简单,具有非常广泛的应用前景。In addition, the present invention has reliable design principle and simple structure, and has a very wide application prospect.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. In other words, other drawings can also be obtained based on these drawings without creative labor.
图1是本发明一个实施例的方法的示意性流程图。FIG. 1 is a schematic flowchart of a method according to an embodiment of the present invention.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本发明中的技术方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only some of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
实施例1:Example 1:
图1是本发明一个实施例的方法的示意性流程图。其中,图1执行主体可以为计算节点、服务器,也可以为普通PC机。该海量数据频繁项集挖掘方法用于挖掘总事务数据集T中满足全局最小支持度minsup的频繁项集,所述的全局最小支持度minsup为预先设定的总事务数据集T上的最小支持度,所述的总事务数据集T包括原始事务数据集TO和新增事务数据集TΔ。FIG. 1 is a schematic flowchart of a method according to an embodiment of the present invention. Wherein, the execution body of FIG. 1 may be a computing node, a server, or a common PC. The massive data frequent itemset mining method is used to mine frequent itemsets that satisfy the global minimum support minsup in the total transaction data set T, where the global minimum support minsup is the preset minimum support on the total transaction data set T The total transaction data set T includes the original transaction data set T O and the newly added transaction data set T Δ .
参见图1,该海量数据频繁项集挖掘方法包括:Referring to Figure 1, the massive data frequent itemset mining method includes:
步骤110,采用频繁项集挖掘算法对原始事务数据集TO进行挖掘,获得原始事务数据集TO对应的所有的局部频繁项集。Step 110 , using the frequent itemset mining algorithm to mine the original transaction data set T O to obtain all local frequent itemsets corresponding to the original transaction data set T O.
具体地,可顺序的读取原始事务数据集TO中的事务,并将取回的事务放在内存缓冲区中,然后使用现有的频繁项集挖掘算法在缓冲区数据集上计算局部频繁项集,计算出的局部频繁项集被保存在一个文件F中;之后清空缓冲区,继续顺序读取原始事务数据集TO中的事务进行下一次迭代,计算出的局部频繁项集继续被保存在上述文件F中。这个过程一直被重复执行直到原始事务数据集TO中的所有事务读取完毕,至此,原始事务数据集TO对应的所有的局部频繁项集已全部生成并得到,且均被存储在所述的文件F中。Specifically, the transactions in the original transaction data set TO can be read sequentially, and the retrieved transactions are placed in the memory buffer, and then the local frequent itemset mining algorithm is used to calculate the local frequent items on the buffer data set. Itemset, the calculated local frequent itemset is saved in a file F; after that, the buffer is emptied, and the transactions in the original transaction data set TO continue to be read sequentially for the next iteration, and the calculated local frequent itemsets continue to be Saved in file F above. This process is repeated until all transactions in the original transaction data set TO are read. So far, all local frequent itemsets corresponding to the original transaction data set TO have been generated and obtained, and are stored in the in file F.
为便于描述,将步骤110对应的步骤记为预计算阶段。For the convenience of description, the steps corresponding to step 110 are denoted as the pre-calculation stage.
步骤120,扫描原始事务数据集TO,对应计算上述步骤110中所获得的每个局部频繁项集在原始事务数据集TO上的支持度计数,依据局部最小支持度ω,对所获得的局部频繁项集进行过滤,获取支持度不小于ω的各局部频繁项集,并将所获取的支持度不小于ω的各局部频繁项集及计算所得的对应的支持度计数对应写入文件Fqf中。Step 120 , scan the original transaction data set T O , correspondingly calculate the support count of each local frequent itemset obtained in the above step 110 on the original transaction data set T O , according to the local minimum support ω, for the obtained Filter the local frequent itemsets, obtain each local frequent itemsets whose support is not less than ω, and write the obtained local frequent itemsets whose support is not less than ω and the corresponding calculated support counts into file F. in qf .
具体实现时,首先将文件F中所有的局部频繁项集读入内存,然后再顺序扫描一遍原始事务数据集TO,对应计算上述读入内存中的每个局部频繁项集的支持度计数;最后,依据局部最小支持度ω,对上述读入内存中的每个局部频繁项集进行过滤,并将过滤得到的支持度不小于ω的各局部频繁项集存储在文件Fqf中。In the specific implementation, first read all the local frequent itemsets in the file F into the memory, and then scan the original transaction data set T O sequentially, and calculate the support count of each local frequent itemsets read into the memory above; Finally, according to the local minimum support ω, filter each local frequent item set read into the memory above, and store each local frequent item set whose support degree is not less than ω in the file F qf .
其中,上述文件Fqf的存储模式表示为Fqf(IS,SUP),IS表示项集,SUP表示项集IS对应的支持度计数;文件Fqf中的局部频繁项集按照支持度计数递减排序。Wherein, the storage mode of the above-mentioned file F qf is expressed as F qf (IS, SUP), IS represents the item set, and SUP represents the support count corresponding to the item set IS; the local frequent itemsets in the file F qf are sorted in descending order according to the support count .
为便于描述,将步骤120对应的步骤记为提纯阶段。For the convenience of description, the step corresponding to step 120 is recorded as the purification stage.
步骤130,读取新增事务数据集TΔ,并判断新增事务数据集TΔ是否为空:Step 130, read the newly added transaction data set T Δ , and determine whether the newly added transaction data set T Δ is empty:
是,则依据总事务数据集T中的事务的数目n以及所述的全局最小支持度minsup,对所述文件Fqf中的局部频繁项集进行过滤,得到过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集并输出,所输出的各局部频繁项集即为总事务数据集T上满足所述全局最小支持度minsup的全部的频繁项集;Yes, then according to the number n of transactions in the total transaction data set T and the global minimum support minsup, the local frequent itemsets in the file F qf are filtered, and the filtered support count is not less than the global The local frequent itemsets of the minimum support count n×minsup are output, and the outputted local frequent itemsets are all the frequent itemsets that satisfy the global minimum support minsup on the total transaction data set T;
否,则采用增量更新方法挖掘总事务数据集T上的频繁项集;If not, the incremental update method is used to mine frequent itemsets on the total transaction data set T;
其中,所述的局部最小支持度ω为预先设定的原始事务数据集TO上的最小支持度,局部最小支持度ω<全局最小支持度minsup。Wherein, the local minimum support degree ω is a preset minimum support degree on the original transaction data set T O , and the local minimum support degree ω<global minimum support degree minsup.
其中,所述的依据总事务数据集T中的事务的数目n以及所述的全局最小支持度minsup,对所述文件Fqf中的局部频繁项集进行过滤,得到过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集,具体为:Wherein, according to the number n of transactions in the total transaction data set T and the global minimum support minsup, the local frequent itemsets in the file F qf are filtered, and the filtered support count is not Local frequent itemsets smaller than the global minimum support count n×minsup, specifically:
顺序扫描文件Fqf;Sequentially scan files F qf ;
分别判断扫描到的文件Fqf中的局部频繁项集的支持度计数是否大于或等于全局最小支持度计数n×minsup:Determine whether the support count of the local frequent itemsets in the scanned file F qf is greater than or equal to the global minimum support count n×minsup:
是,则扫描到的文件Fqf中的局部频繁项集为支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集(为总事务数据集T上的频繁项集)。If yes, then the local frequent itemsets in the scanned file F qf are the local frequent itemsets whose support count is not less than the global minimum support count n×minsup (the frequent itemsets on the total transaction data set T).
其中,在读取新增事务数据集TΔ时,先读取所述TΔ中的一个事务,再读取该事务中的项,待该事务中的项读取完后,再转入读取所述TΔ中的下一个事务。其中,对于tΔ表示当前读取的新增事务数据集TΔ中的一个事务,i表示所述tΔ中的一个项,每读取一个新增事务数据集TΔ中的项,则将i的计数(初始值为0)增加1。Among them, when reading the newly added transaction data set T Δ , first read a transaction in the T Δ , and then read the items in the transaction, and then transfer to reading after the items in the transaction are read. Take the next transaction in the TΔ . Among them, for t Δ represents a transaction in the currently read new transaction data set T Δ , i represents an item in the t Δ , each time an item in the newly added transaction data set T Δ is read, the count of i is (Initial value is 0) Increment by 1.
优选地,所述的增量更新方法,包括步骤(即所述的采用增量更新方法挖掘总事务数据集T上的频繁项集所包括的步骤):Preferably, the incremental update method includes steps (that is, the steps included in the incremental update method used to mine frequent itemsets on the total transaction data set T):
扫描新增事务数据集TΔ,计算新增事务数据集TΔ中各项集在新增事务数据集TΔ上的支持度计数,并将新增事务数据集TΔ中各项集、以及计算所得的新增事务数据集TΔ中各项集在新增事务数据集TΔ上的支持度计数,对应存入数组cntΔ,并记数组cntΔ中最大的支持度计数为masΔ;Scan the newly added transaction data set T Δ , calculate the support count of each item set in the newly added transaction data set T Δ on the newly added transaction data set T Δ , and add the item sets in the newly added transaction data set T Δ , and The calculated support count of each item set in the newly added transaction data set T Δ on the newly added transaction data set T Δ is stored in the array cnt Δ correspondingly, and the maximum support count in the array cnt Δ is recorded as mas Δ ;
顺序扫描当前的文件Fqf,对于每个从当前的文件Fqf中扫描出的局部频繁项集,分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则输出当前扫描出的局部频繁项集,并记该输出的局部频繁项集为第一频繁项集;若判断结果为否,则基于所述的masΔ,判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集:否,则将当前扫描到的局部频繁项集及对应的支持度计数,对应写入集合STCAD中;Sequentially scan the current file F qf , and for each local frequent item set scanned from the current file F qf , determine whether it is a frequent item set that satisfies the preset global minimum support minsup on the total transaction data set T respectively. , if the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no, then based on the mas Δ , determine the Whether the currently scanned local frequent itemsets are not frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T: No, then count the currently scanned local frequent itemsets and the corresponding support , correspondingly written into the set ST CAD ;
基于上述数组cntΔ,对应计算与更新集合STCAD中在新增事务数据集TΔ中存在的各局部频繁项集在总事务数据集T上的支持度计数,得到更新后的集合STCAD;Based on the above-mentioned array cnt Δ , corresponding to the support counts of each local frequent item set existing in the newly added transaction data set T Δ in the calculation and update set ST CAD on the total transaction data set T, the updated set ST CAD is obtained;
遍历更新后的集合STCAD,并分别判断各当前遍历到的更新后的集合STCAD中的局部频繁项集是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则对应输出各当前遍历到的更新后的集合STCAD中的相应的局部频繁项集;Traverse the updated set ST CAD , and respectively judge whether the local frequent itemsets in the currently traversed updated sets ST CAD are frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T , if the judgment result is yes, then correspondingly output the corresponding local frequent itemsets in the updated set ST CAD currently traversed;
之后判断表达式(nO×ω-1)+masΔ<(n×minsup)是否成立:Then judge whether the expression (n O ×ω-1)+mas Δ <(n×minsup) holds:
是,则频繁项集挖掘结束;If yes, the frequent itemset mining ends;
否,则继续在新增事务数据集TΔ中挖掘新的频繁项集,并对所挖掘出的新的频繁项集进行输出;If not, continue to mine new frequent itemsets in the newly added transaction data set T Δ , and output the new frequent itemsets mined;
其中,所述的新的频繁项集,其在原始事务数据集TO上的支持度计数大于零,在总事务数据集T上满足所述的全局最小支持度minsup,并区别于上述输出的所有的频繁项集;Wherein, the new frequent itemset, whose support count on the original transaction data set T O is greater than zero, satisfies the global minimum support minsup on the total transaction data set T, and is different from the above output all frequent itemsets;
nO为新增事务数据集TΔ中的事务的数目。n O is the number of transactions in the newly added transaction data set T Δ .
其中,在本实施例中,所述的顺序扫描当前的文件Fqf,对于每个从当前的文件Fqf中扫描出的局部频繁项集,分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则输出当前扫描出的局部频繁项集,并记该输出的局部频繁项集为第一频繁项集;若判断结果为否,则基于所述的masΔ,判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集:否,则将当前扫描到的局部频繁项集及对应的支持度计数,对应写入集合STCAD中,具体包括:Wherein, in this embodiment, the current file F qf is scanned sequentially, and for each local frequent item set scanned from the current file F qf , it is respectively judged whether the total transaction data set T satisfies the preset requirements. The frequent itemset of the given global minimum support minsup, if the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no , then based on the mas Δ , it is determined whether the currently scanned local frequent itemset must not be a frequent itemset that satisfies the preset global minimum support minsup on the total transaction data set T: if not, then scan the current scan to The local frequent itemsets and the corresponding support counts are written into the set ST CAD , including:
顺序扫描所述的文件Fqf;Sequentially scan the files F qf ;
对于每个扫描出的文件Fqf中的局部频繁项集,分别判断扫描到的文件Fqf中的局部频繁项集的支持度计数是否大于或等于全局最小支持度计数n×minsup:For each local frequent item set in the scanned file F qf , determine whether the support count of the local frequent item set in the scanned file F qf is greater than or equal to the global minimum support count n×minsup:
是,则输出扫描出的局部频繁项集,该输出的局部频繁项集为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集;If yes, output the scanned local frequent itemset, and the output local frequent itemset is the frequent itemset that satisfies the preset global minimum support minsup on the total transaction data set T;
否,则判断扫描到的文件Fqf中的局部频繁项集的支持度计数与所述masΔ的加和是否小于全局最小支持度计数n×minsup,并在判定结果为是,则当前扫描到的局部频繁项集一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集,否则将当前扫描到的局部频繁项集及对应扫描到的支持度计数,对应写入所述的集合STCAD。No, then judge whether the sum of the support count of the local frequent itemsets in the scanned file F qf and the mas Δ is less than the global minimum support count n×minsup, and if the judgment result is yes, then the current scan to The local frequent itemsets must not be frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T. Otherwise, the currently scanned local frequent itemsets and the corresponding scanned support counts will be written accordingly. The collection ST CAD .
需要说明的是,本发明中文件Fqf中的局部频繁项集分为三部分:(1)绝对属于总事务数据集T的频繁项集的部分,(2)绝对不属于总事务数据集T的频繁项集的部分,(3)有可能属于总事务数据集T的频繁项集的部分。可见对于t表示当前读取的局部频繁项集,若t已经满足了所述的全局最小支持度minsup,那么t是总事务数据集T的频繁项集;假设所述TΔ中所有事务都包含所述的t,但t依然不能满足全局最小支持度minsup,那么t一定不是总事务数据集T的频繁项集;其他情况,t可能是总事务数据集T的频繁项集,需要进一步验证,本发明将可能是频繁项集的t保存在集合STCAD中。可见,本发明基于对整个总事务数据集T中的频繁项集的分类,分类别地对总事务数据集T中的频繁项集进行挖掘,一定程度上有助于挖掘效率的提高。It should be noted that the local frequent itemsets in the file F qf in the present invention are divided into three parts: (1) the part of frequent itemsets that absolutely belong to the total transaction data set T, (2) the part that absolutely does not belong to the total transaction data set T The part of frequent itemsets of (3) may belong to the part of frequent itemsets of the total transaction dataset T. visible for t represents the currently read local frequent itemset, if t has satisfied the global minimum support minsup, then t is the frequent itemset of the total transaction data set T; it is assumed that all transactions in the T Δ include the t, but t still cannot satisfy the global minimum support minsup, then t must not be the frequent itemset of the total transaction data set T; in other cases, t may be the frequent itemset of the total transaction data set T, which needs further verification. Save t, which may be frequent itemsets, in the set ST CAD . It can be seen that, based on the classification of frequent itemsets in the entire total transaction data set T, the present invention mines the frequent itemsets in the total transaction data set T by category, which helps to improve the mining efficiency to a certain extent.
可优选地,在本实施例中,所述的在新增事务数据集TΔ中挖掘新的频繁项集,并对所挖掘出的新的频繁项集进行输出,包括:Preferably, in this embodiment, mining a new frequent itemset in the newly added transaction data set T Δ , and outputting the mined new frequent itemsets, includes:
通过公式计算新增事务数据集TΔ的最小支持度minsupΔ,式中nΔ为新增事务数据集TΔ中事务的数目,n为总事务数据集T中事务的数目,nO为原始事务数据集TO中事务的数目;by formula Calculate the minimum support minsup Δ of the newly added transaction data set T Δ , where n Δ is the number of transactions in the newly added transaction data set T Δ , n is the number of transactions in the total transaction data set T, and n O is the original transaction data the number of transactions in the set TO ;
分割新增事务数据集TΔ中的事务为多个目标事务数据集;Divide the transactions in the newly added transaction data set T Δ into multiple target transaction data sets;
采用Eclat算法,对各目标事务数据集进行局部频繁项集挖掘,得到各目标事务数据集对应的满足上述计算所得的最小支持度minsupΔ的所有的局部频繁项集;Using the Eclat algorithm, perform local frequent itemsets mining on each target transaction data set, and obtain all local frequent itemsets corresponding to each target transaction data set that satisfy the minimum support minsup Δ calculated above;
记上述得到各目标事务数据集对应的所有的局部频繁项集的集合为LFΔ,遍历并删除所述集合LFΔ中已出现在文件Fqf中的局部频繁项集,得到候选集合GFΔ;并基于上述数组cntΔ,将候选集合GFΔ中各局部频繁项集在新增事务数据集TΔ中的支持度计数对应存入该候选集合GFΔ;Denote the set of all local frequent itemsets corresponding to each target transaction data set obtained above as LF Δ , traverse and delete the local frequent itemsets in the set LF Δ that have appeared in the file F qf to obtain the candidate set GF Δ ; And based on the above-mentioned array cnt Δ , the support counts of each local frequent item set in the candidate set GF Δ in the newly added transaction data set T Δ are correspondingly stored in the candidate set GF Δ ;
扫描当前的原始事务数据集TO,增加并更新候选集合GFΔ中局部频繁项集的支持度计数,得到新的候选集合GFΔ;Scan the current original transaction data set TO , increase and update the support count of the local frequent itemsets in the candidate set GF Δ , and obtain a new candidate set GF Δ ;
扫描上述新的候选集合GFΔ,对应判断各当前扫描出的新的候选集合GFΔ中的局部频繁项集是否为满足预先设定的全局最小支持度minsup的频繁项集,若判断结果为是,则对应输出各当前扫描出的候选集合GFΔ中的局部频繁项集。Scan the above-mentioned new candidate set GF Δ , correspondingly determine whether the local frequent itemsets in the new candidate sets GF Δ currently scanned are frequent itemsets that satisfy the preset global minimum support minsup, if the judgment result is yes , then correspondingly output the local frequent itemsets in each currently scanned candidate set GF Δ .
其中,在本实施例中,所述的采用Eclat算法,对各目标事务数据集进行局部频繁项集挖掘,得到各目标事务数据集对应的满足上述计算所得的最小支持度minsupΔ的所有的局部频繁项集,包括:Among them, in this embodiment, the Eclat algorithm is used to perform local frequent itemset mining on each target transaction data set, and all local parts corresponding to each target transaction data set that satisfy the minimum support minsup Δ obtained by the above calculation are obtained. Frequent itemsets, including:
P0、遍历各目标事务数据集;P0, traverse each target transaction data set;
P1、对当前遍历到的目标事务数据集:P1. For the current traversed target transaction data set:
P11、采用Eclat算法,计算并获取当前遍历到的目标事务数据集对应的候选频繁k-项集,并在所生成的候选频繁k-项集满足上述计算所得的最小支持度minsupΔ时,记所生成的候选频繁k-项集为频繁k-项集并存入集合LFk,Δ中,k≥1;P11. Use the Eclat algorithm to calculate and obtain the candidate frequent k-itemsets corresponding to the currently traversed target transaction data set, and when the generated candidate frequent k-items satisfies the minimum support minsup Δ calculated above, record The generated candidate frequent k-itemsets are frequent k-itemsets and are stored in the set LF k,Δ , k≥1;
P12、通过合并所述LFk,Δ中两个频繁k-项集生成候选频繁(k+1)-项集,并在所生成的候选频繁(k+1)-项集满足上述计算所得的最小支持度minsupΔ时,记所生成的候选频繁(k+1)-项集为频繁(k+1)-项集并存入集合LFk+1,Δ中,其中所述的两个频繁k-项集的前k-1项相同且最后一项不同;P12. Generate a candidate frequent (k+1)-item set by merging the two frequent k-itemsets in the LF k,Δ , and the generated candidate frequent (k+1)-item set satisfies the above calculation result When the minimum support minsup Δ , the generated candidate frequent (k+1)-itemsets are recorded as frequent (k+1)-itemsets and stored in the set LF k+1,Δ , wherein the two frequent The first k-1 items of the k-item set are the same and the last item is different;
P13、重复执行上述步骤P11-P12,每次k增加1,直到不能再生成当前遍历到的目标事务数据集对应的新的候选频繁项集;之后执行步骤P14;P13. Repeat the above steps P11-P12, each time k increases by 1, until the new candidate frequent itemsets corresponding to the currently traversed target transaction data set can no longer be generated; then step P14 is performed;
P14、继续遍历下一目标事务数据集,并重复执行上述步骤P11-P13,直至遍历完各目标事务数据集,从而得到各目标事务数据集对应的满足上述计算所得的最小支持度minsupΔ的所有的局部频繁项集。P14. Continue to traverse the next target transaction data set, and repeat the above steps P11-P13 until each target transaction data set is traversed, so as to obtain all target transaction data sets that satisfy the minimum support minsup Δ calculated above. The local frequent itemsets of .
需要说明的是,参见图1:It should be noted that, see Figure 1:
图中各t,分别为各自对应步骤当前扫描到的文件Fqf中的局部频繁项集;各t.SUP,分别为各自对应步骤当前扫描出的局部频繁项集t在文件Fqf中对应的支持度计数;Each t in the figure is the local frequent item set in the file F qf currently scanned by the corresponding step; each t.SUP is the corresponding local frequent item set t currently scanned by the corresponding step in the file F qf . support count;
图中所示的“s”为对应步骤当前扫描到的更新后的集合STCAD中的项集,“s.SUP”为所述项集s在更新后的集合STCAD中对应的支持度计数;The "s" shown in the figure is the item set in the updated set ST CAD currently scanned in the corresponding step, and "s.SUP" is the corresponding support count of the item set s in the updated set ST CAD . ;
图中所示的“r”为对应步骤中当前扫描到的新的候选集合GFΔ中的项集,“r.SUP”为所述项集r在新的候选集合GFΔ中对应的支持度计数。"r" shown in the figure is the item set in the new candidate set GF Δ currently scanned in the corresponding step, and "r.SUP" is the corresponding support degree of the item set r in the new candidate set GF Δ count.
另外需要说明的是,结合图1可知:In addition, it should be noted that, with reference to Figure 1, it can be seen that:
在判定新增事务数据集TΔ为空时,图1中对应输出的各个项集t,为通过本发明所示方法挖掘出的总事务数据集T上的所有的频繁项集;When it is determined that the newly added transaction data set T Δ is empty, each item set t corresponding to the output in FIG. 1 is all the frequent itemsets on the total transaction data set T mined by the method shown in the present invention;
在判定新增事务数据集TΔ为非空时,图1中对应输出的各个项集t、各个项集s以及各个项集r,为通过本发明所示方法挖掘出的总事务数据集T上的所有的频繁项集。When it is determined that the newly added transaction data set T Δ is non-empty, each item set t, each item set s and each item set r correspondingly output in FIG. 1 are the total transaction data set T mined by the method shown in the present invention All frequent itemsets on .
另外需要说明的是,本发明在具体实现时,所述minsup和ω的取值可由本领域技术人员依据经验值进行选取,比如minsup可以取值为0.2、ω可以取值为0.1,等等。In addition, it should be noted that, when the present invention is specifically implemented, the values of minsup and ω can be selected by those skilled in the art based on empirical values. For example, minsup can be 0.2, ω can be 0.1, and so on.
综上,本发明提供的海量数据频繁项集挖掘方法,采用了文件Fqf、集合STCAD、以及数组cntΔ,并在整个挖掘过程中复用了文件Fqf、集合STCAD和数组cntΔ,这在一定程度上避免了对原始事务数据集TO和新增事务数据集TΔ的遍历,一定程度上减少了计算开销,从而可在一定程度上提高频繁项集的挖掘速率。To sum up, the method for mining frequent itemsets of massive data provided by the present invention adopts the file F qf , the set ST CAD , and the array cnt Δ , and multiplexes the file F qf , the set ST CAD and the array cnt Δ in the whole mining process , which avoids the traversal of the original transaction data set T O and the new transaction data set T Δ to a certain extent, reduces the computational overhead to a certain extent, and thus can improve the mining rate of frequent itemsets to a certain extent.
实施例2:Example 2:
与实施例1相比,不同之处在于,该实施例2中所述的海量数据频繁项集挖掘方法,为进一步提高本发明的挖掘速率,在上述步骤P11和步骤P12之间,还包括步骤S:对所述集合LFk,Δ的精剪步骤;Compared with Embodiment 1, the difference is that, in the method for mining frequent itemsets of massive data described in Embodiment 2, in order to further improve the mining rate of the present invention, between the above steps P11 and P12, the method further includes the following steps: S: fine-pruning step for the set LF k,Δ ;
其中,对所述集合LFk,Δ的精剪步骤,包括:Wherein, the fine pruning step for the set LF k,Δ includes:
依据项集的前(k-1)项是否相同,获取并划分集合LFk,Δ中频繁k-项集的分组,得到相应数量的项集分组,其中,同一项集分组中的频繁k-项集的前(k-1)项相同;According to whether the first (k-1) items of the itemset are the same, obtain and divide the frequent k-item sets in the set LF k,Δ into groups, and obtain the corresponding number of itemset groups, among which, the frequent k-items in the same item set grouping The first (k-1) items of the itemset are the same;
分别统计每个项集分组中频繁k-项集的数量,并对应判断所统计的各数量是否等于1:是,则删除对应项集分组,并删除集合LFk,Δ中与所述对应项集分组中频繁k-项集相同的项集;所述的对应项集分组,其内频繁k-项集的数量等于1;Count the number of frequent k-itemsets in each itemset grouping separately, and judge whether the counted quantities are equal to 1: Yes, delete the corresponding itemset grouping, and delete the corresponding items in the set LF k,Δ The itemsets with the same frequent k-itemsets in the set grouping; in the corresponding itemset grouping, the number of frequent k-itemsets in it is equal to 1;
对应判定当前存在的各项集分组中,是否是项集分组中任意两个频繁k-项集的并集都包含在所述的文件Fqf中:是,则删除对应的项集分组,并删除集合LFk,Δ中与所述对应的项集分组中频繁k-项集相同的项集。Correspondingly determine whether the current item set grouping is the union of any two frequent k-itemsets in the itemset grouping, which are included in the file F qf : if yes, delete the corresponding itemset grouping, and Delete the itemsets in the set LF k,Δ that are the same as the frequent k-itemsets in the corresponding itemset grouping.
另外,为了更进一步地提高本发明的挖掘速率,本实施例在基于上述数组cntΔ,对应计算与更新集合STCAD中在新增事务数据集TΔ中存在的各局部频繁项集在总事务数据集T上的支持度计数,得到更新后的集合STCAD之前,还包括对所述集合STCAD的精剪步骤;In addition, in order to further improve the mining rate of the present invention, in this embodiment, based on the above-mentioned array cnt Δ , each local frequent itemset existing in the newly added transaction data set T Δ in the calculation and update set ST CAD is corresponding to the total transaction The support count on the data set T, before obtaining the updated set ST CAD , also includes a fine-cutting step for the set ST CAD ;
对所述集合STCAD的精剪步骤,包括第一阶段的精简步骤;The fine-cutting steps of the set ST CAD , including the streamlining steps of the first stage;
所述的第一阶段的精简步骤,包括:The simplified steps of the first stage include:
遍历所述的文件Fqf及数组cntΔ,并分别计算各遍历到的文件Fqf中的1-项集的支持度计数与数组cntΔ中相同的1-项集在数组cntΔ中对应的支持度计数之和,若计算得到的支持度之和小于全局最小支持度计数n×minsup,则将遍历到的文件Fqf中的相应的1-项集自所述的集合STCAD中移除。Traverse the file F qf and the array cnt Δ , and calculate the support counts of the 1-itemsets in each traversed file F qf and the corresponding 1-itemsets in the array cnt Δ corresponding to the same 1-item sets in the array cnt Δ The sum of support counts, if the calculated sum of support is less than the global minimum support count n×minsup, the corresponding 1-item set in the traversed file F qf is removed from the set ST CAD .
另外,为了更进一步地提高本发明的挖掘速率,对所述集合STCAD的精剪步骤,还包括第二阶段的精简步骤;In addition, in order to further improve the digging rate of the present invention, the fine-cutting step of the set ST CAD also includes a second-stage streamlining step;
所述的第二阶段的精简步骤,包括:The simplified steps of the second stage include:
构建PIP数组;build the pip array;
遍历经第一阶段的精简步骤精简过的集合STCAD,并为经第一阶段的精简步骤精简过的集合STCAD中的每个局部频繁项集分别选择各自对应项集中的两个支持度计数最小的项,组成对应项集的项对并均保存在上述构建的PIP数组中;Traverse the set ST CAD reduced by the first-stage reduction step, and select two support counts in the corresponding item sets for each local frequent item set in the set ST CAD reduced by the first-stage reduction step. The smallest item, the item pairs that form the corresponding itemset, are stored in the PIP array constructed above;
计算PIP数组中的每个项对在新增事务数据集TΔ上的支持度计数;Calculate the support count of each item in the PIP array on the newly added transaction dataset T Δ ;
判定PIP数组中的每个项对在新增事务数据集TΔ上的支持度计数及对应项集在新增事务数据集TΔ上的支持度计数之和,与全局最小支持度计数n×minsup的大小关系,并将判定该支持度计数之和小于全局最小支持度计数n×minsup的各项对对应的局部频繁项集自集合STCAD中删除。Determine the sum of the support count of each item in the PIP array on the newly added transaction data set T Δ and the support count of the corresponding item set on the newly added transaction data set T Δ , and the global minimum support count n× The size relationship of minsup, and the local frequent itemsets corresponding to the pairs of items whose sum of the support counts is less than the global minimum support count n×minsup will be deleted from the set ST CAD .
综上可见,本发明提供的海量数据频繁项集挖掘方法,还包括对所述集合STCAD的精剪步骤,使集合STCAD在被用于后续的计算之前被进一步减小,从而减小了I/O开销和计算开销。To sum up, it can be seen that the method for mining frequent itemsets of massive data provided by the present invention further includes a step of fine-cutting the set ST CAD , so that the set ST CAD is further reduced before being used for subsequent calculations, thereby reducing the size of the set ST CAD. I/O overhead and computational overhead.
实施例3:Example 3:
与实施例2相比,不同之处在于,该实施例3中所述的海量数据频繁项集挖掘方法,在更新所述的新增事务数据集TΔ,并在更新所述的原始事务数据集TO为原有的原始事务数据集TO和原有的新增事务数据集TΔ之和、以及更新后的新增事务数据集TΔ非空时,还包括更新挖掘的步骤。Compared with Embodiment 2, the difference is that the method for mining frequent itemsets of massive data described in Embodiment 3 is updating the newly added transaction data set T Δ and updating the original transaction data. When the set T O is the sum of the original original transaction data set T O and the original new transaction data set T Δ , and the updated new transaction data set T Δ is not empty, the step of updating and mining is also included.
具体地,本实施例中所述的更新挖掘的步骤,包括:Specifically, the steps of updating mining described in this embodiment include:
更新总事务数据集T中的事务的数目n,为原有的原始事务数据集TO和原有的新增事务数据集TΔ的事务的数目之和;Update the number n of transactions in the total transaction data set T, which is the sum of the number of transactions in the original original transaction data set T O and the original new transaction data set T Δ ;
获取上述获得的原有原始事务数据集TO对应的所有的局部频繁项集,计算所获取的原有原始事务数据集TO对应的各局部频繁项集在原有的原始事务数据集TO上的支持度计数,对应写入文件Fqf,O中;Obtain all the local frequent item sets corresponding to the original original transaction data set T O obtained above, and calculate the local frequent itemsets corresponding to the original original transaction data set T O obtained on the original original transaction data set T O The support count of , corresponding to the write file F qf,O ;
基于原有的新增事务数据集TΔ对应的数组cntΔ,增加并更新文件Fqf,O中各局部频繁项集在原有的新增事务数据集TΔ上的支持度计数;之后,依据所述的局部最小支持度ω,对文件Fqf,O中的局部频繁项集进行过滤,获取所述更新总事务数据集T对应的支持度不小于ω的各局部频繁项集;Based on the array cnt Δ corresponding to the original newly added transaction data set T Δ , add and update the support counts of each local frequent itemset in the file F qf,O on the original newly added transaction data set T Δ ; For the local minimum support degree ω, filter the local frequent itemsets in the file F qf,O , and obtain each local frequent item set whose support degree corresponding to the update total transaction data set T is not less than ω;
之后将所获取的所述更新总事务数据集T对应的支持度不小于ω的各局部频繁项集及其在文件Fqf,O中各自对应的支持度计数,对应写入一个新的文件Fqf;Afterwards, each local frequent item set whose support degree is not less than ω and its corresponding support degree in the file F qf,O is written into a new file F correspondingly. qf ;
获取原有的新增事务数据集TΔ对应的集合LFΔ,并删除该获取到的集合LFΔ中存在于所述文件Fqf,O中的项集,对应得到一个新的集合LFΔ;Acquire the set LF Δ corresponding to the original newly added transaction data set T Δ , and delete the itemsets in the obtained set LF Δ that exist in the file F qf,O , correspondingly to obtain a new set LF Δ ;
基于原有的新增事务数据集TΔ对应的数组cntΔ,在所述新的集合LFΔ中,对应写入该新的集合LFΔ中各项集在原有的新增事务数据集TΔ上的支持度计数;Based on the array cnt Δ corresponding to the original newly added transaction data set T Δ , in the new set LF Δ , correspondingly write the item sets in the new set LF Δ into the original newly added transaction data set T Δ support count on ;
之后依据所述的局部最小支持度ω,对所述新的集合LFΔ中的项集进行过滤,获取过滤后的支持度不小于ω的各项集,并将该获取到的支持度不小于ω的各项集及所述新的集合LFΔ中写入的支持度计数,对应写入所述的新的文件Fqf;Then, according to the local minimum support ω, filter the itemsets in the new set LF Δ , obtain the itemsets with the filtered support not less than ω, and use the obtained support not less than ω. the item set of ω and the support count written in the new set LF Δ , corresponding to the new file F qf written;
之后用上述新的文件Fqf替换原有的文件Fqf、用原有的原始事务数据集TO和原有的新增事务数据集TΔ之和替换原有的原始事务数据集TO、以及用更新后的新增事务数据集TΔ替换原有的新增事务数据集TΔ,基于更新后的总事务数据集T中的事务的数目n,采用所述的增量更新方法挖掘上述更新后的总事务数据集T上的频繁项集。Then replace the original file F qf with the above-mentioned new file F qf , and replace the original original transaction data set T O with the sum of the original original transaction data set T O and the original newly added transaction data set T Δ , And replace the original new transaction data set T Δ with the updated new transaction data set T Δ , based on the number n of transactions in the updated total transaction data set T, use the incremental update method to mine the above Frequent itemsets on the updated total transaction dataset T.
可见本发明提供的海量数据频繁项集挖掘方法,提出了具体的增量更新策略,利用已有的计算信息,比如数组cntΔ、集合LFΔ等,加速更新操作,从而有助于提升海量数据频繁项集挖掘的性能和实用性。It can be seen that the method for mining frequent itemsets of massive data provided by the present invention proposes a specific incremental update strategy, and utilizes existing calculation information, such as array cnt Δ , set LF Δ , etc., to speed up update operations, thereby helping to improve massive data Performance and practicality of frequent itemset mining.
需要说明的是,本说明书中各个实施例之间相同相似的部分互相参见即可。It should be noted that the same and similar parts among the various embodiments in this specification may be referred to each other.
尽管通过参考附图并结合优选实施例的方式对本发明进行了详细描述,但本发明并不限于此。在不脱离本发明的精神和实质的前提下,本领域普通技术人员可以对本发明的实施例进行各种等效的修改或替换,而这些修改或替换都应在本发明的涵盖范围内/任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。Although the present invention has been described in detail in conjunction with the preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Without departing from the spirit and essence of the present invention, those of ordinary skill in the art can make various equivalent modifications or substitutions to the embodiments of the present invention, and these modifications or substitutions should all fall within the scope of the present invention/any Those skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention, which should all be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477465.9A CN110222090A (en) | 2019-06-03 | 2019-06-03 | A kind of mass data Mining Frequent Itemsets |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477465.9A CN110222090A (en) | 2019-06-03 | 2019-06-03 | A kind of mass data Mining Frequent Itemsets |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110222090A true CN110222090A (en) | 2019-09-10 |
Family
ID=67819051
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910477465.9A Pending CN110222090A (en) | 2019-06-03 | 2019-06-03 | A kind of mass data Mining Frequent Itemsets |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110222090A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113064934A (en) * | 2021-03-26 | 2021-07-02 | 安徽继远软件有限公司 | Fault association rule mining method and system for sensing layer of power sensor network |
CN114004286A (en) * | 2021-10-19 | 2022-02-01 | 河海大学 | A Multidimensional Time Series Synchronous Motif Discovery Method Based on Frequent Item Mining |
CN114691749A (en) * | 2022-05-11 | 2022-07-01 | 江苏大学 | Sliding window-based frequent item set parallel incremental mining method |
CN115473933A (en) * | 2022-10-10 | 2022-12-13 | 国网江苏省电力有限公司南通供电分公司 | A Discovery Method of Association Service in Network System Based on Frequent Subgraph Mining |
CN115525695A (en) * | 2022-10-08 | 2022-12-27 | 广东工业大学 | An Incremental Frequent Itemset Mining Method for Internet Financial Real-time Streaming Data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761236A (en) * | 2013-11-20 | 2014-04-30 | 同济大学 | Incremental frequent pattern increase data mining method |
CN107229751A (en) * | 2017-06-28 | 2017-10-03 | 济南大学 | A kind of concurrent incremental formula association rule mining method towards stream data |
-
2019
- 2019-06-03 CN CN201910477465.9A patent/CN110222090A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761236A (en) * | 2013-11-20 | 2014-04-30 | 同济大学 | Incremental frequent pattern increase data mining method |
CN107229751A (en) * | 2017-06-28 | 2017-10-03 | 济南大学 | A kind of concurrent incremental formula association rule mining method towards stream data |
Non-Patent Citations (1)
Title |
---|
韩希先: "Efficiently Mining Frequent Itemsets on Massive Data", 《IEEE ACCESS》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113064934A (en) * | 2021-03-26 | 2021-07-02 | 安徽继远软件有限公司 | Fault association rule mining method and system for sensing layer of power sensor network |
CN113064934B (en) * | 2021-03-26 | 2023-12-08 | 安徽继远软件有限公司 | Power sensing network perception layer fault association rule mining method and system |
CN114004286A (en) * | 2021-10-19 | 2022-02-01 | 河海大学 | A Multidimensional Time Series Synchronous Motif Discovery Method Based on Frequent Item Mining |
CN114004286B (en) * | 2021-10-19 | 2024-04-26 | 河海大学 | Multi-dimensional time sequence synchronization motif discovery method based on frequent item mining |
CN114691749A (en) * | 2022-05-11 | 2022-07-01 | 江苏大学 | Sliding window-based frequent item set parallel incremental mining method |
CN114691749B (en) * | 2022-05-11 | 2024-03-19 | 江苏大学 | A method for parallel incremental mining of frequent itemsets based on sliding windows |
CN115525695A (en) * | 2022-10-08 | 2022-12-27 | 广东工业大学 | An Incremental Frequent Itemset Mining Method for Internet Financial Real-time Streaming Data |
CN115473933A (en) * | 2022-10-10 | 2022-12-13 | 国网江苏省电力有限公司南通供电分公司 | A Discovery Method of Association Service in Network System Based on Frequent Subgraph Mining |
CN115473933B (en) * | 2022-10-10 | 2023-05-23 | 国网江苏省电力有限公司南通供电分公司 | A Discovery Method of Association Service in Network System Based on Frequent Subgraph Mining |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222090A (en) | A kind of mass data Mining Frequent Itemsets | |
Lee et al. | Sliding window based weighted maximal frequent pattern mining over data streams | |
US9405790B2 (en) | System, method and data structure for fast loading, storing and access to huge data sets in real time | |
US10579661B2 (en) | System and method for machine learning and classifying data | |
US9805080B2 (en) | Data driven relational algorithm formation for execution against big data | |
CA2723731C (en) | Managing storage of individually accessible data units | |
CN106126543B (en) | A method of model conversion and data migration from relational database to MongoDB | |
Ramadan et al. | Dynamic sorted neighborhood indexing for real-time entity resolution | |
US9063947B2 (en) | Detecting duplicative hierarchical sets of files | |
AU2018211280B2 (en) | Managing memory and storage space for a data operation | |
CN101599079A (en) | A Management Method for Centralized Storage of Backup Data | |
US10133761B2 (en) | Metadump spatial database system | |
US20070088912A1 (en) | Method and system for log structured relational database objects | |
CN110389950A (en) | A fast-running method for cleaning big data | |
US10545960B1 (en) | System and method for set overlap searching of data lakes | |
Tseng | Mining frequent itemsets in large databases: The hierarchical partitioning approach | |
Pandey et al. | VariantStore: an index for large-scale genomic variant search | |
KR20220099745A (en) | A spatial decomposition-based tree indexing and query processing methods and apparatus for geospatial blockchain data retrieval | |
Rossignolo et al. | USTAR: Improved compression of k-mer sets with counters using de Bruijn graphs | |
CN109241058A (en) | A kind of method and apparatus from key-value pair to B+ tree batch that being inserted into | |
CN110413602B (en) | A layered cleaning method for big data cleaning | |
CN106874396B (en) | A method of frequent pattern mining based on non-volatile memory | |
Rajendran et al. | Incremental MapReduce for K-medoids clustering of big time-series data | |
JP5354606B2 (en) | Data storage device and method and program, and data search device and method and program | |
CN114840577B (en) | Frequent closed-term set mining algorithm based on adjacent bit compression table |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190910 |