CN110222090A

CN110222090A - A kind of mass data Mining Frequent Itemsets

Info

Publication number: CN110222090A
Application number: CN201910477465.9A
Authority: CN
Inventors: 韩希先; 陈剑; 赖国骏
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-09-10

Abstract

The present invention provides a method for mining frequent itemsets of massive data, comprising: mining an original transaction data set TO by using a frequent _itemset mining algorithm to obtain all local frequent _itemsets corresponding to the original transaction data set TO; scanning the original transaction Data set T _O , corresponding to calculating the support count of each local frequent itemset obtained above on the original transaction data set T _O , filter the obtained local frequent itemsets, and obtain each local part whose support degree is not less than ω Frequent itemsets, and write the obtained local frequent itemsets and the corresponding support counts calculated into the file F _qf ; read the newly added transaction data set T _Δ , and judge the newly added transaction data set T _Δ Whether it is empty, then perform frequent itemset mining based on whether the newly added transaction data set T _Δ is empty. The present invention reuses the file F _qf , the set ST _CAD and the array cnt _Δ in the whole mining process, which reduces the computational cost to a certain extent, thereby improving the mining rate of frequent itemsets.

Description

A method for mining frequent itemsets in massive data

技术领域technical field

本发明涉及数据挖掘技术领域，具体涉及一种海量数据频繁项集挖掘方法。The invention relates to the technical field of data mining, in particular to a method for mining frequent itemsets of massive data.

背景技术Background technique

频繁项集挖掘一直以来都是数据挖掘中最活跃的领域之一。它在现实生活中具有非常广泛的应用，例如，它广泛应用于数据挖掘、软件错误探测、时空数据分析、生物分析等多个研究领域。由于其实际意义，频繁项集挖掘已经引起了广泛的关注。Frequent itemset mining has always been one of the most active areas in data mining. It has a very wide range of applications in real life, for example, it is widely used in data mining, software error detection, spatiotemporal data analysis, biological analysis and many other research fields. Due to its practical significance, frequent itemset mining has attracted extensive attention.

在数据存储领域，数据通常以只读/只添加模式存储，整个事务数据集可以被分为两个部分：原始事务数据集和新增事务数据集。一定时间或条件下，新增事务数据集中的数据被并入原始事务数据集中，此时原始事务数据集中数据在增加，而新增事务数据集中的数据因被并入原始事务数据集而被清空，当有新增数据的写入时，新增数据被写入新增事务数据集，之后在再次满足一定时间或条件时，新增事务数据集中新增写入的数据再次被并入原始事务数据集，而新增事务数据集继续用于等待新数据的存入，如此往复。可见，在只读/只添加模式进行存储下，原事务数据集始终由原始事务数据集和新增事务数据集组成。In the field of data storage, data is usually stored in read-only/add-only mode, and the entire transaction data set can be divided into two parts: the original transaction data set and the new transaction data set. At a certain time or condition, the data in the new transaction data set is merged into the original transaction data set. At this time, the data in the original transaction data set is increasing, and the data in the new transaction data set is emptied because it is merged into the original transaction data set. , when new data is written, the new data is written into the new transaction data set, and then when a certain time or condition is met again, the newly written data in the new transaction data set is merged into the original transaction again Data set, and the new transaction data set continues to be used to wait for the storage of new data, and so on. It can be seen that in the read-only/add-only mode for storage, the original transaction data set is always composed of the original transaction data set and the newly added transaction data set.

多年来，国内外的科研工作者们已经提出了许多相关算法。现存的算法可以分为两类：基于候选生成的算法、基于模式增长的算法。基于候选生成的算法首先生成候选项集，然后，通过扫描数据库来验证候选项集，并识别出其中的频繁项集。此外，基于候选生成的算法还利用反单调性来剪切搜索空间。但是，这类算法需要多遍扫描数据库，当处理海量数据时，这将带来很高的I/O开销。基于模式增长的算法不会直接生成候选项集，它通过构建一种特殊的基于树的数据结构来保存数据库中频繁项集的必要信息。通过利用这种数据结构，频繁项集可以被高效的计算出来，然而这类算法构建数据结构十分复杂，并且在处理海量数据时，内存需求量通常会超出可用的内存，导致数据结构无法在内存中正确构建。Over the years, researchers at home and abroad have proposed many related algorithms. Existing algorithms can be divided into two categories: those based on candidate generation and those based on pattern growth. Algorithms based on candidate generation first generate candidate itemsets, then scan the database to verify the candidate itemsets and identify frequent itemsets in them. Furthermore, candidate generation-based algorithms also exploit anti-monotonicity to clip the search space. However, such algorithms require multiple passes to scan the database, which will bring high I/O overhead when dealing with massive amounts of data. The algorithm based on pattern growth does not generate candidate itemsets directly, but it builds a special tree-based data structure to save the necessary information of frequent itemsets in the database. By using this data structure, frequent itemsets can be efficiently calculated. However, this kind of algorithm is very complicated to construct data structure, and when processing massive data, the memory requirement usually exceeds the available memory, so the data structure cannot be stored in memory. build correctly in .

为此，本发明提供一种海量数据频繁项集挖掘方法，用于实现对只读/只添加模式存储模式下的海量数据频繁项集的挖掘。To this end, the present invention provides a method for mining frequent itemsets of massive data, which is used to realize the mining of frequent itemsets of massive data in the read-only/add-only mode storage mode.

发明内容SUMMARY OF THE INVENTION

针对现有技术的上述不足，本发明提供一种海量数据频繁项集挖掘方法，用于实现对只读/只添加模式存储模式下海量数据频繁项集的挖掘，以提高对海量数据频繁项集的挖掘速率。In view of the above-mentioned shortcomings of the prior art, the present invention provides a method for mining frequent itemsets of massive data, which is used to realize the mining of frequent itemsets of massive data in the read-only/add-only mode storage mode, so as to improve the mining of frequent itemsets of massive data. mining rate.

本发明提供了一种海量数据频繁项集挖掘方法，该海量数据频繁项集挖掘方法用于挖掘总事务数据集T中满足全局最小支持度minsup的频繁项集，所述的全局最小支持度minsup为预先设定的总事务数据集T上的最小支持度；The present invention provides a method for mining frequent itemsets of massive data. The method for mining frequent itemsets of massive data is used to mine frequent itemsets that satisfy the global minimum support minsup in the total transaction data set T. The global minimum support minsup is the minimum support degree on the preset total transaction data set T;

所述的总事务数据集T包括原始事务数据集T_O和新增事务数据集T_Δ；The total transaction data set T includes the original transaction data set T _O and the newly added transaction data set T _Δ ;

该海量数据频繁项集挖掘方法包括步骤：The massive data frequent itemset mining method includes the following steps:

采用频繁项集挖掘算法对原始事务数据集T_O进行挖掘，获得原始事务数据集T_O对应的所有的局部频繁项集；The frequent _itemset mining algorithm is used to mine the original transaction data set TO, and obtain all the local frequent _itemsets corresponding to the original transaction data set TO;

扫描原始事务数据集T_O，对应计算上述所获得的每个局部频繁项集在原始事务数据集T_O上的支持度计数，依据局部最小支持度ω，对所获得的局部频繁项集进行过滤，获取支持度不小于ω的各局部频繁项集，并将所获取的支持度不小于ω的各局部频繁项集及计算所得的对应的支持度计数对应写入文件F_qf中；Scan the original transaction data set T _O , correspondingly calculate the support count of each local frequent itemset obtained above on the original transaction data set T _O , and filter the obtained local frequent itemsets according to the local minimum support ω , obtain each local frequent item set whose support degree is not less than ω, and write the obtained local frequent itemsets whose support degree is not less than ω and the corresponding calculated support degree count into the file F _qf correspondingly;

读取新增事务数据集T_Δ，并判断新增事务数据集T_Δ是否为空：Read the new transaction data set T _Δ and determine whether the new transaction data set T _Δ is empty:

是，则依据总事务数据集T中的事务的数目n以及所述的全局最小支持度minsup，对所述文件F_qf中的局部频繁项集进行过滤，得到过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集并输出，所输出的各局部频繁项集即为总事务数据集T上满足所述全局最小支持度minsup的全部的频繁项集；Yes, then according to the number n of transactions in the total transaction data set T and the global minimum support minsup, the local frequent itemsets in the file F _qf are filtered, and the filtered support count is not less than the global The local frequent itemsets of the minimum support count n×minsup are output, and the outputted local frequent itemsets are all the frequent itemsets that satisfy the global minimum support minsup on the total transaction data set T;

否，则采用增量更新方法挖掘总事务数据集T上的频繁项集；If not, the incremental update method is used to mine frequent itemsets on the total transaction data set T;

其中，所述的局部最小支持度ω为预先设定的原始事务数据集T_O上的最小支持度，局部最小支持度ω＜全局最小支持度minsup。Wherein, the local minimum support degree ω is a preset minimum support degree on the original transaction data set T _O , and the local minimum support degree ω<global minimum support degree minsup.

进一步地，所述的增量更新方法，包括步骤：Further, the incremental update method includes the steps:

扫描新增事务数据集T_Δ，计算新增事务数据集T_Δ中各项集在新增事务数据集T_Δ上的支持度计数，并将新增事务数据集T_Δ中各项集、以及计算所得的新增事务数据集T_Δ中各项集在新增事务数据集T_Δ上的支持度计数，对应存入数组cnt_Δ，并记数组cnt_Δ中最大的支持度计数为mas_Δ；Scan the newly added transaction data set T _Δ , calculate the support count of each item set in the newly added transaction data set T _Δ on the newly added transaction data set T _Δ , and add the item sets in the newly added transaction data set T _Δ , and The calculated support count of each item set in the newly added transaction data set T _Δ on the newly added transaction data set T _Δ is stored in the array cnt _Δ correspondingly, and the maximum support count in the array cnt _Δ is recorded as mas _Δ ;

扫描当前的文件F_qf，对于每个从当前的文件F_qf中扫描出的局部频繁项集，分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集，若判断结果为是，则输出当前扫描出的局部频繁项集，并记该输出的局部频繁项集为第一频繁项集；若判断结果为否，则基于所述的mas_Δ，判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集：否，则将当前扫描到的局部频繁项集及对应的支持度计数，对应写入集合ST_CAD中；Scan the current file F _qf , for each local frequent itemset scanned from the current file F _qf , determine whether it is a frequent item set that satisfies the preset global minimum support minsup on the total transaction data set T, respectively, If the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no, then based on the mas _Δ , determine the current Whether the scanned local frequent itemsets are not the frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T: if not, then count the currently scanned local frequent itemsets and the corresponding support, Correspondingly written into the collection ST _CAD ;

基于上述数组cnt_Δ，对应计算与更新集合ST_CAD中在新增事务数据集T_Δ中存在的各局部频繁项集在总事务数据集T上的支持度计数，得到更新后的集合ST_CAD；Based on the above-mentioned array cnt _Δ , corresponding to the support counts of each local frequent item set existing in the newly added transaction data set T _Δ in the calculation and update set ST _CAD on the total transaction data set T, the updated set ST _CAD is obtained;

遍历更新后的集合ST_CAD，并分别判断各当前遍历到的更新后的集合ST_CAD中的局部频繁项集是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集，若判断结果为是，则对应输出各当前遍历到的更新后的集合ST_CAD中的相应的局部频繁项集；Traverse the updated set ST _CAD , and respectively judge whether the local frequent itemsets in the currently traversed updated sets ST _CAD are frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T , if the judgment result is yes, then correspondingly output the corresponding local frequent itemsets in the updated set ST _CAD currently traversed;

之后判断表达式(n_O×ω-1)+mas_Δ＜(n×minsup)是否成立：Then judge whether the expression (n _O ×ω-1)+mas _Δ <(n×minsup) holds:

是，则频繁项集挖掘结束；If yes, the frequent itemset mining ends;

否，则继续在新增事务数据集T_Δ中挖掘新的频繁项集，并对所挖掘出的新的频繁项集进行输出；If not, continue to mine new frequent itemsets in the newly added transaction data set T _Δ , and output the new frequent itemsets mined;

其中，所述的新的频繁项集，其在原始事务数据集T_O上的支持度计数大于零，在总事务数据集T上满足所述的全局最小支持度minsup，并区别于上述输出的所有的频繁项集；Wherein, the new frequent itemset, whose support count on the original transaction data set T _O is greater than zero, satisfies the global minimum support minsup on the total transaction data set T, and is different from the above output all frequent itemsets;

n_O为新增事务数据集T_Δ中的事务的数目。n _O is the number of transactions in the newly added transaction data set T _Δ .

进一步地，所述的在新增事务数据集T_Δ中挖掘新的频繁项集，并对所挖掘出的新的频繁项集进行输出，包括：Further, mining new frequent itemsets in the newly added transaction data set T _Δ , and outputting the mined new frequent itemsets, including:

通过公式计算新增事务数据集T_Δ的最小支持度minsup_Δ，式中n_Δ为新增事务数据集T_Δ中事务的数目；by formula Calculate the minimum support minsup _Δ of the newly added transaction data set T _Δ , where n _Δ is the number of transactions in the newly added transaction data set T _Δ ;

分割新增事务数据集T_Δ中的事务为多个目标事务数据集；Divide the transactions in the newly added transaction data set T _Δ into multiple target transaction data sets;

采用Eclat算法，对各目标事务数据集进行局部频繁项集挖掘，得到各目标事务数据集对应的满足上述计算所得的最小支持度minsup_Δ的所有的局部频繁项集；Using the Eclat algorithm, perform local frequent itemsets mining on each target transaction data set, and obtain all local frequent itemsets corresponding to each target transaction data set that satisfy the minimum support minsup _Δ calculated above;

记上述得到各目标事务数据集对应的所有的局部频繁项集的集合为LF_Δ，遍历并删除所述集合LF_Δ中已出现在文件F_qf中的局部频繁项集，得到候选集合GF_Δ；并基于上述数组cnt_Δ，将候选集合GF_Δ中各局部频繁项集在新增事务数据集T_Δ中的支持度计数对应存入该候选集合GF_Δ；Denote the set of all local frequent itemsets corresponding to each target transaction data set obtained above as LF _Δ , traverse and delete the local frequent itemsets in the set LF _Δ that have appeared in the file F _qf to obtain the candidate set GF _Δ ; And based on the above-mentioned array cnt _Δ , the support counts of each local frequent item set in the candidate set GF _Δ in the newly added transaction data set T _Δ are correspondingly stored in the candidate set GF _Δ ;

扫描当前的原始事务数据集T_O，增加并更新候选集合GF_Δ中局部频繁项集的支持度计数，得到新的候选集合GF_Δ；Scan the current original transaction data set _TO , increase and update the support count of the local frequent itemsets in the candidate set GF _Δ , and obtain a new candidate set GF _Δ ;

扫描上述新的候选集合GF_Δ，对应判断各当前扫描出的新的候选集合GF_Δ中的局部频繁项集是否为满足预先设定的全局最小支持度minsup的频繁项集，若判断结果为是，则对应输出各当前扫描出的候选集合GF_Δ中的局部频繁项集。Scan the above-mentioned new candidate set GF _Δ , correspondingly determine whether the local frequent itemsets in the new candidate sets GF _Δ currently scanned are frequent itemsets that satisfy the preset global minimum support minsup, if the judgment result is yes , then correspondingly output the local frequent itemsets in each currently scanned candidate set GF _Δ .

进一步地，所述的采用Eclat算法，对各目标事务数据集进行局部频繁项集挖掘，得到各目标事务数据集对应的满足上述计算所得的最小支持度minsup_Δ的所有的局部频繁项集，包括：Further, the Eclat algorithm is used to perform local frequent itemset mining on each target transaction data set, and all local frequent itemsets corresponding to each target transaction data set that satisfy the minimum support minsup _Δ obtained by the above calculation are obtained, including :

P0、遍历各目标事务数据集；P0, traverse each target transaction data set;

P1、对当前遍历到的目标事务数据集：P1. For the current traversed target transaction data set:

P11、采用Eclat算法，计算并获取当前遍历到的目标事务数据集对应的候选频繁k-项集，并在所生成的候选频繁k-项集满足上述计算所得的最小支持度minsup_Δ时，记所生成的候选频繁k-项集为频繁k-项集并存入集合LF_k,Δ中，k≥1；P11. Use the Eclat algorithm to calculate and obtain the candidate frequent k-itemsets corresponding to the currently traversed target transaction data set, and when the generated candidate frequent k-items satisfies the minimum support minsup _Δ calculated above, record The generated candidate frequent k-itemsets are frequent k-itemsets and are stored in the set LF _k,Δ , k≥1;

P12、通过合并所述LF_k,Δ中两个频繁k-项集生成候选频繁(k+1)-项集，并在所生成的候选频繁(k+1)-项集满足上述计算所得的最小支持度minsup_Δ时，记所生成的候选频繁(k+1)-项集为频繁(k+1)-项集并存入集合LF_k+1,Δ中，其中所述的两个频繁k-项集的前k-1项相同且最后一项不同；P12. Generate a candidate frequent (k+1)-item set by merging the two frequent k-itemsets in the LF _k,Δ , and the generated candidate frequent (k+1)-item set satisfies the above calculation result When the minimum support minsup _Δ , the generated candidate frequent (k+1)-itemsets are recorded as frequent (k+1)-itemsets and stored in the set LF _k+1,Δ , wherein the two frequent The first k-1 items of the k-item set are the same and the last item is different;

P13、重复执行上述步骤P11-P12，每次k增加1，直到不能再生成当前遍历到的目标事务数据集对应的新的候选频繁项集；之后执行步骤P14；P13. Repeat the above steps P11-P12, each time k increases by 1, until the new candidate frequent itemsets corresponding to the currently traversed target transaction data set can no longer be generated; then step P14 is performed;

P14、继续遍历下一目标事务数据集，并重复执行上述步骤P11-P13，直至遍历完各目标事务数据集，从而得到各目标事务数据集对应的满足上述计算所得的最小支持度minsup_Δ的所有的局部频繁项集。P14. Continue to traverse the next target transaction data set, and repeat the above steps P11-P13 until each target transaction data set is traversed, so as to obtain all target transaction data sets that satisfy the minimum support minsup _Δ calculated above. The local frequent itemsets of .

进一步地，在步骤P11和步骤P12之间，还包括步骤S：对所述集合LF_k,Δ的精剪步骤；Further, between step P11 and step P12, it also includes step S: a fine-pruning step for the set LF _k,Δ ;

其中，对所述集合LF_k,Δ的精剪步骤，包括：Wherein, the fine pruning step for the set LF _k,Δ includes:

依据项集的前(k-1)项是否相同，获取并划分集合LF_k,Δ中频繁k-项集的分组，得到相应数量的项集分组，其中，同一项集分组中的频繁k-项集的前(k-1)项相同；According to whether the first (k-1) items of the itemset are the same, obtain and divide the frequent k-item sets in the set LF _k,Δ into groups, and obtain the corresponding number of itemset groups, among which, the frequent k-items in the same item set grouping The first (k-1) items of the itemset are the same;

分别统计每个项集分组中频繁k-项集的数量，并对应判断所统计的各数量是否等于1：是，则删除对应项集分组，并删除集合LF_k,Δ中与所述对应项集分组中频繁k-项集相同的项集；所述的对应项集分组，其内频繁k-项集的数量等于1；Count the number of frequent k-itemsets in each itemset grouping separately, and judge whether the counted quantities are equal to 1: Yes, delete the corresponding itemset grouping, and delete the corresponding items in the set LF _k,Δ The itemsets with the same frequent k-itemsets in the set grouping; in the corresponding itemset grouping, the number of frequent k-itemsets in it is equal to 1;

对应判定当前存在的各项集分组中，是否是项集分组中任意两个频繁k-项集的并集都包含在所述的文件F_qf中：是，则删除对应的项集分组，并删除集合LF_k,Δ中与所述对应的项集分组中频繁k-项集相同的项集。Correspondingly determine whether the current item set grouping is the union of any two frequent k-itemsets in the itemset grouping, which are included in the file F _qf : if yes, delete the corresponding itemset grouping, and Delete the itemsets in the set LF _k,Δ that are the same as the frequent k-itemsets in the corresponding itemset grouping.

进一步地，在基于上述数组cnt_Δ，对应计算与更新集合ST_CAD中在新增事务数据集T_Δ中存在的各局部频繁项集在总事务数据集T上的支持度计数，得到更新后的集合ST_CAD之前，还包括对所述集合ST_CAD的精剪步骤；Further, based on the above-mentioned array cnt _Δ , corresponding to the support count of each local frequent itemset existing in the newly added transaction data set T _Δ in the calculation and update set ST _CAD on the total transaction data set T, the updated Before the collection of ST _CAD , it also includes a fine-cutting step for the collection of ST _CAD ;

对所述集合ST_CAD的精剪步骤，包括第一阶段的精简步骤；The fine-cutting steps of the set ST _CAD , including the streamlining steps of the first stage;

所述的第一阶段的精简步骤，包括：The simplified steps of the first stage include:

遍历所述的文件F_qf及数组cnt_Δ，并分别计算各遍历到的文件F_qf中的1-项集的支持度计数与数组cnt_Δ中相同的1-项集在数组cnt_Δ中对应的支持度计数之和，若计算得到的支持度之和小于全局最小支持度计数n×minsup，则将遍历到的文件F_qf中的相应的1-项集自所述的集合ST_CAD中移除。Traverse the file F _qf and the array cnt _Δ , and calculate the support counts of the 1-itemsets in each traversed file F _qf and the corresponding 1-itemsets in the array cnt _Δ corresponding to the same 1-item sets in the array cnt _Δ The sum of support counts, if the calculated sum of support is less than the global minimum support count n×minsup, the corresponding 1-item set in the traversed file F _qf is removed from the set ST _CAD .

进一步地，对所述集合ST_CAD的精剪步骤，还包括第二阶段的精简步骤；Further, the fine-cutting step of the set ST _CAD also includes a second-stage streamlining step;

所述的第二阶段的精简步骤，包括：The simplified steps of the second stage include:

构建PIP数组；build the pip array;

遍历经第一阶段的精简步骤精简过的集合ST_CAD，并为经第一阶段的精简步骤精简过的集合ST_CAD中的每个局部频繁项集分别选择各自对应项集中的两个支持度计数最小的项，组成对应项集的项对并均保存在上述构建的PIP数组中；Traverse the set ST _CAD reduced by the first-stage reduction step, and select two support counts in the corresponding item sets for each local frequent item set in the set ST _CAD reduced by the first-stage reduction step. The smallest item, the item pairs that form the corresponding itemset, are stored in the PIP array constructed above;

计算PIP数组中的每个项对在新增事务数据集T_Δ上的支持度计数；Calculate the support count of each item in the PIP array on the newly added transaction dataset T _Δ ;

判定PIP数组中的每个项对在新增事务数据集T_Δ上的支持度计数及对应项集在新增事务数据集T_Δ上的支持度计数之和，与全局最小支持度计数n×minsup的大小关系，并将判定该支持度计数之和小于全局最小支持度计数n×minsup的各项对对应的局部频繁项集自集合ST_CAD中删除。Determine the sum of the support count of each item in the PIP array on the newly added transaction data set T _Δ and the support count of the corresponding item set on the newly added transaction data set T _Δ , and the global minimum support count n× The size relationship of minsup, and the local frequent itemsets corresponding to the pairs of items whose sum of the support counts is less than the global minimum support count n×minsup will be deleted from the set ST _CAD .

进一步地，在更新所述的新增事务数据集T_Δ，并在更新所述的原始事务数据集T_O为原有的原始事务数据集T_O和原有的新增事务数据集T_Δ之和、以及更新后的新增事务数据集T_Δ非空时，所述的海量数据频繁项集挖掘方法还包括更新挖掘的步骤；Further, when updating the newly added transaction data set T _Δ , and updating the original transaction data set T _O to be the difference between the original original transaction data set T _O and the original newly added transaction data set T _Δ and, and when the updated new transaction data set T _Δ is not empty, the method for mining frequent itemsets of massive data further includes the step of updating and mining;

所述的更新挖掘的步骤，包括：The described steps of updating mining include:

更新总事务数据集T中的事务的数目n，为原有的原始事务数据集T_O和原有的新增事务数据集T_Δ的事务的数目之和；Update the number n of transactions in the total transaction data set T, which is the sum of the number of transactions in the original original transaction data set T _O and the original new transaction data set T _Δ ;

获取上述获得的原有原始事务数据集T_O对应的所有的局部频繁项集，计算所获取的原有原始事务数据集T_O对应的各局部频繁项集在原有的原始事务数据集T_O上的支持度计数，对应写入文件F_qf,O中；Obtain all the local frequent item sets corresponding to the original original transaction data set T _O obtained above, and calculate the local frequent itemsets corresponding to the original original transaction data set T _O obtained on the original original transaction data set T _O The support count of , corresponding to the write file F _qf,O ;

基于原有的新增事务数据集T_Δ对应的数组cnt_Δ，增加并更新文件F_qf,O中各局部频繁项集在原有的新增事务数据集T_Δ上的支持度计数；之后，依据所述的局部最小支持度ω，对文件F_qf,O中的局部频繁项集进行过滤，获取所述更新总事务数据集T对应的支持度不小于ω的各局部频繁项集；Based on the array cnt _Δ corresponding to the original newly added transaction data set T _Δ , add and update the support counts of each local frequent itemset in the file F _qf,O on the original newly added transaction data set T _Δ ; For the local minimum support degree ω, filter the local frequent itemsets in the file F _qf,O , and obtain each local frequent item set whose support degree corresponding to the update total transaction data set T is not less than ω;

之后将所获取的所述更新总事务数据集T对应的支持度不小于ω的各局部频繁项集及其在文件F_qf,O中各自对应的支持度计数，对应写入一个新的文件F_qf；Afterwards, each local frequent item set whose support degree is not less than ω and its corresponding support degree in the file F _qf,O is written into a new file F correspondingly. _qf ;

获取原有的新增事务数据集T_Δ对应的集合LF_Δ，并删除该获取到的集合LF_Δ中存在于所述文件F_qf,O中的项集，对应得到一个新的集合LF_Δ；Acquire the set LF _Δ corresponding to the original newly added transaction data set T _Δ , and delete the itemsets in the obtained set LF _Δ that exist in the file F _qf,O , correspondingly to obtain a new set LF _Δ ;

基于原有的新增事务数据集T_Δ对应的数组cnt_Δ，在所述新的集合LF_Δ中，对应写入该新的集合LF_Δ中各项集在原有的新增事务数据集T_Δ上的支持度计数；Based on the array cnt _Δ corresponding to the original newly added transaction data set T _Δ , in the new set LF _Δ , correspondingly write the item sets in the new set LF _Δ into the original newly added transaction data set T _Δ support count on ;

之后依据所述的局部最小支持度ω，对所述新的集合LF_Δ中的项集进行过滤，获取过滤后的支持度不小于ω的各项集，并将该获取到的支持度不小于ω的各项集及所述新的集合LF_Δ中写入的支持度计数，对应写入所述的新的文件F_qf；Then, according to the local minimum support ω, filter the itemsets in the new set LF _Δ , obtain the itemsets with the filtered support not less than ω, and use the obtained support not less than ω. the item set of ω and the support count written in the new set LF _Δ , corresponding to the new file F _qf written;

之后用上述新的文件F_qf替换原有的文件F_qf、用原有的原始事务数据集T_O和原有的新增事务数据集T_Δ之和替换原有的原始事务数据集T_O、以及用更新后的新增事务数据集T_Δ替换原有的新增事务数据集T_Δ，基于更新后的总事务数据集T中的事务的数目n，采用所述的增量更新方法挖掘上述更新后的总事务数据集T上的频繁项集。Then replace the original file F _qf with the above-mentioned new file F _qf , and replace the original original transaction data set T _O with the sum of the original original transaction data set T _O and the original newly added transaction data set T _Δ , And replace the original new transaction data set T _{Δ with the updated new transaction data set T Δ} _, based on the number n of transactions in the updated total transaction data set T, use the incremental update method to mine the above Frequent itemsets on the updated total transaction dataset T.

进一步地，所述的依据总事务数据集T中的事务的数目n以及所述的全局最小支持度minsup，对所述文件F_qf中的局部频繁项集进行过滤，得到过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集，具体为：Further, according to the number n of transactions in the total transaction data set T and the global minimum support minsup, the local frequent itemsets in the file F _qf are filtered to obtain the filtered support count. Local frequent itemsets not less than the global minimum support count n×minsup, specifically:

顺序扫描文件F_qf；Sequentially scan files F _qf ;

分别判断扫描到的文件F_qf中的局部频繁项集的支持度计数是否大于或等于全局最小支持度计数n×minsup：Determine whether the support count of the local frequent itemsets in the scanned file F _qf is greater than or equal to the global minimum support count n×minsup:

是，则扫描到的文件F_qf中的局部频繁项集为该过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集。If yes, the local frequent itemsets in the scanned file F _qf are the local frequent itemsets whose filtered support count is not less than the global minimum support count n×minsup.

进一步地，所述的扫描当前的文件F_qf，对于每个从当前的文件F_qf中扫描出的局部频繁项集，分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集，若判断结果为是，则输出当前扫描出的局部频繁项集，并记该输出的局部频繁项集为第一频繁项集；若判断结果为否，则基于所述的mas_Δ，判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集：否，则将当前扫描到的局部频繁项集及对应的支持度计数，对应写入集合ST_CAD中，具体包括：Further, in the described scanning of the current file F _qf , for each local frequent item set scanned from the current file F _qf , it is respectively judged whether the total transaction data set T satisfies the preset global minimum support degree. For the frequent itemset of minsup, if the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no, then based on the mas _Δ , to determine whether the currently scanned local frequent itemset is not necessarily a frequent itemset that satisfies the preset global minimum support minsup on the total transaction data set T: if not, the currently scanned local frequent itemsets and The corresponding support counts are written into the set ST _CAD , including:

顺序扫描所述的文件F_qf；Sequentially scan the files F _qf ;

对于每个扫描出的文件F_qf中的局部频繁项集，分别判断扫描到的文件F_qf中的局部频繁项集的支持度计数是否大于或等于全局最小支持度计数n×minsup：For each local frequent item set in the scanned file F _qf , determine whether the support count of the local frequent item set in the scanned file F _qf is greater than or equal to the global minimum support count n×minsup:

是，则输出扫描出的局部频繁项集，该输出的局部频繁项集为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集；If yes, output the scanned local frequent itemset, and the output local frequent itemset is the frequent itemset that satisfies the preset global minimum support minsup on the total transaction data set T;

否，则判断扫描到的文件F_qf中的局部频繁项集的支持度计数与所述mas_Δ的加和是否小于全局最小支持度计数n×minsup，并在判定结果为是，则当前扫描到的局部频繁项集一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集，否则将当前扫描到的局部频繁项集及对应扫描到的支持度计数，对应写入所述的集合ST_CAD。No, then judge whether the sum of the support count of the local frequent itemsets in the scanned file F _qf and the mas _Δ is less than the global minimum support count n×minsup, and if the judgment result is yes, then the current scan to The local frequent itemsets must not be frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T. Otherwise, the currently scanned local frequent itemsets and the corresponding scanned support counts will be written accordingly. The collection ST _CAD .

本发明的有益效果在于：The beneficial effects of the present invention are:

(1)本发明提供的海量数据频繁项集挖掘方法，采用了文件F_qf、集合ST_CAD、以及数组cnt_Δ，并在整个挖掘过程中复用了文件F_qf、集合ST_CAD和数组cnt_Δ，这在一定程度上避免了对原始事务数据集T_O和新增事务数据集T_Δ的遍历，一定程度上减少了计算开销，从而可在一定程度上提高频繁项集的挖掘速率。(1) The method for mining frequent itemsets of massive data provided by the present invention adopts the file F _qf , the set ST _CAD , and the array cnt _Δ , and multiplexes the file F _qf , the set ST _CAD and the array cnt _Δ in the whole mining process , which avoids the traversal of the original transaction data set T _O and the new transaction data set T _Δ to a certain extent, reduces the computational overhead to a certain extent, and thus can improve the mining rate of frequent itemsets to a certain extent.

(2)本发明提供的海量数据频繁项集挖掘方法，包括对所述集合ST_CAD的精剪步骤，使集合ST_CAD在被用于后续的计算之前被进一步减小，从而减小了I/O开销和计算开销。(2) The method for mining frequent itemsets of massive data provided by the present invention includes a fine-pruning step for the set ST _CAD , so that the set ST _CAD is further reduced before being used for subsequent calculations, thereby reducing I/ O overhead and computational overhead.

(3)本发明提供的海量数据频繁项集挖掘方法，提出了具体的增量更新策略，利用已有的计算信息,比如数组cnt_Δ、集合LF_Δ等，加速更新操作，从而有助于提升海量数据频繁项集挖掘的性能和实用性。(3) The method for mining frequent itemsets of massive data provided by the present invention proposes a specific incremental update strategy, using existing calculation information, such as array cnt _Δ , set LF _Δ , etc., to speed up the update operation, thereby helping to improve Performance and practicability of frequent itemsets mining in massive data.

此外，本发明设计原理可靠，结构简单，具有非常广泛的应用前景。In addition, the present invention has reliable design principle and simple structure, and has a very wide application prospect.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. In other words, other drawings can also be obtained based on these drawings without creative labor.

图1是本发明一个实施例的方法的示意性流程图。FIG. 1 is a schematic flowchart of a method according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明中的技术方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only some of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

实施例1：Example 1:

图1是本发明一个实施例的方法的示意性流程图。其中，图1执行主体可以为计算节点、服务器，也可以为普通PC机。该海量数据频繁项集挖掘方法用于挖掘总事务数据集T中满足全局最小支持度minsup的频繁项集，所述的全局最小支持度minsup为预先设定的总事务数据集T上的最小支持度，所述的总事务数据集T包括原始事务数据集T_O和新增事务数据集T_Δ。FIG. 1 is a schematic flowchart of a method according to an embodiment of the present invention. Wherein, the execution body of FIG. 1 may be a computing node, a server, or a common PC. The massive data frequent itemset mining method is used to mine frequent itemsets that satisfy the global minimum support minsup in the total transaction data set T, where the global minimum support minsup is the preset minimum support on the total transaction data set T The total transaction data set T includes the original transaction data set T _O and the newly added transaction data set T _Δ .

参见图1，该海量数据频繁项集挖掘方法包括：Referring to Figure 1, the massive data frequent itemset mining method includes:

步骤110，采用频繁项集挖掘算法对原始事务数据集T_O进行挖掘，获得原始事务数据集T_O对应的所有的局部频繁项集。Step 110 , using the frequent itemset mining algorithm to mine the original transaction data set T _O to obtain all local frequent itemsets corresponding to the original transaction data set T _O.

具体地，可顺序的读取原始事务数据集T_O中的事务，并将取回的事务放在内存缓冲区中，然后使用现有的频繁项集挖掘算法在缓冲区数据集上计算局部频繁项集，计算出的局部频繁项集被保存在一个文件F中；之后清空缓冲区，继续顺序读取原始事务数据集T_O中的事务进行下一次迭代，计算出的局部频繁项集继续被保存在上述文件F中。这个过程一直被重复执行直到原始事务数据集T_O中的所有事务读取完毕，至此，原始事务数据集T_O对应的所有的局部频繁项集已全部生成并得到，且均被存储在所述的文件F中。Specifically, the transactions in the original transaction data set _TO can be read sequentially, and the retrieved transactions are placed in the memory buffer, and then the local frequent itemset mining algorithm is used to calculate the local frequent items on the buffer data set. Itemset, the calculated local frequent itemset is saved in a file F; after that, the buffer is emptied, and the transactions in the original transaction data set _TO continue to be read sequentially for the next iteration, and the calculated local frequent itemsets continue to be Saved in file F above. This process is repeated until all transactions in the original transaction data set _TO are read. So far, all local frequent _itemsets corresponding to the original transaction data set TO have been generated and obtained, and are stored in the in file F.

为便于描述，将步骤110对应的步骤记为预计算阶段。For the convenience of description, the steps corresponding to step 110 are denoted as the pre-calculation stage.

步骤120，扫描原始事务数据集T_O，对应计算上述步骤110中所获得的每个局部频繁项集在原始事务数据集T_O上的支持度计数，依据局部最小支持度ω，对所获得的局部频繁项集进行过滤，获取支持度不小于ω的各局部频繁项集，并将所获取的支持度不小于ω的各局部频繁项集及计算所得的对应的支持度计数对应写入文件F_qf中。Step 120 , scan the original transaction data set T _O , correspondingly calculate the support count of each local frequent itemset obtained in the above step 110 on the original transaction data set T _O , according to the local minimum support ω, for the obtained Filter the local frequent itemsets, obtain each local frequent itemsets whose support is not less than ω, and write the obtained local frequent itemsets whose support is not less than ω and the corresponding calculated support counts into file F. in _qf .

具体实现时，首先将文件F中所有的局部频繁项集读入内存，然后再顺序扫描一遍原始事务数据集T_O，对应计算上述读入内存中的每个局部频繁项集的支持度计数；最后，依据局部最小支持度ω，对上述读入内存中的每个局部频繁项集进行过滤，并将过滤得到的支持度不小于ω的各局部频繁项集存储在文件F_qf中。In the specific implementation, first read all the local frequent itemsets in the file F into the memory, and then scan the original transaction data set T _O sequentially, and calculate the support count of each local frequent itemsets read into the memory above; Finally, according to the local minimum support ω, filter each local frequent item set read into the memory above, and store each local frequent item set whose support degree is not less than ω in the file F _qf .

其中，上述文件F_qf的存储模式表示为F_qf(IS,SUP)，IS表示项集，SUP表示项集IS对应的支持度计数；文件F_qf中的局部频繁项集按照支持度计数递减排序。Wherein, the storage mode of the above-mentioned file F _qf is expressed as F _qf (IS, SUP), IS represents the item set, and SUP represents the support count corresponding to the item set IS; the local frequent itemsets in the file F _qf are sorted in descending order according to the support count .

为便于描述，将步骤120对应的步骤记为提纯阶段。For the convenience of description, the step corresponding to step 120 is recorded as the purification stage.

步骤130，读取新增事务数据集T_Δ，并判断新增事务数据集T_Δ是否为空：Step 130, read the newly added transaction data set T _Δ , and determine whether the newly added transaction data set T _Δ is empty:

其中，所述的依据总事务数据集T中的事务的数目n以及所述的全局最小支持度minsup，对所述文件F_qf中的局部频繁项集进行过滤，得到过滤后的支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集，具体为：Wherein, according to the number n of transactions in the total transaction data set T and the global minimum support minsup, the local frequent itemsets in the file F _qf are filtered, and the filtered support count is not Local frequent itemsets smaller than the global minimum support count n×minsup, specifically:

顺序扫描文件F_qf；Sequentially scan files F _qf ;

是，则扫描到的文件F_qf中的局部频繁项集为支持度计数不小于全局最小支持度计数n×minsup的局部频繁项集(为总事务数据集T上的频繁项集)。If yes, then the local frequent itemsets in the scanned file F _qf are the local frequent itemsets whose support count is not less than the global minimum support count n×minsup (the frequent itemsets on the total transaction data set T).

其中，在读取新增事务数据集T_Δ时，先读取所述T_Δ中的一个事务，再读取该事务中的项，待该事务中的项读取完后，再转入读取所述T_Δ中的下一个事务。其中，对于t_Δ表示当前读取的新增事务数据集T_Δ中的一个事务，i表示所述t_Δ中的一个项，每读取一个新增事务数据集T_Δ中的项，则将i的计数(初始值为0)增加1。Among them, when reading the newly added transaction data set T _Δ , first read a transaction in the T _Δ , and then read the items in the transaction, and then transfer to reading after the items in the transaction are read. Take the next transaction in the _TΔ . Among them, for t _Δ represents a transaction in the currently read new transaction data set T _Δ , i represents an item in the t _Δ , each time an item in the newly added transaction data set T _Δ is read, the count of i is (Initial value is 0) Increment by 1.

优选地，所述的增量更新方法，包括步骤(即所述的采用增量更新方法挖掘总事务数据集T上的频繁项集所包括的步骤)：Preferably, the incremental update method includes steps (that is, the steps included in the incremental update method used to mine frequent itemsets on the total transaction data set T):

顺序扫描当前的文件F_qf，对于每个从当前的文件F_qf中扫描出的局部频繁项集，分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集，若判断结果为是，则输出当前扫描出的局部频繁项集，并记该输出的局部频繁项集为第一频繁项集；若判断结果为否，则基于所述的mas_Δ，判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集：否，则将当前扫描到的局部频繁项集及对应的支持度计数，对应写入集合ST_CAD中；Sequentially scan the current file F _qf , and for each local frequent item set scanned from the current file F _qf , determine whether it is a frequent item set that satisfies the preset global minimum support minsup on the total transaction data set T respectively. , if the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no, then based on the mas _Δ , determine the Whether the currently scanned local frequent itemsets are not frequent itemsets that satisfy the preset global minimum support minsup on the total transaction data set T: No, then count the currently scanned local frequent itemsets and the corresponding support , correspondingly written into the set ST _CAD ;

是，则频繁项集挖掘结束；If yes, the frequent itemset mining ends;

其中，在本实施例中，所述的顺序扫描当前的文件F_qf，对于每个从当前的文件F_qf中扫描出的局部频繁项集，分别判断是否为总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集，若判断结果为是，则输出当前扫描出的局部频繁项集，并记该输出的局部频繁项集为第一频繁项集；若判断结果为否，则基于所述的mas_Δ，判定该当前扫描出的局部频繁项集是否一定不是总事务数据集T上满足预先设定的全局最小支持度minsup的频繁项集：否，则将当前扫描到的局部频繁项集及对应的支持度计数，对应写入集合ST_CAD中，具体包括：Wherein, in this embodiment, the current file F _qf is scanned sequentially, and for each local frequent item set scanned from the current file F _qf , it is respectively judged whether the total transaction data set T satisfies the preset requirements. The frequent itemset of the given global minimum support minsup, if the judgment result is yes, output the currently scanned local frequent itemset, and record the output local frequent itemset as the first frequent itemset; if the judgment result is no , then based on the mas _Δ , it is determined whether the currently scanned local frequent itemset must not be a frequent itemset that satisfies the preset global minimum support minsup on the total transaction data set T: if not, then scan the current scan to The local frequent itemsets and the corresponding support counts are written into the set ST _CAD , including:

顺序扫描所述的文件F_qf；Sequentially scan the files F _qf ;

需要说明的是，本发明中文件F_qf中的局部频繁项集分为三部分：(1)绝对属于总事务数据集T的频繁项集的部分，(2)绝对不属于总事务数据集T的频繁项集的部分，(3)有可能属于总事务数据集T的频繁项集的部分。可见对于t表示当前读取的局部频繁项集，若t已经满足了所述的全局最小支持度minsup，那么t是总事务数据集T的频繁项集；假设所述T_Δ中所有事务都包含所述的t，但t依然不能满足全局最小支持度minsup，那么t一定不是总事务数据集T的频繁项集；其他情况，t可能是总事务数据集T的频繁项集，需要进一步验证，本发明将可能是频繁项集的t保存在集合ST_CAD中。可见，本发明基于对整个总事务数据集T中的频繁项集的分类，分类别地对总事务数据集T中的频繁项集进行挖掘，一定程度上有助于挖掘效率的提高。It should be noted that the local frequent itemsets in the file F _qf in the present invention are divided into three parts: (1) the part of frequent itemsets that absolutely belong to the total transaction data set T, (2) the part that absolutely does not belong to the total transaction data set T The part of frequent itemsets of (3) may belong to the part of frequent itemsets of the total transaction dataset T. visible for t represents the currently read local frequent itemset, if t has satisfied the global minimum support minsup, then t is the frequent itemset of the total transaction data set T; it is assumed that all transactions in the T _Δ include the t, but t still cannot satisfy the global minimum support minsup, then t must not be the frequent itemset of the total transaction data set T; in other cases, t may be the frequent itemset of the total transaction data set T, which needs further verification. Save t, which may be frequent itemsets, in the set ST _CAD . It can be seen that, based on the classification of frequent itemsets in the entire total transaction data set T, the present invention mines the frequent itemsets in the total transaction data set T by category, which helps to improve the mining efficiency to a certain extent.

可优选地，在本实施例中，所述的在新增事务数据集T_Δ中挖掘新的频繁项集，并对所挖掘出的新的频繁项集进行输出，包括：Preferably, in this embodiment, mining a new frequent itemset in the newly added transaction data set T _Δ , and outputting the mined new frequent itemsets, includes:

通过公式计算新增事务数据集T_Δ的最小支持度minsup_Δ，式中n_Δ为新增事务数据集T_Δ中事务的数目，n为总事务数据集T中事务的数目，n_O为原始事务数据集T_O中事务的数目；by formula Calculate the minimum support minsup _Δ of the newly added transaction data set T _Δ , where n _Δ is the number of transactions in the newly added transaction data set T _Δ , n is the number of transactions in the total transaction data set T, and n _O is the original transaction data the number of transactions in the set _TO ;

其中，在本实施例中，所述的采用Eclat算法，对各目标事务数据集进行局部频繁项集挖掘，得到各目标事务数据集对应的满足上述计算所得的最小支持度minsup_Δ的所有的局部频繁项集，包括：Among them, in this embodiment, the Eclat algorithm is used to perform local frequent itemset mining on each target transaction data set, and all local parts corresponding to each target transaction data set that satisfy the minimum support minsup _Δ obtained by the above calculation are obtained. Frequent itemsets, including:

需要说明的是，参见图1：It should be noted that, see Figure 1:

图中各t，分别为各自对应步骤当前扫描到的文件F_qf中的局部频繁项集；各t.SUP，分别为各自对应步骤当前扫描出的局部频繁项集t在文件F_qf中对应的支持度计数；Each t in the figure is the local frequent item set in the file F _qf currently scanned by the corresponding step; each t.SUP is the corresponding local frequent item set t currently scanned by the corresponding step in the file F _qf . support count;

图中所示的“s”为对应步骤当前扫描到的更新后的集合ST_CAD中的项集，“s.SUP”为所述项集s在更新后的集合ST_CAD中对应的支持度计数；The "s" shown in the figure is the item set in the updated set ST _CAD currently scanned in the corresponding step, and "s.SUP" is the corresponding support count of the item set s in the updated set ST _CAD . ;

图中所示的“r”为对应步骤中当前扫描到的新的候选集合GF_Δ中的项集，“r.SUP”为所述项集r在新的候选集合GF_Δ中对应的支持度计数。"r" shown in the figure is the item set in the new candidate set GF _Δ currently scanned in the corresponding step, and "r.SUP" is the corresponding support degree of the item set r in the new candidate set GF _Δ count.

另外需要说明的是，结合图1可知：In addition, it should be noted that, with reference to Figure 1, it can be seen that:

在判定新增事务数据集T_Δ为空时，图1中对应输出的各个项集t，为通过本发明所示方法挖掘出的总事务数据集T上的所有的频繁项集；When it is determined that the newly added transaction data set T _Δ is empty, each item set t corresponding to the output in FIG. 1 is all the frequent itemsets on the total transaction data set T mined by the method shown in the present invention;

在判定新增事务数据集T_Δ为非空时，图1中对应输出的各个项集t、各个项集s以及各个项集r，为通过本发明所示方法挖掘出的总事务数据集T上的所有的频繁项集。When it is determined that the newly added transaction data set T _Δ is non-empty, each item set t, each item set s and each item set r correspondingly output in FIG. 1 are the total transaction data set T mined by the method shown in the present invention All frequent itemsets on .

另外需要说明的是，本发明在具体实现时，所述minsup和ω的取值可由本领域技术人员依据经验值进行选取，比如minsup可以取值为0.2、ω可以取值为0.1，等等。In addition, it should be noted that, when the present invention is specifically implemented, the values of minsup and ω can be selected by those skilled in the art based on empirical values. For example, minsup can be 0.2, ω can be 0.1, and so on.

综上，本发明提供的海量数据频繁项集挖掘方法，采用了文件F_qf、集合ST_CAD、以及数组cnt_Δ，并在整个挖掘过程中复用了文件F_qf、集合ST_CAD和数组cnt_Δ，这在一定程度上避免了对原始事务数据集T_O和新增事务数据集T_Δ的遍历，一定程度上减少了计算开销，从而可在一定程度上提高频繁项集的挖掘速率。To sum up, the method for mining frequent itemsets of massive data provided by the present invention adopts the file F _qf , the set ST _CAD , and the array cnt _Δ , and multiplexes the file F _qf , the set ST _CAD and the array cnt _Δ in the whole mining process , which avoids the traversal of the original transaction data set T _O and the new transaction data set T _Δ to a certain extent, reduces the computational overhead to a certain extent, and thus can improve the mining rate of frequent itemsets to a certain extent.

实施例2：Example 2:

与实施例1相比，不同之处在于，该实施例2中所述的海量数据频繁项集挖掘方法，为进一步提高本发明的挖掘速率，在上述步骤P11和步骤P12之间，还包括步骤S：对所述集合LF_k,Δ的精剪步骤；Compared with Embodiment 1, the difference is that, in the method for mining frequent itemsets of massive data described in Embodiment 2, in order to further improve the mining rate of the present invention, between the above steps P11 and P12, the method further includes the following steps: S: fine-pruning step for the set LF _k,Δ ;

另外，为了更进一步地提高本发明的挖掘速率，本实施例在基于上述数组cnt_Δ，对应计算与更新集合ST_CAD中在新增事务数据集T_Δ中存在的各局部频繁项集在总事务数据集T上的支持度计数，得到更新后的集合ST_CAD之前，还包括对所述集合ST_CAD的精剪步骤；In addition, in order to further improve the mining rate of the present invention, in this embodiment, based on the above-mentioned array cnt _Δ , each local frequent itemset existing in the newly added transaction data set T _Δ in the calculation and update set ST _CAD is corresponding to the total transaction The support count on the data set T, before obtaining the updated set ST _CAD , also includes a fine-cutting step for the set ST _CAD ;

另外，为了更进一步地提高本发明的挖掘速率，对所述集合ST_CAD的精剪步骤，还包括第二阶段的精简步骤；In addition, in order to further improve the digging rate of the present invention, the fine-cutting step of the set ST _CAD also includes a second-stage streamlining step;

构建PIP数组；build the pip array;

综上可见，本发明提供的海量数据频繁项集挖掘方法，还包括对所述集合ST_CAD的精剪步骤，使集合ST_CAD在被用于后续的计算之前被进一步减小，从而减小了I/O开销和计算开销。To sum up, it can be seen that the method for mining frequent itemsets of massive data provided by the present invention further includes a step of fine-cutting the set ST _CAD , so that the set ST _CAD is further reduced before being used for subsequent calculations, thereby reducing the size of the set ST CAD. I/O overhead and computational overhead.

实施例3：Example 3:

与实施例2相比，不同之处在于，该实施例3中所述的海量数据频繁项集挖掘方法，在更新所述的新增事务数据集T_Δ，并在更新所述的原始事务数据集T_O为原有的原始事务数据集T_O和原有的新增事务数据集T_Δ之和、以及更新后的新增事务数据集T_Δ非空时，还包括更新挖掘的步骤。Compared with Embodiment 2, the difference is that the method for mining frequent itemsets of massive data described in Embodiment 3 is updating the newly added transaction data set T _Δ and updating the original transaction data. When the set T _O is the sum of the original original transaction data set T _O and the original new transaction data set T _Δ , and the updated new transaction data set T _Δ is not empty, the step of updating and mining is also included.

具体地，本实施例中所述的更新挖掘的步骤，包括：Specifically, the steps of updating mining described in this embodiment include:

可见本发明提供的海量数据频繁项集挖掘方法，提出了具体的增量更新策略，利用已有的计算信息,比如数组cnt_Δ、集合LF_Δ等，加速更新操作，从而有助于提升海量数据频繁项集挖掘的性能和实用性。It can be seen that the method for mining frequent itemsets of massive data provided by the present invention proposes a specific incremental update strategy, and utilizes existing calculation information, such as array cnt _Δ , set LF _Δ , etc., to speed up update operations, thereby helping to improve massive data Performance and practicality of frequent itemset mining.

需要说明的是，本说明书中各个实施例之间相同相似的部分互相参见即可。It should be noted that the same and similar parts among the various embodiments in this specification may be referred to each other.

尽管通过参考附图并结合优选实施例的方式对本发明进行了详细描述，但本发明并不限于此。在不脱离本发明的精神和实质的前提下，本领域普通技术人员可以对本发明的实施例进行各种等效的修改或替换，而这些修改或替换都应在本发明的涵盖范围内/任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。Although the present invention has been described in detail in conjunction with the preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Without departing from the spirit and essence of the present invention, those of ordinary skill in the art can make various equivalent modifications or substitutions to the embodiments of the present invention, and these modifications or substitutions should all fall within the scope of the present invention/any Those skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention, which should all be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A mass data frequent item set mining method is used for mining a frequent item set which meets a global minimum support degree minsup in a total transaction data set T, wherein the global minimum support degree minsup is a preset minimum support degree on the total transaction data set T; it is characterized in that the preparation method is characterized in that,

said total transaction data set T comprising an original transaction data set T_OAnd adding a new transaction data set T_Δ；

The method for mining the mass data frequent item set comprises the following steps:

adopting a frequent item set mining algorithm to carry out on an original transaction data set T_OMining is carried out to obtain an original transaction data set T_OAll corresponding local frequent item sets;

scanning an original transaction data set T_OCorrespondingly calculating each local frequent item set obtained in the above step in the original transaction data set T_OThe above support degree counting is to filter the obtained local frequent item sets according to the local minimum support degree omega, obtain each local frequent item set with the support degree not less than omega, and correspondingly write each obtained local frequent item set with the support degree not less than omega and the corresponding support degree obtained by calculation into the file F_qfPerforming the following steps;

reading newly added transaction data set T_ΔAnd judging the newly added transaction data set T_ΔWhether it is empty:

if yes, the file F is processed according to the number n of the transactions in the total transaction data set T and the global minimum support degree minsup_qfFiltering the local frequent item sets to obtain and output the local frequent item sets with the filtered support counts not less than the global minimum support count n multiplied by min, wherein each output local frequent item set is all frequent item sets meeting the global minimum support count min on the total transaction data set T;

if not, mining a frequent item set on the total transaction data set T by adopting an incremental updating method;

wherein, the local minimum support degree omega is a preset original transaction data set T_OThe local minimum support omega is less than the global minimum support minsup.

2. The mass data frequent itemset mining method according to claim 1, wherein the incremental updating method comprises the following steps:

scanning for newly added transaction data set T_ΔCalculating a newly added transaction data set T_ΔIn the new transaction data set T_ΔCount the support degree of the transaction data set and add a new transaction data set T_ΔEach ofItem set and newly added transaction data set T obtained through calculation_ΔIn the new transaction data set T_ΔCount of support degree of (1), and correspondingly store into array cnt_ΔAnd count the group cnt_ΔThe maximum support count of the medium is mas_Δ；

Scanning a current document F_qfFor each slave current file F_qfRespectively judging whether the local frequent item sets scanned in the total transaction data set T meet the preset global minimum support degree minsup, if so, outputting the currently scanned local frequent item sets, and recording the output local frequent item sets as first frequent item sets; if the judgment result is no, based on the mas_ΔJudging whether the currently scanned local frequent item set is not a frequent item set which meets a preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set ST_CADPerforming the following steps;

based on the above array cnt_ΔCorresponding calculation and update set ST_CADNewly added transaction data set T_ΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set ST_CAD；

Traversing updated set ST_CADAnd respectively judging each current traversed updated set ST_CADIf the local frequent item set in the set is the frequent item set meeting the preset global minimum support degree minsup on the total transaction data set T, correspondingly outputting each current traversed updated set ST_CADThe corresponding local frequent itemset in (c);

then judging the expression (n)_O×ω-1)+mas_ΔWhether or not (n × min) holds:

if yes, finishing the mining of the frequent item set;

otherwise, continuing to add the transaction data set T_ΔExcavating a new frequent item set, and outputting the excavated new frequent item set;

wherein,the new frequent item set is in the original transaction data set T_OThe support degree on the system is more than zero, the global minimum support degree minsup is met on the total transaction data set T, and the system is different from all the output frequent item sets;

n_Ofor newly adding transaction data set T_ΔThe number of transactions in (2).

3. The mass data frequent itemset mining method according to claim 2, wherein the new transaction data set T_ΔMining a new frequent item set, and outputting the mined new frequent item set, wherein the method comprises the following steps:

by the formulaCalculating a newly added transaction data set T_ΔMinimum support degree of (Min)_ΔIn the formula n_ΔFor newly adding transaction data set T_ΔThe number of transactions in;

splitting a newly added transaction dataset T_ΔThe transactions in (1) are a plurality of target transaction data sets;

adopting an Eclat algorithm to carry out local frequent item set mining on each target transaction data set to obtain the minimum support degree minsup which corresponds to each target transaction data set and meets the calculation_ΔAll local frequent itemsets of (1);

the set of all local frequent item sets corresponding to the target transaction data sets is recorded as LF_ΔGo through and delete the set LF_ΔHas appeared in the file F_qfTo obtain a candidate set GF_Δ(ii) a And based on the above array cnt_ΔCandidate set GF_ΔAdding local frequent item sets in the newly added transaction data set T_ΔThe corresponding support metric in (b) is stored in the candidate set GF_Δ；

Scanning a current original transaction data set T_OAdding and updating candidate set GF_ΔCounting the support degree of the middle and local frequent item sets to obtain new candidatesSet GF_Δ；

Scanning the new candidate set GF_ΔCorrespondingly judging each currently scanned new candidate set GF_ΔIf the local frequent item set in (1) is the frequent item set meeting the preset global minimum support degree minsup, correspondingly outputting each currently scanned candidate set GF_ΔLocal frequent itemses in (1).

4. The method as claimed in claim 3, wherein the Eclat algorithm is used to perform local frequent itemset mining on each target transaction data set to obtain the minimum support degree minsup corresponding to each target transaction data set and satisfying the above calculation_ΔAll local frequent itemsets of (1), including:

p0, traversing each target transaction data set;

p1, for the currently traversed target transaction data set:

p11, calculating and acquiring a candidate frequent k-item set corresponding to the currently traversed target transaction data set by adopting an Eclat algorithm, and meeting the minimum support degree minsup obtained by the calculation in the generated candidate frequent k-item set_ΔAnd then, the generated candidate frequent k-item set is recorded as a frequent k-item set and stored in a set LF_k，ΔIn the middle, k is more than or equal to 1;

p12 by merging the LF_k，ΔGenerating candidate frequent (k +1) -item sets by the two medium frequent k-item sets, and satisfying the minimum support degree minsup obtained by the calculation in the generated candidate frequent (k +1) -item sets_ΔThen, the generated candidate frequent (k +1) -item set is recorded as a frequent (k +1) -item set and stored in a set LF_k+1，ΔWherein the first k-1 terms of the two sets of frequent k-terms are the same and the last term is different;

p13, repeating the above steps P11-P12, increasing k by 1 each time until a new candidate frequent item set corresponding to the currently traversed target transaction data set can not be generated; then step P14 is executed;

p14, continue traversing the next target transaction data set, and repeatedly executing the above stepsP11-P13 until all target transaction data sets are traversed, so that the minimum support degree minsup which is obtained by correspondingly meeting the calculation and corresponds to each target transaction data set is obtained_ΔAll of the local frequent itemsets of (c).

5. The mass data frequent item set mining method according to claim 4, further comprising, between step P11 and step P12, step S: for the set LF_k，ΔFine shearing;

wherein, for the set LF_k，ΔThe fine shearing step comprises the following steps:

obtaining and dividing a set LF according to whether the first (k-1) items of the item set are the same or not_k，ΔGrouping medium-frequent k-item sets to obtain a corresponding number of item set groups, wherein the first (k-1) items of the frequent k-item sets in the same item set group are the same;

respectively counting the number of the frequent k-item sets in each item set group, and correspondingly judging whether the counted number is equal to 1: if yes, deleting the corresponding item set group and deleting the set LF_k，ΔThe same set of items as the frequent set of k-items in the corresponding set of items grouping; the corresponding item set group, wherein the number of the frequent k-item sets is equal to 1;

correspondingly judging whether the union of any two frequent k-item sets in the item set group is contained in the file F in the currently existing item set group_qfThe method comprises the following steps: if yes, deleting the corresponding item set group and deleting the set LF_k，ΔThe same set of items as the frequent set of k-items in the corresponding group of sets of items.

6. The mass data frequent itemset mining method according to claim 2, wherein the array cnt is based on_ΔCorresponding calculation and update set ST_CADNewly added transaction data set T_ΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set ST_CADBefore, also include to the said set ST_CADFine shearing;

for the set ST_CADThe fine shearing step comprises a first stage of simplification step;

the first stage of the reduction step comprises:

traverse said file F_qfAnd array cnt_ΔAnd respectively calculate each traversed file F_qfSupport count and array cnt for 1-item set in (1)_ΔThe same 1-item set in the array cnt_ΔIf the calculated sum of the support degrees is smaller than the global minimum support degree count n multiplied by min, the traversed file F_qfFrom said set ST_CADIs removed.

7. The mass data frequent itemset mining method of claim 6, wherein the set ST is_CADThe fine shearing step of (2) further comprises a second stage of simplification step;

the second stage of the reduction step comprises:

building a PIP array;

traversing the set ST reduced by the first stage of reduction_CADAnd is a set ST reduced by the first stage of reduction_CADSelecting two items with the minimum support counts in the corresponding item sets respectively for each local frequent item set to form an item pair of the corresponding item sets, and storing the item pair in the constructed PIP array;

calculating the new transaction data set T of each item pair in the PIP array_ΔA support count of (a);

determining that each item pair in the PIP array is in the newly added transaction data set T_ΔCount of support degree and corresponding item set in newly added transaction data set T_ΔThe sum of the support counts and the global minimum support count n × min, and a local frequent item set self-set ST corresponding to each item pair for which it is determined that the sum of the support counts is smaller than the global minimum support count n × min_CADIs deleted.

8. The mass data frequent item set mining method according to claim 1, 2, 3, 4, 5, 6 or 7, wherein said newly added transaction data set T is updated_ΔAnd updating said original transaction data set T_OFor the original transaction data set T_OAnd the original newly added transaction data set T_ΔSum and updated newly added transaction data set T_ΔWhen the mass data is not empty, the method for mining the frequent item set of the mass data further comprises the step of updating and mining;

the step of updating the mining comprises the following steps:

updating the number n of the transactions in the total transaction data set T to the original transaction data set T_OAnd the original newly added transaction data set T_ΔThe sum of the number of transactions of (c);

obtaining the original transaction data set T obtained above_OCorresponding all local frequent item sets, and calculating the acquired original transaction data set T_OThe corresponding local frequent item sets are in the original transaction data set T_OCount of degree of support above, corresponding to write file F_qf，OPerforming the following steps;

newly-added transaction data set T based on original_ΔCorresponding array cnt_ΔAdding and updating file F_qf，OThe local frequent item sets are in the original newly added transaction data set T_ΔA support count of (a); then, according to the local minimum support degree omega, the file F is processed_qf，OFiltering the local frequent item sets in the total transaction data set to obtain each local frequent item set with the support degree not less than omega corresponding to the updated total transaction data set T;

then, each local frequent item set with the support degree not less than omega corresponding to the obtained updated total transaction data set T and the file F thereof_qf，ORespectively corresponding to the support degree counts, correspondingly writing a new file F_qf；

Obtaining the original newly added transaction data set T_ΔCorresponding sets LF_ΔAnd delete the acquired set LF_ΔIs present in the file F_qf，OThe item set in (1) is corresponded to obtain a new set LF_Δ；

Newly-added transaction data set T based on original_ΔCorresponding array cnt_ΔIn the new set LF_ΔIn correspondence with writing the new set LF_ΔAll items in the original newly added transaction data set T_ΔA support count of (a);

then, according to the local minimum support degree omega, the new set LF is processed_ΔFiltering the item sets in the collection, acquiring the filtered item sets with the support degree not less than omega, and acquiring the acquired item sets with the support degree not less than omega and the new set LF_ΔThe support degree count of the write-in is written into the new file F correspondingly_qf；

Thereafter using the new file F_qfReplacing original file F_qfUsing original transaction data set T_OAnd the original newly added transaction data set T_ΔThe sum replaces the original transaction data set T_OAnd using the updated new transaction data set T_ΔReplace the original newly added transaction data set T_ΔAnd mining a frequent item set on the updated total transaction data set T by adopting the incremental updating method based on the number n of the transactions in the updated total transaction data set T.

9. The mass data frequent item set mining method according to claim 1, 2, 3, 4, 5, 6 or 7, characterized in that said file F is mined according to the number n of transactions in the total transaction data set T and said global minimum support degree min_qfFiltering the local frequent item set to obtain a local frequent item set of which the filtered support count is not less than the global minimum support count nxmin, which specifically comprises the following steps:

sequential scanning document F_qf；

Respectively judging the scanned documents F_qfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:

if yes, the scanned file F_qfThe local frequent item set in (1) is the local frequent item set of which the filtered support count is not less than the global minimum support count n multiplied by min.

10. The mass data frequent item set mining method according to claim 2, 3, 4, 5, 6 or 7, characterized in that said scanning of the current file F_qfFor each slave current file F_qfRespectively judging whether the local frequent item sets scanned in the total transaction data set T meet the preset global minimum support degree minsup, if so, outputting the currently scanned local frequent item sets, and recording the output local frequent item sets as first frequent item sets; if the judgment result is no, based on the mas_ΔJudging whether the currently scanned local frequent item set is not a frequent item set which meets a preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set ST_CADThe method specifically comprises the following steps:

sequentially scanning said document F_qf；

For each scanned document F_qfThe scanned file F is respectively judged according to the local frequent item set_qfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:

if so, outputting the scanned local frequent item set, wherein the output local frequent item set is a frequent item set meeting the preset global minimum support minsup on the total transaction data set T;

if not, judging the scanned file F_qfThe support count of the local frequent item set in (1) and the mas_ΔIs less than the global minimum support degree minsup, and if the result of the determination is yes, the currently scanned local frequent item set must not be the frequent item set satisfying the preset global minimum support degree minsup on the total transaction data set T, and if not, the local frequent item set satisfies the global minimum support degree minsupCorrespondingly writing the currently scanned local frequent item set and the correspondingly scanned support degree count into the set ST_CAD。