WO2019041628A1 - Method for mining multivariate time series association rule based on eclat - Google Patents

Method for mining multivariate time series association rule based on eclat Download PDF

Info

Publication number
WO2019041628A1
WO2019041628A1 PCT/CN2017/115843 CN2017115843W WO2019041628A1 WO 2019041628 A1 WO2019041628 A1 WO 2019041628A1 CN 2017115843 W CN2017115843 W CN 2017115843W WO 2019041628 A1 WO2019041628 A1 WO 2019041628A1
Authority
WO
WIPO (PCT)
Prior art keywords
minhash
mining
matrix
data
algorithm
Prior art date
Application number
PCT/CN2017/115843
Other languages
French (fr)
Chinese (zh)
Inventor
张春慨
Original Assignee
哈尔滨工业大学深圳研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哈尔滨工业大学深圳研究生院 filed Critical 哈尔滨工业大学深圳研究生院
Publication of WO2019041628A1 publication Critical patent/WO2019041628A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2477Temporal data queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Definitions

  • the invention belongs to the field of data mining technology, and particularly relates to a method for mining association rules under large-scale data.
  • the general approximate association rule mining step is divided into two stages, first performing pre-processing operations, compressing, smoothing, denoising, linearizing approximation, segmentation time series, clustering, etc. on massive raw data, and then processing The implementation of the approximate association rule mining algorithm is performed on the data set.
  • the traditional association rule mining algorithm is for discrete data, and the mining association rules cannot reflect the sequence of time.
  • the proposed mining algorithm for the first time applying association rules on time series was proposed by Das in 1998.
  • the research object starts with the association rule of single time series mining, and then extends to the mining of multiple time series.
  • the time series is divided into sub-sequences of equal length, and then a symbolic representation is assigned to each sub-sequence with different trends.
  • Later schools applied the FP-growth algorithm to the mining of time series association rules.
  • the FP-growth algorithm is an efficient and extensible algorithm. By means of pattern growth, the extended prefix tree structure FP-tree is used. This summary storage structure is used to store compression and key information about frequent patterns, in many cases. Apriori works better. Later on, there were many improved algorithms.
  • the CFP-mine algorithm is based on a compressed FP tree, based on a constrained subtree method, reduces memory calls, and uses array methods to reduce the number of traversals.
  • the most classic association rule mining algorithm is the Apriori algorithm proposed by Agrawal in 1993.
  • the Apriori algorithm is a frequent item set algorithm for mining association rules. Through the iterative algorithm of layer-by-layer search, each time a candidate frequent item set is generated. Have to go through the steps of scanning, counting, comparing, connecting, and pruning. However, using Apriori algorithm to mine association rules to scan the entire data set more than once when verifying the candidate frequent K item set, its time efficiency is very low.
  • the EH-Apriori mining algorithm has improved on the basis of the Apriori algorithm. One is that the mining process is preprocessed, and the other is to hash the data of the dataset to a large table. Later, Han et al.
  • the FP-growth algorithm can dig through the database by creating a FP-tree with a prefix property, so that frequent patterns can be mined, thereby improving mining efficiency.
  • the performance of the FP-growth algorithm is an order of magnitude faster than Apriori.
  • Both Apriori and FP-growth use horizontal item sets to mine data.
  • ZAKI proposed the Eclat algorithm in 2000, which uses vertical data representation to mine association rules.
  • the vertical data indicates that the data set consists of a set of items and all the identifiers of the transactions containing the item.
  • the algorithm uses cross-counting so that the generation of the candidate set and the calculation of the support count can be completed simultaneously.
  • Practice has shown that the performance of algorithms using vertical data representation is generally better than algorithms using horizontal data representation.
  • LSH local sensitive hash
  • the present invention designs an association rule mining method based on Eclat, which significantly speeds up the mining of association rules and achieves the goal of timely obtaining time series data analysis results, although the accuracy of mining is sacrificed. , but can greatly improve the efficiency of mining, Save machine memory.
  • An association rule mining method based on Eclat characterized in that: the method comprises: (1) generating a vertical data set; (2) generating a MINHASH matrix, and the MINHASH matrix needs to specify a parameter k, the meaning of which is that the matrix has at most k rows; (3) Using the MINHASH matrix to estimate the candidate set in the original data set; (4) pruning the candidate set according to the minimum support degree to obtain the frequent item set 1; (5) combining the hash items with the frequent items set to generate a new one.
  • ⁇ kmin(S i ) represents the intersection of the sets S i in the hash matrix formed by sampling using the MinHash method.
  • the vertical data set is obtained by inverting on the original transaction set.
  • step (2) further includes releasing the vertical data set to save memory.
  • the minimum support is estimated using MinHash.
  • the method is applied to association rule mining of multiple time series.
  • Figure 1 is a schematic view of an inverted process
  • FIG. 2 is a schematic diagram of generating a frequent 1 item set
  • Figure 3 is a schematic diagram of a sampling process
  • FIG. 4 is a schematic diagram of generating a frequent 2 item set
  • Figure 5 is a schematic diagram of the MinHash calculation set intersection
  • Figure 6 is a schematic diagram of the error of the MinHash calculation set intersection
  • Figure 7 is the speed and accuracy of the HashEclat obtained by fixing the minimum element K and adjusting the error E;
  • Figure 8 is the fixed error E, adjusting the minimum element K to get the speed and accuracy of the HashEclat;
  • Figure 9 is a comparison of HashEclat and Eclat speed memory on T10I4D100K;
  • Figure 10 is a comparison of HashEclat and Eclat speed memory on T40I10D100K;
  • Figure 11 shows the comparison of HashEclat and Eclat speed memory on Online Retail.
  • the feature representation of the time series is the feature of extracting the data and transforming the dimensions of the data. This can achieve the effect of feature dimension reduction.
  • the data in the low-dimensional space can retain the information of the original time series as much as possible.
  • the present invention investigates a feature representation method of TEO.
  • Analysis of the data characteristics of the time series there are often different trends in the two sides of the segmentation point analogy to the grayscale change of the edge of the image in image processing.
  • the grayscale rate of change of the image point changes. If the data before a certain point in the time series has a tendency to increase, and the data after the point has a decreasing trend, the point can be considered to be a segmentation point, that is, an edge point of the time series.
  • the TEO representation of time series is a piecewise linear representation that combines the edge detection operator in image processing with the characteristics of time series data.
  • the convolution calculation result is based on the designed time series edge operator and the original time series data. .
  • segment points are selected from the calculated edge degree results according to the determined selection principle, and the segment points are joined to represent the time series.
  • TEO(tu) ⁇ w(i)*(x t+i -x t )
  • i -1,-2,...-u,0,u,...,2,1 ⁇ (1)
  • w(i) represents the weight function
  • the selection is based on the characteristics of the data.
  • the weight setting method employed in the experiment of the present invention is that the closer to the center of the detection window, the higher the weight setting.
  • a transaction of a database consists of a transaction identifier (TID) and an item (Item).
  • TID transaction identifier
  • Item item
  • a transaction is uniquely identified by a TID, and a transaction can contain one project or multiple projects.
  • the HashEclat algorithm uses a vertical data set to do the basic data structure of the algorithm. This vertical data set is "inverted” on the original transaction set, and the "inverted" build process is shown in Figure 1.
  • Each record in the database consists of a list of items and all transaction records that have occurred (Tidset). This allows the support count of any frequent item set to be obtained by the Tidset intersection operation.
  • the algorithm After forming the vertical data set, the algorithm first prunes according to the minimum support degree pair, and generates a candidate 1 item set of the frequent item set. At this time, the algorithm needs to save the size of the transaction set of each item I to prepare for the subsequent calculation steps.
  • the HashEclat algorithm samples the Tidset using the MinHash method, so that the entire "inverted table" forms a hash matrix. The sampling process is shown in Figure 3.
  • x is the line number, which is equivalent to a random change to the matrix row.
  • the MinHash method needs to specify the parameter K, which means that the hash matrix is selected to have at most K rows.
  • the example K below is equal to 3. Because the subsequent steps are all calculated using this hash matrix, the original "inverted table" can be released to save memory.
  • the algorithm uses the hash frequent 1 item set to generate frequent 2 item sets.
  • the Hash frequent 1 item set is merged to generate a new frequent 2 item set.
  • the generation process is shown in Figure 4. (1) generating a vertical data set; (2) pruning the candidate set according to the minimum support degree to obtain a frequent item set 1, and combining the hash items with the first item set to generate a new frequent 2 item set; (3) a loop step (1) (2) until it cannot be merged.
  • MinHash estimation Since the intersection of the hash matrix calculations generated by MinHash is used, it is desirable to estimate the intersection size of the original set.
  • the principle of using MinHash estimation is as defined in the following definition 1.
  • , and the set intersection size is t
  • n max /k, and then ⁇
  • the HashEclat algorithm is an intersection estimated by MinHash when calculating frequent itemsets, two kinds of errors are generated.
  • the first type of error is that the originally frequent itemsets are estimated to be infrequent, and the second is that the originally infrequent itemsets are estimated to be frequent.
  • X is an infrequent item set (Fig. 6: X is smaller than A)
  • the first type of error is Zone2 of Fig. 6,
  • the second type of error is 0, and the total error is Zone2.
  • the approximate association rule mining algorithm designed by the present invention is a general-purpose algorithm, not only can it be applied to the time series, the data set used in the experiment uses three non-sequence data sets from the UCI website, as shown in Table 1.
  • HashEclat needs to set the upper error limit E and the minimum element number K of the MinHash parameter, these two parameters have an impact on the computational efficiency and accuracy of the algorithm.
  • the present invention therefore first designs a set of experiments on the T10I4D100K data set - one of the parameters of the fixed HashEclat, adjusts the other parameter, and then observes the speed and accuracy of the algorithm of the present invention. Accuracy uses the F1 value as a measure. After adjusting the HashEclat parameters, the present invention then compares the three data with the computational speed of the original Eclat algorithm.
  • the minimum support threshold is 350
  • the fixed minimum element number K is 100
  • the adjustment error E is 100
  • the F1 and time values are as shown in FIG.
  • the minimum support threshold is 350
  • the fixed error E is 0.8
  • the minimum element number K is adjusted
  • the F1 and time values are as shown in FIG.
  • the present invention compares the three data with the original Eclat algorithm at the computational speed, running memory, as shown in Figures 9-11.
  • the HashEclat algorithm is more suitable for real-time data such as data massive and time series stream data.
  • the algorithm can significantly speed up the mining of association rules and achieve the goal of timely obtaining time series data analysis results. It can be seen that although the HashEclat algorithm sacrifices the accuracy of mining, it can greatly improve the mining efficiency and save machine memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method for mining a multivariate time series association rule based on Eclat, comprising: (1) generating a perpendicular dataset; (2) generating a MINHASH matrix, wherein the MINHASH matrix needs a designated parameter k; (3) using the MINHASH matrix to estimate a candidate item set in an original dataset; (4) according to the minimum support, pruning the candidate item set to obtain a frequent item set 1; (5) combining two Hash frequent item sets 1 and generating a new frequent item set 2; and (6) repeating step 5 until combination cannot be carried out, and ending an algorithm. The method markedly increases the mining speed of an association rule, and reaches the goal of acquiring a time series data analysis result in time. Although mining precision is sacrificed, mining efficiency can be greatly improved, and machine memory can be saved.

Description

基于Eclat的多元时间序列关联规则挖掘方法Multi-time sequence association rule mining method based on Eclat 技术领域Technical field
本发明属于数据挖掘技术领域,具体涉及一种大规模数据下的进行关联规则挖掘的方法。The invention belongs to the field of data mining technology, and particularly relates to a method for mining association rules under large-scale data.
背景技术Background technique
目前国内外都有一些关于近似关联规则挖掘研究,由于他们研究的侧重点不同,用的关联规则的挖掘算法不同,挖掘到的关联规则的特点也不同。一般的近似关联规则挖掘的步骤分为两个阶段,先进行预处理操作,在海量的原始数据上进行压缩,平滑,去噪,线性化近似,分割时间序列,聚类等,然后在已经处理过的数据集上进行近似关联规则挖掘算法的实施。At present, there are some researches on the mining of approximate association rules at home and abroad. Because their research focuses differently, the mining rules of association rules are different, and the characteristics of the associated association rules are different. The general approximate association rule mining step is divided into two stages, first performing pre-processing operations, compressing, smoothing, denoising, linearizing approximation, segmentation time series, clustering, etc. on massive raw data, and then processing The implementation of the approximate association rule mining algorithm is performed on the data set.
传统的关联规则挖掘算法针对的是离散数据,挖掘出的关联规则并不能体现时间的先后顺序。第一次应用在时间序列上的关联规则的挖掘算法的提出是在1998年Das提出的。研究对象是从单时间序列挖掘的关联规则开始,后来扩展到多时间序列的挖掘。在处理时间序列数据时,把时间序列划分成长度相等的子序列,然后为每个趋势不同的子序列分配一个符号表示。这种算法关注的子序列的趋势主要有三种,上升,下降,平衡。因此,对于不同时间长度,趋势相同的子序列并不能区分。后来学者把FP-growth算法应用在时间序列关联规则挖掘方面。FP-growth算法是一种高效可扩展的算法,通过模式增长的方式,用扩展前缀树结构FP-tree,这个概要存储结构用于存储关于频繁模式的压缩和关键信息,在很多情况下都比Apriori效果更好。后来在此基础上又出现了很多改良的算法。CFP-mine算法是基于压缩的FP树,基于约束子树的方法,减少内存的调用,并且用了数组的方法,减少了遍历次数。The traditional association rule mining algorithm is for discrete data, and the mining association rules cannot reflect the sequence of time. The proposed mining algorithm for the first time applying association rules on time series was proposed by Das in 1998. The research object starts with the association rule of single time series mining, and then extends to the mining of multiple time series. When processing time series data, the time series is divided into sub-sequences of equal length, and then a symbolic representation is assigned to each sub-sequence with different trends. There are three main trends in the subsequences that this algorithm focuses on, rising, falling, and balancing. Therefore, for different time lengths, sub-sequences with the same trend cannot be distinguished. Later scholars applied the FP-growth algorithm to the mining of time series association rules. The FP-growth algorithm is an efficient and extensible algorithm. By means of pattern growth, the extended prefix tree structure FP-tree is used. This summary storage structure is used to store compression and key information about frequent patterns, in many cases. Apriori works better. Later on, there were many improved algorithms. The CFP-mine algorithm is based on a compressed FP tree, based on a constrained subtree method, reduces memory calls, and uses array methods to reduce the number of traversals.
最经典的关联规则挖掘算法是Agrawal于1993年提出的Apriori算法,Apriori算法是一种挖掘关联规则的频繁项集算法,通过逐层搜索的迭代算法,在每次生成候选的频繁项集的时候都要经过扫描,计数,比较,连接,剪枝这几个步骤。但是利用Apriori算法挖掘关联规则在验证候选频繁K项集的时候要对整个数据集进行扫描不止一遍,其时间效率很低。EH-Apriori挖掘算法在Apriori算法的基础上进行了两点改进,一是挖掘过程进行了预处理,二是将数据集的数据Hash到一个很大的表。后来Han等学者于2000年研究了关联规则的相关性质提出了FP-growth算法。 FP-growth算法是通过建立具有前缀性质的FP-tree来仅仅遍历一遍数据库,就可以挖掘到频繁模式,从而提高挖掘效率。实验证明FP-growth算法的性能比Apriori快了一个数量级。Apriori与FP-growth均采用水平项集来挖掘数据,ZAKI于2000年提出Eclat算法,该算法采用垂直数据表示来挖掘关联规则。垂直数据表示数据集由项目和所有包含该项目的事务的标识的集合组成算法采用交叉计数的方式使得候选集的生成与支持度计数的计算可以同时完成。实践证明采用垂直数据表示的算法的性能一般优于采用水平数据表示的算法。The most classic association rule mining algorithm is the Apriori algorithm proposed by Agrawal in 1993. The Apriori algorithm is a frequent item set algorithm for mining association rules. Through the iterative algorithm of layer-by-layer search, each time a candidate frequent item set is generated. Have to go through the steps of scanning, counting, comparing, connecting, and pruning. However, using Apriori algorithm to mine association rules to scan the entire data set more than once when verifying the candidate frequent K item set, its time efficiency is very low. The EH-Apriori mining algorithm has improved on the basis of the Apriori algorithm. One is that the mining process is preprocessed, and the other is to hash the data of the dataset to a large table. Later, Han et al. studied the related properties of association rules in 2000 and proposed the FP-growth algorithm. The FP-growth algorithm can dig through the database by creating a FP-tree with a prefix property, so that frequent patterns can be mined, thereby improving mining efficiency. Experiments show that the performance of the FP-growth algorithm is an order of magnitude faster than Apriori. Both Apriori and FP-growth use horizontal item sets to mine data. ZAKI proposed the Eclat algorithm in 2000, which uses vertical data representation to mine association rules. The vertical data indicates that the data set consists of a set of items and all the identifiers of the transactions containing the item. The algorithm uses cross-counting so that the generation of the candidate set and the calculation of the support count can be completed simultaneously. Practice has shown that the performance of algorithms using vertical data representation is generally better than algorithms using horizontal data representation.
由于时间序列数据量大,实时产生等特点,传统数据挖掘算法无法及时有效的挖掘到所需的知识。取样是一种在普通的资源上获取近似规则的有效手段,以其在处理大规模数据集中表现出的良好性能而得到了广泛深入的研究,是提高关联规则算法效率和可扩展性的一种简单、有效的方式。常用的设计方法有直方图方法、取样方法和小波方法等。由于取样方法良好的伸缩性和灵活性使其成为构建数据流概要的一个非常重要的方法。所有这些研究的最终目标都是利用尽可能小的样本集最佳地近似原始数据集上的信息(找到合适的样本大小和最优样本集),但这一结果的获得离不开对取样误差(数据集之间差异)的有效度量。目前缺乏系统的研究和统一、有效的模型。基于取样策略的关联规则挖掘算法,乃至整个数据挖掘算法的样本集与原始数据集之间、样本集与样本集之间包含兴趣信息差异的计算是整个取样过程的一个中心问题。Due to the large amount of time series data and real-time generation, traditional data mining algorithms cannot mine the required knowledge in a timely and effective manner. Sampling is an effective means to obtain approximation rules on common resources. It has been extensively studied for its good performance in processing large-scale data sets. It is a kind of improving the efficiency and scalability of association rules algorithms. Simple and effective way. Commonly used design methods include histogram method, sampling method and wavelet method. The scalability and flexibility of the sampling method make it a very important way to build a summary of the data stream. The ultimate goal of all of these studies is to use the smallest possible sample set to best approximate the information on the original data set (to find the appropriate sample size and optimal sample set), but this result is inseparable from the sampling error. A valid measure of the difference between data sets. There is currently no systematic research and a unified and effective model. The association rule mining algorithm based on sampling strategy, and even the calculation of the difference of interest information between the sample set and the original data set of the whole data mining algorithm and between the sample set and the sample set is a central problem of the whole sampling process.
近些年一种使用局部敏感哈希(LSH)技术辅助关联规则挖掘的方法逐渐开始流行。这种方法借鉴了信息检索领域快速计算相似度的手段来优化关联规则挖掘中的步骤,从而达到快速挖掘的目的。这种方式采用哈希函数对数据进行压缩,能够比较好的处理海量数据。并且经过理论和实践的验证,数据压缩带来的信息损失可以控制在一定范围内,挖掘规则的精准性也可以得到保障。在保证一定精确度的前提下,取样方法显著减小了所处理数据集的规模,使得众多数据挖掘算法得以应用到大数据集以及数据流数据上。In recent years, a method of using local sensitive hash (LSH) technology to assist association rule mining has gradually become popular. This method draws on the means of quickly calculating the similarity in the field of information retrieval to optimize the steps in mining association rules, so as to achieve the purpose of rapid mining. This method uses a hash function to compress the data, which can handle massive amounts of data better. And after the verification of theory and practice, the information loss caused by data compression can be controlled within a certain range, and the accuracy of mining rules can also be guaranteed. Under the premise of ensuring a certain degree of accuracy, the sampling method significantly reduces the size of the processed data set, enabling many data mining algorithms to be applied to large data sets and data stream data.
发明内容Summary of the invention
为解决现有技术中存在的问题,本发明设计了一种基于Eclat的关联规则挖掘方法,显著的加快关联规则挖掘速度,达到及时获取时间序列数据分析结果的目标,虽然牺牲了挖掘的精确性,但可以大大的提高挖掘效率、 节约机器内存。In order to solve the problems existing in the prior art, the present invention designs an association rule mining method based on Eclat, which significantly speeds up the mining of association rules and achieves the goal of timely obtaining time series data analysis results, although the accuracy of mining is sacrificed. , but can greatly improve the efficiency of mining, Save machine memory.
本发明具体通过如下技术方案实现:The invention is specifically implemented by the following technical solutions:
一种基于Eclat的关联规则挖掘方法,其特征在于:所述方法包括:(1)生成垂直数据集;(2)生成MINHASH矩阵,MINHASH矩阵需要指定参数k,其意义是矩阵最多有k行;(3)利用MINHASH矩阵估计原始数据集中的候选项集;(4)根据最小支持度把候选集剪枝后得到频繁项集1;(5)在哈希频繁1项集两两合并生成新的频繁2项集;(6)循环步骤(4)、(5)直到无法合并,结束算法;其中,步骤(3)中使用MinHash估计集合交集大小,对于多个集合S1,S2,…Si,…,Sm,包含最多元素的集合大小为nmax=maxi|Si|,集合交集大小估计值为An association rule mining method based on Eclat, characterized in that: the method comprises: (1) generating a vertical data set; (2) generating a MINHASH matrix, and the MINHASH matrix needs to specify a parameter k, the meaning of which is that the matrix has at most k rows; (3) Using the MINHASH matrix to estimate the candidate set in the original data set; (4) pruning the candidate set according to the minimum support degree to obtain the frequent item set 1; (5) combining the hash items with the frequent items set to generate a new one. Frequently 2 sets of items; (6) Cycling steps (4), (5) until unmerge, ending the algorithm; wherein, in step (3), MinHash is used to estimate the set intersection size, for multiple sets S 1 , S 2 ,...S i ,...,S m , the set size containing the most elements is n max =max i |S i |, and the aggregate intersection size is estimated
Figure PCTCN2017115843-appb-000001
Figure PCTCN2017115843-appb-000001
其中∩kmin(Si)表示使用MinHash方法抽样形成的哈希矩阵中集合Si的交集。Where ∩kmin(S i ) represents the intersection of the sets S i in the hash matrix formed by sampling using the MinHash method.
进一步地,所述步骤(1)中,在原始事务集上经过倒排得到垂直数据集。Further, in the step (1), the vertical data set is obtained by inverting on the original transaction set.
进一步地,步骤(2)还包括释放垂直数据集以节省内存。Further, step (2) further includes releasing the vertical data set to save memory.
进一步地,所述最小支持度使用MinHash估计。Further, the minimum support is estimated using MinHash.
进一步地,所述方法应用于多元时间序列的关联规则挖掘。Further, the method is applied to association rule mining of multiple time series.
附图说明DRAWINGS
图1是倒排过程的示意图;Figure 1 is a schematic view of an inverted process;
图2是生成频繁1项集的示意图;2 is a schematic diagram of generating a frequent 1 item set;
图3是抽样过程的示意图;Figure 3 is a schematic diagram of a sampling process;
图4是生成频繁2项集的示意图;4 is a schematic diagram of generating a frequent 2 item set;
图5是MinHash计算集合交集的示意图;Figure 5 is a schematic diagram of the MinHash calculation set intersection;
图6是MinHash计算集合交集的误差示意图;Figure 6 is a schematic diagram of the error of the MinHash calculation set intersection;
图7是固定最小元素K、调整误差E得到HashEclat的速度与准确率;Figure 7 is the speed and accuracy of the HashEclat obtained by fixing the minimum element K and adjusting the error E;
图8是固定误差E、调整最小元素K得到HashEclat的速度与准确率;Figure 8 is the fixed error E, adjusting the minimum element K to get the speed and accuracy of the HashEclat;
图9是在T10I4D100K上HashEclat与Eclat速度内存比较结果;Figure 9 is a comparison of HashEclat and Eclat speed memory on T10I4D100K;
图10是在T40I10D100K上HashEclat与Eclat速度内存比较结果;Figure 10 is a comparison of HashEclat and Eclat speed memory on T40I10D100K;
图11是在Online Retail上HashEclat与Eclat速度内存比较结果。 Figure 11 shows the comparison of HashEclat and Eclat speed memory on Online Retail.
具体实施方式Detailed ways
下面结合附图说明及具体实施方式对本发明进一步说明。The invention will now be further described with reference to the drawings and specific embodiments.
由于时间序列数据量大,实时产生等特点,在挖掘关联规则之前需要对数据进行压缩,也即特征表示。时间序列的特征表示是提取数据的特征,转换数据的维度。这样能达到对特征降维的作用。同时,在低维空间中的数据还可以尽可能的保留原始时间序列的信息。Due to the large amount of time series data and real-time generation, it is necessary to compress the data, that is, the feature representation, before mining the association rules. The feature representation of the time series is the feature of extracting the data and transforming the dimensions of the data. This can achieve the effect of feature dimension reduction. At the same time, the data in the low-dimensional space can retain the information of the original time series as much as possible.
首先,本发明研究了TEO这种特征表示方法。分析时间序列的数据特点,在分段点的两边往往会有不同的变化趋势类比图像处理中图像的边缘的灰度的变化。在图像的边缘处,图像点的灰度变化率会发生变化。若时间序列上某点之前的数据有增长的趋势,该点之后的数据有减小的趋势,则在一定程度上可以认为这个点是分段点,也就是时间序列的边缘点。时间序列的TEO表示是将图像处理中的边缘检测算子与时间序列数据特点结合的一种分段线性化表示,根据设计的时间序列边缘算子与原始的时间序列数据进行卷积的计算结果。然后根据确定的选择原理从计算的边缘度结果中选择分段点,连结分段点来表示时间序列。时间序列的表示形式是X=<x1,x2,…,xn>,TEO定义如式(1):First, the present invention investigates a feature representation method of TEO. Analysis of the data characteristics of the time series, there are often different trends in the two sides of the segmentation point analogy to the grayscale change of the edge of the image in image processing. At the edge of the image, the grayscale rate of change of the image point changes. If the data before a certain point in the time series has a tendency to increase, and the data after the point has a decreasing trend, the point can be considered to be a segmentation point, that is, an edge point of the time series. The TEO representation of time series is a piecewise linear representation that combines the edge detection operator in image processing with the characteristics of time series data. The convolution calculation result is based on the designed time series edge operator and the original time series data. . Then, segment points are selected from the calculated edge degree results according to the determined selection principle, and the segment points are joined to represent the time series. The representation of the time series is X=<x 1 , x 2 ,..., x n >, and TEO is defined as equation (1):
TEO(t.u)={w(i)*(xt+i-xt)|i=-1,-2,...-u,0,u,...,2,1}         (1)TEO(tu)={w(i)*(x t+i -x t )|i=-1,-2,...-u,0,u,...,2,1} (1)
其中2u+1表示检测窗口的长度,w(i)表示的权重函数,选择的依据是数据的特征。本发明实验采用的权重设置方法是越靠近检测窗口中心权重设置的越高。Where 2u+1 represents the length of the detection window, w(i) represents the weight function, and the selection is based on the characteristics of the data. The weight setting method employed in the experiment of the present invention is that the closer to the center of the detection window, the higher the weight setting.
传统的数据挖掘算法多采用水平数据表示,在水平数据表示中,数据库的一条事务由事务标识符(TID)和项目(Item)组成。事务由TID唯一标识,一条事务可以包含一个项目或多个项目。HashEclat算法使用垂直数据集来做算法的基本数据结构。这种垂直数据集是在原始事务集上经过“倒排”而成的,“倒排”构建过程如图1所示。数据库中的每一条记录由一个项目及其所出现过的所有事务记录的列表(Tidset)构成。这样使得任何一个频繁项集的支持度计数都可以通过对Tidset交集运算求得。Traditional data mining algorithms mostly use horizontal data representation. In horizontal data representation, a transaction of a database consists of a transaction identifier (TID) and an item (Item). A transaction is uniquely identified by a TID, and a transaction can contain one project or multiple projects. The HashEclat algorithm uses a vertical data set to do the basic data structure of the algorithm. This vertical data set is "inverted" on the original transaction set, and the "inverted" build process is shown in Figure 1. Each record in the database consists of a list of items and all transaction records that have occurred (Tidset). This allows the support count of any frequent item set to be obtained by the Tidset intersection operation.
在形成垂直数据集之后,算法首先根据最小支持度对进行剪枝,产生频繁项集的候选1项集。这时算法需要保存每一个项目I的事务集大小,为后续计算步骤做准备。设例子的最小支持度为3,生成频繁1项集的剪枝过程如图2所示。After forming the vertical data set, the algorithm first prunes according to the minimum support degree pair, and generates a candidate 1 item set of the frequent item set. At this time, the algorithm needs to save the size of the transaction set of each item I to prepare for the subsequent calculation steps. Let the minimum support of the example be 3, and the pruning process of generating a frequent set of items is shown in Figure 2.
此时如果一个Tidset中的事务集过多,会使后续交集计算效率显著下 降,并且会占用大量内存。HashEclat算法将Tidset使用MinHash方法抽样,使整个“倒排表”形成一个哈希矩阵,其抽样过程如图3所示。At this point, if there are too many transaction sets in a Tidset, the subsequent intersection calculation efficiency will be significant. Drops and takes up a lot of memory. The HashEclat algorithm samples the Tidset using the MinHash method, so that the entire "inverted table" forms a hash matrix. The sampling process is shown in Figure 3.
图3使用哈希函数h(x)=(x+2)mod 6,其中x为行号,相当于对矩阵行随机变化。出现1的最小行号称为最小哈希值,如I5的最小哈希值为hmin(I5)=3。MinHash的方法需要指定参数K,其意义是选择哈希矩阵最多有K行。下图例子K等于3。因为后续步骤都是用这个哈希矩阵计算,此时就可以释放原始的“倒排表”以节省内存。Figure 3 uses the hash function h(x) = (x + 2) mod 6, where x is the line number, which is equivalent to a random change to the matrix row. The smallest line number that appears 1 is called the minimum hash value, such as the minimum hash value of I5, hmin(I5)=3. The MinHash method needs to specify the parameter K, which means that the hash matrix is selected to have at most K rows. The example K below is equal to 3. Because the subsequent steps are all calculated using this hash matrix, the original "inverted table" can be released to save memory.
接下来,算法使用哈希频繁1项集来生成频繁2项集,先在哈希频繁1项集两两合并生成新的频繁2项集,生成过程如图4所示。(1)生成垂直数据集;(2)根据最小支持度把候选集剪枝后得到频繁项集1,在哈希频繁1项集两两合并生成新的频繁2项集;(3)循环步骤(1)(2)直到无法合并。Next, the algorithm uses the hash frequent 1 item set to generate frequent 2 item sets. First, the Hash frequent 1 item set is merged to generate a new frequent 2 item set. The generation process is shown in Figure 4. (1) generating a vertical data set; (2) pruning the candidate set according to the minimum support degree to obtain a frequent item set 1, and combining the hash items with the first item set to generate a new frequent 2 item set; (3) a loop step (1) (2) until it cannot be merged.
由于使用MinHash产生的哈希矩阵计算的交集,所以想要对原始集合的交集大小进行估计。使用MinHash估计的原理如下面的定义1可知。Since the intersection of the hash matrix calculations generated by MinHash is used, it is desirable to estimate the intersection size of the original set. The principle of using MinHash estimation is as defined in the following definition 1.
定义1:使用MinHash估计交集大小。有多个集合S1,S2,…Si,…,Sm,包含最多元素的集合大小为nmax=maxi|Si|,集合交集大小为t=|S1,S1,…,Sm|,k为MinHash算法参数,当0<ε<1,
Figure PCTCN2017115843-appb-000002
时集合交集大小估计值,其中∩kmin(Si)表示使用MinHash方法抽样形成的哈希矩阵中集合Si的交集。
Definition 1: Estimate the intersection size using MinHash. There are multiple sets S 1 , S 2 ,...S i ,...,S m , the set size containing the most elements is n max =max i |S i |, and the set intersection size is t=|S 1 ,S 1 ,... , S m |, k is the MinHash algorithm parameter, when 0 < ε < 1,
Figure PCTCN2017115843-appb-000002
The time set intersection size estimate, where ∩kmin(S i ) represents the intersection of the set S i in the hash matrix formed using the MinHash method.
Figure PCTCN2017115843-appb-000003
Figure PCTCN2017115843-appb-000003
至少有概率
Figure PCTCN2017115843-appb-000004
满足
At least probable
Figure PCTCN2017115843-appb-000004
Satisfy
Figure PCTCN2017115843-appb-000005
Figure PCTCN2017115843-appb-000005
这种方法使我们可以在最小概率
Figure PCTCN2017115843-appb-000006
下,或者得到一个集合交集的(ε,δ)估计值,或者得到集合交集大小的上限。本发明先估算的交集大小是X=|∩kmin(si)|nmax/k,再得到ε=|X-A|,其中A为最小支持度,k为MinHash参数,nmax为两个集合中较大的集合与元素个数。如果估算结果X大于
Figure PCTCN2017115843-appb-000007
则估算误差是可以保证的,否则只能使用原始集合计算交集大小。
This method allows us to have a minimum probability
Figure PCTCN2017115843-appb-000006
Next, either get an (ε, δ) estimate of the set intersection, or get the upper bound of the set intersection size. The first estimated intersection size of the present invention is X=|∩kmin(si)|n max /k, and then ε=|XA|, where A is the minimum support, k is the MinHash parameter, and n max is the two sets. Large collections with the number of elements. If the estimated result X is greater than
Figure PCTCN2017115843-appb-000007
Then the estimation error can be guaranteed, otherwise the intersection size can only be calculated using the original set.
我们可以使用结果继续重复计算出所有频繁项集。最后还需要把存在误差的整体计算一遍。We can continue to calculate all frequent itemsets repeatedly using the results. Finally, we need to calculate the total error.
(1)生成垂直数据集;(2)生成MINHASH矩阵,MINHASH矩阵需要指定参数k,其意义是矩阵最多有k行;(3)利用MINHASH矩阵估计原始数据集中的候选项集;(4)根据最小支持度把候选集剪枝后得到频繁项集1;(5)在哈希频繁1项集两两合并生成新的频繁2项集;(6)循环步骤(4)(5)直到无法合并,停止算法。(1) generate a vertical data set; (2) generate a MINHASH matrix, the MINHASH matrix needs to specify the parameter k, the meaning is that the matrix has at most k rows; (3) use the MINHASH matrix to estimate the candidate set in the original data set; (4) according to The minimum support degree prunes the candidate set to get the frequent item set 1; (5) combines the hash 1 frequent items set to generate a new frequent 2 item set; (6) loops step (4) (5) until it cannot be merged , stop the algorithm.
由于HashEclat算法在计算频繁项集时是通过MinHash估算的交集,所以会产生两种误差。第一种误差是原本为频繁的项集被估计为不频繁的,第二种是原本不频繁项集被估计成频繁的。不妨计算出的X为不频繁的项集(如图6:X小于A),第一种误差为图6的Zone2,第二种误差为0,总误差为Zone2。由定理1可得,我们估计值在图6的Zone3的概率至少为
Figure PCTCN2017115843-appb-000008
所以处于Zone1(我们定义的错误)的概率至多为
Figure PCTCN2017115843-appb-000009
由图6可知Zone1>Zone2。所以我们是保守估计。可以保证估计错误的误差上界时最多是
Figure PCTCN2017115843-appb-000010
当X为频繁项集时,同理可得误差上界时最多是
Figure PCTCN2017115843-appb-000011
Since the HashEclat algorithm is an intersection estimated by MinHash when calculating frequent itemsets, two kinds of errors are generated. The first type of error is that the originally frequent itemsets are estimated to be infrequent, and the second is that the originally infrequent itemsets are estimated to be frequent. It may be calculated that X is an infrequent item set (Fig. 6: X is smaller than A), the first type of error is Zone2 of Fig. 6, the second type of error is 0, and the total error is Zone2. From theorem 1, we estimate that the probability of the value of Zone3 in Figure 6 is at least
Figure PCTCN2017115843-appb-000008
So the probability of being in Zone1 (the error we defined) is at most
Figure PCTCN2017115843-appb-000009
It can be seen from Fig. 6 that Zone1>Zone2. So we are conservative estimates. It can guarantee that the error upper bound of the estimated error is at most
Figure PCTCN2017115843-appb-000010
When X is a frequent item set, the same reason can be obtained when the error upper bound is
Figure PCTCN2017115843-appb-000011
由于本发明设计的近似关联规则挖掘算法是一种通用算法,不止能应用在时间序列上,所以实验采用的数据集是采用来自UCI网站的三个非序列数据集,如表1所示。Since the approximate association rule mining algorithm designed by the present invention is a general-purpose algorithm, not only can it be applied to the time series, the data set used in the experiment uses three non-sequence data sets from the UCI website, as shown in Table 1.
表1 实验数据集Table 1 Experimental data set
Figure PCTCN2017115843-appb-000012
Figure PCTCN2017115843-appb-000012
由于HashEclat需要设置误差上限E和MinHash参数最小元素数K,这两个参数对算法的计算效率和准确度都有影响。因此本发明首先在T10I4D100K数据集上设计一组实验——固定HashEclat的其中一个参数,调整另一个参数,然后观察本发明算法的速度和准确率。准确度使用F1值作为衡量标准。调整好HashEclat参数后,本发明然后在三个数据与原始Eclat算法计算速度进行了比较。Since HashEclat needs to set the upper error limit E and the minimum element number K of the MinHash parameter, these two parameters have an impact on the computational efficiency and accuracy of the algorithm. The present invention therefore first designs a set of experiments on the T10I4D100K data set - one of the parameters of the fixed HashEclat, adjusts the other parameter, and then observes the speed and accuracy of the algorithm of the present invention. Accuracy uses the F1 value as a measure. After adjusting the HashEclat parameters, the present invention then compares the three data with the computational speed of the original Eclat algorithm.
在数据集T10I4D100K上,使用最小支持度阈值为350,固定最小元素数K为100、调整误差E,F1与时间按值如图7所示。 On the data set T10I4D100K, the minimum support threshold is 350, the fixed minimum element number K is 100, the adjustment error E, and the F1 and time values are as shown in FIG.
在数据集T10I4D100K上,使用最小支持度阈值为350,固定误差E为0.8、调整最小元素数K,F1与时间按值如图8所示。On the data set T10I4D100K, the minimum support threshold is 350, the fixed error E is 0.8, the minimum element number K is adjusted, and the F1 and time values are as shown in FIG.
由实验可以看出来,K越小代表矩阵压缩率高,计算的数据量小。所以误差会提高(F1值降低)。正常情况下K越小,计算速度加快,但K取较小值时HashEclat不命中太多,使用原始数据合并次数多,导致时速度减慢。E代表一次合并所允许的最大容忍错误,这样E越小就命中的机会就高,命中之后使用估算的算法,所以误差高,速度快。It can be seen from the experiment that the smaller the K, the higher the compression ratio of the matrix and the smaller the amount of data calculated. Therefore, the error will increase (the F1 value decreases). Under normal circumstances, the smaller the K is, the faster the calculation speed is. However, if the K takes a small value, the HashEclat does not hit too much, and the original data is merged more frequently, which causes the speed to slow down. E represents the maximum tolerance error allowed for a merger, so that the smaller the E, the higher the chance of hitting. After hitting, the estimated algorithm is used, so the error is high and the speed is fast.
本发明然后在三个数据与原始Eclat算法在计算速度、运行内存进行了比较,如图9至图11所示。The present invention then compares the three data with the original Eclat algorithm at the computational speed, running memory, as shown in Figures 9-11.
通过实验验证,HashEclat算法更适宜于数据海量和时间序列流数据这类实时产生的数据。该算法可以显著的加快关联规则挖掘速度,达到及时获取时间序列数据分析结果的目标。由此可见,HashEclat算法虽然牺牲了挖掘的精确性,但可以大大的提高挖掘效率、节约机器内存。Through experiments, the HashEclat algorithm is more suitable for real-time data such as data massive and time series stream data. The algorithm can significantly speed up the mining of association rules and achieve the goal of timely obtaining time series data analysis results. It can be seen that although the HashEclat algorithm sacrifices the accuracy of mining, it can greatly improve the mining efficiency and save machine memory.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。 The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims (5)

  1. 一种基于Eclat的关联规则挖掘方法,其特征在于:所述方法包括:(1)生成垂直数据集;(2)生成MINHASH矩阵,MINHASH矩阵需要指定参数k,其意义是矩阵最多有k行;(3)利用MINHASH矩阵估计原始数据集中的候选项集;(4)根据最小支持度把候选集剪枝后得到频繁项集1;(5)在哈希频繁1项集两两合并生成新的频繁2项集;(6)循环步骤(4)、(5)直到无法合并,结束算法;其中,步骤(3)中使用MinHash估计集合交集大小,对于多个集合S1,S2,…Si,…,Sm,包含最多元素的集合大小为nmax=maxi|Si|,集合交集大小估计值为An association rule mining method based on Eclat, characterized in that: the method comprises: (1) generating a vertical data set; (2) generating a MINHASH matrix, and the MINHASH matrix needs to specify a parameter k, the meaning of which is that the matrix has at most k rows; (3) Using the MINHASH matrix to estimate the candidate set in the original data set; (4) pruning the candidate set according to the minimum support degree to obtain the frequent item set 1; (5) combining the hash items with the frequent items set to generate a new one. Frequently 2 sets of items; (6) Cycling steps (4), (5) until unmerge, ending the algorithm; wherein, in step (3), MinHash is used to estimate the set intersection size, for multiple sets S 1 , S 2 ,...S i ,...,S m , the set size containing the most elements is n max =max i |S i |, and the aggregate intersection size is estimated
    Figure PCTCN2017115843-appb-100001
    Figure PCTCN2017115843-appb-100001
    其中∩kmin(Si)表示使用MinHash方法抽样形成的哈希矩阵中集合Si的交集。Where ∩kmin(S i ) represents the intersection of the sets S i in the hash matrix formed by sampling using the MinHash method.
  2. 根据权利要求1所述的方法,其特征在于:所述步骤(1)中,在原始事务集上经过倒排得到垂直数据集。The method according to claim 1, characterized in that in the step (1), the vertical data set is obtained by inverting on the original transaction set.
  3. 根据权利要求1所述的方法,其特征在于:步骤(2)还包括释放垂直数据集以节省内存。The method of claim 1 wherein step (2) further comprises releasing the vertical data set to save memory.
  4. 根据权利要求1所述的方法,其特征在于:所述最小支持度使用MinHash估计。The method of claim 1 wherein said minimum support is estimated using MinHash.
  5. 根据权利要求1所述的方法,其特征在于:所述方法应用于多元时间序列的关联规则挖掘。 The method of claim 1 wherein said method is applied to association rule mining of a plurality of time series.
PCT/CN2017/115843 2017-08-30 2017-12-13 Method for mining multivariate time series association rule based on eclat WO2019041628A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710763342.2A CN107562865A (en) 2017-08-30 2017-08-30 Multivariate time series association rule mining method based on Eclat
CN201710763342.2 2017-08-30

Publications (1)

Publication Number Publication Date
WO2019041628A1 true WO2019041628A1 (en) 2019-03-07

Family

ID=60978111

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/115843 WO2019041628A1 (en) 2017-08-30 2017-12-13 Method for mining multivariate time series association rule based on eclat

Country Status (2)

Country Link
CN (1) CN107562865A (en)
WO (1) WO2019041628A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489411A (en) * 2019-07-11 2019-11-22 齐鲁工业大学 A kind of association rule mining method based on virtual value storage and operation mode
CN110866047A (en) * 2019-11-13 2020-03-06 辽宁工程技术大学 Community discovery algorithm based on improved association rule
CN113411235A (en) * 2021-06-21 2021-09-17 大连大学 Unknown protocol data frame feature extraction method based on PSO
CN113407986A (en) * 2021-05-21 2021-09-17 南京逸智网络空间技术创新研究院有限公司 Singular value decomposition-based frequent item set mining method for local differential privacy protection
CN113722374A (en) * 2021-07-30 2021-11-30 河海大学 Suffix tree-based time sequence variable-length motif mining method
CN114170796A (en) * 2021-11-20 2022-03-11 无锡数据湖信息技术有限公司 Algorithm improved congestion propagation analysis method
CN114238491A (en) * 2021-12-02 2022-03-25 西北工业大学 Multi-mode traffic operation situation association rule mining method based on heterogeneous graph
CN116523351A (en) * 2023-07-03 2023-08-01 广东电网有限责任公司湛江供电局 Source-load combined typical scene set generation method, system and equipment

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470068A (en) * 2018-03-29 2018-08-31 重庆大学 A kind of abstract index generation method of sequential key assignments type industrial process data
CN108809628B (en) * 2018-06-13 2021-07-13 哈尔滨工业大学深圳研究生院 Time series abnormity detection method and system based on safety multiple parties
CN109858507B (en) * 2018-09-17 2021-03-23 北京工业大学 Rare subsequence mining method of multidimensional time sequence data applied to atmospheric pollution control
CN110874413B (en) * 2019-11-14 2023-04-07 哈尔滨工业大学 Association rule mining-based method for establishing efficacy evaluation index system of air defense multi-weapon system
CN111324638B (en) * 2020-02-10 2023-03-28 上海海洋大学 AR _ TSM-based time sequence motif association rule mining method
CN111666519A (en) * 2020-05-13 2020-09-15 中国科学院软件研究所 Dynamic mining method and system for network access behavior feature group under enhanced condition
CN111986036A (en) * 2020-08-31 2020-11-24 平安医疗健康管理股份有限公司 Medical wind control rule generation method, device, equipment and storage medium
CN112732771A (en) * 2020-11-06 2021-04-30 河北上晟医疗科技发展有限公司 Application of association rule mining technology based on PACS system
CN113282645A (en) * 2021-07-23 2021-08-20 广东粤港澳大湾区硬科技创新研究院 Satellite time sequence parameter analysis method, system, terminal and storage medium
CN114936581B (en) * 2022-06-01 2024-04-26 中国人民解放军63796部队 Multi-parameter association mining method based on time sequence data segmentation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073732A (en) * 2011-01-18 2011-05-25 东北大学 Method for mining frequency episode from event sequence by using same node chains and Hash chains
US20130332432A1 (en) * 2012-06-12 2013-12-12 International Business Machines Corporation Closed itemset mining using difference update
CN105653672A (en) * 2015-12-29 2016-06-08 郑州轻工业学院 Time sequence based computer data mining method
CN106384128A (en) * 2016-09-09 2017-02-08 西安交通大学 Method for mining time series data state correlation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073732A (en) * 2011-01-18 2011-05-25 东北大学 Method for mining frequency episode from event sequence by using same node chains and Hash chains
US20130332432A1 (en) * 2012-06-12 2013-12-12 International Business Machines Corporation Closed itemset mining using difference update
CN105653672A (en) * 2015-12-29 2016-06-08 郑州轻工业学院 Time sequence based computer data mining method
CN106384128A (en) * 2016-09-09 2017-02-08 西安交通大学 Method for mining time series data state correlation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG, CHUNKAI: "An approximate approach to frequent itemset mining", 2017 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE, 29 June 2017 (2017-06-29), XP033139593, DOI: 10.1109/DSC.2017.60 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489411A (en) * 2019-07-11 2019-11-22 齐鲁工业大学 A kind of association rule mining method based on virtual value storage and operation mode
CN110489411B (en) * 2019-07-11 2023-08-22 齐鲁工业大学 Association rule mining method based on effective value storage and operation mode
CN110866047A (en) * 2019-11-13 2020-03-06 辽宁工程技术大学 Community discovery algorithm based on improved association rule
CN113407986A (en) * 2021-05-21 2021-09-17 南京逸智网络空间技术创新研究院有限公司 Singular value decomposition-based frequent item set mining method for local differential privacy protection
CN113407986B (en) * 2021-05-21 2024-02-23 南京逸智网络空间技术创新研究院有限公司 Frequent item set mining method for local differential privacy protection based on singular value decomposition
CN113411235B (en) * 2021-06-21 2023-11-07 大连大学 Unknown protocol data frame feature extraction method based on PSO
CN113411235A (en) * 2021-06-21 2021-09-17 大连大学 Unknown protocol data frame feature extraction method based on PSO
CN113722374A (en) * 2021-07-30 2021-11-30 河海大学 Suffix tree-based time sequence variable-length motif mining method
CN113722374B (en) * 2021-07-30 2023-12-01 河海大学 Time sequence variable length motif mining method based on suffix tree
CN114170796A (en) * 2021-11-20 2022-03-11 无锡数据湖信息技术有限公司 Algorithm improved congestion propagation analysis method
CN114238491A (en) * 2021-12-02 2022-03-25 西北工业大学 Multi-mode traffic operation situation association rule mining method based on heterogeneous graph
CN114238491B (en) * 2021-12-02 2024-02-13 西北工业大学 Heterogeneous graph-based multi-mode traffic operation situation association rule mining method
CN116523351B (en) * 2023-07-03 2023-09-22 广东电网有限责任公司湛江供电局 Source-load combined typical scene set generation method, system and equipment
CN116523351A (en) * 2023-07-03 2023-08-01 广东电网有限责任公司湛江供电局 Source-load combined typical scene set generation method, system and equipment

Also Published As

Publication number Publication date
CN107562865A (en) 2018-01-09

Similar Documents

Publication Publication Date Title
WO2019041628A1 (en) Method for mining multivariate time series association rule based on eclat
Cormode Sketch techniques for approximate query processing
Mythili et al. Performance evaluation of apriori and fp-growth algorithms
Zhang et al. Clustering-based missing value imputation for data preprocessing
CN109522926A (en) Method for detecting abnormality based on comentropy cluster
Sun et al. Fast anomaly detection in multiple multi-dimensional data streams
CN104462184A (en) Large-scale data abnormity recognition method based on bidirectional sampling combination
WO2022151829A1 (en) Time series data trend feature extraction method based on dynamic grid division
CN109829066B (en) Local sensitive Hash image indexing method based on hierarchical structure
Absalyamov et al. Lightweight cardinality estimation in LSM-based systems
CN111143442A (en) Time series symbol aggregation approximate representation method fusing trend characteristics
CN109389172B (en) Radio signal data clustering method based on non-parameter grid
Wu et al. High dimensional data clustering algorithm based on sparse feature vector for categorical attributes
Zhang et al. An efficient method for time series similarity search using binary code representation and hamming distance
CN115941281A (en) Abnormal network flow detection method based on bidirectional time convolution neural network and multi-head self-attention mechanism
Song et al. Isolated forest in keystroke dynamics-based authentication: Only normal instances available for training
AT&T
Wei et al. Efficiently finding unusual shapes in large image databases
Song et al. The detection algorithms for similar duplicate data
Singh et al. Proposing an efficient method for frequent pattern mining
CN111488924A (en) Multivariate time sequence data clustering method
Syahrir et al. Improvement of Apriori Algorithm Performance Using the TID-List Vertical Approach and Data Partitioning.
Zhang et al. A genetic evolutionary ROCK algorithm
CN113722374B (en) Time sequence variable length motif mining method based on suffix tree
CN109255378A (en) A kind of Laplce&#39;s centrality peak-data clustering method based on potential energy entropy

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.02.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 17923606

Country of ref document: EP

Kind code of ref document: A1