WO2019041628A1

WO2019041628A1 - Method for mining multivariate time series association rule based on eclat

Info

Publication number: WO2019041628A1
Application number: PCT/CN2017/115843
Authority: WO
Inventors: 张春慨
Original assignee: 哈尔滨工业大学深圳研究生院
Priority date: 2017-08-30
Filing date: 2017-12-13
Publication date: 2019-03-07
Also published as: CN107562865A

Abstract

A method for mining a multivariate time series association rule based on Eclat, comprising: (1) generating a perpendicular dataset; (2) generating a MINHASH matrix, wherein the MINHASH matrix needs a designated parameter k; (3) using the MINHASH matrix to estimate a candidate item set in an original dataset; (4) according to the minimum support, pruning the candidate item set to obtain a frequent item set 1; (5) combining two Hash frequent item sets 1 and generating a new frequent item set 2; and (6) repeating step 5 until combination cannot be carried out, and ending an algorithm. The method markedly increases the mining speed of an association rule, and reaches the goal of acquiring a time series data analysis result in time. Although mining precision is sacrificed, mining efficiency can be greatly improved, and machine memory can be saved.

Description

Multi-time sequence association rule mining method based on Eclat

Technical field

The invention belongs to the field of data mining technology, and particularly relates to a method for mining association rules under large-scale data.

Background technique

At present, there are some researches on the mining of approximate association rules at home and abroad. Because their research focuses differently, the mining rules of association rules are different, and the characteristics of the associated association rules are different. The general approximate association rule mining step is divided into two stages, first performing pre-processing operations, compressing, smoothing, denoising, linearizing approximation, segmentation time series, clustering, etc. on massive raw data, and then processing The implementation of the approximate association rule mining algorithm is performed on the data set.

The traditional association rule mining algorithm is for discrete data, and the mining association rules cannot reflect the sequence of time. The proposed mining algorithm for the first time applying association rules on time series was proposed by Das in 1998. The research object starts with the association rule of single time series mining, and then extends to the mining of multiple time series. When processing time series data, the time series is divided into sub-sequences of equal length, and then a symbolic representation is assigned to each sub-sequence with different trends. There are three main trends in the subsequences that this algorithm focuses on, rising, falling, and balancing. Therefore, for different time lengths, sub-sequences with the same trend cannot be distinguished. Later scholars applied the FP-growth algorithm to the mining of time series association rules. The FP-growth algorithm is an efficient and extensible algorithm. By means of pattern growth, the extended prefix tree structure FP-tree is used. This summary storage structure is used to store compression and key information about frequent patterns, in many cases. Apriori works better. Later on, there were many improved algorithms. The CFP-mine algorithm is based on a compressed FP tree, based on a constrained subtree method, reduces memory calls, and uses array methods to reduce the number of traversals.

The most classic association rule mining algorithm is the Apriori algorithm proposed by Agrawal in 1993. The Apriori algorithm is a frequent item set algorithm for mining association rules. Through the iterative algorithm of layer-by-layer search, each time a candidate frequent item set is generated. Have to go through the steps of scanning, counting, comparing, connecting, and pruning. However, using Apriori algorithm to mine association rules to scan the entire data set more than once when verifying the candidate frequent K item set, its time efficiency is very low. The EH-Apriori mining algorithm has improved on the basis of the Apriori algorithm. One is that the mining process is preprocessed, and the other is to hash the data of the dataset to a large table. Later, Han et al. studied the related properties of association rules in 2000 and proposed the FP-growth algorithm. The FP-growth algorithm can dig through the database by creating a FP-tree with a prefix property, so that frequent patterns can be mined, thereby improving mining efficiency. Experiments show that the performance of the FP-growth algorithm is an order of magnitude faster than Apriori. Both Apriori and FP-growth use horizontal item sets to mine data. ZAKI proposed the Eclat algorithm in 2000, which uses vertical data representation to mine association rules. The vertical data indicates that the data set consists of a set of items and all the identifiers of the transactions containing the item. The algorithm uses cross-counting so that the generation of the candidate set and the calculation of the support count can be completed simultaneously. Practice has shown that the performance of algorithms using vertical data representation is generally better than algorithms using horizontal data representation.

Due to the large amount of time series data and real-time generation, traditional data mining algorithms cannot mine the required knowledge in a timely and effective manner. Sampling is an effective means to obtain approximation rules on common resources. It has been extensively studied for its good performance in processing large-scale data sets. It is a kind of improving the efficiency and scalability of association rules algorithms. Simple and effective way. Commonly used design methods include histogram method, sampling method and wavelet method. The scalability and flexibility of the sampling method make it a very important way to build a summary of the data stream. The ultimate goal of all of these studies is to use the smallest possible sample set to best approximate the information on the original data set (to find the appropriate sample size and optimal sample set), but this result is inseparable from the sampling error. A valid measure of the difference between data sets. There is currently no systematic research and a unified and effective model. The association rule mining algorithm based on sampling strategy, and even the calculation of the difference of interest information between the sample set and the original data set of the whole data mining algorithm and between the sample set and the sample set is a central problem of the whole sampling process.

In recent years, a method of using local sensitive hash (LSH) technology to assist association rule mining has gradually become popular. This method draws on the means of quickly calculating the similarity in the field of information retrieval to optimize the steps in mining association rules, so as to achieve the purpose of rapid mining. This method uses a hash function to compress the data, which can handle massive amounts of data better. And after the verification of theory and practice, the information loss caused by data compression can be controlled within a certain range, and the accuracy of mining rules can also be guaranteed. Under the premise of ensuring a certain degree of accuracy, the sampling method significantly reduces the size of the processed data set, enabling many data mining algorithms to be applied to large data sets and data stream data.

Summary of the invention

In order to solve the problems existing in the prior art, the present invention designs an association rule mining method based on Eclat, which significantly speeds up the mining of association rules and achieves the goal of timely obtaining time series data analysis results, although the accuracy of mining is sacrificed. , but can greatly improve the efficiency of mining, Save machine memory.

The invention is specifically implemented by the following technical solutions:

An association rule mining method based on Eclat, characterized in that: the method comprises: (1) generating a vertical data set; (2) generating a MINHASH matrix, and the MINHASH matrix needs to specify a parameter k, the meaning of which is that the matrix has at most k rows; (3) Using the MINHASH matrix to estimate the candidate set in the original data set; (4) pruning the candidate set according to the minimum support degree to obtain the frequent item set 1; (5) combining the hash items with the frequent items set to generate a new one. Frequently 2 sets of items; (6) Cycling steps (4), (5) until unmerge, ending the algorithm; wherein, in step (3), MinHash is used to estimate the set intersection size, for multiple sets S ₁ , S ₂ ,...S _i ,...,S _m , the set size containing the most elements is n _max =max _i |S _i |, and the aggregate intersection size is estimated

Where ∩kmin(S _i ) represents the intersection of the sets S _i in the hash matrix formed by sampling using the MinHash method.

Further, in the step (1), the vertical data set is obtained by inverting on the original transaction set.

Further, step (2) further includes releasing the vertical data set to save memory.

Further, the minimum support is estimated using MinHash.

Further, the method is applied to association rule mining of multiple time series.

DRAWINGS

Figure 1 is a schematic view of an inverted process;

2 is a schematic diagram of generating a frequent 1 item set;

Figure 3 is a schematic diagram of a sampling process;

4 is a schematic diagram of generating a frequent 2 item set;

Figure 5 is a schematic diagram of the MinHash calculation set intersection;

Figure 6 is a schematic diagram of the error of the MinHash calculation set intersection;

Figure 7 is the speed and accuracy of the HashEclat obtained by fixing the minimum element K and adjusting the error E;

Figure 8 is the fixed error E, adjusting the minimum element K to get the speed and accuracy of the HashEclat;

Figure 9 is a comparison of HashEclat and Eclat speed memory on T10I4D100K;

Figure 10 is a comparison of HashEclat and Eclat speed memory on T40I10D100K;

Figure 11 shows the comparison of HashEclat and Eclat speed memory on Online Retail.

Detailed ways

The invention will now be further described with reference to the drawings and specific embodiments.

Due to the large amount of time series data and real-time generation, it is necessary to compress the data, that is, the feature representation, before mining the association rules. The feature representation of the time series is the feature of extracting the data and transforming the dimensions of the data. This can achieve the effect of feature dimension reduction. At the same time, the data in the low-dimensional space can retain the information of the original time series as much as possible.

First, the present invention investigates a feature representation method of TEO. Analysis of the data characteristics of the time series, there are often different trends in the two sides of the segmentation point analogy to the grayscale change of the edge of the image in image processing. At the edge of the image, the grayscale rate of change of the image point changes. If the data before a certain point in the time series has a tendency to increase, and the data after the point has a decreasing trend, the point can be considered to be a segmentation point, that is, an edge point of the time series. The TEO representation of time series is a piecewise linear representation that combines the edge detection operator in image processing with the characteristics of time series data. The convolution calculation result is based on the designed time series edge operator and the original time series data. . Then, segment points are selected from the calculated edge degree results according to the determined selection principle, and the segment points are joined to represent the time series. The representation of the time series is X=<x ₁ , x ₂ ,..., x _n >, and TEO is defined as equation (1):

TEO(tu)={w(i)*(x _t+i -x _t )|i=-1,-2,...-u,0,u,...,2,1} (1)

Where 2u+1 represents the length of the detection window, w(i) represents the weight function, and the selection is based on the characteristics of the data. The weight setting method employed in the experiment of the present invention is that the closer to the center of the detection window, the higher the weight setting.

Traditional data mining algorithms mostly use horizontal data representation. In horizontal data representation, a transaction of a database consists of a transaction identifier (TID) and an item (Item). A transaction is uniquely identified by a TID, and a transaction can contain one project or multiple projects. The HashEclat algorithm uses a vertical data set to do the basic data structure of the algorithm. This vertical data set is "inverted" on the original transaction set, and the "inverted" build process is shown in Figure 1. Each record in the database consists of a list of items and all transaction records that have occurred (Tidset). This allows the support count of any frequent item set to be obtained by the Tidset intersection operation.

After forming the vertical data set, the algorithm first prunes according to the minimum support degree pair, and generates a candidate 1 item set of the frequent item set. At this time, the algorithm needs to save the size of the transaction set of each item I to prepare for the subsequent calculation steps. Let the minimum support of the example be 3, and the pruning process of generating a frequent set of items is shown in Figure 2.

At this point, if there are too many transaction sets in a Tidset, the subsequent intersection calculation efficiency will be significant. Drops and takes up a lot of memory. The HashEclat algorithm samples the Tidset using the MinHash method, so that the entire "inverted table" forms a hash matrix. The sampling process is shown in Figure 3.

Figure 3 uses the hash function h(x) = (x + 2) mod 6, where x is the line number, which is equivalent to a random change to the matrix row. The smallest line number that appears 1 is called the minimum hash value, such as the minimum hash value of I5, hmin(I5)=3. The MinHash method needs to specify the parameter K, which means that the hash matrix is selected to have at most K rows. The example K below is equal to 3. Because the subsequent steps are all calculated using this hash matrix, the original "inverted table" can be released to save memory.

Next, the algorithm uses the hash frequent 1 item set to generate frequent 2 item sets. First, the Hash frequent 1 item set is merged to generate a new frequent 2 item set. The generation process is shown in Figure 4. (1) generating a vertical data set; (2) pruning the candidate set according to the minimum support degree to obtain a frequent item set 1, and combining the hash items with the first item set to generate a new frequent 2 item set; (3) a loop step (1) (2) until it cannot be merged.

Since the intersection of the hash matrix calculations generated by MinHash is used, it is desirable to estimate the intersection size of the original set. The principle of using MinHash estimation is as defined in the following definition 1.

Definition 1: Estimate the intersection size using MinHash. There are multiple sets S ₁ , S ₂ ,...S _i ,...,S _m , the set size containing the most elements is n _max =max _i |S _i |, and the set intersection size is t=|S ₁ ,S ₁ ,... , S _m |, k is the MinHash algorithm parameter, when 0 < ε < 1,

The time set intersection size estimate, where ∩kmin(S _i ) represents the intersection of the set S _i in the hash matrix formed using the MinHash method.

At least probable

Satisfy

This method allows us to have a minimum probability

Next, either get an (ε, δ) estimate of the set intersection, or get the upper bound of the set intersection size. The first estimated intersection size of the present invention is X=|∩kmin(si)|n _max /k, and then ε=|XA|, where A is the minimum support, k is the MinHash parameter, and n _max is the two sets. Large collections with the number of elements. If the estimated result X is greater than

Then the estimation error can be guaranteed, otherwise the intersection size can only be calculated using the original set.

We can continue to calculate all frequent itemsets repeatedly using the results. Finally, we need to calculate the total error.

(1) generate a vertical data set; (2) generate a MINHASH matrix, the MINHASH matrix needs to specify the parameter k, the meaning is that the matrix has at most k rows; (3) use the MINHASH matrix to estimate the candidate set in the original data set; (4) according to The minimum support degree prunes the candidate set to get the frequent item set 1; (5) combines the hash 1 frequent items set to generate a new frequent 2 item set; (6) loops step (4) (5) until it cannot be merged , stop the algorithm.

Since the HashEclat algorithm is an intersection estimated by MinHash when calculating frequent itemsets, two kinds of errors are generated. The first type of error is that the originally frequent itemsets are estimated to be infrequent, and the second is that the originally infrequent itemsets are estimated to be frequent. It may be calculated that X is an infrequent item set (Fig. 6: X is smaller than A), the first type of error is Zone2 of Fig. 6, the second type of error is 0, and the total error is Zone2. From theorem 1, we estimate that the probability of the value of Zone3 in Figure 6 is at least

So the probability of being in Zone1 (the error we defined) is at most

It can be seen from Fig. 6 that Zone1>Zone2. So we are conservative estimates. It can guarantee that the error upper bound of the estimated error is at most

When X is a frequent item set, the same reason can be obtained when the error upper bound is

Since the approximate association rule mining algorithm designed by the present invention is a general-purpose algorithm, not only can it be applied to the time series, the data set used in the experiment uses three non-sequence data sets from the UCI website, as shown in Table 1.

Table 1 Experimental data set

Since HashEclat needs to set the upper error limit E and the minimum element number K of the MinHash parameter, these two parameters have an impact on the computational efficiency and accuracy of the algorithm. The present invention therefore first designs a set of experiments on the T10I4D100K data set - one of the parameters of the fixed HashEclat, adjusts the other parameter, and then observes the speed and accuracy of the algorithm of the present invention. Accuracy uses the F1 value as a measure. After adjusting the HashEclat parameters, the present invention then compares the three data with the computational speed of the original Eclat algorithm.

On the data set T10I4D100K, the minimum support threshold is 350, the fixed minimum element number K is 100, the adjustment error E, and the F1 and time values are as shown in FIG.

On the data set T10I4D100K, the minimum support threshold is 350, the fixed error E is 0.8, the minimum element number K is adjusted, and the F1 and time values are as shown in FIG.

It can be seen from the experiment that the smaller the K, the higher the compression ratio of the matrix and the smaller the amount of data calculated. Therefore, the error will increase (the F1 value decreases). Under normal circumstances, the smaller the K is, the faster the calculation speed is. However, if the K takes a small value, the HashEclat does not hit too much, and the original data is merged more frequently, which causes the speed to slow down. E represents the maximum tolerance error allowed for a merger, so that the smaller the E, the higher the chance of hitting. After hitting, the estimated algorithm is used, so the error is high and the speed is fast.

The present invention then compares the three data with the original Eclat algorithm at the computational speed, running memory, as shown in Figures 9-11.

Through experiments, the HashEclat algorithm is more suitable for real-time data such as data massive and time series stream data. The algorithm can significantly speed up the mining of association rules and achieve the goal of timely obtaining time series data analysis results. It can be seen that although the HashEclat algorithm sacrifices the accuracy of mining, it can greatly improve the mining efficiency and save machine memory.

The above is a further detailed description of the present invention in connection with the specific preferred embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Claims

An association rule mining method based on Eclat, characterized in that: the method comprises: (1) generating a vertical data set; (2) generating a MINHASH matrix, and the MINHASH matrix needs to specify a parameter k, the meaning of which is that the matrix has at most k rows; (3) Using the MINHASH matrix to estimate the candidate set in the original data set; (4) pruning the candidate set according to the minimum support degree to obtain the frequent item set 1; (5) combining the hash items with the frequent items set to generate a new one. Frequently 2 sets of items; (6) Cycling steps (4), (5) until unmerge, ending the algorithm; wherein, in step (3), MinHash is used to estimate the set intersection size, for multiple sets S 1 , S 2 ,...S i ,...,S m , the set size containing the most elements is n max =max i |S i |, and the aggregate intersection size is estimated

Where ∩kmin(S i ) represents the intersection of the sets S i in the hash matrix formed by sampling using the MinHash method.
The method according to claim 1, characterized in that in the step (1), the vertical data set is obtained by inverting on the original transaction set.
The method of claim 1 wherein step (2) further comprises releasing the vertical data set to save memory.
The method of claim 1 wherein said minimum support is estimated using MinHash.
The method of claim 1 wherein said method is applied to association rule mining of a plurality of time series.