CN112732798A

CN112732798A - Sequence data association rule mining method based on fragment clustering

Info

Publication number: CN112732798A
Application number: CN202110186382.1A
Authority: CN
Inventors: 陈红倩; 孙丽萍
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2021-04-30

Abstract

The invention relates to a sequence data association rule mining method based on fragment clustering, and belongs to the field of data mining in computer science. The method comprises the following implementation steps: setting parameters; dividing original sequence data into subsequence sets by using a sliding window algorithm, and normalizing each subsequence; clustering the subsequences by using a k-means algorithm, and calculating the distance between each subsequence and a central point by using a DTW algorithm in the clustering process; merging the clustering results to form an ordered transaction set T from the clustering results; generating a frequent item set based on the transaction set T, and generating an association rule; and screening and applying the association rules according to the confidence coefficient of each association rule.

Description

Sequence data association rule mining method based on fragment clustering

Technical Field

The invention belongs to the field of data mining in computer science, and particularly relates to a mining method for association rules among sequence segments with the same variation trend in sequence data.

Background

The association rule analysis can mine the correlation among a large number of transaction sets and reveal the potential association among the transaction sets. Association rule mining enables association rule mining to discover interesting associations or interrelationships between sets of items in a large amount of data, but generally mines a frequent set of items of the data of the transaction itself. The time series prediction generally adopts a regression prediction method to analyze a time series and find out a rule that the sequence changes along with the time change, but the correlation and the association relation among the subsequences are less mined.

The trend of the sequence is further abstraction of the sequences, and is a higher-level aggregation, and a plurality of transactions with the same content can show different trends in different contexts; similarly, different transaction sequences may also express similar trends.

In time series, different trends are expressed in different data curve shapes, but there are often multiple series with very similar shapes that are not aligned on the x-axis, i.e., the length of two similar time series may not be equal, in which case the distance between two time series cannot be effectively calculated using the conventional euclidean distance. Therefore, before comparing the similarity between two subsequences, one or both sequences need to be warped in the time axis to achieve better alignment. The DTW (Dynamic Time Warping) algorithm can calculate the similarity between two Time series by finding the point where the two waveforms are aligned, and the DTW calculates the similarity between the two Time series by extending and shortening the Time series.

The invention mainly aims at different sequence fragments with the same or similar change trends in sequence data, and analyzes whether the fragments have certain correlation conditions and whether a certain change trend or a plurality of change trends in the sequence can cause another trend. However, in the existing literature, there is little literature in which fragmentation clustering is performed on sequence data and association rule mining is performed on the basis of the fragmentation clustering.

Disclosure of Invention

In view of the above, the present invention provides a sequence data association rule mining method based on fragment clustering, which is used for mining association patterns among sequences or trends of subsequences in sequence data, clustering subsequences in the sequences, merging the subsequences into several sequence trend classifications, then mining association rules of the sequence trends, and finally finding out associations or related associations in variation trends of different sequences.

The technical scheme for realizing the sequence data association rule mining method based on the fragment clustering is as follows:

step one, setting the size w of a sliding window, the clustering number clusterNum, the iteration number num _ iter and a threshold value mDtw;

step two, dividing the original sequence data S into subsequence sets subS by using a sliding window algorithm according to w;

step three, normalizing each subsequence, wherein the normalization method is shown as a formula (1),

for each subsequence, vMax and vMin are the maximum value and the minimum value of the subsequence, and v is a value in the subsequence;

fourthly, clustering the subsequence by using a k-means algorithm, wherein the clustering method comprises the following steps:

step 4.1, randomly selecting clusterNum central points, and naming the central point of the ith class as O_i；

Step 4.2, making the iteration number k equal to 0;

step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithm_iThe DTW distance of (1);

4.4, distributing each subsequence to the class to which the central point with the minimum DTW distance belongs;

step 4.5, for each class, calculating the mean O of all subsequences in the class_i′；

Step 4.6, calculating the subsequence mean O of each class_i' with the current clustering center point O of this class_iDTW distance of

And updating O with the mean_i' is the new center point of the class;

step 4.7, let k equal to k +1, if k is greater than or equal to num _ iter, execute step five; if k is less than num _ iter, go to step 4.8;

step 4.8, if the distance is

If the value is less than or equal to the set threshold value mDtw, executing a fifth step; if this distance is not the case

If the threshold value is larger than the set threshold value mDtw, executing the step 4.3;

combining the clustering results to form an ordered transaction set T;

generating a frequent item set based on the transaction set T, and generating an association rule;

seventhly, screening and applying the association rules according to the confidence coefficient of each association rule;

and ending the mining method of the sequence data association rule based on the fragment clustering.

Has the advantages that:

the method provided by the invention can be used for mining the association rule of the relation between the variation trends in the sequence data and applying the association rule to the fragment association mining of the time sequence data.

Drawings

FIG. 1 is a flow chart of the present invention

FIG. 2 is a sequence warping map based on DTW algorithm

FIG. 3 shows the clustering effect of sliding window w-5 and clusterNum-9

FIG. 4 is the first row of Table 2

Trend analysis of

Detailed Description

The following describes the present invention in detail by referring to the accompanying drawings and embodiments, and taking fig. 1 as an example, a technical scheme for implementing a fragment clustering-based sequence data association rule mining method is as follows:

in this embodiment, the adopted data set is air quality data, six features, namely PM2.5, PM10, NO2, CO, O3 and SO2, are selected to form multivariate sequence data, the size w of the sliding window is set to 5, the cluster number clusterNum is 9, the iteration number num _ iter is 100, and the threshold mDtw is 0.1;

Step 4.2, making the iteration number k equal to 0;

step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithm_iThe DTW distance of (1); FIG. 2 is a schematic diagram of a DTW algorithm-based sequence normalization;

And updating O with the mean_i' is the new center point of the class;

step 4.8, if the distance is

in this embodiment, the clustering effect of the sliding window w-5 and the clusterNum-9 is shown in fig. 3;

combining the clustering results to form an ordered transaction set T;

in this embodiment, the clustering results of the six sequences are merged to form a transaction set, the specific format is shown in table 1,

table 1 merged transaction set format

Transaction ID	Transaction set
		1	5，0，2，0，7，4
2	4，7，6，3，8，3
		3	8，7，8，5，8，6

Merging the clustering results of the respective sequences into a similar transaction set dataSet [ [4, 4, 1, 8, 0, 7], [4, 4, 1, 8, 0], [4, 4, 2], [4, 4, 8, 2, 3, 2], [8, 6, 7, 1, 2, 4], [8, 6, 7, 5, 1, 4], [8, 6, 1], [5, 6, 0, 5, 0], [0, 5, 6, 0, 5, 0], [0, 5, 0, 3, 0], [0, 5, 6, 0, 3, 0], [0, 5, 6, 6, 3, 0], [0, 3, 5, 0, 4, 0], [0, 5, 5, 0, 7, 0] ];

in this embodiment, the minimum support min _ support is 0.07, the minimum confidence min _ confidence is 0.6, and the sliding window w is 5, and the association rule analysis is performed, because the generated frequent item sets are many, the 15 rules with the highest confidence are shown in table 2,

table 2 partial association rule table with sliding window w-5

in this embodiment, trend analysis is performed on the first line in table 2, as shown in fig. 4, the confidence of the occurrence of category 1 in the case of simultaneous occurrence of

Claims

1. The sequence data association rule mining method based on fragment clustering is characterized by comprising the following steps of:

Step 4.2, making the iteration number k equal to 0;

And updating O with the mean_i' is the new center point of the class;

step 4.8, if the distance is

combining the clustering results to form an ordered transaction set T;