CN112732798A - Sequence data association rule mining method based on fragment clustering - Google Patents
Sequence data association rule mining method based on fragment clustering Download PDFInfo
- Publication number
- CN112732798A CN112732798A CN202110186382.1A CN202110186382A CN112732798A CN 112732798 A CN112732798 A CN 112732798A CN 202110186382 A CN202110186382 A CN 202110186382A CN 112732798 A CN112732798 A CN 112732798A
- Authority
- CN
- China
- Prior art keywords
- subsequence
- clustering
- class
- association rule
- sequence data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 238000005065 mining Methods 0.000 title claims abstract description 18
- 239000012634 fragment Substances 0.000 title claims abstract description 13
- 238000012216 screening Methods 0.000 claims abstract description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000007418 data mining Methods 0.000 abstract description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a sequence data association rule mining method based on fragment clustering, and belongs to the field of data mining in computer science. The method comprises the following implementation steps: setting parameters; dividing original sequence data into subsequence sets by using a sliding window algorithm, and normalizing each subsequence; clustering the subsequences by using a k-means algorithm, and calculating the distance between each subsequence and a central point by using a DTW algorithm in the clustering process; merging the clustering results to form an ordered transaction set T from the clustering results; generating a frequent item set based on the transaction set T, and generating an association rule; and screening and applying the association rules according to the confidence coefficient of each association rule.
Description
Technical Field
The invention belongs to the field of data mining in computer science, and particularly relates to a mining method for association rules among sequence segments with the same variation trend in sequence data.
Background
The association rule analysis can mine the correlation among a large number of transaction sets and reveal the potential association among the transaction sets. Association rule mining enables association rule mining to discover interesting associations or interrelationships between sets of items in a large amount of data, but generally mines a frequent set of items of the data of the transaction itself. The time series prediction generally adopts a regression prediction method to analyze a time series and find out a rule that the sequence changes along with the time change, but the correlation and the association relation among the subsequences are less mined.
The trend of the sequence is further abstraction of the sequences, and is a higher-level aggregation, and a plurality of transactions with the same content can show different trends in different contexts; similarly, different transaction sequences may also express similar trends.
In time series, different trends are expressed in different data curve shapes, but there are often multiple series with very similar shapes that are not aligned on the x-axis, i.e., the length of two similar time series may not be equal, in which case the distance between two time series cannot be effectively calculated using the conventional euclidean distance. Therefore, before comparing the similarity between two subsequences, one or both sequences need to be warped in the time axis to achieve better alignment. The DTW (Dynamic Time Warping) algorithm can calculate the similarity between two Time series by finding the point where the two waveforms are aligned, and the DTW calculates the similarity between the two Time series by extending and shortening the Time series.
The invention mainly aims at different sequence fragments with the same or similar change trends in sequence data, and analyzes whether the fragments have certain correlation conditions and whether a certain change trend or a plurality of change trends in the sequence can cause another trend. However, in the existing literature, there is little literature in which fragmentation clustering is performed on sequence data and association rule mining is performed on the basis of the fragmentation clustering.
Disclosure of Invention
In view of the above, the present invention provides a sequence data association rule mining method based on fragment clustering, which is used for mining association patterns among sequences or trends of subsequences in sequence data, clustering subsequences in the sequences, merging the subsequences into several sequence trend classifications, then mining association rules of the sequence trends, and finally finding out associations or related associations in variation trends of different sequences.
The technical scheme for realizing the sequence data association rule mining method based on the fragment clustering is as follows:
step one, setting the size w of a sliding window, the clustering number clusterNum, the iteration number num _ iter and a threshold value mDtw;
step two, dividing the original sequence data S into subsequence sets subS by using a sliding window algorithm according to w;
step three, normalizing each subsequence, wherein the normalization method is shown as a formula (1),
for each subsequence, vMax and vMin are the maximum value and the minimum value of the subsequence, and v is a value in the subsequence;
fourthly, clustering the subsequence by using a k-means algorithm, wherein the clustering method comprises the following steps:
step 4.1, randomly selecting clusterNum central points, and naming the central point of the ith class as Oi;
Step 4.2, making the iteration number k equal to 0;
step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithmiThe DTW distance of (1);
4.4, distributing each subsequence to the class to which the central point with the minimum DTW distance belongs;
step 4.5, for each class, calculating the mean O of all subsequences in the classi′;
Step 4.6, calculating the subsequence mean O of each classi' with the current clustering center point O of this classiDTW distance ofAnd updating O with the meani' is the new center point of the class;
step 4.7, let k equal to k +1, if k is greater than or equal to num _ iter, execute step five; if k is less than num _ iter, go to step 4.8;
step 4.8, if the distance isIf the value is less than or equal to the set threshold value mDtw, executing a fifth step; if this distance is not the caseIf the threshold value is larger than the set threshold value mDtw, executing the step 4.3;
combining the clustering results to form an ordered transaction set T;
generating a frequent item set based on the transaction set T, and generating an association rule;
seventhly, screening and applying the association rules according to the confidence coefficient of each association rule;
and ending the mining method of the sequence data association rule based on the fragment clustering.
Has the advantages that:
the method provided by the invention can be used for mining the association rule of the relation between the variation trends in the sequence data and applying the association rule to the fragment association mining of the time sequence data.
Drawings
FIG. 1 is a flow chart of the present invention
FIG. 2 is a sequence warping map based on DTW algorithm
FIG. 3 shows the clustering effect of sliding window w-5 and clusterNum-9
Detailed Description
The following describes the present invention in detail by referring to the accompanying drawings and embodiments, and taking fig. 1 as an example, a technical scheme for implementing a fragment clustering-based sequence data association rule mining method is as follows:
step one, setting the size w of a sliding window, the clustering number clusterNum, the iteration number num _ iter and a threshold value mDtw;
in this embodiment, the adopted data set is air quality data, six features, namely PM2.5, PM10, NO2, CO, O3 and SO2, are selected to form multivariate sequence data, the size w of the sliding window is set to 5, the cluster number clusterNum is 9, the iteration number num _ iter is 100, and the threshold mDtw is 0.1;
step two, dividing the original sequence data S into subsequence sets subS by using a sliding window algorithm according to w;
step three, normalizing each subsequence, wherein the normalization method is shown as a formula (1),
for each subsequence, vMax and vMin are the maximum value and the minimum value of the subsequence, and v is a value in the subsequence;
fourthly, clustering the subsequence by using a k-means algorithm, wherein the clustering method comprises the following steps:
step 4.1, randomly selecting clusterNum central points, and naming the central point of the ith class as Oi;
Step 4.2, making the iteration number k equal to 0;
step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithmiThe DTW distance of (1); FIG. 2 is a schematic diagram of a DTW algorithm-based sequence normalization;
4.4, distributing each subsequence to the class to which the central point with the minimum DTW distance belongs;
step 4.5, for each class, calculating the mean O of all subsequences in the classi′;
Step 4.6, calculating the subsequence mean O of each classi' with the current clustering center point O of this classiDTW distance ofAnd updating O with the meani' is the new center point of the class;
step 4.7, let k equal to k +1, if k is greater than or equal to num _ iter, execute step five; if k is less than num _ iter, go to step 4.8;
step 4.8, if the distance isIf the value is less than or equal to the set threshold value mDtw, executing a fifth step; if this distance is not the caseIf the threshold value is larger than the set threshold value mDtw, executing the step 4.3;
in this embodiment, the clustering effect of the sliding window w-5 and the clusterNum-9 is shown in fig. 3;
combining the clustering results to form an ordered transaction set T;
in this embodiment, the clustering results of the six sequences are merged to form a transaction set, the specific format is shown in table 1,
table 1 merged transaction set format
Transaction ID | Transaction set |
1 | 5,0,2,0,7,4 |
2 | 4,7,6,3,8,3 |
3 | 8,7,8,5,8,6 |
Merging the clustering results of the respective sequences into a similar transaction set dataSet [ [4, 4, 1, 8, 0, 7], [4, 4, 1, 8, 0], [4, 4, 2], [4, 4, 8, 2, 3, 2], [8, 6, 7, 1, 2, 4], [8, 6, 7, 5, 1, 4], [8, 6, 1], [5, 6, 0, 5, 0], [0, 5, 6, 0, 5, 0], [0, 5, 0, 3, 0], [0, 5, 6, 0, 3, 0], [0, 5, 6, 6, 3, 0], [0, 3, 5, 0, 4, 0], [0, 5, 5, 0, 7, 0] ];
generating a frequent item set based on the transaction set T, and generating an association rule;
in this embodiment, the minimum support min _ support is 0.07, the minimum confidence min _ confidence is 0.6, and the sliding window w is 5, and the association rule analysis is performed, because the generated frequent item sets are many, the 15 rules with the highest confidence are shown in table 2,
table 2 partial association rule table with sliding window w-5
Seventhly, screening and applying the association rules according to the confidence coefficient of each association rule;
in this embodiment, trend analysis is performed on the first line in table 2, as shown in fig. 4, the confidence of the occurrence of category 1 in the case of simultaneous occurrence of categories 2, 3, and 5 is 1, which indicates that the subsequence trend is definitely to occur; the trend analysis is also carried out on the third line, the confidence coefficient of the occurrence of the 7 th class when the 4 th and the 5 th classes simultaneously occur is 0.818, and the probability of the occurrence of the sequence trend is high;
and ending the mining method of the sequence data association rule based on the fragment clustering.
Claims (1)
1. The sequence data association rule mining method based on fragment clustering is characterized by comprising the following steps of:
step one, setting the size w of a sliding window, the clustering number clusterNum, the iteration number num _ iter and a threshold value mDtw;
step two, dividing the original sequence data S into subsequence sets subS by using a sliding window algorithm according to w;
step three, normalizing each subsequence, wherein the normalization method is shown as a formula (1),
for each subsequence, vMax and vMin are the maximum value and the minimum value of the subsequence, and v is a value in the subsequence;
fourthly, clustering the subsequence by using a k-means algorithm, wherein the clustering method comprises the following steps:
step 4.1, randomly selecting clusterNum central points, and naming the central point of the ith class as Oi;
Step 4.2, making the iteration number k equal to 0;
step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithmiThe DTW distance of (1);
4.4, distributing each subsequence to the class to which the central point with the minimum DTW distance belongs;
step 4.5, for each class, calculating the mean O of all subsequences in the classi′;
Step 4.6, calculating the subsequence mean O of each classi' with the current clustering center point O of this classiDTW distance ofAnd updating O with the meani' is the new center point of the class;
step 4.7, let k equal to k +1, if k is greater than or equal to num _ iter, execute step five; if k is less than num _ iter, go to step 4.8;
step 4.8, if the distance isIf the value is less than or equal to the set threshold value mDtw, executing a fifth step; if this distance is not the caseIf the threshold value is larger than the set threshold value mDtw, executing the step 4.3;
combining the clustering results to form an ordered transaction set T;
generating a frequent item set based on the transaction set T, and generating an association rule;
seventhly, screening and applying the association rules according to the confidence coefficient of each association rule;
and ending the mining method of the sequence data association rule based on the fragment clustering.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110186382.1A CN112732798A (en) | 2021-02-18 | 2021-02-18 | Sequence data association rule mining method based on fragment clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110186382.1A CN112732798A (en) | 2021-02-18 | 2021-02-18 | Sequence data association rule mining method based on fragment clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112732798A true CN112732798A (en) | 2021-04-30 |
Family
ID=75596700
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110186382.1A Pending CN112732798A (en) | 2021-02-18 | 2021-02-18 | Sequence data association rule mining method based on fragment clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112732798A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822570A (en) * | 2021-09-20 | 2021-12-21 | 河南惠誉网络科技有限公司 | Enterprise production data storage method and system based on big data analysis |
-
2021
- 2021-02-18 CN CN202110186382.1A patent/CN112732798A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822570A (en) * | 2021-09-20 | 2021-12-21 | 河南惠誉网络科技有限公司 | Enterprise production data storage method and system based on big data analysis |
CN113822570B (en) * | 2021-09-20 | 2023-09-26 | 北京瀚博网络科技有限公司 | Enterprise production data storage method and system based on big data analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108846259B (en) | Gene classification method and system based on clustering and random forest algorithm | |
Cai et al. | Multi-class l2, 1-norm support vector machine | |
CN101859383A (en) | Hyperspectral remote sensing image band selection method based on time sequence important point analysis | |
Li et al. | Linear time complexity time series classification with bag-of-pattern-features | |
Yao et al. | A novel random forests-based feature selection method for microarray expression data analysis | |
Huang et al. | An adaptive kernelized rank-order distance for clustering non-spherical data with high noise | |
CN112732798A (en) | Sequence data association rule mining method based on fragment clustering | |
Spinosa et al. | Support vector machines for novel class detection in bioinformatics | |
Guan et al. | SMMP: a stable-membership-based auto-tuning multi-peak clustering algorithm | |
Tamura et al. | Clustering of time series using hybrid symbolic aggregate approximation | |
Kumar et al. | Analysis of X-means and global k-means USING TUMOR classification | |
CN111625578A (en) | Feature extraction method suitable for time sequence data in cultural science and technology fusion field | |
Tang et al. | Feature selection algorithm based on k-means clustering | |
Gao et al. | Adaptive image stream classification via convolutional neural network with intrinsic similarity metrics | |
CN116186569A (en) | Abnormality detection method based on improved K-means | |
SuriyaPrakash et al. | Obtain Better Accuracy Using Music Genre Classification Systemon GTZAN Dataset | |
Ertl et al. | Semi-Supervised Time Point Clustering for Multivariate Time Series. | |
Salman et al. | Two-Stage Clustering with k-means Algorithm | |
Chen et al. | A new clustering framework | |
Tuna et al. | Classification with binary gene expressions | |
Bahri et al. | Shapelet-based Temporal Association Rule Mining for Multivariate Time Series Classification | |
An et al. | Finding rule groups to classify high dimensional gene expression datasets | |
CN113722374B (en) | Time sequence variable length motif mining method based on suffix tree | |
Nagendar et al. | Fast approximate dynamic warping kernels | |
Liu et al. | Regularized nonnegative matrix factorization for clustering gene expression data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210430 |
|
WD01 | Invention patent application deemed withdrawn after publication |