CN112732798A - Sequence data association rule mining method based on fragment clustering - Google Patents

Sequence data association rule mining method based on fragment clustering Download PDF

Info

Publication number
CN112732798A
CN112732798A CN202110186382.1A CN202110186382A CN112732798A CN 112732798 A CN112732798 A CN 112732798A CN 202110186382 A CN202110186382 A CN 202110186382A CN 112732798 A CN112732798 A CN 112732798A
Authority
CN
China
Prior art keywords
subsequence
clustering
class
association rule
sequence data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110186382.1A
Other languages
Chinese (zh)
Inventor
陈红倩
孙丽萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202110186382.1A priority Critical patent/CN112732798A/en
Publication of CN112732798A publication Critical patent/CN112732798A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a sequence data association rule mining method based on fragment clustering, and belongs to the field of data mining in computer science. The method comprises the following implementation steps: setting parameters; dividing original sequence data into subsequence sets by using a sliding window algorithm, and normalizing each subsequence; clustering the subsequences by using a k-means algorithm, and calculating the distance between each subsequence and a central point by using a DTW algorithm in the clustering process; merging the clustering results to form an ordered transaction set T from the clustering results; generating a frequent item set based on the transaction set T, and generating an association rule; and screening and applying the association rules according to the confidence coefficient of each association rule.

Description

Sequence data association rule mining method based on fragment clustering
Technical Field
The invention belongs to the field of data mining in computer science, and particularly relates to a mining method for association rules among sequence segments with the same variation trend in sequence data.
Background
The association rule analysis can mine the correlation among a large number of transaction sets and reveal the potential association among the transaction sets. Association rule mining enables association rule mining to discover interesting associations or interrelationships between sets of items in a large amount of data, but generally mines a frequent set of items of the data of the transaction itself. The time series prediction generally adopts a regression prediction method to analyze a time series and find out a rule that the sequence changes along with the time change, but the correlation and the association relation among the subsequences are less mined.
The trend of the sequence is further abstraction of the sequences, and is a higher-level aggregation, and a plurality of transactions with the same content can show different trends in different contexts; similarly, different transaction sequences may also express similar trends.
In time series, different trends are expressed in different data curve shapes, but there are often multiple series with very similar shapes that are not aligned on the x-axis, i.e., the length of two similar time series may not be equal, in which case the distance between two time series cannot be effectively calculated using the conventional euclidean distance. Therefore, before comparing the similarity between two subsequences, one or both sequences need to be warped in the time axis to achieve better alignment. The DTW (Dynamic Time Warping) algorithm can calculate the similarity between two Time series by finding the point where the two waveforms are aligned, and the DTW calculates the similarity between the two Time series by extending and shortening the Time series.
The invention mainly aims at different sequence fragments with the same or similar change trends in sequence data, and analyzes whether the fragments have certain correlation conditions and whether a certain change trend or a plurality of change trends in the sequence can cause another trend. However, in the existing literature, there is little literature in which fragmentation clustering is performed on sequence data and association rule mining is performed on the basis of the fragmentation clustering.
Disclosure of Invention
In view of the above, the present invention provides a sequence data association rule mining method based on fragment clustering, which is used for mining association patterns among sequences or trends of subsequences in sequence data, clustering subsequences in the sequences, merging the subsequences into several sequence trend classifications, then mining association rules of the sequence trends, and finally finding out associations or related associations in variation trends of different sequences.
The technical scheme for realizing the sequence data association rule mining method based on the fragment clustering is as follows:
step one, setting the size w of a sliding window, the clustering number clusterNum, the iteration number num _ iter and a threshold value mDtw;
step two, dividing the original sequence data S into subsequence sets subS by using a sliding window algorithm according to w;
step three, normalizing each subsequence, wherein the normalization method is shown as a formula (1),
Figure BSA0000233562090000021
for each subsequence, vMax and vMin are the maximum value and the minimum value of the subsequence, and v is a value in the subsequence;
fourthly, clustering the subsequence by using a k-means algorithm, wherein the clustering method comprises the following steps:
step 4.1, randomly selecting clusterNum central points, and naming the central point of the ith class as Oi
Step 4.2, making the iteration number k equal to 0;
step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithmiThe DTW distance of (1);
4.4, distributing each subsequence to the class to which the central point with the minimum DTW distance belongs;
step 4.5, for each class, calculating the mean O of all subsequences in the classi′;
Step 4.6, calculating the subsequence mean O of each classi' with the current clustering center point O of this classiDTW distance of
Figure BSA0000233562090000031
And updating O with the meani' is the new center point of the class;
step 4.7, let k equal to k +1, if k is greater than or equal to num _ iter, execute step five; if k is less than num _ iter, go to step 4.8;
step 4.8, if the distance is
Figure BSA0000233562090000032
If the value is less than or equal to the set threshold value mDtw, executing a fifth step; if this distance is not the case
Figure BSA0000233562090000033
If the threshold value is larger than the set threshold value mDtw, executing the step 4.3;
combining the clustering results to form an ordered transaction set T;
generating a frequent item set based on the transaction set T, and generating an association rule;
seventhly, screening and applying the association rules according to the confidence coefficient of each association rule;
and ending the mining method of the sequence data association rule based on the fragment clustering.
Has the advantages that:
the method provided by the invention can be used for mining the association rule of the relation between the variation trends in the sequence data and applying the association rule to the fragment association mining of the time sequence data.
Drawings
FIG. 1 is a flow chart of the present invention
FIG. 2 is a sequence warping map based on DTW algorithm
FIG. 3 shows the clustering effect of sliding window w-5 and clusterNum-9
FIG. 4 is the first row of Table 2
Figure BSA0000233562090000034
Trend analysis of
Detailed Description
The following describes the present invention in detail by referring to the accompanying drawings and embodiments, and taking fig. 1 as an example, a technical scheme for implementing a fragment clustering-based sequence data association rule mining method is as follows:
step one, setting the size w of a sliding window, the clustering number clusterNum, the iteration number num _ iter and a threshold value mDtw;
in this embodiment, the adopted data set is air quality data, six features, namely PM2.5, PM10, NO2, CO, O3 and SO2, are selected to form multivariate sequence data, the size w of the sliding window is set to 5, the cluster number clusterNum is 9, the iteration number num _ iter is 100, and the threshold mDtw is 0.1;
step two, dividing the original sequence data S into subsequence sets subS by using a sliding window algorithm according to w;
step three, normalizing each subsequence, wherein the normalization method is shown as a formula (1),
Figure BSA0000233562090000041
for each subsequence, vMax and vMin are the maximum value and the minimum value of the subsequence, and v is a value in the subsequence;
fourthly, clustering the subsequence by using a k-means algorithm, wherein the clustering method comprises the following steps:
step 4.1, randomly selecting clusterNum central points, and naming the central point of the ith class as Oi
Step 4.2, making the iteration number k equal to 0;
step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithmiThe DTW distance of (1); FIG. 2 is a schematic diagram of a DTW algorithm-based sequence normalization;
4.4, distributing each subsequence to the class to which the central point with the minimum DTW distance belongs;
step 4.5, for each class, calculating the mean O of all subsequences in the classi′;
Step 4.6, calculating the subsequence mean O of each classi' with the current clustering center point O of this classiDTW distance of
Figure BSA0000233562090000042
And updating O with the meani' is the new center point of the class;
step 4.7, let k equal to k +1, if k is greater than or equal to num _ iter, execute step five; if k is less than num _ iter, go to step 4.8;
step 4.8, if the distance is
Figure BSA0000233562090000043
If the value is less than or equal to the set threshold value mDtw, executing a fifth step; if this distance is not the case
Figure BSA0000233562090000044
If the threshold value is larger than the set threshold value mDtw, executing the step 4.3;
in this embodiment, the clustering effect of the sliding window w-5 and the clusterNum-9 is shown in fig. 3;
combining the clustering results to form an ordered transaction set T;
in this embodiment, the clustering results of the six sequences are merged to form a transaction set, the specific format is shown in table 1,
table 1 merged transaction set format
Transaction ID Transaction set
1 5,0,2,0,7,4
2 4,7,6,3,8,3
3 8,7,8,5,8,6
Merging the clustering results of the respective sequences into a similar transaction set dataSet [ [4, 4, 1, 8, 0, 7], [4, 4, 1, 8, 0], [4, 4, 2], [4, 4, 8, 2, 3, 2], [8, 6, 7, 1, 2, 4], [8, 6, 7, 5, 1, 4], [8, 6, 1], [5, 6, 0, 5, 0], [0, 5, 6, 0, 5, 0], [0, 5, 0, 3, 0], [0, 5, 6, 0, 3, 0], [0, 5, 6, 6, 3, 0], [0, 3, 5, 0, 4, 0], [0, 5, 5, 0, 7, 0] ];
generating a frequent item set based on the transaction set T, and generating an association rule;
in this embodiment, the minimum support min _ support is 0.07, the minimum confidence min _ confidence is 0.6, and the sliding window w is 5, and the association rule analysis is performed, because the generated frequent item sets are many, the 15 rules with the highest confidence are shown in table 2,
table 2 partial association rule table with sliding window w-5
Figure BSA0000233562090000051
Figure BSA0000233562090000061
Seventhly, screening and applying the association rules according to the confidence coefficient of each association rule;
in this embodiment, trend analysis is performed on the first line in table 2, as shown in fig. 4, the confidence of the occurrence of category 1 in the case of simultaneous occurrence of categories 2, 3, and 5 is 1, which indicates that the subsequence trend is definitely to occur; the trend analysis is also carried out on the third line, the confidence coefficient of the occurrence of the 7 th class when the 4 th and the 5 th classes simultaneously occur is 0.818, and the probability of the occurrence of the sequence trend is high;
and ending the mining method of the sequence data association rule based on the fragment clustering.

Claims (1)

1. The sequence data association rule mining method based on fragment clustering is characterized by comprising the following steps of:
step one, setting the size w of a sliding window, the clustering number clusterNum, the iteration number num _ iter and a threshold value mDtw;
step two, dividing the original sequence data S into subsequence sets subS by using a sliding window algorithm according to w;
step three, normalizing each subsequence, wherein the normalization method is shown as a formula (1),
Figure FSA0000233562080000011
for each subsequence, vMax and vMin are the maximum value and the minimum value of the subsequence, and v is a value in the subsequence;
fourthly, clustering the subsequence by using a k-means algorithm, wherein the clustering method comprises the following steps:
step 4.1, randomly selecting clusterNum central points, and naming the central point of the ith class as Oi
Step 4.2, making the iteration number k equal to 0;
step 4.3, aiming at each subsequence, calculating the subsequence and each central point O by using a DTW algorithmiThe DTW distance of (1);
4.4, distributing each subsequence to the class to which the central point with the minimum DTW distance belongs;
step 4.5, for each class, calculating the mean O of all subsequences in the classi′;
Step 4.6, calculating the subsequence mean O of each classi' with the current clustering center point O of this classiDTW distance of
Figure FSA0000233562080000014
And updating O with the meani' is the new center point of the class;
step 4.7, let k equal to k +1, if k is greater than or equal to num _ iter, execute step five; if k is less than num _ iter, go to step 4.8;
step 4.8, if the distance is
Figure FSA0000233562080000012
If the value is less than or equal to the set threshold value mDtw, executing a fifth step; if this distance is not the case
Figure FSA0000233562080000013
If the threshold value is larger than the set threshold value mDtw, executing the step 4.3;
combining the clustering results to form an ordered transaction set T;
generating a frequent item set based on the transaction set T, and generating an association rule;
seventhly, screening and applying the association rules according to the confidence coefficient of each association rule;
and ending the mining method of the sequence data association rule based on the fragment clustering.
CN202110186382.1A 2021-02-18 2021-02-18 Sequence data association rule mining method based on fragment clustering Pending CN112732798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110186382.1A CN112732798A (en) 2021-02-18 2021-02-18 Sequence data association rule mining method based on fragment clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110186382.1A CN112732798A (en) 2021-02-18 2021-02-18 Sequence data association rule mining method based on fragment clustering

Publications (1)

Publication Number Publication Date
CN112732798A true CN112732798A (en) 2021-04-30

Family

ID=75596700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110186382.1A Pending CN112732798A (en) 2021-02-18 2021-02-18 Sequence data association rule mining method based on fragment clustering

Country Status (1)

Country Link
CN (1) CN112732798A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822570A (en) * 2021-09-20 2021-12-21 河南惠誉网络科技有限公司 Enterprise production data storage method and system based on big data analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822570A (en) * 2021-09-20 2021-12-21 河南惠誉网络科技有限公司 Enterprise production data storage method and system based on big data analysis
CN113822570B (en) * 2021-09-20 2023-09-26 北京瀚博网络科技有限公司 Enterprise production data storage method and system based on big data analysis

Similar Documents

Publication Publication Date Title
CN108846259B (en) Gene classification method and system based on clustering and random forest algorithm
Cai et al. Multi-class l2, 1-norm support vector machine
CN101859383A (en) Hyperspectral remote sensing image band selection method based on time sequence important point analysis
Li et al. Linear time complexity time series classification with bag-of-pattern-features
Yao et al. A novel random forests-based feature selection method for microarray expression data analysis
Huang et al. An adaptive kernelized rank-order distance for clustering non-spherical data with high noise
CN112732798A (en) Sequence data association rule mining method based on fragment clustering
Spinosa et al. Support vector machines for novel class detection in bioinformatics
Guan et al. SMMP: a stable-membership-based auto-tuning multi-peak clustering algorithm
Tamura et al. Clustering of time series using hybrid symbolic aggregate approximation
Kumar et al. Analysis of X-means and global k-means USING TUMOR classification
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
Tang et al. Feature selection algorithm based on k-means clustering
Gao et al. Adaptive image stream classification via convolutional neural network with intrinsic similarity metrics
CN116186569A (en) Abnormality detection method based on improved K-means
SuriyaPrakash et al. Obtain Better Accuracy Using Music Genre Classification Systemon GTZAN Dataset
Ertl et al. Semi-Supervised Time Point Clustering for Multivariate Time Series.
Salman et al. Two-Stage Clustering with k-means Algorithm
Chen et al. A new clustering framework
Tuna et al. Classification with binary gene expressions
Bahri et al. Shapelet-based Temporal Association Rule Mining for Multivariate Time Series Classification
An et al. Finding rule groups to classify high dimensional gene expression datasets
CN113722374B (en) Time sequence variable length motif mining method based on suffix tree
Nagendar et al. Fast approximate dynamic warping kernels
Liu et al. Regularized nonnegative matrix factorization for clustering gene expression data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210430

WD01 Invention patent application deemed withdrawn after publication