CN113722374B - Time sequence variable length motif mining method based on suffix tree - Google Patents

Time sequence variable length motif mining method based on suffix tree Download PDF

Info

Publication number
CN113722374B
CN113722374B CN202110870995.7A CN202110870995A CN113722374B CN 113722374 B CN113722374 B CN 113722374B CN 202110870995 A CN202110870995 A CN 202110870995A CN 113722374 B CN113722374 B CN 113722374B
Authority
CN
China
Prior art keywords
point
edge
edge point
length
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110870995.7A
Other languages
Chinese (zh)
Other versions
CN113722374A (en
Inventor
王继民
保宏程
崔明星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202110870995.7A priority Critical patent/CN113722374B/en
Publication of CN113722374A publication Critical patent/CN113722374A/en
Application granted granted Critical
Publication of CN113722374B publication Critical patent/CN113722374B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A10/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
    • Y02A10/40Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to a time sequence variable length motif mining method based on a suffix tree. The method comprises the following steps: setting a change rate threshold value by carrying out mode representation based on the slope, extracting all edge points, and obtaining an edge point set; constructing a suffix tree by using edge points of the edge point set, and counting the frequency of edge point subsequences by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode; mapping the frequent pattern back to the original time sequence, and recording the position of a variable-length die body; and calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the smallest value is an effective die body, extracting the effective die body, solving the problem of low die body discovery precision caused by symbolizing hidden extreme point information, and improving the time sequence variable-length die body mining precision.

Description

Time sequence variable length motif mining method based on suffix tree
Technical Field
The application relates to the technical field of information processing, in particular to a suffix tree-based time sequence variable length motif mining method.
Background
Time-series data mining belongs to the category of data mining, and the main objective of the time-series data mining is to find meaningful information from time-series data, and tasks such as clustering, classification, similarity searching, anomaly detection, motif mining and the like need to be completed. Wherein the time series motif mining is to find the repeatedly occurring unknown patterns in the time series without any prior information about its position or shape. Furthermore, time series motif mining is applicable not only to one-dimensional or multidimensional data, but also to different types of sequence data, such as spatial sequence data, temporal sequence data, and stream data. And the time sequence motif mining technology is also applied to various fields such as genetics, medicine, mathematics, music and the like.
Motif is defined as a repeated pattern, frequent trend, or approximately repeated sequence, shape, segment, subsequence, or the like. Muen gives its definition of motifs: a motif is a pair of time-series subsequences that are most similar to each other over a long period of time. The definition of motifs can now be broadly divided into two categories: k-motif and nearest neighbor motifs.
k-motif, given a time sequence T, a subsequence length n and the most important motif in a range R, T (also called 1-motif) is subsequence C 1 It has the highest non-trivial match count. In T, the K-most important motif is the subsequence C K (also known as K-motif) which has the highest non-trivial match count and when 1.ltoreq.i.ltoreq.K, D (C K ,C i )>2R。
Nearest neighbor motifs of length m in the time sequence S of length n are subsequences S i (1.ltoreq.i.ltoreq.n-m+1) together with its non-trivial nearest neighbor S j (1.ltoreq.j.ltoreq.n-m+1), the distance between them being the smallest.
The main difference between these two definitions is that nearest neighbor motifs refer to a pair of subsequences that are least distant, i.e., most similar, rather than subsequences that possess the most non-trivial matches, the subsequence that possesses the most non-trivial number of matches being 1-motif.
However, existing motif discovery algorithms still suffer from a number of deficiencies. The approximate motif finding algorithm performs time sequence discretization according to the characteristics of the data set, and finds frequent patterns in the symbolized character strings to reduce the calculated amount and the execution time, but symbolizes to hide extreme point information, so that the time sequence is longer and the motif mining precision is lower. If the random projection is used for die body discovery, a sequence average value is calculated, and a frequent mode is discovered after the sign representation is carried out according to the average value, so that the overall change trend of the die body can only be ensured to be the same, the similarity between results can not be ensured, and the mining precision of the die body with the variable length time sequence is lower.
Disclosure of Invention
In view of the above, it is desirable to provide a suffix tree-based time-series variable-length motif mining method that can improve the accuracy of time-series variable-length motif mining.
A suffix tree based time series variable length motif mining method, the method comprising:
performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set;
constructing a suffix tree by using edge points of the edge point set, and counting the frequency of edge point subsequences by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode;
mapping the frequent pattern back to the original time sequence, and recording the position of a variable-length die body;
and calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the minimum value is the effective die body.
In one embodiment, the step of performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set includes:
the time sequence mode based on the slope represents the change trend of the representative time sequence and is divided into { steep rise, slow rise, constant hold, slow fall and steep fall }, which is expressed as M= {2,1,0, -1, -2};
setting a change rate threshold, extracting edge points from the time sequence, and extracting all edge points in the time sequence to obtain an edge point set.
In one embodiment, the step of setting a change rate threshold, extracting edge points from the time sequence, extracting all edge points in the time sequence, and obtaining an edge point set includes:
setting a change rate threshold d and a point x to be analyzed i A point x preceding the point to be analyzed i-1 The slope of the line segment established by the two points is slope1;
point x to be analyzed i A point x next to the point to be analyzed i+1 Two-point determinationThe slope of the vertical line segment is slope2, analyzing the value of the slope1-slope2, and if the value is greater than or equal to the change rate threshold d, the point x to be analyzed is i As edge points, otherwise the point x to be analyzed i Not edge points.
In one embodiment, the step of constructing a suffix tree by using edge points of the edge point set, counting the frequency of the edge point subsequence by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent pattern, includes:
constructing a suffix tree by utilizing the edge points of the edge point set;
setting the window length, and dividing the edge point set into edge point subsequences by utilizing a sliding window;
counting the occurrence frequency of each edge point sub-sequence in the edge point set by using the suffix tree, and determining the frequency of each edge point sub-sequence;
according to the counted frequency of each edge point sub-sequence, storing the frequency in a frequency array, finding out the frequency maximum value in the frequency array, finding out the edge point sub-sequence equal to the frequency according to the maximum value, wherein the edge point sub-sequence equal to the frequency is a frequent pattern, and all the frequent patterns form a frequent pattern set.
In one embodiment, the step of mapping the frequent pattern back to the original time sequence and recording the position of the variable length motif comprises:
and mapping the frequent patterns back to the original time sequence, dividing the patterns with the same length into a group, eliminating trivial matching to obtain variable-length patterns with different lengths, and recording the positions of the variable-length patterns.
According to the suffix tree-based time sequence variable length motif mining method, mode representation is performed based on a slope, a change rate threshold is set, all edge points are extracted, and an edge point set is obtained; constructing a suffix tree by using edge points of the edge point set, and counting the frequency of edge point subsequences by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode; mapping the frequent pattern back to the original time sequence, and recording the position of a variable-length die body; and calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the smallest value is an effective die body, extracting the effective die body, solving the problem of low die body discovery precision caused by symbolizing hidden extreme point information, and improving the time sequence variable-length die body mining precision.
Drawings
FIG. 1 is a flow diagram of a suffix tree based time series variable length motif mining method in one embodiment;
FIG. 2 is a flowchart of a suffix tree based method for mining a time series variable length motif according to another embodiment;
FIG. 3 is a diagram of experimental results of a stability analysis east sand island station measurement of a suffix tree based time series variable length motif mining method in one embodiment;
FIG. 4 is a diagram of experimental results of east sand island station testing by effectiveness analysis of a suffix tree based time series variable length motif mining method in one embodiment;
fig. 5 is a diagram of experimental results of the east sand island station measurement based on efficiency analysis of the suffix tree based time series variable length motif mining method in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a suffix tree-based time series variable length motif mining method is provided, which includes the following steps:
step S220, performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set.
In one embodiment, the step of performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set includes: the time sequence mode based on the slope represents the change trend of the representative time sequence and is divided into { steep rise, slow rise, constant hold, slow fall and steep fall }, which is expressed as M= {2,1,0, -1, -2}; setting a change rate threshold, extracting edge points of a time sequence, and extracting all edge points in the time sequence to obtain an edge point set.
In one embodiment, setting a change rate threshold, extracting edge points of the time sequence, extracting all edge points in the time sequence, and obtaining an edge point set, including: setting a change rate threshold d and a point x to be analyzed i A point x preceding the point to be analyzed i-1 The slope of the line segment established by the two points is slope1; point x to be analyzed i A point x preceding the point to be analyzed i+1 The slope of the line segment established by the two points is slope2, the value of the slope1-slope2 is analyzed, the value is larger than or equal to the change rate threshold d, and the point x to be analyzed is i As edge points, otherwise point x to be analyzed i Not edge points.
Step S240, constructing a suffix tree by using edge points of the edge point set, and counting the frequency of the edge point subsequence by using the suffix tree, wherein the edge point subsequence with the largest frequency is the frequent pattern.
According to the method for constructing the Suffix Tree (Suffix Tree), the Suffix Tree is constructed by utilizing the edge points, so that the Suffix Tree statistics frequency can be conveniently utilized subsequently.
In one embodiment, a suffix tree is constructed by using edge points of an edge point set, and the frequency of an edge point subsequence is counted by using the suffix tree, wherein the edge point subsequence with the largest frequency is the step of a frequent pattern, which comprises the following steps: constructing a suffix tree by utilizing edge points of the edge point set; setting the window length, and dividing the edge point set into an edge point subsequence by utilizing a sliding window; counting the occurrence frequency of each edge point sub-sequence in the edge point set by using a suffix tree, and determining the frequency of each edge point sub-sequence; according to the counted frequency of each edge point sub-sequence, storing the frequency in a frequency array, finding out the frequency maximum value in the frequency array, finding out the edge point sub-sequence equal to the frequency according to the maximum value, wherein the edge point sub-sequence equal to the frequency is a frequent pattern, and all the frequent patterns form a frequent pattern set.
Where the window length represents the number of edge points that are expected to be found to be included. Edge point subsequences are non-overlapping subsequences.
If the frequency of one edge point sub-sequence is 3, it means that the edge point sub-sequence has two edge point sub-sequences in the whole edge point set, namely, the three edge point sub-sequences have the same edge points and also represent extreme point information with the same change trend, if the original time sequence is mapped back, the three time sequences on the original time sequence will have extreme point information with the same change trend and also will have information of non-edge points, and the overall change trend will not be affected because the non-edge point change rate is not great.
In step S260, the frequent pattern is mapped back to the original time sequence, and the variable-length motif position is recorded.
In one embodiment, the step of mapping the frequent pattern back to the original time series and recording the position of the variable length motif comprises: and (3) mapping the frequent patterns back to the original time sequence, dividing the patterns with the same length into a group, eliminating trivial matching to obtain variable-length patterns with different lengths, and recording the positions of the variable-length patterns.
The number of non-edge points included between the original time series of the edge points in the frequent pattern may not be consistent, so that the motifs of the frequent pattern mapping back to the original time series may have different lengths. And mapping the frequent patterns back to the original time sequence, dividing the patterns with the same length into a group, eliminating trivial matching to obtain variable-length patterns with different lengths, and recording the positions of the variable-length patterns.
And step S280, calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the smallest value is the effective die body.
And calculating Distance Profile among the variable length die bodies by using the obtained variable length die bodies with different lengths, and further calculating Matrix Profile. According to the definition of the effective die body, a pair of variable length die bodies with the minimum Matrix Profile values are the effective die bodies. Thereby, the effective die body extraction is completed.
According to the suffix tree-based time sequence variable length motif mining method, mode representation is performed based on a slope, a change rate threshold is set, all edge points are extracted, and an edge point set is obtained; constructing a suffix tree by using edge points of the edge point set, and counting the frequency of the edge point subsequence by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode; mapping the frequent pattern back to the original time sequence, and recording the position of the variable-length die body; according to the positions of the variable-length die bodies, matrix Profile values among the variable-length die bodies are calculated, the Matrix Profile value with the smallest value is the effective die body, extraction of the effective die body is added, the problem of low die body discovery precision caused by symbolizing hidden extreme point information is solved, and the time sequence variable-length die body mining precision is improved. Further, time sequence discretization is performed according to the characteristics of the data set, frequent patterns are found in the symbolized character string to reduce the calculated amount, reduce the execution time, have higher efficiency, extract edge points according to slope change, perform die body discovery on the basis, and the result contains extreme point information with abrupt change, thereby having more research value.
As shown in fig. 2, in one embodiment, a suffix tree-based time series variable length motif mining method is provided, which specifically includes the following steps:
the slope change is combined with the time sequence characteristics by utilizing the time sequence piecewise linear representation, and the time sequence is subjected to mode representation, so that the purpose of the mode representation is to simplify the change rate aiming at extreme points with different change rates, and the edge points are conveniently extracted by setting the change rate threshold value subsequently. And then setting a change rate threshold value according to the result of the mode representation to extract edge points, wherein the edge points represent extreme points with relatively large slope change. After the edge points are extracted, a Suffix Tree is constructed, and node information contained in a path from a root node to a leaf node for any leaf node in the Suffix Tree is connected to correspond to a Suffix from a certain position. Designating the number of the extreme points which are expected to be found, counting out the extreme point sequence with the largest occurrence number, namely frequent patterns by using the Suffix Tree, mapping the frequent patterns back to the original time sequence, and recording the positions of the variable-length die bodies. Since the extreme points contain some information of unimportant extreme points, the length between the motif is uncertain, thereby completing the time series variable length motif mining containing important extreme point information. And calculating Distance Profile between the variable length die bodies by using the obtained positions of the variable length die bodies, wherein the Distance measurement mode is DTW because the lengths of the stored variable length die bodies with the same frequency are different. And further calculating Matrix Profile, wherein a pair of die bodies with the minimum Matrix Profile value is the effective die body.
The specific process is as follows:
1. piecewise linear representation.
(1) Slope-based mode representation:
the slope-based time series pattern represents a trend of variation representing a time series, and is classified into { steep rise, slow rise, hold constant, slow fall, steep fall }, steep rise inclination e (45 °,90 °), slow rise inclination e (0 °,45 ° ], hold constant inclination 0 °, slow fall inclination e (135 °,180 °), steep fall inclination (90 °,135 °) }, and is represented as m= {2,1,0, -1, -2}. Pattern represents algorithm pseudocode as shown in table 1:
table 1 schema represents an algorithm
Lines 1-11 traverse to calculate the slope, and change the slope to a pattern representation according to the pattern representation rule. The range representation set in the 1 st line cycle does not model the first and last points because the subsequent edge point extraction will extract the first and last points as edge points without determining whether their slope changes exceed the rate of change threshold.
(2) Edge point extraction:
setting a change rate threshold according to the edgeThe points extract edge points. Judging whether a point of the time series is an edge point, e.g. judging the point x to be analyzed i Whether or not it is an edge point, assuming point x to be analyzed i A point x preceding the point to be analyzed i-1 The slope of the line segment established by the two points is slope1; point x to be analyzed i A point x next to the point to be analyzed i+1 The slope of the line segment established by the two points is slope2, the value of the slope1-slope2 is analyzed, the value is larger than or equal to the change rate threshold d, and the point x to be analyzed is i As edge points, otherwise the point x to be analyzed i Not edge points. The edge point extraction algorithm pseudocode is as shown in table 2:
table 2 edge point extraction algorithm
Lines 2-5 extract the first and last points to store in the set of edge points, and lines 6-8 calculate the slope change using the slope represented by the pattern described above, if the slope change exceeds the rate of change threshold, extract the points to store in the set of edge points.
2. Frequent pattern discovery.
(1) The edge points are used to construct the Suffix Tree.
The piecewise linear representation has obtained an edge point set, and according to the method for constructing the Suffix Tree, the edge points are utilized to construct the Suffix Tree, so that the subsequent utilization of the statistical frequency of the Suffix Tree is facilitated.
(2) The frequencies of the edge point sub-sequences in the whole edge point set are counted by using the Suffix Tree.
Setting a window length, wherein the window length represents the number of the edge points which are expected to be found, dividing the edge point set into edge point sub-sequences by utilizing a sliding window, and counting the frequency of each edge point sub-sequence in the whole edge point set by utilizing a Suffix Tree. If the frequency of one edge point sub-sequence is 3, it means that the edge point sub-sequence has two edge point sub-sequences in the whole edge point set, namely, the three edge point sub-sequences have the same edge points and also represent extreme point information with the same variation trend, if the original time sequence is mapped, the three time sequences on the original time sequence have extreme point information with the same variation trend and also have information of non-edge points, and the overall variation trend is not affected because the non-edge point variation rate is not great.
(3) The frequent pattern with the greatest frequency is obtained.
According to the statistics of the frequencies of the sub-sequences of the edge points, storing the frequency sub-sequences in a frequency array, finding out the maximum value of the frequencies in the frequency array, finding out the sub-sequences of the edge points equal to the frequency of the maximum value, wherein the sub-sequences of the edge points are frequent patterns, and all the frequent patterns form a frequent pattern set.
3. And (5) extracting the die body.
(1) And extracting a variable-length die body.
The obtained frequent pattern is mapped back to the original time sequence, and the number of non-edge points contained by edge points in the frequent pattern between the original time sequences may not be consistent, so that the motifs of the frequent pattern mapping back time sequence may have different lengths. The same length of die bodies are divided into a group, and the trivial matching is eliminated. And eliminating the trivial matching to obtain variable-length die bodies with different lengths, and recording the positions of the variable-length die bodies.
(2) Extracting an effective die body.
And calculating Distance Profile among the variable length die bodies by using the obtained variable length die bodies with different lengths, and further calculating Matrix Profile. According to the definition of the effective die body, a pair of variable length die bodies with the minimum Matrix Profile values are the effective die bodies. Thereby, the effective die body extraction is completed.
Examples: in order to verify the effect of the application, hydrologic data observed by the east sand island station of China are adopted as experimental data, and experiments are carried out from three aspects, (1) the stability of the algorithm of the application is analyzed in detail aiming at a data set; (2) Analyzing the effectiveness of the algorithm of the present application as compared to not performing piecewise linear representations; (3) Based on the existing dataset, the temporal performance of the algorithm of the application is analyzed.
The stability, effectiveness and efficiency of the algorithm of the application are analyzed based on the east sand island station data set.
1) And (3) analyzing stability, changing a change rate threshold d, further influencing the compression rate, and comparing whether the results found by the die body are related under different compression forces.
2) Validity analysis compares whether no piecewise linear representation of the discovered motifs and piecewise linear representation of the discovered motifs both carry peak and valley information and whether there is the same trend between motifs.
3) Efficiency analysis, the algorithm consists essentially of two parts, piecewise linear representation and motif extraction, so dependent variable run time is the sum of the two part consumption times. Mainly discussing the time distribution of the independent variable compression rate change in the piecewise linear representation and the die body extraction process.
1. Data preparation
The section takes hydrologic data observed by the east sand island floating station as an experimental object, and data set information is shown in table 3.
Table 3 station dataset information
Sequence number Data set Length of Remarks
1 East sand island station 1379 East sand island buoy station
The east sand island station records the relative information of the water basin from 1 st 8 th 2017 to 30 th 2019, including station longitude, station latitude, year, month, day, air temperature, air pressure, surface water temperature, wind speed, wind direction, instantaneous wind speed, effective wave height and flow rate.
The effective wave height is an important parameter, which is important for the prediction of wave and ocean dynamics, and the distribution characteristic of the effective wave height is also closely related to the front and rear energy distribution characteristics of typhoons. The relation between wind speed and effective wave height depends on whether the effective wave height is positioned to the left or right of the typhoon center, and if the distance from the typhoon center is short, the effective wave height falling rate is faster, and if the distance from the typhoon center is long, the effective wave height falling rate is more gentle; to the right of the typhoon center, whether valid; the effective wave height is close to linear attenuation when the wave height is close to the typhoon center.
2. Experimental analysis
1) Stability analysis
Gradually increasing the change rate threshold d, recording the die body information found by different change rate thresholds, comparing whether intersection exists between the die bodies, and verifying the stability of the algorithm. The experimental results are presented with reference to table 4 and fig. 3, wherein fig. 3 includes fig. 3 (a) of d=0, fig. 3 (b) of d=1, and fig. 3 (c) of d=1.
Table 4 die body information table for finding different thresholds of east sand island station
It can be found from table 4 that, when the window value is 10, the found variable length motifs are different in the case of different change rate thresholds, and at the same time, when the frequent pattern is mapped back to the original time sequence, there is a trivial match, so that the result after the trivial match is eliminated may be null, for example, when the change rate thresholds d=3 and d=4 are the trivial match is eliminated, and because the frequent pattern is the trivial match, the trivial match is also mapped back to the original time sequence. It can be seen from table 4 that the position 427 at the change rate threshold of 2 and the position 454 at the threshold of 0 are overlapped with each other, and that the variable length motif of length 9 is stable according to the definition of the stability of the above-described judgment algorithm.
2) Validity analysis
Based on east sand island station measurement data, comparing whether the die bodies with the same length and found by the non-piecewise linear representation and the piecewise linear representation have peak values or peak-valley conditions which change sharply, and verifying the effectiveness of the algorithm. According to the above experiment, the fitting effect and the compression ratio were comprehensively considered, and the change rate threshold was set to 3. Because the algorithm does not need to specify the found die body length, the algorithm is used for finding the effective die body first, recording the found effective die body length and position, then comparing with a reference algorithm BF, observing whether the die body found under the same die body length has a peak value or a peak valley condition which changes sharply, and verifying the effectiveness of the algorithm. The experimental results are shown below. Table 5, fig. 4 lists the experimental results, wherein fig. 4 includes fig. 4 (a) with piecewise linear representation and fig. 4 (b) without piecewise linear representation.
Table 5 motif information table found based on dataset present algorithm
Analysis of the above results reveals that, when the motif length is relatively long, both the motif found with piecewise linearity and the motif found without piecewise linearity contain sharp-varying peak or valley information, and that the motifs found by the two methods are intersected. However, when the die body length is shorter, the die body found by the piecewise linear representation does not necessarily contain peak value or peak-valley information, and no matter what the die body length is found, the slope change of extreme points among the effective die bodies found by the piecewise linear representation is the same, namely the trend is the same, and the effectiveness of die body finding after the piecewise linear representation is verified.
3) Efficiency analysis of the present algorithm
The algorithm mainly comprises two parts of piecewise linear representation and die body extraction, so that the change rate threshold d is taken as an independent variable based on a data set, the sum of consumed time of the piecewise linear representation and the die body extraction is taken as the dependent variable, and the algorithm is operated for 8 times under the condition of each change rate threshold, and the average value of the operation time is calculated by statistics. The results are shown in Table 6 and FIG. 5.
Table 6 different threshold values present algorithm running time table based on east sand island station measurement
Analysis of the above experimental results shows that as the rate of change threshold d increases, the compression rate gradually increases, the piecewise linear representation and the die body extraction run time decrease, and thus the overall run time of the algorithm decreases, but the piecewise linear representation run time occupies less overall time and is almost negligible. The die body extraction run time assumes a slow decrease after a sharp decrease, and it follows that as the compression rate increases, the algorithm time-consuming decrease begins quickly and then becomes slow, since the main decrease time is on the die body extraction.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (3)

1. A suffix tree based time series variable length motif mining method, the method comprising:
the time sequence mode based on slope represents the change trend of the representative time sequence data, according to the slope change characteristics, the time sequence mode is divided into { steep rise, slow rise, keep unchanged, slow fall, steep fall }, expressed as M= {2,1,0, -1, -2}, a change rate threshold is set, the first point and the last point in the observed hydrologic data are extracted from the hydrologic data and stored in an edge point set, the slope change is calculated according to the slope change represented by the mode, if the point x is to be analyzed i The slope change of (a) exceeds the change rate threshold value, and the point x to be analyzed is extracted i Storing the edge point set;
according to a method for constructing a Suffix Tree, constructing a Suffix Tree by utilizing edge points of the edge point set, and counting the frequency of an edge point subsequence by utilizing the Suffix Tree, wherein the edge point subsequence with the largest frequency is a frequent mode;
mapping the frequent pattern back to the text data, and recording the position of the variable-length die body;
calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein a pair of variable-length die bodies with the minimum Matrix Profile values are effective die bodies of the hydrologic data;
calculating the slope change by using the slope represented by the pattern according to the slope change if the point x to be analyzed i The slope change of (a) exceeds the change rate threshold value, and the point x to be analyzed is extracted i A step of storing an edge point set, comprising:
setting a change rate threshold d, and analyzing the value of the |slope1-slope2|, wherein the value is larger than or equal to the change rate threshold d, and the point x to be analyzed is i As edge points, otherwise the point x to be analyzed i Instead of edge points, slope1 is the point x to be analyzed in the hydrological data i A point x preceding the point to be analyzed i-1 Slope2 of line segment established by two points is point x to be analyzed in hydrologic data i A point x next to the point to be analyzed i+1 The slope of the line segment established at two points.
2. The method of claim 1, wherein the step of constructing a Suffix Tree using edge points of the edge point set according to the method of constructing a Suffix Tree, counting edge point subsequence frequencies using the Suffix Tree, and determining an edge point subsequence with the largest frequency as a frequent pattern includes:
constructing a Suffix Tree by utilizing edge points of the edge point set according to a method for constructing a Suffix Tree;
setting the window length, and dividing the edge point set into edge point subsequences by utilizing a sliding window;
counting the occurrence frequency of each edge point sub-sequence in the edge point set by using the suffix tree, and determining the frequency of each edge point sub-sequence;
according to the counted frequency of each edge point sub-sequence, storing the frequency in a frequency array, finding out the frequency maximum value in the frequency array, finding out the edge point sub-sequence equal to the frequency according to the maximum value, wherein the edge point sub-sequence equal to the frequency is a frequent pattern, and all the frequent patterns form a frequent pattern set.
3. The method of claim 1, wherein the step of mapping the frequent pattern back to an original time series, recording a variable length motif position, comprises:
and mapping the frequent patterns back to the original time sequence, dividing the patterns with the same length into a group, eliminating trivial matching to obtain variable-length patterns with different lengths, and recording the positions of the variable-length patterns.
CN202110870995.7A 2021-07-30 2021-07-30 Time sequence variable length motif mining method based on suffix tree Active CN113722374B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110870995.7A CN113722374B (en) 2021-07-30 2021-07-30 Time sequence variable length motif mining method based on suffix tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110870995.7A CN113722374B (en) 2021-07-30 2021-07-30 Time sequence variable length motif mining method based on suffix tree

Publications (2)

Publication Number Publication Date
CN113722374A CN113722374A (en) 2021-11-30
CN113722374B true CN113722374B (en) 2023-12-01

Family

ID=78674412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110870995.7A Active CN113722374B (en) 2021-07-30 2021-07-30 Time sequence variable length motif mining method based on suffix tree

Country Status (1)

Country Link
CN (1) CN113722374B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019041628A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for mining multivariate time series association rule based on eclat
CN113128582A (en) * 2021-04-14 2021-07-16 河海大学 Matrix Profile-based time sequence variable-length die body mining method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160070763A1 (en) * 2013-05-31 2016-03-10 Teradata Us, Inc. Parallel frequent sequential pattern detecting
US10409844B2 (en) * 2016-03-01 2019-09-10 Ching-Tu WANG Method for extracting maximal repeat patterns and computing frequency distribution tables

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019041628A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for mining multivariate time series association rule based on eclat
CN113128582A (en) * 2021-04-14 2021-07-16 河海大学 Matrix Profile-based time sequence variable-length die body mining method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
NPLWAP:一种新的Web序列模式挖掘算法;林维仲;张东站;;厦门大学学报(自然科学版)(第01期);全文 *
一种有效的后缀树建立方法;黄影;;中国电子教育(第03期);全文 *
快速时间序列模体发现算法;朱晓晓;;国外电子测量技术(第09期);全文 *
时间序列多尺度异常检测方法;陈波;刘厚泉;赵志凯;;计算机工程与应用(第20期);全文 *
时间序列异常模式的k-均距异常因子检测;詹艳艳;徐荣聪;;计算机工程与应用(第09期);全文 *
面向航天器测试的时序数据模式表示方法研究;周家杰;余丹;马世龙;陈丽萍;;计算机应用研究(第01期);全文 *

Also Published As

Publication number Publication date
CN113722374A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN108596362B (en) Power load curve form clustering method based on adaptive piecewise aggregation approximation
CN108846259B (en) Gene classification method and system based on clustering and random forest algorithm
CN111310833B (en) Travel mode identification method based on Bayesian neural network
JP6587625B2 (en) System and method for optimization of audio fingerprint search
Nanjundan et al. Identifying the number of clusters for K-Means: A hypersphere density based approach
WO2019041628A1 (en) Method for mining multivariate time series association rule based on eclat
CN112732748B (en) Non-invasive household appliance load identification method based on self-adaptive feature selection
CN110866997A (en) Novel method for constructing running condition of electric automobile
CN109359135B (en) Time sequence similarity searching method based on segment weight
CN112819299A (en) Differential K-means load clustering method based on center optimization
CN112434662B (en) Tea leaf scab automatic identification algorithm based on multi-scale convolutional neural network
CN112308235A (en) Time series data flow abnormity detection method
CN111611293B (en) Outlier data mining method based on feature weighting and MapReduce
CN111611961A (en) Harmonic anomaly identification method based on variable point segmentation and sequence clustering
He et al. A Method of Identifying Thunderstorm Clouds in Satellite Cloud Image Based on Clustering.
CN115512772A (en) High-precision single cell clustering method and system based on marker genes and ensemble learning
CN110362606B (en) Time series variable-length die body mining method
CN110245692B (en) Hierarchical clustering method for collecting numerical weather forecast members
CN108921211A (en) A method of based on density peaks cluster calculation fractal dimension
CN109389172B (en) Radio signal data clustering method based on non-parameter grid
CN109241118A (en) It is connected entirely based on subsequence and the time series die body of Clique finds method
CN113128582B (en) Matrix Profile-based time sequence variable-length die body mining method
Kaiser et al. Multiple hypotheses at multiple scales for audio novelty computation within music
CN113722374B (en) Time sequence variable length motif mining method based on suffix tree
Sebayang et al. Optimization on Purity K-means using variant distance measure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant