CN113722374B

CN113722374B - Time sequence variable length motif mining method based on suffix tree

Info

Publication number: CN113722374B
Application number: CN202110870995.7A
Authority: CN
Inventors: 王继民; 保宏程; 崔明星
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2023-12-01
Anticipated expiration: 2041-07-30
Also published as: CN113722374A

Abstract

The application relates to a time sequence variable length motif mining method based on a suffix tree. The method comprises the following steps: setting a change rate threshold value by carrying out mode representation based on the slope, extracting all edge points, and obtaining an edge point set; constructing a suffix tree by using edge points of the edge point set, and counting the frequency of edge point subsequences by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode; mapping the frequent pattern back to the original time sequence, and recording the position of a variable-length die body; and calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the smallest value is an effective die body, extracting the effective die body, solving the problem of low die body discovery precision caused by symbolizing hidden extreme point information, and improving the time sequence variable-length die body mining precision.

Description

Time sequence variable length motif mining method based on suffix tree

Technical Field

The application relates to the technical field of information processing, in particular to a suffix tree-based time sequence variable length motif mining method.

Background

Time-series data mining belongs to the category of data mining, and the main objective of the time-series data mining is to find meaningful information from time-series data, and tasks such as clustering, classification, similarity searching, anomaly detection, motif mining and the like need to be completed. Wherein the time series motif mining is to find the repeatedly occurring unknown patterns in the time series without any prior information about its position or shape. Furthermore, time series motif mining is applicable not only to one-dimensional or multidimensional data, but also to different types of sequence data, such as spatial sequence data, temporal sequence data, and stream data. And the time sequence motif mining technology is also applied to various fields such as genetics, medicine, mathematics, music and the like.

Motif is defined as a repeated pattern, frequent trend, or approximately repeated sequence, shape, segment, subsequence, or the like. Muen gives its definition of motifs: a motif is a pair of time-series subsequences that are most similar to each other over a long period of time. The definition of motifs can now be broadly divided into two categories: k-motif and nearest neighbor motifs.

k-motif, given a time sequence T, a subsequence length n and the most important motif in a range R, T (also called 1-motif) is subsequence C ₁ It has the highest non-trivial match count. In T, the K-most important motif is the subsequence C _K (also known as K-motif) which has the highest non-trivial match count and when 1.ltoreq.i.ltoreq.K, D (C _K ，C _i )>2R。

Nearest neighbor motifs of length m in the time sequence S of length n are subsequences S _i (1.ltoreq.i.ltoreq.n-m+1) together with its non-trivial nearest neighbor S _j (1.ltoreq.j.ltoreq.n-m+1), the distance between them being the smallest.

The main difference between these two definitions is that nearest neighbor motifs refer to a pair of subsequences that are least distant, i.e., most similar, rather than subsequences that possess the most non-trivial matches, the subsequence that possesses the most non-trivial number of matches being 1-motif.

However, existing motif discovery algorithms still suffer from a number of deficiencies. The approximate motif finding algorithm performs time sequence discretization according to the characteristics of the data set, and finds frequent patterns in the symbolized character strings to reduce the calculated amount and the execution time, but symbolizes to hide extreme point information, so that the time sequence is longer and the motif mining precision is lower. If the random projection is used for die body discovery, a sequence average value is calculated, and a frequent mode is discovered after the sign representation is carried out according to the average value, so that the overall change trend of the die body can only be ensured to be the same, the similarity between results can not be ensured, and the mining precision of the die body with the variable length time sequence is lower.

Disclosure of Invention

In view of the above, it is desirable to provide a suffix tree-based time-series variable-length motif mining method that can improve the accuracy of time-series variable-length motif mining.

A suffix tree based time series variable length motif mining method, the method comprising:

performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set;

constructing a suffix tree by using edge points of the edge point set, and counting the frequency of edge point subsequences by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode;

mapping the frequent pattern back to the original time sequence, and recording the position of a variable-length die body;

and calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the minimum value is the effective die body.

In one embodiment, the step of performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set includes:

the time sequence mode based on the slope represents the change trend of the representative time sequence and is divided into { steep rise, slow rise, constant hold, slow fall and steep fall }, which is expressed as M= {2,1,0, -1, -2};

setting a change rate threshold, extracting edge points from the time sequence, and extracting all edge points in the time sequence to obtain an edge point set.

In one embodiment, the step of setting a change rate threshold, extracting edge points from the time sequence, extracting all edge points in the time sequence, and obtaining an edge point set includes:

setting a change rate threshold d and a point x to be analyzed _i A point x preceding the point to be analyzed _i-1 The slope of the line segment established by the two points is slope1;

point x to be analyzed _i A point x next to the point to be analyzed _i+1 Two-point determinationThe slope of the vertical line segment is slope2, analyzing the value of the slope1-slope2, and if the value is greater than or equal to the change rate threshold d, the point x to be analyzed is _i As edge points, otherwise the point x to be analyzed _i Not edge points.

In one embodiment, the step of constructing a suffix tree by using edge points of the edge point set, counting the frequency of the edge point subsequence by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent pattern, includes:

constructing a suffix tree by utilizing the edge points of the edge point set;

setting the window length, and dividing the edge point set into edge point subsequences by utilizing a sliding window;

counting the occurrence frequency of each edge point sub-sequence in the edge point set by using the suffix tree, and determining the frequency of each edge point sub-sequence;

according to the counted frequency of each edge point sub-sequence, storing the frequency in a frequency array, finding out the frequency maximum value in the frequency array, finding out the edge point sub-sequence equal to the frequency according to the maximum value, wherein the edge point sub-sequence equal to the frequency is a frequent pattern, and all the frequent patterns form a frequent pattern set.

In one embodiment, the step of mapping the frequent pattern back to the original time sequence and recording the position of the variable length motif comprises:

and mapping the frequent patterns back to the original time sequence, dividing the patterns with the same length into a group, eliminating trivial matching to obtain variable-length patterns with different lengths, and recording the positions of the variable-length patterns.

According to the suffix tree-based time sequence variable length motif mining method, mode representation is performed based on a slope, a change rate threshold is set, all edge points are extracted, and an edge point set is obtained; constructing a suffix tree by using edge points of the edge point set, and counting the frequency of edge point subsequences by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode; mapping the frequent pattern back to the original time sequence, and recording the position of a variable-length die body; and calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the smallest value is an effective die body, extracting the effective die body, solving the problem of low die body discovery precision caused by symbolizing hidden extreme point information, and improving the time sequence variable-length die body mining precision.

Drawings

FIG. 1 is a flow diagram of a suffix tree based time series variable length motif mining method in one embodiment;

FIG. 2 is a flowchart of a suffix tree based method for mining a time series variable length motif according to another embodiment;

FIG. 3 is a diagram of experimental results of a stability analysis east sand island station measurement of a suffix tree based time series variable length motif mining method in one embodiment;

FIG. 4 is a diagram of experimental results of east sand island station testing by effectiveness analysis of a suffix tree based time series variable length motif mining method in one embodiment;

fig. 5 is a diagram of experimental results of the east sand island station measurement based on efficiency analysis of the suffix tree based time series variable length motif mining method in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, a suffix tree-based time series variable length motif mining method is provided, which includes the following steps:

step S220, performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set.

In one embodiment, the step of performing mode representation based on the slope, setting a change rate threshold, extracting all edge points, and obtaining an edge point set includes: the time sequence mode based on the slope represents the change trend of the representative time sequence and is divided into { steep rise, slow rise, constant hold, slow fall and steep fall }, which is expressed as M= {2,1,0, -1, -2}; setting a change rate threshold, extracting edge points of a time sequence, and extracting all edge points in the time sequence to obtain an edge point set.

In one embodiment, setting a change rate threshold, extracting edge points of the time sequence, extracting all edge points in the time sequence, and obtaining an edge point set, including: setting a change rate threshold d and a point x to be analyzed _i A point x preceding the point to be analyzed _i-1 The slope of the line segment established by the two points is slope1; point x to be analyzed _i A point x preceding the point to be analyzed _i+1 The slope of the line segment established by the two points is slope2, the value of the slope1-slope2 is analyzed, the value is larger than or equal to the change rate threshold d, and the point x to be analyzed is _i As edge points, otherwise point x to be analyzed _i Not edge points.

Step S240, constructing a suffix tree by using edge points of the edge point set, and counting the frequency of the edge point subsequence by using the suffix tree, wherein the edge point subsequence with the largest frequency is the frequent pattern.

According to the method for constructing the Suffix Tree (Suffix Tree), the Suffix Tree is constructed by utilizing the edge points, so that the Suffix Tree statistics frequency can be conveniently utilized subsequently.

In one embodiment, a suffix tree is constructed by using edge points of an edge point set, and the frequency of an edge point subsequence is counted by using the suffix tree, wherein the edge point subsequence with the largest frequency is the step of a frequent pattern, which comprises the following steps: constructing a suffix tree by utilizing edge points of the edge point set; setting the window length, and dividing the edge point set into an edge point subsequence by utilizing a sliding window; counting the occurrence frequency of each edge point sub-sequence in the edge point set by using a suffix tree, and determining the frequency of each edge point sub-sequence; according to the counted frequency of each edge point sub-sequence, storing the frequency in a frequency array, finding out the frequency maximum value in the frequency array, finding out the edge point sub-sequence equal to the frequency according to the maximum value, wherein the edge point sub-sequence equal to the frequency is a frequent pattern, and all the frequent patterns form a frequent pattern set.

Where the window length represents the number of edge points that are expected to be found to be included. Edge point subsequences are non-overlapping subsequences.

If the frequency of one edge point sub-sequence is 3, it means that the edge point sub-sequence has two edge point sub-sequences in the whole edge point set, namely, the three edge point sub-sequences have the same edge points and also represent extreme point information with the same change trend, if the original time sequence is mapped back, the three time sequences on the original time sequence will have extreme point information with the same change trend and also will have information of non-edge points, and the overall change trend will not be affected because the non-edge point change rate is not great.

In step S260, the frequent pattern is mapped back to the original time sequence, and the variable-length motif position is recorded.

In one embodiment, the step of mapping the frequent pattern back to the original time series and recording the position of the variable length motif comprises: and (3) mapping the frequent patterns back to the original time sequence, dividing the patterns with the same length into a group, eliminating trivial matching to obtain variable-length patterns with different lengths, and recording the positions of the variable-length patterns.

The number of non-edge points included between the original time series of the edge points in the frequent pattern may not be consistent, so that the motifs of the frequent pattern mapping back to the original time series may have different lengths. And mapping the frequent patterns back to the original time sequence, dividing the patterns with the same length into a group, eliminating trivial matching to obtain variable-length patterns with different lengths, and recording the positions of the variable-length patterns.

And step S280, calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein the Matrix Profile value with the smallest value is the effective die body.

And calculating Distance Profile among the variable length die bodies by using the obtained variable length die bodies with different lengths, and further calculating Matrix Profile. According to the definition of the effective die body, a pair of variable length die bodies with the minimum Matrix Profile values are the effective die bodies. Thereby, the effective die body extraction is completed.

According to the suffix tree-based time sequence variable length motif mining method, mode representation is performed based on a slope, a change rate threshold is set, all edge points are extracted, and an edge point set is obtained; constructing a suffix tree by using edge points of the edge point set, and counting the frequency of the edge point subsequence by using the suffix tree, wherein the edge point subsequence with the largest frequency is a frequent mode; mapping the frequent pattern back to the original time sequence, and recording the position of the variable-length die body; according to the positions of the variable-length die bodies, matrix Profile values among the variable-length die bodies are calculated, the Matrix Profile value with the smallest value is the effective die body, extraction of the effective die body is added, the problem of low die body discovery precision caused by symbolizing hidden extreme point information is solved, and the time sequence variable-length die body mining precision is improved. Further, time sequence discretization is performed according to the characteristics of the data set, frequent patterns are found in the symbolized character string to reduce the calculated amount, reduce the execution time, have higher efficiency, extract edge points according to slope change, perform die body discovery on the basis, and the result contains extreme point information with abrupt change, thereby having more research value.

As shown in fig. 2, in one embodiment, a suffix tree-based time series variable length motif mining method is provided, which specifically includes the following steps:

the slope change is combined with the time sequence characteristics by utilizing the time sequence piecewise linear representation, and the time sequence is subjected to mode representation, so that the purpose of the mode representation is to simplify the change rate aiming at extreme points with different change rates, and the edge points are conveniently extracted by setting the change rate threshold value subsequently. And then setting a change rate threshold value according to the result of the mode representation to extract edge points, wherein the edge points represent extreme points with relatively large slope change. After the edge points are extracted, a Suffix Tree is constructed, and node information contained in a path from a root node to a leaf node for any leaf node in the Suffix Tree is connected to correspond to a Suffix from a certain position. Designating the number of the extreme points which are expected to be found, counting out the extreme point sequence with the largest occurrence number, namely frequent patterns by using the Suffix Tree, mapping the frequent patterns back to the original time sequence, and recording the positions of the variable-length die bodies. Since the extreme points contain some information of unimportant extreme points, the length between the motif is uncertain, thereby completing the time series variable length motif mining containing important extreme point information. And calculating Distance Profile between the variable length die bodies by using the obtained positions of the variable length die bodies, wherein the Distance measurement mode is DTW because the lengths of the stored variable length die bodies with the same frequency are different. And further calculating Matrix Profile, wherein a pair of die bodies with the minimum Matrix Profile value is the effective die body.

The specific process is as follows:

1. piecewise linear representation.

(1) Slope-based mode representation:

the slope-based time series pattern represents a trend of variation representing a time series, and is classified into { steep rise, slow rise, hold constant, slow fall, steep fall }, steep rise inclination e (45 °,90 °), slow rise inclination e (0 °,45 ° ], hold constant inclination 0 °, slow fall inclination e (135 °,180 °), steep fall inclination (90 °,135 °) }, and is represented as m= {2,1,0, -1, -2}. Pattern represents algorithm pseudocode as shown in table 1:

table 1 schema represents an algorithm

Lines 1-11 traverse to calculate the slope, and change the slope to a pattern representation according to the pattern representation rule. The range representation set in the 1 st line cycle does not model the first and last points because the subsequent edge point extraction will extract the first and last points as edge points without determining whether their slope changes exceed the rate of change threshold.

(2) Edge point extraction:

setting a change rate threshold according to the edgeThe points extract edge points. Judging whether a point of the time series is an edge point, e.g. judging the point x to be analyzed _i Whether or not it is an edge point, assuming point x to be analyzed _i A point x preceding the point to be analyzed _i-1 The slope of the line segment established by the two points is slope1; point x to be analyzed _i A point x next to the point to be analyzed _i+1 The slope of the line segment established by the two points is slope2, the value of the slope1-slope2 is analyzed, the value is larger than or equal to the change rate threshold d, and the point x to be analyzed is _i As edge points, otherwise the point x to be analyzed _i Not edge points. The edge point extraction algorithm pseudocode is as shown in table 2:

table 2 edge point extraction algorithm

Lines 2-5 extract the first and last points to store in the set of edge points, and lines 6-8 calculate the slope change using the slope represented by the pattern described above, if the slope change exceeds the rate of change threshold, extract the points to store in the set of edge points.

2. Frequent pattern discovery.

(1) The edge points are used to construct the Suffix Tree.

The piecewise linear representation has obtained an edge point set, and according to the method for constructing the Suffix Tree, the edge points are utilized to construct the Suffix Tree, so that the subsequent utilization of the statistical frequency of the Suffix Tree is facilitated.

(2) The frequencies of the edge point sub-sequences in the whole edge point set are counted by using the Suffix Tree.

Setting a window length, wherein the window length represents the number of the edge points which are expected to be found, dividing the edge point set into edge point sub-sequences by utilizing a sliding window, and counting the frequency of each edge point sub-sequence in the whole edge point set by utilizing a Suffix Tree. If the frequency of one edge point sub-sequence is 3, it means that the edge point sub-sequence has two edge point sub-sequences in the whole edge point set, namely, the three edge point sub-sequences have the same edge points and also represent extreme point information with the same variation trend, if the original time sequence is mapped, the three time sequences on the original time sequence have extreme point information with the same variation trend and also have information of non-edge points, and the overall variation trend is not affected because the non-edge point variation rate is not great.

(3) The frequent pattern with the greatest frequency is obtained.

According to the statistics of the frequencies of the sub-sequences of the edge points, storing the frequency sub-sequences in a frequency array, finding out the maximum value of the frequencies in the frequency array, finding out the sub-sequences of the edge points equal to the frequency of the maximum value, wherein the sub-sequences of the edge points are frequent patterns, and all the frequent patterns form a frequent pattern set.

3. And (5) extracting the die body.

(1) And extracting a variable-length die body.

The obtained frequent pattern is mapped back to the original time sequence, and the number of non-edge points contained by edge points in the frequent pattern between the original time sequences may not be consistent, so that the motifs of the frequent pattern mapping back time sequence may have different lengths. The same length of die bodies are divided into a group, and the trivial matching is eliminated. And eliminating the trivial matching to obtain variable-length die bodies with different lengths, and recording the positions of the variable-length die bodies.

(2) Extracting an effective die body.

Examples: in order to verify the effect of the application, hydrologic data observed by the east sand island station of China are adopted as experimental data, and experiments are carried out from three aspects, (1) the stability of the algorithm of the application is analyzed in detail aiming at a data set; (2) Analyzing the effectiveness of the algorithm of the present application as compared to not performing piecewise linear representations; (3) Based on the existing dataset, the temporal performance of the algorithm of the application is analyzed.

The stability, effectiveness and efficiency of the algorithm of the application are analyzed based on the east sand island station data set.

1) And (3) analyzing stability, changing a change rate threshold d, further influencing the compression rate, and comparing whether the results found by the die body are related under different compression forces.

2) Validity analysis compares whether no piecewise linear representation of the discovered motifs and piecewise linear representation of the discovered motifs both carry peak and valley information and whether there is the same trend between motifs.

3) Efficiency analysis, the algorithm consists essentially of two parts, piecewise linear representation and motif extraction, so dependent variable run time is the sum of the two part consumption times. Mainly discussing the time distribution of the independent variable compression rate change in the piecewise linear representation and the die body extraction process.

1. Data preparation

The section takes hydrologic data observed by the east sand island floating station as an experimental object, and data set information is shown in table 3.

Table 3 station dataset information

Sequence number	Data set	Length of	Remarks
				1	East sand island station	1379	East sand island buoy station

The east sand island station records the relative information of the water basin from 1 st 8 th 2017 to 30 th 2019, including station longitude, station latitude, year, month, day, air temperature, air pressure, surface water temperature, wind speed, wind direction, instantaneous wind speed, effective wave height and flow rate.

The effective wave height is an important parameter, which is important for the prediction of wave and ocean dynamics, and the distribution characteristic of the effective wave height is also closely related to the front and rear energy distribution characteristics of typhoons. The relation between wind speed and effective wave height depends on whether the effective wave height is positioned to the left or right of the typhoon center, and if the distance from the typhoon center is short, the effective wave height falling rate is faster, and if the distance from the typhoon center is long, the effective wave height falling rate is more gentle; to the right of the typhoon center, whether valid; the effective wave height is close to linear attenuation when the wave height is close to the typhoon center.

2. Experimental analysis

1) Stability analysis

Gradually increasing the change rate threshold d, recording the die body information found by different change rate thresholds, comparing whether intersection exists between the die bodies, and verifying the stability of the algorithm. The experimental results are presented with reference to table 4 and fig. 3, wherein fig. 3 includes fig. 3 (a) of d=0, fig. 3 (b) of d=1, and fig. 3 (c) of d=1.

Table 4 die body information table for finding different thresholds of east sand island station

It can be found from table 4 that, when the window value is 10, the found variable length motifs are different in the case of different change rate thresholds, and at the same time, when the frequent pattern is mapped back to the original time sequence, there is a trivial match, so that the result after the trivial match is eliminated may be null, for example, when the change rate thresholds d=3 and d=4 are the trivial match is eliminated, and because the frequent pattern is the trivial match, the trivial match is also mapped back to the original time sequence. It can be seen from table 4 that the position 427 at the change rate threshold of 2 and the position 454 at the threshold of 0 are overlapped with each other, and that the variable length motif of length 9 is stable according to the definition of the stability of the above-described judgment algorithm.

2) Validity analysis

Based on east sand island station measurement data, comparing whether the die bodies with the same length and found by the non-piecewise linear representation and the piecewise linear representation have peak values or peak-valley conditions which change sharply, and verifying the effectiveness of the algorithm. According to the above experiment, the fitting effect and the compression ratio were comprehensively considered, and the change rate threshold was set to 3. Because the algorithm does not need to specify the found die body length, the algorithm is used for finding the effective die body first, recording the found effective die body length and position, then comparing with a reference algorithm BF, observing whether the die body found under the same die body length has a peak value or a peak valley condition which changes sharply, and verifying the effectiveness of the algorithm. The experimental results are shown below. Table 5, fig. 4 lists the experimental results, wherein fig. 4 includes fig. 4 (a) with piecewise linear representation and fig. 4 (b) without piecewise linear representation.

Table 5 motif information table found based on dataset present algorithm

Analysis of the above results reveals that, when the motif length is relatively long, both the motif found with piecewise linearity and the motif found without piecewise linearity contain sharp-varying peak or valley information, and that the motifs found by the two methods are intersected. However, when the die body length is shorter, the die body found by the piecewise linear representation does not necessarily contain peak value or peak-valley information, and no matter what the die body length is found, the slope change of extreme points among the effective die bodies found by the piecewise linear representation is the same, namely the trend is the same, and the effectiveness of die body finding after the piecewise linear representation is verified.

3) Efficiency analysis of the present algorithm

The algorithm mainly comprises two parts of piecewise linear representation and die body extraction, so that the change rate threshold d is taken as an independent variable based on a data set, the sum of consumed time of the piecewise linear representation and the die body extraction is taken as the dependent variable, and the algorithm is operated for 8 times under the condition of each change rate threshold, and the average value of the operation time is calculated by statistics. The results are shown in Table 6 and FIG. 5.

Table 6 different threshold values present algorithm running time table based on east sand island station measurement

Analysis of the above experimental results shows that as the rate of change threshold d increases, the compression rate gradually increases, the piecewise linear representation and the die body extraction run time decrease, and thus the overall run time of the algorithm decreases, but the piecewise linear representation run time occupies less overall time and is almost negligible. The die body extraction run time assumes a slow decrease after a sharp decrease, and it follows that as the compression rate increases, the algorithm time-consuming decrease begins quickly and then becomes slow, since the main decrease time is on the die body extraction.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A suffix tree based time series variable length motif mining method, the method comprising:

the time sequence mode based on slope represents the change trend of the representative time sequence data, according to the slope change characteristics, the time sequence mode is divided into { steep rise, slow rise, keep unchanged, slow fall, steep fall }, expressed as M= {2,1,0, -1, -2}, a change rate threshold is set, the first point and the last point in the observed hydrologic data are extracted from the hydrologic data and stored in an edge point set, the slope change is calculated according to the slope change represented by the mode, if the point x is to be analyzed _i The slope change of (a) exceeds the change rate threshold value, and the point x to be analyzed is extracted _i Storing the edge point set;

according to a method for constructing a Suffix Tree, constructing a Suffix Tree by utilizing edge points of the edge point set, and counting the frequency of an edge point subsequence by utilizing the Suffix Tree, wherein the edge point subsequence with the largest frequency is a frequent mode;

mapping the frequent pattern back to the text data, and recording the position of the variable-length die body;

calculating Matrix Profile values among the variable-length die bodies according to the variable-length die body positions, wherein a pair of variable-length die bodies with the minimum Matrix Profile values are effective die bodies of the hydrologic data;

calculating the slope change by using the slope represented by the pattern according to the slope change if the point x to be analyzed _i The slope change of (a) exceeds the change rate threshold value, and the point x to be analyzed is extracted _i A step of storing an edge point set, comprising:

setting a change rate threshold d, and analyzing the value of the |slope1-slope2|, wherein the value is larger than or equal to the change rate threshold d, and the point x to be analyzed is _i As edge points, otherwise the point x to be analyzed _i Instead of edge points, slope1 is the point x to be analyzed in the hydrological data _i A point x preceding the point to be analyzed _i-1 Slope2 of line segment established by two points is point x to be analyzed in hydrologic data _i A point x next to the point to be analyzed _i+1 The slope of the line segment established at two points.

2. The method of claim 1, wherein the step of constructing a Suffix Tree using edge points of the edge point set according to the method of constructing a Suffix Tree, counting edge point subsequence frequencies using the Suffix Tree, and determining an edge point subsequence with the largest frequency as a frequent pattern includes:

constructing a Suffix Tree by utilizing edge points of the edge point set according to a method for constructing a Suffix Tree;

3. The method of claim 1, wherein the step of mapping the frequent pattern back to an original time series, recording a variable length motif position, comprises: