CN114911846A

CN114911846A - FAD and DTW-based hydrological time sequence similarity searching method

Info

Publication number: CN114911846A
Application number: CN202210531963.9A
Authority: CN
Inventors: 杨佳琦; 万定生; 余宇峰
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-08-16

Abstract

The invention discloses a hydrological time sequence similarity searching method based on FAD and DTW, which comprises the following steps: firstly, smoothing a pre-acquired time sequence by utilizing wavelet transformation; secondly, selecting a starting point, an ending point and a local extreme point in the time sequence as feature points, giving semantics to a data segment between adjacent feature points, and performing semantic symbolization expression on the sequence; then calculating derivative estimation values of subsequences in the preliminary candidate set and each point in the sequence to be queried to obtain a derivative estimation sequence, converting the derivative estimation sequence into a symbolic representation sequence, and finally obtaining a characteristic sequence corresponding to the subsequences in the preliminary candidate set and the sequence to be queried; after the data representation stage is completed, firstly, the FAD is used for finding out the sub-sequence with approximate trend, then, the DTW is used for carrying out accurate matching, and finally, the similar sub-sequence is obtained. The method disclosed by the invention is used for carrying out similarity search on the historical time sequence by combining the characteristics of FAD and DTW, so that the search efficiency is improved to a great extent.

Description

FAD and DTW-based hydrological time sequence similarity searching method

Technical Field

The invention belongs to the technical field of hydrologic data mining, and particularly relates to a method for searching similarity of hydrologic time sequences based on FAD and DTW.

Background

The hydrologic time series similarity search aims to find out similar subsequences from historical time series given a certain time series. The similarity of the data in the time sequence database is found, so that the data change rule and trend can be mastered, and a basis is provided for effective prediction. Therefore, the research on the similarity search of the hydrological time series has important practical significance in flood forecasting and flood control scheduling.

The problems involved in the hydrologic time series similarity search mainly include time series feature representation, similarity measurement, subsequence matching, and the like. Many researchers have achieved certain results around the research of time series similarity by adopting different methods, and have certain application in the hydrology field. The similarity measurement method of the hydrological time series mainly comprises Euclidean distance, dynamic time warping distance and related improved algorithms (such as DTW-SS and FastDTW). The euclidean distance is simple and easy to understand, but is only suitable for similarity comparison between equal-length time sequences. The DTW can obtain a high-precision measurement effect by bending a time axis, but the calculation method is point-by-point matching, and the time complexity is high. Therefore, a similarity search method capable of greatly reducing the time complexity while ensuring the query accuracy is needed.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the defects in the prior art, provides a hydrological time sequence similarity searching method based on FAD and DTW, and provides a hydrological time sequence reality searching method capable of improving the query efficiency while ensuring the query efficiency by combining the related technology of data mining.

The technical scheme is as follows: the invention discloses a hydrological time series similarity searching method based on FAD and DTW, which comprises the following steps:

step S1, in order to eliminate the noise in the original time sequence, the data smoothing processing is carried out on the historical time sequence and the sequence to be inquired by utilizing the wavelet transform;

step S2, selecting a starting point, an end point and a local extreme point meeting a certain condition in the smoothed time sequence as feature points, giving semantic rise (U), maintenance (B) and decline (D) to a data segment between adjacent feature points, and performing semantic symbolization representation on the historical time sequence and the sequence to be queried;

s3, screening out subsequences with the same semantics as the sequence to be queried from the historical sequence as a preliminary candidate set;

step S4, calculating derivative estimation values of subsequences in the preliminary candidate set and each point in the sequence to be queried to obtain a derivative estimation sequence, converting the derivative estimation sequence into a symbolic representation sequence, and finally obtaining a characteristic sequence corresponding to the subsequences in the preliminary candidate set and the sequence to be queried;

s5, sequentially carrying out approximate matching on the characteristic sequence of the sequence to be queried and the characteristic subsequences in the preliminary candidate set by using a FAD similarity measurement method, and screening out the previous M subsequences with approximate change trends according to the FAD distance;

step S6, carrying out DTW accurate matching on the query sequence and the M approximate subsequences to obtain the first N subsequences with the minimum DTW distance, namely the best similar subsequence;

the step S2 is to semantically symbolize the time sequence, and the step S2 is further to:

step S2.1, a time sequence T ═ x is provided ₁ ，x ₂ ...x _n ) If one of the following conditions is satisfied, the data point T is called (x) ₁ ，x ₂ ...x _n ) Is an extreme point:

(1) m is 1 or m is n;

(2)x _m ≥x _m-1 and x _m ≥x _m+1 Wherein m is more than 1 and less than n;

(3)x _m ≤x _m-1 and x _m ≤x _m+1 Wherein m is more than 1 and less than n;

s2.2, giving semantic ascending (U), keeping (B) and descending (D) to the data segment between the adjacent extreme points, and performing semantic symbolization expression on the historical time sequence and the sequence to be inquired;

the step S4 is converting the time series into the feature series, and the step S4 is further:

step S4.1, a certain time sequence T ═ x is set ₁ ，x ₂ ...x _n ) Converting the original time sequence into a derivative estimation sequence, wherein the derivative estimation value is calculated according to the following formula:

wherein X _h For the time series T ═ x ₁ ，x ₂ ...x _n ) One data point of;

step S4.2, after obtaining the derivative estimation sequence

The derivative values are divided into different sign values according to their distribution, which reflect the trend information of the time series. The conversion formula for the symbolic representation sequence is as follows:

wherein R is _h Is that

The symbolization of (2) indicates that the parameter epsilon is a threshold value (epsilon is more than or equal to 0) of the variation trend and is used for judging the variation amplitude of the data. The parameter λ is the number of symbols used to represent the original sequence;

s4.3, transforming the obtained symbolic representation sequence to obtain a characteristic sequence

Wherein S _j ＝(R _j ，k _j )，R _j Is a sequence of features

Is a symbol, k _j Is the number of adjacent points of the same sign.

And S4.4, acquiring the subsequences in the preliminary candidate set and the characteristic sequences corresponding to the sequences to be inquired according to the steps.

Step S5 is to sequentially perform FAD similarity measurement on the subsequences in the preliminary candidate set and the sequence to be queried, where step S2 further includes:

step S5.1, setting

Is a signature sequence of a sub-sequence in the preliminary candidate set,

is the characteristic sequence of the sequence to be inquired.

If sequence

And

the corresponding segments in (1) are represented by different symbols, that is, the variation trend of two segments is different, and the distance formula between the two segments is as follows:

D(S1 _i ，S2 _j )＝1，(R1 _i ≠R2 _j )

wherein S1 _i And S2 _j Are respectively a sequence

And

a fragment subsequence of (2), R1 _i And R2 _j Is S1 _i And S2 _j A corresponding symbolic representation;

step S5.2, if the sequence

And

corresponding segments are denoted by the same symbols, i.e. both segments have a similar trend. The distance between these two segments depends mainly on their difference in length, which is calculated as follows:

wherein k1 _i And k2 _j Are respectively

And

the number of intermediate points, γ, is an adjustable parameter for varying the ratio of the distance of the same symbol to a different symbol. In theory, the distance of the same symbol segments must be less than the distance of the different symbol segments. Therefore, 0. ltoreq. D (S1) _i ，S2 _j ) < 1 and gamma. epsilon [0, 1]。

Step S5.3, since the length of the time sequence may not be equal and the time warping of FAD, there will be some segments in a certain sequence that can not be mapped. These fragments can be considered dissimilar to any fragment belonging to another sequence, and the formula is calculated as follows:

D(-，S _i )＝1

step S5.4, combining steps S5.1 to S5.3, summarizes the distance calculation formula of the two segments as follows.

Thus time series

And

the FAD distance calculation formula is as follows:

and S5.5, screening the first 50 subsequences with the minimum distance according to the FAD distance value to form a subsequent candidate set to be matched.

Step S6 is to sequentially perform DTW similarity measurement on the subsequences in the candidate set to be matched and the sequence to be queried, and step S6 is further:

and S6.1, calculating DTW distance values of the sequence to be queried and the subsequences in the candidate set to be matched, and acquiring the first 4 subsequences with the minimum DTW distance as the optimal similar subsequences. The DTW distance is calculated as follows:

wherein Q is a sequence to be queried, Y is a subsequence in a candidate set to be matched,

D _base (q ₁ ，y ₁ ) The base distance between the ith time point vector representing Q and the jth time point vector of Y is expressed by euclidean distance.

And S6.2, outputting the final similar sequence result set.

Has the advantages that: compared with the prior art, the invention has the advantages that:

based on the existing similarity measurement method, the morphological characteristics and numerical characteristics of the hydrological time sequence are comprehensively considered, and the similarity search of the hydrological time sequence is carried out by combining FAD approximate matching and DTW accurate matching based on trend characteristics, so that the similar sequence in the flow domain can be effectively excavated.

Compared with the traditional DTW, the FAD _ DTW solves the problem of high calculation complexity of DTW due to point-to-point matching, can greatly reduce a candidate set of follow-up similarity matching by screening out subsequences with approximate morphological trends, effectively improves query efficiency, and has important practical significance in flood forecasting and flood control scheduling.

Drawings

FIG. 1 is an overall step diagram in an embodiment of the present invention;

FIG. 2 is a diagram illustrating a conversion of a symbolic representation sequence into a feature sequence in an embodiment;

FIGS. 3 and 4 are similar subsequences obtained by FAD _ DTW method in two experiments as an example;

FIGS. 5 and 6 show similar subsequences obtained by DTW-SS method in two experiments as an example;

FIG. 7 is a comparison of query times of FAD _ DTW and DTW-SS with the increase of the years of the historical sequence in the example;

FIG. 8 is a comparison of query times of FAD _ DTW and DTW-SS with increasing length of the sequence to be queried in the examples;

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

As shown in fig. 1, a method for calculating a grid rainfall based on a survey station of the present embodiment includes the following steps:

step S1, selecting hydrological data of the tunny river basin tunny river station as a data set, obtaining a sequence Q to be queried and a historical time sequence S therefrom, and smoothing the obtained time sequence by using wavelet transformation to obtain a smoothed sequence Q 'to be queried and a smoothed historical time sequence S'.

Step S2, selecting a starting point, an end point and a local extreme point meeting certain conditions in the sequence Q' to be inquired as feature points, giving semantic rise (U), maintenance (B) and decline (D) to data segments between adjacent feature points, and performing semantic symbolization expression on the historical time sequence and the sequence to be inquired;

step S2.1, in which Q' is { x for the sequence to be queried ₁ ，x ₂ ...x _n The data point x is called if one of the following conditions is satisfied _m (m.ltoreq.n) is an extreme point:

(1) m is 1 or m is n;

(2)x _m ≥x _m-1 and x _m ≥x _m+1 Wherein m is more than 1 and less than n;

(3)x _m ≤x _m-1 and x _m ≤x _m+1 Wherein m is more than 1 and less than n;

and S2.2, extracting extreme points of the sequence Q ' according to the conditions in the step S2.1 to obtain an extreme point sequence Q ', and symbolizing Q '. For the extreme point sequence Q ", the pattern between every two data points is used to form a new time sequence Q" { Q ″ ₁ ，q ₂ ，...q _n }. Wherein q is _i Belongs to { U, B, D }, and represents rising, holding, and falling, respectivelyThe trend of (a), Q '"is represented as a semantic schema for Q';

step S3, extracting extreme points from the historical time sequence S in the same way according to the way in the step 2, and obtaining a semantic mode representation S' of S;

s4, screening out subsequences with the same semantics as the sequence Q 'to be queried from the historical sequence S' as a primary candidate set Z;

step S5, calculating the derivative estimation value of each point in the sequence Q' to be inquired to obtain a derivative estimation sequence

Then converting the sequence into a symbolic representation sequence to finally obtain a characteristic sequence corresponding to Q

Step S5.1, obtaining derivative estimation sequence

The derivative estimate calculation is as follows:

step S5.2, the derivative values are divided into different sign values according to their distribution, which reflect the trend information of the time series. The conversion formula for the symbolic representation sequence is as follows:

wherein R is _h Is that

The symbolization of (2) indicates that the parameter epsilon is a threshold value (epsilon is more than or equal to 0) of the variation trend and is used for judging the variation amplitude of the data. The parameter λ (λ ≧ 1 and λ an integer) is the number of symbols used to represent the original sequence. For example, we canConverting the original sequence into a sequence consisting of-3, -2, -1, 0, 1, 2, 3 and the like;

s5.3, transforming the obtained symbolic representation sequence to obtain a characteristic sequence of the sequence to be inquired

Wherein S _j ＝(R _j ，k _j )，R _j Is a certain expression symbol, k, in the signature sequence _j Is the number of adjacent points of the same symbol, fig. 2 shows the whole transformation process;

step S6, calculating the derivative estimated value of all the subsequences in the candidate set Z in the same way according to the method for acquiring the characteristic sequence in step S5, and obtaining the corresponding characteristic subsequence set

Step S7, calculating the characteristic sequence of the sequence to be inquired in sequence

And

and screening the first 50 subsequences with the similar trend to the sequence to be inquired according to the FAD distance values of all the subsequences to form a data set S' to be matched.

Step S7.1, setting

Is composed of

A signature subsequence of (1). If sequence

And

the corresponding segments in (1) are represented by different symbols, that is, the variation trend of two segments is different, and the distance between the two segments is expressed as follows:

D(S1 _i ，S2 _j )＝1，(R1 _i ≠R2 _j )

step S7.2, if the sequence

And

corresponding segments are denoted by the same symbols, i.e. both segments have a similar trend of change. The distance between these two segments depends mainly on their difference in length, which is calculated as follows:

wherein k1 _i And k2 _j Are respectively

And

the number of intermediate points, γ, is an adjustable parameter for varying the ratio of the distance of the same symbol to a different symbol. In theory, the distance of the same symbol segments must be smaller than the distance of the different symbol segments. Therefore, 0. ltoreq. D (S1) _i ，S2 _j ) < 1 and gamma. epsilon. [0, 1 ]]。

Step S7.3, since the length of the time sequence may not be equal and the time warping of FAD, there will often be some segments in a certain sequence that can not be mapped. These fragments can be considered dissimilar to any fragment belonging to another sequence, and the formula is calculated as follows:

D(-，S _i )＝1

step S7.4, combining step S5.1 to step S5.3, summarizes the distance calculation formula of the two segments as follows.

Thus time series

And

the FAD distance calculation formula is as follows:

step S8, calculating the DTW distance value between the sequence to be queried and each subsequence in the candidate set S', and obtaining the first 4 subsequences with the smallest DTW distance, namely the best similar subsequence. The DTW distance is calculated as follows:

To verify the effect of the invention, two sets of experimental data were taken, taking the tunxi station in the tunxi basin as an example, and compared and analyzed with the DTW-SS method in order to verify the rapidity and accuracy of the invention. The similar subsequences queried by the two methods are shown in table 1 and table 2. Fig. 3 and fig. 4 respectively correspond to the first 4 matching results obtained by the FAD _ DTW method for the two query sequences, and fig. 5 and fig. 6 respectively correspond to the first 4 matching results obtained by the DTW-SS method for the two query sequences. The query times for both methods are shown in fig. 7 and 8. Through the chart, the FAD _ DTW algorithm in the embodiment can ensure the query accuracy and has the query efficiency obviously superior to that of the DTW-SS method.

TABLE 1 FAD _ DTW similarity match results

TABLE 2 DTW-SS similarity match results

Claims

1. A hydrological time similarity searching method based on FAD and DTW is characterized by comprising the following steps:

the data preparation stage specifically comprises:

the similarity searching stage specifically comprises the following steps:

and step S6, carrying out DTW accurate matching on the query sequence and the M approximate subsequences, and obtaining the first N subsequences with the minimum DTW distance, namely the best similar subsequence.

2. The FAD and DTW-based hydrological time series similarity search method according to claim 1, wherein the step S2 is implemented as follows:

(1) m is 1 or m is n;

(2)x _m ≥x _m-1 and x _m ≥x _m+1 Wherein m is more than 1 and less than n;

(3)x _m ≤x _m-1 and x _m ≤x _m+1 Wherein m is more than 1 and less than n;

and S2.2, giving semantic rise (U), maintenance (B) and decline (D) to the data segments between the adjacent extreme points, and performing semantic symbolization representation on the historical time sequence and the sequence to be queried.

3. The FAD and DTW-based hydrological time series similarity search method according to claim 1, wherein the step S4 is implemented as follows:

step S4.1, a certain time sequence T ═ x is set ₁ ，x ₂ ...x _n ) Converting the original time series into a derivative estimation series by equation (1)

Wherein, X _h As a sequence of timeColumn T ═ x ₁ ，x ₂ ...x _n ) One data point of;

step S4.2 after obtaining the derivative estimation sequence

wherein R is _h Is that

Wherein S _j ＝(R _j ，k _j )，R _j Is a sequence of features

Is a symbol, k _j The number of adjacent points of the same symbol;

4. The FAD and DTW based hydrological time series similarity search method of claim 1, wherein the FAD similarity measure of step S5 is implemented by the following process:

step S5.1, two characteristic sequences are assumed to exist

And

if the sequence is

And

the corresponding segments in (1) are represented by different symbols, which shows that the variation trends of the two segments are different, and the distance formula between the two segments is as follows:

D(S1 _i ，S2 _j )＝1，(R1 _i ≠R2 _j )

wherein S1 _i And S2 _j Are respectively a sequence

And

step S5.2, if sequence

And

corresponding segments are represented by the same symbols, which indicates that two segments have similar variation trends, and the distance between the two segments depends mainly on the length difference, and the calculation formula is as follows:

wherein k1 _i And k2 _j Are respectively

And

the number of intermediate points, γ, is an adjustable parameter for varying the ratio of the distance of the same symbol to a different symbol. In theory, the distance of the same symbol segments must be less than the distance of the different symbol segments. Therefore, 0. ltoreq. D (S1) _i ，S2 _j ) < 1 and gamma. epsilon. [0, 1 ]]。

Step S5.3, since the length of the time sequence may not be equal and the time warping of FAD, there will be some segments in a certain sequence that no segments can map. These fragments can be considered dissimilar to any fragment belonging to another sequence, and the formula is calculated as follows:

D(-，S _i )＝1

Thus time series

And

the FAD distance calculation formula is as follows:

5. the FAD and DTW-based hydrological time series similarity search method according to claim 1, wherein the DTW similarity measure of step S6 is implemented by the following process:

wherein X and Y represent two time series for DTW similarity measurement, D _base (x _i ，y _j ) The base distance between the ith time point vector representing X and the jth time point vector of Y is expressed by euclidean distance.

And S6.2, outputting the final similar sequence result set.