CN109241118A

CN109241118A - It is connected entirely based on subsequence and the time series die body of Clique finds method

Info

Publication number: CN109241118A
Application number: CN201810895890.5A
Authority: CN
Inventors: 王继民; 朱跃龙; 朱晓晓; 张鹏程
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2018-08-08
Filing date: 2018-08-08
Publication date: 2019-01-18

Abstract

The present invention is disclosed a kind of connected entirely based on subsequence and finds method with the time series die body of Clique.Step are as follows: 1. subsequences connect entirely: the distance between all subsequences in the sliding-window operations nesting cycle calculations time series T that length is m are used；2. constructing subsequence similar diagram: defining similarity threshold, the distance less than similarity threshold indicates that other distance values are indicated with 0 with 1.Corresponding adjacency matrix is converted by Distance matrix D istance Matrix；3. finding Clique: finding the Clique in subsequence similar diagram using the adjacency matrix that Clique searching algorithm is searched through figure.The time series subsequence of Clique vertex correspondence is die body.Connect entirely and maximum clique algorithm discovery time sequence die body using subsequence through the invention, improve the efficiency of time series die body discovery, solves the problems, such as that existing time series die body finds that algorithm can not find a plurality of die body.

Description

It is connected entirely based on subsequence and the time series die body of Clique finds method

Technical field

The present invention relates to a kind of connected entirely based on subsequence and the time series die body of Clique find method, be related to information Processing technology field.

Background technique

Time series is the set of volume of data be sequentially arranged, with equal time distances.Time sequence It arranges ubiquitous, it is made to obtain universal application in various industries.Such as the security bargain data of financial field, meteorological field Air temperature and air pressure data, the electricity consumption data of industrial circle, the brain wave of medical domain and ECG data etc..In time series number According in the problems of excavation, the mode discovery of time series is a Basic Problems.It is frequent mode in time series, different The associated rule discovery of norm formula, cyclic pattern to time series, abnormality detection, prediction etc. play an important role.Time series Mode discovery includes searching prior designated mode and mode unknown in advance.The problem of searching prior designated mode is (i.e. by content Inquiry) have many solutions.However, lookup is unknown in advance, the mode repeated i.e. time series die body discovery is (also referred to as For the sequence motif discovery of time series) problem then faces more challenges.Die body, which is found the problem, has time series excavation Significance, can be used for solving data division, the visualization and classification of magnanimity time series databases, including cluster, classification, The problems such as associated rule discovery.The shortcomings that existing die body discovery algorithm is in the presence of calculating complexity, and can not find a plurality of die body, The efficiency for improving die body discovery, it is found that more die bodys are an important research directions.

Summary of the invention

Goal of the invention: aiming at the problems existing in the prior art, the present invention provide it is a kind of based on subsequence connect entirely with most The time series die body of agglomerate finds method, is connected by subsequence, constructs subsequence similar diagram, finds three steps of Clique With a plurality of die body in efficiently discovery time sequence.

Technical solution: a kind of connected entirely based on subsequence finds method with the time series die body of Clique, including following Step:

(1) subsequence connects entirely

Use the distance between all subsequences in the sliding-window operations nesting cycle calculations time series T that length is m (i.e. time series T's connects certainly), this process uses the MASS algorithm of " supper-fast " to obtain distance matrix (Distance Matrix)." supper-fast " of the algorithm is that it has first carried out Fast Fourier Transform (FFT) to data, then executes dot product operations, The result of dot product operations is subjected to inverse Fourier transform again.Aforesaid operations are instead of the higher convolution operation of computation complexity.Most Afterwards the result of inverse Fourier transform is used to calculate based on the normalized Euclidean distance of z-, obtains Distance matrix D istance Matrix.The basic step of MASS algorithm is the sliding dot product calculated between time series subsequence Q and T first, then the time The mean value and variance of sequence subsequence Q and T, finally calculate time series subsequence between Q and T the normalized Euclidean of z- away from From the distance between time of return sequence subsequence Q and T value.

MASS algorithm has invoked SlidingDotProducts algorithm, the main function of SlidingDotProducts algorithm first It can be the value for calculating QT [i].SlidingDotProducts algorithm includes the classical convolution algorithm to two vectors, and algorithm will be adopted Complicated convolution operation is calculated with Fast Fourier Transform (FFT) and inverse fast fourier transform substitution, to improve the speed of MASS algorithm Degree.Because MASS algorithm is used as time series subsequence Q and T using the normalized Euclidean distance Dist [i] of z-_i,mBetween away from From measurement, need to carry out time series subsequence Q and T_i,mBetween dot product calculate QT [i], Euclidean distance Dist [i] formula It is as follows:

Wherein, m is the length of subsequence, μ_QFor the average value of time series subsequence Q, σ_QFor time series subsequence Q's Standard deviation, M_TFor time series subsequence T_i,mAverage value, ∑_TFor time series subsequence T_i,mStandard deviation.Q and T_i,m? Ordinary matched subsequence is not present in two referred in a time series in MASS algorithm, and in the algorithm, Q is as inquiry sequence Column calculate the distance between other subsequences in time series.

Under normal conditions, the time complexity for calculating the average and standard deviation of each subsequence in long-term sequence is O (m).Algorithm has used the accumulative of cache-time sequential value and the method with accumulation quadratic sum, any stage two accumulations and to Amount is enough to calculate the mean value and variance of random length subsequence.Different from KNN search for similarity method, what which calculated is to look into Ask the distance between all subsequences, the i.e. range distribution (Distance of time series T in sequence and time series Profile)。

(2) subsequence similar diagram is constructed

Similarity threshold is defined, the distance value less than similarity threshold indicates that other distance values are indicated with 0 with 1, by distance Matrix conversion is adjacency matrix.Figure is obtained according to adjacency matrix, this figure is known as subsequence similar diagram.There are ordinary in distance matrix The distance value of matched subsequence is set as inf, similarity threshold eps=6, when distance value is less than eps, apart from 1 table of element Show, is indicated when greater than similarity threshold or being inf with 0, finally obtain similar adjacency matrix.

(3) Clique is found

The maximal clique problem of subsequence similar diagram in second step is solved using maximum clique algorithm.The vertex pair of gained Clique The example for answering time series die body.The Clique of searching figure is a complicated combinatorial optimization problem, the maximum that the present invention uses Group's algorithm proposes a new objective function (R1NdM):

Wherein parameter d >=0, it is assumed that the adjacency matrix for scheming G is A, U is the local minimum of function, and defining relevant improvement adjacency matrix is B=A+I_n, M_d=(1+d) B-d1_n×n, wherein I_nFor n The unit matrix of rank, 1_n×nFor n × n matrix of element complete 1.The Maximum Clique of the local minimum corresponding diagram G of objective function, it is global The Clique of minimum value corresponding diagram G.Use gradient descent algorithm as iterative algorithm, adjusts step-length using Armijo criterion, ask Solve the optimal solution of objective function, the Clique of optimal solution corresponding diagram G.This algorithm accuracy rate is higher and calculating speed very Fastly.

The present invention by adopting the above technical scheme, has the advantages that

Proposed by the present invention connected entirely based on subsequence finds method with the time series die body of Clique, compared to use The time series die body of approximate discretization method finds that algorithm, the algorithm proposed use initial data, remain time series In important information.Find that algorithm, the algorithm proposed remove meaningless compared to the time series die body using cluster Match, the die body of discovery is more valuable.Algorithm, the algorithm parameter proposed are found compared to time series die body based on probability Less, calculating simply should be readily appreciated that.Algorithm is found compared to other time series die bodys based on subsequence connection, is proposed Efficiency of algorithm is higher.In addition, the algorithm proposed can find a plurality of die body within a short period of time, and there is high efficiency, accurate Property and stronger scalability and robustness.

Detailed description of the invention

Fig. 1 is that distance matrix is converted into adjacency matrix；

Fig. 2 is the die body comparison diagram based on each algorithm discovery of EEG data collection；

Fig. 3 is the die body comparison diagram based on each algorithm discovery of EOG data set, (a) BF algorithm (eps=3.4, m=145), (b) TSSJMC algorithm (eps=3.4, m=145)；

Fig. 4 is the die body comparison diagram based on each algorithm discovery of ECG data collection, (a) BF algorithm (eps=16.5, m=150), (b) TSSJMC algorithm (eps=16.5, m=150)；

Fig. 5 is the die body comparison diagram based on each algorithm discovery of ECG data collection, (a) BF algorithm (eps=16, m=250), (b) TSSJMC algorithm (eps=16, m=250)；

Fig. 6 is execution time comparison diagram (EEG:PAA_size=30, α=6, repeat=60 of three kinds of algorithms；EOG: PAA_size=25, α=5, repeat=60；ECG:PAA_size=20, α=3, repeat=60；Insect Behavior:PAA_size=30, α=5, repeat=60)；

Fig. 7 is based on each Riming time of algorithm comparison diagram of EEG data collection；

Fig. 8 is based on each Riming time of algorithm comparison diagram of EOG data set；

Fig. 9 is based on each Riming time of algorithm comparison diagram of ECG data collection；

Figure 10 is to run comparison diagram based on each algorithm of Insect Behavior data set；

Figure 11 is each Riming time of algorithm comparison diagram of EEG data collection of different length；

Figure 12 is each Riming time of algorithm comparison diagram of EOG data set of different length；

Figure 13 is the die body quantitative comparison figure of different noise levels discovery；

Figure 14 is the runing time comparison diagram of different noise level algorithms；

Figure 15 is method flow diagram.

Specific embodiment

Combined with specific embodiments below, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalences of the invention The modification of form falls within the application range as defined in the appended claims.

A kind of connected entirely based on subsequence finds method with the time series die body of Clique, comprising the following steps:

(1) subsequence connects entirely

Use the distance between all subsequences in the sliding-window operations nesting cycle calculations time series T that length is m (i.e. time series T's connects certainly), this process uses the MASS algorithm of " supper-fast " to obtain distance matrix (Distance Matrix)." supper-fast " of the algorithm is that it has first carried out Fast Fourier Transform (FFT) to data, then executes dot product operations, The result of dot product operations is subjected to inverse Fourier transform again.Aforesaid operations are instead of the higher convolution operation of computation complexity.Most Afterwards the result of inverse Fourier transform is used to calculate based on the normalized Euclidean distance of z-, obtains Distance matrix D istance Matrix.The basic step of MASS algorithm is the sliding dot product calculated between time series subsequence Q and T first, then the time The mean value and variance of sequence subsequence Q and T, finally calculate time series subsequence between Q and T the normalized Euclidean of z- away from From the distance between time of return sequence subsequence Q and T value.The pseudocode of MASS algorithm is as shown in table 1.

1 MASS algorithm of table

The first row of table 1 has invoked SlidingDotProducts algorithm, and the basic step of algorithm is to acquire the time first The length of sequence subsequence Q and T are filled with 0 if the length of T is not equal to the length of Q, keep Q and T isometric, convenient for subsequent Q inverted is carried out Fast Fourier Transform (FFT) to isometric Q and T respectively, dot-product operation is then executed, finally by point by dot-product operation The result of product operation carries out inverse Fourier transform.Its pseudocode is as shown in table 2, and major function is to calculate the value of QT [i].Table 2 5-6 row is the classical convolution algorithm to two vectors, and algorithm will use Fast Fourier Transform (FFT) and inverse fast fourier transform Substitution calculates complicated convolution operation, to improve the speed of MASS algorithm.Because MASS algorithm uses the normalized Euclidean of z- Distance Dist [i] is used as time series subsequence Q and T_i,mThe distance between measurement, need to carry out time series subsequence Q with T_i,mBetween dot product calculate QT [i], Euclidean distance Dist [i] formula is as follows:

Wherein, m is the length of subsequence, μ_QFor the average value of time series subsequence Q, σ_QFor time series subsequence Q's Standard deviation, M_TFor time series subsequence T_i,mAverage value, ∑_TFor time series subsequence T_i,mStandard deviation.

2 SlidingDotProducts algorithm of table

(2) subsequence similar diagram is constructed

Similarity threshold is defined, the distance value less than similarity threshold indicates that other distance values are indicated with 0 with 1, by distance Matrix conversion is adjacency matrix.Figure is obtained according to adjacency matrix, this figure is known as subsequence similar diagram.The structure of adjacency matrix and turn It is as shown in Figure 1 to change process.Assuming that being the distance matrix generated on the left of Fig. 1, wherein there are the distance values of ordinary matched subsequence It is set as inf, similarity threshold eps=6 is indicated with 1 apart from element when distance value is less than eps, greater than similarity threshold or is It is indicated when inf with 0, finally obtains the similar adjacency matrix as shown in the right side Fig. 1.

(3) Clique is found

(4) evaluation with the time series die body of Clique discovery method is connected entirely based on subsequence.

Lin in 2002 etc. proposes Brute Force algorithm, uses length for the sliding window of m, extracts in time series Subsequence carries out ε range query, each subsequence and its similar sub-sequence structure successively using each subsequence as search sequence At a similar subset, it is considered as the die body of the time series comprising the subsequence in the most similar subset of subsequence.The algorithm Time complexity with higher, while being widely used as benchmark algorithm to evaluate the accuracy of other algorithms.Brute Force Pseudo-code of the algorithm is as shown in table 3.

3 Brute Force algorithm of table

Embodiment:

In order to verify effect of the invention, experiment uses disclosed data set EEG, EOG, ECG, Insect Behavior It, will be of the invention by terms of the accuracy of algorithm, high efficiency, scalability and robustness four as the research data of experiment It is compared with existing algorithm, the algorithm for participating in comparing includes Brute Force (BF) algorithm, Random Projection (RP) algorithm.

1) it will be connected entirely based on subsequence and the time series die body of Clique find algorithm and Brute Force algorithm institute It was found that die body and its quantity compare, verify the accuracy of this paper algorithm.

2) in order to assess the efficiency of TSSJMC algorithm, itself and Brute Force and Random Projection are calculated in experiment Method finds that the runing time (in seconds) of die body carries out Statistical Comparison on four data sets.

3) it is directed to scalability, is divided into two with the comparative experiments of Brute Force and Random Projection algorithm Point: first part is based on EEG, EOG, ECG, and Insect Behavior data set is respectively 64,128 in die body length value, Under the conditions of 256,512,1024, runing time needed for analysis compares three algorithm discovery die bodys；Second part is based on EEG, EOG Data set increases the length of data set with 1000 step-length using fixed die body length, compares three algorithm discovery die body institutes The runing time needed.

4) it is directed to robustness, based on Insect Behavior data set, using fixed die body length, is used Awgn function in matlab makees data set to carry out additive Gaussian noise processing: x=awgn (data, db).Wherein, db is to add Signal-to-noise ratio after entering noise, db value is bigger, indicate containing noise it is fewer.In the SNR ranges of 0-80db, comparative analysis Each algorithm it can be found that die body quantity and required runing time.

1. data preparation

This section is with disclosed data set EEG, and EOG, ECG, Insect Behavior is as experimental data, data set information As shown in table 4.

4 data set information of table

Serial number	Data set name	Length	Remarks
				1	EEG	4000	EEG data collection
2	EOG	4000	Electroculogram data set
				3	ECG	3001	Electrocardiographicdataset dataset
4	Insect Behavior	4000	Insect line data set

2. experimental analysis

1) analysis of the accuracy

Based on EEG, EOG, ECG, Insect Behavior data set, by connected entirely based on subsequence and Clique when Between sequence die body discovery (Time Series Subsequence Join and Maximum Clique, TSSJMC) algorithm with Brute Force algorithm compares, and verifies the accuracy of this paper algorithm.Table 5 and Fig. 2-5 list experimental result.

5 two kinds of algorithms of table find the quantity of die body example

The experimental result of contrast table 5 can be seen that for EOG and ECG data collection, the die body number of TSSJMC algorithm discovery It is identical as Brute Force algorithm.And for EEG and Insect Behavior data set, the die body of TSSJMC algorithm discovery Number is slightly below Brute Force algorithm.It is analyzed since maximum clique algorithm causes.Because the maximum clique algorithm efficiency used is very Height, but the size of the group of its discovery is sometimes smaller.Fig. 2-5 gives the specific die body found based on each data set algorithm It is identical as Brute Force algorithm to observe the die body example that can determine that TSSJMC algorithm is found for example, it was demonstrated that TSSJMC is calculated Method accuracy with higher, and can effectively find a plurality of die body.

2) efficiency analysis

This section is tested based on four data sets described above by statistics to assess the efficiency of TSSJMC algorithm Runing time needed for TSSJMC, Brute Force algorithm and the higher Random Projection algorithm discovery die body of efficiency (in seconds).The setting of the similarity threshold eps and die body length m of three above algorithm and analysis of the accuracy experiment scene It is identical.The parameter that Random Projection algorithm is related to has the number of characters and iteration time of PAA segments PAA_size, SAX Number repeat.Fig. 6 gives experimental result.

Analysis contrast and experiment can obtain, and runing time needed for TSSJMC algorithm is far below Brute Force algorithm, together Sample is slightly below Random Projection algorithm, it was demonstrated that TSSJMC algorithm has high efficiency.Experimental result also presents simultaneously The low disadvantage of Brute Force efficiency of algorithm, because it is force search algorithm.The knot of combination algorithm accuracy comparative experiments By can illustrate, although the quantity of TSSJMC algorithm discovery die body is slightly below Brute Force algorithm, after its efficiency is much higher than Person.Therefore, comparatively TSSJMC algorithm is algorithm that quality and efficiency balance each other.

3) scalability Analysis

This section experiment consists of two parts: first part is based on aforementioned four data set, is continuously increased sub-sequence length (mould Body length), difference modulus body a length of 64,128,256,512,1024, and be based respectively on and die body is carried out with upper die body length value It was found that experiment, runing time needed for obtaining three algorithms is as is seen in figs 7-10.Second part: selection EEG (EEG data collection, Totally 4000 data points), EOG (electroculogram data set, totally 4000 data points) data set, stent body length, constantly with 1000 step-length is incremented by the length of data set, runing time needed for three algorithms of Statistical Comparison, experimental result such as Figure 11,12 institutes Show.

Can be obtained by analysis chart 7-10: with the increase of die body length, index is presented in the runing time of Brute Force algorithm Type increases, and the runing time of TSSJMC and Random Projection algorithm is all far below Brute Force algorithm.It is transporting In row time rate of rise, though Random Projection algorithm is lower compared with Brute Force algorithm, with die body length Increase, rate of rise is gradually faster than TSSJMC algorithm.And the runing time of TSSJMC algorithm can remain

Level under two algorithm best-cases.

It can be obtained by Figure 11-12, with the increase of data set length, the runing time of three algorithms is all in increasing trend, The growth of Brute Force algorithm is most fast, and Random Projection and TSSJMC algorithm increases relatively slow.When comparison operation Between, TSSJMC algorithm is better than Random Projection algorithm.Therefore, on general effect, TSSJMC algorithm has stronger Scalability.

3) robust analysis

Based on Insect Behavior data set, stent body length is 250.Use awgn function logarithm in matlab Gaussian noise is added according to collection, controlling signal-to-noise ratio with 10 is step-length, is gradually incremented by.The eps that TSSJMC algorithm is arranged is 16, Random The PAA_size of Projection algorithm is 30, α 5, repeat 60.Robustness contrast and experiment such as Figure 13, shown in 14.

Figure 13's the results showed that be added noise for TSSJMC and Random Projection algorithm discovery die body Quantity do not influence, it was demonstrated that two algorithms all have stronger robustness.But confirmed from Figure 14 comparing result: different journeys are added After the noise of degree, the performance of TSSJMC algorithm performance is stablized, and remains at preferable level；And Random Projection is calculated Method performance high progression, is influenced more serious by noise.Therefore, TSSJMC algorithm is compared to Random Projection Algorithm has stronger robustness.Because Random Projection algorithm carries out symbolism data using SAX algorithm, noise It is added so that SAX algorithm performance is affected.

Claims

1. a kind of connected entirely based on subsequence finds method with the time series die body of Clique, which is characterized in that including following Step:

(1) subsequence connects entirely

Use the distance between all subsequences in the sliding-window operations nesting cycle calculations time series T that length is m；

(2) subsequence similar diagram is constructed

Similarity threshold is defined, the distance value less than similarity threshold indicates that other distance values are indicated with 0 with 1, by distance matrix Be converted to adjacency matrix；Figure is obtained according to adjacency matrix, this figure is known as subsequence similar diagram；

(3) Clique is found

The maximal clique problem of subsequence similar diagram is solved using maximum clique algorithm；The vertex correspondence time series mould of gained Clique The example of body.

2. according to claim 1 connected entirely based on subsequence finds method with the time series die body of Clique, special Sign is, uses the MASS algorithm of " supper-fast " to obtain distance matrix in the step (1)；" supper-fast " of the algorithm is it Fast Fourier Transform (FFT) first has been carried out to data, has then executed dot product operations, dot product operations result is carried out to inverse Fourier again and is become It changes, finally is used to calculate based on the normalized Euclidean distance of z- by the result after inverse Fourier transform, obtains distance matrix Distance Matrix。

3. according to claim 2 connected entirely based on subsequence finds method with the time series die body of Clique, special Sign is that Euclidean distance Dist [i] formula is as follows in the step (1):

Wherein, m is the length of subsequence, μ_QFor the average value of time series subsequence Q, σ_QFor the standard of time series subsequence Q Difference, M_TFor time series subsequence T_i,mAverage value, ∑_TFor time series subsequence T_i,mStandard deviation.

4. according to claim 1 connected entirely based on subsequence finds method with the time series die body of Clique, special Sign is that there are the distance values of ordinary matched subsequence to be set as inf, similarity threshold eps=6, distance in distance matrix It when value is less than eps, is indicated apart from element with 1, is indicated when greater than similarity threshold or being inf with 0, finally obtain similar adjacent square Battle array.

5. according to claim 1 connected entirely based on subsequence finds method with the time series die body of Clique, special Sign is, in the step (3), the maximum clique algorithm used proposes a new objective function:

Wherein parameter d >=0, it is assumed that the adjacency matrix for scheming G is A, and u is letter Several local minimums, defining relevant improvement adjacency matrix is B=A+I_n, M_d=(1+d) B-d1_n×n, wherein I_nFor the list of n rank Bit matrix, 1_n×nFor n × n matrix of element complete 1；The Maximum Clique of the local minimum corresponding diagram G of objective function, global minimum The Clique of corresponding diagram G；Use gradient descent algorithm as iterative algorithm, adjusts step-length using Armijo criterion, solve target The optimal solution of function, the Clique of optimal solution corresponding diagram G.