CN106446081B

CN106446081B - The method for excavating time series data incidence relation based on variation consistency

Info

Publication number: CN106446081B
Application number: CN201610814069.7A
Authority: CN
Inventors: 王文青; 杨天社; 鲍军鹏; 张海龙; 吴冠; 李方正; 王超; 齐勇
Original assignee: Xian Jiaotong University; China Xian Satellite Control Center
Current assignee: Xian Jiaotong University; China Xian Satellite Control Center
Priority date: 2016-09-09
Filing date: 2016-09-09
Publication date: 2019-08-13
Anticipated expiration: 2036-09-09
Also published as: CN106446081A

Abstract

Based on the method that variation consistency excavates time series data incidence relation, time series data variable is pre-processed first；Then wavelet transformation is carried out to single variable, original time sequence is divided into several windows with sliding window, wavelet transform is carried out to each window, extracts maximum wavelet detail coefficients；WDC cluster is carried out to the maximum wavelet detail coefficients of all windows of single variable again, it is therefore an objective to distinguish and the different window of most of window wavelet character, these windows have corresponded to the change point of variable；CCP cluster finally is carried out to the change point of all variables, the change point of the same cluster internal variable is approximate in cluster result, therefore these variables have variation consistency, are considered to have potential incidence relation；The present invention changes consistency angle between variable, is not only able to be found to have the variable of linear correlation relationship, moreover it is possible to detect the variable with complex nonlinear incidence relation, this plays a significant role the association analysis between large-scale complicated system variable.

Description

The method for excavating time series data incidence relation based on variation consistency

Technical field

The invention belongs to Intelligent Information Processing and field of computer technology, and in particular to one kind is excavated based on variation consistency The method of time series data incidence relation.

Background technique

It in large-scale complicated system, generally requires to detect the incidence relation between multiple variables, this is for summarizing system fortune Professional etiquette rule, early warning are of great significance.There may be complicated incidence relation, this incidence relations between variable in system Effect usually by internal system rule.Relevance can show as cooccurrence relation, causality, tendency relationship on space-time Etc..When a variable changes, it will cause different variables that corresponding variation occurs.

Summary of the invention

The purpose of the present invention is to provide a kind of method for excavating time series data incidence relation based on variation consistency, the party Method integrated use wavelet transformation theory detects change point and the clustering learning theory of single variable to investigate multivariable variation Similitude between point vector, thus potential incidence relation between discovery time sequence variables.

In order to achieve the above objectives, the technical scheme is that

Based on the method that variation consistency excavates time series data incidence relation, the system for realizing this method includes that data are located in advance Module, characteristic extracting module, WDC cluster module and CCP cluster module are managed, is comprised the concrete steps that:

1) firstly, carrying out elimination of burst noise, at equal intervals interpolation, normalizing to original temporal data using the pre- module 1-1 processing of data Change operation, obtains the valid data form of timing variable；

2) secondly, being carried out using each window data of the characteristic extracting module 1-2 to the valid data form of timing variable Wavelet transform extracts maximum wavelet detail coefficients；

3) then, WDC is carried out using maximum wavelet detail coefficients of the WDC cluster module 1-3 to all windows of single variable It clusters, it is change point that window in the cluster of threshold value is less than in cluster result；

4) finally, CCP cluster is carried out to the change point vectors of all variables using CCP cluster module 1-4, in cluster result Variable in the same cluster be it is relevant, finally export the incidence relation and its intensity of each cluster internal variable.

The data preprocessing module carries out elimination of burst noise, at equal intervals interpolation, normalization operation packet to original temporal data Include following steps:

Firstly, calculate the mean value and standard deviation of each window, judge each data point and watch window mean value where it Whether difference is greater than the standard deviation of 5 times of watch window, if more than then the data point is outlier, is rejected；

Then, interpolation at equal intervals is carried out to the time series after elimination of burst noise, if the sampling interval is △ t, initial time is T, Then the time collection at equal intervals after interpolation is combined into { T+n* △ t n=0,1,2,3 ... }, and the corresponding value of T+i* △ t moment is original sequence Nearest from the moment in column to be less than value corresponding to T+i* △ t moment, i.e., first is greater than T+i* △ t moment in original series The previous moment corresponding to observation；

Finally, carrying out linear normalization to the data after interpolation operation at equal intervals, a time series is scanned first, is obtained The maximum value (max) and minimum value (min) of observation, according to formulaNumber after calculating each observation point normalization Value, original time series value range is transformed on [0,1] section, wherein x_iIndicate i-th of observation point numerical value；△= max-min。

The characteristic extraction step of the characteristic extracting module includes: firstly, being carried out with sliding window to univariate data Cutting, if the Sampling starting point of initial data is t moment, the sampling interval is n seconds, window size m, sliding distance l, then first The period of a window is that the initial time of t, t+n*m, two windows is that first window initial time slides backward l, therefore The period of two windows is t+l, t+l+n*m, and so on, obtain N number of window；

Secondly, carrying out discrete wavelet transformation to the data in each window, according to window size, the wavelet decomposition number of plies is set L, maximum wavelet details coefficient cD in selected window_iAs the feature of the window, [i, cD_i] indicate initial data in i-th The wavelet character of window.

The WDC sorting procedure of the WDC cluster module:

1) initialization of cluster, each independent cluster of window, the cluster heart be the window itself feature vector wavelet character [i, CDi], window number is denoted as m, and number of clusters mesh is denoted as n, at this time n=m；

2) the error sum of squares SSE of cluster result according to the following formula, is calculated_n；

Wherein, n indicates the number of cluster；W indicates the window number in a cluster；J indicates the window subscript in cluster i；c_iTable Show the cluster heart of cluster i；

3) the cluster heart distance of any two cluster according to the following formula, is calculated；

dist(c_i,c_j)=| c_i-c_j|i≠j

Wherein, dist (c_i,c_j) indicate cluster i and cluster j manhatton distance；c_i、c_jRespectively indicate the cluster heart of two clusters；

4) two nearest clusters of combined distance and according to the following formula replacement cluster center；

Wherein, c indicates the cluster heart；W indicates the window number in the cluster；cD_iIndicate the maximum wavelet detail coefficients of window i；

5) n number subtracts 1；

6) step 2) is repeated to 5) until n=1；

7) corresponding cluster result when SSE declines most fast is picked out according to the following formula, is denoted as result={ c₁,c₂,… c_k, k indicates the number of clusters mesh of this layer of cluster result；

Wherein, i indicates the number of plies of cluster；M is window number, that is, clusters the maximum number of plies；

8) distance for calculating any two cluster in result picks out two nearest clusters of distance, is denoted as c_i,c_j；

If 9) dist (c_i,c_j)≤d, d=0.2 then merges the two clusters, and calculates the cluster heart of new cluster, then repeats to walk Rapid 8；

If 10) dist (c_i,c_j) > d, then exit cluster process；

11) contained window is the Parameters variation point in lesser cluster in cluster result, and lesser cluster is exactly window in cluster The ratio between several and total window number is less than the cluster of given threshold value 0.2, and all labels compared with window in tuftlet then constitute the variation of the parameter Point set, i.e. cpv={ cp₁,cp₂,…,cp_m, wherein cp_iIt is window label.

The CCP sorting procedure of the CCP cluster module includes:

1) the independent cluster of single variable is equipped with n variable, and the number of cluster is denoted as k, then k=n；

2) the variation consistency coefficient CoC of any two cluster according to the following formula, is calculated:

Wherein, CoC (c) indicates cluster c (c_i, c_jNew cluster after merging) variation consistency coefficient；X, y is any two in cluster c A variable；Z is cluster internal variable number, and the combination of any two variable has z (z-1)/2 kind, the variation consistency coefficient of a cluster It is equal to the average value of the variation consistency coefficient of all any two variables in cluster

Wherein, CoC (x, y) indicates the variation consistency coefficient of two variables x, y；|cpv_x| indicate the change point of variable x The number i.e. size of the Parameters variation point set；|cpv_y| indicate the change point number of variable y；|cpv_xy| indicate variable x and y Common change point number；

cpv_xy=cpv_x∩cpv_y

Wherein, cpv_x、cpv_yRespectively indicate the variation point set of variable x, y；

3) the variation strongest two cluster c of consistency are picked out_i,c_j, variation consistency coefficient between the two is denoted as max_ CoC；

4) if max_CoC is more than or equal to given threshold value 0.8, merge cluster c_i,c_j, k number subtracts 1, goes to step 2)；

If 5) max_CoC is less than given threshold value, cluster process is exited, in final cluster result, in the same cluster Variable has incidence relation, and the strength of association between them is exactly the variation consistency coefficient CoC of corresponding cluster.

Variation consistency refers to that several timing variables always change at the time of close.That is, if more It almost changes on a variable longer period or together or does not change nearly all again, these variables have potential Incidence relation.The present invention is that foundation excavates the variable with relevance from a large amount of variables collections with the variation consistency of variable Subset.Compared with the existing technology, the invention has the following advantages: the present invention investigates more from variation consistency angle Incidence relation between each and every one variable, this incidence relation can be it is nonlinear, such as index, logarithm, multinomial function close System.The relevance that variable shows under variation is paid close attention to, and general association rule mining method only excavates normally In the case of frequent mode.Compared to traditional association rule mining method Apriori and FP-Tree, the present invention is suitable for big Quantitative change amount is associated analysis, therefrom finds potential relevance between parameter.

Detailed description of the invention

Fig. 1 is the module frame figure of present system.

Fig. 2 is WDC cluster module flow chart of the present invention.

Fig. 3 is CCP cluster module of the present invention.

Table 1 is the data simulation function of example timing variable of the present invention.

Fig. 4 is the emulation datagraphic segment of few examples timing variable of the present invention.

Table 2 is example time series data variable association relation excavation result in CCP cluster module.

Specific embodiment

Invention is further described in detail with reference to the accompanying drawings and embodiments.

Referring to Fig. 1, realize that system of the invention includes data preprocessing module 1-1, characteristic extracting module 1-2, WDC cluster Module 1-3 and CCP cluster module 1-4；The specific technical solution of the present invention is:

Step 1: elimination of burst noise, at equal intervals interpolation, normalizing are carried out to original temporal data using the pre- module 1-1 processing of data Change operation, obtains the valid data form of timing variable；

Finally, carrying out linear normalization to the data after interpolation operation at equal intervals, a time series is scanned first, is obtained The maximum value (max) and minimum value (min) of observation, according to formulaAfter calculating each observation point normalization Original time series value range is transformed on [0,1] section, wherein x by numerical value_iIndicate i-th of observation point numerical value；△= max-min；

Step 2: secondly, using characteristic extracting module 1-2 to each window data of the valid data form of timing variable Wavelet transform is carried out, maximum wavelet detail coefficients are extracted；

Firstly, univariate data is cut with sliding window, if the Sampling starting point of initial data is t moment, sampling Interval is n seconds, window size m, sliding distance l, then the period of first window is rising for t, t+n*m, two windows Moment beginning is that first window initial time slides backward l, therefore the period of second window is t+l, t+l+n*m, with such It pushes away, obtains N number of window；

Secondly, carrying out discrete wavelet transformation to the data in each window, according to window size, the wavelet decomposition number of plies is set L, maximum wavelet details coefficient cD in selected window_iAs the feature of the window, [i, cD_i] indicate initial data in i-th The wavelet character of window；

Step 3: referring to fig. 2, then, using 1-3 pairs of cluster module of WDC (Wavelet Detail Coefficient) The maximum wavelet detail coefficients of all windows of single variable carry out WDC cluster, and window in the cluster of threshold value is less than in cluster result and is Change point；

1) step 2-1 is carried out first, and the initialization of cluster, each independent cluster of window, the cluster heart is the wavelet character of the window cD_i, window number is denoted as m, and number of clusters mesh is denoted as n, at this time n=m；

2) it then carries out step 2-2 and calculates the error sum of squares SSE of cluster result according to the following formula_n(Sum of Squared Error)；

3) step 2-3 is executed, according to the following formula, calculates the cluster heart distance of any two cluster；

dist(c_i,c_j)=| c_i-c_j|i≠j

4) step 2-4, two nearest clusters of combined distance and according to the following formula replacement cluster center are executed；

5) step 2-5 is executed, n number subtracts 1；

6) step 2-6 is executed, repeats step 2) to 5) until n=1；

7) step 2-7 is executed, corresponding cluster result when SSE declines most fast is picked out according to the following formula, is denoted as Result={ c₁,c₂,…c_k, k indicates the number of clusters mesh of this layer of cluster result；

8) step 2-8 is executed, the distance of any two cluster in result is calculated, picks out two nearest clusters of distance, note Make c_i,c_j；

9) step 2-9 is executed, if dist (c_i,c_j)≤d, d=0.2), then merge the two clusters, and calculate the cluster of new cluster The heart, then repeatedly step 8；

10) step 2-10 is executed, if dist (c_i,c_j) > d, then exit cluster process；

Step 4: referring to Fig. 3, finally, using CCP (Clustering based on Change Point) cluster module 1-4 carries out CCP cluster to the change point vectors of all variables, the variable in cluster result in the same cluster be it is relevant, finally Export the incidence relation and its intensity of each cluster internal variable；

1) step 3-1 is carried out first, and the single independent cluster of variable is equipped with n variable, and the number of cluster is denoted as k, then k=n；

2) step 3-2 is executed, according to the following formula, calculates the variation consistency coefficient CoC of any two cluster:

Wherein, CoC (x, y) indicates the variation consistency coefficient of two variables x, y；|cpv_x| indicate the change point of variable x Number (i.e. the size of the Parameters variation point set)；|cpv_y| indicate the change point number of variable y；|cpv_xy| indicate variable x and y Common change point number；

cpv_xy=cpv_x∩cpv_y

3) step 3-3 is executed, the variation strongest two cluster c of consistency are picked out_i,c_j, variation consistency system between the two Number scale makees max_CoC；

4) step 3-4 is executed, if max_CoC is more than or equal to given threshold value 0.8, merges cluster c_i,c_j, k number subtracts 1, turns Step 2)；

5) step 3-5 is executed, if max_CoC is less than given threshold value, exits cluster process, in final cluster result, Variable in the same cluster has incidence relation, and the strength of association between them is exactly the variation consistency coefficient of corresponding cluster CoC。

Referring to table 1, simulated each variable 20 days for example time series data variable simulated function according to simulated function Data, sampling interval are 20 minutes.Three groups of correlated variables are wherein shared, every group includes 11 variables, A group variable and g₁(x) phase It closes, B group variable and g₂(x) related, C group variable and g₃(x) related, formula is as follows:

Table 1

It is the emulation datagraphic segment of few examples time series data variable referring to Fig. 4.Yellow, white bars mark in figure The part of note indicates window, wherein " cDi " indicates the maximum wavelet detail coefficients of i-th of window.

It is example time series data variable association relation excavation in CCP cluster module as a result, wherein same referring to table 2 Variable in cluster is considered to have incidence relation, and the strength of association between them is exactly the variation consistency system of corresponding cluster Number CoC.

Table 2

Claims

1. the method for excavating time series data incidence relation based on variation consistency, it is characterised in that: realize the system packet of this method Data preprocessing module (1-1), characteristic extracting module (1-2), WDC cluster module (1-3) and CCP cluster module (1-4) are included, It comprises the concrete steps that:

1) elimination of burst noise, at equal intervals interpolation, normalization are carried out to original temporal data firstly, handling using the pre- module of data (1-1) Operation, obtains the valid data form of timing variable；

2) secondly, using each window data of the characteristic extracting module (1-2) to the valid data form of timing variable carry out from Wavelet transformation is dissipated, maximum wavelet detail coefficients are extracted；

3) then, WDC is carried out using maximum wavelet detail coefficients of the WDC cluster module (1-3) to all windows of single variable to gather Class, it is change point that window in the cluster of threshold value is less than in cluster result；

4) same in cluster result finally, carrying out CCP cluster to the change point vectors of all variables using CCP cluster module (1-4) Variable in one cluster be it is relevant, finally export the incidence relation and its intensity of each cluster internal variable.

2. the method according to claim 1 for excavating time series data incidence relation based on variation consistency, it is characterised in that: The data preprocessing module (1-1) to original temporal data carry out elimination of burst noise, at equal intervals interpolation, normalization operation include with Lower step:

Firstly, calculating the mean value and standard deviation of each window, judges each data point and difference of watch window mean value is where it The standard deviation of the no watch window for being greater than 5 times, if more than then the data point is outlier, is rejected；

Then, interpolation at equal intervals is carried out to the time series after elimination of burst noise, if the sampling interval is Δ t, initial time is T, then etc. Time collection after the interpolation of interval is combined into { T+n* Δ t n=0,1,2,3 ... }, and the corresponding value of T+i* time Δt is in original series Nearest from the moment is less than value corresponding to T+i* time Δt, i.e., in original series first be greater than T+i* time Δt before Observation corresponding to one moment；

Finally, carrying out linear normalization to the data after interpolation operation at equal intervals, a time series is scanned first, is observed The maximum value (max) and minimum value (min) of value, according to formulaNumerical value after calculating each observation point normalization, Original time series value range is transformed on [0,1] section, wherein x_iIndicate i-th of observation point numerical value；Δ=max- min。

3. the method according to claim 1 for excavating time series data incidence relation based on variation consistency, which is characterized in that The characteristic extraction step of the characteristic extracting module (1-2) includes: firstly, being cut with sliding window to univariate data It cuts, if the Sampling starting point of initial data is t moment, the sampling interval is n seconds, window size m, sliding distance l, then first The period of window is that the initial time of t, t+n*m, two windows is that first window initial time slides backward l, therefore second The period of a window is t+l, t+l+n*m, and so on, obtain N number of window；

Secondly, carrying out discrete wavelet transformation to the data in each window, according to window size, wavelet decomposition number of plies L, choosing are set Take maximum wavelet details coefficient cD in window_iAs the feature of the window, [i, cD_i] indicate initial data in i-th of window Wavelet character.

4. the method according to claim 3 for excavating time series data incidence relation based on variation consistency, which is characterized in that The WDC sorting procedure of the WDC cluster module (1-3) includes:

Wherein, n indicates the number of cluster；W indicates the window number in a cluster；J indicates the window subscript in cluster i；c_iIndicate cluster i The cluster heart；

dist(c_i,c_j)=| c_i-c_j|i≠j

5) n number subtracts 1；

6) step 2) is repeated to 5) until n=1；

7) corresponding cluster result when SSE declines most fast is picked out according to the following formula, is denoted as result={ c₁,c₂,…c_k, k Indicate the number of clusters mesh of the cluster result；

If 9) dist (c_i,c_j)≤d, d=0.2 then merges the two clusters, and calculates the cluster heart of new cluster, then repeatedly step 8；

If 10) dist (c_i,c_j) > d, then exit cluster process；

11) contained window is Parameters variation point in lesser cluster in cluster result, and lesser cluster is exactly window number and total in cluster The ratio between window number is less than the cluster of given threshold value 0.2, and all labels compared with window in tuftlet then constitute the variation point set of parameter, i.e., Cpv={ cp₁,cp₂,…,cp_m, wherein cp_iIt is window label.

5. the method according to claim 1 for excavating time series data incidence relation based on variation consistency, it is characterised in that: The CCP sorting procedure of the CCP cluster module (1-4) includes:

Wherein, CoC (c) indicates cluster c (c_i, c_jNew cluster after merging) variation consistency coefficient；X, y is that any two become in cluster c Amount；Z is cluster internal variable number, and the combination of any two variable has z (z-1)/2 kind, and the variation consistency coefficient of a cluster is just etc. In the average value of the variation consistency coefficient of any two variables all in cluster:

Wherein, CoC (x, y) indicates the variation consistency coefficient of two variables x, y；|cpv_x| indicate variable x change point number be The size of Parameters variation point set；|cpv_y| indicate the change point number of variable y；|cpv_xy| indicate the common variation of variable x and y Point number；

cpv_xy=cpv_x∩cpv_y

3) the variation strongest two cluster c of consistency are picked out_i,c_j, variation consistency coefficient between the two is denoted as max_CoC；

If 5) max_CoC is less than given threshold value, cluster process is exited, in final cluster result, the variable in the same cluster Strength of association with incidence relation, and between them is exactly the variation consistency coefficient CoC of corresponding cluster.