CN103886195B

CN103886195B - Time Series Similarity measure under shortage of data

Info

Publication number: CN103886195B
Application number: CN201410095671.0A
Authority: CN
Inventors: 祁宏生; 王殿海; 许骏; 叶盈; 韦薇; 郑正非; 蔡正义
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-03-14
Filing date: 2014-03-14
Publication date: 2015-08-26
Anticipated expiration: 2034-03-14
Also published as: CN103886195A

Abstract

The invention discloses a kind of Time Series Similarity measure that can adapt to missing data.The method extracts data pair between two from original two time serieses, and is divided into 5 kinds according to shortage of data situation, calculates single order similarity respectively interval; Interval to single order similarity, extraction interval calculates second order similarity between two, and obtains second order similarity vector; Finally second order similarity vector is done on average, obtain two final seasonal effect in time series similarities.The present invention can adapt to several scenes, and method is simple, has no requirement to data integrity.

Description

Time Series Similarity measure under shortage of data

Technical field

The present invention relates to the time series similarity computing method in a kind of computer information processing, relating to specifically calculate has one or more missing data and the physical constraint of data is the method for the similarity between two time serieses in [0, the upper limit] situation.

Background technology

Time series is present in human society and occurring in nature in a large number, such as financial time series, traffic time sequence, temperature-time sequence etc., Time Series Similarity can find the many similar time serieses in similar field, thus provides extremely favourable data for the analysis of physical phenomenon and social phenomenon.Current Time Series Similarity method is mainly for the situation not having missing data, if shortage of data, mean value replacement, trend extrapolation, exponential smoothing etc. is then utilized to make up, but these make up the knowledge needing priori, thus be difficult to ensure the similarity accuracy after Data-parallel language, and in some cases, the disappearance of data can not only be interpreted as lacking of information, sometimes exactly can reflect more data characteristics.Thus to be necessary in missing data situation sequence similarity measure Time Created.

Summary of the invention

In order to overcome existing time series tolerance cannot be applied to missing data situation under, the present invention propose a kind of can under any deletion condition computing time sequence similarity method.The method has no requirement to data integrated degree.

The present invention solves method that its technical matters adopts as following, for two time serieses:

1) two seasonal effect in time series data pair are extracted between two.

2) each is divided into five kinds to data deletion condition, and it is interval to calculate its single order similarity according to shortage of data situation.

3) similarity is calculated between two to the some similarity intervals calculated again and obtain second order similarity vector.

4) second order similarity vector is averaging, then obtains final two seasonal effect in time series similarities.

Beneficial effect of the present invention: because the time series great majority of occurring in nature have certain constraint (such as speed is greater than 0 and is less than section speed limit), can adapt to several scenes, method is simple, has no requirement to data integrity.

Accompanying drawing explanation

Fig. 1 is the Similarity Measure schematic diagram of two bivectors containing missing values.

Embodiment

Below the present invention is described in further detail.

Suppose for two time series X _i=(x _i1, x _i2...) and X _j=(x _j1, x _j2...), length of time series is all N, and each value of time series has upper limit x, and lower limit is 0, and similarity calculating method is as follows:

1) extract two seasonal effect in time series data pair between two, if extract m and the n-th data to two time serieses respectively, obtain x _jm, x _jnand x _im, x _in, total right.And being constrained to of each data

2) for this to the every a pair { x in data _im, x _inand { x _jm, x _jn, be divided into following 5 kinds of situations to consider to calculate similarity interval, this interval is referred to as single order similarity:

(1) if data do not lack, then according to formula below:

s_{mn}^{'} ({x_{im}, x_{in}}, {x_{jm}, x_{jn}}) = \frac{x_{im} x_{jm} + x_{in} x_{jn}}{\sqrt{{(x_{jm})}^{2} + {(x_{jn})}^{2}} \sqrt{{(x_{im})}^{2} + {(x_{in})}^{2}}}

Final data to similarity interval are:

s _mn∈[s _mn′({x _im,x _in}，{x _jm,x _jn}),s _mn′({x _im,x _in}，{x _jm,x _jn})]

(2) if data all lack, be also { x _im, x _in}={ NaN, NaN} and { x _jm, x _jn}={ NaN, NaN}, then:

s _mn∈[1,1]

(3) if data only have a disappearance, without loss of generality, x is supposed _jn=NaN, then according to cosine similarity Computation schema, the similarity of two binary vectors to equal on two dimensional surface two vectorial included angle cosines, as shown in Figure 1, works as x _jnvacancy, due to x _jnhave bound, thus two vectorial angles have a maximal value and minimum value, thus similarity be one also interval:

s _mn∈[min(1,cos(Θ ₁),cos(Θ ₂)),max(1,cos(Θ ₁),cos(Θ ₂))]

Can be in the hope of

\cos (Θ_{1}) = \frac{x_{im}}{\sqrt{{(x_{im})}^{2} + {(x_{in})}^{2}}}, \cos (Θ_{2}) = \frac{x_{im} x_{jm} + x_{in} \overset{&OverBar;}{x}}{\sqrt{{(x_{jm})}^{2} + {(\overset{&OverBar;}{x})}^{2}} \sqrt{{(x_{im})}^{2} + {(x_{in})}^{2}}};

(4) if two data are to there being shortage of data, and form is { x _im, x _in}={ x _im, NaN} and { x _jm, x _jn}={ x _jm, NaN}, then similar, similarity is an interval:

s_{mn} &Element; [0, \max (\frac{x_{im}}{\sqrt{{(x_{im})}^{2} + {(\overset{&OverBar;}{x})}^{2}}}, \frac{x_{jm}}{\sqrt{{(x_{jm})}^{2} + {(\overset{&OverBar;}{x})}^{2}}})]

(5) if two data are to there being a disappearance, and form is { x _im, x _in}={ x _im, NaN} and { x _jm, x _jn}={ NaN, x _jn; Or two data are to having three disappearances, and form is { x _im, x _in}={ x _im, NaN} and { x _jm, x _jn}={ NaN, NaN}, similar, there is a similarity interval:

s _mn∈[0,1]

3) right individual similarity interval is (by each s _mninterval Unified Expression is represent interval initial value, represent interval end value), then calculate similarity (due to all known during similarity, thus similarity is all a scalar) successively between two, be referred to as second order similarity.Suppose that certain a pair similarity interval is respectively with then their similarity s _mnkjcomputing method are:

s_{mnkj} = \frac{s_{mn}^{1} s_{kj}^{1} + s_{mn}^{2} s_{kj}^{2}}{\sqrt{{(s_{mn}^{1})}^{2} + {(s_{kj}^{1})}^{2}} \sqrt{{(s_{kj}^{1})}^{2} + {(s_{kj}^{2})}^{2}}}, &ForAll; m &NotEqual; n, k &NotEqual; j

Known, s _mnkjnumber be

\frac{C_{N (N - 1)}^{2}}{2} = \frac{\frac{N (N - 1)}{2} (\frac{N (N - 1)}{2} - 1)}{2} = \frac{N^{4} - {2 N}^{3} - N^{2} - 2 N + 2}{8}

4) second order similarity vector is averaging, final two time series similarity s (X _i, X _j) be:

s (X_{i}, X_{j}) = \frac{\underset{m &NotEqual; nmk &NotEqual; j}{Σ} s_{mnkj}}{\frac{N^{4} - {2 N}^{3} - N^{2} - 2 N + 2}{8}} = \frac{8 \underset{m &NotEqual; n, k &NotEqual; j}{Σ} s_{mnkj}}{N^{4} - {2 N}^{3} - N^{3} - 2 N + 2}

So far, the time series similarity of two missing datas has been obtained.

Claims

1. Time Series Similarity measure under shortage of data, is characterized in that:

Suppose for two time series X _i=(x _i1, x _i2...) and X _j=(x _j1, x _j2...), length of time series is all N, and missing data is expressed as NaN, and each value of time series has the upper limit lower limit is 0, and similarity calculating method is as follows:

1) extract two seasonal effect in time series data pair between two, if extract m and the n-th data to two time serieses respectively, obtain x _jm, x _jnand x _im, x _in, total right; And being constrained to of each data

2) for this to the every a pair { x in data _im, x _inand { x _jm, x _jn, be divided into following five kinds of situations to consider to calculate similarity interval, this interval is referred to as single order similarity:

(1) if data do not lack, then according to formula below:

{s_{mn}}^{'} ({x_{im}, x_{in}}, {x_{jm}, x_{jn}}) = \frac{x_{im} x_{jm} + x_{in} x_{jn}}{\sqrt{{(x_{jm})}^{2} + {(x_{jn})}^{2}} \sqrt{{(x_{im})}^{2} + {(x_{in})}^{2}}};

Final data to similarity interval are:

s _mn∈[s _mn′({x _im,x _in}，{x _jm,x _jn}),s _mn′({x _im,x _in}，{x _jm,x _jn})]；

s _mn∈[1,1]；

(3) if data only have a disappearance, without loss of generality, x is supposed _jn=NaN, then according to cosine similarity Computation schema, the similarity of two binary vectors to equal on two dimensional surface two vectorial included angle cosines, works as x _jnvacancy, due to x _jnhave bound, thus two vectorial angles have a maximal value and minimum value, thus similarity be one also interval:

s _mn∈[min(1,cos(Θ ₁),cos(Θ ₂)),max(1,cos(Θ ₁),cos(Θ ₂))]；

Try to achieve

\cos (Θ_{1}) = \frac{x_{im}}{\sqrt{{(x_{im})}^{2} + {(x_{in})}^{2}}}, \cos (Θ_{2}) = \frac{x_{im} x_{jm} + x_{in} \overset{&OverBar;}{x}}{\sqrt{{(x_{jm})}^{2} + {(\overset{&OverBar;}{x})}^{2}} \sqrt{{(x_{im})}^{2} + {(x_{in})}^{2}}};

(4) if two data are to there being shortage of data, and form is { x _im, x _in}={ x _im, NaN} and { x _jm, x _jn}={ x _jm, NaN}, then similarity is an interval:

s_{mn} &Element; [0, \max (\frac{x_{im}}{\sqrt{{(x_{im})}^{2} + {(\overset{&OverBar;}{x})}^{2}}}, \frac{x_{jm}}{\sqrt{{(x_{jm})}^{2} + {(\overset{&OverBar;}{x})}^{2}}})];

(5) if two data are to there being a disappearance, and form is { x _im, x _in}={ x _im, NaN} and { x _jm, x _jn}={ NaN, x _jn; Or two data are to having three disappearances, and form is { x _im, x _in}={ x _im, NaN} and { x _jm, x _jn}={ NaN, NaN}, have a similarity interval:

s _mn∈[0,1]；

3) by each s _mninterval Unified Expression is represent interval initial value, represent interval end value, right individual similarity is interval, then calculates similarity between two successively, is referred to as second order similarity; Suppose that certain a pair similarity interval is respectively with then their similarity s _mnkjfor:

s_{mnkj} = \frac{s_{mn}^{1} s_{kj}^{1} + s_{mn}^{2} s_{kj}^{2}}{\sqrt{{(s_{mn}^{1})}^{2} + {(s_{kj}^{1})}^{2}} \sqrt{{(s_{kj}^{1})}^{2} + {(s_{kj}^{2})}^{2}}}, &ForAll; m &NotEqual; n, k &NotEqual; j;

Known, s _mnkjnumber be

C_{\frac{N (N - 1)}{2}}^{2} = \frac{\frac{N (N - 1)}{2} (\frac{N (N - 1)}{2} - 1)}{2} = \frac{N^{4} - 2 N^{3} - N^{2} - 2 N + 2}{8};

s (X_{i}, X_{j}) = \frac{\underset{m &NotEqual; n, k &NotEqual; j}{Σ} s_{mnkj}}{\frac{N^{4} - 2 N^{3} - N^{2} - 2 N + 2}{8}} = \frac{8 \underset{m &NotEqual; n, k &NotEqual; j}{Σ} s_{mnkj}}{N^{4} - 2 N^{3} - N^{2} - 2 N + 2};

So far, the time series similarity of two missing datas has been obtained.