CN107463604A

CN107463604A - A kind of time series fixed segments algorithm based on vital point

Info

Publication number: CN107463604A
Application number: CN201710462992.3A
Authority: CN
Inventors: 孙志伟; 董亮亮; 马永军
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2017-06-19
Filing date: 2017-06-19
Publication date: 2017-12-12

Abstract

The present invention relates to a kind of time series fixed segments algorithm based on vital point, its technical characteristics is：Time series data is normalized；Calculate piecewise fitting error and be ranked up according to the priority of piecewise fitting error from big to small；Take out the maximum preceding k sections of error of fitting and sorted from small to large according to the section length of time series；First two sections by section length sequence are taken out, are fitted pretreatment；Compare the maximum of error of fitting and the symbol of minimum value and magnitude relationship in segmentation, determine that vital point is segmented, while vital point number is incremented by, loop iteration, until reaching fixed segments.The present invention is reasonable in design, significantly reduces error of fitting, improves stage efficiency, has good fitting effect and adaptability, can be widely used for the fields such as the prediction of time series, cluster, abnormality detection.

Description

A kind of time series fixed segments algorithm based on vital point

Technical field

The invention belongs to intelligent information processing technology field, especially a kind of time series fixed segments based on vital point Algorithm.

Background technology

Time series refers to the ordered set according to the tactic each observational record of time order and function, is widely present in business The fields such as industry, economy, Scientific Engineering and social science.In recent years, the data mining in time series data is studied by universal Concern, including association rule mining, similarity query, mode discovery, abnormality detection etc..Over time, time series Substantial amounts of data are generally comprised, how statistics and analysis is carried out to these time series datas, therefrom finds that some are valuable Information and knowledge, it is always the problem of user is interested.But magnanimity due to time series data and it is complicated the characteristics of, directly Data mining is carried out in time series, expensive is not only spent in storage and calculating, and algorithm may be influenceed Accuracy and reliability.In order to improve the degree of accuracy of data mining algorithm and validity, it is necessary to be done first to time series pre- Processing, it is desirable to which original time series data is replaced with a kind of approximate representation simplified.

Substantial amounts of research work has been done in approximate research of the people to time series data, proposes many time sequences both at home and abroad The method that row pattern represents, such as：Based on frequency domain method, based on singular value decomposition, Symbolic Representation method and piecewise linearity Method for expressing.Piecewise Linear Representation (Piecewise Linear Representation) method is by extracting former time sequence Reflect the principal character point of Sequence Trend trend on row, with continuous, end to end straightway come approximate representation original sequence, tool There is the features such as more analyticities of simple, intuitive, time, data compression rate is high, be a kind of data compression and eliminate noise there are efficacious prescriptions Method.Piecewise Linear Representation of Time Series method is considered as that more advanced time series represents method, therefore to time series The research of linear expression is significant.

According to the difference of segmentation method, the representation based on segmentation can be divided into following several：

The first is referred to as PAA (Piecewise Aggregate Approximation) (piecewise approximation polymerization), and it passes through Time series is divided at equal intervals, will preset time sequence with each section of average value come the whole sequence of approximate description The approximating sequence for only including K straightway is converted to, but the error of each subsegment and full section can not be controlled.PAA methods are not In the case of considering actual sequence shape, only with the method for decile, it is impossible to retain the variation tendency of original series well.

Second of referred to as PLR (Piecewise Linear Representation), time series data is expressed as adjacent clusters of line segments, with some Straightway adjacent from beginning to end is next approximate instead of original time series, and interval might not be equal.Inside each segmentation, typically Using linear interpolation or the method fitting data of linear regression.The method can be subdivided into two kinds again, and one kind uses error of fitting Method be segmented, representative is Keogh.In Keogh segmentation method for expressing, the target of piecewise approximation is to make the former time Residual sum of squares (RSS) between sequence and its linear approximation represent is minimum.This method can be subdivided into two kinds again:One uses office Portion's threshold value controls single split, allows the error of current subsegment to be no more than the local threshold, the second is using global threshold, allows institute There is the error of segmentation and no more than the threshold value.This type of global threshold has comprising 3 kinds represents meaning linear segmented algorithm:I.e. Sliding window (SW), top-down (TD), bottom-up (BU).Wherein, SW supports the online piecewising algorithms of time series, but is segmented effect Fruit is general, and does not support to retain segment history information and quadratic fit.By contrast, TD and BU algorithms are although subsection efect Preferably, but do not support to carry out online piecewising algorithms to time series, and algorithm space complexity is higher.In addition, Xiao Hui proposes one Time series segmentation algorithm of the kind based on tense boundary operator, Zhan Yanyan propose a kind of time based on slope extraction marginal point Sequence Segmentation Algorithm.Proposed in Piecewise Linear Representation of Time Series method based on slope extraction marginal point based on certain point and a left side The slope differences of line when slope differences are more than some threshold values, that is, are added into come the method judged between right both sides consecutive points The set of marginal point.Based on the Piecewise Linear Representation method of time series trend turning point by extreme point and amplitude of variation more than a certain The point of threshold values is classified as turning point.The common defects of above-mentioned measure are the parameters for needing to specify some to be not easy to determine in advance, The threshold values of threshold values, amplitude of variation such as slope, and local situation is only considered, consider deficiency to overall.

Other Piecewise Linear Representation is also included using the method for finding vital point, mainly store have to sequence tendency it is important The point of influence.And the method based on vital point meets the eye impressions of people very much, trend important in whole sequence can be retained Situation, but need accurately to be defined vital point.Zhou great Zhuo etc. demonstrates the equivalence of orthogonal distance and vertical range, and carries Go out and be based on sequence vital point partitioning algorithm PLR_SIP (Piecewise Linear Representation Series Importance Point).But the defects of this method is, the degree of compression can not be selected according to the needs of user, because should The method that method uses recursive call, is decomposed to leftmost side sequence always, until certain that error of fitting is specified less than user Individual value, it is impossible to according to the point for needing to find out most important specified number of user.

The time series fixed segments number segmentation algorithm based on vital point that Chen Ran is proposed, using each section of error of fitting As the standard of priority, while there is provided error threshold as input parameter, but the parameter value of error threshold is poorly estimated Meter.

In summary, existing time sequence important point segmentation algorithm in the accuracy and time efficiency of error of fitting all In the presence of very big room for promotion.

The content of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide one kind is reasonable in design, error of fitting is small and is segmented The time series fixed segments algorithm based on vital point of efficiency high.

The present invention solves its technical problem and takes following technical scheme to realize：

A kind of time series fixed segments algorithm based on vital point, comprises the following steps：

Step 1, time series data is normalized, initializes the set of waypoint, by time series data Beginning and end be added in the set of waypoint, the segmentation that the beginning and end of time series data is formed is added to excellent In first level queue；

Step 2, the piecewise fitting error newly added in priority query is calculated, priority query is missed according to piecewise fitting The priority of difference from big to small is ranked up；

Step 3, the preceding k sections arranged in priority query according to piecewise fitting error are taken out, to preceding K sections according to time series Section length sorted from small to large；

Step 4, first two sections to be sorted by section length are taken out, be fitted pretreatment：Simulation carries out respective vital point After being segmented, the front and rear error reduced of each segmentation is calculated, selects error of fitting to reduce and more be segmented；

Step 5, calculate maximum in piecewise fitting error and minimum value and maximum in piecewise fitting error and The symbol of minimum value, compare the maximum of piecewise fitting error and the symbol of minimum value and magnitude relationship, when the condition is satisfied, really Fixed two vital points are segmented, and otherwise determine that a vital point is segmented, while vital point number is incremented by, when vital point When number is not reaching to fixed segments number, return to step 2 carries out loop iteration, until reaching fixed segments.

The formula that the step 2 calculates piecewise fitting error is：

Wherein, E represents the error of fitting of this section, X_iThe actual value of the point is represented,Represent after being split by vital point The predicted value that linear interpolation obtains.

The default value of k in the step 3 is 5.

Vital point in the step 4 is calculated using equation below：

Wherein, height represents distance corresponding to the point, x_a, y_aRepresent the starting point coordinate of this section, x_b, y_b, represent this section Terminal point coordinate, vital point are the maximum point of height distances, and vital point coordinate representation is x corresponding to the maximum point of height distances Index.

The step 5 fall into a trap maximum in point counting section error of fitting and minimum value formula it is as follows：

Wherein, X_iThe actual value of the point is represented,The predicted value that linear interpolation obtains after splitting by vital point is represented,Represent error of fitting,The maximum of this section of error of fitting is represented,Represent the minimum value of this section of error of fitting.

The formula of the symbol of maximum and minimum value that the step 5 is fallen into a trap in point counting section error of fitting is：

The maximum of primitive curve and matched curve is represented,The maximum of primitive curve and matched curve is represented, When maxima and minima symbol is opposite and size meetsWhen, then divide simultaneously Two vital points, otherwise divide a vital point.

The advantages and positive effects of the present invention are：

1st, the present invention has considered the size and sequence length of overall fit error, for the higher segmentation of priority Pre-staged processing is carried out to find optimal segmentation, the same of maximum of points and minimum point in segmentation is take into account in segmentation Incorgruous relation, the division of multiple vital points can be once carried out, avoid and allow the users oneself such as user's input error of fitting to be difficult to The parameter value of determination, it is demonstrated experimentally that this method had both remained the global feature of original time series, it in turn ensure that error of fitting Minimum, and improve time efficiency.

2nd, the present invention is reasonable in design, and it is contrasted by the experimental analysis of multiple data sets, is reduced error of fitting, is achieved More preferable fitting effect, it is larger while fitting effect is improved to improve stage efficiency compared with important point segmentation algorithm, With good fitting effect and adaptability, the fields such as the prediction of time series, cluster, abnormality detection are can be widely applied to.

Brief description of the drawings

Fig. 1 is the disposed of in its entirety flow chart of the present invention；

Fig. 2 is the primitive curve figure of the stock certificate data (255 sections) of the embodiment of the present invention；

Fig. 3 is matched curve figure when stock certificate data is fitted to 50 sections；

Fig. 4 be the present invention from PAA, PLR_PF the different compression ratios of same data set (Ocean) error of fitting comparison diagram；

Fig. 5 be the present invention from PLR_FPIP the different compression ratios of same data set (Ocean) error of fitting comparison diagram；

Fig. 6 is for the present invention with PLR_FPIP in same data set (Ocean), the time time-consuming figure of different compression ratios.

Embodiment

The embodiment of the present invention is further described below in conjunction with accompanying drawing.

The present invention proposes a kind of time series fixed segments algorithm (PLR_TSIP, Piecewise based on vital point Linear Representation Time Series Important Points), algorithm input is time series X= (x₁,x₂,x₃,...,x_i,...,x_n) and fixed segments number N or compression ratio, algorithm output is vital point set IPs.Such as Fig. 1 institutes Show, the present invention comprises the following steps：

Step 1, time series data X is normalized, initializes the set IPs of waypoint, by time series X Beginning and end be added in set IPs, the segmentation that X beginning and end is formed is added in priority query Q.

Step 2, the error of fitting for calculating the segmentation newly added in priority query Q, priority query Q is according to error of fitting Size be ranked up.

Step 3, take out priority query Q according to error of fitting arrange preceding k sections (default value 5), to preceding K sections according to The section length (the points size that fragmented packets contain) of time series is sorted from small to large.

Step 4, small first two sections of section length in step 3 are selected, pre-processed：Simulation is carried out each according to vital point After segmentation, the error of fitting after segmentation is made comparisons with the error of fitting being segmented before, error of fitting reduces after segmentation More segmentation is preferentially segmented, and more segmentations that error of fitting reduces preferentially are handled.

Step 5 calculates the primitive curve of segmentation obtained by step 4 and the maximum of matched curveWith minimum valueIf meet symbol it is opposite andWhen, can simultaneously by maximum of points with Minimum point is segmented simultaneously as waypoint, so as to one dividing into three, then will be segmented starting point, maximum of points, minimum point, 3 sequential segmentation that segment endpoint is formed, is put into priority query Q, while fixed segments points N subtracts two；Otherwise according to weight Main points are segmented, and by the starting point by being segmented, 2 sequential that vital point and segment endpoint are formed are segmented according to error of fitting by big It is added to small order in priority query, while the number N of waypoint subtracts 1.If waypoint N number is more than zero, return Receipt row step 2, when the number N of waypoint is zero, just stops performing, export important point set IPs.

The present invention will be further described for the specific experiment data provided below with Fig. 2：

Experimental data uses IBM common stock closing prices:daily,29th June 1959to The original time serieses of 30th June 1960, sequential length are the open number from different field that 255 and keogh et al. provide According to collection (abbreviation KData data sets).Fig. 2 gives the primitive curve figure for the stock certificate data that sequential length is 255.

In the present embodiment, the input of algorithm is the stock original time series and fixing point hop count N or pressure that length is 255 Shrinkage, N takes 50 in this example, and the output of algorithm is vital point array IPs, and processing step is as follows：

Step 1, due in data set each time series come from different field, sequential value differs greatly, so in order to just In calculating and contrast, using before important point segmentation algorithm firstly the need of doing [0,1] section standardization processing to time series, Normalizing is as follows：

The start node 1 of time series and end node 255 are stored in array IPs as vital point, time series is risen The segmentation that point 1 and terminal 255 are formed is added in priority query Q, and waypoint N number subtracts two.

Step 2, the error of fitting for calculating the segmentation newly added in priority query Q, priority query Q is according to error of fitting Size be ranked up.The error of fitting of the segmentation is calculated according to formula (2), maximum difference is calculated according to formula (3), according to public affairs Formula (4) calculates minimal difference, according to formula (5) vital point distance and important point coordinates.

Wherein, E represents the error of fitting of this section, X_iThe actual value of the point is represented,Represent after being split by vital point The predicted value that linear interpolation obtains

Wherein, X_iThe actual value of the point is represented,The predicted value that linear interpolation obtains after splitting by vital point is represented,Represent error of fitting,Represent the maximum of this section of error of fitting

Wherein, X_iThe actual value of the point is represented,The predicted value that linear interpolation obtains after splitting by vital point is represented,Represent error of fitting,Represent the minimum value of this section of error of fitting

Wherein, height represents distance corresponding to the point, x_a, y_a, represent the starting point coordinate of this section, x_b, y_b, represent the section Terminal point coordinate, vital point seg represents the maximum point of height distances, and vital point coordinate representation is the maximum point of height distances Corresponding x indexes.

Step 3, take out priority query Q according to error of fitting arrange preceding k sections (default value 5), to preceding K sections according to The section length of time series is sorted from small to large, because current only a segmentation, the length of segmentation are in sequence 255。

Step 4, small first two sections of section length in step 3 are selected, pre-processed, be i.e. simulation is carried out each according to important After point segmentation, the error of fitting after segmentation is made comparisons with the error of fitting being segmented before, error of fitting subtracts after segmentation Small more segmentation is preferentially segmented, and more segmentations that error of fitting reduces preferentially is handled, because first time iteration Only one segmentation, so directly selecting output.

Step 5, calculate the primitive curve of segmentation obtained by step 4 and the maximum of matched curveWith minimum valueIf meet symbol it is opposite andWhen, can simultaneously by maximum of points with Minimum point is segmented simultaneously as waypoint, so as to one dividing into three, then will be segmented starting point, maximum of points, minimum point, 3 sequential segmentation that segment endpoint is formed, is put into priority query Q, while fixed segments points N subtracts two；Otherwise according to weight Main points are segmented, and by the starting point by being segmented, 2 sequential that vital point and segment endpoint are formed are segmented according to error of fitting by big It is added to small order in priority query, while the number N of waypoint subtracts 1, is 1 in starting point during first round iteration, terminal For in 255 segmentation, minimum point is 174, maximum of points is 239, and meet symbol it is opposite andBar Part, so IPs adds 174 and 239 two vital points, waypoint N number subtracts two, then added in priority query Q Point, 174,239 and terminal 255 form three segmentation.

If waypoint N number is more than zero, returns and perform step 2, when the number N of waypoint is zero, with regard to stopping holding OK, important point set IPs is exported, during the first round, because N values are 46 to be more than zero, return to step 2 continues executing with, other wheels Iteration similarly, until N is equal to zero, all vital point is exported.

When algorithm performs finish, IPs exports 50 points, and Fig. 3 is the matched curve figure when stock certificate data is fitted to 50 sections, As can be seen that the present invention has been fitted original stock curve well.

Pass through (PLR_TSIP) of the invention and other several methods：PAA(Piecewise Aggregate Approximation), PLR_PF (Piecewise Linear Representation PRATT FINK) and PLR_FPIP (Piecewise Linear Representation Fixed Important Points) carries out experiment comparison, and table 1 provides In identical compression ratio 80%, with the error of fitting comparing result of data set, as can be seen from Table 1：Sequence segment over time Several increases, the error of fitting of PLR_TSIP methods are significantly less than PAA and PLR_PF, PLR_TSIP algorithms and PLR_FPIP algorithms Compare, the range of decrease minimum 7% of error of fitting, be up to 22%, average out to 13%.

Table 1

Compare PLR_TSIP, this 3 kinds of algorithms of PLR_PF, PLR_PAA are from Ocean data sets in the case of different compression ratios Error of fitting, as shown in figure 4, the increase of sequence segment number over time, algorithm error of fitting are reducing, but in identical compression In the case of rate, the error of fitting of PLR_TSIP algorithms is significantly less than PAA, both algorithms of PLR_PF.Compare PLR_TSIP with This 2 kinds of algorithms of PLR_FPIP select error of fitting of the Ocean data sets in the case of different compression ratios, as shown in figure 5, with when Between sequence fixed segments number increase, PLR_FPIP and PLR_TSIP error of fitting are all reducing, but PLR_TSIP fitting Error is less than PLR_FPIP, and PLR_TSIP algorithms are compared with PLR_FPIP algorithms, and the mean reduction of error of fitting is 15%, pressure Even more than 30% can be improved when shrinkage is smaller, it will be apparent that.

PLR_TSIP and PAA, PLR_PF of the present invention are compared, although causing the execution time more due to computationally intensive, Its error of fitting reduces a lot than PAA and PLR_PF.Compared with being all the PLR_FPIP of vital point algorithm, prior figures 5 show to intend Close effect and improve a lot, carry out time efficiency comparison with PLR_FPIP herein, as shown in table 1, during same segment number, PLR_TSIP Higher than PLR_FPIP efficiency of algorithm, performance can averagely improve more than 20%.

It is emphasized that embodiment of the present invention is illustrative, rather than it is limited, therefore present invention bag Include and be not limited to embodiment described in embodiment, it is every by those skilled in the art's technique according to the invention scheme The other embodiment drawn, also belongs to the scope of protection of the invention.

Claims

1. a kind of time series fixed segments algorithm based on vital point, it is characterised in that comprise the following steps：

Step 1, time series data is normalized, initializes the set of waypoint, by rising for time series data Point and terminal are added in the set of waypoint, and the segmentation that the beginning and end of time series data is formed is added into priority In queue；

Step 2, calculate the piecewise fitting error that newly adds in priority query, by priority query according to piecewise fitting error from Small priority is arrived greatly to be ranked up；

Step 3, the preceding k sections arranged in priority query according to piecewise fitting error are taken out, preceding K sections are divided according to time series Segment length is sorted from small to large；

Step 4, first two sections to be sorted by section length are taken out, be fitted pretreatment：Simulation carries out respective vital point progress After segmentation, the front and rear error reduced of each segmentation is calculated, selects error of fitting to reduce and more be segmented；

Maximum and minimum value and maximum and minimum in piecewise fitting error in step 5, calculating piecewise fitting error The symbol of value, compare the maximum of piecewise fitting error and the symbol of minimum value and magnitude relationship, when the condition is satisfied, determine two Individual vital point is segmented, and otherwise determines that a vital point is segmented, while vital point number is incremented by, when vital point number does not have Have when reaching fixed segments number, return to step 2 carries out loop iteration, until reaching fixed segments.

A kind of 2. time series fixed segments algorithm based on vital point according to claim 1, it is characterised in that：It is described Step 2 calculate piecewise fitting error formula be：

<mrow> <mi>E</mi> <mo>=</mo> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>-</mo> <msubsup> <mi>X</mi> <mi>i</mi> <mi>c</mi> </msubsup> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> </mrow>

Wherein, E represents the error of fitting of this section, X_iThe actual value of the point is represented,Represent and inserted by linear after vital point segmentation It is worth the predicted value obtained.

A kind of 3. time series fixed segments algorithm based on vital point according to claim 1, it is characterised in that：It is described The default value of k in step 3 is 5.

A kind of 4. time series fixed segments algorithm based on vital point according to claim 1, it is characterised in that：It is described Vital point in step 4 is calculated using equation below：

Wherein, height represents distance corresponding to the point, x_a, y_aRepresent the starting point coordinate of this section, x_b, y_b, represent the terminal of this section Coordinate, vital point are the maximum point of height distances, and vital point coordinate representation is x ropes corresponding to the maximum point of height distances Draw.

A kind of 5. time series fixed segments algorithm based on vital point according to claim 1, it is characterised in that：It is described Step 5 fall into a trap maximum in point counting section error of fitting and minimum value formula it is as follows：

A kind of 6. time series fixed segments algorithm based on vital point according to claim 1, it is characterised in that：It is described The formula of the symbol of maximum and minimum value that step 5 is fallen into a trap in point counting section error of fitting is：

The maximum of primitive curve and matched curve is represented,The maximum of primitive curve and matched curve is represented, works as maximum Value is opposite with minimum value symbol and size meetsOrWhen, then two weights are divided simultaneously Main points, otherwise divide a vital point.