CN103577562A

CN103577562A - Multi-measurement time series similarity analysis method

Info

Publication number: CN103577562A
Application number: CN201310508432.9A
Authority: CN
Inventors: 王继民; 朱跃龙; 李士进; 万定生; 冯钧
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2013-10-24
Filing date: 2013-10-24
Publication date: 2014-02-12
Anticipated expiration: 2033-10-24
Also published as: CN103577562B

Abstract

The invention discloses a multi-measurement time series similarity analysis method applicable to k-neighbor inquires of a time series. A multi-single-similarity-measurement method is chosen according to the analysis requirement, each single similarity measurement is used to analyze and inquire an m-neighbor sequence or subsequence of the sequence, pruning the m-neighbor sequence or subsequence under each similarity measurement to obtain a candidate similarity sequence or subsequence, and combining the candidate similarity sequence or subsequence by using a multiple-classifier combination method with advantage weight to obtain the k-neighbor sequence of the inquired sequence. Compared with the single similarity measurement, the similarity analysis of combined multiple measurements can obtain a more comprehensive analysis result. The multiple-classifier combination method with advantage weight regulates the ranking score according to the difference of the similarity distance between the adjacent candidate similarity sequence or subsequence and the inquired sequence while using a BORDA counting method for reference, so as to reflect the specific difference of similarity of the candidate similarity sequence or subsequence.

Description

A kind of many measuring period sequence similarity analysis methods

Technical field

The present invention relates to a kind of many measuring period sequence similarity analysis methods, the method that can especially carry out the k- neighbours Similar Time Series Based on Markov Chain analysis of many measurement combinations belongs to data mining technology field.

Background technology

It is exactly to search and find in time series databases and the time series similar to mould-fixed that Time Series Similarity, which is searched, the process for searching similar sub-sequence is frequently encountered in practical problem, for example, in the genome plan of the mankind, the sub-piece similar to given genetic fragment is found out from DNA gene orders, is studied according to the similitude of heredity；According to the sales figure of extensive stock, find out with similar merchandise sales pattern, similar sales tactics etc. is formulated according to the sales mode of like product；The identical omen of natural calamity generation is found out, so as to carry out tactics research to forecast natural calamity；In hydrology field, the historical flood process similar to current peb process is found out, the problems such as answering " current hydrologic process is similar with the hydrologic process in which period in history " that often will recognize that in flood control command.

Similarity searching was proposed that he is the important foundation of time series forecasting, classification, cluster and sequential mode mining etc. in 1993 first by R.Agrawal.Time Series Similarity lookup is different from traditional accurate inquiry, because time series numerically has continuity and has different influence of noises, therefore, time series very accurately mate is not needed in most cases.On the other hand it is that Time Series Similarity inquiry is not some the specific numerical value being directed in time series, and it is the time series within a period of time with similar morphology feature and variation tendency that lookup is looked for according to given search sequence.In Time Series Similarity search, the problem of need to solving includes time series feature extraction, time series index and similarity measure etc..For similarity measure, researcher proposes various measures, such as Euclidean distance and its mutation based on Lp criterions, dynamic time warping distance（Dynamic Time Warping, DTW）, editing distance（Edit Distance,ED）, pattern distance (Pattern Distance, PD) and Longest Common Substring（Longest Common Subsequence,LCSS）Deng.

The similitude between sequence is evaluated in current Time Series Similarity search using single similarity measure mostly, each similarity measure only evaluates the similarity degree between sequence from some angle, such as, pattern distance, slope distance etc. consider similar from series modality angle, Euclidean distance then considers similar from the actual size of sequence value, and dynamic time warping DTW can ignore the distortion of sequence in time.Find when actually used, often with single similarity model, it is impossible to which, while carrying out multi-angle evaluation from multiple angle against time sequence similarities, the result of feedback is often inaccurate, it is impossible to meet overall merit of the user to Time Series Similarity.Pattern-recognition and machine learning research field, the problem of having one critically important is exactly the combinatorial problem of multi-categorizer, researchs numerous at present show, Combination of Multiple Classifiers can be obtained than the single more preferable effect of base grader, the result of decision of multiple classifiers combinations together, often provide that the result of decision is more convincing than single grader, it can concentrate the advantage of each base grader, so as to preferably guide our carry out decision-making.In Time Series Similarity analysis and research field, it is less that many measurement combinations carry out the current documents of similarity analysis.Fabris F propose it is a kind of based on weight many measuring period sequence similarity analysis [Fabris F, Drago I,

F M.A multi-measure nearest neighbor algorithm for time series classification.Advances in Artificial Intelligence–IBERAMIA2008.Springer Berlin Heidelberg,2008:153-162.], the weight of each measurement is determined using heuristic search, similarity distance is the weighted sum of each metric range, and this method needs to take a significant amount of time to seek optimal weight vectors, while needing to predefine training set.The present invention uses for reference BORDA counting methods and made improvements, and proposes the Combination of Multiple Classifiers method with advantage weight, the candidate's similar sequences then produced to many similarity measures（Subsequence）Sequence is combined, to obtain final similar sequences（Subsequence）.

The content of the invention

Goal of the invention：The present invention provides a kind of many measuring period sequence similarity analysis methods, improves the efficiency of Time Series Similarity analysis.

To achieve these goals, the present invention uses for reference BORDA counting methods and made improvements, and proposes with the Combination of Multiple Classifiers method of advantage weight to adapt to the similar sequences to each single similarity measure（Subsequence）The demand of sequence is combined, and provides a kind of many measuring period sequence k- nearest neighbouringplot methods on this basis.Time Series Similarity analysis divides from the object being queried can include complete sequence inquiry（Whole Match）With subsequence inquiry（Subsequence Match）.Complete sequence is inquired about, i.e., time series to be checked includes the time series of multiple equal or different lengths, and given query sequence searches the sequence similar to search sequence from Query sequences.Subsequence is inquired about, i.e., the lookup subsequence similar to given query sequence from time series to be checked one long, as a result including each similar sub-sequence the deviation post in Query sequences.Many measuring period sequence similarity analysis methods of the present invention are applied to complete sequence and the k- NN Queries of subsequence.

Technical scheme：A kind of many measuring period sequence similarity analysis methods, comprise the following steps：

Use for reference and improve BORDA counting methods, propose the Combination of Multiple Classifiers method with advantage weight, in the candidate's similar sequences produced to multiple single similarity measures（Subsequence）When being combined, using each candidate's similar sequences（Subsequence）Between the quantitative gap score that sorted to it be weighted, so as to embody different candidate's similar sequences in sequence score（Subsequence）Specific gap, candidate's similar sequences（Subsequence）The accumulative of the score that sorts is referred to as candidate's similar sequences（Subsequence）Similar score, to candidate's similar sequences（Subsequence）Sorted from high to low according to similar score, obtain candidate's similar sequences（Subsequence）Final sequence；According to specific similarity analysis demand（Such as, form is similar, time orientation can be distorted）A variety of single similarity measures are selected as base grader from existing Time Series Similarity measurement；Similarity analysis is carried out to time series to be checked using the Similarity Measures of selection, m- neighbour's sequences are obtained（Subsequence）, m value is larger than final k；The similar sequences produced due to each single similarity measure（Subsequence）Do not have identical initial time typically, therefore to similar sequences（Subsequence）Trimmed, the similar sequences that each single similarity measure is produced（Subsequence）In the sequence more than sequence length half overlapping in time alignd, delete occurrence number less than similarity measure number half period in similar sequences（Subsequence）, obtain candidate's similar sequences（Subsequence）, including sequence of packets is pre-processed, alignment overlap, the isolated sequence of deletion are resequenced with sequence；Using the Combination of Multiple Classifiers method with advantage weight to candidate's similar sequences（Subsequence）Sequence is combined, according to candidate's similar sequences（Subsequence）Similar score sort from high to low, take k sequence before ranking, obtain final k- neighbour's sequences（Subsequence）.

Beneficial effect：Compared with traditional single similarity measure, the present invention can consider many factor of similarities simultaneously so that analog result can reflect overall merit of the user to result；The present invention can be combined compared with Fabris.F method in the case of no training dataset to many measurement results；Of the invention sequence of traditional BORDA counting methods to candidate is scored at first place and obtains n point compared with traditional BORDA counting methods, and second place obtains n-1 points, reduced successively, whipper-in obtains 1 point.Candidate's similar sequences that sequence score does not reflect before and after ranking（Subsequence）Between specific gap size, causing in some cases can not be well to candidate's similar sequences（Subsequence）It is ranked up.Candidate's similar sequences that Combination of Multiple Classifiers method with advantage weight is produced according to each single similarity measure（Subsequence）Similarity distance between search sequence is to candidate sequence（Subsequence）Sequence score be weighted so that its similar gap between search sequence of reflection that the sequence score between the sequence of ranking priority can be more specific, obtained similar sequences（Subsequence）It is final more accurate.

Brief description of the drawings

Fig. 1 is many measuring period sequence similarity analysis method model figures of the embodiment of the present invention；

Fig. 2 is the flow chart of the similar inquiry of many measuring period sequence similarity analysis methods of the embodiment of the present invention；

Fig. 3 is used for the similar sub-sequence trimming schematic diagram of k- neighbours subsequence inquiry for many measuring period sequence similarity analysis methods of the embodiment of the present invention；

Fig. 4 is used for the similar sequences trimming schematic diagram of k- neighbours complete sequence inquiry for many measuring period sequence similarity analysis methods of the embodiment of the present invention；

Fig. 5 is the similar Query Result figure of single flood peak peb process of experiment, wherein（a）The comparison of Euclidean distance similar sub-sequence and search sequence,（b）DTW apart from similar sub-sequence and search sequence comparison,（c）Slope apart from similar sub-sequence and search sequence comparison,（d）Many measurement similar sub-sequences of Combination of Multiple Classifiers method with advantage weight and the comparison of search sequence,（e）Many measurement similar sub-sequences of BORDA counting methods and the comparison of search sequence；

Fig. 6 is the similar Query Result figure of double flood peak peb processes of experiment, wherein（a）The comparison of Euclidean distance similar sub-sequence and search sequence,（b）DTW apart from similar sub-sequence and search sequence comparison,（c）Slope apart from similar sub-sequence and search sequence comparison,（d）Many measurement similar sub-sequences of Combination of Multiple Classifiers method with advantage weight and the comparison of search sequence,（e）Many measurement similar sub-sequences of BORDA counting methods and the comparison of search sequence.

Embodiment

With reference to specific embodiment, the present invention is furture elucidated, it should be understood that these embodiments are only illustrative of the invention and is not intended to limit the scope of the invention, after the present invention has been read, modification of the those skilled in the art to the various equivalent form of values of the present invention falls within the application appended claims limited range.

The present invention is directed to k- neighbor search problems, i.e. inquiry and the most like preceding k sequence of specified sequence（Subsequence）.From the point of view of classification angle, k- neighbours similarity can be considered as is divided into the 1st similar sequences using similarity measure by time series（Subsequence）, the 2nd similar sequences（Subsequence）..., kth similar sequences（Subsequence）And dissimilar sequence（Subsequence）.Similarity is carried out using multiple single similarity measures to classify to time series equivalent to using multiple graders.Researchs numerous at present show, Combination of Multiple Classifiers can be obtained than the single more preferable effect of base grader, the result of decision of multiple classifiers combinations together, often provides that the result of decision is more convincing than single grader, and it can concentrate the advantage of each base grader.

In many measuring period sequence analysis method illustratons of model as shown in Figure 1, similar inquiry is carried out to time series respectively using multiple similarity measures, then the Query Result of each similarity measure is combined using the Combination of Multiple Classifiers method with advantage weight and obtains final Similar Time Series Based on Markov Chain.Model is altogether comprising three parts, and first part is the input of time series to be checked and query time sequence, and selects involved multiple single similarity measures；The second part is that different single Similarity Measuring Algorithms are respectively adopted（Equivalent to base grader）Similarity analysis is carried out to input time sequence, m- neighbour's similar sequences of search sequence are obtained（Subsequence）；The similar sequences that Part III is exported to Part II（Subsequence）Trimmed, produce candidate's similar sequences（Subsequence）, with the Combination of Multiple Classifiers method with advantage weight to candidate's similar sequences（Subsequence）It is combined the sequence of k before sequence, selected and sorted（Subsequence）, obtain final k- neighbour's sequences（Subsequence）.

It is the demand according to analysis as each single similarity measure of base grader（Such as, form is similar, time orientation can be distorted）Selected from existing similarity measure by user.Use the step of single similarity measure carries out Time Series Similarity analysis for：According to the requirement of similarity measure, extraction time sequence signature, setup time sequence index, with reference to similarity measure, analyzes m- neighbour's sequences of search sequence（Subsequence）, m values are greater than k, to ensure when being trimmed to m- neighbour's time serieses, can obtain being more than candidate's similar sequences of k（Subsequence）.

The similar sequences produced due to each single similarity measure（Subsequence）Do not have identical initial time, therefore the similar sequences to be produced to single similarity measure typically（Subsequence）Trimmed.By m- neighbour's sequences of each single similarity measure（Subsequence）The upper overlapping similar sequences more than sequence length half of middle time（Subsequence）Alignd, delete occurrence number less than the similar sequences in the period of similarity measure number half（Subsequence）, to obtain candidate's similar sequences（Subsequence）, specific steps include：Sequence of packets pretreatment, alignment overlap, the isolated sequence of deletion are resequenced with sequence, it is assumed that involved single similarity measure number is d, similar sequences（Subsequence）Length be l.1. sequence of packets is pre-processed：To all similar sequences（Subsequence）It is grouped, is met in one group of sequence, for any one sequence（Subsequence）, can group in find at least one with its time the overlapping sequence more than sequence length half（Subsequence）, and the sequence more than sequence length half overlapping with its time is can not find in other groups（Subsequence）.If certain similar sequences（Subsequence）Get along well other any sequences（Subsequence）The overlapping situation more than sequence length half in existence time, then treat the similar sequences separately as one group.2. align overlap, to the sequence group of 1. middle generation, if sequence number exceedes similarity measure number d half in group（The single similarity measure for having more than half thinks that this section of sequence is similar to search sequence）, then this group of sequence is alignd.In subsequence inquiry and complete sequence inquiry, alignment operation is different, and in subsequence is inquired about, alignment schemes are：The average time t of this group of all sequences initial time is calculated, in time series to be checked using t as initial time, intercepted length is l subsequence, obtains candidate's similar sub-sequence.In complete sequence similarity analysis, overlapping situation only has completely overlapped and not overlapping two kinds between the similar sequences that each single similarity measure is produced.In terms of each single similarity measure angle, candidate's similar sequences are set（Subsequence）With the similar sequences being aligned（Subsequence）With with search sequence identical similarity distance.In the overlapping similar sequences that align（Subsequence）When, if one group of overlapping similar sub-sequence number is more than similarity measure number half, but similarity measure number is less than, then align obtained candidate's similar sequences（Subsequence）Also increase is the similar sequences of remaining single similarity measure（Subsequence）, and calculate its similarity distance with search sequence using single similarity measure.3. isolated sequence is deleted：To the sequence group of 1. middle generation, if sequence number is less than similarity measure number half in group, all similar sequences in the group are deleted（Subsequence）, do not considered further that in follow-up sequence.4. sequence is resequenced：For each single similarity measure, due to there is newly-increased similar sub-sequence and deleting isolated similar sub-sequence, therefore, to each single similarity measure, to candidate's similar sequences（Subsequence）Rearrangement.

Combination of Multiple Classifiers method with advantage weight uses for reference traditional weighted voting BORDA counting methods, while the problem of existing for BORDA counting methods, makes improvements, calculates simple, more important advantage is need not to train set.According to traditional BORDA counting methods, it is assumed that k is final similar sequences（Subsequence）Number, m is candidate's similar sequences（Subsequence）, n similarity measure is all by similarity degree order from high to low to all candidate's similar sequences（Subsequence）It is ranked up to represent his preference.For the sequence of each similarity measure, to each candidate's similar sequences（Subsequence）One sequence score of setting, regulation comes last candidate's similar sequences（Subsequence）Sequence be scored at 1 point, candidate's similar sequences reciprocal 2nd（Subsequence）For 2 points, the like, it is m points, candidate's similar sequences to come the 1st（Subsequence）The accumulation of sequence score be referred to as similar score, similar score enters first k of candidate's similar sequences（Subsequence）For k- neighbour's sequences.But traditional sequence score sets and simply considers all candidate's similar sequences（Subsequence）Sequencing information, do not account for candidate's similar sequences（Subsequence）Between specific similarity degree difference, so, in each single similarity measure to candidate's similar sequences（Subsequence）Sequence when differing greatly, candidate's similar sequences can not accurately be reflected by being likely to result in（Subsequence）Between similarity degree difference.Therefore need to consider candidate's similar sequences（Subsequence）The complete information of sequence, the i.e. sequence to candidate's similar sub-sequence includes：The difference size of similarity degree between tandem and front and rear candidate's similar sub-sequence and search sequence.

The Combination of Multiple Classifiers method with advantage weight in the present invention in anabolic process by giving each single similarity measure（Base grader）Query Result be assigned to corresponding weight, referred to as advantage weight is denoted as ω, to adjust its score that sorts, to reflect candidate's similar sequences before and after sequence（Subsequence）The difference of similarity degree.

The reflection of advantage weight is adjacent two candidate's similar sequences in base grader（Subsequence）Gap in sequencing of similarity.Assuming that a known search sequence Q, with certain similarity measure（Such as euclidean distance metric, DTW, slope distance）Preceding m similar time serieses are obtained, numbering is S_i(i=1,2 ..., m), the similarity distance of each similar sequences and search sequence is denoted as d_i(i=1,2 ..., m), i.e. work as i>J (i=1,2 ..., m) when, meet d_i>d_j, that is to say d_i(i=1,2 ..., n) has monotonicity, note △ d_i=d_i+1-d_i>0 (i=1,2 ..., m-1), as △ d_iWhen bigger, the physical significance of reflection is exactly similar sequences（Subsequence）S_i+1And S_iRelative to same search sequence Q, similar gender gap is bigger, otherwise difference is smaller.Advantage weight, is denoted as ω, and it passes through formula (2) and calculated.

ω_{i}^{k} = Δ d_{i} / Σ_{i = 1}^{m - 1} Δ d_{i}, (i = 1,2, . . ., m - 1) - - - (2)

WhereinRepresent similar sequences in k-th of similarity measure（Subsequence）S_iRelative to S_i+1Similar advantage weight.In k-th of similarity measure Query Result, i-th of similar sequences（Subsequence）S_iSequence score with advantage weight is by formula（3）Represent：

\{\begin{matrix} r_{0}^{k} = m - 1 \\ r_{i}^{k} = r_{i - 1}^{k} - (m - 1) ω_{i}^{k}, i = 1,2 \cdot \cdot \cdot, m - 1 \end{matrix} - - - (3)

Candidate's similar sequences（Subsequence）Similar be scored at candidate's similar sequences（Subsequence）Sequence score summation in all similarity measures.I.e., it is assumed that certain time series appears in candidate's similar sequences of m similarity measure（Subsequence）In, and sequence score in each similarity measure is respectively r₁,r₂,…,r_m, then the similar sequences（Subsequence）Similar be scored at

.According to similar score rank, that time series of final similar highest scoring（Subsequence）The as sequence most like with search sequence.

Particularly, in candidate's Similar Time Series Based on Markov Chain of k-th of similarity measure, as △ d₁=△d₂=...=△d_m-1When, i.e. ω₁=ω₂=ω_m-1When=1/ (m-1), now sort i-th of candidate's similar sequences（Subsequence）Sequence is scored at：

r_{i}^{k} = r_{i - 1}^{k} - (m - 1) ω_{i} = r_{i - 1}^{k} - 1 - - - (4)

It is traditional BORDA counting methods, it can be seen that traditional BORDA scoring methods, which are the Combination of Multiple Classifiers methods with advantage weight, works as advantage weight value ω_i=1/ (m-1)（I=1,2 ..., m-1）When special circumstances.

As shown in Fig. 2 the flow chart of many measuring period sequence similarity analysis methods for the present invention.Each step process is as follows：

Step 101：Time series to be checked is the time series being queried, and the time series to be checked in subsequence inquiry is usually the sequence with longer duration.

Step 102：Multiple single similarity measures are selected from existing similarity measure.When selecting single similarity measure, it is necessary to consider to evaluate the similar of sequence from multiple angles, such as form is similar, and time shaft can be offset.

Step 103：Query time sequence can be extracted or new time series from time series to be checked.

Step 104：Required according to the analysis of each single similarity measure of selection, to time series to be checked and the feature of query time sequential extraction procedures time series, set up index.

Step 105：Similarity analysis is carried out using each single similarity measure of selection, m- neighbour's sequences of each single similarity measure are produced；

Step 106：Judge whether that also similarity measure does not carry out similarity analysis, if "Yes", continue step 105 and carry out similarity analysis using next similarity measure, otherwise, into step 107.

Step 107：According to temporal overlapping cases between m- neighbour's sequences of each single similarity measure to similar sequences（Subsequence）Trimmed, specifically include sequence of packets pretreatment, alignment overlap, delete isolated sequence and ranked candidate similar sequences.This example is introduced with subsequence inquiry and introduced behind trimming process, the construction process of complete sequence inquiry.The lookup for carrying out 3- neighbours respectively of each single similarity measure in example, as a result as shown in figure 3, the similar sub-sequence of the 1st similarity measure is respectively s₁₁（t₁₁To t₁₁Subsequence between+l）、s₁₂（t₁₂To t₁₂Subsequence between+l）And s₁₃（t₁₃To t₁₃Subsequence between+l）, it is noted that merely just marked according to the temporal order of similar sub-sequence of each single similarity measure, be not offered as them with search sequence similarity degree sequentially, similarly, the similar sub-sequence of the 2nd similarity measure is respectively s₂₁（t₂₁To t₂₁Subsequence between+l）、s₂₂（t₂₂To t₂₂Subsequence between+l）And s₂₃（t₂₃To t₂₃Subsequence between+l）, the similar sub-sequence of the 3rd similarity measure is respectively s₃₁（t₃₁To t₃₁Subsequence between+l）、s₃₂（t₃₂To t₃₂Subsequence between+l）And s₃₃（t₃₃To t₃₃Subsequence between+l）.

（1）Sequence of packets is pre-processed

All similar sub-sequences are grouped, met, any one interior sequence of group can find at least one sequence in this group, and time-interleaving exceedes sequence length half therewith.It is points 5 groups, as a result as follows after pretreatment for similar sequences in Fig. 3：①s₁₁,s₂₁,s₃₁。s₁₁And s₂₁It is overlapping to exceed half, s₂₁And s₃₁It is overlapping to exceed half, 2. s₃₂, 3. s₁₂,s₂₂, 4. s₁₃,s₃₃, 5. s₂₃。

（2）Align overlap

1., sequence number 3., 4. in three groups is above the half of similarity measure number 3, it is therefore desirable to alignd respectively.For the alignment of group 1., t is taken₁₁,t₂₁,t₃₁The average time t of three time_c1, initial time t in Query sequences_c1, the subsequence that length is l is candidate's similar sub-sequence s_c1（Time started is t_c1, length l）.For each single similarity measure, s_c1Distance with search sequence is using the distance that sequence and search sequence are aligned in corresponding similarity measure.That is, from the point of view of the 1st similarity measure angle, s_c1S is used with the distance of search sequence₁₁With the distance of search sequence, from the point of view of the 2nd similarity measure angle, s_c1S is used with the distance of search sequence₂₁With the distance of search sequence, from the point of view of the 3rd similarity measure angle, s_c1S is used with the distance of search sequence₃₁With the distance of search sequence.For the alignment of group 3., t is calculated₁₂,t₂₂The average time t of 2 time_c2, initial time t in time series to be checked_c2, the subsequence that length is l is candidate's similar sub-sequence s_c2（Time started is t_c1, length l）.But s_c2In the similar sub-sequence for not appearing in the 3rd similarity measure, accordingly, it would be desirable to recalculate s using the 3rd similarity metric function again_c2With the distance of search sequence, and sequence below is participated in.For the alignment of group 4., 3. process is similar with group.

（3）Delete isolated sequence

2., the sequence number 5. in 2 groups is less than the half of single similarity measure number 3, therefore deletes and do not consider.

（4）Ranked candidate similar sequences

For each single similarity measure, candidate's similar sequences are ranked up again.By the processing of 3 steps above in example, initial time respectively t is obtained_c1,t_c2,t_c3Length is l three candidate's similar sub-sequence s_c1、s_c2And s_c3, but s_c2,s_c3Occur as the new similar sequences of some single similarity measures, accordingly, it would be desirable to recalculate the similarity distance of they and search sequence, then from the angle of each single similarity measure, each candidate's similar sequences are sorted respectively.

Step 108：Sequence is combined to candidate's similar sub-sequence using the Combination of Multiple Classifiers method with advantage weight, final similar score is calculated.

Step 109：All final candidate's similar sub-sequences are sorted according to final similar score height.

Step 110：Take before ranking k- neighbour's similar sequences that k candidate's similar sub-sequence is search sequence.

The handling process that the present invention is inquired about for complete sequence is identical with the handling process that subsequence is inquired about, but part processing details is different from subsequence inquiry, distinguishes in step 107 " similar sequences（Subsequence）Trimming ", " similar sequences of complete sequence similarity search（Subsequence）Trimming " is specific as follows：

In complete sequence similarity search, the overlapping relation of the similar sequences of all single similarity measures in time includes completely overlapped and not overlapping two kinds, therefore the alignment of overlap relatively easily handles.Fig. 4 is that certain search sequence carries out the result that similar inquiry is obtained, including 3 single similarity measures by each single similarity measure, and each single similarity measure analysis inquiry obtains 5- neighbour's sequences.The similar sequences of each single similarity measure appear in Query sequences t₀,t₁,…,t₆In（Herein with time marking sequence when rising of sequence）.Under i.e. the 1st similarity measure, the 5- neighbours of search sequence include t₀,t₁,t₃,t₄,t₅, under the 2nd similarity measure, the 5- neighbours of search sequence include t₁,t₂,t₄,t₅,t₆.Under 3rd similarity measure, the 5- neighbours of search sequence include t₀,t₂,t₃,t₄,t₅.The order provided in figure does not represent the similarity degree of each 5 similar sequences of single similarity measure sequentially.Such as, it is possible to which preceding 5 similar sequences of the 1st similarity measure are t according to the similarity degree ranking of search sequence₁,t₀,t₄,t₃,t₅。

（1）Sequence of packets is pre-processed

All similar sequences are grouped, the overlapping relation of all similar sequences in time only has completely overlapped and not overlapping two kinds, therefore is finally divided into t₀,t₁,…,t₆Group.

（2）Align overlap

In complete sequence inquiry, when same group of all time serieses have identical starting, therefore registration process is not needed, but t₀Need the newly-increased similar sequences for adding as the 2nd similarity measure, t₁The newly-increased similar sequences for adding as the 3rd similarity measure, t₂The newly-increased similar sequences for adding as the 1st similarity measure, t₃Increase as the similar sequences of the 2nd similarity measure.

（3）Delete isolated sequence

t₆Only in the similar sequences of a similarity measure, less than the half of similarity measure number, therefore t₆It will be deleted.

（4）Ranked candidate similar sequences

By（1）、（2）、（3）Step process obtains candidate's similar sequences t₀,t₁,t₂,t₃,t₄,t₅, for each single similarity measure, the similar sequences newly added and the similarity distance of search sequence are recalculated, and candidate's similar sequences of the similarity measure are sorted.

Below based on the effect of many measuring period sequence similarity analysis methods of the description of test present invention.Annual June 1 arrives the data on flows of the day entry of September 30 during 1 day June in 1998 to the 12 days July in 2009 for taking certain large gate, there is 2 daily:00、8:00、14:00、20:004 monitoring time point, selection Euclidean distance, slope distance and DTW distances are used as involved similarity measure, distinguished point based extracts the feature of flood time series, the peb process for choosing " single flood peak is reverse V-shaped " and " double flood peak M types " two kinds of forms respectively is used as search sequence, search sequence is the subsequence of Query sequences, similar inquiry is carried out using sliding window subsequence matching method, the Combination of Multiple Classifiers method using traditional BORDA counting methods and with advantage weight carries out many measurement combinations respectively.

（1）" single flood peak is reverse V-shaped " peb process similarity analysis

Choose 2000.7.312:00-2000.8.2920:" single flood peak is reverse V-shaped " peb process time series during 00 carries out similarity analysis as search sequence, and what each similarity measure and many measurements were combined the results are shown in Table the comparison that 1, Fig. 4 gives similar sub-sequence and search sequence.

The similar sub-sequence of the single flood peak peb process of table 1

In table 1, the similar sub-sequence in the Combination of Multiple Classifiers result with advantage weight is that repeatedly occur and forward subsequence of sorting in each single similarity measure result.Starting point is 2004.7.12:00 subsequence causes that similar score is low to be eliminated due to only occurring in DTW distance metric results and the gap that sorts is big, and starting point is 2007.6.162:00 and 2008.8.18:Although 00 subsequence occurs in two kinds of single similarity measure results respectively, other subsequences repeatedly occurred gap in respective sequence is smaller, and similar score is improved, therefore causes the two sequences to be eliminated after final similar sequence middle position is rested against.From Fig. 4（d）Find out, the Combination of Multiple Classifiers with advantage weight is integrated with the advantage of three kinds of similarity measures, the flood peak conversion process of the similar sub-sequence inquired is almost completely the same with search sequence, three subsequences filtered and Fig. 4（d）As a result other phase in are compared like sequence, and similarity degree is weaker.BORDA counting methods are identical with the subsequence included in the Query Result of the Combination of Multiple Classifiers with advantage weight, but the Combination of Multiple Classifiers with advantage weight gives sequence to partial sequence.

（2）" double flood peak M types " flood similarity analysis

Choose 2000.8.152:00-2000.9.1320:" double flood peak M types " flood discharge process time sequence during 00 carries out similarity analysis as search sequence, and what each similarity measure and many measurements were combined the results are shown in Table the comparison that 2, Fig. 5 gives similar sub-sequence and search sequence.

The similar sub-sequence of 2 pairs of flood peak peb processes of table

In table 2, in the Combination of Multiple Classifiers Query Result with advantage weight, except starting point is 2007.8.152:00 subsequence, is all the subsequence repeatedly occurred in each single similarity measure result.Starting point is 2007.7.162:00、2003.7.12:00 and 2005.7.12:Although 00 subsequence repeatedly occurs in different single similarity measures, respective sequence gap is larger, causes similar score relatively low, therefore be eliminated.From Fig. 5（d）Find out, the flood peak conversion process for the similar sub-sequence that the Combination of Multiple Classifiers with advantage weight is inquired is almost completely the same with search sequence, be all the peb process of double flood peaks.Three subsequences being eliminated and Fig. 5（d）As a result other similar sub-sequences in are compared, and similarity degree is weaker.

Many measurements with respect to BORDA counting methods are combined, and the Combination of Multiple Classifiers with advantage weight has eliminated starting point for 2003.7.12:00 three flood peak flood discharge process subsequences, reservation starting point is 2007.8.152:00 double flood peak flood discharge process subsequences.Starting point 2003.7.12:00 subsequence occurs twice, but it sorts, gap is big, therefore similar score is substantially reduced than traditional BORDA scores, starting point 2007.8.152:Although 00 subsequence only occurs once, it sorts, and gap is small, and similar score is higher than traditional BORDA scores on the contrary, from Fig. 5（d）And Fig. 5（e）In, starting point is 2007.8.152:00 subsequence compares 2003.7.12:00 subsequence similarity degree is big.Meanwhile, the combination of many measurements of BORDA counting methods to final similar sub-sequence without being ranked up well.

Claims

1. a kind of many measuring period sequence similarity analysis methods, it is adaptable to the k- NN Queries of time series, it is characterised in that the described method comprises the following steps：

A variety of single similarity measures are selected to be used as base grader according to analysis demand；

Feature is extracted the need for query time sequence is treated according to selected single similarity measure, index is set up；

Treated using each single similarity measure and look into sequence progress similarity analysis, obtain m- neighbour's time serieses of search sequence；

M- neighbour's time serieses under each single similarity measure are trimmed, candidate's similar sequences or subsequence is obtained；

Candidate's similar sequences or subsequence are combined using the Combination of Multiple Classifiers method with advantage weight and obtain final k- neighbour's time serieses.

2. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that as each single similarity measure of base grader be to be selected according to the demand of analysis from existing similarity measure by user；Each single similarity measure all by Query sequences be divided into the 1st similar sequences, the 2nd similar sequences ..., m+1 classes as m similar sequences and dissimilar sequence.

3. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that each the analytical procedure of single similarity measure is specially：Extraction time sequence signature, setup time sequence index, with Algorithm for Similarity Search in Time Series, with reference to similarity measure, retrieves m- neighbour's time serieses, m values are slightly larger than k.

4. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that the step of being trimmed to m- neighbour's sequences under each single similarity measure be specially：The m- neighbours sequence of each single similarity measure is arranged sequentially in time, the sequence that sequence length half is crossed between the similar sequences of each single similarity measure is trimmed, pruning method is, the new time series of selection replaces the sequence intersected, and the starting point of new sequence is the average of the starting time of crossing sequence；If not occurring the new sequence in m- neighbour's sequences of certain single similarity measure, increase the sequence as similar sequences, and recalculate using similarity measure the similarity distance between search sequence；Delete similar sequences of the occurrence number less than measurement number half in m- neighbour's sequences of all single similarity measures.

5. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that using the Combination of Multiple Classifiers method with advantage weight to concretely comprising the following steps that candidate's similar sequences or subsequence are combined：First against each single similarity measure, using the combined method with advantage weight calculate its generation similar sequences or subsequence in each sequence sequence score, add up the sequence score of each candidate's similar sequences or subsequence, obtain the similar score of each candidate's similar sequences or subsequence, all candidate's similar sequences or subsequence are ranked up from high in the end according to similar score, k candidate's similar sequences or k- neighbour's sequences that subsequence is search sequence before ranking.

6. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that the Combination of Multiple Classifiers method with advantage weight is used for reference BORDA counting methods and improved it, is specifically improved to：The sequence score of similar sequences or subsequence is weighted according to the similarity distance of candidate's similar sequences or subsequence and search sequence, the sequence score between the front and rear similar sequences of sequence or subsequence is enabled to reflect its similitude gap degree between search sequence, accumulative candidate's similar sequences or the sequence score of subsequence, obtain the similar score of the sequence.

7. the Combination of Multiple Classifiers method according to claim 6 with advantage weight, it is characterised in that：For each single similarity measure, candidate's similar sequences of the similarity measure or subsequence are arranged from low to high according to similarity distance first（That is similarity degree height sorts）, the sequence made number one is scored at m point, and the sequence rolled into last place is scored at 1 point；The sequence for coming i-th bit is scored at

\{\begin{matrix} r_{0} = m - 1 \\ r_{i} = r_{i - 1} - (m - 1) \times ω_{i}, i = 1,2, \cdot \cdot \cdot, m - 1 \end{matrix} - - - (1)

Wherein,

ω_{i} = Δ d_{i} / Σ_{i = 1}^{m - 1} Δ d_{i}, (i = 1,2, . . ., m - 1)

ω is the advantage weight of two neighboring candidate's similar sequences or subsequence.The accumulative sequence score of candidate's similar sequences or subsequence in each single similarity measure obtains the similar score of candidate's similar sequences or subsequence, the similarity degree between the height reflection candidate's similar sequences or subsequence and search sequence of similar score.