CN103577562A - Multi-measurement time series similarity analysis method - Google Patents

Multi-measurement time series similarity analysis method Download PDF

Info

Publication number
CN103577562A
CN103577562A CN201310508432.9A CN201310508432A CN103577562A CN 103577562 A CN103577562 A CN 103577562A CN 201310508432 A CN201310508432 A CN 201310508432A CN 103577562 A CN103577562 A CN 103577562A
Authority
CN
China
Prior art keywords
sequence
subsequence
similarity
similar
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310508432.9A
Other languages
Chinese (zh)
Other versions
CN103577562B (en
Inventor
王继民
朱跃龙
李士进
万定生
冯钧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201310508432.9A priority Critical patent/CN103577562B/en
Publication of CN103577562A publication Critical patent/CN103577562A/en
Application granted granted Critical
Publication of CN103577562B publication Critical patent/CN103577562B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data

Abstract

The invention discloses a multi-measurement time series similarity analysis method applicable to k-neighbor inquires of a time series. A multi-single-similarity-measurement method is chosen according to the analysis requirement, each single similarity measurement is used to analyze and inquire an m-neighbor sequence or subsequence of the sequence, pruning the m-neighbor sequence or subsequence under each similarity measurement to obtain a candidate similarity sequence or subsequence, and combining the candidate similarity sequence or subsequence by using a multiple-classifier combination method with advantage weight to obtain the k-neighbor sequence of the inquired sequence. Compared with the single similarity measurement, the similarity analysis of combined multiple measurements can obtain a more comprehensive analysis result. The multiple-classifier combination method with advantage weight regulates the ranking score according to the difference of the similarity distance between the adjacent candidate similarity sequence or subsequence and the inquired sequence while using a BORDA counting method for reference, so as to reflect the specific difference of similarity of the candidate similarity sequence or subsequence.

Description

A kind of many measuring period sequence similarity analysis methods
Technical field
The present invention relates to a kind of many measuring period sequence similarity analysis methods, the method that can especially carry out the k- neighbours Similar Time Series Based on Markov Chain analysis of many measurement combinations belongs to data mining technology field.
Background technology
It is exactly to search and find in time series databases and the time series similar to mould-fixed that Time Series Similarity, which is searched, the process for searching similar sub-sequence is frequently encountered in practical problem, for example, in the genome plan of the mankind, the sub-piece similar to given genetic fragment is found out from DNA gene orders, is studied according to the similitude of heredity;According to the sales figure of extensive stock, find out with similar merchandise sales pattern, similar sales tactics etc. is formulated according to the sales mode of like product;The identical omen of natural calamity generation is found out, so as to carry out tactics research to forecast natural calamity;In hydrology field, the historical flood process similar to current peb process is found out, the problems such as answering " current hydrologic process is similar with the hydrologic process in which period in history " that often will recognize that in flood control command.
Similarity searching was proposed that he is the important foundation of time series forecasting, classification, cluster and sequential mode mining etc. in 1993 first by R.Agrawal.Time Series Similarity lookup is different from traditional accurate inquiry, because time series numerically has continuity and has different influence of noises, therefore, time series very accurately mate is not needed in most cases.On the other hand it is that Time Series Similarity inquiry is not some the specific numerical value being directed in time series, and it is the time series within a period of time with similar morphology feature and variation tendency that lookup is looked for according to given search sequence.In Time Series Similarity search, the problem of need to solving includes time series feature extraction, time series index and similarity measure etc..For similarity measure, researcher proposes various measures, such as Euclidean distance and its mutation based on Lp criterions, dynamic time warping distance(Dynamic Time Warping, DTW), editing distance(Edit Distance,ED), pattern distance (Pattern Distance, PD) and Longest Common Substring(Longest Common Subsequence,LCSS)Deng.
The similitude between sequence is evaluated in current Time Series Similarity search using single similarity measure mostly, each similarity measure only evaluates the similarity degree between sequence from some angle, such as, pattern distance, slope distance etc. consider similar from series modality angle, Euclidean distance then considers similar from the actual size of sequence value, and dynamic time warping DTW can ignore the distortion of sequence in time.Find when actually used, often with single similarity model, it is impossible to which, while carrying out multi-angle evaluation from multiple angle against time sequence similarities, the result of feedback is often inaccurate, it is impossible to meet overall merit of the user to Time Series Similarity.Pattern-recognition and machine learning research field, the problem of having one critically important is exactly the combinatorial problem of multi-categorizer, researchs numerous at present show, Combination of Multiple Classifiers can be obtained than the single more preferable effect of base grader, the result of decision of multiple classifiers combinations together, often provide that the result of decision is more convincing than single grader, it can concentrate the advantage of each base grader, so as to preferably guide our carry out decision-making.In Time Series Similarity analysis and research field, it is less that many measurement combinations carry out the current documents of similarity analysis.Fabris F propose it is a kind of based on weight many measuring period sequence similarity analysis [Fabris F, Drago I,
Figure BDA0000401230270000021
F M.A multi-measure nearest neighbor algorithm for time series classification.Advances in Artificial Intelligence–IBERAMIA2008.Springer Berlin Heidelberg,2008:153-162.], the weight of each measurement is determined using heuristic search, similarity distance is the weighted sum of each metric range, and this method needs to take a significant amount of time to seek optimal weight vectors, while needing to predefine training set.The present invention uses for reference BORDA counting methods and made improvements, and proposes the Combination of Multiple Classifiers method with advantage weight, the candidate's similar sequences then produced to many similarity measures(Subsequence)Sequence is combined, to obtain final similar sequences(Subsequence).
The content of the invention
Goal of the invention:The present invention provides a kind of many measuring period sequence similarity analysis methods, improves the efficiency of Time Series Similarity analysis.
To achieve these goals, the present invention uses for reference BORDA counting methods and made improvements, and proposes with the Combination of Multiple Classifiers method of advantage weight to adapt to the similar sequences to each single similarity measure(Subsequence)The demand of sequence is combined, and provides a kind of many measuring period sequence k- nearest neighbouringplot methods on this basis.Time Series Similarity analysis divides from the object being queried can include complete sequence inquiry(Whole Match)With subsequence inquiry(Subsequence Match).Complete sequence is inquired about, i.e., time series to be checked includes the time series of multiple equal or different lengths, and given query sequence searches the sequence similar to search sequence from Query sequences.Subsequence is inquired about, i.e., the lookup subsequence similar to given query sequence from time series to be checked one long, as a result including each similar sub-sequence the deviation post in Query sequences.Many measuring period sequence similarity analysis methods of the present invention are applied to complete sequence and the k- NN Queries of subsequence.
Technical scheme:A kind of many measuring period sequence similarity analysis methods, comprise the following steps:
Use for reference and improve BORDA counting methods, propose the Combination of Multiple Classifiers method with advantage weight, in the candidate's similar sequences produced to multiple single similarity measures(Subsequence)When being combined, using each candidate's similar sequences(Subsequence)Between the quantitative gap score that sorted to it be weighted, so as to embody different candidate's similar sequences in sequence score(Subsequence)Specific gap, candidate's similar sequences(Subsequence)The accumulative of the score that sorts is referred to as candidate's similar sequences(Subsequence)Similar score, to candidate's similar sequences(Subsequence)Sorted from high to low according to similar score, obtain candidate's similar sequences(Subsequence)Final sequence;According to specific similarity analysis demand(Such as, form is similar, time orientation can be distorted)A variety of single similarity measures are selected as base grader from existing Time Series Similarity measurement;Similarity analysis is carried out to time series to be checked using the Similarity Measures of selection, m- neighbour's sequences are obtained(Subsequence), m value is larger than final k;The similar sequences produced due to each single similarity measure(Subsequence)Do not have identical initial time typically, therefore to similar sequences(Subsequence)Trimmed, the similar sequences that each single similarity measure is produced(Subsequence)In the sequence more than sequence length half overlapping in time alignd, delete occurrence number less than similarity measure number half period in similar sequences(Subsequence), obtain candidate's similar sequences(Subsequence), including sequence of packets is pre-processed, alignment overlap, the isolated sequence of deletion are resequenced with sequence;Using the Combination of Multiple Classifiers method with advantage weight to candidate's similar sequences(Subsequence)Sequence is combined, according to candidate's similar sequences(Subsequence)Similar score sort from high to low, take k sequence before ranking, obtain final k- neighbour's sequences(Subsequence).
Beneficial effect:Compared with traditional single similarity measure, the present invention can consider many factor of similarities simultaneously so that analog result can reflect overall merit of the user to result;The present invention can be combined compared with Fabris.F method in the case of no training dataset to many measurement results;Of the invention sequence of traditional BORDA counting methods to candidate is scored at first place and obtains n point compared with traditional BORDA counting methods, and second place obtains n-1 points, reduced successively, whipper-in obtains 1 point.Candidate's similar sequences that sequence score does not reflect before and after ranking(Subsequence)Between specific gap size, causing in some cases can not be well to candidate's similar sequences(Subsequence)It is ranked up.Candidate's similar sequences that Combination of Multiple Classifiers method with advantage weight is produced according to each single similarity measure(Subsequence)Similarity distance between search sequence is to candidate sequence(Subsequence)Sequence score be weighted so that its similar gap between search sequence of reflection that the sequence score between the sequence of ranking priority can be more specific, obtained similar sequences(Subsequence)It is final more accurate.
Brief description of the drawings
Fig. 1 is many measuring period sequence similarity analysis method model figures of the embodiment of the present invention;
Fig. 2 is the flow chart of the similar inquiry of many measuring period sequence similarity analysis methods of the embodiment of the present invention;
Fig. 3 is used for the similar sub-sequence trimming schematic diagram of k- neighbours subsequence inquiry for many measuring period sequence similarity analysis methods of the embodiment of the present invention;
Fig. 4 is used for the similar sequences trimming schematic diagram of k- neighbours complete sequence inquiry for many measuring period sequence similarity analysis methods of the embodiment of the present invention;
Fig. 5 is the similar Query Result figure of single flood peak peb process of experiment, wherein(a)The comparison of Euclidean distance similar sub-sequence and search sequence,(b)DTW apart from similar sub-sequence and search sequence comparison,(c)Slope apart from similar sub-sequence and search sequence comparison,(d)Many measurement similar sub-sequences of Combination of Multiple Classifiers method with advantage weight and the comparison of search sequence,(e)Many measurement similar sub-sequences of BORDA counting methods and the comparison of search sequence;
Fig. 6 is the similar Query Result figure of double flood peak peb processes of experiment, wherein(a)The comparison of Euclidean distance similar sub-sequence and search sequence,(b)DTW apart from similar sub-sequence and search sequence comparison,(c)Slope apart from similar sub-sequence and search sequence comparison,(d)Many measurement similar sub-sequences of Combination of Multiple Classifiers method with advantage weight and the comparison of search sequence,(e)Many measurement similar sub-sequences of BORDA counting methods and the comparison of search sequence.
Embodiment
With reference to specific embodiment, the present invention is furture elucidated, it should be understood that these embodiments are only illustrative of the invention and is not intended to limit the scope of the invention, after the present invention has been read, modification of the those skilled in the art to the various equivalent form of values of the present invention falls within the application appended claims limited range.
The present invention is directed to k- neighbor search problems, i.e. inquiry and the most like preceding k sequence of specified sequence(Subsequence).From the point of view of classification angle, k- neighbours similarity can be considered as is divided into the 1st similar sequences using similarity measure by time series(Subsequence), the 2nd similar sequences(Subsequence)..., kth similar sequences(Subsequence)And dissimilar sequence(Subsequence).Similarity is carried out using multiple single similarity measures to classify to time series equivalent to using multiple graders.Researchs numerous at present show, Combination of Multiple Classifiers can be obtained than the single more preferable effect of base grader, the result of decision of multiple classifiers combinations together, often provides that the result of decision is more convincing than single grader, and it can concentrate the advantage of each base grader.
In many measuring period sequence analysis method illustratons of model as shown in Figure 1, similar inquiry is carried out to time series respectively using multiple similarity measures, then the Query Result of each similarity measure is combined using the Combination of Multiple Classifiers method with advantage weight and obtains final Similar Time Series Based on Markov Chain.Model is altogether comprising three parts, and first part is the input of time series to be checked and query time sequence, and selects involved multiple single similarity measures;The second part is that different single Similarity Measuring Algorithms are respectively adopted(Equivalent to base grader)Similarity analysis is carried out to input time sequence, m- neighbour's similar sequences of search sequence are obtained(Subsequence);The similar sequences that Part III is exported to Part II(Subsequence)Trimmed, produce candidate's similar sequences(Subsequence), with the Combination of Multiple Classifiers method with advantage weight to candidate's similar sequences(Subsequence)It is combined the sequence of k before sequence, selected and sorted(Subsequence), obtain final k- neighbour's sequences(Subsequence).
It is the demand according to analysis as each single similarity measure of base grader(Such as, form is similar, time orientation can be distorted)Selected from existing similarity measure by user.Use the step of single similarity measure carries out Time Series Similarity analysis for:According to the requirement of similarity measure, extraction time sequence signature, setup time sequence index, with reference to similarity measure, analyzes m- neighbour's sequences of search sequence(Subsequence), m values are greater than k, to ensure when being trimmed to m- neighbour's time serieses, can obtain being more than candidate's similar sequences of k(Subsequence).
The similar sequences produced due to each single similarity measure(Subsequence)Do not have identical initial time, therefore the similar sequences to be produced to single similarity measure typically(Subsequence)Trimmed.By m- neighbour's sequences of each single similarity measure(Subsequence)The upper overlapping similar sequences more than sequence length half of middle time(Subsequence)Alignd, delete occurrence number less than the similar sequences in the period of similarity measure number half(Subsequence), to obtain candidate's similar sequences(Subsequence), specific steps include:Sequence of packets pretreatment, alignment overlap, the isolated sequence of deletion are resequenced with sequence, it is assumed that involved single similarity measure number is d, similar sequences(Subsequence)Length be l.1. sequence of packets is pre-processed:To all similar sequences(Subsequence)It is grouped, is met in one group of sequence, for any one sequence(Subsequence), can group in find at least one with its time the overlapping sequence more than sequence length half(Subsequence), and the sequence more than sequence length half overlapping with its time is can not find in other groups(Subsequence).If certain similar sequences(Subsequence)Get along well other any sequences(Subsequence)The overlapping situation more than sequence length half in existence time, then treat the similar sequences separately as one group.2. align overlap, to the sequence group of 1. middle generation, if sequence number exceedes similarity measure number d half in group(The single similarity measure for having more than half thinks that this section of sequence is similar to search sequence), then this group of sequence is alignd.In subsequence inquiry and complete sequence inquiry, alignment operation is different, and in subsequence is inquired about, alignment schemes are:The average time t of this group of all sequences initial time is calculated, in time series to be checked using t as initial time, intercepted length is l subsequence, obtains candidate's similar sub-sequence.In complete sequence similarity analysis, overlapping situation only has completely overlapped and not overlapping two kinds between the similar sequences that each single similarity measure is produced.In terms of each single similarity measure angle, candidate's similar sequences are set(Subsequence)With the similar sequences being aligned(Subsequence)With with search sequence identical similarity distance.In the overlapping similar sequences that align(Subsequence)When, if one group of overlapping similar sub-sequence number is more than similarity measure number half, but similarity measure number is less than, then align obtained candidate's similar sequences(Subsequence)Also increase is the similar sequences of remaining single similarity measure(Subsequence), and calculate its similarity distance with search sequence using single similarity measure.3. isolated sequence is deleted:To the sequence group of 1. middle generation, if sequence number is less than similarity measure number half in group, all similar sequences in the group are deleted(Subsequence), do not considered further that in follow-up sequence.4. sequence is resequenced:For each single similarity measure, due to there is newly-increased similar sub-sequence and deleting isolated similar sub-sequence, therefore, to each single similarity measure, to candidate's similar sequences(Subsequence)Rearrangement.
Combination of Multiple Classifiers method with advantage weight uses for reference traditional weighted voting BORDA counting methods, while the problem of existing for BORDA counting methods, makes improvements, calculates simple, more important advantage is need not to train set.According to traditional BORDA counting methods, it is assumed that k is final similar sequences(Subsequence)Number, m is candidate's similar sequences(Subsequence), n similarity measure is all by similarity degree order from high to low to all candidate's similar sequences(Subsequence)It is ranked up to represent his preference.For the sequence of each similarity measure, to each candidate's similar sequences(Subsequence)One sequence score of setting, regulation comes last candidate's similar sequences(Subsequence)Sequence be scored at 1 point, candidate's similar sequences reciprocal 2nd(Subsequence)For 2 points, the like, it is m points, candidate's similar sequences to come the 1st(Subsequence)The accumulation of sequence score be referred to as similar score, similar score enters first k of candidate's similar sequences(Subsequence)For k- neighbour's sequences.But traditional sequence score sets and simply considers all candidate's similar sequences(Subsequence)Sequencing information, do not account for candidate's similar sequences(Subsequence)Between specific similarity degree difference, so, in each single similarity measure to candidate's similar sequences(Subsequence)Sequence when differing greatly, candidate's similar sequences can not accurately be reflected by being likely to result in(Subsequence)Between similarity degree difference.Therefore need to consider candidate's similar sequences(Subsequence)The complete information of sequence, the i.e. sequence to candidate's similar sub-sequence includes:The difference size of similarity degree between tandem and front and rear candidate's similar sub-sequence and search sequence.
The Combination of Multiple Classifiers method with advantage weight in the present invention in anabolic process by giving each single similarity measure(Base grader)Query Result be assigned to corresponding weight, referred to as advantage weight is denoted as ω, to adjust its score that sorts, to reflect candidate's similar sequences before and after sequence(Subsequence)The difference of similarity degree.
The reflection of advantage weight is adjacent two candidate's similar sequences in base grader(Subsequence)Gap in sequencing of similarity.Assuming that a known search sequence Q, with certain similarity measure(Such as euclidean distance metric, DTW, slope distance)Preceding m similar time serieses are obtained, numbering is Si(i=1,2 ..., m), the similarity distance of each similar sequences and search sequence is denoted as di(i=1,2 ..., m), i.e. work as i>J (i=1,2 ..., m) when, meet di>dj, that is to say di(i=1,2 ..., n) has monotonicity, note △ di=di+1-di>0 (i=1,2 ..., m-1), as △ diWhen bigger, the physical significance of reflection is exactly similar sequences(Subsequence)Si+1And SiRelative to same search sequence Q, similar gender gap is bigger, otherwise difference is smaller.Advantage weight, is denoted as ω, and it passes through formula (2) and calculated.
ω i k = Δ d i / Σ i = 1 m - 1 Δ d i , ( i = 1,2 , . . . , m - 1 ) - - - ( 2 )
WhereinRepresent similar sequences in k-th of similarity measure(Subsequence)SiRelative to Si+1Similar advantage weight.In k-th of similarity measure Query Result, i-th of similar sequences(Subsequence)SiSequence score with advantage weight is by formula(3)Represent:
r 0 k = m - 1 r i k = r i - 1 k - ( m - 1 ) ω i k , i = 1,2 · · · , m - 1 - - - ( 3 )
Candidate's similar sequences(Subsequence)Similar be scored at candidate's similar sequences(Subsequence)Sequence score summation in all similarity measures.I.e., it is assumed that certain time series appears in candidate's similar sequences of m similarity measure(Subsequence)In, and sequence score in each similarity measure is respectively r1,r2,…,rm, then the similar sequences(Subsequence)Similar be scored at
Figure BDA0000401230270000074
.According to similar score rank, that time series of final similar highest scoring(Subsequence)The as sequence most like with search sequence.
Particularly, in candidate's Similar Time Series Based on Markov Chain of k-th of similarity measure, as △ d1=△d2=...=△dm-1When, i.e. ω12m-1When=1/ (m-1), now sort i-th of candidate's similar sequences(Subsequence)Sequence is scored at:
r i k = r i - 1 k - ( m - 1 ) ω i = r i - 1 k - 1 - - - ( 4 )
It is traditional BORDA counting methods, it can be seen that traditional BORDA scoring methods, which are the Combination of Multiple Classifiers methods with advantage weight, works as advantage weight value ωi=1/ (m-1)(I=1,2 ..., m-1)When special circumstances.
As shown in Fig. 2 the flow chart of many measuring period sequence similarity analysis methods for the present invention.Each step process is as follows:
Step 101:Time series to be checked is the time series being queried, and the time series to be checked in subsequence inquiry is usually the sequence with longer duration.
Step 102:Multiple single similarity measures are selected from existing similarity measure.When selecting single similarity measure, it is necessary to consider to evaluate the similar of sequence from multiple angles, such as form is similar, and time shaft can be offset.
Step 103:Query time sequence can be extracted or new time series from time series to be checked.
Step 104:Required according to the analysis of each single similarity measure of selection, to time series to be checked and the feature of query time sequential extraction procedures time series, set up index.
Step 105:Similarity analysis is carried out using each single similarity measure of selection, m- neighbour's sequences of each single similarity measure are produced;
Step 106:Judge whether that also similarity measure does not carry out similarity analysis, if "Yes", continue step 105 and carry out similarity analysis using next similarity measure, otherwise, into step 107.
Step 107:According to temporal overlapping cases between m- neighbour's sequences of each single similarity measure to similar sequences(Subsequence)Trimmed, specifically include sequence of packets pretreatment, alignment overlap, delete isolated sequence and ranked candidate similar sequences.This example is introduced with subsequence inquiry and introduced behind trimming process, the construction process of complete sequence inquiry.The lookup for carrying out 3- neighbours respectively of each single similarity measure in example, as a result as shown in figure 3, the similar sub-sequence of the 1st similarity measure is respectively s11(t11To t11Subsequence between+l)、s12(t12To t12Subsequence between+l)And s13(t13To t13Subsequence between+l), it is noted that merely just marked according to the temporal order of similar sub-sequence of each single similarity measure, be not offered as them with search sequence similarity degree sequentially, similarly, the similar sub-sequence of the 2nd similarity measure is respectively s21(t21To t21Subsequence between+l)、s22(t22To t22Subsequence between+l)And s23(t23To t23Subsequence between+l), the similar sub-sequence of the 3rd similarity measure is respectively s31(t31To t31Subsequence between+l)、s32(t32To t32Subsequence between+l)And s33(t33To t33Subsequence between+l).
(1)Sequence of packets is pre-processed
All similar sub-sequences are grouped, met, any one interior sequence of group can find at least one sequence in this group, and time-interleaving exceedes sequence length half therewith.It is points 5 groups, as a result as follows after pretreatment for similar sequences in Fig. 3:①s11,s21,s31。s11And s21It is overlapping to exceed half, s21And s31It is overlapping to exceed half, 2. s32, 3. s12,s22, 4. s13,s33, 5. s23
(2)Align overlap
1., sequence number 3., 4. in three groups is above the half of similarity measure number 3, it is therefore desirable to alignd respectively.For the alignment of group 1., t is taken11,t21,t31The average time t of three timec1, initial time t in Query sequencesc1, the subsequence that length is l is candidate's similar sub-sequence sc1(Time started is tc1, length l).For each single similarity measure, sc1Distance with search sequence is using the distance that sequence and search sequence are aligned in corresponding similarity measure.That is, from the point of view of the 1st similarity measure angle, sc1S is used with the distance of search sequence11With the distance of search sequence, from the point of view of the 2nd similarity measure angle, sc1S is used with the distance of search sequence21With the distance of search sequence, from the point of view of the 3rd similarity measure angle, sc1S is used with the distance of search sequence31With the distance of search sequence.For the alignment of group 3., t is calculated12,t22The average time t of 2 timec2, initial time t in time series to be checkedc2, the subsequence that length is l is candidate's similar sub-sequence sc2(Time started is tc1, length l).But sc2In the similar sub-sequence for not appearing in the 3rd similarity measure, accordingly, it would be desirable to recalculate s using the 3rd similarity metric function againc2With the distance of search sequence, and sequence below is participated in.For the alignment of group 4., 3. process is similar with group.
(3)Delete isolated sequence
2., the sequence number 5. in 2 groups is less than the half of single similarity measure number 3, therefore deletes and do not consider.
(4)Ranked candidate similar sequences
For each single similarity measure, candidate's similar sequences are ranked up again.By the processing of 3 steps above in example, initial time respectively t is obtainedc1,tc2,tc3Length is l three candidate's similar sub-sequence sc1、sc2And sc3, but sc2,sc3Occur as the new similar sequences of some single similarity measures, accordingly, it would be desirable to recalculate the similarity distance of they and search sequence, then from the angle of each single similarity measure, each candidate's similar sequences are sorted respectively.
Step 108:Sequence is combined to candidate's similar sub-sequence using the Combination of Multiple Classifiers method with advantage weight, final similar score is calculated.
Step 109:All final candidate's similar sub-sequences are sorted according to final similar score height.
Step 110:Take before ranking k- neighbour's similar sequences that k candidate's similar sub-sequence is search sequence.
The handling process that the present invention is inquired about for complete sequence is identical with the handling process that subsequence is inquired about, but part processing details is different from subsequence inquiry, distinguishes in step 107 " similar sequences(Subsequence)Trimming ", " similar sequences of complete sequence similarity search(Subsequence)Trimming " is specific as follows:
In complete sequence similarity search, the overlapping relation of the similar sequences of all single similarity measures in time includes completely overlapped and not overlapping two kinds, therefore the alignment of overlap relatively easily handles.Fig. 4 is that certain search sequence carries out the result that similar inquiry is obtained, including 3 single similarity measures by each single similarity measure, and each single similarity measure analysis inquiry obtains 5- neighbour's sequences.The similar sequences of each single similarity measure appear in Query sequences t0,t1,…,t6In(Herein with time marking sequence when rising of sequence).Under i.e. the 1st similarity measure, the 5- neighbours of search sequence include t0,t1,t3,t4,t5, under the 2nd similarity measure, the 5- neighbours of search sequence include t1,t2,t4,t5,t6.Under 3rd similarity measure, the 5- neighbours of search sequence include t0,t2,t3,t4,t5.The order provided in figure does not represent the similarity degree of each 5 similar sequences of single similarity measure sequentially.Such as, it is possible to which preceding 5 similar sequences of the 1st similarity measure are t according to the similarity degree ranking of search sequence1,t0,t4,t3,t5
(1)Sequence of packets is pre-processed
All similar sequences are grouped, the overlapping relation of all similar sequences in time only has completely overlapped and not overlapping two kinds, therefore is finally divided into t0,t1,…,t6Group.
(2)Align overlap
In complete sequence inquiry, when same group of all time serieses have identical starting, therefore registration process is not needed, but t0Need the newly-increased similar sequences for adding as the 2nd similarity measure, t1The newly-increased similar sequences for adding as the 3rd similarity measure, t2The newly-increased similar sequences for adding as the 1st similarity measure, t3Increase as the similar sequences of the 2nd similarity measure.
(3)Delete isolated sequence
t6Only in the similar sequences of a similarity measure, less than the half of similarity measure number, therefore t6It will be deleted.
(4)Ranked candidate similar sequences
By(1)、(2)、(3)Step process obtains candidate's similar sequences t0,t1,t2,t3,t4,t5, for each single similarity measure, the similar sequences newly added and the similarity distance of search sequence are recalculated, and candidate's similar sequences of the similarity measure are sorted.
Below based on the effect of many measuring period sequence similarity analysis methods of the description of test present invention.Annual June 1 arrives the data on flows of the day entry of September 30 during 1 day June in 1998 to the 12 days July in 2009 for taking certain large gate, there is 2 daily:00、8:00、14:00、20:004 monitoring time point, selection Euclidean distance, slope distance and DTW distances are used as involved similarity measure, distinguished point based extracts the feature of flood time series, the peb process for choosing " single flood peak is reverse V-shaped " and " double flood peak M types " two kinds of forms respectively is used as search sequence, search sequence is the subsequence of Query sequences, similar inquiry is carried out using sliding window subsequence matching method, the Combination of Multiple Classifiers method using traditional BORDA counting methods and with advantage weight carries out many measurement combinations respectively.
(1)" single flood peak is reverse V-shaped " peb process similarity analysis
Choose 2000.7.312:00-2000.8.2920:" single flood peak is reverse V-shaped " peb process time series during 00 carries out similarity analysis as search sequence, and what each similarity measure and many measurements were combined the results are shown in Table the comparison that 1, Fig. 4 gives similar sub-sequence and search sequence.
The similar sub-sequence of the single flood peak peb process of table 1
Figure BDA0000401230270000111
In table 1, the similar sub-sequence in the Combination of Multiple Classifiers result with advantage weight is that repeatedly occur and forward subsequence of sorting in each single similarity measure result.Starting point is 2004.7.12:00 subsequence causes that similar score is low to be eliminated due to only occurring in DTW distance metric results and the gap that sorts is big, and starting point is 2007.6.162:00 and 2008.8.18:Although 00 subsequence occurs in two kinds of single similarity measure results respectively, other subsequences repeatedly occurred gap in respective sequence is smaller, and similar score is improved, therefore causes the two sequences to be eliminated after final similar sequence middle position is rested against.From Fig. 4(d)Find out, the Combination of Multiple Classifiers with advantage weight is integrated with the advantage of three kinds of similarity measures, the flood peak conversion process of the similar sub-sequence inquired is almost completely the same with search sequence, three subsequences filtered and Fig. 4(d)As a result other phase in are compared like sequence, and similarity degree is weaker.BORDA counting methods are identical with the subsequence included in the Query Result of the Combination of Multiple Classifiers with advantage weight, but the Combination of Multiple Classifiers with advantage weight gives sequence to partial sequence.
(2)" double flood peak M types " flood similarity analysis
Choose 2000.8.152:00-2000.9.1320:" double flood peak M types " flood discharge process time sequence during 00 carries out similarity analysis as search sequence, and what each similarity measure and many measurements were combined the results are shown in Table the comparison that 2, Fig. 5 gives similar sub-sequence and search sequence.
The similar sub-sequence of 2 pairs of flood peak peb processes of table
Figure BDA0000401230270000112
Figure BDA0000401230270000121
In table 2, in the Combination of Multiple Classifiers Query Result with advantage weight, except starting point is 2007.8.152:00 subsequence, is all the subsequence repeatedly occurred in each single similarity measure result.Starting point is 2007.7.162:00、2003.7.12:00 and 2005.7.12:Although 00 subsequence repeatedly occurs in different single similarity measures, respective sequence gap is larger, causes similar score relatively low, therefore be eliminated.From Fig. 5(d)Find out, the flood peak conversion process for the similar sub-sequence that the Combination of Multiple Classifiers with advantage weight is inquired is almost completely the same with search sequence, be all the peb process of double flood peaks.Three subsequences being eliminated and Fig. 5(d)As a result other similar sub-sequences in are compared, and similarity degree is weaker.
Many measurements with respect to BORDA counting methods are combined, and the Combination of Multiple Classifiers with advantage weight has eliminated starting point for 2003.7.12:00 three flood peak flood discharge process subsequences, reservation starting point is 2007.8.152:00 double flood peak flood discharge process subsequences.Starting point 2003.7.12:00 subsequence occurs twice, but it sorts, gap is big, therefore similar score is substantially reduced than traditional BORDA scores, starting point 2007.8.152:Although 00 subsequence only occurs once, it sorts, and gap is small, and similar score is higher than traditional BORDA scores on the contrary, from Fig. 5(d)And Fig. 5(e)In, starting point is 2007.8.152:00 subsequence compares 2003.7.12:00 subsequence similarity degree is big.Meanwhile, the combination of many measurements of BORDA counting methods to final similar sub-sequence without being ranked up well.

Claims (7)

1. a kind of many measuring period sequence similarity analysis methods, it is adaptable to the k- NN Queries of time series, it is characterised in that the described method comprises the following steps:
A variety of single similarity measures are selected to be used as base grader according to analysis demand;
Feature is extracted the need for query time sequence is treated according to selected single similarity measure, index is set up;
Treated using each single similarity measure and look into sequence progress similarity analysis, obtain m- neighbour's time serieses of search sequence;
M- neighbour's time serieses under each single similarity measure are trimmed, candidate's similar sequences or subsequence is obtained;
Candidate's similar sequences or subsequence are combined using the Combination of Multiple Classifiers method with advantage weight and obtain final k- neighbour's time serieses.
2. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that as each single similarity measure of base grader be to be selected according to the demand of analysis from existing similarity measure by user;Each single similarity measure all by Query sequences be divided into the 1st similar sequences, the 2nd similar sequences ..., m+1 classes as m similar sequences and dissimilar sequence.
3. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that each the analytical procedure of single similarity measure is specially:Extraction time sequence signature, setup time sequence index, with Algorithm for Similarity Search in Time Series, with reference to similarity measure, retrieves m- neighbour's time serieses, m values are slightly larger than k.
4. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that the step of being trimmed to m- neighbour's sequences under each single similarity measure be specially:The m- neighbours sequence of each single similarity measure is arranged sequentially in time, the sequence that sequence length half is crossed between the similar sequences of each single similarity measure is trimmed, pruning method is, the new time series of selection replaces the sequence intersected, and the starting point of new sequence is the average of the starting time of crossing sequence;If not occurring the new sequence in m- neighbour's sequences of certain single similarity measure, increase the sequence as similar sequences, and recalculate using similarity measure the similarity distance between search sequence;Delete similar sequences of the occurrence number less than measurement number half in m- neighbour's sequences of all single similarity measures.
5. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that using the Combination of Multiple Classifiers method with advantage weight to concretely comprising the following steps that candidate's similar sequences or subsequence are combined:First against each single similarity measure, using the combined method with advantage weight calculate its generation similar sequences or subsequence in each sequence sequence score, add up the sequence score of each candidate's similar sequences or subsequence, obtain the similar score of each candidate's similar sequences or subsequence, all candidate's similar sequences or subsequence are ranked up from high in the end according to similar score, k candidate's similar sequences or k- neighbour's sequences that subsequence is search sequence before ranking.
6. many measuring period sequence similarity analysis methods according to claim 1, it is characterised in that the Combination of Multiple Classifiers method with advantage weight is used for reference BORDA counting methods and improved it, is specifically improved to:The sequence score of similar sequences or subsequence is weighted according to the similarity distance of candidate's similar sequences or subsequence and search sequence, the sequence score between the front and rear similar sequences of sequence or subsequence is enabled to reflect its similitude gap degree between search sequence, accumulative candidate's similar sequences or the sequence score of subsequence, obtain the similar score of the sequence.
7. the Combination of Multiple Classifiers method according to claim 6 with advantage weight, it is characterised in that:For each single similarity measure, candidate's similar sequences of the similarity measure or subsequence are arranged from low to high according to similarity distance first(That is similarity degree height sorts), the sequence made number one is scored at m point, and the sequence rolled into last place is scored at 1 point;The sequence for coming i-th bit is scored at
r 0 = m - 1 r i = r i - 1 - ( m - 1 ) × ω i , i = 1,2 , · · · , m - 1 - - - ( 1 )
Wherein, ω i = Δ d i / Σ i = 1 m - 1 Δ d i , ( i = 1,2 , . . . , m - 1 )
ω is the advantage weight of two neighboring candidate's similar sequences or subsequence.The accumulative sequence score of candidate's similar sequences or subsequence in each single similarity measure obtains the similar score of candidate's similar sequences or subsequence, the similarity degree between the height reflection candidate's similar sequences or subsequence and search sequence of similar score.
CN201310508432.9A 2013-10-24 2013-10-24 A kind of many measuring periods sequence similarity analyzes method Active CN103577562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310508432.9A CN103577562B (en) 2013-10-24 2013-10-24 A kind of many measuring periods sequence similarity analyzes method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310508432.9A CN103577562B (en) 2013-10-24 2013-10-24 A kind of many measuring periods sequence similarity analyzes method

Publications (2)

Publication Number Publication Date
CN103577562A true CN103577562A (en) 2014-02-12
CN103577562B CN103577562B (en) 2016-08-31

Family

ID=50049338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310508432.9A Active CN103577562B (en) 2013-10-24 2013-10-24 A kind of many measuring periods sequence similarity analyzes method

Country Status (1)

Country Link
CN (1) CN103577562B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886195A (en) * 2014-03-14 2014-06-25 浙江大学 Time-series similarity measurement method under data missing
CN104063467A (en) * 2014-06-26 2014-09-24 北京工商大学 Intra-domain traffic flow pattern discovery method based on improved similarity search technology
CN104090940A (en) * 2014-06-27 2014-10-08 华中科技大学 Sequential network and sequential data polymorphic clustering method
CN104182460A (en) * 2014-07-18 2014-12-03 浙江大学 Time sequence similarity query method based on inverted indexes
CN104794153A (en) * 2015-03-06 2015-07-22 河海大学 Similar hydrologic process searching method using user interaction
CN105046203A (en) * 2015-06-24 2015-11-11 哈尔滨工业大学 Satellite telemeasuring data self-adaptive hierarchical clustering method based on intersection angle DTW distance
CN105069093A (en) * 2015-08-05 2015-11-18 河海大学 Embedded index based hydrological time series similarity searching method
CN107491903A (en) * 2017-09-27 2017-12-19 河海大学 A kind of Flood Forecasting Method based on data mining similarity theory
CN107516114A (en) * 2017-08-28 2017-12-26 湖南大学 A kind of time Series Processing method and device
CN108573059A (en) * 2018-04-26 2018-09-25 哈尔滨工业大学 A kind of time series classification method and device of feature based sampling
CN108710623A (en) * 2018-03-13 2018-10-26 南京航空航天大学 Airport departure from port delay time at stop prediction technique based on Time Series Similarity measurement
CN110414726A (en) * 2019-07-15 2019-11-05 南京灿能电力自动化股份有限公司 A kind of power quality method for early warning based on Analysis on monitoring data
US10997176B2 (en) 2015-06-25 2021-05-04 International Business Machines Corporation Massive time series correlation similarity computation
CN113017628A (en) * 2021-02-04 2021-06-25 山东师范大学 Consciousness and emotion recognition method and system integrating ERP components and nonlinear features
CN110287977B (en) * 2018-03-19 2021-09-21 阿里巴巴(中国)有限公司 Content clustering method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114331A1 (en) * 2003-11-26 2005-05-26 International Business Machines Corporation Near-neighbor search in pattern distance spaces
CN102880621A (en) * 2011-07-14 2013-01-16 富士通株式会社 Method and device for extracting similar sub time sequences

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114331A1 (en) * 2003-11-26 2005-05-26 International Business Machines Corporation Near-neighbor search in pattern distance spaces
CN102880621A (en) * 2011-07-14 2013-01-16 富士通株式会社 Method and device for extracting similar sub time sequences

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FABIO FABRIS ETC: "A Multi-measure Nearest Neighbor Algorithm", 《IBERAMIA 2008 PROCEEDINGS OF THE 11TH IBERO-AMERICAN CONFERENCE ON AI:ADVANCES IN ARTIFICIAL INTELLIGENCE》 *
李士进等: "基于BORDA 计数法的多元水文时间序列相似性分析", 《水利学报》 *
欧阳如琳等: "水文时间序列的相似性搜索研究", 《河海大学学报(自然科学版)》 *
王咏梅: "多元水文时间序列相似性挖掘的研究与应用", 《企业技术开发》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886195A (en) * 2014-03-14 2014-06-25 浙江大学 Time-series similarity measurement method under data missing
CN104063467A (en) * 2014-06-26 2014-09-24 北京工商大学 Intra-domain traffic flow pattern discovery method based on improved similarity search technology
CN104063467B (en) * 2014-06-26 2017-04-26 北京工商大学 Intra-domain traffic flow pattern discovery method based on improved similarity search technology
CN104090940A (en) * 2014-06-27 2014-10-08 华中科技大学 Sequential network and sequential data polymorphic clustering method
CN104090940B (en) * 2014-06-27 2018-04-27 华中科技大学 A kind of polymorphic clustering method of sequential network and time series data
CN104182460A (en) * 2014-07-18 2014-12-03 浙江大学 Time sequence similarity query method based on inverted indexes
CN104182460B (en) * 2014-07-18 2017-06-13 浙江大学 Time Series Similarity querying method based on inverted index
CN104794153A (en) * 2015-03-06 2015-07-22 河海大学 Similar hydrologic process searching method using user interaction
CN104794153B (en) * 2015-03-06 2017-11-24 河海大学 Utilize the similar hydrologic process searching method of user mutual
CN105046203B (en) * 2015-06-24 2018-03-30 哈尔滨工业大学 The adaptive hierarchy clustering method of satellite telemetering data based on angle DTW distances
CN105046203A (en) * 2015-06-24 2015-11-11 哈尔滨工业大学 Satellite telemeasuring data self-adaptive hierarchical clustering method based on intersection angle DTW distance
US10997176B2 (en) 2015-06-25 2021-05-04 International Business Machines Corporation Massive time series correlation similarity computation
CN105069093A (en) * 2015-08-05 2015-11-18 河海大学 Embedded index based hydrological time series similarity searching method
CN105069093B (en) * 2015-08-05 2018-07-24 河海大学 A kind of Hydrological Time Series Similarity searching method based on embedded index
CN107516114A (en) * 2017-08-28 2017-12-26 湖南大学 A kind of time Series Processing method and device
CN107491903A (en) * 2017-09-27 2017-12-19 河海大学 A kind of Flood Forecasting Method based on data mining similarity theory
CN108710623A (en) * 2018-03-13 2018-10-26 南京航空航天大学 Airport departure from port delay time at stop prediction technique based on Time Series Similarity measurement
CN108710623B (en) * 2018-03-13 2021-01-05 南京航空航天大学 Airport departure delay time prediction method based on time series similarity measurement
CN110287977B (en) * 2018-03-19 2021-09-21 阿里巴巴(中国)有限公司 Content clustering method and device
CN108573059A (en) * 2018-04-26 2018-09-25 哈尔滨工业大学 A kind of time series classification method and device of feature based sampling
CN108573059B (en) * 2018-04-26 2021-02-19 哈尔滨工业大学 Time sequence classification method and device based on feature sampling
CN110414726A (en) * 2019-07-15 2019-11-05 南京灿能电力自动化股份有限公司 A kind of power quality method for early warning based on Analysis on monitoring data
CN113017628A (en) * 2021-02-04 2021-06-25 山东师范大学 Consciousness and emotion recognition method and system integrating ERP components and nonlinear features
CN113017628B (en) * 2021-02-04 2022-06-10 山东师范大学 Consciousness and emotion recognition method and system integrating ERP components and nonlinear features

Also Published As

Publication number Publication date
CN103577562B (en) 2016-08-31

Similar Documents

Publication Publication Date Title
CN103577562A (en) Multi-measurement time series similarity analysis method
US7657506B2 (en) Methods and apparatus for automated matching and classification of data
CN105069093B (en) A kind of Hydrological Time Series Similarity searching method based on embedded index
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN108985380B (en) Point switch fault identification method based on cluster integration
CN104750819B (en) The Biomedical literature search method and system of a kind of word-based grading sorting algorithm
CN103729351A (en) Search term recommendation method and device
CN106339416A (en) Grid-based data clustering method for fast researching density peaks
CN103488790A (en) Polychronic time sequence similarity analysis method based on weighting BORDA counting method
CN114357120A (en) Non-supervision type retrieval method, system and medium based on FAQ
CN103324929B (en) Based on the handwritten Chinese recognition methods of minor structure study
CN110442618B (en) Convolutional neural network review expert recommendation method fusing expert information association relation
CN109299357B (en) Laos language text subject classification method
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN109871379B (en) Online Hash nearest neighbor query method based on data block learning
CN113312474A (en) Similar case intelligent retrieval system of legal documents based on deep learning
CN105808739A (en) Search result ranking method based on Borda algorithm
CN110955767A (en) Algorithm and device for generating intention candidate set list set in robot dialogue system
CN113204976A (en) Real-time question and answer method and system
CN104657749A (en) Method and device for classifying time series
Joodaki et al. Protein complex detection from PPI networks on Apache Spark
Deng et al. Contrasting sequence groups by emerging sequences
Alb Collapsing the Decision Tree: the Concurrent Data Predictor
CN104572868A (en) Method and device for information matching based on questioning and answering system
Singh et al. An effort to developing the knowledge base in data mining by factor analysis and soft computing methodology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant