JP2009199507A

JP2009199507A - Similar partial sequence detecting method, similar partial sequence detection program, and similar partial sequence detecting device

Info

Publication number: JP2009199507A
Application number: JP2008042633A
Authority: JP
Inventors: Machiko Toyoda; 真智子豊田; Yasushi Sakurai; 保志櫻井; Shunichi Ichikawa; 俊一市川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-02-25
Filing date: 2008-02-25
Publication date: 2009-09-03
Anticipated expiration: 2028-02-25
Also published as: JP5060340B2

Abstract

<P>PROBLEM TO BE SOLVED: To suppress an increase in computational complexity and the amount of memory usage when detecting a pair similar partial sequences from two data streams. <P>SOLUTION: This similar partial sequence detecting device 1 can suppress an increase in computational complexity and the amount of memory usage by using a single time warping matrix when detecting a pair of similar partial sequences from two data streams by using the time warping matrix showing a similarity score of the two partial sequences with each other. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、ストリームマイニングにおいて類似部分シーケンスペアを検出する技術に関する。 The present invention relates to a technique for detecting similar partial sequence pairs in stream mining.

データストリーム（以下、単に「ストリーム」ともいう。）とは、ネットワークから高速に流れてくる大量のデータのことである。ストリームマイニングとは、時系列として表現されるデータストリームから役に立つ情報を素早く見つけ出す技術である。ストリームマイニングは、単にデータベースに蓄えられた大規模データを分析するものではなく、増え続けるデータの流れをリアルタイムに分析し、監視するための技術である。そして、増え続ける大規模なデータを分析するため、またユーザに情報をリアルタイムに提供するため、ストリームマイニングの技術は高速化と省メモリ化を図る必要がある。 A data stream (hereinafter also simply referred to as a “stream”) is a large amount of data flowing from a network at high speed. Stream mining is a technique for quickly finding useful information from a data stream expressed as a time series. Stream mining is not just a technique for analyzing large-scale data stored in a database, but a technique for analyzing and monitoring an increasing flow of data in real time. In order to analyze ever-increasing large-scale data and to provide information to users in real time, it is necessary to increase the speed and memory saving of stream mining technology.

ストリーム監視においては、シーケンスマッチング技術が必要とされる。シーケンスマッチングでは、２つのデータシーケンス間の類似度を距離値として表し、この距離値を用いて類似度を判断する。また、各データストリームのサンプリングレートが異なる場合や、データ送受信の周期が変化する場合があるが、これらに柔軟に対応するよう、タイムワーピングを考慮することが重要となる。このタイムワーピングを考慮する距離関数としては、ダイナミックタイムワーピング（ＤＴＷ： Dynamic Time Warping）が広く用いられている。 In stream monitoring, a sequence matching technique is required. In sequence matching, the similarity between two data sequences is expressed as a distance value, and the similarity is determined using this distance value. In addition, the sampling rate of each data stream may be different or the data transmission / reception cycle may change. However, it is important to consider time warping so as to flexibly cope with these. As a distance function considering this time warping, dynamic time warping (DTW) is widely used.

ＤＴＷは、蓄積されたシーケンスに対して用いられる距離関数であり、２つのシーケンス間の距離を最小化するように時間軸方向に伸長を行い、各要素同士をマッチングさせた計算により距離値を求め、類似か否かを、距離値と閾値εによって判定する。この距離値はＤＴＷ距離と呼ばれ、最適にシーケンス長を調整した後の距離の合計値で表され、動的計画法に基づくマトリックス（タイムワーピング行列）によって計算される。ＤＴＷ距離の値が小さいほど２つのシーケンスは類似度が高く、０の場合は完全に一致していることを意味する。 DTW is a distance function that is used for the stored sequence. The DTW is extended in the time axis direction so as to minimize the distance between the two sequences, and the distance value is obtained by calculation that matches each element. Whether or not they are similar is determined by the distance value and the threshold value ε. This distance value is called a DTW distance, is represented by the total value of the distance after optimally adjusting the sequence length, and is calculated by a matrix (time warping matrix) based on dynamic programming. The smaller the value of the DTW distance, the higher the similarity between the two sequences, and 0 means that they are completely matched.

図１４はＤＴＷの説明図である。図１４（ａ）に示すように、２つのシーケンスＸ＝（ｘ_１，ｘ_２，…，ｘ_ｎ）とシーケンスＹ＝（ｙ_１，ｙ_２，…，ｙ_ｍ）とは、ＤＴＷ距離を求める際、ＤＴＷ距離が最小になるように対応付けがなされる。２つのシーケンスの長さが同じ場合でも異なる場合でも、ＤＴＷは各要素を適切に対応付けることができる。 FIG. 14 is an explanatory diagram of DTW. As shown in FIG. 14A, two sequences X = (x ₁ , x ₂ ,..., X _n ) and sequence Y = (y ₁ , y ₂ ,..., Y _m ) obtain a DTW distance. At this time, the association is performed so that the DTW distance is minimized. Whether the two sequences have the same length or different lengths, the DTW can associate each element appropriately.

図１４（ｂ）に示すように、ＤＴＷ距離の計算に用いられるマトリックス（タイムワーピング行列）において、２つのシーケンス間で対応付けられたその組み合わせ（集合）はタイムワーピングパスと呼ばれ、ここでは色付け（黒塗り）されたセルとして示している。 As shown in FIG. 14B, in the matrix (time warping matrix) used for calculating the DTW distance, the combination (set) associated between the two sequences is called a time warping path. It is shown as a (blacked) cell.

ＤＴＷについて、図１５を用いてさらに説明する。図１５は、ＤＴＷによるタイムワーピング行列を例示した図である。図１５において、シーケンスＹは、固定長ｍ（ここでは４）のデータであり、シーケンスの類似判断の元となるデータである。データストリームであるシーケンスＸは、時々刻々と伸張している（データ量が増えている）シーケンスであり、シーケンスＹに対する類似判断の対象となるデータである。 DTW will be further described with reference to FIG. FIG. 15 is a diagram illustrating a time warping matrix by DTW. In FIG. 15, a sequence Y is data of a fixed length m (4 in this case) and is data that is a source of sequence similarity determination. The sequence X, which is a data stream, is a sequence that is constantly expanding (the amount of data is increasing), and is data that is subjected to similarity determination with respect to the sequence Y.

ＤＴＷ距離は、タイムワーピング行列に基づいて計算することができる。ここで、長さｎのシーケンスＸ＝（ｘ_１，ｘ_２，…，ｘ_ｎ）と長さｍのシーケンスＹ＝（ｙ_１，ｙ_２，…，ｙ_ｍ）において、これらのＤＴＷ距離Ｄ（Ｘ，Ｙ）は以下のように定義される。なお、ｉ＝１，２，…，ｎ、ｊ＝１，２，…，ｍとする。 The DTW distance can be calculated based on a time warping matrix. Here, the length n sequence _{_{X = (x 1, x 2}} , ..., x n) and length sequence _{_{Y = (y 1, y 2}} , ..., y m) of the m in, these DTW distance D ( X, Y) is defined as follows. Note that i = 1, 2,..., N, j = 1, 2,.

Ｄ（Ｘ，Ｙ）＝ｆ（ｎ，ｍ）・・・式（９）
ｆ（ｉ，ｊ）＝‖ｘ_ｉ−ｙ_ｊ‖＋ｍｉｎ｛ｆ（ｉ，ｊ−１），ｆ（ｉ−１，ｊ），
ｆ（ｉ−１，ｊ−１）｝・・・式（１０）
ｆ（０，０）＝０・・・式（１１）
ｆ（ｉ，０）＝ｆ（０，ｊ）＝∞ ・・・式（１２） D (X, Y) = f (n, m) (9)
f (i, j) = ‖x _i −y _j ‖ + min {f (i, j−1), f (i−1, j),
f (i-1, j-1)} Expression (10)
f (0,0) = 0 Expression (11)
f (i, 0) = f (0, j) = ∞ (12)

式（９）は、ＤＴＷ距離の定義である。式（１０）は、具体的な計算式である。式（１０）において、‖ｘ_ｉ−ｙ_ｊ‖は、２つの数値（ｘ_ｉとｙ_ｊ）の距離を表すものであり、例えば、ユークリッド距離やマンハッタン距離（Ｌ１距離）などが挙げられる。ｎ次元空間において、ａ、ｂという２つの点の座標をａ（ａ_１，ａ_２，…，ａ_ｎ）、ｂ（ｂ_１，ｂ_２，…，ｂ_ｎ）とし、また、（１≦ｋ≦ｎ）とすると、ユークリッド距離とは√（Σ（ａ_ｋ−ｂ_ｋ）^２）、マンハッタン距離とはΣ｜ａ_ｋ−ｂ_ｋ｜で表される距離のことである。以下の具体例では、計算を容易にするために、‖ｘ_ｉ−ｙ_ｊ‖として、ユークリッド距離の二乗の値を使用する。なお、本発明は、ユークリッド距離の二乗の値を使用する場合に限定されず、他の距離を使用してもかまわない。 Equation (9) is a definition of the DTW distance. Formula (10) is a specific calculation formula. In the formula (10), ‖x _i -y _j ‖ is representative of the distance of the two numbers _{(x i} and _{y j),} for example, Euclidean distance or the Manhattan distance (L1 distance), and the like. In the n-dimensional space, the coordinates of two points a and b are a (a ₁ , a ₂ ,..., a _n ), b (b ₁ , b ₂ ,..., b _n ), and (1 ≦ k ≦ n), the Euclidean distance is √ (Σ (a _k −b _k ) ² ), and the Manhattan distance is a distance represented by Σ | a _k −b _k |. In the following specific example, a square value of the Euclidean distance is used as ‖x _i -y _j ‖ for easy calculation. Note that the present invention is not limited to the case where the square value of the Euclidean distance is used, and other distances may be used.

式（１０）において、ｍｉｎ｛ｆ（ｉ，ｊ−１），ｆ（ｉ−１，ｊ），ｆ（ｉ−１，ｊ−１）｝は、｛｝内の３つの値のうち、最小のものを採用する、という意味である。式（１１）および式（１２）は、これらの３つの値を計算する際に使用する、タイムワーピング行列における境界条件である。このＤＴＷ距離を用いたタイムワーピング行列によれば、シーケンスＹと類似するシーケンＸの部分シーケンスを検出することができる。 In equation (10), min {f (i, j-1), f (i-1, j), f (i-1, j-1)} is the smallest of the three values in {}. Means to adopt. Equations (11) and (12) are boundary conditions in the time warping matrix used when calculating these three values. According to the time warping matrix using this DTW distance, a partial sequence of sequence X similar to sequence Y can be detected.

例えば、シーケンスＹ＝（１１，６，９，４）と、シーケンスＸ＝（１２,６,１０,６,５,１,…）のそれぞれとのＤＴＷ距離を計算すると、図１５のタイムワーピング行列に示す値となる。ここで、図１５のタイムワーピング行列のハッチング部分は、ＤＴＷ距離「６」を計算するために辿ってきたルート（タイムワーピングパス）であり、このルートを辿ることで、ＤＴＷ距離の計算の開始位置が分かる。つまり、このＤＴＷ距離の値が比較的小さいものが連なったタイムワーピングパスを見つけることで、シーケンスＸ＝（１２,６,１０,６,５,１,…）から、シーケンスＹ＝（１１，６，９，４）に類似する部分シーケンス（ここでは部分シーケンス（１２,６,１０,６））を検出することができる。 For example, if the DTW distance between the sequence Y = (11, 6, 9, 4) and the sequence X = (12, 6, 10, 6, 5, 1,...) Is calculated, the time warping matrix of FIG. It becomes the value shown in. Here, the hatched portion of the time warping matrix in FIG. 15 is a route (time warping path) followed to calculate the DTW distance “6”, and the start position of the calculation of the DTW distance by following this route. I understand. That is, by finding a time warping path in which the DTW distance values are relatively small, a sequence Y = (11, 6) from the sequence X = (12, 6, 10, 6, 5, 1,...). , 9, 4) can be detected (partial sequence (12, 6, 10, 6) here).

このように、タイムワーピング行列は、ＤＴＷの関数の値（すなわち、式（１０）におけるｆ（ｉ，ｊ）の値）を保持しており、これがＤＴＷの基礎をなす。長さｎのシーケンスＸと長さｍのシーケンスＹの距離を求めようとすると、ＤＴＷはＯ（ｎｍ）の時間を要する。これは、ＤＴＷが２つのシーケンスのすべての要素を対応付けて計算を行うためであり、特に長いシーケンスを扱う場合には計算コストが著しく大きくなる。 Thus, the time warping matrix holds the value of the DTW function (that is, the value of f (i, j) in equation (10)), which forms the basis of DTW. When the distance between the sequence X having the length n and the sequence Y having the length m is to be obtained, the DTW requires a time of O (nm). This is because the DTW performs the calculation by associating all elements of the two sequences, and the calculation cost is significantly increased particularly when a long sequence is handled.

つまり、この従来の手法を用いる場合、データストリームから類似部分シーケンスを検出するとき、あらゆるパターンの部分シーケンスとの比較を行う必要があるので、時刻が経過してデータストリームのデータが到着するたびにタイムワーピング行列を追加していく必要がある。すなわち、図１５ではタイムワーピング行列を１つしか図示していないが、同様のタイムワーピング行列を時間の経過とともに次々と追加する必要がある。そのため、データストリームの増加とともに計算量やメモリ使用量が増大するという問題があった。 In other words, when using this conventional method, when detecting a similar partial sequence from the data stream, it is necessary to compare with a partial sequence of any pattern, so every time data in the data stream arrives after the time has passed. It is necessary to add a time warping matrix. That is, FIG. 15 shows only one time warping matrix, but it is necessary to add similar time warping matrices one after another as time passes. Therefore, there has been a problem that the amount of calculation and the amount of memory used increase as the number of data streams increases.

また、ＤＴＷを用いたシーケンスマッチングのためのアルゴリズムは数多く提案されているが、その多くは事前に用意された問い合わせシーケンスに類似するシーケンスを検出するものである。非特許文献１および非特許文献２では、ＤＴＷを用いたシーケンスマッチングにおいて計算コスト削減のための手法が提案されている。しかし、これらの手法は蓄積されたデータ集合のための手法であり、ストリーム処理には適していない。 Many algorithms for sequence matching using DTW have been proposed, but many of them detect sequences similar to a query sequence prepared in advance. Non-Patent Document 1 and Non-Patent Document 2 propose methods for reducing calculation costs in sequence matching using DTW. However, these methods are methods for accumulated data sets and are not suitable for stream processing.

また、ＤＴＷを用いたストリームのシーケンスマッチングについては、非特許文献３において、問い合わせシーケンスに類似する部分シーケンスを検出する手法が提案されている。しかし、このシーケンスマッチングのための従来手法は、用意された問い合わせシーケンスに類似する部分シーケンスを検出するものに過ぎない。 Regarding the sequence matching of streams using DTW, Non-Patent Document 3 proposes a method for detecting a partial sequence similar to an inquiry sequence. However, this conventional method for sequence matching is merely to detect a partial sequence similar to a prepared inquiry sequence.

一方、特定の問い合わせシーケンスを用意せず、増え続けるストリームの中から類似する部分シーケンスペアを検出し続ける技術も重要視されている。非特許文献４や非特許文献５は、リアルタイムでのストリーム監視に焦点をあて、ストリーム間の相関を検出する手法を提案している。しかし、これらは、時間軸方向の調節がない距離尺度を用いており、タイムワーピングに対応していない。 On the other hand, a technique that does not prepare a specific inquiry sequence and continues to detect similar partial sequence pairs from an ever-increasing stream is also regarded as important. Non-Patent Literature 4 and Non-Patent Literature 5 focus on real-time stream monitoring and propose a method for detecting correlation between streams. However, these use distance scales without adjustment in the time axis direction and do not support time warping.

本発明が扱うのは、データストリームからの類似部分シーケンスペア検出問題であり、具体的には「２つのデータストリームが与えられたとき、類似する部分シーケンスペアを、ＤＴＷに準じた手法（ＤＴＷと同等の手法）で検出する」ことである。この問題について、図１３を用いて説明する。図１３は、類似部分シーケンスペアの検出に使用されるデータストリーム（シーケンス）の例である。 The present invention deals with the problem of detecting a similar partial sequence pair from a data stream. Specifically, “when two data streams are given, a similar partial sequence pair is converted into a method based on DTW (DTW and DTW). Equivalent method) ”. This problem will be described with reference to FIG. FIG. 13 is an example of a data stream (sequence) used for detecting similar partial sequence pairs.

図１３の（ａ）と（ｂ）に示すように、シーケンス＃１は＃１１、＃１２、＃１４に、シーケンス＃２は＃２２、＃２３に小さなスパイク（突出部）が存在するデータである。各スパイクの振幅はほぼ同じであるが、周期（時間幅）はそれぞれ異なっている。また、これらのシーケンスには３つの大きなスパイク（＃１３、＃２１、＃２４）が含まれており、これらについても周期は異なっている。 As shown in FIGS. 13A and 13B, sequence # 1 is data with small spikes (projections) at # 11, # 12, and # 14, and sequence # 2 is at # 22 and # 23. is there. The amplitude of each spike is substantially the same, but the period (time width) is different. In addition, these sequences include three large spikes (# 13, # 21, # 24), which have different periods.

本発明が解決しようとする課題は、２つのシーケンス間の部分的な類似を見つけることである。例えば、部分シーケンスペア＃１１と＃２２、＃１１と＃２３、＃１３と＃２１、＃１３と＃２４は、シーケンス＃１と＃２の類似部分シーケンスペアである。これらのペアの周期は異なっているため、タイムワーピングを考慮しない場合、的確に検出することが困難である。 The problem to be solved by the present invention is to find a partial similarity between two sequences. For example, partial sequence pairs # 11 and # 22, # 11 and # 23, # 13 and # 21, and # 13 and # 24 are similar partial sequence pairs of sequences # 1 and # 2. Since the periods of these pairs are different, it is difficult to detect accurately when time warping is not considered.

ここで、データストリームＸは、時刻Ｉ＝ｉ_１，ｉ_２，…，ｉ_ｎ，…で収集されるｘ_１，ｘ_２，…，ｘ_ｎ，…の値からなる半無限長のシーケンスとして表記できる。ｘ_ｎはｉ_ｎにおける最新のデータであり、時間の経過と共にｎは増加する。Ｘ[ｉ_ｓ：ｉ_ｅ]をｉ_ｓからｉ_ｅまでの部分シーケンスであるとする。同様に、Ｙは、ｙ_１，ｙ_２，…，ｙ_ｍの値からなるシーケンスであり、Ｙ[ｊ_ｓ：ｊ_ｅ]をｊ_ｓからｊ_ｅまでの部分シーケンスであるとする。例えば、シーケンス＃１をデータストリームＸ、シーケンス＃２をデータストリームＹとすると、部分シーケンス＃１１はＸ［１１５５：２７１２］、＃２２はＹ［６１１１：８３６１］と表せる。このとき、類似部分シーケンスペア検出問題は、次のように定義される。 Here, the data stream X, the time _{_{I = i 1, i 2,}} ..., i n, x 1, x 2 collected ... in, ..., _{x n,} denoted as a semi-infinitely long sequence of ... the value of it can. x _n is the latest data in i _n , and n increases with time. Let X [i _s : i _e ] be a partial sequence from i _s to i _e . Likewise, Y _{_{is, y 1, y 2, ...}} , a sequence of values of _{y m,} Y: a _{_[j} s _j _e] a is a partial sequence from _{j s} to _{j e.} For example, if sequence # 1 is data stream X and sequence # 2 is data stream Y, partial sequence # 11 can be expressed as X [1155: 2712], and # 22 can be expressed as Y [6111: 8361]. At this time, the similar partial sequence pair detection problem is defined as follows.

［類似部分シーケンスペア検出］
２つのデータストリームＸとＹ、類似判定のための閾値ε、類似部分シーケンス長の下限値ζが与えられたとき、次の条件を満たす類似部分シーケンスペアＸ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]を検出する。
１．Ｘ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]の平均距離値はε以下である。
２．Ｘ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]のシーケンス長はいずれもζ以上である。
ここで、平均距離値とは部分シーケンスＸ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]の１要素あたりの距離値を意味し、Ｘ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]の距離値／（Ｘ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]のタイムワーピングパス長）により求めるものとする。 [Similar partial sequence pair detection]
Given two data streams X and Y, a threshold value ε for similarity determination, and a lower limit value ζ of similar partial sequence lengths, similar partial sequence pairs X [i _s : i _e ] and Y [ j _s : j _e ] is detected.
1. The average distance value between X [i _s : i _e ] and Y [j _s : j _e ] is ε or less.
2. The sequence lengths of X [i _s : i _e ] and Y [j _s : j _e ] are both ζ or more.
Here, the average distance value means a distance value per element of the partial sequences X [i _s : i _e ] and Y [j _s : j _e ], and X [i _s : i _e ] and Y [j The distance value of _s : j _e ] / (time warping path length of X [i _s : i _e ] and Y [j _s : j _e ]).

閾値εおよび類似部分シーケンス長の下限値ζはユーザにより指定（設定）され、これらの条件に基づいて類似部分シーケンスペアが検出される。 The threshold value ε and the lower limit value ζ of the similar partial sequence length are designated (set) by the user, and a similar partial sequence pair is detected based on these conditions.

この問題を従来技術であるＤＴＷを用いて解決することを考える。ＤＴＷを用いてデータストリームから類似部分シーケンスペアＸ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]を検出する場合、Ｏ（ｎｍ）個のマトリックスを必要とする。これは、１≦ｉ_ｓ≦ｎ−ζ＋１，１≦ｊ_ｓ≦ｍ−ζ＋１の範囲で変化する各開始点から始まるマトリックスを作成する必要があるためである。これが、従来の一般的な方法である。本明細書ではこの方法をナイーブな手法と呼ぶ。 Consider solving this problem using DTW, which is the prior art. When detecting similar partial sequence pairs X [i _s : i _e ] and Y [j _s : j _e ] from the data stream using DTW, O (nm) matrices are required. This is because it is necessary to create a matrix starting from each starting point that varies in the range of 1 ≦ i _s ≦ n−ζ + 1 and 1 ≦ j _s ≦ m−ζ + 1. This is the conventional general method. In this specification, this method is called a naive method.

ｉ，ｊ番目のマトリックス（すなわち、時刻ｉと時刻ｊから始まるマトリックス）において、要素（ｋ，ｌ）の距離をｄ_ｉ，ｊ（ｋ，ｌ）とする。ナイーブな手法では、ＸとＹの部分シーケンスマッチングの距離は以下のように求められる。なお、ｉ＝１,２，…,ｎ、ｋ＝１,２，…,ｎ−ｉ＋１、ｊ＝１,２，…,ｍ、ｌ＝１,２，…,ｍ−ｊ＋１である。 In the i, j-th matrix (that is, the matrix starting from time i and time j), let the distance of element (k, l) be d _{i, j} (k, l). In the naive method, the partial sequence matching distance between X and Y is obtained as follows. Note that i = 1, 2,..., N, k = 1, 2,..., N−i + 1, j = 1, 2,.

Ｄ（Ｘ［ｉ_ｓ：ｉ_ｅ］,Ｙ［ｊ_ｓ：ｊ_ｅ］）＝ｄ_{ｉｓ,ｊｓ}（ｉ_ｅ−ｉ_ｓ＋１,ｊ_ｅ−ｊ_ｓ＋１）・・・式（１３）
ｄ_ｉ,ｊ（ｋ,ｌ）＝‖ｘ_{ｉ＋ｋ−１}−ｙ_{ｊ＋ｌ−１}‖＋ｍｉｎ｛ｄ_ｉ，ｊ（ｋ，ｌ−１），ｄ_ｉ,ｊ（ｋ−１,ｌ），ｄ_ｉ,ｊ（ｋ−１,ｌ−１）｝・・・式（１４）
ｄ_ｉ,ｊ（０,０）＝０・・・式（１５）
ｄ_ｉ,ｊ（ｋ,０）＝ｄ_ｉ,ｊ（０,ｌ）＝∞ ・・・式（１６） _{_{D (X [i s: i}} e], Y [j s: j e]) = d is, js (i e -i s + 1, j e -j s +1) ··· formula (13)
d _{i, j} (k, l) = ‖x _{i + k−1} −y _{j + l−1} ‖ + min {d _{i, j} (k, l−1), d _{i, j} (k−1, l), d _{i, j} (k-1, l-1)} Expression (14)
d _{i, j} (0,0) = 0 Expression (15)
d _{i, j} (k, 0) = d _{i, j} (0, l) = ∞ (16)

また、Ｘ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]の類似度を評価するための平均距離ｄ’は、次のように求められる。なお、Ｗは部分シーケンスＸ[ｉ_ｓ：ｉ_ｅ]と部分シーケンスＹ[ｊ_ｓ：ｊ_ｅ]とのタイムワーピングパスの長さである。
ｄ’＝ｄ_ｉ,ｊ（ｋ,ｌ）／Ｗ・・・式（１７） Further, the average distance d ′ for evaluating the similarity between X [i _s : i _e ] and Y [j _s : j _e ] is obtained as follows. W is the length of the time warping path between the partial sequence X [i _s : i _e ] and the partial sequence Y [j _s : j _e ].
d ′ = d _{i, j} (k, l) / W (17)

E. J. Keogh: “Exact Indexing of Dynamic Time Warping,” In Proceedings of the 28th International Conference on Very Large Data Base (VLDB2002), pp.406-417, 2002.E. J. Keogh: “Exact Indexing of Dynamic Time Warping,” In Proceedings of the 28th International Conference on Very Large Data Base (VLDB2002), pp.406-417, 2002. S. W. Kim, S. Park, W. W. Chu: “An Index-based Approach for Similarity Search Supporting Time Warping in Large Sequence Database,” In Proceedings of IEEE 17th International Conference on Data Engineering (ICDE2001), pp.607-614, 2001.S. W. Kim, S. Park, W. W. Chu: “An Index-based Approach for Similarity Search Supporting Time Warping in Large Sequence Database,” In Proceedings of IEEE 17th International Conference on Data Engineering (ICDE2001), pp.607-614, 2001. Y. Sakurai, C. Faloutsos, and M. Yamamuro: “Stream Monitoring under the Time Warping Distance,” In Proceedings of IEEE 23rd International Conference on Data Engineering (ICDE 2007), pp.1046-1055, 2007.Y. Sakurai, C. Faloutsos, and M. Yamamuro: “Stream Monitoring under the Time Warping Distance,” In Proceedings of IEEE 23rd International Conference on Data Engineering (ICDE 2007), pp.1046-1055, 2007. S. Papadimitriou, J. Sun, and C. Faloutsos: “Streaming Pattern Discovery in Multiple Time-Series,” In Proceedings of the 31th International Conference on Very Large Data Bases(VLDB2005), pp.697-708, 2005.S. Papadimitriou, J. Sun, and C. Faloutsos: “Streaming Pattern Discovery in Multiple Time-Series,” In Proceedings of the 31th International Conference on Very Large Data Bases (VLDB2005), pp.697-708, 2005. Y. Zhu, D. Shasha: “StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time,” In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB2002), pp.358-369, 2002.Y. Zhu, D. Shasha: “StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time,” In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB2002), pp.358-369, 2002.

しかしながら、ＤＴＷは固定長の問い合わせシーケンスに類似するシーケンスを検出するためのシーケンスマッチング手法であるため、この手法を使用すると、計算に必要となるマトリックスの数が時間の経過とともに増加する。そのため、毎時刻に更新する必要のある値はＯ（ｎｍ^２）またはＯ（ｎ^２ｍ）となり（マトリックスの数がＯ（ｎｍ）なので）、計算量やメモリ使用量が大幅に増加するという問題がある。 However, since DTW is a sequence matching technique for detecting a sequence similar to a fixed-length query sequence, the use of this technique increases the number of matrices required for calculation. Therefore, the value that needs to be updated every hour is O (nm ² ) or O (n ² m) (since the number of matrices is O (nm)), and the amount of calculation and memory usage increase significantly. There is.

本発明は、前記問題に鑑みてなされたものであり、２つのデータストリームから類似する部分シーケンスのペアを検出するときの計算量やメモリ使用量の増加を抑制することを課題とする。 The present invention has been made in view of the above problems, and an object of the present invention is to suppress an increase in the amount of calculation and memory usage when detecting a pair of similar partial sequences from two data streams.

前記した課題を解決するため、本発明は、２つのデータストリームから、類似する部分シーケンスのペアを、２つの前記部分シーケンス同士の類似度スコアを示すタイムワーピング行列を用いて検出する類似部分シーケンス検出装置による類似部分シーケンス検出方法であって、前記類似部分シーケンス検出装置は、前記タイムワーピング行列を記憶する記憶部と、処理部と、を備えており、前記処理部は、前記２つのデータストリームのうちいずれかのデータストリームのデータの１つの要素を受信したとき、当該要素を含む前記データストリーム中の部分シーケンスと、他方の前記データストリーム中の部分シーケンスと、の類似度スコアを算出し、前記算出した類似度スコアと、その類似度スコアの算出に用いた２つの前記部分シーケンスの開始位置および終了位置と、を対応付けて前記記憶部の前記タイムワーピング行列に記憶し、前記記憶部のタイムワーピング行列に記憶された前記類似度スコアを用いて、類似する部分シーケンスのペアを検出して出力することを特徴とする。 In order to solve the above-described problem, the present invention detects a similar partial sequence detection from two data streams using a time warping matrix indicating a similarity score between the two partial sequences. A similar partial sequence detection method by an apparatus, wherein the similar partial sequence detection device includes a storage unit that stores the time warping matrix, and a processing unit, wherein the processing unit is configured to store the two data streams. When one element of data of any one of the data streams is received, a similarity score between the partial sequence in the data stream including the element and the partial sequence in the other data stream is calculated, The calculated similarity score and the two partial sequences used to calculate the similarity score A pair of similar partial sequences using the similarity score stored in the time warping matrix of the storage unit in association with the start position and the end position of the storage unit stored in the time warping matrix of the storage unit Is detected and output.

かかる発明によれば、２つのデータストリームから類似する部分シーケンスのペアを検出するとき、用いるタイムワーピング行列が単一で済むので、計算量やメモリ使用量の増加を抑制することができる。 According to this invention, when detecting a pair of similar partial sequences from two data streams, since only a single time warping matrix is used, an increase in calculation amount and memory usage can be suppressed.

また、本発明は、前記処理部が、前記２つのデータストリームの部分シーケンス同士の前記類似度スコアを算出するとき、２つの部分シーケンスＸ［ｉ_ｓ：ｉ_ｅ］とＹ［ｊ_ｓ：ｊ_ｅ］との類似度スコアＳ（Ｘ［ｉ_ｓ：ｉ_ｅ］,Ｙ［ｊ_ｓ：ｊ_ｅ］）を、以下の式（１）〜式（５）により算出し、
Ｓ（Ｘ［ｉ_ｓ：ｉ_ｅ］,Ｙ［ｊ_ｓ：ｊ_ｅ］）＝ｓ（ｉ_ｅ,ｊ_ｅ）・・・式（１）
ｓ（ｉ,ｊ）＝ｍａｘ｛０，２ε−‖ｘ_ｉ−ｙ_ｊ‖＋ｓ_ｂｅｓｔ｝・・・式（２）
ｓ_ｂｅｓｔ＝ｍａｘ｛ｓ（ｉ,ｊ−１），ｓ（ｉ−１,ｊ），ｓ（ｉ−１,ｊ−１）｝
・・・式（３）
ｓ（ｉ,０）＝０・・・式（４）
ｓ（０,ｊ）＝０・・・式（５）
前記タイムワーピング行列に記憶する前記部分シーケンスＸ［ｉ_ｓ：ｉ_ｅ］の開始位置ｉ_ｓと、前記部分シーケンスＹ［ｊ_ｓ：ｊ_ｅ］の開始位置ｊ_ｓとを示すｐ（ｉ,ｊ）を、以下の式（７）により算出し、
ｐ（ｉ,ｊ）＝｛ｐ（ｉ,ｊ−１）（if ｓ_ｂｅｓｔ＝ｓ（ｉ,ｊ−１）），
ｐ（ｉ−１,ｊ）（if ｓ_ｂｅｓｔ＝ｓ（ｉ−１,ｊ）），
ｐ（ｉ−１,ｊ−１）（if ｓ_ｂｅｓｔ＝ｓ（ｉ−１,ｊ−１）），
（ｉ,ｊ）（if ｓ_ｂｅｓｔ＝０）｝・・・式（７）
前記開始位置ｉ_ｓ,ｊ_ｓを、以下の式（８）により算出することが望ましい。
（ｉ_ｓ,ｊ_ｓ）＝ｐ（ｉ_ｅ,ｊ_ｅ）・・・式（８）
ただし、ｉ＝１，２，…，ｎ、ｊ＝１，２，…，ｍ、‖ｘ_ｉ−ｙ_ｊ‖はｘ_ｉとｙ_ｊとの間の距離を示す。 Further, according to the present invention, when the processing unit calculates the similarity score between the partial sequences of the two data streams, the two partial sequences X [i _s : i _e ] and Y [j _s : j _e ] Similarity score S (X [i _s : i _e ], Y [j _s : j _e ]) is calculated by the following equations (1) to (5),
S (X [i _s : i _e ], Y [j _s : j _e ]) = s (i _e , j _e ) (1)
s (i, j) = max {0, 2ε−‖x _i −y _j ‖ + s _best } Expression (2)
s _best = max {s (i, j-1), s (i-1, j), s (i-1, j-1)}
... Formula (3)
s (i, 0) = 0 Formula (4)
s (0, j) = 0 Formula (5)
The partial sequence _X stored in the time warping matrix _{_[i} s: i _e] and the start position _{i s} of the partial sequence _Y: p indicating the start position _{j s} of _{_{[j s j e] (i}} , j) Is calculated by the following equation (7),
p (i, j) = {p (i, j-1) (if s _best = s (i, j-1)),
p (i-1, j) (if s _best = s (i-1, j)),
p (i-1, j-1) (if s _best = s (i-1, j-1)),
(I, j) (if s _best = 0)} (7)
It is desirable to calculate the start position i _s , j _s by the following equation (8).
(I _s , j _s ) = p (i _e , j _e ) (8)
However, i = 1,2, ..., shown n, j = 1,2, ..., m, the distance between the ‖x _i -y _j ‖ the _{x i} and _{y j.}

かかる発明によれば、類似度スコアを具体的に適切に算出し、その類似度スコアと、その類似度スコアの算出に用いた２つの前記部分シーケンスの開始位置および終了位置と、を対応付けて前記タイムワーピング行列に記憶することができる。 According to this invention, the similarity score is specifically calculated appropriately, and the similarity score is associated with the start position and the end position of the two partial sequences used for the calculation of the similarity score. It can be stored in the time warping matrix.

また、本発明は、前記処理部が、前記記憶部のタイムワーピング行列に記憶された前記類似度スコアを用いて、類似する部分シーケンスのペアを検出して出力する場合、
類似度スコアの平均値ｓ’を、以下の式（６）により算出し、
ｓ’＝ｓ（ｉ,ｊ）／Ｗ・・・式（６）
前記算出した類似度スコアの平均値ｓ’が所定の閾値ε以上であり、かつ、前記類似度スコアの算出に使用した２つの部分シーケンスの長さがいずれも所定の長さζ以上であるとき、それらの２つの前記部分シーケンスを、類似する部分シーケンスのペアとして検出して出力することが望ましい。
ただし、Ｗは部分シーケンスＸ［ｉ_ｓ：ｉ_ｅ］と部分シーケンスＹ［ｊ_ｓ：ｊ_ｅ］とのタイムワーピングパスの長さを示す。 In the present invention, when the processing unit detects and outputs a pair of similar partial sequences using the similarity score stored in the time warping matrix of the storage unit,
The average value s ′ of the similarity score is calculated by the following equation (6),
s ′ = s (i, j) / W (6)
When the average value s ′ of the calculated similarity score is equal to or greater than a predetermined threshold ε, and the lengths of the two partial sequences used for calculating the similarity score are both equal to or greater than the predetermined length ζ It is desirable to detect and output these two partial sequences as pairs of similar partial sequences.
Here, W indicates the length of the time warping path between the partial sequence X [i _s : i _e ] and the partial sequence Y [j _s : j _e ].

かかる発明によれば、類似度スコアをその対応するタイムワーピングパスの長さで除算した平均値を用いることで、類似する部分シーケンスのペアをより正確に検出することができる。 According to this invention, a pair of similar partial sequences can be detected more accurately by using the average value obtained by dividing the similarity score by the length of the corresponding time warping path.

また、本発明は、前記処理部が、前記算出した類似度スコアの平均値ｓ’が所定の閾値ε以上であり、かつ、前記算出に使用した２つの部分シーケンスの長さがいずれも所定の長さζ以上である場合、当該２つの部分シーケンスを、類似部分シーケンスペア候補として前記記憶部に記憶し、前記記憶部に記憶された前記類似部分シーケンスペア候補のうち、前記タイムワーピングパスの少なくとも一部に重複しているものがあるとき、前記重複している類似部分シーケンスペア候補の中から、前記タイムワーピングパスが最長の類似部分シーケンスペア候補を選択し、当該選択した類似部分シーケンスペア候補である２つの前記部分シーケンスを、類似する部分シーケンスのペアとして出力することが望ましい。 Further, according to the present invention, the processing unit has an average value s ′ of the calculated similarity score that is equal to or greater than a predetermined threshold value ε, and the lengths of the two partial sequences used for the calculation are both predetermined. If the length ζ is greater than or equal to ζ, the two partial sequences are stored in the storage unit as similar partial sequence pair candidates, and among the similar partial sequence pair candidates stored in the storage unit, at least the time warping path When there is an overlapping part, the similar partial sequence pair candidate with the longest time warping path is selected from the overlapping similar partial sequence pair candidates, and the selected similar partial sequence pair candidate is selected. It is desirable to output the two partial sequences that are as pairs of similar partial sequences.

かかる発明によれば、タイムワーピングパスの少なくとも一部が重複する類似部分シーケンスペアが複数あった場合、その中でタイムワーピングパスが最長の類似部分シーケンスペアを出力するので、ユーザに対して、冗長な情報を与えることなく、より有益な情報を提供できる。 According to this invention, when there are a plurality of similar partial sequence pairs in which at least a part of the time warping path overlaps, the similar partial sequence pair with the longest time warping path among them is output, so that redundancy is provided to the user. It is possible to provide more useful information without giving unnecessary information.

また、本発明は、類似部分シーケンス検出方法をコンピュータに実行させることを特徴とする類似部分シーケンス検出プログラムである。このようなプログラムによれば、類似部分シーケンス検出方法を一般的なコンピュータに実行させることができる。 The present invention also provides a similar partial sequence detection program which causes a computer to execute a similar partial sequence detection method. According to such a program, it is possible to cause a general computer to execute the similar partial sequence detection method.

本発明によれば、２つのデータストリームから類似する部分シーケンスのペアを検出するときの計算量やメモリ使用量の増加を抑制することができる。 According to the present invention, it is possible to suppress an increase in calculation amount and memory usage when detecting a pair of similar partial sequences from two data streams.

以下、本発明を実施するための最良の形態（以下、実施形態という）を、第１実施形態および第２実施形態に分けて説明する。また、その後で、３つの実験結果について説明する。 Hereinafter, the best mode for carrying out the present invention (hereinafter referred to as an embodiment) will be described by dividing it into a first embodiment and a second embodiment. Thereafter, three experimental results will be described.

≪第１実施形態≫
図１は、第１実施形態の類似部分シーケンス検出装置の構成図である。類似部分シーケンス検出装置１は、コンピュータ装置であり、入力部１１、処理部１２、記憶部１３および出力部１４を備える。 << First Embodiment >>
FIG. 1 is a configuration diagram of a similar partial sequence detection apparatus according to the first embodiment. The similar partial sequence detection device 1 is a computer device, and includes an input unit 11, a processing unit 12, a storage unit 13, and an output unit 14.

入力部１１は、外部装置（不図示）やセンサ（不図示）からインターネットやＬＡＮ（Local Area Network）経由でデータストリームの入力を受け付けたり、キーボードやマウス等の入力装置（不図示）から類似部分シーケンスペア（以下、単に「類似部分シーケンス」ともいう。）の検出のための類似部分シーケンス検出条件の入力を受け付けたりする。この類似部分シーケンス検出条件は、例えば、データストリームから類似部分シーケンスを検出するときの部分シーケンス長の下限値ζ、類似度スコアの平均値の閾値ε等（詳細は後記）である。この入力部１１は、インターネットやＬＡＮ経由でデータの送受信を行うための通信インタフェースや、入力装置等の外部装置との各種データの入出力を行うための入出力インタフェースにより実現される。 The input unit 11 receives data stream input from an external device (not shown) or a sensor (not shown) via the Internet or a LAN (Local Area Network), or a similar part from an input device (not shown) such as a keyboard or a mouse. An input of a similar partial sequence detection condition for detecting a sequence pair (hereinafter also simply referred to as “similar partial sequence”) is accepted. The similar partial sequence detection conditions are, for example, the lower limit value ζ of the partial sequence length when detecting a similar partial sequence from the data stream, the threshold value ε of the average value of similarity scores, and the like (details will be described later). The input unit 11 is realized by a communication interface for transmitting / receiving data via the Internet or a LAN, and an input / output interface for inputting / outputting various data to / from an external device such as an input device.

このような入力部１１は、２つのデータストリームから類似部分シーケンスを検出するときに用いる類似部分シーケンス検出条件の入力を受け付ける検出条件入力部１１１と、データストリームの入力を受け付けるデータストリーム入力部１１２とを含んで構成される。 Such an input unit 11 includes a detection condition input unit 111 that receives an input of a similar partial sequence detection condition used when detecting a similar partial sequence from two data streams, and a data stream input unit 112 that receives an input of a data stream. It is comprised including.

処理部１２は、２つのデータストリームから類似部分シーケンスペアを検出するための各種演算処理を行うものであり、例えば、ＣＰＵ（Central Processing Unit）が記憶部１３のプログラムを実行することで実現される。この処理部１２は、データストリーム処理部１２１を備え、データストリーム入力部１１２で受信したデータストリームに関して、記憶部１３のタイムワーピングデータ記憶部１３２のタイムワーピング行列（マトリックス）を用いて、類似部分シーケンスペアを検出し、出力部１４の類似部分シーケンス出力部１４１（後記）経由で外部へ出力する。なお、このデータストリーム処理部１２１は、２つのデータストリームから部分シーケンス同士の類似度スコアの平均値を計算するとき、スコアリング関数を用いる。 The processing unit 12 performs various arithmetic processes for detecting a similar partial sequence pair from two data streams, and is realized, for example, by a CPU (Central Processing Unit) executing a program in the storage unit 13. . The processing unit 12 includes a data stream processing unit 121, and uses the time warping matrix (matrix) of the time warping data storage unit 132 of the storage unit 13 for the data stream received by the data stream input unit 112, and a similar partial sequence A pair is detected and output to the outside via a similar partial sequence output unit 141 (described later) of the output unit 14. The data stream processing unit 121 uses a scoring function when calculating the average value of similarity scores between partial sequences from two data streams.

スコアリング関数は、類似部分シーケンスペアを検出するための類似判定手段である。部分シーケンス間の類似度は、２つのシーケンス間の各要素同士をマッチングするために時間軸方向に最適に伸長を行った後、スコアとして算出される。スコアリング関数はＤＴＷと同様、動的計画法に基づくアプローチであるが、次の２つの点でＤＴＷと異なる。 The scoring function is similarity determination means for detecting similar partial sequence pairs. The degree of similarity between partial sequences is calculated as a score after optimal expansion in the time axis direction in order to match each element between two sequences. Similar to DTW, the scoring function is an approach based on dynamic programming, but differs from DTW in the following two points.

１つ目は、累積スコアの最大値を用いて類似部分シーケンスペアを求めることである。一方、ＤＴＷは、累積距離の最小値を用いて類似部分シーケンスペアを求める。 The first is to obtain a similar partial sequence pair using the maximum value of the cumulative score. On the other hand, the DTW obtains a similar partial sequence pair using the minimum cumulative distance.

２つ目は、スコアリング関数のための“zero-resetting”を導入したことである。これは、もしマトリックスの累積スコアｓ（ｉ，ｊ）が負の値となった場合、この値を０（ゼロ）で置き換えること意味する。このアプローチは、バイオインフォマティックスの分野で提案されており、Smith-Watermanアルゴリズムなどに実装されている。バイオインフォマティックス分野におけるシーケンスは記号シーケンスを対象としているが、本実施形態でのスコアリング関数は数値シーケンスを対象としている点で異なる。 The second is the introduction of “zero-resetting” for scoring functions. This means that if the cumulative score s (i, j) of the matrix becomes a negative value, this value is replaced with 0 (zero). This approach has been proposed in the field of bioinformatics and is implemented in the Smith-Waterman algorithm and the like. The sequence in the bioinformatics field is intended for symbol sequences, but the scoring function in this embodiment is different in that it is intended for numerical sequences.

スコアを０で置き換えることは、終了点（ｉ，ｊ）における部分シーケンスＸとＹが、もはや類似部分シーケンス検出の定義を満たしていない、すなわち、全く類似していないことを意味する。そのため、０が連続する区間では、この区間の部分シーケンスペアが全く類似していないことを表す。“zero-resetting”により、部分シーケンスペアＸ[ｉ_ｓ：ｉ_ｅ]（長さはＬ_ｘ）とＹ[ｊ_ｓ：ｊ_ｅ]（長さはＬ_ｙ）の間の類似度を評価することができる。 Replacing the score with 0 means that the partial sequences X and Y at the end point (i, j) no longer meet the definition of similar partial sequence detection, i.e. are not at all similar. Therefore, in the section where 0 continues, the partial sequence pairs in this section are not similar at all. Assess the similarity between the partial sequence pair X [i _s : i _e ] (length is L _x ) and Y [j _s : j _e ] (length is L _y ) by “zero-resetting” Can do.

以下、スコアリング関数等について、さらに具体的に説明する。スコアリング関数は、一方のデータストリームＸ＝（ｘ_１，ｘ_２，…，ｘ_ｎ,…）と、他方のデータストリームＹ＝（ｙ_１，ｙ_２，…，ｙ_ｍ,…）との類似度スコアの平均値を計算するための関数である。データストリーム処理部１２１は、いずれかのデータストリームのデータが１つ到着するたびに、このスコアリング関数により、データストリームＸの部分シーケンスＸ［ｉ_ｓ：ｉ_ｅ］と、データストリームＹの部分シーケンスＹ［ｊ_ｓ：ｊ_ｅ］の類似度スコアを計算（算出）する。 Hereinafter, the scoring function and the like will be described more specifically. The scoring function is similar to one data stream X = (x ₁ , x ₂ ,..., X _n ,...) And the other data stream Y = (y ₁ , y ₂ ,..., Y _m ,. It is a function for calculating the average value of the degree score. The data stream processing unit 121 uses the scoring function to generate a partial sequence X [i _s : i _e ] of the data stream X and a partial sequence of the data stream Y each time one piece of data of any data stream arrives. The similarity score of Y [j _s : j _e ] is calculated (calculated).

なお、このスコアリング関数により計算される類似度スコアは、式（１）〜式（５）に示すように、類似度が高ければ加算され、類似度が低ければ減算される仕組みになっており、類似度スコアが閾値ε以上となる部分シーケンス同士が、類似する部分シーケンス同士であることを意味する。この類似度スコアは、データストリーム処理部１２１が、タイムワーピング行列（図３〜図６参照）を用いて、部分シーケンスＸと部分シーケンスＹの各要素同士を対応させて計算する。 The similarity score calculated by this scoring function is added when the similarity is high and subtracted when the similarity is low, as shown in equations (1) to (5). This means that partial sequences having a similarity score equal to or greater than the threshold ε are similar partial sequences. The similarity score is calculated by the data stream processing unit 121 by associating the elements of the partial sequence X and the partial sequence Y with each other using a time warping matrix (see FIGS. 3 to 6).

２つの前記部分シーケンスＸ［ｉ_ｓ：ｉ_ｅ］とＹ［ｊ_ｓ：ｊ_ｅ］との類似度スコアＳ（Ｘ［ｉ_ｓ：ｉ_ｅ］,Ｙ［ｊ_ｓ：ｊ_ｅ］）は、以下の式（１）〜式（５）により算出される。なお、ｉ＝１，２，…，ｎ、ｊ＝１，２，…，ｍとする。 The similarity score S (X [i _s : i _e ], Y [j _s : j _e ]) between the two partial sequences X [i _s : i _e ] and Y [j _s : j _e ] is (1) to (5). Note that i = 1, 2,..., N, j = 1, 2,.

Ｓ（Ｘ［ｉ_ｓ：ｉ_ｅ］,Ｙ［ｊ_ｓ：ｊ_ｅ］）＝ｓ（ｉ_ｅ,ｊ_ｅ）・・・式（１）
ｓ（ｉ,ｊ）＝ｍａｘ｛０，２ε−‖ｘ_ｉ−ｙ_ｊ‖＋ｓ_ｂｅｓｔ｝・・・式（２）
ｓ_ｂｅｓｔ＝ｍａｘ｛ｓ（ｉ,ｊ−１），ｓ（ｉ−１,ｊ），ｓ（ｉ−１,ｊ−１）｝
・・・式（３）
ｓ（ｉ,０）＝０・・・式（４）
ｓ（０,ｊ）＝０・・・式（５） S (X [i _s : i _e ], Y [j _s : j _e ]) = s (i _e , j _e ) (1)
s (i, j) = max {0, 2ε−‖x _i −y _j ‖ + s _best } Expression (2)
s _best = max {s (i, j-1), s (i-1, j), s (i-1, j-1)}
... Formula (3)
s (i, 0) = 0 Formula (4)
s (0, j) = 0 Formula (5)

また、前記類似度スコアの平均値ｓ’は、以下の式（６）により算出される。
ｓ’＝ｓ（ｉ,ｊ）／Ｗ・・・式（６） Further, the average value s ′ of the similarity score is calculated by the following equation (6).
s ′ = s (i, j) / W (6)

ここで、εは、類似判定のための閾値である。式（１）は、類似度スコアの定義である。式（２）は、具体的な計算式である。式（２）において、‖ｘ_ｉ−ｙ_ｊ‖は、２つの数値（ｘ_ｉとｙ_ｊ）の距離を表すものであり、例えば、ユークリッド距離やマンハッタン距離（Ｌ１距離）などが挙げられるが、ここでは、ユークリッド距離の二乗の値を使用する。 Here, ε is a threshold value for similarity determination. Formula (1) is the definition of the similarity score. Formula (2) is a specific calculation formula. In Expression (2), ‖x _i −y _j ‖ represents a distance between two numerical values (x _i and y _j ), and examples thereof include a Euclidean distance and a Manhattan distance (L1 distance). Here, the square value of the Euclidean distance is used.

式（２），（３）におけるｍａｘ｛｝は、｛｝内の値のうち、最大のものを採用する、という意味である。なお、式（４）および式（５）は、タイムワーピング行列における境界条件である。また、Ｗは部分シーケンスＸ［ｉ_ｓ：ｉ_ｅ］と部分シーケンスＹ［ｊ_ｓ：ｊ_ｅ］とのタイムワーピングパスの長さを示す。この類似度スコアを用いたタイムワーピング行列および類似度スコアの平均値ｓ’を使用することで、シーケンスＸとシーケンＹにおける類似部分シーケンスのペアをより正確に検出することができる（詳細は後記）。 In formulas (2) and (3), max {} means that the maximum value among the values in {} is adopted. Expressions (4) and (5) are boundary conditions in the time warping matrix. W represents the length of the time warping path between the partial sequence X [i _s : i _e ] and the partial sequence Y [j _s : j _e ]. By using the time warping matrix using the similarity score and the average value s ′ of the similarity score, a pair of similar partial sequences in sequence X and sequence Y can be detected more accurately (details will be described later). .

なお、式（２）に示すように、ｓ（ｉ,ｊ）を計算する場合において、「２ε−‖ｘ_ｉ−ｙ_ｊ‖＋ｓ_ｂｅｓｔ」の値が０より小さくなったときには、ｓ（ｉ,ｊ）＝０とする。このようにすることで、類似度スコアが０より小さくなった場合でも、それ以降の部分シーケンスの類似度スコアに影響を与えないようにすることができる。つまり、類似度スコアのより高い部分シーケンス同士のつながりを反映した類似度スコアの計算を行うことができる。 As shown in the equation (2), when s (i, j) is calculated, when the value of “2ε−‖x _i −y _j ‖ + s _best ” becomes smaller than 0, s (i, j j) = 0. By doing in this way, even when the similarity score becomes smaller than 0, it is possible to prevent the similarity score of the subsequent partial sequences from being affected. That is, it is possible to calculate a similarity score that reflects the connection between partial sequences having higher similarity scores.

また、タイムワーピング行列に記憶させる部分シーケンスＸ［ｉ_ｓ：ｉ_ｅ］の開始位置ｉ_ｓと、前記部分シーケンスＹ［ｊ_ｓ：ｊ_ｅ］の開始位置ｊ_ｓとを示すｐ（ｉ,ｊ）は、以下の式（７）により算出することができる。 The time warping matrix subsequence _X to store _{_[i} s: i _e] and the start position _{i s} of the partial sequence _Y: p showing the _{_[j} s j _e] start position _{j s} (i, _j) Can be calculated by the following equation (7).

ｐ（ｉ,ｊ）＝｛ｐ（ｉ,ｊ−１）（if ｓ_ｂｅｓｔ＝ｓ（ｉ,ｊ−１）），
ｐ（ｉ−１,ｊ）（if ｓ_ｂｅｓｔ＝ｓ（ｉ−１,ｊ）），
ｐ（ｉ−１,ｊ−１）（if ｓ_ｂｅｓｔ＝ｓ（ｉ−１,ｊ−１）），
（ｉ,ｊ）（if ｓ_ｂｅｓｔ＝０）｝・・・式（７） p (i, j) = {p (i, j-1) (if s _best = s (i, j-1)),
p (i-1, j) (if s _best = s (i-1, j)),
p (i-1, j-1) (if s _best = s (i-1, j-1)),
(I, j) (if s _best = 0)} (7)

そして、開始位置ｉ_ｓ,ｊ_ｓは、以下の式（８）により算出することができる。
（ｉ_ｓ,ｊ_ｓ）＝ｐ（ｉ_ｅ,ｊ_ｅ）・・・式（８） Then, the start position i _s , j _s can be calculated by the following equation (8).
(I _s , j _s ) = p (i _e , j _e ) (8)

つまり、このタイムワーピング行列は、各要素に類似度スコアと、その類似度スコアの算出に用いた部分シーケンスＸ，Ｙの開始位置を保持することで、該当するタイムワーピングパスの開始位置を、過去に遡ることなく（過去のデータを保持することなく）認識することができる。 That is, this time warping matrix holds the similarity score for each element and the start positions of the partial sequences X and Y used to calculate the similarity score, so that the start position of the corresponding time warping path is stored in the past. Can be recognized without going back to the past (without holding past data).

記憶部１３は、２つのデータストリームから類似部分シーケンス検出を行うための各種データを記憶する。この記憶部１３は、例えば、ＲＡＭ（Random Access Memory）と、ＨＤＤ（Hard Disk Drive）とにより実現される。この記憶部１３は、検出条件記憶部１３１と、タイムワーピングデータ記憶部１３２とを含んで構成される。なお、破線で示した類似部分シーケンス候補記憶部１３３については、別途、第２実施形態で説明する。 The storage unit 13 stores various data for performing similar partial sequence detection from two data streams. The storage unit 13 is realized by, for example, a RAM (Random Access Memory) and an HDD (Hard Disk Drive). The storage unit 13 includes a detection condition storage unit 131 and a time warping data storage unit 132. The similar partial sequence candidate storage unit 133 indicated by a broken line will be described separately in the second embodiment.

検出条件記憶部１３１は、類似部分シーケンス検出条件を記憶する。この類似部分シーケンス検出条件は、２つのデータストリームから類似部分シーケンスのペアを検出するときの部分シーケンス長の下限値ζ、類似度スコアの平均値の閾値ε等を示す情報である。 The detection condition storage unit 131 stores similar partial sequence detection conditions. The similar partial sequence detection condition is information indicating the lower limit value ζ of the partial sequence length when detecting a pair of similar partial sequences from two data streams, the threshold value ε of the average value of similarity scores, and the like.

タイムワーピングデータ記憶部１３２は、前記したタイムワーピング行列（図３〜図６参照）を記憶する。このタイムワーピング行列が単一であることが、類似部分シーケンス検出装置１の特徴の1つである。すなわち、１つのタイムワーピング行列の使用で類似部分シーケンスペアを検出することができるので、多数のタイムワーピング行列を使用する場合に比較して、計算量やメモリ使用量の増加を大幅に抑制することができる。また、オリジナルのデータストリームを保持する必要がなく、メモリ使用量の増加をさらに抑制することができる。 The time warping data storage unit 132 stores the above-described time warping matrix (see FIGS. 3 to 6). One feature of the similar partial sequence detection apparatus 1 is that this time warping matrix is single. In other words, since a similar partial sequence pair can be detected by using one time warping matrix, an increase in the amount of calculation and memory usage is greatly suppressed as compared to the case of using a large number of time warping matrices. Can do. Further, it is not necessary to hold the original data stream, and an increase in memory usage can be further suppressed.

なお、図３〜図６では、受信したデータストリームのすべての要素のタイムワーピング行列を示しているが、このタイムワーピングデータ記憶部１３２に用意されるタイムワーピング行列は、外側（新しいデータ側）の２行２列分のデータ（図３のマトリックスではｉ＝３，４、ｊ＝４，５の部分のデータ）があればよい。 3 to 6 show the time warping matrix of all elements of the received data stream, the time warping matrix prepared in the time warping data storage unit 132 is outside (new data side). There should be data for 2 rows and 2 columns (in the matrix of FIG. 3, data for i = 3, 4, j = 4, 5).

出力部１４は、データストリーム処理部１２１で検出された類似部分シーケンスを出力する類似部分シーケンス出力部１４１を含んで構成される。この出力部１４は、前記した通信インタフェースや、出力装置等の外部装置との各種データの入出力を行うための入出力インタフェースにより実現される。 The output unit 14 includes a similar partial sequence output unit 141 that outputs a similar partial sequence detected by the data stream processing unit 121. The output unit 14 is realized by the communication interface described above and an input / output interface for inputting / outputting various data to / from an external device such as an output device.

次に、類似部分シーケンス検出装置１の処理について説明する。図２は、第１実施形態の類似部分シーケンス検出装置１の処理を示すフローチャートである。ここでは、２つのデータストリームＸ（＝ｘ_１，…，ｘ_ｉ，…，ｘ_ｎ，…）と、データストリームｙ（＝ｙ_１，…，ｙ_ｊ，…，ｙ_ｍ，…）とにおける各要素が、順次与えられるものとする。 Next, processing of the similar partial sequence detection device 1 will be described. FIG. 2 is a flowchart showing processing of the similar partial sequence detection device 1 of the first embodiment. Here, each of the two data streams X (= x ₁ ,..., X _i ,..., X _n ,...) And the data stream y (= y ₁ ,..., Y _j ,..., Y _m ,. Assume that elements are given sequentially.

まず、データストリーム処理部１２１は、データストリーム入力部１１２経由で、時刻ｉにおいてデータｘ_ｉを受信する（ステップＳ１１）。
次に、データストリーム処理部１２１は、ｊの値を０にリセットする（ステップＳ１２）。 First, the data stream processing unit 121, via the data stream input unit 112, receives the data _{x i} at time i (step S11).
Next, the data stream processing unit 121 resets the value of j to 0 (step S12).

その後、データストリーム処理部１２１は、ｊの値をインクリメント（１増加）させながら、ｊの値がｍになるまで、ステップＳ１３〜Ｓ１６の処理を繰り返す。 Thereafter, the data stream processing unit 121 repeats the processes of steps S13 to S16 while incrementing (increasing by 1) the value of j until the value of j becomes m.

ステップＳ１３において、データストリーム処理部１２１は、ｘ_ｉとｙ_ｊの類似度スコアｓ（ｉ,ｊ）、類似度スコアの平均値ｓ’、および、開始位置ｐ（ｉ,ｊ）を、前記式（１）〜（７）を用いて計算する。 In step S13, the data stream processing unit 121 calculates the similarity score s (i, j) between x _i and y _j , the average value s ′ of the similarity score, and the start position p (i, j) using the above formula. Calculate using (1) to (7).

ステップＳ１４において、データストリーム処理部１２１は、ｓ’≧εかつＬ_ｘ≧ζかつＬ_ｙ≧ζの条件を満たすか否か、つまり、類似度スコアの平均値ｓ’が所定の閾値ε以上であり、かつ、その類似度スコアｓ（ｉ,ｊ）の算出に使用した２つの部分シーケンスの長さがいずれも所定の長さζ以上であるか否か、判断する。 In step S14, the data stream processing unit 121 determines whether or not the conditions of s ′ ≧ ε, L _x ≧ ζ, and L _y ≧ ζ are satisfied, that is, the average value s ′ of the similarity score is equal to or greater than a predetermined threshold value ε. In addition, it is determined whether or not the lengths of the two partial sequences used for calculating the similarity score s (i, j) are equal to or greater than a predetermined length ζ.

ステップＳ１４でＹｅｓのとき、ステップＳ１５において、データストリーム処理部１２１は、（ｉ_ｓ,ｊ_ｓ）にｐ（ｉ_ｅ,ｊ_ｅ）、ｉ_ｅにｉ、ｊ_ｅにｊ、の値をそれぞれ代入する。 When Yes in step S14, in step S15, the data stream processing unit 121 assigns p (i _e , j _e ) to (i _s , j _s ), i to i _e , and j to j _e , respectively. To do.

ステップＳ１６において、データストリーム処理部１２１は、ｉ_ｓ,ｉ_ｅ,ｊ_ｓ, ｊ_ｅ,ｓ’の値を、類似部分シーケンス出力部１４１経由でユーザに対して出力する。 In step _S < _b > 16, the data stream processing unit 121 outputs the values of i _s , i _e , j _s , j _e , and s ′ to the user via the similar partial sequence output unit 141.

ステップＳ１４でＮｏのとき、ステップＳ１５およびステップＳ１６をスキップする。 When No in step S14, step S15 and step S16 are skipped.

データストリーム処理部１２１は、データストリーム入力部１１２経由で、時刻ｊにおいてデータｙ_ｊを受信した場合（ステップＳ２１）、ステップＳ１２〜ステップＳ１６の処理と同様にして、ステップＳ２２〜ステップＳ２６の処理を行う。 When the data stream processing unit 121 receives the data y _j at time j via the data stream input unit 112 (step S21), the data stream processing unit 121 performs the processing of step S22 to step S26 in the same manner as the processing of step S12 to step S16. Do.

ステップＳ１３〜Ｓ１６の処理の後、あるいは、ステップＳ２３〜Ｓ２６の処理の後、データストリーム処理部１２１は、２つのデータストリームの受信が終了したか否か判断する（ステップＳ３１）。データストリーム処理部１２１は、終了していなければ（ステップＳ３１でＮｏ）ステップＳ１１あるいはステップＳ２１に戻り、終了していれば（ステップＳ３１でＹｅｓ）処理を終了する。 After the processes of steps S13 to S16 or after the processes of steps S23 to S26, the data stream processing unit 121 determines whether or not the reception of the two data streams has been completed (step S31). If not completed (No in step S31), the data stream processing unit 121 returns to step S11 or step S21, and if completed (Yes in step S31), ends the process.

次に、この図２のフローチャートの処理の具体例について説明する。図３〜図６は、その具体例の説明図である。この具体例では、２つのデータストリームＸ＝（５，１２，６，１０，６，５，１）、ｙ＝（１１，６，９，４，１３，８，５）について、ε＝３、ζ＝４の条件の下、類似部分シーケンスペアを検出することを想定する。なお、前記したように、マトリックス（タイムワーピング行列）は、各要素（セル）に、類似度スコアと、その類似度スコアの算出に用いた部分シーケンスＸ，Ｙの開始位置を保持するものであるが、以下では、説明を容易にするために、各要素（セル）に、類似度スコアの平均値と前記開始位置を保持するものとして説明する。 Next, a specific example of the processing of the flowchart of FIG. 2 will be described. 3-6 is explanatory drawing of the specific example. In this example, for two data streams X = (5, 12, 6, 10, 6, 5, 1), y = (11, 6, 9, 4, 13, 8, 5), ε = 3, Assume that similar partial sequence pairs are detected under the condition of ζ = 4. As described above, the matrix (time warping matrix) holds, for each element (cell), the similarity score and the start positions of the partial sequences X and Y used to calculate the similarity score. However, in the following, for ease of explanation, each element (cell) is assumed to hold the average value of the similarity score and the start position.

データストリームＸまたはＹの１つの要素が到着した時点で、類似部分シーケンス検出装置１による処理が実行される。まず、ｉ＝４，ｊ＝５の時点、つまり、Ｘ＝（５，１２，６，１０）、ｙ＝（１１，６，９，４，１３）のデータをすでに受信した場合のマトリックスは図３のようになる。各セルには類似度スコアの平均値ｓ’と、その算出の開始点ｐ（ｉ，ｊ）の情報が保持されており、上段に類似度スコアの平均値ｓ’を、下段に開始点情報を示している。例えば、セル（４，３）の類似度スコアの平均値ｓ’は４であり、部分シーケンスＸ［２：４］とＹ［１：３］の類似度スコアの平均値であることを意味する。 When one element of the data stream X or Y arrives, processing by the similar partial sequence detection device 1 is executed. First, when i = 4, j = 5, that is, when the data of X = (5, 12, 6, 10) and y = (11, 6, 9, 4, 13) has already been received, the matrix is shown in FIG. It becomes like 3. Each cell holds the average value s ′ of the similarity score and the information of the calculation start point p (i, j), the average value s ′ of the similarity score in the upper part, and the starting point information in the lower part. Is shown. For example, the average value s ′ of the similarity score of the cell (4, 3) is 4, meaning that it is the average value of the similarity scores of the partial sequences X [2: 4] and Y [1: 3]. .

そして、時刻ｉ＝５においてｘ_５＝６が到着した場合、図２のステップＳ１１〜Ｓ１６の処理が実行され、図４の色付け（網掛け）されたセルの各値が計算される。図４はｉ＝５，ｊ＝５の時点におけるマトリックスである。このとき、丸印で囲ったセル（５，４）において、平均スコアがε（＝３）以上、長さがいずれもζ（＝４）以上となる部分シーケンスＸ［２：５］、Ｙ［１：４］のペアが類似部分シーケンスペアとして検出される。 When x ₅ = 6 arrives at time i = 5, the processing in steps S11 to S16 in FIG. 2 is executed, and the values of the colored (shaded) cells in FIG. 4 are calculated. FIG. 4 is a matrix at the time point when i = 5 and j = 5. At this time, in the cells (5, 4) surrounded by the circles, the partial sequences X [2: 5], Y [whose average score is equal to or greater than ε (= 3) and whose length is equal to or greater than ζ (= 4). 1: 4] is detected as a similar partial sequence pair.

データストリームＹの要素が到着した場合には、図２のステップＳ２１〜Ｓ２６の処理が実行される。データストリームＸおよびＹの各要素が到着した時点でこれらの処理が繰り返され、最終的なマトリックスは図５のようになる。図５はｉ＝７，ｊ＝７の時点におけるマトリックスである（各数値の記載は省略）。図５に示すように、最終的に、丸印で囲った５つのセルに対応する部分シーケンスペアが類似部分シーケンスペアとして検出される。 When the element of the data stream Y arrives, the processing of steps S21 to S26 in FIG. 2 is executed. These processes are repeated when the elements of the data streams X and Y arrive, and the final matrix is as shown in FIG. FIG. 5 is a matrix at the time point of i = 7 and j = 7 (the description of each numerical value is omitted). As shown in FIG. 5, finally, partial sequence pairs corresponding to the five cells surrounded by circles are detected as similar partial sequence pairs.

ナイーブな手法では、類似部分シーケンスペアを検出するためにＯ（ｎｍ^２＋ｎ^２ｍ）のメモリ量を使用し、単位時間あたりＯ（ｎｍ^２）（Ｘの要素が到着した場合）またはＯ（ｎ^２ｍ）（Ｙの要素が到着した場合）の距離値の更新が必要となる。一方、本実施形態では単一のマトリックスのみで類似部分シーケンスペアの検出が可能なため、Ｏ（ｍ＋ｎ）のメモリ量を使用し、単位時間あたりＯ（ｍ）（Ｘの要素が到着した場合）またはＯ（ｎ）（Ｙの要素が到着した場合）の値しか更新しなくてよい。そのため、計算量（計算時間）やメモリ使用量の大幅な低減化を実現できる。 The naive approach uses a memory amount of O (nm ² + n ² m) to detect similar partial sequence pairs, and O (nm ² ) (when an element of X arrives) or O (n ² m) It is necessary to update the distance value (when the element Y arrives). On the other hand, in the present embodiment, similar partial sequence pairs can be detected using only a single matrix. Therefore, a memory amount of O (m + n) is used, and O (m) per unit time (when an element of X arrives) Alternatively, only the value of O (n) (when the element Y arrives) needs to be updated. Therefore, it is possible to achieve a significant reduction in the calculation amount (calculation time) and memory usage.

≪第２実施形態≫
次に、第２実施形態について説明する。前記した第１実施形態の類似部分シーケンス検出装置１によれば、類似部分シーケンスペア検出により出力される類似部分シーケンスペアには、タイムワーピングパスの少なくとも一部が重複するものも含まれる。例えば、図５で検出される類似部分シーケンスペアは、Ｘ[１：５］とＹ［４：７］のペア、Ｘ［１：６］とＹ［４：７］のペア、Ｘ［２：５］とＹ［１：４］のペア、Ｘ［２：６］とＹ［１：４］のペア、Ｘ［２：７］とＹ［１：４］のペアの５つであり、これらのタイムワーピングパスは図６のようになる。図６は、図５で検出される類似部分シーケンスペアのタイムワーピングパスを示す図である。 << Second Embodiment >>
Next, a second embodiment will be described. According to the similar partial sequence detection device 1 of the first embodiment described above, the similar partial sequence pairs output by the similar partial sequence pair detection include those in which at least a part of the time warping path overlaps. For example, the similar partial sequence pairs detected in FIG. 5 are a pair of X [1: 5] and Y [4: 7], a pair of X [1: 6] and Y [4: 7], and X [2: 5] and Y [1: 4], X [2: 6] and Y [1: 4], and X [2: 7] and Y [1: 4]. The time warping path is as shown in FIG. FIG. 6 is a diagram showing a time warping path of similar partial sequence pairs detected in FIG.

前記したように、タイムワーピングパスとは、２つの部分シーケンスのどの要素同士がマッチングしているのかを示すものである。図６から確認できるように、５つのシーケンスペアは大きく２つのグループに分類できる、すなわち、５つのタイムワーピングパスは異なる２つのグループに分類することができる。そこで、第２実施形態の類似部分シーケンス検出装置１では、タイムワーピングパスが重複する複数の類似部分シーケンスペアから、タイムワーピングパスが最長の類似部分シーケンスペアを１つ選択して検出する。ユーザヘの報知（データ出力）はその選択されたタイムワーピングパスが最長の類似部分シーケンスペアのみとすることにより、ユーザが重複する情報を受け取る事態を回避できる。図５の例では、Ｘ［１：６］，Ｙ［４：７］のペア、および、Ｘ［２：７］，Ｙ［１：４］のペアが、出力されるタイムワーピングパスが最長の類似部分シーケンスペアとなる。 As described above, the time warping path indicates which elements of the two partial sequences are matched with each other. As can be seen from FIG. 6, the five sequence pairs can be broadly classified into two groups, that is, the five time warping paths can be classified into two different groups. Therefore, the similar partial sequence detection apparatus 1 according to the second embodiment selects and detects one similar partial sequence pair with the longest time warping path from a plurality of similar partial sequence pairs with overlapping time warping paths. By notifying the user (data output) only by using the similar partial sequence pair having the longest selected time warping path, it is possible to avoid a situation where the user receives duplicate information. In the example of FIG. 5, the pair of X [1: 6], Y [4: 7] and the pair of X [2: 7], Y [1: 4] has the longest output time warping path. Similar partial sequence pairs.

第２実施形態の類似部分シーケンス検出装置１について、第１実施形態の場合との相違点を中心に説明する。第２実施形態の類似部分シーケンス検出装置１は、タイムワーピングパスの少なくとも一部が重複する複数の類似部分シーケンスペアから、最長の類似部分シーケンスペアを１つ選択して検出することを特徴とする。以下の説明において、前記した第１実施形態と同様の構成要素には同じ符号を付して、説明を省略する。 The similar partial sequence detection device 1 of the second embodiment will be described focusing on the differences from the case of the first embodiment. The similar partial sequence detection apparatus 1 according to the second embodiment is characterized by selecting and detecting one longest similar partial sequence pair from a plurality of similar partial sequence pairs in which at least a part of the time warping path overlaps. . In the following description, the same components as those in the first embodiment described above are denoted by the same reference numerals and description thereof is omitted.

類似部分シーケンス検出装置１は、図１に示すように、記憶部１３に類似部分シーケンス候補記憶部１３３を備える。この類似部分シーケンス候補記憶部１３３は、ｓ’≧εかつＬ_ｘ≧ζかつＬ_ｙ≧ζの条件を満たす類似部分シーケンスペアの情報を１つ以上記録（蓄積）する。 As illustrated in FIG. 1, the similar partial sequence detection device 1 includes a similar partial sequence candidate storage unit 133 in the storage unit 13. The similar partial sequence candidate storage unit 133 records (accumulates) one or more pieces of information of similar partial sequence pairs that satisfy the conditions of s ′ ≧ ε, L _x ≧ ζ, and L _y ≧ ζ.

このような類似部分シーケンス検出装置１の処理手順を、図７を用いて説明する。図７は、第２実施形態の類似部分シーケンス検出装置の処理を示すフローチャートである。 The processing procedure of such a similar partial sequence detection apparatus 1 will be described with reference to FIG. FIG. 7 is a flowchart showing processing of the similar partial sequence detection device of the second embodiment.

データストリーム処理部１２１は、ステップＳ１５の後、ｉ_ｓ,ｉ_ｅ,ｊ_ｓ, ｊ_ｅ,ｓ’の値を類似部分シーケンスペアの候補として、記憶部１３の類似部分シーケンス候補記憶部１３３に記憶させる（ステップＳ１７）。ステップＳ２７についても同様である。 After step S15, the data stream processing unit 121 stores the values of i _s , i _e , j _s , j _e , and s ′ as similar partial sequence pair candidates in the similar partial sequence candidate storage unit 133 of the storage unit 13. (Step S17). The same applies to step S27.

データストリーム処理部１２１は、ステップＳ１３〜Ｓ１７のループの後、あるいは、ステップＳ２３〜Ｓ２７のループの後、タイムワーピングパスの少なくとも一部が重複する複数の類似部分シーケンスペアのうち、タイムワーピングパスが最長となる類似部分シーケンスペアが決定したか否か判断する（ステップＳ４１）。この判断は、タイムワーピング行列（マトリックス）の各セルの開始位置の情報を使用することで実行できる。つまり、例えば、タイムワーピングパスの少なくとも一部が重複する類似部分シーケンスペアが複数あった場合、マトリックスにおいて逐次更新するすべてのセルの開始位置の情報がそれらの複数の類似部分シーケンスペアのタイムワーピングパスの終了位置よりも後ろになっていれば、重複する類似部分シーケンスペアはもうそれ以上ないことになる。 After the loop of steps S13 to S17 or the loop of steps S23 to S27, the data stream processing unit 121 has a time warping path among a plurality of similar partial sequence pairs in which at least a part of the time warping path overlaps. It is determined whether the longest similar partial sequence pair has been determined (step S41). This determination can be performed by using the information on the start position of each cell in the time warping matrix. That is, for example, when there are a plurality of similar partial sequence pairs in which at least a part of the time warping path overlaps, the information on the start positions of all cells sequentially updated in the matrix is the time warping path of the plurality of similar partial sequence pairs. If it is after the end position, there are no more similar partial sequence pairs that overlap.

ステップＳ４１でＹｅｓのとき、データストリーム処理部１２１は、タイムワーピングパスが最長となる類似部分シーケンスペアを類似部分シーケンス出力部１４１経由でユーザに対して出力（ステップＳ４２）し、ステップＳ３１に移る。 When Yes in step S41, the data stream processing unit 121 outputs the similar partial sequence pair having the longest time warping path to the user via the similar partial sequence output unit 141 (step S42), and proceeds to step S31.

このようにして、第２実施形態の類似部分シーケンス検出装置１によれば、タイムワーピングパスの少なくとも一部が重複する類似部分シーケンスペアが複数あった場合、その中でタイムワーピングパスが最長の類似部分シーケンスを出力するので、ユーザに対して、冗長な情報を与えることなく、より有益な情報（タイムワーピングパスが最長の類似部分シーケンス）を提供できる。 In this way, according to the similar partial sequence detection device 1 of the second embodiment, when there are a plurality of similar partial sequence pairs in which at least a part of the time warping path overlaps, the similarity with the longest time warping path is among them. Since the partial sequence is output, more useful information (similar partial sequence with the longest time warping path) can be provided to the user without giving redundant information.

また、第１実施形態および第２実施形態の類似部分シーケンス検出方法を実行させるための類似部分シーケンス検出プログラムを作成すれば、そのプログラムを実行する一般的なコンピュータが類似部分シーケンス検出装置１として動作できる。 If a similar partial sequence detection program for executing the similar partial sequence detection method of the first embodiment and the second embodiment is created, a general computer that executes the program operates as the similar partial sequence detection device 1. it can.

≪実験結果≫
次に、第１実施形態の類似部分シーケンス検出装置１を用いた処理の実験結果について説明する。実験結果を視覚的に把握するため、ここでは、散布図を用いる。例えば、図８は、図１３のシーケンス＃１と＃２から類似部分シーケンスペアを検出した結果を表す散布図である。図８の散布図は、シーケンス＃１と＃２における類似部分シーケンスペアであるＸ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]における要素ｉ_ｅとｊ_ｅを反映している。つまり、この図８における横軸はＸの要素を、縦軸はＹの要素を表し、部分シーケンスペアＸ[ｉ_ｓ：ｉ_ｅ]とＹ[ｊ_ｓ：ｊ_ｅ]が類似している場合、散布図の（ｉ_ｅ，ｊ_ｅ）の位置に点がプロットされる。 ≪Experimental results≫
Next, experimental results of processing using the similar partial sequence detection device 1 of the first embodiment will be described. In order to visually grasp the experimental results, a scatter diagram is used here. For example, FIG. 8 is a scatter diagram showing the result of detecting similar partial sequence pairs from sequences # 1 and # 2 in FIG. The scatter diagram in FIG. 8 reflects the elements i _e and j _e in X [i _s : i _e ] and Y [j _s : j _e ], which are similar partial sequence pairs in sequences # 1 and # 2. That is, the horizontal axis in FIG. 8 represents an element of X, the vertical axis represents an element of Y, and when the partial sequence pair X [i _s : i _e ] and Y [j _s : j _e ] are similar, A point is plotted at the position of (i _e , j _e ) in the scatter diagram.

図８では、２つの実線で囲まれたそれぞれの部分が検出された小さなスパイクを、破線で囲まれた部分が検出された大きなスパイクを含んだ領域を、それぞれ表している。この図８では、２つのシーケンス間の類似部分シーケンスペアの周期性（出現間隔や出現時間幅）を確認することができる。 In FIG. 8, a small spike in which each part surrounded by two solid lines is detected, and a region including a large spike in which a part surrounded by a broken line is detected are shown. In FIG. 8, the periodicity (appearance interval and appearance time width) of similar partial sequence pairs between two sequences can be confirmed.

例えば、小さなスパイク＃１１と＃２２は類似しており、その関係は散布図において左の実線で囲まれた部分の一番左下に現れている。実線で囲まれた部分では、小さなスパイク同士の対応を表す６つの位置が規則的にプロットされており、＃１１や＃２２に類似する小さなスパイクが周期的に現れていることがわかる。シーケンス＃１と＃２では大小のそれぞれのスパイクの間隔が異なっているが、類似部分シーケンスペアの存在およびその周期性を確認することができる。
以下、図９〜図１１について、同様にして散布図を作成することができる。 For example, small spikes # 11 and # 22 are similar, and the relationship appears in the lower left corner of the part surrounded by the left solid line in the scatter diagram. In a portion surrounded by a solid line, six positions representing correspondences between small spikes are regularly plotted, and it can be seen that small spikes similar to # 11 and # 22 appear periodically. The sequence # 1 and # 2 have different large and small spike intervals, but the presence of a similar partial sequence pair and its periodicity can be confirmed.
Hereinafter, scatter diagrams can be created in the same manner for FIGS.

次に、実データを用いた実験結果を示す。各実験は、２ＧＢのメモリ、３ＧＨｚのＣＰＵを搭載したコンピュータ上で実施した。 Next, experimental results using actual data are shown. Each experiment was carried out on a computer equipped with a 2 GB memory and a 3 GHz CPU.

図９において、（ａ）と（ｂ）はホワイトノイズを持つ複数のサイン波から構成される人工データ（Sines）を示す図であり（縦軸は数値、横軸は時間）、（ｃ）は実験結果の散布図である。
図９の（ａ）と（ｂ）に示すように、Sines１とSines２は、含まれるサイン波の周期と、サイン波の現れる周期が、それぞれ異なっている。そして、図９（ｃ）に示すように、本実施形態の類似部分シーケンス検出装置１によれば、すべてのサイン波と時間変化する周期性を完全に特定（表現）することができていることがわかる。つまり、Sines１にはサイン波が６つ、Sines２にはサイン波が５つあり、散布図には３０個（＝６×５）のプロット群が存在している。また、図９（ｃ）の散布図において、各サイン波の周期の違いは傾きの違いとして表れていることが確認できる。 9, (a) and (b) are diagrams showing artificial data (Sines) composed of a plurality of sine waves having white noise (the vertical axis is a numerical value, the horizontal axis is time), and (c) is It is a scatter diagram of an experimental result.
As shown in FIGS. 9A and 9B, Sines 1 and Sines 2 have different periods of the sine wave and the period in which the sine wave appears. And as shown in FIG.9 (c), according to the similar partial sequence detection apparatus 1 of this embodiment, all the sine waves and the time change periodicity can be specified completely (representation). I understand. That is, there are six sine waves in Sines1, five sine waves in Sines2, and there are 30 (= 6 × 5) plot groups in the scatter diagram. Moreover, in the scatter diagram of FIG.9 (c), it can confirm that the difference in the period of each sine wave appears as the difference in inclination.

図１０において、（ａ）と（ｂ）は温度センサの計測値（Temperature）を示す図でありその温度は摂氏で表され（縦軸は温度、横軸は時間）、（ｃ）は実験結果の散布図である。 10, (a) and (b) are diagrams showing measured values (Temperature) of the temperature sensor, and the temperature is expressed in Celsius (the vertical axis is temperature, the horizontal axis is time), and (c) is the experimental result. It is a scatter diagram.

図１０の（ａ）と（ｂ）におけるデータの取得間隔は約１分であり、これらのデータでは多くの時刻で計測値が欠けている。Temperature１とTemperature２は、天候によって、約１８度から３２度まで大きく変動する温度変化が連続して２つ表れるパターンを、それぞれ２つずつ含んでいる（破線部分参照）。図１０（ｃ）に示すように、本実施形態の類似部分シーケンス検出装置１では、これらの２つずつのパターンを見つけることに成功していることがわかる。つまり、図１０（ｃ）の散布図には４個（＝２×２）のプロット群が存在している。 The data acquisition interval in FIGS. 10A and 10B is about 1 minute, and these data lack measurement values at many times. Temperature 1 and Temperature 2 each include two patterns in which two temperature changes that vary greatly from about 18 degrees to 32 degrees depending on the weather appear in succession (see the broken line portion). As shown in FIG. 10C, it can be seen that the similar partial sequence detection apparatus 1 of the present embodiment has succeeded in finding these two patterns. That is, there are four (= 2 × 2) plot groups in the scatter diagram of FIG.

図１１において、（ａ）と（ｂ）はモーションキャプチャデータ（Mocap）を示す図であり、（ｃ）は実験結果の散布図である。この２つのデータ（Mocap１とMocap２）は、被験者の各部位にマーカーを取り付け、取り付けた部位の角速度を１秒間に１２０回というサンプリング周期で測定したモーションキャプチャデータである。表の“Sec.”が実際のモーション時間に対応しており、被験者があるモーションを連続して行っていることを示す。 In FIG. 11, (a) and (b) are diagrams showing motion capture data (Mocap), and (c) is a scatter diagram of experimental results. These two data (Mocap1 and Mocap2) are motion capture data obtained by attaching a marker to each part of the subject and measuring the angular velocity of the attached part at a sampling period of 120 times per second. “Sec.” In the table corresponds to the actual motion time, and indicates that the subject is performing a certain motion continuously.

例えば、Mocap１では、被験者がwalking-running-jumping-…の順にモーションを行ったことを意味する。本実験では、二の腕・肘下・太もも・ふくらはぎの各部位に左右対称に取り付けられたマーカーから取得された８個の角速度データを選択し、８次元データをして使用している。これらのデータにはwalkingモーションが含まれており（表の色付けした部分）、各walkingモーションのデータ長は異なっている。図１１（ｃ）に示すように、本実施形態の類似部分シーケンス検出装置１では、walkingモーションがすべて検出されている様子が散布図から確認できる。つまり、Mocap１にはwalkingが３つ、Mocap２にもwalkingが３つあり、散布図には９個（＝３×３）のプロット群が存在している。 For example, in Mocap1, it means that the subject performed a motion in the order of walking-running-jumping-. In this experiment, eight pieces of angular velocity data acquired from markers attached symmetrically to the respective parts of the upper arm, the elbow, the thigh, and the calf are selected and used as eight-dimensional data. These data include walking motion (colored portion of the table), and the data length of each walking motion is different. As shown in FIG. 11C, in the similar partial sequence detection apparatus 1 of the present embodiment, it can be confirmed from the scatter diagram that all walking motions are detected. That is, there are three walkings in Mocap1, three walkings in Mocap2, and there are nine (= 3 × 3) plot groups in the scatter diagram.

なお、Mocap１およびMocap２における「Time」の数値と、散布図におけるプロット群の位置とに多少のずれが生じている部分があるようにも見える。しかし、それは、被験者によるモーションの切り替えが連続的な場合（一旦停止なし）と不連続な場合（一旦停止あり）とが混在していて、次のモーションが始まるまでを検出している（つまり、あるモーションの終了直後に、すぐ次のモーションが始まる場合と、被験者の動きが一旦（１，２秒）停止する場合とでは、類似スコアの減少速度が異なる）ことに起因するものであり、本手法の精度とは無関係である。 In addition, it seems that there is a part where a slight deviation occurs between the numerical value of “Time” in Mocap 1 and Mocap 2 and the position of the plot group in the scatter diagram. However, it detects both when the subject's motion switching is continuous (no pause) and discontinuous (temporary pause) until the next motion starts (ie, This is because when the next motion starts immediately after the end of a certain motion, and when the subject's movement stops once (1, 2 seconds), the rate of decrease in the similarity score is different). It has nothing to do with the accuracy of the method.

図１２は、本実施形態による方法とナイーブな手法との計算時間に関する比較図である。データには人工データを用い、シーケンス長（横軸）のｎとｍを変化させている。図１２では、マトリックスの更新と類似部分シーケンスペア検出の合計時間を平均し、計算時間として縦軸に示している。この実験結果から、本実施形態による方法がナイーブな手法と比べ、非常に高い性能を示している（処理の高速化を実現している）ことがわかる。つまり、計算量が、ナイーブな手法のＯ（ｎｍ^２＋ｎ^２ｍ）（マトリックスの数がＯ（ｎｍ）なので）と比較して、本実施形態による方法はＯ（ｍ＋ｎ）（マトリックスが単一なので）であり、大幅な低減化を達成している。 FIG. 12 is a comparison diagram regarding calculation time between the method according to the present embodiment and the naive method. Artificial data is used as data, and the sequence length (horizontal axis) n and m are changed. In FIG. 12, the total time of matrix update and similar partial sequence pair detection is averaged, and the vertical axis indicates the calculation time. From this experimental result, it can be seen that the method according to the present embodiment shows extremely high performance (higher processing speed is realized) than the naive method. In other words, the calculation amount is O (m + n) (since the matrix is single) as compared with the naive technique O (nm ² + n ² m) (since the number of matrices is O (nm)). ) And has achieved a significant reduction.

以上、本発明の実施形態について説明したが、本発明はこれに限定されるものではなく、その趣旨を変えない範囲で実施することができる。 As mentioned above, although embodiment of this invention was described, this invention is not limited to this, It can implement in the range which does not change the meaning.

その他、ハードウェア、ソフトウェアの具体的な構成について、本発明の主旨を逸脱しない範囲で適宜変更が可能である。 In addition, specific configurations of hardware and software can be appropriately changed without departing from the gist of the present invention.

なお、データストリームは、映像やセンサネットワーク、金融など様々な分野で発生する。本発明はこれらのすべての分野に適用可能である。 Data streams occur in various fields such as video, sensor networks, and finance. The present invention is applicable to all these fields.

第１実施形態の類似部分シーケンス検出装置の構成図である。It is a block diagram of the similar partial sequence detection apparatus of 1st Embodiment. 第１実施形態の類似部分シーケンス検出装置の処理を示すフローチャートである。It is a flowchart which shows the process of the similar partial sequence detection apparatus of 1st Embodiment. タイムワーピング行列の一例を示す図である。It is a figure which shows an example of a time warping matrix. タイムワーピング行列の一例を示す図である。It is a figure which shows an example of a time warping matrix. タイムワーピング行列の一例を示す図である。It is a figure which shows an example of a time warping matrix. タイムワーピング行列の一例を示す図である。It is a figure which shows an example of a time warping matrix. 第２実施形態の類似部分シーケンス検出装置の処理を示すフローチャートである。It is a flowchart which shows the process of the similar partial sequence detection apparatus of 2nd Embodiment. 図１３のシーケンスから類似部分シーケンスペアを検出した結果を表す散布図である。It is a scatter diagram showing the result of having detected the similar partial sequence pair from the sequence of FIG. （ａ）と（ｂ）はホワイトノイズを持つ複数のサイン波から構成される人工データを示す図であり、（ｃ）は実験結果の散布図である。(A) And (b) is a figure which shows the artificial data comprised from several sine waves with white noise, (c) is a scatter diagram of an experimental result. （ａ）と（ｂ）は温度センサの計測値を示す図であり、（ｃ）は実験結果の散布図である。(A) And (b) is a figure which shows the measured value of a temperature sensor, (c) is a scatter diagram of an experimental result. （ａ）と（ｂ）はモーションキャプチャデータを示す図であり、（ｃ）は実験結果の散布図である。(A) And (b) is a figure which shows motion capture data, (c) is a scatter diagram of an experimental result. 本実施形態による方法とナイーブな手法との計算時間に関する比較図である。It is a comparison figure regarding the calculation time of the method by this embodiment, and a naive method. 類似部分シーケンスペアの検出に使用されるデータストリームの例である。It is an example of the data stream used for the detection of a similar partial sequence pair. ＤＴＷの説明図である。It is explanatory drawing of DTW. ＤＴＷによるタイムワーピング行列を例示した図である。It is the figure which illustrated the time warping matrix by DTW.

Explanation of symbols

１類似部分シーケンス検出装置
１１入力部
１２処理部
１３記憶部
１４出力部
１１１検出条件入力部
１１２データストリーム入力部
１２１データストリーム処理部
１３１検出条件記憶部
１３２タイムワーピングデータ記憶部
１３３類似部分シーケンス候補記憶部
１４１類似部分シーケンス出力部 DESCRIPTION OF SYMBOLS 1 Similar partial sequence detection apparatus 11 Input part 12 Processing part 13 Storage part 14 Output part 111 Detection condition input part 112 Data stream input part 121 Data stream processing part 131 Detection condition storage part 132 Time warping data storage part 133 Similar partial sequence candidate memory | storage 141 Similar partial sequence output section

Claims

A similar partial sequence detection method by a similar partial sequence detection device for detecting a pair of similar partial sequences from two data streams using a time warping matrix indicating a similarity score between the two partial sequences,
The similar partial sequence detection device includes a storage unit that stores the time warping matrix, and a processing unit.
The processor is
When one element of data of one of the two data streams is received, the degree of similarity between the partial sequence in the data stream including the element and the partial sequence in the other data stream Calculate the score,
The calculated similarity score and the start position and end position of the two partial sequences used for calculating the similarity score are stored in the time warping matrix of the storage unit in association with each other,
A similar partial sequence detection method, comprising: detecting and outputting a pair of similar partial sequences using the similarity score stored in the time warping matrix of the storage unit.

The processor is
When calculating the similarity score between the partial sequences of the two data streams,
The similarity score S (X [i _s : i _e ], Y [j _s : j _e ]) between two partial sequences X [i _s : i _e ] and Y [j _s : j _e ] is expressed as Calculated by equations (1) to (5),
S (X [i _s : i _e ], Y [j _s : j _e ]) = s (i _e , j _e ) (1)
s (i, j) = max {0, 2ε−‖x _i −y _j ‖ + s _best } Expression (2)
s _best = max {s (i, j-1), s (i-1, j), s (i-1, j-1)}
... Formula (3)
s (i, 0) = 0 Formula (4)
s (0, j) = 0 Formula (5)
The partial sequence _X stored in the time warping matrix _{_[i} s: i _e] and the start position _{i s} of the partial sequence _Y: p indicating the start position _{j s} of _{_{[j s j e] (i}} , j) Is calculated by the following equation (7),
p (i, j) = {p (i, j-1) (if s _best = s (i, j-1)),
p (i-1, j) (if s _best = s (i-1, j)),
p (i-1, j-1) (if s _best = s (i-1, j-1)),
(I, j) (if s _best = 0)} (7)
The similar partial sequence detection method according to claim 1, wherein the start position i _s , j _s is calculated by the following equation (8).
(I _s , j _s ) = p (i _e , j _e ) (8)
However, i = 1,2, ..., shown n, j = 1,2, ..., m, the distance between the ‖x _i -y _j ‖ the _{x i} and _{y j.}

The processor is
Using the similarity score stored in the time warping matrix of the storage unit to detect and output a pair of similar partial sequences,
The average value s ′ of the similarity score is calculated by the following equation (6),
s ′ = s (i, j) / W (6)
When the average value s ′ of the calculated similarity score is equal to or greater than a predetermined threshold ε, and the lengths of the two partial sequences used for calculating the similarity score are both equal to or greater than the predetermined length ζ The similar partial sequence detection method according to claim 2, wherein the two partial sequences are detected and output as a pair of similar partial sequences.
Here, W indicates the length of the time warping path between the partial sequence X [i _s : i _e ] and the partial sequence Y [j _s : j _e ].

The processor is
When the average value s ′ of the calculated similarity score is equal to or greater than a predetermined threshold ε, and the lengths of the two partial sequences used for the calculation are both equal to or greater than a predetermined length ζ,
The two partial sequences are stored in the storage unit as similar partial sequence pair candidates,
Among the similar partial sequence pair candidates stored in the storage unit, when there is an overlap in at least a part of the time warping path, the time is selected from the overlapping similar partial sequence pair candidates. The similar partial sequence pair candidate with the longest warping path is selected, and the two partial sequences that are the selected similar partial sequence pair candidates are output as pairs of similar partial sequences. Similar partial sequence detection method.

A similar partial sequence detection program for causing a computer to execute the similar partial sequence detection method according to any one of claims 1 to 4.

A similar partial sequence detection device for detecting a pair of similar partial sequences from two data streams using a time warping matrix indicating a similarity score between the two partial sequences,
A storage unit for storing the time warping matrix;
When one element of data of one of the two data streams is received, the degree of similarity between the partial sequence in the data stream including the element and the partial sequence in the other data stream Calculate the score,
The calculated similarity score and the start position and end position of the two partial sequences used for calculating the similarity score are stored in the time warping matrix of the storage unit in association with each other,
A processing unit that detects and outputs a pair of similar partial sequences using the similarity score stored in the time warping matrix of the storage unit;
A similar partial sequence detection device comprising:

The processor is
When calculating the similarity score between the partial sequences of the two data streams,
The similarity score S (X [i _s : i _e ], Y [j _s : j _e ]) between two partial sequences X [i _s : i _e ] and Y [j _s : j _e ] is expressed as Calculated by equations (1) to (5),
S (X [i _s : i _e ], Y [j _s : j _e ]) = s (i _e , j _e ) (1)
s (i, j) = max {0, 2ε−‖x _i −y _j ‖ + s _best } Expression (2)
s _best = max {s (i, j-1), s (i-1, j), s (i-1, j-1)}
... Formula (3)
s (i, 0) = 0 Formula (4)
s (0, j) = 0 Formula (5)
The partial sequence _X stored in the time warping matrix _{_[i} s: i _e] and the start position _{i s} of the partial sequence _Y: p indicating the start position _{j s} of _{_{[j s j e] (i}} , j) Is calculated by the following equation (7),
p (i, j) = {p (i, j-1) (if s _best = s (i, j-1)),
p (i-1, j) (if s _best = s (i-1, j)),
p (i-1, j-1) (if s _best = s (i-1, j-1)),
(I, j) (if s _best = 0)} (7)
The similar partial sequence detection apparatus according to claim 6, wherein the start position i _s , j _s is calculated by the following equation (8).
(I _s , j _s ) = p (i _e , j _e ) (8)
However, i = 1,2, ..., shown n, j = 1,2, ..., m, the distance between the ‖x _i -y _j ‖ the _{x i} and _{y j.}

The processor is
Using the similarity score stored in the time warping matrix of the storage unit to detect and output a pair of similar partial sequences,
The average value s ′ of the similarity score is calculated by the following equation (6),
s ′ = s (i, j) / W (6)
When the average value s ′ of the calculated similarity score is equal to or greater than a predetermined threshold ε, and the lengths of the two partial sequences used for calculating the similarity score are both equal to or greater than the predetermined length ζ The similar partial sequence detection apparatus according to claim 7, wherein the two partial sequences are detected and output as a pair of similar partial sequences.
Here, W indicates the length of the time warping path between the partial sequence X [i _s : i _e ] and the partial sequence Y [j _s : j _e ].

The processor is
When the average value s ′ of the calculated similarity score is equal to or greater than a predetermined threshold ε, and the lengths of the two partial sequences used for the calculation are both equal to or greater than a predetermined length ζ,
The two partial sequences are stored in the storage unit as similar partial sequence pair candidates,
Among the similar partial sequence pair candidates stored in the storage unit, when there is an overlap in at least a part of the time warping path, the time is selected from the overlapping similar partial sequence pair candidates. 9. The similar partial sequence pair candidate with the longest warping path is selected, and the two partial sequences that are the selected similar partial sequence pair candidates are output as a pair of similar partial sequences. Similar partial sequence detection device.