JP2013164811A

JP2013164811A - Data retrieval device, data retrieval method, and data retrieval program

Info

Publication number: JP2013164811A
Application number: JP2012028791A
Authority: JP
Inventors: Hiroshi Kurasawa; 央倉沢; Hiroshi Sato; 浩史佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-02-13
Filing date: 2012-02-13
Publication date: 2013-08-22
Anticipated expiration: 2032-02-13
Also published as: JP5632862B2

Abstract

PROBLEM TO BE SOLVED: To provide a data retrieval device, a data retrieval method, and a data retrieval program which are capable of reducing a distance calculation cost required for retrieving a pair of pieces of similar partial time-series data from time-series data of a plurality of series that has flowed in within a predetermined time from the past to the present.SOLUTION: An algorithm is employed which has a data structure easy for insertion of partial time-series data at a new receipt time and deletion of partial time-series data at an old receipt time, and discriminates the partial time-series data by the receipt time to prunes them.

Description

本発明は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するデータ検索装置、データ検索方法、及びデータ検索プログラムに関する。 The present invention relates to a data search apparatus, a data search method, and a data search program for searching for a similar pair of partial time-series data from a plurality of series of time-series data that have flowed within a certain period in the past from the present.

時系列データは、時間的に連続して観測されたデータ系列である。時系列データは環境観測や設備監視、株価、統計情報、メディア処理などで利用されている。 The time series data is a data series observed continuously in time. Time series data is used for environmental observation, equipment monitoring, stock prices, statistical information, media processing, and so on.

時系列データの検索には、事前に蓄積されたデータに対して検索するものと、逐次流入してくるデータに対して検索するものとがある。前者の例としては、過去に定時観測した降水量のデータで類似したパターンを探す処理などがある。後者の例としては、リアルタイムに株価の変動を観測して目的のパターンと一致した瞬間に通知する処理などがある。本発明は後者のリアルタイムな処理を扱う。 There are two types of time-series data search: searching for data stored in advance and searching for data that flows in sequentially. As an example of the former, there is a process of searching for a similar pattern using precipitation data that has been regularly observed in the past. As an example of the latter, there is a process of observing a stock price change in real time and notifying the moment when it matches a target pattern. The present invention deals with the latter real-time processing.

リアルタイムな時系列データ検索の先行技術としては、以下のものが挙げられる。複数系列の同一時刻の時系列データのうち、類似した時系列データのペアを求める検索処理があった（例えば、特許文献１や非特許文献１参照）。特許文献１では閾値よりも類似度の高いペアを、非特許文献１では相関係数が閾値以上のものを探す課題を扱っている。これらの先行技術では、系列数が多いほど類似判定の計算コストが大きくなるため、類似度計算の削減のための次元圧縮や類似度計算回数削減のための索引技術を技術課題としていた。しかしながら、これらの先行技術は時系列データの終点が現在時刻のもののみを扱うという制限をもつ。つまり、時間的なずれのある類似した部分時系列データを対象とした検索には適用できない。 The prior art of real-time time series data search includes the following. There has been a search process for obtaining a pair of similar time-series data among a plurality of series of time-series data at the same time (see, for example, Patent Document 1 and Non-Patent Document 1). Patent Document 1 deals with a problem of searching for a pair having a higher degree of similarity than the threshold, and Non-Patent Document 1 searching for a correlation coefficient that is greater than or equal to the threshold. In these prior arts, the calculation cost for similarity determination increases as the number of series increases, and therefore, the technical problem is a dimensional compression for reducing similarity calculation and an index technique for reducing the number of similarity calculations. However, these prior arts have a limitation that only the end point of the time series data is the current time. That is, it cannot be applied to a search for similar partial time-series data having a time lag.

特開２００６−２４４３８９号公報JP 2006-244389 A

ＡｂｄｕｌｌａｈＭｕｅｅｎ，ＳｕｍａｎＮａｔｈａｎｄＪｉｅＬｉｕ，“ＦａｓｔＡｐｐｒｏｘｉｍａｔｅＣｏｒｒｅｌａｔｉｏｎｆｏｒＭａｓｓｉｖｅＴｉｍｅ−ｓｅｒｉｅｓＤａｔａ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２０１０ＡＣＭＳＩＧＭＯＤｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＭａｎａｇｅｍｅｎｔｏｆｄａｔａ，２０１０．Abdullah Muen, Sumna Nat and Jie Liu, “Fast Approximate Correlation for Massive Time-series of Data”, In Proceedings of the 20th Ace. Ｙｉａｎｉｌｏｓ，ＰｅｔｅｒＮ，“Ｄａｔａｓｔｒｕｃｔｕｒｅｓａｎｄａｌｇｏｒｉｔｈｍｓｆｏｒｎｅａｒｅｓｔｎｅｉｇｈｂｏｒｓｅａｒｃｈｉｎｇｅｎｅｒａｌｍｅｔｒｉｃｓｐａｃｅｓ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅｆｏｕｒｔｈａｎｎｕａｌＡＣＭ−ＳＩＡＭＳｙｍｐｏｓｉｕｍｏｎＤｉｓｃｒｅｔｅａｌｇｏｒｉｔｈｍｓ．ＳｏｃｉｅｔｙｆｏｒＩｎｄｕｓｔｒｉａｌａｎｄＡｐｐｌｉｅｄＭａｔｈｅｍａｔｉｃｓＰｈｉｌａｄｅｌｐｈｉａ，ｐｐ．３１１−３２１，１９９３．Yianilos, Peter N, “Data structures and algorithms for nearest neighbors, search in general metric spaces, In Proceedings of theSimum in the AUM. Society for Industrial and Applied Materials Philadelphia, pp. 311-321, 1993.

上述のように、従来技術は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを求める検索処理を扱うことができない、つまり、時間的なずれのある類似した部分時系列データを複数系列の時系列データの中から検索することができない。このような検索例としては、１ヶ月間の複数銘柄の株価の時系列データから類似した部分時系列データを検索して、時間的なずれのある特徴的な株価の変動を見つけ出す処理がある。また、複数のセンサ値の時系列データから類似した部分時系列データを検索して、日射から地熱、気温といった時間的なずれの生じる現象を見つけ出す処理も検索例として挙げられる。上述のとおり、複数系列の時系列データを対象とした先行技術や、時間的なずれのある類似した部分時系列データを対象とした先行技術はあるが、この２つを同時に満たす時系列データ検索処理技術は発明されていない。 As described above, the conventional technology cannot handle a search process for obtaining a similar pair of partial time-series data from a plurality of series of time-series data that has flowed in a certain period in the past from the present, that is, time It is not possible to search similar partial time-series data having a misalignment from a plurality of time-series data. As an example of such a search, there is a process of searching for similar partial time-series data from the time-series data of stock prices of a plurality of issues for one month to find characteristic stock price fluctuations with a time lag. Another example of a search is a process of searching for similar partial time-series data from time-series data of a plurality of sensor values to find a phenomenon in which a time lag such as geothermal heat or air temperature occurs from solar radiation. As described above, there are prior arts for multiple series of time series data and prior arts for similar partial time series data with time lag, but time series data search that satisfies these two simultaneously Processing techniques have not been invented.

上記２つを同時に満たす時系列データ検索処理では、現在から過去の一定期間内に流入した複数系列の時系列データの中から、閾値よりも類似した部分時系列データのペアを求める検索処理を扱うことになる。この検索処理は、距離計算コストを抑えることが求められており、類似度計算回数削減が解決すべき技術課題となっている。 The time-series data search process that satisfies the above two simultaneously deals with a search process for obtaining a pair of partial time-series data that is more similar to the threshold value from a plurality of time-series data that has flowed in from a present to a certain period in the past. It will be. This search process is required to reduce the cost of distance calculation, and a reduction in the number of similarity calculations is a technical problem to be solved.

つまり、時系列データの系列数が大きい、検索対象とする時系列データの時間が長い、もしくは時系列データの長さに対して部分時系列データが短いほど、類似度計算回数は大きくなる。類似度計算回数を削減するため、類似度の高い時系列データのペアに絞り込む検索索引が必要である。従来技術の検索索引は時系列データの終点が現在時刻のものを検索対象としていたが、上記２つを同時に満たす時系列データ検索処理では受信時刻の異なる部分時系列データも検索対象となり、類似判定の計算コストが大きくなるという課題がある。 That is, as the number of time-series data is larger, the time-series data to be searched is longer, or the partial time-series data is shorter than the time-series data, the number of similarity calculations increases. In order to reduce the number of similarity calculations, a search index that narrows down pairs of time-series data having a high similarity is necessary. Although the search index of the prior art used the time series data with the end point of the current time as the search target, in the time series data search processing that satisfies the above two simultaneously, the partial time series data with different reception times is also the search target. However, there is a problem that the calculation cost is increased.

本発明は、上記課題を解消するためになされたものであり、その目的は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するに際し、必要とする距離計算コストを抑えることができるデータ検索装置、データ検索方法、及びデータ検索プログラムを提供することにある。 The present invention has been made to solve the above-mentioned problems, and its purpose is to create a pair of similar partial time-series data from a plurality of series of time-series data that have flowed within a certain period in the past from the present. An object of the present invention is to provide a data search device, a data search method, and a data search program that can reduce the required distance calculation cost when searching.

本発明は、新しい受信時刻の部分時系列データの挿入と古い受信時刻の部分時系列データの削除が容易なデータ構造で、かつ、部分時系列データを受信時刻ごとに区別して枝刈るアルゴリズム採用することとした。ここで、「枝刈り」とは、検索対象とならないデータを予め除去しておくという意味である。 The present invention employs an algorithm that easily inserts partial time-series data at a new reception time and deletes partial time-series data at an old reception time, and distinguishes and prune the partial time-series data for each reception time. It was decided. Here, “pruning” means that data not to be searched is removed in advance.

本発明に係る一のデータ検索装置は、逐次流入してくる複数系列の時系列データを受信する受信部と、
前記受信部が受信した前記時系列データを記憶する第１メモリと、
前記第１メモリから時系列データを読み出し、所定時間の部分時系列データに分割する区間分割部と、
前記区間分割部が分割した前記部分時系列データについて互いの類似度を求める類似判定部と、
前記類似判定部が求めた前記類似度に基づき、前記区間分割部で生成された部分時系列データの類似検索索引を作成する索引部と、
前記索引部で作成された前記類似検索索引を受信時刻毎にキュー構造で記憶する第２メモリと、
前記区間分割部で生成された前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて前記第２メモリから読み出す類似ペア検索部と、
を備える。 One data search device according to the present invention includes a receiving unit that receives time-series data of a plurality of sequences that sequentially flows,
A first memory for storing the time-series data received by the receiving unit;
An interval dividing unit that reads time-series data from the first memory and divides the time-series data into partial time-series data of a predetermined time;
A similarity determination unit for obtaining a similarity between the partial time-series data divided by the section dividing unit;
Based on the similarity obtained by the similarity determination unit, an index unit that creates a similarity search index of the partial time-series data generated by the section division unit;
A second memory for storing the similar search index created by the index unit in a queue structure for each reception time;
Using the partial time-series data generated by the section dividing unit as a query, a similar pair search unit that reads partial time-series data similar to a specified threshold value from the second memory using the similar search index;
Is provided.

本発明に係る一のデータ検索方法は、逐次入力される複数系列の時系列データを所定時間の部分時系列データに分割し、前記部分時系列データについて互いの類似度を求め、前記類似度に基づく類似検索索引を作成して受信時刻毎にキュー構造で記憶しておき、検索時に前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて読み出す。 One data search method according to the present invention divides time-series data of a plurality of sequences that are sequentially input into partial time-series data of a predetermined time, obtains the similarity between the partial time-series data, and determines the similarity A similar search index is created and stored in a queue structure for each reception time, the partial time series data is used as a query at the time of search, and the partial time series data similar to a specified threshold is used as the query Read out.

本発明は、類似検索索引を受信時刻別につくることで、指定された時間範囲外の受信時刻の部分時系列データの索引からの削除と新しい部分時系列データの索引への追加の高速化、および、部分時系列データを受信時刻ごとに区別した枝刈りができるようになる。このため、本発明は、時系列データの系列数が大きいときに、部分時系列データの更新の高速化と類似度計算回数の削減を実現することができる。 The present invention creates a similar search index for each reception time, thereby speeding up deletion of partial time-series data of reception times outside the specified time range from the index and addition of new partial time-series data to the index, and Therefore, pruning can be performed by distinguishing partial time-series data for each reception time. Therefore, according to the present invention, when the number of series of time series data is large, it is possible to increase the speed of updating partial time series data and reduce the number of times of similarity calculation.

従って、本発明は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するに際し、必要とする距離計算コストを抑えることができるデータ検索装置及びデータ検索方法を提供することができる。 Therefore, according to the present invention, it is possible to reduce the distance calculation cost required when searching for a pair of similar partial time-series data from a plurality of time-series data that has flowed in from the present within a certain period in the past. A data search device and a data search method can be provided.

本発明に係る他のデータ検索装置は、逐次流入してくる複数系列の時系列データを受信する受信部と、
前記受信部が受信した前記時系列データを記憶する第１メモリと、
前記第１メモリから時系列データを読み出し、所定時間の部分時系列データに分割する区間分割部と、
前記区間分割部が分割した前記部分時系列データについて互いの類似度を求める類似判定部と、
前記部分時系列データを受信時刻で区別可能な構造を有し、前記類似判定部が求めた前記類似度に基づく、前記区間分割部で生成された部分時系列データの類似検索索引を作成する索引部と、
前記索引部で作成された前記類似検索索引を記憶するとともに、前記類似検索索引の構造を維持したまま後続の前記部分時系列データを前記類似検索索引にキューとして追記する第２メモリと、
前記区間分割部で生成された前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて前記第２メモリから読み出す類似ペア検索部と、
を備える。 Another data search device according to the present invention, a receiving unit that receives time-series data of a plurality of series that sequentially flows,
A first memory for storing the time-series data received by the receiving unit;
An interval dividing unit that reads time-series data from the first memory and divides the time-series data into partial time-series data of a predetermined time;
A similarity determination unit for obtaining a similarity between the partial time-series data divided by the section dividing unit;
An index that has a structure in which the partial time-series data can be distinguished by reception time, and creates a similarity search index for the partial time-series data generated by the section division unit based on the similarity obtained by the similarity determination unit And
A second memory for storing the similar search index created by the index unit and adding the subsequent partial time series data as a queue to the similar search index while maintaining the structure of the similar search index;
Using the partial time-series data generated by the section dividing unit as a query, a similar pair search unit that reads partial time-series data similar to a specified threshold value from the second memory using the similar search index;
Is provided.

本発明に係る他のデータ検索方法は、逐次入力される複数系列の時系列データを所定時間の部分時系列データに分割し、前記部分時系列データについて互いの類似度を求め、前記部分時系列データを受信時刻で区別可能な構造を有し、前記類似度に基づく１つの類似検索索引を作成して記憶しておき、前記類似検索索引の構造を維持したまま後続の前記部分時系列データを前記類似検索索引にキューとして追記し、検索時に前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて読み出す。 Another data search method according to the present invention divides time-series data of a plurality of series that are sequentially input into partial time-series data of a predetermined time, obtains a similarity between the partial time-series data, and the partial time-series data The data has a structure that can be distinguished by reception time, and creates and stores one similar search index based on the similarity, and the subsequent partial time-series data is maintained while maintaining the structure of the similar search index. A queue is added to the similar search index, and the partial time series data similar to a specified threshold is read using the partial search data as a query at the time of search using the similar search index.

本発明は、類似検索索引を各時系列データを受信時刻で区別可能な構造にしたうえで１つに集約することで、指定された時間範囲内の受信時刻つくられた複数の類似検索索引を検索のたびに探索する手間を削減できるようになる。このため、本発明は、検索対象とする時系列データの時間が長い、もしくは時系列データの長さに対して部分時系列データが短いときに、類似度計算回数の削減を実現することができる。 In the present invention, a plurality of similar search indexes created at reception times within a specified time range are obtained by making the similar search indexes into a structure in which each time-series data can be distinguished by reception time and then consolidating them into one. It is possible to reduce the time and effort of searching for each search. Therefore, the present invention can realize a reduction in the number of times of similarity calculation when the time series data to be searched is long or when the partial time series data is short relative to the length of the time series data. .

本発明に係るデータ検索プログラムは、前記データ検索方法をコンピュータに実行させる。 A data search program according to the present invention causes a computer to execute the data search method.

本発明は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するに際し、必要とする距離計算コストを抑えることができるデータ検索装置及びデータ検索方法を提供することができる。 The present invention provides a data search that can reduce the required distance calculation cost when searching for a pair of similar partial time-series data from a plurality of time-series data that has flowed within a certain period in the past from the present. An apparatus and a data search method can be provided.

類似部分時系列ペアを説明するイメージの図である。It is a figure of the image explaining a similar partial time series pair. 本発明に係るデータ検索装置を説明する図である。It is a figure explaining the data search device concerning the present invention. 本発明に係るデータ検索方法を説明する図である。It is a figure explaining the data search method which concerns on this invention. 本発明に係るデータ検索装置が時系列データを部分時系列データに分割するときのイメージの図である。It is a figure of an image when the data search device concerning the present invention divides time series data into partial time series data. 本発明に係るデータ検索方法における類似検索索引のキューを説明する図である。It is a figure explaining the queue of the similar search index in the data search method concerning the present invention. 本発明に係るデータ検索方法における類似検索索引構造を説明する図である。It is a figure explaining the similar search index structure in the data search method which concerns on this invention. 本発明に係るデータ検索方法における類似検索索引を使った枝刈りのイメージの図である。It is a figure of the image of pruning using the similar search index in the data search method which concerns on this invention. 本発明に係るデータ検索装置を説明する図である。It is a figure explaining the data search device concerning the present invention. 本発明に係るデータ検索方法を説明する図である。It is a figure explaining the data search method which concerns on this invention. 本発明に係るデータ検索方法における類似検索索引のキューを説明する図である。It is a figure explaining the queue of the similar search index in the data search method concerning the present invention.

添付の図面を参照して本発明の実施形態を説明する。以下に説明する実施形態は本発明の実施例であり、本発明は、以下の実施形態に制限されるものではない。なお、本明細書及び図面において符号が同じ構成要素は、相互に同一のものを示すものとする。 Embodiments of the present invention will be described with reference to the accompanying drawings. The embodiments described below are examples of the present invention, and the present invention is not limited to the following embodiments. In the present specification and drawings, the same reference numerals denote the same components.

図１は、類似部分時系列ペアのイメージの図である。実施形態１および実施形態２の装置では閾値よりも類似した部分時系列データペアを検索する。 FIG. 1 is a diagram of an image of similar partial time series pairs. In the apparatus according to the first and second embodiments, a partial time series data pair that is more similar than the threshold is searched.

（実施形態１）
図２は、本実施形態のデータ検索装置３０１を説明する図である。データ検索装置３０１は、逐次流入してくる複数系列の時系列データを受信する受信部１１と、
受信部１１が受信した前記時系列データを記憶する第１メモリ１２と、
第１メモリ１２から時系列データを読み出し、所定時間の部分時系列データに分割する区間分割部１３と、
区間分割部１３が分割した前記部分時系列データについて互いの類似度を求める類似判定部１４と、
類似判定部１４が求めた前記類似度に基づき、区間分割部１３で生成された部分時系列データの類似検索索引を作成する索引部１５と、
索引部１５で作成された前記類似検索索引を受信時刻毎にキュー構造で記憶する第２メモリ１６と、
区間分割部１３で生成された前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて第２メモリ１６から読み出す類似ペア検索部１７と、
を備える。 (Embodiment 1)
FIG. 2 is a diagram illustrating the data search apparatus 301 according to the present embodiment. The data search device 301 includes a receiving unit 11 that receives time-series data of a plurality of sequences that sequentially flows,
A first memory 12 for storing the time-series data received by the receiving unit 11;
An interval division unit 13 that reads time-series data from the first memory 12 and divides the time-series data into partial time-series data of a predetermined time;
A similarity determination unit 14 for obtaining a similarity between the partial time-series data divided by the section dividing unit 13;
An index unit 15 that creates a similarity search index of the partial time-series data generated by the section division unit 13 based on the similarity obtained by the similarity determination unit 14;
A second memory 16 for storing the similar search index created by the index unit 15 in a queue structure for each reception time;
A similar pair search unit 17 that reads out the partial time-series data similar to a specified threshold value from the second memory 16 using the similar search index, using the partial time-series data generated by the section dividing unit 13 as a query;
Is provided.

図３は、データ検索装置３０１が行うデータ検索方法を説明する図である。本データ検索方法は、逐次入力される複数系列の時系列データを所定時間の部分時系列データに分割し（ステップＳ１０１、Ｓ１０２）、前記部分時系列データについて互いの類似度を求め、前記類似度に基づく類似検索索引を作成して受信時刻毎にキュー構造で記憶しておき（ステップＳ１０３）、検索時に前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて読み出す（ステップＳ１０４）。 FIG. 3 is a diagram for explaining a data search method performed by the data search apparatus 301. The data search method divides time series data of a plurality of series that are sequentially input into partial time series data of a predetermined time (steps S101 and S102), obtains the similarity between the partial time series data, and calculates the similarity A similar search index based on the above is created and stored in a queue structure for each reception time (step S103), and the partial time series data similar to a specified threshold is used as a query at the time of search. Reading is performed using the similar search index (step S104).

ステップＳ１０１において、受信部１１は、逐次流入してくる複数系列の時系列データを受信して第１メモリ１２に記憶させる。第１メモリ１２の容量に制限がある場合は、指定された時間範囲外の受信時刻の時系列データを第１メモリ１２から削除する。第１メモリ１２の最低限必要な記憶容量は、検索対象とする部分時系列データ長をｍ、検索対象とする時間長をｗ、系列数をｎとすると、ｎ（ｍ＋ｗ−１）の要素数を蓄積できる容量となる。 In step S <b> 101, the receiving unit 11 receives a plurality of time series data that sequentially flows and stores the time series data in the first memory 12. When the capacity of the first memory 12 is limited, the time series data of the reception time outside the designated time range is deleted from the first memory 12. The minimum required storage capacity of the first memory 12 is the number of elements of n (m + w−1), where m is the partial time series data length to be searched, w is the time length to be searched, and n is the number of series. It becomes the capacity which can be stored.

ステップＳ１０２において、区間分割部１３は、第１メモリ１２から時系列データを読みだして、指定された区間時間の部分時系列データに分割する。本実施形態における部分時系列データとは、逐次流入してくる時間的に連続したデータ系列のうち、指定された区間時間の連続したデータ系列をいう。区間分割部１３は新しい時系列データを受信するたびに、各時系列データの直近の長さｍの部分時系列データを分割して、索引部１５と類似ペア検索部１７に部分時系列データを入力する。また、類似ペア検索部１７から指定された時間の長さｍの部分時系列データを分割して返す。図４は、区間分割部１３が時系列データを部分時系列データに分割するときのイメージの図である。図中の

は受信時刻ｔ_ｘにおける系列のＩＤがｉの部分時系列データを表している。 In step S102, the section dividing unit 13 reads the time series data from the first memory 12, and divides it into partial time series data of the designated section time. The partial time series data in the present embodiment refers to a continuous data series in a specified section time among temporally continuous data series that sequentially flow in. Every time new time series data is received, the section division unit 13 divides the partial time series data of the most recent length m of each time series data, and the partial time series data is sent to the index unit 15 and the similar pair search unit 17. input. In addition, the partial time-series data of the length m of time specified from the similar pair search unit 17 is divided and returned. FIG. 4 is a diagram of an image when the section dividing unit 13 divides time series data into partial time series data. In the figure

The ID of the sequences in reception time t _x represents a partial time-series data of i.

ステップＳ１０３において、類似判定部１４は部分時系列データ間の類似度を計算する。本実施形態における類似度とは、２つの部分時系列データがどの程度類似しているかを数値で表す指標である。特に、本実施形態においては、２つの部分時系列データ

の類似度Ｄｉｓｔにユークリッド距離を採用し、以下の通り定義する。

In step S103, the similarity determination unit 14 calculates the similarity between the partial time series data. The similarity in the present embodiment is an index that expresses how similar two partial time-series data are numerically. In particular, in this embodiment, two partial time series data

The Euclidean distance is adopted as the similarity degree Dist and is defined as follows.

なお、本発明のいう類似度は、上記特徴を満たす様々な距離がその対象となるものであって、ユークリッド距離に限られない。具体的に一例を挙げると、ＤｙｎａｍｉｃｔｉｍｅｗａｒｐｉｎｇやＥａｒｔｈＭｏｖｅｒ’ｓＤｉｓｔａｎｃｅ、編集距離、ハミング距離、Ｊａｃｃａｒｄ係数距離、コサイン類似度、マンハッタン距離、マハラノビス距離などを含む時系列データ間の類似度を表す距離であってよい。さらに、これらの距離に加えて、部分時系列のオフセットや振幅を調整したものに対する距離であってもよい。 It should be noted that the similarity referred to in the present invention is subject to various distances that satisfy the above characteristics, and is not limited to the Euclidean distance. Specifically, for example, distances representing similarity between time series data including Dynamic time warping, Earth Move's distance, editing distance, Hamming distance, Jaccard coefficient distance, cosine similarity, Manhattan distance, Mahalanobis distance, etc. It may be. Furthermore, in addition to these distances, distances to those adjusted for partial time series offset and amplitude may be used.

ステップＳ１０３において、索引部１５は第２メモリ１６に指定された時間範囲内の受信時刻に生成された複数の類似検索索引をキュー構造で管理する。本実施形態における類似検索索引とは、クエリとして与えた部分時系列データと蓄積した部分時系列データ間の類似度すべてを判別することなく、クエリとの類似度が閾値以内の蓄積した部分時系列データを探すためのデータ構造である。索引部１５は、区間分割部１３で分割した最新の受信時刻の部分時系列データを基準とした類似検索索引を１つ新たにつくり、第２メモリ１６の索引を蓄積するキューに追加する。同時に、指定された時間範囲外の受信時刻の類似検索索引を第２メモリ１６の索引を蓄積するキューから削除する。図５は、類似検索索引のキューを説明する図である。 In step S103, the index unit 15 manages a plurality of similar search indexes generated at the reception time within the time range specified in the second memory 16 in a queue structure. The similarity search index in the present embodiment is an accumulated partial time series in which the similarity with the query is within a threshold without determining all the similarities between the partial time series data given as a query and the accumulated partial time series data. A data structure for searching for data. The index unit 15 newly creates a similar search index based on the partial time-series data of the latest reception time divided by the section dividing unit 13 and adds the similar search index to the queue in which the index of the second memory 16 is accumulated. At the same time, the similar search index of the reception time outside the designated time range is deleted from the queue storing the index of the second memory 16. FIG. 5 is a diagram for explaining a queue of similar search indexes.

［類似検索索引の作成方法］
本実施形態における類似検索索引は、距離空間の類似検索索引ＶａｎｔａｇｅＰｏｉｎｔＴｒｅｅ（ＶＰＴ）（非特許文献２参照）を採用し、以下の手順で作成する。この索引は、基準となる部分時系列データ（基準部分時系列データ）１つと分割距離１つを定めて、基準部分時系列データからの距離と分割距離に基づいて索引対象の複数の部分時系列データを２つに分類することを繰り返して作成する。ここで、受信時刻ｔの部分時系列データ集合Ｓ_ｔを２分類する例を説明する。基準となる系列ｐの部分時系列データＳ^ｐ _ｔをＳ_ｔから一つ選び、Ｓ_ｔを均等２分割できる距離をｒ^ｐ _ｔとしたとき、Ｓ_ｔは、

の２つの部分時系列データ集合Ｓ_ｉｎとＳ_ｏｕｔに分類される。Ｓ_ｉｎとＳ_ｏｕｔも同様にそれぞれで基準となる部分時系列データと分割距離を決めてさらに空間を２分類することを繰り返すことで、木構造の類似検索索引をつくる。この類似検索索引は、葉ノードと内部ノードという２種類のノードで構成される。葉ノードは木構造の末端のノードであり、内部ノードは末端以外に位置するノードをいう。分類のたびに必要となるＳ^ｐ _ｔとｒ^ｐ _ｔは、内部ノードを生成して管理する。十分に小さい数の部分時系列データ集合まで繰り返し分類して必要な数の内部ノードを生成した後は、葉ノードを生成してＳ_ｔを管理する。つまり、内部ノードは分類に必要な情報（Ｓ^ｐ _ｔ、ｒ^ｐ _ｔ）を管理し、葉ノードでは部分時系列データ集合を管理する。図６は、本実施形態の類似検索索引構造を説明する図である。 [How to create a similar search index]
The similarity search index in this embodiment employs a similarity search index Vantage Point Tree (VPT) (see Non-Patent Document 2) in the metric space, and is created by the following procedure. This index defines one partial time series data (reference partial time series data) and one division distance as a reference, and a plurality of partial time series to be indexed based on the distance and division distance from the reference partial time series data Create data by repeatedly classifying data into two. Here, an example of 2 classify partial time-series data set S _t of reception time t. The partial time-series data ^S _{p t} in relation to the standard sequence p select one from _{S t,} and the distance that can be evenly divided into two _{S t} was ^r _{p t,} _{S t} is

_Are classified into two partial time series data sets S _in and S _out . Similarly for S _in and S _out , a similar search index having a tree structure is created by repeatedly determining the partial time series data and the division distance as reference and further classifying the space into two. This similar search index is composed of two types of nodes, leaf nodes and internal nodes. The leaf node is a node at the end of the tree structure, and the internal node is a node located at a position other than the end. S ^p _t and r ^p _t required for each classification are generated by managing internal nodes. After iteratively classifying a sufficiently small number of partial time-series data sets and generating the required number of internal nodes, leaf nodes are generated and _St is managed. That is, the internal node manages information (S ^p _t , r ^p _t ) necessary for classification, and the leaf node manages a partial time-series data set. FIG. 6 is a diagram for explaining the similar search index structure of the present embodiment.

なお、本発明のいう類似検索索引は上記特徴を満たす様々なデータ構造がその対象となるものであって、ＶＰＴに限られない。具体的な一例を挙げると、他の距離空間の類似検索索引Ｍ−ｔｒｅｅやＰＭ−ｔｒｅｅ、ＭＶＰ−ｔｒｅｅ、ｓａ−ｔｒｅｅ、ｉＤｉｓｔａｎｃｅ、ＬｉｓｔｏｆＣｌｕｓｔｅｒｓ、ＡＥＳＡ、ＬＡＥＳＡ、ＧＮＡＴ、Ｄ−ｉｎｄｅｘなどを含む。さらに、これらの距離空間の類似検索索引に加えて、多次元空間を対象とする類似検索索引のＲ−ｔｒｅｅやＰｒｉｏｒｉｔｙＲ−ｔｒｅｅ、文字列データの類似検索索引、集合データの類似検索索引であってもよい。さらに、本発明の範囲はこれに限るものでなく、これらの索引に次元圧縮技術を組み合わせることも可能である。具体的な一例を挙げると、ＤＦＴ、ＳＶＤ、ＰＡＡ、ＳＡＸなどを含む。部分時系列データが長くなり、次元数の増加によるデータ量の増大（『次元の呪い』）で枝刈りが不十分である場合は、これら次元圧縮技術を採用することで距離の上限値と下限値を索引付けに利用できる。 Note that the similar search index referred to in the present invention is intended for various data structures that satisfy the above characteristics, and is not limited to the VPT. Specific examples include similar search indexes for other metric spaces such as M-tree, PM-tree, MVP-tree, sa-tree, iDistance, List of Clusters, AESA, LAESA, GNAT, D-index, etc. . Furthermore, in addition to these metric space similar search indexes, there are similar search indexes R-tree and Priority R-tree for multi-dimensional spaces, similar search indexes for character string data, and similar search indexes for set data. May be. Further, the scope of the present invention is not limited to this, and it is possible to combine a dimensional compression technique with these indexes. Specific examples include DFT, SVD, PAA, SAX and the like. If partial time-series data is long and the pruning is insufficient due to an increase in the amount of data due to an increase in the number of dimensions ("curse of dimension"), the upper and lower limits of distance can be achieved by adopting these dimension compression techniques. Values can be used for indexing.

［類似データ検索方法］
ステップＳ１０４において、類似ペア検索部１７は、区間分割部１３で分割した最新の受信時刻の部分時系列データをクエリとして、第２メモリ１６にキュー構造で管理されている受信時刻ごとの類似検索索引を使って、閾値よりもクエリと類似した部分時系列データを検索する。そして、類似ペア検索部１７は、類似部分時系列データペアを出力する。ここでのデータペアとは、最新時刻の部分時系列データとこれに類似する部分時系列データのペアを意味する。類似ペア検索部１７は、以下の手順を行い、類似部分時系列データペアを出力する。 [Similar data search method]
In step S104, the similar pair search unit 17 uses the latest reception time partial time-series data divided by the section division unit 13 as a query, and the similar search index for each reception time managed in the second memory 16 in a queue structure. Is used to search for partial time series data that is more similar to the query than the threshold. And the similar pair search part 17 outputs a similar partial time series data pair. The data pair here means a pair of partial time-series data at the latest time and similar partial time-series data. The similar pair search unit 17 performs the following procedure and outputs a similar partial time series data pair.

ここでは、上述のＳ_ｔがＳ^ｐ _ｔとｒ^ｐ _ｔによってＳ_ｉｎとＳ_ｏｕｔの２つの部分時系列データ集合に分類されている構造を使った検索例を述べる。クエリとする時系列データ

からｒ^ｑの距離の範囲にある時系列データを探したい場合を想定する。数５のクエリと空間分割の基準とした部分時系列データ

との間の距離

が、
１）

のとき、Ｓ_ｉｎに含まれる部分時系列データを検索すれば十分で、Ｓ_ｏｕｔに含まれる時系列データ、つまりＳ_ｏｕｔを管理する内部ノードもしくは葉ノードは枝刈りできる。
２）

のとき、Ｓ_ｏｕｔに含まれる時系列データを検索すれば十分で、Ｓ_ｉｎに含まれる時系列データ、つまりＳ_ｉｎを管理する内部ノードもしくは葉ノードは枝刈りできる。
３）上記１），２）のどちらの不等式も満たさない場合、Ｓ_ｉｎとＳ_ｏｕｔの両方の時系列データを検索する必要がある。 Here, a search example using a structure in which the above-described S _t is classified into two partial time-series data sets of S _in and S _out by S ^p _t and r ^p _t will be described. Time series data used as a query

It is assumed that you want to find the time-series data is in the range of distances r ^q from. Partial time-series data based on 5 queries and space division

Distance between

But,
1)

In this case, it is sufficient to search the partial time series data included _in S _in, and the time series data included in S _out , that is, the internal node or leaf node that manages S _out can be pruned.
2)

When, it suffices to find the time-series data contained in the S _out, the time-series data contained in the S _in, ie internal nodes or leaf nodes to manage S _in can be pruned.
3) When neither the inequality of 1) nor 2) is satisfied, it is necessary to search both time series data of S _in and S _out .

このような枝刈りを再帰的に実行する。つまり、図６の内部ノードをある程度たどって、枝刈りできない内部ノードや葉ノードにおいて検索を行う。図７は、類似検索索引を使った枝刈りのイメージの図である。上述の１）の条件に該当して、Ｓ_ｏｕｔに含まれる時系列データを枝刈りできる様子を表している。 Such pruning is performed recursively. In other words, the internal node shown in FIG. 6 is traced to some extent, and a search is performed for internal nodes and leaf nodes that cannot be pruned. FIG. 7 is a diagram of an image of pruning using a similar search index. And it satisfies the condition of the above 1), and represents a state capable of pruning time series data contained in the S _out.

以上の手順で、データ検索装置３０１は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、閾値よりも類似した部分時系列データペアを検索する。 With the above procedure, the data search device 301 searches for a partial time-series data pair that is more similar to the threshold value from a plurality of series of time-series data that has flowed in from the present within a certain period in the past.

（実施形態２）
図８は、本実施形態のデータ検索装置３０２を説明する図である。データ検索装置３０２は、逐次流入してくる複数系列の時系列データを受信する受信部１１と、
受信部１１が受信した前記時系列データを記憶する第１メモリ１２と、
第１メモリ１２から時系列データを読み出し、所定時間の部分時系列データに分割する区間分割部１３と、
区間分割部１３が分割した前記部分時系列データについて互いの類似度を求める類似判定部１４と、
前記部分時系列データを受信時刻で区別可能な構造を有し、類似判定部１４が求めた前記類似度に基づく、区間分割部１３で生成された部分時系列データの類似検索索引を作成する索引部１５と、
索引部１５で作成された前記類似検索索引を記憶するとともに、前記類似検索索引の構造を維持したまま後続の前記部分時系列データを前記類似検索索引にキューとして追記する第２メモリ１６と、
区間分割部１３で生成された前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて第２メモリ１６から読み出す類似ペア検索部１７と、
を備える。 (Embodiment 2)
FIG. 8 is a diagram illustrating the data search apparatus 302 according to this embodiment. The data search device 302 includes a receiving unit 11 that receives time-series data of a plurality of sequences that sequentially flows,
A first memory 12 for storing the time-series data received by the receiving unit 11;
An interval division unit 13 that reads time-series data from the first memory 12 and divides the time-series data into partial time-series data of a predetermined time;
A similarity determination unit 14 for obtaining a similarity between the partial time-series data divided by the section dividing unit 13;
An index that has a structure in which the partial time-series data can be distinguished by reception time, and creates a similarity search index for the partial time-series data generated by the section division unit 13 based on the similarity obtained by the similarity determination unit 14 Part 15;
A second memory 16 for storing the similar search index created by the index unit 15 and adding the subsequent partial time series data as a queue to the similar search index while maintaining the structure of the similar search index;
A similar pair search unit 17 that reads out the partial time-series data similar to a specified threshold value from the second memory 16 using the similar search index, using the partial time-series data generated by the section dividing unit 13 as a query;
Is provided.

図９は、データ検索装置３０２が行うデータ検索方法を説明する図である。本データ検索方法は、逐次入力される複数系列の時系列データを所定時間の部分時系列データに分割し、前記部分時系列データについて互いの類似度を求め、前記部分時系列データを受信時刻で区別可能な構造を有し、前記類似度に基づく１つの類似検索索引を作成して記憶しておき（ステップＳ２０１、Ｓ２０２）、
前記類似検索索引の構造を維持したまま後続の前記部分時系列データを前記類似検索索引にキューとして追記し（ステップＳ２０３〜ステップＳ２０５）、
検索時に前記部分時系列データをクエリとして、指定された閾値よりも類似した部分時系列データを前記類似検索索引を用いて読み出す（ステップＳ２０６）。 FIG. 9 is a diagram for explaining a data search method performed by the data search apparatus 302. This data search method divides time-series data of a plurality of series that are sequentially input into partial time-series data of a predetermined time, obtains a similarity between the partial time-series data, and determines the partial time-series data based on a reception time. One similar search index having a distinguishable structure and based on the similarity is created and stored (steps S201 and S202),
The subsequent partial time series data is added as a queue to the similar search index while maintaining the structure of the similar search index (steps S203 to S205),
Using the partial time-series data as a query at the time of search, partial time-series data similar to a specified threshold is read using the similar search index (step S206).

データ検索装置３０２において、受信部１１、区間分割部１３、類似判定部１４、類似ペア検索部１７及び第１メモリ１２は図２のデータ検索装置３０１と同じ動作を行う。 In the data search device 302, the reception unit 11, the section division unit 13, the similarity determination unit 14, the similar pair search unit 17, and the first memory 12 perform the same operations as the data search device 301 in FIG.

索引部１５は、ステップＳ２０２で一度類似検索索引を構築し、その後は内部ノードを変更せずに葉ノードのみを更新する（ステップＳ２０３〜Ｓ２０５）。本実施形態における類似検索索引も実施形態１と同様、距離空間の類似検索索引ＶＰＴのデータ構造を採用する。 The index unit 15 builds a similar search index once in step S202, and thereafter updates only the leaf node without changing the internal node (steps S203 to S205). Similar to the first embodiment, the similar search index in the present embodiment adopts the data structure of the similar search index VPT in the metric space.

ステップ（Ｓ２０１〜Ｓ２０２）では、受信部１１にてｎ（ｍ＋ｗ−１）の要素数を蓄積した後に、１つの類似検索索引を構築する。区間分割部１３で切り出せる部分時系列データ集合を“Ｓ”とする。基準となる系列ＩＤがｐで受信時刻がｔ_ｘの部分時系列データ

を“Ｓ”から一つ選び、“Ｓ”を均等２分割できる距離を

とおく。“Ｓ”は、

の２つの部分時系列データ集合Ｓ_ｉｎとＳ_ｏｕｔに分類される。 In steps (S201 to S202), after the number of elements of n (m + w-1) is accumulated in the receiving unit 11, one similar search index is constructed. A partial time series data set that can be cut out by the section dividing unit 13 is assumed to be “S”. Partial time series data with reference series ID p and reception time t _x

Select one from “S” and set the distance that can divide “S” into two equal parts.

far. “S” means

_Are classified into two partial time series data sets S _in and S _out .

Ｓ_ｉｎとＳ_ｏｕｔも同様にそれぞれで基準となる部分時系列データと分割距離を決めてさらに空間を２分類することを繰り返すことで、木構造の索引をつくる。十分に小さい数の部分時系列データ集合まで分類した後は、葉ノードとして“Ｓ”を管理する。このとき、葉ノードでは部分時系列データを受信時刻別にアクセスしやすいキュー構造で管理する。 S _in the S _out also by repeating two more classifying that the space determined the partial time-series data and dividing the distance as a reference in each similarly, creating an index of the tree structure. After classifying a sufficiently small number of partial time-series data sets, “S” is managed as a leaf node. At this time, the leaf node manages the partial time series data in a queue structure that is easy to access for each reception time.

このようにして、木構造の内部ノードでは

を管理し、葉ノードでは部分時系列データ集合を管理する。図１０は、類似検索索引の葉ノードにおけるキューを説明する図である。 In this way, in the internal node of the tree structure

The leaf node manages a partial time series data set. FIG. 10 is a diagram for explaining a queue in the leaf node of the similar search index.

なお、本発明のいう類似検索索引は上記特徴を満たす様々なデータ構造がその対象となるものであって、ＶＰＴに限られず、実施形態１で示した類似検索索引であってもよい。さらに、これらの索引に実施形態１で示した次元圧縮技術を組み合わせることも可能である。 The similar search index referred to in the present invention is intended for various data structures satisfying the above characteristics, and is not limited to the VPT, and may be the similar search index shown in the first embodiment. Furthermore, it is possible to combine the dimensional compression technique shown in Embodiment 1 with these indexes.

索引構築後、ステップＳ２０３〜Ｓ２０５でデータ検索装置３０２は、新しく追加する部分時系列Ｓ^ｉ _ｔを、該当する葉ノードに追加する。つまり、数１３を管理する内部ノードがあった場合は、

ならばＳ_ｉｎを担当するノードを、

ならばＳ_ｏｕｔを担当するノードを探索することを繰り返して、該当する葉ノードを探索する。そして、葉ノードが見つかったら、部分時系列データを蓄積するキューにＳ^ｉ _ｔを追加する。なお、所定時間が経過した部分時系列データは葉ノードにおけるキューから削除される。 After the index built, the data search apparatus 302 in step S203～S205, the partial time series ^S _{i t} the newly added, to add to the appropriate leaf node. In other words, if there is an internal node that manages the number 13,

If the node in charge of _{S in,}

Then, the search for the node in charge of _Sout is repeated to search for the corresponding leaf node. And, When you find leaf node, add the S ⁱ _t in the queue for accumulating the partial time-series data. Note that the partial time series data for which a predetermined time has elapsed is deleted from the queue in the leaf node.

ステップＳ２０６は図３で説明したステップＳ１０４と同様である。類似ペア検索部１７は、区間分割部１３で分割した最新の受信時刻の部分時系列データをクエリとして、第２メモリ１６に管理されている類似検索索引を使い、図６の内部ノードをある程度たどって、枝刈りできない内部ノードや葉ノードにおいて、閾値よりもクエリと類似した部分時系列データを検索する。そして、類似ペア検索部１７は、類似部分時系列データペアを出力する。 Step S206 is the same as step S104 described in FIG. The similar pair search unit 17 uses the partial time-series data of the latest reception time divided by the section division unit 13 as a query and uses the similar search index managed in the second memory 16 to trace the internal nodes in FIG. 6 to some extent. Thus, in the internal node or leaf node that cannot be pruned, partial time series data that is more similar to the query than the threshold is searched. And the similar pair search part 17 outputs a similar partial time series data pair.

以上の手順で、データ検索装置３０２は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、閾値よりも類似した部分時系列データペアを検索する。 With the above procedure, the data search device 302 searches for partial time-series data pairs that are more similar to the threshold value from a plurality of series of time-series data that have flowed in from the present within a certain period in the past.

１１：受信部
１２：第１メモリ
１３：区間分割部
１４：類似判定部
１５：索引部
１６：第２メモリ
１７：類似ペア判定部
１８：第３メモリ
３０１、３０２：データ検索装置 11: receiving unit 12: first memory 13: section dividing unit 14: similarity determining unit 15: index unit 16: second memory 17: similar pair determining unit 18: third memories 301 and 302: data search device

Claims

A receiving unit for receiving time series data of a plurality of series that sequentially flows;
A first memory for storing the time-series data received by the receiving unit;
An interval dividing unit that reads time-series data from the first memory and divides the time-series data into partial time-series data of a predetermined time;
A similarity determination unit for obtaining a similarity between the partial time-series data divided by the section dividing unit;
Based on the similarity obtained by the similarity determination unit, an index unit that creates a similarity search index of the partial time-series data generated by the section division unit;
A second memory for storing the similar search index created by the index unit in a queue structure for each reception time;
Using the partial time-series data generated by the section dividing unit as a query, a similar pair search unit that reads partial time-series data similar to a specified threshold value from the second memory using the similar search index;
A data search device comprising:

A receiving unit for receiving time series data of a plurality of series that sequentially flows;
A first memory for storing the time-series data received by the receiving unit;
An interval dividing unit that reads time-series data from the first memory and divides the time-series data into partial time-series data of a predetermined time;
A similarity determination unit for obtaining a similarity between the partial time-series data divided by the section dividing unit;
An index that has a structure in which the partial time-series data can be distinguished by reception time, and creates a similarity search index for the partial time-series data generated by the section division unit based on the similarity obtained by the similarity determination unit And
A second memory for storing the similar search index created by the index unit and adding the subsequent partial time series data as a queue to the similar search index while maintaining the structure of the similar search index;
Using the partial time-series data generated by the section dividing unit as a query, a similar pair search unit that reads partial time-series data similar to a specified threshold value from the second memory using the similar search index;
A data search device comprising:

A plurality of time series data that are sequentially input are divided into partial time series data of a predetermined time, the similarities of the partial time series data are obtained, and a similar search index based on the similarity is created for each reception time. In the queue structure,
A data search method for reading out partial time-series data similar to a specified threshold using the similar search index, using the partial time-series data as a query at the time of search.

The time series data of a plurality of series that are sequentially input is divided into partial time series data of a predetermined time, a similarity between the partial time series data is obtained, and the partial time series data can be distinguished by reception time. One similarity search index based on the similarity is created and stored,
The subsequent partial time series data is added as a queue to the similar search index while maintaining the structure of the similar search index,
A data search method for reading out partial time-series data similar to a specified threshold using the similar search index, using the partial time-series data as a query at the time of search.

A data search program for causing a computer to execute the data search method according to claim 3 or 4.