JP2013196665A

JP2013196665A - Data retriever, data retrieval method, and data retrieval program

Info

Publication number: JP2013196665A
Application number: JP2012066643A
Authority: JP
Inventors: Hiroshi Kurasawa; 央倉沢; Hiroshi Sato; 浩史佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-03-23
Filing date: 2012-03-23
Publication date: 2013-09-30
Anticipated expiration: 2032-03-23
Also published as: JP5711171B2

Abstract

PROBLEM TO BE SOLVED: To provide a data retriever, data retrieval method and data retrieval program capable of suppressing a memory size when retrieving a pair of similar partial time series data from time series data of plural series flowed in within a fixed period from the present to the past.SOLUTION: Time series data of sequentially inputted plural series is divided into partial time series data of a prescribed time, mutual similarity about the partial time series data is calculated, candidates for a pair of similar partial series data and similarity thresholds are stored in each reception time within a designated time range on the basis of the similarity, partial time series data more similar than the stored similarity thresholds among pairs of the generated partial time series data is stored as a provisional similar pair, and a partial similar time series data pair having the most similar given number is read among the stored provisional similar pairs.

Description

本発明は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するデータ検索装置、データ検索方法、及びデータ検索プログラムに関する。 The present invention relates to a data search apparatus, a data search method, and a data search program for searching for a similar pair of partial time-series data from a plurality of series of time-series data that have flowed within a certain period in the past from the present.

時系列データは、時間的に連続して観測されたデータ系列である。時系列データは環境観測や設備監視、株価、統計情報、メディア処理などで利用されている。 The time series data is a data series observed continuously in time. Time series data is used for environmental observation, equipment monitoring, stock prices, statistical information, media processing, and so on.

時系列データの検索には、事前に蓄積されたデータに対して検索するものと、逐次流入してくるデータに対して検索するものとがある。前者の例としては、過去に定時観測した降水量のデータで類似したパターンを探す処理などがある。後者の例としては、リアルタイムに株価の変動を観測して目的のパターンと一致した瞬間に通知する処理などがある。本発明は後者のリアルタイムな処理を扱う。 There are two types of time-series data search: searching for data stored in advance and searching for data that flows in sequentially. As an example of the former, there is a process of searching for a similar pattern using precipitation data that has been regularly observed in the past. As an example of the latter, there is a process of observing a stock price change in real time and notifying the moment when it matches a target pattern. The present invention deals with the latter real-time processing.

リアルタイムな時系列データ検索の先行技術としては、以下のものが挙げられる。複数系列の同一時刻の時系列データのうち、類似した時系列データのペアを求める検索処理があった（例えば、特許文献１や非特許文献１参照）。特許文献１では閾値よりも類似度の高いペアを、非特許文献１では相関係数が閾値以上のものを探す課題を扱っている。これらの先行技術では、系列数が多いほど類似判定の計算コストが大きくなるため、類似度計算の削減のための次元圧縮や類似度計算回数削減のための索引技術を技術課題としていた。しかしながら、これらの先行技術は時系列データの終点が現在時刻のもののみを扱うという制限をもつ。つまり、時間的なずれのある類似した部分時系列データを対象とした検索には適用できない。 The prior art of real-time time series data search includes the following. There has been a search process for obtaining a pair of similar time-series data among a plurality of series of time-series data at the same time (see, for example, Patent Document 1 and Non-Patent Document 1). Patent Document 1 deals with a problem of searching for a pair having a higher degree of similarity than the threshold, and Non-Patent Document 1 searching for a correlation coefficient that is greater than or equal to the threshold. In these prior arts, the calculation cost for similarity determination increases as the number of series increases, and therefore, the technical problem is a dimensional compression for reducing similarity calculation and an index technique for reducing the number of similarity calculations. However, these prior arts have a limitation that only the end point of the time series data is the current time. That is, it cannot be applied to a search for similar partial time-series data having a time lag.

一方、現在から過去の一定期間内に流入した単一系列の時系列データの中から、類似した部分時系列データのペアを求める検索処理があった（例えば、非特許文献２と非特許文献３参照）。非特許文献２と非特許文献３のいずれも、指定した数の最も類似した部分時系列データのペアを検索する課題を扱っている。これらの先行技術では、検索対象とする時間が長い、もしくは、時系列データの長さに対して部分時系列データが短いほど暫定類似ペアを記憶するメモリサイズが大きくなるため、暫定類似ペア候補の絞り込み方法を技術課題としていた。しかしながら、これらの先行技術は単一系列の時系列データのみを扱うという制限を持つ。つまり、複数系列の時系列データを対象とした検索には適用できない。 On the other hand, there has been a search process for obtaining a pair of similar partial time-series data from a single series of time-series data that has flowed within a certain period in the past from now (for example, Non-Patent Document 2 and Non-Patent Document 3). reference). Both Non-Patent Document 2 and Non-Patent Document 3 address the problem of searching for a specified number of most similar partial time-series data pairs. In these prior arts, the longer the time to be searched, or the shorter the partial time-series data relative to the length of the time-series data, the larger the memory size for storing the temporary similar pairs. The narrowing down method was a technical issue. However, these prior arts have a limitation of handling only a single series of time series data. That is, it cannot be applied to a search for a plurality of time series data.

特開２００６−２４４３８９号公報JP 2006-244389 A

ＡｂｄｕｌｌａｈＭｕｅｅｎ，ＳｕｍａｎＮａｔｈａｎｄＪｉｅＬｉｕ， “ＦａｓｔＡｐｐｒｏｘｉｍａｔｅＣｏｒｒｅｌａｔｉｏｎｆｏｒＭａｓｓｉｖｅＴｉｍｅ−ｓｅｒｉｅｓＤａｔａ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２０１０ＡＣＭＳＩＧＭＯＤｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＭａｎａｇｅｍｅｎｔｏｆｄａｔａ，２０１０．Abdullah Muen, Sumna Nat and Jie Liu, “Fast Approximate Correlation for Massive Time-series of Data 20”, In Proceeds of the 20th Ace. ＡｂｄｕｌｌａｈＭｕｅｅｎａｎｄＥａｍｏｎｎＫｅｏｇｈ， “ＯｎｌｉｎｅＤｉｓｃｏｖｅｒｙａｎｄＭａｉｎｔｅｎａｎｃｅｏｆＴｉｍｅＳｅｒｉｅｓＭｏｔｉｆｓ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１６ｔｈＡＣＭＳＩＧＫＤＤＣｏｎｆｅｒｅｎｃｅｏｎＫｎｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙａｎｄＤａｔａＭｉｎｉｎｇ，２０１０．Abdullah Muen and Emanon Keog, “Online Discovery and Maintenance of Time Series 10 Mot.”, In Proceeds of the Cond. ＨｏａｎｇＴｈａｎｈＬａｍ，ＴｕＥｉｎｄｈｏｖｅｎ，ＴｏｏｎＣａｌｄｅｒｓａｎｄＮｉｎｈＰｈａｍ， “ＯｎｌｉｎｅＤｉｓｃｏｖｅｒｙｏｆＴｏｐ−ｋＳｉｍｉｌａｒＭｏｔｉｆｓｉｎＴｉｍｅＳｅｒｉｅｓＤａｔａ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２０１１ＳＩＡＭＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＤａｔａＭｉｎｉｎｇ，２０１１．Hang Thanh Lam, Tu Eindhoven, Toon Calders and Ninh Pham, “Online Discovery of Top in k Simal Motifs in Time Series,” Ｙｉａｎｉｌｏｓ，ＰｅｔｅｒＮ， “Ｄａｔａｓｔｒｕｃｔｕｒｅｓａｎｄａｌｇｏｒｉｔｈｍｓｆｏｒｎｅａｒｅｓｔｎｅｉｇｈｂｏｒｓｅａｒｃｈｉｎｇｅｎｅｒａｌｍｅｔｒｉｃｓｐａｃｅｓ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅｆｏｕｒｔｈａｎｎｕａｌＡＣＭ−ＳＩＡＭＳｙｍｐｏｓｉｕｍｏｎＤｉｓｃｒｅｔｅａｌｇｏｒｉｔｈｍｓ．ＳｏｃｉｅｔｙｆｏｒＩｎｄｕｓｔｒｉａｌａｎｄＡｐｐｌｉｅｄＭａｔｈｅｍａｔｉｃｓＰｈｉｌａｄｅｌｐｈｉａ，ｐｐ．３１１−３２１，１９９３．Yianilos, Peter N, “Data structures and algorithms for near neighbor neural search in general metrical spaces.” In Proceedings of the ensembleSumNum Society for Industrial and Applied Materials Philadelphia, pp. 311-321, 1993.

上述のように、従来技術は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを求める検索処理を扱うことができない。つまり、時間的なずれのある類似した部分時系列データを複数系列の時系列データの中から検索することができない。このような検索例としては、１ヶ月間の複数銘柄の株価の時系列データから類似した部分時系列データを検索して、時間的なずれのある特徴的な株価の変動を見つけ出す処理がある。また、複数のセンサ値の時系列データから類似した部分時系列データを検索して、日射から地熱、気温といった時間的なずれの生じる現象を見つけ出す処理も検索例として挙げられる。上述のとおり、複数系列の時系列データを対象とした先行技術や、時間的なずれのある類似した部分時系列データを対象とした先行技術はあるが、この２つを同時に満たす時系列データ検索処理技術は見当たらない。 As described above, the prior art cannot handle a search process for obtaining a similar pair of partial time-series data from a plurality of series of time-series data that has flowed within a certain period in the past from the present. That is, it is not possible to retrieve similar partial time-series data having a time lag from a plurality of series of time-series data. As an example of such a search, there is a process of searching for similar partial time-series data from the time-series data of stock prices of a plurality of issues for one month to find characteristic stock price fluctuations with a time lag. Another example of a search is a process of searching for similar partial time-series data from time-series data of a plurality of sensor values to find a phenomenon in which a time lag such as geothermal heat or air temperature occurs from solar radiation. As described above, there are prior arts for multiple series of time series data and prior arts for similar partial time series data with time lag, but time series data search that satisfies these two simultaneously There is no processing technology.

上記２つを同時に満たす時系列データ検索処理では、現在から過去の一定期間内に流入した複数系列の時系列データの中から、与えられた数の最も類似した部分時系列データのペアを探す検索処理を扱うことになる。この検索処理は、メモリサイズを抑えることが求められており、暫定類似ペア候補の絞り込みが解決すべき技術課題となっている。ここで、「与えられた数の最も類似した部分時系列データのペアを探す検索処理」とは、「要求されている出力数ｋ（ペアの個数）が入力されたとき、最類似からｋ番目までの部分時系列データのペアを検索して出力する処理」を意味する。 In the time-series data search process that satisfies the above two simultaneously, a search for searching for a given number of the most similar partial time-series data pairs from a plurality of series of time-series data that has flowed in from the present within a certain period in the past. Will handle the process. This search process is required to reduce the memory size, and narrowing down the provisional similar pair candidates is a technical problem to be solved. Here, “a search process for finding a given number of most similar partial time-series data pairs” means “when the requested number of outputs k (number of pairs) is input, Means a process of searching for and outputting a pair of partial time series data up to.

与えられた数の最も類似した部分時系列データのペアを探す検索要求を扱うとき、検索対象とする時系列データの時間が長い、もしくは時系列データの長さに対して部分時系列データが短いほど、暫定類似ペアの数が増える。暫定類似ペアの数が多いと、それらを記憶するメモリサイズが大きくなる。また、暫定類似ペアを絞りこめないと、暫定類似ペア間の類似度の閾値を大きく見積もってしまうことになり、検索索引を使った枝刈りが難しく、距離計算回数が大きくなる。以上の理由から、メモリサイズを抑えるためには暫定類似ペアの絞り込みが必要という課題がある。ここで、「枝刈り」とは、検索対象とならないデータを予め除去しておくという意味である。 When handling a search request to find a given number of the most similar partial time-series data pairs, the time-series data to be searched is long, or the partial time-series data is short relative to the length of the time-series data As the number of temporary similar pairs increases, the number increases. When the number of provisional similar pairs is large, the memory size for storing them increases. If the temporary similar pairs are not narrowed down, the threshold value of the similarity between the temporary similar pairs is greatly estimated, so that pruning using the search index is difficult, and the number of distance calculations increases. For the above reasons, there is a problem that it is necessary to narrow down temporary similar pairs in order to reduce the memory size. Here, “pruning” means that data not to be searched is removed in advance.

本発明は、上記課題を解消するためになされたものであり、その目的は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するに際し、必要とするメモリサイズを抑えることができるデータ検索装置、データ検索方法、及びデータ検索プログラムを提供することにある。 The present invention has been made to solve the above-mentioned problems, and its purpose is to create a pair of similar partial time-series data from a plurality of series of time-series data that have flowed within a certain period in the past from the present. An object of the present invention is to provide a data search device, a data search method, and a data search program that can reduce the required memory size when searching.

本発明は、新しい受信時刻の部分時系列データの挿入と古い受信時刻の部分時系列データの削除が容易なデータ構造で、かつ、部分時系列データを受信時刻ごとに区別して枝刈りするアルゴリズム採用することとした。 The present invention adopts an algorithm that easily inserts partial time-series data at a new reception time and deletes partial time-series data at an old reception time, and uses an algorithm for distinguishing and pruning partial time-series data for each reception time. It was decided to.

本発明に係るデータ検索装置は、逐次流入してくる複数系列の時系列データを受信する受信部と、前記受信部が受信した前記時系列データを記憶する第１メモリと、前記第１メモリから時系列データを読み出し、所定時間の部分時系列データに分割する区間分割部と、前記区間分割部が分割した前記部分時系列データについて互いの類似度を求める類似判定部と、前記区間分割部で生成される前記部分時系列データのペアの中から前記類似判定部が求めた類似度に基づき、所定の類似度の閾値よりも類似した部分時系列データを暫定類似ペア候補として検索して記憶させる暫定類似ペア候補検索部と、前記暫定類似ペア候補検索部の検索する前記暫定類似ペア候補及び前記所定の類似度の閾値を記憶する第３メモリと、前記第３メモリから読み出された前記暫定類似ペア候補から、類似度の閾値判定と類似ペアの絞込みをする類似ペア判定部と、を備える。 A data search device according to the present invention includes a receiving unit that receives a plurality of time series data that sequentially flows in, a first memory that stores the time series data received by the receiving unit, and a first memory A section dividing unit that reads time-series data and divides it into partial time-series data of a predetermined time, a similarity determination unit that obtains a similarity between the partial time-series data divided by the section dividing unit, and the section dividing unit Based on the similarity obtained by the similarity determination unit from the generated partial time-series data pairs, partial time-series data similar to a predetermined similarity threshold is searched and stored as a temporary similar pair candidate. A temporary similar pair candidate search unit; a third memory for storing the temporary similar pair candidate searched by the temporary similar pair candidate search unit and a threshold value of the predetermined similarity; and reading from the third memory From the issued the provisional similarity pair candidates, and a similar pair determination unit that the narrowing of similar pairs and threshold determination of similarity.

本発明に係るデータ検索方法は、逐次入力される複数系列の時系列データを所定時間の部分時系列データに分割し、前記部分時系列データについて互いの類似度を求め、前記類似度に基づいて、指定された時間範囲の受信時刻毎に、類似した部分時系列データのペアの候補と類似度の閾値を記憶しておき、生成された部分時系列データのペアの中から、記憶した類似度の閾値よりも類似した部分時系列データを暫定類似ペアとして記憶しておき、記憶した暫定類似ペアの中から、与えられた数の最も類似した部分類似時系列データペアを読み出す。 The data search method according to the present invention divides time-series data of a plurality of sequences that are sequentially input into partial time-series data of a predetermined time, obtains a similarity between the partial time-series data, and based on the similarity For each reception time in the specified time range, a similar partial time series data pair candidate and a similarity threshold value are stored, and the stored similarity degree is selected from the generated partial time series data pairs. Partial time-series data similar to the threshold value is stored as a temporary similar pair, and a given number of most similar partial similar time-series data pairs are read from the stored temporary similar pairs.

本発明は、指定された時間範囲の受信時刻毎に暫定類似ペアを管理することで、古い受信時刻で暫定類似ペア候補には含まれるが類似ペアの閾値を満たさない部分時系列データのペアの判定と絞込みを高速にできるようになる。このため、本発明は、検索対象とする時系列データの時間が長い、もしくは時系列データの長さに対して部分時系列データが短いときに、メモリサイズの削減を実現することができる。 The present invention manages temporary similar pairs for each reception time in a specified time range, so that a pair of partial time-series data that is included in the temporary similar pair candidates at the old reception time but does not satisfy the similar pair threshold value. Judgment and refinement can be performed at high speed. Therefore, the present invention can reduce the memory size when the time-series data to be searched is long or when the partial time-series data is short with respect to the length of the time-series data.

従って、本発明は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するに際し、メモリサイズを抑えることができるデータ検索装置及びデータ検索方法を提供することができる。 Therefore, the present invention provides a data search device capable of suppressing the memory size when searching for a similar pair of partial time-series data from a plurality of series of time-series data that has flowed within a certain period in the past from the present, and A data search method can be provided.

本発明に係るデータ検索装置は、前記暫定類似ペア候補検索部が、前記区間分割部で生成される前記部分時系列データのペアの中から前記類似判定部が求めた類似度に基づき、部分時系列データを受信時刻で区別可能な構造を有した類似検索索引を作成する索引部と、前記索引部で作成された類似検索索引を記憶する第２メモリと、を備える。 In the data search device according to the present invention, the temporary similar pair candidate search unit performs a partial time based on the similarity obtained by the similarity determination unit from the partial time-series data pairs generated by the section division unit. An index unit that creates a similar search index having a structure that can distinguish series data according to reception time, and a second memory that stores the similar search index created by the index unit.

本発明に係るデータ検索方法は、前記部分時系列データについて互いの類似度を求める際、前記部分時系列データの受信時刻で区別可能な構造を有した類似検索索引を使って、前記部分時系列データの受信時刻毎の類似度の閾値で絞り込む。 The data search method according to the present invention uses the similarity search index having a structure distinguishable by the reception time of the partial time series data when obtaining the similarity between the partial time series data. Filter by the threshold of similarity at each data reception time.

従って、本発明は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するに際し、必要とする距離計算コストを抑えることができるデータ検索装置及びデータ検索方法を提供することができる。 Therefore, according to the present invention, it is possible to reduce the distance calculation cost required when searching for a pair of similar partial time-series data from a plurality of time-series data that has flowed in from the present within a certain period in the past. A data search device and a data search method can be provided.

本発明に係るデータ検索プログラムは、前記データ検索方法をコンピュータに実行させる。 A data search program according to the present invention causes a computer to execute the data search method.

本発明は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、類似した部分時系列データのペアを検索するに際し、メモリサイズを抑えることができるデータ検索装置及びデータ検索方法を提供することができる。 The present invention relates to a data search device and a data search capable of reducing the memory size when searching for a similar partial time-series data pair from a plurality of time-series data that has flowed within a certain period in the past from the present. A method can be provided.

類似部分時系列ペアを説明するイメージの図である。It is a figure of the image explaining a similar partial time series pair. 本発明に係るデータ検索装置を説明する図である。It is a figure explaining the data search device concerning the present invention. 本発明に係るデータ検索方法を説明する図である。It is a figure explaining the data search method which concerns on this invention. 本発明に係るデータ検索装置が時系列データを部分時系列データに分割するときのイメージの図である。It is a figure of an image when the data search device concerning the present invention divides time series data into partial time series data. 類似ペアの絞込みのイメージ図である。It is an image figure of narrowing down a similar pair.

添付の図面を参照して本発明の実施形態を説明する。以下に説明する実施形態は本発明の実施例であり、本発明は、以下の実施形態に制限されるものではない。なお、本明細書及び図面において符号が同じ構成要素は、相互に同一のものを示すものとする。 Embodiments of the present invention will be described with reference to the accompanying drawings. The embodiments described below are examples of the present invention, and the present invention is not limited to the following embodiments. In the present specification and drawings, the same reference numerals denote the same components.

図１は、類似部分時系列ペアのイメージの図である。実施形態の装置では与えられた数の最も類似した部分類似時系列データペアを検索する。 FIG. 1 is a diagram of an image of similar partial time series pairs. The apparatus of the embodiment retrieves a given number of most similar partially similar time series data pairs.

図２は、本実施形態のデータ検索装置３０３を説明する図である。データ検索装置３０３は、逐次流入してくる複数系列の時系列データを受信する受信部１１と、受信部１１が受信した前記時系列データを記憶する第１メモリ１２と、第１メモリ１２から時系列データを読み出し、所定時間の部分時系列データに分割する区間分割部１３と、区間分割部１３が分割した前記部分時系列データについて互いの類似度を求める類似判定部１４と、区間分割部１３で生成される前記部分時系列データのペアの中から類似判定部１４が求めた類似度に基づき、所定の類似度の閾値よりも類似した部分時系列データを暫定類似ペア候補として検索して記憶させる暫定類似ペア候補検索部１９と、暫定類似ペア候補検索部１９の検索する前記暫定類似ペア候補及び前記所定の類似度の閾値を記憶する第３メモリ１８と、第３メモリ１８から読み出された前記暫定類似ペア候補から、類似度の閾値判定と類似ペアの絞込みをする類似ペア判定部１７と、を備える。 FIG. 2 is a diagram illustrating the data search apparatus 303 according to this embodiment. The data search device 303 includes a receiving unit 11 that receives time series data of a plurality of series that sequentially flows in, a first memory 12 that stores the time series data received by the receiving unit 11, and a time from the first memory 12. The section dividing unit 13 that reads out the series data and divides it into partial time series data of a predetermined time, the similarity determination unit 14 that obtains the similarity between the partial time series data divided by the section dividing unit 13, and the section dividing unit 13 Based on the similarity obtained by the similarity determination unit 14 from the partial time-series data pairs generated in Step 1, partial time-series data that is more similar than a predetermined similarity threshold is searched for and stored as a temporary similar pair candidate. A temporary similar pair candidate search unit 19 to be executed, a third memory 18 for storing the temporary similar pair candidate searched by the temporary similar pair candidate search unit 19 and the predetermined similarity threshold, Provided from the provisional similarity pair candidates read out from the memory 18, a similar pair determination unit 17 for the narrowing of similar pairs and threshold determination of similarity, a.

図３は、データ検索装置３０３が行うデータ検索方法を説明する図である。本データ検索方法は、逐次入力される複数系列の時系列データを所定時間の部分時系列データに分割し（ステップＳ３０１、Ｓ３０２）、前記部分時系列データについて互いの類似度を求め、前記類似度に基づいて、指定された時間範囲の受信時刻毎に、類似した部分時系列データのペアの候補と類似度の閾値を記憶しておき（ステップＳ３０３）、生成された部分時系列データのペアの中から、記憶した類似度の閾値よりも類似した部分時系列データを暫定類似ペアとして記憶しておき（ステップＳ３０４）、記憶した暫定類似ペアの中から、与えられた数の最も類似した部分類似時系列データペアを読み出す（ステップＳ３０５）。 FIG. 3 is a diagram for explaining a data search method performed by the data search device 303. In this data search method, time series data of a plurality of series that are sequentially input are divided into partial time series data of a predetermined time (steps S301 and S302), and the similarity is obtained for the partial time series data, and the similarity is obtained. Based on the above, for each reception time in the designated time range, a similar partial time-series data pair candidate and a similarity threshold are stored (step S303), and the generated partial time-series data pair is stored. Among them, partial time-series data that is more similar than the stored threshold of similarity is stored as a temporary similar pair (step S304), and a given number of most similar partial similarities are stored among the stored temporary similar pairs. A time series data pair is read (step S305).

ｋ個の最も類似した部分時系列データペアを探索する場合の実施例について述べる。
ステップＳ３０１において、受信部１１は、逐次流入してくる複数系列の時系列データを受信して第１メモリ１２に記憶させる。第１メモリ１２の容量に制限がある場合は、指定された時間範囲外の受信時刻の時系列データを第１メモリ１２から削除する。第１メモリ１２の最低限必要な記憶容量は、検索対象とする部分時系列データ長をｍ、検索対象とする時間長をｗ、系列数をｎとすると、ｎ（ｍ＋ｗ−１）の要素数を蓄積できる容量となる。 An embodiment in the case of searching for the k most similar partial time series data pairs will be described.
In step S <b> 301, the reception unit 11 receives a plurality of time series data that sequentially flows and stores the time series data in the first memory 12. When the capacity of the first memory 12 is limited, the time series data of the reception time outside the designated time range is deleted from the first memory 12. The minimum required storage capacity of the first memory 12 is the number of elements of n (m + w−1), where m is the partial time series data length to be searched, w is the time length to be searched, and n is the number of series. It becomes the capacity which can be stored.

ステップＳ３０２において、区間分割部１３は、第１メモリ１２から時系列データを読みだして、指定された区間時間の部分時系列データに分割する。本実施形態における部分時系列データとは、逐次流入してくる時間的に連続したデータ系列のうち、指定された区間時間の連続したデータ系列をいう。区間分割部１３では新しい時系列データを受信するたびに、各時系列データの直近の長さｍの部分時系列データを分割して、暫定類似ペア候補検索部１９に部分時系列データを入力する。また、類似ペア判定部１７から指定された時間の長さｍの部分時系列データを分割して返す。図４は、区間分割部１３が時系列データを部分時系列データに分割するときのイメージの図である。図中の

は受信時刻ｔ_ｘにおける系列のＩＤがｉの部分時系列データを表している。 In step S302, the section dividing unit 13 reads the time series data from the first memory 12, and divides it into partial time series data of the designated section time. The partial time series data in the present embodiment refers to a continuous data series in a specified section time among temporally continuous data series that sequentially flow in. Each time new time series data is received, the section division unit 13 divides the partial time series data of the most recent length m of each time series data and inputs the partial time series data to the provisional similar pair candidate search unit 19. . In addition, the partial time-series data of the length m of time specified from the similar pair determination unit 17 is divided and returned. FIG. 4 is a diagram of an image when the section dividing unit 13 divides time series data into partial time series data. In the figure

The ID of the sequences in reception time t _x represents a partial time-series data of i.

ステップＳ３０３において、類似判定部１４は部分時系列データ間の類似度を計算する。本実施形態における類似度とは、２つの部分時系列データがどの程度類似しているかを数値で表す指標である。特に、本実施形態においては、２つの部分時系列データ

の類似度Ｄｉｓｔにユークリッド距離を採用し、以下の通り定義する。

In step S303, the similarity determination unit 14 calculates the similarity between the partial time series data. The similarity in the present embodiment is an index that expresses how similar two partial time-series data are numerically. In particular, in this embodiment, two partial time series data

The Euclidean distance is adopted as the similarity degree Dist and is defined as follows.

なお、本発明のいう類似度は、上記特徴を満たす様々な距離がその対象となるものであって、ユークリッド距離に限られない。具体的に一例を挙げると、ＤｙｎａｍｉｃｔｉｍｅｗａｒｐｉｎｇやＥａｒｔｈＭｏｖｅｒ’ｓＤｉｓｔａｎｃｅ、編集距離、ハミング距離、Ｊａｃｃａｒｄ係数距離、コサイン類似度、マンハッタン距離、マハラノビス距離などを含む時系列データ間の類似度を表す距離であってよい。さらに、これらの距離に加えて、部分時系列のオフセットや振幅を調整したものに対する距離であってもよい。 It should be noted that the similarity referred to in the present invention is subject to various distances that satisfy the above characteristics, and is not limited to the Euclidean distance. Specifically, for example, distances representing similarity between time series data including Dynamic time warping, Earth Move's distance, editing distance, Hamming distance, Jaccard coefficient distance, cosine similarity, Manhattan distance, Mahalanobis distance, etc. It may be. Furthermore, in addition to these distances, distances to those adjusted for partial time series offset and amplitude may be used.

第３メモリ１８は、受信時刻ごとの類似した部分時系列データのペアの候補と類似度の閾値を記憶する。 The third memory 18 stores similar partial time-series data pairs for each reception time and a similarity threshold.

受信時刻ごとの類似した部分時系列データのペアの候補は暫定類似ペア候補検索部１９によって書き換えられ、受信時刻ごとの類似度の閾値は暫定類似ペア候補検索部１９と類似ペア判定部１７によって書き換えられる。 Similar partial time-series data pair candidates for each reception time are rewritten by the temporary similar pair candidate search unit 19, and the similarity threshold for each reception time is rewritten by the temporary similar pair candidate search unit 19 and the similar pair determination unit 17. It is done.

ステップＳ３０３において、暫定類似ペア候補検索部１９は、与えられた数の最も類似した部分時系列データペアに含まれうる暫定類似ペアと、受信時刻別の類似度の閾値を第３メモリで管理する。暫定類似ペア候補検索部１９は、暫定類似ペアを受信時刻別にヒープ構造で蓄積し、さらにそのヒープを受信時刻の順にキュー構造で管理する。各ヒープは最大ｋ個の部分時系列データのペアとデータ間の類似度を蓄積する。暫定類似ペアに含まれる部分時系列データのペアは、２つの部分時系列データのうち古いほうの受信時刻を基準に整理する。例えば、数２の部分時系列データ（ｔ_ｘ＜ｔ_ｙ）でのペアを管理する場合、ｔ_ｘの受信時刻にもとづいて整理する。こうすることで、最も古い受信時刻の暫定類似ペアの削除を容易にし、さらに各受信時刻で最も距離が大きい類似ペアを容易に削除できるようにしている。 In step S303, the temporary similar pair candidate search unit 19 manages the temporary similar pairs that can be included in the given number of the most similar partial time-series data pairs and the similarity threshold for each reception time in the third memory. . The temporary similar pair candidate search unit 19 accumulates the temporary similar pairs in the heap structure according to the reception time, and further manages the heap in the queue structure in the order of the reception time. Each heap accumulates a maximum of k partial time-series data pairs and the similarity between the data. The partial time-series data pairs included in the provisional similar pair are arranged based on the older reception time of the two partial time-series data. For example, when managing pairs of partial time-series data (t _x <t _y ) of Equation 2, they are organized based on the reception time of t _x . In this way, the provisional similar pair with the oldest reception time can be easily deleted, and the similar pair with the longest distance at each reception time can be easily deleted.

ステップＳ３０４において、暫定類似ペア候補検索部１９は、区間分割部１３で分割した最新の受信時刻の部分時系列データをクエリとして、区間分割部１３で読み込める部分時系列データとの距離が受信時刻別の類似度の閾値よりも類似した部分時系列データを探し、暫定類似ペアのヒープのキューに追加する。 In step S304, the provisional similar pair candidate search unit 19 uses the partial time-series data of the latest reception time divided by the section dividing unit 13 as a query, and the distance from the partial time-series data that can be read by the section dividing unit 13 depends on the reception time. The partial time-series data that is more similar than the similarity threshold is searched for and added to the heap queue of the temporary similar pair.

データ検索装置３０３では、この類似した部分時系列データを探すコストを削減するため、暫定類似ペア候補検索部１９には索引部１５と第２メモリ１６を採用して類似検索索引を構築してもよい。このとき、索引部１５と第２メモリ１６は、部分時系列データについて互いの類似度を求める際、部分時系列データの受信時刻で区別可能な構造を有した類似検索索引を使って、部分時系列データの受信時刻毎の類似度の閾値で絞り込む。具体的には、索引部１５と第２メモリ１６は、区間分割部１３で分割した最新の受信時刻の部分時系列データを基準とした類似検索索引を１つ新たにつくり、第２メモリ１６の索引を蓄積するキューに追加する。同時に、指定された時間範囲外の受信時刻の類似検索索引を第２メモリ１６の索引を蓄積するキューから削除する。 In the data search device 303, in order to reduce the cost of searching for similar partial time series data, the temporary similar pair candidate search unit 19 may employ the index unit 15 and the second memory 16 to construct a similar search index. Good. At this time, the index unit 15 and the second memory 16 use the similar search index having a structure that can be distinguished by the reception time of the partial time-series data when the similarity between the partial time-series data is obtained. Narrow down by the similarity threshold for each reception time of the series data. Specifically, the index unit 15 and the second memory 16 newly create a similar search index based on the partial time-series data of the latest reception time divided by the section dividing unit 13. Add to the queue that accumulates the index. At the same time, the similar search index of the reception time outside the designated time range is deleted from the queue storing the index of the second memory 16.

［類似検索索引の作成方法］
本実施形態における類似検索索引は、距離空間の類似検索索引ＶａｎｔａｇｅＰｏｉｎｔＴｒｅｅ（ＶＰＴ）（非特許文献４参照）を採用し、以下の手順で作成する。この索引は、基準となる部分時系列データ（基準部分時系列データ）１つと分割距離１つを定めて、基準部分時系列データからの距離と分割距離に基づいて索引対象の複数の部分時系列データを２つに分類することを繰り返して作成する。ここで、受信時刻ｔの部分時系列データ集合Ｓ_ｔを２分類する例を説明する。基準となる系列ｐの部分時系列データＳ^ｐ _ｔをＳ_ｔから一つ選び、Ｓ_ｔを均等２分割できる距離をｒ^ｐ _ｔとしたとき、Ｓ_ｔは、

の２つの部分時系列データ集合Ｓ_ｉｎとＳ_ｏｕｔに分類される。Ｓ_ｉｎとＳ_ｏｕｔも同様にそれぞれで基準となる部分時系列データと分割距離を決めてさらに空間を２分類することを繰り返すことで、木構造の類似検索索引をつくる。この類似検索索引は、葉ノードと内部ノードという２種類のノードで構成される。葉ノードは木構造の末端のノードであり、内部ノードは末端以外に位置するノードをいう。分類のたびに必要となるＳ^ｐ _ｔとｒ^ｐ _ｔは、内部ノードを生成して管理する。十分に小さい数の部分時系列データ集合まで繰り返し分類して必要な数の内部ノードを生成した後は、葉ノードを生成してＳ_ｔを管理する。つまり、内部ノードは分類に必要な情報（Ｓ^ｐ _ｔ、ｒ^ｐ _ｔ）を管理し、葉ノードでは部分時系列データ集合を管理する。 [How to create a similar search index]
The similarity search index in this embodiment adopts a similarity search index Vantage Point Tree (VPT) (see Non-Patent Document 4) in the metric space, and is created by the following procedure. This index defines one partial time series data (reference partial time series data) and one division distance as a reference, and a plurality of partial time series to be indexed based on the distance and division distance from the reference partial time series data Create data by repeatedly classifying data into two. Here, an example of 2 classify partial time-series data set S _t of reception time t. The partial time-series data ^S _{p t} in relation to the standard sequence p select one from _{S t,} and the distance that can be evenly divided into two _{S t} was ^r _{p t,} _{S t} is

_Are classified into two partial time series data sets S _in and S _out . Similarly for S _in and S _out , a similar search index having a tree structure is created by repeatedly determining the partial time series data and the division distance as reference and further classifying the space into two. This similar search index is composed of two types of nodes, leaf nodes and internal nodes. The leaf node is a node at the end of the tree structure, and the internal node is a node located at a position other than the end. S ^p _t and r ^p _t required for each classification are generated by managing internal nodes. After iteratively classifying a sufficiently small number of partial time-series data sets and generating the required number of internal nodes, leaf nodes are generated and _St is managed. That is, the internal node manages information (S ^p _t , r ^p _t ) necessary for classification, and the leaf node manages a partial time-series data set.

本発明の類似検索索引の作成方法はこれに限るものでなく、部分時系列データを受信時刻で区別可能な構造であればよく、複数の類似検索索引をキュー構造で管理せずに１つの類似検索索引に集約することも可能である。 The method for creating a similar search index according to the present invention is not limited to this, and any similar structure can be used as long as partial time-series data can be distinguished by reception time. It is also possible to aggregate into a search index.

また、本発明の範囲はこれに限るものでなく、類似検索索引を持たないで暫定類似ペア候補を探すことも可能である。さらに、本発明の範囲はこれに限るものでなく、次元圧縮技術で類似した部分時系列データを探すコストを削減することもできる。具体的な一例を挙げると、ＤｉｓｃｒｅｔｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍａｔｉｏｎ（ＤＦＴ）、ＳｉｎｇｕｌａｒＶａｌｕｅＤｅｃｏｍｐｏｓｉｔｉｏｎ（ＳＶＤ）、ＰｉｅｃｅｗｉｓｅＡｇｇｒｅｇａｔｅＡｐｐｒｏｘｉｍａｔｉｏｎ（ＰＡＡ）、ＳｙｍｂｏｌｉｃＡｇｇｒｅｇａｔｅＡｐｐｒｏｘｉｍａｔｉｏｎ（ＳＡＸ）などを含む。 Further, the scope of the present invention is not limited to this, and it is also possible to search for a provisional similar pair candidate without having a similar search index. Furthermore, the scope of the present invention is not limited to this, and it is possible to reduce the cost of searching for similar partial time-series data using a dimensional compression technique. Specific examples include Discrete Fourier Transform (DFT), Single Value Decomposition (SVD), Piecewise Aggregate Application (PAA), and SymbolicAggregateAx.

受信時刻別の類似度の閾値は、受信時刻別のヒープに蓄積される暫定類似ペアが上限数ｋに達したとき、ヒープに蓄積されている暫定類似ペアの距離の最大値に更新される。 The similarity threshold for each reception time is updated to the maximum value of the distance between the provisional similar pairs stored in the heap when the number of provisional similar pairs stored in the heap for each reception time reaches the upper limit number k.

ステップ３０５において、類似ペア判定部１７は、第３メモリ１８から読み出された類似部分時系列データの暫定類似ペア候補から、類似度の閾値判定と類似ペア候補の絞り込みをする。類似ペア判定部１７は、最大ｋペアの類似ペア候補を管理できるヒープを１つ生成する。そして、新しい受信時刻の暫定類似ペア候補から順に、ヒープに追加する。各受信時刻でヒープを追加した後、受信時刻別の類似度の閾値をヒープの管理するペア間の距離の最大値に設定する。この新しい閾値よりも類似していない暫定類似ペアを受信時刻別暫定類似ペアのヒープから削除する。この手順によって、新しい受信時刻ほど暫定類似ペアの距離の閾値は大きい値が設定される。また、この手順によって、受信時刻別のヒープに蓄積する暫定類似ペアを上限数ｋよりも下回る数に絞り込み、ヒープを記憶する第３メモリに必要なメモリサイズを削減する。 In step 305, the similar pair determination unit 17 performs similarity threshold determination and narrows down similar pair candidates from provisional similar pair candidates of the similar partial time-series data read from the third memory 18. The similar pair determination unit 17 generates one heap that can manage up to k pairs of similar pair candidates. And it adds to a heap in an order from the provisional similar pair candidate of new reception time. After adding a heap at each reception time, the similarity threshold for each reception time is set to the maximum distance between pairs managed by the heap. The temporary similar pair that is not more similar than the new threshold is deleted from the heap of the temporary similar pair for each reception time. By this procedure, the threshold value of the distance of the temporary similar pair is set to be larger as the reception time is new. Also, by this procedure, the number of temporary similar pairs stored in the heap for each reception time is narrowed down to a number lower than the upper limit number k, and the memory size required for the third memory for storing the heap is reduced.

図５は、類似ペアの絞り込みのイメージ図を示す。ｔ−４からｔの範囲の受信時刻で上位３件の類似部分時系列ペアを探す様子を表している。図上段の網模様の部分は受信時刻ｔの部分時系列データを含む、新たに暫定類似ペア候補に追加されたペアを表している。図中段では、類似ペア候補を格納するヒープに、受信時刻の新しい暫定類似ペアを追加していき、ヒープを更新していく様子を表している。図下段は、類似ペア候補のヒープを更新していく過程で生成した暫定類似ペアの距離の閾値を表している。 FIG. 5 shows an image diagram of narrowing down similar pairs. It shows how the top three similar partial time series pairs are searched for at reception times in the range from t-4 to t. The net pattern portion in the upper part of the drawing represents a pair newly added to the temporary similar pair candidate including partial time-series data at the reception time t. The middle part of the figure shows a state where a temporary similar pair with a new reception time is added to the heap storing similar pair candidates and the heap is updated. The lower part of the figure represents a threshold value of the distance between temporary similar pairs generated in the process of updating the heap of similar pair candidates.

以上の手順で、データ検索装置３０３は、現在から過去の一定期間内に流入した複数系列の時系列データの中から、与えられた数の最も類似した部分時系列データペアを検索する。 With the above procedure, the data search device 303 searches for a given number of most similar partial time-series data pairs from among a plurality of series of time-series data that have flowed within a certain period in the past from the present.

本発明のデータ検索方法をコンピュータに実行させるデータ検索プログラムは、前述した手順をコンピュータに実行させるものである。 A data search program for causing a computer to execute the data search method of the present invention causes a computer to execute the above-described procedure.

１１：受信部
１２：第１メモリ
１３：区間分割部
１４：類似判定部
１５：索引部
１６：第２メモリ
１７：類似ペア判定部
１８：第３メモリ
１９：暫定類似ペア候補検索部
３０３：データ検索装置 11: receiving unit 12: first memory 13: section dividing unit 14: similarity determining unit 15: index unit 16: second memory 17: similar pair determining unit 18: third memory 19: provisional similar pair candidate searching unit 303: data Search device

Claims

A receiving unit for receiving time series data of a plurality of series that sequentially flows;
A first memory for storing the time-series data received by the receiving unit;
An interval dividing unit that reads time-series data from the first memory and divides the time-series data into partial time-series data of a predetermined time;
A similarity determination unit for obtaining a similarity between the partial time-series data divided by the section dividing unit;
Based on the similarity obtained by the similarity determination unit from the partial time series data pairs generated by the section division unit, partial time series data that is more similar than a predetermined similarity threshold is set as a temporary similar pair candidate. A provisional similar pair candidate search unit for searching and storing;
A third memory for storing the temporary similar pair candidate searched by the temporary similar pair candidate search unit and the predetermined similarity threshold;
A data search apparatus comprising: a similar pair determination unit that determines a similarity threshold and narrows similar pairs from the temporary similar pair candidates read from the third memory.

The provisional similar pair candidate search unit
Create a similarity search index having a structure capable of distinguishing partial time-series data by reception time based on the similarity obtained by the similarity determination unit from the partial time-series data pairs generated by the section division unit An index part to be
A second memory for storing a similar search index created by the index unit;
The data search apparatus according to claim 1, further comprising:

A plurality of time series data that are sequentially input are divided into partial time series data of a predetermined time, a similarity between the partial time series data is obtained, and a reception time in a specified time range is determined based on the similarity For each pair of similar partial time series data, a threshold value of similarity and a similarity threshold are stored, and from the generated partial time series data pair, a partial time series similar to the stored similarity threshold value is stored. A data search method in which data is stored as a temporary similar pair, and a given number of most similar partially similar time-series data pairs are read from the stored temporary similar pairs.

When obtaining the similarity between the partial time-series data, the similarity search index having a structure distinguishable by the reception time of the partial time-series data is used to determine the similarity of the partial time-series data for each reception time. 4. The data search method according to claim 3, wherein the data is narrowed down by a threshold.

A data search program for causing a computer to execute the data search method according to claim 3 or 4.