JP2006244389A

JP2006244389A - Similar time series data calculating device, and its method and program

Info

Publication number: JP2006244389A
Application number: JP2005062688A
Authority: JP
Inventors: Yasuhiro Fujiwara; 靖宏藤原; Kazuyoshi Mitsui; 一能三井; Masashi Yamamuro; 雅司山室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-03-07
Filing date: 2005-03-07
Publication date: 2006-09-14
Anticipated expiration: 2025-03-07
Also published as: JP4451332B2

Abstract

<P>PROBLEM TO BE SOLVED: To more efficiently divide a multi-dimensional space when finding similar candidates of time series data flowing in sequentially. <P>SOLUTION: A center point determining part 33 in a space index part 3 determines the center point of a concentric hypersphere in the multi-dimensional space. A candidate searching part 34 divides the multi-dimensional space into concentric hypersphere structures with the center point as a center. Thus, multi-dimensional space is divided using all axes, which is more efficient. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、複数の時系列データの中で類似したものを求める類似時系列データ計算装置、類似時系列データ計算方法、および類似時系列データ計算プログラムに関する。 The present invention relates to a similar time-series data calculation apparatus, a similar time-series data calculation method, and a similar time-series data calculation program for obtaining a similar one among a plurality of time-series data.

これまで、時系列データストリームは、株取引、気候観測、メディア処理、医療等の様々な分野のサービスに応用されてきた。時系列データ処理の分野で多く研究されている問題の１つとして、蓄積された時系列データの中から類似したものを求めることがある。この技術のアプリケーションとしては以下のような例が挙げられる。 Until now, time-series data streams have been applied to services in various fields such as stock trading, climate observation, media processing, and medical care. One of the problems that has been studied extensively in the field of time series data processing is to obtain a similar one from accumulated time series data. Examples of applications of this technology include the following.

・過去の株価の推移が同じ銘柄を見つける。・ Find stocks with the same stock price transition.

・過去に同じような気温のパターンを示した年を見つける。 • Find years that have shown similar temperature patterns in the past.

・蓄積された動画から類似したものを見つける。 -Find similar videos from the stored videos.

近年、時系列データを検知するセンサの小型化・低価格化が進んだことから、大量の時系列センサデータストリームを利用したサービスが可能になりつつある。そのため、今後はセンサから出力された時系列データの処理技術の重要性が高まることが考えられる。センサデータストリームの特徴の１つに大量の時系列データが逐次流入してくることが挙げられる。 In recent years, as sensors for detecting time-series data have become smaller and cheaper, services using a large amount of time-series sensor data streams are becoming possible. For this reason, it is considered that the importance of processing technology for time-series data output from sensors will increase in the future. One of the features of the sensor data stream is that a large amount of time-series data sequentially flows.

逐次流入してくる時系列データの中から類似したものを求める技術のアプリケーションとしては以下の例が挙げられる。 The following example is given as an application of a technique for obtaining similar data from time-series data that sequentially flows.

・リアルタイムに売り上げパターンが同じ商品を見つける。 -Find products with the same sales pattern in real time.

・地震発生時に揺れ方が同じ地点を見つける。・ Find the same place of shaking when an earthquake occurs.

・現時刻において同じ移動をしている人物を見つける。・ Find a person who is moving the same time.

本願では、類似している時系列データを求める問題を次のように定める。「ｍ個の時系列データのうち、現在時刻Ｔから過去の一定期間内（ｎデータポイント）における距離が閾値ε内の時系列データの組み合わせを求める」
なお、ここで扱う時系列データには、株価のような離散値と移動軌跡のような連続値がある。離散値の場合、時系列データが類似しているか否かは、離散値を用いて時系列データ間の距離が閾値内にあるか否かを調べればよい。また連続値の場合、時系列データは処理する際にサンプリングされ離散値になるので、結局は離散値の処理手法を用いることになる。 In the present application, the problem of obtaining similar time series data is defined as follows. “Among the m time-series data, a combination of time-series data whose distance in the past certain period (n data points) from the current time T is within the threshold ε is obtained.”
The time series data handled here includes discrete values such as stock prices and continuous values such as movement trajectories. In the case of discrete values, whether the time-series data is similar may be determined by checking whether the distance between the time-series data is within the threshold using the discrete values. In the case of continuous values, the time-series data is sampled at the time of processing and becomes discrete values, so that eventually a discrete value processing method is used.

２つの時系列データＳ^ｕ，Ｓ^ｖ間の距離関数として次のようなユークリッド距離を用いる。
The following Euclidean distance is used as a distance function between the two time series data S ^u and S ^v .

ユークリッド距離を用いて時系列データ間の距離を定義することによって、図１５に示すように、類似する時系列データを求める問題は多次元空間内で近い点を求める問題に置き換えることができる。 By defining the distance between the time series data using the Euclidean distance, as shown in FIG. 15, the problem of obtaining similar time series data can be replaced with the problem of obtaining a close point in a multidimensional space.

また、時系列データ間の距離をユークリッド距離を用いて定めることにより、本願における問題は、閾値をεとしたとき次の条件を満たす時系列データの組み合わせの集合を求める問題となる。
Further, by determining the distance between the time series data using the Euclidean distance, the problem in the present application is to obtain a set of combinations of time series data satisfying the following condition when the threshold is ε.

類似した時系列データを高速に求めるフレームワークとしてＧＥＭＩＮＩ（GEneric Multimedia INdexIng method）がある（非特許文献１参照）。ＧＥＭＩＮＩでは、まず時系列データを次元圧縮し、この次元圧縮後の時系列データの特徴量を用いてＲ^＊−ｔｒｅｅ（非特許文献２参照）等の空間インデクスを構築し、空間インデクスを用いて類似した時系列データの距離を絞り込む。そして絞り込まれた候補について、次元圧縮前の時系列データの距離を計算して、類似した時系列データを正確に求める。 There is GEMINI (GEneric Multimedia INdex Ing method) as a framework for obtaining similar time-series data at high speed (see Non-Patent Document 1). In GEMINI, first, time-series data is dimensionally compressed, a spatial index such as R ^* -tree (see Non-Patent Document 2) is constructed using the feature quantity of the time-series data after this dimension compression, and the spatial index is used. Narrow the distance of similar time series data. Then, for the narrowed candidates, the distance of time series data before dimension compression is calculated, and similar time series data is accurately obtained.

従来ＧＥＭＩＮＩは、蓄積された時系列データを対象に多く用いられてきた。この場合は、空間インデクスの構築に時間をかけることができる。しかし、逐次流入してくる時系列データを対象にする場合には、空間インデクスを高速に構築する必要がある。 Conventionally, GEMINI has been widely used for accumulated time series data. In this case, it can take time to construct a spatial index. However, when time-series data that sequentially flows in is targeted, it is necessary to construct a spatial index at high speed.

また空間インデクスを利用して類似時系列データを求める場合、誤差の発生に留意しなければならない。ここで誤差とは、false negativeとfalse positiveの２通りが考えられる。false negativeとは、類似している時系列データを類似していないと判断してしまうことであり、false positiveとは、類似していない時系列データを類似していると判断してしまうことである。この２つの誤差のうち、false negativeが発生しないことがより重要である。false positiveの方は、類似判断の結果を精査することによって取り除くことが可能であり、またfalse negativeの発生は結果の正確性に影響するからである。 When similar time series data is obtained using a spatial index, attention must be paid to the occurrence of errors. Here, there are two types of error, false negative and false positive. False negative means that similar time-series data is not similar, and false positive is that similar time-series data is determined to be similar. is there. Of these two errors, it is more important that no false negative occurs. This is because false positives can be removed by examining the results of similarity judgments, and the occurrence of false negatives affects the accuracy of the results.

false negativeが発生せず、かつ逐次流入してくる時系列データを対象とした空間インデクス手法としてgrid structureがある（非特許文献３参照）。grid structureは、図１６に示すように、時系列データが存在する多次元空間をグリッド構造により複数の領域に分割し、見過ごしなく類似する時系列データの組み合わせを絞り込む手法である。grid structureは、離散フーリエ変換によってＮ次元に次元圧縮された多次元空間を、離散フーリエ変換の係数のはじめの数個を利用して、幅が一定の範囲のグリッド構造に分割する。すると、ある時系列データに類似する組み合わせの候補は、その時系列データが属する領域とそれに隣接する領域に限定される。これにより、全ての時系列データの組み合わせを計算する必要がないため、計算回数を削減できる。なお、図１６では、３次元空間を２次元のグリッド構造に分割した例を示している。
R.Agrawl, C.Faloutsos, and A.N.Swami, "Efficient Similarity Search In Sequence Databases", In Proc. FODO, 1993 N.Bechmann, H.P.Kriegel, R.Schneider, B.Seeger, "R*-tree:An Efficient and Robust Access Method for Points and Rectangles", In Proc. SIGMOD, 1990 Y.zhu, D.Shasha, "State Stream:Statistical Monitoring of Thousands of Data Streams in Real Time", In Proc. VLDB, 2002 There is a grid structure as a spatial index method for time-series data that does not generate false negatives and flows sequentially (see Non-Patent Document 3). As shown in FIG. 16, the grid structure is a technique for dividing a multi-dimensional space in which time-series data exists into a plurality of regions using a grid structure and narrowing down combinations of similar time-series data without oversight. The grid structure divides a multidimensional space that is N-dimensionally compressed by discrete Fourier transform into a grid structure having a constant width by using the first few coefficients of the discrete Fourier transform. Then, combination candidates similar to certain time-series data are limited to an area to which the time-series data belongs and an area adjacent thereto. Thereby, since it is not necessary to calculate the combination of all the time series data, the number of calculations can be reduced. FIG. 16 shows an example in which a three-dimensional space is divided into a two-dimensional grid structure.
R. Agrawl, C. Faloutsos, and ANSwami, "Efficient Similarity Search In Sequence Databases", In Proc. FODO, 1993 N. Bechmann, HPKriegel, R. Schneider, B. Seeger, "R * -tree: An Efficient and Robust Access Method for Points and Rectangles", In Proc. SIGMOD, 1990 Y.zhu, D.Shasha, "State Stream: Statistical Monitoring of Thousands of Data Streams in Real Time", In Proc.VLDB, 2002

ところで、grid structureの問題点として、多次元空間を有効に分割できないことが挙げられる。これは、grid structureでは単一の軸を基準に多次元空間を分割するからである。時系列データ間の距離は多次元空間内の距離でもあるので、単一の軸を基準に多次元空間を分割しても精度が低い。また、grid structureは、離散フーリエ変換のように係数のはじめの数個に多くのエネルギーが集まる次元圧縮法を用いないと大きな効果がない。また、grid structureは、候補領域に存在する時系列データ全てに対して次元圧縮後の距離を計算し、類似する候補であるかを計算しなければならない。 By the way, a problem of the grid structure is that the multidimensional space cannot be divided effectively. This is because in the grid structure, a multidimensional space is divided based on a single axis. Since the distance between the time series data is also a distance in the multidimensional space, the accuracy is low even if the multidimensional space is divided based on a single axis. In addition, the grid structure has no significant effect unless a dimensional compression method in which a large amount of energy is collected in the first few coefficients such as discrete Fourier transform is used. In addition, the grid structure has to calculate the distance after dimensional compression for all the time-series data existing in the candidate area, and calculate whether or not they are similar candidates.

本発明は、上記に鑑みてなされたものであり、その課題とするところは、逐次流入してくる時系列データの類似候補を求めるに際し、多次元空間をより効率的に分割することにある。 The present invention has been made in view of the above, and an object of the present invention is to more efficiently divide a multidimensional space when obtaining similar candidates for time-series data that sequentially flows.

第１の本発明に係る類似時系列データ計算装置は、逐次流入してくる時系列データを受信して第１メモリに記憶させるデータ受信手段と、第１メモリから読み出された時系列データを次元圧縮して第２メモリに記憶させる次元圧縮手段と、多次元空間における同心超球の中心点を決定して第３メモリに記憶させる中心点決定手段と、第３メモリから読み出された中心点を中心とする同心超球構造に多次元空間を分割して、第２メモリから読み出された時系列データの類似の組み合わせの候補を絞り込んで第４メモリに記憶させる候補検索手段と、を有することを特徴とする。 The similar time-series data calculation apparatus according to the first aspect of the present invention is a data receiving means for receiving time-series data that sequentially flows in and storing it in the first memory; and time-series data read from the first memory. Dimension compression means for dimensional compression and storing in the second memory, center point determination means for determining the center point of the concentric hypersphere in the multi-dimensional space and storing it in the third memory, and the center read from the third memory Candidate search means for dividing a multidimensional space into concentric hypersphere structures centered on a point, narrowing down candidates for similar combinations of time-series data read from the second memory, and storing them in the fourth memory; It is characterized by having.

本発明にあっては、多次元空間における同心超球の中心点を決定し、この中心点を中心とする同心超球構造に多次元空間を分割することで、全ての軸を用いた多次元空間の分割が可能となるので、より効率的である。 In the present invention, the center point of the concentric hypersphere in the multidimensional space is determined, and the multidimensional space is divided into the concentric hypersphere structure centered on the center point, so that the multidimensional space using all the axes is obtained. Since space can be divided, it is more efficient.

上記類似時系列データ計算装置において、前記中心点決定手段は、次元圧縮後の時系列データをクラスタリングして第５メモリに記憶させるクラスタリング手段と、第５メモリから読み出されたクラスタリングの結果を用いて多次元空間における中心点の位置を決定して第３メモリに記憶させる中心点位置計算手段と、を有することを特徴とする。 In the similar time series data calculation apparatus, the center point determination means uses clustering means for clustering the time series data after dimension compression and storing it in the fifth memory, and a clustering result read from the fifth memory. Center point position calculating means for determining the position of the center point in the multidimensional space and storing it in the third memory.

本発明にあっては、すべての時系列データについてクラスタリングを行い、クラスタリングの結果を用いて中心点を決定することで、適切に中心点を配置することが可能となる。 In the present invention, clustering is performed for all time-series data, and the center point is determined using the clustering result, whereby the center point can be appropriately arranged.

上記類似時系列データ計算装置において、前記中心点決定手段は、前記クラスタリングの前に、次元圧縮後の時系列データの中からランダムにサンプルを選択するサンプル手段を有することを特徴とする。 In the similar time-series data calculation apparatus, the center point determination unit includes a sample unit that randomly selects a sample from the time-series data after dimension compression before the clustering.

本発明にあっては、ランダムにサンプリングされた時系列データを対象にクラスタリングを行い、このクラスタリングの結果を用いて中心点を決定することで、中心点を高速に計算することが可能となる。 In the present invention, clustering is performed on randomly sampled time-series data, and the center point is determined using the clustering result, whereby the center point can be calculated at high speed.

上記類似時系列データ計算装置において、前記中心点決定手段は、次元圧縮後の時系列データについて多次元空間における中心点の位置をランダムに決定して第３メモリに記憶させる中心点位置計算手段を有することを特徴とする。 In the similar time series data calculation apparatus, the center point determination means includes center point position calculation means for randomly determining the position of the center point in the multidimensional space for the time series data after dimension compression and storing the position in the third memory. It is characterized by having.

本発明にあっては、中心点をランダムに決定することで、中心点をさらに高速に計算することが可能となる。 In the present invention, the center point can be calculated at higher speed by randomly determining the center point.

上記類似時系列データ計算装置において、前記候補検索手段は、多次元空間を前記中心点から時系列データの類似を判断するための閾値の刻みで同心超球構造に分割する空間分割手段と、分割の結果から類似する時系列データが存在する候補領域を決定する候補領域決定手段と、決定された候補領域から次元圧縮後の時系列データの類似候補を絞り込み、その類似候補の組み合わせを第４メモリに記憶させる候補決定手段と、を有することを特徴とする。 In the similar time series data calculation device, the candidate search means includes: a space dividing means for dividing the multidimensional space into concentric hypersphere structures in increments of a threshold for judging similarity of time series data from the center point; Candidate area determining means for determining candidate areas in which similar time-series data exists from the results of the above, and narrowing down similar candidates of the time-series data after dimension compression from the determined candidate areas, and a combination of the similar candidates in the fourth memory And a candidate determination means for storing the data.

本発明にあっては、多次元空間を中心点から時系列データの類似を判断するための閾値の刻みで同心超球構造に分割することで、類似する時系列データが存在する候補領域を、時系列データが属する領域およびこの領域に直接的に隣接する領域のみに限定することができ、候補領域の数を減少させることが可能となる。 In the present invention, by dividing the multidimensional space into a concentric hypersphere structure in steps of a threshold for determining the similarity of time series data from the central point, candidate regions where similar time series data exist are obtained. Only the region to which the time-series data belongs and the region directly adjacent to this region can be limited, and the number of candidate regions can be reduced.

上記類似時系列データ計算装置において、前記空間分割手段は、前記閾値に代えて当該閾値よりも小さい値の刻みで多次元空間を分割することを特徴とする。 In the similar time-series data calculation apparatus, the space dividing unit divides a multidimensional space in increments of values smaller than the threshold instead of the threshold.

本発明にあっては、前記の閾値εよりも小さい値ε／ｃ（ｃ＞１）の刻みで多次元空間を分割することで、領域が細かく分割されるので候補領域を小さくすることが可能となる。これにより、中心点Ｏ_ｉからそれぞれＲ^Ｏｉ _ｕ，Ｒ^Ｏｉ _ｖ個の領域に属する時系列データ同士は、Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ≦ｃの関係にあれば次元圧縮後の距離を計算しなくとも類似する時系列データの候補とすることができる。 In the present invention, by dividing the multidimensional space in increments of a value ε / c (c> 1) smaller than the threshold value ε, the region is divided finely, so that the candidate region can be reduced. It becomes. Thus, each ^R _Oi u from the center _{O ^i,} series data together when belonging to ^{R Oi} _v number of ^_regions, without having to calculate the distance after dimensionality reduction if the relationship ^{_{R Oi u + R Oi v ≦}} c Can be candidates for similar time-series data.

第２の本発明に係る類似時系列データ計算方法は、逐次流入してくる時系列データを受信して第１メモリに記憶させるステップと、第１メモリから読み出された時系列データを次元圧縮して第２メモリに記憶させるステップと、多次元空間における同心超球の中心点を決定して第３メモリに記憶させるステップと、第３メモリから読み出された中心点を中心とする同心超球構造に多次元空間を分割して、第２メモリから読み出された時系列データの類似の組み合わせの候補を絞り込んで第４メモリに記憶させるステップと、を有することを特徴とする。 The similar time series data calculation method according to the second aspect of the present invention includes a step of receiving time series data that sequentially flows in and storing the time series data in the first memory, and dimension compressing the time series data read from the first memory. Storing in the second memory, determining the central point of the concentric hypersphere in the multidimensional space and storing it in the third memory, and concentric super centering the central point read from the third memory Dividing the multidimensional space into a spherical structure, narrowing down candidates for similar combinations of time-series data read from the second memory, and storing them in the fourth memory.

第３の本発明に係る類似時系列データ計算プログラムは、逐次流入してくる時系列データを受信して第１メモリに記憶させる処理と、第１メモリから読み出された時系列データを次元圧縮して第２メモリに記憶させる処理と、多次元空間における同心超球の中心点を決定して第３メモリに記憶させる処理と、第３メモリから読み出された中心点を中心とする同心超球構造に多次元空間を分割して、第２メモリから読み出された時系列データの類似の組み合わせの候補を絞り込んで第４メモリに記憶させる処理と、をコンピュータに実行させることを特徴とする。 A similar time-series data calculation program according to a third aspect of the present invention includes a process of receiving time-series data that sequentially flows in and storing the time-series data in the first memory, and dimensionally compressing the time-series data read from the first memory Processing to be stored in the second memory, processing to determine the central point of the concentric hypersphere in the multidimensional space and store it in the third memory, and concentric super centering the central point read from the third memory Dividing the multidimensional space into a spherical structure, narrowing down candidates for similar combinations of time-series data read from the second memory and storing them in the fourth memory, and causing the computer to execute the processing .

本発明によれば、類似する時系列データを求める際に多次元空間をより効率的に分割することができる。 According to the present invention, it is possible to more efficiently divide a multidimensional space when obtaining similar time-series data.

以下、本発明の最良の形態について図面を用いて説明する。 Hereinafter, the best mode of the present invention will be described with reference to the drawings.

図１のブロック図に示すように、本実施形態における類似時系列データ計算装置１は、データ受信部２、空間インデクス部３、類似判断部４、計算結果送信部５を有する。各部における処理は、本類似時系列データ計算装置１にインストールされたコンピュータプログラムによって実行される。各部の処理の詳細な説明は後述するが、その概要は次のとおりである。 As shown in the block diagram of FIG. 1, the similar time-series data calculation apparatus 1 according to the present embodiment includes a data reception unit 2, a spatial index unit 3, a similarity determination unit 4, and a calculation result transmission unit 5. The processing in each part is executed by a computer program installed in the similar time series data calculation apparatus 1. A detailed description of the processing of each unit will be described later, but the outline is as follows.

データ受信部２は、逐次流入してくる時系列データを受信して内部のメモリに記憶させる。 The data receiving unit 2 receives time-series data that sequentially flows and stores it in an internal memory.

空間インデクス部３は、データ受信部２のメモリから読み出された時系列データを受け取り、この時系列データについて空間インデクスを用いて類似する時系列データの候補を絞り込んで内部のメモリに記憶させる。 The spatial index unit 3 receives the time-series data read from the memory of the data receiving unit 2, narrows down similar time-series data candidates for the time-series data using the spatial index, and stores them in the internal memory.

類似判断部４は、空間インデクス部３のメモリから読み出された類似時系列データの候補に対応する絞り込み前の時系列データをデータ受信部２のメモリから読み出し、この絞り込み前の時系列データ間の距離を求めて内部のメモリに記憶させるとともに、この距離が一定範囲内にある時系列データを類似と判断する。 The similarity determination unit 4 reads the time series data before narrowing down corresponding to the candidate of the similar time series data read from the memory of the spatial index unit 3 from the memory of the data reception unit 2, and the time series data before the narrowing down Is obtained and stored in an internal memory, and time-series data having this distance within a certain range is determined to be similar.

計算結果送信部５は、類似と判断された時系列データの組み合わせを外部へ送信する。 The calculation result transmission unit 5 transmits the combination of time series data determined to be similar to the outside.

空間インデクス部３では、CS structure(Concentric hyper Sphere structure)法を採用する。この手法は、多次元空間を同心超球構造に分割して、類似する時系列データの組み合わせを絞り込むものである。具体的には、図２に示すように、多次元空間上に複数の中心点を設定し、中心点からの距離に応じて所定の閾値ε以下の幅で多次元空間を分割する。この敷値εは、時系列データ間の類似性判断に用いる閾値である。このように閾値ε以下の幅で分割した場合、ある時系列データに類似する候補領域は、その時系列データが属する領域およびこの領域に一定の範囲で隣接する複数の領域に限定される。なお、図２では、３次元空間に対して２つの同心球を重ねて多次元空間を閾値εの幅で分割した場合に、類似する時系列データの候補領域は、その時系列データが属する領域およびこの領域に２つまでの範囲で隣接する複数の領域に限定される例を示している。以下、空間インデクス部３の構成について説明する。 The spatial index unit 3 employs a CS structure (Concentric hyper Sphere structure) method. In this method, a multidimensional space is divided into concentric hypersphere structures to narrow down combinations of similar time series data. Specifically, as shown in FIG. 2, a plurality of center points are set on the multi-dimensional space, and the multi-dimensional space is divided with a width equal to or smaller than a predetermined threshold ε according to the distance from the center point. This threshold value ε is a threshold value used for determining similarity between time-series data. In this way, when dividing with a width equal to or smaller than the threshold ε, candidate regions similar to certain time-series data are limited to a region to which the time-series data belongs and a plurality of regions adjacent to this region within a certain range. In FIG. 2, when two concentric spheres are overlapped on a three-dimensional space and the multidimensional space is divided by the width of the threshold ε, similar time-series data candidate areas are the areas to which the time-series data belongs and An example is shown in which this region is limited to a plurality of regions adjacent to each other in a range of up to two. Hereinafter, the configuration of the space index unit 3 will be described.

図３のブロック図に示すように、空間インデクス部３は、データ受信部３１、次元圧縮部３２、中心点決定部３３、候補検索部３４、データ送信部３５を有する。各部における処理は、コンピュータプログラムによって実行される。 As shown in the block diagram of FIG. 3, the spatial index unit 3 includes a data reception unit 31, a dimension compression unit 32, a center point determination unit 33, a candidate search unit 34, and a data transmission unit 35. Processing in each unit is executed by a computer program.

データ受信部３１は、データ受信部２のメモリから読み出された時系列データを受け取る。 The data receiving unit 31 receives time-series data read from the memory of the data receiving unit 2.

次元圧縮部３２は、受け取った時系列データを次元圧縮して内部のメモリに記憶させる。この次元圧縮の手法には、例えば、離散フーリエ変換を用い、その係数をもって次元圧縮の結果とする。もちろん、他の次元圧縮法を用いても良い。 The dimension compression unit 32 dimensionally compresses the received time series data and stores it in an internal memory. As this dimensional compression method, for example, discrete Fourier transform is used, and the coefficient is used as a result of dimensional compression. Of course, other dimensional compression methods may be used.

中心点決定部３３は、多次元空間における同心超球の中心点を決定して内部のメモリに記憶させる。この処理の詳細については後述する。 The center point determination unit 33 determines the center point of the concentric hypersphere in the multidimensional space and stores it in the internal memory. Details of this processing will be described later.

候補検索部３４は、中心点決定部３３のメモリから読み出された中心点を中心とする同心超球構造に多次元空間を分割し、類似する時系列データの組み合わせの候補を絞り込んで内部のメモリに記憶させる。この処理の詳細についても後述する。 The candidate search unit 34 divides the multidimensional space into a concentric hypersphere structure centered on the center point read from the memory of the center point determination unit 33, narrows down candidates for combinations of similar time-series data, and performs internal search. Store in memory. Details of this processing will also be described later.

データ送信部３５は、類似する時系列データの組み合わせの候補を類似判断部４に出力する。 The data transmission unit 35 outputs similar time-series data combination candidates to the similarity determination unit 4.

図４のブロック図に示すように、中心点決定部３３は、データ受信部３３１、サンプル部３３２、クラスタリング部３３３、中心点計算部３３４、データ送信部３３５を有する。各部における処理は、コンピュータプログラムによって実行される。 As shown in the block diagram of FIG. 4, the center point determination unit 33 includes a data reception unit 331, a sample unit 332, a clustering unit 333, a center point calculation unit 334, and a data transmission unit 335. Processing in each unit is executed by a computer program.

データ受信部３３１は、次元圧縮部３２のメモリから読み出された次元圧縮後の時系列データを受け取って内部のメモリに記憶させる。 The data receiving unit 331 receives the time-series data after dimension compression read from the memory of the dimension compressing unit 32 and stores it in the internal memory.

サンプル部３３２は、データ受信部３３１のメモリに格納された次元圧縮後の時系列データの中からランダムにサンプルを選択して内部のメモリに格納する。 The sample unit 332 selects a sample at random from the time-series data after dimension compression stored in the memory of the data reception unit 331, and stores the sample in the internal memory.

クラスタリング部３３３は、サンプル部３３２のメモリからサンプルとして選択された次元圧縮後の時系列データを読み出し、これをクラスタリングして内部のメモリに記憶させる。 The clustering unit 333 reads the time-series data after dimensional compression selected as a sample from the memory of the sample unit 332, clusters the data, and stores it in the internal memory.

中心点計算部３３４は、クラスタリング部３３３のメモリから読み出されたクラスタリングの結果を用いて多次元空間における中心点の位置を決定して内部のメモリに記憶させる。 The center point calculation unit 334 determines the position of the center point in the multidimensional space using the result of clustering read from the memory of the clustering unit 333 and stores it in the internal memory.

データ送信部３３５は、メモリから読み出された中心点の位置とサンプルとして選択された次元圧縮後の時系列データとを空間インデクス部３の候補検索部３４へ出力する。 The data transmission unit 335 outputs the position of the center point read from the memory and the time-series data after dimension compression selected as a sample to the candidate search unit 34 of the spatial index unit 3.

図５のブロック図に示すように、候補検索部３４は、データ受信部３４１、空間分割部３４２、候補領域決定部３４３、候補決定部３４４、データ送信部３４５を有する。各部における処理は、コンピュータプログラムによって実行される。 As illustrated in the block diagram of FIG. 5, the candidate search unit 34 includes a data reception unit 341, a space division unit 342, a candidate area determination unit 343, a candidate determination unit 344, and a data transmission unit 345. Processing in each unit is executed by a computer program.

データ受信部３４１は、送信されてきた中心点の位置と次元圧縮後の時系列データを受け取って内部のメモリに記憶させる。 The data receiving unit 341 receives the transmitted center point position and time-series data after dimension compression, and stores them in an internal memory.

空間分割部３４２は、データ受信部３４１のメモリから読み出された中心点を中心とする所定の閾値の刻みの同心超球構造に多次元空間を分割する。この閾値としては、時系列データが類似しているかを判断するための閾値εよりも小さい値ε／ｃ（ｃ＞１）とする。 The space dividing unit 342 divides the multidimensional space into concentric hypersphere structures with a predetermined threshold value centered on the center point read from the memory of the data receiving unit 341. The threshold value is set to a value ε / c (c> 1) smaller than the threshold value ε for determining whether time series data are similar.

候補領域決定部３４３は、分割された多次元空間内における類似する時系列データの候補を選択するための候補領域を決定する。この決定に際しては、分割の結果あるいは次元圧縮後の時系列データが属する領域を用いる。 The candidate area determination unit 343 determines candidate areas for selecting similar time-series data candidates in the divided multidimensional space. In this determination, the region to which the division result or time-series data after dimension compression belongs is used.

候補決定部３４４は、決定された候補領域から次元圧縮後の時系列データの類似候補を絞り込み、その類似候補の組み合わせを内部のメモリに記憶させる。 Candidate determination unit 344 narrows down similar candidates of time-series data after dimension compression from the determined candidate region, and stores the combination of similar candidates in an internal memory.

データ送信部３４５は、候補決定部３４４のメモリから読み出された時系列データの類似候補の組み合わせを空間インデクス部３のデータ送信部３５へ出力する。 The data transmission unit 345 outputs a combination of similar candidates of time-series data read from the memory of the candidate determination unit 344 to the data transmission unit 35 of the spatial index unit 3.

次に、空間インデクス部３の次元圧縮部３２で用いる次元圧縮法が満たさなければならない条件について説明する。 Next, conditions that must be satisfied by the dimensional compression method used by the dimensional compression unit 32 of the spatial index unit 3 will be described.

次元圧縮部３２では、次元圧縮を行ってもfalse negativeが発生してはならない。このことを保証する定理としてlower bounding lemmaが知られている。これは、次式（lower bounding condition）が成り立てばfalse negativeは発生しないという定理である。 In the dimension compression unit 32, false negatives should not occur even if dimension compression is performed. The lower bounding lemma is known as a theorem to guarantee this. This is a theorem that false negative does not occur if the following expression (lower bounding condition) holds.

Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)≦Ｄreal(Ｓ^ｕ，Ｓ^ｖ) （３）
ここで、各関数の定義は次の通りである。 Ddigest (S ′ ^u , S ′ ^v ) ≦ Dreal (S ^u , S ^v ) (3)
Here, the definition of each function is as follows.

Ｄreal(Ｓ^ｕ，Ｓ^ｖ)：次元圧縮前の時系列データＳ^ｕ，Ｓ^ｖの距離
Ｄdigest（Ｓ'^ｕ，Ｓ'^ｖ）：次元圧縮後の時系列データＳ'^ｕ，Ｓ'^ｖの距離
よって、次元圧縮部３２では、このlower bounding conditionが成り立つ離散フーリエ変換法、離散ウェブレット変換法、特異値分解法などの手法を用いる。なお、ここでは、次元圧縮前の時系列データ間の距離をユークリッド距離とするが、これに限られるものではない。 Dreal (S ^u , S ^v ): distance between time series data S ^u , S ^v before dimension compression Ddigest (S ′ ^u , S ′ ^v ): distance between time series data S ′ ^u , S ′ ^v after dimension compression Therefore, the dimension compression unit 32 uses a technique such as a discrete Fourier transform method, a discrete weblet transform method, or a singular value decomposition method that satisfies this lower bounding condition. Here, the distance between the time-series data before dimension compression is the Euclidean distance, but is not limited to this.

次に、中心点決定部３３における同心超球の中心点を決定する手法について説明する。 Next, a method for determining the center point of the concentric hypersphere in the center point determination unit 33 will be described.

CS structureでは、多次元空間において複数の中心点が近い位置に集まっていると、図６に示すように、複数の中心点による候補領域の広くが重複してしまう。しかし、効率的な絞り込みを行うためには、中心点は、多次元空間内で均一に分散していた方が望ましい。そこで、多次元空間内に中心点を均一に分散させるために、ここではクラスタリング部３３３を設けてK-means法によるクラスタリング手法を用いる。K-means法のアルゴリズムは以下の通りである。 In the CS structure, if a plurality of center points are gathered at close positions in a multidimensional space, as shown in FIG. 6, the wide candidate areas of the plurality of center points overlap. However, in order to perform efficient narrowing down, it is desirable that the center points are uniformly distributed in the multidimensional space. Therefore, in order to uniformly distribute the center points in the multidimensional space, here, a clustering unit 333 is provided and a clustering method based on the K-means method is used. The algorithm of the K-means method is as follows.

Ｓｔｅｐ１（初期値の決定）：Ｋ個（Ｋは任意の正整数）のクラスタの初期分割をランダムに設定する。 Step 1 (determination of initial value): Initial division of K clusters (K is an arbitrary positive integer) is set at random.

Ｓｔｅｐ２（個体の割り当て）：それぞれの個体を最も近い代表点に割り当てる。 Step 2 (assignment of individual): Assign each individual to the nearest representative point.

Ｓｔｅｐ３（代表点の更新）：すべてのクラスタの中心が変化しなければ終了し、終了しない場合には各クラスタの代表点を計算してＳｔｅｐ２へ戻る。 Step 3 (Update of representative points): The process ends if the centers of all the clusters have not changed. If not, the representative point of each cluster is calculated and the process returns to Step 2.

なお、ここでは、時系列データを個体、各クラスタの中心（各クラスタの要素の平均）を超球の中心点と置き換えるものとする。 Here, it is assumed that the time series data is an individual, and the center of each cluster (the average of the elements of each cluster) is replaced with the center point of the hypersphere.

本実施の形態では、逐次流入してくる時系列データが対象であるので、中心点を高速に計算できることが望ましい。そのため、K-means法における計算コストを低く抑える必要がある。K-means法の計算コストは、Ｓ×Ｍ×ｒ×ｄとなることが知られている。ここで、Ｓはクラスタ（中心点）の数、Ｍは個体（時系列データ）の数、ｒはK-means法の計算が収束するまでの繰り返し回数、ｄは距離を計算する際の次元数である。本実施の形態では、個体の数Ｍを次のように削減する。 In the present embodiment, since time series data that sequentially flows in is an object, it is desirable that the center point can be calculated at high speed. Therefore, it is necessary to keep the calculation cost in the K-means method low. It is known that the calculation cost of the K-means method is S × M × r × d. Here, S is the number of clusters (center points), M is the number of individuals (time-series data), r is the number of iterations until calculation of the K-means method converges, and d is the number of dimensions when calculating the distance. It is. In the present embodiment, the number M of individuals is reduced as follows.

Ｍ＝ｍ×ｋ（４）
ここで、ｍは時系列データの総数、ｋはすべての時系列データの中からどの程度の数を用いてK-means法を行うかを決定するサンプル率（０〜１）である。ｋの値が小さい程、少ないデータポイントから中心点を決定するので、高速に多次元空間を分割できる一方、適切でない位置が中心点となることもある。 M = m × k (4)
Here, m is the total number of time-series data, and k is a sample rate (0 to 1) that determines how many of the time-series data are used to perform the K-means method. As the value of k is smaller, the center point is determined from fewer data points. Therefore, the multidimensional space can be divided at high speed, but an inappropriate position may be the center point.

次に、候補検索部３４における類似時系列データの候補の決定手法について説明する。本実施の形態では、多次元空間をε／ｃ（ｃ＞１）の幅で分割するが、時系列データが非類似であるか否かを時系列データが属する領域をもとに判断する。図７に例示するように、中心点Ｏ_ｉから次元圧縮後の時系列データＳ'^ｕがＲ^Ｏｉ _ｕ（＞０）個目の領域にあり、時系列データＳ'^ｖがＲ^Ｏｉ _ｖ（＞０）個目の領域にあるとき、次式（５）となる領域に属する時系列データは類似している可能性があると判断し、次式（６）となる領域に属する時系列データは非類似と判断する。 Next, a method for determining candidates for similar time series data in the candidate search unit 34 will be described. In the present embodiment, the multidimensional space is divided by the width of ε / c (c> 1), but whether the time series data is dissimilar is determined based on the region to which the time series data belongs. As illustrated in FIG. 7, the time series data S ′ ^u after dimension compression from the center point O _i is in the R ^Oi _u (> 0) region, and the time series data S ′ ^v is R ^Oi _v (> 0) When it is in the area, it is determined that the time series data belonging to the area represented by the following expression (5) may be similar, and the time series data belonging to the area represented by the following expression (6) is Judged as dissimilar.

|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|＜ｃ＋１（５）
|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|≧ｃ＋１（６）
類似の可能性がある時系列データについては、次元圧縮後の距離を計算し、この距離が一定の範囲内にあるか否かで類似する時系列データの候補であるか否かを判断する。なお、図７の例では、ｃ＝１．５、Ｒ^Ｏｉ _ｕ＝２、Ｒ^Ｏｉ _ｖ＝４またはＲ^Ｏｉ _ｖ＝５である。Ｒ^Ｏｉ _ｖ＝４であれば類似する可能性がある時系列データとなり、Ｒ^Ｏｉ _ｖ＝５であれば非類似の時系列データとなる。 | R ^Oi _u -R ^Oi _v | <c + 1 (5)
| R ^Oi _u -R ^Oi _v | ≧ c + 1 (6)
For time-series data with similar possibility, the distance after dimension compression is calculated, and it is determined whether or not it is a candidate for similar time-series data depending on whether this distance is within a certain range. In the example of FIG. 7, c = 1.5, R ^Oi _u = 2, R ^Oi _v = 4, or R ^Oi _v = 5. If R ^Oi _v = 4, the time series data is likely to be similar, and if R ^Oi _v = 5, the time series data is dissimilar.

空間インデクスを用いる場合、false negativeが発生しないことが必要である。そこで、候補検索部３４では、このことを保証する定理contractive transform lemmaを新たに導入する。 When using a spatial index, it is necessary that false negatives do not occur. Therefore, the candidate search unit 34 introduces a theorem contractive transform lemma that guarantees this.

contractive transform lemmaとは、次式（contractive transform condition）が成立すればfalse negativeは発生しないというものである。 Contractive transform lemma means that false negative does not occur if the following expression (contractive transform condition) is satisfied.

Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)≦Ｄreal(Ｓ^ｕ，Ｓ^ｖ) （７）
ここで、各関数の定義は次の通りである。 ^{Dcontractive (S 'u, S'} v) <Ddigest (S 'u, S' v) ≦ Dreal (S u, S v) (7)
Here, the definition of each function is as follows.

Ｄreal(Ｓ^ｕ，Ｓ^ｖ)：次元圧縮前の時系列データＳ^ｕ，Ｓ^ｖの距離
Ｄdigest（Ｓ'^ｕ，Ｓ'^ｖ）：次元圧縮後の時系列データＳ'^ｕ，Ｓ'^ｖの距離
Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)：多次元空間を分割し新たな距離空間を設定した後の時系列データＳ'^ｕ，Ｓ'^ｖの距離
contractive transform lemmaは、以下のように証明できる。false negativeが発生しないことを保証するためには、Ｄreal(Ｓ^ｕ，Ｓ^ｖ)≦εであればＤcontractive(Ｓ'^ｕ，Ｓ'^ｖ)≦εでなければならない。ここで、contractive transform conditionより、次式が成り立つのでcontractive transform lemmaは証明できる。 Dreal (S ^u , S ^v ): distance between time series data S ^u , S ^v before dimension compression Ddigest (S ′ ^u , S ′ ^v ): distance between time series data S ′ ^u , S ′ ^v after dimension compression Dcontractive (S ′ ^u , S ′ ^v ): distance of time series data S ′ ^u , S ′ ^v after dividing the multidimensional space and setting a new metric space
The contractive transform lemma can be proved as follows. For false negative to ensure that does not ^{occur, Dreal (S u, S v} ) if ^{≦ ε Dcontractive (S 'u,} S' v) must be ≦ epsilon. Here, from the contractive transform condition, the following formula holds, so that the contractive transform lemma can be proved.

Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)≦Ｄreal(Ｓ^ｕ，Ｓ^ｖ)≦ε （８）
ここでＤcontractive(Ｓ'^ｕ，Ｓ'^ｖ)は中心点をＯｉとしたとき(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)とする。 ^{Dcontractive (S 'u, S'} v) <Ddigest (S 'u, S' v) ≦ Dreal (S u, S v) ≦ ε (8)
Here, Dcontractive (S ′ ^u , S ′ ^v ) is (| R ^Oi _u −R ^Oi _v | −1) · (ε / c) when the central point is Oi.

次に、Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)がcontractive transform conditionを満たすことを説明する。まずlower bounding conditionにより次式が成り立つ。 Next, it will be described that Dcontractive (S ′ ^u , S ′ ^v ) satisfies the contractive transform condition. First, the following equation holds according to the lower bounding condition.

Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)≦Ｄreal(Ｓ^ｕ，Ｓ^ｖ) （９）
続いて、次式が成り立つことを示す。 Ddigest (S ′ ^u , S ′ ^v ) ≦ Dreal (S ^u , S ^v ) (9)
Subsequently, it is shown that the following equation holds.

Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ) （１０）
［ｉ］:Ｓ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にない場合
平面Ｓ'^ｕＳ'^ｖＯ_ｉは図８に示すように三角形となる。同図において、三角形Ｓ'^ｕＳ'^ｖＯ_ｉの成立条件より、次式が成り立つ。 ^{Dcontractive (S 'u, S'} v) <Ddigest (S 'u, S' v) (10)
[I]: When S ′ ^u , S ′ ^v , and O _i are not on the same straight line The plane S ′ ^u S ′ ^v O _i is a triangle as shown in FIG. In the figure, the following equation holds from the conditions for establishing the triangle S ′ ^u S ′ ^v O _i .

Ｓ'^ｕＯ_ｉ＋Ｓ'^ｖＯ_ｉ＞Ｓ'^ｕＳ'^ｖかつ
Ｓ'^ｖＯ_ｉ＋Ｓ'^ｕＳ'^ｖ＞Ｓ'^ｕＯ_ｉかつ
Ｓ'^ｕＳ'^ｖ＋Ｓ'^ｕＯ_ｉ＞Ｓ'^ｖＯ_ｉ（１１）
この式より、Ｓ'^ｕＳ'^ｖ＞|Ｓ'^ｕＯ_ｉ−Ｓ'^ｖＯ_ｉ|となる。また、|Ｓ'^ｕＯ_ｉ−Ｓ'^ｖＯ_ｉ|＞(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)である。ここで、Ｓ'^ｕＳ'^ｖ＝Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)であり、(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)＝Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)であるので、式（１０）が成り立つ。 ^{_{S 'u O i + S'}} v O i> S 'u S' v and ^{_{S 'v O i + S'}} u S 'v>S' u O i and ^{^{S 'u S' v + S}} 'u O i>S' ^v O _i (11)
From this equation, S ′ ^u S ′ ^v > | S ′ ^u O _i −S ′ ^v O _i | ^{_{Moreover, | S 'u O i -S}} ' v O i |> (| R Oi u -R Oi v | -1) is a · (ε / c). ^{^{Here, S 'u S' v =}} Ddigest (S 'u, S' v) ^{_{^{is, (| R Oi u -R Oi}}} v | -1) · (ε / c) = Dcontractive (S 'u, S Since ' ^v ), equation (10) holds.

［ii−１］:Ｓ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にある場合
図９に示すように、この場合は、|Ｓ'^ｕＯ_ｉ−Ｓ'^ｖＯ_ｉ|＜Ｓ'^ｕＳ'^ｖとなる。また、|Ｓ'^ｕＯ_ｉ−Ｓ'^ｖＯ_ｉ|＞(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)である。ここで、Ｓ'^ｕＳ'^ｖ＝Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)であり、(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)＝Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)であるので、式（１０）が成り立つ。 [Ii-1]: When S ′ ^u , S ′ ^v and O _i are on the same straight line and O _i is inside the line segment S ′ ^u S ′ ^{v As} shown in FIG. | S ′ ^u O _i −S ′ ^v O _i | <S ′ ^u S ′ ^v . ^{_{Moreover, | S 'u O i -S}} ' v O i |> (| R Oi u -R Oi v | -1) is a · (ε / c). ^{^{Here, S 'u S' v =}} Ddigest (S 'u, S' v) ^{_{^{is, (| R Oi u -R Oi}}} v | -1) · (ε / c) = Dcontractive (S 'u, S Since ' ^v ), equation (10) holds.

［ii−２］:Ｓ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にない場合
図１０に示すように、この場合は、|Ｓ'^ｕＯ_ｉ−Ｓ'^ｖＯ_ｉ|＝Ｓ'^ｕＳ'^ｖとなる。また、|Ｓ'^ｕＯ_ｉ−Ｓ'^ｖＯ_ｉ|＞(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)である。ここで、Ｓ'^ｕＳ'^ｖ＝Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)であり、(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)＝Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)であるので、式（１０）が成り立つ。 [Ii-2]: S ′ ^u , S ′ ^v , O _i are on the same straight line and O _i is not inside the line segment S ′ ^u S ′ ^{v As} shown in FIG. | S ′ ^u O _i −S ′ ^v O _i | = S ′ ^u S ′ ^v . ^{_{Moreover, | S 'u O i -S}} ' v O i |> (| R Oi u -R Oi v | -1) is a · (ε / c). ^{^{Here, S 'u S' v =}} Ddigest (S 'u, S' v) ^{_{^{is, (| R Oi u -R Oi}}} v | -1) · (ε / c) = Dcontractive (S 'u, S Since ' ^v ), equation (10) holds.

よって、[ｉ]，[ii−１]，[ii−２]のいずれの場合であってもＤcontractive(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)が成り立つ。 Therefore, Dcontractive (S ′ ^u , S ′ ^v ) <Ddigest (S ′ ^u , S ′ ^v ) holds in any case of [i], [ii-1], and [ii-2].

ここで、Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)＝(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)＜εとなるが、この式は、|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|＜ｃ＋１となる。よって、|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|＜ｃ＋１の関係を満たす時系列データは類似する可能性がある。また、ε≦Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)＝(|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|−１)・(ε/ｃ)（＜Ｄreal(Ｓ^ｕ，Ｓ^ｖ)）であれば非類似の時系列データとなるが、この式は|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|≧ｃ＋１となる。よって、|Ｒ^Ｏｉ _ｕ−Ｒ^Ｏｉ _ｖ|≧ｃ＋１である時系列データは非類似となる。 ^{Here, Dcontractive (S 'u, S} ' v) = (| R Oi u -R Oi v | -1) · (ε / c) < becomes a epsilon, this equation, | ^R _^Oi u ^-R _Oi _v | <c + 1. Therefore, time-series data satisfying the relationship of | R ^Oi _u −R ^Oi _v | <c + 1 may be similar. ^{Further, ε ≦ Dcontractive (S 'u} , S' v) = (| R Oi u -R Oi v | -1) · (ε / c) (<Dreal (S u, S v)) if dissimilar However, this equation becomes | R ^Oi _u -R ^Oi _v | ≧ c + 1. Therefore, time-series data ^satisfying | R ^Oi _u −R ^Oi _v | ≧ c + 1 are dissimilar.

候補検索部３４の候補決定部３４４では、類似する可能性がある時系列データに対して次元圧縮後の時系列データ間の距離を計算し、この距離が一定範囲内にあるものを類似の候補として絞り込む。しかし、次元圧縮後の距離を計算しないでも類似する候補とすることができる場合がある。例えば図１１に示すように、時系列データＳ'^ｕが中心点Ｏ_ｉからＲ^Ｏｉ _ｕ個目の領域にあり、Ｓ'^ｖが中心点Ｏ_ｉからＲ^Ｏｉ _ｖ個目の領域にあるとき、Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ≦ｃとなる領域に属する時系列データは、距離を計算するまでもなく類似の候補として絞り込むことができる。一方、Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ＞ｃとなる領域に属する時系列データについては、次元圧縮後の距離を計算し、この距離に基づいて類否を判断する。なお、図１１の例では、ｃ＝５．５、Ｒ^Ｏｉ _ｕ＝２、Ｒ^Ｏｉ _ｖ＝３またはＲ^Ｏｉ _ｖ＝４である。Ｒ^Ｏｉ _ｖ＝３であれば類似候補であり、Ｒ^Ｏｉ _ｖ＝４であれば次元圧縮後の距離を計算する。以下、このときの処理について詳細に説明する。 The candidate determination unit 344 of the candidate search unit 34 calculates the distance between time series data after dimension compression with respect to time series data that may be similar, and selects those whose distances are within a certain range as similar candidates. Refine as. However, there are cases where similar candidates can be obtained without calculating the distance after dimension compression. For example, as shown in FIG. 11, when 'there ^u is from the center point _{O i} to ^{R Oi} _u-th region, S' series data S ^{when v} is from the center point _{O i} to ^{R Oi} _v-th region, Time series data belonging to the region where R ^Oi _u + R ^Oi _v ≦ c can be narrowed down as similar candidates without calculating the distance. On the other hand, for time-series data belonging to a region where R ^Oi _u + R ^Oi _v > c, a distance after dimension compression is calculated, and similarity is determined based on this distance. In the example of FIG. 11, c = 5.5, R ^Oi _u = 2, R ^Oi _v = 3, or R ^Oi _v = 4. If R ^Oi _v = 3, it is a similar candidate, and if R ^Oi _v = 4, the distance after dimension compression is calculated. Hereinafter, the process at this time will be described in detail.

候補決定部３４４では、次式が成り立つ時系列データを類似の候補として求める。 The candidate determination unit 344 obtains time series data that satisfies the following expression as a similar candidate.

Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)≦ε （１２）
ここで、Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)は（Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)とする。以下では、次式が成り立つことを示す。 Ddigest (S ′ ^u , S ′ ^v ) <Dexpanded (S ′ ^u , S ′ ^v ) ≦ ε (12)
Here, Dexpanded (S ′ ^u , S ′ ^v ) is (R ^Oi _u + R ^Oi _v ) · (ε / c). In the following, it is shown that the following equation holds.

Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ) （１３）
［ｉ］:Ｓ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にない場合
平面Ｓ'^ｕＳ'^ｖＯ_ｉは図１２に示すように三角形となる。同図において、三角形Ｓ'^ｕＳ'^ｖＯ_ｉの成立条件より、Ｓ'^ｕＯ_ｉ＋Ｓ'^ｖＯ_ｉ＞Ｓ'^ｕＳ'^ｖが成り立つ。また、Ｓ'^ｕＯ_ｉ＋Ｓ'^ｖＯ_ｉ＜(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)である。ここで、Ｓ'^ｕＳ'^ｖ＝Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)であり、(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)＝Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)であるので、式（１３）が成り立つ。 ^{Ddigest (S 'u, S'} v) <Dexpanded (S 'u, S' v) (13)
[I]: When S ′ ^u , S ′ ^v , and O _i are not on the same straight line The plane S ′ ^u S ′ ^v O _i is a triangle as shown in FIG. In the figure, than satisfied the conditions of the triangle ^{^{_{S 'u S' v O i}}} , S 'u O i + S' v O i> S 'u S' v holds. Further, it is ^{_{S 'u O i + S'}} v O i <(R Oi u + R Oi v) · (ε / c). Here, S ′ ^u S ′ ^v = Ddigest (S ′ ^u , S ′ ^v ), and (R ^Oi _u + R ^Oi _v ) · (ε / c) = Dexpanded (S ′ ^u , S ′ ^v ). Therefore, Formula (13) is materialized.

［ii−１］:Ｓ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にある場合
図１３に示すように、この場合は、Ｓ'^ｕＳ'^ｖ＝Ｓ'^ｕＯ_ｉ＋Ｓ'^ｖＯ_ｉとなる。また、Ｓ'^ｕＯ_ｉ＋Ｓ'^ｖＯ_ｉ＜(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)である。ここで、Ｓ'^ｕＳ'^ｖ＝Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)であり、(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)＝Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)であるので、式（１３）が成り立つ。 [Ii-1]: S ′ ^u , S ′ ^v , O _i are on the same straight line and O _i is inside the line segment S ′ ^u S ′ ^{v As} shown in FIG. the ^{^{S 'u S' v = S}} 'u O i + S' v O i. Further, it is ^{_{S 'u O i + S'}} v O i <(R Oi u + R Oi v) · (ε / c). Here, S ′ ^u S ′ ^v = Ddigest (S ′ ^u , S ′ ^v ), and (R ^Oi _u + R ^Oi _v ) · (ε / c) = Dexpanded (S ′ ^u , S ′ ^v ). Therefore, Formula (13) is materialized.

［ii−２］:Ｓ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にない場合
図１４に示すように、この場合は、Ｓ'^ｕＯ_ｉ＋Ｓ'^ｖＯ_ｉ＞Ｓ'^ｕＳ'^ｖとなる。また、Ｓ'^ｕＯ_ｉ＋Ｓ'^ｖＯ_ｉ＜(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)である。ここで、Ｓ'^ｕＳ'^ｖ＝Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)であり、(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)＝Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)であるので、式（１３）が成り立つ。 [Ii-2]: When S ′ ^u , S ′ ^v , O _i are on the same straight line and O _i is not inside the line segment S ′ ^u S ′ ^{v As} shown in FIG. the ^{_{S 'u O i + S'}} v O i> S 'u S' v. Further, it is ^{_{S 'u O i + S'}} v O i <(R Oi u + R Oi v) · (ε / c). Here, S ′ ^u S ′ ^v = Ddigest (S ′ ^u , S ′ ^v ), and (R ^Oi _u + R ^Oi _v ) · (ε / c) = Dexpanded (S ′ ^u , S ′ ^v ). Therefore, Formula (13) is materialized.

よって、[ｉ]，[ii−１]，[ii−２]のいずれの場合であってもＤdigest(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)が成り立つ。Ｄdigest(Ｓ'^ｕ，Ｓ'^ｖ)＜Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)＝(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)≦εであれば類似する候補となる。この式はＲ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ≦ｃとなるので、Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ≦ｃの関係を満たす時系列データを類似する候補として判断する。 Therefore, Ddigest (S ′ ^u , S ′ ^v ) <Dexpanded (S ′ ^u , S ′ ^v ) holds in any case of [i], [ii-1], and [ii-2]. If Ddigest (S ′ ^u , S ′ ^v ) <Dexpanded (S ′ ^u , S ′ ^v ) = (R ^Oi _u + R ^Oi _v ) · (ε / c) ≦ ε, similar candidates are obtained. This equation becomes ^{_{^{_{R Oi u + R Oi v ≦}}}} c, determines as candidates for similar time-series data that satisfies the relationship of _{^{_{R Oi u + R Oi v ≦}}} c.

一方、ε＜Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)＝(Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ)・(ε/ｃ)であれば直ちに類似の候補とはならない。この式はＲ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ＞ｃとなるので、Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ＞ｃの関係を満たす時系列データについては距離を計算し、この距離が一定範囲内にあるものを類似の候補として判断する。この距離には、例えばユークリッド距離を用いる。 On the other hand, if ε <Dexpanded (S ′ ^u , S ′ ^v ) = (R ^Oi _u + R ^Oi _v ) · (ε / c), it is not immediately a similar candidate. Since this equation is R ^Oi _u + R ^Oi _v > c, a distance is calculated for time series data satisfying the relationship of R ^Oi _u + R ^Oi _v > c, and those whose distance is within a certain range are similar candidates. Judge as. For this distance, for example, the Euclidean distance is used.

したがって、本実施の形態によれば、空間インデクス部３内部の中心点決定部３３により多次元空間における同心超球の中心点を決定し、候補検索部３４によりこの中心点を中心とする同心超球構造に多次元空間を分割することで、全ての軸を用いた多次元空間の分割が可能となるので、より効率的な分割を実現することができる。 Therefore, according to the present embodiment, the center point determination unit 33 in the space index unit 3 determines the center point of the concentric hypersphere in the multidimensional space, and the candidate search unit 34 sets the concentric super center around the center point. By dividing the multidimensional space into spherical structures, it becomes possible to divide the multidimensional space using all the axes, so that more efficient division can be realized.

本実施の形態によれば、クラスタリング部３３３により、時系列データのクラスタリングを行い、中心点計算部３３４により、このクラスタリングの結果を用いて中心点を決定することで、適切に中心点を配置することができる。 According to the present embodiment, the clustering unit 333 performs time-series data clustering, and the center point calculation unit 334 determines the center point using the clustering result, thereby appropriately arranging the center point. be able to.

本実施の形態によれば、サンプル部３３２により、ランダムにサンプリングされた時系列データを対象に、クラスタリング部３３３がクラスタリングを行い、中心点計算部３３４がクラスタリングの結果を用いて中心点を決定することで、中心点を高速に計算することができる。 According to the present embodiment, the clustering unit 333 performs clustering on the time-series data randomly sampled by the sample unit 332, and the center point calculation unit 334 determines the center point using the clustering result. Thus, the center point can be calculated at high speed.

本実施の形態によれば、空間分割部３４２により、時系列データの類似を判断するための閾値εよりも小さい値ε／ｃ（ｃ＞１）の刻みで多次元空間を分割することで、領域が細かく分割されるので候補領域を小さくすることが可能となる。これにより、中心点Ｏ_ｉからそれぞれＲ^Ｏｉ _ｕ，Ｒ^Ｏｉ _ｖ個の領域に属する時系列データ同士は、Ｒ^Ｏｉ _ｕ＋Ｒ^Ｏｉ _ｖ≦ｃの関係にあれば次元圧縮後の距離を計算しなくとも類似する時系列データの候補とすることができる。 According to the present embodiment, the space dividing unit 342 divides the multidimensional space in increments of a value ε / c (c> 1) smaller than the threshold value ε for determining the similarity of time series data. Since the area is finely divided, the candidate area can be reduced. Thus, each ^R _Oi u from the center _{O ^i,} series data together when belonging to ^{R Oi} _v number of ^_regions, without having to calculate the distance after dimensionality reduction if the relationship ^{_{R Oi u + R Oi v ≦}} c Can be candidates for similar time-series data.

なお、本実施の形態においては、距離関数としてユークリッド距離を用いた。しかし、これに限られるものではなく、次の４つの性質が成り立つ距離関数であれば適用することができる。 In the present embodiment, the Euclidean distance is used as the distance function. However, the present invention is not limited to this, and any distance function that satisfies the following four properties can be applied.

１．ｄ(ｘ，ｙ)≧０（反射則）
２．ｄ(ｘ，ｙ)＝０⇔ｘ＝ｙ（分離則）
３．ｄ(ｘ，ｙ)＝ｄ(ｙ，ｘ) （対称則）
４．ｄ(ｘ，ｙ)＋ｄ(ｙ，ｚ)≧ｄ(ｘ，ｚ) （三角則）
また、本実施の形態については、様々な変形が考えられるので、その変形例について以下説明する。 1. d (x, y) ≧ 0 (reflection law)
2. d (x, y) = 0⇔x = y (separation rule)
3. d (x, y) = d (y, x) (Symmetry)
4). d (x, y) + d (y, z) ≧ d (x, z) (trigonometric rule)
In addition, since various modifications can be considered for the present embodiment, modifications thereof will be described below.

［実施例１］
中心点決定部３３は、多次元空間における同心超球の中心点をランダムに決定して内部のメモリに記憶させる。中心点をランダムに決定するので、高速に計算できるメリットがある。一方で、中心点を不適切な位置に配置してしまう可能性があるデメリットもある。 [Example 1]
The center point determination unit 33 randomly determines the center point of the concentric hypersphere in the multidimensional space and stores it in the internal memory. Since the center point is determined randomly, there is an advantage that it can be calculated at high speed. On the other hand, there is a disadvantage that the center point may be arranged at an inappropriate position.

［実施例２］
サンプル部３３２を採用せず、クラスタリング部３３３は、多次元空間にあるすべての時系列データを対象にクラスタリングを行う。これにより、中心点計算部３３４で中心点をさらに適切に配置できるメリットがある。一方で、すべての時系列データを対象にクラスタリングを行うため計算に時間を要するデメリットがある。 [Example 2]
Without adopting the sample unit 332, the clustering unit 333 performs clustering on all time series data in the multidimensional space. Thereby, there is a merit that the center point calculation unit 334 can arrange the center point more appropriately. On the other hand, there is a demerit that requires time for calculation because clustering is performed on all time series data.

［実施例３］
空間分割部３４２は、多次元空間を中心点から閾値εの刻みで分割する。閾値ε／ｃを用いた場合と比較して、類似する時系列データが存在する候補領域は時系列データが属する領域とこの領域に直接的に隣接する領域だけに限定されるので、候補領域の数が少なくなるメリットがある。一方で、候補領域に存在する時系列データのすべてに対して次元圧縮後の距離を計算する必要があるデメリットがある。 [Example 3]
The space dividing unit 342 divides the multidimensional space from the center point in increments of the threshold ε. Compared to the case where the threshold value ε / c is used, the candidate area where similar time-series data exists is limited to the area to which the time-series data belongs and the area directly adjacent to this area. There is an advantage that the number is reduced. On the other hand, there is a demerit that it is necessary to calculate the distance after dimension compression for all the time series data existing in the candidate area.

一実施の形態における類似時系列データ計算装置の構成を示すブロック図である。It is a block diagram which shows the structure of the similar time series data calculation apparatus in one embodiment. CS strucutreについて説明するための図である。It is a figure for demonstrating CS strucutre. 上記類似時系列データ計算装置における空間インデクス部の構成を示すブロック図である。It is a block diagram which shows the structure of the space index part in the said similar time series data calculation apparatus. 上記空間インデクス部における中心点決定部の構成を示すブロック図である。It is a block diagram which shows the structure of the center point determination part in the said space index part. 上記空間インデクス部における候補検索部の構成を示すブロック図である。It is a block diagram which shows the structure of the candidate search part in the said space index part. CS strucutreの多次元空間において複数の中心点が近い位置に集まっている状態を示す図である。It is a figure which shows the state which the several center point has gathered in the position close in the multidimensional space of CS strucutre. 非類似時系列データを判断する領域を説明するための図である。It is a figure for demonstrating the area | region which judges dissimilar time series data. Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)においてＳ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にない場合について説明するための図である。 ^{Dcontractive (S 'u, S'} v) S 'u, S' in ^v, _{O i} is a diagram illustrating a case not collinear. Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)においてＳ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にある場合について説明するための図である。Figure for ^{Dcontractive (S 'u, S'} v) in ^{^{S 'u, S' v,}} O i is collinear and _{O i} is described a case where the interior of the line segment S ^'u S' ^v It is. Ｄcontractive(Ｓ'^ｕ，Ｓ'^ｖ)においてＳ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にない場合について説明するための図である。A diagram for explaining a case where S ′ ^u , S ′ ^v , and O _i are on the same straight line in Dcontractive (S ′ ^u , S ′ ^v ) and O _i is not inside the line segment S ′ ^u S ′ ^v . It is. 類似時系列データを判断する領域を説明するための図である。It is a figure for demonstrating the area | region which judges similar time series data. Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)においてＳ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にない場合について説明するための図である。 ^{Dexpanded (S 'u, S'} v) S 'u, S' in ^v, _{O i} is a diagram illustrating a case not collinear. Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)においてＳ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にある場合について説明するための図である。A diagram for explaining a case where S ′ ^u , S ′ ^v , and O _i are on the same straight line and O _i is inside the line segment S ′ ^u S ′ ^v in Dexpanded (S ′ ^u , S ′ ^v ). It is. Ｄexpanded(Ｓ'^ｕ，Ｓ'^ｖ)においてＳ'^ｕ，Ｓ'^ｖ，Ｏ_ｉが同一直線上にありかつＯ_ｉが線分Ｓ'^ｕＳ'^ｖの内部にない場合について説明するための図である。The figure for explaining the case where S ′ ^u , S ′ ^v , O _i are on the same straight line and O _i is not inside the line segment S ′ ^u S ′ ^v in Dexpanded (S ′ ^u , S ′ ^v ) It is. 左側の図は２つの時系列データを示し、右側の図はユークリッド距離を用いたときの時系列データ間の距離を示す図である。The diagram on the left shows two time-series data, and the diagram on the right shows the distance between the time-series data when using the Euclidean distance. grid structureを説明するための図である。It is a figure for demonstrating grid structure.

Explanation of symbols

１…類似時系列データ計算装置
２…データ受信部
３…空間インデクス部
４…類似判断部
５…計算結果送信部
３１…データ受信部
３２…次元圧縮部
３３…中心点決定部
３４…候補検索部
３５…データ送信部
３３１…データ受信部
３３２…サンプル部
３３３…クラスタリング部
３３４…中心点計算部
３３５…データ送信部
３４１…データ受信部
３４２…空間分割部
３４３…候補領域決定部
３４４…候補決定部
３４５…データ送信部
DESCRIPTION OF SYMBOLS 1 ... Similar time series data calculation apparatus 2 ... Data reception part 3 ... Spatial index part 4 ... Similarity determination part 5 ... Calculation result transmission part 31 ... Data reception part 32 ... Dimension compression part 33 ... Center point determination part 34 ... Candidate search part 35 ... Data transmission unit 331 ... Data reception unit 332 ... Sample unit 333 ... Clustering unit 334 ... Center point calculation unit 335 ... Data transmission unit 341 ... Data reception unit 342 ... Spatial division unit 343 ... Candidate area determination unit 344 ... Candidate determination unit 345 ... Data transmission unit

Claims

Data receiving means for receiving time-series data flowing in sequentially and storing them in the first memory;
Dimensional compression means for dimensionally compressing time-series data read from the first memory and storing it in the second memory;
Center point determination means for determining a center point of a concentric hypersphere in a multidimensional space and storing it in a third memory;
Dividing the multidimensional space into concentric hypersphere structures centered on the center point read from the third memory, and narrowing down candidates for similar combinations of time-series data read from the second memory Candidate search means to be stored in
A similar time-series data calculation device comprising:

The center point determination means includes
Clustering means for clustering the time-series data after dimension compression and storing it in the fifth memory;
Center point position calculation means for determining the position of the center point in the multidimensional space using the result of the clustering read from the fifth memory and storing it in the third memory;
The similar time series data calculation apparatus according to claim 1, wherein:

The center point determination means includes
3. The similar time series data calculation apparatus according to claim 2, further comprising sample means for randomly selecting a sample from the time series data after dimension compression before the clustering.

2. The similar time series data calculation apparatus according to claim 1, wherein the center point determination means includes center point position calculation means for randomly determining the position of the center point in the multidimensional space and storing it in a third memory. .

The candidate search means includes:
Space dividing means for dividing a multidimensional space into concentric hypersphere structures in increments of a threshold for judging the similarity of time series data from the center point;
Candidate area determination means for determining candidate areas where similar time-series data exists from the result of the division;
Candidate determination means for narrowing down similar candidates of time-series data after dimension compression from the determined candidate area, and storing the combination of the similar candidates in the fourth memory;
5. The similar time series data calculation apparatus according to claim 1, comprising:

6. The similar time series data calculation apparatus according to claim 5, wherein the space dividing unit divides the multidimensional space in units of a threshold smaller than the threshold instead of the threshold.

Receiving time-series data flowing in sequentially and storing them in the first memory;
Dimensionally compressing time-series data read from the first memory and storing it in the second memory;
Determining a center point of a concentric hypersphere in a multidimensional space and storing it in a third memory;
Dividing the multidimensional space into concentric hypersphere structures centered on the center point read from the third memory, and narrowing down candidates for similar combinations of time-series data read from the second memory Memorizing steps,
A similar time series data calculation method characterized by comprising:

Determining the center point comprises:
Clustering time-series data after dimension compression and storing it in a fifth memory;
Determining the position of the center point in the multidimensional space using the result of clustering read from the fifth memory and storing it in the third memory;
The similar time series data calculation method according to claim 7, further comprising:

Determining the center point comprises:
9. The similar time series data calculation method according to claim 8, further comprising a step of selecting a sample at random from the time series data after dimension compression before the clustering.

8. The similar time series data calculation method according to claim 7, wherein the step of determining the center point includes the step of randomly determining the position of the center point in the multidimensional space and storing it in a third memory.

The step of narrowing down the candidates includes
Dividing a multidimensional space into concentric hypersphere structures in threshold increments to determine similarity of time series data from the center point;
Determining candidate areas where similar time-series data exists from the result of the division;
Narrowing down similar candidates of time-series data after dimension compression from the determined candidate area, and storing the combination of the similar candidates in the fourth memory;
The similar time series data calculation method according to claim 7, comprising:

Receiving time-series data that sequentially flows in and storing it in the first memory;
A process of dimensionally compressing the time-series data read from the first memory and storing it in the second memory;
Determining the center point of the concentric hypersphere in the multidimensional space and storing it in the third memory;
Dividing the multidimensional space into concentric hypersphere structures centered on the center point read from the third memory, and narrowing down candidates for similar combinations of time-series data read from the second memory Processing to be stored in
A similar time series data calculation program characterized by causing a computer to execute.

The process of determining the center point is:
Clustering the time-series data after dimension compression and storing it in the fifth memory;
Processing for determining the position of the center point in the multidimensional space using the result of clustering read from the fifth memory and storing it in the third memory;
The similar time series data calculation program according to claim 12, wherein:

The process of determining the center point is:
14. The similar time series data calculation program according to claim 13, wherein a process of randomly selecting a sample from time series data after dimension compression is executed before the clustering.

15. The process of determining the center point includes executing a process of randomly determining the position of the center point in the multidimensional space and storing it in a third memory for time-series data after dimension compression. Similar time series data calculation program.