JP2004046601A

JP2004046601A - Data processing apparatus and method for extracting a complex order pattern

Info

Publication number: JP2004046601A
Application number: JP2002204185A
Authority: JP
Inventors: Ririan Harada; 原田　リリアン
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-07-12
Filing date: 2002-07-12
Publication date: 2004-02-12

Abstract

【課題】与えられたデータから複雑なクエリにより指定された順序パターンを効率良く抽出する。
【解決手段】述語列ｐ［０］，．．．，ｐ［３］にマッチするレコード列を検索するとき、あらかじめ述語同士の両立性を解析することで最適なウィンドウシフト量が決められ、ウィンドウ内においてシフト方向とは逆向きに各述語がチェックされる。まず、ウィンドウ２１の位置で値４２がｐ［１］を満たさないことが分かると、ウィンドウ２１を１レコード分シフトし、ウィンドウ２２内でチェックが再開される。次に、値３６がｐ［２］を満たさないことが分かると、ウィンドウ２２を３レコード分シフトし、ウィンドウ２３の位置ですべての値が対応する述語を満たすことが確認される。
【選択図】　　　図２An object of the present invention is to efficiently extract an order pattern specified by a complicated query from given data.
A predicate sequence p [0],. . . , P [3], an optimal window shift amount is determined by analyzing compatibility between predicates in advance, and each predicate is checked in the window in the opposite direction to the shift direction. You. First, if it is found that the value 42 does not satisfy p [1] at the position of the window 21, the window 21 is shifted by one record, and the check is restarted in the window 22. Next, if it is found that the value 36 does not satisfy p [2], the window 22 is shifted by three records, and it is confirmed that all the values at the position of the window 23 satisfy the corresponding predicate.
[Selection] Fig. 2

Description

【０００１】
【発明の属する技術分野】
本発明は、与えられたデータから複雑な条件により指定された順序パターンを抽出するデータ処理装置および方法に関する。
【０００２】
【従来の技術および発明が解決しようとする課題】
今日の情報化社会では、大量の順序データをモニタしなければならない多くのアプリケーションが存在する。このようなデータおよびアプリケーションとしては、例えば、センサアプリケーション、ネットワークモニタリングやトラフィック管理におけるパフォーマンス測定、医療モニタリングのためのバイタルサイン（生命徴候）および処置、電気通信における呼の詳細に関するレコード、ウェブアプリケーションにおけるログレコードやクリックストリームが挙げられる。
【０００３】
ここで、センサデータに対する問い合わせと解析を行うことで物理的な世界をモニタするアプリケーションについて考えてみる。このようなアプリケーションの１つに、工場の倉庫に格納された物品を管理するアプリケーションがある。倉庫内の物品および壁には温度センサが貼り付けられ、床や天井にも温度センサが埋め込まれている。各センサは一定周期で測定温度を出力し、倉庫管理者はそのセンサデータを用いて物品が過熱状態になっていないことを確認する。この場合、アプリケーションにとっては以下のようなデータ検索のクエリが重要となる。
【０００４】
クエリ１：３つの連続する測定温度が３５度、３６度、および３７度であるようなセンサを検出せよ。
クエリ２：３回続けて２度を越える温度上昇があり、３５度未満の温度から４０度より高い温度に達したセンサを検出せよ。
クエリ３：３８度未満の温度から４０度と５０度の間の温度に上昇し、引き続いて２回の下落を経て３８度未満の温度に戻った時間的な変化パターンを検出せよ。
【０００５】
この温度モニタリングアプリケーションにおける対象パターンは、３つの連続する測定温度を見つければよいクエリ１のような非常に単純なものから、引き続く温度上昇や、倉庫管理者が注意すべきではあるが正確には指定できない異常なスパイクを見つけるクエリ２および３のようなより複雑なものまで変化する。絶対的な単一の測定値の代わりに、可能な測定値の範囲や増減する測定値の関係が指定されることもある。
【０００６】
より具体的に言えば、クエリ１、２、および３は、データストリームの中からそれぞれ以下のような述語（ｐｒｅｄｉｃａｔｅ　）ｐ［ｉ］で表されるパターンにマッチする温度属性値を有する複数のレコードｒを見つけるものである。
【０００７】

ここで、述語ｐ［ｉ］（ｒ）はパターン内のｉ番目のレコードが持つ属性値に関する条件を表し、ｒ．ｔｅｍｐｅｒａｔｕｒｅはレコードｒの温度属性値を表し、ｒ．ｐｒｅｖｉｏｕｓはデータストリームにおけるレコードｒの直前のレコードを表す。与えられるデータを左から右へ順に並べた場合、クエリ２および３のｒ．ｐｒｅｖｉｏｕｓはレコードｒの左側のレコードに対応し、そのレコードに対する条件を指定することでパターンを記述することができる。
【０００８】
クエリ１のパターン述語は常に定数を用いた等式である。したがって、Ｂｏｙｅｒ−ＭｏｏｒｅアルゴリズムやＫｎｕｔｈ−Ｍｏｒｒｉｓ−Ｐｒａｔｔアルゴリズムのような周知の文字列マッチングアルゴリズムを効率良く適用することができる。しかしながら、クエリ２およびクエリ３はより複雑であり、それらの述部は隣接レコードの属性値と定数を用いた不等式の組み合わせ（ｃｏｎｊｕｎｃｔｉｏｎ）である。文字列マッチングアルゴリズムは、クエリ１のように定数を用いた等式で表されるクエリの場合にのみ適用可能であるので、より複雑なこれらの２つのクエリには適用できない。
【０００９】
このようなクエリに適用できるマッチングアルゴリズムとしては、Ｓａｄｒｉ　ｅｔ　ａｌ．により提案されたＯＰＳ（Ｏｐｔｉｍｉｚｅｄ　Ｐａｔｔｅｒｎ　Ｓｅａｒｃｈ）アルゴリズムがある（Ｓａｄｒｉ，　Ｒ．，　Ｚａｉｏｌｏ，　Ｃ．，　Ｚａｒｋｅｓｈ，　Ａ．，　ａｎｄ　Ａｄｉｂｉ，　Ｊ．，　Ｏｐｔｉｍｉｚａｔｉｏｎ　ｏｆ　Ｓｅｑｕｅｎｃｅ　Ｑｕｅｒｉｅｓ　ｉｎ　Ｄａｔａｂａｓｅ　Ｓｙｓｔｅｍｓ，　Ｐｒｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈｅ　Ｔｗｅｎｔｉｅｔｈ　ＡＣＭＳＩＧＭＯＤ−ＳＩＧＡＣＴ−ＳＩＧＡＲＴ　Ｓｙｍｐｏｓｉｕｍ　ｏｎ　Ｐｒｉｎｃｉｐｌｅｓ　ｏｆ　Ｄａｔａｂａｓｅ　Ｓｙｓｔｅｍｓ，　ｐｐ．７１−８１，　Ｍａｙ２００１）。このアルゴリズムでは、データストリーム上に設定されたウィンドウ内で、ｒ．ｐｒｅｖｉｏｕｓを用いて記述された述語を左から右に向かって（つまり、ｐ［０］、ｐ［１］、ｐ［２］、およびｐ［３］の順に）チェックしている。そして、ある位置で述語をチェックし終えるとウィンドウを別の位置にシフトして再びチェックを行うという操作を繰り返す。
【００１０】
しかしながら、この方法では、述語とマッチしなかったレコードを後続するウィンドウ位置において何回も再チェックしなければならないことが多く、必ずしも効率的なアルゴリズムとはいえない。
【００１１】
本発明の課題は、与えられたデータから複雑なクエリにより指定された順序パターンを効率良く抽出するデータ処理装置および方法を提供することである。
【００１２】
【課題を解決するための手段】
図１は、本発明の第１および第２のデータ処理装置の構成図である。図１のデータ処理装置は、入力手段１１、前処理手段１２、検索手段１３、および出力手段１４を備える。
【００１３】
本発明の第１のデータ処理装置は、複数の順序付けられたデータから、順序付けられたｍ個のデータからなるデータ列を抽出する。入力手段１１は、データ間の関係を用いてｍ個のデータをそれぞれ指定する順序付けられたｍ個の条件からなる条件列を入力する。前処理手段１２は、ｍ個の条件の間の両立性を解析して、データ抽出に用いる補助情報を生成する。検索手段１３は、順序付けられたデータ上で先頭のデータから末尾のデータに向かってデータ列を検索するとき、データが条件列の対応する条件を満たすか否かを検索方向とは逆の向きにチェックし、チェックされたデータが条件を満たさないとき上記補助情報を用いて次のチェック開始位置を決定する。そして、出力手段１４は、条件列のすべての条件を満たしたデータ列の情報を出力する。
【００１４】
入力手段１１により入力される条件列の各条件は、順序付けられたデータのうち任意の２つのデータ間の関係（値の大小関係等）を用いて記述され、ユーザは任意の関係を用いて複雑な順序パターンを指定することができる。この条件は、例えば、レコード間の関係を記述する述語に対応する。
【００１５】
前処理手段１２はあらかじめ複数の条件の間の両立性を解析することで補助情報を生成し、検索手段１３は、この補助情報を用いて次のチェック開始位置を決定することで、指定されたパターンとデータの間の不要なチェックをスキップしながら、パターンにマッチするデータ列を検索する。
【００１６】
多くの場合、データ上の検索方向と条件チェックの向きを逆向きにすることで、これらを同じ向きにした場合よりもチェック回数が削減されることが、シミュレーションにより確認された。このシミュレーションの結果については後述することにする。
【００１７】
本発明の第２のデータ処理装置は、レコードｒ［ｉ］（ｉ＝｛０，１，．．．，ｎ−１｝）の属性値に対する条件式で記述される述語をｐ［ｊ］（ｊ＝｛０，１，．．．，ｍ−１｝）とするとき、ｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］からなる順序データからｍ個の述語ｐ［０］，ｐ［１］，．．．，ｐ［ｍ−１］からなるパターンを抽出する。入力手段１１は、パターンを入力し、前処理手段１２は、述語ｐ［ｊ］と述語ｐ［ｊ１］（ｊ１＝｛０，１，．．．，ｊ−１｝）の間の両立性を解析して、パターン抽出に用いる補助情報を生成する。検索手段１３は、順序データ上でレコードｒ［０］からｒ［ｎ−１］に向かってパターンに対応するレコード列を検索するとき、レコードがパターン内の対応する述語を満たすか否かを末尾の述語ｐ［ｍ−１］から先頭の述語ｐ［０］に向かってチェックし、チェックされたレコードが述語を満たさないとき上記補助情報を用いて次のチェック開始位置を決定する。そして、出力手段１４は、パターンのすべての述語を満たしたレコード列の情報を出力する。
【００１８】
入力手段１１により入力されるパターン内の各述語はレコードの属性値を含む条件式で記述され、ユーザは任意の条件式を用いて複雑な順序パターンを指定することができる。
【００１９】
前処理手段１２はあらかじめ複数の述語の間の両立性を解析することで補助情報を生成し、検索手段１３は、この補助情報を用いて次のチェック開始位置を決定することで、指定されたパターンとレコードの間の不要なチェックをスキップしながら、パターンにマッチするレコード列を検索する。
【００２０】
第１のデータ処理装置と同様に、順序データ上の検索方向と述語チェックの向きが逆向きになっているため、これらを同じ向きにした場合よりも効率の良い検索処理が可能となる。
【００２１】
図１の入力手段１１は、例えば、後述する図１４の入力装置８３に対応し、前処理手段１２および検索手段１３は、例えば、図１４のＣＰＵ（中央処理装置）８１およびメモリ８２の組み合わせに対応し、出力手段１４は、例えば、図１４の出力装置８４に対応する。
【００２２】
【発明の実施の形態】
以下、図面を参照しながら、本発明の実施の形態を詳細に説明する。
本実施形態のデータ処理装置は、例えば、コンピュータを用いて構成され、連続的かつ無限に入力されるデータストリームや有限の格納データの形式で与えられるデータから、複雑な時間的変化のパターンを検出する。
【００２３】
データ処理装置は、まず、ストリームデータに対するパターン述語の不要なチェックをスキップするために、コンパイル時に複雑なパターンの前処理を行って補助情報を生成しておく。そして、実行時にウィンドウを左から右に（ストリームの先頭から末尾に向かって）スライドさせながら、その補助情報を用いて効率良くストリーム検索を行う。このとき、ウィンドウ内のデータについては、右から左に向かってパターンの述語を満たすか否かをチェックする。
【００２４】
この方法では、クエリのコンパイル処理の一部としてパターンの述語間における論理的関係（補助情報）を抽出することで、ストリームとパターン述語の部分的マッチが得られたときにウィンドウをどのようにシフトすればうまくいくかが推測される。このようなチェック方法を用いることで、ウィンドウシフトの長さを改良し、同じデータの再チェックの回数を最小化することができる。
【００２５】
ウィンドウ内で述語を右から左に向かってチェックする場合、ｒ．ｐｒｅｖｉｏｕｓの情報を利用することができないので、クエリを別の形式に書き換える必要が生じる。上述したクエリ２および３のパターン述語は、例えば、以下のように書き換えられる。
【００２６】

ｒ．ｎｅｘｔはデータストリームにおけるレコードｒの直後のレコード（レコードｒの右側のレコード）を表し、そのレコードに対する条件を指定することでパターンを記述することができる。ｒ．ｎｅｘｔを用いて書き換えられたパターンを用いれば、述語を右から左に向かって（つまり、ｐ［３］、ｐ［２］、ｐ［１］、およびｐ［０］の順に）チェックすることが可能となる。
【００２７】
以下では、クエリ３を例に用いてその処理を説明する。データストリームとしては、簡単のため、温度属性値がそれぞれ３６、４２、４７、３６、３７、４２、４０、および３７である８個のレコードを用いることにする。
【００２８】
図２に示すように、最初のウィンドウ２１の処理では、３６＜３８であることから値３６がｐ［３］（ｒ）：ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜３８を満たし、４７＞３６であることから値４７がｐ［２］（ｒ）：ｒ．ｔｅｍｐｅｒａｔｕｒｅ＞ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅを満たす。しかし、次の値４２はｐ［１］（ｒ）：４０＜ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜５０　ＡＮＤ　ｒ．ｔｅｍｐｅｒａｔｕｒｅ＞ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅを満たさない。なぜなら、４０＜４２＜５０ではあっても、４２＜４７であり４２＞４７ではないからである。
【００２９】
こうして、ｐ［３］およびｐ［２］がデータとマッチした後、最初のミスマッチがｐ［１］で発生し、パターン述語とストリームデータのチェックを再開するためにウィンドウ２１をシフトする必要が生じる。ここで、値３６および４７がそれぞれｐ［３］およびｐ［２］を満たすことが分かっており、値４２がｐ［１］を満たさないことが分かっている。
【００３０】
パターン述語を解析することにより、ｐ［３］とｐ［２］は互いに矛盾しないことが分かり、ｐ［３］を満たす値はｐ［２］も満たす可能性があるといえる。また、ｐ［２］を満たす値はｐ［１］も満たす可能性があり、ｐ［１］を満たさない値はｐ［０］を満たす可能性がある。そこで、ウィンドウ２１を１レコード分シフトして、新たなウィンドウ２２内で値３６、３７、および４２がそれぞれｐ［２］、ｐ［１］、およびｐ［０］と並ぶようにする。
【００３１】
この２番目のウィンドウ２２の処理では、３７＜３８であることから値３７はｐ［３］を満たすが、３６＜３７であり３６＞３７ではないことから値３６はｐ［２］を満たさないことが分かる。ミスマッチがｐ［２］で発生したため、チェックを再開するためにウィンドウ２２をいくらかシフトする必要が生じる。
【００３２】
上述したようにｐ［３］を満たす値はｐ［２］も満たす可能性があるが、ｐ［２］（ｒ）：ｒ．ｔｅｍｐｅｒａｔｕｒｅ＞ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅを満たさない値はｐ［１］（ｒ）：４０＜ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜５０　ＡＮＤ　ｒ．ｔｅｍｐｅｒａｔｕｒｅ＞ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅも満たさない。既に値３６はｐ［２］を満たさず、したがってｐ［１］も満たさないことが分かっているわけであるから、ウィンドウ２２を１レコード分シフトして値３６をｐ［１］と並ぶようにしても成果が得られないのは明らかである。
【００３３】
また、ウィンドウ２２を２レコード分シフトした場合、ｐ［３］（ｒ）：ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜３８を満たした値がｐ［１］（ｒ）：４０＜ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜５０　ＡＮＤ　ｒ．ｔｅｍｐｅｒａｔｕｒｅ＞ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅも満たさなければならないことになるが、ｐ［３］とｐ［１］は互いに矛盾するのでこれは不可能である。したがって、２レコードのシフト操作も破棄される。
【００３４】
次に、ｐ［３］を満たした値がｐ［０］も満たす可能性があれば、３レコードのシフト操作が考慮されることになるが、これは可能である。そこで、図２に示されるようにウィンドウ２２を３レコード分シフトし、３番目のウィンドウ２３の処理において、値３７、４０、４２、および３７がそれぞれｐ［３］、ｐ［２］、ｐ［１］、およびｐ［０］を満たすことが確認される。
【００３５】
この例では、パターン述語の間の関係に関する知識を用いることにより、与えられたパターンを３回のウィンドウ処理で検出できた。ストリームデータに対する述語のチェック回数は９回であった。
【００３６】
上述した基本処理においては温度属性値が連結述語（ｃｏｎｊｕｎｃｔｉｖｅ　ｐｒｅｄｉｃａｔｅ　）の不等式を満たすかどうかがチェックされているが、各不等式を別々に扱う拡張処理を導入することでチェック回数を削減することも可能である。この例では、次のレコードの値を用いた不等式と定数を用いた不等式を別々に扱うことにする。前者は２つのレコード間における属性値の相関関係（変化の形状）を表すので、“関係述語（ｒｅｌａｔｉｏｎｐｒｅｄｉｃａｔｅ）”と呼ぶことにし、後者は属性値の範囲を表すので、“範囲述語（ｒａｎｇｅ　ｐｒｅｄｉｃａｔｅ　）”と呼ぶことにする。
【００３７】
最初のウィンドウ２１の処理では、前述したように、値４２がｐ［１］（ｒ）：４０＜ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜５０　ＡＮＤ　ｒ．ｔｅｍｐｅｒａｔｕｒｅ＞ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅを満たさない。しかし、より詳細に見れば、この述語は範囲述語（４０＜ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜５０）と関係述語（ｒ．ｔｅｍｐｅｒａｔｕｒｅ＞ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅ）の論理積で表されており、値４２は範囲述語を満たしており（４０＜４２＜５０）、関係述語を満たしていない（４２＜４７であり４２＞４７ではない）ことが分かる。
【００３８】
このように値４２はｐ［１］の範囲述語を満たしているのであるから、ｐ［０］（ｒ）：ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜３８　ＡＮＤ　ｒ．ｔｅｍｐｅｒａｔｕｒｅ＜ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅを満たし得ない。したがって、連結述語ｐ［１］を分離しない基本処理で用いた１レコードのシフトより大きなウィンドウシフトが可能であることが分かる。また、ｐ［２］とｐ［０］は互いに矛盾しないので、図３に示すように、ウィンドウ２１を３レコード分シフトすることが決定される。
【００３９】
２番目のウィンドウ３１の処理では、４０＞３８であり４０＜３８ではないことから値４０はｐ［３］を満たさないことが分かり、ウィンドウ３１が１レコード分シフトされる。そして、３番目のウィンドウ３２の処理において、４つの値がそれぞれ対応する述語を満たすことが確認される。この例では、述語の不等式を別々に解析することでチェック回数が８回に削減された。
【００４０】
図２および図３の処理では、ストリームとパターン述語が部分的にマッチしたときにマッチングが成功する可能性のあるパターンシフトを、述語の間の相互依存性を活用して推測している。これにより、ウィンドウ内の述語とストリームデータの比較回数が削減され、結果的に検索速度が向上する。
【００４１】
この例では、理解を容易にするため、ミスマッチが発生した場合の処理の一部としてウィンドウのスライド長を推論することを示した。しかしながら、述語の間の相互依存性はストリームデータとは無関係であるため、クエリコンパイル時にあらかじめ計算することが可能である。
【００４２】
以下では、パターン述語の間の論理的関係を活用することでクエリコンパイルの一部として一旦補助情報を生成しておき、ストリームデータからパターンを検索する際に補助情報を繰り返し用いてウィンドウを効率良くスライドさせる方法について、詳細に説明する。まず、いくつかの必要な用語の意味をまとめておく。
【００４３】
データストリーム：ストリームはｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］の列であり、各ｒ［ｉ］はセンサＩＤ、温度センサが生成する温度、その生成時刻等のようなｋ個の属性ａ［０］，ａ［１］，．．．，ａ［ｋ−１］を持つ。
【００４４】
パターンと述語：パターンはｍ個の述語ｐ［０］（ｒ），．．．，ｐ［ｍ−１］（ｒ）の列であり、各ｐ［ｊ］（ｒ）（ｊ＝｛０，１，．．．，ｍ−１｝）は、ストリームレコードｒの属性と定数の間の条件式（不等式や等式）や、レコードｒの属性とデータストリーム中の他のレコードの属性の間の条件式を含んでいる。
【００４５】
この条件式は一般に（ｒ［ｉ］．ａ［ｑ］　ＯＰ　Ｆ）の形式で記述される。ｒ［ｉ］．ａ［ｑ］はｉ番目（ｉ＝｛０，１，．．．，ｎ−１｝）のレコードのｑ番目（ｑ＝｛０，１，．．．，ｋ−１｝）の属性の値を表し、Ｆは定数と複数の属性ｒ［ｉ１］．ａ［ｑ１］を含む任意の関数を表す。ただし、ｉ１＝｛０，１，．．．，ｎ−１　ｅｘｃｅｐｔ　ｉ｝、ｑ１＝｛０，１，．．．，ｋ−１｝であり、ｉ１は０，１，．．．，ｎ−１からｉを除いた残りの数字のいずれかを表す。また、ＯＰは任意のオペレータ（ＯＰ∈｛＜，≦，＝，＞，≧，≠｝）である。
【００４６】
例えば、レコードｒの温度属性をｒ．ｔｅｍｐｅｒａｔｕｒｅとし、隣接するレコードであるｒ．ｎｅｘｔの温度属性をｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅとし、実定数をＣとすると、以下のような条件式を指定することができる。
【００４７】
（ｒ．ｔｅｍｐｅｒａｔｕｒｅ　ＯＰ　Ｃ）
（ｒ．ｔｅｍｐｅｒａｔｕｒｅ　ＯＰ　ｒ．ｎｅｘｔ．ｔｅｍｐｅｒａｔｕｒｅ＋Ｃ）
連結述語は、論理積ＡＮＤで接続された複数の条件式からなる。
【００４８】
パターン述語の間の関係：パターン述語間におけるすべての論理的関係は、正の前提条件論理行列θおよび負の前提条件論理行列φを用いて表される。これらの行列のサイズはパターン述語の数ｍに一致し、その要素θ［ｊ，ｋ］およびφ［ｊ，ｋ］は以下のように定義される。
【００４９】
【数１】

【００５０】
ここで、　外１　（以下ではｐ［ｊ］バーと記す）はｐ［ｊ］の否定を表し、
【００５１】
【外１】

【００５２】
Ｕはｐ［ｊ］⇒ｐ［ｋ］なのかｐ［ｊ］⇒ｐ［ｋ］バーなのかが不明であることを表す。
本実施形態のパターン検出方法では、パターン述語の数ｍに等しいサイズのウィンドウを利用してストリームをスキャンする。図４に示すように、ウィンドウ４１がレコードｒ［ｉ］，．．．，ｒ［ｉ＋ｍ−１］をカバーしているとき、ウィンドウ４１の位置をｉで表すことにする。
【００５３】
ウィンドウ４１内のストリームレコードはパターン述語に対して右から左にチェックされるので、述語ｐ［ｍ−１］がレコードｒ［ｉ＋ｍ−１］に対して成り立てば、次にレコードｒ［ｉ＋ｍ−２］が述語ｐ［ｍ−２］に対してチェックされる。そして、述語ｐ［ｍ−２］がレコードｒ［ｉ＋ｍ−２］に対して成り立てば、次にレコードｒ［ｉ＋ｍ−３］が述語ｐ［ｍ−３］に対してチェックされ、同様の処理が続けられる。
【００５４】
述語ｐ［ｊ］がレコードｒ［ｉ＋ｊ］に対して成り立たないことが分かると、現在のウィンドウのマッチング処理は失敗する。そこで、現在のウィンドウ４１がｓｈｉｆｔ［ｊ］で表される長さだけ右にシフトされ、ｉ＋ｓｈｉｆｔ［ｊ］の位置のウィンドウ４２内のストリームレコードが前のウィンドウ４１の処理と同様にしてチェックされる。ｓｈｉｆｔ［ｊ］は、あらかじめ補助情報として計算しておく。
【００５５】
この方法では、ｍ個のパターン述語の列を満足するようなすべてのストリームレコードを検索するために、まず、ウィンドウの位置をｉ＝０に初期化する。これにより、ウィンドウの左端が最初のストリームレコードｒ［０］に並ぶことになる。ウィンドウをスライドさせる処理は、ウィンドウの右端が最初のデータストリームの右端を越えるまで繰り返される。このようなパターン検出アルゴリズムは、例えば、以下のように記述できる。
【００５６】

ここで、ｐ［ｊ］（ｒ［ｉ］）は、ｐ［ｊ］がストリームのｉ番目のレコードｒ［ｉ］に対して成り立つか否かをテストすることを表している。ｒ［ｉ］がｐ［ｊ］を満足すればｐ［ｊ］（ｒ［ｉ］）＝１となり、満足しなければｐ［ｊ］（ｒ［ｉ］）＝０となる。
【００５７】
ｊ＜０となるのは位置ｉのウィンドウで検索パターンが得られたときであり、このときｆｉｎｄ（ｉ）は、パターン述語の全体ｐ［０］，．．．，ｐ［ｍ−１］を満足するストリームレコードｒ［ｉ］，．．．，ｒ［ｉ＋ｍ−１］を出力する。シフト配列ｓｈｉｆｔ［ｊ］に格納される距離値の計算方法については、後述することにする。
【００５８】
ｓｈｉｆｔ［ｊ］＝１とおいてウィンドウを毎回１位置だけスライドさせると、どの述語ｐ［ｊ］が成り立たない場合でも（どこでパターン全体のマッチが得られようとも）、ｍ個のパターン述語の列を満足するようなすべてのストリームレコードを検出するには、レコードの総数をｎとしてＯ（ｎｍ）の時間が必要である。
【００５９】
そこで、この大きなチェック回数を削減するために、ウィンドウを１位置より大きくスライドさせることを考える。言い換えれば、述語ｐ［ｊ］でミスマッチが起こったとき、いかなるマッチング候補をも漏らすことなく可能な限り大きくウィンドウをスライドさせるように、ｓｈｉｆｔ［ｊ］を決定する。
【００６０】
図４より、位置ｉのウィンドウにおいて述語ｐ［ｊ］でミスマッチが起こったとき、検索パターンの述語ｐ［ｊ＋１］からｐ［ｍ−１］まではそれぞれストリームレコードｒ［ｉ＋ｊ＋１］からｒ［ｉ＋ｍ−１］までに対して成り立っていることが分かる。このとき、述語ｐ［ｊ］はストリームレコードｒ［ｉ＋ｊ］に対して成り立っていない。
【００６１】
位置ｉのウィンドウをｋ位置だけ右にシフトして新たな位置（ｉ＋ｋ）にスライドさせたとき、前のウィンドウで述語ｐ［ｊ］，．．．，ｐ［ｍ−１］によりチェックされたストリームレコードｒ［ｉ＋ｊ］，．．．，ｒ［ｉ＋ｍ−１］と新たなウィンドウとのオーバラップの状況として、全体オーバラップ、部分オーバラップ、およびオーバラップなしの３つの場合が考えられる。
【００６２】
図５は、全体オーバラップの場合を示している。０＜ｋ≦ｊの場合、ウィンドウ５１は、述語ｐ［ｊ］とのミスマッチが発生したレコードｒ［ｉ＋ｊ］の左側の位置（ｉ＋ｋ）またはレコードｒ［ｉ＋ｊ］と丁度同じ位置にシフトする。したがって、新たなウィンドウ５１は前のウィンドウ４１で既にチェックしたレコードｒ［ｉ＋ｊ］，．．．，ｒ［ｉ＋ｍ−１］の列全体を含むことになる。
【００６３】
位置（ｉ＋ｋ）のウィンドウ５１では、レコードｒ［ｉ＋ｊ］，．．．，ｒ［ｉ＋ｍ−１］が述語ｐ［ｊ−ｋ］，．．．，ｐ［ｍ−ｋ−１］と並んでいる。これらの述語が対応するレコードに対して成り立つ可能性があるか否かを予測するために、前のウィンドウ４１における述語ｐ［ｊ］，．．．，ｐ［ｍ−１］のこれらのレコードに対するチェック結果を利用することができる。
【００６４】
そこで、ウィンドウ４１においてレコードｒ［ｉ＋ｊ］，．．．，ｒ［ｉ＋ｍ−１］と並んでいた述語の部分列ｐ［ｊ］，．．．，ｐ［ｍ−１］と、ウィンドウ５１においてそれらのレコードと並んでいる述語の部分列ｐ［ｊ−ｋ］，．．．，ｐ［ｍ−ｋ−１］の２つの部分列の論理的関係を解析することにより、パターン検索が成功する可能性のあるｋ位置のシフトを計算する。これらの述語列の間で次の関係が成り立てば、ｋ位置のシフトはパターン検索を満たすストリームレコードを含むようなウィンドウを生成する可能性がある。
【００６５】

前述したθとφを用いると、この関係は以下のように記述することができる。まず、次式によりα［ｊ，ｋ］を定義する。
【００６６】

ただし、Ｕ∧１＝Ｕ、Ｕ∧０＝０、およびＵ∧Ｕ＝Ｕである。α［ｊ，ｋ］＝０であれば、ｋ位置のウィンドウシフトは検索パターンにマッチするストリームレコードを含むことはなく、α［ｊ，ｋ］＝１であればマッチするレコードを含み、α［ｊ，ｋ］＝Ｕであればマッチするレコードを含む可能性がある。
【００６７】
次に、図６は、部分オーバラップの場合を示している。ｊ＜ｋ＜ｍの場合、ウィンドウ４１は、述語ｐ［ｊ］とのミスマッチが発生したレコードｒ［ｉ＋ｊ］の右側の位置（ｉ＋ｋ）にシフトする。したがって、新たなウィンドウ６１は前のウィンドウ４１で既にチェックしたレコードの部分列ｒ［ｉ＋ｋ］，．．．，ｒ［ｉ＋ｍ−１］を含むことになる。
【００６８】
位置（ｉ＋ｋ）のウィンドウ６１では、レコードｒ［ｉ＋ｋ］，．．．，ｒ［ｉ＋ｍ−１］が述語ｐ［０］，．．．，ｐ［ｍ−ｋ−１］と並んでいる。これらの述語が対応するレコードに対して成り立つ可能性があるか否かを予測するために、前のウィンドウ４１における述語ｐ［ｋ］，．．．，ｐ［ｍ−１］のこれらのレコードに対するチェック結果を利用することができる。
【００６９】
そこで、ウィンドウ４１においてレコードｒ［ｉ＋ｋ］，．．．，ｒ［ｉ＋ｍ−１］と並んでいた述語の部分列ｐ［ｋ］，．．．，ｐ［ｍ−１］と、ウィンドウ６１においてそれらのレコードと並んでいる述語の部分列ｐ［０］，．．．，ｐ［ｍ−ｋ−１］の２つの部分列の論理的関係を解析することにより、パターン検索が成功する可能性のあるｋ位置のシフトを計算する。これらの述語列の間で次の関係が成り立てば、ｋ位置のシフトはパターン検索を満たすストリームレコードを含むようなウィンドウを生成する可能性がある。
【００７０】

ここで、θとφを用いて次式によりβ［ｊ，ｋ］を定義する。
【００７１】

β［ｊ，ｋ］＝０であれば、ｋ位置のウィンドウシフトは検索パターンにマッチするストリームレコードを含むことはなく、β［ｊ，ｋ］＝１であればマッチするレコードを含み、β［ｊ，ｋ］＝Ｕであればマッチするレコードを含む可能性がある。
【００７２】
次に、図７は、オーバラップなしの場合を示している。ｋ≧ｍの場合、ウィンドウ４１は、その右端のレコードｒ［ｉ＋ｍ−１］の右側の位置（ｉ＋ｋ）にシフトする。特にｋ＝ｍの場合、新たなウィンドウ７１内の最初のレコード位置は、前のウィンドウ４１の直後のレコード位置に一致する。したがって、新たなウィンドウ７１は前のウィンドウ４１で既にチェックしたレコードｒ［ｉ＋ｊ］，．．．，ｒ［ｉ＋ｍ−１］のいずれも含んでいない。
【００７３】
新たなウィンドウ７１内のストリームレコードのいずれもまだチェックされていないわけであるから、それらが検索パターンにマッチするストリームレコードの候補であることを否定する根拠はない。
【００７４】
以上の３つの場合に関する考察から、ｋ位置のウィンドウシフトは、以下のいずれかの条件が成り立つときに、パターン述語を満たすストリームレコードを含む位置にウィンドウをスライドさせる可能性があることが分かる。
（ａ）α［ｊ，ｋ］≠０　ｆｏｒ　０＜ｋ≦ｊ
（ｂ）β［ｊ，ｋ］≠０　ｆｏｒ　ｊ＜ｋ＜ｍ
（ｃ）ｋ≧ｍ
検索処理においてはいかなるマッチング候補も漏らしたくないので、ｓｈｉｆｔ［ｊ］をパターンにマッチする可能性のある右方向への最小スライド距離として計算することにする。
【００７５】
まず、右端にある部分列ｐ［ｊ−ｋ］，ｐ［ｊ−ｋ＋１］，．．．，ｐ［ｍ−ｋ−１］を見つけるために、α［ｊ，ｋ］≠０となる最小のｋを求める。そのようなｋが存在しなければ、次に、最長の部分列ｐ［０］，．．．，ｐ［ｍ−ｋ−１］を見つけるために、β［ｊ，ｋ］≠０となる最小のｋを求める。そのようなｋも存在しなければ、ｋ＜ｍとなるいかなるシフトも失敗することが分かる。そこで、ｓｈｉｆｔ［ｊ］＝ｍとおいて、新たなウィンドウ内の最初のレコード位置が前のウィンドウの直後のレコード位置に一致するように、ウィンドウを右方向にシフトする。このようなｓｈｉｆｔ［ｊ］の求め方をまとめると、次のようになる。
【００７６】
【数２】

【００７７】
シフト配列ｓｈｉｆｔ［ｊ］に格納される距離の単純な計算方法では、すべてのｊおよびｋについてα［ｊ，ｋ］およびβ［ｊ，ｋ］を最初に計算しておく必要があり、膨大な計算量が要求される。以下では、シフト配列をＯ（ｍ）の計算量で効率良く求める簡単なプログラムコードを紹介する。
【００７８】
まず、次式で定義される補助配列ｃｏｍｐａｔｉ［ｉ］の要素を計算する。

図８に示すように、位置ｉにおけるｃｏｍｐａｔｉ［ｉ］は、位置ｉを右端としパターン述語ｐ［ｉ−ｌｅｎ＋１］，．．．，ｐ［ｉ］からなるパターン部分列（述語列）であって、パターン述語ｐ［ｍ−ｌｅｎ］，．．．，ｐ［ｍ−１］からなる長さｌｅｎの右端の部分列と両立するようなパターン部分列の長さｌｅｎの最大値を表す。
【００７９】
ｃｏｍｐａｔｉ［ｉ］の値は右から左に向かって（ｉ＝ｍ−１からｉ＝０に向かって）計算される。基本的なアイデアは、あるｃｏｍｐａｔｉ［ｉ］の値を可能な限りその左側の値（より小さいｉに対する値）の計算に用いるというものである。このアイデアを、次のような４つのパターン述語からなる非常に単純なパターンの例を用いて説明する。
【００８０】
ｐ［３］：ｒ．ｌｅｖｅｌ＞１０
ｐ［２］：ｒ．ｌｅｖｅｌ＞５
ｐ［１］：ｒ．ｌｅｖｅｌ＞３
ｐ［０］：ｒ．ｌｅｖｅｌ＞１
この場合、いずれのｐ［ｉ］をとってもｐ［ｉ］⇒ｐ［ｉ］であるから、明らかにｃｏｍｐａｔｉ［３］＝４である（以下に示すコードの１行目に対応）。ｃｏｍｐａｔｉ［２］の計算では、ｐ［３］とｐ［２］、ｐ［２］とｐ［１］、およびｐ［１］とｐ［０］の間の関係をチェックしなければならない。図９に示すように、ｐ［３］⇒ｐ［２］、ｐ［２］⇒ｐ［１］、およびｐ［１］⇒ｐ［０］であるから、ｃｏｍｐａｔｉ［２］＝３となる（以下に示すコードの７−１２行目に対応）。
【００８１】
ここで、ｐ［３］⇒ｐ［２］およびｐ［２］⇒ｐ［１］であることから、ｐ［３］⇒ｐ［１］であることが分かる。さらに、ｐ［２］⇒ｐ［１］およびｐ［１］⇒ｐ［０］であることから、ｐ［２］⇒ｐ［０］であることも分かる。したがって、ｐ［３］とｐ［１］およびｐ［２］とｐ［０］の間の関係を直接チェックしなくとも、図１０に示すようなｃｏｍｐａｔｉ［１］＝２という結論を得ることができる（以下に示すコードの４−５行目に対応）。
【００８２】
このようなｃｏｍｐａｔｉ［ｉ］の値を効率良く計算する簡単なプログラムコードとしては、例えば、以下のようなものが考えられる。

配列ｃｏｍｐａｔｉ［ｉ］を決定されると、次にシフト配列が計算される。以下のプログラムコードに示されるように、まずシフト配列はオーバラップなしの場合の値で満たされ、可能であれば、次にｃｏｍｐａｔｉ［ｉ］を用いて部分オーバラップの場合の値がシフト配列に格納され、最後に全体オーバラップの場合の値が格納される。
【００８３】

このシフト配列の計算コードの１−２行目では、ｍがオーバラップなしの場合に対応するシフト値として入力される。前述したように、このシフト値は、パターン述語の右端の部分列と両立するような部分列が存在しない場合に用いられるので、図１１に示すように、新たなウィンドウが前のウィンドウとオーバラップするパターン述語を含まないようにウィンドウがシフトされる。
【００８４】
また、３−８行目では、部分オーバラップの場合のシフト値が設定される。この場合、図１２に示す通り、パターン述語の左端にある最長の部分列ｐ［０］，．．．，ｐ［ｉ］が右端にある部分列ｐ［ｍ−１−ｉ］，．．．，ｐ［ｍ−１］と両立するように、最小のシフト値が計算される。
【００８５】
この左端の部分列ｐ［０］，．．．，ｐ［ｉ］の最後のパターン述語はｐ［ｉ］であり、その位置ｉにおいてはｃｏｍｐａｔｉ［ｉ］＝ｉ＋１である。したがって、配列ｃｏｍｐａｔｉ［ｉ］に格納されたパターン述語の両立性（ｃｏｍｐａｔｉｂｉｌｉｔｙ　）に関する情報（ｃｏｍｐａｔｉ［ｉ］＝ｉ＋１）が与えられれば、左端の部分列ｐ［０］，．．．，ｐ［ｉ］が両立する右端の部分列ｐ［ｍ−１−ｉ］，．．．，ｐ［ｍ−１］とオーバラップするようにウィンドウがシフトされる。
【００８６】
最後に、９−１１行目では、全体オーバラップの場合のシフト値が設定される。この場合、図１３に示す通り、パターン述語の右端にある部分列ｐ［ｍ−ｃｏｍｐａｔｉ［ｉ］］，．．．，ｐ［ｍ−１］がｃｏｍｐａｔｉ［ｉ］の長さの部分列ｐ［ｉ−ｃｏｍｐａｔｉ［ｉ］＋１］，．．．，ｐ［ｉ］と両立し、かつ、ｐ［ｍ−ｃｏｍｐａｔｉ［ｉ］−１］バーがパターン述語ｐ［ｉ−ｃｏｍｐａｔｉ［ｉ］］と両立するように、最小のシフト値が計算される。
【００８７】
既にｃｏｍｐａｔｉ［ｉ］の計算時にチェックした通り、ｐ［ｍ−ｃｏｍｐａｔｉ［ｉ］−１］⇒ｐ［ｉ−ｃｏｍｐａｔｉ［ｉ］］バー（θ［ｍ−ｃｏｍｐａｔｉ［ｉ］−１，ｉ−ｃｏｍｐａｔｉ［ｉ］］＝０）が成り立つことが分かっている。さらに、ｐ［ｍ−ｃｏｍｐａｔｉ［ｉ］−１］バー⇒ｐ［ｉ−ｃｏｍｐａｔｉ［ｉ］］（φ［ｍ−ｃｏｍｐａｔｉ［ｉ］−１，ｉ−ｃｏｍｐａｔｉ［ｉ］］≠０）が成り立てば、上述した２つの条件が満たされる。したがって、ｓｈｉｆｔ［ｍ−ｃｏｍｐａｔｉ［ｉ］−１］＝ｍ−１−ｉとおくことができる。
【００８８】
同じｃｏｍｐａｔｉ［ｉ］の値を持つ複数のｉが存在する場合、検索処理においていかなるマッチング候補も漏らすことがないように最小のシフト値を求めるという要請から、最後には最も右のｉ（最大のｉ）がシフト配列にｓｈｉｆｔ［ｍ−ｃｏｍｐａｔｉ［ｉ］−１］＝ｍ−１−ｉとして入力されることになる。したがって、配列ｃｏｍｐａｔｉ［ｉ］は左から右に向かって（ｉ＝０からｉ＝ｍ−２に向かって）スキャンされる。
【００８９】
ところで、上述した基本処理におけるｓｈｉｆｔ［ｊ］の計算方法では、パターン述語の間の両立性の解析において連結述語の複数の不等式を分離して考慮してはいない。これらを分離して扱えば、前述したように、より大きなシフトが得られる可能性がある。
【００９０】
そこで、拡張処理においては、連結述語ｐ［ｊ］を、定数を用いた条件式を含む“範囲述語ｐ１［ｊ］”と隣接レコードの値を用いた条件式を含む“関係述語ｐ２［ｊ］”の組み合わせで表し、ｐ［ｊ］＝ｐ１［ｊ］∧ｐ２［ｊ］と記述することにする。この場合にも、基本処理の計算方法を素直に拡張することでｓｈｉｆｔ［ｊ］を求めることができる。基本処理との主な違いは以下の通りである。
【００９１】
基本処理ではｐ［ｊ］バーのみを考慮しているが、拡張処理ではｐ１［ｊ］バー∧ｐ２［ｊ］、ｐ１［ｊ］∧ｐ２［ｊ］バー、およびｐ１［ｊ］バー∧ｐ２［ｊ］バーの３つの場合を考慮する。これらの３つの場合に応じて、上述のφ［ｊ，ｊ−ｋ］がそれぞれφ１［ｊ，ｊ−ｋ］、φ２［ｊ，ｊ−ｋ］、およびφ１２［ｊ，ｊ−ｋ］に分解され、α［ｊ，ｊ−ｋ］がそれぞれα１［ｊ，ｊ−ｋ］、α２［ｊ，ｊ−ｋ］、およびα１２［ｊ，ｊ−ｋ］に分解される。
【００９２】
したがって、ｓｈｉｆｔ［ｊ］もあらかじめｓｈｉｆｔ１［ｊ］、ｓｈｉｆｔ２［ｊ］、およびｓｈｉｆｔ１２［ｊ］に分けて計算される。そして、述語ｐ［ｊ］でミスマッチが起こったとき、ミスマッチを起こした述語が範囲述語のみか、関係述語のみか、あるいはその両方かに応じて、３つのシフト値ｓｈｉｆｔ１［ｊ］、ｓｈｉｆｔ２［ｊ］、およびｓｈｉｆｔ１２［ｊ］の中からウィンドウのスライド距離が選択される。
【００９３】
ところで、本実施形態のデータ処理装置は、例えば、図１４に示すような情報処理装置（コンピュータ）を用いて構成される。図１４の情報処理装置は、ＣＰＵ（中央処理装置）８１、メモリ８２、入力装置８３、出力装置８４、外部記憶装置８５、媒体駆動装置８６、およびネットワーク接続装置８７を備え、それらはバス８８により互いに接続されている。
【００９４】
メモリ８２は、例えば、ＲＯＭ（ｒｅａｄ　ｏｎｌｙ　ｍｅｍｏｒｙ）、ＲＡＭ（ｒａｎｄｏｍ　ａｃｃｅｓｓ　ｍｅｍｏｒｙ）等を含み、処理に用いられるプログラムとデータを格納する。ＣＰＵ８１は、メモリ８２を利用してプログラムを実行することにより、必要な処理を行う。
【００９５】
入力装置８３は、例えば、キーボード、ポインティングデバイス、タッチパネル等であり、ユーザからの指示や情報の入力に用いられる。出力装置８４は、例えば、ディスプレイ装置、プリンタ、スピーカ等であり、ユーザへの問い合わせや処理結果の出力に用いられる。
【００９６】
外部記憶装置８５は、例えば、磁気ディスク装置、光ディスク装置、光磁気ディスク装置、テープ装置等である。情報処理装置は、この外部記憶装置８５に、上述のプログラムとデータを保存しておき、必要に応じて、それらをメモリ８２にロードして使用する。
【００９７】
媒体駆動装置８６は、可搬記録媒体８９を駆動し、その記録内容にアクセスする。可搬記録媒体８９としては、メモリカード、フレキシブルディスク、ＣＤ−ＲＯＭ（ｃｏｍｐａｃｔ　ｄｉｓｋ　ｒｅａｄ　ｏｎｌｙ　ｍｅｍｏｒｙ　）、光ディスク、光磁気ディスク等、任意のコンピュータ読み取り可能な記録媒体が用いられる。ユーザは、この可搬記録媒体８９に上述のプログラムとデータを格納しておき、必要に応じて、それらをメモリ８２にロードして使用する。
【００９８】
ネットワーク接続装置８７は、ＬＡＮ（ｌｏｃａｌ　ａｒｅａ　ｎｅｔｗｏｒｋ）等の任意の通信ネットワークに接続され、通信に伴うデータ変換を行う。情報処理装置は、上述のプログラムとデータをネットワーク接続装置８７を介して他の装置から受け取り、必要に応じて、それらをメモリ８２にロードして使用する。
【００９９】
図１５は、図１４の情報処理装置にプログラムとデータを供給することのできるコンピュータ読み取り可能な記録媒体を示している。可搬記録媒体８９やサーバ９０のデータベース８１に保存されたプログラムとデータは、メモリ８２にロードされる。このとき、サーバ９０は、プログラムとデータを搬送する搬送信号を生成し、ネットワーク上の任意の伝送媒体を介して情報処理装置に送信する。そして、ＣＰＵ８１は、そのデータを用いてそのプログラムを実行し、必要な処理を行う。
【０１００】
次に、ウィンドウを常に１レコードずつシフトするナイーブアプローチと、前述したＯＰＳアルゴリズムによる検索処理と、本発明の検索処理とを比較したシミュレーションの結果について説明する。クエリおよびストリームデータとしては以下のようなものを用いた。
【０１０１】
クエリ
“ある企業の株価が２回続けて下落した後２回続けて上昇し、下落により株価は４０と５０の間の値となり、最初の上昇では株価は５２を越えないようなパターンをすべて抽出せよ。”
ストリームデータ
５５　５０　４５　５１　５４　５０　４７　４９　４５　４２　５５　５７
５９　６０　５７
図１６から図１９までは、このようなクエリとストリームデータを用いてパターン検索を行ったときの、ストリームデータ上の位置ｉと検索パターン上の位置ｊの値の変化を表している。このうち、図１６はナイーブアプローチによる結果を表し、図１７はＯＰＳアルゴリズムによる結果を表し、図１８は本発明の基本処理（連結述語の条件式を分離して扱わない処理）による結果を表し、図１９は本発明の拡張処理（連結述語の条件式を分離して扱う処理）による結果を表す。また、グラフ上の四角のマークは、ストリームレコードに対してパターン述語のチェックが行われた点を示している。
【０１０２】
図１６のナイーブアプローチでは、ｗ１〜ｗ９の９つのウィンドウの処理が行われ、ストリームレコードに対するパターン述語のチェックは２０回行われている。また、図１７のＯＰＳアルゴリズムでは、ｗ１〜ｗ７の７つのウィンドウの処理と１６回のチェックで検索が終了している。
【０１０３】
これに対して、図１８の基本処理では、ｗ１〜ｗ８の８つのウィンドウの処理と１１回のチェックで検索が終了しており、図１９の拡張処理では、ｗ１〜ｗ７の７つのウィンドウの処理と１０回のチェックで検索が終了している。したがって、この例では拡張処理が最も性能が良いことが分かる。
【０１０４】
ナイーブアプローチでは、処理の進行に伴って連続する点の間でｉの値が減少するバックトラックが発生しているが、ＯＰＳアルゴリズムでは、このようなバックトラックは発生していない。しかしながら、ＯＰＳアルゴリズムでは、同じｉの値を持つ連続する点において、ミスマッチしたレコードが引き続き何度も再チェックされている。
【０１０５】
これに対して、本発明の基本処理では、ウィンドウ処理の数はＯＰＳアルゴリズムのそれより小さくないが、レコードの再チェックはミスマッチが発生しなかった（つまり、パターンが完全に検出された）最後のウィンドウｗ８のみで行われている。さらに、関係述語と範囲述語を分離して解析する拡張処理では、最初のウィンドウｗ１内のレコードｒ［３］（ｉ＝３）でミスマッチが発生した後、基本処理より大きくウィンドウがシフトされ、レコードｒ［４］（ｉ＝４）のチェックがスキップされている。
【０１０６】
このように、本発明の検索処理は、ナイーブアプローチおよびＯＰＳアルゴリズムと比較して効率の良いアルゴリズムであることが分かる。
次に、他のクエリを用いたシミュレーションの結果について説明する。以下に示すＰ１〜Ｐ１０の１０個のパターンはいずれも７つの述語で記述され、これらの述語を用いてそれぞれの条件を満たす水位（および温度）属性を有する連続レコードを検索した。ストリームデータとしては、センサによる水位および温度の測定値を有するレコードとしてランダムに生成された１００，０００個のレコードを用いた。
【０１０７】
Ｐ１：連続する７回の測定において、水位は隣接する測定値に対して−２と＋２の間の範囲にある。
Ｐ２：連続する７回の測定において、水位は上昇し続ける。
Ｐ３：連続する７回の測定において、水位は低下し続ける。
Ｐ４：連続する７回の測定の前半において水位は上昇し続け、後半において安定するかまたは低下し続ける。
Ｐ５：３回連続して水位のスパイクが発生する。つまり、水位が突然１０より大きく上昇した後突然１０より大きく低下する現象が３回繰り返される。
Ｐ６：連続する７回の測定において、水位は隣接する測定値に対して−２と＋２の間の範囲にあり、温度は上昇し続ける。
Ｐ７：連続する７回の測定において、水位は上昇し続け、温度は低下し続ける。
Ｐ８：連続する７回の測定において、水位は隣接する測定値に対して−２と＋２の間の範囲にあり、温度は前半において上昇し続け、後半において安定するかまたは低下し続ける。
Ｐ９：連続する７回の測定において、水位は隣接する測定値に対して−２と＋２の間の範囲にあり、３回連続して温度のスパイクが発生する。つまり、温度が突然１０より大きく上昇した後突然１０より大きく低下する現象が３回繰り返される。
Ｐ１０：連続する７回の測定において、水位は上昇し続け、３回連続して温度のスパイクが発生する。
【０１０８】
図２０は、ナイーブアプローチ（Ｎａｉｖｅ）、ＯＰＳアルゴリズム、本発明の基本処理（Ｒ２Ｌ）、拡張ＯＰＳアルゴリズム（ＯＰＳ−ｅｘｔｅｎｄｅｄ）、および本発明の拡張処理（Ｒ２Ｌ−ｅｘｔｅｎｄｅｄ）の各処理において述語のチェックが行われた回数を示している。
【０１０９】
拡張ＯＰＳアルゴリズムとは、ＯＰＳアルゴリズムにおいて、本発明の拡張処理と同様に連結述語の条件式を分離して扱うアルゴリズムを指す。パターンＰ１〜Ｐ５では各述語が１つの条件式しか含まないため、拡張ＯＰＳアルゴリズムおよび本発明の拡張処理のシミュレーションは行われていない。
【０１１０】
図２１は、図２０のシミュレーション結果（チェック回数）の棒グラフを示しており、１０１、１０２、１０３、１０４、および１０５は、それぞれナイーブアプローチ、ＯＰＳ、本発明の基本処理、拡張ＯＰＳ、および本発明の拡張処理のチェック回数に対応する。
【０１１１】
このグラフから、ＯＰＳ、拡張ＯＰＳ、本発明の基本処理、および本発明の拡張処理は、どのパターンに対してもナイーブアプローチよりチェック回数が少ないことが分かる。ＯＰＳと拡張ＯＰＳのチェック回数は、レコードの総数（１００，０００）と同じかそれよりわずかに多い程度であり、これらの処理では大部分のパターンに対してバックトラックが発生していないことを意味している。
【０１１２】
また、パターンＰ４およびＰ９に対するＯＰＳの場合を除いて、本発明の基本処理と拡張処理のチェック回数は、それぞれＯＰＳと拡張ＯＰＳのそれより少ない。特にパターンＰ１、Ｐ２、Ｐ３、Ｐ６、およびＰ７に対しては、本発明の基本処理と拡張処理のチェック回数は、それぞれＯＰＳと拡張ＯＰＳのそれの半分以下である。パターンＰ８、Ｐ９、およびＰ１０に対しては、基本処理の代わりに拡張処理を用いることでチェック回数を大幅に削減することができる。
【０１１３】
上述したように、ランダムに生成された大量のデータから１０個の異なるパターンをそれぞれ検索するシミュレーションにより、多くの場合、本発明の検索処理がナイーブアプローチやＯＰＳアルゴリズムより格段に優れた性能を示し、拡張処理は基本処理よりさらに効率が良いことが分かった。
（付記１）　複数の順序付けられたデータから、順序付けられたｍ個のデータからなるデータ列を抽出するデータ処理装置であって、
データ間の関係を用いて前記ｍ個のデータをそれぞれ指定する順序付けられたｍ個の条件からなる条件列を入力する入力手段と、
前記ｍ個の条件の間の両立性を解析して、データ抽出に用いる補助情報を生成する前処理手段と、
前記順序付けられたデータ上で先頭のデータから末尾のデータに向かって前記データ列を検索するとき、データが前記条件列の対応する条件を満たすか否かを検索方向とは逆の向きにチェックし、チェックされたデータが条件を満たさないとき前記補助情報を用いて次のチェック開始位置を決定する検索手段と、
前記条件列のすべての条件を満たしたデータ列の情報を出力する出力手段と
を備えることを特徴とするデータ処理装置。
（付記２）　前記前処理手段は、前記ｍ個の条件からなる条件列上の各位置を右端とする部分条件列であって、該ｍ個の条件からなる条件列の右端の部分条件列と両立するような部分条件列の長さの最大値を表す配列を生成し、該配列を用いて前記補助情報を生成することを特徴とする付記１記載のデータ処理装置。
（付記３）　レコードｒ［ｉ］（ｉ＝｛０，１，．．．，ｎ−１｝）の属性値に対する条件式で記述される述語をｐ［ｊ］（ｊ＝｛０，１，．．．，ｍ−１｝）とするとき、ｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］からなる順序データからｍ個の述語ｐ［０］，ｐ［１］，．．．，ｐ［ｍ−１］からなるパターンを抽出するデータ処理装置であって、
前記パターンを入力する入力手段と、
述語ｐ［ｊ］と述語ｐ［ｊ１］（ｊ１＝｛０，１，．．．，ｊ−１｝）の間の両立性を解析し、パターン抽出に用いる補助情報を生成する前処理手段と、
前記順序データ上でレコードｒ［０］からｒ［ｎ−１］に向かって前記パターンに対応するレコード列を検索するとき、レコードが該パターン内の対応する述語を満たすか否かを末尾の述語ｐ［ｍ−１］から先頭の述語ｐ［０］に向かってチェックし、チェックされたレコードが述語を満たさないとき前記補助情報を用いて次のチェック開始位置を決定する検索手段と、
前記パターンのすべての述語を満たしたレコード列の情報を出力する出力手段と
を備えることを特徴とするデータ処理装置。
（付記４）　前記前処理手段は、複数の条件式が述語に含まれているとき、該複数の条件式を分離して解析することを特徴とする付記３記載のデータ処理装置。（付記５）　前記前処理手段は、位置ｊを右端とし述語ｐ［ｊ−ｌｅｎ＋１］，．．．，ｐ［ｊ］からなる述語列であって、述語ｐ［ｍ−ｌｅｎ］，．．．，ｐ［ｍ−１］からなる述語列と両立するような述語列の長さｌｅｎの最大値を表す配列を生成し、該配列を用いて前記補助情報を生成することを特徴とする付記３記載のデータ処理装置。
（付記６）　ｋ個の属性ａ［０］，ａ［１］，．．．，ａ［ｋ−１］を有するレコードをｒ［ｉ］（ｉ＝｛０，１，．．．，ｎ−１｝）とし、ｒ［ｉ］の属性ａ［ｑ］（ｑ＝｛０，１，．．．，ｋ−１｝）の値をｒ［ｉ］．ａ［ｑ］とし、定数と属性ｒ［ｉ１］．ａ［ｑ１］（ｉ１＝｛０，１，．．．，ｎ−１からｉを除いた残り｝、ｑ１＝｛０，１，．．．，ｋ−１｝）を含む任意の関数をＦとし、｛＜，≦，＝，＞，≧，≠｝のうちのいずれかのオペレータをＯＰとして、（ｒ［ｉ］．ａ［ｑ］　ＯＰ　Ｆ）のような条件式の組み合わせで記述される述語をｐ［ｊ］（ｊ＝｛０，１，．．．，ｍ−１｝）とするとき、ｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］からなる順序データからｍ個の述語ｐ［０］，ｐ［１］，．．．，ｐ［ｍ−１］からなるパターンを抽出するデータ処理装置であって、
前記パターンを入力する入力手段と、
述語ｐ［ｊ］と述語ｐ［ｊ１］（ｊ１＝｛０，１，．．．，ｊ−１｝）の間の両立性を解析し、パターン抽出に用いる補助情報を生成する前処理手段と、
前記順序データ上でレコードｒ［０］からｒ［ｎ−１］に向かって前記パターンに対応するレコード列を検索するとき、レコードが該パターン内の対応する述語を満たすか否かを末尾の述語ｐ［ｍ−１］から先頭の述語ｐ［０］に向かってチェックし、チェックされたレコードが述語を満たさないとき前記補助情報を用いて次のチェック開始位置を決定する検索手段と、
前記パターンのすべての述語を満たしたレコード列の情報を出力する出力手段と
を備えることを特徴とするデータ処理装置。
（付記７）　複数の順序付けられたデータから、順序付けられたｍ個のデータからなるデータ列を抽出するコンピュータのためのプログラムであって、
データ間の関係を用いて前記ｍ個のデータをそれぞれ指定する順序付けられたｍ個の条件からなる条件列を入力し、
前記ｍ個の条件の間の両立性を解析して、データ抽出に用いる補助情報を生成し、
前記順序付けられたデータ上で先頭のデータから末尾のデータに向かって前記データ列を検索するとき、データが前記条件列の対応する条件を満たすか否かを検索方向とは逆の向きにチェックし、チェックされたデータが条件を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
前記条件列のすべての条件を満たしたデータ列の情報を出力する
処理を前記コンピュータに実行させることを特徴とするプログラム。
（付記８）　レコードｒ［ｉ］（ｉ＝｛０，１，．．．，ｎ−１｝）の属性値に対する条件式で記述される述語をｐ［ｊ］（ｊ＝｛０，１，．．．，ｍ−１｝）とするとき、ｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］からなる順序データからｍ個の述語ｐ［０］，ｐ［１］，．．．，ｐ［ｍ−１］からなるパターンを抽出するコンピュータのためのプログラムであって、
前記パターンを入力し、
述語ｐ［ｊ］と述語ｐ［ｊ１］（ｊ１＝｛０，１，．．．，ｊ−１｝）の間の両立性を解析し、パターン抽出に用いる補助情報を生成し、
前記順序データ上でレコードｒ［０］からｒ［ｎ−１］に向かって前記パターンに対応するレコード列を検索するとき、レコードが該パターン内の対応する述語を満たすか否かを末尾の述語ｐ［ｍ−１］から先頭の述語ｐ［０］に向かってチェックし、チェックされたレコードが述語を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
前記パターンのすべての述語を満たしたレコード列の情報を出力する
処理を前記コンピュータに実行させることを特徴とするプログラム。
（付記９）　前記コンピュータは、前記補助情報を生成するとき、複数の条件式が述語に含まれていれば、該複数の条件式を分離して解析することを特徴とする付記８記載のプログラム。
（付記１０）　複数の順序付けられたデータから、順序付けられたｍ個のデータからなるデータ列を抽出するコンピュータのためのプログラムを記録した記録媒体であって、該プログラムは、
データ間の関係を用いて前記ｍ個のデータをそれぞれ指定する順序付けられたｍ個の条件からなる条件列を入力し、
前記ｍ個の条件の間の両立性を解析して、データ抽出に用いる補助情報を生成し、
前記順序付けられたデータ上で先頭のデータから末尾のデータに向かって前記データ列を検索するとき、データが前記条件列の対応する条件を満たすか否かを検索方向とは逆の向きにチェックし、チェックされたデータが条件を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
前記条件列のすべての条件を満たしたデータ列の情報を出力する
処理を前記コンピュータに実行させることを特徴とするコンピュータ読み取り可能な記録媒体。
（付記１１）　レコードｒ［ｉ］（ｉ＝｛０，１，．．．，ｎ−１｝）の属性値に対する条件式で記述される述語をｐ［ｊ］（ｊ＝｛０，１，．．．，ｍ−１｝）とするとき、ｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］からなる順序データからｍ個の述語ｐ［０］，ｐ［１］，．．．，ｐ［ｍ−１］からなるパターンを抽出するコンピュータのためのプログラムを記録した記録媒体であって、該プログラムは、
前記パターンを入力し、
述語ｐ［ｊ］と述語ｐ［ｊ１］（ｊ１＝｛０，１，．．．，ｊ−１｝）の間の両立性を解析し、パターン抽出に用いる補助情報を生成し、
前記順序データ上でレコードｒ［０］からｒ［ｎ−１］に向かって前記パターンに対応するレコード列を検索するとき、レコードが該パターン内の対応する述語を満たすか否かを末尾の述語ｐ［ｍ−１］から先頭の述語ｐ［０］に向かってチェックし、チェックされたレコードが述語を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
前記パターンのすべての述語を満たしたレコード列の情報を出力する
処理を前記コンピュータに実行させることを特徴とするコンピュータ読み取り可能な記録媒体。
（付記１２）　前記コンピュータは、前記補助情報を生成するとき、複数の条件式が述語に含まれていれば、該複数の条件式を分離して解析することを特徴とする付記１１記載の記録媒体。
（付記１３）　複数の順序付けられたデータから、順序付けられたｍ個のデータからなるデータ列を抽出するコンピュータのためのプログラムを該コンピュータに搬送する搬送信号であって、該プログラムは、
データ間の関係を用いて前記ｍ個のデータをそれぞれ指定する順序付けられたｍ個の条件からなる条件列を入力し、
前記ｍ個の条件の間の両立性を解析して、データ抽出に用いる補助情報を生成し、
前記順序付けられたデータ上で先頭のデータから末尾のデータに向かって前記データ列を検索するとき、データが前記条件列の対応する条件を満たすか否かを検索方向とは逆の向きにチェックし、チェックされたデータが条件を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
前記条件列のすべての条件を満たしたデータ列の情報を出力する
処理を前記コンピュータに実行させることを特徴とする搬送信号。
（付記１４）　レコードｒ［ｉ］（ｉ＝｛０，１，．．．，ｎ−１｝）の属性値に対する条件式で記述される述語をｐ［ｊ］（ｊ＝｛０，１，．．．，ｍ−１｝）とするとき、ｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］からなる順序データからｍ個の述語ｐ［０］，ｐ［１］，．．．，ｐ［ｍ−１］からなるパターンを抽出するコンピュータのためのプログラムを該コンピュータに搬送する搬送信号であって、該プログラムは、
前記パターンを入力し、
述語ｐ［ｊ］と述語ｐ［ｊ１］（ｊ１＝｛０，１，．．．，ｊ−１｝）の間の両立性を解析し、パターン抽出に用いる補助情報を生成し、
前記順序データ上でレコードｒ［０］からｒ［ｎ−１］に向かって前記パターンに対応するレコード列を検索するとき、レコードが該パターン内の対応する述語を満たすか否かを末尾の述語ｐ［ｍ−１］から先頭の述語ｐ［０］に向かってチェックし、チェックされたレコードが述語を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
前記パターンのすべての述語を満たしたレコード列の情報を出力する
処理を前記コンピュータに実行させることを特徴とする搬送信号。
（付記１５）　前記コンピュータは、前記補助情報を生成するとき、複数の条件式が述語に含まれていれば、該複数の条件式を分離して解析することを特徴とする付記１４記載の搬送信号。
（付記１６）　複数の順序付けられたデータから、順序付けられたｍ個のデータからなるデータ列を抽出するデータ処理方法であって、
入力手段が、データ間の関係を用いて前記ｍ個のデータをそれぞれ指定する順序付けられたｍ個の条件からなる条件列を入力し、
前処理手段が、前記ｍ個の条件の間の両立性を解析して、データ抽出に用いる補助情報を生成し、
検索手段が、前記順序付けられたデータ上で先頭のデータから末尾のデータに向かって前記データ列を検索するとき、データが前記条件列の対応する条件を満たすか否かを検索方向とは逆の向きにチェックし、チェックされたデータが条件を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
出力手段が、前記条件列のすべての条件を満たしたデータ列の情報を出力することを特徴とするデータ処理方法。
（付記１７）　レコードｒ［ｉ］（ｉ＝｛０，１，．．．，ｎ−１｝）の属性値に対する条件式で記述される述語をｐ［ｊ］（ｊ＝｛０，１，．．．，ｍ−１｝）とするとき、ｎ個のレコードｒ［０］，ｒ［１］，．．．，ｒ［ｎ−１］からなる順序データからｍ個の述語ｐ［０］，ｐ［１］，．．．，ｐ［ｍ−１］からなるパターンを抽出するデータ処理方法であって、
入力手段が、前記パターンを入力し、
前処理手段が、述語ｐ［ｊ］と述語ｐ［ｊ１］（ｊ１＝｛０，１，．．．，ｊ−１｝）の間の両立性を解析し、パターン抽出に用いる補助情報を生成し、
検索手段が、前記順序データ上でレコードｒ［０］からｒ［ｎ−１］に向かって前記パターンに対応するレコード列を検索するとき、レコードが該パターン内の対応する述語を満たすか否かを末尾の述語ｐ［ｍ−１］から先頭の述語ｐ［０］に向かってチェックし、チェックされたレコードが述語を満たさないとき前記補助情報を用いて次のチェック開始位置を決定し、
出力手段が、前記パターンのすべての述語を満たしたレコード列の情報を出力する
ことを特徴とするデータ処理方法。
（付記１８）　前記前処理手段は、複数の条件式が述語に含まれているとき、該複数の条件式を分離して解析することを特徴とする付記１７記載のデータ処理方法。
【０１１４】
【発明の効果】
本発明によれば、与えられたデータから複雑なクエリにより指定された順序パターンを抽出する処理において、指定されたパターンとデータの間の不要なチェックをスキップすることができ、パターンにマッチするデータ列を効率良く検出することができる。
【図面の簡単な説明】
【図１】本発明のデータ処理装置の構成図である。
【図２】基本ウィンドウ処理を示す図である。
【図３】拡張ウィンドウ処理を示す図である。
【図４】ｓｈｉｆｔ［ｊ］のウィンドウシフトを示す図である。
【図５】全体オーバラップを示す図である。
【図６】部分オーバラップを示す図である。
【図７】オーバラップなしのウィンドウシフトを示す図である。
【図８】ｃｏｍｐａｔｉ［ｉ］を示す図である。
【図９】ｃｏｍｐａｔｉ［２］を示す図である。
【図１０】ｃｏｍｐａｔｉ［１］を示す図である。
【図１１】オーバラップなしの場合のパターン述語を示す図である。
【図１２】部分オーバラップの場合のパターン述語を示す図である。
【図１３】全体オーバラップの場合のパターン述語を示す図である。
【図１４】情報処理装置の構成図である。
【図１５】記録媒体を示す図である。
【図１６】ナイーブアプローチの処理結果を示す図である。
【図１７】ＯＰＳアルゴリズムの処理結果を示す図である。
【図１８】本発明の基本処理の処理結果を示す図である。
【図１９】本発明の拡張処理の処理結果を示す図である。
【図２０】５つの検索処理におけるチェック回数を示す図である。
【図２１】チェック回数のグラフを示す図である。
【符号の説明】
１１　入力手段
１２　前処理手段
１３　検索手段
１４　出力手段
２１、２２、２３、３１、３２、４１、４２、５１、６１、７１　ウィンドウ
８１　ＣＰＵ
８２　メモリ
８３　入力装置
８４　出力装置
８５　外部記憶装置
８６　媒体駆動装置
８７　ネットワーク接続装置
８８　バス
８９　可搬記録媒体
９０　サーバ
９１　データベース
１０１、１０２、１０３、１０４、１０５　チェック回数[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a data processing apparatus and method for extracting an order pattern specified by complex conditions from given data.
[0002]
2. Description of the Related Art
In today's information society, there are many applications that have to monitor large amounts of ordinal data. Such data and applications include, for example, sensor applications, performance measurements in network monitoring and traffic management, vital signs and procedures for medical monitoring, records of telecommunication call details, and logs in web applications. Records and clickstreams.
[0003]
Here, consider an application that monitors the physical world by performing inquiry and analysis on sensor data. One such application is an application that manages articles stored in a factory warehouse. Temperature sensors are attached to articles and walls in the warehouse, and temperature sensors are embedded in floors and ceilings. Each sensor outputs a measured temperature at a fixed cycle, and the warehouse manager uses the sensor data to confirm that the article is not overheated. In this case, the following data search query is important for the application.
[0004]
Query 1: Find sensors where the three consecutive measured temperatures are 35, 36, and 37 degrees.
Query 2: Detect sensors that have had a temperature rise of more than 2 degrees in a row three times and have reached a temperature above 35 degrees to a temperature below 35 degrees.
Query 3: Detect a temporal pattern of rise from a temperature below 38 degrees to a temperature between 40 and 50 degrees, followed by two drops and back to a temperature below 38 degrees.
[0005]
The target pattern in this temperature monitoring application can be very simple, such as Query 1 where three consecutive measured temperatures can be found, and can be specified by a subsequent rise in temperature or by a warehouse manager who should be careful but not exactly. It varies to more complex ones, such as

queries

2 and 3, which find unusual spikes that cannot be made. Instead of a single absolute measurement, a range of possible measurements or a relationship between increasing and decreasing measurements may be specified.
[0006]
More specifically,

queries

1, 2, and 3 are each composed of a plurality of records in the data stream having a temperature attribute value that matches a pattern represented by the following predicate p [i]: to find r.
[0007]

Here, the predicate p [i] (r) represents a condition related to the attribute value of the i-th record in the pattern. “temperature” indicates a temperature attribute value of the record r, and r. “previous” indicates a record immediately before the record r in the data stream. When the given data is arranged in order from left to right, r. “previous” corresponds to the record on the left side of the record “r”, and a pattern can be described by specifying a condition for the record.
[0008]
The pattern predicate of Query 1 is always an equation using constants. Therefore, a well-known character string matching algorithm such as a Boyer-Moore algorithm or a Knth-Morris-Pratt algorithm can be efficiently applied. However, Query 2 and Query 3 are more complex and their predicates are inequalities using adjacent record attribute values and constants. Since the string matching algorithm is applicable only to a query represented by an equation using a constant, such as Query 1, it cannot be applied to these two more complex queries.
[0009]
For a matching algorithm applicable to such a query, see Sadri et al. There is an Optimized Pattern Search (OPS) algorithm proposed by S.A. (Sadri, R., Zaiolo, C., Zarkesh, A., and Adibi, J., Optimization of the System of Authenticating the Dimensions of the Dimensions of the Requirement from the Dimensions of the Dimensions of the Requirement of the Authenticating System of the Dimensional Dimensions of the Requirement of the Authenticating Dimensions of the Authenticating Dimensional Dimensions of the Authenticating Technology Dimensions of the Authenticating Dimensional Dimensions of the Quarterly Requirement to the Dimensional Quarterly). -SIGART Symposium on Principles of Database Systems, pp. 71-81, May 2001). In this algorithm, within a window set on the data stream, r. The predicates described using previous are checked from left to right (that is, in the order of p [0], p [1], p [2], and p [3]). When the predicate is checked at a certain position, the operation of shifting the window to another position and checking again is repeated.
[0010]
However, in this method, a record that does not match the predicate often needs to be rechecked many times at subsequent window positions, and is not necessarily an efficient algorithm.
[0011]
An object of the present invention is to provide a data processing apparatus and method for efficiently extracting an order pattern specified by a complex query from given data.
[0012]
[Means for Solving the Problems]
FIG. 1 is a configuration diagram of the first and second data processing devices of the present invention. The data processing device of FIG. 1 includes an input unit 11, a preprocessing unit 12, a search unit 13, and an output unit 14.
[0013]
A first data processing device of the present invention extracts a data string including m ordered data from a plurality of ordered data. The input unit 11 inputs a condition sequence including m ordered conditions that respectively specify m data using the relationship between the data. The preprocessing unit 12 analyzes compatibility between the m conditions and generates auxiliary information used for data extraction. When searching the data sequence from the first data to the last data on the ordered data, the search unit 13 determines whether or not the data satisfies the corresponding condition of the condition sequence in a direction opposite to the search direction. If the checked data does not satisfy the condition, the next check start position is determined using the auxiliary information. Then, the output unit 14 outputs information of a data string satisfying all the conditions of the condition string.
[0014]
Each condition of the condition sequence input by the input means 11 is described using a relationship (arbitrary value, etc.) between any two pieces of data among the ordered data, and the user can use any relationship to create a complicated condition. Order patterns can be specified. This condition corresponds to, for example, a predicate that describes a relationship between records.
[0015]
The preprocessing unit 12 generates auxiliary information by analyzing compatibility between a plurality of conditions in advance, and the searching unit 13 determines a next check start position by using the auxiliary information, and Search for data strings that match the pattern, skipping unnecessary checks between the pattern and the data.
[0016]
It has been confirmed by simulation that in many cases, the number of checks is reduced by reversing the search direction on the data and the direction of the condition check as compared to the case where the directions are the same. The result of this simulation will be described later.
[0017]
The second data processing device according to the present invention converts a predicate described by a conditional expression for an attribute value of a record r [i] (i = {0, 1,..., N-1}) into p [j] ( j = {0,1, ..., m-1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1]. The input means 11 inputs a pattern, and the preprocessing means 12 determines compatibility between the predicate p [j] and the predicate p [j1] (j1 = {0, 1,..., J-1}). Analysis is performed to generate auxiliary information used for pattern extraction. When the search unit 13 searches the sequence data for a record sequence corresponding to the pattern from the records r [0] to r [n-1], the search unit 13 determines whether the record satisfies the corresponding predicate in the pattern. From the predicate p [m-1] to the leading predicate p [0], and when the checked record does not satisfy the predicate, the next check start position is determined using the auxiliary information. Then, the output unit 14 outputs information of a record string satisfying all the predicates of the pattern.
[0018]
Each predicate in the pattern input by the input unit 11 is described by a conditional expression including an attribute value of a record, and the user can specify a complex order pattern using an arbitrary conditional expression.
[0019]
The preprocessing unit 12 generates auxiliary information by analyzing the compatibility between a plurality of predicates in advance, and the search unit 13 determines the next check start position by using the auxiliary information, and Search for a record sequence that matches the pattern, skipping unnecessary checks between the pattern and the record.
[0020]
As in the first data processing apparatus, the search direction on the order data and the predicate check direction are opposite, so that more efficient search processing can be performed than when these are set to the same direction.
[0021]
The input unit 11 in FIG. 1 corresponds to, for example, an input device 83 in FIG. 14 described later, and the preprocessing unit 12 and the search unit 13 correspond to, for example, a combination of a CPU (central processing unit) 81 and a memory 82 in FIG. The output means 14 corresponds to, for example, the output device 84 in FIG.
[0022]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The data processing device of the present embodiment is configured using, for example, a computer, and detects a complex temporal change pattern from a continuous and infinitely input data stream or data provided in the form of finite storage data. I do.
[0023]
First, the data processing apparatus performs preprocessing of a complicated pattern at the time of compilation to generate auxiliary information in order to skip unnecessary check of a pattern predicate for stream data. Then, while the window is slid from left to right (from the beginning to the end of the stream) at the time of execution, the stream search is efficiently performed using the auxiliary information. At this time, it is checked whether the data in the window satisfies the pattern predicate from right to left.
[0024]
This method extracts the logical relationship (auxiliary information) between pattern predicates as part of the query compilation process, and how the window is shifted when a partial match between the stream and the pattern predicate is obtained. I guess if it works. By using such a check method, the length of the window shift can be improved, and the number of rechecks of the same data can be minimized.
[0025]
When checking predicates in a window from right to left, r. Since the information of the previous cannot be used, it is necessary to rewrite the query into another format. The above-described pattern predicates of the

queries

2 and 3 are rewritten as follows, for example.
[0026]

r. Next represents a record immediately after the record r in the data stream (the record on the right side of the record r), and a pattern can be described by designating a condition for the record. r. Using the pattern rewritten using next, the predicate can be checked from right to left (ie, in the order of p [3], p [2], p [1], and p [0]). It becomes possible.
[0027]
Hereinafter, the processing will be described using Query 3 as an example. For the sake of simplicity, eight records whose temperature attribute values are 36, 42, 47, 36, 37, 42, 40, and 37 are used as the data stream.
[0028]
As shown in FIG. 2, in the processing of the first window 21, since 36 <38, the value 36 is p [3] (r): r. Since temperature <38 is satisfied and 47> 36, the value 47 is p [2] (r): r. temperature> r. next. Satisfies temperature. However, the next value 42 is p [1] (r): 40 <r. temperature <50 AND r. temperature> r. next. Does not satisfy temperature. This is because even if 40 <42 <50, 42 <47 and not 42> 47.
[0029]
Thus, after p [3] and p [2] match the data, the first mismatch occurs at p [1], which requires shifting window 21 to resume checking the pattern predicate and stream data. . Here, it is known that the

values

36 and 47 satisfy p [3] and p [2], respectively, and the value 42 does not satisfy p [1].
[0030]
By analyzing the pattern predicate, it is found that p [3] and p [2] do not contradict each other, and it can be said that a value satisfying p [3] may also satisfy p [2]. A value satisfying p [2] may also satisfy p [1], and a value not satisfying p [1] may satisfy p [0]. Therefore, the window 21 is shifted by one record so that the

values

36, 37, and 42 are aligned with p [2], p [1], and p [0] in the new window 22, respectively.
[0031]
In the processing of the second window 22, the value 37 satisfies p [3] because 37 <38, but the value 36 does not satisfy p [2] since 36 <37 and not 36> 37. You can see that. Since the mismatch occurred at p [2], it would be necessary to shift window 22 somewhat to restart the check.
[0032]
As described above, a value that satisfies p [3] may also satisfy p [2], but p [2] (r): r. temperature> r. next. The value that does not satisfy temperature is p [1] (r): 40 <r. temperature <50 AND r. temperature> r. next. The temperature is not satisfied. Since it is already known that the value 36 does not satisfy p [2] and therefore does not satisfy p [1], the window 22 is shifted by one record so that the value 36 is aligned with p [1]. Obviously, no results are obtained.
[0033]
When the window 22 is shifted by two records, p [3] (r): r. The value satisfying temperature <38 is p [1] (r): 40 <r. temperature <50 AND r. temperature> r. next. Temperature must also be satisfied, but this is not possible because p [3] and p [1] contradict each other. Therefore, the shift operation of two records is also discarded.
[0034]
Next, if there is a possibility that the value satisfying p [3] may also satisfy p [0], a shift operation of three records will be considered, but this is possible. Therefore, as shown in FIG. 2, the window 22 is shifted by three records, and in the processing of the third window 23, the

values

37, 40, 42, and 37 are p [3], p [2], and p [ 1] and p [0].
[0035]
In this example, a given pattern could be detected in three window processes by using knowledge about the relationship between pattern predicates. The number of predicate checks on the stream data was nine.
[0036]
In the basic processing described above, it is checked whether or not the temperature attribute value satisfies the inequality of the conjunctive predicate. However, the number of checks can be reduced by introducing an extended processing that handles each inequality separately. It is. In this example, an inequality using the value of the next record and an inequality using a constant will be treated separately. The former indicates the correlation (shape of change) of the attribute values between the two records, and will be referred to as "relation predicate". The latter indicates the range of the attribute values, and will be referred to as "range predicate". ".
[0037]
In the processing of the first window 21, as described above, the value 42 is p [1] (r): 40 <r. temperature <50 AND r. temperature> r. next. Does not satisfy temperature. However, in more detail, this predicate is represented by the logical product of a range predicate (40 <r.temperature <50) and a relational predicate (r.temperature> r.next.temperature), and the value 42 is a range predicate. Is satisfied (40 <42 <50), and the relational predicate is not satisfied (42 <47 and not 42> 47).
[0038]
Since the value 42 satisfies the range predicate of p [1], p [0] (r): r. temperature <38 AND r. temperature <r. next. Temperture cannot be satisfied. Therefore, it can be seen that a window shift larger than the shift of one record used in the basic processing without separating the connection predicate p [1] is possible. Also, since p [2] and p [0] do not contradict each other, it is determined that the window 21 is shifted by three records as shown in FIG.
[0039]
In the processing of the second window 31, since 40> 38 and not 40 <38, it is known that the value 40 does not satisfy p [3], and the window 31 is shifted by one record. Then, in the processing of the third window 32, it is confirmed that the four values satisfy the corresponding predicates. In this example, the number of checks was reduced to eight by separately analyzing the inequalities of the predicates.
[0040]
In the processing in FIGS. 2 and 3, a pattern shift that may succeed in matching when a stream and a pattern predicate partially match is estimated using the interdependency between the predicates. Thereby, the number of comparisons between the predicate in the window and the stream data is reduced, and as a result, the search speed is improved.
[0041]
In this example, in order to facilitate understanding, it has been shown that the slide length of the window is inferred as a part of the processing when a mismatch occurs. However, interdependencies between predicates are independent of stream data, and can be calculated in advance at query compilation.
[0042]
In the following, auxiliary information is generated once as a part of query compilation by utilizing the logical relationship between pattern predicates, and the window is efficiently used by repeatedly using the auxiliary information when searching patterns from stream data. The method of sliding will be described in detail. First, let's summarize the meaning of some necessary terms.
[0043]
Data stream: The stream consists of n records r [0], r [1],. . . , R [n-1], where each r [i] is k attributes a [0], a [1],..., Such as a sensor ID, a temperature generated by the temperature sensor, and a generation time. . . , A [k−1].
[0044]
Patterns and Predicates: A pattern consists of m predicates p [0] (r),. . . , P [m−1] (r), and each p [j] (r) (j = {0, 1,..., M−1}) represents the attribute of the stream record r and the constant This includes conditional expressions (inequalities and equations) between the attributes of the record r and attributes of other records in the data stream.
[0045]
This conditional expression is generally described in the form of (r [i] .a [q] OPF). r [i]. a [q] is the value of the q-th (q = {0, 1,..., k-1}) attribute of the i-th (i = {0, 1,..., n-1}) record Where F is a constant and a plurality of attributes r [i1]. represents an arbitrary function including a [q1]. Here, i1 = {0, 1,. . . , N−1 except i}, q1 = {0, 1,. . . , K−1}, and i1 is 0, 1,. . . , N-1 except for i. OP is an arbitrary operator (OP∈ ｛<, ≦, =,>, ≧, ≠｝).
[0046]
For example, the temperature attribute of the record r is set to r. temperature, and r. Next, the temperature attribute of r. next. Assuming that the temperature is temperature and the real constant is C, the following conditional expression can be specified.
[0047]
(R. Temperature OP C)
(R. Temperature OP r.next. Temperature + C)
The concatenation predicate is composed of a plurality of conditional expressions connected by a logical AND.
[0048]
Relationships between pattern predicates: All logical relationships between pattern predicates are expressed using a positive preconditional logical matrix θ and a negative preconditional logical matrix φ. The size of these matrices corresponds to the number m of pattern predicates, and their elements θ [j, k] and φ [j, k] are defined as follows.
[0049]
(Equation 1)

[0050]
Here, 外 1 (hereinafter referred to as p [j] bar) represents the negation of p [j],
[0051]
[Outside 1]

[0052]
U indicates that it is unknown whether it is p [j] ⇒p [k] or p [j] ⇒p [k] bar.
In the pattern detection method of the present embodiment, a stream is scanned using a window having a size equal to the number m of pattern predicates. As shown in FIG. 4, the window 41 has records r [i],. . . , R [i + m−1], the position of the window 41 is represented by i.
[0053]
The stream records in window 41 are checked from right to left for pattern predicates, so if predicate p [m-1] holds for record r [i + m-1], then record r [i + m-2] ] Is checked against the predicate p [m−2]. Then, if the predicate p [m-2] holds for the record r [i + m-2], then the record r [i + m-3] is checked for the predicate p [m-3], and the same processing is performed. You can continue.
[0054]
If it is found that the predicate p [j] does not hold for the record r [i + j], the matching processing of the current window fails. Therefore, the current window 41 is shifted rightward by the length represented by shift [j], and the stream record in the window 42 at the position of i + shift [j] is checked in the same manner as in the processing of the previous window 41. . shift [j] is calculated in advance as auxiliary information.
[0055]
In this method, first, the position of the window is initialized to i = 0 in order to retrieve all stream records that satisfy the column of m pattern predicates. As a result, the left end of the window is aligned with the first stream record r [0]. The process of sliding the window is repeated until the right edge of the window exceeds the right edge of the first data stream. Such a pattern detection algorithm can be described, for example, as follows.
[0056]

Here, p [j] (r [i]) indicates to test whether p [j] holds for the i-th record r [i] of the stream. If r [i] satisfies p [j], p [j] (r [i]) = 1, otherwise p [j] (r [i]) = 0.
[0057]
j <0 when a search pattern is obtained in the window at the position i. At this time, find (i) is the entire pattern predicate p [0],. . . , P [m−1], stream records r [i],. . . , R [i + m−1]. The method of calculating the distance value stored in the shift array shift [j] will be described later.
[0058]
If shift [j] = 1 and the window is slid by one position each time, no matter which predicate p [j] does not hold (no matter where a match of the entire pattern is obtained), m columns of pattern predicates are obtained. In order to detect all satisfying stream records, a time of O (nm) is required, where n is the total number of records.
[0059]
Therefore, in order to reduce the number of large checks, it is considered that the window is slid more than one position. In other words, when a mismatch occurs in the predicate p [j], shift [j] is determined so as to slide the window as large as possible without leaking any matching candidates.
[0060]
From FIG. 4, when a mismatch occurs in the predicate p [j] in the window at the position i, the stream records r [i + j + 1] to r [i + m−) are used for the search pattern predicates p [j + 1] to p [m−1], respectively. 1]. At this time, the predicate p [j] does not hold for the stream record r [i + j].
[0061]
When the window at position i is shifted right by k positions and slid to a new position (i + k), the predicates p [j],. . . , P [m−1], stream records r [i + j],. . . , R [i + m−1] and the new window, there are three cases: full overlap, partial overlap, and no overlap.
[0062]
FIG. 5 shows a case of overall overlap. When 0 <k ≦ j, the window 51 shifts to the position (i + k) on the left side of the record r [i + j] where the mismatch with the predicate p [j] or the same position as the record r [i + j]. Therefore, the new window 51 is the record r [i + j],. . . , R [i + m−1].
[0063]
In the window 51 at the position (i + k), records r [i + j],. . . , R [i + m−1] are the predicates p [jk],. . . , P [m−k−1]. In order to predict whether these predicates may hold for the corresponding records, the predicates p [j],. . . , P [m−1] of these records can be used.
[0064]
Therefore, records r [i + j],. . . , R [i + m−1], the subsequences p [j],. . . , P [m−1], and substrings p [jk],. . . , P [m−k−1], the shift of the k position where the pattern search is likely to be successful is calculated by analyzing the logical relationship between the two subsequences. If the following relationship holds between these predicate sequences, a shift in the k position may create a window containing stream records that satisfy the pattern search.
[0065]

Using θ and φ described above, this relationship can be described as follows. First, α [j, k] is defined by the following equation.
[0066]

Where U∧1 = U, U∧0 = 0, and U∧U = U. If α [j, k] = 0, the window shift at the k position does not include a stream record that matches the search pattern, and if α [j, k] = 1, includes a matching record, and α [j, k] If j, k] = U, there is a possibility that a matching record is included.
[0067]
Next, FIG. 6 shows a case of partial overlap. If j <k <m, the window 41 shifts to the position (i + k) on the right side of the record r [i + j] where the mismatch with the predicate p [j] has occurred. Therefore, the new window 61 is composed of the sub-strings r [i + k],. . . , R [i + m−1].
[0068]
In window 61 at position (i + k), records r [i + k],. . . , R [i + m−1] are the predicates p [0],. . . , P [m−k−1]. To predict whether these predicates may hold for the corresponding records, the predicates p [k],. . . , P [m−1] of these records can be used.
[0069]
Therefore, in the window 41, records r [i + k],. . . , R [i + m−1], the subsequences p [k],. . . , P [m−1], and substrings p [0],. . . , P [m−k−1], the shift of the k position where the pattern search is likely to be successful is calculated by analyzing the logical relationship between the two subsequences. If the following relationship holds between these predicate sequences, a shift in the k position may create a window containing stream records that satisfy the pattern search.
[0070]

Here, β [j, k] is defined by the following equation using θ and φ.
[0071]

If β [j, k] = 0, the window shift at the k position does not include a stream record that matches the search pattern, and if β [j, k] = 1, includes a matching record. If j, k] = U, there is a possibility that a matching record is included.
[0072]
Next, FIG. 7 shows a case without overlap. If k ≧ m, the window 41 shifts to the right position (i + k) of the rightmost record r [i + m−1]. In particular, when k = m, the first record position in the new window 71 matches the record position immediately after the previous window 41. Therefore, the new window 71 is the record r [i + j],. . . , R [i + m−1] are not included.
[0073]
Since none of the stream records in the new window 71 has been checked yet, there is no reason to deny that they are candidates for stream records that match the search pattern.
[0074]
From the considerations of the above three cases, it is understood that the window shift at the k position may slide the window to a position including a stream record satisfying the pattern predicate when any of the following conditions is satisfied.
(A) α [j, k] ≠ 0 for 0 <k ≦ j
(B) β [j, k] ≠ 0 for j <k <m
(C) k ≧ m
Since we do not want to leak any matching candidates in the search process, shift [j] is calculated as the minimum rightward sliding distance that may match the pattern.
[0075]
First, the rightmost subsequences p [jk], p [jk + 1],. . . , P [m−k−1], the minimum k satisfying α [j, k] ≠ 0 is determined. If no such k exists, then the longest subsequence p [0],. . . , P [m−k−1], the minimum k satisfying β [j, k] ≠ 0 is determined. It can be seen that if no such k exists, any shift where k <m fails. Thus, with shift [j] = m, the window is shifted rightward so that the first record position in the new window matches the record position immediately after the previous window. The method of obtaining such shift [j] is summarized as follows.
[0076]
(Equation 2)

[0077]
In a simple calculation method of the distance stored in the shift array shift [j], it is necessary to calculate α [j, k] and β [j, k] for all j and k first, which is enormous. Computational complexity is required. In the following, a simple program code for efficiently obtaining the shift array with the calculation amount of O (m) will be introduced.
[0078]
First, the elements of the auxiliary array compati [i] defined by the following equation are calculated.

As shown in FIG. 8, the compati [i] at the position i is a pattern predicate p [i-len + 1],. . . , P [i], and includes pattern predicates p [m-len],. . . , P [m-1], the maximum value of the length len of the pattern subsequence compatible with the right end subsequence of the length len.
[0079]
The value of compati [i] is calculated from right to left (from i = m-1 to i = 0). The basic idea is to use the value of a certain compati [i] as much as possible to calculate the value to the left (the value for the smaller i). This idea is illustrated using a very simple pattern example consisting of the following four pattern predicates:
[0080]
p [3]: r. level> 10
p [2]: r. level> 5
p [1]: r. level> 3
p [0]: r. level> 1
In this case, p [i] ⇒ p [i] for any p [i], and thus compati [3] = 4 (corresponding to the first line of the code shown below). In the computation of compati [2], the relationships between p [3] and p [2], p [2] and p [1], and p [1] and p [0] must be checked. As shown in FIG. 9, since p [3] → p [2], p [2] → p [1], and p [1] → p [0], compati [2] = 3 ( (Corresponding to lines 7-12 of the code shown below).
[0081]
Here, since p [3] => p [2] and p [2] => p [1], it can be seen that p [3] => p [1]. Furthermore, since p [2] => p [1] and p [1] => p [0], it can be seen that p [2] => p [0]. Therefore, without directly checking the relationship between p [3] and p [1] and the relationship between p [2] and p [0], it is possible to obtain the conclusion that compati [1] = 2 as shown in FIG. Yes (corresponding to lines 4-5 of the code below).
[0082]
For example, the following is conceivable as a simple program code for efficiently calculating the value of the compati [i].

Once the array compati [i] is determined, the shift array is then calculated. As shown in the following program code, the shift array is first filled with the values for the non-overlapping case, and if possible, then the values for the partial overlap are stored in the shift array using compati [i]. The value is stored, and finally the value in the case of overall overlap is stored.
[0083]

In the first and second lines of the calculation code of the shift array, m is input as a shift value corresponding to a case where there is no overlap. As described above, this shift value is used when there is no subsequence that is compatible with the rightmost subsequence of the pattern predicate. Therefore, as shown in FIG. The window is shifted so that it does not contain the corresponding pattern predicate.
[0084]
In the 3rd to 8th lines, a shift value in the case of partial overlap is set. In this case, as shown in FIG. 12, the longest subsequence p [0],. . . , P [i] are the rightmost subsequences p [m-1-i],. . . , P [m-1], the minimum shift value is calculated.
[0085]
The leftmost subsequences p [0],. . . , P [i] is p [i], and at that position, compati [i] = i + 1. Therefore, if information (compati [i] = i + 1) on the compatibility (compatibility) of the pattern predicates stored in the array compati [i] is given, the leftmost subsequences p [0],. . . , P [i] are compatible with the rightmost subsequence p [m-1-i],. . . , P [m-1], the window is shifted.
[0086]
Finally, on the 9th to 11th lines, a shift value in the case of the entire overlap is set. In this case, as shown in FIG. 13, the subsequences p [m-compati [i]],. . . , P [m−1] is a subsequence p [i-compati [i] +1],. . . , P [i], and the minimum shift value is calculated such that p [m-compati [i] -1] bar is compatible with the pattern predicate p [i-compati [i]].
[0087]
As already checked at the time of calculation of compati [i], p [m-compati [i] -1] ⇒p [i-compati [i]] bar (θ [m-compati [i] -1, i-compati [ i]] = 0). Further, if p [m-compati [i] -1] bar ⇒ p [i-compati [i]] (φ [m-compati [i] -1, i-compati [i]] ≠ 0) holds, The above two conditions are satisfied. Therefore, shift [m-compati [i] -1] = m-1-i can be set.
[0088]
When there are a plurality of i having the same value of compati [i], a request to find the minimum shift value so as not to leak any matching candidates in the search processing results in the final rightmost i (maximum i) i) is input to the shift array as shift [m-compati [i] -1] = m-1-i. Therefore, the array compati [i] is scanned from left to right (from i = 0 to i = m−2).
[0089]
By the way, in the calculation method of shift [j] in the above-described basic processing, a plurality of inequalities of the connected predicate are not separately considered in the analysis of the compatibility between the pattern predicates. If these are handled separately, a larger shift may be obtained as described above.
[0090]
Therefore, in the expansion processing, the concatenation predicate p [j] is converted into a “range predicate p1 [j]” including a conditional expression using a constant and a “relation predicate p2 [j] including a conditional expression using a value of an adjacent record. , And described as p [j] = p1 [j] ∧p2 [j]. Also in this case, shift [j] can be obtained by directly extending the calculation method of the basic processing. The main differences from the basic processing are as follows.
[0091]
In the basic processing, only the p [j] bar is considered, but in the extended processing, p1 [j] bar ∧p2 [j], p1 [j] ∧p2 [j] bar, and p1 [j] bar ∧p2 [ j] consider the three cases of bars. According to these three cases, the above φ [j, jk] is decomposed into φ1 [j, jk], φ2 [j, jk], and φ12 [j, jk], respectively. Then, α [j, jk] is decomposed into α1 [j, jk], α2 [j, jk], and α12 [j, jk], respectively.
[0092]
Therefore, shift [j] is also calculated in advance into shift1 [j], shift2 [j], and shift12 [j]. Then, when a mismatch occurs in the predicate p [j], three shift values shift1 [j] and shift2 [j depend on whether the mismatched predicate is only a range predicate, only a relational predicate, or both. ] And shift12 [j] are selected.
[0093]
Incidentally, the data processing device of the present embodiment is configured using, for example, an information processing device (computer) as shown in FIG. 14 includes a CPU (Central Processing Unit) 81, a memory 82, an input device 83, an output device 84, an external storage device 85, a medium drive device 86, and a network connection device 87. Connected to each other.
[0094]
The memory 82 includes, for example, a ROM (read only memory), a RAM (random access memory), and the like, and stores programs and data used for processing. The CPU 81 performs necessary processing by executing a program using the memory 82.
[0095]
The input device 83 is, for example, a keyboard, a pointing device, a touch panel, or the like, and is used for inputting an instruction or information from a user. The output device 84 is, for example, a display device, a printer, a speaker, or the like, and is used for inquiring a user or outputting a processing result.
[0096]
The external storage device 85 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, or the like. The information processing device stores the above-described program and data in the external storage device 85, and uses them by loading them into the memory 82 as necessary.
[0097]
The medium driving device 86 drives the portable recording medium 89 and accesses the recorded contents. As the portable recording medium 89, any computer-readable recording medium such as a memory card, a flexible disk, a compact disk read only memory (CD-ROM), an optical disk, and a magneto-optical disk is used. The user stores the above-described program and data in the portable recording medium 89, and uses them by loading them into the memory 82 as necessary.
[0098]
The network connection device 87 is connected to an arbitrary communication network such as a LAN (local area network) and performs data conversion accompanying communication. The information processing device receives the above-described program and data from another device via the network connection device 87, and uses them by loading them into the memory 82 as necessary.
[0099]
FIG. 15 shows a computer-readable recording medium capable of supplying a program and data to the information processing apparatus in FIG. The programs and data stored in the portable recording medium 89 and the database 81 of the server 90 are loaded into the memory 82. At this time, the server 90 generates a carrier signal for carrying the program and the data, and transmits the carrier signal to the information processing device via an arbitrary transmission medium on the network. Then, the CPU 81 executes the program using the data and performs necessary processing.
[0100]
Next, a description will be given of the result of a simulation comparing the naive approach in which the window is always shifted one record at a time, the search processing by the OPS algorithm described above, and the search processing of the present invention. The following were used as queries and stream data.
[0101]
Query
"Extract all patterns where a company's stock price falls twice and then rises twice in a row, causing the stock price to fall between 40 and 50 due to the decline and the first rise resulting in the stock price not exceeding 52. . "
Stream data
55 50 45 51 51 54 50 47 49 45 42 42 55 57
59 60 57
FIGS. 16 to 19 show changes in the values of the position i on the stream data and the position j on the search pattern when a pattern search is performed using such a query and stream data. Among them, FIG. 16 shows the result by the naive approach, FIG. 17 shows the result by the OPS algorithm, and FIG. 18 shows the result by the basic processing of the present invention (processing in which the conditional expression of the connecting predicate is not separately handled). FIG. 19 shows a result of the extended processing (processing for separately treating a conditional expression of a connecting predicate) according to the present invention. A square mark on the graph indicates a point at which a stream record was checked for a pattern predicate.
[0102]
In the naive approach of FIG. 16, the processing of nine windows w1 to w9 is performed, and the check of the pattern predicate on the stream record is performed 20 times. In the OPS algorithm shown in FIG. 17, the search is completed by processing seven windows w1 to w7 and checking 16 times.
[0103]
On the other hand, in the basic processing of FIG. 18, the search is completed by processing eight windows w1 to w8 and performing eleven checks, and in the extended processing of FIG. 19, processing of seven windows w1 to w7 is completed. The search has been completed after 10 checks. Therefore, in this example, it is understood that the extension processing has the best performance.
[0104]
In the naive approach, backtracking occurs in which the value of i decreases between successive points as the processing proceeds, but such backtracking does not occur in the OPS algorithm. However, in the OPS algorithm, at successive points having the same value of i, mismatched records are continuously rechecked many times.
[0105]
In contrast, in the basic processing of the present invention, the number of window processings is not smaller than that of the OPS algorithm, but the rechecking of the record does not cause a mismatch (that is, when the pattern is completely detected). This is performed only in the window w8. Further, in the extended processing for separating and analyzing the relational predicate and the range predicate, after a mismatch occurs in the record r [3] (i = 3) in the first window w1, the window is shifted to a greater extent than the basic processing, and The check of r [4] (i = 4) is skipped.
[0106]
Thus, it can be seen that the search processing of the present invention is an algorithm that is more efficient than the naive approach and the OPS algorithm.
Next, a result of a simulation using another query will be described. Each of the ten patterns P1 to P10 described below is described by seven predicates, and using these predicates, a continuous record having a water level (and temperature) attribute satisfying each condition was searched. As the stream data, 100,000 records randomly generated as records having the measured values of the water level and the temperature by the sensor were used.
[0107]
P1: In seven consecutive measurements, the water level is in the range between -2 and +2 with respect to the adjacent measurements.
P2: In seven consecutive measurements, the water level continues to rise.
P3: The water level continues to drop in seven consecutive measurements.
P4: The water level continues to rise in the first half of seven consecutive measurements, and stabilizes or decreases in the second half.
P5: Three consecutive water level spikes occur. That is, a phenomenon in which the water level suddenly rises above 10 and then suddenly falls below 10 is repeated three times.
P6: In seven consecutive measurements, the water level is in the range between -2 and +2 relative to the adjacent measurement, and the temperature continues to rise.
P7: In seven consecutive measurements, the water level continues to rise and the temperature continues to decrease.
P8: In seven consecutive measurements, the water level is in the range between -2 and +2 with respect to the adjacent measurements, the temperature continues to rise in the first half and either stabilizes or falls in the second half.
P9: In seven consecutive measurements, the water level is in the range between -2 and +2 with respect to the adjacent measurement value, and three consecutive temperature spikes occur. That is, a phenomenon in which the temperature suddenly rises more than 10 and then suddenly drops more than 10 is repeated three times.
P10: In seven consecutive measurements, the water level continues to rise, and three consecutive temperature spikes occur.
[0108]
FIG. 20 shows that the predicate check is performed in each of the naive approach (Naive), the OPS algorithm, the basic processing (R2L) of the present invention, the extended OPS algorithm (OPS-extended), and the extended processing (R2L-extended) of the present invention. Shows the number of times it was performed.
[0109]
The extended OPS algorithm refers to an algorithm in the OPS algorithm in which a conditional expression of a connected predicate is separately handled in the same manner as the extended processing of the present invention. In the patterns P1 to P5, since each predicate includes only one conditional expression, the extended OPS algorithm and the simulation of the extended processing of the present invention are not performed.
[0110]
FIG. 21 shows a bar graph of the simulation result (the number of checks) of FIG. 20. 101, 102, 103, 104, and 105 denote naive approach, OPS, basic processing of the present invention, extended OPS, and the present invention, respectively. Corresponding to the number of checks for the extension processing.
[0111]
From this graph, it can be seen that the number of checks for the OPS, the extended OPS, the basic processing of the present invention, and the extended processing of the present invention is smaller than the naive approach for any pattern. The number of checks of the OPS and the extended OPS is the same as or slightly larger than the total number of records (100,000), which means that backtracking does not occur for most patterns in these processes. are doing.
[0112]
Except for the OPS for the patterns P4 and P9, the number of checks of the basic processing and the extended processing of the present invention is smaller than that of the OPS and the extended OPS, respectively. In particular, for the patterns P1, P2, P3, P6, and P7, the number of checks of the basic processing and the expansion processing of the present invention is less than half that of the OPS and the expansion OPS, respectively. For patterns P8, P9, and P10, the number of checks can be significantly reduced by using extended processing instead of basic processing.
[0113]
As described above, in many cases, the simulation of searching for ten different patterns from a large amount of randomly generated data shows that the search processing of the present invention often shows much better performance than the naive approach or the OPS algorithm. The extension processing was found to be more efficient than the basic processing.
(Supplementary Note 1) A data processing device for extracting a data sequence including m ordered data from a plurality of ordered data,
Input means for inputting a condition sequence consisting of ordered m conditions that respectively specify the m data using a relationship between data;
Preprocessing means for analyzing compatibility between the m conditions and generating auxiliary information used for data extraction;
When searching for the data sequence from the first data to the last data on the ordered data, it is checked whether the data satisfies the corresponding condition of the condition sequence in a direction opposite to a search direction. Search means for determining the next check start position using the auxiliary information when the checked data does not satisfy the condition,
Output means for outputting information of a data string that satisfies all the conditions of the condition string;
A data processing device comprising:
(Supplementary Note 2) The pre-processing means is a partial condition sequence having each position on the condition sequence including the m conditions as a right end, and a partial condition sequence at a right end of the condition sequence including the m conditions. 2. The data processing apparatus according to claim 1, wherein an array representing the maximum value of the lengths of the partial condition strings that are compatible with each other is generated, and the auxiliary information is generated using the array.
(Supplementary Note 3) A predicate described by a conditional expression for an attribute value of a record r [i] (i = {0, 1,..., N−1}) is p [j] (j = {0, 1, , M−1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], and extracts a pattern consisting of
Input means for inputting the pattern,
Preprocessing means for analyzing compatibility between a predicate p [j] and a predicate p [j1] (j1 = {0, 1,..., J-1}) and generating auxiliary information used for pattern extraction; ,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. search means for checking from p [m-1] toward the leading predicate p [0], and determining the next check start position using the auxiliary information when the checked record does not satisfy the predicate;
Output means for outputting information of a record sequence satisfying all the predicates of the pattern;
A data processing device comprising:
(Supplementary Note 4) The data processing apparatus according to Supplementary Note 3, wherein when a plurality of conditional expressions are included in the predicate, the preprocessing unit separates and analyzes the plurality of conditional expressions. (Supplementary Note 5) The preprocessing unit sets the position j to the right end, and sets the predicates p [j-len + 1],. . . , P [j], where the predicates p [m-len],. . . , P [m-1], an array representing the maximum value of the length len of the predicate sequence is generated, and the auxiliary information is generated using the array. The data processing device as described in the above.
(Supplementary Note 6) k attributes a [0], a [1],. . . , A [k−1] is defined as r [i] (i = {0, 1,..., N−1}), and the attribute a [q] of r [i] (q = {0, ,..., K−1}) are denoted by r [i]. a [q], a constant and an attribute r [i1]. An arbitrary function including a [q1] (i1 = {0, 1,..., n−1 minus the remainder of i), q1 = {0, 1,. And is described by a combination of conditional expressions such as (r [i] .a [q] OP F), where OP is any operator of ｛<, ≦, =,>, ≧, ≠｝. When the predicate is p [j] (j = {0, 1,..., M-1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], and extracts a pattern consisting of
Input means for inputting the pattern,
Preprocessing means for analyzing compatibility between a predicate p [j] and a predicate p [j1] (j1 = {0, 1,..., J-1}) and generating auxiliary information used for pattern extraction; ,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. search means for checking from p [m-1] toward the leading predicate p [0], and determining the next check start position using the auxiliary information when the checked record does not satisfy the predicate;
Output means for outputting information of a record sequence satisfying all the predicates of the pattern;
A data processing device comprising:
(Supplementary Note 7) A program for a computer that extracts a data sequence including m ordered data from a plurality of ordered data,
Inputting a condition sequence consisting of ordered m conditions specifying each of the m data using the relationship between the data,
Analyzing compatibility between the m conditions, generating auxiliary information used for data extraction,
When searching for the data sequence from the first data to the last data on the ordered data, it is checked whether the data satisfies the corresponding condition of the condition sequence in a direction opposite to a search direction. When the checked data does not satisfy the condition, determine the next check start position using the auxiliary information,
Output information on data strings that satisfy all the conditions in the above condition columns
A program for causing a computer to execute processing.
(Supplementary Note 8) A predicate described by a conditional expression for an attribute value of a record r [i] (i = {0, 1,..., N−1}) is p [j] (j = {0, 1, , M−1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], a computer program for extracting a pattern consisting of
Enter the pattern,
Analyze the compatibility between the predicate p [j] and the predicate p [j1] (j1 = {0, 1,..., J-1}), generate auxiliary information used for pattern extraction,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. Checking from p [m-1] to the leading predicate p [0], and when the checked record does not satisfy the predicate, determine the next check start position using the auxiliary information;
Output the information of the record sequence that satisfies all the predicates of the pattern
A program for causing a computer to execute processing.
(Supplementary note 9) The program according to supplementary note 8, wherein the computer separates and analyzes the plurality of conditional expressions if the predicate includes the plurality of conditional expressions when generating the auxiliary information. .
(Supplementary Note 10) A recording medium recording a program for a computer that extracts a data string including m pieces of ordered data from a plurality of ordered data, the program comprising:
Inputting a condition sequence consisting of ordered m conditions specifying each of the m data using the relationship between the data,
Analyzing compatibility between the m conditions, generating auxiliary information used for data extraction,
When searching for the data sequence from the first data to the last data on the ordered data, it is checked whether the data satisfies the corresponding condition of the condition sequence in a direction opposite to a search direction. When the checked data does not satisfy the condition, determine the next check start position using the auxiliary information,
Output information on data strings that satisfy all the conditions in the above condition columns
A computer-readable recording medium that causes the computer to execute processing.
(Supplementary Note 11) A predicate described by a conditional expression for an attribute value of a record r [i] (i = {0, 1,..., N−1}) is p [j] (j = {0, 1, , M-1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1] on a recording medium for recording a program for a computer, the program comprising:
Enter the pattern,
Analyze the compatibility between the predicate p [j] and the predicate p [j1] (j1 = {0, 1,..., J-1}), generate auxiliary information used for pattern extraction,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. Checking from p [m-1] to the leading predicate p [0], and when the checked record does not satisfy the predicate, determine the next check start position using the auxiliary information;
Output the information of the record sequence that satisfies all the predicates of the pattern
A computer-readable recording medium that causes the computer to execute processing.
(Supplementary Note 12) The recording according to Supplementary Note 11, wherein, when generating the auxiliary information, if a plurality of conditional expressions are included in a predicate, the plurality of conditional expressions are separated and analyzed. Medium.
(Supplementary Note 13) A carrier signal for carrying to a computer a program for a computer that extracts a data sequence including m ordered data from a plurality of ordered data, the program comprising:
Inputting a condition sequence consisting of ordered m conditions specifying each of the m data using the relationship between the data,
Analyzing compatibility between the m conditions, generating auxiliary information used for data extraction,
When searching for the data sequence from the first data to the last data on the ordered data, it is checked whether the data satisfies the corresponding condition of the condition sequence in a direction opposite to a search direction. When the checked data does not satisfy the condition, determine the next check start position using the auxiliary information,
Output information on data strings that satisfy all the conditions in the above condition columns
A carrier signal causing the computer to execute a process.
(Supplementary Note 14) A predicate described by a conditional expression for an attribute value of a record r [i] (i = {0, 1,..., N−1}) is p [j] (j = {0, 1, , M−1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], a carrier signal for carrying to the computer a program for extracting a pattern, the program comprising:
Enter the pattern,
Analyze the compatibility between the predicate p [j] and the predicate p [j1] (j1 = {0, 1,..., J-1}), generate auxiliary information used for pattern extraction,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. Checking from p [m-1] to the leading predicate p [0], and when the checked record does not satisfy the predicate, determine the next check start position using the auxiliary information;
Output the information of the record sequence that satisfies all the predicates of the pattern
A carrier signal causing the computer to execute a process.
(Supplementary note 15) The transport according to Supplementary note 14, wherein when the auxiliary information is generated, if a plurality of conditional expressions are included in the predicate, the plurality of conditional expressions are separated and analyzed. signal.
(Supplementary Note 16) A data processing method for extracting a data string including m ordered data from a plurality of ordered data,
Input means for inputting a condition sequence consisting of m ordered conditions specifying each of the m data using a relationship between the data;
Preprocessing means for analyzing compatibility between the m conditions to generate auxiliary information used for data extraction;
When the search unit searches the data sequence from the first data toward the last data on the ordered data, it is determined whether the data satisfies a corresponding condition of the condition sequence in a direction opposite to a search direction. Check the direction, determine the next check start position using the auxiliary information when the checked data does not satisfy the conditions,
A data processing method, wherein an output unit outputs information of a data string satisfying all the conditions of the condition string.
(Supplementary Note 17) A predicate described by a conditional expression for an attribute value of a record r [i] (i = {0, 1,..., N−1}) is p [j] (j = {0, 1, , M−1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], a data processing method for extracting a pattern
Input means for inputting the pattern;
The preprocessing means analyzes compatibility between the predicate p [j] and the predicate p [j1] (j1 = {0, 1,..., J-1}) and generates auxiliary information used for pattern extraction. And
When the search means searches a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the order data, it is determined whether the record satisfies a corresponding predicate in the pattern. From the tail predicate p [m-1] to the head predicate p [0], and if the checked record does not satisfy the predicate, determine the next check start position using the auxiliary information;
Output means for outputting information of a record string satisfying all the predicates of the pattern
A data processing method comprising:
(Supplementary note 18) The data processing method according to supplementary note 17, wherein, when a plurality of conditional expressions are included in the predicate, the preprocessing unit separates and analyzes the plurality of conditional expressions.
[0114]
【The invention's effect】
According to the present invention, in a process of extracting an ordered pattern specified by a complicated query from given data, unnecessary check between a specified pattern and data can be skipped, and data matching a pattern can be skipped. Columns can be detected efficiently.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a data processing device of the present invention.
FIG. 2 is a diagram showing basic window processing.
FIG. 3 is a diagram illustrating extended window processing.
FIG. 4 is a diagram showing a window shift of shift [j].
FIG. 5 is a diagram showing an overall overlap.
FIG. 6 is a diagram showing a partial overlap.
FIG. 7 illustrates a window shift without overlap.
FIG. 8 is a diagram showing compati [i].
FIG. 9 is a diagram showing compati [2].
FIG. 10 is a diagram showing compati [1].
FIG. 11 is a diagram showing a pattern predicate when there is no overlap.
FIG. 12 is a diagram showing a pattern predicate in the case of partial overlap.
FIG. 13 is a diagram showing a pattern predicate in the case of overall overlap.
FIG. 14 is a configuration diagram of an information processing apparatus.
FIG. 15 is a diagram showing a recording medium.
FIG. 16 is a diagram showing a processing result of a naive approach.
FIG. 17 is a diagram showing a processing result of the OPS algorithm.
FIG. 18 is a diagram showing a processing result of the basic processing of the present invention.
FIG. 19 is a diagram showing a processing result of the extension processing of the present invention.
FIG. 20 is a diagram showing the number of checks in five search processes.
FIG. 21 is a diagram showing a graph of the number of checks.
[Explanation of symbols]
11 Input means
12 Preprocessing means
13 Search means
14 Output means
21, 22, 23, 31, 32, 41, 42, 51, 61, 71 windows
81 CPU
82 memory
83 input device
84 Output device
85 External storage device
86 medium drive
87 Network connection device
88 bus
89 Portable recording media
90 server
91 Database
101, 102, 103, 104, 105 Number of checks

Claims

A data processing device for extracting a data sequence including m ordered data from a plurality of ordered data,
Input means for inputting a condition sequence consisting of ordered m conditions that respectively specify the m data using a relationship between data;
Preprocessing means for analyzing compatibility between the m conditions and generating auxiliary information used for data extraction;
When searching for the data sequence from the first data to the last data on the ordered data, it is checked whether the data satisfies the corresponding condition of the condition sequence in a direction opposite to a search direction. Search means for determining the next check start position using the auxiliary information when the checked data does not satisfy the condition,
An output unit configured to output information of a data string satisfying all the conditions of the condition string.

The pre-processing means is a partial condition sequence whose right end is each position on the condition sequence consisting of the m conditions, and which is compatible with the right end partial condition sequence of the condition sequence consisting of the m conditions. 2. The data processing apparatus according to claim 1, wherein an array representing the maximum value of the length of the partial condition sequence is generated, and the auxiliary information is generated using the array.

The predicate described by the conditional expression for the attribute value of the record r [i] (i = {0, 1,..., N-1}) is p [j] (j = {0, 1,. m-1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], and extracts a pattern consisting of
Input means for inputting the pattern,
Preprocessing means for analyzing compatibility between a predicate p [j] and a predicate p [j1] (j1 = {0, 1,..., J-1}) and generating auxiliary information used for pattern extraction; ,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. search means for checking from p [m-1] toward the leading predicate p [0], and determining the next check start position using the auxiliary information when the checked record does not satisfy the predicate;
An output unit that outputs information of a record string satisfying all the predicates of the pattern.

4. The data processing apparatus according to claim 3, wherein when a plurality of conditional expressions are included in the predicate, the preprocessing unit separates and analyzes the plurality of conditional expressions.

The preprocessing means sets the position j to the right end, and sets the predicates p [j-len + 1],. . . , P [j], where the predicates p [m-len],. . . , P [m-1], an array representing the maximum value of the length len of the predicate string is generated, and the auxiliary information is generated using the array. 3. The data processing device according to 3.

k attributes a [0], a [1],. . . , A [k−1] is defined as r [i] (i = {0, 1,..., N−1}), and the attribute a [q] of r [i] (q = {0, ,..., K−1}) are denoted by r [i]. a [q], a constant and an attribute r [i1]. An arbitrary function including a [q1] (i1 = {0, 1,..., n−1 minus the remainder of i), q1 = {0, 1,. And is described by a combination of conditional expressions such as (r [i] .a [q] OP F), where OP is any operator of ｛<, ≦, =,>, ≧, ≠｝. When the predicate is p [j] (j = {0, 1,..., M-1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], and extracts a pattern consisting of
Input means for inputting the pattern,
Preprocessing means for analyzing compatibility between a predicate p [j] and a predicate p [j1] (j1 = {0, 1,..., J-1}) and generating auxiliary information used for pattern extraction; ,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. search means for checking from p [m-1] toward the leading predicate p [0], and determining the next check start position using the auxiliary information when the checked record does not satisfy the predicate;
An output unit that outputs information of a record string satisfying all the predicates of the pattern.

A program for a computer that extracts a data sequence consisting of ordered m data from a plurality of ordered data,
Inputting a condition sequence consisting of ordered m conditions specifying each of the m data using the relationship between the data,
Analyzing compatibility between the m conditions, generating auxiliary information used for data extraction,
When searching for the data sequence from the first data to the last data on the ordered data, it is checked whether the data satisfies the corresponding condition of the condition sequence in a direction opposite to a search direction. When the checked data does not satisfy the condition, determine the next check start position using the auxiliary information,
A program for causing the computer to execute a process of outputting information of a data string satisfying all the conditions of the condition string.

The predicate described by the conditional expression for the attribute value of the record r [i] (i = {0, 1,..., N-1}) is p [j] (j = {0, 1,. m-1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], a computer program for extracting a pattern consisting of
Enter the pattern,
Analyze the compatibility between the predicate p [j] and the predicate p [j1] (j1 = {0, 1,..., J-1}), generate auxiliary information used for pattern extraction,
When retrieving a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the ordinal data, it is determined whether the record satisfies the corresponding predicate in the pattern by a tail predicate. Checking from p [m-1] to the leading predicate p [0], and when the checked record does not satisfy the predicate, determine the next check start position using the auxiliary information;
A program causing the computer to execute a process of outputting information of a record string satisfying all the predicates of the pattern.

A data processing method for extracting, from a plurality of ordered data, a data sequence including m ordered data,
Input means for inputting a condition sequence consisting of m ordered conditions specifying each of the m data using a relationship between the data;
Preprocessing means for analyzing compatibility between the m conditions to generate auxiliary information used for data extraction;
When the search unit searches the data sequence from the first data toward the last data on the ordered data, it is determined whether the data satisfies a corresponding condition of the condition sequence in a direction opposite to a search direction. Check the direction, determine the next check start position using the auxiliary information when the checked data does not satisfy the conditions,
A data processing method, wherein an output unit outputs information of a data string satisfying all the conditions of the condition string.

The predicate described by the conditional expression for the attribute value of the record r [i] (i = {0, 1,..., N-1}) is p [j] (j = {0, 1,. m-1}), n records r [0], r [1],. . . , R [n−1], m predicates p [0], p [1],. . . , P [m-1], a data processing method for extracting a pattern
Input means for inputting the pattern;
The preprocessing means analyzes compatibility between the predicate p [j] and the predicate p [j1] (j1 = {0, 1,..., J-1}) and generates auxiliary information used for pattern extraction. And
When the search means searches a record sequence corresponding to the pattern from the record r [0] to r [n-1] on the order data, it is determined whether the record satisfies a corresponding predicate in the pattern. From the tail predicate p [m-1] to the head predicate p [0], and if the checked record does not satisfy the predicate, determine the next check start position using the auxiliary information;
A data processing method, wherein an output unit outputs information of a record string satisfying all the predicates of the pattern.