JP2004110327A

JP2004110327A - Time series correlation extracting device

Info

Publication number: JP2004110327A
Application number: JP2002270950A
Authority: JP
Inventors: Yuji Hotta; 堀田　勇次; Naoteru Akaboshi; 赤星　直輝
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-09-18
Filing date: 2002-09-18
Publication date: 2004-04-08

Abstract

<P>PROBLEM TO BE SOLVED: To restrict an increase of the processing time even if the number of transactions is increased in a time series correlation extracting device. <P>SOLUTION: The time series correlation extracting device is provided with a time series filter unit 4 having a specifying means for specifying a retrieval pattern by using a plurality of events for defining that the predetermined attribute inside a record has a specified value in the case of retrieving combination of records among a collection of the records having a plurality of attributes and the relation of order between the a plurality of events defined on the basis of order of the attribute value, a retrieving means for retrieving combination of the records corresponding to the retrieval pattern specified in the collection of the records, and an output means for outputting a result of the retrieval. The time series correlation extracting device also has a function for extracting a time series correlation rule with a time series correlation engine unit 5. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】本発明は、大量のトランザクションデータから時系列相関ルールを抽出する時系列相関抽出装置に関する。
【０００２】
更に詳しくは、データベースに記録されたデータの間の関連規則を発見するデータベースマイニングに関連し、データベースの中の膨大なデータのうちで相関のあるデータの組み合わせの出現回数を数え上げる技術に関連する。そして、この技術を用いて数え上げられた結果から、与えられた条件に適合する組み合わせとその出現回数を用いて、データマイニング手法の中の１つである相関ルールの生成処理（相関ルールの抽出処理）が行われる。相関ルールを用いた相関分析は、近年、米国を中心として広く注目されている。
【０００３】
【従来の技術】以下、従来例について説明する。
【０００４】
§１：従来例１の説明
図４３は従来例１の説明図であり、Ａ図は時系列相関抽出装置の説明図、Ｂ図は処理フローチャートである。なお、図４３において、Ｓ１〜Ｓ６は各処理ステップを示す。
【０００５】
従来の時系列相関抽出装置は、図４３のＡ図のように、時系列相関抽出装置の内部に時系列相関エンジン部５を備え、前記時系列相関エンジン部５が入力データを取り込んで処理を行い相関ルールを抽出していた。この場合、一般に、時系列相関ルールの抽出では、生のトランザクションルールを使うか、期間、店舗等、ある部分に限定したトランザクションデータを使って処理を行っていた。
【０００６】
前記時系列相関エンジン部５が処理を行う場合、図４３のＢ図に示した処理フローチャートに従って処理を行っていた。この処理では、先ず、時系列相関エンジン部５はトランザクションデータを読み出し（Ｓ１）、空か否かを判断する（Ｓ２）。その結果、空ならば処理を終了するが、空でなければ、アイテムの組み合わせを生成し（Ｓ３）、組み合わせの数え上げを行う（Ｓ４）。
【０００７】
そして、時系列相関エンジン部５は、時系列相関ルール生成処理（又は抽出処理）を行い（Ｓ５）、生成した時系列相関ルールを出力し（Ｓ６）、Ｓ１の処理へ移行する。
【０００８】
§２：従来例２の説明
また、従来、時間的に連続するトランザクションに含まれるアイテムのシーケンスデータから、ｉ＝２個以上のアイテムの組み合わせに対しては該シーケンスにおける順序を維持した形式で与えられた条件に適合するアイテム１個ずつまたは２個以上の組み合わせとその出現回数とを求めて、時系列相関ルールを抽出する装置（本願の請求項１に記載された前提技術「時間的に・・・において」までの技術に相当する）が知られていた（特許文献１参照）。以下、特許文献１の内、本発明に特に関連する部分を、従来例２として詳細に説明する。
【０００９】
（１）　：構成の説明
従来例２では、以下の構成を含む情報を開示する。
【００１０】
▲１▼：それぞれ１個以上のアイテムをデータとして含む多数のトランザクションからアイテム１個ずつまたは２個以上のアイテムの組み合わせのうちで、前記トランザクションの中での出現回数が与えられた条件に適合するアイテム１個ずつまたは２個以上のアイテムの組み合わせとその出現回数とを求めるデータ組み合わせの数え上げ方法において、各トランザクションに含まれるアイテムを１個ずつカウントして、該１個ずつのアイテムの全トランザクションでの出現回数をカウントし、該カウント回数が与えられた条件に適合するアイテムを選択して、該アイテムと該カウント回数との組を数え上げ結果として出力し、該選択されたアイテムに対応したビットに“１”をセットしたビットマップを作成し、組み合わせ内のアイテムの個数を示すｉの値をｉ＝２とし、該ビットマップに“１”が立っている位置に対応するアイテムを用いて、各トランザクションに含まれるｉ個のアイテムの組み合わせを生成し、該生成されたアイテムの組み合わせの全トランザクションでの出現回数をカウントし、該カウント回数が与えられた条件に適合するアイテムの組み合わせを選択して、該アイテムの組み合わせと該カウント回数との組を数え上げ結果として出力し、該選択されたアイテムの組み合わせあるいは部分的組み合わせ、または該組み合わせに含まれるアイテム１個ずつに対応したビット位置に“１”をセットしたビットマップを作成し、ｉの値をインクリメントして前記ｉ個のアイテムの組み合わせの生成以降の処理を繰り返すことを特徴とする相関のあるデータ組み合わせの数え上げ方法。
【００１１】
▲２▼：前記▲１▼のデータ組み合わせの数え上げ方法において、前記多数のトランザクションのうちで、時間的に連続するトランザクションに含まれるアイテムのシーケンスデータから、ｉ＝２個以上のアイテムの組み合わせに対しては、該シーケンスにおける順序を維持した形式で、前記与えられた条件に適合するアイテム１個ずつまたは２個以上の組み合わせとその出現回数とを求めることを特徴とする相関のあるデータ組み合わせの数え上げ方法。
【００１２】
（２）　：属する技術分野
データベースに記録されたデータ間の関連規則を発見するデータマイニングに関し、更に詳しくは、データベース中の膨大なデータの内で相関のあるデータの組み合わせの出現回数を数え上げる方式に関し、この方式を用いて数え上げられた結果から、与えられた条件に適合する組み合わせとその出現回数を用いて、データマイニング手法の中の１つである相関ルールの生成処理（抽出処理）が行われる。
【００１３】
（３）　：従来の技術
▲１▼：データ組み合わせの数え上げ方式に関する従来技術
このデータ組み合わせの数え上げ方式は、データベースマイニングにおける相関ルール生成処理の一部を成すものであるため、まず相関ルールについて説明する。なお、後述するように本発明におけるデータ組み合わせの数え上げでは、その処理の一部として本発明のグループバイ処理方式が用いられている。
【００１４】
例として小売業においてＰＯＳ（Ｐｏｉｎｔ−ｏｆ　Ｓａｌｅｓ）で収集した１００人の顧客のレシートのうち、２０人の顧客が商品Ａを購入し、また、１２人の顧客が商品Ａと商品Ｂの両方を購入しているとする。１つの商品をアイテムと呼び、また、１枚のレシートをトランザクションと呼ぶ。
【００１５】
１つのトランザクションには、通常、複数のアイテムが含まれる。このとき、以下の定義式
アイテムのサポート＝アイテムを含むトランザクションの数／全トランザクション数
に基づいて商品Ａの「サポート」＝２０％、商品Ａと商品Ｂの「サポート」＝１２％となる。さらに、単純な条件付き確率計算により、「Ａを購入する顧客の６０％（１２％／２０％）がＢも購入する」と結論できる。これを「Ａ→Ｂ　確信度６０％、サポート１２％」と表し、相関ルールと定義する。つまり、相関ルール「Ａ→Ｂ」における確信度は、
「Ａ→Ｂ」の確信度＝Ａ∧Ｂ（ＡとＢの両方購入）のサポート／Ａのサポートである。更に、Ａ→Ｂといった単純なルールだけでなく、（Ａ∧Ｂ→Ｃ∧Ｄ∧Ｅ（「ＡとＢを購入する顧客がＣとＤとＥを購入する」）の様な複雑なルールも用いる。この場合の確信度は、
「Ａ∧Ｂ→Ｃ∧Ｄ∧Ｅ」の確信度＝Ａ∧Ｂ∧Ｃ∧Ｄ∧Ｅのサポート／Ａ∧Ｂのサポート
である。
【００１６】
相関ルールは、目玉商品がどの商品群の売り上げに貢献したかの評価、棚割り（どの商品とどの商品を近くに並べるべきか）の最適化や、クレジットカードのデータからダイレクトメールのヒット率を高めるといった、様々な局面に有効な情報である。
【００１７】
相関ルール生成処理は、（１）　：トランザクションの中から、与えられたサポートの条件を満たすアイテムの組み合わせの出現回数を数え上げる処理と、（２）　：前記（１）　で求められた組み合わせ群とその出現回数を基に、ルールとそのサポートおよび確信度の計算を行うという２つの段階からなる処理からなる。
【００１８】
前記（１）　において、与えられたサポートの条件を満たすアイテムの組み合わせ群を「ラージアイテムセット」と呼ぶ。サポートの条件としては、最小値（０％〔＝全ての組み合わせを数える〕〜１００％〔＝全部のトランザクションで購入されたアイテムを数える〕）から、最大値（最小値＜＝最大値＜＝１００％）の範囲であたえる。従来では、多くの場合、最大値を１００％に固定する手法が用いられている。
【００１９】
前記（１）　のラージアイテムセットの数え上げ処理は非常に時間のかかる処理であるため、各種の高速化手法が提案されている。中でも、ＳＱＬに基づくＳＥＴＭアルゴリズムと、ＩＢＭ（登録商標）が提案するいくつかのアルゴリズムの中のＡＰＲＩＯＲＩが代表的なものとして知られている。
【００２０】
ＳＥＴＭに基づく相関ルールの生成処理は、関係データベース問い合わせ言語であるＳＱＬ言語をベースとしており、実装が容易である特徴を持つ。処理に当たっては、ＳＱＬの結合演算（Ｊｏｉｎ　Ｏｐｅｒａｔｉｏｎ）とグループバイ演算（ＧｒｏｕｐＢｙ　Ｏｐｅｒａｔｉｏｎ　）を用いる。サポートの最小値の条件を満たす長さｋ−１のアイテムの組み合わせを含むトランザクションのテーブルを用いた自己結合演算を行い、長さｋのアイテムの組み合わせ候補を生成する。
【００２１】
次に、グループバイ演算を用いて、長さｋのラージアイテムセットを数え上げる。さらに、結合演算を用いてサポートの最小値を満足するトランザクション群を生成し、次の長さｋ＋１のアイテムの組み合わせ生成に利用する。
【００２２】
図４４は、ＳＥＴＭアルゴリズムにおける具体的な処理の流れを説明する図であり、図４５はＳＥＴＭアルゴリズムの処理における各機能ブロックの処理内容を示す図である。これらの図を用いて、従来技術としてのＳＥＴＭの処理について詳細に説明する。
【００２３】
図４４において、テーブルＲ１′は各トランザクションｔ　ｘに含まれるアイテムを示している。例えばトランザクション１にはアイテム１，２および３が含まれていることを示している。ＧＢ（１）　は１個ずつのアイテムの出現回数（グループバイ処理）を行うものであり、テーブルＬ１はそのカウント結果を、カウント数が２以上のアイテムに対して示したものである。
【００２４】
テーブルＲ１は、ジョインの処理（結合処理）Ｊ（１）　によって、テーブルＲ１′に含まれるデータのうちでテーブルＬ１に存在するアイテムだけを抜き出して結果を示す。
【００２５】
ＳＪ（１）　は、テーブルＲ１に対するセルフジョインの処理を示し、その結果テーブルＲ２′として、各トランザクションに対して２つのアイテムの可能な組み合わせが生成される。
【００２６】
グループバイ処理ＧＢ（２）　によって、２つのアイテムの組み合わせの出現回数がテーブルＲ２′の組み合わせに対してカウントされ、そのカウント結果のうちでカウント数が２以上のものがテーブルＬ２として作成される。
【００２７】
以下同様にして、３つのアイテムの組み合わせのうちでカウント数が２以上のもののテーブルＬ３が作られ、またアイテムが４個の組み合わせのうちでカウント数が２以上のものがテーブルＬ４として作られるが、テーブルＬ４の内容は空となる。
【００２８】
アプリオリ（Ａｐｒｉｏｒｉ）・アルゴリズムでは、サポートの最小値の条件を満たす長さｋ−１のラージアイテムセットを用いて長さｋのアイテムの組み合わせの候補を生成する。その際に、ｋ−１のラージアイテムセットがすべてメモリにのる場合、長さｋのアイテムの組み合わせの中の、長さｋ−１のすべての組み合わせがラージアイテムセットに含まれているかをチェックし、含まれている場合のみ、長さｋのアイテムの組み合わせの候補とみなす。
【００２９】
長さｋ−１のすべての組み合わせをメモリ上のハッシュ表（Ｈａｓｈ　Ｔａｂｌｅ）に登録しておくことにより、不必要な候補をプルーニング（Ｐｒｕｎｉｎｇ　）する。さらに、候補であるアイテムの組み合わせ群をハッシュ木（Ｈａｓｈ　Ｔｒｅｅ　）に保持し、各トランザクション毎にトランザクション中に含まれるアイテムの組み合わせがハッシュ木に登録されている場合、そのカウント値を増やすことにより、長さｋのアイテムの組み合わせの候補の出現回数を数え上げる。ハッシュ木に登録されている組み合わせのみを対象とすることにより、不必要な組み合わせの数え上げをしない工夫がなされている。
【００３０】
図４６は、アプリオリ（Ａｐｒｉｏｒｉ）・アルゴリズムにおける具体的な処理の流れを説明する図、図４７はアプリオリ（Ａｐｒｉｏｒｉ）・アルゴリズムにおける各機能ブロックの内容を説明する図である。これらの図に基づいて、Ａｐｒｉｏｒｉアルゴリズムによるラージアイテムセットの数え上げ処理の具体例について説明する。
【００３１】
図４６において、８つのトランザクションのリストＴＬの内容は実質的に図４４と同じである。まず最初にこれらのトランザクションに含まれるアイテムが１つずつＳｕｂｓｅｔ（１）　に入力され、１つ１つのアイテムの出現回数がＣ１としてカウントされる。そのカウント結果はＦに入力され、出現回数が２回以上のものがフィルタリングによって選択され、フィルタリング結果がＬ１として作成される。
【００３２】
Ｌ１の中に含まれるアイテムから２個の組み合わせが選択され、Ｃ２としてハッシュ木に登録される。そして各トランザクションの中のハッシュ木に登録された２個のアイテムの組み合わせが含まれている場合、Ｓｕｂｓｅｔ（２）　によってその出現回数をカウントすることによって、２個のアイテムの組み合わせの出現回数が求められ、その結果がＦによってフィルタリングされることにより、２個のアイテムの組み合わせのうちで出現回数が２回以上のものがＬ２として得られる。
【００３３】
以下同様の処理を実行することによって、３個のアイテムの組み合わせのうちで出現回数が２回以上のものがＬ３として得られ、また４個の組み合わせのうちで出現回数が２回以上のものが、図４４と同様に、存在しないことが判明した時点で処理を終了する。
【００３４】
（４）　：組み合わせ数え上げ方式の説明
図９６において、数え上げ方式は相関があると考えられるアイテムの組み合わせ候補を生成する組み合わせ生成部Ｃ（ｉ）　１、グループバイ処理によって組み合わせ候補のトランザクション内部での出現回数をカウントする出現回数数え上げ部Ｇ（ｉ）　２、出現回数のカウント値が例えばある指定された範囲にある場合にその組み合わせ相関のある組み合わせとして、ラージアイテムセットＬ（ｉ）　の要素として選択する組み合わせ選択部Ｆ３、および組み合わせ選択部Ｆ３の出力するラージアイテムセットから、組み合わせ生成部Ｃ（ｉ）　１によって組み合わせのプルーニングのために使用されるビットマップｂ１５，ｂ（ｉ−１）６を生成し、Ｃ（ｉ＋１）によって使用されるビットマップｂｉ７を追加するビットマップＢ（ｉ）　４から構成されている。
【００３５】
図９６の各部は以下のような処理を実行する。
Ｃ（ｉ）
・ｉ＝１の時：同一トランザクションに含まれるアイテムを１つづつＧ（ｉ）　へ送る。
・ｉ＞２の時：同一トランザクションに含まれるｉ個のアイテムの組み合わせの内、ビットマップフィルタｂ１，ｂ２，・・・ｂ（ｉ−１）により除外されないものをＧ（ｉ）　へ送る。ここで、フィルタｂ１，ｂ２，・・・ｂ（ｉ−１）は、図９６のｂ１，ｂ２，・・・ｂ（ｉ−１）に相当する。
Ｇ（ｉ）
・ｉ個のアイテムが並んだレコードを受け取り、そのレコード全体をキーとして後述のグループバイ処理を行い、各グループに含まれるレコードの個数を計算し、レコードにその個数を付け加えたものを出力する。
Ｆ
Ｇ（ｉ）　の出力を受け取り、個数が与えられた条件に合うものを長さｉのラージアイテムセットＬ（ｉ）　として出力する。
Ｂ（ｉ）
Ｆの出力［アイテム１・・・アイテムｉ，個数］を受け取り、ビットマップフィルタｂｊ（１≦ｊ≦ｉ）に対して、［アイテム１・・・アイテムｉ］の中から、全てのｊ個のアイテムの組み合わせを取り出し、それぞれの組み合わせに対して、Ｈｊ（ｊ個のアイテムの組み合わせ）で計算されるビット位置に“１”を立てる操作を行う。この時、ｂ１，ｂ２，・・・ｂ（ｉ−１）は既に存在するので、これらを更新する。ｂｉはまだ存在しないので、これを新規に作成し、更新を行う。
【００３６】
図９７は、組み合わせの数え上げ方式の全体処理フローチャートである。同図において処理が開始されると、まずステップＳ５０１でアイテムの組み合わせの個数としてのｉが１とされた後に、ステップＳ５０２でＬ（１）　が生成される。このＬ（１）　は１個のアイテムのみからなるラージアイテムセットである。
【００３７】
その後ステップＳ５０３でＬ（ｉ）　の要素数、すなわち組み合わせの数がｉ＋１以上あるか否かが判定される。ここでは、Ｌ（１）　のラージアイテムセットの要素数が２個以上であるか否かが判定され、この判定がＹｅｓの場合にはステップＳ５０４でｉの値がインクリメントされた後に、ステップＳ５０２以降の処理が繰り返される。
【００３８】
すなわち、ここでは２個のアイテムの組み合わせのうちで出現回数が指定された範囲にあるものの集合としてのラージアイテムセットＬ（２）　の生成がＳ５０２で行われ、続いてＳ５０３以降の処理が実行される。そしてステップＳ５０３でラージアイテムセットＬ（ｉ）　に含まれる要素の数がｉ＋１個以上でないと判定された時点で、処理を終了する。
【００３９】
図９８は、ラージアイテムセット生成処理のフローチャートである。同図において処理が開始されると、まずステップＳ５１０でトランザクションリストＴＬの先頭のトランザクションが読み出され、ステップＳ５１１でｉ個のアイテムの組み合わせ候補が生成され、その組み合わせ候補、すなわちアイテム１からアイテムｉまでの組み合わせが出現回数数え上げ部Ｇ（ｉ）　に送り込まれ、ステップＳ５１２でトランザクションリストが空か否かが判定され、空でない時にはステップＳ５１０以降の処理が繰り返され、トランザクションリストの内部から組み合わせ候補の生成とＧ（ｉ）　への送り込みが実行される。
【００４０】
ステップＳ５１２でトランザクションリストが空になったと判定されると、ステップＳ５１３で数え上げ部Ｇ（ｉ）　から、アイテムの組み合わせ候補とそのトランザクション内での出現回数、すなわち個数が組み合わせ選択部Ｆに送られ、組み合わせ選択処理が行われる。そしてステップＳ５１４で組み合わせ選択結果がラージアイテムセットＬ（ｉ）　として格納され、同時にラージアイテムセットにおけるアイテムの組み合わせ部分がビットマップ生成部Ｂ（ｉ）　に送られる。ステップＳ５１５でビットマップ生成処理が行われ、ステップＳ５１６で数え上げ部Ｇ（ｉ）　にまだレコードが存在するか否かが判定され、存在する場合にはステップＳ５１３以降の処理が繰り返され、存在しない場合には処理を終了する。
【００４１】
次に具体例を用いて、組み合わせの数え上げ方式の処理について更に詳細に説明する。ここでは具体例として、以下の４つのトランザクションＴ１〜Ｔ４からなるトランザクションリストＴＬを対象とする。
【００４２】
Ｔ１＝［１，２，４］、Ｔ２＝［２，３，６］
Ｔ３＝［１，４，５，６］、Ｔ４＝［１，２，４，５］
各トランザクション内部のアイテムは、その番号順にソートされているものとする。ここではラージアイテムセットとして選択される組み合わせ候補の出現回数、すなわち個数のサポートの最小値として５０％以上の条件を満たす個数、すなわち全トランザクションの数の４に対して個数が２以上となる条件を満たすものを組み合わせ選択部Ｆにおける選択条件とする。
【００４３】
まず長さ１のアイテムセットＬ（１）　の生成について説明する。図９９〜図１０２は長さ１のアイテムの組み合わせ候補の生成とその出現回数数え上げの説明図である。
【００４４】
はじめに、図９９でトランザクションＴ１［１，２，４］を読み込んで、Ｃ（１）　に入力する。Ｃ（１）　では、同一トランザクション中に含まれるアイテムを１つづつ、すなわち［１］，［２］，［４］の３個を出現回数数え上げ部Ｇ（１）　に入力する。Ｇ（１）　では、Ｃ（１）　から入力されたアイテム、［１］，［２］，［４］の出現回数を数え上げ、アイテムと個数をペアとして保持する。ここでは、アイテム［１］，［２］，［４］がそれぞれ１回づつ入力されたので、［アイテム，個数］の形式で、［１，１］，［２，１］，［４，１］を保持する。以上で、Ｔ１についての処理が終わる。
【００４５】
１つのトランザクションについてのＧ（１）　への入力処理が終了したら、Ｃ（１）　にもどって、次のトランザクションの処理を行う。図１００では、Ｔ２＝［２，３，６］について、［２］，［３］，［６］の３個のアイテムとしてのＧ（１）　に入力する。Ｇ（１）　では、それまでのトランザクションの処理で数え上げた個数に、入力されたアイテムと個数を追加していく。
【００４６】
アイテム［２］はＴＱとＴ２の両方で入力されるので、Ｔ２の処理が終了した段階では、［アイテム，個数］の形式で［２，２］と保持される。
【００４７】
以下同様に全てのトランザクションについて、Ｃ（１）　でアイテムを１つづつＧ（１）　に入力して、Ｇ（１）　で数え上げを行う。Ｔ４までのトランザクションを全て処理した結果、［アイテム，個数］の組は、［１，３］，［２，３］，［４，３］、［３，１］，［６，２］，［５，２］となる。この結果が図１０３である。
【００４８】
全てのトランザクションについてＧ（１）　への入力処理が終了したら、Ｇ（１）　から［アイテム，個数］の組を取り出し、Ｆにおいてサポートの最小値を満たすアイテムを選ぶ。図１０４〜図１０９は、この処理の説明図である。図１０４では、［１，３］の個数３がサポートの最小値５０％を満たしているので、これをラージアイテムセットＬ（１）　に登録する。と同時に、Ｂ（１）　では、アイテム［１］をビットマップｂ１に登録する。ここではアイテムに対応するビット位置（０〜５）は以下のハッシュ関数Ｈ１によって得られる。
Ｈ１（アイテム１）＝アイテム１の番号ｍｏｄ６．
この場合、トランザクション中のユニークなアイテムの総個数が６個であるから、ビットマップのビット数はそれにあわせて６とした。もし、ビットマップがメモリに入らない、あるいは他の処理のためにメモリを予約したいのであれば、ユニークなアイテムの総個数６より小さい値を取ることも可能である。ビットマップのすべてのビットは、最初は“０”である。アイテム［１］にハッシュ関数を適用すると、１ｍｏｄ６＝１を得るので、ｂ１の１に対応する上から２番目のビットに“１”を立てる。
【００４９】
次に、図１０５で、Ｇ（１）　から［２，３］を取り出し、アイテム［２］の個数３がサポートの最小値である５０％以上の条件を満たすのでラージアイテムセットとして登録する。同時に、２ｍｏｄ６＝２であるから、Ｂ（１）　ではビットマップｂ１の３番目のビットに“１”を立てる。
【００５０】
図１０６、図１０８、図１０９ではアイテムと個数のペアのうち［４，３］、［６，２］、［５，２］についてサポートの最小値の条件を満たしているので、［４］、［６］、［５］と個数の組をラージアイテムセットＬ（１）　に登録し、同時にビットマップｂ１のハッシュ関数値に“１”をセットする。
【００５１】
しかし、図１０７では、［３，１］について個数１がサポートの最小値の条件を満たしていないので、ラージアイテムセットＬ（１）　およびビットマップｂ１への登録は行わない。処理の結果、ビットマップｂ１＝｛１，１，１，０，１，１｝となり、長さ１のラージアイテムセットＬ（１）　＝｛［１，３］、［２，３］、［４，３］，［６，２］、［５，２］｝が生成される。
【００５２】
次に長さ２のラージアイテムセットＬ（２）　の生成について説明する。図１１０〜図１１３はアイテムの組み合わせ候補の数え上げまでの説明図である。長さ２のアイテムの組み合わせ生成では、既に作成されたビットマップｂ１を利用する。はじめに、図１１０でトランザクションＴ１＝［１，２，４］を読み込んで、ビットマップｂ１でセットされたアイテムだけを使って長さ２のアイテムの組み合わせの候補を生成する。ここでは、ｂ１に３つのアイテムが登録されているので、［１］，［２］，［４］の３つのアイテムから［１２］，［１４］，［２４］の３つのアイテムの組み合わせの候補を生成し、数え上げ部であるＧ（２）　では、Ｃ（２）　から入力されたアイテムの組み合わせ［１２］，［１４］，［２４］の出現回数を数え上げ、アイテムと個数を組として保持する。［１２］，［１４］，［２４］がそれぞれ１回づつ入力されたので、［アイテム１アイテム２，個数］の形式で、［１２，１］，［１４，１］，［２４，１］を保持する。以上で、Ｔ１についての処理が終わる。
【００５３】
次のトランザクションＴ２＝［２，３，６］のアイテムをビットマップｂ１でフィルタリングすると、アイテム［３］が落とされるので、その結果アイテムの組み合わせの候補として［２６］１つだけを図１１でＧ（２）　に入力する。Ｔ３、Ｔ４についても同様の処理を行うが、Ｔ３とＴ４にはビットマップｂ１でふるいにかけられて落とされるアイテムがないので、トランザクションから生成される全ての長さ２のアイテムの組み合わせを図１１２、図１１３でＧ（２）　に入力する。
【００５４】
全てのトランザクションについてＧ（２）　への入力処理が終了したら、Ｇ（２）　から、［アイテム１アイテム２，個数］を取り出し、Ｆで個数がサポートの最小値の条件を満たすものを選ぶ。この処理を図１１４〜図１２３に示す。［１２，２］の個数２がサポートの最小値５０％以上の条件を満たしているので、アイテムの組み合わせ候補［１２］と出現回数を図１１４でラージアイテムセットＬ（２）　に登録する。同時に、Ｂ（２）　では、アイテムの組み合わせ［１２］の各アイテム［１］，［２］をビットマップｂ１に登録する。この場合、ｂ１のビット位置１と２に“１”を立てる。更に、［１２］については以下のハッシュ関数Ｈ２で計算されるビットマップｂ２（ビット位置０〜４）のビット位置３に“１”を立てる。
【００５５】
Ｈ２（アイテム１，アイテム２）＝（アイテム１の番号＋アイテム２の番号）ｍｏｄ５．
このハッシュ関数Ｈ２ではアイテム１アイテム２の番号の和のｍｏｄ５を取るものを用いたので、ビットマップのビット数はこれに合わせて５とした。最初は、ｂ２の全ての５つのビットは“０”である。なお、ここでのハッシュ関数は例として設定したもので、実際にはハッシュ関数の効率を考慮して任意に設定できる。また、ビットマップのメモリの大きさを考慮に入れて任意に設定できる。ｂ１では常に１つのアイテムで“１”を立てるビット位置を決めるが、ｂ２は２個のアイテムを引数としたハッシュ関数によりビット位置を決定する。
【００５６】
図１１５、図１１６、図１１８、図１２０に示したように、［１４，３］，［２４，２］，［１５，２］，［４５，２］についてサポートの最小値を満たすので、アイテムをラージアイテムセットＬ（２）　に登録し、同時にビットマップｂ１とｂ２を更新する。これに対して図１１７、図１１９、図１２１〜図１２３では、サポートの最小値が満たされていないためアイテムセットへの登録とビットマップの更新は行われない。以上の結果、長さ２のラージアイテムセットＬ（２）　＝｛［１２，２］，［１４，３］，［２４，２］，［１５，２］，［４５，２］｝および、ｂ１＝｛０，１，１，０，１，１｝，ｂ２＝｛１，１，０，１，１｝が生成される。この結果を図１２３に示す。
【００５７】
次にＬ（３）　の生成について説明する。図１２４〜図１２７は組み合わせ候補数え上げまでの説明図である。長さ３のアイテムの組み合わせの生成では長さ２のラージアイテムセットを作成した際に生成したビットマップｂ１およびｂ２を利用する。まず、トランザクションＴ１＝［１，２，４］を読み込んで、ビットマップｂ１でセットされたアイテムだけを使って長さ３のアイテムの組み合わせを生成する。ここでは、［１］，［２］，［４］の３つのアイテムがｂ１にセットされている。次に長さ２のアイテムの組み合わせ［１２］，［１４］，［２，４］がビットマップｂ２でセットされているかどうかを調べる。この場合、ｂ２［Ｈ２（１，２）］＝ｂ２［Ｈ２（１，４）］＝ｂ２［Ｈ２（２，４）］＝１とすべてセットされていることから［１２４］を組み合わせ候補としてＧ（３）　に入力する。この処理を図１２４に示す。
【００５８】
図１２５に示すように、Ｔ２＝［２，３，６］の場合、ｂ１に［３］がセットされていないためふるい落とされて、トランザクションの長さが２となってしまうため、長さ３のアイテムの組み合わせ候補が生成できずＧ（３）　への入力は行わない。
【００５９】
Ｔ３＝［１，４，５，６］の場合、［１］，［４］，［５］，［６］に対応するｂ１のビットがセットされているので、４つの長さ３のアイテムの組み合わせ［１４５］，［１４６］，［１５６］，［４５６］が考えられるが、まず、それぞれについて、ビットマップｂ２のチェックを行う。［１４５］の場合、［１４］，［１５］，［４５］の全てについて、対応するｂ２のビットがセットされているので、組み合わせの候補として図１２６でＧ（３）　に入力する。
【００６０】
しかし、［１４６］の場合、［１４］，［４６］に対応するｂ２のビットはセットされているが、［１６］のビットはセットされていないため、［１４６］を組み合わせ候補としてＧ（３）　に入力しない。同様に、［４５６］の全ての２個のアイテムの組み合わせ［４５］，［４６］，［５６］に対応するｂ２のビットはセットされているので、［４５６］は組み合わせの候補としてＧ（３）　に入力されるが、［１５６］の中の［１６］のｂ２のビットはセットされていないためＧ（３）　に入力しない。
【００６１】
Ｔ４＝［１，２，４，５］についても同様なｂ１とｂ２のチェックを行い、組み合わせの候補として［１２４］と［１４５］をＧ（３）　に入力する。この結果を図１２７に示す。
【００６２】
この例では、不要な候補の生成を防ぐため、長さ３のアイテムの組み合わせ候補を生成する際に、全ての長さ２のアイテムの組み合わせに対応するビットマップｂ２のビットをチェックした。この手法は、不要な数え上げを防止するプルーニングとして有効であるが、例えば、処理に使用できるメモリの量が十分でない場合には各アイテムのチェック、あるいは２の長さのアイテムの組み合わせのチェックを行わない方法、すなわちｂ１あるいはｂ２を使用しない方法を取ることもできる。また、ｂ１とｂ２を使うがチェックする長さ２の組み合わせを少なくする方式もあり得る。
【００６３】
例えば、長さ３のアイテムの組み合わせを生成する際に、先頭の長さ２の組み合わせ１つだけをチェックするやり方を考えることができる。
【００６４】
Ｇ（３）　への入力処理が終了したら、図１２８〜図１３０に示すように、Ｇ（３）　から［アイテム１アイテム２アイテム３，個数］の組を取り出し、Ｆで個数がサポートの最小値の条件を満たすアイテムを選ぶ。図１２８では、［１２４，２］の個数２がサポートの最小値５０％以上という条件を満たしているので、これを長さ３のラージアイテムセットＬ（３）　に登録する。同様に、Ｂ（３）　では、［１２４］のアイテム［１］，［２］，［４］それぞれに対応するビットマップｂ１のビット位置に“１”を立てる。また、ハッシュ関数Ｈ３を用いて、すべての３つのアイテムの組み合わせ［１２４］に対応するビットマップｂ３のビット位置に“１”を立てる。
【００６５】
Ｈ３（アイテム１，アイテム２，アイテム３）＝（アイテム１の番号＋アイテム２の番号＋アイテム３の番号）ｍｏｄ５．
同時に、［１２４］の中から、２個のアイテムの組み合わせを取り出し、ビットマップｂ２を生成することもできるが、ここではｂ１とｂ３のみ生成する。
【００６６】
続いて行われる図１２９、図１３０の処理の結果、長さ３のラージアイテムセットＬ（３）　＝｛［１２４，２］，［１４５，２］｝，ビットマップｂ１＝｛０，１，１，０，１，１｝，ｂ３＝｛１，０，１，０，０｝が生成される。
【００６７】
図１３０では、長さ３のラージアイテムセットＬ（３）　の個数が２個であり、図９７のステップＳ５０３における終了条件の４個以下となったので処理を終了する。以上で、ラージアイテムセットの数え上げ処理が終了し、この結果を用いて相関ルールを生成することができる。
【００６８】
以上に述べた例では、ｉ＞１の場合のＣ（ｉ）　では、各トランザクションに対し、ビットマップｂ１，ｂ（ｉ−１）により除外されない長さｉのアイテムの組み合わせ候補が生成される。それと同時に、各トランザクションに対し、ビットマップｂ１、ｂ（ｉ−１）により除外されないアイテムのみを新たなトランザクションとして格納することもできる。これにより、ｉ＞２の場合、基のトランザクション群ＴＬではなく、Ｃ（ｉ−１）で生成されたリダクションされたトランザクション群からアイテムの組み合わせの候補を生成することができる。以上において相関ルール生成のために使用されるデータ組み合わせの数え上げ方式について詳細に説明した。
【００６９】
（５）　：相関分析の手法としての時系列分析への応用
この時系列分析は、長期間に渡る顧客の商品購入パターンの分析に利用される。顧客が第１のアイテムを購入した後に、所定の期間内に第２のアイテムを購入する確率を相関ルールとして知ることができれば、例えば小売業者はより効果的に在庫管理を行うことができる。
【００７０】
図４８は、顧客の長期間に渡る商品購入パターンとしてのシーケンスリストの説明図である。この図で第１のシーケンスは、ある顧客がアイテム３を１個だけ購入した後に、例えば次の日にアイテム８だけを購入し、更に、その１週間後にアイテム３と８を１づつ購入したというような、商品購入の時系列パターンを示している。
【００７１】
図４８の説明において１つのエレメントは前述のトランザクション、すなわち１枚のレシートに対応し、例えばアイテム３と８が同じエレメントに属するということは、アイテム３と８とが同一のレシートに記録されていることを示す。
【００７２】
以下に説明する時系列分析においてはサポートの最小値を４０％とする。図４８のシーケンスリストは５つのシーケンスによって構成されているために、４０％のサポートはアイテムの組み合わせの出現回数が２以上であることを示す。　時系列分析における組み合わせ生成部Ｃ（ｉ）　、数え上げ部Ｇ（ｉ）　、組み合わせ選択部Ｆ、およびビットマップＢ（ｉ）　の動作を以下に説明する。
Ｃ（ｉ）
・ｉ＝１の時：同一シーケンスに含まれるアイテムを１つづつＧ（ｉ）　へ送る。
・ｉ＞２の時：同一シーケンスに含まれるｉ個のアイテムの順列の内、ビットマップフィルタｂ１，ｂ２，・・・ｂ（ｉ−１）により除外されないものをＧ（ｉ）　へ送る。
Ｇ（ｉ）
ｉ個のアイテムの順列を受け取り、後述するグループバイ処理を行い、その個数を付け加えたものを出力する。
Ｆ
Ｇ（ｉ）　の出力を受け取り、個数が与えられた条件に合うものを長さｉのラージシーケンスＬ（ｉ）　として出力する。
Ｂ（ｉ）
Ｆの出力を受け取り、ビットマップフィルタｂｊ（１≦ｊ≦ｉ）に対して、ｉ個のアイテムの順列の中から、全てのｊ個のアイテムの順列を取り出し、それぞれの順列に対して、Ｈｊ（ｊ個のアイテムの順列）で計算されるビット位置に“１”を立てる操作を行う。
【００７３】
ｊ個のアイテムの順列の内部表現は［アイテム１，・・・アイテムｊ，セパレータ（ｊ−１）］、セパレータｋ（１≦ｋ≦ｊ−１）はアイテムｋとアイテム（ｋ＋１）が同じエレメントに属する場合は０、異なるエレメントに属する場合は１となる。
【００７４】
ハッシュ関数Ｈｊは、Ｈｊ（アイテム１，・・・アイテムｊ，セパレータ１，セパレータ（ｊ−１）＝（アイテム１の番号＋，・・・，＋アイテムｊの番号＋セパレータ１＋・・・，＋セパレータ（ｊ−１）ｍｏｄＮとする。Ｎはビットマップのビット数を示す。
【００７５】
図４９〜図５３はシーケンスリストに属する１個ずつの出現回数の数え上げ処理としてのＧ（１）　までの処理の説明図である。図４９〜図５３において、それぞれのシーケンスリストに含まれるアイテムが１個ずつＧ（１）　によってカウントされ、最終結果として図５３が得られる。
【００７６】
図５４〜図６１は、図４９〜図５３で得られたアイテム１個ずつの出現回数のカウント値に対応して、サポート値が４０％以上のアイテムを選択してラージシーケンスＬ（１）　を作成する処理と、同時にビットマップｂ１を作成する処理の説明図である。
【００７７】
例えば図５４において入力されるアイテム３はその個数が４であり、サポートの最小値を満たしているためラージシーケンスＬ（１）　の要素とされ、同時にそのアイテム３はビットマップ生成部Ｂ（１）　に与えられ、ビットマップｂ１へのビット設定が行われる。ここでビットマップｂ１のビット数は８である。
【００７８】
図５５〜図５８においては、組み合わせ選択部Ｆに入力されるアイテムの出現回数がサポート値を満たしているため、それらのアイテムはラージシーケンスＬ（１）　に追加されるが、図５９〜図６１で入力されるアイテムの出現回数はサポート値以下であるためラージシーケンスＬ（１）　への登録は行われず、この処理の最終結果は図６１となる。
【００７９】
図６２〜図６６は２個のアイテムの組み合わせ候補の数え上げ処理の説明図である。図６２において、Ｃ（２）　によって、ビットマップｂ１を用いたフィルタリングによって落とされないアイテムを用いて２つのアイテムの組み合わせ候補が生成され、その出現回数がＧ（２）　によってカウントされる。この組み合わせにおいては、シーケンスにおける順序があるため、また同じエレメントに属するか否かによって、次の３つの組み合わせ候補は別々のものとしてカウントされる。＜（３）　（８）　＞，＜（８）　（３）　＞，＜（３，８）＞
図６３〜図６６において、同様にして２つのアイテムの組み合わせ候補の数え上げが行われるが、例えば図６３においてアイテム１，２および６はビットマップｂ１によってフィルタリングされ、組み合わせ候補の生成には使用されない。また、図６６においてシーケンスに含まれるアイテムは１個のみであるため、２個のアイテムの組み合わせ候補は生成されず、図６５が最終結果となる。
【００８０】
図６７〜図７７は、図６２〜図６６において作成された２個のアイテムの組み合わせ候補のうち、サポート値を満たすものによるラージシーケンスＬ（２）　の作成と、ビットマップ作成部Ｂ（２）　によるビットマップｂ１、およびｂ２生成の処理の説明図である。図６７において、組み合わせ選択部Ｆに入力される２個のアイテムの組み合わせ候補はサポート値を満たしているため、ラージシーケンスＬ（２）　の要素とされ、２つのアイテム３，８に対してハッシュ関数Ｈ１、およびＨ２を用いてビットマップｂ１、およびｂ２のビット設定処理が行われる。
【００８１】
図６８〜図７１、図７５〜図７７において入力される組み合わせ候補はサポート値を満たしていないため、ラージシーケンスへの登録、およびビットマップのビット設定は行われない。
【００８２】
これに対して、図７２〜図７４において組み合わせ生成部Ｆに入力される組み合わせ候補はサポート値を満たしているため、ラージシーケンスリ（２）　への登録と、ビットマップのビット設定処理が行われ、この処理の最終結果は図７７となる。
【００８３】
図７８〜図８２は、３個のアイテムの組み合わせ候補のＧ（３）　によるカウント処理などの説明図である。図７８において最初のシーケンスが組み合わせ生成部Ｃ（３）　に入力されるが、このシーケンスから生成される３個のアイテムの組み合わせ候補はいずれもビットマップｂ２によってフィルタリングされ、Ｇ（３）　には入力されない。
【００８４】
図７９、および図８１において、組み合わせ生成部Ｃ（３）　によって生成される３個のアイテムの組み合わせ候補１種類だけがそれぞれＧ（３）　によってカウントされる。図８０および図８２においては、Ｇ（３）　に入力される組み合わせ候補は存在せず、この処理の最終結果は図８２となる。
【００８５】
図８３は、図７８〜図８２において作成された３個のアイテムの組み合わせ候補のうちサポート値を満足するものをラージシーケンスＬ（３）　に登録し、同時にビットマップｂ１、およびｂ３へビット設定を行う処理の説明図である。３個のアイテムの組み合わせ候補は１種類のみであり、組み合わせ生成部Ｆによってこの組み合わせ候補がラージシーケンスＬ（３）　とされ、またこの組み合わせ候補の中のアイテムを用いてビットマップｂ１、およびｂ３の設定が行われる。ラージシーケンスＬ（３）　の要素数は１個のみとなるため、組み合わせの数え上げ処理はここで終了する。
【００８６】
図８４は、従来技術の図４４および図４６におけると同じトランザクションリストを用いて、本発明における基本相関分析のためのアイテム組み合わせの数え上げ処理の流れを説明したものである。図８４においては、例えばラージアイテムセットＬ１に対応して作成されるビットマップｂ１を用いて、トランザクションリストから組み合わせ生成部Ｃ（２）　によってフィルタリングされないアイテム２個の組み合わせ候補が生成され、数え上げ部Ｇ（２）　に与えられる点に本発明の基本的な特徴がある。
【００８７】
このビットマップを生成する処理は簡単なものであり、またビットマップの容量は利用できるメモリ量にフィットするサイズにすることが常に可能であるため、図４４における結合処理量が大きいという問題点や、図４６におけるハッシュ木の容量が利用できるメモリ量よりも大きくなる場合があるというような問題点を解決することが可能である。
【００８８】
（６）　：組み合わせの出現回数数え上げ部Ｇ（ｉ）　において実行されるグループバイ処理
以上において本発明のデータの組み合わせ数え上げ方式の全体処理の説明を終了し、次に組み合わせの出現回数数え上げ部Ｇ（ｉ）　において実行されるグループバイ処理について説明する。
【００８９】
アイテムの組み合わせ候補の数え上げには本発明におけるグループバイ処理方式が用いられるが、その処理について具体例を用いて説明する。アイテムが１個のみのときにその出現回数を数え上げる場合には、アイテム番号をレコード値と考えることによって、前述のグループバイ処理方式をそのまま用いることができるため、ここでは２個のアイテムの組み合わせ候補の数え上げを具体例として説明する。
【００９０】
また、グループバイ処理方式としては、前述の第１の実施例と同様にハッシュ処理によってハッシュ済みリストと補助情報リストを作成した後に、グループバイ関数処理として同じキーの値を持つレコードの数え上げを行う場合を説明する。
【００９１】
入力レコードは次の１６件であり、それぞれのレコードはアイテムの番号を２つ持っており、その２つのアイテム番号によってハッシュ処理、およびソート処理におけるキーが決定されるものとする。
【００９２】
［１２］［１４］［２４］［２６］［１４］［１５］［１６］［４５］［４６］［５６］［１２］［１４］［１５］［２４］［２５］［４５］
このような場合のキーの比較方法として種々考えられるが、ここでは一例として辞書順を用いることにする。すなわち、２つのレコードを比較する時、まず１番目のアイテム番号で比較して大小がつけばそのままレコードのキーの大小関係とし、一番目のアイテム番号が等しければ２番目のアイテム番号を比較して、大小がつけばそのままキーの大小関係とし、２番目のアイテム番号も等しければ更に３番目のアイテム番号を比較するというような比較を続け、最後のアイテム番号まで等しければ２つのレコードのキーの値は等しいとする。
【００９３】
そこで例えば［１２］と［１４］とは、１番目のアイテム番号は等しいが、２番目のアイテム番号で比較することにより、［１２］より［１４］の方がキーの値が大きいとすることができる。
【００９４】
ハッシュ関数として適当なものを選んで使用することができるが、ここではハッシュ関数の一例として、次式のようにアイテム番号の総和に対して、３で割った時の余剰をとることにする。つまり［１２］のハッシュ値は（１＋２）ｍｏｄ３＝０で０となり、［１４］のハッシュ値は（１＋４）ｍｏｄ３＝２で２となる。
【００９５】
【数１】

【００９６】
前述のグループバイ処理方式の第１の実施例と同様に、１６件の入力レコードからまずハッシュ済みリストが生成される。この時図９２のハッシュ表１５の大きさは３、レコードバッファ１４は４つ分のレコードの格納領域を持ち、ハッシュ済みリスト用出力バッファ１８は３つのレコードを格納できるものとする。　図８５はハッシュ済みリスト生成処理の経過の説明図である。同図においては、図９２の入力バッファ１３からレコードバッファ１４へのレコードの入力と、レコードバッファ１４からハッシュ済みリスト用出力バッファ１８へのレコードの出力の経過が示されている。図９３におけると同様に、１つのレコードが入力される時、すでに同じハッシュ値を持つレコードがハッシュ表のそのハッシュ値の領域からポイントされている時には、新しく入力されるレコードがハッシュ表からポイントされ、すでに入力されているレコードはリンク管理表によって管理される。
【００９７】
図８６は、図８５の処理結果としてのハッシュ済みリストである。図８５と比較することにより、ランとしてはブロック番号０〜２のランと、ブロック番号３〜５の２本のランが得られることが分かる。
【００９８】
次に補助情報リストとしては［ブロック番号、先頭レコードのハッシュ値、末尾レコードのハッシュ値］の形を用いるものとすると、次のような補助情報リストが得られる。この補助リストの形式はｉの値によっては変化しない。
【００９９】
［０，０，０］［１，１，２］［２，２，２］［３，０，０］［４，０，１］［５，１，１］
補助情報リストのソートによって、例えば次のようにソートされた補助情報リストが得られる。このソートの方式もｉの値によって変化しないものとする。
【０１００】
［０，０，０］［３，０，０］［４，０，１］［５，１，１］［１，１，２］［２，２，２］
このソートされた補助情報リストをもとに、図９４のグループバイ関数処理が行われるが、まず図９５における最小ハッシュ値レコード取り出し装置３３による処理を図８７に示す。ここではラン数分のハッシュ済みリスト用入力バッファは２個であり、２個のバッファにソートされた補助情報リスト３０の順番に従って入力されたレコードから最小ハッシュ値のレコードが１つずつ取り出され、最終的な結果は次のようになる。
【０１０１】
［２４］［１２］［１５］［４５］［４５］［４５］［１５］［２４］［１２］［２５］［１６］［４６］［１４］［２６］［５６］［１４］［１４］
この最小ハッシュ値レコード取り出し装置３３による処理結果を、ハッシュ値毎に分けると次のようになる。
【０１０２】
ハッシュ値０：［２４］［１２］［１５］［４５］［４５］［１５］［２４］［１２］
ハッシュ値１：［２５］［１６］［４６］
ハッシュ値２：［１４］［２６］［５６］［１４］［１４］
このようにハッシュ値が等しいレコードに対しては、図９５のソート装置３４によるソート処理が行われ、その結果がグループバイ関数演算処理装置３５に送られ、数え上げが実行される。このソートでは、前述した辞書式順序を用いることにする。図９４では、ハッシュ値０のレコードが取り出されなくなった時点でハッシュ値０のレコードのソートが行われ、ハッシュ値１のレコードが取り出されなくなった時点でハッシュ値１のレコードのソートが行われる。
【０１０３】
このような動作が最後まで繰り返され、出力されるレコードは［キー値，個数］の形式で表される。キー値はここでは２つのアイテム番号からなる。
【０１０４】
先ずハッシュ値（）のレコードのソートが行われ、［１２］［１２］［１５］［１５］［２４］［２４］［４５］［４５］
というソート結果が得られ、次にグループバイ関数演算処理装置３５にソート結果が送られ、
［１２，２］［１５，２］［２４，２］［４５，２］が得られる。次にハッシュ値１のレコードのソートが行われ、
［１６］［２５］［４６］
というソート結果が得られ、次にグループバイ関数演算処理装置３５にソート結果が送られ、
［１６，１］［２５，１］［４６，１］
が得られる。最後にハッシュ値２のレコードのソートが行われ、
［１４］［１４］［２６］［５６］
というソート結果が得られ、次にグループバイ関数演算処理装置３５にソート結果が送られ、
［１４，３］［２６，１］［５６，１］
という結果が得られる。
【０１０５】
全体では、
［１２，２］［１５，２］［２４，２］［４５，２］［１６，１］［２５，１］［４６，１］［１４，３］［２６，１］［５６，１］
という結果がえられる。
【０１０６】
以上詳細に説明したように、本発明のグループバイ処理方式においては、二次記憶装置に対するアクセスをできるだけ連続的にするために、比較的大きなブロック単位でのデータの逐次読み出しと逐次書き込みを行うことによってハッシュ処理を高速化し、そのハッシュ処理の結果を用いたグループバイ処理を実行することにより、全体としての高速化が可能になる。
【０１０７】
次に、本発明のデータ数え上げ組み合わせ方式においては、不必要なアイテムの組み合わせ候補のプルーニングのためにメモリ使用量を小さくできるビットマップを用いることにより、利用可能なメモリ量が小さい場合にもプルーニングが効率よく行え、更にアイテムセットを数え上げるグループバイ処理を高速化することによって、組み合わせの数え上げを効率的に実行することが可能になる。本発明のデータ組み合わせ数え上げ方式を、データベースのマイニング手法としての相関ルールの生成処理に応用することによって、データマイニングの効率化に寄与するところが大きい。
【０１０８】
【特許文献１】
特開平１１−３３４２号公報（特許請求の範囲の請求項４７〜５０、段落番号［０００１］〜［０００４］、［００２０］〜［００３７］、［０１７０］〜［０２０９］、［０２３９］〜［０２７９］、図３、８、９、１８、１９、４６、図４７〜７９、図１２４〜１７１）
（特許文献２）
特願２００１−３４０８１７（第２−１９頁、図１〜図３４）
【０１０９】
【発明が解決しようとする課題】前記のような従来のものにおいては、次のような課題があった。
【０１１０】
一般に、従来の時系列相関ルールの抽出では、生のトランザクションデータを使うか、期間、店舗等、ある部分に限定したトランザクションデータを使って処理を行う。こうした方法では、トランザクションの数が増えると組み合わせの数が増大し、処理に多大の時間を要することとなる。
【０１１１】
本発明は、このような従来の課題を解決し、前記のようなトランザクションの数が増えても、処理に多大の時間が掛かるのを防止することを目的とする。
【０１１２】
【課題を解決するための手段】本発明は前記の目的を達成するため、次のように構成した。
【０１１３】
（１）　：時間的に連続するトランザクションに含まれるアイテムのシーケンスデータから、ｉ＝２個以上のアイテムの組み合わせに対しては該シーケンスにおける順序を維持した形式で、与えられた条件に適合するアイテム１個ずつまたは２個以上の組み合わせとその出現回数とを求めて時系列相関ルールを抽出する時系列相関抽出装置において、入力データをフィルタリングして必要なデータを残す時系列フィルタ部と、入力データから時系列相関ルールを抽出する時系列相関エンジン部とを備えると共に、前記時系列フィルタ部は、複数の属性からなるレコードの集合からレコードの組み合わせを検索する際、レコード内の所定属性が特定の値をとることをそれぞれ定義する複数のイベントと、その属性値の順序に基づいて定義された該複数のイベントの間の順序関係とを用いて、検索パターンを指定する指定手段と、前記レコードの集合から、指定された検索パターンに対応するレコードの組み合わせを検索する検索手段と、検索結果を出力する出力手段とを備えていることを特徴とする。
【０１１４】
（２）　：前記（１）　の時系列相関抽出装置において、前記時系列相関エンジン部の前段に前記時系列フィルタ部を配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする。
【０１１５】
（３）　：前記（１）　の時系列相関抽出装置において、前記時系列相関エンジン部の後段に前記時系列フィルタ部を配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする。
【０１１６】
（４）　：前記（１）　の時系列相関抽出装置において、前記時系列相関エンジンの前段と後段の両方に前記時系列フィルタ部を配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする。
【０１１７】
（５）　：前記（１）　乃至（４）　の何れかの時系列相関抽出装置において、前記時系列フィルタ部を前記時系列相関エンジン部の外部に配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする。
【０１１８】
（作用）
前記構成に基づく本発明の作用を、図１に基づいて説明する。
【０１１９】
（ａ）　：前記（１）　の時系列相関抽出装置では、時間的に連続するトランザクションに含まれるアイテムのシーケンスデータから、ｉ＝２個以上のアイテムの組み合わせに対しては該シーケンスにおける順序を維持した形式で、前記与えられた条件に適合するアイテム１個ずつまたは２個以上の組み合わせとその出現回数とを求めて、時系列と相関データを抽出する。
【０１２０】
この場合、時系列フィルタ部４は、入力データをフィルタリングして必要なデータを残す処理を行い、時系列相関エンジン部５は、入力データから時系列相関ルールを生成して抽出する処理を行うが、特に、前記時系列フィルタ部４は、次のような処理を行う。
【０１２１】
すなわち、時系列フィルタ部４は、前記指定手段、検索手段、出力手段を備え、複数の属性からなるレコードの集合からレコードの組み合わせを検索する。この時、前記指定手段は、レコード内の所定属性が特定の値をとることをそれぞれ定義する複数イベントと、属性の順序に基づいて定義されたそれらのイベントの間の順序関係とを用いて、検索パターン（イベントパターン）を指定する。
【０１２２】
前記検索手段は、レコードの集合から、指定された検索パターンに対応するレコードの組み合わせを検索し、出力手段は、検索結果を出力する。イベントは、レコード内の所定属性が特定の値をとる状態として定義され、複数のイベントの間の順序関係は、それらのイベントに対応するレコードの間において、１つ以上の属性の値の順序関係に基づいて定義される。
【０１２３】
ユーザは、指定手段を用いて、これらのイベントとイベント間の順序関係により決められる検索パターンを指定する。指定手段は、検索パターンを検索手段に渡し、検索手段は、受け取った検索パターンを解釈して、検索パターンに対応するレコードの組み合わせを抽出する。そして、出力手段は、抽出されたレコード等の情報を検索結果として出力する。
【０１２４】
このような装置によれば、２つ以上のイベントが同じ順序に存在する場合や、複数のイベントの順序関係が任意の間隔で記述される場合を含めて、様々な検索パターンを容易に指定することが可能となり、順序を考慮した汎用的な時系列フィルタリング処理を行うことができる。このようにして、前記のようなトランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０１２５】
（ｂ）　：前記（２）　では、時系列フィルタ部４を時系列相関エンジン部５の前段に配置し、時系列フィルタ部４と時系列相関エンジン部５を１つのプログラムで構成した例である。この場合、入力データは時系列フィルタ部４でフィルタリング処理を行うことで、組み合わせを減少させ、処理の時間短縮を行う。
【０１２６】
次に、時系列相関エンジン部５で時系列相関ルールを生成して抽出し、出力する。このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０１２７】
（ｃ）　：前記（３）　では、時系列相関エンジン部５を時系列フィルタ部４の前段に配置し、時系列相関エンジン部５と時系列フィルタ部４を１つのプログラムとして構成した例である。この場合、入力データは時系列相関エンジン部５で時系列相関ルールを生成し抽出し、その後、時系列フィルタ部４でフィルタリング処理を行い、時系列相関ルールを抽出して出力する。このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０１２８】
（ｄ）　：前記（４）　では、時系列相関エンジン部５の前後両方に時系列フィルタ部４を配置し、２つの時系列フィルタ部４と１つの時系列相関エンジン部５を１つのプログラムで構成した例である。この場合、入力データは時系列フィルタ部４でフィルタリング処理を行い、その後、時系列相関エンジン部５で時系列相関ルールを生成して抽出し、更に、別の時系列フィルタ部４でフィルタリングを行うことで相関ルールを出力する。
【０１２９】
この例では、前段の時系列フィルタ部４で組み合わせを減少させ、後段の時系列フィルタ部４で無駄なものを除外する機能がある。このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０１３０】
（ｅ）　：前記（５）　では、時系列フィルタ部４と時系列相関エンジン部５とをそれぞれ別の１つのプログラム部で構成した例である。すなわち、時系列フィルタ部４を１つのプログラムで構成し、時系列相関エンジン部５を別の１つのプログラムで構成する。
【０１３１】
このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。また、時系列フィルタ部４と時系列相関エンジン部５が別のプログラムで構成されているので、並列処理が可能であり、更に、時系列フィルタ部４の取り換えが可能である。
【０１３２】
【発明の実施の形態】以下、本発明の実施の形態を図面に基づいて詳細に説明する。
【０１３３】
§１：時系列相関抽出装置の説明
（１）　：時系列相関抽出装置の構成の説明
図２は時系列相関抽出装置の説明図である。以下、図２に基づいて、時系列相関抽出装置の構成を説明する。
【０１３４】
図２に示したように、時系列相関抽出装置１には、入力装置２と出力装置３が接続されている。また、時系列相関抽出装置１の内部には、時系列フィルタ部４と、時系列相関エンジン部５等を備えている。
【０１３５】
前記入力装置２は、相関ルールを求めるべき入力データや他のデータを入力するものである。時系列フィルタ部４は、入力データをフィルタリングして必要なデータを残すものである。時系列相関エンジン部５は、入力データから時系列相関ルールを生成して抽出するものである。出力装置３は、抽出された相関ルールを出力するものである。
【０１３６】
（２）　：時系列相関抽出装置のプログラムの説明
図３は時系列相関抽出装置のプログラムの説明図（その１）であり、Ａ図は例１、Ｂ図は例２、Ｃ図は例３、Ｄ図は例４を示す。また、図４は時系列相関抽出装置のプログラムの説明図（その２）であり、Ｅ図は例５、Ｆ図は例６、Ｇ図は例７、Ｈ図は例８を示す。
【０１３７】
前記時系列フィルタ部４と時系列相関エンジン部５は、それぞれプログラムにより構成されており、これらの各プログラムの関係は次のようになっている。なお、図３、図４において、Ｐ部は１つの１つのプログラム部（１つのプログラムと同じ）で構成されていることを示している。また、Ｐ１、Ｐ２もそれぞれ１つのプログラム部（１つのプログラム）で構成されていることを示している。
【０１３８】
▲１▼：例１（図３のＡ図参照）は、時系列フィルタ部４を時系列相関エンジン部５の前段に置いた例であり、時系列フィルタ部４と時系列相関エンジン部５をＰ部（１つのプログラム）で構成した例である。この場合、入力データは時系列フィルタ部４でフィルタリング処理を行い、その後、時系列相関エンジン部５で時系列相関ルールを抽出して出力する。
【０１３９】
このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０１４０】
▲２▼：例２（図３のＢ図参照）は、例１と逆に、時系列相関エンジン部５を時系列フィルタ部４の前段に置いた例であり、時系列相関エンジン部５と時系列フィルタ部４をＰ部（１つのプログラム）として構成した例である。この場合、入力データは時系列相関エンジン部５で時系列相関ルールを生成して抽出し、その後、時系列フィルタ部４でフィルタリング処理を行い、時系列相関ルールを抽出して出力する。
【０１４１】
このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０１４２】
▲３▼：例３（図３のＣ図参照）は、時系列相関エンジン部５の前後両方に時系列フィルタ部４を設け、２つの時系列フィルタ部４と１つの時系列相関エンジン部５をＰ部（１つのプログラム）で構成した例である。この場合、入力データは時系列フィルタ部４でフィルタリング処理を行い、その後、時系列相関エンジン部５で時系列相関ルールを生成して抽出し、更に、別の時系列フィルタ部４でフィルタリングを行うことで相関ルールを出力する。
【０１４３】
この例では、前段の時系列フィルタ部４で組み合わせを減少させ、後段の時系列フィルタ部４で無駄なものを除外する機能がある。このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０１４４】
▲４▼：例４（図３のＤ図参照）は、時系列フィルタ部４と時系列相関エンジン部５とをそれぞれ別の１つのプログラム部で構成した例である。すなわち、時系列フィルタ部４をＰ１部（１つのプログラム）で構成し、時系列相関エンジン部５を別のＰ２部（１つのプログラム）で構成する。
【０１４５】
このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。また、時系列フィルタ部４と時系列相関エンジン部５が別のプログラムで構成されているので、並列処理が可能であり、更に、時系列フィルタ部４の取り換えが可能である。
【０１４６】
▲５▼：例５（図４のＥ図参照）は、時系列フィルタ部４と時系列相関エンジン部５とをそれぞれ別の１つのプログラム部で構成した例であり、時系列相関エンジン部５と時系列フィルタ部４の配列は例４と逆である。すなわち、前段に時系列相関エンジン部５をＰ１部（１つのプログラム）で構成し、後段に、時系列フィルタ部４を別のＰ２部（１つのプログラム）で構成する。
【０１４７】
▲６▼：例６（図４のＦ図参照）は、前段に、時系列フィルタ部４をＰ１部（１つのプログラム）で構成し、後段に、時系列相関エンジン部５と時系列フィルタ部４をＰ２部（１つのプログラム）で構成した例である。この場合、Ｐ１部とＰ２部は別のプログラムである。
【０１４８】
▲７▼：例７（図４のＧ図参照）は、前段に、時系列フィルタ部４と時系列相関エンジン部５をＰ１部（１つのプログラム）で構成し、後段に、時系列フィルタ部４をＰ２部（１つのプログラム）で構成した例である。この場合、Ｐ１部とＰ２部は別のプログラムである。
【０１４９】
▲８▼：例８（図４のＨ図参照）は、時系列相関エンジン部５の前後両方に時系列フィルタ部４を設け、２つの時系列フィルタ部４と１つの時系列相関エンジン部５を、それぞれ別々のプログラムで構成した例である（例３の２つの時系列フィルタ部４と１つの時系列相関エンジン部５をそれぞ別々のプログラムで構成した例である）。
【０１５０】
すなわち、最前段には時系列フィルタ部４をＰ１部（１つのプログラム）で構成し、その後段に、時系列相関エンジン部５をＰ２部（１つのプログラム）で構成し、更に、その後段に、時系列フィルタ部４をＰ３部（１つのプログラム）で構成したものである。
【０１５１】
（３）　：入力データ形式の説明
図５は入力データ形式を示す図であり、Ａ図は入力データ（テキスト形式）、Ｂ図は形式の説明図である。すなわち、図５のＡ図は、入力データをテキスト表現したもの、図５のＢ図は形式の説明図である。また、図６は処理の説明図であり、Ａ図は組み合わせ（長さ２）、Ｂ図はＣＩＤ＝１，４を削除した組み合わせ、Ｃ図は単純なフィルタでは効果がない入力を示す。
【０１５２】
図５において、「ＣＩＤ」は顧客ＩＤ等の識別番号を表す。「数」はアイテムの数を表す。アイテムの数は区切り文字「｜」も含める。「Ｉｔｅｍコート゛　」はアイテムの識別番号を表す。入力は、ＣＩＤ毎に一連の時系列データとなる。Ｉｔｅｍコート゛　は古い時間から新しい時間の順に、図の左から右に並べられる。区切り文字「｜」の間のアイテム（「，」で区切られたアイテム）には順序関係はない。
【０１５３】
ここで、最後にアイテム「６」というイベントが発生した相関ルール（「Ｘ→６」という表現となる）を見つけたいものとする。従来の相関エンジンでは、図５のＡ図の入力データから図６のＡ図のような組み合わせを生成する。長さ２の場合、一意の組み合わせは図６のＡ図のように２９通りとなる。
【０１５４】
ここで、入力データを時系列フィルタ部４によってフィルタリングして、最後に「６」というイベントが発生していないＣＩＤ（ＣＩＤ＝１，４）のデータを削除する。この場合の組み合わせは、図６のＢ図のように４通りまで削減される。
【０１５５】
一般に、時系列相関ルールの抽出処理では、組み合わせの生成に大部分の時間を要する。また、組み合わせはデータの期間（長さ）が長くなれば、爆発的に多くなる。このため、予め不要なデータを削除することは、処理時間の短縮のために極めて有効な手段となる。ただし、最後に「６」というイベントが発生した相関ルールを見つけたいような場合、入力データに６が含まれないＣＩＤのデータを単純に削除するだけではうまくいかない。２個以上の長さのイベントがあり、かつ２個目以降にイベント「６」が発生しているもの、以外のデータを削除できる時系列フィルタ部４をここでは用いる必要がある。図６のＣ図のような入力の場合、イベント「６」を含まないデータを削除する単純なフィルタの場合は全く効果がない。
【０１５６】
（４）　：フローチャートによる処理の説明
▲１▼：例１の説明
図７は例１の処理フローチャートである。以下、図７に基づいて、例１の処理を説明する。なお、Ｓ１１〜Ｓ１５は各処理ステップを示す。
【０１５７】
例１は、図３のＡ図に示した例１の処理であり、Ｓ１１〜Ｓ１４が時系列フィルタ部４の処理、Ｓ１５が時系列相関エンジン部５の処理である。この処理では、先ず、時系列フィルタ部４はデータ有りか否かを調べ（Ｓ１１）、データがなければそのまま処理を終了するが、データが有れば、データを入力し（Ｓ１２）、時系列フィルタ部４の処理を行う（Ｓ１３）。
【０１５８】
そして、時系列フィルタ部４は、フィルタ条件を満たしているか否かを調べ（Ｓ１４）、前記条件に不適合であれば、Ｓ１１の処理へ移行する。しかし、フィルタ条件に適合していれば、時系列相関エンジン部５による処理へ移行する（Ｓ１５）。このようにして前記Ｐ部での時系列フィルタ部４と時系列相関エンジン部５の処理を終了する。
【０１５９】
▲２▼：例２の説明
図８は例２の処理フローチャートである。以下、図８に基づいて、例２の処理を説明する。なお、Ｓ２１〜Ｓ２６は各処理ステップを示す。
【０１６０】
例２は、図３のＢ図に示した例２の処理であり、Ｓ２１〜Ｓ２３が時系列相関エンジン部５の処理、Ｓ２４〜Ｓ２６が時系列フィルタ部４の処理である。この処理では、時系列相関エンジン部５は、データ有りか否かを調べ（Ｓ２１）、データがなければそのまま処理を終了するが、データが有れば、データを入力する（Ｓ２２）。そして、時系列相関エンジン部５による処理を行う（Ｓ２３）。
【０１６１】
次に、時系列フィルタ部４は、時系列フィルタ部４の処理を行い（Ｓ２４）、フィルタ条件を満たしているか否かを調べ（Ｓ２５）、フィルタ条件に不適合であれば、Ｓ２１の処理へ移行する。しかし、フィルタ条件に適合していれば、データの出力を行う（Ｓ２６）。
【０１６２】
▲３▼：例３の説明
図９は例３の処理フローチャートである。以下、図９に基づいて、例３の処理を説明する。なお、Ｓ３１〜Ｓ３９は各処理ステップを示す。
【０１６３】
例３は、図３のＣ図に示した例３の処理であり、Ｓ３１〜Ｓ３４が時系列フィルタ部４の処理、Ｓ３５は時系列相関エンジン部５の処理、Ｓ３６〜Ｓ３９が時系列フィルタ部４の処理である。
【０１６４】
この処理では、先ず、時系列フィルタ部４はデータ有りか否かを調べ（Ｓ３１）、データがなければそのまま処理を終了するが、データが有れば、データを入力し（Ｓ３２）、時系列フィルタ部４の処理を行う（Ｓ３３）。そして、時系列フィルタ部４は、フィルタ条件を満たしているか否かを調べ（Ｓ１４）、前記条件に不適合であれば、Ｓ３１の処理へ移行する。
【０１６５】
しかし、フィルタ条件に適合していれば、時系列相関エンジン部５による処理を行う（Ｓ３５）。その後、再び、時系列フィルタ部４はデータ有りか否かを調べ（Ｓ３６）、データがなければそのまま処理を終了するが、データが有れば、時系列フィルタ部４の処理を行う（Ｓ３７）。そして、時系列フィルタ部４は、フィルタ条件を満たしているか否かを調べ（Ｓ３８）、前記条件に不適合であれば、Ｓ３６の処理へ移行する。しかし、フィルタ条件に適合していれば、データの出力を行い（Ｓ３９）、Ｓ３１の処理へ移行する。
【０１６６】
（５）　：具体的な装置例の説明
図１０は具体的な装置の構成図である。前記時系列相関抽出装置は、は、ワークステーション、パーソナルコンピュータ等の任意のコンピュータにより実現することができる。この装置は、コンピュータ本体１０と、該コンピュータ本体１０に接続されたディスプレイ装置１１、入力装置（キーボード／マウス等）１２、リムーバブルディスクドライブ（「ＲＤＤ」という）１３、ハードディスク装置（「ＨＤＤ」という）１４等で構成されている。
【０１６７】
そして、コンピュータ本体１０には、内部の各種制御や処理を行うＣＰＵ１５と、プログラムや各種データを格納しておくためのＲＯＭ１６（不揮発性メモリ）と、メモリ１７と、インタフェース制御部（「Ｉ／Ｆ制御部」という）１８と、通信制御部１９等が設けてある。なお、前記ＲＤＤ１３には、フレキシブルディスクドライブや光ディスクドライブ等が含まれる。
【０１６８】
前記構成の装置において、例えば、ＨＤＤ１４の磁気ディスク（記録媒体）に、前記時系列相関抽出装置の処理を実現するためのプログラムを格納しておき、このプログラムをＣＰＵ１５が読み出して実行することにより、前記時系列相関抽出装置が行う処理を実行する。
【０１６９】
しかし、本発明は、このような例に限らず、例えば、ＨＤＤ１４の磁気ディスクに、次のようにしてプログラムを格納し、このプログラムをＣＰＵ１５が実行することで前記処理を行うことも可能である。
【０１７０】
▲１▼：他の装置で作成されたリムーバブルディスクに格納されているプログラム（他の装置で作成したプログラムデータ）を、ＲＤＤ１３により読み取り、ＨＤＤ１４の記録媒体に格納する。
【０１７１】
▲２▼：通信回線を介して他の装置から伝送されたプログラム等のデータを、通信制御部１９を介して受信し、そのデータをＨＤＤ１４の記録媒体（磁気ディスク）に格納する。
【０１７２】
§２：時系列フィルタ部の説明
（１）　：概要
前記時系列フィルタ部４（図１〜図４参照）は、特許文献２の明細書及び図面に記載されており、以下、詳細に説明する。この時系列フィルタ部は、「順序を考慮したパターンを用いた検索装置および方法」と題する発明であり、以下、これを、本発明の「時系列フィルタ部」と呼んで説明する。
【０１７３】
Ａ：順序に基づいてデータを扱うために従来から利用されてきた技術には、関係データベースや、文字列の出現順序に対する正規表現によるパターンマッチングがあり、株価のような時系列データに対しては、専用のアプリケーションが用いられてきた。以下では、従来の技術について、その特徴及び利用分野について述べる。
【０１７４】
Ｂ：関係データベース
大量のデータを格納するために、関係データベースが広く利用されている。関係データベースでは、考案者であるＥ．Ｆ．Ｃｏｄｄによる論文（１９７０年）に明記されているように、データ集合に順序の概念がない。但し、日付型や時間型といったデータ型が用意されていることや、レコードを値にしたがって並び替える（ソーティング）機能が提供されており、順序を持つデータを格納するのに利用される。
【０１７５】
このデータベースでは、データ型として日付を扱うことはできるが、順序の概念がないため、時系列のように順序を意識したデータのパターンを探し出す問い合わせは、ＳＱＬ（Ｓｔｒｕｃｔｕｒｅｄ　Ｑｕｅｒｙ　Ｌａｎｇｕａｇｅ　）で全て処理することはできない。
【０１７６】
そこで、データベースと外部プログラムを組み合わせて処理を行っており、順序を考慮したパターンを取り出す毎にプログラムが必要になる。
【０１７７】
Ｃ：正規表現によるパターンマッチング
文字列検索の分野では、文字列が出現する順序に着目して、正規表現によるパターンマッチングが利用されてきた。先ず、文字列検索とパターンマッチングの相違について明確にしておく。文字列検索とは、「文章中からパターンａｂｃを探せ」のように、探索すべきパターンが完全に確定しているものを言う。一方、パターンマッチングとは、不確定のパターンを探す操作のことで、パターン照合とも呼ばれている。パターンマッチングでは、パターンを指定するのに正規表現が用いられる。
【０１７８】
図１１は、文字列データと正規表現ａ（ａ｜ｂ）＊ａのパターンマッチングの例を示している。ここで、（ａ｜ｂ）＊は、ａまたはｂが０回以上繰り返し出現することを意味している。文字列検索とパターンマッチングは、一見似たように見えるが、異なるカテゴリに属する問題で、それぞれに対して別のアルゴリズムを適用しなければならない。
【０１７９】
正規表現によるパターンマッチングを実現するには、有限オートマトン（ｆｉｎｉｔｅ　ａｕｔｏｍａｔｏｎ）を利用する。正規表現をオートマトンに変換するには、２段階のアプローチをとる。まず、正規表現を非決定性有限オートマトン（ｎｏｎ−ｄｅｔｅｒｍｉｎｉｓｔｉｃ　ｆｉｎｉｔｅ　ａｕｔｏｍａｔｏｎ，　ＮＦＡ　）に変換する。正規表現からＮＦＡへの変換は容易である。ＮＦＡ単体でパターンマッチングを行うことも可能であるが、得られたＮＦＡを等価である決定性有限オートマトン（ｄｅｔｅｒｍｉｎｉｓｔｉｃｆｉｎｉｔｅ　ａｕｔｏｍａｔｏｎ，ＤＦＡ　）に変換し、ＤＦＡを使ってパターンマッチングを行う方法が多く用いられる。
【０１８０】
ＤＦＡでは、決定性という言葉が示すように、ある状態において入力が決まれば遷移先が１つだけに定まるが、ＮＦＡでは、非決定性という言葉が示すように、ある状態において入力に対する遷移性が複数存在する場合がある。
【０１８１】
図１２は、正規表現ａ（ａ｜ｂ）＊ａに対応するＮＦＡを示している。このＮＦＡにａａａという文字列を与えた場合を考える。１文字目のａを入力すると、初期状態０から１に遷移する。２文字目もａであるが、状態１には文字ａに対する遷移先として、状態１と状態２の２種類の状態がある。結論から言えば、２文字目のａでは状態１から状態１へと遷移し、３文字目のａでは状態２に遷移するのが正解なのであるが、２文字目のａを読み込んだ時点では、どちらへ進めばよいかわからない。
【０１８２】
この問題を解決するためには、バックトラックを利用し、とりあえずどちらかの状態に遷移して処理を勧め、失敗したらもう１つの状態に遷移するという処理が必要になる。しかし、バックトラックを用いると、後戻りのための処理時間がかかる。
【０１８３】
そこで、正規表現を変換して得られたＮＦＡをそのまま使うのではなく、ＮＦＡをさらにＤＦＡに変換してから、パターンマッチング処理を行う。ＤＦＡの場合には、ＮＦＡと違って、状態と入力が決まれば遷移先が必ず１つだけに決まるので、ＮＦＡのようなバックトラックは必要なく、処理を高速に実行することが可能になる。
【０１８４】
例えば、図１２のＮＦＡは、図１３に示すようなＤＦＡに変換される。もちろん、ＮＦＡをＤＦＡに変換するのに時間がかかるが、大量のデータに対してパターンマッチングを行う場合、バックトラックがないＤＦＡの高速性により、全体としては十分に処理が高速化される。
【０１８５】
ところで、正規表現は、図１４に示すような連結（ｃｏｎｃａｔｅｎａｔｉｏｎ　）、選択（ｕｎｉｏｎ　）、および繰り返し（閉包：ｃｌｏｓｕｒｅ　）の３つの基本操作（演算子）によって、再帰的に定義される。これらの操作の間には、普通の数式と同様に、優先順位があり、最も強く結合するのが繰り返し“＊”であり、次に強く結合するのが連結であり、最後が選択“｜”という順番になる。ただし、文字や記号を括弧で囲むことによって、優先順位を変えることもできる。
【０１８６】
ＰＯＳＩＸ｛Ｐｏｒｔａｂｌｅ　Ｏｐｅｒａｔｉｎｇ　Ｓｙｓｔｅｍ　Ｉｎｔｅｒｆａｃｅｆｏｒ　ＵＮＩＸ　（登録商標）｝１００３．２では、Ｂａｓｉｃ　Ｒｅｇｕｌａｒ　Ｅｘｐｒｅｓｓｉｏｎ（ＢＲＥ）　と、Ｅｘｔｅｎｄｅｄ　Ｒｅｇｕｌａｒ　Ｅｘｐｒｅｓｓｉｏｎ（ＥＲＥ）の２種類の正規表現が定められている。ＢＲＥを使用するＵＮＩＸ上のソフトウェア・ユーティリティには、ｅｄ、ｅｘ、ｖｉ、ｍｏｒｅ、ｓｅｄ、ｇｒｅｐ等があり、ＥＲＥを使用するソフトウェア・ユーティリティには、ａｗｋや、−Ｅオプション指定時のｇｒｅｐがある。
【０１８７】
Ｄ：時系列アプリケーション
順序を持つデータに対しては、株価の相場予想や、データマイニングにおけるシーケンシャルパターンのように、専用のアプリケーションを用いて処理が行われる場合がある。専用のアプリケーションを利用すれば、時系列のパターン検索を高速に処理することが可能であるが、専用のアプリケーションは、汎用的なさまざまな問い合わせにおいて必ず利用できるとは限らない。
【０１８８】
Ｅ：課題
ところで、前記の処理には、次のような課題がある。
【０１８９】
▲１▼：文字列における正規表現と、それによるパターンマッチングは、あらゆるクラスの文字列に対する検索方法を提供する一般的なフレームワークである。しかし、順序を持つデータは、以下のＡ、Ｂ、Ｃのように、データの特性が文字列データとは異なるため、正規表現とそのパターンマッチングを適用することができない。
【０１９０】
Ａ：文字列においては、項間隔で隣あった文字しか存在しないが、順序を持つデータの場合には、ある位置に複数のイベントが存在する場合がある。例えば、同じ日に複数回の買い物を行った場合がそれにあたり、「顧客１０００１は、３月２１日に牛乳とパンの２つの商品を購入」といった表現が必要となる。しかし、文字列では、同じ位置に２つの文字が出現することがないので、正規表現では同時に発現するイベントを表現できない。
【０１９１】
Ｂ：文字列では、値と記号（リテラル）とが等しい。つまり、文字列として“Ａ”が与えられた時、“Ａ”は値であり、記号としてのＡでもある。しかし、複数の属性からなるデータでは、「２００円以上のパンを購入した顧客を‘顧客グループＡ’と呼ぶ」というように、複数フィールド（この場合、商品と価格）の条件の組み合わせを１つの記号として扱う必要がある。このような複数フィールドの条件の組み合わせを、正規表現で表すことはできない。
【０１９２】
Ｃ：順序を持ったデータの場合、「パンを購入してから２日以内にチーズを購入」というように、順序に間隔の概念が必要となる。しかし、文字列の正規表現では、間隔を指定することはできない。
【０１９３】
図１５は、順序を持つデータの例として、小売り業売上データを示している。この例では、顧客毎に商品を購入した日付が順序にあたる。３月２１日（０３．２１）に顧客１０００１は、牛乳とパンを同時に購入しているが、同時に出現する事象を正規表現で記述することはできない。また、小売店の売上データでは、休業日にあたる３月２２日のデータは存在しないが、文字列では特定の順序の文字が存在しないという状況はあり得ない。さらに、ある商品を購入して次の来店までの間隔が３日以内であるといったように、順序間にまたがる関係を考慮した記述は、正規表現では困難である。
【０１９４】
以上説明したように、順序を持つレコード群の中から、順序を考慮したパターンを一般的に指定することは、従来の正規表現とオートマトン理論では不可能である。
【０１９５】
また、関係データベースでは、順序関係については限定された形でしかサポートされていない。したがって、順序を持つデータから、指定したパターンを発見しようとする場合、データベースと外部プログラムの組み合わせで処理する必要がある。
【０１９６】
しかし、問い合わせ毎にプログラムを作成する方法では、正規表現によるパターンマッチングのように、指定するパターンを変更するだけでパターンマッチングを実行することはできない。
【０１９７】
更に、株価の予測のように、順序を持ったデータを解析するために、専用のアプリケーションを用いることも考えられる。専用のアプリケーションは、ある目的のためにデータを準備すれば特定の結果を返すので、その目的に限定すれば、高速な処理を行うことは可能である。しかし、専用のアプリケーションでは、専用であるがゆえに、正規表現によるパターンマッチングのように幅広い問題に対処することは困難である。そこで、以下では、データ間の順序関係のための汎用的な表現を用いて、順序に基づくデータのパターンを検索する装置および方法について説明する。
【０１９８】
図１６は、時系列フィルタ部の原理的な説明図である。図１６の検索装置は、指定手段１０１、検索手段１０２、および出力手段１０３を備え、複数の属性からなるレコードの集合１０４からレコードの組み合わせを検索する。
【０１９９】
指定手段１０１は、レコード内の所定属性が特定の値をとることをそれぞれ定義する複数イベントと、属性値の順序に基づいて定義されたそれらのイベントの間の順序関係とを用いて、検索パターン（イベントパターン）１０５を指定する。検索手段１０２は、レコードの集合１０４から、指定された検索パターン１０５に対応するレコードの組み合わせを検索し、出力手段１０３は、検索結果を出力する。
【０２００】
イベントは、レコード内の所定属性が特定の値をとる状態として定義され、複数のイベントの間の順序関係は、それらのイベントに対応するレコードの間において、１つ以上の属性の値の順序関係に基づいて定義される。ユーザは、指定手段１０１を用いて、これらのイベントとイベント間の順序関係により決められる検索パターン１０５指定する。指定手段１０１は、検索パターン１０５を検索手段１０２に渡し、検索手段１０２は、受け取った検索パターン１０５を解釈して、検索パターン１０５に対応するレコードの組み合わせを抽出する。そして、出力手段１０３は、抽出されたレコード等の情報を検索結果として出力する。
【０２０１】
このような検索装置によれば、２つ以上のイベントが同じ順序に存在する場合や、複数のイベントの順序関係が任意の間隔で記述される場合を含めて、様々な検索パターンを容易に指定することが可能となり、順序を考慮した汎用的なデータ検索が実現される。
【０２０２】
Ｆ：検索装置の説明
以下、図面を参照しながら、詳細な装置について説明する。対象とするデータは、図１５に示したような、複数のフィールド（属性）を持つレコードの集合からなるデータである。各レコードは決まった数のフィールドを持ち、あるレコードのみが異なった数のフィールドを持つことはないものとする。更に、データ中には、順序を持ったフィールドが１つ以上含まれているものとする。
【０２０３】
順序を持ったフィールドとは、日付や時刻といった順序関係をあらかじめ持っているフィールドか、顧客ＩＤ（顧客識別子）フィールドのように、データを並べ替えることで順序が生じるものである。また、日付フィールドと時刻フィールドを合わせて１つの順序フィールドと考えることができるように、複数のフィールドを複合して順序を持たせる場合もある。
【０２０４】
このような対象データの中から順序を持つパターンを検索する際に、イベント定義とイベント間定義による汎用的なパターン指定を行い、指定されたパターンを解釈して検索を実行する、汎用的な処理系を用いる。イベント定義とは、１つあるいは複数のフィールドについて指定した条件に、一意に名前を付けたものである。１つのフィールドについて条件を指定する場合は、例えば、「商品としてＰＣを購入した顧客を‘顧客グループＡ’と呼ぶ」というように定義できる。
【０２０５】
また、複数のフィールドについて条件を指定する場合には、例えば、「商品＝ＰＣで価格＝２５万円の商品を購入した顧客を‘顧客グループＡ’と呼ぶ」というように、複数フィールドの条件の組み合わせを１つの記号（リテラル）として扱う。つまり、イベントとは、１つあるいは複数のフィールドについて条件を満たしたレコードに関するラベルであると定義する。さらに、正規表現におけるワイルドカードのように、何でもマッチするような条件を指定することも可能とする。
【０２０６】
図１７は、「商品＝ＰＣで価格＝２５万円の商品を購入した顧客を‘顧客グループＡ’と呼ぶ」で定義されるイベント定義例を示している。既に説明したように、複数フィールドの条件の組み合わせを文字列の正規表現で表すことは不可能である。
【０２０７】
次に、イベント間定義とは、イベント定義を利用して、イベントとイベントの間にまたがる関係を記述したものである。イベント間定義においては、同じ順序に複数のイベントが存在する場合や、順序の間隔が一定でない場合（任意の間隔で順序が記述される場合）も考慮される。
【０２０８】
具体的な例としては、２５万円のＰＣを購入した顧客のイベントを‘Ａ’と呼び、１０万円のＴＶを購入した顧客のイベントを‘Ｂ’と呼ぶことにすると、「‘Ａ’から‘Ｂ’までの間隔が３来店以内」というようなイベント間定義が考えられる。また、「‘Ａ’から‘Ｂ’までの間隔が３日以内（‘Ａ’の日付フィールドと‘Ｂ’の日付フィールドの差が３日以内）」といった定義も考えられる。
【０２０９】
また、順序を持つフィールド以外でも、イベントとイベントにまたがる制約を記述することができる。例えば、「‘Ａ’における価格が‘Ｂ’における価格より大きい」といった定義が可能である。さらに、イベント定義がどのようなパターンにもマッチするワイルドカードで指定されている場合でも、イベント間定義によって条件を指定することが可能である。
【０２１０】
図１８は、イベント間定義の例を示している。この例においては、イベント‘Ａ’とイベント‘Ｂ’の間隔が３日以内であり、イベント‘Ｂ’とイベント‘Ｃ’の間隔が２日以内であり、イベント‘Ａ’とイベント‘Ｃ’の間隔が５日以内であることが定義されている。
【０２１１】
正規表現において、全ての文字にマッチする‘・’を利用してａ・・ｂと表記するのは、あくまでもリテラル‘ａ’の３文字後に‘ｂ’が出現するという意味であり、１つ以上のフィールドについて条件が指定されたイベント定義とは異なる。
【０２１２】
また、任意のイベント間の関係が定義できるということは、検索すべきマッチングパターンがクラフ構造で表現されることを意味する。図１８の例でも、イベント定義１とイベント定義２の間と、イベント定義２とイベント定義３の間と、イベント定義１とイベント定義３の間にそれぞれイベント間定義が存在する。
【０２１３】
本発明で扱われる順序パターンは、このようなグラフ構造を持ち得ることや、イベント定義とイベント間定義の組み合わせによって指定されるという点で、正規表現におけるパターンとは明確に異なる。イベント定義とイベント間定義を用いれば、従来の正規表現では記述が不可能な検索パターンを指定することが可能となる。
【０２１４】
本検索装置は、このようなイベント定義とイベント間定義によって与えられる順序に基づいたパターンが汎用的に指定し、それを解釈して実行する。順序に基づいてパターンを解釈する際には、インタープリタのように実行時に動的に解釈する方法と、コンパイラのように実行前にパターンを計算機で実行可能な命令に置き換えてしまう方法とがある。
【０２１５】
図１９は、本検索装置の基本構成を示している。図１９の検索装置は、検索対象となるデータ１１１と、イベント定義とイベント間定義によって指定された検索パターン１１２と、検索パターン１１２を解釈して実行する検索処理部１１３とからなり、検索処理部１１３は、検索結果１１４を出力する。検索パターン１１２の定義を変えるだけで、様々な種類の検索に対応することができる。
【０２１６】
図２０は、検索処理部１１３による全体処理のフローチャートである。検索処理部１１３は、まず、データ１１１（ステップＳ１）と検索パターン１１２（ステップＳ２）を入力として受け取る。そして、検索パターン１１２を解釈して（ステップＳ３）、指定されたパターンを検索し（ステップＳ４）、検索結果１１４を出力する（ステップＳ５）。以下では、検索パターン１１２をイベントパターンと呼ぶことにする。
【０２１７】
図２１は、図２０のステップＳ４において検索処理部１１３が保持しているデータ構造を示している。検索処理部１１３は、データを指すポインタＰ１と、イベント定義とイベント間定義を指すポインタＰ２を備える。通常の文字列データとは違って、例えば、同じ日に発生したデータは同じ順序を持つ。したがって、ポインタＰ１が指すデータは、１つのレコードとは限らず、複数のレコードである場合もある。
【０２１８】
イベント定義については、先に出現するイベントから順番に並べて記述され、イベント間定義については、１つのイベント間定義に含まれるイベントのうち最も最後のイベントの欄に記述される。
【０２１９】
図２２は、図２０のステップＳ４における検索処理のフローチャートである。検索処理部１１３は、まず、ポインタＰ１に対応するデータを読み込み（ステップＳ１１）、ポインタＰ１が最後に指しているかどうかを調べる（ステップＳ１２）。ポインタＰ１が最後を指していなければ、そのデータがポインタＰ２に対応するイベント定義を満たすかどうか調べる（ステップＳ１３）。そして、そのデータがイベント定義を満たしていなければ、ポインタＰ１に１を加算して（ステップＳ１７）、ステップＳ１１以降の処理を繰り返す。
【０２２０】
データがイベント定義を満たしている場合は、さらに、そのデータがＰ２に対応するイベント間定義を満たしているかどうかを調べる（ステップＳ１４）。データがイベント間定義を満たしていなければ、ステップＳ１７以降の処理を繰り返す。
【０２２１】
データがイベント定義を満たしている場合には、次に、ポインタＰ２が最後を指しているかどうかを調べる（ステップＳ１５）。そして、ポインタＰ２が最後を指していなければ、Ｐ２に１を加算して（ステップＳ１８）、ステップＳ１７以降の処理を繰り返す。
【０２２２】
ポインタＰ２が最後を指していれば、すべてのイベント定義及びイベント間定義を満たすデータの組み合わせが見つかったことになるので、それらのデータを検索結果として登録し（ステップＳ１６）、ステップＳ１７以降の処理を繰り返す。
【０２２３】
そして、ステップＳ１２においてポインタＰ１が最後を指していれば、全てのデータについての検索処理が終了したことになるので、登録された検索結果を出力する。ステップＳ１６において、検索結果を登録する代わりに、直ちに出力することも可能である。また、図２２に示しフローチャートは検索処理の一例に過ぎず、検索処理には任意のパターンマッチング方法を用いることができる。
【０２２４】
次に、図２３から図３６までを参照しながら、検索装置の付加的な機能について説明する。
【０２２５】
検索装置は、多数のファイルに記録された複数の入力データ（ログ）をまとめることにより、検索対象である複数の属性からなる入力データを生成することができる。例えば、関係データベースの結合（ＪＯＩＮ）演算や、外部プログラムを利用することで、複数のファイル（データ）から検索対象データを生成する。
【０２２６】
図２３に示すテーブルと図２４に示すテーブルから検索対象データを生成する場合、検索装置は、これら２つのテーブルに対して図２５のようなＳＱＬ文を適用し、結合演算処理（Ｊｏｉｎ　Ｏｐｅｒａｔｉｏｎ）を実行する。これにより、図１５に示したデータが生成される。この場合、２つのテーブルには、それぞれ異なるフィールドが含まれているが、これらを結合することにより、両方のテーブルのフィールドを含むテーブルが生成される。
【０２２７】
また、複数の属性からなるレコードの集合に関して、各属性を整数化等の処理によって圧縮することで、検索対象データを生成することも可能である。例えば、図１５における顧客ＩＤを４バイト整数化により整数化した後、２ビットに圧縮すると、次のようになる。
【０２２８】
整数　　　　　　　　　２ビット
“１０００１”　　　−　　　００
“１０００２”　　　−　　　０１
“１０００３”　　　−　　　１０
圧縮された顧客ＩＤを用いれば、図１５のデータは図２６のように書き換えられる。圧縮されたデータは、出力時に元の文字列に戻して表示すればよいので、内部的には２ビットのままで処理することが可能である。これにより、処理に必要となるメモリ量を削減することができ、処理が高速化される。
【０２２９】
また、検索対象である複数の属性からなるレコードの集合に関して、発見すべきパターンで不要なレコードを使わないことで、必要となるメモリ量を削減し、入出力のコストを減らすことができる。つまり、発見すべきパターンとして必要のないレコードについては、始めから処理の対象にしない。例えば、次のようなイベントパターンが定義されているとする。
【０２３０】
イベント定義
イベント１：商品＝牛乳
イベント２：商品＝パン
イベント間定義
イベント１．日付＝イベント２．日付
この例では、同じ日に牛乳とパンが購入されたイベントパターンが指定されており、牛乳とパン以外の商品についてのデータは必要がないことがわかる。したがって、データをファイルから読み込む際に、商品のフィールドが牛乳とパン以外の必要のないレコードについては、メモリに読み込まないようにする。
【０２３１】
図２７は、このようなレコードの削減処理を示している。ファイル１２１に記録されたデータは、一旦読み込みバッファ１２２に入力され、データから不要なレコードが削除される。そして、必要なレコードのみが、検索対象データとしてメモリ１２３上に読み込まれ、検索処理が行われる。これにより、メモリ量を削減することができるとともに、処理の高速化も期待できる。
【０２３２】
ところで、商品に大分類と小分類の２種類のクラス階層がある場合のように、データが階層構造を持っている場合、階層に基づいた値の書き換えを行うことで、階層を意識した処理が可能となる。以下に、商品の階層に基づく値の書き換えの例を示す。
【０２３３】
大分類　大分類コード　小分類　小分類コード　大分類コード＋小分類コード
生鮮　１００００　　　キュウリ　　　１　　　　　　　１０００１
生鮮　１００００　　　白菜　　　　２　　　　　　　１０００２
魚介　２００００　　　アジ　　　　１　　　　　　　２０００１
魚介　２００００　　　カレイ　　　２　　　　　　　２０００２
精肉　３００００　　　牛　　　　　１　　　　　　　３０００１
この例では、大分類に５桁コード（１００００，２００００，・・・）を割り当て、小分類に４桁コード（０００１〜９９９９）を割り当て、大分類コードと小分類コードを足すことで、全体として商品を一意に識別可能なコードを割り当てる。このようにして値を書き換えれば、処理の途中で、コードに基づいてそれがどの分類項目に対応するかが容易に判定でき、ある特定の分類項目のみを検索対象とすることが容易になる。
【０２３４】
例えば、イベント定義で、２００００＜＝商品コード＜＝２９９９９と指定すれば、それは魚介類という分類を指定したことになる。また、イベント定義で、商品コード＝１０００２のように、大分類と小分類の両方で指定すれば、特定の商品を指定することが可能である。
【０２３５】
次に、検索処理におけるパターンマッチングの指定方法について説明する。正規表現におけるパターンマッチングでは、与えられた文字列パターンにマッチする最も長い文字列を返す「最長マッチ」が基本である。しかし、ｐｅｒｌ（Ｐｒａｃｔｉｃａｌ　Ｅｘｔｒａｃｔｉｏｎ　ａｎｄ　Ｒｅｐｏｒｔ　　Ｌａｎｇｕａｇｅ　）処理系では、与えられた文字パターンにマッチする最も短い文字列を返す「最短マッチ」を指定することができる。
【０２３６】
本例の順序に基づいたパターンマッチングについては、正規表現で用いられるマッチング指定に加えて、さらに別のマッチング指定が可能である。これらのマッチング指定について、図２８の検索対象データを用いて説明する。説明のため、図２８のデータには、各レコードを一意に識別するためのレコード番号が付加されている。イベントパターンとしては、以下のようなパターンを用いることにする。
【０２３７】
イベント定義
イベント１：商品＝牛乳
イベント２：商品＝パン
イベント間定義
イベント１．日付＜イベント２．日付
ここで、“イベント１．日付＜イベント２．日付”という条件は、イベント１の日付がイベント２の日付より前であることを示している。まず、与えられたイベントパターンにマッチする最初のパターンを返す「最初マッチ」においては、レコード番号１と４のレコードが抽出される。
【０２３８】
このレコードは、出現する最初の牛乳レコードと、そのレコードより日付が後のレコードの中で最初に出現したパンのレコードである。この場合、抽出されたパターンは、レコード番号の組み合わせを用いて、（１，４）のように表される。
【０２３９】
次に、与えられたイベントパターンにマッチするパターンの中で間隔が最短のものを返す「最短マッチ」においても、レコード番号（１，４）が抽出される。図２８のデータにおいては、イベントパターンにマッチするレコードの組み合わせのうち、レコード１とレコード４の間で間隔が最も短くなるからである。
【０２４０】
また、データの先頭から「最短マッチ」を繰り返すことによって、イベントパターンにマッチする複数のレコードの組み合わせを抽出することも可能である。この場合、検索対象データから、１回の最短マッチによりイベントパターンにマッチした部分を取り除き、残りのデータを新たに検索対象として最短マッチに適用する処理が繰り返される。
【０２４１】
次に、与えられたイベントパターンにマッチするパターンの中で間隔が最長のものを返す「最長マッチ」では、レコード番号（１，７）が抽出される。図２８のデータにおいては、イベントパターンにマッチするレコードの組み合わせのうち、レコード１とレコード７の間で間隔が最も長くなるからである。また、与えられたイベントパターンにマッチする全てのパターンを返す「全部マッチ」では、レコード番号（１，４）、（１，７）、および（３，７）という３つの組み合わせが抽出される。
【０２４２】
さらに、データの先頭からの順方向のマッチングだけでなく、データの末尾からの逆方向のマッチングを指定することもできる。このようにマッチング指定を説明するため、上述のイベントパターンのイベント間定義を変更した、以下のようにイベントパターンを用いることにする。
【０２４３】
イベント定義
イベント１：商品＝牛乳
イベント２：商品＝パン
イベント間定義
イベント１．日付＞イベント２．日付
まず、「データの末尾から逆向きの最初マッチ」では、与えられたイベントパターンに逆向きにマッチする最初のパターンが返される。図２８のデータでは、レコード番号（６，４）がこれに該当する。
【０２４４】
次に、「データの末尾から逆向きの最短マッチ」では、与えられたイベントパターンに逆向きにマッチするパターンの中で間隔が最短のものが返される。図２８のデータでは、レコード番号（３，２）がこれに該当する。同様に、「データの末尾から逆向きの最長マッチ」では、与えられたイベントパターンに逆向きにマッチするパターンの中で間隔が最長のものが返される。図２８のデータでは、レコード番号（６，２）がこれに該当する。
【０２４５】
このように、イベントパターンを用いれば、様々なマッチング指定が可能であり、これらの指定方法を目的に合わせて使い分けることができる。
【０２４６】
検索装置がグラフィック・ユーザ・インタフェース（ＧＵＩ）を備えている場合、ユーザは、ＧＵＩを用いてイベントパターンを指定することができる。この場合、イベント定義とイベント間定義の２つをそれぞれＧＵＩで指定すればよい。
【０２４７】
図２９は、このようなＧＵＩの画面例を示している。イベント定義のボックス１３１にはイベント名が入力され、ボックス１３２にはイベントの条件が入力される。そして、ユーザがＯＫボタンをクリックすると、イベント定義の指定が完了する。ここでは、商品としてＰＣを購入したイベントが‘Ａ’と定義されている。同様にして、商品としてＴＶを購入したイベントは‘Ｂ’と定義されている。
【０２４８】
また、イベント間定義のボックス１３４および１３６には、定義されたイベント名が入力され、ボックス１３５には、ボックス１３４および１３６に入力されたイベント間の関係が入力される。そして、ユーザがＯＫボタン１３７をクリックすると、イベント間定義の指定が完了する。ここでは、イベントＡとイベントＢの間の関係について、イベントＡの後でイベントＢが起こるという条件が指定されている。これ以外にも、例えば、２つのイベントの間隔が所定日数以内であるといった指定が可能である。
【０２４９】
また、イベントパターンは、正規表現の拡張によって指定することもできる。正規表現では、複数イベントの同時発生とイベント間の間隔の記述がサポートされていないが、これらの記述を追加することによって、順序を考慮したイベントパターンが指定できる。例えば、同時発生を‘＝’で表現するとすれば、イベントＡ：“商品＝ＰＣ，価格＝１０万円”とイベントＢ：“商品＝ＴＶ，価格＝５万円”が同時に発生することを、以下のように表現することができる。
【０２５０】
イベントパターン
イベント定義
イベントＡ：商品＝ＰＣ，価格＝１０万円
イベントＢ：商品＝ＴＶ，価格＝５万円
イベント間定義
イベントＡ．日付＝イベントＢ．日付
また、イベントＡとイベントＢが同時に発生した後で、イベントＣ：“商品＝ＶＴＲ，価格＝３万円”が発生するパターンについては、以下のように記述することができる。
【０２５１】
イベントパターン
イベント定義
イベントＡ：商品＝ＰＣ，価格＝１０万円
イベントＢ：商品＝ＴＶ，価格＝５万円
イベントＣ：商品＝ＶＴＲ，価格＝３万円
イベント間定義
イベントＡ．日付＝イベントＢ．日付＜イベントＣ．日付
このように、同時発生と間隔の記述を導入することで、正規表現では表すことのできないイベントパターンを容易に指定することができる。
【０２５２】
前述のように、検索処理部が与えられたイベントパターンを解釈して処理を行うとき、インタープリタのように動的に解釈する方法と、コンパイラのように、パターンを計算機で実行可能な命令に置き換えてしまう方法とがある。検索処理部は、与えられたデータの中からイベントパターンにマッチするパターンを発見すると、発見したパターンについてあらかじめ決められた情報を出力する。例えば、図１５のデータについて、以下のような問い合わせを行った場合を説明する。
【０２５３】
イベントパターン
イベント定義
イベントＡ：商品＝牛乳
イベントＢ：商品＝パン
イベント間定義
イベントＡ．日付＝イベントＢ．日付
検索装置がパターンを発見する度に、発見されたパターンを構成するレコード群の情報をそのまま出力する場合、以下の２つのレコードの情報が出力される。
【０２５４】
１０００１　０３／２１　牛乳　１８９　（イベントＡに対応するレコード）
１０００１　０３／２１　パン　３００　（イベントＢに対応するレコード）
また、発見されたパターンの出力フォーマットをユーザが指定可能にすることもできる。この場合、発見されたパターンを構成するレコード群の情報が、指定されたフォーマットにコンパイルされて出力される。例えば、上記の問い合わせにおいて、出力の形式を、
イベントＡ．商品，　イベントＢ．商品
と指定すれば、
牛乳，パン
のような検索結果が出力される。さらに、“イベントＢ．日付−イベントＡ．日付”のように、特定のフィールドの値を用いた演算を指定することも可能である。この場合、発見されたパターンに含まれるレコードの指定されたフィールドの値に関して、指定された演算の結果が出力される。
【０２５５】
また、パターンをすべて発見した後に、レコード群の集約演算（ａｇｇｒｅｇａｔｅ　ｆｕｎｃｔｉｏｎ）を実行することも可能である。集約演算としては、最小値（ＭＩＮ）、最大値（ＭＡＸ）、平均（ＡＶＧ）、総和（ＳＵＭ）といった一般的な関数が用いられる。これらの集約演算を処理する際には、１つ１つのパターンを発見したら、それらをバッファに保存しておき、処理の区切りがついたところで、保存されたパターンすべてを対象として演算を実行する。例えば、上記の問い合わせに対して、ユーザは、ＡＶＧ（イベントＢ．日付−イベントＡ．日付）のような集約演算を指定することができる。
【０２５６】
イベントパターンにマッチするパターンが見つからない場合には、なんらかの形でユーザにその旨を通知する必要がある。例えば、マッチするパターンがないことを示すレコードをあらかじめ用意しておき、それを用いてメッセージを画面上に表示することで、ユーザにこれを通知することができる。図１５のデータの場合、以下の問い合わせを行うと、マッチするパターンは存在しない。
【０２５７】
イベントパターン
イベント定義
イベントＡ：商品＝牛乳
イベントＢ：商品＝パン
イベント間定義
イベントＡ．日付＝イベントＢ．日付
複数のフィールドからなるレコードの集合がデータとして与えられた際に、レコードのグループバイ（グループ化）を行って、処理を高速化することもできる。このとき、どのフィールドでグループ化し、各グループのレコード群をどのフィールドで整列（ソート）するかを指定して、あらかじめレコードを並べ替えておく。グループ化に用いるフィールドとしては、複数のフィールドを指定することができる。
【０２５８】
例えば、図３０のようなデータを検索対象として、顧客ＩＤフィールドでグループ化し、各グループのレコードを購入日で整列するものとする。この場合、グループ化と整列により、検索対象データは図３１のように並べ替えられる。顧客ＩＤが１１０００１００１の顧客は、２００１／０１／１３に１回目の来店を行い、商品ＡとＮの２点を購入した。また、同じ顧客が２００１／０１／２８に２回目の来店を行い、別の商品Ｂを購入している。こうして、並べ替えられたデータを対象として、イベントパターンが指定される。
【０２５９】
イベントパターンの定義とは、前述したように、あるフィールドがどのような値を取るかを１つあるいは複数のフィールドについて定義し、それらの条件全体に１つの名前を付けたものである。また、イベント間定義とは、複数のイベントに跨がる制約のことである。
【０２６０】
図３１のデータについて、図３２に示すように、商品のフィールドが“Ａ”であるレコードをＥｖｅｎｔ１と定義し、商品のフィールドが“Ｂ”であるレコードをＥｖｅｎｔ２と定義したとする。さらに、Ｅｖｅｎｔ１とＥｖｅｎｔ２についてＥｖｅｎｔ１とＥｖｅｎｔ２が発生した日（購入日）の間隔が３０日以内であり、Ｅｖｅｎｔ１とＥｖｅｎｔ２の価格が等しいと指定すると、図３３に示すようなイベントパターンが得られる。
【０２６１】
この例では、データがグループ化して整列されているので、検索処理部は、各グループについて整列されたデータを順番に取り出すだけで、マッチングを行うことができる。もし、グループ化と整列が行われておらず、データがメモリ上に乗り切らない場合には、何度もデータを読み込む必要が生じるため、処理の効率が悪くなる。したがって、グループ化と整列による高速化のメリットは明らかである。
【０２６２】
上述の例では、データをグループ化して、各グループ内で整列を行っているが、検索対象データ全体を１つのグループと見なした場合も、同様にして、データを順序に従って整列することで、処理を高速化できる。図３４は、ある店舗の売り上げデータを日付順に整列した例を示している。このようにデータ全体を日付順に整列することで、ｍ個のデータ項目からｎ個のイベントを検索する処理が、Ｏ（ｍｎ）のオーダからＯ（ｍ＋ｎ）のオーダに効率化される。
【０２６３】
また、インデクスを用いれば、データのグループ化と整列を行うことなく、同等の高速化を達成することができる。この方法では、レコードの集合は、グループ軸と順序軸でインデクス付けして処理され、レコードへのアクセスをグループ毎に順番に行うことができる。
【０２６４】
図３５は、このようなインデクスによるデータアクセスを示している。この例では、グループ情報保持部１４１と複数のインデクス１４２を介して、データ１４３がアクセスされる。グループ情報保持部１４１は、複数のグループに対応する複数の顧客識別子（ＣＩＤ１〜ＣＩＤ４）と、各顧客識別子に対応するインデクス１４２へのポインタを保持している。また、各インデクス１４２は、複数の日付データと各日付データに対応するレコードへのポインタを保持している。検索処理部は、グループ情報保持部１４１とインデクス１４２を用いることで、グループ毎に日付順にレコードにアクセスすることができる。
【０２６５】
また、データ全体を１つのグループとみなした場合でも、インデクスを利用すれば、整列を行うことなく、同等の高速化を達成することができる。この場合、図３５のグループ情報保持部１４１が不要となるため、図３６のような構成が用いられる。検索処理部は、インデクス１４２を用いることで、日付順にレコードにアクセスすることができる。
【０２６６】
次に、図３７から図４２までを参照しながら、検索処理の具体例として、例１、例２、および例３について説明する。
【０２６７】
図３７は、検索対象となる売上データを示している。図３７において、各レコードは、ＲＩＤ（レコード識別子）、顧客ＩＤ、購入日、商品、および価格の５つのフィールドからなる。データは、顧客ＩＤフィールドでグループ化され、さらに、グループ内でのレコードの順序は、購入日フィールドに従って並び替えられているものとする。
【０２６８】
図３７では、スペースの都合上、顧客ＩＤが１１０００１００１のグループに関するレコードのみが示されている。また、以下では、主として、このグループに関する処理についてのみ説明するが、実際には、他の顧客ＩＤのグループについても同様の処理が行われる。
【０２６９】
（例１）
以下のように指定されるイベントパターンを考える。
【０２７０】
イベントパターン
イベント定義
Ｅｖｅｎｔ１：商品＝Ａ
Ｅｖｅｎｔ２：価格＜＝８３８
イベント間定義
Ｅｖｅｎｔ２：Ｅｖｅｎｔ１から３日以内
（Ｅｖｅｎｔ２．購入日＜＝Ｅｖｅｎｔ１．購入日＋３日）
このように問い合わせが与えられると、検索処理部は、問い合わせを解釈して図３８に示すようなパターン構造（問い合わせパターン）を内部的に生成する。パターンマッチングの実効には、様々な実現方法が考えられるが、図２１に示　たポインタＰ２を利用することが可能である。処理の開始には、このポインタが問い合わせパターンマッチングの先頭を指すように、初期化しておく。
【０２７１】
パターンマッチングは、グループ毎に行われる。検索処理部は、各グループについて、レコードを先頭から順番に取り出し、あらかじめ解釈しておいたイベント定義にマッチするかどうかを調べていく。レコードがイベント定義にマッチすれば、さらに、レコードがイベント間定義にマッチするかどうかをしらべる。レコードがイベント間定義にマッチすれば、ポインタＰ２を１つ進める。
【０２７２】
図３８の問い合わせパターンでは、図３７の先頭のレコードＲ１が、ポインタＰ２が指しているＥｖｅｎｔ１のイベント定義にマッチする。そこで、次に、イベント間定義の条件が満たされるかどうかがチェックされる。ところが、Ｅｖｅｎｔ１についてはイベント間定義が定義されていないため、レコードＲ１はイベント定義がマッチしただけで、Ｅｖｅｎｔ１とマッチしたことになる。したがって、図３９に示すように、ポインタＰ２はＥｖｅｎｔ２を指すように変更される。　次に、レコードＲ２が読み込まれ、レコードＲ２がＥｖｅｎｔ２のイベント定義である“価格＜＝８３８”を満たすかどうかがチェックされる。しかし、Ｒ２の価格は１８００円で、Ｅｖｅｎｔ２のイベント定義を満たさないので、次のレコードＲ３の処理に移る。Ｒ３の価格は８３８円なので、Ｅｖｅｎｔ２のイベント定義についての条件が満たされる。そこで、Ｒ３がＥｖｅｎｔ２のイベント間定義を満たすかどうかがチェックされる。
【０２７３】
この場合、イベント間定義は、Ｅｖｅｎｔ２すなわちＲ３が生起したのがＥｖｅｎｔ１から３日以内であるという条件を表す。Ｒ３とＲ１の間隔は１５日であるので、イベント間定義の条件を満たしていない。また、レコードは購入日の早い順に並んでいるため、Ｒ３以降のレコードについては、Ｅｖｅｎｔ１を満たしているレコードＲ１との間隔が１５日以上になることがわかる。したがって、現時点でＥｖｅｎｔ２のイベント間定義を満たすレコードが存在しないことがわかるので、このグループの処理は中止される。
【０２７４】
このようにして、各グループのデータについてマッチング処理が行われ、問い合わせパターンのすべての条件が満たされた場合に、マッチングが成功する。上述した例では、顧客ＩＤが１１０００１００１のグループでは、問い合わせパターンにマッチするパターンがデータ中に存在しない。そこで、「指定されたパターンは存在しない」という結果が出力される。
【０２７５】
（例２）
以下のように指定されるイベントパターンを考える。マッチング方法としては、最初マッチと全部マッチの２通りを指定するものとする。
【０２７６】
イベントパターン
イベント定義
Ｅｖｅｎｔ１：商品＝Ａ
Ｅｖｅｎｔ２：価格＜＝８３８
イベント間定義
Ｅｖｅｎｔ２：Ｅｖｅｎｔ１から３来店以内
（Ｅｖｅｎｔ２．順序＜＝Ｅｖｅｎｔ１．順序＋３）
この例のイベント定義は、例１のものと同じである。しかし、イベント間定義としては、例１で用いた制約に代わって、Ｅｖｅｎｔ２がＥｖｅｎｔ１から３来店以内であるという制約が用いられる。この場合、各レコードには、購入日フィールドの順番に従って、１から始まる数字が順序番号として付加され、図３７の検索対象データは、図４０に示すような内容形式に変換される。
【０２７７】
図４０において、レコードＲ１とＲ２には、同じ順序１が割り当てられている。この順序は、顧客ＩＤが１１０００１００１の顧客が最初に来店した日を意味している。同様に、レコードＲ３とＲ４に順序２が割り当てられており、２回目の来店を意味している。以下、レコードＲ５、Ｒ６、Ｒ７には、それぞれ順序３、４、５が割り当てられており、それぞれ３回目、４回目、５回目の来店を意味している。なお、ここで示した順序は論理的な概念であり、図４０のような順序フィールドを物理的に設けるかどうかは、システムの実装に依存する。
【０２７８】
検索時には、例１と同様に、問い合わせを解釈して、図４１に示すような問い合わせパターンが内部的に生成される。まず、例１と同様に、最初のレコードＲ１がＥｖｅｎｔ１のイベント定義にマッチし、ポインタＰ２は、Ｅｖｅｎｔ２を指すように変更される。
【０２７９】
次に、レコードＲ２が読み込まれるが、Ｒ２の価格は８３８円より高価であるため、次のレコードＲ３が、Ｅｖｅｎｔ２のイベント定義を満たすがどうかがチェックされる。Ｒ３の価格はＥｖｅｎｔ２のイベント定義を満たしているので、次に、Ｒ３がＥｖｅｎｔ２のイベント間定義を満たすかどうかがチェックされる。　この例では、イベント間定義は、Ｅｖｅｎｔ２が正起したのがＥｖｅｎｔ１から３来店以内であるという条件を表す。言い換えれば、Ｅｖｅｎｔ２の順序は、Ｅｖｅｎｔ１すなわち、Ｒ１の順序である３を足して得られる４以下であるという条件である。Ｒ３の順序は４以下の２であり、Ｒ１の次の来店に対応するので、イベント間定義の制約を満たしている。
【０２８０】
Ｒ３まで検索した時点で、Ｅｖｅｎｔ１とＥｖｅｎｔ２からなるパターン指定を満たしているので、マッチした最初のパターンとして（Ｒ１，Ｒ３）が出力される。次に、Ｒ３と同じ来店日のレコードであるＲ４についても、Ｅｖｅｎｔ１をＲ１とした場合に、Ｅｖｅｎｔ２の条件を満たしているかどうかがチェックされる。その結果、Ｒ４についても、Ｅｖｅｎｔ２のイベント定義とイベント間定義の双方が満たされていることがわかり、パターン（Ｒ１，Ｒ４）が出力される。
【０２８１】
２つのパターン（Ｒ１，Ｒ３）と（Ｒ１，Ｒ４）は、まったく同じ来店日で問い合わせパターンを満たすものであり、どちらかが先であるとは言えない。このように、同じ位置（日付や時刻）に複数のデータが存在する場合には、従来の正規表現のデータと異なり、答えも複数存在する場合がある。
【０２８２】
次のレコードＲ５は、Ｒ３、Ｒ４と同じ来店日ではないので、最初マッチが指定された場合には、ここで処理が中止される。これに対して、全部マッチが指定された場合には、Ｒ５以降のレコードの処理が同様に続けられる。その結果、（Ｒ１，Ｒ３）、（Ｒ１，Ｒ４）に加えて、（Ｒ１，Ｒ６）も検索結果のパターンとして出力される。Ｒ６の価格は５８１円で８３８円より低く、また順序は４であるため、（Ｒ１，Ｒ６）のパターンの間隔は３来店となる。したがって、（Ｒ１，Ｒ６）は与えられたパターン指定を満たしている。
【０２８３】
検索結果を（Ｅｖｅｎｔ１．ＲＩＤ，Ｅｖｅｎｔ２．ＲＩＤ）のように表すことにすると、最初マッチ及び全部マッチの場合の出力は、それぞれ以下のようになる。
【０２８４】
最初マッチ：（Ｒ１，Ｒ３）、（Ｒ１，Ｒ４）
全部マッチ：（Ｒ１，Ｒ３）、（Ｒ１，Ｒ４）、（Ｒ１，Ｒ６）
（例３）
以下のように指定されるイベントパターンを考える。マッチング方法としては、最初マッチと全部マッチの２通りを指定するものとする。
【０２８５】

この例のイベント定義は、例１および例２のものとは異なり、Ｅｖｅｎｔ２はワイルドカードとして何にでもマッチできると定義されている。イベント間定義において、例２で用いたＥｖｅｎｔ１とＥｖｅｎｔ２が３来店以内という制約に加えて、Ｅｖｅｎｔ２の価格がＥｖｅｎｔ１の価格と同じかそれより低いという制約が与えられている。例２の場合と同様に、各レコードには、購入日フィールドの順番に従って、１から始まる数字が順序番号として付加され、図４０の内部形式が検索対象データとして用いられる。
【０２８６】
検索時には、例１および例２と同様に、問い合わせを解釈して、図４２に示すような問い合わせパターンが内部的に生成される。まず、最初のレコードＲ１が、Ｅｖｅｎｔ１のイベント定義にマッチし、ポインタＰ２は、Ｅｖｅｎｔ２を指すように変更される。
【０２８７】
次に、レコードＲ２が読み込まれるが、Ｅｖｅｎｔ２のイベント定義は、ワイルドカードであり、何にでもマッチする。そこで、Ｅｖｅｎｔ２のイベント間定義が満たされるかどうかがチェックされる。イベント間定義は、Ｅｖｅｎｔ２が生起したのがＥｖｅｎｔ１から３来店以内であるという条件と、価格についての条件である。Ｒ２は、Ｅｖｅｎｔ１から３来店以内という１番目の条件は満たしているが、Ｒ２の価格はＲ１の価格より高価であるため２番目の条件が成立しない。
【０２８８】
次に、レコードＲ３が読み込まれる。Ｒ３についても、Ｅｖｅｎｔ２のイベント定義は無条件に満たしている。そこで、Ｒ３がＥｖｅｎｔ２のイベント間定義を満たしているかどうかがチェックされる。Ｒ３の順序は４以下の２であり、Ｒ１の次の来店である。また、Ｒ３の価格８３８円はＥｖｅｎｔ１の価格より低い。したがって、２つの条件の両方を満たしているので、イベント間定義の制約は満たされる。
【０２８９】
Ｒ３まで検索した時点で、Ｅｖｅｎｔ１とＥｖｅｎｔ２からなるパターン指定を満たしているので、マッチした最初のパターンとして（Ｒ１，Ｒ３）が出力される。次に、レコードＲ４についても、Ｅｖｅｎｔ２の条件を満たしているかどうかがチェックされ、イベント定義とイベント間定義の双方が満たされていることがわかるので、パターン（Ｒ１，Ｒ４）が出力される。
【０２９０】
次のレコードＲ５は、Ｒ３。Ｒ４と同じ来店日ではないので、最初マッチが指定された場合には、ここで処理が中止される。これに対して、全部マッチが指定された場合には、Ｒ５以降のレコードの処理が同様に続けられる。その結果、（Ｒ１，Ｒ３）、（Ｒ１，Ｒ４）に加えて、（Ｒ１，Ｒ６）も検索結果のパターンとして出力される。Ｒ６の順序は４であるため、（Ｒ１，Ｒ６）のパターンの間隔は３来店となる。さらに、Ｒ６の価格は５８１円でＲ１の価格８３８円より低いので、（Ｒ１，Ｒ６）は与えられたパターン指定を満たしている。こうして、最初マッチおよび全部マッチの場合の出力は、それぞれ、以下のようになる。　最初マッチ：（Ｒ１，Ｒ３），（Ｒ１，Ｒ４）
全部マッチ：（Ｒ１，Ｒ３），（Ｒ１，Ｒ４），（Ｒ１，Ｒ６）
以上説明した３つの例のイベントパターンでは、簡単のため、Ｅｖｅｎｔ１とＥｖｅｎｔ２の２つのイベントのみが用いられている。また、イベント定義とイベント間定義としては、１つあるいは２つの条件のみが用いられている。しかし、実際には、イベントパターンをより多くの数のイベントによって指定することが可能であり、イベント定義およびイベント間定義においても、より多くの条件を指定することが可能である。
【０２９１】
以上の検索装置及び検索方法によれば、データ間の順序を考慮したパターンを容易に指定して、指定されたパターンを検索することが可能になる。また、パターンの定義を変えるだけで様々な種類の検索に対応することができる。
【０２９２】
（２）　：時系列フィルタ部の構成の説明
前記時系列フィルタ部は次のように構成されている。
【０２９３】
Ａ：複数の属性からなるレコードの集合からレコードの組み合わせを検索する装置であって、レコード内の所定属性が特定の値をとることでそれぞれ定義する複数のイベントと、属性値の順序に基づいて定義された該複数のイベントの間の順序関係とを用いて、検索パターンを指定する指定手段と、前記レコードの集合から、指定された検索パターンに対応するレコードの組み合わせを検索する検索手段と、検索結果を出力する出力手段とを備えていることを特徴とする検索装置（本発明の時系列フィルタ部４に相当）。
【０２９４】
Ｂ：前記Ａの検索装置において、前記指定手段は、２つ以上のイベントが同じ順序に存在するような検索パターンを指定することを特徴とした検索装置（本発明の時系列フィルタ部４に相当）。
【０２９５】
Ｃ：前記Ａの検索装置において、前記指定手段は、前記複数のイベントの順序関係を任意の間隔で記述した検索パターンを指定することを特徴とする検索装置（本発明の時系列フィルタ部４に相当）。
【０２９６】
Ｄ：前記Ａの検索装置において、前記複数の属性のうち少なくとも１つの値のデータを圧縮することで、前記レコードの集合のデータを圧縮する手段をさらに備え、前記検索手段は、圧縮されたデータを検索することを特徴とする検索装置（本発明の時系列フィルタ部４に相当）。
【０２９７】
Ｅ：前記Ａの検索装置において、前記指定手段は、前記検索パターンとレコードの組み合わせの中でイベント間の間隔が最長の組み合わせを返す最長マッチと、該検索パターンにマッチスルイベントの組み合わせの中でイベント間の間隔が最短の組み合わせを返す最短マッチと、該最短マッチの繰り返しと、該検索パターンにマッチする全てのイベントの組み合わせを返す全部マッチと、該検索パターンに逆向きにマッチするイベントの組み合わせの中で間隔が最長の組み合わせを返す逆向きの最長マッチと、該検索パターンに逆向きにマッチするイベントの組み合わせの中で間隔が最短の組み合わせを返す逆向きのマッチのうちの１つの方法を指定し、前記検索手段は、指定された方法でパターンマッチングを行うことを特徴とする検索装置。
【０２９８】
Ｆ：前記Ａの検索装置において、前記出力手段は、前記検索パターンに対応するレコードの組み合わせに関する集約演算を行って、演算結果を出力することを特徴とする検索装置。
【０２９９】
Ｇ：前記Ａの検索装置において、前記指定手段は、レコードの整列に用いる属性を指定し、前記検索手段は、指定された属性の値に基づいて前記レコードの集合を整列し、整列したレコードの集合を検索することを特徴とする検索装置。
【０３００】
Ｈ：前記Ａの検索装置において、前記レコードの集合を前記属性値の順序に従ってアクセスするためのインデクス手段を更に備えることを特徴とする検索装置。
前記の説明に対し、次の構成を付記する。
【０３０１】
（付記１）
時間的に連続するトランザクションに含まれるアイテムのシーケンスデータから、ｉ＝２個以上のアイテムの組み合わせに対しては該シーケンスにおける順序を維持した形式で、与えられた条件に適合するアイテム１個ずつまたは２個以上の組み合わせとその出現回数とを求めて時系列相関ルールを抽出する時系列相関抽出装置において、
入力データをフィルタリングして必要なデータを残す時系列フィルタ部と、入力データから時系列相関ルールを抽出する時系列相関エンジン部とを備えると共に、
前記時系列フィルタ部は、
複数の属性からなるレコードの集合からレコードの組み合わせを検索する際、レコード内の所定属性が特定の値をとることをそれぞれ定義する複数のイベントと、
その属性値の順序に基づいて定義された該複数のイベントの間の順序関係とを用いて、検索パターンを指定する指定手段と、
前記レコードの集合から、指定された検索パターンに対応するレコードの組み合わせを検索する検索手段と、
検索結果を出力する出力手段とを備えていることを特徴とする時系列相関抽出装置。
【０３０２】
（付記２）
前記時系列相関エンジン部の前段に前記時系列フィルタ部を配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする（付記１）記載の時系列相関抽出装置。
【０３０３】
（付記３）
前記時系列相関エンジン部の後段に前記時系列フィルタ部を配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする（付記１）記載の時系列相関抽出装置。
【０３０４】
（付記４）
前記時系列相関エンジンの前段と後段の両方に前記時系列フィルタ部を配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする（付記１）記載の時系列相関抽出装置。
【０３０５】
（付記５）
前記時系列フィルタ部を前記時系列相関エンジン部の外部に配置することで、時系列相関ルールを抽出する機能を備えていることを特徴とする（付記１）乃至（付記４）の何れかに記載の時系列相関抽出装置。
【０３０６】
（付記６）
前記時系列相関エンジンの前段と後段の両方に前記時系列フィルタ部を配置すると共に、
前段の時系列フィルタ部と時系列相関エンジン部を１つのプログラムで構成し、後段の時系列フィルタ部を別の１つのプログラムで構成したことを特徴とする（付記１）記載の時系列相関抽出装置。
【０３０７】
（付記７）
前記時系列相関エンジンの前段と後段の両方に前記時系列フィルタ部を配置すると共に、
前段の時系列フィルタ部を１つのプログラムで構成し、時系列相関エンジン部と後段の時系列フィルタ部を別の１つのプログラムで構成したことを特徴とする（付記１）記載の時系列相関抽出装置。
【０３０８】
（付記８）
前記時系列相関エンジンの前段と後段の両方に前記時系列フィルタ部を配置すると共に、
前段の時系列フィルタ部と、時系列相関エンジン部と、後段の時系列フィルタ部を、それぞれ別々の１つのプログラムで構成したことを特徴とする（付記１）記載の時系列相関抽出装置。
【０３０９】
【発明の効果】以上説明したように、本発明によれば次のような効果がある。
【０３１０】
（１）　：請求項１では、時系列相関抽出装置は、時間的に連続するトランザクションに含まれるアイテムのシーケンスデータから、ｉ＝２個以上のアイテムの組み合わせに対しては該シーケンスにおける順序を維持した形式で、前記与えられた条件に適合するアイテム１個ずつまたは２個以上の組み合わせとその出現回数とを求めて、時系列相関データを抽出する。
【０３１１】
この場合、時系列フィルタ部は、入力データをフィルタリングして必要なデータを残す処理を行い、時系列相関エンジン部は、入力データから時系列相関ルールを生成して抽出する処理を行うが、特に、前記時系列フィルタ部は、次のような処理を行う。
【０３１２】
すなわち、時系列フィルタ部は、前記指定手段、検索手段、出力手段を備え、複数の属性からなるレコードの集合からレコードの組み合わせを検索する。この時、前記指定手段は、レコード内の所定属性が特定の値をとることをそれぞれ定義する複数イベントと、属性の順序に基づいて定義されたそれらのイベントの間の順序関係とを用いて、検索パターン（イベントパターン）を指定する。
【０３１３】
前記検索手段は、レコードの集合から、指定された検索パターンに対応するレコードの組み合わせを検索し、出力手段は、検索結果を出力する。イベントは、レコード内の所定属性が特定の値をとる状態として定義され、複数のイベントの間の順序関係は、それらのイベントに対応するレコードの間において、１つ以上の属性の値の順序関係に基づいて定義される。
【０３１４】
ユーザは、指定手段を用いて、これらのイベントとイベント間の順序関係により決められる検索パターンを指定する。指定手段は、検索パターンを検索手段に渡し、検索手段は、受け取った検索パターンを解釈して、検索パターンに対応するレコードの組み合わせを抽出する。そして、出力手段は、抽出されたレコード等の情報を検索結果として出力する。
【０３１５】
このような装置によれば、２つ以上のイベントが同じ順序に存在する場合や、複数のイベントの順序関係が任意の間隔で記述される場合を含めて、様々な検索パターンを容易に指定することが可能となり、順序を考慮した汎用的な時系列フィルタリング処理を行うことができる。このようにして、前記のようなトランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０３１６】
（２）　：請求項２では、時系列フィルタ部を時系列相関エンジン部の前段に置いた例であり、時系列フィルタ部と時系列相関エンジン部を１つのプログラムで構成した例である。この場合、入力データは時系列フィルタ部でフィルタリング処理を行うことで、組み合わせを減少させ、処理の時間短縮を行う。
【０３１７】
次に、時系列相関エンジン部で時系列相関ルールを抽出して出力する。このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０３１８】
（３）　：請求項３では、時系列相関エンジン部を時系列フィルタ部の前段に置いた例であり、時系列相関エンジン部と時系列フィルタ部を１つのプログラムとして構成した例である。この場合、入力データは時系列相関エンジン部で時系列相関ルールを抽出し、その後、時系列フィルタ部でフィルタリング処理を行い、時系列相関ルールを抽出して出力する。このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０３１９】
（４）　：請求項４では、時系列相関エンジン部の前後両方に時系列フィルタ部を設け、２つの時系列フィルタ部と１つの時系列相関エンジン部を１つのプログラムで構成した例である。この場合、入力データは時系列フィルタ部でフィルタリング処理を行い、その後、時系列相関エンジン部で時系列相関ルールを抽出し、更に、別の時系列フィルタ部でフィルタリングを行うことで相関ルールを出力する。
【０３２０】
この例では、前段の時系列フィルタ部で組み合わせを減少させ、後段の時系列フィルタ部で無駄なものを除外する機能がある。このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。
【０３２１】
（５）　：請求項５では、時系列フィルタ部と時系列相関エンジン部とをそれぞれ別の１つのプログラム部で構成した例である。すなわち、時系列フィルタ部を１つのプログラムで構成し、時系列相関エンジン部５を別の１つのプログラムで構成する。
【０３２２】
このようにすれば、トランザクションの数が増えても、処理に多大の時間が掛かるのを防止し、高速に時系列相関ルールを抽出することが可能になる。また、時系列フィルタ部と時系列相関エンジン部が別のプログラムで構成されているので、並列処理が可能であり、更に、時系列フィルタ部４の取り換えが可能である。
【図面の簡単な説明】
【図１】本発明の原理説明図である。
【図２】本発明の実施の形態における時系列相関抽出装置の説明図である。
【図３】本発明の実施の形態における時系列相関抽出装置のプログラムの説明図（その１）であり、Ａ図は例１、Ｂ図は例２、Ｃ図は例３、Ｄ図は例４である。
【図４】本発明の実施の形態における時系列相関抽出装置のプログラムの説明図（その２）であり、Ｅ図は例５、Ｆ図は例６、Ｇ図は例７、Ｈ図は例８である。
【図５】本発明の実施の形態における入力データ形式であり、Ａ図は入力データ（テキスト表現）、Ｂ図は形式の説明図である。
【図６】本発明の実施の形態における処理の説明図であり、Ａ図は組み合わせ（長さ２）、Ｂ図はＣＩＤ＝１，４を削除した組み合わせ、Ｃ図は単純なフィルタでは効果がない入力である。
【図７】本発明の実施の形態における例１の処理フローチャートである。
【図８】本発明の実施の形態における例２の処理フローチャートである。
【図９】本発明の実施の形態における例３の処理フローチャートである。
【図１０】本発明の実施の形態における具体的な装置の構成図である。
【図１１】本発明の実施の形態における正規表現によるパターンマッチングを示す図である。
【図１２】本発明の実施の形態におけるＮＦＡを示す図である。
【図１３】本発明の実施の形態におけるＤＦＡを示す図である。
【図１４】本発明の実施の形態における正規表現の演算子を示す図である。
【図１５】本発明の実施の形態における小売り業売上データを示す図である。
【図１６】本発明の実施の形態における時系列フィルタ部の原理的な説明図である。
【図１７】本発明の実施の形態におけるイベント定義を示す図である。
【図１８】本発明の実施の形態におけるイベント間定義を示す図である。
【図１９】本発明の実施の形態における検索装置の構成図である。
【図２０】本発明の実施の形態における全体処理のフローチャートである。
【図２１】本発明の実施の形態における検索時のデータ構造を示す図である。
【図２２】本発明の実施の形態における検索処理のフローチャートである。
【図２３】本発明の実施の形態における第１のテーブルを示す図である。
【図２４】本発明の実施の形態における第２のテーブルを示す図である。
【図２５】本発明の実施の形態におけるＳＱＬ文を示す図である。
【図２６】本発明の実施の形態における圧縮されたデータを示す図である。
【図２７】本発明の実施の形態におけるレコードの削減処理を示す図である。
【図２８】本発明の実施の形態における第１の検索対象データを示す図である。
【図２９】本発明の実施の形態におけるＧＵＩの画面を示す図である。
【図３０】本発明の実施の形態における第２の検索対象データを示す図である。
【図３１】本発明の実施の形態における並べ替えられたデータを示す図である。
【図３２】本発明の実施の形態におけるイベント定義に対応するレコードを示す図である。
【図３３】本発明の実施の形態におけるイベントパターンに対応するレコードを示す図である。
【図３４】本発明の実施の形態における整列されたデータを示す図である。
【図３５】本発明の実施の形態における第１のインデクスを示す図である。
【図３６】本発明の実施の形態における第２のインデクスを示す図である。
【図３７】本発明の実施の形態における第３の検索対象データを示す図である。
【図３８】本発明の実施の形態における第１の問い合わせパターンを示す図である。
【図３９】本発明の実施の形態におけるポインタの移動を示す図である。
【図４０】本発明の実施の形態における検索対象データの内部形式を示す図である。
【図４１】本発明の実施の形態における第２の問い合わせパターンを示す図である。
【図４２】本発明の実施の形態における第３の問い合わせパターンを示す図である。
【図４３】従来例１の説明図であり、Ａ図は時系列相関抽出装置の説明図、Ｂ図は処理フローチャートである。
【図４４】従来例におけるＳＥＴＭアルゴリズムにおける具体的な処理の流れを説明する図である。
【図４５】従来例におけるＳＥＴＭアルゴリズムの処理における各機能ブロックの処理内容を示す図である。
【図４６】従来例におけるアプリオリ・アルゴリズムにおける具体的な処理の流れを説明する図である。
【図４７】従来例におけるアプリオリ・アルゴリズムにおける各機能ブロックの内容を説明する図である。
【図４８】従来例における時系列分析におけるシーケンスリストの例を説明する図である。
【図４９】従来例における時系列分析におけるＧ（１）　までの処理を説明する図（その１）である。
【図５０】従来例における時系列分析におけるシーケンスリストの例を説明する図（その２）である。
【図５１】従来例における時系列分析におけるＧ（１）　までの処理を説明する図（その３）である。
【図５２】従来例における時系列分析におけるＧ（１）　までの処理を説明する図（その４）である。
【図５３】従来例における時系列分析におけるＧ（１）　までの処理を説明する図（その５）である。
【図５４】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その１）である。
【図５５】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その２）である。
【図５６】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その３）である。
【図５７】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その４）である。
【図５８】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その５）である。
【図５９】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その６）である。
【図６０】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その７）である。
【図６１】従来例における時系列分析におけるＬ（１）　の選択を説明する図（その８）である。
【図６２】従来例における時系列分析におけるＧ（２）　までの処理を説明する図（その１）である。
【図６３】従来例における時系列分析におけるＧ（２）　までの処理を説明する図（その２）である。
【図６４】従来例における時系列分析におけるＧ（２）　までの処理を説明する図（その３）である。
【図６５】従来例における時系列分析におけるＧ（２）　までの処理を説明する図（その４）である。
【図６６】従来例における時系列分析におけるＧ（２）　までの処理を説明する図（その５）である。
【図６７】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その１）である。
【図６８】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その２）である。
【図６９】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その３）である。
【図７０】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その４）である。
【図７１】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その５）である。
【図７２】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その６）である。
【図７３】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その７）である。
【図７４】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その８）である。
【図７５】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その９）である。
【図７６】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その１０）である。
【図７７】従来例における時系列分析におけるＬ（２）　の選択を説明する図（その１１）である。
【図７８】従来例における時系列分析におけるＧ（３）　までの処理を説明する図（その１）である。
【図７９】従来例における時系列分析におけるＧ（３）　までの処理を説明する図（その２）である。
【図８０】従来例における時系列分析におけるＧ（３）　までの処理を説明する図（その３）である。
【図８１】従来例における時系列分析におけるＧ（３）　までの処理を説明する図（その４）である。
【図８２】従来例における時系列分析におけるＧ（３）　までの処理を説明する図（その５）である。
【図８３】従来例における時系列分析におけるＬ（３）　の選択を説明する図である。
【図８４】従来例における基本相関分析における処理の流れを説明する図である。
【図８５】従来例におけるアイテム組み合わせ数数え上げ処理におけるハッシュ済みリスト生成処理の経過を説明する図である。
【図８６】図８５の処理結果としてのハッシュ済みリストを示す図である。
【図８７】図８６のハッシュ済みリストに対する最小ハッシュ値レコード取り出し処理の経過を示す図である。
【図８８】ソート処理に基づくグループバイ処理の従来例のフローチャートである。
【図８９】図８８のフローチャートを用いた具体的な処理経過の説明図である。
【図９０】ハッシュ処理に基づくグループバイ処理の従来例のフローチャートである。
【図９１】図９０のフローチャートを用いた具体的な処理経過の説明図である。
【図９２】グループバイ処理方式の第１の実施例におけるハッシュ処理を説明する図である。
【図９３】グループバイ処理方式の第１の実施例におけるハッシュ処理の経過を示す図である。
【図９４】グループバイ処理方式の第１および第２の実施例におけるグループバイ関数処理の全体フローチャートである。
【図９５】グループバイ処理方式の第１および第２の実施例におけるグループバイ関数処理の全体説明図である。
【図９６】本発明のデータ組み合わせ数え上げ方式の実施例の構成を説明する図である。
【図９７】本発明のデータ組み合わせ数え上げ方式の全体処理フローチャートである。
【図９８】ラージアイテムセット生成処理のフローチャートである。
【図９９】長さ１のアイテムの組み合わせ候補の生成Ｃ（１）　とその数え上げＧ（１）　を説明する図（その１）である。
【図１００】長さ１のアイテムの組み合わせ候補の生成Ｃ（１）　とその数え上げＧ（１）　を説明する図（その２）である。
【図１０１】長さ１のアイテムの組み合わせ候補の生成Ｃ（１）　とその数え上げＧ（１）　を説明する図（その３）である。
【図１０２】長さ１のアイテムの組み合わせ候補の生成Ｃ（１）　とその数え上げＧ（１）　を説明する図（その４）である。
【図１０３】グループバイ処理の第１の実施例におけるハッシュ処理の結果として得られるハッシュ済みリストを示す図である。
【図１０４】長さ１のラージアイテムセットＬ（１）　の選択を説明する図（その１）である。
【図１０５】長さ１のラージアイテムセットＬ（１）　の選択を説明する図（その２）である。
【図１０６】長さ１のラージアイテムセットＬ（１）　の選択を説明する図（その３）である。
【図１０７】長さ１のラージアイテムセットＬ（１）　の選択を説明する図（その４）である。
【図１０８】長さ１のラージアイテムセットＬ（１）　の選択を説明する図（その５）である。
【図１０９】長さ１のラージアイテムセットＬ（１）　の選択を説明する図（その６）である。
【図１１０】長さ２のアイテムの組み合わせ候補数え上げＧ（２）　までの処理を説明する図（その１）である。
【図１１１】長さ２のアイテムの組み合わせ候補数え上げＧ（２）　までの処理を説明する図（その２）である。
【図１１２】長さ２のアイテムの組み合わせ候補数え上げＧ（２）　までの処理を説明する図（その３）である。
【図１１３】長さ２のアイテムの組み合わせ候補数え上げＧ（２）　までの処理を説明する図（その４）である。
【図１１４】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その１）である。
【図１１５】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その２）である。
【図１１６】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その３）である。
【図１１７】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その４）である。
【図１１８】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その５）である。
【図１１９】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その６）である。
【図１２０】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その７）である。
【図１２１】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その８）である。
【図１２２】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その９）である。
【図１２３】長さ２のアイテムのラージアイテムセットＬ（２）　の選択を説明する図（その１０）である。
【図１２４】長さ３のアイテムの組み合わせ候補数え上げＧ（３）　までの処理を説明する図（その１）である。
【図１２５】長さ３のアイテムの組み合わせ候補数え上げＧ（３）　までの処理を説明する図（その２）である。
【図１２６】長さ３のアイテムの組み合わせ候補数え上げＧ（３）　までの処理を説明する図（その３）である。
【図１２７】長さ３のアイテムの組み合わせ候補数え上げＧ（３）　までの処理を説明する図（その４）である。
【図１２８】長さ３のラージアイテムセットＬ（３）　の選択を説明する図（その１）である。
【図１２９】長さ３のラージアイテムセットＬ（３）　の選択を説明する図（その２）である。
【図１３０】長さ３のラージアイテムセットＬ（３）　の選択を説明する図（その３）である。
【符号の説明】
１　時系列相関抽出装置
２　入力装置
３　出力装置
４　時系列フィルタ部
５　時系列相関エンジン部
１０　コンピュータ本体
１１　ディスプレイ装置（表示装置）
１２　入力装置
１３　リムーバブルディスクドライブ（ＲＤＤ）
１４　ハードディスク装置（ＨＤＤ）
１５　ＣＰＵ（中央演算処理装置）
１６　ＲＯＭ（リード・オンリ・メモリ）
１７　メモリ
１８　インタフェース制御部（Ｉ／Ｆ制御部）
１９　通信制御部[0001]
[0001] 1. Field of the Invention [0002] The present invention relates to a time series correlation extraction apparatus for extracting a time series correlation rule from a large amount of transaction data.
[0002]
More specifically, the present invention relates to database mining for finding rules related to data recorded in a database, and relates to a technique for counting the number of occurrences of correlated data combinations among a huge amount of data in the database. Then, from the results counted using this technique, a combination rule that is one of the data mining methods (correlation rule extraction process) is used by using a combination that matches a given condition and the number of appearances thereof. ) Is performed. In recent years, correlation analysis using association rules has attracted widespread attention, mainly in the United States.
[0003]
2. Description of the Related Art A conventional example will be described below.
[0004]
§1: Description of Conventional Example 1
FIG. 43 is an explanatory view of Conventional Example 1, FIG. A is an explanatory view of a time-series correlation extraction device, and FIG. In FIG. 43, S1 to S6 indicate respective processing steps.
[0005]
The conventional time-series correlation extraction device includes a time-series correlation engine unit 5 inside the time-series correlation extraction device as shown in FIG. 43A, and the time-series correlation engine unit 5 takes in input data and performs processing. And extract association rules. In this case, in general, in extracting the time-series correlation rule, processing is performed using a raw transaction rule or using transaction data limited to a certain portion such as a period or a store.
[0006]
When the time-series correlation engine unit 5 performs the processing, the processing is performed according to the processing flowchart shown in FIG. 43B. In this process, first, the time-series correlation engine unit 5 reads out transaction data (S1) and determines whether or not it is empty (S2). As a result, if it is empty, the process is terminated. If it is not empty, a combination of items is generated (S3), and the number of combinations is counted (S4).
[0007]
Then, the time-series correlation engine unit 5 performs a time-series correlation rule generation process (or an extraction process) (S5), outputs the generated time-series correlation rule (S6), and proceeds to the process of S1.
[0008]
§2: Description of Conventional Example 2
Conventionally, based on sequence data of items included in a temporally continuous transaction, an item 1 that satisfies a given condition in a format maintaining the order in the sequence for a combination of i = 2 or more items A device for obtaining a time series correlation rule by calculating individual or two or more combinations and the number of appearances thereof (the technology up to the prerequisite technology "in time ..." described in claim 1 of the present application) (Equivalent) was known (see Patent Document 1). Hereinafter, a part particularly related to the present invention in Patent Document 1 will be described in detail as Conventional Example 2.
[0009]
(1): Description of configuration
In Conventional Example 2, information including the following configuration is disclosed.
[0010]
{Circle around (1)} Out of a large number of transactions each including one or more items as data, among the combinations of one item or two or more items, the number of appearances in the transaction meets a given condition. In a data combination counting method for determining a combination of one item or two or more items and the number of appearances thereof, the number of items included in each transaction is counted one by one, and the total number of items in each transaction is counted. Is counted, an item that satisfies the given condition is selected, a set of the item and the count is output as a counting result, and a bit corresponding to the selected item is output. Create a bitmap with “1” set, and create a bitmap The value of i indicating the number is set to i = 2, and a combination of i items included in each transaction is generated using the item corresponding to the position where “1” stands in the bitmap. Count the number of occurrences of the combination of items in all transactions, select a combination of items that satisfies the given number of conditions, and output a combination of the combination of the item and the number of counts as a counting result. Then, a bitmap in which "1" is set at a bit position corresponding to each combination or partial combination of the selected items or each item included in the combination is created, and the value of i is incremented to generate the bitmap. a correlated data combination characterized by repeating the processing after generation of a combination of i items Counting how to.
[0011]
{Circle around (2)} In the method for counting the number of data combinations according to the above {1}, from the sequence data of the items included in the temporally consecutive transactions among the multiple transactions, the combination of i = 2 or more items is determined. Counting the number of correlated data combinations, in which the order in the sequence is maintained and the combination of one item or two or more items that meet the given condition and the number of appearances are determined. Method.
[0012]
(2): Technical field
Regarding data mining that discovers rules related to data recorded in a database, and more specifically, a method for counting the number of occurrences of correlated data combinations among a huge amount of data in a database, and counting using this method From the given result, a generation process (extraction process) of an association rule, which is one of the data mining methods, is performed using a combination that matches a given condition and the number of appearances.
[0013]
(3): Conventional technology
{Circle around (1)}: Conventional technology related to a method of counting data combinations
Since this data combination enumeration method forms part of the association rule generation process in database mining, the association rule will be described first. As will be described later, in counting the number of data combinations in the present invention, the group-by processing method of the present invention is used as a part of the processing.
[0014]
As an example, out of 100 customer receipts collected at the POS (Point-of-Sales) in the retail business, 20 customers purchase product A, and 12 customers purchase both product A and product B. Suppose you have purchased. One product is called an item, and one receipt is called a transaction.
[0015]
One transaction usually includes a plurality of items. At this time, the following formula
Item support = number of transactions containing the item / total number of transactions
Based on the “support” of the product A = 20%, the “support” of the product A and the product B = 12%. Further, by a simple conditional probability calculation, it can be concluded that "60% (12% / 20%) of the customers who purchase A also purchase B". This is expressed as “A → B confidence 60%, support 12%” and is defined as an association rule. That is, the certainty factor in the association rule “A → B” is
The confidence of “A → B” = A∧B (both A and B purchased) support / A support. Furthermore, not only simple rules such as A → B, but also complicated rules such as (A∧B → C∧D∧E (“Customers who purchase A and B purchase C, D and E”)) The confidence in this case is
Confidence of “A∧B → C∧D∧E” = Support of A∧B∧C∧D サポート E / Support of A サポート B
It is.
[0016]
Correlation rules are used to evaluate which product groups have contributed to sales, optimize shelf layout (which products should be placed close to each other), and determine the direct mail hit rate from credit card data. It is useful information for various situations such as raising.
[0017]
The association rule generation processing includes (1) {}: processing of counting the number of appearances of an item combination that satisfies a given support condition from a transaction, and (2) {: the combination group determined in (1)} and its combination. It consists of a two-stage process of calculating rules, their support and confidence based on the number of appearances.
[0018]
In the above (1) 組み合わせ, a combination group of items that satisfies a given support condition is referred to as a “large item set”. Conditions for support include a minimum value (0% [= count all combinations] to 100% [= count items purchased in all transactions]) to a maximum value (minimum value <= maximum value <= 100). %). Conventionally, a method of fixing the maximum value to 100% is often used.
[0019]
Since the large item set counting process (1) is very time-consuming, various speeding-up methods have been proposed. Among them, the SETM algorithm based on SQL and APRIORI among several algorithms proposed by IBM (registered trademark) are known as typical ones.
[0020]
The generation processing of the association rule based on the SETM is based on the SQL language which is a relational database query language, and has a feature that it is easy to implement. In the processing, an SQL join operation (Join @ Operation) and a group-by operation (GroupBy @ Operation) are used. A self-joining operation is performed using a table of transactions including a combination of items of length k−1 that satisfies the condition of the minimum value of support to generate a combination candidate of items of length k.
[0021]
Next, a large item set having a length k is counted using a group-by operation. Further, a transaction group that satisfies the minimum value of the support is generated by using the join operation, and is used for generating the next combination of items of length k + 1.
[0022]
FIG. 44 is a diagram illustrating a specific processing flow in the SETM algorithm, and FIG. 45 is a diagram illustrating processing contents of each functional block in the processing of the SETM algorithm. With reference to these figures, the processing of the SETM as a conventional technique will be described in detail.
[0023]
In FIG. 44, a table R1 ′ stores each transaction t　The item included in x is shown. For example, transaction 1 includes

items

1, 2, and 3. GB (1)} is for performing the number of appearances of one item at a time (group-by processing), and the table L1 shows the count result for items having a count of 2 or more.
[0024]
The table R1 shows the result by extracting only the items existing in the table L1 from the data included in the table R1 ′ by the join processing (joining processing) J (1).
[0025]
SJ (1)} indicates a self-join process for table R1, resulting in a possible combination of two items for each transaction as table R2 '.
[0026]
By the group-by process GB (2)}, the number of appearances of the combination of two items is counted for the combination of the table R2 ', and among the count results, those having a count of 2 or more are created as the table L2.
[0027]
Similarly, a table L3 having a count of 2 or more out of a combination of three items is created, and a table L4 having a count of 2 or more out of a combination of four items is created as a table L4. , The contents of the table L4 are empty.
[0028]
In the Apriori algorithm, a candidate for a combination of items of length k is generated using a large item set of length k-1 that satisfies the condition of the minimum value of support. At this time, if all the k-1 large item sets are in the memory, it is checked whether all the combinations of the length k-1 among the combinations of the items of the length k are included in the large item set. Only when it is included, it is regarded as a candidate for a combination of items of length k.
[0029]
Unnecessary candidates are pruned (Pruning) by registering all combinations of length k-1 in a hash table (Hash @ Table) on the memory. Further, a group of candidate items is held in a hash tree (Hash {Tree}), and when a combination of items included in a transaction is registered in the hash tree for each transaction, the count value is increased. The number of appearances of a candidate combination of items having a length k is counted. By targeting only the combinations registered in the hash tree, an attempt is made not to count unnecessary combinations.
[0030]
FIG. 46 is a diagram for explaining a specific processing flow in the apriori algorithm, and FIG. 47 is a diagram for explaining the contents of each functional block in the apriori algorithm. Based on these figures, a specific example of the large item set counting process by the Priori algorithm will be described.
[0031]
In FIG. 46, the contents of the list TL of eight transactions are substantially the same as in FIG. First, items included in these transactions are input one by one to Subset (1) #, and the number of appearances of each item is counted as C1. The count result is input to F, and those whose appearance number is 2 or more are selected by filtering, and the filtering result is created as L1.
[0032]
Two combinations are selected from the items included in L1 and registered in the hash tree as C2. Then, when a combination of two items registered in the hash tree in each transaction is included, the number of appearances is counted by Subset (2), so that the number of appearances of the combination of two items is obtained. Then, the result is filtered by F, so that a combination of two or more occurrences of the two items is obtained as L2.
[0033]
By performing the same processing as described below, L3 is obtained when the number of appearances is two or more among the combinations of three items, and is obtained when the number of appearances is two or more among the four combinations. 44, the process is terminated when it is determined that the file does not exist.
[0034]
(4): Explanation of combination counting system
In FIG. 96, the enumeration method is a combination generation unit C (i) # 1 for generating a combination candidate of items considered to be correlated, and an appearance number enumeration unit G for counting the number of appearances of a combination candidate within a transaction by group-by processing. (I) {2, when the count value of the number of appearances is within a specified range, for example, as a combination having a combination correlation, a combination selection unit F3 and a combination selection unit that are selected as elements of the large item set L (i)} From the large item set output by F3, bitmaps b15 and b (i-1) 6 used for pruning of the combination are generated by the combination generation unit C (i) # 1 and used by C (i + 1). Bitmap B (i It is composed of four.
[0035]
Each unit in FIG. 96 performs the following processing.
C (i)
When i = 1: Send items included in the same transaction to G (i) # one by one.
When i> 2: Among the combinations of i items included in the same transaction, those that are not excluded by the bitmap filters b1, b2,... B (i-1) are sent to G (i)}. Here, the filters b1, b2,..., B (i-1) correspond to b1, b2,.
G (i)
A record in which i items are arranged is received, a group-by process described later is performed using the entire record as a key, the number of records included in each group is calculated, and a record obtained by adding the number to the record is output.
F
G (i)}, and outputs a large item set L (i)} with a length i that satisfies the given condition.
B (i)
The output [item 1... Item i, the number] of F is received, and all j items out of [item 1... Item i] are output to the bitmap filter bj (1 ≦ j ≦ i). An item combination is extracted, and an operation of setting “1” at a bit position calculated by Hj (combination of j items) is performed for each combination. At this time, since b1, b2,... B (i-1) already exist, these are updated. Since bi does not yet exist, it is newly created and updated.
[0036]
FIG. 97 is an overall processing flowchart of the combination counting method. When the process is started in the figure, first, i as the number of combinations of items is set to 1 in step S501, and then L (1)} is generated in step S502. This L (1) で is a large item set including only one item.
[0037]
Thereafter, in step S503, it is determined whether the number of elements of L (i), that is, the number of combinations is i + 1 or more. Here, it is determined whether or not the number of elements of the large item set of L (1) is two or more. If this determination is Yes, the value of i is incremented in step S504, and then, after step S502. Is repeated.
[0038]
That is, here, a large item set L (2) # is generated in S502 as a set of combinations of the number of appearances in the specified range of the combination of two items, and then the processing of S503 and thereafter is executed. You. Then, when it is determined in step S503 that the number of elements included in the large item set L (i) # is not more than i + 1, the process ends.
[0039]
FIG. 98 is a flowchart of the large item set generation process. When the process is started in the figure, first, in step S510, the first transaction in the transaction list TL is read, and in step S511, a combination candidate of i items is generated, and the combination candidates, ie, item 1 to item i Are sent to the appearance count counting unit G (i) #, and it is determined in step S512 whether or not the transaction list is empty. If not, the processing in step S510 and subsequent steps is repeated, and a combination candidate is selected from inside the transaction list. Generation and sending to G (i)} are performed.
[0040]
If it is determined in step S512 that the transaction list is empty, in step S513, the item combination candidate and the number of appearances in the transaction, that is, the number of items, are sent to the combination selection unit F from the counting unit G (i) #, A combination selection process is performed. Then, in step S514, the combination selection result is stored as a large item set L (i) #, and at the same time, the combination portion of the items in the large item set is sent to the bitmap generator B (i) #. In step S515, a bitmap generation process is performed. In step S516, it is determined whether or not a record still exists in the counting unit G (i) #. If there is, a process after step S513 is repeated. Ends the processing.
[0041]
Next, using a specific example, the processing of the combination counting method will be described in more detail. Here, as a specific example, a transaction list TL including the following four transactions T1 to T4 is targeted.
[0042]
T1 = [1,2,4], T2 = [2,3,6]
T3 = [1,4,5,6], T4 = [1,2,4,5]
It is assumed that the items inside each transaction are sorted in the order of their numbers. Here, the number of appearances of the combination candidates selected as the large item set, that is, the number that satisfies the condition of 50% or more as the minimum value of the support of the number, that is, the condition that the number is 2 or more with respect to 4 of the number of all transactions. What is satisfied is set as a selection condition in the combination selection unit F.
[0043]
First, generation of an item set L (1) # having a length of 1 will be described. FIG. 99 to FIG. 102 are explanatory diagrams of generation of a combination candidate of an item having a length of 1 and counting of the number of appearances thereof.
[0044]
First, the transaction T1 [1,2,4] is read in FIG. 99 and input to C (1) #. In C (1)}, the items included in the same transaction are input one by one, that is, three items [1], [2], and [4] are input to the appearance number counting unit G (1)}. In G (1)}, the number of appearances of the items [1], [2], [4] input from C (1)} is counted, and the item and the number are held as a pair. Here, since the items [1], [2], and [4] are input once each, the format of [item, number] is [1,1], [2,1], [4,1]. ] Is held. This is the end of the process for T1.
[0045]
When the input processing to G (1) # for one transaction is completed, the process returns to C (1) # to perform the processing of the next transaction. In FIG. 100, T2 = [2, 3, 6] is input to G (1)} as three items [2], [3], and [6]. In G (1)}, the input item and the number are added to the number counted in the processing of the transaction up to that time.
[0046]
Since the item [2] is input by both TQ and T2, at the stage when the process of T2 is completed, [2, 2] is held in the form of [item, number].
[0047]
In the same manner, for all transactions, items are input one by one to G (1) # with C (1) #, and counting is performed with G (1) #. As a result of processing all transactions up to T4, a set of [item, number] is [1,3], [2,3], [4,3], [3,1], [6,2], [ 5,2]. FIG. 103 shows the result.
[0048]
When the input processing to G (1)} is completed for all transactions, a set of [item, number] is extracted from G (1)}, and an item satisfying the minimum support value in F is selected. FIG. 104 to FIG. 109 are explanatory diagrams of this processing. In FIG. 104, since the number 3 of [1, 3] satisfies the minimum support value of 50%, this is registered in the large item set L (1) #. At the same time, in B (1) #, item [1] is registered in bitmap b1. Here, the bit positions (0 to 5) corresponding to the items are obtained by the following hash function H1.
H1 (item 1) = item 1 number mod6.
In this case, since the total number of unique items in the transaction is 6, the number of bits in the bitmap is set to 6 accordingly. If the bitmap does not fit in memory, or if it is desired to reserve memory for other processing, a value less than the total number 6 of unique items can be used. All bits of the bitmap are initially "0". When the hash function is applied to the item [1], 1mod6 = 1 is obtained, so “1” is set to the second highest bit corresponding to 1 in b1.
[0049]
Next, in FIG. 105, [2,3] is extracted from G (1) and registered as a large item set because the number 3 of items [2] satisfies the condition of 50% or more which is the minimum value of support. At the same time, since 2mod6 = 2, "1" is set in the third bit of the bitmap b1 in B (1)}.
[0050]
In FIG. 106, FIG. 108, and FIG. 109, among the item and number pairs, [4, 3], [6, 2], and [5, 2] satisfy the minimum support condition. The set of [6] and [5] and the number are registered in the large item set L (1) #, and at the same time, "1" is set to the hash function value of the bitmap b1.
[0051]
However, in FIG. 107, since the number 1 does not satisfy the condition of the minimum value of support for [3, 1], registration in the large item set L (1) # and the bitmap b1 is not performed. As a result of the processing, the bitmap b1 = {1,1,1,0,1,1}, and the large item set L (1) of length 1 = {[1,3], [2,3], [4] , 3], [6, 2], [5, 2]} are generated.
[0052]
Next, generation of a large item set L (2) # of length 2 will be described. FIG. 110 to FIG. 113 are explanatory diagrams up to the counting of the item combination candidates. The generation of the combination of the items having the length 2 uses the bitmap b1 that has already been created. First, a transaction T1 = [1, 2, 4] is read in FIG. 110, and a candidate for a combination of items having a length of 2 is generated using only the items set in the bitmap b1. Here, since three items are registered in b1, candidates for combinations of three items [12], [14], and [24] are selected from the three items [1], [2], and [4]. Is generated, and the number of appearances of the combination [12], [14], [24] of the item input from C (2)} is counted in the counting unit G (2)}, and the item and the number are held as a pair. . Since [12], [14], and [24] are input once each, the format [12, 1], [14, 1], [24, 1] is in the form of [item 1 item 2, number]. Hold. This is the end of the process for T1.
[0053]
When the item of the next transaction T2 = [2,3,6] is filtered by the bitmap b1, the item [3] is dropped. As a result, only one [26] as a candidate of the combination of items is G in FIG. (2) Enter in ②. The same processing is performed for T3 and T4. However, since there are no items that are eliminated by being sieved in the bitmap b1 in T3 and T4, the combination of all items of length 2 generated from the transaction is shown in FIG. In FIG. 113, G (2) is input.
[0054]
When the input processing to G (2) @ is completed for all transactions, [Item 1 Item 2, Number] is taken out from G (2)}, and F whose number satisfies the minimum support condition is selected. This process is shown in FIGS. Since the number 2 of [12, 2] satisfies the condition of the minimum support value of 50% or more, the item combination candidate [12] and the number of appearances are registered in the large item set L (2) # in FIG. At the same time, in B (2)}, each item [1] and [2] of the item combination [12] is registered in the bitmap b1. In this case, "1" is set in

bit positions

1 and 2 of b1. Further, regarding [12], “1” is set at bit position 3 of the bit map b2 (bit positions 0 to 4) calculated by the following hash function H2.
[0055]
H2 (item 1, item 2) = (item 1 number + item 2 number) mod5.
In this hash function H2, the one that takes mod5 of the sum of the numbers of item 1 and item 2 was used, so the number of bits in the bitmap was set to 5 in accordance with this. Initially, all five bits of b2 are "0". The hash function here is set as an example, and can be set arbitrarily in consideration of the efficiency of the hash function. Also, it can be arbitrarily set in consideration of the size of the bitmap memory. In b1, the bit position at which "1" is set for one item is always determined, whereas in b2, the bit position is determined by a hash function using two items as arguments.
[0056]
As shown in FIG. 115, FIG. 116, FIG. 118, and FIG. 120, since the minimum support value is satisfied for [14, 3], [24, 2], [15, 2], and [45, 2], the item Is registered in the large item set L (2) #, and the bitmaps b1 and b2 are updated at the same time. In contrast, in FIGS. 117, 119, 121 to 123, since the minimum value of the support is not satisfied, registration to the item set and updating of the bitmap are not performed. As a result, the large item set L (2) of length 2 = {[12, 2], [14, 3], [24, 2], [15, 2], [45, 2]} and b1 = {0,1,1,0,1,1}, b2 = {1,1,0,1,1}. The result is shown in FIG.
[0057]
Next, generation of L (3) will be described. FIG. 124 to FIG. 127 are explanatory diagrams up to the counting of combination candidates. The generation of the combination of the items having the length 3 uses the bitmaps b1 and b2 generated when the large item set having the length 2 is generated. First, the transaction T1 = [1, 2, 4] is read, and a combination of items of length 3 is generated using only the items set in the bitmap b1. Here, three items [1], [2], and [4] are set in b1. Next, it is checked whether the combination [12], [14], [2, 4] of the item of length 2 is set in the bitmap b2. In this case, since b2 [H2 (1,2)] = b2 [H2 (1,4)] = b2 [H2 (2,4)] = 1, [124] is selected as a combination candidate (3) Enter in ②. This process is shown in FIG.
[0058]
As shown in FIG. 125, when T2 = [2,3,6], since [3] is not set in b1, the screen is sieved and the length of the transaction becomes 2, so that the length 3 Cannot be generated, and no input is made to G (3) #.
[0059]
If T3 = [1, 4, 5, 6], the bits of b1 corresponding to [1], [4], [5], [6] are set, so that Combinations [145], [146], [156], and [456] are conceivable. First of all, the bitmap b2 is checked. In the case of [145], since the corresponding bit of b2 is set for all of [14], [15], and [45], it is input to G (3)} in FIG. 126 as a combination candidate.
[0060]
However, in the case of [146], although the bit of b2 corresponding to [14] and [46] is set, but the bit of [16] is not set, G (3) is set to [146] as a combination candidate. ) Do not enter in. Similarly, since the bit of b2 corresponding to the combination [45], [46], and [56] of all two items of [456] is set, [456] is G (3 )}, But is not input to G (3)} because the bit of b2 of [16] in [156] is not set.
[0061]
For T4 = [1,2,4,5], similar check of b1 and b2 is performed, and [124] and [145] are input to G (3)} as combination candidates. FIG. 127 shows the result.
[0062]
In this example, in order to prevent generation of unnecessary candidates, bits of the bitmap b2 corresponding to all combinations of items of length 2 were checked when generating combinations of items of length 3. This method is effective as pruning to prevent unnecessary counting. For example, if the amount of memory available for processing is not sufficient, each item is checked or a combination of two-length items is checked. Alternatively, a method that does not use b1 or b2 may be used. There is also a method in which b1 and b2 are used but the number of combinations of length 2 to be checked is reduced.
[0063]
For example, when generating a combination of items having a length of 3, a method of checking only one combination having a head length of 2 can be considered.
[0064]
When the input process to G (3)} is completed, as shown in FIGS. 128 to 130, a set of [item 1 item 2 item 3, number] is taken out from G (3)}, and the number is the minimum supported value by F. Select items that satisfy the conditions. In FIG. 128, since the number 2 of [124, 2] satisfies the condition that the minimum value of the support is 50% or more, this is registered in the large item set L (3) # of length 3. Similarly, in B (3), “1” is set at the bit position of the bit map b1 corresponding to each of the items [1], [2], and [4] of [124]. Further, using the hash function H3, "1" is set at the bit position of the bitmap b3 corresponding to the combination [124] of all three items.
[0065]
H3 (item 1, item 2, item 3) = (item 1 number + item 2 number + item 3 number) mod5.
At the same time, a combination of two items can be extracted from [124] to generate a bitmap b2, but here, only b1 and b3 are generated.
[0066]
As a result of the subsequent processing of FIGS. 129 and 130, the large item set L (3) of length 3 = {[124, 2], [145, 2]}, bitmap b1 = {0, 1, 1} , 0, 1, 1}, b3 = {1, 0, 1, 0, 0}.
[0067]
In FIG. 130, the number of large item sets L (3) # of length 3 is two, and the number of end conditions in step S503 in FIG. 97 is four or less, so the process ends. Thus, the large item set counting process is completed, and the association rule can be generated using the result.
[0068]
In the example described above, for C (i)} where i> 1, for each transaction, a candidate combination of items of length i that is not excluded by the bitmaps b1 and b (i-1) is generated. At the same time, for each transaction, only items that are not excluded by the bitmaps b1 and b (i-1) can be stored as a new transaction. As a result, when i> 2, it is possible to generate an item combination candidate from the reduced transaction group generated in C (i−1) instead of the original transaction group TL. The method for counting the number of data combinations used for generating the association rule has been described above in detail.
[0069]
(5): Application to time series analysis as a method of correlation analysis
This time series analysis is used for analyzing a customer's product purchase pattern over a long period of time. If the probability of purchasing the second item within a predetermined period after the customer purchases the first item can be known as a correlation rule, for example, a retailer can more effectively perform inventory management.
[0070]
FIG. 48 is an explanatory diagram of a sequence list as a product purchase pattern of a customer over a long period of time. In this figure, the first sequence is that a customer purchases only one item 3 and then purchases only item 8 the next day, for example, and then one week later purchases

items

3 and 8 one by one. Such a time series pattern of product purchase is shown.
[0071]
In the description of FIG. 48, one element corresponds to the above-described transaction, that is, one receipt. For example, that the

items

3 and 8 belong to the same element means that the

items

3 and 8 are recorded on the same receipt. It indicates that.
[0072]
In the time series analysis described below, the minimum value of the support is set to 40%. Since the sequence list in FIG. 48 is composed of five sequences, support of 40% indicates that the number of appearances of the combination of items is 2 or more. The operation of the {combination generation unit C (i) in time series analysis}, the counting unit G (i)}, the combination selection unit F, and the bitmap B (i) will be described below.
C (i)
When i = 1: Send items included in the same sequence to G (i)} one by one.
When i> 2: Among the permutations of i items included in the same sequence, those not excluded by the bitmap filters b1, b2,... B (i-1) are sent to G (i)}.
G (i)
It receives the permutation of i items, performs group-by processing described later, and outputs the result with the number added.
F
G (i)}, and the output whose number satisfies the given condition is output as a large sequence L (i)} of length i.
B (i)
F, receives the output of F, extracts the permutations of all j items from the permutations of i items to the bitmap filter bj (1 ≦ j ≦ i), and, for each permutation, Hj An operation of setting “1” at a bit position calculated by (permutation of j items) is performed.
[0073]
The internal representation of the permutation of j items is [item 1,... item j, separator (j-1)], and the separator k (1 ≦ k ≦ j−1) is the same element as item k and item (k + 1). 0 if it belongs to a different element, and 1 if it belongs to a different element.
[0074]
The hash function Hj is expressed as Hj (item 1,... Item j, separator 1, separator (j-1) = (item 1 number +,..., Item j number + separator 1+. Separator (j-1) mod N. N indicates the number of bits in the bitmap.
[0075]
FIGS. 49 to 53 are explanatory diagrams of the processing up to G (1) # as the processing for counting the number of appearances of each one belonging to the sequence list. 49 to 53, the items included in each sequence list are counted one by one by G (1), and FIG. 53 is obtained as a final result.
[0076]
54 to 61 correspond to the count values of the number of appearances of each item obtained in FIG. 49 to FIG. 53, and select an item having a support value of 40% or more to form a large sequence L (1). FIG. 9 is an explanatory diagram of a process of creating and a process of simultaneously creating a bitmap b1.
[0077]
For example, the number of items 3 input in FIG. 54 is four, which is the element of the large sequence L (1) because it satisfies the minimum value of support. At the same time, the item 3 is the bitmap generator B (1) , And a bit is set in the bit map b1. Here, the bit number of the bitmap b1 is eight.
[0078]
55 to 58, since the number of appearances of the items input to the combination selection unit F satisfies the support value, those items are added to the large sequence L (1) #, but FIGS. Since the number of appearances of the item input in step (1) is equal to or less than the support value, the item is not registered in the large sequence L (1) #, and the final result of this processing is shown in FIG.
[0079]
FIGS. 62 to 66 are explanatory diagrams of the counting process of the combination candidates of two items. In FIG. 62, a combination candidate of two items is generated by using an item that is not dropped by the filtering using the bitmap b1 according to C (2)}, and the number of appearances is counted by G (2)}. In this combination, since there is an order in the sequence, and depending on whether or not they belong to the same element, the following three combination candidates are counted as separate candidates. <(3) (8)>, <(8) (3)>, <(3,8)>
63 to 66, the combination candidates of the two items are similarly counted. For example, in FIG. 63,

items

1, 2, and 6 are filtered by the bitmap b1 and are not used for generating the combination candidates. Also, in FIG. 66, only one item is included in the sequence, so no combination candidate of two items is generated, and FIG. 65 is the final result.
[0080]
FIGS. 67 to 77 show the creation of a large sequence L (2) using the candidate combination of the two items created in FIGS. 62 to 66 that satisfies the support value, and the bitmap creation unit B (2). FIG. 6 is an explanatory diagram of a process of generating bitmaps b1 and b2 by using. In FIG. 67, since the combination candidate of the two items input to the combination selection unit F satisfies the support value, it is set as an element of the large sequence L (2) ２, and the hash function Bit setting processing of the bitmaps b1 and b2 is performed using H1 and H2.
[0081]
Since the combination candidates input in FIGS. 68 to 71 and FIGS. 75 to 77 do not satisfy the support value, registration to the large sequence and bit setting of the bitmap are not performed.
[0082]
On the other hand, since the combination candidates input to the combination generation unit F in FIGS. 72 to 74 satisfy the support value, registration in the large sequence (2) # and bit setting of the bitmap are performed. FIG. 77 shows the final result of this processing.
[0083]
FIGS. 78 to 82 are explanatory diagrams of the counting process and the like of G (3) # of the combination candidates of three items. In FIG. 78, the first sequence is input to the combination generation unit C (3) #, but all the combination candidates of the three items generated from this sequence are filtered by the bitmap b2, and the input is input to G (3) #. Not done.
[0084]
In FIG. 79 and FIG. 81, only one combination candidate of three items generated by the combination generation unit C (3) # is counted by G (3)}. In FIGS. 80 and 82, there is no combination candidate input to G (3), and the final result of this processing is shown in FIG.
[0085]
FIG. 83 shows that a candidate satisfying the support value among the three item combination candidates created in FIGS. 78 to 82 is registered in large sequence L (3) #, and at the same time, bit setting is performed on bitmaps b1 and b3. It is an explanatory view of a process to be performed. The combination candidate of the three items is only one type, and the combination generation unit F determines the combination candidate as a large sequence L (3), and uses the items in the combination candidates to generate the bitmaps b1 and b3. The settings are made. Since the number of elements in large sequence L (3) # is only one, the process of counting combinations ends here.
[0086]
FIG. 84 illustrates the flow of the process of counting the number of item combinations for the basic correlation analysis in the present invention, using the same transaction list as in FIGS. 44 and 46 of the related art. In FIG. 84, for example, using a bitmap b1 created corresponding to the large item set L1, a combination candidate of two items not filtered by the combination creating unit C (2) # is generated from the transaction list, and the counting unit G (2) There is a basic feature of the present invention in the point given in (1).
[0087]
The process of generating this bitmap is simple, and the capacity of the bitmap can always be set to a size that fits the amount of available memory. 46, the capacity of the hash tree in FIG. 46 may be larger than the available memory capacity.
[0088]
(6) {}: Group-by processing executed in the combination appearance frequency counting unit G (i)}
The description of the overall processing of the data combination enumeration method of the present invention has been completed above, and the group-by processing executed in the combination appearance enumeration unit G (i) # will be described next.
[0089]
The group-by processing method of the present invention is used to count the item combination candidates, and the processing will be described using a specific example. When counting the number of appearances when only one item is present, the above-mentioned group-by processing method can be used as it is by considering the item number as a record value. The counting will be described as a specific example.
[0090]
As a group-by processing method, a hashed list and an auxiliary information list are created by hash processing in the same manner as in the first embodiment, and then records having the same key value are counted as group-by function processing. The case will be described.
[0091]
The input records are the next 16 items. Each record has two item numbers, and a key in hash processing and sort processing is determined by the two item numbers.
[0092]
[12] [14] [24] [26] [14] [15] [16] [45] [46] [56] [12] [14] [15] [24] [25] [45]
There are various possible key comparison methods in such a case. Here, dictionary order is used as an example. That is, when comparing two records, the first item number is compared, and if the size is added, the size of the key of the record is set as it is. If the first item numbers are equal, the second item number is compared. If the size is added, the key size is assumed to be the same. If the second item number is equal, the comparison of the third item number is continued. If the last item number is equal, the key value of two records is used. Are assumed to be equal.
[0093]
Therefore, for example, [12] and [14] have the same first item number, but by comparing with the second item number, [14] has a larger key value than [12]. Can be.
[0094]
An appropriate hash function can be selected and used. Here, as an example of the hash function, a surplus obtained by dividing the sum of the item numbers by 3 as shown in the following equation is taken. That is, the hash value of [12] is 0 when (1 + 2) mod3 = 0, and the hash value of [14] is 2 when (1 + 4) mod3 = 2.
[0095]
(Equation 1)

[0096]
As in the first embodiment of the group-by processing method described above, a hashed list is first generated from 16 input records. At this time, the size of the hash table 15 in FIG. 92 is 3, the record buffer 14 has a storage area for four records, and the hashed list output buffer 18 can store three records. FIG. 85 is an explanatory diagram of the progress of the hashed list generation process. This figure shows the progress of record input from the input buffer 13 to the record buffer 14 and output of records from the record buffer 14 to the hashed list output buffer 18 in FIG. As in FIG. 93, when one record is input, when a record having the same hash value is already pointed out from the area of the hash value in the hash table, a newly input record is pointed out from the hash table. The records already entered are managed by the link management table.
[0097]
FIG. 86 is a hashed list as a processing result of FIG. By comparing with FIG. 85, it can be seen that runs with block numbers 0 to 2 and two runs with block numbers 3 to 5 are obtained.
[0098]
Next, if the form of [block number, hash value of the first record, hash value of the last record] is used as the auxiliary information list, the following auxiliary information list is obtained. The format of this auxiliary list does not change depending on the value of i.
[0099]
[0,0,0] [1,1,2] [2,2,2] [3,0,0] [4,0,1] [5,1,1]
By sorting the auxiliary information list, for example, an auxiliary information list sorted as follows is obtained. It is assumed that this sorting method does not change depending on the value of i.
[0100]
[0,0,0] [3,0,0] [4,0,1] [5,1,1] [1,1,2] [2,2,2]
The group-by function processing of FIG. 94 is performed based on the sorted auxiliary information list. First, the processing by the minimum hash value record retrieval device 33 in FIG. 95 is shown in FIG. Here, there are two input buffers for the hashed list for the number of runs, and the records of the minimum hash value are extracted one by one from the records input according to the order of the auxiliary information list 30 sorted into the two buffers. The end result is:
[0101]
[24] [12] [15] [45] [45] [45] [15] [24] [12] [25] [16] [46] [14] [26] [56] [14] [14 ]
The processing result of the minimum hash value record retrieval device 33 is divided as follows for each hash value.
[0102]
Hash value 0: [24] [12] [15] [45] [45] [15] [24] [12]
Hash value 1: [25] [16] [46]
Hash value 2: [14] [26] [56] [14] [14]
Such records having the same hash value are sorted by the sorting device 34 shown in FIG. 95, and the result is sent to the group-by function operation processing device 35, where counting is performed. In this sorting, the lexicographic order described above is used. In FIG. 94, the records of the hash value 0 are sorted when the record of the hash value 0 is no longer extracted, and the records of the hash value 1 are sorted when the record of the hash value 1 is no longer extracted.
[0103]
Such an operation is repeated to the end, and the output record is represented in the form of [key value, number]. The key value here consists of two item numbers.
[0104]
First, the records of the hash value () are sorted, and [12] [12] [15] [15] [24] [24] [45] [45]
Is obtained, and then the sort result is sent to the group-by function operation processing unit 35.
[12,2] [15,2] [24,2] [45,2] is obtained. Next, the records with hash value 1 are sorted,
[16] [25] [46]
Is obtained, and then the sort result is sent to the group-by function operation processing unit 35.
[16,1] [25,1] [46,1]
Is obtained. Finally, the records with hash value 2 are sorted,
[14] [14] [26] [56]
Is obtained, and then the sort result is sent to the group-by function operation processing unit 35.
[14,3] [26,1] [56,1]
Is obtained.
[0105]
Overall,
[12,2] [15,2] [24,2] [45,2] [16,1] [25,1] [46,1] [14,3] [26,1] [56,1]
The result is obtained.
[0106]
As described in detail above, in the group-by processing method of the present invention, in order to make the access to the secondary storage device as continuous as possible, data is sequentially read and written in relatively large block units. As a result, the speed of the hash processing is increased, and the group-by processing using the result of the hash processing is executed, whereby the overall speed can be increased.
[0107]
Next, in the data enumeration combination method of the present invention, pruning is performed even when the amount of available memory is small, by using a bitmap that can reduce the amount of memory used for pruning unnecessary item combination candidates. By efficiently performing the group-by process for counting the item sets and speeding up the group-by process, it is possible to efficiently count the combinations. Applying the data combination enumeration method of the present invention to the association rule generation processing as a database mining technique greatly contributes to the efficiency of data mining.
[0108]
[Patent Document 1]
JP-A-11-3342 (Claims 47 to 50 of the claims, paragraph numbers [0001] to [0004], [0020] to [0037], [0170] to [0209], [0239] to [ 0279], FIGS. 3, 8, 9, 18, 19, 46, FIGS. 47-79, FIGS. 124-171)
(Patent Document 2)
Japanese Patent Application No. 2001-340817 (Pages 2-19, FIGS. 1-34)
[0109]
The above-mentioned prior art has the following problems.
[0110]
In general, in conventional extraction of a time-series correlation rule, processing is performed using raw transaction data or using transaction data limited to a certain portion such as a period or a store. In such a method, when the number of transactions increases, the number of combinations increases, and a large amount of time is required for processing.
[0111]
An object of the present invention is to solve such a conventional problem and to prevent the processing from taking a lot of time even if the number of transactions as described above increases.
[0112]
The present invention has the following configuration to achieve the above object.
[0113]
(1): From the sequence data of the items included in the temporally continuous transaction, for the combination of i = 2 or more items, the item that satisfies the given condition in the form maintaining the order in the sequence. A time-series correlation extracting apparatus for extracting a time-series correlation rule by obtaining one or two or more combinations and the number of appearances thereof, a time-series filter section for filtering input data and leaving necessary data, And a time-series correlation engine unit for extracting a time-series correlation rule from the time-series correlation rule. Multiple events, each defining a value, and defined based on the order of the attribute values Specifying means for specifying a search pattern using an order relationship among a plurality of events; searching means for searching a combination of records corresponding to the specified search pattern from the set of records; outputting search results And output means for performing the operation.
[0114]
(2): The time-series correlation extraction device of (1) has a function of extracting a time-series correlation rule by arranging the time-series filter unit in front of the time-series correlation engine unit. Features.
[0115]
(3): The time-series correlation extraction apparatus of (1) has a function of extracting a time-series correlation rule by arranging the time-series filter section after the time-series correlation engine section. Features.
[0116]
(4): The time-series correlation extraction device according to (1) has a function of extracting a time-series correlation rule by arranging the time-series filter section at both the former stage and the latter stage of the time-series correlation engine. It is characterized by having.
[0117]
(5): The time-series correlation rule is extracted by disposing the time-series filter unit outside the time-series correlation engine unit in the time-series correlation extraction apparatus according to any one of (1) to (4). It is characterized by having a function.
[0118]
(Action)
The operation of the present invention based on the above configuration will be described with reference to FIG.
[0119]
(A): In the time-series correlation extraction apparatus of (1), based on sequence data of items included in a temporally continuous transaction, the order in the sequence is maintained for a combination of i = 2 or more items. In this manner, the time series and the correlation data are extracted by obtaining the items one by one or a combination of two or more that meet the given condition and the number of appearances.
[0120]
In this case, the time-series filter unit 4 performs a process of filtering the input data to leave necessary data, and the time-series correlation engine unit 5 performs a process of generating and extracting a time-series correlation rule from the input data. In particular, the time series filter unit 4 performs the following processing.
[0121]
That is, the time-series filter unit 4 includes the designation unit, the search unit, and the output unit, and searches for a combination of records from a set of records having a plurality of attributes. At this time, the specifying means uses a plurality of events each defining that a predetermined attribute in the record takes a specific value, and an order relationship between those events defined based on the order of the attributes, Specify a search pattern (event pattern).
[0122]
The search means searches a set of records for a combination of records corresponding to a specified search pattern, and the output means outputs a search result. An event is defined as a state in which a predetermined attribute in a record takes a specific value, and the order relation between a plurality of events is the order relation between the values of one or more attributes among records corresponding to those events. Is defined based on
[0123]
The user specifies a search pattern determined by these events and the order relation between the events by using a specifying unit. The specifying means passes the search pattern to the search means, and the search means interprets the received search pattern and extracts a combination of records corresponding to the search pattern. Then, the output unit outputs information such as the extracted records as a search result.
[0124]
According to such an apparatus, various search patterns can be easily specified, including a case where two or more events exist in the same order, and a case where the order relation of a plurality of events is described at an arbitrary interval. This makes it possible to perform general-purpose time-series filtering processing in consideration of the order. In this way, even if the number of transactions as described above increases, it does not take much time for processing, and it is possible to extract time-series correlation rules at high speed.
[0125]
(B): The above (2) is an example in which the time-series filter unit 4 is arranged before the time-series correlation engine unit 5 and the time-series filter unit 4 and the time-series correlation engine unit 5 are configured by one program. . In this case, the input data is filtered by the time-series filter unit 4 to reduce the number of combinations and reduce the processing time.
[0126]
Next, the time-series correlation engine unit 5 generates, extracts, and outputs a time-series correlation rule. In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0127]
(C): (3) is an example in which the time-series correlation engine unit 5 is arranged before the time-series filter unit 4 and the time-series correlation engine unit 5 and the time-series filter unit 4 are configured as one program. . In this case, the time series correlation engine unit 5 generates and extracts the time series correlation rule from the input data, and then performs the filtering process in the time series filter unit 4 to extract and output the time series correlation rule. In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0128]
(D): In the above (4), the time-series filter units 4 are arranged both before and after the time-series correlation engine unit 5, and the two time-series filter units 4 and one time-series correlation engine unit 5 are implemented by one program. This is a configuration example. In this case, the input data is subjected to a filtering process in the time-series filter unit 4, and then the time-series correlation engine unit 5 generates and extracts a time-series correlation rule, and further performs filtering in another time-series filter unit 4. Output association rules.
[0129]
In this example, the time series filter unit 4 at the preceding stage has a function of reducing the number of combinations, and the time series filter unit 4 at the subsequent stage has a function of eliminating useless ones. In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0130]
(E): The above (5) で is an example in which the time-series filter unit 4 and the time-series correlation engine unit 5 are each configured by one separate program unit. That is, the time-series filter unit 4 is configured by one program, and the time-series correlation engine unit 5 is configured by another program.
[0131]
In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed. Further, since the time-series filter unit 4 and the time-series correlation engine unit 5 are configured by different programs, parallel processing is possible, and the time-series filter unit 4 can be replaced.
[0132]
Embodiments of the present invention will be described below in detail with reference to the drawings.
[0133]
§1: Explanation of time series correlation extraction device
(1): Description of the configuration of the time-series correlation extraction device
FIG. 2 is an explanatory diagram of the time-series correlation extraction device. Hereinafter, the configuration of the time-series correlation extraction device will be described with reference to FIG.
[0134]
As shown in FIG. 2, an input device 2 and an output device 3 are connected to the time-series correlation extraction device 1. The time-series correlation extraction device 1 includes a time-series filter unit 4, a time-series correlation engine unit 5, and the like.
[0135]
The input device 2 is for inputting input data for which a correlation rule is to be obtained and other data. The time-series filter unit 4 filters input data and leaves necessary data. The time-series correlation engine unit 5 generates and extracts a time-series correlation rule from input data. The output device 3 outputs the extracted association rule.
[0136]
(2): Explanation of the program of the time-series correlation extraction device
3A and 3B are explanatory diagrams (part 1) of a program of the time-series correlation extraction device. FIG. 3A shows Example 1, FIG. 3B shows Example 2, FIG. 3C shows Example 3, and FIG. FIG. 4 is an explanatory diagram (part 2) of the program of the time-series correlation extraction apparatus. FIG. 4E shows Example 5, FIG. 4F shows Example 6, FIG.
[0137]
The time-series filter unit 4 and the time-series correlation engine unit 5 are each configured by a program, and the relationship between these programs is as follows. In FIGS. 3 and 4, the P section is shown to be composed of one program section (the same as one program). Further, P1 and P2 also indicate that they are each configured by one program unit (one program).
[0138]
{Circle around (1)} Example 1 (see FIG. 3A) is an example in which the time-series filter unit 4 is placed before the time-series correlation engine unit 5. This is an example composed of a P section (one program). In this case, the input data is subjected to filtering processing by the time-series filter unit 4, and then the time-series correlation engine unit 5 extracts and outputs the time-series correlation rule.
[0139]
In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0140]
{Circle around (2)} Example 2 (see FIG. 3B) is an example in which the time-series correlation engine unit 5 is placed in front of the time-series filter unit 4, contrary to the example 1. This is an example in which the time-series filter unit 4 is configured as a P unit (one program). In this case, the time series correlation engine unit 5 generates and extracts a time series correlation rule from the input data, and then performs a filtering process in the time series filter unit 4 to extract and output the time series correlation rule.
[0141]
In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0142]
{Circle around (3)} Example 3 (see FIG. 3C) is that the time series filter units 4 are provided both before and after the time series correlation engine unit 5, and two time series filter units 4 and one time series correlation engine unit 5 are provided. Is an example in which a P section (one program) is configured. In this case, the input data is subjected to a filtering process in the time-series filter unit 4, and then the time-series correlation engine unit 5 generates and extracts a time-series correlation rule, and further performs filtering in another time-series filter unit 4. Output association rules.
[0143]
In this example, the time series filter unit 4 at the preceding stage has a function of reducing the number of combinations, and the time series filter unit 4 at the subsequent stage has a function of eliminating useless ones. In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0144]
{Circle around (4)} Example 4 (see D in FIG. 3) is an example in which the time-series filter unit 4 and the time-series correlation engine unit 5 are each configured by a separate program unit. That is, the time-series filter unit 4 is configured by a P1 unit (one program), and the time-series correlation engine unit 5 is configured by another P2 unit (one program).
[0145]
In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed. Further, since the time-series filter unit 4 and the time-series correlation engine unit 5 are configured by different programs, parallel processing is possible, and the time-series filter unit 4 can be replaced.
[0146]
{Circle around (5)} Example 5 (see FIG. 4E) is an example in which the time-series filter unit 4 and the time-series correlation engine unit 5 are each configured by one separate program unit. The arrangement of the time-series filter unit 4 is opposite to that of the example 4. That is, the time series correlation engine unit 5 is configured by a P1 unit (one program) at the front stage, and the time series filter unit 4 is configured by another P2 unit (one program) at the subsequent stage.
[0147]
{Circle around (6)} In Example 6 (see FIG. 4F), the time series filter unit 4 is constituted by a P1 unit (one program) in the preceding stage, and the time series correlation engine unit 5 and the time series filter unit are arranged in the subsequent stage. 4 is an example in which P4 is configured by a P2 unit (one program). In this case, the P1 part and the P2 part are different programs.
[0148]
{Circle around (7)} Example 7 (see G in FIG. 4) is that the time series filter unit 4 and the time series correlation engine unit 5 are constituted by a P1 unit (one program) in the preceding stage, and the time series filter unit is arranged in the subsequent stage. 4 is an example in which P4 is configured by a P2 unit (one program). In this case, the P1 part and the P2 part are different programs.
[0149]
{Circle around (8)} Example 8 (see FIG. 4H) shows that the time series filter section 4 is provided both before and after the time series correlation engine section 5, and two time series filter sections 4 and one time series correlation engine section 5 are provided. Is an example in which the two time-series filter units 4 and one time-series correlation engine unit 5 in Example 3 are respectively configured by separate programs.
[0150]
That is, the time series filter unit 4 is configured by the P1 unit (one program) at the forefront, the time series correlation engine unit 5 is configured by the P2 unit (one program) at the subsequent stage, and further, the subsequent stage , The time series filter unit 4 is configured by a P3 unit (one program).
[0151]
(3): Explanation of input data format
FIG. 5 is a diagram showing the input data format. FIG. 5A is an explanatory diagram of the input data (text format), and FIG. That is, FIG. 5A is a diagram illustrating input data expressed in text, and FIG. 5B is a diagram illustrating the format. FIG. 6 is an explanatory diagram of the processing. FIG. 6A shows a combination (length 2), FIG. 6B shows a combination in which CID = 1 and 4 are deleted, and FIG.
[0152]
In FIG. 5, “CID” represents an identification number such as a customer ID. "Number" represents the number of items. The number of items includes the delimiter “|”. “Item coat #” represents the identification number of the item. The input is a series of time-series data for each CID. Item courts # are arranged from left to right in the figure from the oldest time to the newest time. Items between delimiters "|" (items separated by ",") have no order relation.
[0153]
Here, it is assumed that the user wants to find a correlation rule (expressed as “X → 6”) in which the event of the item “6” last occurred. The conventional correlation engine generates a combination as shown in FIG. 6A from the input data shown in FIG. 5A. In the case of length 2, there are 29 unique combinations as shown in FIG.
[0154]
Here, the input data is filtered by the time-series filter unit 4, and finally, the data of CID (CID = 1, 4) in which the event “6” has not occurred is deleted. In this case, the number of combinations is reduced to four as shown in FIG.
[0155]
Generally, in the extraction processing of the time-series correlation rule, most of the time is required to generate a combination. The number of combinations increases explosively as the data period (length) increases. For this reason, deleting unnecessary data in advance is an extremely effective means for reducing the processing time. However, when it is desired to find a correlation rule in which an event “6” has occurred at the end, simply deleting CID data whose input data does not include 6 does not work. Here, it is necessary to use a time-series filter unit 4 capable of deleting data other than those having two or more events and having the event “6” occurring after the second. In the case of the input as shown in FIG. 6C, a simple filter for deleting data that does not include the event “6” has no effect.
[0156]
(4): Explanation of processing by flowchart
(1): Explanation of Example 1
FIG. 7 is a processing flowchart of Example 1. Hereinafter, the process of Example 1 will be described with reference to FIG. In addition, S11 to S15 indicate each processing step.
[0157]
Example 1 is the processing of Example 1 shown in FIG. 3A, where S11 to S14 are the processing of the time-series filter unit 4, and S15 is the processing of the time-series correlation engine unit 5. In this process, first, the time series filter unit 4 checks whether or not there is data (S11). If there is no data, the process is terminated as it is. If there is data, the data is input (S12). The processing of the filter unit 4 is performed (S13).
[0158]
Then, the time-series filter unit 4 checks whether or not the filtering condition is satisfied (S14). If the filtering condition is not satisfied, the process proceeds to S11. However, if the filter condition is satisfied, the process proceeds to the processing by the time-series correlation engine unit 5 (S15). Thus, the processing of the time-series filter unit 4 and the time-series correlation engine unit 5 in the P unit is completed.
[0159]
{Circle around (2)}: Explanation of Example 2
FIG. 8 is a processing flowchart of Example 2. Hereinafter, the process of Example 2 will be described with reference to FIG. In addition, S21 to S26 indicate each processing step.
[0160]
Example 2 is the process of Example 2 shown in FIG. 3B, in which S21 to S23 are processes of the time-series correlation engine unit 5, and S24 to S26 are processes of the time-series filter unit 4. In this process, the time-series correlation engine unit 5 checks whether or not there is data (S21). If there is no data, the process is terminated as it is, but if there is data, the data is input (S22). Then, the processing by the time-series correlation engine unit 5 is performed (S23).
[0161]
Next, the time series filter unit 4 performs the processing of the time series filter unit 4 (S24), checks whether or not the filter condition is satisfied (S25). If the filter condition is not satisfied, the process proceeds to S21. I do. However, if the filter condition is satisfied, data is output (S26).
[0162]
(3): Explanation of Example 3
FIG. 9 is a processing flowchart of Example 3. Hereinafter, the process of Example 3 will be described with reference to FIG. In addition, S31 to S39 indicate each processing step.
[0163]
Example 3 is the processing of Example 3 shown in FIG. 3C, in which S31 to S34 are the processing of the time-series filter unit 4, S35 is the processing of the time-series correlation engine unit 5, and S36 to S39 are the processing of the time-series filter unit. This is the process 4.
[0164]
In this process, first, the time series filter unit 4 checks whether or not there is data (S31). If there is no data, the process is terminated as it is. If there is data, the data is input (S32). The processing of the filter unit 4 is performed (S33). Then, the time-series filter unit 4 checks whether or not the filtering condition is satisfied (S14). If the filtering condition is not satisfied, the process proceeds to S31.
[0165]
However, if the filter condition is satisfied, the processing by the time-series correlation engine unit 5 is performed (S35). Thereafter, the time-series filter unit 4 checks again whether or not there is data (S36). If there is no data, the process is terminated as it is. If there is data, the process of the time-series filter unit 4 is performed (S37). . Then, the time-series filter unit 4 checks whether or not the filtering condition is satisfied (S38). If the filtering condition is not satisfied, the process proceeds to S36. However, if the filter condition is satisfied, data is output (S39), and the process proceeds to S31.
[0166]
(5): Description of a specific device example
FIG. 10 is a configuration diagram of a specific device. The time-series correlation extraction device can be realized by any computer such as a workstation and a personal computer. This apparatus includes a computer main body 10, a display device 11 connected to the computer main body 10, an input device (keyboard / mouse or the like) 12, a removable disk drive (referred to as "RDD") 13, and a hard disk device (referred to as "HDD"). 14 and the like.
[0167]
The computer body 10 includes a CPU 15 for performing various internal controls and processes, a ROM 16 (non-volatile memory) for storing programs and various data, a memory 17, and an interface control unit (“I / F 18), a communication control unit 19, and the like. Note that the RDD 13 includes a flexible disk drive, an optical disk drive, and the like.
[0168]
In the device having the above configuration, for example, a program for realizing the processing of the time-series correlation extraction device is stored in a magnetic disk (recording medium) of the HDD 14, and the CPU 15 reads out and executes the program. The processing performed by the time-series correlation extraction device is performed.
[0169]
However, the present invention is not limited to such an example. For example, a program may be stored in a magnetic disk of the HDD 14 as follows, and the program may be executed by the CPU 15 to execute the processing. .
[0170]
{Circle around (1)} A program (program data created by another device) stored in a removable disk created by another device is read by the RDD 13 and stored in a recording medium of the HDD 14.
[0171]
{Circle over (2)}: Receives data such as a program transmitted from another device via a communication line via the communication control unit 19 and stores the data on a recording medium (magnetic disk) of the HDD 14.
[0172]
§2: Explanation of the time series filter unit
(1): Overview
The time-series filter unit 4 (see FIGS. 1 to 4) is described in the specification and drawings of Patent Document 2, and will be described in detail below. This time-series filter unit is an invention entitled "Searching apparatus and method using a pattern in consideration of order", and will be described below as "time-series filter unit" of the present invention.
[0173]
A: Techniques that have been conventionally used to handle data based on order include relational databases and pattern matching using regular expressions for the order of appearance of character strings. For time-series data such as stock prices, , Dedicated applications have been used. The following describes the features and fields of application of the conventional technology.
[0174]
B: Relational database
Relational databases are widely used to store large amounts of data. In the relational database, the inventor E. F. As specified in the article by Codd (1970), there is no concept of order in the data set. However, a data type such as a date type or a time type is provided, and a function of sorting records according to values (sorting) is provided, and is used to store ordered data.
[0175]
In this database, dates can be handled as a data type. However, since there is no concept of order, all queries for searching for a pattern of data that is conscious of order, such as a time series, must be processed in SQL (Structured Query Language). Can not.
[0176]
Therefore, processing is performed by combining a database and an external program, and a program is required each time a pattern in which the order is taken out is taken out.
[0177]
C: Pattern matching by regular expression
In the field of character string search, pattern matching using regular expressions has been used, paying attention to the order in which character strings appear. First, the difference between character string search and pattern matching will be clarified. A character string search refers to a pattern in which a pattern to be searched is completely determined, such as "search for a pattern abc in a sentence". On the other hand, pattern matching is an operation of searching for an uncertain pattern, and is also called pattern matching. In pattern matching, a regular expression is used to specify a pattern.
[0178]
FIG. 11 shows an example of pattern matching between character string data and a regular expression a (a | b) * a. Here, (a | b) * means that a or b repeatedly appears 0 times or more. String search and pattern matching seem similar, but they belong to different categories and require different algorithms to be applied to each.
[0179]
To realize pattern matching by a regular expression, a finite automaton (finite @ automaton) is used. Converting a regular expression to an automaton takes a two-step approach. First, the regular expression is converted into a non-deterministic finite automaton (non-deterministic {finite} automaton, {NFA}). Conversion from regular expressions to NFA is easy. Although it is possible to perform pattern matching with the NFA alone, a method of converting the obtained NFA into an equivalent deterministic finite automaton (deterministic finite automaton, DFA) and performing pattern matching using the DFA is often used.
[0180]
In the DFA, as the word deterministic indicates, if an input is determined in a certain state, only one transition destination is determined. However, in the NFA, as the word non-deterministic indicates, a plurality of transitivity to the input exists in a certain state. May be.
[0181]
FIG. 12 shows an NFA corresponding to the regular expression a (a | b) * a. Consider a case where a character string "aaa" is given to this NFA. When the first character “a” is input, a transition is made from the initial state 0 to 1. The second character is also a, but state 1 has two types of states, state 1 and state 2, as transition destinations for character a. From the conclusion, it is correct to make a transition from state 1 to state 1 for the second character a and to state 2 for the third character a, but at the time of reading the second character a, I don't know where to proceed.
[0182]
In order to solve this problem, it is necessary to use backtracking, to transition to one of the states for the time being, to recommend processing, and to make a transition to another state if a failure occurs. However, when backtracking is used, processing time for backtracking is required.
[0183]
Therefore, instead of using the NFA obtained by converting the regular expression as it is, the NFA is further converted to DFA, and then the pattern matching process is performed. In the case of the DFA, unlike the NFA, once the state and the input are determined, the transition destination is always determined to be only one. Therefore, the backtrack unlike the NFA is not required, and the processing can be executed at a high speed.
[0184]
For example, the NFA in FIG. 12 is converted into a DFA as shown in FIG. Of course, it takes time to convert the NFA to the DFA, but when performing pattern matching on a large amount of data, the speed of the DFA without the backtrack can be sufficiently speeded up as a whole.
[0185]
By the way, a regular expression is recursively defined by three basic operations (operators) of connection (concatenation $), selection (union $), and repetition (closure: closure $) as shown in FIG. Between these operations, there is a precedence order, as in ordinary formulas, the strongest binding is the repetition "*", the next strong binding is the concatenation, and the last is the selection "|" It becomes the order. However, priority can be changed by enclosing characters and symbols in parentheses.
[0186]
In POSIX \ Portable \ Operating \ System \ Interfacefor UNIX \ (registered trademark) \ 1003.2, there are two types of regular expressions in which Basic Regular Expression (BRE) and Extended Regular Expression (BRE) are defined. Software utilities on UNIX that use the BRE include ed, ex, vi, more, sed, grep, and the like. Software utilities that use the ERE include awk and grep when the -E option is specified. .
[0187]
D: Time series application
For data having an order, processing may be performed using a dedicated application, such as a stock price forecast or a sequential pattern in data mining. If a dedicated application is used, a time-series pattern search can be processed at high speed, but the dedicated application is not always available for various general-purpose inquiries.
[0188]
E: Challenge
By the way, the above-mentioned processing has the following problems.
[0189]
{Circle around (1)} Regular expressions in character strings and pattern matching using the regular expressions are general frameworks that provide search methods for character strings in all classes. However, data having an order has different data characteristics from character string data, as shown in A, B, and C below, so that a regular expression and its pattern matching cannot be applied.
[0190]
A: In a character string, only characters adjacent to each other at item intervals exist. However, in the case of data having an order, a plurality of events may exist at a certain position. For example, a case where shopping is performed a plurality of times on the same day corresponds to this, and an expression such as “customer 10001 purchases two products of milk and bread on March 21” is required. However, since two characters do not appear at the same position in a character string, a regular expression cannot express an event that occurs simultaneously.
[0191]
B: In a character string, the value is equal to the symbol (literal). That is, when "A" is given as a character string, "A" is a value and is also a symbol A. However, in the data having a plurality of attributes, a combination of conditions of a plurality of fields (in this case, a product and a price) is defined as one such as “a customer who has purchased bread for 200 yen or more is called 'customer group A'”. Must be treated as a symbol. Such a combination of conditions of a plurality of fields cannot be represented by a regular expression.
[0192]
C: In the case of data having an order, the concept of an interval is necessary for the order, such as "buy cheese within two days after purchasing bread". However, a regular expression of a character string cannot specify an interval.
[0193]
FIG. 15 shows retail sales data as an example of data having an order. In this example, the date of purchase of the product for each customer corresponds to the order. On March 21 (03.21), the customer 10001 purchases milk and bread at the same time, but cannot describe events that appear at the same time in a regular expression. Further, in the sales data of the retail store, there is no data on March 22 which is a holiday, but there is no situation where characters in a specific order do not exist in the character string. Furthermore, it is difficult to write a description that takes into account the relationship between orders, such as an interval between the purchase of a certain product and the next visit within three days, using regular expressions.
[0194]
As described above, it is impossible with a conventional regular expression and automaton theory to generally designate a pattern in consideration of an order from a group of records having an order.
[0195]
In the relational database, the order relation is supported only in a limited form. Therefore, when trying to find the specified pattern from the data having the order, it is necessary to process the combination with a database and an external program.
[0196]
However, in the method of creating a program for each inquiry, pattern matching cannot be performed only by changing a designated pattern, such as pattern matching by a regular expression.
[0197]
Further, it is conceivable to use a dedicated application to analyze ordered data, such as stock price prediction. If a dedicated application prepares data for a certain purpose, it returns a specific result, so if it is limited to that purpose, it is possible to perform high-speed processing. However, in a dedicated application, it is difficult to deal with a wide range of problems, such as pattern matching using a regular expression, because the application is dedicated. Therefore, an apparatus and a method for searching for an order-based data pattern using a general-purpose expression for an order relation between data will be described below.
[0198]
FIG. 16 is a principle explanatory diagram of the time-series filter unit. The search device in FIG. 16 includes a designation unit 101, a search unit 102, and an output unit 103, and searches for a combination of records from a set 104 of records having a plurality of attributes.
[0199]
The designating means 101 uses a plurality of events each defining that a predetermined attribute in a record takes a specific value, and an order relation between those events defined based on the order of attribute values, to specify a search pattern. (Event pattern) 105 is specified. The search unit 102 searches the set of records 104 for a combination of records corresponding to the specified search pattern 105, and the output unit 103 outputs a search result.
[0200]
An event is defined as a state in which a predetermined attribute in a record takes a specific value, and the order relation between a plurality of events is the order relation between the values of one or more attributes among records corresponding to those events. Is defined based on The user uses the specifying means 101 to specify the search pattern 105 determined by these events and the order relation between the events. The specifying unit 101 passes the search pattern 105 to the search unit 102, and the search unit 102 interprets the received search pattern 105 and extracts a record combination corresponding to the search pattern 105. Then, the output unit 103 outputs information such as the extracted records as a search result.
[0201]
According to such a search device, various search patterns can be easily specified, including a case where two or more events exist in the same order and a case where the order relation of a plurality of events is described at an arbitrary interval. And a general-purpose data search in consideration of the order is realized.
[0202]
F: Description of search device
Hereinafter, a detailed apparatus will be described with reference to the drawings. The target data is data composed of a set of records having a plurality of fields (attributes) as shown in FIG. Each record has a fixed number of fields, and no single record has a different number of fields. Further, it is assumed that one or more fields having an order are included in the data.
[0203]
A field having an order is a field having an order relationship such as date and time or a customer ID (customer identifier) field, and the order is generated by rearranging data. In addition, a plurality of fields may be combined to have an order so that the date field and the time field can be considered as one order field.
[0204]
When searching for a pattern having an order from such target data, a general-purpose process that specifies a general-purpose pattern using event definitions and inter-event definitions, interprets the specified pattern, and executes the search Use the system. The event definition is a condition in which a condition specified for one or a plurality of fields is uniquely named. When a condition is specified for one field, it can be defined as, for example, "a customer who has purchased a PC as a product is called 'customer group A'".
[0205]
When conditions are specified for a plurality of fields, for example, a customer who purchased a product with a price of 250,000 yen on a product = PC is referred to as a “customer group A”. Treat the combination as one symbol (literal). In other words, an event is defined as a label related to a record that satisfies one or more fields. Furthermore, it is possible to specify a condition that matches anything, such as a wild card in a regular expression.
[0206]
FIG. 17 shows an example of an event definition defined by "a customer who purchased a product with a price of 250,000 yen on a product = PC is referred to as a" customer group A "". As described above, it is impossible to represent a combination of conditions in a plurality of fields by a regular expression of a character string.
[0207]
Next, an inter-event definition is a description of a relationship between events using the event definition. In the definition between events, a case where a plurality of events exist in the same order or a case where the order of the order is not constant (a case where the order is described at an arbitrary interval) is also considered.
[0208]
As a specific example, an event of a customer who purchases a PC of 250,000 yen is called "A", and an event of a customer who purchases a TV of 100,000 yen is called "B". An event definition such as “interval from“ to ”“ B ”is within 3 visits” is conceivable. In addition, a definition such as "the interval from 'A' to 'B' is within 3 days (the difference between the 'A' date field and the 'B' date field is within 3 days)" is also conceivable.
[0209]
In addition to the fields having the order, it is possible to describe an event and a constraint extending over the event. For example, a definition such as "the price at 'A' is greater than the price at 'B'" is possible. Further, even when an event definition is specified by a wild card that matches any pattern, it is possible to specify a condition by defining an event.
[0210]
FIG. 18 shows an example of the definition between events. In this example, the interval between event 'A' and event 'B' is less than three days, the interval between event 'B' and event 'C' is less than two days, and event 'A' and event 'C'. Is defined to be within 5 days.
[0211]
In a regular expression, the expression "a..b" using "." That matches all characters means that "b" appears three characters after the literal "a", and one or more characters Is different from the event definition in which the condition is specified for the field.
[0212]
In addition, the fact that the relationship between arbitrary events can be defined means that the matching pattern to be searched is represented by a craft structure. Also in the example of FIG. 18, an inter-event definition exists between the event definition 1 and the event definition 2, between the event definition 2 and the event definition 3, and between the event definition 1 and the event definition 3.
[0213]
The order pattern handled in the present invention is clearly different from the pattern in the regular expression in that it can have such a graph structure and is specified by a combination of an event definition and an inter-event definition. By using the event definition and the definition between events, it is possible to specify a search pattern that cannot be described by a conventional regular expression.
[0214]
In the present search apparatus, a pattern based on the order given by such an event definition and an inter-event definition is designated universally, interpreted and executed. When interpreting the pattern based on the order, there are a method of dynamically interpreting the pattern at the time of execution like an interpreter, and a method of replacing the pattern with an instruction executable by a computer before execution like a compiler.
[0215]
FIG. 19 shows the basic configuration of the search device. The search device in FIG. 19 includes data 111 to be searched, a search pattern 112 specified by an event definition and an inter-event definition, and a search processing unit 113 that interprets and executes the search pattern 112. 113 outputs a search result 114. By changing the definition of the search pattern 112, various types of searches can be supported.
[0216]
FIG. 20 is a flowchart of the entire process performed by the search processing unit 113. First, the search processing unit 113 receives the data 111 (step S1) and the search pattern 112 (step S2) as inputs. Then, it interprets the search pattern 112 (step S3), searches for the specified pattern (step S4), and outputs the search result 114 (step S5). Hereinafter, the search pattern 112 is referred to as an event pattern.
[0219]
FIG. 21 shows a data structure held by the search processing unit 113 in step S4 of FIG. The search processing unit 113 includes a pointer P1 that points to data, and a pointer P2 that points to an event definition and an inter-event definition. Unlike ordinary character string data, for example, data that occurs on the same day has the same order. Therefore, the data pointed to by the pointer P1 is not limited to one record, but may be a plurality of records.
[0218]
The event definition is described in order from the event that appears first, and the inter-event definition is described in the last event column among the events included in one inter-event definition.
[0219]
FIG. 22 is a flowchart of the search process in step S4 of FIG. First, the search processing unit 113 reads data corresponding to the pointer P1 (step S11), and checks whether or not the pointer P1 points last (step S12). If the pointer P1 does not point to the end, it is checked whether the data satisfies the event definition corresponding to the pointer P2 (step S13). If the data does not satisfy the event definition, 1 is added to the pointer P1 (step S17), and the processing from step S11 is repeated.
[0220]
If the data satisfies the event definition, it is further checked whether the data satisfies the inter-event definition corresponding to P2 (step S14). If the data does not satisfy the definition between the events, the processing from step S17 is repeated.
[0221]
If the data satisfies the event definition, it is next checked whether or not the pointer P2 points to the end (step S15). If the pointer P2 does not point to the end, 1 is added to P2 (step S18), and the processing from step S17 is repeated.
[0222]
If the pointer P2 points to the end, it means that a combination of data that satisfies all the event definitions and the definitions between the events has been found. Therefore, those data are registered as a search result (step S16), and the processing after step S17 is performed. repeat.
[0223]
If the pointer P1 points to the end in step S12, it means that the search processing for all data has been completed, and the registered search result is output. In step S16, instead of registering the search result, the search result can be output immediately. The flowchart shown in FIG. 22 is only an example of the search processing, and any pattern matching method can be used for the search processing.
[0224]
Next, additional functions of the search device will be described with reference to FIGS.
[0225]
The search device can generate input data including a plurality of attributes to be searched by collecting a plurality of input data (logs) recorded in a large number of files. For example, search target data is generated from a plurality of files (data) by using a relational database (JOIN) operation or using an external program.
[0226]
When generating search target data from the table shown in FIG. 23 and the table shown in FIG. 24, the search device applies an SQL sentence as shown in FIG. 25 to these two tables and performs a join operation (Join @ Operation). Execute. As a result, the data shown in FIG. 15 is generated. In this case, although the two tables each include different fields, by combining these, a table including the fields of both tables is generated.
[0227]
Further, with respect to a set of records including a plurality of attributes, it is possible to generate search target data by compressing each attribute by a process such as integer conversion. For example, if the customer ID in FIG. 15 is converted to an integer by 4-byte integer conversion and then compressed to 2 bits, the following is obtained.
[0228]
Integer 2 bits
"10001" $-$ 00
“10002”-01
"10003" $ 10
If the compressed customer ID is used, the data in FIG. 15 is rewritten as shown in FIG. The compressed data may be returned to the original character string at the time of output and displayed, so that the processing can be internally performed with 2 bits. As a result, the amount of memory required for processing can be reduced, and the processing speeds up.
[0229]
In addition, for a set of records including a plurality of attributes to be searched, unnecessary memory is not used in a pattern to be found, so that a necessary memory amount can be reduced and an input / output cost can be reduced. That is, records that are not necessary as patterns to be discovered are not processed from the beginning. For example, assume that the following event pattern is defined.
[0230]
Event definition
Event 1: Product = milk
Event 2: Goods = bread
Definition between events
Event 1. Date = Event2. date
In this example, the event pattern in which milk and bread were purchased on the same day is specified, and it can be seen that there is no need for data on products other than milk and bread. Therefore, when data is read from a file, records that do not need to be in a product field other than milk and bread are not read into the memory.
[0231]
FIG. 27 shows such a record reduction process. The data recorded in the file 121 is temporarily input to the read buffer 122, and unnecessary records are deleted from the data. Then, only necessary records are read into the memory 123 as search target data, and a search process is performed. As a result, the amount of memory can be reduced, and the processing can be expected to be faster.
[0232]
By the way, when data has a hierarchical structure, such as when a product has two types of class hierarchies, large classification and small classification, rewriting values based on the hierarchy enables processing that is aware of the hierarchy. It becomes possible. An example of rewriting a value based on a product hierarchy will be described below.
[0233]
Large classification Large classification code Small classification Small classification code Large classification code + Small classification code
Fresh ¥ 10000 ¥ cucumber ¥ 10000
Fresh ¥ 10,000 Chinese cabbage ¥ 2,10002
Seafood 20,000 yen Horse mackerel 1 20001
Seafood 20,000 flounder 2,000 20002
Meat ¥ 30000 ¥ Beef ¥ 13000
In this example, a five-digit code (10000, 20,000,...) Is assigned to the large classification, a four-digit code (0001 to 9999) is assigned to the small classification, and the large classification code and the small classification code are added. Assign a code that uniquely identifies the product. If the value is rewritten in this way, it is possible to easily determine which classification item it corresponds to based on the code in the middle of the process, and it becomes easy to search only a specific classification item.
[0234]
For example, if the event definition specifies 20000 <= product code <= 29999, it means that the classification of fish and shellfish is specified. In the event definition, a specific product can be specified by specifying both the large classification and the small classification, such as product code = 1002.
[0235]
Next, a method of designating pattern matching in the search processing will be described. In pattern matching in regular expressions, "longest match" that returns the longest character string that matches a given character string pattern is fundamental. However, in a perl (Practical {Extraction} and {Report} Language} processing system, it is possible to specify a "shortest match" that returns the shortest character string that matches a given character pattern.
[0236]
Regarding the pattern matching based on the order of the present example, another matching specification can be made in addition to the matching specification used in the regular expression. These matching designations will be described using the search target data in FIG. For the sake of explanation, a record number for uniquely identifying each record is added to the data in FIG. The following patterns are used as event patterns.
[0237]
Event definition
Event 1: Product = milk
Event 2: Goods = bread
Definition between events
Event 1. Date <Event 2. date
Here, the condition of “event 1. date <event 2. date” indicates that the date of event 1 is earlier than the date of event 2. First, in the “first match” that returns the first pattern that matches the given event pattern, records of

record numbers

1 and 4 are extracted.
[0238]
This record is the first milk record that appears, and the first bread record that appears later in the record than the record. In this case, the extracted pattern is represented as (1, 4) using a combination of record numbers.
[0239]
Next, the record number (1, 4) is also extracted in the “shortest match” that returns the pattern having the shortest interval among the patterns matching the given event pattern. This is because, in the data of FIG. 28, the interval between record 1 and record 4 is the shortest among the combinations of records that match the event pattern.
[0240]
Also, by repeating the “shortest match” from the beginning of the data, it is also possible to extract a combination of a plurality of records that match the event pattern. In this case, the process of removing the portion that matches the event pattern by one shortest match from the search target data, and applying the remaining data as a new search target to the shortest match is repeated.
[0241]
Next, in the “longest match” that returns the pattern having the longest interval among the patterns matching the given event pattern, the record number (1, 7) is extracted. This is because, in the data of FIG. 28, the interval between record 1 and record 7 is the longest among the combinations of records that match the event pattern. In “all match” that returns all patterns matching the given event pattern, three combinations of record numbers (1, 4), (1, 7), and (3, 7) are extracted.
[0242]
Furthermore, not only forward matching from the beginning of data but also backward matching from the end of data can be designated. In order to explain the matching designation in this manner, the event pattern will be used as follows by changing the definition of the event pattern described above between events.
[0243]
Event definition
Event 1: Product = milk
Event 2: Goods = bread
Definition between events
Event 1. Date> Event 2. date
First, "first match backward from the end of data" returns the first pattern that matches the given event pattern in the reverse direction. In the data of FIG. 28, the record numbers (6, 4) correspond to this.
[0244]
Next, in “the shortest match in the reverse direction from the end of the data”, the pattern having the shortest interval among patterns that match the given event pattern in the opposite direction is returned. In the data of FIG. 28, the record number (3, 2) corresponds to this. Similarly, in the “longest match in the reverse direction from the end of the data”, the pattern having the longest interval among the patterns that match the given event pattern in the opposite direction is returned. In the data of FIG. 28, the record number (6, 2) corresponds to this.
[0245]
As described above, if the event pattern is used, various matching designations are possible, and these designation methods can be properly used according to the purpose.
[0246]
If the search device has a graphic user interface (GUI), the user can specify an event pattern using the GUI. In this case, the event definition and the event-to-event definition may be respectively specified in the GUI.
[0247]
FIG. 29 shows an example of such a GUI screen. An event name is entered in an event definition box 131, and an event condition is entered in a box 132. Then, when the user clicks the OK button, the specification of the event definition is completed. Here, the event of purchasing a PC as a product is defined as “A”. Similarly, an event in which a TV is purchased as a product is defined as 'B'.
[0248]
In the

boxes

134 and 136 of the definition between events, the defined event names are input, and in the box 135, the relationship between the events input in the

boxes

134 and 136 is input. Then, when the user clicks the OK button 137, the specification of the definition between events is completed. Here, the condition that event B occurs after event A is specified for the relationship between event A and event B. In addition to this, for example, it is possible to specify that the interval between two events is within a predetermined number of days.
[0249]
Also, the event pattern can be specified by extending a regular expression. The regular expression does not support the description of the simultaneous occurrence of a plurality of events and the interval between the events, but by adding these descriptions, it is possible to specify an event pattern in consideration of the order. For example, if simultaneous occurrence is expressed by '=', event A: “product = PC, price = 100,000 yen” and event B: “product = TV, price = 50,000 yen” occur simultaneously. It can be expressed as follows.
[0250]
Event pattern
Event definition
Event A: Product = PC, Price = 100,000 yen
Event B: Product = TV, Price = 50,000 yen
Definition between events
Event A. Date = Event B. date
Further, a pattern in which event C: “goods = VTR, price = 30,000 yen” occurs after event A and event B occur simultaneously can be described as follows.
[0251]
Event pattern
Event definition
Event A: Product = PC, Price = 100,000 yen
Event B: Product = TV, Price = 50,000 yen
Event C: Product = VTR, Price = 30,000 yen
Definition between events
Event A. Date = Event B. Date <Event C. date
As described above, by introducing the description of the co-occurrence and the interval, it is possible to easily specify an event pattern that cannot be expressed by the regular expression.
[0252]
As described above, when the search processing unit interprets a given event pattern and performs processing, it dynamically interprets it like an interpreter and replaces the pattern with a computer-executable instruction like a compiler There is a way to do it. When the search processing unit finds a pattern matching the event pattern from the given data, the search processing unit outputs predetermined information on the found pattern. For example, a case will be described in which the following inquiry is made for the data in FIG.
[0253]
Event pattern
Event definition
Event A: Product = Milk
Event B: Goods = bread
Definition between events
Event A. Date = Event B. date
When the information of the record group constituting the found pattern is output as it is every time the search device finds the pattern, the information of the following two records is output.
[0254]
10001 {03/21} Milk {189} (Record corresponding to Event A)
10001 {03/21} bread {300} (record corresponding to event B)
The output format of the found pattern can be specified by the user. In this case, the information of the record group constituting the found pattern is compiled into a specified format and output. For example, in the above query, the format of the output is
Event A. Product, @Event B. Product
If you specify
Milk, bread
Is output. Further, it is also possible to specify an operation using a value of a specific field, such as "event B. date-event A. date". In this case, the result of the specified operation is output for the value of the specified field of the record included in the found pattern.
[0255]
After finding all the patterns, it is also possible to execute an aggregation operation (aggregate @ function) of the record group. As the aggregation operation, general functions such as a minimum value (MIN), a maximum value (MAX), an average (AVG), and a sum total (SUM) are used. When processing these aggregation operations, if one pattern is found, it is stored in a buffer, and when the processing is delimited, the operation is performed on all the stored patterns. For example, in response to the above inquiry, the user can specify an aggregation operation such as AVG (event B. date-event A. date).
[0256]
If no pattern matching the event pattern is found, it is necessary to notify the user in some way. For example, a record indicating that there is no matching pattern is prepared in advance, and a message is displayed on the screen using the record, so that the user can be notified of this. In the case of the data shown in FIG. 15, when the following inquiry is made, there is no matching pattern.
[0257]
Event pattern
Event definition
Event A: Product = Milk
Event B: Goods = bread
Definition between events
Event A. Date = Event B. date
When a set of records composed of a plurality of fields is given as data, group by (grouping) of records can be performed to speed up the processing. At this time, the records are rearranged in advance by specifying which field is to be grouped and which field is to be used to sort (sort) the records in each group. A plurality of fields can be specified as fields used for grouping.
[0258]
For example, it is assumed that data as shown in FIG. 30 is searched and grouped by a customer ID field, and records of each group are arranged by purchase date. In this case, the search target data is rearranged as shown in FIG. 31 by grouping and sorting. The customer with the customer ID of 110001001 visited the store for the first time on January 13, 2001, and purchased two items A and N. The same customer visits the store for the second time on January 28, 2001 and purchases another product B. In this way, an event pattern is specified for the rearranged data.
[0259]
The definition of the event pattern is, as described above, what kind of value a certain field takes, for one or more fields, and gives one name to all the conditions. In addition, the definition between events refers to a constraint extending over a plurality of events.
[0260]
Regarding the data in FIG. 31, as shown in FIG. 32, it is assumed that a record in which the product field is “A” is defined as Event1, and a record in which the product field is “B” is defined as Event2. Furthermore, if the interval between the dates (purchase dates) on which Event1 and Event2 occur for Event1 and Event2 is within 30 days and the prices of Event1 and Event2 are equal, an event pattern as shown in FIG. 33 is obtained.
[0261]
In this example, since the data is grouped and arranged, the search processing unit can perform matching only by sequentially taking out the data arranged for each group. If the grouping and sorting are not performed and the data does not survive on the memory, it is necessary to read the data many times, and the processing efficiency becomes poor. Therefore, the merit of speeding up by grouping and alignment is clear.
[0262]
In the above example, the data is grouped and sorted within each group. However, when the entire search target data is regarded as one group, the data is similarly sorted according to the order. Processing can be accelerated. FIG. 34 shows an example in which sales data of a certain store is arranged in date order. By thus arranging the entire data in order of date, the process of searching for n events from m data items is made more efficient from the order of O (mn) to the order of O (m + n).
[0263]
In addition, if an index is used, equivalent high-speed processing can be achieved without performing data grouping and alignment. In this method, a set of records is indexed and processed on a group axis and an order axis, and access to records can be performed in order for each group.
[0264]
FIG. 35 shows data access using such an index. In this example, the data 143 is accessed via the group information holding unit 141 and the plurality of indexes 142. The group information holding unit 141 holds a plurality of customer identifiers (CID1 to CID4) corresponding to a plurality of groups, and a pointer to an index 142 corresponding to each customer identifier. Each index 142 holds a plurality of date data and pointers to records corresponding to each date data. The search processing unit can use the group information holding unit 141 and the index 142 to access the records in order of date for each group.
[0265]
Further, even when the entire data is regarded as one group, the use of the index makes it possible to achieve the same speedup without performing the alignment. In this case, since the group information holding unit 141 in FIG. 35 is not necessary, a configuration as shown in FIG. 36 is used. The search processing unit can access the records in chronological order by using the index 142.
[0266]
Next, examples 1, 2, and 3 will be described as specific examples of the search processing with reference to FIGS.
[0267]
FIG. 37 shows sales data to be searched. In FIG. 37, each record is composed of five fields: RID (record identifier), customer ID, purchase date, product, and price. The data is grouped by the customer ID field, and the order of the records in the group is assumed to be rearranged according to the purchase date field.
[0268]
In FIG. 37, only records related to the group with the customer ID of 110001001 are shown due to space limitations. In addition, hereinafter, only the processing related to this group will be mainly described. However, actually, the same processing is performed also for the group having another customer ID.
[0269]
(Example 1)
Consider an event pattern specified as follows:
[0270]
Event pattern
Event definition
Event1: Product = A
Event2: Price <= 838
Definition between events
Event2: Within 3 days from Event1
(Event 2. Purchase date <= Event 1. Purchase date + 3 days)
When the query is given in this way, the search processing unit interprets the query and internally generates a pattern structure (query pattern) as shown in FIG. Various realization methods are conceivable for executing the pattern matching, but the pointer P2 shown in FIG. 21 can be used. At the start of the process, the pointer is initialized so as to point to the head of the query pattern matching.
[0271]
Pattern matching is performed for each group. The search processing unit extracts records from each group in order from the beginning, and checks whether or not each group matches an event definition that has been interpreted in advance. If the record matches the event definition, it also checks whether the record matches the inter-event definition. If the record matches the definition between events, the pointer P2 is advanced by one.
[0272]
In the inquiry pattern of FIG. 38, the first record R1 of FIG. 37 matches the event definition of Event1 pointed to by the pointer P2. Then, next, it is checked whether the condition of the definition between events is satisfied. However, since an inter-event definition is not defined for Event1, the record R1 matches Event1 only by matching the event definition. Therefore, as shown in FIG. 39, the pointer P2 is changed to point to Event2. Next, the record R2 is read, and it is checked whether the record R2 satisfies “Price <= 838” which is the event definition of Event2. However, since the price of R2 is 1800 yen and does not satisfy the event definition of Event2, the processing moves to the next record R3. Since the price of R3 is 838 yen, the condition for the event definition of Event2 is satisfied. Therefore, it is checked whether or not R3 satisfies the definition of Event2 between events.
[0273]
In this case, the definition between events represents a condition that Event2, that is, R3, occurs within three days from Event1. Since the interval between R3 and R1 is 15 days, the condition of the definition between events is not satisfied. In addition, since the records are arranged in the order of earliest purchase date, it can be seen that the interval between the record R3 and the record R1 satisfying Event1 is 15 days or more. Therefore, it is known that there is no record that satisfies the definition between the events of Event2 at this time, and the processing of this group is stopped.
[0274]
In this way, the matching process is performed on the data of each group, and if all the conditions of the inquiry pattern are satisfied, the matching is successful. In the example described above, in the group having the customer ID of 110001001, there is no pattern matching the inquiry pattern in the data. Therefore, a result that "the specified pattern does not exist" is output.
[0275]
(Example 2)
Consider an event pattern specified as follows: As the matching method, it is assumed that two types, ie, a first match and an all match are specified.
[0276]
Event pattern
Event definition
Event1: Product = A
Event2: Price <= 838
Definition between events
Event2: Within 3 visits from Event1
(Event 2. Order <= Event 1. Order + 3)
The event definition in this example is the same as that in Example 1. However, as the definition between events, a constraint that Event2 is within 3 visits from Event1 is used instead of the constraint used in Example 1. In this case, a number starting from 1 is added as a sequence number to each record according to the order of the purchase date field, and the search target data in FIG. 37 is converted into a content format as shown in FIG.
[0277]
In FIG. 40, the same order 1 is assigned to records R1 and R2. This order means the date when the customer with the customer ID of 110001001 first came to the store. Similarly, order 2 is assigned to records R3 and R4, which means the second visit to the store. Hereafter, the records R5, R6, and R7 are assigned

orders

3, 4, and 5, respectively, which means the third, fourth, and fifth visits, respectively. The order shown here is a logical concept, and whether or not the order field as shown in FIG. 40 is physically provided depends on the implementation of the system.
[0278]
At the time of retrieval, as in the case of Example 1, the query is interpreted and a query pattern as shown in FIG. 41 is internally generated. First, as in Example 1, the first record R1 matches the event definition of Event1, and the pointer P2 is changed to point to Event2.
[0279]
Next, the record R2 is read. Since the price of R2 is more than 838 yen, it is checked whether the next record R3 satisfies the event definition of Event2. Since the price of R3 satisfies the event definition of Event2, it is next checked whether R3 satisfies the definition of Event2 between events.では In this example, the definition between events indicates a condition that the occurrence of Event2 is within three visits from Event1. In other words, the order of Event2 is a condition that it is equal to or less than 4 obtained by adding 3 which is the order of Event1, that is, R1. The order of R3 is 2, which is 4 or less, and corresponds to the next visit to R1, and thus satisfies the constraint of the definition between events.
[0280]
At the point when the search up to R3 has been completed, the pattern specification consisting of Event1 and Event2 is satisfied, so that (R1, R3) is output as the first matched pattern. Next, also for R4, which is a record of the day of visiting the same store as R3, if Event1 is R1, it is checked whether the condition of Event2 is satisfied. As a result, for R4, it is found that both the event definition and the inter-event definition of Event2 are satisfied, and the pattern (R1, R4) is output.
[0281]
The two patterns (R1, R3) and (R1, R4) satisfy the inquiry pattern on the same visit date, and it cannot be said that either is the first. As described above, when a plurality of data exist at the same position (date or time), there may be a plurality of answers unlike the data of the conventional regular expression.
[0282]
Since the next record R5 does not have the same visit date as R3 and R4, if the first match is specified, the processing is stopped here. On the other hand, if all matches are specified, the processing of the records after R5 is similarly continued. As a result, in addition to (R1, R3) and (R1, R4), (R1, R6) is also output as a search result pattern. Since the price of R6 is 581 yen, which is lower than 838 yen, and the order is 4, the interval of the pattern of (R1, R6) is 3 visits. Therefore, (R1, R6) satisfies the given pattern specification.
[0283]
If the search result is expressed as (Event1.RID, Event2.RID), the output in the case of the first match and all matches is as follows.
[0284]
First match: (R1, R3), (R1, R4)
All matches: (R1, R3), (R1, R4), (R1, R6)
(Example 3)
Consider an event pattern specified as follows: As the matching method, it is assumed that two types, ie, a first match and an all match are specified.
[0285]

The event definition in this example is different from those in Examples 1 and 2, and it is defined that Event2 can match anything as a wild card. In the definition between events, in addition to the constraint that Event1 and Event2 used in Example 2 are within three visits, the constraint that the price of Event2 is equal to or lower than the price of Event1 is given. As in the case of Example 2, a number starting from 1 is added as a sequence number to each record according to the order of the purchase date field, and the internal format in FIG. 40 is used as search target data.
[0286]
At the time of retrieval, as in the case of Example 1 and Example 2, the query is interpreted and a query pattern as shown in FIG. 42 is internally generated. First, the first record R1 matches the event definition of Event1, and the pointer P2 is changed to point to Event2.
[0287]
Next, the record R2 is read, and the event definition of Event2 is a wild card and matches anything. Then, it is checked whether or not the event-to-event definition of Event2 is satisfied. The definition between events is a condition that Event2 has occurred within three visits from Event1 and a condition regarding price. R2 satisfies the first condition of within three visits from Event1, but the second condition is not satisfied because the price of R2 is more expensive than the price of R1.
[0288]
Next, the record R3 is read. Also for R3, the event definition of Event2 is unconditionally satisfied. Therefore, it is checked whether or not R3 satisfies the definition of Event2 between events. The order of R3 is 2 below 4, which is the next visit to R1. The price of R3, 838 yen, is lower than the price of Event1. Therefore, since both of the two conditions are satisfied, the constraint of the definition between events is satisfied.
[0289]
At the point when the search up to R3 has been completed, the pattern specification consisting of Event1 and Event2 is satisfied, so that (R1, R3) is output as the first matched pattern. Next, also for the record R4, it is checked whether or not the condition of Event2 is satisfied. Since it is found that both the event definition and the definition between events are satisfied, the pattern (R1, R4) is output.
[0290]
The next record R5 is R3. Since the visit date is not the same as R4, if the first match is designated, the process is stopped here. On the other hand, if all matches are specified, the processing of the records after R5 is similarly continued. As a result, in addition to (R1, R3) and (R1, R4), (R1, R6) is also output as a search result pattern. Since the order of R6 is 4, the interval of the pattern of (R1, R6) is 3 visits. Further, since the price of R6 is 581 yen, which is lower than the price of 838 yen of R1, (R1, R6) satisfies the given pattern specification. Thus, the output in the case of the first match and all matches is as follows. First match: (R1, R3), (R1, R4)
All matches: (R1, R3), (R1, R4), (R1, R6)
In the three example event patterns described above, only two events, Event1 and Event2, are used for simplicity. Further, only one or two conditions are used as the event definition and the definition between the events. However, in practice, an event pattern can be specified by a larger number of events, and more conditions can be specified in the event definition and the definition between events.
[0291]
According to the above-described search device and search method, it is possible to easily specify a pattern in consideration of the order between data and search for the specified pattern. Further, it is possible to cope with various types of searches simply by changing the definition of the pattern.
[0292]
(2): Description of the configuration of the time series filter unit
The time series filter section is configured as follows.
[0293]
A: An apparatus for searching for a combination of records from a set of records composed of a plurality of attributes, based on a plurality of events each defined by a predetermined attribute in a record having a specific value, and an order of attribute values. Using a defined order relationship between the plurality of events, a specifying unit that specifies a search pattern, and a search unit that searches a set of records for a combination of records corresponding to a specified search pattern, A search device comprising output means for outputting a search result (corresponding to the time-series filter unit 4 of the present invention).
[0294]
B: In the search device of A, the specifying means specifies a search pattern in which two or more events exist in the same order (corresponding to the time-series filter unit 4 of the present invention). ).
[0295]
C: In the search device of A, the specifying means specifies a search pattern in which the order relation of the plurality of events is described at an arbitrary interval. Equivalent).
[0296]
D: The search device of A, further comprising a unit for compressing data of the set of records by compressing data of at least one value of the plurality of attributes, wherein the search unit includes (Equivalent to the time-series filter unit 4 of the present invention).
[0297]
E: In the search device of the above A, the designating means includes a longest match that returns a combination having the longest interval between events among the combinations of the search pattern and the record, and a combination of a longest match that matches the search pattern. A combination of the shortest match that returns the combination with the shortest interval between events, the repetition of the shortest match, the all match that returns the combination of all events that match the search pattern, and the event that matches the search pattern in the opposite direction One of the longest reverse match that returns the combination with the longest interval and the reverse match that returns the shortest combination among the combinations of events that match the search pattern in the reverse direction. A search device that performs pattern matching by a specified method. .
[0298]
F: In the search device of A, the output means performs an aggregation operation on a combination of records corresponding to the search pattern, and outputs a calculation result.
[0299]
G: In the search device of A, the specifying means specifies an attribute to be used for sorting the records, and the searching means sorts the set of records based on a value of the specified attribute, and A retrieval device for retrieving a set.
[0300]
H: The search device of A, further comprising index means for accessing the set of records in the order of the attribute values.
The following configuration is added to the above description.
[0301]
(Appendix 1)
From the sequence data of the items included in the temporally consecutive transactions, for the combination of i = 2 or more items, one item at a time, In a time-series correlation extraction apparatus for obtaining a combination of two or more and the number of appearances thereof and extracting a time-series correlation rule,
A time-series filter unit that filters input data and leaves necessary data, and a time-series correlation engine unit that extracts a time-series correlation rule from the input data,
The time series filter unit,
When retrieving a combination of records from a set of records composed of a plurality of attributes, a plurality of events each defining that a predetermined attribute in the record takes a specific value,
Specifying means for specifying a search pattern using an order relationship between the plurality of events defined based on the order of the attribute values;
Search means for searching a combination of records corresponding to a specified search pattern from the set of records,
A time-series correlation extraction device, comprising: output means for outputting a search result.
[0302]
(Appendix 2)
The time-series correlation extraction device according to (Appendix 1), further comprising a function of extracting a time-series correlation rule by arranging the time-series filter unit at a stage preceding the time-series correlation engine unit.
[0303]
(Appendix 3)
The time-series correlation extraction device according to (Appendix 1), further comprising a function of extracting a time-series correlation rule by arranging the time-series filter unit at a stage subsequent to the time-series correlation engine unit.
[0304]
(Appendix 4)
The time-series correlation extraction function according to (Appendix 1), wherein a function of extracting a time-series correlation rule is provided by arranging the time-series filter unit in both a preceding stage and a subsequent stage of the time-series correlation engine. apparatus.
[0305]
(Appendix 5)
The time series filter section is provided outside the time series correlation engine section to provide a function of extracting a time series correlation rule. (Supplementary note 1) to (Supplementary note 4) The time-series correlation extraction apparatus according to the above.
[0306]
(Appendix 6)
Along with arranging the time-series filter unit at both the preceding and subsequent stages of the time-series correlation engine,
The time-series correlation extraction unit according to (Supplementary Note 1), wherein the preceding-stage time-series filter unit and the time-series correlation engine unit are configured by one program, and the subsequent-stage time-series filter unit is configured by another single program. apparatus.
[0307]
(Appendix 7)
Along with arranging the time-series filter unit at both the preceding and subsequent stages of the time-series correlation engine,
The time-series correlation extraction described in (Appendix 1), wherein the first-stage time-series filter unit is configured by one program, and the time-series correlation engine unit and the second-stage time series filter unit are configured by another single program. apparatus.
[0308]
(Appendix 8)
Along with arranging the time-series filter unit at both the preceding and subsequent stages of the time-series correlation engine,
The time-series correlation extraction device according to (Supplementary Note 1), wherein the first-stage time-series filter unit, the time-series correlation engine unit, and the second-stage time-series filter unit are each configured by one separate program.
[0309]
As described above, according to the present invention, the following effects can be obtained.
[0310]
(1): In claim 1, the time-series correlation extraction device maintains the order in the sequence of i = 2 or more items from the sequence data of the items included in the temporally continuous transaction. In this manner, time-series correlation data is extracted by obtaining one item or a combination of two or more items that meet the given condition and the number of appearances.
[0311]
In this case, the time-series filter unit performs a process of filtering the input data to leave necessary data, and the time-series correlation engine unit performs a process of generating and extracting a time-series correlation rule from the input data. The time series filter unit performs the following processing.
[0312]
That is, the time-series filter unit includes the specifying unit, the search unit, and the output unit, and searches for a combination of records from a set of records having a plurality of attributes. At this time, the specifying means uses a plurality of events each defining that a predetermined attribute in the record takes a specific value, and an order relationship between those events defined based on the order of the attributes, Specify a search pattern (event pattern).
[0313]
The search means searches a set of records for a combination of records corresponding to a specified search pattern, and the output means outputs a search result. An event is defined as a state in which a predetermined attribute in a record takes a specific value, and the order relation between a plurality of events is the order relation between the values of one or more attributes among records corresponding to those events. Is defined based on
[0314]
The user specifies a search pattern determined by these events and the order relation between the events by using a specifying unit. The specifying means passes the search pattern to the search means, and the search means interprets the received search pattern and extracts a combination of records corresponding to the search pattern. Then, the output unit outputs information such as the extracted records as a search result.
[0315]
According to such an apparatus, various search patterns can be easily specified, including a case where two or more events exist in the same order, and a case where the order relation of a plurality of events is described at an arbitrary interval. This makes it possible to perform general-purpose time-series filtering processing in consideration of the order. In this way, even if the number of transactions as described above increases, it does not take much time for processing, and it is possible to extract time-series correlation rules at high speed.
[0316]
(2): The second embodiment is an example in which the time-series filter section is provided at the preceding stage of the time-series correlation engine section, and is an example in which the time-series filter section and the time-series correlation engine section are configured by one program. In this case, the input data is subjected to a filtering process in the time-series filter unit, so that the number of combinations is reduced and the processing time is shortened.
[0317]
Next, a time-series correlation engine extracts and outputs a time-series correlation rule. In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0318]
(3): The third embodiment is an example in which the time-series correlation engine unit is placed before the time-series filter unit, and is an example in which the time-series correlation engine unit and the time-series filter unit are configured as one program. In this case, the time series correlation rule is extracted from the input data by the time series correlation engine unit, and then the time series filter unit performs filtering processing to extract and output the time series correlation rule. In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0319]
(4): In claim 4, a time series filter section is provided both before and after the time series correlation engine section, and two time series filter sections and one time series correlation engine section are configured by one program. In this case, the input data is subjected to a filtering process in a time-series filter unit, and thereafter, a time-series correlation rule is extracted in a time-series correlation engine unit, and the correlation rule is output by performing filtering in another time-series filter unit. I do.
[0320]
In this example, there is a function of reducing the number of combinations in the preceding time series filter unit and excluding useless ones in the subsequent time series filter unit. In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed.
[0321]
(5): The fifth embodiment is an example in which the time series filter section and the time series correlation engine section are each configured by one separate program section. That is, the time series filter unit is configured by one program, and the time series correlation engine unit 5 is configured by another program.
[0322]
In this way, even if the number of transactions increases, it does not take a long time for processing, and it is possible to extract time-series correlation rules at high speed. Further, since the time series filter unit and the time series correlation engine unit are configured by different programs, parallel processing is possible, and the time series filter unit 4 can be replaced.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating the principle of the present invention.
FIG. 2 is an explanatory diagram of a time-series correlation extraction device according to an embodiment of the present invention.
FIG. 3 is an explanatory diagram (part 1) of a program of a time-series correlation extraction device according to an embodiment of the present invention; FIG. A is Example 1, FIG. B is Example 2, C is Example 3, and D is Example; 4.
FIG. 4 is an explanatory diagram (part 2) of a program of the time-series correlation extraction device according to the embodiment of the present invention; FIG. E is Example 5, FIG. F is Example 6, G is Example 7, and H is Example; 8
FIG. 5 is an input data format according to the embodiment of the present invention. FIG. 5A is an explanatory diagram of input data (text representation), and FIG.
FIG. 6 is an explanatory diagram of processing in the embodiment of the present invention. FIG. 6A shows a combination (length 2), FIG. 6B shows a combination in which CID = 1 and 4 are deleted, and FIG. There is no input.
FIG. 7 is a processing flowchart of Example 1 in the embodiment of the present invention.
FIG. 8 is a processing flowchart of Example 2 in the embodiment of the present invention.
FIG. 9 is a processing flowchart of Example 3 in the embodiment of the present invention.
FIG. 10 is a configuration diagram of a specific device according to an embodiment of the present invention.
FIG. 11 is a diagram illustrating pattern matching by a regular expression according to the embodiment of the present invention.
FIG. 12 is a diagram showing an NFA according to the embodiment of the present invention.
FIG. 13 is a diagram illustrating a DFA according to the embodiment of the present invention.
FIG. 14 is a diagram showing a regular expression operator according to the embodiment of the present invention.
FIG. 15 is a diagram showing retail sales data according to the embodiment of the present invention.
FIG. 16 is a principle explanatory diagram of a time-series filter unit according to the embodiment of the present invention.
FIG. 17 is a diagram showing an event definition in the embodiment of the present invention.
FIG. 18 is a diagram showing a definition between events in the embodiment of the present invention.
FIG. 19 is a configuration diagram of a search device according to an embodiment of the present invention.
FIG. 20 is a flowchart of an overall process according to the embodiment of the present invention.
FIG. 21 is a diagram showing a data structure at the time of retrieval according to the embodiment of the present invention.
FIG. 22 is a flowchart of a search process according to the embodiment of the present invention.
FIG. 23 is a diagram showing a first table in the embodiment of the present invention.
FIG. 24 is a diagram showing a second table in the embodiment of the present invention.
FIG. 25 is a diagram showing an SQL sentence according to the embodiment of the present invention.
FIG. 26 is a diagram showing compressed data according to the embodiment of the present invention.
FIG. 27 is a diagram illustrating a record reduction process according to the embodiment of the present invention.
FIG. 28 is a diagram showing first search target data according to the embodiment of the present invention.
FIG. 29 is a diagram showing a GUI screen according to the embodiment of the present invention.
FIG. 30 is a diagram showing second search target data according to the embodiment of the present invention.
FIG. 31 is a diagram showing rearranged data in the embodiment of the present invention.
FIG. 32 is a diagram showing a record corresponding to an event definition in the embodiment of the present invention.
FIG. 33 is a diagram showing a record corresponding to an event pattern in the embodiment of the present invention.
FIG. 34 is a diagram showing sorted data in the embodiment of the present invention.
FIG. 35 is a diagram showing a first index in the embodiment of the present invention.
FIG. 36 is a diagram showing a second index in the embodiment of the present invention.
FIG. 37 is a diagram showing third search target data according to the embodiment of the present invention.
FIG. 38 is a diagram showing a first inquiry pattern in the embodiment of the present invention.
FIG. 39 is a diagram illustrating movement of a pointer according to the embodiment of the present invention.
FIG. 40 is a diagram showing an internal format of search target data according to the embodiment of the present invention.
FIG. 41 is a diagram showing a second inquiry pattern in the embodiment of the present invention.
FIG. 42 is a diagram showing a third inquiry pattern in the embodiment of the present invention.
FIG. 43 is an explanatory diagram of Conventional Example 1, wherein FIG. 43A is an explanatory diagram of a time-series correlation extraction device, and FIG.
FIG. 44 is a diagram illustrating a specific processing flow in the SETM algorithm in the conventional example.
FIG. 45 is a diagram showing processing content of each functional block in processing of a SETM algorithm in a conventional example.
FIG. 46 is a diagram illustrating a specific processing flow in an a priori algorithm in a conventional example.
FIG. 47 is a diagram illustrating the contents of each functional block in an a priori algorithm in a conventional example.
FIG. 48 is a diagram illustrating an example of a sequence list in a time-series analysis in a conventional example.
FIG. 49 is a view (No. 1) explaining a process up to G (1)} in time-series analysis in a conventional example.
FIG. 50 is a diagram (part 2) illustrating an example of a sequence list in time-series analysis in a conventional example.
FIG. 51 is a view (No. 3) explaining a process up to G (1)} in time-series analysis in a conventional example.
FIG. 52 is a view (No. 4) explaining a process up to G (1)} in time-series analysis in a conventional example.
FIG. 53 is a diagram (No. 5) explaining a process up to G (1)} in time-series analysis in a conventional example.
FIG. 54 is a view (part 1) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 55 is a diagram (part 2) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 56 is a diagram (part 3) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 57 is a diagram (part 4) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 58 is a diagram (part 5) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 59 is a diagram (part 6) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 60 is a diagram (part 7) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 61 is a diagram (part 8) for explaining selection of L (1) in time-series analysis in a conventional example.
FIG. 62 is a view (No. 1) explaining a process up to G (2) # in time-series analysis in a conventional example.
FIG. 63 is a diagram (part 2) for explaining the processing up to G (2) in the time-series analysis in the conventional example.
FIG. 64 is a view (No. 3) explaining a process up to G (2) # in the time-series analysis in the conventional example.
FIG. 65 is a view (No. 4) explaining a process up to G (2) # in the time-series analysis in the conventional example.
FIG. 66 is a view (No. 5) explaining a process up to G (2) # in the time-series analysis in the conventional example.
FIG. 67 is a view (part 1) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 68 is a diagram (part 2) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 69 is a diagram (part 3) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 70 is a diagram (part 4) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 71 is a diagram (part 5) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 72 is a view (No. 6) explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 73 is a diagram (part 7) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 74 is a view (part 8) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 75 is a view (No. 9) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 76 is a diagram (part 10) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 77 is a diagram (part 11) for explaining selection of L (2) in time-series analysis in a conventional example.
FIG. 78 is a view (No. 1) explaining processing up to G (3)} in time-series analysis in a conventional example.
FIG. 79 is a diagram (part 2) for explaining the processing up to G (3) in the time-series analysis in the conventional example.
FIG. 80 is a view (No. 3) explaining a process up to G (3) # in time-series analysis in a conventional example.
FIG. 81 is a view (No. 4) explaining a process up to G (3) # in the time-series analysis in the conventional example.
FIG. 82 is a view (No. 5) explaining a process up to G (3) # in the time-series analysis in the conventional example.
FIG. 83 is a diagram illustrating selection of L (3) in time-series analysis in a conventional example.
FIG. 84 is a view for explaining the flow of processing in basic correlation analysis in a conventional example.
FIG. 85 is a diagram illustrating the progress of the hashed list generation process in the item combination number counting process in the conventional example.
FIG. 86 is a diagram showing a hashed list as a processing result of FIG. 85;
87 is a diagram illustrating the progress of the minimum hash value record retrieval process for the hashed list in FIG. 86.
FIG. 88 is a flowchart of a conventional example of a group-by process based on a sort process.
FIG. 89 is an explanatory diagram of a specific process progress using the flowchart of FIG. 88;
FIG. 90 is a flowchart of a conventional example of group-by processing based on hash processing.
FIG. 91 is an explanatory diagram of a specific process flow using the flowchart of FIG. 90;
FIG. 92 is a diagram illustrating hash processing in the first example of the group-by processing method.
FIG. 93 is a diagram showing the progress of hash processing in the first example of the group-by processing method.
FIG. 94 is an overall flowchart of group-by function processing in the first and second embodiments of the group-by processing method;
FIG. 95 is an overall explanatory diagram of group-by function processing in the first and second embodiments of the group-by processing method.
FIG. 96 is a diagram illustrating a configuration of an embodiment of a data combination counting system according to the present invention.
FIG. 97 is an overall processing flowchart of a data combination counting method according to the present invention.
FIG. 98 is a flowchart of a large item set generation process.
FIG. 99 is a diagram (No. 1) for explaining generation of a combination candidate of an item having a length of 1 (C (1)) and counting (G (1)) thereof;
FIG. 100 is a diagram (No. 2) for explaining generation of a combination candidate of an item having a length of 1 (C (1)) and counting (G (1)) thereof.
FIG. 101 is a diagram (No. 3) for explaining generation C (1)} of a combination candidate of an item having a length of 1 and counting (G (1)) thereof.
FIG. 102 is a diagram (No. 4) for explaining generation C (1) of combination items of length 1 and counting (G (1)) thereof.
FIG. 103 is a diagram illustrating a hashed list obtained as a result of the hash processing in the first example of the group-by processing.
FIG. 104 is a view (part 1) for explaining selection of a large item set L (1) # having a length of 1;
FIG. 105 is a view (part 2) for explaining selection of a large item set L (1) # having a length of 1;
FIG. 106 is a diagram (part 3) for explaining selection of a large item set L (1) # having a length of 1;
FIG. 107 is a diagram (part 4) for explaining selection of a large item set L (1) # having a length of 1;
FIG. 108 is a view (No. 5) for explaining selection of a large item set L (1) # having a length of 1;
FIG. 109 is a view (No. 6) for explaining selection of a large item set L (1) # having a length of 1;
FIG. 110 is a view (No. 1) for explaining the processing up to the enumeration of the combination candidates of the length 2 up to G (2) #;
FIG. 111 is a view (No. 2) for explaining the processing up to the enumeration G (2) # of the combination candidates of the item of length 2;
FIG. 112 is a view (No. 3) explaining a process up to the enumeration of the combination candidates of the length 2 up to G (2) #.
FIG. 113 is a diagram (No. 4) for explaining the process up to the enumeration of the combination candidates of the length 2 (G (2)).
FIG. 114 is a diagram (part 1) for explaining selection of a large item set L (2) # of an item having a length of 2;
FIG. 115 is a view (part 2) for explaining selection of a large item set L (2) # of items having a length of 2;
FIG. 116 is a diagram (part 3) for explaining selection of a large item set L (2) # of items having a length of 2;
FIG. 117 is a diagram (part 4) for explaining selection of a large item set L (2) # of an item having a length of 2;
FIG. 118 is a view (No. 5) for explaining selection of a large item set L (2) # of items having a length of 2;
FIG. 119 is a view (No. 6) for explaining selection of a large item set L (2) # of an item having a length of 2;
FIG. 120 is a view (No. 7) for explaining selection of a large item set L (2) # of items having a length of 2;
FIG. 121 is a view (No. 8) for explaining selection of a large item set L (2) # of items having a length of 2;
FIG. 122 is a view (No. 9) for explaining selection of a large item set L (2) # of items having a length of 2;
FIG. 123 is a view (No. 10) for explaining selection of a large item set L (2) # of items having a length of 2;
124 is a view (No. 1) for explaining the process up to the counting of the combination candidates of the item having a length of 3 up to G (3) #; FIG.
FIG. 125 is a view (No. 2) for explaining the processing up to the combination candidate enumeration G (3) # of the item of length 3;
FIG. 126 is a diagram (No. 3) for explaining the process up to the enumeration of the combination candidates of the length 3 (G (3)).
FIG. 127 is a view (No. 4) for explaining the processing up to the enumeration of the combination candidates of the length 3 up to G (3) #.
FIG. 128 is a view (part 1) for explaining selection of a large item set L (3) # having a length of 3;
FIG. 129 is a diagram (part 2) for explaining selection of a large item set L (3) # having a length of 3;
FIG. 130 is a view (No. 3) for explaining selection of a large item set L (3) # having a length of 3;
[Explanation of symbols]
1 time-series correlation extraction device
2 Input device
3 Output device
4 Time series filter unit
5 Time Series Correlation Engine
10 Computer body
11 Display device (display device)
12 input device
13 Removable Disk Drive (RDD)
14 Hard disk drive (HDD)
15 CPU (central processing unit)
16 $ ROM (read only memory)
17 memory
18 Interface control unit (I / F control unit)
19 Communication control unit

Claims

From the sequence data of the items included in the temporally consecutive transactions, for the combination of i = 2 or more items, one item at a time, In a time-series correlation extraction apparatus for obtaining a combination of two or more and the number of appearances thereof and extracting a time-series correlation rule,
A time-series filter unit that filters input data and leaves necessary data, and a time-series correlation engine unit that extracts a time-series correlation rule from the input data,
The time series filter unit,
When retrieving a combination of records from a set of records composed of a plurality of attributes, a plurality of events each defining that a predetermined attribute in the record takes a specific value,
Specifying means for specifying a search pattern using an order relationship between the plurality of events defined based on the order of the attribute values;
Search means for searching a combination of records corresponding to a specified search pattern from the set of records,
A time-series correlation extraction device, comprising: output means for outputting a search result.

The time-series correlation extraction device according to claim 1, further comprising a function of extracting a time-series correlation rule by arranging the time-series filter unit before the time-series correlation engine unit.

The time-series correlation extraction device according to claim 1, further comprising a function of extracting a time-series correlation rule by arranging the time-series filter unit after the time-series correlation engine unit.

2. The time-series correlation extraction device according to claim 1, further comprising a function of extracting a time-series correlation rule by arranging the time-series filter unit in both a preceding stage and a subsequent stage of the time-series correlation engine. .

The time-series filter according to any one of claims 1 to 4, further comprising a function of extracting a time-series correlation rule by disposing the time-series filter unit outside the time-series correlation engine unit. Correlation extraction device.