JP4118363B2

JP4118363B2 - Apparatus and method for collating multiple symbol strings based on sparse state transition table

Info

Publication number: JP4118363B2
Application number: JP16724297A
Authority: JP
Inventors: 功難波; 伸之井形
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1996-06-27
Filing date: 1997-06-24
Publication date: 2008-07-16
Anticipated expiration: 2017-06-24
Also published as: JPH10105576A

Description

【０００１】
【発明の属する技術分野】
本発明は、文字列検索装置等において、与えられたテキスト等のデータ中に、少なくとも１つ以上の記号列が存在するか否かを一括して判定する照合装置とその方法に関する。
【０００２】
【従来の技術】
今日、ワードプロセッサ等の文書処理装置において、テキスト中に、検索タームとして複数の記号列の集合が存在するか否かを一括して判定することが求められている。ここで、記号列とは文字やその他の記号の並びを意味し、文字列も１種の記号列である。このような判定機能は、しばしば、複数文字列照合または複数文字列検索と呼ばれる。
【０００３】
従来の複数文字列検索装置において効率のよいものとしては、Aho らの提案しているＡＣ（Aho Corasick）法（A. V. Aho and M. J. Corasick: “EfficientString Matching: An Aid to Bibliographic Search”CACM Vol.18 No.6 1975）、これに対して決定的有限状態機械（Deterministic Finite Automaton：ＤＦＡ）を構成した方法、ならびに浦谷の提案しているＦＡＳＴ（Flying Algorithm for Searching Terms）法（“高速な文字列照合アルゴリズムＦＡＳＴ”情報処理学会論文誌 Vol.30 No.9 1989 公開特許公報昭64-74619 特許出願昭62-231091 ）がある。
【０００４】
以下では、まず、ＡＣ法とＡＣ法をＤＦＡ化した複数文字列照合アルゴリズムの説明を行い、次いで、ＦＡＳＴ法の複数文字列照合アルゴリズムの説明を行う。
【０００５】
ＡＣ法は、入力キー集合に対してＰＭＭ（Pattern Matching Machine）と呼ばれる有限状態機械を構成することにより、文字列の照合を行う方法である。
ＡＣ法における照合動作は次の通りである。まず、初期設定として状態番号を“１”にセットする。次に、入力されたテキストより１文字づつ記号を読み出し、この入力記号により、現状態からどの状態に遷移するかを決定する。現状態に対して入力記号による遷移が定義されていない場合には、照合が失敗（fail）したものとして、現状態のｆａｉｌ先に遷移する。そして、ｆａｉｌ先の状態に対してこの入力記号による遷移が定義されていない場合には、ｆａｉｌ先に遷移することを繰り返す。
【０００６】
初期状態“１”に対してはすべての記号に対して遷移が定義されているため、ｆａｉｌによる遷移は最悪でも初期状態で停止する。このようにテキストの入力記号に対して遷移を繰り返す。また、状態に対して受理する記号列が定義されていた場合には、この記号列とそのテキスト中の位置を出力する。
【０００７】
図６４は、３つの記号列｛ａｂ，ｂｃ，ｂｄ｝を検索キーとする、ＡＣ法のＰＭＭを示している。図６４のＰＭＭは、６つの状態“１”、“２”、“３”、“４”、“５”、“６”から成り、実線の矢印は通常の遷移先を指し、破線の矢印はｆａｉｌ先を指している。また、“＾ａ，ｂ”はａとｂ以外の入力記号を表し、状態“４”、“５”、“６”（ｓ４、ｓ５、ｓ６）には、出力キーワードとして、それぞれ記号列“ａｂ”、“ｂｃ”、“ｂｄ”が定義されている。
【０００８】
入力記号列‘ｃａｂｃｚ’に対するこのＰＭＭの動作は、図６５に示すようになる。初期状態は“１”である。まず、記号“ｃ”が入力されると、これはａとｂ以外の入力記号に相当するので、次状態は同じ状態“１”で、出力は生成されない。次に、記号“ａ”が入力されると次状態“２”に遷移し、記号“ｂ”が入力されると次状態“４”に遷移する。ここで、状態“４”に定義されている記号列“ａｂ”が出力される。
【０００９】
ところが、状態“４”には遷移先が定義されていないので、次に記号“ｃ”が入力されると、一旦ｆａｉｌ先の状態“３”に遷移し、そこで遷移先が探索される。すると、記号“ｃ”による遷移先として状態“５”が定義されているので、その状態に遷移して、記号列“ｂｃ”が出力される。次に、記号“ｚ”が入力されると状態“１”に遷移し、動作を終了する。
【００１０】
このように、ＡＣ法では、遷移先が定義されていない入力記号によるｆａｉｌｕｒｅ遷移が起こる度に、遷移回数が１回増える。このため、ｎ個の入力記号に対しては最大２ｎ未満の有限状態機械の遷移が行われることになる。一般には、キー数の増加に伴いキーの先頭文字がヒットする確率が増加するが、これに伴って、ｆａｉｌｕｒｅ遷移も増加するため、ＡＣ法の照合速度はキー数が増えるにつれて段々低下していく。
【００１１】
ＡＣ法の速度を低下させるのは、遷移先の定義されていないｆａｉｌｕｒｅ遷移であるが、ＤＦＡでは入力記号に対して一意に遷移先の状態が決まる。このため、ｎ個の入力記号に対して常にｎ回の有限状態機械の遷移が行われ、照合速度は高速である。Aho らはＡＣ法の状態遷移機械をＤＦＡに変換する方法を示している。
【００１２】
図６６は、記号列｛ａｂ，ｂｃ，ｂｄ｝に対するＡＣ法の状態遷移機械に対応する有限状態機械を示している。図６６において、“ｓｔａｔｅ”は現状態を表し、“ｎｅｘｔ”は“ｉｎｐｕｔ”に記された記号が入力されたときの遷移先の状態を表す。状態ｓ１、ｓ２、ｓ３、ｓ４、ｓ５、ｓ６は、それぞれ状態“１”、“２”、“３”、“４”、“５”、“６”に対応している。また、例えば“¬ａ，ｂ”といった表記はａとｂ以外の記号を表す。
【００１３】
この有限状態機械の入力記号列‘ｃａｂｃｚ’に対する動作は、図６７に示すようになる。初期状態は１である。図６７に示された状態遷移の中には、図６５に現れるようなｆａｉｌｕｒｅ遷移がなく、状態遷移の回数は入力記号‘ｃａｂｃｚ’に含まれる記号の数５に一致している。
【００１４】
また、高速な照合法として知られるＦＡＳＴ法においても、ＡＣ法と同様に、入力キー集合に対してＰＭＭを構成することにより文字列の照合を行う。
ＦＡＳＴ法における照合動作は次の通りである。まず、初期状態として状態番号を“０”にセットする。また、入力キー集合における最も短いキーの長さを最短キー長とし、入力テキストにおける照合開始位置を、テキストの先頭から最短キー長だけ離れた位置にセットする。
【００１５】
次に、照合開始位置よりテキストの左に向かって１文字ずつ記号を読み出し、その入力記号により、現状態からどの状態に遷移するかを決定する。遷移が定義されていない場合には、入力記号に応じた規定量だけ照合開始位置を右方向にずらして、照合を再開する。
【００１６】
このように、入力記号に対して状態遷移が可能な限り、テキストを右から左へ向かって走査し、文字列のパターンを抽出する。遷移が不可能な場合には、入力記号に対して定義されたシフト量だけ、照合開始位置をテキスト中で右方向にシフトする。
【００１７】
図６８は、３つの記号列｛ｓｔａｔｅ，ｅａｓｔ，ｓｍａｒｔ｝を検索キーとする、ＦＡＳＴ法のＰＭＭを示している。図６８のＰＭＭは、１４個の状態“０”、“１”、“２”、“３”、“４”、“５”、“６”、“７”、“８”、“９”、“１０”、“１１”、“１２”、“１３”から成り、実線の矢印は遷移先を指し、破線の矢印はシフト先を指している。
【００１８】
状態遷移は、各検索キーに含まれる記号の並びの逆順に定義されており、“Ｄｅｐｔｈ”はＰＭＭ内における各状態の深さを表す。また、状態“５”、“９”、“１３”（ｓ５、ｓ９、ｓ１３）には、出力キーワードとして、それぞれ記号列“ｓｔａｔｅ”、“ｅａｓｔ”、“ｓｍａｒｔ”が定義されている。
【００１９】
このＰＭＭの各状態に対する、記号入力時の遷移先およびシフト量を表にすると、図６９のようになる。図６９において、１行目の数字は状態番号を表し、１列目の記号は入力記号を表す。ここで、“（Ｏｔｈｅｒ）”は、“ａ”、“ｅ”、“ｍ”、“ｒ”、“ｓ”、“ｔ”以外の入力記号を表す。この表において、正の値の要素は、対応する入力記号による遷移先の状態番号を表し、負の値の要素は、対応する入力記号によるシフト量を表す。
【００２０】
入力記号列‘ａａｓｅａｓｔａｔｅ’に対するこのＰＭＭの動作は、図７０に示すようになる。初期状態は“０”である。この場合、記号列“ｓｔａｔｅ”、“ｅａｓｔ”、“ｓｍａｒｔ”のうち最も短いものは“ｅａｓｔ”で、その長さは４であるから、最短キー長は４となる。そこで、入力記号列の右端から最短キー長４だけ離れた位置“ｔ”を照合開始位置として、右から左へと照合が行われる。
【００２１】
照合が失敗した場合には、その入力記号に対して定義されたシフト量に−１を乗じてシフト量の大きさを求め、その分だけ照合開始位置を右にずらす。そして、状態番号を“０”にして、照合を再開する。
【００２２】
図７０の入力記号列内の記号“ｔ”の位置に記された初期状態“０”において、記号“ｔ”が入力されると、図６９の表に従って状態“６”に遷移する。次に、記号“ｓ”が入力されると状態“７”に遷移し、次に、記号“ａ”が入力されると状態“８”に遷移する。次に、記号“ｅ”が入力されると状態“９”に遷移し、状態“９”に定義されている記号列“ｅａｓｔ”が出力される。
【００２３】
次に、記号“ｓ”が入力されると、この記号に対して状態“９”では、遷移先ではなくシフト量−７が定義されているので、その大きさ７だけ照合開始位置を右にずらす。そして、初期状態“０”に戻り、シフト先の記号“ｅ”の位置を新たな照合開始位置として照合を再開する。以下同様にして照合が続行され、状態“５”に遷移したときに、記号列“ｓｔａｔｅ”が出力される。
【００２４】
上述のような複数文字列の検索処理は、データベース、ワードプロセッサ、全文検索装置などの各装置において用いられる。
全文検索装置とは、全文検索インデックスによる検索において、検索結果が正しいかどうかを確かめるために文字列検索を行う装置を指す。ここで、全文検索インデックスとは、シグニチャファイル（signature file）や、文書中での単語の出現位置を持たないファイル（inverted file ）のように、インデックスそのものが入力されたキーワードに対して、必ずしも正解だけを返すとは限らない検索用のインデックスを意味する。
【００２５】
例えば、英語のインデックスに対して、キーワード‘John Smith’を検索する場合を考える。インデックスの単位は、通常、スペースとスペースの間の単語であるので、‘John Smith’は‘John AND Smith’と同じになる。ところが、‘John AND Smith’という検索条件で文書を検索すると、‘John’と‘Smith ’が離れて出現している場合も検索結果に含まれ、過剰な結果が得られる。このような場合、結果が正しいかどうかが文字列検索により確かめられる。
【００２６】
【発明が解決しようとする課題】
以上説明した従来の文字列照合方法において問題となるのは、ＰＭＭの状態遷移に相当する部分の速度と記憶容量の関係である。
【００２７】
ＡＣ法では、状態遷移部分を表すのにリスト構造を使用することで、記憶容量を減らすことが可能である。しかし、リスト構造ではポインタを順にたどらなければならず、アクセス処理が低速であるため、照合動作は一層低速になってしまう。
【００２８】
ＤＦＡ化されたＡＣ法の照合速度は高速であるが、すべての入力記号に対して定義されたすべての状態遷移を表すために、図６６のような表構造を使用せざるを得ない。しかし、これは記憶容量に多大の負担となる。
【００２９】
例えば、入力記号の種類を２５６個（８ｂｉｔ符号）とし、状態数をＮとし、ポインタを４ｂｙｔｅとする。表形式では、１つの状態に対して、次状態へのポインタ２５６個、ｆａｉｌ先へのポインタ１個、および出力記号列へのポインタ１個が必要である。このため、Ｎ＊（２５６＋１＋１）＊４ｂｙｔｅの記憶容量が必要になる。
【００３０】
一般に、検索キーの数が増大するにつれて状態数Ｎも増大するので、キー数が多い場合には、必要な記憶容量は膨大になる。したがって、ＤＦＡ化されたＡＣ法に基づいて文字列照合装置を構成するのは現実的ではない。
【００３１】
また、ＦＡＳＴ法においても同様に、すべての入力記号に対して状態遷移またはシフトが定義されているため、図６９のような表構造を使用せざるを得ない。したがって、ＦＡＳＴ法に基づいて文字列照合装置を構成すると、やはり、膨大な記憶容量が必要となる。
【００３２】
本発明の課題は、現実的な記憶容量で高速な照合を行うことのできる複数記号列の照合装置およびその方法を提供することである。
【００３３】
【課題を解決するための手段】
図１は、本発明の照合装置の原理図である。図１の照合装置は、与えられた記号列をキーとし、ファイル中にそのキーが存在するか否かを、有限状態機械を用いて判定する情報処理装置における照合装置であって、状態遷移記憶手段１と照合手段２を備える。
【００３４】
状態遷移記憶手段１は、少なくとも１つ以上のキーに関する照合操作を定義した状態遷移表であって、あらかじめ決められた操作を表すデータを削減したスパースな状態遷移表を、圧縮された配列形式で記憶する。
【００３５】
照合手段２は、上記スパースな状態遷移表を参照しながら、上記ファイルに含まれる各記号に対応する操作を行い、そのファイル中の記号列を上記１つ以上のキーと照合する。このとき、照合手段２は、スパースな状態遷移表に、ファイルから入力される入力記号に対する操作が定義されているか否かをチェックし、その入力記号に対する操作が定義されていないとき、上記あらかじめ決められた操作を行う。
【００３６】
例えば、ＤＦＡ化されたＡＣ法に基づく状態遷移表を作成する場合は、現状態から初期状態への遷移操作を表すデータと、現状態から初期状態の次の状態への遷移操作を表すデータのうち、少なくとも一方のデータを従来の状態遷移表から除いて、状態遷移表の情報量を削減する。
【００３７】
また、ＦＡＳＴ法に基づく状態遷移表を作成する場合は、上記ファイル内の照合位置を照合方向と逆の方向に戻すシフト操作を表すデータを、従来の状態遷移表から可能な限り除いて、状態遷移表の情報量を削減する。
【００３８】
これにより、状態遷移表内でデータが除かれた部分は空要素となり、要素がまばらに散在するスパースな状態遷移表が生成される。このようなスパースな状態遷移表を圧縮して配列に格納することで、コンパクトな有限状態機械を構成することが可能になり、記憶容量が大幅に削減される。
【００３９】
また、圧縮された配列形式の状態遷移表は、基本的にＤＦＡの状態遷移表に基づいて作成されているので、遷移等が定義されているかどうかのチェックは必要であるが、ファイルから入力される１つの記号毎に１回の遷移操作を行えばよく、ＤＦＡの高速性は保たれる。
【００４０】
照合対象となるファイルとしては、テキストで記述された文書ファイルや音声データをデジタルコードに変換したファイル等、任意の記号列を含むファイルを用いることができる。
【００４１】
例えば、図１の状態遷移記憶手段１は、実施形態の図５における状態遷移部１２２に対応し、照合手段２は状態遷移判定部１２１に対応する。
【００４２】
【発明の実施の形態】
以下、図面を参照しながら、本発明の実施の形態を詳細に説明する。
図２は、本発明に基づく照合システムの構成図である。図２の照合システムは圧縮装置１０１と照合装置１０２から成る。圧縮装置１０１は、キーワード入力部１０３、スパース配列有限状態機械作成部１０４、および状態遷移機械圧縮部１０５を備え、照合装置１０２は、照合用状態遷移機械部１０６およびテキスト入力部１０７を備える。
【００４３】
キーワード入力部１０３は、検索対象となる入力されたキーワード群を受理し、スパース配列有限状態機械作成部１０４は、入力されたキーワード群に対して、配列形式ではスパースとなる、文字列照合のための有限状態機械を中間構造として２進木上に構築する。ここで、スパース配列とは、ほとんど要素が入っていない配列を意味する。状態遷移機械圧縮部１０５は、スパース配列有限状態機械作成部１０４が作成した中間構造を、照合が高速な圧縮された配列形式に変換する。
【００４４】
照合用状態遷移機械部１０６は、圧縮された配列形式の有限状態機械を用いて、テキスト入力部１０７から入力されたテキストとキーワード群との照合を行う。
【００４５】
図３は、図２の照合システムの動作のフローチャートである。図３において処理が開始されると、まず、キーワード入力部１０３が検索対象となるキーワード群を受理する（ステップＳＴ１）。
【００４６】
次に、スパース配列有限状態機械作成部１０４は、入力されたキーワードに対して、照合速度がＤＦＡと同じオーダーであり、かつ配列形式ではスパースとなる中間構造の文字列照合機械を２進木の上に構築する（ステップＳＴ２）。２進木上の文字列照合機械は、照合速度は低速であるが、要素の追加、挿入が容易な構造になっている。
【００４７】
次に、状態遷移機械圧縮部１０５は、この２進木上の文字列照合機械を、照合が高速な圧縮された配列形式に変換する（ステップＳＴ３）。このとき、配列内で要素の存在する部分に、確認用の文字ラベルを付け、複数の配列の要素の存在する部分が互いに重複しないように、それぞれの配列を重ね合わせることにより、１つの照合用配列を作成する。
【００４８】
そして、照合用状態遷移機械部１０６は、入力されたテキストに対して、配列形式の有限状態機械を用いて照合を行う。
検索対象となる入力されたキーワード群に対して、ＤＦＡ化されたＡＣ法や配列実装によるＦＡＳＴ法と、照合速度のオーダーは同じであるが、配列形式ではスパースとなり圧縮が可能である有限状態機械を構築することにより、照合速度が高速であり、記憶容量の節減が可能な文字列照合装置が構築される。
【００４９】
構築方法をまとめると次のようになる。
要素の挿入、追加などが容易であり記憶容量の問題がない２進木上に文字列照合機械を中間構造として構築し、この中間構造を照合が高速でコンパクトな、圧縮された配列形式に変換する。このとき、要素の確認用のラベルをその要素に付与し、互いに要素の重複がないように重ね合わせることにより、配列を圧縮する。
【００５０】
図４は、図２の照合システムを実現する情報処理装置（コンピュータ）の構成図である。図４の情報処理装置は、ＣＰＵ（中央処理装置）１１１、メモリ１１２、入力装置１１３、出力装置１１４、外部記憶装置１１５、媒体駆動装置１１６、ネットワーク接続装置１１７を備え、それらの各装置はバス１１８により互いに結合されている。
【００５１】
ＣＰＵ１１１は、メモリ１１２に格納されたプログラムを実行して、圧縮装置１０１と照合装置１０２の各処理を実現する。メモリ１１２には、上述のプログラムの他に、処理に用いられるデータが格納されている。メモリ１１２としては、例えばＲＯＭ（read only memory）、ＲＡＭ（random access memory）等が用いられる。
【００５２】
入力装置１１３は、例えばキーボード、ポインティングデバイス等に相当し、ユーザからの要求や指示の入力に用いられる。また、出力装置１１４は、表示装置やプリンタ等に相当し、状態遷移機械や照合結果等の出力に用いられる。
【００５３】
外部記憶装置１１５は、例えば、磁気ディスク装置、光ディスク装置、光磁気ディスク装置等である。この外部記憶装置１１５に、上述のプログラムとデータを保存しておき、必要に応じて、それらをメモリ１１２にロードして使用することができる。また、外部記憶装置１１５は、キーワードやテキストを保存するデータベースとしても使用される。
【００５４】
媒体駆動装置１１６は、可搬記録媒体１１９を駆動し、その記憶内容にアクセスする。可搬記録媒体１１９としては、メモリカード、フロッピーディスク、ＣＤ−ＲＯＭ（compact disk read only memory ）、光ディスク、光磁気ディスク等、任意のコンピュータ読み取り可能な記録媒体が使用される。この可搬記録媒体１１９に、上述のプログラムとデータを格納しておき、必要に応じて、それらをメモリ１１２にロードして使用することができる。
【００５５】
ネットワーク接続装置１１７は、ＬＡＮ（local area network）等の任意の通信ネットワークに接続され、通信に伴うデータ変換等を行う。照合システムは、ネットワーク接続装置１１７を介して、外部の情報提供者の装置１２０（データベース等）と通信する。これにより、必要に応じて、上述のプログラムとデータを装置１２０からネットワークを介して受け取り、それらをメモリ１１２にロードして使用することができる。
【００５６】
次に、図５から図３７までを参照しながら、ＤＦＡ化されたＡＣ法に基づく第１の実施形態について説明する。
第１の実施形態では、ＤＦＡ化されたＡＣ法の表構造に対して、初期状態への遷移と初期状態の次状態への遷移とを定義せずに、これらに対する照合が失敗した場合には、初期状態からの遷移として定義し直すような状態遷移機械を構成する。
【００５７】
このように、遷移が定義されていない場合に自動的に初期状態へ遷移するようにすれば、初期状態への遷移を記憶する必要がなくなる。また、初期状態の次状態へは、初期状態経由で必ず遷移することができるので、この遷移についても記憶する必要がない。これらの遷移の定義を省略することで、ＤＦＡの状態遷移表の要素を大幅に削除して、スパースな状態遷移表を得ることができる。
【００５８】
次に、この状態遷移機械を配列形式にし、遷移が定義されている部分が重複しないように複数の配列を重ね合わせる。また、同時に、要素がその配列に含まれているかどうか確認を行うために、遷移が定義されている文字に対してはその文字をラベルとして付与する。
【００５９】
また、圧縮された状態遷移表を作成する際に、２進木の中間形式を経由することにより、実際に使用される記憶容量を節減する。
図５は、図２の照合装置１０２の構成図である。図５において、状態遷移判定部１２１および状態遷移部１２２は、照合用状態遷移機械部１０６に対応する。テキスト入力部１０７は、対象テキストより１文字ずつ記号を抽出し、状態遷移判定部１２１は、入力記号に対してどの状態に遷移するかを決定する。
【００６０】
状態遷移部１２２は、例えばメモリ１１２に対応し、圧縮状態遷移部１２３、確認ラベル部１２４、出力記号部１２５を含む。圧縮状態遷移部１２３は、状態遷移表を圧縮した配列形式で格納し、確認ラベル部１２４は、圧縮に伴い遷移が定義されているかどうかを確認するためのラベルを格納する。出力記号部１２５は、ある状態に到達した段階で出力される記号列を定義する。
【００６１】
図６は、３つの記号列｛ａｂ，ｂｃ，ｂｄ｝を入力キーワードとした場合の圧縮前の第１の状態遷移表を示している。この表を図６６の表と比較すると、状態ｓ２、ｓ３、ｓ４、ｓ５、ｓ６における遷移のうち、初期状態ｓ１への遷移と初期状態ｓ１の次状態ｓ２、ｓ３への遷移が除かれている。
【００６２】
図６の状態遷移表に定義された各入力記号に対する遷移を図示すると、図７のようになる。図７において、“〜ａ，ｂ”はａとｂ以外の入力記号を表す。
図８は、図６の状態遷移表を圧縮して得られる照合用の配列を示している。図８において、ｉｎｄｅｘは配列の添え字を表し、ＧＯＴＯは、圧縮状態遷移部１２３に格納される重ね合わせられた状態遷移表を表し、ＣＨＥＣＫは、確認ラベル部１２４に格納される確認用のラベルを表し、ＯＵＴＰＵＴは、出力記号部１２５に格納されるポインタの配列を表す。これらのポインタは、状態遷移部１２２内に格納された出力用の文字列を指している。
【００６３】
配列ＣＨＥＣＫの要素“＃”は終端記号を表す。また、図８の配列の下方に記されている状態ｓ１〜ｓ６は、図６の各状態がどのように重なっているかを表しており、記号ａ、ｂ等は、図６の入力記号に対応する遷移先の格納位置を表している。
【００６４】
図９は、第１の実施形態の説明のために用いる第１の文字コード変換表を示している。この文字コード変換表に従って、文字コードが内部コードに変換される。ここでは、簡単のため、文字コードの範囲をアルファベットのａからｚまでとしているが、これら以外にも数字等の任意の記号を用いることができる。
【００６５】
次に、図１０および図１１を参照しながら、第１の実施形態における照合処理について説明する。
図１０は、ＡＣ法に基づく文字列照合処理のフローチャートである。図１０において処理が開始されると、状態遷移判定部１２１は、まず入力されたテキストを指すテキストポインタをその先頭部にセットし、状態遷移部１２２内の状態遷移配列を指す遷移ポインタを初期状態にセットする（ステップＳＴ１１）。次に、テキストポインタがテキストの最後を指すかどうかのチェックを行う（ステップＳＴ１２）。テキストポインタがテキストの終わりを指せば照合は終了する。
【００６６】
テキストポインタがテキストの終わりを指していなければ、そのポインタの指す文字を取り出し（ステップＳＴ１３）、その文字に対応する内部コードの値を遷移ポインタの値に加算し、その加算値をｉｎｄｅｘとする位置に格納された文字ラベルがこの文字と同じかどうかのチェックを行う（ステップＳＴ１４）。
【００６７】
これらの文字が一致しなければ、その入力文字に対する遷移が定義されていないことになる。そこで、初期状態へのポインタに入力文字の内部コードを加算し、その加算値をｉｎｄｅｘとする位置に定義された遷移先を、新たな遷移ポインタとする（ステップＳＴ１５）。
【００６８】
ステップＳＴ１４においてラベルと入力文字が同じであれば、次に、出力文字列が状態に対して定義されているかどうかの確認を行う（ステップＳＴ１６）。
出力文字列が定義されていなければ、現在の遷移ポインタに入力文字の内部コードを加算し、その加算値をｉｎｄｅｘとする位置に定義された遷移先を、新たな遷移ポインタとする（ステップＳＴ１８）。そして、テキストポインタを１文字進め（ステップＳＴ１９）、ステップＳＴ１２以降の処理を繰り返す。
【００６９】
出力文字列が定義されているのであれば、その文字列を照合結果として出力し（ステップＳＴ１７）、ステップＳＴ１８以降の処理を行う。ステップＳＴ１７において、現在のテキストポインタの値を、照合された文字列の位置として出力することもできる。
【００７０】
図１１は、図８の状態遷移配列を用いた場合の入力テキスト‘ｃａｂｃｚ’に対する照合動作を示している。図１１において、“’ｃ”等は、文字ｃ等に対応する内部コードの値を表す。また、ＧＯＴＯ［ｘ］は、ｉｎｄｅｘ“ｘ”の位置に定義されている遷移先の番号を表し、ＣＨＥＣＫ［ｘ］は、ｉｎｄｅｘ“ｘ”の位置に格納されているラベルを表し、ＯＵＴＰＵＴ［ｘ］は、ｉｎｄｅｘ“ｘ”の位置に格納されている出力用ポインタを表す。この場合の照合は次のように行われる。
【００７１】
図８において、初期状態はｉｎｄｅｘ＝１の位置に対応する。最初の入力記号は“ｃ”である（ステップＳＴ１３）。初期状態においてはすべての入力記号に対する遷移先が定義されているので、図１１に“＊１”で示されるように、ステップＳＴ１４のチェックは省略され、ステップＳＴ１６の処理に移る。ここでは、出力は定義されていないので、ステップＳＴ１７の処理は行われない。
次に、入力記号“ｃ”に対して図９の変換表を引くと’ｃ＝３を得る。そこで、ＧＯＴＯ［１＋３］＝ＧＯＴＯ［４］の結果を、次に進むべき遷移先とする。図８において、ｉｎｄｅｘ“４”の位置に定義されている遷移先は“１”であるので、ＧＯＴＯ［４］＝１となる。したがって、遷移先は、再びｉｎｄｅｘ“１”に対応する初期状態となる（ステップＳＴ１８）。
【００７２】
次に、同様にして、入力記号“ａ”に対して照合動作を行うと、ＧＯＴＯ［１＋’ａ］＝２６となり、ｉｎｄｅｘ“２６”から始まる状態に移る（ステップＳＴ１８）。
【００７３】
次に、入力記号“ｂ”に対しては、配列ＣＨＥＣＫにアクセスして、遷移が定義されているかいないかを確かめる必要がある（ステップＳＴ１４）。そこで、現在の遷移ポインタの値２６に’ｂ＝２を加算すると、ｉｎｄｅｘ“２８”を得る。ｉｎｄｅｘ“２８”の位置に格納されたラベルは“ｂ”であるから、ＣＨＥＣＫ［２８＋’ｂ］＝ｂとなり、この入力記号に対する遷移が定義されていることが分かる。
【００７４】
このとき、ＯＵＴＰＵＴ［２６＋’ｂ］には出力記号列“ａｂ”が定義されているので、これを出力する（ステップＳＴ１７）。また、次の遷移先は、ＧＯＴＯ［２６＋’ｂ］＝２９となる。
【００７５】
次に、入力記号“ｃ”に対して同様の処理を行い、記号列“ｂｃ”を出力して、ＧＯＴＯ［２９＋’ｃ］＝５の位置に遷移する。
次に、最後の入力記号“ｚ”に対して同様の処理を行うと、ＣＨＥＣＫ［５＋’ｚ］はｚではないため、ｉｎｄｅｘ“５”の状態からの遷移は失敗し、記号“ｚ”に対する遷移は初期状態からの遷移として定義される。この結果、遷移ポインタはＧＯＴＯ［１＋’ｚ］＝１となり（ステップＳＴ１５）、テキストが終了したので照合動作を終了する。
【００７６】
こうして、入力テキストに含まれていた記号列“ａｂ”と“ｂｃ”が、照合結果として出力される。
次に、照合に用いる状態遷移配列の作成方法を説明する。図１２は、図２の圧縮装置の第１の構成図である。図１２において、２進木変換部１３１および遷移追加部１３２は、スパース配列有限状態機械作成部１０４に対応し、変換部１３３は状態遷移機械圧縮部１０５に対応する。
【００７７】
キーワード入力部１０３は、指定されたキーワード群を受理する。２進木変換部１３１は、この受理されたキーワード群を、各キーの左から右に向かう方向に、２進木構造に変換する。遷移追加部１３２は、作成された２進木構造に対して、照合失敗時における初期状態とその次状態以外への遷移を追加する。変換部１３３は、遷移追加部１３２が出力する２進木構造を、圧縮された配列形式に変換する。
【００７８】
このような照合用の配列の作成手順は、入力されたキーワード群からの２進木の作成、その結果得られた２進木のノードに対するｆａｉｌ先ノードの追加、ｆａｉｌｕｒｅ遷移の中で直接遷移可能なものを定義するｇｏｔｏｉｎ／ｇｏｔｏｏｕｔ先ノードリストの追加、最終的に得られた２進木の配列形式への変換の各処理より成る。
【００７９】
図１３は、２進木作成処理のフローチャートである。図１３において処理が開始されると、２進木変換部１３１は、まず入力キーに対して２進木を作成する（ステップＳＴ２１）。次に、キーを受理したノードに対して、対応する入力キーを出力記号列として付加し（ステップＳＴ２２）、処理を終了する。
【００８０】
例えば、図７の状態遷移図に対応するキーワード群｛ａｂ，ｂｃ，ｂｄ｝に対して２進木を作成すると、図１４のようになる。図１４において、矩形のボックスが１つのノードを表し、各ノードに付加された文字ラベルはキーワード中に現れる記号に対応している。また、横へのポインタは、２進木上で同じ深さのノードへのポインタを表し、下へのポインタはより深いノードへのポインタを表す。ここでは、ノード“４”、“５”、“６”に対して、出力記号列“ａｂ”、“ｂｃ”、“ｂｄ”がそれぞれのｏｕｔｐｕｔとして付加されている。
【００８１】
図１５は、作成された２進木のノードに対してｆａｉｌｕｒｅ遷移の遷移先を付与する処理のフローチャートである。図１５において処理が開始されると、遷移追加部１３２は、まず処理対象となるノードを格納するノードキューＱを初期化する（ステップＳＴ３１）。
【００８２】
次に、２進木のルートノードから遷移可能なノード（ルートノードに対して次ノードになるノードおよびそのノードと深さが同じノード）をキューＱに入れる（ステップＳＴ３２）。
【００８３】
次に、キューに入っているノードのｆａｉｌ先をルートノードに設定し（ステップＳＴ３３）、Ｑが空かどうかの判定を行う（ステップＳＴ３４）。Ｑが空ならば処理は終了する。Ｑが空でなければ、次にキューＱから１つノードを取り出し、これをｒに設定し（ステップＳＴ３５）、取り出したノードをキューＱから除く（ステップＳＴ３６）。
【００８４】
次に、ノードキューＪを初期化し（ステップＳＴ３７）、ｒから遷移可能なノード（ノードｒに対して次ノードになるノードおよびそのノードと深さが同じノード）をキューＪに入れる（ステップＳＴ３８）。そして、Ｊが空かどうかを判定する（ステップＳＴ３９）。
【００８５】
Ｊが空であるならばステップＳＴ３４以降の処理を繰り返す。Ｊが空でないならば、次にノードキューＪより１つノードを取り出し、これをｓにセットして（ステップＳＴ４０）、取り出したノードをキューＪより除く（ステップＳＴ４１）。次に、ｓをキューＱに入れて（ステップＳＴ４２）、ノードｒのｆａｉｌ先をｔにセットし（ステップＳＴ４３）、ノードｓについている文字ラベルによるノードｔからの遷移が定義されているかどうかの判定を行う（ステップＳＴ４４）。
【００８６】
そのような遷移が定義されていなければｔにｔのｆａｉｌ先をセットし（ステップＳＴ４５）、再びステップＳＴ４４の判定を行い、判定結果がＹＥＳになるまでループする。最終的には、初期状態でループを抜けることができる。
【００８７】
ステップＳＴ４４で遷移が定義されていれば、次に、ｔからｓのラベルで遷移する先をｓのｆａｉｌ先とする（ステップＳＴ４６）。そして、ｓのｏｕｔｐｕｔ（出力文字列）として、ｓのｆａｉｌ先の出力文字列を加え（ステップＳＴ４７）、ステップＳＴ３９以降の処理を繰り返す。
例えば、図１４の２進木のｆａｉｌｕｒｅ遷移を計算し、これを２進木の各ノードに付与すると、図１６のようになる。この場合のｆａｉｌｕｒｅ計算の手順を、図１５のフローに沿って説明する。
【００８８】
まず、ルートノード“１”から遷移可能なノード“２”、“３”をキューＱに入れ（ステップＳＴ３２）、それらのノードのｆａｉｌ先をルートノードとする。その後の処理は、図１７に示すようになる。図１７において、例えばｇｏｔｏ（１，ｂ）の表記は、記号“ｂ”に対して定義されたノード“１”からの遷移を表す。このような手順は、一般のＡＣ法におけるｆａｉｌｕｒｅ関数の作成手順と同様である。
【００８９】
図１８は、付加されたｆａｉｌｕｒｅ遷移に対して、初期状態とその次状態以外の状態への遷移を、ｇｏｔｏｉｎ／ｇｏｔｏｏｕｔノードリストとして２進木のノードに追加する処理のフローチャートである。図１８において処理が開始されると、遷移追加部１３２は、まずノードキューＱを初期化する（ステップＳＴ５１）。
【００９０】
次に、２進木のルートノードから遷移可能なノードをキューＱに入れ（ステップＳＴ５２）、キューＱが空かどうかの判定を行う（ステップＳＴ５３）。キューＱが空であるならば処理は終了する。
【００９１】
キューＱが空でないならば、キューＱより１つノードを取り出し、これをｒにセットし（ステップＳＴ５４）、そのノードをキューＱより取り除く（ステップＳＴ５５）。次に、キューＸにすべての可能な入力記号を入れて（ステップＳＴ５６）、キューＸが空かどうかの判定を行う（ステップＳＴ５７）。キューＸが空であればステップＳＴ５３以降の処理を繰り返す。
【００９２】
例えば８ｂｉｔ符号を用いた場合、ステップＳＴ５６においてキューＸに入れられる記号のコードは、０〜２５５までの２５６個となる。
キューＸが空でなければ、キューＸより１文字取り出し、これをｓにセットし（ステップＳＴ５８）、同時にこの文字をキューＸより取り除く（ステップＳＴ５９）。そして、記号ｓによりｒから次状態に遷移可能かどうかを判定する（ステップＳＴ６０）。遷移可能であれば、記号ｓによるｒの遷移先をキューＱに追加し（ステップＳＴ６５）、ステップＳＴ５７以降の処理を繰り返す。
【００９３】
ステップＳＴ６０において遷移不可能であれば、次に、ノードｒのｆａｉｌ先がルートノードかどうかの判定を行う（ステップＳＴ６１）。それがルートノードであれば、そのままステップＳＴ５７以降の処理を繰り返す。
【００９４】
ステップＳＴ６１においてノードｒのｆａｉｌ先がルートノードでなければ、次に、ｒのｆａｉｌ先から記号ｓによる遷移が可能かどうかの判定を行う（ステップＳＴ６２）。遷移が不可能であればステップＳＴ５７以降の処理を繰り返す。遷移が可能であれば、次に、ｒのｆａｉｌ先から記号ｓで遷移可能なノードを、ｒのｇｏｔｏｏｕｔに追加し（ステップＳＴ６３）、そのノードのｇｏｔｏｉｎにノードｒを追加して（ステップＳＴ６４）、ステップＳＴ５７以降の処理を繰り返す。
【００９５】
今、処理対象ノードｒをｒ＝Ａとし、記号ｓによるノードＡのｆａｉｌ先のノードをＢとし、ノードＢから遷移可能なノードの集合をＣとする。このとき、図１８のステップＳＴ５７からＳＴ６５までのループ処理では、まずノードＡから記号ｓで直接遷移可能なノードを集合Ｃから除く。そして、集合Ｃ中のノードのリストをｇｏｔｏｏｕｔとしてノードＡに付加し、集合Ｃ中の各ノードに対してはノードＡをｇｏｔｏｉｎとして付加している。
【００９６】
図１６の２進木の場合に対してｇｏｔｏｏｕｔとｇｏｔｏｉｎを求めると、図１９のようになる。図１９の２進木は、遷移追加部１３２が最終的に出力する中間構造の有限状態機械に相当する。
【００９７】
上述のｇｏｔｏｏｕｔリストおよびｇｏｔｏｉｎリストの計算において、実際に遷移が定義され得るのは、ｆａｉｌ先が初期状態以外のノードになった場合だけである。図１６では、この条件を満たすのは明らかにノード“４”だけであるので、このノードについての処理を説明する。
【００９８】
図１６において、ノード“４”のｆａｉｌ先はノード“３”である。ノード“３”からは、文字“ｃ”によりノード“５”へ遷移可能であり、文字“ｄ”によりノード“６”へ遷移可能である。したがって、ノード“４”のｇｏｔｏｏｕｔはノード“５”とノード“６”となる。また、ノード“５”、“６”のｇｏｔｏｉｎは、ともにノード“４”となる。図１９では、これらのｇｏｔｏｏｕｔおよびｇｏｔｏｉｎが、ラベルの付いたノードの形式で表されている。
【００９９】
従来のＤＦＡの場合は、２進木のすべてのノードに対してｇｏｔｏｏｕｔが定義されているが、本発明では、ルートノード以外へのｆａｉｌｕｒｅ遷移が定義されているノードに対してのみ、ｇｏｔｏｏｕｔが定義される。この場合、ｇｏｔｏｉｎを付加的に定義する必要があるが、ルートノードをｆａｉｌ先とする多くのノードのｆａｉｌｕｒｅ遷移が削除されるため、記憶容量の削減に寄与する。
【０１００】
図２０および図２１は、遷移追加部１３２が出力する２進木を、圧縮された配列形式の状態遷移機械に変換する処理のフローチャートである。図２０において処理が開始されると、変換部１３３は、まず配列ＧＯＴＯ、ＣＨＥＣＫ、ＯＵＴＰＵＴを０に初期化し（ステップＳＴ７１）、２進木のノードのメンバｉｎｄｅｘを０に初期化する（ステップＳＴ７２）。
【０１０１】
このｉｎｄｅｘは、２進木の各ノードと、図８に示されるような状態遷移配列のｉｎｄｅｘとの対応関係を記憶するために、状態遷移配列のｉｎｄｅｘとは独立に設けられる。
【０１０２】
次に、可能な入力記号のうち、ルートノードから他のノードにその記号で遷移できないようなものをキューＲに入れ（ステップＳＴ７３）、キューＲが空かどうかの判定を行う（ステップＳＴ７４）。キューＲが空でなければ、キューＲより１文字取り出し、これをｓにセットし（ステップＳＴ７５）、これをキューＲより除く（ステップＳＴ７６）。
【０１０３】
次に、ＧＯＴＯ［１＋’ｓ］を１とし（ステップＳＴ７７）、ＣＨＥＣＫ［１＋’ｓ］をｓの文字ラベルとして（ステップＳＴ７８）、ステップＳＴ７４以降の処理を繰り返す。“’ｓ”は、ｓに対する配列中での内部コードを表すが、これは文字コードそのままでも構わない。
【０１０４】
そして、キューＲが空になると、次にキューＱを初期化し（ステップＳＴ７９）、Ｐn ＝ルートノード、Ｃn ＝ルートノードの次ノード、Ｐp ＝１とする（ステップＳＴ８０）。
【０１０５】
ここで、ルートノードの次ノードとは、ルートノードから遷移可能な複数のノード（ノード列）において、最小の文字ラベルを持つノードを意味する。図１９のような２進木の場合、Ｃn に入れられるノードは、ルートノードから下へのポインタで指されるノードに一致する。
【０１０６】
次に、［Ｐn ，Ｃn ，Ｐp ］の３つ組をキューＱに追加して（ステップＳＴ８１）、キューＱが空かどうかの判定を行う（ステップＳＴ８２）。キューＱが空であれば処理は終わりとなる。
【０１０７】
キューＱが空でなければ、次にキューＱの先頭より３つ組を取り出し、これをｓにセットして（ステップＳＴ８３）、その３つ組をキューＱより取り除く（ステップＳＴ８４）。次に、ｓ内のノードＰn のｇｏｔｏｏｕｔに繋がるノードと、ｓ内のノードＣn に連なる、Ｃn と深さが同じノードとを挿入可能な、配列ＧＯＴＯ、ＣＨＥＣＫ、ＯＵＴＰＵＴ上の位置を求め、これをｐｏｉｎｔにセットする（ステップＳＴ８５）。
【０１０８】
このとき、既に挿入されたノードのｐｏｉｎｔの位置と、新たに挿入するノードのｐｏｉｎｔの位置が重複しないようにする。もし、新たな挿入可能位置が既にｐｏｉｎｔとして用いられている場合は、例えばそれを１つずらしてｐｏｉｎｔに設定する。
【０１０９】
次に、ＧＯＴＯ［ｓのＰp ］＝ｐｏｉｎｔとし（ステップＳＴ８６）、ｓのＰn のｇｏｔｏｏｕｔに繋がるノードをキューｔｍｐに入れて（ステップＳＴ８７）、キューｔｍｐが空かどうかの判定を行う（ステップＳＴ８８）。
【０１１０】
キューｔｍｐが空でなければ、次に、キューｔｍｐよりノードを１つ取り出しこれをｉにセットし（ステップＳＴ８９）、そのノードをキューｔｍｐより除く（ステップＳＴ９０）。そして、ＧＯＴＯ［ｉのｉｎｄｅｘ］が０かどうかの判定を行う（ステップＳＴ９１）。
【０１１１】
ここで、２進木のノードｉのｉｎｄｅｘには、ノードｉが格納された状態遷移配列上の位置のｉｎｄｅｘ、または０が格納されている。これが０であれば、２進木のノードｉはまだ状態遷移配列上に移されていない。
【０１１２】
ＧＯＴＯ［ｉのｉｎｄｅｘ］が０であれば、ステップＳＴ８８の処理を繰り返す。それが０でなければ、ｇｏｔｏｏｕｔ先への遷移は配列上に移されているので、ｇｏｔｏｏｕｔによる遷移を配列上にマッピングし（ステップＳＴ９２）、ステップＳＴ８８以降の処理を繰り返す。
【０１１３】
ステップＳＴ９２では、ＧＯＴＯ［ｐｏｉｎｔ＋’ｉのラベル］＝ＧＯＴＯ［ｉのｉｎｄｅｘ］、ＯＵＴＰＵＴ［ｐｏｉｎｔ＋’ｉのラベル］＝ｉのｏｕｔｐｕｔとすることにより、遷移をマッピングする。ここで、“’ｉのラベル”とは、ノードｉの文字ラベルの内部コードを表し、ｉのｏｕｔｐｕｔとは、ノードｉに定義された出力記号列を表す。こうして、ノードｉのｏｕｔｐｕｔが配列上に複写される。
【０１１４】
ステップＳＴ８８においてキューｔｍｐが空になると、次に、ｓ内のＰn のｇｏｔｏｉｎに繋がるノードをキューｔｍｐに入れ（図２１、ステップＳＴ９３）、キューｔｍｐが空かどうかの判定を行う（ステップＳＴ９４）。
【０１１５】
キューｔｍｐが空でなければ、キューｔｍｐよりノードを１つ取り出し、これをｉにセットし（ステップＳＴ９５）、ｉをキューより除く（ステップＳＴ９６）。そして、ＧＯＴＯ［ｉのｉｎｄｅｘ］が０かどうかの判定を行う（ステップＳＴ９７）。
【０１１６】
ＧＯＴＯ［ｉのｉｎｄｅｘ］が０であれば、ステップＳＴ９７の処理を繰り返す。それが０でなければ、ｇｏｔｏｉｎ経由でｇｏｔｏｏｕｔの遷移を配列上にマッピングし（ステップＳＴ９８）、ステップＳＴ９７以降の処理を繰り返す。
【０１１７】
ステップＳＴ９８では、ＧＯＴＯ［ＧＯＴＯ［ｉのｉｎｄｅｘ］＋’ｉのラベル］＝ｐｏｉｎｔとすることにより、マッピングを行う。また、ｓ内のノードＰn にｏｕｔｐｕｔが定義されているのであれば、ＯＵＴＰＵＴ［ＧＯＴＯ［ｉのｉｎｄｅｘ］＋’ｉのラベル］＝Ｐn のｏｕｔｐｕｔとして、Ｐn のｏｕｔｐｕｔを配列上に複写する。
【０１１８】
ステップＳＴ９４においてキューｔｍｐが空になると、次に、Ｐn のｇｏｔｏｏｕｔに繋がるノードの文字ラベルと、Ｃn に同じ深さで繋がるノードの文字ラベルとをキューｃｈｔｍｐに入れる（ステップＳＴ９９）。そして、キューｃｈｔｍｐが空かどうかの判定を行う（ステップＳＴ１００）。
【０１１９】
ｃｈｔｍｐが空でなければ、次に、キューｃｈｔｍｐより１つ文字ラベルを取り出し、これをｊにセットし（ステップＳＴ１０１）、そのラベルをキューｃｈｔｍｐより除く（ステップＳＴ１０２）。そして、ＣＨＥＣＫ［ｐｏｉｎｔ＋’ｊ］＝ｊとして、ノードの挿入先のラベルをセットし（ステップＳＴ１０３）、ステップＳＴ１００以降の処理を繰り返す。
【０１２０】
ステップＳＴ１００においてｃｈｔｍｐが空になると、次に、ｓ内のＣn およびそれと同じ深さのノードをキューｔｍｐに入れ（ステップＳＴ１０４）、キューｔｍｐが空かどうかの判定を行う（ステップＳＴ１０５）。
【０１２１】
キューｔｍｐが空でなければ、キューｔｍｐよりノードを取り出し、これをｉにセットし（ステップＳＴ１０６）、そのノードをキューより取り除く（ステップＳＴ１０７）。次に、ノードｉのｉｎｄｅｘに、（ｐｏｉｎｔ＋’ｉのラベル）の値をセットして（ステップＳＴ１０８）、ノードｉにｏｕｔｐｕｔが定義されているかどうかをチェックする（ステップＳＴ１０９）。
【０１２２】
ノードｉにｏｕｔｐｕｔがなければ、ステップＳＴ１０５以降の処理を繰り返す。ｏｕｔｐｕｔがあれば、ノードｉのｏｕｔｐｕｔを配列ＯＵＴＰＵＴにコピーして（ステップＳＴ１１０）、ステップＳＴ１０５以降の処理を繰り返す。ステップＳＴ１１０においては、ＯＵＴＰＵＴ［ｐｏｉｎｔ＋’ｉのラベル］＝ｉのｏｕｔｐｕｔとする。
【０１２３】
ステップＳＴ１０５においてｔｍｐが空になると、次に、Ｃn と同じ深さのノードをキューｔｍｐに入れ（ステップＳＴ１１１）、ｔｍｐが空かどうかの判定を行う（ステップＳＴ１１２）。
【０１２４】
キューｔｍｐが空でなければ、キューｔｍｐより１つノードを取り出し、これをｉにセットし（ステップＳＴ１１３）、そのノードをキューｔｍｐから取り除く（ステップＳＴ１１４）。そして、ノードｉより何らかの記号で次状態に遷移可能かどうかの判定を行う（ステップＳＴ１１５）。
【０１２５】
遷移不可能であれば、ステップＳＴ１１２以降の処理を繰り返す。遷移可能であれば、Ｐn ＝ｉ、Ｃn ＝ｉの遷移先の先頭ノード、Ｐp ＝ｉのｉｎｄｅｘとして、この３つ組みをキューＱに追加し（ステップＳＴ１１６）、ステップＳＴ１１２以降の処理を繰り返す。ここで、ｉの遷移先の先頭ノードとは、ノードｉから遷移可能なノード列において、最小の文字ラベルを持つノードを意味する。
【０１２６】
ステップＳＴ１１５においてｔｍｐが空になると、図２０のステップＳＴ８２に戻り、それ以降の処理を繰り返す。
次に、図１９の２進木を図８のような配列形式に変換する手順を、図２０および図２１のフローに沿って説明する。
【０１２７】
まず、初期状態に対してはすべての遷移が定義されるので、これを定義する。この処理は、図２０のステップＳＴ７４〜ＳＴ７８のループに相当する。これにより図２２のような配列形式を得る。
【０１２８】
この後、２進木のルートノードから順に、２進木上のノードを配列に挿入していくが、この処理が図２０のステップＳＴ７９から図２１のステップＳＴ１１６のループに相当する。この挿入処理において、変換部１３３は、２進木のノードに対する３つ組のデータ［Ｐn ，Ｃn ，Ｐp ］をキューに積み、これを１つずつ配列に挿入していく。
【０１２９】
まず、ルートノードより挿入するが、図２０では、ステップＳＴ７９、ＳＴ８０、ＳＴ８１でキューＱに３つ組を積み、これをステップＳＴ１３、ＳＴ１４で取り出している。ここでは、Ｐn ＝ルートノード“１”、Ｃn ＝ノード“２”、Ｐp ＝１となる。ステップＳＴ８５では挿入可能な場所を捜すが、この処理はルートノード“１”に対しては次のようになる。
【０１３０】
この場合、Ｐn のｇｏｔｏｏｕｔは空である。Ｃn と深さが同じノードはノード“２”とノード“３”となる。これらのノードの文字ラベルは、それぞれ“ａ”、“ｂ”である。このノード列が挿入可能な配列上の場所を捜す操作は、図２３のようなパターンに対応する場所を、配列ＧＯＴＯ、ＣＨＥＣＫ、ＯＵＴＰＵＴ上で捜すことに相当する。
【０１３１】
図２３のパターンにおいて、上の行は、現在の配列ＣＨＥＣＫ上のパターンを表し、下の行は、挿入されるパターンを表す。また、“０”は、その領域が空であることを表し、“＊”は、その領域が空であっても要素が入っていてもどちらでもよいことを表す。
【０１３２】
このようなパターンが挿入可能な位置を図２２の配列上で捜すと、図２３のパターンは図２４に示すように重ねることが可能であるので、ノードの挿入位置を表すｐｏｉｎｔの値は１となる。この位置は、ノード“２”のラベル“ａ”の挿入位置の１つ前の位置に対応している。
【０１３３】
また、ステップＳＴ８６では、ＧＯＴＯ［Ｐp ］＝ｐｏｉｎｔとなるので、ＧＯＴＯ［１］＝１となる。これにより、２進木のノード“１”からノード“２”、“３”への遷移が配列上に移される。ただし、ルートノード“１”に対応するｉｎｄｅｘ＝１の位置に対しては、あらかじめＧＯＴＯ［１］＝１となっているため、変化はない。
【０１３４】
図２０のステップＳＴ８７から図２１のステップＳＴ９８までの処理は、ノードＰp がｇｏｔｏｏｕｔ／ｇｏｔｏｉｎを持っている場合の処理であるので、ルートノードには関係がない。
【０１３５】
図２１のステップＳＴ９９からＳＴ１０３までの処理は、配列に文字ラベルを挿入する処理である。これは、それぞれのノードの持つ文字ラベルを、先に確保された配列上の場所のＣＨＥＣＫの部分にセットすることに相当する。この操作により、図２５のような配列形式を得る。図２５において、下線を付加したラベル“ａ”、“ｂ”が、挿入されたラベルに相当する。
【０１３６】
図２１のステップＳＴ１０４からＳＴ１１０までの処理は、配列上の位置を２進木のノードに対してセットする処理と、ノードのｏｕｔｐｕｔを配列上に複写する処理である。これにより、ノード“２”のｉｎｄｅｘには２がセットされ、ノード“３”のｉｎｄｅｘには３がセットされる。
【０１３７】
また、ステップＳＴ１１１からＳＴ１１６までのループ処理は、Ｐp から遷移可能なノードをキューＱに積む処理である。ここでは、ルートノードより遷移可能なのは、ノード“２”とノード“３”であるので、［Ｐn ＝ノード“２”，Ｃn ＝ノード“４”，Ｐp ＝２］と［Ｐn ＝ノード“３”，Ｃn ＝ノード“５”，Ｐp ＝３］の２つがキューに積まれる。そして、処理は図２０のステップＳＴ８２に戻る。
【０１３８】
今度は、Ｐn ＝ノード“２”、Ｃn ＝ノード“４”、Ｐp ＝２として同様の処理を行い、ノードの挿入可能な位置を捜すと、ｐｏｉｎｔ＝２６の位置が見つかる。そこで、図２０のステップＳＴ８５から図２１のステップＳＴ１１１までの処理を同様にして行うと、図２６のような配列形式が得られる。
【０１３９】
図２６において、ノード“４”のラベル“ｂ”はｉｎｄｅｘ＝２８の位置に挿入されているが、図９の変換表を用いた場合、ラベル“ｂ”は内部コード２に変換される。そこで、２８から２を引いた値２６をｐｏｉｎｔとする。これにより、図１０の照合処理のステップＳＴ１８において、ｉｎｄｅｘ＝２６の位置から入力記号“ｂ”によりｉｎｄｅｘ＝２８の位置に移動し、その要素が表す遷移先に遷移することができるようになる。
【０１４０】
また、ＯＵＴＰＵＴ［２８］にはノード“４”のｏｕｔｐｕｔである“ａｂ”が複写される。
次に、ノード“４”に後続するノードはノード“７”であるので、［Ｐn ＝ノード“４”，Ｃn ＝ノード“７”，Ｐp ＝２８］がキューＱに積まれる。そして、処理は図２０のステップＳＴ８２に戻る。
【０１４１】
次にキューＱより取り出されるのは、［Ｐn ＝ノード“３”，Ｃn ＝ノード“５”，Ｐp ＝３］である。これに対しては、図２７のようなパターンを満たす挿入場所を見つければよいが、ノード“５”のラベル“ｃ”をｉｎｄｅｘ＝２９の位置に合わせると、その内部コードは３であるから、ｐｏｉｎｔ＝２６となる。この値は既に１度ｐｏｉｎｔに設定されているので、重複を避けるために、１つずらしてｐｏｉｎｔ＝２７とする。
【０１４２】
こうして、図２０のステップＳＴ８５から図２１のステップＳＴ１１１までの処理を同様にして行うと、図２８のような配列形式が得られる。この場合には、ＯＵＴＰＵＴ［３０］にはノード“５”のｏｕｔｐｕｔである“ｂｃ”が複写され、ＯＵＴＰＵＴ［３１］にはノード“６”のｏｕｔｐｕｔである“ｂｄ”が複写される。
【０１４３】
次に、ノード“５”、“６”に後続するのは、それぞれノード“８”、“９”である。そこで、［Ｐn ＝ノード“５”，Ｃn ＝ノード“８”，Ｐp ＝３０］、［Ｐn ＝ノード“６”，Ｃn ＝ノード“９”、Ｐp ＝３１］がキューＱに積まれる。そして、処理は図２０のステップＳＴ８２に戻る。
【０１４４】
今度は、［Ｐn ＝ノード“４”，Ｃn ＝ノード“７”，Ｐp ＝２８］として同様の処理を行う。ノードの挿入可能位置を捜すが、ノード“４”には、ｇｏｔｏｏｕｔとして文字“ｃ”、“ｄ”による遷移が定義されているため、挿入可能な場所は、図２９のようなパターンを満たす場所となる。この場所に対応するｐｏｉｎｔの値は２９となる。
【０１４５】
このとき、図２０のステップＳＴ８５から図２１のステップＳＴ１１１までの処理を同様にして行うが、この場合には、Ｐn となるノード“４”にはｇｏｔｏｏｕｔが定義されているため、図２０のステップＳＴ８７からＳＴ９２までのループ処理に入る。しかし、ＧＯＴＯ［ノード“５”のｉｎｄｅｘ］とＧＯＴＯ［ノード“６”のｉｎｄｅｘ］がともに未定義となるため、ステップＳＴ９２の処理は行われない。結局、得られる配列は図３０のようになる。
【０１４６】
ノード“７”から遷移可能なノードはないので、キューＱにはこれ以上ノードが積まれず、処理は図２０のステップＳＴ８２に戻る。
次に、キューＱより、［Ｐn ＝ノード“５”，Ｃn ＝ノード“８”，Ｐp ＝３０］が取り出される。挿入可能な場所は、図３１のようなパターンを満たす場所となる。この場所に対応するｐｏｉｎｔの値は５となる。
【０１４７】
このとき、図２０のステップＳＴ８５から図２１のステップＳＴ１１１までの処理を同様にして行うが、この場合には、Ｐn となるノード“５”にはｇｏｔｏｉｎとしてノード“４”が定義されているため、図２１のステップＳＴ９３からＳＴ９８までのループ処理に入る。
【０１４８】
ステップＳＴ９７の条件判定では、ＧＯＴＯ［ノード“４”のｉｎｄｅｘ］＝２９となるので、ステップＳＴ９８では、ＧＯＴＯ［２９＋’ｃ］＝ＧＯＴＯ［３２］＝５となる。また、Ｐn となるノード“５”にはｏｕｔｐｕｔとして“ｂｃ”が定義されているので、これもコピーされ、ＯＵＴＰＵＴ［３２］＝ｂｃとなる。この結果、図３２のような配列が得られる。
【０１４９】
また、最後にノード“９”をノード“８”と同様に処理すると、図８のような結果を得る。
なお、この例においては、状態遷移配列の圧縮率を高めるため、許される最小の値をｐｏｉｎｔとして用いているが、その値より大きいｉｎｄｅｘをｐｏｉｎｔとして用いても構わない。
【０１５０】
次に、この第１の実施形態の照合装置に文字列置換機能を加えた文字列置換装置の実施形態を説明する。ここでは、入力キーワード｛ａｂ，ｂｃ，ｂｄ｝を、それぞれ｛ａａａ，ｂｂｂ，ｃｃｃ｝に置換する例を示す。
【０１５１】
この文字列置換機能は、入力されたテキスト中より検出されたキーワードを出力する代わりに、このキーワードが検出された場所を記憶し、テキストの処理後にキーワードを置換する機能である。
【０１５２】
図３３は、入力キーに対する文字列置換用の状態遷移配列を示している。図３３において、配列ＧＯＴＯは重ねられた状態遷移表を格納し、配列ＣＨＥＣＫは確認用のラベルを格納し、配列ＳＵＢＳＴは置換用の文字列へのポインタを格納し、配列ＬＥＮＧＴＨは照合された置換前の文字列の長さを格納している。
【０１５３】
また、図３４は、置換処理に用いるテキストオフセット格納配列の初期状態を示している。この配列には、置換対象となる文字列のテキスト中の位置を表すテキストオフセットと、対応する置換用文字列を指すＳＵＢＳＴのポインタと、置換対象文字列の長さとが格納される。
【０１５４】
入力テキスト‘ｃａｂｃｚ’内の各記号に対するテキストオフセットは、図３５に示すようになる。また、この入力テキストに対してパターン照合を行った後には、テキストオフセット格納配列は図３６のようになる。このときの照合処理は、図１０と同様である。照合の結果、キーワード“ａｂ”、“ｂｃ”のテキスト内の位置が、それぞれテキストオフセット“１”、“２”として格納されている。
【０１５５】
置換処理においては、文字列置換装置は、テキストとテキストオフセット格納配列にそれぞれポインタを設定する。そして、テキストオフセット配列に格納されたテキストオフセットの位置から始まる、置換対象文字列長に対応する長さの区間内に、テキストへのポインタがない限り、テキスト文字を出力する。また、テキストへのポインタが、テキストオフセット配列内のテキストオフセットの位置にあった場合には、対応する置換用文字列を出力する。
【０１５６】
図３７は、この置換処理のフローチャートである。図３７において処理が開始されると、文字列置換装置は、まずテキストポインタｔをテキストの先頭にセットし（ステップＳＴ１２１）、置換ポインタｐをテキストオフセット格納配列の先頭にセットする（ステップＳＴ１２２）。そして、ポインタｔがテキストの最後を指しているかどうかを判定する（ステップＳＴ１２３）。ポインタｔがテキストの最後を指していれば、処理を終了する。
【０１５７】
ポインタｔがテキストの最後を指していなければ、次に、ポインタｐの指す位置に格納されたテキストオフセットの値を、ポインタｔと比較する（ステップＳＴ１２４）。これらが一致しなければ、ポインタｔの指すテキスト内の文字を出力し（ステップＳＴ１２５）、ポインタｔを１文字分進めて（ステップＳＴ１２６）、ステップＳＴ１２３以降の処理を繰り返す。
【０１５８】
ステップＳＴ１２４において両者が一致すれば、次に、ポインタｐの指す位置に格納されたＳＵＢＳＴのポインタを取り出し、それが指す置換用文字列を出力する（ステップＳＴ１２７）。次に、ポインタｔを、ポインタｐの指す位置に格納された置換対象文字列長分だけ進め（ステップＳＴ１２８）、ポインタｐを１つ進める（ステップＳＴ１２９）。
【０１５９】
そして、ポインタｔの値がポインタｐの指す位置のテキストオフセットより大きく、かつ、ポインタｐの指す位置がテキストオフセット格納配列の最後でないかどうかを判定する（ステップＳＴ１３０）。判定結果がＹＥＳであればステップＳＴ１２９以降の処理を繰り返し、ＮＯであればステップＳＴ１２３以降の処理を繰り返す。
【０１６０】
このような置換処理の結果、テキスト‘ｃａｂｃｚ’は‘ｃａａａｃｚ’に変換される。
次に、図３８から図６０までを参照しながら、ＦＡＳＴ法に基づく第２の実施形態について説明する。
【０１６１】
ＦＡＳＴ法の状態遷移表において、各状態の入力記号に対する操作を定義するデータは、デフォルトシフトと遷移の２つに簡単に分離できそうであるが、必ずしもそうではない。例えば図６９の状態遷移表では、状態“６”における入力記号“ａ”に対するシフト量は−２（大きさ２）であるが、その他の入力記号に対するデフォルトのシフト量は−５（大きさ５）である。前者の小さなシフト量は、ＦＡＳＴ法のシフト量を計算中に決まる。このため、このままでは、表形式をデフォルトのシフト量と次状態への遷移とに分離し、コンパクトにするのは不可能である。
【０１６２】
そこで、第２の実施形態では、まずＦＡＳＴ法の入力記号による操作を３種類に分ける。最初の操作は、一般の入力記号に対する状態のデフォルトシフトである。次の操作は、特定の入力記号に対するシフトである。最後の操作は、入力記号による次状態への遷移である。
【０１６３】
照合動作においては、まず入力記号に対して遷移またはシフトが定義されているかどうかを判定する。シフトも遷移も定義されていない場合には、デフォルトシフトに従い、右方向へポインタをシフトさせる。入力記号が受理されるが、遷移ではなくシフトが定義されている場合には、そのシフトに従い右方向へポインタをシフトさせる。遷移が定義されている場合には、次状態へ遷移する。
【０１６４】
このような操作を各状態ごとに配列形式で表し、遷移／シフトが定義されている部分が重複しないように配列を重ね合わせる。また、同時に要素がその配列に含まれているかどうかの確認を行うために、遷移／シフトが定義されている文字に対しては、その文字をラベルとして付与する。
【０１６５】
また、圧縮された表形式を作成する際に、２進木の中間構造を経由することにより、実際に使用される記憶容量を節減する。
第２の実施形態における照合装置の構成は、図５と同様である。この場合、状態遷移判定部１２１は、入力記号に対してどの状態に遷移するか、もしくはどの程度テキスト中でシフトするかを決定する。
【０１６６】
図３８は、３つの記号列｛ｓｍａｒｔ，ｅａｓｔ，ｓｔａｔｅ｝を入力キーワードとした場合の圧縮前の第２の状態遷移表を示している。この表を図６９の表と比較すると、各状態で多用されている典型的なシフト量がデフォルトシフト（ｄｅｆａｕｌｔｓｈｉｆｔ）として１つにまとめられ、特定の入力記号に対する特定のシフト量または遷移先のみが入力記号毎に定義されていることが分かる。
【０１６７】
図３８の状態遷移表に定義された各入力記号に対する遷移を図示すると、図３９のようになる。図３９の状態遷移図は、ノード“０”〜“１３”の１４個のノードから成る。各ノードに付与されたデータａ１、ａ２、ａ３、ａ４は、それぞれ、状態番号（ノード番号）、デフォルトシフトの大きさ、特定の文字とそれに対する特定のシフトの大きさ、出力文字列を表している。
【０１６８】
図４０は、図３８の状態遷移表を圧縮して得られる照合用の配列を示している。図４０において、ｉｎｄｅｘは配列の添え字を表し、ＧＯＴＯは、圧縮状態遷移部１２３に格納される重ね合わせられた状態遷移表を表し、ＣＨＥＣＫは、確認ラベル部１２４に格納される確認用のラベルを表し、ＯＵＴＰＵＴは、出力記号部１２５に格納されるポインタの配列を表す。これらのポインタは、状態遷移部１２２内に格納された出力文字列を指している。
【０１６９】
配列ＧＯＴＯの要素が０未満の場合はシフト量を表し、それ以外の場合は次状態への遷移を表す。
また、図４０の配列の下方に記されている状態ｓ０〜ｓ１３は、図３８の状態“０”〜“１３”がどのように重なっているかを表している。また、記号ａ、ｅ等は、図３８の各入力記号に対応するアクセス先のデータの格納位置を表し、記号Ｄは、デフォルトシフトに対応するアクセス先のデータの格納位置を表している。
【０１７０】
図４１は、第２の実施形態の説明のために用いる第２の文字コード変換表を示している。この文字コード変換表に従って、文字コードが内部コードに変換される。図４１の変換表では、重ね合わせられた各配列の先頭部にデフォルトシフトコードを設定するために、第１列目にその内部コードとして“１”が設定されている。このため、各入力記号に対する内部コードは２以上となり、第２列目以降に設定されている。
【０１７１】
また、入力記号の文字コード値に１を加えた結果を内部コードとして用いてもよい。
次に、図４２および図４３を参照しながら、第２の実施形態における照合処理について説明する。
【０１７２】
図４２は、ＦＡＳＴ法に基づく文字列照合処理のフローチャートである。図４２において処理が開始されると、状態遷移判定部１２１は、まず入力されたテキストを指すテキストポインタを、その先頭に最短キー長を加えた位置にセットし、状態遷移部１２２内の状態遷移配列を指す遷移ポインタを初期状態にセットする（ステップＳＴ１３１）。次に、テキストポインタがテキストの最後を指すかどうかのチェックを行う（ステップＳＴ１３２）。テキストポインタがテキストの終わりを指せば照合は終了する。
【０１７３】
テキストポインタがテキストの終わりを指していなければ、そのポインタの指す文字を取り出し（ステップＳＴ１３３）、その文字に対応する内部コードの値を遷移ポインタの値に加算し、その加算値をｉｎｄｅｘとする位置に格納された文字ラベルがこの文字と同じかどうかのチェックを行う（ステップＳＴ１３４）。
【０１７４】
これらの文字が一致しなければ、その入力文字に対する遷移またはシフトが定義されていないことになる。そこで、テキストポインタをその状態に対するデフォルトシフト分だけ進めて（ステップＳＴ１３５）、遷移ポインタを初期状態にセットし（ステップＳＴ１３６）、ステップＳＴ１３２以降の処理を繰り返す。
【０１７５】
ステップＳＴ１３４においてラベルと入力文字が同じであれば、次に、入力文字によるシフトが定義されているかどうかを確かめる（ステップＳＴ１３７）。入力文字でシフトするのであれば、現在の状態において、その入力文字に対して定義されたシフト量の大きさだけテキストポインタを進め（ステップＳＴ１３８）、遷移ポインタを初期状態にセットして（ステップＳＴ１３９）、ステップＳＴ１３２以降の処理を繰り返す。
【０１７６】
ステップＳＴ１３７において、シフトではなく入力文字による遷移が定義されている場合は、次に、出力文字列が状態に対して定義されているかどうかの確認を行う（ステップＳＴ１４０）。
【０１７７】
出力文字列が定義されていなければ、現在の遷移ポインタに入力文字の内部コードを加算し、その加算値をｉｎｄｅｘとする位置に定義された遷移先を、新たな遷移ポインタとする（ステップＳＴ１４２）。そして、テキストポインタを１文字戻し（ステップＳＴ１４３）、ステップＳＴ１３２以降の処理を繰り返す。
【０１７８】
出力文字列が定義されているのであれば、その文字列を照合結果として出力すし（ステップＳＴ１４１）、ステップＳＴ１４２以降の処理を行う。ステップＳＴ１４１において、現在のテキストポインタの値を、照合された文字列の位置として出力することもできる。
【０１７９】
図４３は、図４０の状態遷移配列を用いた場合の入力テキスト‘ａａｓｅａｓｔａｔｅｒｒｒ’に対する照合動作を示している。まず、テキストポインタは、テキストの先頭から最短キーワード長４だけ離れた位置にセットされ、遷移ポインタは、初期状態に対応するｉｎｄｅｘの値“１”にセットされる。
【０１８０】
最初の入力記号“ｅ”に対しては、ＣＨＥＣＫ［１＋’ｅ］＝ｅであり、ＧＯＴＯ［１＋’ｅ］＝３＞０であるため、遷移ポインタ＝３として（ステップＳＴ１４２）、次状態に遷移する。
【０１８１】
以下、同様にして、図４３のような照合動作が行われる。ここでは、まず入力記号による遷移先またはシフト量が定義されているかどうかを、配列ＣＨＥＣＫにアクセスして確認し、遷移が定義されていない場合には、デフォルトシフトによりシフトしている。
【０１８２】
また、特定の入力記号に対してデフォルトシフトより小さいシフトが定義されている場合には、そのシフト量に応じてシフトしている。図４３では、最後の初期状態“１”において記号“ｒ”が入力されたときに、このような特定のシフトが発生している。そして、記号“ｒ”に対しては、デフォルトシフト量４より小さい値のシフト量１だけ、右方向にテキストポインタがシフトされるが、これによりテキストの最後に移動するので、処理を終了している。
【０１８３】
こうして、入力テキストに含まれていた記号列“ｅａｓｔ”と“ｓｔａｔｅ”が、照合結果として出力される。
次に、照合に用いる状態遷移配列の作成方法を説明する。図４４は、図２の圧縮装置の第２の構成図である。図４４において、２進木変換部１４１、前処理部１４２、およびシフト量計算部１４３は、スパース配列有限状態機械作成部１０４に対応し、変換部１４４は状態遷移機械圧縮部１０５に対応する。
【０１８４】
キーワード入力部１０３は、指定されたキーワード群を受理する。２進木変換部１４１は、この受理されたキーワード群を、各キーの左から右に向かう方向に、２進木構造に変換する。前処理部１４２は、入力されたキーワードの最も長さの短いものの長さと、それぞれのノードの深さと、各キーワードに対する終端ノードとを設定する。シフト量計算部１４３は、各状態に対するシフト量を計算する。また、変換部１４４は、シフト量計算部１４３が出力する２進木構造を、圧縮された配列形式に変換する。
【０１８５】
このような照合用の配列の作成手順は、入力されたキーワード群からの２進木の作成、その結果得られた２進木のノードに対するノードの深さの追加、キーワードに対する最後尾ノードのセット、シフト量の計算と中間構造の２進木の作成、中間構造から配列形式への変換の各処理より成る。
【０１８６】
図４５は、２進木作成処理のフローチャートである。図４５において処理が開始されると、２進木変換部１４１は、まず入力キーに対してキーの並びの逆順に２進木を作成する（ステップＳＴ１５１）。次に、キーを受理したノードに対して、対応する入力キーを出力記号列として付加し（ステップＳＴ１５２）、処理を終了する。
【０１８７】
図４６は、２進木に対してシフト量計算を行うための前処理のフローチャートである。図４６において処理が開始されると、前処理部１４２は、まず各ノードに対してルートノードからの距離（深さ）をセットする（ステップＳＴ１６１）。次に、最短の入力キーワード長を求め（ステップＳＴ１６２）、各キーワードに対する最後尾のノードを求めて（ステップＳＴ１６３）、処理を終了する。
【０１８８】
例えば、図３９の状態遷移図に対応するキーワード群｛ｓｍａｒｔ，ｅａｓｔ，ｓｔａｔｅ｝に対して２進木を作成し、前処理を行うと、図４７のような２進木が得られる。図４７において、下へのポインタと横へのポインタの意味は、図１４と同様である。
【０１８９】
各ノードに付与されたデータｄはノードの深さを表し、ｏｕｔｐｕｔは出力文字列を表す。ここでは、ノード“５”、“９”、“１３”に対して、出力記号列“ｓｔａｔｅ”、“ｅａｓｔ”、“ｓｍａｒｔ”がそれぞれのｏｕｔｐｕｔとして付加されている。
【０１９０】
また、これらの出力記号列（キーワード）の最短パターン長として４が設定され、各キーワード“ｓｔａｔｅ”、“ｅａｓｔ”、“ｓｍａｒｔ”に対する終端ノードとして、それぞれノード“５”、“９”、“１３”が設定されている。
【０１９１】
図４８、図４９、および図５０は、２進木に対してシフト量を求める計算処理のフローチャートである。
図４８は、各ノードに、特定文字によるシフトとｆａｉｌｕｒｅ遷移を設定する処理を示している。図４８において処理が開始されると、シフト量計算部１４３は、まず処理対象となるノードを格納するノードキューＱを初期化する（ステップＳＴ１７１）。次に、２進木のルートノードから遷移可能なノードをキューＱに入れる（ステップＳＴ１７２）。
【０１９２】
次に、キューに入っているノードのｆａｉｌ先をルートノードに設定し（ステップＳＴ１７３）、Ｑが空かどうかの判定を行う（ステップＳＴ１７４）。Ｑが空ならば処理は終了する。Ｑが空でなければ、次にキューＱから１つノードを取り出し、これをｒに設定し（ステップＳＴ１７５）、取り出したノードをキューＱから除く（ステップＳＴ１７６）。
【０１９３】
次に、ノードキューＪを初期化し（ステップＳＴ１７７）、ｒから遷移可能なノードをキューＪに入れる（ステップＳＴ１７８）。そして、Ｊが空かどうかを判定する（ステップＳＴ１７９）。
【０１９４】
Ｊが空であるならばステップＳＴ１７４以降の処理を繰り返す。Ｊが空でないならば、次にノードキューＪより１つノードを取り出し、これをｓにセットして（ステップＳＴ１８０）、取り出したノードをキューＪより除く（ステップＳＴ１８１）。次に、ｓをキューＱに入れて（ステップＳＴ１８２）、ノードｒのｆａｉｌ先をｔにセットし（ステップＳＴ１８３）、ノードｓについている文字ラベルによるノードｔからの遷移が定義されているかどうかの判定を行う（ステップＳＴ１８４）。
【０１９５】
そのような遷移が定義されていれば、次に、ｔからｓのラベルで遷移する先をｓのｆａｉｌ先とし（ステップＳＴ１８８）、ステップＳＴ１７９以降の処理を繰り返す。
【０１９６】
ステップＳＴ１８４で遷移が定義されていなければ、次に、ノードｔにおいてｓのラベルでのシフト量が定義されていないか、もしくは、既に定義されているシフト量が現在のｓの深さより大きいかどうかを判定する（ステップＳＴ１８５）。
【０１９７】
シフト量が未定義、もしくは定義されている値が現在のｓの深さより大きければ、次に、ノードｔに対するノードｓのラベルでのシフト量をｓの深さとする（ステップＳＴ１８６）。このような特定のラベルに対するシフト量は、２進木の各ノードに付加されるｓｐｅｃｉｆｉｃリストとして表される。次に、ｔにｔのｆａｉｌ先をセットし（ステップＳＴ１８７）、ステップＳＴ１８４以降の処理を繰り返す。
【０１９８】
ステップＳＴ１８５で、現在のｓの深さ以下のシフト量がｓのラベルに対して定義されていれば、ステップＳＴ１８７以降の処理を行う。
図４９は、各ノードに、最大のシフト量であるデフォルトシフト量を付加する処理のフローチャートである。図４９において処理が開始されると、シフト量計算部１４３は、各ノードに対するデフォルトシフト量として、ノードの深さに最短キー長を加算した値を設定して（ステップＳＴ１９１）、処理を終了する。
【０１９９】
例えば、図４７の２進木の場合、まず図４８のステップＳＴ１７２の処理により、シフト量計算部１４３はルートノード“０”から遷移可能なノード“１”、“６”をキューＱに入れる。次に、ステップＳＴ１７３の処理により、ノード“１”、“６”のｆａｉｌｕｒｅ遷移先を、ルートノードに設定する。
【０２００】
次に、キューＱからノード“１”が取り出されて、これがｒにセットされ、ステップＳＴ１７５〜ＳＴ１７８の処理により、ノードキューＪにノード“２”が積まれる。Ｊに積まれるノードはノード“２”のみであるので、処理は一度だけステップＳＴ１７９〜ＳＴ１８３のループを通る。
【０２０１】
ここでは、ステップＳＴ１８０の処理により、ｓにノード“２”が設定され、ステップＳＴ１８３の処理により、ｔにノードｒ（ノード“１”）のｆａｉｌ先、すなわちルートノードが設定される。
【０２０２】
ルートノードからはｓのラベル（ノード“２”のラベル“ｔ”）で遷移可能であるので、処理はステップＳＴ１８８に進み、ｓのｆａｉｌ先は、ルートノードからノード“２”のラベル“ｔ”で遷移可能なノード“６”となる。
【０２０３】
次に、キューＱから取り出されるのはノード“６”である。ノード“６”から遷移可能なノード“７”に対して、同様な処理を行う。この場合には、ｒ＝ノード“６”、ｓ＝ノード“７”、ｔ＝ノード“１”（ノード“６”のｆａｉｌ先）となる。しかし、ｔからノード“７”のラベル“ｓ”で遷移は不可能である。
【０２０４】
そこで、ノード“１”では、入力記号“ｓ”に対する特定のシフト量として、ノード“６”の深さ、すなわち１が設定される。この記号“ｓ”と対応するシフト量“１”は、ｓｐｅｃｉｆｉｃリストとしてリスト構造で表される。
【０２０５】
このような処理を同様にして繰り返すと、すべてのノードにｆａｉｌｕｒｅ遷移と暫定的なシフト量とが付与される。この後、図４９の処理に従って、各ノードにおける最大のシフト量であるデフォルトシフト量を、各ノードに割り当てる。こうして、図５１のような２進木が得られる。
図５１において、破線の矢印はｆａｉｌｕｒｅ遷移を表し、ｆａｉｌｕｒｅ遷移の表示されていないノードのｆａｉｌ先はルートノード“０”になっている。また、各ノードに付加されたデータＤはデフォルトシフト量を表す。さらに、ノード“０”、“６”に繋がるｓｐｅｃｉｆｉｃリストには、特定の文字ラベルとそれに対応する特定のシフト量とが設定されている。
【０２０６】
図５０は、各ノードのシフト量を必要に応じて削減して、最終的なシフト量を割り付ける処理のフローチャートである。図５０において処理が開始されると、シフト量計算部１４３は、まず入力キーワードをキューＱに積み（ステップＳＴ２０１）、キューＱが空かどうかの判定を行う（ステップＳＴ２０２）。
【０２０７】
キューＱが空でなければ、キューＱからポップしたキーワードをｊに入れる（ステップＳＴ２０３）。次に、キーワードｊに対応する最後尾のノードをｊｓｔにセットし（ステップＳＴ２０４）、ｊｓｔのｆａｉｌ先をｂｓｔにセットし（ステップＳＴ２０５）、ｊｌｅｎ＝ｊｓｔの深さとする（ステップＳＴ２０６）。そして、ｂｓｔがルートノードかどうかを判定する（ステップＳＴ２０７）。ｂｓｔがルートノードであれば、ステップＳＴ２０２以降の処理を繰り返す。
【０２０８】
ｂｓｔがルートノードでなければ、次に、ｊｌｅｎからｂｓｔの深さを減算し、その結果をｓｌｅｎに設定する（ステップＳＴ２０８）。次に、キューＮにｂｓｔを入れて（ステップＳＴ２０９）、キューＮが空であるかどうかの判定を行う（ステップＳＴ２１０）。キューＮが空であれば、ｂｓｔ＝ｂｓｔのｆａｉｌ先として（ステップＳＴ２１１）、ステップＳＴ２０７以降の処理を繰り返す。
【０２０９】
キューＮが空でなければ、次に、キューＮからポップしたノードをｒに入れ（ステップＳＴ２１２）、ｒのデフォルトシフトが（ｓｌｅｎ＋ｒの深さ）より大きいかどうかを判定する（ステップＳＴ２１３）。ｒのデフォルトシフトが（ｓｌｅｎ＋ｒの深さ）以下であれば、ステップＳＴ２１０以降の処理を繰り返す。
【０２１０】
ｒのデフォルトシフト＞（ｓｌｅｎ＋ｒの深さ）であれば、ｒのデフォルトシフト＝ｓｌｅｎ＋ｒの深さとし（ステップＳＴ２１４）、ｒから遷移可能なノードをキューＮに積んで（ステップＳＴ２１５）、ステップＳＴ２１０以降の処理を繰り返す。
【０２１１】
ステップＳＴ２０２においてキューＱが空であれば、ステップＳＴ２１６からＳＴ２２０までのループ処理を行う。この処理は、ノードに対して定義されている特定の文字によるシフトの大きさが、デフォルトシフトの大きさより大きい場合に、これを削って、デフォルトシフトと同じ大きさにする処理である。
【０２１２】
ここでは、まずすべてのノードをキューＱに入れ（ステップＳＴ２１６）、キューＱが空かどうかの判定を行う（ステップＳＴ２１７）。キューＱが空であれば処理は終了する。
【０２１３】
キューＱが空でなければ、次に、キューＱからポップしたノードをｊに入れ（ステップＳＴ２１８）、ｊにおいて定義されている特定の文字に対するシフトの大きさが、デフォルトシフトの大きさより大きいかどうかを判定する（ステップＳＴ２１９）。
【０２１４】
判定結果がＮＯの場合はステップＳＴ２１７以降の処理を繰り返す。判定結果がＹＥＳの場合は、その特定の文字に対するシフトの大きさ＝デフォルトシフトの大きさとして（ステップＳＴ２２０）、ステップＳＴ２１７以降の処理を繰り返す。
【０２１５】
図５１の２進木に設定されたシフト量を、図５０の処理に従って整形すると、次のようになる。まず、ステップＳＴ２０１〜ＳＴ２０３の処理により、照合用のキーワードセット｛ｓｍａｒｔ，ｅａｓｔ，ｓｔａｔｅ｝をキューＱに積み、キューからポップしたキーワードをｊにセットする。
【０２１６】
今、これを仮に“ｓｔａｔｅ”とすると、ステップＳＴ２０４〜ＳＴ２０６の処理により、ｊｓｔ＝ノード“５”、ｂｓｔ＝ノード“７”、ｊｌｅｎ＝５となる。このとき、ｂｓｔはルートノードではないので、処理はステップＳＴ２０８に移る。ステップＳＴ２０８では、ｓｌｅｎ＝ｊｌｅｎ−ｂｓｔの深さ＝５−２＝３となる。また、ステップＳＴ２０９では、キューＮにノード“７”が入れられ、ステップＳＴ２１２では、これがキューＮよりポップされてｒにセットされる。
【０２１７】
ステップＳＴ２１３の条件判定では、ｒのデフォルトシフト（ノード“７”のデフォルトシフト）は６で、ｓｌｅｎより大きい。そこで、処理はステップＳＴ２１４に移り、ｒのデフォルトシフト＝ｓｌｅｎ＋ｒの深さ＝３＋２＝５となる。
【０２１８】
この後、ステップＳＴ２１５の処理により、キューＮにはノード“８”が積まれて、同様の処理が行われる。これにより、ノード“８”より下の枝が順に処理されて、各ノードのデフォルトシフトの大きさが変更される。
【０２１９】
このように、キーワードの終端ノードのｆａｉｌ先となっているノードと、そのノードに繋がるより深い位置のノードにおいては、ステップＳＴ２１３およびＳＴ２１４の処理により、デフォルトシフトの大きさが（ｓｌｅｎ＋ｒの深さ）以下に抑えられる。
【０２２０】
最終的にすべてのノードに対して処理が終了すると、図５２のような２進木が得られる。図５２において、キーワードの終端ノード“５”、“９”のｆａｉｌ先となっているノード“７”、“１”のデフォルトシフトは、図５１と比べて１だけ小さくなっていることが分かる。また、ノード“７”、“１”の下方に繋がる各ノードのデフォルトシフトもそれぞれ１ずつ小さくなっている。
【０２２１】
図５３および図５４は、シフト量計算部１４３が出力する２進木を、圧縮された配列形式の状態遷移機械に変換する処理のフローチャートである。図５３において処理が開始されると、変換部１４４は、まず配列ＧＯＴＯ、ＣＨＥＣＫ、ＯＵＴＰＵＴを０に初期化し（ステップＳＴ２２１）、２進木のノードのメンバｉｎｄｅｘを０に初期化する（ステップＳＴ２２２）。
【０２２２】
このｉｎｄｅｘは、２進木の各ノードと、図４０に示されるような状態遷移配列のｉｎｄｅｘとの対応関係を記憶するために、状態遷移配列のｉｎｄｅｘとは独立に設けられる。
【０２２３】
次に、キューＱを初期化し（ステップＳＴ２２３）、Ｐn ＝ルートノード、Ｃn ＝ルートノードの次ノード、Ｐp ＝１とする（ステップＳＴ２２４）。
ここで、ルートノードの次ノードとは、ルートノードから遷移可能な複数のノード（ノード列）において、最小の文字ラベルを持つノードを意味する。図５２のような２進木の場合、Ｃn に入れられるノードは、ルートノードから下へのポインタで指されるノードに一致する。
【０２２４】
次に、［Ｐn ，Ｃn ，Ｐp ］の３つ組をキューＱに追加して（ステップＳＴ２２５）、キューＱが空かどうかの判定を行う（ステップＳＴ２２６）。キューＱが空であれば処理は終わりとなる。
【０２２５】
キューＱが空でなければ、次にキューＱの先頭より３つ組をポップしてｓにセットする（ステップＳＴ２２７）。そして、ｓ内のノードＰn のｓｐｅｃｉｆｉｃリストに繋がるノードと、ｓ内のノードＣn に連なる、Ｃn と深さが同じノードと、ｓに対するデフォルトシフトのシフト量とを挿入可能な、配列ＧＯＴＯ、ＣＨＥＣＫ、ＯＵＴＰＵＴ上の位置を求め、これをｐｏｉｎｔにセットする（ステップＳＴ２２８）。
【０２２６】
このとき、既に挿入されたノードのｐｏｉｎｔの位置と、新たに挿入するノードのｐｏｉｎｔの位置が重複しないようにする。もし、新たな挿入可能位置が既にｐｏｉｎｔとして用いられている場合は、例えばそれを１つずらしてｐｏｉｎｔに設定する。
【０２２７】
次に、ＧＯＴＯ［ｓのＰp ］＝ｐｏｉｎｔとし（ステップＳＴ２２９）、ＣＨＥＣＫ［ｐｏｉｎｔ＋１］＝１とする（ステップＳＴ２３０）。また、ｓ内のノードＰn のデフォルトシフトの大きさに−１を乗じて負の値にし、それをＧＯＴＯ［ｐｏｉｎｔ＋１］に入れる（ステップＳＴ２３１）。これにより、配列内のｉｎｄｅｘ＝ｐｏｉｎｔ＋１の位置に、デフォルトシフトの値が設定される。
【０２２８】
次に、ｓのＰn のｓｐｅｃｉｆｉｃリストに連なるノードを、キューｔｍｐに入れて（ステップＳＴ２３２）、キューｔｍｐが空かどうかの判定を行う（ステップＳＴ２３３）。キューｔｍｐが空でなければ、次に、キューｔｍｐからポップしたノードをｊに設定する（ステップＳＴ２３４）。
【０２２９】
そして、ノードｊの文字ラベルに対するシフト量として、対応するシフトの大きさに−１を乗じた値を、ＧＯＴＯ［ｐｏｉｎｔ＋’ｊのラベル］に設定する（ステップＳＴ２３５）。ここで、“’ラベル”は、その文字ラベルに対応する配列上での内部コードであり、図４１の変換表を用いた場合は’ラベル≧２である。
【０２３０】
次に、ＣＨＥＣＫ［ｐｏｉｎｔ＋’ｊのラベル］に、確認用のラベルとしてｊの文字ラベルを設定し（ステップＳＴ２３６）、ステップＳＴ２３３以降の処理を繰り返す。
【０２３１】
ステップＳＴ２３３においてキューｔｍｐが空になると、次に、ｓ内のＣn およびそれと同じ深さのノードをキューｔｍｐに入れて（図５４、ステップＳＴ２３７）、キューｔｍｐが空かどうかの判定を行う（ステップＳＴ２３８）。
【０２３２】
キューｔｍｐが空でなければ、次に、キューｔｍｐからポップしたノードをｉにセットし（ステップＳＴ２３９）、ノードｉのｉｎｄｅｘに、（ｐｏｉｎｔ＋’ｉのラベル）の値をセットして（ステップＳＴ２４０）、ノードｉにｏｕｔｐｕｔが定義されているかどうかをチェックする（ステップＳＴ２４１）。
【０２３３】
ノードｉにｏｕｔｐｕｔがあれば、ノードｉのｏｕｔｐｕｔを配列ＯＵＴＰＵＴにコピーする（ステップＳＴ２４２）。ここでは、ＯＵＴＰＵＴ［ｐｏｉｎｔ＋’ｉのラベル］＝ｉのｏｕｔｐｕｔとなる。次に、ＣＨＥＣＫ［ｐｏｉｎｔ＋’ｉのラベル］＝ｉのラベルとして（ステップＳＴ２４３）、ステップＳＴ２３８以降の処理を繰り返す。ノードｉにｏｕｔｐｕｔがなければ、ステップＳＴ２４３以降の処理を行う。
【０２３４】
ステップＳＴ２３８においてｔｍｐが空になると、次に、Ｃn と同じ深さのノードをキューｔｍｐに入れ（ステップＳＴ２４４）、ｔｍｐが空かどうかの判定を行う（ステップＳＴ２４５）。
【０２３５】
キューｔｍｐが空でなければ、キューｔｍｐからポップしたノードをｉにセットし（ステップＳＴ２４６）、ノードｉより何らかの記号で次状態に遷移可能かどうかの判定を行う（ステップＳＴ２４７）。遷移不可能であれば、ステップＳＴ２４５以降の処理を繰り返す。
【０２３６】
遷移可能であれば、Ｐn ＝ｉ、Ｃn ＝ｉの遷移先の先頭ノード、Ｐp ＝ｉのｉｎｄｅｘとして、この３つ組みをキューＱに追加し（ステップＳＴ２４８）、ステップＳＴ２４５以降の処理を繰り返す。ここで、ｉの遷移先の先頭ノードとは、ノードｉから遷移可能なノード列において、最小の文字ラベルを持つノードを意味する。
【０２３７】
ステップＳＴ２４５においてｔｍｐが空になると、図５３のステップＳＴ２２６に戻り、それ以降の処理を繰り返す。
次に、図５２の２進木を図４０のような配列形式に変換する手順を、図５３および図５４のフローに従って説明する。この変換処理では、先行するノードと、それから遷移可能なノードあるいはシフト可能なノードとの関係が、２進木から配列形式にマップされる。
【０２３８】
図５２の各ノードに対する入力記号による遷移、入力記号による特定のシフト、およびデフォルトシフトをまとめると、図５５のようになる。図５５において、例えば第１行第２列の“ｅ：１”は、ノード“０”に対する入力記号“ｅ”による遷移先がノード“１”であることを表し、第１行第４列の“ａ：２”は、ノード“０”に対する入力記号“ａ”によるシフトの大きさが２であることを表す。
【０２３９】
図５３のステップＳＴ２２７から図５４のステップＳＴ２４８までの処理では、変換部１４４は、先行する状態に対して、次の遷移／シフトの記号を定義できる配列上の空き場所を捜し出し、それらのデータを挿入して、遷移関係をマップする。このデータ挿入の方法は、基本的には第１の実施形態で述べた方法と同様である。
【０２４０】
ここで、ルートノード“０”を例としてマッピングの操作を説明する。ルートノードに関して挿入すべきデータのパターンは、図５６のようになる。図５６は、ルートノードに対してデフォルトシフト以外に遷移等が定義され得る入力記号は、｛ａ，ｅ，ｍ，ｒ，ｓ，ｔ｝の６つであることを示している。
【０２４１】
最初は、状態遷移配列の要素がすべて空であるので、図５６のパターンが挿入可能な場所を配列上で捜すと、可能な最小の配列のｉｎｄｅｘは“１”となる。そこで、図５３のステップＳＴ２２９からＳＴ２３６までの処理により、ｐｏｉｎｔを“１”に設定し、文字ラベルを書き込み、デフォルトシフトや特定のシフトのシフト量を設定すると、図４０のｓ０の行に示すパターンがマップされる。この段階では、ラベル“ｅ”と“ｔ”に相当する部分は未定義のままである。
【０２４２】
ラベル“ｅ”と“ｔ”の部分は、ルートノード“０”の次ノードに対応するため、図５４のステップＳＴ２３７からＳＴ２４８までの処理により、それらのノードが配列上に挿入された段階で設定される。ここで、ルートノードの次に処理される次ノードは、ノード“１”である。
【０２４３】
ノード“１”から遷移できる文字は“ｔ”のみであるので、ノード“１”の挿入場所は、図５７のようなパターンに対応する場所となる。図５７において、上の行は、現在の配列ＣＨＥＣＫ上のパターンを表し、下の行は、挿入されるパターンを表す。
【０２４４】
このようなパターンが挿入可能な場所を配列上で捜すと、ｐｏｉｎｔに設定可能な最小の配列のｉｎｄｅｘは“３”となる。そこで、まずノード“１”がノード“０”からの遷移先であることを定義するために、このｐｏｉｎｔの値３をＧＯＴＯ［７］に設定し、ノード“０”から記号“ｅ”で遷移する先のデータは、ｉｎｄｅｘ＝３の位置より始まるということを定義する。
【０２４５】
そして、このノード“１”をｉｎｄｅｘ＝３の位置に挿入し、ステップＳＴ２２９からＳＴ２３６までの処理により、文字ラベルやデフォルトシフト量等の設定を行う。この結果、図４０のｓ１の行に示すパターンがマップされる。
【０２４６】
このような処理を、キューにノードが積まれる順に、すなわちノード“０”、“１”、“６”、“２”、“７”、“１０”、“３”、“８”、“１１”、“４”、“９”、“１２”、“５”、“１３”の順に行うことで、最終的に図４０のような状態遷移配列が得られる。
【０２４７】
図５２のノード“１４”、“１５”、“１６”もキューに積まれて同様の処理が施され、厳密には、終端記号“＃”による遷移が定義される。しかし、終端記号は出力文字列には含まれないため、ここでは、これらのノードの処理を省略し、図４０においてもこれらのノードへの遷移は省略されている。また、配列内での終端記号のラベルの値を０とすることにより、終端ノードに対する遷移を定義しないようにすることも可能である。
【０２４８】
次に、この第２の実施形態の照合装置に文字列置換機能を加えた文字列置換装置の実施形態を説明する。ここでは、入力キーワード｛ｓｍａｒｔ，ｅａｓｔ，ｓｔａｔｅ｝を、それぞれ｛ＳＭＡＲＴ，ＥＡＳＴ，ＳＴＡＴＥ｝に置換する例を示す。
【０２４９】
図５８は、入力キーに対する文字列置換用の状態遷移配列を示している。図５８において、配列ＧＯＴＯは重ねられた状態遷移表を格納し、配列ＣＨＥＣＫは確認用のラベルを格納し、配列ＳＵＢＳＴは置換用の文字列へのポインタを格納し、配列ＬＥＮＧＴＨは置換前の文字列の長さを格納している。また、置換処理に用いるテキストオフセット格納配列の初期状態は、図３４と同様である。
【０２５０】
入力テキスト‘ａａｓｅａｓｔ’内の各記号に対するテキストオフセットは、図５９に示すようになる。また、この入力テキストに対してパターン照合を行った後には、テキストオフセット格納配列は図６０のようになる。このときの照合処理は、図４２と同様である。照合の結果、キーワード“ｅａｓｔ”のテキスト内の位置が、テキストオフセット“３”として格納されている。
【０２５１】
また、置換処理は、図３７と同様にして行われる。この置換処理の結果、テキスト‘ａａｓｅａｓｔ’は‘ａａｓＥＡＳＴ’に変換される。
以上説明した第１および第２の実施形態では、基本的に表構造のＤＦＡを圧縮した状態遷移配列を用いているため、入力記号の数と同じ回数の遷移操作で照合を終了することができ、ＤＦＡの高速性が保たれる。しかし、状態遷移配列を格納するために必要な記憶容量は、従来の状態遷移表に比べてはるかに少なくて済む。
【０２５２】
ところで、上述の例では英文字のテキストを入力としているが、日本語のような２ｂｙｔｅ文字コードで書かれたテキストに対しては、次の２つの方法で本発明を適用できる。
【０２５３】
第１の方法では、まずキーワード群に対して１ｂｙｔｅ毎に状態遷移配列を作成する。そして、テキスト中でいずれかのキーワードのパターンを検出した段階で、そのパターンの前にある改行記号まで戻る。次に、その改行記号の次の位置から始まる文において、検出されたパターンが１ｂｙｔｅずれていないかどうかを確認する。
【０２５４】
第２の方法では、入力された２ｂｙｔｅ文字コードを１つの単位として状態遷移配列を作成し、照合処理を行う。
ＡＳＣＩＩ（American Standard Code for Information Interchange）のような１ｂｙｔｅ文字と日本語ＥＵＣ（Extended UNIX Code）のような２ｂｙｔｅ文字とが混在するテキストでは、第２の実施形態においてシフトした際に、２ｂｙｔｅ文字が１ｂｙｔｅ分ずれて照合される可能性がある。
【０２５５】
このため、このようなテキストに対しては、パターンを受理した段階でそのパターンを含む文の先頭まで戻り、その先頭位置を基準として、検出されたパターンがずれていないかどうかを確認する必要がある。
【０２５６】
【実施例】
以下では、文字列検索時の記憶容量と検索速度を、従来の照合装置と本発明の照合装置について比較した結果を説明する。
【０２５７】
まず、ＡＣ法に基づくキーワード検索の場合について比較してみる。図６１は、従来法と本発明の方法におけるメモリ使用量の変化を示している。図６１において、グラフの縦軸はメモリ使用量を表し、横軸はキーワード数を表す。また、“ＡＣ＋ｏｕｒｓ”は、ＤＦＡ化されたＡＣ法に基づく本発明の第１の実施形態の照合装置に対応し、“ＤＦＡｏｆＡＣ”は、従来の状態遷移表を用いた照合装置に対応する。
【０２５８】
図６１を見ると、キーワードの数が増加するにつれて、従来法ではメモリ使用量が激増するが、本発明の第１の実施形態の方法ではそれほど増加しないことが分かる。
【０２５９】
また、図６２は、７５Ｍｂｙｔｅのテキストに対してキーワード検索を行った場合の、それぞれの照合装置の検索速度の変化を示している。図６２において、縦軸は検索に要した時間（秒）、横軸はキーワード数を表す。ここでは、比較のために、従来のＡＣ法の検索速度を“ＡＣ”として追加している。
【０２６０】
図６２を見ると、キーワードの数が増加するにつれて、従来のＡＣ法では検索速度が低下するが、従来のＤＦＡ化されたＡＣ法と第１の実施形態の方法では、検索速度が高速に保たれることが分かる。さらに、“ＡＣ＋ｏｕｒｓ”と“ＤＦＡｏｆＡＣ”の検索時間の差は僅かなものである。
【０２６１】
次に、ＦＡＳＴ法に基づくキーワード検索の場合について比較してみる。本発明の第２の実施形態の照合装置において必要な記憶領域は、２進木領域、特定の文字に対するシフト量のリスト領域、および圧縮された状態遷移配列の領域である。また、従来のＦＡＳＴ法の照合装置において必要な記憶領域は、可能な入力記号の数と状態数とに基づいて決まる数だけのポインタ領域である。
【０２６２】
図６３は、従来法と本発明の方法におけるメモリ使用量の変化を示している。図６３において、グラフの縦軸はメモリ使用量を表し、横軸はキーワード数を表す。また、“ＮＯＲＭＡＬ”は、従来のＦＡＳＴ法を用いた照合装置に対応し、“ＣＯＭＰＲＥＳＳ”は、本発明の第２の実施形態の照合装置に対応する。
【０２６３】
図６３を見ると、キーワードの数が増加するにつれて、従来法ではメモリ使用量が激増するが、本発明の第２の実施形態の方法ではほとんど増加しないことが分かる。
【０２６４】
【発明の効果】
本発明によれば、与えられたキーワード群に対して、速度と記憶容量の両側面において効率の良い照合表を作成することができる。また、その表を用いて、電子化文書に対して指定された複数のキーワードを、効率よく照合／検索することができる。
【０２６５】
また、このような文字列照合処理により、文字列検索や文字列置換などのワードプロセッサの機能を、より効率化することが可能となる。また、全文検索装置における全文検索インデックスの文字列検索、表示段階における括弧付け処理のような文字列置換などの機能や、データベースにおける文字列検索、表示段階における文字列置換などの機能も効率化することができる。
【図面の簡単な説明】
【図１】本発明の照合装置の原理図である。
【図２】照合システムの構成図である。
【図３】照合システムの動作のフローチャートである。
【図４】情報処理装置の構成図である。
【図５】照合装置の構成図である。
【図６】圧縮前の第１の状態遷移表を示す図である。
【図７】第１の状態遷移を示す図である。
【図８】圧縮された第１の状態遷移表を示す図である。
【図９】第１の文字コード変換表を示す図である。
【図１０】第１の照合処理のフローチャートである。
【図１１】第１の照合動作を示す図である。
【図１２】第１の圧縮装置の構成図である。
【図１３】第１の２進木作成処理のフローチャートである。
【図１４】第１の２進木を示す図である。
【図１５】２進木にｆａｉｌｕｒｅを追加する処理のフローチャートである。
【図１６】ｆａｉｌｕｒｅを付けた第１の２進木を示す図である。
【図１７】ｆａｉｌｕｒｅ計算の例を示す図である。
【図１８】２進木にｇｏｔｏｉｎ，ｇｏｔｏｏｕｔを追加する処理のフローチャートである。
【図１９】ｇｏｔｏｉｎ，ｇｏｔｏｏｕｔを付けた第１の２進木を示す図である。
【図２０】第１の変換処理のフローチャート（その１）である。
【図２１】第１の変換処理のフローチャート（その２）である。
【図２２】第１の配列を示す図である。
【図２３】第１のパターンを示す図である。
【図２４】第２の配列を示す図である。
【図２５】第３の配列を示す図である。
【図２６】第４の配列を示す図である。
【図２７】第２のパターンを示す図である。
【図２８】第５の配列を示す図である。
【図２９】第３のパターンを示す図である。
【図３０】第６の配列を示す図である。
【図３１】第４のパターンを示す図である。
【図３２】第７の配列を示す図である。
【図３３】文字列置換用の第１の状態遷移表を示す図である。
【図３４】テキストオフセット格納配列を示す図である。
【図３５】第１のテキストのテキストオフセットを示す図である。
【図３６】パターン照合後の第１のテキストオフセット格納配列を示す図である。
【図３７】置換処理のフローチャートである。
【図３８】圧縮前の第２の状態遷移表を示す図である。
【図３９】第２の状態遷移を示す図である。
【図４０】圧縮された第２の状態遷移表を示す図である。
【図４１】第２の文字コード変換表を示す図である。
【図４２】第２の照合処理のフローチャートである。
【図４３】第２の照合動作を示す図である。
【図４４】第２の圧縮装置の構成図である。
【図４５】第２の２進木作成処理のフローチャートである。
【図４６】シフト計算の前処理のフローチャートである。
【図４７】第２の２進木を示す図である。
【図４８】第１のシフト量計算処理のフローチャートである。
【図４９】第２のシフト量計算処理のフローチャートである。
【図５０】第３のシフト量計算処理のフローチャートである。
【図５１】ｆａｉｌｕｒｅとシフトを付けた第２の２進木を示す図である。
【図５２】最終的な第２の２進木を示す図である。
【図５３】第２の変換処理のフローチャート（その１）である。
【図５４】第２の変換処理のフローチャート（その２）である。
【図５５】２進木上の情報を示す図である。
【図５６】第５のパターンを示す図である。
【図５７】第６のパターンを示す図である。
【図５８】文字列置換用の第２の状態遷移表を示す図である。
【図５９】第２のテキストのテキストオフセットを示す図である。
【図６０】パターン照合後の第２のテキストオフセット格納配列を示す図である。
【図６１】ＡＣ法におけるメモリ使用量の変化を示す図である。
【図６２】ＡＣ法における速度の変化を示す図である。
【図６３】ＦＡＳＴ法におけるメモリ使用量の変化を示す図である。
【図６４】ＡＣ法のパターンマッチングマシンの例を示す図である。
【図６５】ＡＣ法の動作例を示す図である。
【図６６】ＡＣ法のＤＦＡを示す図である。
【図６７】ＡＣ法のＤＦＡの動作例を示す図である。
【図６８】ＦＡＳＴ法のパターンマッチングマシンの例を示す図である。
【図６９】ＦＡＳＴ法の遷移とシフトを示す図である。
【図７０】ＦＡＳＴ法による照合の例を示す図である。
【符号の説明】
１状態遷移記憶手段
２照合手段
１０１圧縮装置
１０２照合装置
１０３キーワード入力部
１０４スパース配列有限状態機械作成部
１０５状態遷移機械圧縮部
１０６照合用状態遷移機械部
１０７テキスト入力部
１１１ＣＰＵ
１１２メモリ
１１３入力装置
１１４出力装置
１１５外部記憶装置
１１６媒体駆動装置
１１７ネットワーク接続装置
１１８バス
１１９可搬記録媒体
１２０情報提供者の装置
１２１状態遷移判定部
１２２状態遷移部
１２３圧縮状態遷移部
１２４確認ラベル部
１２５出力記号部
１３１、１４１２進木変換部
１３２遷移追加部
１３３、１４４変換部
１４２前処理部
１４３シフト量計算部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a collation apparatus and method for collectively judging whether or not at least one symbol string exists in given text data in a character string search apparatus or the like.
[0002]
[Prior art]
Today, in a document processing apparatus such as a word processor, it is required to collectively determine whether or not a set of a plurality of symbol strings exists as a search term in a text. Here, the symbol string means a sequence of characters and other symbols, and the character string is also a kind of symbol string. Such a determination function is often called multiple character string matching or multiple character string search.
[0003]
As a more efficient multi-string search device, Aho and MJ Corasick (“Afficient String Matching: An Aid to Bibliographic Search” CACM Vol.18 No. .6 1975), a method that constructs a Deterministic Finite Automaton (DFA) and a FAST (Flying Algorithm for Searching Terms) method proposed by Uraya (“Fast String Matching Algorithm” FAST "Information Processing Society Journal Vol.30 No.9 1989 Published Patent Publication No. 64-74619 Patent Application No. 62-231091).
[0004]
In the following, first, a multi-character string collation algorithm obtained by converting the AC method and the AC method into DFA will be described, and then a multi-character string collation algorithm of the FAST method will be described.
[0005]
The AC method is a method of matching character strings by configuring a finite state machine called a PMM (Pattern Matching Machine) for an input key set.
The collation operation in the AC method is as follows. First, the state number is set to “1” as an initial setting. Next, a symbol is read from the input text one character at a time, and it is determined by this input symbol which state is to be changed from the current state. If the transition by the input symbol is not defined for the current state, it is assumed that the verification has failed and the transition is made to the fail destination of the current state. If the transition by this input symbol is not defined for the state of the failure destination, the transition to the failure destination is repeated.
[0006]
Since transitions are defined for all symbols in the initial state “1”, the transition due to the failure stops in the initial state at the worst. Thus, the transition is repeated for the input symbol of the text. If a symbol string to be accepted for the state is defined, the symbol string and its position in the text are output.
[0007]
FIG. 64 shows an AC method PMM using three symbol strings {ab, bc, bd} as search keys. The PMM in FIG. 64 is composed of six states “1”, “2”, “3”, “4”, “5”, “6”, the solid line arrows indicate normal transition destinations, and the broken line arrows indicate It points to the failure destination. “^ A, b” represents an input symbol other than a and b. In the states “4”, “5”, “6” (s4, s5, s6), symbol strings “ab” are used as output keywords. "," Bc ", and" bd "are defined.
[0008]
The operation of this PMM for the input symbol string 'cabcz' is as shown in FIG. The initial state is “1”. First, when the symbol “c” is input, this corresponds to an input symbol other than a and b, so that the next state is the same state “1” and no output is generated. Next, when the symbol “a” is input, the state transitions to the next state “2”, and when the symbol “b” is input, the state transitions to the next state “4”. Here, the symbol string “ab” defined in the state “4” is output.
[0009]
However, since the transition destination is not defined in the state “4”, when the symbol “c” is input next, the state transitions to the state “3” of the fail destination, where the transition destination is searched. Then, since the state “5” is defined as the transition destination by the symbol “c”, the state transitions to that state, and the symbol string “bc” is output. Next, when the symbol “z” is input, the state transits to “1”, and the operation ends.
[0010]
Thus, in the AC method, the number of transitions increases by one every time a failure transition occurs due to an input symbol whose transition destination is not defined. Therefore, a finite state machine transition of less than 2n at the maximum is performed for n input symbols. In general, as the number of keys increases, the probability of hitting the first character of the key increases, but with this, the failure transition also increases. Therefore, the verification speed of the AC method gradually decreases as the number of keys increases. .
[0011]
The speed of the AC method is reduced in the failure transition where the transition destination is not defined, but in DFA, the state of the transition destination is uniquely determined for the input symbol. For this reason, finite state machine transitions are always performed n times for n input symbols, and the matching speed is high. Aho et al. Show a method for converting the AC state transition machine to DFA.
[0012]
FIG. 66 shows a finite state machine corresponding to the state transition machine of the AC method for the symbol string {ab, bc, bd}. In FIG. 66, “state” represents the current state, and “next” represents the transition destination state when the symbol described in “input” is input. The states s1, s2, s3, s4, s5, and s6 correspond to the states “1”, “2”, “3”, “4”, “5”, and “6”, respectively. For example, the notation “¬a, b” represents a symbol other than a and b.
[0013]
The operation of the finite state machine for the input symbol string 'cabcz' is as shown in FIG. The initial state is 1. In the state transition shown in FIG. 67, there is no failure transition as shown in FIG. 65, and the number of state transitions coincides with the number of symbols 5 included in the input symbol 'cabcz'.
[0014]
Also in the FAST method known as a high-speed collation method, as in the AC method, character strings are collated by configuring a PMM for the input key set.
The collating operation in the FAST method is as follows. First, the state number is set to “0” as an initial state. Also, the shortest key length in the input key set is set as the shortest key length, and the collation start position in the input text is set at a position separated from the head of the text by the shortest key length.
[0015]
Next, the symbols are read one character at a time from the collation start position toward the left of the text, and the state to be transitioned from the current state is determined by the input symbols. If the transition is not defined, the collation start position is shifted rightward by a specified amount corresponding to the input symbol, and collation is resumed.
[0016]
In this way, as long as state transition is possible with respect to the input symbol, the text is scanned from right to left to extract a character string pattern. If transition is impossible, the collation start position is shifted to the right in the text by the shift amount defined for the input symbol.
[0017]
FIG. 68 shows a FAST method PMM using three symbol strings {state, east, smart} as search keys. 68 has 14 states “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, It consists of “10”, “11”, “12”, “13”, and the solid arrow indicates the transition destination, and the broken arrow indicates the shift destination.
[0018]
The state transition is defined in the reverse order of the sequence of symbols included in each search key, and “Depth” represents the depth of each state in the PMM. In the states “5”, “9”, and “13” (s5, s9, and s13), symbol strings “state”, “east”, and “smart” are defined as output keywords, respectively.
[0019]
FIG. 69 shows a table of transition destinations and shift amounts at the time of inputting symbols for the respective states of the PMM. In FIG. 69, the numbers in the first row represent state numbers, and the symbols in the first column represent input symbols. Here, “(Other)” represents an input symbol other than “a”, “e”, “m”, “r”, “s”, and “t”. In this table, the positive value element represents the state number of the transition destination by the corresponding input symbol, and the negative value element represents the shift amount by the corresponding input symbol.
[0020]
The operation of this PMM with respect to the input symbol string “aasestate” is as shown in FIG. The initial state is “0”. In this case, the shortest of the symbol strings “state”, “east”, and “smart” is “east” and its length is 4, so the shortest key length is 4. Therefore, collation is performed from the right to the left with the position “t” separated from the right end of the input symbol string by the shortest key length 4 as the collation start position.
[0021]
When collation fails, the shift amount defined for the input symbol is multiplied by −1 to obtain the magnitude of the shift amount, and the collation start position is shifted to the right by that amount. Then, the state number is set to “0” and the collation is resumed.
[0022]
When the symbol “t” is input in the initial state “0” indicated at the position of the symbol “t” in the input symbol string of FIG. 70, the state transitions to the state “6” according to the table of FIG. Next, when the symbol “s” is inputted, the state transits to the state “7”, and when the symbol “a” is inputted next, the state transits to the state “8”. Next, when the symbol “e” is input, the state transits to the state “9”, and the symbol string “east” defined in the state “9” is output.
[0023]
Next, when the symbol “s” is input, in the state “9” for this symbol, not the transition destination but the shift amount −7 is defined. Shift. Then, the state returns to the initial state “0”, and the collation is restarted with the position of the symbol “e” as the shift destination as a new collation start position. In the same manner, the collation is continued, and the symbol string “state” is output when the state transitions to the state “5”.
[0024]
The above-described multi-character string search process is used in each device such as a database, a word processor, and a full-text search device.
The full-text search device refers to a device that performs a character string search in order to confirm whether or not a search result is correct in a search using a full-text search index. Here, the full-text search index is not necessarily the correct answer for keywords for which the index itself is entered, such as a signature file or a file that does not have the appearance position of a word in a document (inverted file). It means a search index that doesn't always return.
[0025]
For example, consider a case where a keyword 'John Smith' is searched for an English index. Since the index unit is usually a word between spaces, 'John Smith' becomes the same as 'John AND Smith'. However, when a document is searched under the search condition of 'John AND Smith', even if 'John' and 'Smith' appear apart from each other, they are included in the search result and an excessive result is obtained. In such a case, whether the result is correct can be confirmed by a character string search.
[0026]
[Problems to be solved by the invention]
The problem in the conventional character string matching method described above is the relationship between the speed of the portion corresponding to the state transition of the PMM and the storage capacity.
[0027]
In the AC method, it is possible to reduce the storage capacity by using a list structure to represent the state transition portion. However, in the list structure, pointers must be followed in order, and the access processing is slow, so the collating operation is even slower.
[0028]
The collation speed of the AC method converted to DFA is high, but a table structure as shown in FIG. 66 must be used to represent all state transitions defined for all input symbols. However, this places a great burden on the storage capacity.
[0029]
For example, the number of input symbols is 256 (8-bit code), the number of states is N, and the pointer is 4 bytes. In the tabular format, one state requires 256 pointers to the next state, one pointer to the fail destination, and one pointer to the output symbol string. For this reason, a storage capacity of N * (256 + 1 + 1) * 4 bytes is required.
[0030]
In general, as the number of search keys increases, the number of states N also increases. Therefore, when the number of keys is large, the necessary storage capacity becomes enormous. Therefore, it is not realistic to configure a character string matching device based on the DFA-converted AC method.
[0031]
Similarly, in the FAST method, since state transitions or shifts are defined for all input symbols, a table structure as shown in FIG. 69 must be used. Therefore, when a character string collation apparatus is configured based on the FAST method, a huge storage capacity is still required.
[0032]
An object of the present invention is to provide a multi-symbol string collation apparatus and method capable of performing high-speed collation with a realistic storage capacity.
[0033]
[Means for Solving the Problems]
FIG. 1 is a principle diagram of a collation device according to the present invention. The collation apparatus of FIG. 1 is a collation apparatus in an information processing apparatus that uses a given symbol string as a key and determines whether or not the key exists in a file using a finite state machine, and is a state transition storage Means 1 and collation means 2 are provided.
[0034]
The state transition storage unit 1 is a state transition table that defines a collation operation related to at least one key, and a sparse state transition table in which data representing a predetermined operation is reduced in a compressed array format. Remember.
[0035]
The collating means 2 performs an operation corresponding to each symbol included in the file while referring to the sparse state transition table, and collates the symbol string in the file with the one or more keys.At this time, the collation means 2 checks whether or not an operation for the input symbol input from the file is defined in the sparse state transition table, and when the operation for the input symbol is not defined, Perform the specified operation.
[0036]
For example, when a state transition table based on the DFA AC method is created, data representing a transition operation from the current state to the initial state and data representing a transition operation from the current state to the next state of the initial state Of these, at least one of the data is removed from the conventional state transition table to reduce the amount of information in the state transition table.
[0037]
Further, when creating a state transition table based on the FAST method, data representing a shift operation for returning the collation position in the file to the direction opposite to the collation direction is excluded from the conventional state transition table as much as possible. Reduce the amount of information in the transition table.
[0038]
As a result, a portion from which data is removed in the state transition table becomes an empty element, and a sparse state transition table in which elements are sparsely generated is generated. By compressing such a sparse state transition table and storing it in an array, a compact finite state machine can be constructed, and the storage capacity is greatly reduced.
[0039]
In addition, since the compressed state transition table is basically created based on the DFA state transition table, it is necessary to check whether transitions are defined, but it is input from the file. It is only necessary to perform one transition operation for each symbol, and the high speed of DFA is maintained.
[0040]
As a file to be collated, a file containing an arbitrary symbol string such as a document file described in text or a file obtained by converting voice data into a digital code can be used.
[0041]
For example, the state transition storage unit 1 in FIG. 1 corresponds to the state transition unit 122 in FIG. 5 of the embodiment, and the matching unit 2 corresponds to the state transition determination unit 121.
[0042]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 2 is a block diagram of a collation system based on the present invention. The collation system in FIG. 2 includes a compression device 101 and a collation device 102. The compression device 101 includes a keyword input unit 103, a sparse array finite state machine creation unit 104, and a state transition machine compression unit 105, and the collation device 102 includes a collation state transition machine unit 106 and a text input unit 107.
[0043]
The keyword input unit 103 accepts the input keyword group to be searched, and the sparse array finite state machine creation unit 104 performs sparse in the array format for the input keyword group for character string matching. The finite state machine is constructed on a binary tree as an intermediate structure. Here, the sparse array means an array containing almost no elements. The state transition machine compression unit 105 converts the intermediate structure created by the sparse array finite state machine creation unit 104 into a compressed array format with high collation speed.
[0044]
The matching state transition machine unit 106 performs matching between the text input from the text input unit 107 and the keyword group using a compressed array type finite state machine.
[0045]
FIG. 3 is a flowchart of the operation of the verification system of FIG. When the processing is started in FIG. 3, first, the keyword input unit 103 receives a keyword group to be searched (step ST1).
[0046]
Next, the sparse array finite state machine creation unit 104 determines a character string collating machine having an intermediate structure that has the collation speed of the same order as that of DFA for the input keyword and is sparse in the array format. It builds on (step ST2). The character string matching machine on the binary tree has a structure in which the addition and insertion of elements are easy although the matching speed is low.
[0047]
Next, the state transition machine compression unit 105 converts the character string matching machine on the binary tree into a compressed array format with high speed matching (step ST3). At this time, a character label for confirmation is attached to the portion where the element is present in the array, and the respective arrays are overlapped so that the portions where the elements of the plurality of arrays are not overlapped with each other. Create an array.
[0048]
Then, the verification state transition machine unit 106 performs verification on the input text using an array-type finite state machine.
A finite state machine that has the same collation speed order as the DFA-based AC method or array implementation FAST method for the input keyword group to be searched, but is sparse in the array format and can be compressed By constructing the above, a character string collating apparatus that has a high collation speed and can reduce the storage capacity is constructed.
[0049]
The construction method is summarized as follows.
A string matching machine is constructed as an intermediate structure on a binary tree that allows easy insertion and addition of elements and no storage capacity problems, and this intermediate structure is converted into a compressed array format that is fast and compact for collation. To do. At this time, a label for confirming the element is given to the element, and the array is compressed by overlapping the elements so that there is no overlapping of elements.
[0050]
FIG. 4 is a configuration diagram of an information processing apparatus (computer) that realizes the collation system of FIG. 4 includes a CPU (central processing unit) 111, a memory 112, an input device 113, an output device 114, an external storage device 115, a medium drive device 116, and a network connection device 117, each of which is a bus. 118 are connected to each other.
[0051]
The CPU 111 executes a program stored in the memory 112 to realize each process of the compression apparatus 101 and the collation apparatus 102. In addition to the above-described program, the memory 112 stores data used for processing. As the memory 112, for example, a ROM (read only memory), a RAM (random access memory), or the like is used.
[0052]
The input device 113 corresponds to, for example, a keyboard, a pointing device, and the like, and is used for inputting requests and instructions from the user. The output device 114 corresponds to a display device, a printer, or the like, and is used to output a state transition machine, a collation result, and the like.
[0053]
The external storage device 115 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, or the like. The above-described program and data can be stored in the external storage device 115, and loaded into the memory 112 for use as necessary. The external storage device 115 is also used as a database for storing keywords and text.
[0054]
The medium driving device 116 drives the portable recording medium 119 and accesses the stored contents. As the portable recording medium 119, an arbitrary computer-readable recording medium such as a memory card, a floppy disk, a CD-ROM (compact disk read only memory), an optical disk, a magneto-optical disk, or the like is used. The above-described program and data are stored in the portable recording medium 119 and can be loaded into the memory 112 and used as necessary.
[0055]
The network connection device 117 is connected to an arbitrary communication network such as a local area network (LAN) and performs data conversion associated with communication. The collation system communicates with an external information provider device 120 (such as a database) via the network connection device 117. As a result, the above-described program and data can be received from the apparatus 120 via the network and used by loading them into the memory 112 as necessary.
[0056]
Next, a first embodiment based on the DFA AC method will be described with reference to FIGS.
In the first embodiment, when the transition to the initial state and the transition to the next state of the initial state are not defined with respect to the table structure of the AC method converted to DFA, when the matching to these fails The state transition machine is redefined as a transition from the initial state.
[0057]
As described above, if the transition to the initial state is automatically performed when the transition is not defined, it is not necessary to store the transition to the initial state. In addition, since it is possible to make a transition to the next state after the initial state via the initial state, it is not necessary to store this transition. By omitting these transition definitions, elements of the DFA state transition table can be largely deleted, and a sparse state transition table can be obtained.
[0058]
Next, this state transition machine is made into an array form, and a plurality of arrays are overlapped so that portions where transitions are defined do not overlap. At the same time, in order to check whether or not the element is included in the array, the character is assigned as a label to the character for which the transition is defined.
[0059]
Also, when creating a compressed state transition table, the storage capacity actually used is saved by going through the intermediate form of the binary tree.
FIG. 5 is a configuration diagram of the collation device 102 of FIG. In FIG. 5, the state transition determination unit 121 and the state transition unit 122 correspond to the matching state transition machine unit 106. The text input unit 107 extracts symbols one character at a time from the target text, and the state transition determination unit 121 determines which state to transition to the input symbol.
[0060]
The state transition unit 122 corresponds to the memory 112, for example, and includes a compression state transition unit 123, a confirmation label unit 124, and an output symbol unit 125. The compressed state transition unit 123 stores the state transition table in a compressed array format, and the confirmation label unit 124 stores a label for confirming whether or not a transition is defined along with the compression. The output symbol portion 125 defines a symbol string that is output when a certain state is reached.
[0061]
FIG. 6 shows a first state transition table before compression when three symbol strings {ab, bc, bd} are used as input keywords. When this table is compared with the table of FIG. 66, the transition to the initial state s1 and the transition to the next states s2 and s3 of the initial state s1 are excluded from the transitions in the states s2, s3, s4, s5, and s6. .
[0062]
FIG. 7 shows the transition for each input symbol defined in the state transition table of FIG. In FIG. 7, “˜a, b” represents input symbols other than a and b.
FIG. 8 shows a collating array obtained by compressing the state transition table of FIG. In FIG. 8, index represents an array subscript, GOTO represents a superimposed state transition table stored in the compressed state transition unit 123, and CHECK represents a confirmation label stored in the confirmation label unit 124. OUTPUT represents an array of pointers stored in the output symbol portion 125. These pointers point to output character strings stored in the state transition unit 122.
[0063]
The element “#” in the array CHECK represents a terminal symbol. In addition, states s1 to s6 shown below the array in FIG. 8 indicate how the states in FIG. 6 overlap, and symbols a and b correspond to the input symbols in FIG. Represents the storage location of the transition destination.
[0064]
FIG. 9 shows a first character code conversion table used for explaining the first embodiment. According to this character code conversion table, the character code is converted into an internal code. Here, for the sake of simplicity, the range of the character code is from a to z of the alphabet, but any other symbols such as numbers can be used.
[0065]
Next, the matching process in the first embodiment will be described with reference to FIGS. 10 and 11.
FIG. 10 is a flowchart of character string matching processing based on the AC method. When processing is started in FIG. 10, the state transition determination unit 121 first sets a text pointer indicating the input text at the head thereof, and sets the transition pointer indicating the state transition array in the state transition unit 122 to the initial state. (Step ST11). Next, it is checked whether or not the text pointer points to the end of the text (step ST12). The collation ends when the text pointer points to the end of the text.
[0066]
If the text pointer does not point to the end of the text, the character pointed to by the pointer is taken out (step ST13), the value of the internal code corresponding to the character is added to the value of the transition pointer, and the added value is set as an index. It is checked whether the character label stored in is the same as this character (step ST14).
[0067]
If these characters do not match, no transition is defined for the input character. Therefore, the internal code of the input character is added to the pointer to the initial state, and the transition destination defined at the position where the added value is index is set as a new transition pointer (step ST15).
[0068]
If the label and the input character are the same in step ST14, it is next checked whether an output character string is defined for the state (step ST16).
If the output character string is not defined, the internal code of the input character is added to the current transition pointer, and the transition destination defined at the position where the added value is an index is set as a new transition pointer (step ST18). . Then, the text pointer is advanced by one character (step ST19), and the processes after step ST12 are repeated.
[0069]
If an output character string is defined, the character string is output as a collation result (step ST17), and the processes after step ST18 are performed. In step ST17, the current text pointer value can be output as the position of the collated character string.
[0070]
FIG. 11 shows the collation operation for the input text “cabcz” when the state transition array of FIG. 8 is used. In FIG. 11, “′ c” or the like represents the value of the internal code corresponding to the character c or the like. GOTO [x] represents the number of the transition destination defined at the position of index “x”, CHECK [x] represents the label stored at the position of index “x”, and OUTPUT [x ] Represents an output pointer stored at the position of index “x”. The collation in this case is performed as follows.
[0071]
In FIG. 8, the initial state corresponds to the position of index = 1. The first input symbol is “c” (step ST13). Since transition destinations for all input symbols are defined in the initial state, as indicated by “* 1” in FIG. 11, the check in step ST14 is omitted, and the process proceeds to step ST16. Here, since the output is not defined, the process of step ST17 is not performed.
Next, by subtracting the conversion table of FIG. 9 from the input symbol “c”, 'c = 3 is obtained. Therefore, the result of GOTO [1 + 3] = GOTO [4] is set as a transition destination to be advanced next. In FIG. 8, since the transition destination defined at the position of index “4” is “1”, GOTO [4] = 1. Therefore, the transition destination is again in the initial state corresponding to index “1” (step ST18).
[0072]
Next, when the collation operation is performed on the input symbol “a” in the same manner, GOTO [1 + ′ a] = 26, and the state starts from index “26” (step ST18).
[0073]
Next, for the input symbol “b”, it is necessary to access the array CHECK to check whether or not a transition is defined (step ST14). Therefore, when “b = 2” is added to the value 26 of the current transition pointer, index “28” is obtained. Since the label stored at the position of the index “28” is “b”, CHECK [28 + ’b] = b, and it can be seen that the transition for this input symbol is defined.
[0074]
At this time, since the output symbol string “ab” is defined in OUTPUT [26 + ’b], this is output (step ST17). The next transition destination is GOTO [26 + 'b] = 29.
[0075]
Next, the same processing is performed on the input symbol “c”, the symbol string “bc” is output, and a transition is made to the position of GOTO [29 + ’c] = 5.
Next, when the same processing is performed on the last input symbol “z”, since CHECK [5 + ′ z] is not z, the transition from the state of index “5” fails, and the symbol “z” A transition is defined as a transition from an initial state. As a result, the transition pointer becomes GOTO [1 + 'z] = 1 (step ST15), and since the text is finished, the collation operation is finished.
[0076]
Thus, the symbol strings “ab” and “bc” included in the input text are output as the collation results.
Next, a method for creating a state transition array used for collation will be described. FIG. 12 is a first block diagram of the compression apparatus of FIG. In FIG. 12, a binary tree conversion unit 131 and a transition addition unit 132 correspond to the sparse array finite state machine creation unit 104, and a conversion unit 133 corresponds to the state transition machine compression unit 105.
[0077]
The keyword input unit 103 receives the specified keyword group. The binary tree conversion unit 131 converts the accepted keyword group into a binary tree structure in the direction from the left to the right of each key. The transition adding unit 132 adds, to the created binary tree structure, a transition to the state other than the initial state and the next state when collation fails. The conversion unit 133 converts the binary tree structure output from the transition addition unit 132 into a compressed array format.
[0078]
The procedure for creating such an array for collation is as follows: creation of a binary tree from an input keyword group, addition of a fail destination node to a node of the resulting binary tree, and direct transition within a failure transition This includes the addition of a “gotoin / gotoout” destination node list that defines a thing, and the conversion of the finally obtained binary tree into an array format.
[0079]
FIG. 13 is a flowchart of the binary tree creation process. When the processing is started in FIG. 13, the binary tree conversion unit 131 first creates a binary tree for the input key (step ST21). Next, the corresponding input key is added as an output symbol string to the node that has received the key (step ST22), and the process ends.
[0080]
For example, when a binary tree is created for the keyword group {ab, bc, bd} corresponding to the state transition diagram of FIG. 7, the result is as shown in FIG. In FIG. 14, a rectangular box represents one node, and a character label added to each node corresponds to a symbol appearing in the keyword. The horizontal pointer represents a pointer to a node having the same depth on the binary tree, and the downward pointer represents a pointer to a deeper node. Here, output symbol strings “ab”, “bc”, and “bd” are added as outputs to the nodes “4”, “5”, and “6”, respectively.
[0081]
FIG. 15 is a flowchart of processing for assigning a transition destination of a failure transition to a created binary tree node. When the process is started in FIG. 15, the transition adding unit 132 first initializes the node queue Q that stores the node to be processed (step ST31).
[0082]
Next, a node that can transition from the root node of the binary tree (a node that becomes the next node with respect to the root node and a node that has the same depth as that node) is placed in the queue Q (step ST32).
[0083]
Next, the failure destination of the node in the queue is set as the root node (step ST33), and it is determined whether Q is empty (step ST34). If Q is empty, the process ends. If Q is not empty, then one node is taken out from the queue Q, set to r (step ST35), and the taken out node is removed from the queue Q (step ST36).
[0084]
Next, the node queue J is initialized (step ST37), and a node that can transition from r (a node that becomes the next node with respect to the node r and a node that has the same depth as that node) is placed in the queue J (step ST38). . Then, it is determined whether J is empty (step ST39).
[0085]
If J is empty, the process after step ST34 is repeated. If J is not empty, then one node is extracted from the node queue J, set to s (step ST40), and the extracted node is removed from the queue J (step ST41). Next, s is put in the queue Q (step ST42), the failure destination of the node r is set to t (step ST43), and whether or not the transition from the node t by the character label attached to the node s is defined. Is performed (step ST44).
[0086]
If no such transition is defined, the failure destination of t is set to t (step ST45), the determination of step ST44 is performed again, and a loop is repeated until the determination result is YES. Eventually, the loop can be exited in the initial state.
[0087]
If the transition is defined in step ST44, the destination of transition with the label from t to s is set as the failure destination of s (step ST46). Then, the output character string at the s fail destination is added as the output (output character string) of s (step ST47), and the processes after step ST39 are repeated.
For example, when the failure transition of the binary tree of FIG. 14 is calculated and assigned to each node of the binary tree, the result is as shown in FIG. The procedure of failure calculation in this case will be described along the flow of FIG.
[0088]
First, nodes “2” and “3” that can transition from the root node “1” are put in the queue Q (step ST32), and the fail destination of those nodes is set as the root node. The subsequent processing is as shown in FIG. In FIG. 17, for example, the notation of goto (1, b) represents a transition from the node “1” defined for the symbol “b”. Such a procedure is the same as the procedure for creating the failure function in the general AC method.
[0089]
FIG. 18 is a flowchart of processing for adding a transition to a state other than the initial state and the next state to the node of the binary tree as a gain / out node list for the added failure transition. When the process is started in FIG. 18, the transition adding unit 132 first initializes the node queue Q (step ST51).
[0090]
Next, a node that can transition from the root node of the binary tree is put in the queue Q (step ST52), and it is determined whether or not the queue Q is empty (step ST53). If the queue Q is empty, the process ends.
[0091]
If the queue Q is not empty, one node is taken out from the queue Q, set to r (step ST54), and the node is removed from the queue Q (step ST55). Next, all possible input symbols are entered in the queue X (step ST56), and it is determined whether the queue X is empty (step ST57). If the queue X is empty, the processes after step ST53 are repeated.
[0092]
For example, when an 8-bit code is used, the number of symbol codes to be placed in the queue X in step ST56 is 256 from 0 to 255.
If the queue X is not empty, one character is extracted from the queue X, set to s (step ST58), and at the same time, this character is removed from the queue X (step ST59). Then, it is determined whether or not the transition from r to the next state is possible by the symbol s (step ST60). If transition is possible, the transition destination of r by the symbol s is added to the queue Q (step ST65), and the processing after step ST57 is repeated.
[0093]
If the transition is not possible in step ST60, it is next determined whether the failure destination of the node r is the root node (step ST61). If it is the root node, the processing from step ST57 is repeated as it is.
[0094]
If the failure destination of the node r is not the root node in step ST61, it is next determined whether or not the transition by the symbol s is possible from the failure destination of r (step ST62). If the transition is impossible, the process after step ST57 is repeated. If transition is possible, next, a node that can be transitioned with symbol s from the fail destination of r is added to r's goout (step ST63), and node r is added to the gotoin of that node (step ST64). The processes after step ST57 are repeated.
[0095]
Now, assume that the processing target node r is r = A, the failure destination node of the node A by the symbol s is B, and a set of nodes that can transition from the node B is C. At this time, in the loop processing from step ST57 to step ST65 in FIG. Then, a list of nodes in the set C is added to the node A as “goout”, and the node A is added as “gotoin” to each node in the set C.
[0096]
FIG. 19 shows the values of goout and gotoin for the binary tree shown in FIG. The binary tree in FIG. 19 corresponds to an intermediate structure finite state machine that the transition adding unit 132 finally outputs.
[0097]
In the above-mentioned calculation of the gooutout list and the gotoin list, the transition can actually be defined only when the fail destination is a node other than the initial state. In FIG. 16, only the node “4” clearly satisfies this condition, so the processing for this node will be described.
[0098]
In FIG. 16, the failure destination of the node “4” is the node “3”. From node “3”, it is possible to transition to node “5” by character “c” and transition to node “6” by character “d”. Therefore, the goout of the node “4” is the node “5” and the node “6”. Also, the “goin” of the nodes “5” and “6” are both the node “4”. In FIG. 19, these gooutout and gotoin are represented in the form of labeled nodes.
[0099]
In the case of the conventional DFA, “goout” is defined for all nodes of the binary tree. However, in the present invention, “goout” is defined only for a node where a failure transition to a node other than the root node is defined. Is done. In this case, it is necessary to additionally define “gotoin”, but the failure transition of many nodes having the root node as the failure destination is deleted, which contributes to a reduction in storage capacity.
[0100]
20 and 21 are flowcharts of processing for converting the binary tree output by the transition adding unit 132 into a state transition machine in a compressed array format. When the processing is started in FIG. 20, the conversion unit 133 first initializes the arrays GOTO, CHECK, and OUTPUT to 0 (step ST71), and initializes the member index of the binary tree node to 0 (step ST72). .
[0101]
This index is provided independently of the index of the state transition array in order to store the correspondence between each node of the binary tree and the index of the state transition array as shown in FIG.
[0102]
Next, among the possible input symbols, those that cannot be transitioned from the root node to other nodes are put in the queue R (step ST73), and it is determined whether or not the queue R is empty (step ST74). If the queue R is not empty, one character is extracted from the queue R, set to s (step ST75), and removed from the queue R (step ST76).
[0103]
Next, GOTO [1 + 's] is set to 1 (step ST77), CHECK [1 +' s] is set to a character label of s (step ST78), and the processes after step ST74 are repeated. “′ S” represents the internal code in the array for s, but this may be the character code as it is.
[0104]
When the queue R becomes empty, the queue Q is then initialized (step ST79), and Pn = root node, Cn = next node of the root node, and Pp = 1 (step ST80).
[0105]
Here, the next node of the root node means a node having a minimum character label among a plurality of nodes (node string) that can be transitioned from the root node. In the case of a binary tree as shown in FIG. 19, the node put in Cn matches the node pointed by the pointer down from the root node.
[0106]
Next, a triplet of [Pn, Cn, Pp] is added to the queue Q (step ST81), and it is determined whether the queue Q is empty (step ST82). If the queue Q is empty, the process ends.
[0107]
If the queue Q is not empty, then a triplet is extracted from the head of the queue Q, set to s (step ST83), and the triplet is removed from the queue Q (step ST84). Next, the position on the array GOTO, CHECK, and OUTPUT where the node connected to the goout of the node Pn in s and the node connected to the node Cn in s and having the same depth as Cn can be inserted is obtained. Set to point (step ST85).
[0108]
At this time, the position of the point of the already inserted node is not overlapped with the position of the point of the newly inserted node. If a new insertable position is already used as a point, for example, it is shifted by one and set to a point.
[0109]
Next, GOTO [s Pp] = point is set (step ST86), a node connected to the goout of Pn of s is placed in the queue tmp (step ST87), and it is determined whether or not the queue tmp is empty (step ST88). .
[0110]
If the queue tmp is not empty, then one node is extracted from the queue tmp and set to i (step ST89), and the node is removed from the queue tmp (step ST90). Then, it is determined whether GOTO [i index] is 0 (step ST91).
[0111]
Here, the index of the position i on the state transition array in which the node i is stored, or 0 is stored in the index of the node i of the binary tree. If this is 0, node i of the binary tree has not yet been moved onto the state transition array.
[0112]
If GOTO [i index] is 0, the process of step ST88 is repeated. If it is not 0, the transition to the goout destination has been moved on the array, so the transition based on the goout is mapped on the array (step ST92), and the processes after step ST88 are repeated.
[0113]
In step ST92, transitions are mapped by setting GOTO [label of 'point +' i] = GOTO [index of i] and OUTPUT [label of 'point +' i]] = output of i. Here, “a label of“ i ”represents an internal code of a character label of the node i, and“ output ”of i represents an output symbol string defined in the node i. Thus, the output of node i is copied onto the array.
[0114]
If the queue tmp becomes empty in step ST88, next, a node connected to the gain of Pn in s is placed in the queue tmp (FIG. 21, step ST93), and it is determined whether or not the queue tmp is empty (step ST94).
[0115]
If the queue tmp is not empty, one node is taken out from the queue tmp, set to i (step ST95), and i is removed from the queue (step ST96). Then, it is determined whether GOTO [i index] is 0 (step ST97).
[0116]
If GOTO [i index] is 0, the process of step ST97 is repeated. If it is not 0, the transition of gotoout is mapped onto the array via gotoin (step ST98), and the processes after step ST97 are repeated.
[0117]
In step ST98, mapping is performed by setting GOTO [GOTO [index of i] + label of 'i] = point. If an output is defined for the node Pn in s, the output of Pn is copied onto the array as OUTPUT [GOTO [index of i] + 'i label] = Pn output.
[0118]
If the queue tmp becomes empty in step ST94, next, the character label of the node connected to goout of Pn and the character label of the node connected to Cn at the same depth are put in the queue chtmp (step ST99). Then, it is determined whether or not the queue chtmp is empty (step ST100).
[0119]
If chtmp is not empty, then one character label is taken out from the queue chtmp, set to j (step ST101), and the label is removed from the queue chtmp (step ST102). Then, CHECK [point + 'j] = j is set, the label of the node insertion destination is set (step ST103), and the processing after step ST100 is repeated.
[0120]
When chtmp becomes empty in step ST100, next, Cn in s and a node having the same depth are placed in the queue tmp (step ST104), and it is determined whether the queue tmp is empty (step ST105).
[0121]
If the queue tmp is not empty, a node is taken out from the queue tmp, set to i (step ST106), and the node is removed from the queue (step ST107). Next, the value of (point + 'i label) is set in the index of node i (step ST108), and it is checked whether or not output is defined for node i (step ST109).
[0122]
If there is no output in the node i, the process after step ST105 is repeated. If there is an output, the output of the node i is copied to the array OUTPUT (step ST110), and the processes after step ST105 are repeated. In step ST110, OUTPUT [label + 'i label] = i is output.
[0123]
If tmp becomes empty in step ST105, next, a node having the same depth as Cn is placed in the queue tmp (step ST111), and it is determined whether tmp is empty (step ST112).
[0124]
If the queue tmp is not empty, one node is extracted from the queue tmp, set to i (step ST113), and the node is removed from the queue tmp (step ST114). Then, it is determined whether or not the transition to the next state can be made by any symbol from the node i (step ST115).
[0125]
If the transition is impossible, the processes after step ST112 are repeated. If transition is possible, the triplet is added to the queue Q as the first node of the transition destination of Pn = i, Cn = i, and the index of Pp = i (step ST116), and the processing after step ST112 is repeated. Here, the leading node at the transition destination of i means a node having the smallest character label in a node string that can transition from the node i.
[0126]
If tmp becomes empty in step ST115, the process returns to step ST82 in FIG. 20, and the subsequent processing is repeated.
Next, a procedure for converting the binary tree of FIG. 19 into the array format as shown in FIG. 8 will be described along the flow of FIGS.
[0127]
First, since all transitions are defined for the initial state, this is defined. This process corresponds to the loop of steps ST74 to ST78 in FIG. As a result, an array format as shown in FIG. 22 is obtained.
[0128]
Thereafter, nodes on the binary tree are inserted into the array in order from the root node of the binary tree. This process corresponds to the loop from step ST79 in FIG. 20 to step ST116 in FIG. In this insertion process, the conversion unit 133 accumulates the triple data [Pn, Cn, Pp] for the nodes of the binary tree in a queue and inserts them into the array one by one.
[0129]
First, although inserted from the root node, in FIG. 20, the triples are stacked in the queue Q in steps ST79, ST80, and ST81, and these are taken out in steps ST13 and ST14. Here, Pn = root node “1”, Cn = node “2”, and Pp = 1. In step ST85, a place where insertion is possible is searched. This processing is as follows for the root node "1".
[0130]
In this case, Pn's goout is empty. Nodes having the same depth as Cn are node "2" and node "3". The character labels of these nodes are “a” and “b”, respectively. The operation of searching for a place on the array where the node string can be inserted corresponds to searching for a place corresponding to the pattern as shown in FIG. 23 on the arrays GOTO, CHECK, and OUTPUT.
[0131]
In the pattern of FIG. 23, the upper row represents a pattern on the current array CHECK, and the lower row represents a pattern to be inserted. “0” indicates that the area is empty, and “*” indicates that the area may be empty or an element may be included.
[0132]
When a position where such a pattern can be inserted is searched for on the array of FIG. 22, the pattern of FIG. 23 can be overlapped as shown in FIG. 24. Therefore, the value of point indicating the insertion position of the node is 1. Become. This position corresponds to the position immediately before the insertion position of the label “a” of the node “2”.
[0133]
In step ST86, GOTO [Pp] = point, so that GOTO [1] = 1. Thereby, the transition from the node “1” of the binary tree to the nodes “2” and “3” is moved onto the array. However, since the position of index = 1 corresponding to the root node “1” is GOTO [1] = 1 in advance, there is no change.
[0134]
Since the processing from step ST87 in FIG. 20 to step ST98 in FIG. 21 is processing when the node Pp has “outout / gotoin”, it is not related to the root node.
[0135]
The process from step ST99 to ST103 in FIG. 21 is a process for inserting a character label into the array. This is equivalent to setting the character label possessed by each node to the CHECK portion at the location on the previously secured array. By this operation, an array format as shown in FIG. 25 is obtained. In FIG. 25, the labels “a” and “b” with underlining correspond to the inserted labels.
[0136]
The processes from steps ST104 to ST110 in FIG. 21 are a process for setting the position on the array to a node of the binary tree and a process for copying the output of the node onto the array. As a result, 2 is set to the index of the node “2”, and 3 is set to the index of the node “3”.
[0137]
In addition, the loop processing from step ST111 to ST116 is processing for loading a node that can transition from Pp into the queue Q. Here, since the node “2” and the node “3” can transition from the root node, [Pn = node “2”, Cn = node “4”, Pp = 2] and [Pn = node “3”. , Cn = node “5”, Pp = 3] are put on the queue. Then, the process returns to step ST82 in FIG.
[0138]
This time, when Pn = node “2”, Cn = node “4”, and Pp = 2, the same processing is performed to find a position where the node can be inserted, and the position of point = 26 is found. Therefore, when the processing from step ST85 in FIG. 20 to step ST111 in FIG. 21 is performed in the same manner, an array format as shown in FIG. 26 is obtained.
[0139]
In FIG. 26, the label “b” of the node “4” is inserted at the position of index = 28, but the label “b” is converted into the internal code 2 when the conversion table of FIG. Therefore, a value 26 obtained by subtracting 2 from 28 is defined as a point. As a result, in step ST18 of the collation process of FIG. 10, it is possible to move from the position of index = 26 to the position of index = 28 by the input symbol “b” and transition to the transition destination represented by the element.
[0140]
Also, “ab” that is the output of the node “4” is copied to OUTPUT [28].
Next, since the node subsequent to the node “4” is the node “7”, [Pn = node “4”, Cn = node “7”, Pp = 28] is loaded on the queue Q. Then, the process returns to step ST82 in FIG.
[0141]
Next, what is taken out from the queue Q is [Pn = node “3”, Cn = node “5”, Pp = 3]. For this, it is only necessary to find an insertion location that satisfies the pattern as shown in FIG. 27. However, if the label “c” of the node “5” is matched with the index = 29 position, the internal code is 3. point = 26. Since this value has already been set to point once, in order to avoid duplication, one point is shifted to point = 27.
[0142]
In this way, when the processing from step ST85 in FIG. 20 to step ST111 in FIG. 21 is performed in the same manner, an array format as shown in FIG. 28 is obtained. In this case, “bc” that is the output of the node “5” is copied to OUTPUT [30], and “bd” that is the output of the node “6” is copied to OUTPUT [31].
[0143]
Next, nodes “8” and “9” follow the nodes “5” and “6”, respectively. Therefore, [Pn = node “5”, Cn = node “8”, Pp = 30], [Pn = node “6”, Cn = node “9”, Pp = 31] are accumulated in the queue Q. Then, the process returns to step ST82 in FIG.
[0144]
This time, the same processing is performed with [Pn = node “4”, Cn = node “7”, Pp = 28]. A search is made for a position where a node can be inserted. Since transitions by characters “c” and “d” are defined as goout in the node “4”, the place where the node can be inserted is a place that satisfies the pattern shown in FIG. It becomes. The value of point corresponding to this place is 29.
[0145]
At this time, the processing from step ST85 in FIG. 20 to step ST111 in FIG. 21 is performed in the same way. However, in this case, since “out” is defined for the node “4” to be Pn, the step in FIG. The loop processing from ST87 to ST92 is started. However, since GOTO [index of node “5”] and GOTO [index of node “6”] are both undefined, the process of step ST92 is not performed. Eventually, the resulting array is as shown in FIG.
[0146]
Since there is no node that can transition from the node “7”, no more nodes are loaded in the queue Q, and the process returns to step ST82 in FIG.
Next, [Pn = node “5”, Cn = node “8”, Pp = 30] is extracted from the queue Q. The place where insertion is possible is a place that satisfies the pattern as shown in FIG. The value of point corresponding to this place is 5.
[0147]
At this time, the processing from step ST85 in FIG. 20 to step ST111 in FIG. 21 is performed in the same way. In this case, however, node “4” is defined as “gotoin” in node “5” serving as Pn. 21 enters the loop processing from steps ST93 to ST98 in FIG.
[0148]
In the condition determination in step ST97, GOTO [index of node “4”] = 29, so in step ST98, GOTO [29 + ’c] = GOTO [32] = 5. In addition, since “bc” is defined as output in the node “5” which is Pn, this is also copied, and OUTPUT [32] = bc. As a result, an arrangement as shown in FIG. 32 is obtained.
[0149]
Finally, when the node “9” is processed in the same manner as the node “8”, the result shown in FIG. 8 is obtained.
In this example, in order to increase the compression ratio of the state transition array, the minimum allowable value is used as the point. However, an index larger than that value may be used as the point.
[0150]
Next, an embodiment of a character string replacement device in which a character string replacement function is added to the collation device of the first embodiment will be described. Here, an example in which the input keywords {ab, bc, bd} are replaced with {aaa, bbb, ccc}, respectively.
[0151]
This character string replacement function is a function for storing the location where the keyword is detected instead of outputting the keyword detected from the input text and replacing the keyword after processing the text.
[0152]
FIG. 33 shows a state transition array for character string replacement for the input key. In FIG. 33, the array GOTO stores an overlaid state transition table, the array CHECK stores a confirmation label, the array SUBST stores a pointer to a replacement character string, and the array LENGTH is a verified replacement. Stores the length of the previous string.
[0153]
FIG. 34 shows the initial state of the text offset storage array used for the replacement process. This array stores a text offset indicating the position of the character string to be replaced in the text, a SUBST pointer indicating the corresponding replacement character string, and the length of the replacement target character string.
[0154]
The text offset for each symbol in the input text 'cabcz' is as shown in FIG. Also, after pattern matching is performed on this input text, the text offset storage array is as shown in FIG. The collation process at this time is the same as in FIG. As a result of the collation, the positions of the keywords “ab” and “bc” in the text are stored as text offsets “1” and “2”, respectively.
[0155]
In the replacement process, the character string replacement device sets a pointer in each of the text and the text offset storage array. Then, text characters are output unless there is a pointer to the text in a section corresponding to the replacement target character string length starting from the position of the text offset stored in the text offset array. When the pointer to the text is at the text offset position in the text offset array, the corresponding replacement character string is output.
[0156]
FIG. 37 is a flowchart of this replacement process. When the processing is started in FIG. 37, the character string replacing device first sets the text pointer t at the head of the text (step ST121), and sets the replacement pointer p at the head of the text offset storage array (step ST122). Then, it is determined whether or not the pointer t points to the end of the text (step ST123). If the pointer t points to the end of the text, the process ends.
[0157]
If the pointer t does not point to the end of the text, the value of the text offset stored at the position pointed to by the pointer p is compared with the pointer t (step ST124). If they do not match, the character in the text pointed to by the pointer t is output (step ST125), the pointer t is advanced by one character (step ST126), and the processing after step ST123 is repeated.
[0158]
If the two match in step ST124, then the SUBST pointer stored at the position pointed to by pointer p is taken out, and the replacement character string pointed to by it is output (step ST127). Next, the pointer t is advanced by the length of the replacement target character string stored at the position indicated by the pointer p (step ST128), and the pointer p is advanced by one (step ST129).
[0159]
Then, it is determined whether the value of the pointer t is larger than the text offset at the position pointed to by the pointer p and whether the position pointed to by the pointer p is not the end of the text offset storage array (step ST130). If the determination result is YES, the processing after step ST129 is repeated, and if it is NO, the processing after step ST123 is repeated.
[0160]
As a result of such replacement processing, the text “cabcz” is converted to “caacz”.
Next, a second embodiment based on the FAST method will be described with reference to FIGS.
[0161]
In the state transition table of the FAST method, the data defining the operation for the input symbol of each state seems to be easily separated into two parts, default shift and transition, but this is not necessarily the case. For example, in the state transition table of FIG. 69, the shift amount for the input symbol “a” in the state “6” is −2 (size 2), but the default shift amount for the other input symbols is −5 (size 5). ). The former small shift amount is determined during calculation of the FAST method shift amount. For this reason, it is impossible to make the table format compact by separating the table format into the default shift amount and the transition to the next state.
[0162]
Therefore, in the second embodiment, first, operations using FAST method input symbols are divided into three types. The first operation is a default shift of state for general input symbols. The next operation is a shift for a particular input symbol. The last operation is a transition to the next state by an input symbol.
[0163]
In the collation operation, it is first determined whether or not a transition or shift is defined for the input symbol. If neither shift nor transition is defined, the pointer is shifted to the right according to the default shift. If an input symbol is accepted but a shift is defined instead of a transition, the pointer is shifted to the right according to the shift. If a transition is defined, transition to the next state.
[0164]
Such an operation is represented in an array format for each state, and the arrays are overlapped so that the portions where transition / shift are defined do not overlap. At the same time, in order to confirm whether or not an element is included in the array, the character is assigned as a label to a character for which transition / shift is defined.
[0165]
Also, when creating a compressed table format, the storage capacity actually used is saved by going through the intermediate structure of the binary tree.
The configuration of the collation device in the second embodiment is the same as that in FIG. In this case, the state transition determination unit 121 determines which state is to be transitioned with respect to the input symbol, or how much is to be shifted in the text.
[0166]
FIG. 38 shows a second state transition table before compression when three symbol strings {smart, east, state} are input keywords. When this table is compared with the table of FIG. 69, typical shift amounts frequently used in each state are combined into one as a default shift (default shift), and only a specific shift amount or a transition destination for a specific input symbol. Is defined for each input symbol.
[0167]
FIG. 39 shows the transition for each input symbol defined in the state transition table of FIG. The state transition diagram of FIG. 39 is composed of 14 nodes of nodes “0” to “13”. Data a1, a2, a3, and a4 assigned to each node represent a state number (node number), a default shift size, a specific character, a specific shift size, and an output character string, respectively. Yes.
[0168]
FIG. 40 shows a collation array obtained by compressing the state transition table of FIG. In FIG. 40, index represents an array subscript, GOTO represents a superimposed state transition table stored in the compressed state transition unit 123, and CHECK represents a confirmation label stored in the confirmation label unit 124. OUTPUT represents an array of pointers stored in the output symbol portion 125. These pointers point to output character strings stored in the state transition unit 122.
[0169]
If the element of the array GOTO is less than 0, it represents the shift amount, and otherwise represents a transition to the next state.
Further, states s0 to s13 shown below the array in FIG. 40 indicate how the states “0” to “13” in FIG. 38 overlap. Symbols a, e, and the like represent storage locations of access destination data corresponding to the respective input symbols in FIG. 38, and symbol D represents a storage location of access destination data corresponding to the default shift.
[0170]
FIG. 41 shows a second character code conversion table used for explaining the second embodiment. According to this character code conversion table, the character code is converted into an internal code. In the conversion table of FIG. 41, “1” is set as the internal code in the first column in order to set the default shift code at the top of each superimposed array. For this reason, the internal code for each input symbol is 2 or more, and is set in the second column and thereafter.
[0171]
The result obtained by adding 1 to the character code value of the input symbol may be used as the internal code.
Next, collation processing in the second embodiment will be described with reference to FIGS. 42 and 43.
[0172]
FIG. 42 is a flowchart of character string matching processing based on the FAST method. When the processing is started in FIG. 42, the state transition determination unit 121 first sets a text pointer indicating the input text at a position where the shortest key length is added to the head of the text pointer, and the state transition in the state transition unit 122 A transition pointer pointing to the array is set to an initial state (step ST131). Next, it is checked whether or not the text pointer points to the end of the text (step ST132). The collation ends when the text pointer points to the end of the text.
[0173]
If the text pointer does not point to the end of the text, the character pointed to by the pointer is taken out (step ST133), the value of the internal code corresponding to the character is added to the value of the transition pointer, and the added value is set as an index. It is checked whether the character label stored in is the same as this character (step ST134).
[0174]
If these characters do not match, then no transition or shift is defined for the input character. Therefore, the text pointer is advanced by the default shift for the state (step ST135), the transition pointer is set to the initial state (step ST136), and the processes after step ST132 are repeated.
[0175]
If the label and the input character are the same in step ST134, it is next checked whether or not a shift by the input character is defined (step ST137). If the input character is shifted, in the current state, the text pointer is advanced by the amount of shift defined for the input character (step ST138), and the transition pointer is set to the initial state (step ST139). ), The process after step ST132 is repeated.
[0176]
In step ST137, when the transition by the input character is defined instead of the shift, it is checked whether the output character string is defined for the state (step ST140).
[0177]
If the output character string is not defined, the internal code of the input character is added to the current transition pointer, and the transition destination defined at the position where the added value is an index is set as a new transition pointer (step ST142). . Then, the text pointer is returned by one character (step ST143), and the processes after step ST132 are repeated.
[0178]
If an output character string is defined, the character string is output as a collation result (step ST141), and the processes after step ST142 are performed. In step ST141, the current text pointer value can be output as the position of the collated character string.
[0179]
FIG. 43 shows a collation operation for the input text ‘aseastasterr’ using the state transition array of FIG. 40. First, the text pointer is set at a position that is the shortest keyword length 4 from the beginning of the text, and the transition pointer is set to the index value “1” corresponding to the initial state.
[0180]
For the first input symbol “e”, CHECK [1 + ′ e] = e and GOTO [1 + ′ e] = 3> 0, so the transition pointer = 3 (step ST142) and the next state is entered. Transition.
[0181]
Thereafter, the collating operation as shown in FIG. 43 is performed in the same manner. Here, first, whether or not the transition destination or the shift amount by the input symbol is defined is checked by accessing the array CHECK, and if the transition is not defined, the shift is performed by the default shift.
[0182]
When a shift smaller than the default shift is defined for a specific input symbol, the shift is performed according to the shift amount. In FIG. 43, such a specific shift occurs when the symbol “r” is input in the final initial state “1”. For the symbol “r”, the text pointer is shifted rightward by a shift amount 1 which is smaller than the default shift amount 4, but this moves to the end of the text, so the processing ends. Yes.
[0183]
Thus, the symbol strings “east” and “state” included in the input text are output as the collation result.
Next, a method for creating a state transition array used for collation will be described. FIG. 44 is a second block diagram of the compression apparatus of FIG. 44, a binary tree conversion unit 141, a preprocessing unit 142, and a shift amount calculation unit 143 correspond to the sparse array finite state machine creation unit 104, and the conversion unit 144 corresponds to the state transition machine compression unit 105.
[0184]
The keyword input unit 103 receives the specified keyword group. The binary tree conversion unit 141 converts the accepted keyword group into a binary tree structure in the direction from the left to the right of each key. The preprocessing unit 142 sets the length of the input keyword having the shortest length, the depth of each node, and the terminal node for each keyword. The shift amount calculation unit 143 calculates the shift amount for each state. Further, the conversion unit 144 converts the binary tree structure output from the shift amount calculation unit 143 into a compressed array format.
[0185]
The procedure for creating such an array for collation includes the creation of a binary tree from the input keyword group, the addition of the node depth to the resulting binary tree node, and the setting of the last node for the keyword , Calculation of shift amount, creation of binary tree of intermediate structure, and conversion from intermediate structure to array format.
[0186]
FIG. 45 is a flowchart of the binary tree creation process. When the processing is started in FIG. 45, the binary tree conversion unit 141 first creates a binary tree in the reverse order of the key sequence for the input key (step ST151). Next, the corresponding input key is added as an output symbol string to the node that has received the key (step ST152), and the process ends.
[0187]
FIG. 46 is a flowchart of the preprocessing for calculating the shift amount for the binary tree. When the processing is started in FIG. 46, the preprocessing unit 142 first sets the distance (depth) from the root node for each node (step ST161). Next, the shortest input keyword length is obtained (step ST162), the last node for each keyword is obtained (step ST163), and the process ends.
[0188]
For example, when a binary tree is created for the keyword group {smart, east, state} corresponding to the state transition diagram of FIG. 39 and pre-processing is performed, a binary tree as shown in FIG. 47 is obtained. In FIG. 47, the meanings of the downward pointer and the horizontal pointer are the same as those in FIG.
[0189]
Data d given to each node represents the depth of the node, and output represents an output character string. In this case, output symbol strings “state”, “east”, and “smart” are added as outputs to the nodes “5”, “9”, and “13”, respectively.
[0190]
Further, 4 is set as the shortest pattern length of these output symbol strings (keywords), and the nodes “5”, “9”, “13” are the termination nodes for the keywords “state”, “east”, “smart”, respectively. "Is set.
[0191]
48, 49, and 50 are flowcharts of the calculation process for obtaining the shift amount for the binary tree.
FIG. 48 shows a process for setting a shift by a specific character and a failure transition to each node. When the processing is started in FIG. 48, the shift amount calculation unit 143 first initializes a node queue Q that stores nodes to be processed (step ST171). Next, a node that can transition from the root node of the binary tree is placed in the queue Q (step ST172).
[0192]
Next, the failure destination of the node in the queue is set as the root node (step ST173), and it is determined whether Q is empty (step ST174). If Q is empty, the process ends. If Q is not empty, then one node is taken out from queue Q, set to r (step ST175), and the taken out node is removed from queue Q (step ST176).
[0193]
Next, the node queue J is initialized (step ST177), and a node that can transition from r is put in the queue J (step ST178). Then, it is determined whether J is empty (step ST179).
[0194]
If J is empty, the process after step ST174 is repeated. If J is not empty, then one node is extracted from the node queue J, set to s (step ST180), and the extracted node is removed from the queue J (step ST181). Next, s is put in the queue Q (step ST182), the failure destination of the node r is set to t (step ST183), and whether or not the transition from the node t by the character label attached to the node s is defined. Is performed (step ST184).
[0195]
If such a transition is defined, then the destination of transition from t to s is the destination of s (step ST188), and the processing after step ST179 is repeated.
[0196]
If no transition is defined in step ST184, then whether or not the shift amount at the label of s is not defined at node t, or whether or not the shift amount already defined is greater than the current depth of s. Is determined (step ST185).
[0197]
If the shift amount is undefined or the defined value is larger than the current depth of s, then the shift amount at the label of node s relative to node t is set as the depth of s (step ST186). The shift amount for such a specific label is represented as a specific list added to each node of the binary tree. Next, the failure destination of t is set to t (step ST187), and the processes after step ST184 are repeated.
[0198]
In step ST185, if a shift amount equal to or smaller than the current depth of s is defined for the label of s, the processing after step ST187 is performed.
FIG. 49 is a flowchart of processing for adding a default shift amount, which is the maximum shift amount, to each node. When the processing is started in FIG. 49, the shift amount calculation unit 143 sets a value obtained by adding the shortest key length to the node depth as the default shift amount for each node (step ST191), and ends the processing. .
[0199]
For example, in the case of the binary tree in FIG. 47, first, the shift amount calculation unit 143 places nodes “1” and “6” that can transition from the root node “0” into the queue Q by the processing in step ST172 in FIG. Next, the failure transition destinations of the nodes “1” and “6” are set to the root node by the process of step ST173.
[0200]
Next, the node “1” is taken out from the queue Q and set to r, and the node “2” is loaded on the node queue J by the processing of steps ST175 to ST178. Since only the node “2” is loaded on J, the process goes through the loop of steps ST179 to ST183 only once.
[0201]
Here, the node “2” is set to s by the process of step ST180, and the fail destination of the node r (node “1”), that is, the root node is set to t by the process of step ST183.
[0202]
Since it is possible to transition from the root node with the label of s (the label “t” of the node “2”), the process proceeds to step ST188, and the failure destination of s is the label “t” of the node “2” from the root node. The node “6” that can be transitioned by is obtained.
[0203]
Next, the node “6” is taken out from the queue Q. The same processing is performed on the node “7” that can be transitioned from the node “6”. In this case, r = node “6”, s = node “7”, and t = node “1” (fail destination of node “6”). However, transition from t to the label “s” of the node “7” is impossible.
[0204]
Therefore, in the node “1”, the depth of the node “6”, that is, 1 is set as the specific shift amount with respect to the input symbol “s”. The shift amount “1” corresponding to this symbol “s” is represented by a list structure as a specific list.
[0205]
If such a process is repeated in the same manner, a failure transition and a provisional shift amount are assigned to all nodes. Thereafter, according to the process of FIG. 49, a default shift amount that is the maximum shift amount in each node is assigned to each node. In this way, a binary tree as shown in FIG. 51 is obtained.
In FIG. 51, a broken-line arrow represents a failure transition, and a failure destination of a node for which no failure transition is displayed is a root node “0”. Data D added to each node represents a default shift amount. Furthermore, a specific character label and a specific shift amount corresponding thereto are set in the specific list connected to the nodes “0” and “6”.
[0206]
FIG. 50 is a flowchart of processing for reducing the shift amount of each node as necessary and assigning the final shift amount. When the processing is started in FIG. 50, the shift amount calculation unit 143 first loads the input keyword on the queue Q (step ST201), and determines whether the queue Q is empty (step ST202).
[0207]
If the queue Q is not empty, the keyword popped from the queue Q is put into j (step ST203). Next, the last node corresponding to the keyword j is set to jst (step ST204), the fail destination of jst is set to bst (step ST205), and jlen = jst depth (step ST206). Then, it is determined whether or not bst is a root node (step ST207). If bst is the root node, the processes after step ST202 are repeated.
[0208]
If bst is not the root node, the depth of bst is subtracted from jlen, and the result is set to slen (step ST208). Next, bst is put in the queue N (step ST209), and it is determined whether or not the queue N is empty (step ST210). If the queue N is empty, the processing after step ST207 is repeated with bst = bst as the fail destination (step ST211).
[0209]
If the queue N is not empty, the node popped from the queue N is put in r (step ST212), and it is determined whether or not the default shift of r is larger than (slen + r depth) (step ST213). If the default shift of r is equal to or less than (slen + r depth), the processes after step ST210 are repeated.
[0210]
If the default shift of r> (slen + r depth), then the default shift of r = slen + r depth (step ST214), nodes that can transition from r are loaded in the queue N (step ST215), and after step ST210 Repeat the process.
[0211]
If the queue Q is empty in step ST202, loop processing from step ST216 to ST220 is performed. This process is a process in which, when the shift size by a specific character defined for the node is larger than the default shift size, the shift size is cut to the same size as the default shift.
[0212]
Here, first, all nodes are placed in the queue Q (step ST216), and it is determined whether or not the queue Q is empty (step ST217). If the queue Q is empty, the process ends.
[0213]
If the queue Q is not empty, the node popped from the queue Q is put in j (step ST218), and whether or not the shift size for the specific character defined in j is larger than the default shift size. Is determined (step ST219).
[0214]
If the determination result is NO, the processes after step ST217 are repeated. If the determination result is YES, assuming that the shift size for the specific character = the default shift size (step ST220), the processing after step ST217 is repeated.
[0215]
When the shift amount set in the binary tree of FIG. 51 is shaped according to the processing of FIG. 50, the following is obtained. First, by the processing of steps ST201 to ST203, a keyword set for verification {smart, east, state} is loaded on the queue Q, and the keyword popped from the queue is set to j.
[0216]
Assuming that this is “state”, jst = node “5”, bst = node “7”, and jlen = 5 by the processing of steps ST204 to ST206. At this time, since bst is not a root node, the process proceeds to step ST208. In step ST208, slen = jlen-bst depth = 5-2 = 3. In step ST209, the node “7” is placed in the queue N, and in step ST212, it is popped from the queue N and set to r.
[0217]
In the condition determination in step ST213, the default shift of r (default shift of node “7”) is 6, which is larger than slen. Therefore, the process moves to step ST214, where r default shift = slen + r depth = 3 + 2 = 5.
[0218]
Thereafter, the node “8” is loaded in the queue N by the processing in step ST215, and the same processing is performed. As a result, the branches below the node “8” are processed in order, and the size of the default shift of each node is changed.
[0219]
As described above, in the node that is the fail destination of the terminal node of the keyword and the deeper node connected to the node, the size of the default shift is (slen + r depth) by the processing in steps ST213 and ST214. It is suppressed to the following.
[0220]
When all the nodes are finally processed, a binary tree as shown in FIG. 52 is obtained. In FIG. 52, it can be seen that the default shifts of the nodes “7” and “1” which are the fail destinations of the keyword end nodes “5” and “9” are smaller by 1 than in FIG. Further, the default shift of each node connected below the nodes “7” and “1” is also decreased by one.
[0221]
FIGS. 53 and 54 are flowcharts of processing for converting the binary tree output from the shift amount calculation unit 143 into a compressed state transition machine in an array format. When the processing is started in FIG. 53, the conversion unit 144 first initializes the arrays GOTO, CHECK, and OUTPUT to 0 (step ST221), and initializes the member index of the binary tree node to 0 (step ST222). .
[0222]
This index is provided independently of the index of the state transition array in order to store the correspondence between each node of the binary tree and the index of the state transition array as shown in FIG.
[0223]
Next, the queue Q is initialized (step ST223), and Pn = root node, Cn = next node of the root node, and Pp = 1 (step ST224).
Here, the next node of the root node means a node having a minimum character label among a plurality of nodes (node string) that can be transitioned from the root node. In the case of a binary tree as shown in FIG. 52, the node placed in Cn matches the node pointed by the pointer down from the root node.
[0224]
Next, the triplet [Pn, Cn, Pp] is added to the queue Q (step ST225), and it is determined whether or not the queue Q is empty (step ST226). If the queue Q is empty, the process ends.
[0225]
If the queue Q is not empty, then a triple is popped from the head of the queue Q and set to s (step ST227). An array GOTO, CHECK, in which a node connected to the specific list of the node Pn in s, a node connected to the node Cn in s and having the same depth as Cn, and the shift amount of the default shift with respect to s can be inserted. The position on OUTPUT is obtained and set to point (step ST228).
[0226]
At this time, the position of the point of the already inserted node is not overlapped with the position of the point of the newly inserted node. If a new insertable position is already used as a point, for example, it is shifted by one and set to a point.
[0227]
Next, GOTO [s Pp] = point (step ST229), and CHECK [point + 1] = 1 (step ST230). Also, the magnitude of the default shift of the node Pn in s is multiplied by −1 to make a negative value, and this is put into GOTO [point + 1] (step ST231). As a result, a default shift value is set at a position of index = point + 1 in the array.
[0228]
Next, a node connected to the Pn specific list of s is put in the queue tmp (step ST232), and it is determined whether or not the queue tmp is empty (step ST233). If the queue tmp is not empty, the node popped from the queue tmp is set to j (step ST234).
[0229]
Then, as a shift amount for the character label of node j, a value obtained by multiplying the corresponding shift magnitude by -1 is set in GOTO [point + 'j label] (step ST235). Here, “label” is an internal code on the array corresponding to the character label, and “label ≧ 2” when the conversion table of FIG. 41 is used.
[0230]
Next, the character label of j is set as a confirmation label in CHECK [point + 'j label] (step ST236), and the processing after step ST233 is repeated.
[0231]
When the queue tmp becomes empty in step ST233, next, Cn in s and a node having the same depth are placed in the queue tmp (FIG. 54, step ST237), and it is determined whether or not the queue tmp is empty (step ST238).
[0232]
If the queue tmp is not empty, the node popped from the queue tmp is set to i (step ST239), and the value of (point + 'i label) is set to the index of the node i (step ST240). It is checked whether or not output is defined for the node i (step ST241).
[0233]
If node i has an output, the output of node i is copied to array OUTPUT (step ST242). Here, OUTPUT [label + 'i label] = output of i. Next, as a label of CHECK [point + 'i label] = i (step ST243), the processes after step ST238 are repeated. If there is no output in the node i, the process after step ST243 is performed.
[0234]
If tmp becomes empty in step ST238, next, a node having the same depth as Cn is placed in the queue tmp (step ST244), and it is determined whether tmp is empty (step ST245).
[0235]
If the queue tmp is not empty, the node popped from the queue tmp is set to i (step ST246), and it is determined whether or not it is possible to make a transition to the next state with any symbol from the node i (step ST247). If the transition is impossible, the processes after step ST245 are repeated.
[0236]
If transition is possible, the triplet is added to the queue Q as the leading node of the transition destination of Pn = i and Cn = i, and the index of Pp = i (step ST248), and the processing after step ST245 is repeated. Here, the leading node at the transition destination of i means a node having the smallest character label in a node string that can transition from the node i.
[0237]
If tmp becomes empty in step ST245, the process returns to step ST226 in FIG. 53, and the subsequent processing is repeated.
Next, the procedure for converting the binary tree of FIG. 52 into the array format as shown in FIG. 40 will be described with reference to the flowcharts of FIGS. In this conversion process, the relationship between the preceding node and the node that can be shifted from it or the node that can be shifted is mapped from the binary tree to the array format.
[0238]
FIG. 55 shows a summary of transitions by input symbols, specific shifts by input symbols, and default shifts for each node in FIG. In FIG. 55, for example, “e: 1” in the first row and second column indicates that the transition destination by the input symbol “e” for the node “0” is the node “1”, and in the first row and fourth column. “A: 2” indicates that the magnitude of the shift by the input symbol “a” for the node “0” is 2.
[0239]
In the process from step ST227 of FIG. 53 to step ST248 of FIG. 54, the conversion unit 144 searches for a vacant place on the array where the symbol of the next transition / shift can be defined for the preceding state, and the data is obtained. Insert and map transition relationships. This data insertion method is basically the same as the method described in the first embodiment.
[0240]
Here, the mapping operation will be described using the root node “0” as an example. The pattern of data to be inserted with respect to the root node is as shown in FIG. FIG. 56 shows that there are six input symbols {a, e, m, r, s, t} that can define transitions other than the default shift for the root node.
[0241]
Initially, all the elements of the state transition array are empty. Therefore, when the place where the pattern of FIG. 56 can be inserted is searched on the array, the index of the smallest possible array is “1”. Therefore, when the point is set to “1”, the character label is written, and the shift amount of the default shift or specific shift is set by the processing from step ST229 to ST236 in FIG. 53, the pattern shown in the line s0 in FIG. Is mapped. At this stage, the portions corresponding to the labels “e” and “t” remain undefined.
[0242]
Since the portions of the labels “e” and “t” correspond to the next node of the root node “0”, they are set when these nodes are inserted into the array by the processing from step ST237 to ST248 in FIG. Is done. Here, the next node to be processed next to the root node is the node “1”.
[0243]
Since the only character that can transition from the node “1” is “t”, the insertion location of the node “1” is a location corresponding to the pattern as shown in FIG. In FIG. 57, the upper row represents a pattern on the current array CHECK, and the lower row represents a pattern to be inserted.
[0244]
When a place where such a pattern can be inserted is searched on the array, the index of the minimum array that can be set to “point” is “3”. Therefore, in order to define that the node “1” is the transition destination from the node “0”, the value 3 of this point is set to GOTO [7], and the transition from the node “0” to the symbol “e” is performed. It is defined that the destination data starts from the position of index = 3.
[0245]
Then, this node “1” is inserted at the position of index = 3, and character labels, default shift amounts, etc. are set by the processing from step ST229 to ST236. As a result, the pattern shown in the row of s1 in FIG. 40 is mapped.
[0246]
Such processing is performed in the order in which nodes are stacked in the queue, that is, nodes “0”, “1”, “6”, “2”, “7”, “10”, “3”, “8”, “11”. By performing in the order of “4”, “9”, “12”, “5”, “13”, a state transition array as shown in FIG. 40 is finally obtained.
[0247]
The nodes “14”, “15”, and “16” in FIG. 52 are also loaded in the queue and subjected to the same processing, and strictly speaking, transitions by the terminal symbol “#” are defined. However, since the terminal symbols are not included in the output character string, the processing of these nodes is omitted here, and the transition to these nodes is also omitted in FIG. It is also possible not to define a transition for the terminal node by setting the value of the label of the terminal symbol in the array to 0.
[0248]
Next, an embodiment of a character string replacement device in which a character string replacement function is added to the collation device of the second embodiment will be described. Here, an example in which the input keyword {smart, east, state} is replaced with {SMART, EAST, STATE}, respectively.
[0249]
FIG. 58 shows a state transition array for character string replacement for an input key. In FIG. 58, array GOTO stores a superimposed state transition table, array CHECK stores a confirmation label, array SUBST stores a pointer to a replacement character string, and array LENGTH stores characters before replacement. Stores the length of the column. The initial state of the text offset storage array used for the replacement process is the same as that shown in FIG.
[0250]
The text offset for each symbol in the input text 'aceast' is as shown in FIG. Also, after pattern matching is performed on this input text, the text offset storage array is as shown in FIG. The collation process at this time is the same as in FIG. As a result of the collation, the position of the keyword “east” in the text is stored as the text offset “3”.
[0251]
The replacement process is performed in the same manner as in FIG. As a result of this replacement processing, the text “aaseast” is converted to “aasEAST”.
In the first and second embodiments described above, since a state transition array in which a DFA having a table structure is compressed is basically used, collation can be completed with the same number of transition operations as the number of input symbols. The high speed of DFA is maintained. However, the storage capacity required to store the state transition array is much smaller than that of the conventional state transition table.
[0252]
By the way, in the above-mentioned example, the text of the English character is input, but the present invention can be applied to the text written in the 2-byte character code such as Japanese by the following two methods.
[0253]
In the first method, first, a state transition array is created for each 1-byte keyword group. When one of the keyword patterns is detected in the text, the process returns to the line feed symbol preceding the pattern. Next, it is confirmed whether or not the detected pattern is shifted by 1 byte in the sentence starting from the position next to the line feed symbol.
[0254]
In the second method, a state transition array is created using the input 2-byte character code as one unit, and collation processing is performed.
For text in which 1-byte characters such as ASCII (American Standard Code for Information Interchange) and 2-byte characters such as Japanese EUC (Extended UNIX Code) are mixed, the 2-byte characters are 1 byte when shifted in the second embodiment. There is a possibility that collation is performed with a deviation.
[0255]
Therefore, for such text, it is necessary to go back to the beginning of the sentence containing the pattern at the stage of accepting the pattern, and check whether the detected pattern is misaligned with reference to the beginning position. is there.
[0256]
【Example】
Hereinafter, a comparison result of the storage capacity and search speed at the time of character string search between the conventional collation apparatus and the collation apparatus of the present invention will be described.
[0257]
First, the case of keyword search based on the AC method will be compared. FIG. 61 shows a change in memory usage in the conventional method and the method of the present invention. In FIG. 61, the vertical axis of the graph represents the memory usage, and the horizontal axis represents the number of keywords. “AC + ours” corresponds to the collation apparatus according to the first embodiment of the present invention based on the DFA AC method, and “DFA of AC” corresponds to the collation apparatus using the conventional state transition table. .
[0258]
As can be seen from FIG. 61, as the number of keywords increases, the memory usage greatly increases in the conventional method, but does not increase so much in the method of the first embodiment of the present invention.
[0259]
FIG. 62 shows a change in search speed of each collation device when a keyword search is performed on a 75 Mbyte text. In FIG. 62, the vertical axis represents the time (seconds) required for the search, and the horizontal axis represents the number of keywords. Here, for comparison, the search speed of the conventional AC method is added as “AC”.
[0260]
As shown in FIG. 62, as the number of keywords increases, the search speed decreases in the conventional AC method. However, in the conventional AC method converted to DFA and the method of the first embodiment, the search speed is kept high. You can see that Furthermore, the difference in search time between “AC + ours” and “DFA of AC” is slight.
[0261]
Next, a comparison will be made regarding the case of keyword search based on the FAST method. The storage areas required in the collation apparatus according to the second embodiment of the present invention are a binary tree area, a shift quantity list area for a specific character, and a compressed state transition array area. In addition, the necessary storage area in the conventional FAST method collation apparatus is the number of pointer areas determined based on the number of possible input symbols and the number of states.
[0262]
FIG. 63 shows changes in memory usage in the conventional method and the method of the present invention. In FIG. 63, the vertical axis of the graph represents the memory usage, and the horizontal axis represents the number of keywords. “NORMAL” corresponds to a collation device using the conventional FAST method, and “COMPRESS” corresponds to the collation device of the second embodiment of the present invention.
[0263]
As can be seen from FIG. 63, as the number of keywords increases, the memory usage greatly increases in the conventional method, but hardly increases in the method of the second embodiment of the present invention.
[0264]
【The invention's effect】
According to the present invention, it is possible to create an efficient collation table on both sides of speed and storage capacity for a given keyword group. Further, by using the table, it is possible to efficiently collate / search a plurality of keywords specified for the digitized document.
[0265]
In addition, with such a character string collation process, the functions of the word processor such as character string search and character string replacement can be made more efficient. In addition, functions such as character string search of full-text search index in full-text search device and character string replacement such as parenthesis processing at the display stage, and functions such as character string search in database and character string replacement at the display stage are also improved. be able to.
[Brief description of the drawings]
FIG. 1 is a principle diagram of a collation device of the present invention.
FIG. 2 is a configuration diagram of a collation system.
FIG. 3 is a flowchart of the operation of the collation system.
FIG. 4 is a configuration diagram of an information processing apparatus.
FIG. 5 is a configuration diagram of a collation device.
FIG. 6 is a diagram showing a first state transition table before compression;
FIG. 7 is a diagram illustrating a first state transition.
FIG. 8 is a diagram illustrating a compressed first state transition table.
FIG. 9 is a diagram showing a first character code conversion table;
FIG. 10 is a flowchart of first collation processing.
FIG. 11 is a diagram illustrating a first collation operation.
FIG. 12 is a configuration diagram of a first compression device.
FIG. 13 is a flowchart of first binary tree creation processing;
FIG. 14 is a diagram illustrating a first binary tree.
FIG. 15 is a flowchart of processing for adding failure to a binary tree;
FIG. 16 is a diagram illustrating a first binary tree with a failure.
FIG. 17 is a diagram illustrating an example of failure calculation;
FIG. 18 is a flowchart of processing for adding “gotoin” and “gooutout” to a binary tree;
FIG. 19 is a diagram illustrating a first binary tree with “gotoin” and “goout”.
FIG. 20 is a first flowchart of first conversion processing;
FIG. 21 is a flowchart (part 2) of the first conversion process;
FIG. 22 is a diagram showing a first arrangement.
FIG. 23 is a diagram showing a first pattern.
FIG. 24 is a diagram showing a second arrangement.
FIG. 25 is a diagram showing a third arrangement.
FIG. 26 is a diagram showing a fourth arrangement;
FIG. 27 is a diagram showing a second pattern.
FIG. 28 is a diagram showing a fifth arrangement.
FIG. 29 is a diagram showing a third pattern.
FIG. 30 is a diagram showing a sixth arrangement.
FIG. 31 is a diagram showing a fourth pattern.
FIG. 32 is a diagram showing a seventh arrangement.
FIG. 33 is a diagram showing a first state transition table for character string replacement.
FIG. 34 is a diagram showing a text offset storage array.
FIG. 35 is a diagram showing a text offset of the first text.
FIG. 36 is a diagram showing a first text offset storage array after pattern matching.
FIG. 37 is a flowchart of a replacement process.
FIG. 38 is a diagram showing a second state transition table before compression;
FIG. 39 is a diagram illustrating a second state transition.
FIG. 40 is a diagram illustrating a compressed second state transition table.
FIG. 41 is a diagram illustrating a second character code conversion table.
FIG. 42 is a flowchart of second collation processing.
FIG. 43 is a diagram illustrating a second collation operation.
FIG. 44 is a configuration diagram of a second compression device.
FIG. 45 is a flowchart of second binary tree creation processing;
FIG. 46 is a flowchart of pre-processing for shift calculation.
FIG. 47 is a diagram illustrating a second binary tree.
FIG. 48 is a flowchart of first shift amount calculation processing;
FIG. 49 is a flowchart of second shift amount calculation processing;
FIG. 50 is a flowchart of third shift amount calculation processing;
FIG. 51 is a diagram illustrating a second binary tree that is shifted with failure.
FIG. 52 is a diagram illustrating a final second binary tree.
FIG. 53 is a flowchart (part 1) of a second conversion process;
FIG. 54 is a flowchart (part 2) of the second conversion process;
FIG. 55 is a diagram showing information on a binary tree.
FIG. 56 is a diagram showing a fifth pattern.
FIG. 57 is a diagram showing a sixth pattern.
FIG. 58 is a diagram showing a second state transition table for character string replacement.
FIG. 59 is a diagram showing a text offset of the second text.
FIG. 60 is a diagram showing a second text offset storage array after pattern matching.
FIG. 61 is a diagram showing changes in memory usage in the AC method;
FIG. 62 is a diagram showing a change in speed in the AC method.
FIG. 63 is a diagram showing changes in memory usage in the FAST method.
FIG. 64 is a diagram illustrating an example of an AC method pattern matching machine;
FIG. 65 is a diagram illustrating an operation example of an AC method.
FIG. 66 is a diagram showing an AC method DFA.
FIG. 67 is a diagram illustrating an operation example of AC method DFA;
FIG. 68 is a diagram illustrating an example of a FAST method pattern matching machine.
FIG. 69 is a diagram showing transitions and shifts in the FAST method.
FIG. 70 is a diagram illustrating an example of collation by the FAST method.
[Explanation of symbols]
1 State transition storage means
2 verification means
101 Compressor
102 Verification device
103 Keyword input section
104 Sparse array finite state machine creation part
105 State transition machine compression unit
106 State transition machine for verification
107 Text input part
111 CPU
112 memory
113 Input device
114 output device
115 External storage device
116 Medium drive device
117 Network connection device
118 bus
119 Portable recording media
120 Information provider devices
121 State transition determination unit
122 State transition part
123 Compression state transition part
124 Confirmation label
125 Output symbol part
131, 141 Binary tree conversion unit
132 Transition addition part
133, 144 conversion unit
142 Pre-processing section
143 Shift amount calculation part

Claims

A verification device that uses a given symbol string as a key and determines whether or not the key exists in a file by using a finite state machine,
A state transition table of the finite state machine in which a collation operation relating to at least one key is defined, wherein the sparse state transition table excluding data representing a transition operation from the current state to the initial state is a compressed array State transition storage means for storing in a form;
With reference to the sparse state transition table, an operation corresponding to each symbol included in the file is performed, and collation means for collating a symbol string in the file with the one or more keys,
The collation means checks whether or not an operation for an input symbol input from the file is defined in the sparse state transition table, and when an operation for the input symbol is not defined, the initial state is entered . A collation device characterized by performing a transition operation.

The state transition storage unit stores the sparse state transition table excluding data representing a transition operation from the current state to the next state of the initial state, and the collating unit stores the sparse state transition table in the sparse state transition table. The collation apparatus according to claim 1, wherein when a transition for an input symbol is not defined, a transition operation to the next state of the initial state is performed via the initial state.

The collation apparatus according to claim 1, storage means for storing an output symbol array for storing data for specifying a key corresponding to an input symbol included in the one or more keys, and data in the output symbol array And an output means for outputting a key corresponding to the input symbol.

Collating apparatus according to claim 1, wherein the one or more in response to the input symbols in the key storage means for storing replacement symbol sequence for storing data identifying the replacement symbol string output in place of the key And an output means for outputting a replacement symbol string corresponding to the input symbol based on data in the replacement symbol array.

A word processor device comprising the collation device according to claim 1 and performing at least one of a symbol string search function and a symbol string replacement function using the collation device.

A database system comprising the collation device according to claim 1 and performing at least one of a symbol string search function and a symbol string replacement function using the collation device.

A full-text search device comprising the collation device according to claim 1, wherein the full-text search device performs at least one of a symbol string search function and a symbol string replacement function using the collation device.

A collation device that uses a given symbol string as a key and creates a finite state machine for determining whether or not the key exists in a file,
A binary tree data representing at least one or more keys is created, and an intermediate structure finite state machine corresponding to a sparse state transition table excluding data representing a transition operation from the current state to the initial state is represented by the binary tree. Sparse finite state machine creation means for creating based on tree data;
State transition machine compression means for converting the intermediate structure finite state machine into a compressed array format,
With reference to the sparse state transition table converted into the array format, it is checked whether or not an operation for an input symbol input from the file is defined, and an operation for the input symbol is not defined A collation apparatus in which a transition operation to the initial state is performed.

The sparse finite state machine forming means, said one or more and the binary tree data generation means for expressing the key, the initial state and initial state the next state and the two information failure transition destinations except quadtree A transition adding means for adding data, and the state transition machine compressing means creates a state transition array compressed so that a plurality of elements do not overlap each other based on the binary tree data received from the transition adding means The collation apparatus according to claim 8, wherein:

A verification device that uses a given symbol string as a key and determines whether or not the key exists in a file by using a finite state machine,
A binary tree data representing at least one or more keys is created, and an intermediate structure finite state machine corresponding to a sparse state transition table excluding data representing a transition operation from the current state to the initial state is represented by the binary tree. Sparse finite state machine creation means for creating based on tree data;
State transition machine compression means for converting the intermediate structure finite state machine into a compressed array format;
State transition storage means for storing the sparse state transition table converted into the array format;
With reference to the sparse state transition table, an operation corresponding to each symbol included in the file is performed, and collation means for collating a symbol string in the file with the one or more keys,
The collation means checks whether or not an operation for an input symbol input from the file is defined in the sparse state transition table, and when an operation for the input symbol is not defined, the initial state is entered . A collation device characterized by performing a transition operation.

A recording medium recording a program for a computer that uses a given symbol string as a key and determines whether or not the key exists in a file using a finite state machine,
A state transition table of the finite state machine that defines a collation operation relating to at least one or more keys, with reference to the sparse state transition table excluding data representing a transition operation from the current state to the initial state, When an operation corresponding to each symbol included in the file is performed and a symbol string in the file is collated with the one or more keys, an operation on an input symbol input from the file is added to the sparse state transition table. Check whether it is defined, when the operation for the input symbol is not defined, computer-readable recording medium a program for realizing a function of performing a transition operation to the initial state to the computer .

A recording medium recording a program for a computer that creates a finite state machine for determining whether or not the key exists in a file using a given symbol string as a key,
A function of creating binary tree data representing at least one or more keys;
A function for creating a finite state machine having an intermediate structure corresponding to a sparse state transition table excluding data representing a transition operation from a current state to an initial state based on the binary tree data;
With reference to the sparse state transition table, it is checked whether or not an operation for an input symbol input from the file is defined. When an operation for the input symbol is not defined , the transition to the initial state is performed. A computer-readable recording medium storing a program for causing the computer to realize a function of converting the finite state machine having the intermediate structure into a compressed array format so that an operation is performed .