JP3596696B2

JP3596696B2 - Information retrieval device

Info

Publication number: JP3596696B2
Application number: JP26042095A
Authority: JP
Inventors: 正和藤本
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-10-06
Filing date: 1995-10-06
Publication date: 2004-12-02
Anticipated expiration: 2015-10-06
Also published as: JPH09101965A

Description

【０００１】
【発明の属する技術分野】
本発明は、検索キーに基づいた検索を行なう情報検索装置に関し、特にワイルドカード検索、部分一致検索等の柔軟な検索の可能な情報検索装置に関する。
【０００２】
【従来の技術】
キーに基づいた完全一致検索における高速な検索方法としてハッシュ法が良く知られている。しかし、ハッシュ法では、データ構造上、ワイルドカード検索などの柔軟な検索が困難であるとされ、ハッシュ法に関しては、衝突による検索速度の低下を抑えるために提案されたもの（例えば特開昭６３−２７１５２５号公報、特開平０４−３５８２６６号公報など）、ハッシュ法を適用することによって装置の高速化を行うために提案されたもの（例えば特開平０５−０６１９１０号公報）などが見受けられるものの、部分一致検索に関する提案は全くなされておらず、これまでは、ハッシュ法を用いた検索装置で柔軟な検索を行うためには、全文検索などの他の方法を併用しなければならなかった。
【０００３】
【発明が解決しようとする課題】
ハッシュ法では実データの他に、少なくともハッシュ表を必要とする。ハッシュ表は、キーワードの異なり語数をＮとして、おおよそポインタ長＊Ｎの大きさになる。ここで、部分一致検索のために他の検索方法を併用すると、さらに別途インデックスデータを持つ必要があり、データサイズが非常に大きくなるという欠点がある。また、実データを直接参照する全文検索方法を併用すれば、余分のインデックスデータは必要ないものの、全データを参照するため、実データのサイズが大きくなると検索速度が非常に遅くなり、実用的な速度が出せないという欠点がある。
【０００４】
本発明は、これらの問題を解決するため、ハッシュ法において、データ構造の追加や全データの参照を行なわずに、ワイルドカード検索などの柔軟な検索が可能な検索装置を提供することを目的とする。
【０００５】
【課題を解決するための手段】
本発明に係わる情報登録方法および情報検索方法では、部分一致検索の検索キーを構成する各要素の情報によってハッシュ値の値域を限定するハッシュ関数を用いるようにし、限定された値域に属する全てのハッシュ値に対して検索を行う手段を設ける。複数のハッシュ値に対する検索は、繰り返し検索を行なうように構成してもよいし、並列計算によって検索を行なうように構成してもよい。
【０００６】
【作用】
データの登録の際に、ハッシュ値がキーを構成する要素に対応した値域内に入ることを保証するハッシュ関数を用いる。そして、部分一致検索において、部分キーが与えられると、ハッシュ関数が部分キーの構成要素に対応する値域を求める。続いて、この値域内の全てのハッシュ値を使用して検索する。この値域外には、与えられた部分キーを含むキーが登録されていないため、余分なデータに対して検索を行なう必要がなく、さらにデータ構造の追加も必要としない。このようにしてハッシュ法において、完全一致検索のみならず、部分一致検索をも実現することが可能になる。
【０００７】
すなわち、検索を行なう際にハッシュ表を参照するためのハッシュ値が、たとえば、０〜１５であるとき、０〜１５の全ての数について検索すれば全レコードが検索できるが、ここで、ハッシュ値の値域をたとえば奇数に限定すれば検索量は半分となる。このようにハッシュ関数を工夫することにより部分一致検索が可能となる。
【０００８】
【発明の実施の形態】
図１に、本発明のハッシュ検索方法の原理を説明するための簡略化したハッシュ関数の例を示す。また、図２に、図１のハッシュ関数を用いて検索を行なう例を示す。
【０００９】
図１の例では、検索キーを構成する要素がＡ，Ｂ，Ｃ，Ｄからなるものとする。ハッシュ関数として、Ａ，Ｂ，Ｃ，Ｄのそれぞれに対して、ハッシュ値の１ビットを割り当て、キーの要素に「Ａ」がある場合には、ビットフィールドＡを１にし、キーの要素に「Ａ」がない場合には、ビットフィールドＡを０にする関数を用いる。これは、「Ａ」が存在する場合のハッシュ値が８から１５の間の値になり、「Ａ」が存在しない場合のハッシュ値が０から７の間の値に限定されることを意味しており、０と１は逆転していても同様の効果がある。また、同様なハッシュ値の値域の限定ができればビット操作である必要はない。データの登録および完全一致検索では、ビットフィールドＡ，Ｂ，Ｃ，Ｄの全てにこの演算を行い、例えば「ＡＢＣ」に対するハッシュ値は、２進数で１１１０すなわち１４とする。続いて、図２に示すように、ハッシュ表を参照し、ハッシュ値１４に対応する「さしすせそ・・・」というデータが検索される。
【００１０】
図１の例で「ＡＢ＊」は部分キーを表し、キーの一部にＡが含まれ、かつ、キーの一部にＢが含まれるもの全てを検索することを示している。これは、「Ａ」が検索キーに存在し、かつ、「Ｂ」が検索キーに存在するという条件であり、図１のハッシュ関数の例では、「Ａ」が存在する場合のハッシュ値が、必ず８から１５の間の値になり、「Ｂ」が存在する場合のハッシュ値が、必ず４から７の間の値か１２から１５の間の値のどちらかになるという性質が利用できる。この２つの条件の論理積をとると、「ＡＢ＊」に対応するハッシュ値は、１２，１３，１４，１５の４つであることがわかる。
【００１１】
実際のハッシュ値を求める演算は、確定した「Ａ」および「Ｂ」に対するビットフィールド、すなわちビットフィールドＡとビットフィールドＢを１とし、確定しない要素に対するビットフィールドは０と１の全ての組み合わせをとればよい。この結果、１２，１３，１４，１５の４つが求められる。それぞれについて、図２に示すように、ハッシュ表を参照し、「あいうえお・・・」「かきくけこ・・・」「さしすせそ・・・」「たちつてと・・・」というデータが検索される。
【００１２】
以下に、本発明のハッシュ検索方法を電子辞書に適用する実施例を用いた動作を説明する。図３は、本発明のハッシュ検索方法を適用した電子辞書の構成例である。図４は、本発明のハッシュ検索方法を電子辞書に適用するためのデータ構造の構成例である。また図５は、動作の概略のフローを示す。
【００１３】
図３に示す電子辞書は、表示画面上に各種のデータなどを表示するＣＲＴ２ｌと、前記ＣＲＴ２１での表示を制御するＣＲＴドライバ２２と、コマンドや文字列、数値などの入力を行うキーボード２３と、ポインティングデバイスであるマウス２４と、ユーザーによるキーボード２３やマウス２４の操作によって、各種のデータを出力するキーボード／マウスドライバ２５と、ディスク装置２６、ディスク装置ドライバ２７、主記憶装置２８、ＣＰＵ（中央処理装置）２９とから構成されている。
【００１４】
ディスク装置２６は、大量のデータを格納するための二次記憶装置であり、後述するチェイン付きインデックスや実データファイルなどが格納されている。また、ディスク装置２６のデータの入出力は、ディスク装置ドライバ２７で制御されている。
【００１５】
主記憶装置２８は、アプリケーションプログラム、及びキーボード２３やマウス２４から入力された文字や数値などのデータのほか、後述するハッシュ表を格納している。また、後述するフローチャートを実現するプログラムも格納されている。
【００１６】
ＣＰＵ２９は、システム全体の制御を行うと共に、各種の命令に基づいて所定のデータに対する演算処理を行う回路であり、後述のフローチャートに基づいてデータの検索処理を実行する。実際には、主記憶装置２８に格納されているプログラムに従って、ＣＰＵ２９がフローチャートの処理を実行する。なお、この検索処理を実現するための基本的な構成及び動作は、特開平０５−０２８１９４号公報などに開示されているものと同様なものである。
【００１７】
上記電子辞書におけるデータ構造の概要を図４に示す。図４のデータ構造は、基本的には、ハッシュ表１１、チェイン付きインデックス１２、実データファイル１３の３つから構成されている。
【００１８】
本実施例の電子辞書においては、図４に示すように、通常の見出し語（「ａ」，「Ａ」など）からの検索に加え、転置キーの設定により単語の語義（「ひとつの」、「イ音」など）や発音のカタカナ表記（「エ」、「ア」など）からの検索を行えるようになっており、このため、キー／レコード数ともに非常に多く、またあるキーに対するレコードも一意ではない。
【００１９】
最初に上記電子辞書におけるチェイン付きハッシュ法による検索の基本的な動作について説明する。
【００２０】
まず、ハッシュ表１１は、検索キーｋのハッシュ値ｈ（ｋ）が指すアドレスよりチェイン付きインデックス１２へのポインタを、たとえば、３バイトで保持する。対応するキーが未登録の場合、ＦＦＦＦＦＦＨを保持する。但し、Ｈは１６進表示を意味する。つまり、ＦＦＦＦＦＦＨとは、２４ビット全てが“１”であることを表している。
【００２１】
次に、チェイン付きインデックス１２の詳細を説明する。チェイン付きインデックス１２は、データレコードに一対一で対応する情報を持つインデックスレコードの集合である。インデックスレコードの構造を図６に示す。インデックスレコードは、レコードに設定された全てのキーに関するチェインポインタｌ〜ｎ（ＣＰ_ｌ−ｎ）と、キー識別子１〜ｎ（ＫＤ_ｌ−ｎ）のペアの並び、及びデータファイルヘのポインタ（データポインタ（ＤＰ））を保持している。なお、キー識別子とは、登録語そのもののコピーである。
【００２２】
チェインポインタは、対応するキーに関する次のインデックスレコードへのポインタであり、衝突により同じハッシュ値を持つ登録キーのリスト（データのつながり）が構成される。
【００２３】
ここで衝突とは、異なる検索キーがハッシュ関数により同じハッシュ値を持つことを意味する。この場合の解決策として、本実施例ではチェインポインタを採用している。
【００２４】
各リストの先頭はハッシュ表１１から直接指されており、衝突がある場合、各ポインタは図７に示すように、次のチェインポインタのアドレスを保持し、リストの終端の場合にはｎｉｌとして００００００Ｈが格納される。チェインポインタは、たとえば、３バイトで表現され、００００００Ｈから７ＦＦＦＦＦＨの値を取り得る。
【００２５】
キ一識別子は、チェインポインタの直後に存在し、入力キーが登録キーに対応するか否か、すなわちハッシュ値の衝突の検出に用いられる。この実施例におけるキ一識別子の記述ルールを以下に示す。
【００２６】
（ｉ）通常の検索キーにおいては、キ一識別子の開始を示す８１Ｈに続いて登録キーの文字列を格納し、続いてキ一識別子の終了を示す８２Ｈを格納する。
【００２７】
（ｉｉ）登録キーがチェインの直前と同じである場合は省略する。省略された場合は、キ一識別子のあるべき位置に、次のチェインポインタまたはデリミタが格納されている。チェインポインタの１バイト目は最大で７ＦＨであり、デリミタはＦＦＨであるため、省略された場合でも、キ一識別子がある場合の８１Ｈとは明白に区別が可能である。
【００２８】
このようにキ一識別子を導入することにより、衝突のチェックのために実データを参照する必要がなくなる。なお、実データにキーワードを格納するフィールドを設け、実データのキーワードフィールドを参照して、衝突のチェックを行うように構成しても構わない。
【００２９】
データポインタには、実データにおける実際のデータレコードの先頭を指すアドレスが、デリミタＦＦＨに続けて３バイトで格納されている。データポインタがＦＦＦＦＦＦＨである場合は、データレコードが削除されていることを示す。
【００３０】
実データファイル１３はデータレコードの集合であり、データレコードは次の形式を持つ。
【００３１】
（見出し語）（見出し区切り［ＮＵＬＬ］）（内容部）（レコード区切り［ＬＦ：ＬｉｎｅＦｅｅｄ］）
キー識別子に前述のように８０Ｈが用いられた場合、このデータレコードの見出し語を参照することでキーの識別を行う。チェイン付インデックスファイルのキー識別子８０Ｈが使用されている場合のみ実データファイルを参照する。
【００３２】
実データファイル１３（内容部）は、実際の辞書記述部分であるが、この実施例ではキー識別子により検索キーの情報を含まないため、この内部にフィールドなどの概念は不要であり、内容はフラットなテキス卜でよい。ここでフラットとは、フィールド区切りの記号を含まない文字コードだけからなるデータを意味する。データレコードは全体でーつのテキストファイルとなる。
【００３３】
次に、上述した電子辞書によるデータ検索のアルゴリズムを、図８のフローチヤートを用いて詳細に説明する。
【００３４】
まず、初期化（ステップ１０１）の後、検索キーのハッシュ値ｈを求め、ハッシュ表の位置ｈの内容をインデックスポジションｉｐとして読み込む（ステップ１０２）。なお、ここで初期化とは、内部で使用するフラグ等を初期化し、また、キー識別子のバッファもクリアすることを意味する。次に、ｉｐ＝ＦＦＦＦＦＦＨであるかどうかを判定する（ステップ１０３）。ここで、ｉｐ＝ＦＦＦＦＦＦＨであれば未登録キーとわかるので処理を終了する。また、ｉｐ＝ＦＦＦＦＦＦＨでないときは、チェイン付きインデックス１２上の位置ｉｐから３バイトをチェインポジションｃｐとしてチェイン付インデックスのファイルから読み込み（ステップ１０４）、チェインポジションｃｐの次のバイトの値（キー識別子の格納位置）≧８０Ｈかどうかを判定する（ステップｌ０５）。
【００３５】
８０Ｈ未満またはＦＦＨの場合には、チェインの前のものと同じであるので前のものを使用する。また、８０Ｈ以上の場合には、キーワードを読み込む（ステップ１０６）。このようにして衝突の判別を行なう。次に、ｋｒに基づいて現在のインデックスレコードがキーに対応するか否かを判定し（ステップ１０７）、対応するときはチェイン付きインデックス１２上のＦＦＨまでスキップし、続く３バイトをデータポインタｄｐとして読み込む（ステップ１０８）。なお、ステップ１０７における判定は、ハッシュ値を求めた入力文字列の下位バイトと８０Ｈの論理和を求め、キー識別子と一致するか否かを判別することにより行なう。続いて、ｄｐ＝ＦＦＦＦＦＦＨであるかどうかを判定する（ステップｌ０９）。ここで、ｄｐ＝ＦＦＦＦＦＦＨでなければデータレコードは存在するので、データファイル上の位置ｄｐから、データレコードのフォーマットに従い、０ＡＨ（＝ＬＦ）までを結果のリストに追加する（ステップ１１０）。次に、ｃｐ＝０かどうかを判定し（ステップ１１ｌ）、ｃｐ＝０であるなら終了、そうでなければチェインが続いているので、ｉｐにｃｐを代入して（ステップ１１２）、ステップ１０４へ戻る。
【００３６】
次に上述したようなチェイン付きハッシュ法による検索を行なう電子辞書において、部分一致検索を行なう場合の動作について説明する。
【００３７】
本実施例では、キーワードを構成する文字の出現位置（１文字目，２文字目，．．．）を３で割ったときの剰余と、キーワードを構成する文字のコード（キャラクタコード）を５で割ったときの剰余からハッシュ値を求めるものとする。具体的には、１５ビットのハッシュ値を、ｎを０以上の整数とした場合の、（３ｎ＋１）文字目、（３ｎ＋２）文字目、（３ｎ＋３）文字目の５ビットづつのフィールドに区切り、文字コードをｘとして、各５ビットの値を、２を（ｘｍｏｄ５）乗したもの（２^{ｘｍｏｄ５}）の論理和とする。なお、ｘｍｏｄ５は、ｘの５で割ったときの剰余を表すものとする。
【００３８】
図９は、この方法によって、文字列「ａｂｃｄ」のハッシュ値を求める例を示す。完全一致検索およびデータ登録においては、この値をそのままハッシュ値として利用する。
【００３９】
図９に示す例においては、１文字目の’ａ’のキャラクタコード６１Ｈ（１６進数）（１０進数では９７）を５で割った余りが２であるので、２^２＝４である。従って、フィールド１は、００１００（２進数）（１０進数では４）と元のフィールド１の値０００００（２進数）との論理和をとり、００１００（２進数）となる。同様に、４文字目の’ｄ’のキャラクタコードが６４Ｈ（１０進数では１００）、これを５で割った余りが０であるので２^０＝１である。従って、００００１（２進数）とフィールド１（００１００）との論理和をとる。この結果、フィールド１は２進数で００１０１となる。
【００４０】
なお、本実施例では、先に説明したように、図４のチェイン付きインデックス内のチェインポインタに、キーに関する情報を示す識別子を付加し、このキー識別子によって衝突時のチェックを行うものとする。
【００４１】
部分一致検索においては、ビットが１になっている部分を１のまま固定し、ビットが０になっている部分を０と１の両方の場合の組合せとして、ハッシュ値の集合を求めて全てについて検索を行う。文字の出現位置を用いる際には、文字列の開始位置に注意する必要がある。例えば「＊ａｂｃｄ＊」という場合には、ａの出現位置が、「ａｂｃｄｅ」などのように（３ｎ＋１）文字目、「ｚａｂｃｄ」などのように（３ｎ＋２）文字目、「ｙｚａｂｃｄ」のように（３ｎ＋３）文字目の全ての場合が存在するので、１文字づつずらした３通りのパターンが必要である。
【００４２】
ハッシュ関数を用いて部分一致検索を行う場合の、概略フローを図５に示す。
【００４３】
先ず、与えられた検索キーのハッシュ値を求め（ステップ１５１）、ハッシュ表の該当する位置の内容を読み込む（ステップ１５２）。ここで、チェインポインタにキー識別子が付加されているときは、このキー識別子に基づいて衝突を検出し（ステップ１５３）、検索結果を格納する（ステップ１５４）。全ハッシュ値の検索が完了するまで、上記ステップ１５２〜１５４を繰り返し（ステップ１５５）、全ハッシュ値の検索が完了したら検索結果を出力表示する（ステップ１５６）。
【００４４】
次に、ハッシュ関数を用いて部分一致検索を行う場合の、詳細なフローを図１０に示す。部分キーが、図９に示すように、「ａｂｃｄ」である場合を例に取り、動作を説明する。なお、ここでは、ビットが１になっている部分を「固定ビット」、ビットが０になっている部分を「不定ビット」と呼んでいる。
【００４５】
Ｓ１：まず、部分キーが与えられると、部分キーの文字開始位置を初期化する。すなわち、文字開始位置をまず（３ｎ＋１）文字目とする。ハッシュ値はｎに無関係であるので、ｎ＝０，開始位置＝１とする。
【００４６】
Ｓ２：続いて、部分キーにより固定されるビットを算出する。開始位置が１である場合は、図９に示されるように、フィールド１のビット２、フィールド１のビット０、フィールド２のビット３、フィールド３のビット４が固定されるビットである。
【００４７】
Ｓ３：次に、この固定ビットの列を、固定ビット列リストとして保存する。これは、文字の開始位置をずらして検索する際に、固定ビットと不定ビットを組み合わせると、既に検索したハッシュ値と同じになる場合があるため、以前に検索した固定ビットの列と比較することによって２度検索する手間を省くためである。
【００４８】
Ｓ４：次に、不定ビットを初期値として全て１にする。すなわち、フィールド１のビット２、フィールド１のビット０、フィールド２のビット３、フィールド３のビット４を除く全てのビットを１にする。以下、不定ビットを、固定ビットのビット数を除いたビット数の一つの数として扱う。これによって、例えば、１５ビットから固定ビットである４ビットを除いた１１ビットの０と１の組合せは、０から２０４７（１０進数）までの２進数で網羅することができる。
【００４９】
Ｓ５：不定ビットと固定ビットを連結する。すなわち、不定ビットを表す２進数に、固定ビットを挟みこむ。図１１に、不定ビットと固定ビットを連結する例を示す。すなわち、Ｓ４または後述するＳ７ｂで作成された不定ビット数を、本来のビット位置に置き、固定ビットを挿入したものである。
【００５０】
Ｓ６：連結によって生成されたハッシュ値が、最小値以下になったらＳ１２に飛ぶ。ここで、検索キーが１文字もないということはあり得ないので、１文字目のフィールドに一つも１になるビットがない状況はあり得ない。したがって、大きな数値から検索を始めて、１文字目のフィールドが０になった時点を最小値、すなわち、ハッシュ値を求める終了条件としている。
【００５１】
Ｓ７：Ｓ３において保存されている一つ前の文字開始位置までの固定ビットリストとハッシュ値を比較し、固定ビットリストの固定ビットと同じビット位置にあるハッシュ値のビットが全て１の場合は、以前に検索したハッシュ値のパターンに含まれるので（Ｓ７ａ）、次の不定ビットを求め（Ｓ７ｂ）、Ｓ５に戻る。なお、Ｓ７ａの判定は、不定ビットの０，１の組合せにより、既に検索した組合せが出現する可能性が有るので、重複して検索しないようにするための判定である。またＳ７ｂでは、不定ビット列が表す数値から１を減じた値を新たに不定ビット列とする。
【００５２】
Ｓ８：求めたハッシュ値によりハッシュ表を読み出す。
【００５３】
Ｓ９：インデックス内に登録語がなければ（Ｓ９）、該当するデータレコードがないので、Ｓ７ｂと同じ処理で不定ビットを変更して次の不定ビットを求め（Ｓ７ｂ）、Ｓ５に戻る。
【００５４】
Ｓ１０：キー識別子により、部分キーが含まれるかどうかを比較し、部分文字列が含まれる場合は、検索結果のリストに登録する（Ｓ１０ａ）。
【００５５】
Ｓ１１：次のインデックス内の次のチェインポインタヘスキップし、Ｓ９へ戻る。
【００５６】
Ｓ１２：文字開始位置を１つずらす。
【００５７】
Ｓ１３：文字開始位置が３以下であれば、Ｓ２に戻る。これは、文字開始位置が１，２，３の場合の全てを検索するためである。
【００５８】
Ｓ１４：結果リストに登録された検索結果を表示する。
【００５９】
以上の手順で、ハッシュ法においても、部分一致検索が可能になる。さらに、前方一致検索の場合は、文字開始位置が１の場合のみを検索すればよい。また、後方一致検索の場合は、図１２に示すように、キー識別子による判別を行う際に、後方の文字から行うようにすればよい。また、前後方一致の場合は、前方一致の部分キーを文字開始位置を１とし、後方一致の部分キーをずらしてハッシュ値を求め、検索すればよい。同様にしてワイルドカードを＊として、「＊Ａ＊Ｂ＊」のような柔軟な検索も容易に実現できる。
【００６０】
なお、本実施例では、１つ１つのハッシュ値に対して、繰り返し検索を行うように構成しているが、各ハッシュ値を並列計算の可能な計算機を用いて、並列に検索しても構わない。
【００６１】
【発明の効果】
以上述べたように、本発明によれば、高速な検索方法でありながら、完全一致検索のみでワイルドカード検索などの柔軟な検索が不可能であったハッシュ法において、データ構造の追加も全データの参照をも行うことなく、部分一致検索が可能になる。さらに、本発明の検索方法では、前方一致、後方一致、前後方一致、中間一致等の全ての部分一致検索がデータ構造の追加なしに実現できる。
【図面の簡単な説明】
【図１】部分一致検索の可能なハッシュ関数の例を示す説明図である。
【図２】ハッシュ法における部分一致検索の原理を示す例である。
【図３】本発明のハッシュ検索方法を適用した電子辞書の構成例を示すブロック図である。
【図４】本発明のハッシュ検索方法を電子辞書に適用するためのデータ構造成例を示す説明図である。
【図５】本発明のハッシュ検索方法を適用した電子辞書の部分一致検索の概略のフローチャートである。
【図６】インデックスレコードの構造を示す図である。
【図７】インデックスレコードにおけるチェインポインタのリストを示す図である。
【図８】電子辞書によるデータ検索のアルゴリズムを示すフローチャートである。
【図９】部分一致検索の可能なハッシュ関数の例を用いてハッシュ値を求める例を示す説明図である。
【図１０】部分一致検索の実現例を示す詳細なフローチャートである。
【図１１】部分キーに対応する複数のハッシュ値の中の１つを求める例を示す説明図である。
【図１２】後方一致検索におけるキー識別方法の例を示す説明図である。
【符号の説明】
ｌ１…ハッシュ表、１２…チェイン付きインデックス、ｌ３…実データファイル、２１…ＣＲＴ、２２…ＣＲＴドライバ、２２…キーボード、２４…マウス、１５…キーボード／マウスドライバ、２６…ディスク装置、２７…ディスク装置ドライバ、２８…主記憶装置、２９…ＣＰＵ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information retrieval apparatus which performs search based on the search key, in particular wildcard search, of possible information retrieval apparatus of flexible search for partial matching or the like.
[0002]
[Prior art]
A hash method is well known as a high-speed search method in a perfect match search based on a key. However, it is considered that the hash method makes it difficult to perform a flexible search such as a wildcard search due to the data structure. The hash method has been proposed to suppress a reduction in search speed due to collision (for example, Japanese Patent Laid-Open No. Japanese Patent Application Laid-Open Nos. 271525 and 04-358266) and those proposed to speed up the apparatus by applying a hash method (for example, Japanese Patent Application Laid-Open No. 05-061910). No proposal for partial match search has been made at all, and hitherto, other methods such as full-text search had to be used in order to perform a flexible search with a search device using a hash method.
[0003]
[Problems to be solved by the invention]
The hash method requires at least a hash table in addition to the actual data. The hash table is approximately the size of the pointer length * N, where N is the number of words different from the keyword. Here, if another search method is used in combination for a partial match search, it is necessary to additionally have index data, and there is a disadvantage that the data size becomes very large. In addition, if a full-text search method that directly refers to actual data is used, no extra index data is required, but since all data is referenced, the search speed becomes extremely slow when the size of the actual data increases, and practical There is a disadvantage that the speed cannot be obtained.
[0004]
An object of the present invention is to provide a search device capable of performing a flexible search such as a wildcard search without adding a data structure or referring to all data in a hash method in order to solve these problems. I do.
[0005]
[Means for Solving the Problems]
In the information registration method and the information search method according to the present invention, a hash function for limiting a range of a hash value according to information of each element constituting a search key of a partial match search is used, and all hash values belonging to the limited range are used. A means for performing a search on a value is provided. The search for a plurality of hash values may be configured to perform the search repeatedly, or may be configured to perform the search by parallel calculation.
[0006]
[Action]
At the time of data registration, a hash function that guarantees that a hash value falls within a range corresponding to an element constituting a key is used. Then, in the partial match search, when a partial key is given, a hash function obtains a range corresponding to a component of the partial key. Subsequently, a search is performed using all the hash values in this range. Since a key including a given partial key is not registered outside this range, it is not necessary to search for extra data and no additional data structure is required. In this way, in the hash method, not only a perfect match search but also a partial match search can be realized.
[0007]
That is, when the hash value for referencing the hash table when performing a search is, for example, 0 to 15, all records can be searched by searching for all the numbers from 0 to 15. If the range of is limited to, for example, an odd number, the retrieval amount is halved. By devising the hash function in this manner, a partial match search can be performed.
[0008]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 shows an example of a simplified hash function for explaining the principle of the hash search method of the present invention. FIG. 2 shows an example of performing a search using the hash function of FIG.
[0009]
In the example of FIG. 1, it is assumed that the elements constituting the search key are A, B, C, and D. As a hash function, one bit of a hash value is assigned to each of A, B, C, and D. When "A" is present in the key element, the bit field A is set to 1 and " If “A” does not exist, a function for setting the bit field A to 0 is used. This means that the hash value when "A" is present is between 8 and 15, and the hash value when "A" is not present is limited to a value between 0 and 7. The same effect is obtained even if 0 and 1 are reversed. If the range of the hash value can be similarly limited, it is not necessary to perform the bit operation. In the data registration and perfect match search, this operation is performed on all of the bit fields A, B, C, and D. For example, the hash value for “ABC” is 1110, that is, 14 in binary. Subsequently, as shown in FIG. 2, the hash table is referred to, and the data “Sashisuse Soso ...” corresponding to the hash value 14 is searched.
[0010]
In the example of FIG. 1, “AB *” represents a partial key, and indicates that a search is performed for all keys in which A is included in a part of the key and B is included in a part of the key. This is a condition that “A” exists in the search key and “B” exists in the search key. In the example of the hash function in FIG. 1, the hash value when “A” exists is The property that the value is always between 8 and 15 and the hash value when “B” exists is always either a value between 4 and 7 or a value between 12 and 15 can be used. When the logical product of these two conditions is calculated, it is understood that the hash values corresponding to “AB *” are four, 12, 13, 14, and 15.
[0011]
In the operation for obtaining the actual hash value, the bit fields for the determined “A” and “B”, that is, the bit fields A and B are set to 1, and the bit fields for the undetermined elements are all combinations of 0 and 1. Just fine. As a result, four values of 12, 13, 14, and 15 are obtained. For each of them, as shown in FIG. 2, by referring to the hash table, data such as "Aioueo ...", "Kakikukeko ...", "Sashisusesoso ...", and "Tatsutsuto ..." are retrieved. .
[0012]
Hereinafter, an operation using an embodiment in which the hash search method of the present invention is applied to an electronic dictionary will be described. FIG. 3 is a configuration example of an electronic dictionary to which the hash search method of the present invention is applied. FIG. 4 is a configuration example of a data structure for applying the hash search method of the present invention to an electronic dictionary. FIG. 5 shows a schematic flow of the operation.
[0013]
The electronic dictionary shown in FIG. 3 includes a CRT 21 for displaying various data on a display screen, a CRT driver 22 for controlling display on the CRT 21, a keyboard 23 for inputting commands, character strings, numerical values, and the like. A mouse 24 which is a pointing device, a keyboard / mouse driver 25 which outputs various data by a user's operation of the keyboard 23 and the mouse 24, a disk device 26, a disk device driver 27, a main storage device 28, a CPU (central processing unit) Device 29).
[0014]
The disk device 26 is a secondary storage device for storing a large amount of data, and stores a chained index, a real data file, and the like to be described later. The input and output of data to and from the disk device 26 is controlled by a disk device driver 27.
[0015]
The main storage device 28 stores an application program, data such as characters and numerical values input from the keyboard 23 and the mouse 24, and a hash table described later. Further, a program for realizing a flowchart described later is also stored.
[0016]
The CPU 29 is a circuit that controls the entire system and performs arithmetic processing on predetermined data based on various commands, and executes data search processing based on a flowchart described later. Actually, the CPU 29 executes the processing of the flowchart in accordance with the program stored in the main storage device 28. The basic configuration and operation for realizing this search processing are the same as those disclosed in Japanese Patent Application Laid-Open No. 05-28194.
[0017]
FIG. 4 shows an outline of a data structure in the electronic dictionary. The data structure of FIG. 4 is basically composed of a hash table 11, a chained index 12, and a real data file 13.
[0018]
In the electronic dictionary of the present embodiment, as shown in FIG. 4, in addition to a search from ordinary headwords (“a”, “A”, etc.), the meaning of a word (“one”, It is possible to search from katakana notation (“e”, “a”, etc.) of pronunciations, etc. Therefore, the number of keys / records is very large, and records for a certain key are also Not unique.
[0019]
First, the basic operation of a search by the hash method with a chain in the electronic dictionary will be described.
[0020]
First, the hash table 11 holds, for example, 3 bytes of a pointer from the address indicated by the hash value h (k) of the search key k to the chained index 12. If the corresponding key has not been registered, FFFFFFH is held. However, H means hexadecimal notation. That is, FFFFFFH indicates that all 24 bits are "1".
[0021]
Next, details of the chained index 12 will be described. The chained index 12 is a set of index records having information corresponding to data records on a one-to-one basis. FIG. 6 shows the structure of the index record. The index record is composed of a pair of chain pointers _{1 to n} (CP _1-n ) and key identifiers 1 to n (KD _1-n ) related to all keys set in the record, and a pointer to the data file (data Pointer (DP)). Note that the key identifier is a copy of the registered word itself.
[0022]
The chain pointer is a pointer to the next index record for the corresponding key, and a list of registered keys (data connection) having the same hash value is formed by collision.
[0023]
Here, the collision means that different search keys have the same hash value by the hash function. As a solution in this case, the present embodiment employs a chain pointer.
[0024]
The head of each list is directly pointed to from the hash table 11, and when there is a collision, each pointer holds the address of the next chain pointer, as shown in FIG. Is stored. The chain pointer is represented by, for example, 3 bytes and can take a value from 000000H to 7FFFFFH.
[0025]
The key identifier exists immediately after the chain pointer, and is used for detecting whether or not the input key corresponds to the registration key, that is, detecting collision of hash values. The description rule of the key identifier in this embodiment is shown below.
[0026]
(I) In a normal search key, the character string of the registration key is stored after 81H indicating the start of the key identifier, and then 82H indicating the end of the key identifier is stored.
[0027]
(Ii) If the registration key is the same as that immediately before the chain, the description is omitted. If omitted, the next chain pointer or delimiter is stored at the position where the key identifier should be. Since the first byte of the chain pointer is 7FH at the maximum and the delimiter is FFH, even if omitted, it can be clearly distinguished from 81H with a key identifier.
[0028]
By introducing a key identifier in this way, it is not necessary to refer to actual data for checking a collision. A field for storing a keyword in the actual data may be provided, and a collision may be checked by referring to the keyword field of the actual data.
[0029]
In the data pointer, an address indicating the head of the actual data record in the actual data is stored in 3 bytes following the delimiter FFH. When the data pointer is FFFFFFH, it indicates that the data record has been deleted.
[0030]
The actual data file 13 is a set of data records, and the data records have the following format.
[0031]
(Headword) (Header break [NULL]) (Contents) (Record break [LF: Line Feed])
When 80H is used as the key identifier as described above, the key is identified by referring to the headword of this data record. The actual data file is referred to only when the key identifier 80H of the chained index file is used.
[0032]
The actual data file 13 (content part) is an actual dictionary description part. However, in this embodiment, information of a search key is not included by a key identifier. Any text is fine. Here, “flat” means data consisting only of a character code that does not include a field delimiter symbol. The data record is a single text file.
[0033]
Next, an algorithm of data search using the above-described electronic dictionary will be described in detail with reference to the flowchart of FIG.
[0034]
First, after initialization (step 101), the hash value h of the search key is obtained, and the contents of the position h of the hash table are read as the index position ip (step 102). Note that the initialization here means that a flag and the like used internally are initialized, and a buffer of the key identifier is also cleared. Next, it is determined whether ip = FFFFFFH (step 103). Here, if ip = FFFFFFH, it is known that the key is an unregistered key, and the process is terminated. When ip = FFFFFFH is not satisfied, 3 bytes from the position ip on the chained index 12 are read from the file of the chained index as the chain position cp (step 104), and the value of the next byte of the chain position cp (key identifier It is determined whether or not (storage position) ≧ 80H (step 105).
[0035]
In the case of less than 80H or FFH, the previous one is used because it is the same as the one before the chain. If it is 80H or more, a keyword is read (step 106). Thus, the collision is determined. Next, it is determined whether or not the current index record corresponds to the key based on kr (step 107). If so, skip to FFH on the chained index 12 and use the next 3 bytes as the data pointer dp. Read (step 108). The determination in step 107 is made by calculating the logical sum of the lower byte of the input character string for which the hash value has been obtained and 80H, and determining whether or not it matches the key identifier. Subsequently, it is determined whether or not dp = FFFFFFH (step 109). Here, if dp = FFFFFFH, there is a data record, so from the position dp on the data file to 0AH (= LF) is added to the list of results according to the format of the data record (step 110). Next, it is determined whether or not cp = 0 (step 11l). If cp = 0, the process is terminated. Otherwise, since the chain is continuing, cp is substituted into ip (step 112), and the process proceeds to step 104. Return.
[0036]
Next, an operation in the case of performing a partial match search in an electronic dictionary performing a search by the hash method with a chain as described above will be described.
[0037]
In the present embodiment, the remainder when the appearance position (the first character, the second character,...) Of the characters constituting the keyword is divided by 3, and the code (character code) of the characters constituting the keyword are represented by 5. It is assumed that a hash value is obtained from the remainder when divided. Specifically, a 15-bit hash value is divided into a 5-bit field of a (3n + 1) th character, a (3n + 2) th character, and a (3n + 3) th character, where n is an integer of 0 or more, The code is x, and each 5-bit value is the logical sum of 2 raised to the power of (xmod5) (2 ^xmod5 ). Note that xmod5 represents a remainder when divided by 5 of x.
[0038]
FIG. 9 shows an example of calculating the hash value of the character string “abcd” by this method. In a perfect match search and data registration, this value is used as it is as a hash value.
[0039]
In the example shown in FIG. 9, the remainder obtained by dividing the character code 61H (hexadecimal) (97 in decimal) of the first character 'a' by 5 is 2, which is 2 ² = 4. Therefore, field 1 is ORed with 00100 (binary number) (4 in decimal number) and the original value of field 1 00000 (binary number), and becomes 00100 (binary number). Similarly, the character code of the fourth character 'd' is 64H (100 in decimal), and the remainder obtained by dividing the character code by 5 is 0, so that 2 ⁰ = 1. Therefore, the logical sum of 00001 (binary number) and field 1 (00100) is obtained. As a result, field 1 becomes 00101 in binary.
[0040]
In this embodiment, as described above, an identifier indicating information about a key is added to the chain pointer in the index with chains in FIG. 4, and a check at the time of collision is performed using the key identifier.
[0041]
In the partial match search, the part where the bit is 1 is fixed at 1 and the part where the bit is 0 is set as a combination of both 0 and 1, and a set of hash values is obtained and all parts are obtained. Perform a search. When using the appearance position of a character, it is necessary to pay attention to the start position of the character string. For example, in the case of “* abcd *”, the appearance position of “a” is the (3n + 1) th character such as “abcde”, the (3n + 2) th character such as “zabcd”, and “yzabcd” ( Since there are all cases of the (3n + 3) -th character, three patterns shifted by one character are required.
[0042]
FIG. 5 shows a schematic flow in the case of performing a partial match search using a hash function.
[0043]
First, the hash value of the given search key is obtained (step 151), and the contents of the corresponding position in the hash table are read (step 152). If a key identifier is added to the chain pointer, a collision is detected based on the key identifier (step 153), and the search result is stored (step 154). Steps 152 to 154 are repeated until the search of all hash values is completed (step 155). When the search of all hash values is completed, the search result is output and displayed (step 156).
[0044]
Next, FIG. 10 shows a detailed flow when a partial match search is performed using a hash function. The operation will be described taking a case where the partial key is “abcd” as shown in FIG. 9 as an example. Here, the part where the bit is 1 is called a “fixed bit”, and the part where the bit is 0 is called an “undefined bit”.
[0045]
S1: First, when a partial key is given, the character start position of the partial key is initialized. That is, the character start position is set to the (3n + 1) th character first. Since the hash value is irrelevant to n, n = 0 and the start position = 1.
[0046]
S2: Subsequently, a bit fixed by the partial key is calculated. When the start position is 1, as shown in FIG. 9, bit 2 of field 1, bit 0 of field 1, bit 3 of field 2, and bit 4 of field 3 are fixed bits.
[0047]
S3: Next, the fixed bit string is stored as a fixed bit string list. This is because when a search is performed with the start position of a character shifted, combining fixed bits and undefined bits may result in the same hash value as already searched. This is to save the trouble of searching twice.
[0048]
S4: Next, all the undefined bits are set to 1 as initial values. That is, all bits except for bit 2 of field 1, bit 0 of field 1, bit 3 of field 2, and bit 4 of field 3 are set to 1. Hereinafter, the indefinite bit is treated as one of the number of bits excluding the number of fixed bits. Thus, for example, a combination of 11 bits of 0 and 1 excluding 4 bits that are fixed bits from 15 bits can be covered by a binary number from 0 to 2047 (decimal number).
[0049]
S5: Connect the undefined bit and the fixed bit. That is, a fixed bit is inserted between a binary number representing an indefinite bit. FIG. 11 shows an example in which an undefined bit and a fixed bit are connected. That is, the indefinite number of bits created in S4 or S7b described later is placed at the original bit position and fixed bits are inserted.
[0050]
S6: If the hash value generated by the concatenation becomes equal to or smaller than the minimum value, the process jumps to S12. Here, since there is no possibility that the search key does not have any single character, it is impossible that there is no bit in the field of the first character that has any one. Therefore, the search is started from a large numerical value, and the time when the field of the first character becomes 0 is set as the minimum value, that is, the end condition for obtaining the hash value.
[0051]
S7: Compare the fixed bit list up to the previous character start position stored in S3 with the hash value, and if all the bits of the hash value at the same bit position as the fixed bits of the fixed bit list are 1, Since it is included in the hash value pattern searched previously (S7a), the next undefined bit is obtained (S7b), and the process returns to S5. Note that the determination in S7a is a determination for preventing a redundant search from occurring because the already searched combination may appear due to the combination of the undefined bits 0 and 1. In S7b, a value obtained by subtracting 1 from the numerical value represented by the undefined bit string is newly set as the undefined bit string.
[0052]
S8: Read the hash table with the obtained hash value.
[0053]
S9: If there is no registered word in the index (S9), since there is no corresponding data record, the undefined bit is changed by the same processing as S7b to obtain the next undefined bit (S7b), and the process returns to S5.
[0054]
S10: Whether or not a partial key is included is compared based on the key identifier, and if a partial character string is included, it is registered in a search result list (S10a).
[0055]
S11: Skip to the next chain pointer in the next index, and return to S9.
[0056]
S12: The character start position is shifted by one.
[0057]
S13: If the character start position is 3 or less, return to S2. This is for searching for all of the cases where the character start positions are 1, 2, and 3.
[0058]
S14: The search result registered in the result list is displayed.
[0059]
With the above procedure, the partial match search can be performed even in the hash method. Furthermore, in the case of a forward match search, only the case where the character start position is 1 needs to be searched. Further, in the case of the backward matching search, as shown in FIG. 12, when performing the determination based on the key identifier, the search may be performed from the backward character. In addition, in the case of front-back matching, the partial key of the front matching is set to the character start position, and the partial key of the back matching is shifted to obtain a hash value and search. Similarly, a flexible search such as "* A * B *" can be easily realized by setting the wildcard to *.
[0060]
Although the present embodiment is configured to repeatedly search for each hash value, each hash value may be searched in parallel using a computer capable of parallel calculation. Absent.
[0061]
【The invention's effect】
As described above, according to the present invention, in the hash method, which is a high-speed search method, a flexible search such as a wild card search cannot be performed only by an exact match search, a data structure is added to all data. Partial search can be performed without referring to. Further, in the search method of the present invention, all partial match searches such as front match, back match, front / back match, intermediate match, etc. can be realized without adding a data structure.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram showing an example of a hash function capable of performing a partial match search.
FIG. 2 is an example showing the principle of partial match search in the hash method.
FIG. 3 is a block diagram showing a configuration example of an electronic dictionary to which a hash search method according to the present invention is applied.
FIG. 4 is an explanatory diagram showing an example of a data structure for applying the hash search method of the present invention to an electronic dictionary.
FIG. 5 is a schematic flowchart of a partial match search of an electronic dictionary to which the hash search method of the present invention is applied.
FIG. 6 is a diagram showing a structure of an index record.
FIG. 7 is a diagram showing a list of chain pointers in an index record.
FIG. 8 is a flowchart illustrating an algorithm of data search using an electronic dictionary.
FIG. 9 is an explanatory diagram showing an example of obtaining a hash value using an example of a hash function capable of performing partial match search.
FIG. 10 is a detailed flowchart showing an example of implementing a partial match search.
FIG. 11 is an explanatory diagram showing an example of obtaining one of a plurality of hash values corresponding to a partial key.
FIG. 12 is an explanatory diagram illustrating an example of a key identification method in a backward match search.
[Explanation of symbols]
11 hash table, 12 index with chain, 13 actual data file, 21 CRT, 22 CRT driver, 22 keyboard, 24 mouse, 15 keyboard / mouse driver, 26 disk device, 27 disk device Driver, 28: Main storage device, 29: CPU

Claims

Input means for inputting a search key which is a word to be searched;
The hash value is divided into a plurality of fields, which are partial areas, a search key input by the input unit is received, and a partial hash value of the size of the field based on the character code of each character constituting the search key Calculating means for storing a partial hash value in a corresponding field based on an appearance position of each character, limiting a possible value of the hash value, and obtaining a set of one or more hash values;
An information search apparatus comprising: a search unit configured to perform a search based on the input search key with respect to one or a plurality of sets of hash values obtained by the calculation unit.

The arithmetic means sequentially stores partial hash values of each character constituting the input search key in association with a predetermined number of fields which are partial areas of the hash value. When the number of characters exceeds the number of the fields, when storing the partial hash value of the excess characters, the position of the field to be stored is obtained by the remainder of the appearance position of each character with respect to the number of the fields, and the obtained field 2. The information retrieval apparatus according to claim 1, wherein a logical sum of the hash value and the hash value already stored in the file is taken and the data is stored again .