JP2005004560A

JP2005004560A - Method for creating inverted file

Info

Publication number: JP2005004560A
Application number: JP2003168554A
Authority: JP
Inventors: Junichi Odagiri; 淳一小田切
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-06-13
Filing date: 2003-06-13
Publication date: 2005-01-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide an inverted file reducing the amount of memory indexes use, without much reducing search speeds. <P>SOLUTION: A method for creating the inverted file includes registering a character string serving as an index and its position information. An inverted file list 9 on which the number of registrations showing the number of pieces of point information registered in the inverted file is written and a table 8 having sections corresponding to character strings for pointed at the inverted file list 9 are provided. One or a plurality of kinds of character strings are registered in one inverted file list, with their position information coded and stored according to its difference to other previously registered position information. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】近年インターネットの普及により、ネットワーク上に存在する多量の文書情報を扱えることが可能になってきた。しかし文書情報が多くなるにつれて、どれが有用な情報か逐一判断することができなくなってきた。これに対してあるキーワードをもとに有用な情報とそうでない情報とを取捨選択する全文検索システムの重要性が増加してきた。本発明は、この全文検索システムに有用なインバーテッドファイル作成方法に関するものである。
【０００２】
【従来の技術】被検索文書中に出現する文字列に対するインデックスつまり索引は、通常高速検索を実現するためにはコンピュータのメモリ上に予め展開しておくことが必要とされる。インデックスは通常被検索文書と同等以上のデータ量となるために、インデックスに必要なメモリ量を削減することがインデックス作成における技術課題となる。
【０００３】
本発明はインデックスの１種であるインバーテッドファイル（ｉｎｖｅｒｔｅｄｆｉｌｅ）のメモリ使用量を削減するためのものである。
【０００４】
インバーテッドファイルには、検索対象となるキーに関する位置情報が記載されている。位置情報とは、そのキーを有する文書の番号（以下文書番号という）や、文書先頭位置からの出現位置のことを指す。文書位置情報として文書番号のみが登録されている場合もあるし、文書番号と出現位置の二つを登録する場合もある。登録されるものはシステム要件により異なる。
【０００５】
例えば、キー「雑誌」に関するインバーテッドファイルには、図１２に示す如く、キーと文書番号の組が記載されている場合を説明する。
【０００６】
「雑誌」というキーワードを有する文書を検出したい場合には、ユーザはこのインバーテッドファイルにアクセスして、図１２に示す文書番号１、２、３、２５、７８・・・１９２３を抽出する。
【０００７】
このインバーテッドファイルのメモリ使用量を圧縮するために下記のことが提案されている（例えば非特許文献１参照）。
【０００８】
ａ．キーを有する文書の文書番号つまり位置情報を昇順に並べる。
ｂ．この位置情報間の差分を取る。
ｃ．差分値に対し小さい正整数ほど少ない符号語に割り当てられる符号を適用する。
【０００９】
前記ｃの符号としては、図１３に示す如きγ符号がよく知られている。γ符号は、整数ｘを表すのに、ｘを２進数で表したときの桁数すなわちビット数から１を引いた数だけ０を続けた後にｘの２進数表示を続けたものである。すなわち０を数によりｘの２進桁数を表し、ｘの２進数表示（必ず１から始まる）を続けたものになる。
【００１０】
いま、図１４に示す如く、キーを「雑誌」としたとき、イに示す如く、文書番号１、２、３、２５・・・１９２３の１２個の文書にこのキーが存在するとき、上記の圧縮手法について説明する。
【００１１】
まずキー「雑誌」を有する文書の文書番号を、図１４のイに示す如く、１、２・・・１９２３と昇順に並べる。
【００１２】
次にこの昇順に並べた文書番号間の差分を求め、図１４のロに示す如く、１、１、１、２２・・・３１７を得る。
【００１３】
この図１４のロで得られた差分値をγ符号化すると、図１４のハに示すビット長のγ符号が得られるので、その合計は１３６ビット長となる。これを文書番号に対して３２ビットの固定長を適用した場合は３２×１２＝３８４ビット長のものとなるので、上記の如く、位置情報の差分を求めてγ符号化することにより３８４ビットから１３６ビットに圧縮されることがわかる。
【００１４】
【非特許文献１】
ＩａｎＨ．Ｗｉｔｔｅｎ，ＡｌｉｓｔａｉｒＭｏｆｆａｔ，ＴｉｍｏｔｈｙＣ．Ｂｅｌｌ著
「ＭａｎａｇｉｎｇＧｉｇａｂｙｔｅｓ：Ｃｏｍｐｒｅｓｓｉｎｇａｎｄｉｎｄｅｘｉｎｇｄｏｃｕｍｅｎｔｓａｎｄｉｍａｇｅｓ」ＶａｎＮｏｓｔｒａｎｄＲｅｉｎｈｏｌｄ（米国）出版、１９９４年，ｐ８２〜９０３．３「Ｉｎｖｅｒｔｅｄｆｉｌｅｃｏｍｐｒｅｓｓｉｏｎ」
【００１５】
【発明が解決しようとする課題】ところでインバーテッドファイルにおいて、出現頻度の少ない文字列が多数存在する場合は、その文字列毎にインバーテッドファイルを作成すると、前記の如く、差分を取ったとしても多くのビットを必要とする。
【００１６】
インデックスを作成するとき、日本語の文書を一文字ずらして連続したＮ文字列で区切ってキーを抽出するＮ−ｇｒａｍ手法が行われる。例えば、図１５図に示す如く、「日本の雑誌が・・・」という文書を連続した２文字列で区切って文字数２のキーを抽出する２−ｇｒａｍ手法の場合は、「日本」から始まり、以下１文字ずらして「本の」、「の雑」、「雑誌」、「読が」というキーが抽出され、これらのキーの位置は図１５に示す通り付加される。
【００１７】
日本語では、例えば「雑誌が」、「雑誌を」、「雑誌の」の如く、語尾が多数変動するので、日本語の文書に対して文字数２文字の文字列のインバーテッドファイルを作成する場合などは、インバーテッドファイル数が多くなる。図１６に、前記１２等で説明したキー「雑誌」以降の語尾が上記３種類に変化した場合について説明する。
【００１８】
いま、「誌が」が文書番号１、２５、３９５、１３８４に存在し、「誌を」が文書番号２、７８、４５８、１５８６に存在し、「誌の」が文書番号３、１００、１２０５、１９２３に存在した場合、差分値、γ符号のビット長は「誌が」の場合「１、１１、１９、２１」となってその合計は５２ビット、「誌を」の場合「３、１５、１９、２３」となってその合計は６０ビット、「誌の」の場合「３、１５、２３、１９」となってその合計は６０ビットとなり、これらの総計は１７２ビットとなる。
【００１９】
ところで、キー「雑誌」と、「誌が」、「誌を」、「誌の」は同じ文書番号の組が分かれて出たにもかかわらず、「雑誌」の場合が前記の如く、１３６ビットであるのに対し、１７２ビットと増加している。このように頻度が少ない場合は、その差分値も大きくなるため、γ符号に使用するビット数が多くなる。
【００２０】
またインバーテッドファイル毎に、インバーテッドファイルをどこの物理アドレスに記録するか、インバーテッドファイル中の登録数、などの管理情報が必要となるため、インバーテッドファイルの数が多くなると、その分メモリ使用量が多くなる。
【００２１】
したがって本発明の目的は、このような問題点を改善したインバーテッドファイル作成方法を提供することである。
【００２２】
【課題を解決するための手段】前記目的を達成するため、本発明では、例えば図１に示す如く、１つのインバーテッドファイルに複数の文字列を対応付けする。あるいは図２に示す如く、差分値を更に小さい値に変換する。
【００２３】
図１（Ａ）に示す如く、「雑誌」のように出現頻度の大きいものについては、１グループ１キーでインバーテッドファイル１を作成するが、図１（Ｂ）に示す如く、「誌が」、「誌を」、「誌の」の如く、出現頻度の小さいものはこれらをまとめる。例えば１グループ３キーでインバーテッドファイル２を作成する。
【００２４】
このように複数の文字列（キー）を１つのインバーテッドファイルに対応付けること、つまりまとめることにより、位置情報の差分値が小さくなるため、符号長が短くできること、後述するインバーテッドファイルの管理情報を少なくできること等の利点がある。
【００２５】
ちなみに、例えばこのインバーテッドファイルを使用して「誌の」をキーとして検索を行う場合は、位置情報をもとに文字列検索を行ってその部分が本当に「誌の」なのかを検討する必要がある。なぜならば「誌を」「誌が」である可能性があるためである。このためこの文字列検索分だけ処理速度が遅くなる。このため、複数文字列を対応付ける方法としては、それぞれのインバーテッドファイルへの登録数が極力等しくなるようにする。これはインバーテッドファイルの登録数が不均一になり、特定のキーの検索にかかる時間が極端に遅くなるのを防ぐためである。さらにメモリ使用量を少なくすることを目的とする場合、差分値を更に小さい値に変換するために差分値を規定数で割った余りを位置情報として登録する。
【００２６】
図２はキー「誌が」に対する例であり、規定数を５１２とした場合を示す。キー「誌が」の文書番号が１、２５、３９５、１３８４の場合、その差分値は１、２４、３７０、９８９となり、規定数が５１２のため、差分値の５１２の剰余は１、２４、３７０、４７７となり、差分値９８９の代わりにビット数の少ない４７７が登録される。
【００２７】
検索時には文書番号１、２５、３９５、１３８４以外にも８７２に対してもキー「誌が」が存在するかどうか文字列検索を行うことになる。これは、差分値上は１、２４、３７０、４７７となっているので、文書番号に直された３９５の次の文書番号８７２（＝３９５（前回の番号）＋４７７）にキーが存在するかどうかを文字列検索によって確かめる処理が必要になる。しかし、存在しないので次は文書番号１３８４（８７２＋５１２）にキーが存在するかどうかを文字列検索によって確かめる。これで存在することがわかったので検索終了ということになるためである。このように文字列検索処理が多くなるが、より少ないビット数で位置情報を表すことができる。
【００２８】
本発明の前記目的は、下記（１）〜（５）により達成することができる。
【００２９】
（１）インデックスとなる文字列とその位置情報を登録するインバーテッドファイル作成方法において、インバーテッドファイルに登録された位置情報数を示す登録数が記入されたインバーテッドファイルリストと、インバーテッドファイルリストをポイントする文字列に対応した区分を有するテーブルを設け、１種または複数種の文字列をある１つのインバーテッドファイルに登録し、その位置情報を、以前に登録された他の位置情報との差分にもとづき符号化して格納することを特徴とするインバーテッドファイル作成方法。
【００３０】
（２）前記（１）においてインデックスを作成する過程で出現した新しい文字列をインバーテッドファイルに登録するとき、登録数が最も少ないインバーテッドファイルに対応付けることを特徴とするインバーテッドファイル作成方法。
【００３１】
（３）前記（１）においてインデックスを作成する過程で出現した新しい文字列をインバーテッドファイル登録するとき、予め登録するインバーテッドファイルが１つあるいは複数決定されており、以後新しい文字列が出現した場合も、そのインバーテッドファイルの登録数が一定数を超えない限りはそのグループに登録することを特徴とするインバーテッドファイル作成方法。
【００３２】
（４）前記（１）において、位置情報を以前に登録された他の位置情報との差分にもとづき符号化するとき、そのインバーテッドファイルに前回登録された位置情報との差分にもとづき符号化を行うことを特徴とするインバーテッドファイル作成方法。
【００３３】
（５）前記（１）において、位置情報を以前に登録された他の位置情報との差分にもとづき符号化するとき、規定数を、使用する符号が、ある限定された桁数で表し得る最大値とし、位置情報との差分値をこの規定数で割った余りに対して符号化を行うことを特徴とするインバーテッドファイル作成方法。これにより下記の効果を奏することができる。
【００３４】
（１）１つのインバーテッドファイルに対して１種あるいは複数種の文字列を対応付けることにより、文字列間の出現位置情報の差分値を小さくし、インバーテッドファイルに対する管理情報を小さくできるため、メモリの消費量を抑えることができる。
【００３５】
（２）新しい文字列を登録数が最も少ないインバーテッドファイルに対応付けるので、極端に登録数が多いインバーテッドファイルを作らないことで検索時間が極端に遅い検索キーを作成しないようにしつつ、インバーテッドファイル管理情報を少なくし、更にインバーテッドファイル内の差分値を小さくすることで小さいビット数でインバーテッドファイルを構成、管理することができる。
【００３６】
（３）対応付けるインバーテッドファイルを予め決定しておき、そのインバーテッドファイルの登録数が規定数を超えるまでは、そのインバーテッドファイルに新規に出現した文字列を対応付けするので、新規な文字列が出現する度に登録数が最小のものを検出する必要がないので、処理を簡略化することができる。
【００３７】
（４）インバーテッドファイルに、前回登録された位置情報との差分値に対して符号化するので、差分値を取って小さな値は少ないビット数で、大きな値は多いビット数で表現する符号を使用して、インバーテッドファイルのメモリ使用量を削減することができる。
【００３８】
（５）差分値に対し規定数で割ってその余りを符号化したものを登録するので少ないビット数でインバーテッドファイルを表現できる。また規定数を、使用する符号がある限定された桁数で表し得る最大値に１加算したものを使用することにより、符号長を最大限有効利用できる。例えば図１３に示す如く、符号長１１ビットで６３まで表すことができるが、この場合規定値を６４にしてγ符号を使用すると１１ビットを最大限有効利用することができる。
【００３９】
【発明の実施の形態】本発明の一実施の形態を図３、図４、図５、図６、図７にもとづき説明する。図３は本発明の一実施の形態説明図、図４はインデックスの要部説明図、図５は動作説明図（その１）、図６は動作説明図（その２）、図７は動作説明図（その３）である。
【００４０】
図中、１はパソコン本体（以下ＰＣ本体という）、２は表示装置、３はキーボード、４は文書検索手段、５は文書データ、６はインデックス、７はインバーテッドファイル管理テーブル（以下管理テーブルという）、８はハッシュテーブル、９はインバーテッドファイルリスト（以下リストという）、１０はメモリ領域、１１は最大値保持部、１２はファイル数保持部である。
【００４１】
パソコン本体１は、図３に示すパソコンを動作するプロセッサや主記憶装置、磁気ディスク装置などが設けられているものであり、文書検索手段４、文書データ５等が用意されている。
【００４２】
表示装置２は、検索要求の入力を求めたり、検索結果を表示するものであって、ＣＲＴや液晶などで構成されている。
【００４３】
キーボード３はパソコンを操作するために必要な入力を行う入力操作キーが設けられており、検索キー入力端末として機能するものである。
【００４４】
文書検索手段４はインバーテッドファイルを作成したり、インバーテッドファイルを使用して検索処理を行うものであり、後述するインデックス６を有する。インデックス６には、例えば図２に示す如き、２文字列に対する複数のインバーテッドファイルを備える。また検索時には、インデックス６から検索キーに該当するインバーテッドファイルを抽出して、インバーテッドファイルに記載されている文書番号をユーザに返す。作成モードのときは、インデックスを作成し、検索モードのときはインデックスを使用して検索を行う。
【００４５】
文書データ５は、検索されるべき文書であって、この例では文書番号と題名により構成されている。文書の題名を検索対象とし、検索キーを題名に含む文書を検索により全て抽出する。
【００４６】
インデックス６は、前記検索時に使用されるインバーテッドファイルの外に、図４に示す如き、インバーテッドファイル管理テーブル（以下管理テーブルという）７を有する。
【００４７】
管理テーブル７は、ハッシュテーブル８とインバーテッドファイルリスト９、最大数保持部１１、ファイル数保持部１２を具備している。
【００４８】
ハッシュテーブル８は各文字列のハッシュ値に対応するインバーテッドファイルリスト９へのポインタが格納されている。ハッシュ値は文字列の文字コードを数値とみなし、ハッシュ関数にかけて得られた値、つまりハッシュ値を示す。「誌が」の場合、「誌」＋「を」を例えばＪＩＳの２進値で表された文字コードを数値化して、この数値を規定値で割算した余りをハッシュ値とし、このハッシュ値が取り得る数の区分を有するハッシュテーブル８を設ける。ハッシュテーブル８の区分０、１、・・・ｎがハッシュ値に対応している。
【００４９】
インバーテッドファイルリスト９は、登録数と最終登録数値と最後尾アドレスの項を有する。
登録数は、インバーテッドファイルに登録された文書番号数を示す。
【００５０】
最終登録数値は、そのインバーテッドファイルに最後に登録された文書番号の数値が記載されている。インバーテッドファイルに登録されるのは、前述の如く符号化された値であり、最初の値から順次復号しないと正確な数値が得られないので、インバーテッドファイルに新しく文書番号を登録するとき、それまでの最終登録数値をこれにより得ることにより登録時間が短縮できることになる。
【００５１】
最後尾アドレスは、メモリ領域のインバーテッドファイル記憶領域１０に登録されているインバーテッドファイルの最後尾アドレスを示すものであって、インバーテッドファイルに新しく文書データを追記するとき、この最後尾アドレスより記入することができるものである。
【００５２】
インバーテッドファイル記憶領域１０は、メモリ領域においてインバーテッドファイルが記憶されている領域を示す。
【００５３】
最大数保持部１１は、このシステムで作成可能なインバーテッドファイルの最大数Ｍａｘが記入されている区分である。図４に示す例では、インバーテッドファイルリスト９のリスト数Ｎが最大数Ｍａｘに相当する。
【００５４】
ファイル数保持部１２は、現在作成されているインバーテッドファイルの数が記入されている区分である。
【００５５】
次に図５、図６、図７を使用して、本発明の特徴的な文書検索手段５の機能であるインデックス作成処理について説明する。図５はインデックス作成処理の全体の動作説明図を示し、図６は図５における既存インバーテッドファイルへの対応付けモジュールの詳細説明図、図７は図５における文書番号登録モジュールの詳細説明図である。
【００５６】
インデックス作成処理を図５〜図７の動作説明図にもとづき説明する。この説明は２文字の文字列作成の例について行うが、本発明の文字列はこれに限定されるものではない。このインデックス作成処理は文書番号０から昇順で行われる。
【００５７】
Ｓ１．文書検索手段４は、文書中の題名から、２文字列を取得する。
Ｓ２．その文字列のハッシュ値を取得する。
【００５８】
Ｓ３．文書検索手段４は、このハッシュ値が示すハッシュテーブル８の区分の中の数値すなわちインバーテッドファイルＩＤを取得する。このハッシュテーブル８は、初期値が−１に初期化されているので、この数値が−１か否かを判断する。そしてこの数値が−１の場合は未登録、つまりその文字列がどんなインバーテッドファイルにも対応付けられていないことが認識され、ステップＳ４に進む。数値が−１でない場合は、対応済みであるため、後述する文書番号登録モジュールＳ１０に進む。
【００５９】
Ｓ４．前記Ｓ３において、数値が−１の場合は、ファイル数保持部１２の現在のインバーテッドファイルのファイル数と最大数保持部１１に記入された作成可能なインバーテッドファイルの最大数Ｍａｘ（この例ではＮ）を比較してインバーテッドファイルが新規登録可能かどうか判断する。可能な場合は後述するＳ６に進みインバーテッドファイルを新規作成し、不可能な場合は、登録数が最も少ないインバーテッドファイルに対応付けるためにＳ５に進む。これによりインバーテッドファイルの各登録数を極力小さくする。
【００６０】
Ｓ５．不可能な場合、後述する既存インバーテッドファイルへの対応付けモジュールを実行する。
【００６１】
Ｓ６．可能な場合、インバーテッドファイルを新規作成する。例えば空いているもっとも小さいリスト番号、例えばリスト番号１に作成する。
【００６２】
Ｓ７．前記Ｓ６で作成したインバーテッドファイルのリスト番号１を、前記Ｓ２で算出したハッシュ値が示す、ハッシュテーブル８の区分（この例ではｎ−１）に格納する。
【００６３】
Ｓ８．それから、このインバーテッドファイルのリスト番号１の区分を初期化する。そして登録数を０にし、最終登録数値すなわち最終文字列文書番号を０にする。また最後尾アドレスには未使用領域１０の先頭アドレスを入力する。なお初期化はシステムの立上りのとき行うこともできる。
【００６４】
Ｓ９．ファイル数保持部１２に、記入されているインバーテッドファイル数に＋１した、新インバーテッドファイル数を記入する。
【００６５】
Ｓ１０．次に後述する文書番号登録モジュールつまり「インバーテッドファイル登録モジュール」を実行する。
【００６６】
Ｓ１１．全文書の全文字列を登録したらインデックス作成処理を終了する。
【００６７】
次に図６により、前記Ｓ４において、インバーテッドファイルが新規登録不可能な場合について説明する。
【００６８】
Ｓ１０１．前記Ｓ４において、文書検索手段４が、作成可能なインバーテッドファイルの最大数を超えており新規作成が不可能と判断したとき、インバーテッドファイルリスト９の登録数の領域を検索して、登録数が最も少ないインバーテッドファイルのリスト番号（図４の例では２）を取得する。
【００６９】
Ｓ１０２．前記Ｓ１０１で取得したインバーテッドファイルリスト９のリスト番号２を、前記Ｓ２で算出したハッシュ値（例えば０）が示すハッシュテーブル８に格納する。
【００７０】
これによりハッシュ値２とハッシュ値０の文字列の、登録数、最終登録数値、最後尾アドレス等の管理データがインバーテッドファイルリスト９の同じリスト番号の区分に記入されることになる。勿論検索も前記図１（Ｂ）に示す如く、グループ化されて行われることになる。
【００７１】
図７により、前記Ｓ３、Ｓ５における既存インバーテッドファイルへの対応づけ、Ｓ６〜Ｓ９におけるインバーテッドファイルの新規作成による登録処理つまり文書番号登録モジュールについて説明する。
【００７２】
Ｓ２０１．前段の処理で特定されたインバーテッドファイルに、現在処理中の文字列の文書番号を登録する。まずインバーテッドファイルリスト９中の最終登録数値をもとにして新しく登録する文書の文書番号との差分値を求め、これを符号化する。
【００７３】
Ｓ２０２．次にインバーテッドファイルリスト９中の最終登録数値を、新しく登録する現在の文字列の文書番号に変更する。
【００７４】
Ｓ２０３．インバーテッドファイル記憶領域１０の最後尾アドレスに前記符号化したものを挿入する。
【００７５】
Ｓ２０４．この挿入により増加した分を最後尾アドレスとしてインバーテッドファイルリスト９に登録する。なおこのとき、インバーテッドファイルリスト９の登録数を＋１する。
【００７６】
なお、本発明は位置情報の差分値を小さくすることで圧縮符号による圧縮時に必要なビット数を削減することを目的としているため、上記説明で使用する圧縮符号はγ符号のみならず、図８に示す正整数の符号化説明図に示すα符号、δ符号、８ｂｉｔｂｌｏｃｋ符号などでもよい。これらはいずれも公知のものであり、ＣＱ出版社、植原智彦著、「文書データ圧縮アルゴリズム入門」に記載されている。
【００７７】
α符号は整数ｘを表すのにｘ−１個の０を続けた後に１を続けたものである。
【００７８】
δ符号は最初の桁数表示のかわりに符号γを利用したもの、つまりｘを２進数で表したときの桁数（ｂｉｔ数）を符号γで表現したあとに、ｘの２進数表示の先頭の１を除いたものである。
【００７９】
８ｂｉｔｂｌｏｃｋ符号は、８ｂｉｔのうちｔｏｐ１ｂｉｔを継続ｆｌａｇとし、そのフラグが存在したら次の８ｂｉｔも数が存在するとするものである。
【００８０】
本発明の第２の実施の形態を図９、図１０、図１１および図５を参照して説明する。図９は第２の実施の形態説明図、図１０は第２の実施の形態における動作説明図（その１）、図１１は第２の実施の形態における動作説明図（その２）である。
【００８１】
第１の実施の形態では、新しい文字列に対応してインバーテッドファイルを作成するとき、すでにインバーテッドファイルの登録数が許容登録数を超えていると、インバーテッドファイルの登録数を全部検索して最小値のものを求めることが必要のため、その検索に長い時間必要とした点があり、これを改善するため、第２の実施の形態では、このような場合あらかじめ登録数の少ないものの、全インバーテッドファイルの、例えば１／４を複数の文字列に対応付けするグループ化候補としてあらかじめ予定しておくものである。
【００８２】
第２の実施の形態では、パソコン内にある文書検索手段によるインデックス作成の実施例を示す。この方法は、文書（文章の外に題名等書誌的事項を含む）の文章中の文字列を検索キーとして、検索キーが出現する位置を全て抽出する。またこの方法は、文書データとして一つの文書を有する。
【００８３】
第２の実施の形態を実現する装置は、図９に示す如く構成され、他図と同符号は同一部を示し、３は検索キー入力端末として機能するキーボード、１４は文書検索手段、１５は文書データである。
【００８４】
キーボード３は、ユーザから入力された検索キーを文書検索手段１４に入力するものである。
【００８５】
文書検索手段１４は、２文字列に対するインバーテッドファイルからなるインデックス６を有しており、検索時には、検索キーのインバーテッドファイルを抽出して、インバーテッドファイルに記載されている、検索キーのその文章における出現位置（例えば先頭からの語数）を返す。
【００８６】
文書データ１５は、検索対象の文書で、１つのみであり、文章を有する。
【００８７】
なお、インデックス６の内部は、前記図３と同じである。
【００８８】
第２の実施の形態におけるインデックス作成の処理手順は、図５におけるＳ５の「既存インバーテッドファイルへの対応付けモジュール」と、Ｓ１０の「文書番号登録モジュール」以外は第一の実施の形態とまったく同一である。よってこの２つのモジュールのみを以下に説明する。
【００８９】
なお、第２の実施の形態では、検索出力は文書番号ではなく検索キーの文字列出現位置であるので、第２の実施の形態では、前記Ｓ１０の記載を「文字列出現位置登録モジュール」と読み替えるものとする。
【００９０】
第２の実施の形態における「既存インバーテッドファイルへの対応付けモジュール」では、新しく出現した文字列に対応付けるインバーテッドファイルを予め複数選択しておく。そして今回新しく出現した文字列をこの複数選択したインバーテッドファイルの内の１つに対応付け、前記選択された全てのインバーテッドファイルが次回以降新しく出現した文字列に順次対応付けされるように処理される。
【００９１】
図１０によりその動作を説明する。
【００９２】
Ｓ３０１．文書検索手段１４は、対応付け用インバーテッドファイルが未決定な場合、あるいはこれまでに選択された対応付け用インバーテッドファイルが全て新しく出現した文字列に対応付けられてもう存在しない場合はステップＳ３０２に進み、そうでない場合はステップＳ３０３に進む。
【００９３】
Ｓ３０２．文書検索手段１４は、全てのインバーテッドファイルから対応付け用インバーテッドファイルとして複数選択する。選択基準は、例えば登録数が少ない方から１／４とする。そして選択したインバーテッドファイルの中で新しく出現した文字列に対応付ける順番を予め決定しておく。決定方法は、例えば登録数の少ないインバーテッドファイルから対応付けられるような順番による。
【００９４】
Ｓ３０３．複数存在する対応付け用インバーテッドファイルの中から、現在の文字列に対応付けるインバーテッドファイルを選択する。次に選択したインバーテッドファイルのリスト番号を取得し、そのリスト番号を前記図５のＳ２で算出したハッシュ値が示すハッシュテーブル８に格納する。
【００９５】
Ｓ３０４．文書検索手段１４は前記Ｓ３０３で選択したインバーテッドファイルは、次回以降に新規出現した文字列への対応付けの対象外とする。但し次回以降でＳ３０２に通った場合は再度選択される場合もある。
【００９６】
図５におけるＳ１０の「文字列出現位置登録モジュール」では、前段の処理で特定されたインバーテッドファイルに、現在処理中の文字列の出現位置を登録する。はじめに、その出現位置を、最終登録数値をもとにして符号化処理を行い、次に最終登録数値を現在の文字列の出現位置に変更する。それから最後尾アドレスに前記符号化したものを挿入し、この挿入により増加した分を最後尾アドレスに登録する。
【００９７】
この動作を図１１により説明する。
【００９８】
Ｓ４０１．文書検索手段１４は、前段の処理で特定されたインバーテッドファイルに、現在処理中の文字列の出現位置を登録するため、以下の式から算出した整数値を符号化処理する。符号は例えばγ符号を使用する。なおＭｏｄ（Ａ、Ｂ）とはＡをＢで割った余りである。
【００９９】

この式に記載されている規定値は、使用する符号がある限定された桁数で表しうる最大値と同じものとする。そのようにしないと、例えば規定値が５１１で（ＮｏｗＯｆｆｓｅｔ − ＢｅｆｏｒｅＯｆｆｓｅｔ）が５１１であった場合、Ｉｎｔは１となり、符号結果をｂｉｔ表示すると“１”となり、メモリ使用量は１ｂｉｔと小さくなる。しかし検索時には、符号結果“１”を数式をもとに差分値に逆変換すると、０になる。よって、差分値０に該当する位置にはキーが存在しないのだが実際に文字列検索することになる。次に差分値５１１（＝０＋５１１×１）に該当する位置を実際に文字列検索して、キーが存在することが判明し、ヒットしたことになる。このように規定値以上の差分値は文字列検索を何度も行うことになる。このように、規定値が小さいと文字列検索の回数が増えることになる。しかし逆に規定値の数が大きいとメモリ使用量が多くなる傾向にある。このようなトレードオフの状況下で規定値を決定する必要がある。このときに、ビットを最大限有効利用するために、前述したように使用する符号がある限定された桁数で表しうる最大値と同じものとする。
【０１００】
例えば規定値が３００ならば、０〜５１０までの差分値は、使用ビット数は最大１７ｂｉｔと変化が無いにもかかわらず、３０１〜５１０は文字列検索回数が増えることになる。しかし、規定値が５１１ならば０〜５１０までの差分値は使用ビット数は最大１７ｂｉｔと維持され文字列検索回数も増えることは無い。このようにして検索時の文字列検索回数を極力減らしつつ、ビットを最大限有効利用する。
【０１０１】
Ｓ４０２．次に文書検索手段１４は、インバーテッドファイルリスト９の最終登録数値を今の文字列の出現位置に変更する。
【０１０２】
Ｓ４０３．それから、インバーテッドファイル記憶領域１０の最後尾アドレスに、前記符号化したものを挿入する。
【０１０３】
Ｓ４０４．この挿入により増加した分を、インバーテッドファイルリスト９の最後尾アドレスに加えて、これを変更する。
【０１０４】
なお、本発明は位置情報の差分値を小さくすることで圧縮符号による圧縮時に必要なビット数を削減することを目的としているため、ここで使用される符号はγ符号のみならずα符号、δ符号であってもよい。
【０１０５】
以上説明のように、本発明によれば、出現文字列をグループ化し、ある定数に対する出現位置情報の剰余をインバーテッドファイルに登録することで、検索速度を余り劣化させずにインデックスのメモリ使用量を削減することができる。
【０１０６】
【発明の効果】本発明により下記の効果を奏することができる。
【０１０７】
（１）１つのインバーテッドファイルに対して１種あるいは複数種の文字列を対応付けることにより、文字列間の出現位置情報の差分値を小さくし、インバーテッドファイルに対する管理情報を小さくできるため、メモリの消費量を抑えることができる。
【０１０８】
（２）新しい文字列を登録数が最も少ないインバーテッドファイルに対応付けるので、極端に登録数が多いインバーテッドファイルを作らないことで検索時間が極端に遅い検索キーを作成しないようにしつつ、インバーテッドファイル管理情報を少なくし、更にインバーテッドファイル内の差分値を小さくすることで小さいビット数でインバーテッドファイルを構成、管理することができる。
【０１０９】
（３）対応付けるインバーテッドファイルを予め決定しておき、そのインバーテッドファイルの登録数が規定数を超えるまでは、そのインバーテッドファイルに新規に出現した文字列を対応付けするので、新規な文字列が出現する度に登録数が最小のものを検出する必要がないので、処理を簡略化することができる。
【０１１０】
（４）インバーテッドファイルに、前回登録された位置情報との差分値に対して符号化するので、差分値を取って小さな値は少ないビット数で、大きな値は多いビット数で表現する符号を使用して、インバーテッドファイルのメモリ使用量を削減することができる。
【０１１１】
（５）差分値に対し規定数で割ってその余りを符号化したものを登録するので少ないビット数でインバーテッドファイルを表現できる。また規定数を、使用する符号がある限定された桁数で表し得る最大値を使用することにより、符号長を最大限有効利用できる。例えば図１３に示す如く、符号長１１ビットで６３まで表すことができるが、この場合規定値を６４にしてγ符号を使用すると１１ビットを最大限有効利用することができる。
【図面の簡単な説明】
【図１】複数の文字列の対応付け説明図である。
【図２】差分値を小さい値に変換する説明図である。
【図３】本発明の一実施の形態説明図である。
【図４】インデックスの要部説明図である。
【図５】本発明の第一の実施の形態の動作説明図（その１）である。
【図６】本発明の第一の実施の形態の動作説明図（その２）である。
【図７】本発明の第一の実施の形態の動作説明図（その３）である。
【図８】正整数の符号化説明図である。
【図９】本発明の第２の実施の形態説明図である。
【図１０】本発明の第２の実施の形態の動作説明図（その１）である。
【図１１】本発明の第２の実施の形態の動作説明図（その２）である。
【図１２】インバーテッドファイル説明図である。
【図１３】正整数の符号化説明図である。
【図１４】圧縮例説明図である。
【図１５】２文字列抽出説明図である。
【図１６】語尾が変化したときの位置情報説明図である。
【符号の説明】
１パソコン本体
２表示装置
３キーボード
４文書検索手段
５文書データ
６インデックス
７インバーテッドファイル管理テーブル
８ハッシュテーブル
９インバーテッドファイルリスト
１０インバーテッドファイル記憶領域
１１最大数保持部
１２ファイル数保持部[0001]
In recent years, with the spread of the Internet, it has become possible to handle a large amount of document information existing on a network. However, as the amount of document information increases, it has become impossible to determine which information is useful. On the other hand, the importance of a full-text search system that selects useful information and information that is not based on a certain keyword has increased. The present invention relates to an inverted file creation method useful for this full-text search system.
[0002]
2. Description of the Related Art An index for a character string appearing in a document to be searched, that is, an index is usually required to be developed in advance in a computer memory in order to realize a high-speed search. Since an index usually has a data amount equal to or greater than that of a document to be searched, reducing the amount of memory necessary for the index is a technical problem in index creation.
[0003]
The present invention is for reducing the memory usage of an inverted file, which is a kind of index.
[0004]
In the inverted file, position information regarding a key to be searched is described. The position information refers to the number of the document having the key (hereinafter referred to as the document number) and the appearance position from the document head position. There are cases where only the document number is registered as the document position information, and there are cases where both the document number and the appearance position are registered. What is registered depends on the system requirements.
[0005]
For example, a case where a set of a key and a document number is described in the inverted file related to the key “magazine” as shown in FIG.
[0006]
When it is desired to detect a document having the keyword “magazine”, the user accesses this inverted file and

extracts document numbers

1, 2, 3, 25, 78... 1923 shown in FIG.
[0007]
In order to compress the memory usage of the inverted file, the following has been proposed (for example, see Non-Patent Document 1).
[0008]
a. Document numbers, that is, position information of documents having keys are arranged in ascending order.
b. The difference between the position information is taken.
c. A code that is assigned to a smaller number of code words is applied to a difference value as a smaller positive integer.
[0009]
As the code of c, a γ code as shown in FIG. 13 is well known. The γ code represents the integer x, and continues the binary display of x after continuing 0 for the number of digits when x is represented in binary, that is, the number of bits minus 1. In other words, 0 represents the binary digit number of x by a number, and the binary number display of x (which always starts from 1) is continued.
[0010]
As shown in FIG. 14, when the key is “magazine”, as shown in a, when this key exists in 12 documents of

document numbers

1, 2, 3, 25. A compression method will be described.
[0011]
First, the document numbers of the documents having the key “magazine” are arranged in ascending order as 1, 2,..., 1923 as shown in FIG.
[0012]
Next, the difference between the document numbers arranged in ascending order is obtained, and 1, 1, 1, 22,... 317 are obtained as shown in FIG.
[0013]
When the difference value obtained in FIG. 14B is γ-encoded, the γ code having the bit length shown in FIG. 14C is obtained, and the total is 136 bits long. If a fixed length of 32 bits is applied to the document number, the length is 32 × 12 = 384 bits. Therefore, as described above, the difference in position information is obtained and γ-coded to obtain the difference from 384 bits. It can be seen that it is compressed to 136 bits.
[0014]
[Non-Patent Document 1]
Ian H. Witten, Alistair Moffat, Timothy C.M. By Bell
“Managing Gigabytes: Compressing and indexing documents and images”, Van Nostrand Reinhold (USA), 1994, p82-90 3.3 “Inverted file compression”
[0015]
By the way, in the inverted file, when there are a large number of character strings with a low appearance frequency, if an inverted file is created for each character string, even if the difference is taken as described above, Requires many bits.
[0016]
When creating an index, an N-gram method is performed in which a Japanese document is shifted by one character and separated by a continuous N character string to extract keys. For example, as shown in FIG. 15, in the case of the 2-gram method of extracting a key having 2 characters by dividing a document “Japanese magazine is ...” by two continuous character strings, it starts from “Japan”. Thereafter, keys of “book”, “miscellaneous”, “magazine”, and “reading” are extracted by shifting one character, and the positions of these keys are added as shown in FIG.
[0017]
In Japanese, for example, “Magazine”, “Magazine”, “Magazine”, and so on, there are many endings, so when creating an inverted file with a character string of 2 characters for a Japanese document Etc., the number of inverted files increases. FIG. 16 illustrates a case where the endings after the key “magazine” described in 12 etc. have been changed to the above three types.
[0018]
Now, “Magazine” exists in document numbers 1, 25, 395, and 1384, “Magazine” exists in document numbers 2, 78, 458, and 1586, and “Magazine” exists in document numbers 3, 100, and 1205. , 1923, the difference value and the bit length of the γ code are “1, 11, 19, 21” when “magazine is”, and the total is 52 bits, and “3, 15” when “magazine is” , 19, 23 ", the total is 60 bits, and in the case of" Journal ", the total is 60 bits, and the total of these is 172 bits.
[0019]
By the way, the key “magazine”, “magazine”, “magazine”, and “magazine” have the same document number, but the case of “magazine” is 136 bits as described above. On the other hand, it is increased to 172 bits. When the frequency is low in this way, the difference value also increases, and the number of bits used for the γ code increases.
[0020]
In addition, for each inverted file, management information such as the physical address where the inverted file is recorded, the number of registrations in the inverted file, etc. is required. Increased usage.
[0021]
Accordingly, an object of the present invention is to provide an inverted file creation method that improves such problems.
[0022]
In order to achieve the above object, in the present invention, for example, as shown in FIG. 1, a plurality of character strings are associated with one inverted file. Alternatively, as shown in FIG. 2, the difference value is converted into a smaller value.
[0023]
As shown in FIG. 1 (A), an inverted file 1 is created with one group 1 key for a frequently appearing item such as “magazine”. As shown in FIG. 1 (B), “magazine” , “Magazine”, “Magazine”, and other items with a low frequency of appearance are collected. For example, the inverted file 2 is created with 1 group 3 key.
[0024]
In this way, by associating a plurality of character strings (keys) with one inverted file, that is, by collecting them, the difference value of the position information becomes small, so that the code length can be shortened, and the management information of the inverted file described later is There are advantages such as being able to reduce.
[0025]
By the way, for example, when using this inverted file to perform a search using “journal” as a key, it is necessary to perform a character string search based on the location information and consider whether that portion is really “journal”. There is. This is because there is a possibility that “magazine” and “magazine”. For this reason, the processing speed is reduced by the amount corresponding to the character string search. For this reason, as a method of associating a plurality of character strings, the number of registrations in each inverted file is made as equal as possible. This is to prevent the number of registered inverted files from becoming uneven and the time required to search for a specific key from becoming extremely slow. When the purpose is to further reduce the memory usage, in order to convert the difference value to a smaller value, the remainder obtained by dividing the difference value by the specified number is registered as position information.
[0026]
FIG. 2 is an example for the key “magazine”, and shows a case where the specified number is 512. When the document number of the key “magazine” is 1, 25, 395, 1384, the difference values are 1, 24, 370, 989, and the prescribed number is 512, so the remainder of the difference value 512 is 1, 24, 370 and 477, and instead of the difference value 989, 477 having a smaller number of bits is registered.
[0027]
When searching, a character string search is performed to determine whether the key “magazine” exists for 872 in addition to document numbers 1, 25, 395, and 1384. Since the difference values are 1, 24, 370, and 477, whether or not a key exists in the document number 872 (= 395 (previous number) +477) next to 395 after the document number is corrected. Need to be verified by a string search. However, since it does not exist, next, whether or not a key exists in the document number 1384 (872 + 512) is confirmed by a character string search. This is because the search ends because it is found that it exists. Thus, although the character string search process increases, the position information can be expressed with a smaller number of bits.
[0028]
The object of the present invention can be achieved by the following (1) to (5).
[0029]
(1) In an inverted file creation method for registering a character string as an index and its position information, an inverted file list in which the number of registrations indicating the number of position information registered in the inverted file is entered, and an inverted file list A table having a classification corresponding to the character string pointing to the file is provided, one or more kinds of character strings are registered in one inverted file, and the position information is compared with other previously registered position information. An inverted file creation method characterized by encoding and storing based on a difference.
[0030]
(2) A method for creating an inverted file, characterized in that, when a new character string appearing in the process of creating an index in (1) is registered in an inverted file, it is associated with the inverted file having the smallest number of registrations.
[0031]
(3) When registering an inverted file for a new character string that appears in the process of creating an index in (1), one or more inverted files to be registered in advance are determined, and a new character string appears thereafter In this case, the inverted file creation method is characterized in that the inverted file is registered in the group unless the number of registration of the inverted file exceeds a certain number.
[0032]
(4) In the above (1), when the position information is encoded based on the difference from other previously registered position information, the encoding is performed based on the difference from the position information previously registered in the inverted file. An inverted file creation method characterized by:
[0033]
(5) In the above (1), when encoding the position information based on a difference from other previously registered position information, the maximum number that the code to be used can represent by a limited number of digits is used. A method for creating an inverted file, characterized in that encoding is performed on a remainder obtained by dividing a difference value from position information by this specified number. As a result, the following effects can be obtained.
[0034]
(1) Since one type or a plurality of types of character strings are associated with one inverted file, the difference value of the appearance position information between character strings can be reduced, and the management information for the inverted file can be reduced. Can be reduced.
[0035]
(2) Since a new character string is associated with an inverted file having the smallest number of registrations, an inverted file having an extremely large number of registrations is not created, so that a search key having an extremely slow search time is not created and an inverted file is created. By reducing the file management information and further reducing the difference value in the inverted file, the inverted file can be configured and managed with a small number of bits.
[0036]
(3) Since an inverted file to be associated is determined in advance and a newly appearing character string is associated with the inverted file until the number of registered inverted files exceeds the specified number, a new character string Since it is not necessary to detect the smallest number of registrations every time appears, the processing can be simplified.
[0037]
(4) Since the difference value with the previously registered position information is encoded in the inverted file, a code that takes the difference value and expresses a small value with a small number of bits and a large value with a large number of bits. Use to reduce memory usage of inverted files.
[0038]
(5) Since the difference value is divided by the specified number and the remainder is encoded, the inverted file can be expressed with a small number of bits. In addition, the code length can be effectively used to the maximum extent by using the specified number obtained by adding 1 to the maximum value that can be represented by a limited number of digits. For example, as shown in FIG. 13, a code length of 11 bits can represent up to 63. In this case, if the specified value is set to 64 and the γ code is used, 11 bits can be used to the maximum.
[0039]
DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described with reference to FIGS. 3, 4, 5, 6, and 7. FIG. FIG. 3 is an explanatory diagram of an embodiment of the present invention, FIG. 4 is an explanatory diagram of the main part of an index, FIG. 5 is an operation explanatory diagram (part 1), FIG. 6 is an operation explanatory diagram (part 2), and FIG. It is a figure (the 3).
[0040]
In the figure, 1 is a personal computer body (hereinafter referred to as PC body), 2 is a display device, 3 is a keyboard, 4 is document search means, 5 is document data, 6 is an index, and 7 is an inverted file management table (hereinafter referred to as a management table). ), 8 is a hash table, 9 is an inverted file list (hereinafter referred to as a list), 10 is a memory area, 11 is a maximum value holding unit, and 12 is a file number holding unit.
[0041]
The personal computer main body 1 is provided with a processor for operating the personal computer shown in FIG. 3, a main storage device, a magnetic disk device, and the like, and is provided with a document search means 4, document data 5, and the like.
[0042]
The display device 2 requests input of a search request or displays search results, and is composed of a CRT, liquid crystal, or the like.
[0043]
The keyboard 3 is provided with input operation keys for performing input necessary for operating the personal computer, and functions as a search key input terminal.
[0044]
The document search means 4 creates an inverted file or performs a search process using the inverted file, and has an index 6 to be described later. The index 6 includes a plurality of inverted files for two character strings, for example, as shown in FIG. At the time of search, the inverted file corresponding to the search key is extracted from the index 6 and the document number described in the inverted file is returned to the user. In the creation mode, an index is created, and in the search mode, a search is performed using the index.
[0045]
The document data 5 is a document to be searched, and is composed of a document number and a title in this example. The title of the document is a search target, and all documents including the search key in the title are extracted by the search.
[0046]
The index 6 has an inverted file management table (hereinafter referred to as a management table) 7 as shown in FIG. 4 in addition to the inverted file used at the time of the search.
[0047]
The management table 7 includes a hash table 8, an inverted file list 9, a maximum number holding unit 11, and a file number holding unit 12.
[0048]
The hash table 8 stores a pointer to the inverted file list 9 corresponding to the hash value of each character string. The hash value indicates a value obtained by applying a hash function, that is, a hash value, assuming that the character code of the character string is a numerical value. If “journal is”, the character code represented by, for example, a JIS binary value is converted into a numerical value for “journal” + “to”, and the remainder obtained by dividing this numerical value by a specified value is used as a hash value. A hash table 8 having a number of sections that can be taken is provided. The

sections

0, 1,... N of the hash table 8 correspond to hash values.
[0049]
The inverted file list 9 has items of registration number, final registration numerical value, and last address.
The number of registrations indicates the number of document numbers registered in the inverted file.
[0050]
As the final registration numerical value, the numerical value of the document number registered last in the inverted file is described. What is registered in the inverted file is the value encoded as described above, and since an accurate numerical value cannot be obtained unless it is sequentially decoded from the first value, when registering a new document number in the inverted file, The registration time can be shortened by obtaining the final registered numerical values up to that point.
[0051]
The last address indicates the last address of the inverted file registered in the inverted file storage area 10 of the memory area. When new document data is added to the inverted file, the last address is It can be filled in.
[0052]
The inverted file storage area 10 indicates an area where an inverted file is stored in the memory area.
[0053]
The maximum number holding unit 11 is a section in which the maximum number Max of inverted files that can be created by this system is entered. In the example shown in FIG. 4, the list number N of the inverted file list 9 corresponds to the maximum number Max.
[0054]
The file number holding unit 12 is a section in which the number of inverted files currently created is entered.
[0055]
Next, with reference to FIGS. 5, 6, and 7, an index creation process that is a function of the document search means 5 which is a characteristic of the present invention will be described. FIG. 5 is a diagram for explaining the overall operation of index creation processing, FIG. 6 is a detailed diagram for explaining a module for associating with an existing inverted file in FIG. 5, and FIG. 7 is a detailed diagram for explaining a document number registration module in FIG. is there.
[0056]
The index creation process will be described based on the operation explanatory diagrams of FIGS. This description will be given with respect to an example of creating a character string of two characters, but the character string of the present invention is not limited to this. This index creation process is performed in ascending order from document number 0.
[0057]
S1. The document search means 4 acquires a two character string from the title in the document.
S2. Get the hash value of the string.
[0058]
S3. The document search means 4 acquires a numerical value in the section of the hash table 8 indicated by this hash value, that is, an inverted file ID. Since this hash table 8 has an initial value initialized to -1, it is determined whether or not this numerical value is -1. If the numerical value is -1, it is recognized that it is not registered, that is, the character string is not associated with any inverted file, and the process proceeds to step S4. If the numerical value is not -1, it has already been dealt with, and the process proceeds to a document number registration module S10 described later.
[0059]
S4. In S3, if the numerical value is -1, the number of files of the current inverted file in the file number holding unit 12 and the maximum number of inverted files that can be created written in the maximum number holding unit 11 Max (in this example, N) is compared to determine whether or not the inverted file can be newly registered. If it is possible, the process proceeds to S6, which will be described later, and a new inverted file is created. This minimizes the number of registered inverted files.
[0060]
S5. If this is not possible, a module for associating with an existing inverted file described later is executed.
[0061]
S6. If possible, create a new inverted file. For example, the smallest available list number, for example, list number 1 is created.
[0062]
S7. The inverted file list number 1 created in S6 is stored in the section of the hash table 8 (n-1 in this example) indicated by the hash value calculated in S2.
[0063]
S8. Then, the section of list number 1 of this inverted file is initialized. The registration number is set to 0, and the final registration value, that is, the final character string document number is set to 0. Further, the head address of the unused area 10 is input as the last address. Initialization can also be performed at the start of the system.
[0064]
S9. In the file number holding unit 12, the number of new inverted files is added by adding 1 to the number of inverted files that have been entered.
[0065]
S10. Next, a document number registration module described later, that is, an “inverted file registration module” is executed.
[0066]
S11. When all the character strings of all documents are registered, the index creation process is terminated.
[0067]
Next, with reference to FIG. 6, the case where the inverted file cannot be newly registered in S4 will be described.
[0068]
S101. In S4, when the document search means 4 determines that the maximum number of inverted files that can be created exceeds the maximum number that can be created, it searches the area of the number of registrations in the inverted file list 9 to find the number of registrations. The list number of the inverted file with the smallest number (2 in the example of FIG. 4) is acquired.
[0069]
S102. The list number 2 of the inverted file list 9 acquired in S101 is stored in the hash table 8 indicated by the hash value (for example, 0) calculated in S2.
[0070]
As a result, the management data such as the registered number, the final registered numerical value, and the last address of the character strings of the hash value 2 and the hash value 0 are entered in the same list number category of the inverted file list 9. Of course, the search is also performed in groups as shown in FIG.
[0071]
With reference to FIG. 7, description will be given of the registration processing by the creation of an inverted file in S6 to S9, that is, the document number registration module, in association with the existing inverted file in S3 and S5.
[0072]
S201. The document number of the character string currently being processed is registered in the inverted file specified in the preceding process. First, a difference value with a document number of a newly registered document is obtained based on the last registered numerical value in the inverted file list 9, and is encoded.
[0073]
S202. Next, the last registered numerical value in the inverted file list 9 is changed to the document number of the current character string to be newly registered.
[0074]
S203. The encoded data is inserted into the last address of the inverted file storage area 10.
[0075]
S204. The amount increased by this insertion is registered in the inverted file list 9 as the last address. At this time, the number of registrations in the inverted file list 9 is incremented by one.
[0076]
Note that the present invention aims to reduce the number of bits required for compression by the compression code by reducing the difference value of the position information. Therefore, the compression code used in the above description is not only the γ code, but also FIG. The α code, δ code, 8-bit block code, and the like shown in the coding explanatory diagram of a positive integer shown in FIG. All of these are known and described in “Introduction to Document Data Compression Algorithm” by CQ Publisher, Tomohiko Uehara.
[0077]
The α code represents an integer x, which is x-1 zeros followed by ones.
[0078]
The δ code uses the sign γ instead of the first digit number display, that is, the number of bits (bit number) when x is expressed in binary number is represented by the sign γ, and then the beginning of the binary number display of x 1 is removed.
[0079]
In the 8-bit block code, top1 bit of 8 bits is set as a continuation flag, and if the flag exists, the next 8 bits also have a number.
[0080]
A second embodiment of the present invention will be described with reference to FIGS. 9, 10, 11 and 5. FIG. FIG. 9 is a diagram for explaining the second embodiment, FIG. 10 is a diagram for explaining the operation of the second embodiment (part 1), and FIG. 11 is a diagram for explaining the operation of the second embodiment (part 2).
[0081]
In the first embodiment, when creating an inverted file corresponding to a new character string, if the number of registered inverted files has already exceeded the allowable number of registrations, the entire number of registered inverted files is searched. In order to improve this, in the second embodiment, although the number of registrations is small in advance in this case, it is necessary to obtain a minimum value. For example, 1/4 of all inverted files is scheduled in advance as a grouping candidate for associating with a plurality of character strings.
[0082]
In the second embodiment, an example of index creation by document search means in a personal computer will be shown. In this method, all the positions where the search key appears are extracted by using the character string in the sentence of the document (including bibliographic items such as a title in addition to the sentence) as a search key. This method has one document as document data.
[0083]
The apparatus for realizing the second embodiment is configured as shown in FIG. 9, wherein the same reference numerals as those in the other figures denote the same parts, 3 is a keyboard functioning as a search key input terminal, 14 is a document search means, and 15 is Document data.
[0084]
The keyboard 3 is used to input a search key input by the user to the document search means 14.
[0085]
The document search means 14 has an index 6 consisting of an inverted file for two character strings. At the time of search, the document search means 14 extracts an inverted file of the search key, and stores the search key that is described in the inverted file. Returns the appearance position in the sentence (for example, the number of words from the beginning).
[0086]
The document data 15 is a document to be searched and is only one and has a sentence.
[0087]
The inside of the index 6 is the same as in FIG.
[0088]
The index creation processing procedure in the second embodiment is exactly the same as that in the first embodiment except for the “association module with existing inverted file” in S5 and the “document number registration module” in S10 in FIG. Are the same. Therefore, only these two modules will be described below.
[0089]
In the second embodiment, the search output is not the document number but the character string appearance position of the search key. Therefore, in the second embodiment, the description in S10 is “character string appearance position registration module”. It shall be replaced.
[0090]
In the “module for associating existing inverted files” in the second embodiment, a plurality of inverted files to be associated with newly appearing character strings are selected in advance. Then, the newly appearing character string is associated with one of the plurality of inverted files selected, and all the selected inverted files are sequentially associated with newly appearing character strings. Is done.
[0091]
The operation will be described with reference to FIG.
[0092]
S301. If the matched inverted file has not yet been determined, or if the matched inverted file selected so far is all associated with the newly appearing character string, the document search unit 14 does not exist anymore. If not, the process proceeds to step S303.
[0093]
S302. The document search means 14 selects a plurality of inverted files for association from all the inverted files. The selection criterion is, for example, ¼ from the smallest number of registrations. Then, the order of association with the newly appearing character string in the selected inverted file is determined in advance. For example, the determination method is based on the order of correspondence from the inverted file having a small number of registrations.
[0094]
S303. An inverted file to be associated with the current character string is selected from a plurality of correlated inverted files. Next, the list number of the selected inverted file is acquired, and the list number is stored in the hash table 8 indicated by the hash value calculated in S2 of FIG.
[0095]
S304. The document search unit 14 excludes the inverted file selected in S303 from being associated with a character string that appears newly after the next time. However, it may be selected again when it goes to S302 after the next time.
[0096]
In the “character string appearance position registration module” of S10 in FIG. 5, the appearance position of the character string currently being processed is registered in the inverted file specified in the preceding process. First, the appearance position is encoded based on the final registered numerical value, and then the final registered numerical value is changed to the current character string appearance position. Then, the encoded one is inserted into the last address, and the amount increased by this insertion is registered in the last address.
[0097]
This operation will be described with reference to FIG.
[0098]
S401. The document search means 14 encodes an integer value calculated from the following equation in order to register the appearance position of the character string currently being processed in the inverted file specified in the preceding process. For example, a γ code is used as the code. Mod (A, B) is a remainder obtained by dividing A by B.
[0099]

The specified value described in this equation is the same as the maximum value that can be represented by a limited number of digits. Otherwise, for example, when the specified value is 511 and (Now Offset-Before Offset) is 511, Int becomes 1, and when the code result is displayed in bits, it becomes “1”, and the memory usage is as small as 1 bit. . However, at the time of retrieval, the code result “1” is converted back to a difference value based on the mathematical formula, and becomes 0. Therefore, although there is no key at the position corresponding to the difference value 0, the character string is actually searched. Next, a character string search is actually performed for a position corresponding to the difference value 511 (= 0 + 511 × 1), and it is found that a key exists and a hit is made. As described above, the character string search is repeated many times for the difference value equal to or greater than the specified value. In this way, if the specified value is small, the number of character string searches increases. However, if the number of specified values is large, the memory usage tends to increase. It is necessary to determine the specified value under such a trade-off situation. At this time, in order to make the most effective use of bits, it is assumed that the code used is the same as the maximum value that can be represented by a limited number of digits as described above.
[0100]
For example, if the specified value is 300, the difference value from 0 to 510 does not change the maximum number of used bits to 17 bits, but 301 to 510 increase the number of character string searches. However, if the specified value is 511, the difference value from 0 to 510 maintains the maximum number of used bits at 17 bits and the number of character string searches does not increase. In this way, the number of bits is effectively utilized as much as possible while reducing the number of character string searches during the search as much as possible.
[0101]
S402. Next, the document search means 14 changes the final registered numerical value of the inverted file list 9 to the current character string appearance position.
[0102]
S403. Then, the encoded one is inserted into the last address of the inverted file storage area 10.
[0103]
S404. The amount increased by this insertion is added to the last address of the inverted file list 9, and this is changed.
[0104]
Since the present invention aims to reduce the number of bits required for compression by compression code by reducing the difference value of position information, the code used here is not only γ code but also α code, δ It may be a code.
[0105]
As described above, according to the present invention, the occurrence character string is grouped, and the remainder of the appearance position information for a certain constant is registered in the inverted file, so that the memory usage of the index can be achieved without degrading the search speed. Can be reduced.
[0106]
According to the present invention, the following effects can be obtained.
[0107]
(1) Since one type or a plurality of types of character strings are associated with one inverted file, the difference value of the appearance position information between character strings can be reduced, and the management information for the inverted file can be reduced. Can be reduced.
[0108]
(2) Since a new character string is associated with an inverted file having the smallest number of registrations, an inverted file having an extremely large number of registrations is not created, so that a search key having an extremely slow search time is not created and an inverted file is created. By reducing the file management information and further reducing the difference value in the inverted file, the inverted file can be configured and managed with a small number of bits.
[0109]
(3) Since an inverted file to be associated is determined in advance and a newly appearing character string is associated with the inverted file until the number of registered inverted files exceeds the specified number, a new character string Since it is not necessary to detect the smallest number of registrations every time appears, the processing can be simplified.
[0110]
(4) Since the difference value with the previously registered position information is encoded in the inverted file, a code that takes the difference value and expresses a small value with a small number of bits and a large value with a large number of bits. Use to reduce memory usage of inverted files.
[0111]
(5) Since the difference value is divided by the specified number and the remainder is encoded, the inverted file can be expressed with a small number of bits. In addition, by using the maximum value that can represent the specified number by a limited number of digits with a code to be used, the code length can be used effectively to the maximum. For example, as shown in FIG. 13, a code length of 11 bits can represent up to 63. In this case, if the specified value is set to 64 and the γ code is used, 11 bits can be used to the maximum.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram of association of a plurality of character strings.
FIG. 2 is an explanatory diagram for converting a difference value into a small value.
FIG. 3 is an explanatory diagram of an embodiment of the present invention.
FIG. 4 is an explanatory diagram of a main part of an index.
FIG. 5 is an operation explanatory view (No. 1) of the first embodiment of the present invention;
FIG. 6 is an operation explanatory diagram (No. 2) of the first embodiment of the present invention;
FIG. 7 is an operation explanatory diagram (No. 3) of the first embodiment of the present invention;
FIG. 8 is an explanatory diagram of encoding of a positive integer.
FIG. 9 is an explanatory diagram of a second embodiment of the present invention.
FIG. 10 is an operation explanatory diagram (part 1) according to the second embodiment of the present invention;
FIG. 11 is an operation explanatory diagram (No. 2) of the second embodiment of the present invention;
FIG. 12 is an explanatory diagram of an inverted file.
FIG. 13 is an explanatory diagram of encoding of a positive integer.
FIG. 14 is an explanatory diagram of a compression example.
FIG. 15 is an explanatory diagram of two character string extraction.
FIG. 16 is an explanatory diagram of position information when the ending is changed.
[Explanation of symbols]
1 PC
2 display devices
3 Keyboard
4 Document search means
5 Document data
6 Index
7 Inverted file management table
8 Hash table
9 Inverted file list
10 Inverted file storage area
11 Maximum number holding part
12 File number holding part

Claims

In an inverted file creation method for registering an index character string and its position information,
An inverted file list in which the number of registrations indicating the number of location information registered in the inverted file is entered;
A table having a section corresponding to the character string pointing to the inverted file list is provided.
Register one or more types of character strings in one inverted file,
An inverted file creation method characterized in that the position information is encoded and stored based on a difference from other previously registered position information.

2. The inverted file creation method according to claim 1, wherein when a new character string appearing in the process of creating an index is registered in the inverted file, the new character string is associated with the inverted file having the smallest number of registrations.

When registering an inverted file for a new character string that appears in the process of creating an index in claim 1,
One or a plurality of inverted files to be registered in advance are determined, and even if a new character string appears thereafter, it is registered in the group as long as the number of registered inverted files does not exceed a certain number. How to create inverted files.

The encoding according to claim 1, wherein when encoding the position information based on a difference from other previously registered position information, the encoding is performed based on a difference from the position information previously registered in the inverted file. Inverted file creation method.

In Claim 1, when encoding position information based on a difference from other previously registered position information, the specified number is a maximum value that can be represented by a limited number of digits with a code to be used, and the position information An inverted file creation method characterized in that encoding is performed on the remainder obtained by dividing the difference value by the specified number.