JP3578501B2

JP3578501B2 - Document search method and apparatus

Info

Publication number: JP3578501B2
Application number: JP30557594A
Authority: JP
Inventors: 川口　　久光; 奈津子水谷; 敦畠山; 勝己多田; 寛次加藤; 悟志浅川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1994-11-15
Filing date: 1994-11-15
Publication date: 2004-10-20
Anticipated expiration: 2019-10-20
Also published as: JPH08147328A

Description

【０００１】
【産業上の利用分野】
本発明は、インデックスを使用した文書検索方法及び装置に係り、データベース、文書ファイリングシステムおよびＤＴＰ（ＤｅｓｋＴｏｐＰｕｂｌｉｓｈｉｎｇ）システムなどに適用されるものである。
【０００２】
【従来の技術】
情報処理システムでは、データベースに格納されている文字列データの集まりからなる文書の中から、検索者の探したいある特定の言葉、すなわち質問語、を含む全ての文書を探し出すことが一つの重要な処理となっている。
このような文書を検索するための方法として、インデックスを使用したインデックス検索方式が良く知られている。この方式は“情報検索”（中原著、電子通信情報学会出版、１９７４）ｐｐ．２０３−２０７（以下、公知例１と呼ぶ）や“ＤＯＣＵＭＥＮＴＤＡＴＡＢＡＳＥ”（Ｇ．Ｊａｍｅｓ著、ＶａｎＮｏｓｔｒａｎｄＲｅｉｎｈｏｌｄＣｏ．、１９８５）ｐｐ．８７−９４に具体的に記載されている。
ここで取り上げられているインデックスは、キーワードが出現する文書の文書番号で構成されている。これらのインデックス検索方式では、質問語と一致するキーワードのインデックスを参照するだけで、そのキーワードを含む文書が分かるため高速な検索が可能である。
【０００３】
上記公知例１に記載されているインデックスの例を図２に示す。インデックスは、文書から抽出されたキーワードに対応して、キーワード番号とキーワードが出現する文書の文書番号が格納される構成となっている。
本例では、キーワード“コア”、“ディスク”、“コンピュータ”および“ＩＲ”に対応するインデックスが作成され、磁気ディスク上のファイルに格納されていることを想定している。
検索の際には、質問語として“コア”、“ディスク”、“コンピュータ”および“ＩＲ”が指定されたときのみ、このインデックスの中のそれぞれのキーワードが出現する文書の文書番号がインデックスが格納されているファイル（以後、インデックスファイルと呼ぶ）から読み出される。
すなわち、質問語が“コア”の場合には文書番号１，４，質問語が“ディスク”の場合には文書番号４、質問語が“コンピュータ”の場合には文書番号１，２，４，質問語が“ＩＲ”の場合には文書番号２のように検索結果として文書番号が出力される。
新たな文書をデータベースへ登録する際には、その文書に出現したキーワードが抽出され、このキーワードが出現した文書の文書番号が、そのキーワードに対応するインデックスに追加登録される。
このように文書からキーワードを抽出する技術は、“自動索引付け研究の動向”（諸橋著、情報処理学会誌、Ｖｏｌ．２５、Ｎｏ．９、１９８４）や“ＤＯＣＵＭＥＮＴＤＡＴＡＢＡＳＥ”（Ｇ．Ｊａｍｅｓ、ＶａｎＮｏｓｔｒａｎｄＲｅｉｎｈｏｌｄＣｏ．、１９８５）ｐｐ．８７−９４に記載されている。
これらのキーワード抽出技術を用いて抽出されたキーワードに対応するインデックスの追加処理例を図３に示す。
本例では、登録対象の文書の文書番号は５であり、この文書から“コア”、“コンピュータ”および“ＩＲ”が抽出されたことを想定する。このインデックスでは“コア”、“コンピュータ”および“ＩＲ”に対応するインデックスにそれぞれ文書番号５が追加されている。このようにして、抽出されたキーワードに対応するインデックスの追加処理が行われることにより、文書の登録処理が実現される。
【０００４】
【発明が解決しようとする課題】
“情報検索”（中原著、電子通信情報学会出版、１９７４）ｐｐ．１２０−１２８によれば、インデックスが格納される記憶装置としては、ランダムアクセスを行うことができ、大容量で安価な磁気ディスクなどの二次記憶装置の使用が一般的とされている。
磁気ディスクに格納されているインデックスの追加処理では、追加の対象となったキーワードに対応するインデックスが独立にアクセスされる。すなわち、インデックスの追加処理は、磁気ディスク上の複数のインデックスが飛び飛びにアクセス（以下、ランダムアクセスと呼ぶ）されることになる。
ワークステーションの一般的なオペレーティングシステムにおいては、磁気ディスクへのアクセスが論理ブロックと呼ばれる単位で行われる。ここでは論理ブロックのサイズとして、上記オペレーティングシステムで使われている８、１９２バイト（以後、８ＫＢと省略する）を想定する。
ただし、磁気ディスクへの書き込みは、８ＫＢ単位で行われない場合もある。例えば、インデックスの追加処理で書き込む文書番号のサイズを一文書番号当たり４バイトと想定すると、一文書を登録する場合には論理ブロックより少ない８ＫＢ未満のデータの書き込みとなる。このような場合、上記オペレーティングシステムでは、書き込み対象となっている論理ブロックを磁気ディスクから上記オペレーティングシステムの主記憶上のバッファエリアに一旦読み込む。次に、書き込む対象のデータをバッファエリア内の論理ブロックの所定の場所に書き込むことにより論理ブロックを更新する。その後で、この論理ブロックを再び磁気ディスクに書き込む。このようにして、論理ブロックより少ないデータの磁気ディスクへの書き込みを実現している。
【０００５】
このときの磁気ディスクの動作の一例を図４を用いて説明する。
まず、磁気ディスクのヘッドを、読み出し対象の論理ブロックの先頭位置に位置決めするためのヘッドのシーク処理と回転待ち処理が行われ、次に読み出し対象となる論理ブロックの読み出し処理が行われる。
その後、前記論理ブロックの更新（文書番号の追加処理）が行われ、再び磁気ディスクのヘッドを読み出し対象となった論理ブロックの位置に位置決めするためのヘッドのシーク処理と回転待ち処理が行われ、その後、論理ブロックの磁気ディスクへの書き込み処理が行われる。
【０００６】
本図の各処理時間の値は一般的な３．５インチの磁気ディスクのものであり、平均的なシーク時間（以後、平均シーク時間と呼ぶ）としては約１４ｍｓ、平均的な回転待ち時間（以後、平均回転待ち時間と呼ぶ）としては約１７ｍｓ、一論理ブロック当たりの読み出し時間および書き込み時間としては約４ｍｓを想定している。また、バッファエリア上の論理ブロックに４バイトの文書番号を書き込む時間としては０．００１ｍｓを想定している。
本例では、論理ブロックより少ないデータの磁気ディスクへの書き込み処理に合計７０ｍｓ掛かる。本図のタイムチャートより磁気ディスクからの読み出し処理および書き込み処理に費やされる時間が８ｍｓであるのに対して、ヘッドを位置決めするためのシーク処理および回転待ち処理に費される時間は６２ｍｓと８倍程度長く掛かっていることが分かる。
すなわち、磁気ディスクへの読み出しや書き込みの処理速度は２ＭＢ／ｓなのに対し、シーク処理や回転待ち処理の時間を含めた全体の実効的な処理速度は約０．１３ＭＢ／ｓとなり、磁気ディスクの読み出しおよび書き込み性能が引き出せない状況となっている。
【０００７】
このため、文書の登録時に行われるインデックスの追加処理において、追加処理が発生したキーワードの個数分磁気ディスクのランダムアクセスが発生することにより磁気ディスクの実効的な処理速度が低下し、インデックスの追加に時間が掛かることになる。つまり、インデックスを用いた文書検索方式では、文書の登録に時間が掛かるという問題がある。
本発明の目的は、登録時に行われるインデックスの追加処理を高速化し、登録時間を短縮することにある。
【０００８】
【課題を解決するための手段】
上記課題を解決するため、本発明は、
文書からキーワードを抽出し、これに基づいてインデックスを作成し、検索時に質問語と一致したキーワードに対応するインデックスを参照して検索を行う文書検索方法において、前記インデックスを二次記憶装置上のファイルに格納するとともに、該ファイルを所定数のブロックに分割しておき、各ブロックに格納されるインデックスのサイズがほぼ均等になるようにキーワードを該ブロックに割り付け、文書の登録時に、前記ブロックに追加または更新が発生したインデックスの個数が所定数以上の場合には、該ブロックを二次記憶装置から主記憶上へ読み込むとともに、該ブロック内のインデックスに対応するキーワードについて、インデックスの追加処理を主記憶上で行い、追加処理された該ブロックを二次記憶装置へ格納し、所定数未満の場合には、該インデックスを二次記憶装置から主記憶上へ読み込むとともに、該インデックスに対応するキーワードについて、インデックスの追加処理を主記憶上で行い、追加処理された該インデックスを二次記憶装置へ格納するようにしている。
また、キーワードのブロック割り付けに際して、ブロックに割り付けられたキーワードについて、インデックスが作成される全ての文書の内、該キーワードの出現文書数を算出し、各キーワード毎に算出した出現文書数の総和をブロック毎に算出し、各ブロックにおける出現文書数の総和が各ブロックにおいてほぼ均等になるようにキーワードのブロック割り付けを行うようにしている。
さらに、文書からキーワードを抽出し、これに基づいてインデックスを作成し、検索時に質問語と一致したキーワードに対応するインデックスを参照して検索を行う文書検索装置において、前記インデックスを二次記憶装置上のファイルに格納するとともに、該ファイルを所定数のブロックに分割する手段と、各ブロックに格納されるインデックスのサイズがほぼ均等になるようにキーワードを該ブロックに割り付ける割り付け手段と、文書の登録時に、前記ブロックに追加または更新が発生したインデックスの個数が所定数以上か否か判定する手段と、判定結果が所定数以上のとき該ブロックを二次記憶装置から主記憶上へ読み込み、該ブロック内のインデックスに対応するキーワードについてインデックスの追加処理を主記憶上で行い、追加処理された該ブロックを二次記憶装置へ格納する手段と、所定数未満のとき該インデックスを二次記憶装置から主記憶上へ読み込み、該インデックスに対応するキーワードについて、インデックスの追加処理を主記憶上で行い、追加処理された該インデックスを二次記憶装置へ格納する手段を備えるようにしている。
また、前記割り付け手段は、ブロックに割り付けられたキーワードについて、インデックスが作成される全ての文書の内、該キーワードの出現文書数を算出し、各キーワード毎に算出した出現文書数の総和をブロック毎に算出し、各ブロックにおける出現文書数の総和が各ブロックにおいてほぼ均等になるようにキーワードのブロック割り付けを行うようにしている。
【０００９】
【作用】
上記手段により、全ての文書からキーワードを抽出し、二次記憶装置上の所定数のブロックの各ブロックのインデックスサイズがほぼ均等になるように、キーワードをブロックに割り付けることができ、文書の登録時に、インデックスを格納するブロック毎に、該当する抽出キーワードが所定数以上の場合には、そのブロックを磁気ディスクからメモリ上に読み込むとともに該当する抽出キーワードに対するインデックスの追加処理メモリ上で一括して行い、これを磁気ディスクへ書き込むことにより、磁気ディスクへのアクセス回数を低減し、所定数未満の場合には該当する抽出キーワードに対応するインデックスのみメモリ上に読み込むとともにそのインデックスの追加処理を行い、これを磁気ディスクへ書き込むことにより、磁気ディスクへのアクセスデータ量を最小化することができるため、非常に高速なインデックス追加処理が可能となり、文書データベースへの高速な登録処理を実現することができる。
【００１０】
【実施例】
まず、本発明の原理を以下に説明する。
初期設定として、インデックスを格納するインデックスファイルのエリアとして所定サイズ分を磁気ディスク上に確保するとともに、これを所定数のブロック（論理ブロックではない）に分割する。次に、各ブロックに格納されるインデックスの容量（インデックスのサイズ）の総和がほぼ均等になるように、インデックスに対応するキーワードを各ブロックに割り付ける。
文書の登録時には、まず、登録文書からキーワードを抽出し、そのキーワードとそれが出現した文書の文書番号を主記憶へ格納する。
次に上記ブロック毎に、そこに割り付けられているキーワードが、いくつ抽出されているかを調べる。
所定数未満の場合には、従来と同様に、各ブロック毎に抽出されたキーワードに対応するインデックスを磁気ディスクから主記憶へ読み込む。次に、主記憶に読み込まれているインデックスの末尾に、そのキーワードが出現した文書の文書番号を追加し、磁気ディスクに再び格納する。
所定数以上の場合には、まず、該当ブロックを磁気ディスクから主記憶へ読み込む。次に、抽出されたキーワードの中で該ブロックに割り付けられているものについてのみ、主記憶に読み込まれている該ブロック内の該キーワードに対応するインデックスの末尾に、そのキーワードが出現した文書の文書番号を追加する。その後、このブロックを磁気ディスクに再び格納する。
この一連の処理を、抽出されたキーワードが割り付けられている全てのブロックに対して行うことにより、登録文書から抽出されたキーワードのインデックスへの追加処理を行う。
【００１１】
以上のように、抽出されたキーワードが所定数以上含まれるブロックに関しては、磁気ディスク上のブロックを１度だけ主記憶に読み込み、新たなキーワードに対する追加処理を行った後に、これを磁気ディスクに書き込むだけで複数のインデックスの追加処理を実現できるため、従来のようにキーワード毎に磁気ディスクから該当するインデックスを読み込み、そのキーワードに対する追加処理を行った後に、これを磁気ディスクへ書き込む場合に比べ大幅にインデックスの追加処理に掛かる時間を削減することができる。
【００１２】
以上説明した原理を、さらに具体例を用いて説明する。
本例で用いるインデックスファイルの例を図５に示す。
本インデックスファイルは、図２に示すインデックスファイルをブロック１とブロック２の二つに分割し、磁気ディスクに格納したものである。さらに、ブロックに含まれるインデックスのサイズの総和がほぼ均等になるように、ブロック１にはキーワード“コア”、“ディスク”および“ＩＲ”を、ブロック２にはキーワード“コンピュータ”を割り付け、対応するブロックにインデックスを格納している。
【００１３】
このインデックスは、検索の際には、ブロックを意識することなく従来と同様に検索処理が行われる。質問語として“コア”、“ディスク”、“コンピュータ”および“ＩＲ”が指定されたときのみ、それぞれのキーワードが出現する文書の文書番号がインデックスファイルから読み出される。
すなわち、質問語が“コア”の場合には文書番号１，４，質問語が“ディスク”の場合には文書番号４、質問語が“コンピュータ”の場合には文書番号１，２，４，質問語が“ＩＲ”の場合には文書番号２のように検索結果として文書番号が出力される。
【００１４】
新たな文書をデータベースへ登録する際には、図６に示すように上記ブロックを意識した処理を行う。以下、詳細にその手順を説明する。
まず、登録対象の文書からキーワードとして、“コア”、“ディスク”および“ＩＲ”が抽出されたものとし、さらにこの文書の文書番号として文書番号５を想定する。
抽出されたキーワードのブロックの割り付けとしては、ブロック１にはキーワード“コア”、“ディスク”および“ＩＲ”が、ブロック２には“コンピュータ”が割り付けられている。これらのキーワードの中で文書５から抽出されたキーワードとしては、ブロック１にキーワード“コア”、“ディスク”および“ＩＲ”の三つが該当する。
このため、まず、ブロック１を主記憶上に読み込む。次にブロック１に格納されているキーワード“コア”、“ディスク”および“ＩＲ”に対応するインデックスの末尾にこれらのキーワードが出現する文書の文書番号である５をそれぞれ追加する。その後、主記憶上に格納されているブロック１を磁気ディスク上のインデックスファイルに書き込む。
ブロック２についてはこの中に割り付けられたキーワードに該当するものが文書５から抽出されていないためインデックスの追加処理は行わない。このようにして、文書の登録処理が行われる。
【００１５】
上記ブロック１のインデックス追加処理のタイムチャートを図７に示す。
本例では、磁気ディスクとして一般的な３．５インチの磁気ディスクを使用し、論理ブロックのサイズとしてワークステーションの一般的なオペレーティングシステムで使われている８ＫＢを使用し、インデックスを格納するブロックのサイズとして８論理ブロックを使用すること想定する。
また、磁気ディスクのシーク時間および回転待ち時間としては、平均シーク時間および平均回転待ち時間を想定する。さらに、磁気ディスクにおける平均シーク時間、平均回転待ち時間、一論理ブロックの読み出し時間および一論理ブロックの書き込み時間には、それぞれ、約１４ｍｓ、約１７ｍｓ、約４ｍｓおよび約４ｍｓを想定する。上記の値を用いて、本図のタイムチャートの流れを説明する。
本例では、まず、ブロック１をバッファエリアへ読み出すときに磁気ディスクへのアクセスが発生し、シーク処理と回転待ち処理により磁気ディスクのヘッドがブロック１の先頭に位置決めされる。この間、平均シーク時間１４ｍｓと平均回転待ち時間１７ｍｓが費やされる。
次に、ブロック１を構成する論理ブロック、この場合八つの論理ブロックとする、がバッファエリアに読み込まれる。この際、１論理ブロック分の読み出し時間の８倍の３２ｍｓが費やされる。
ここで、バッファエリアに読み出されたブロック１に格納されているキーワード“コア”、“ディスク”および“ＩＲ”に対応するインデックスの末尾にこれらのキーワードが出現する文書の文書番号である５がそれぞれ追加される。したがって、ブロック１に３回の文書番号の追加が発生する。このときのバッファエリア上の一つのインデックスに４バイトの文書番号を書き込む時間として０．００１ｍｓを想定する。ここでは、３回の文書番号の追加が発生するため３倍の０．００３ｍｓを要するが、他の処理時間に比べ無視できるほど小さい。
その後、インデックスの追加処理が行われたブロック１を、磁気ディスクに書き込む。このとき、シーク処理と回転待ち処理により磁気ディスクのヘッドが所定の位置に位置決めされる。その間、平均シーク時間１４ｍｓと平均回転待ち時間１７ｍｓが費やされる。この後に、ブロック１を構成する八つの論理ブロックが磁気ディスクに書き込まれ、１ブロック分の書き込み時間の８倍の３２ｍｓが費やされる。
以上の処理により、本例のインデックスの追加に合計１２６ｍｓが費やされることになる。これは、一キーワード当たり平均４２ｍｓとなる。
従来のように、キーワード毎に磁気ディスク上のインデックスへの追加処理を行ったときには、前述した図４のように、一キーワード当たり７０ｍｓ掛かったものが、本発明のようにブロック単位にまとめてインデックスの追加処理を行うことにより４０％程度文書の登録時間を短縮することが可能となる。
一般の文書を登録する際には、登録キーワードの個数が本例の数十倍にもなるため、本発明のブロック単位でのインデックスの追加処理の効果は更に大きくなる。
【００１６】
以上説明したように、本発明によれば、所定数以上の追加登録キーワードが含まれるブロックにおけるインデックスの追加処理を一括して行うことにより、磁気ディスク上のブロックを１度主記憶に読み込むだけで複数のインデックスの追加処理が実行できるため、文書データベースへの高速な登録処理を実現することができる。
【００１７】
以下、本発明の実施例を説明する。
本発明が適用された文書検索システムの構成について図１を用いて説明する。
本システムは、ディスプレイ１０１、キーボード１０２、ＣＰＵ１０３、メモリ１０４、磁気ディスク１０５およびフロッピーディスクドライブ（ＦＤＤ）１０６から構成される。
ディスプレイ１０１、キーボード１０２、メモリ１０４、磁気ディスク１０５およびＦＤＤ１０６は、ＣＰＵ１０３よりバスを介してアクセスされる。磁気ディスク１０５には、インデックスファイル８０００が格納される。
メモリ１０４には、システム制御プログラム５０００、検索インタフェースプログラム６０００、登録制御プログラム２０００、検索制御プログラム３０００、キーワード割り付けプログラム２１００、インデックス作成登録プログラム２２００およびインデックス検索プログラム３１００がロードされ、ワークエリア４０００が確保される。
本文書検索システムの文書データベースに登録される文書は、フロッピーディスク１０７に格納され、ＦＤＤ１０６を介してＣＰＵ１０３よりアクセスされる。
【００１８】
本システムでは、電源投入時ＣＰＵ１０３によりシステム制御プログラム５０００が起動され、システム制御プログラム５０００の制御のもとに登録制御プログラム２０００および検索制御プログラム３０００が起動される。
まず、このような構成の本システムにおける文書の登録処理の概略について説明する。
ユーザがキーボード１０２から入力した指示に従って、システム制御プログラム５０００が登録制御プログラム２０００を起動する。
登録制御プログラム２０００では、最初、文書を登録する前に、ユーザがキーボード１０２から入力した指示に従い、キーワード割り付けプログラム２１００を起動し、インデックスファイルの初期設定を行う。
まず、ユーザがキーボード１０２から入力した指示に従い、インデックスを格納するインデックスファイル８０００を所定ブロック数分磁気ディスク１０５上に確保するとともに、これを指定された数のブロックに分割する。
そして、各ブロックにキーワードを割り付けるための所定数の文書（以後、種文書と呼ぶ）がＦＤＤ１０６を介してフロッピーディスク１０７からメモリ１０４のワークエリア４０００に読み込まれる。種文書としては、例えば１０万件の新聞記事ＤＢを作成する場合には、同じ種類の文書、すなわち新聞記事を数百件〜数千件程度登録する。
次に、ワークエリア４０００に読み込まれた種文書から検索に必要な言葉をキーワードとして抽出し、そのキーワードの出現文書数を算出する。
このキーワードの出現文書数の総和が、各ブロック間でほぼ均等になるようにキーワードを各ブロックに割り付ける。その後、登録制御プログラム２０００では、インデックス作成登録プログラム２２００を起動する。
【００１９】
インデックス作成登録プログラム２２００では、ユーザがキーボード１０２から入力した指示に従い、フロッピーディスク１０７に格納された登録対象の文書を、ＦＤＤ１０６を介してメモリ１０４のワークエリア４０００に読み込む。
この登録文書から検索に必要な言葉がキーワードとして抽出され、インデックスファイル８０００の該当ブロックにキーワードと文書番号、あるいは文書番号が登録される。
【００２０】
次に、本システムにおける文書の検索動作の概略について説明する。ユーザがキーボード１０２から入力した指示に従い、システム制御プログラム５０００は検索制御プログラム３０００と検索インタフェースプログラム６０００を起動する。
その後、ユーザがキーボード１０２から入力した質問語は、検索インタフェースプログラム６０００に入力され、検索制御プログラム３０００に送られる。
検索制御プログラム３０００では、インデックス検索プログラム３１００を起動するとともに本プログラムへ前記質問語を送る。
インデックス検索プログラム３１００では、受け取った質問語に対応するインデックスから文書番号を読み出し、検索結果として検索制御プログラム３０００へ送出する。
本検索結果は、検索インタフェースプログラム６０００へと送られ、検索結果文書番号としてディスプレイ１０１に表示される。
【００２１】
次に、キーワード割り付けプログラム２１００の構成とキーワード割り付け処理について図８を用いて説明する。
キーワード割り付けプログラム２１００は、インデックス分割ステップ２１０５、種文書数分繰返しステップ２１１０、種文書読み込みステップ２１２０、キーワード抽出ステップ２１３０、出現文書数カウントステップ２１４０およびキーワード割り付けステップ２１５０から構成される。
まず、インデックス分割ステップ２１０５では、ユーザから指定されたブロック数をキーボード１０２から読み込む。次にインデックスファイル８０００として、指定のブロック数分のエリアを確保するとともに、これを指定ブロック数に均等分割する。
次に、種文書読み込みステップ２１２０では、ＦＤＤ１０６を介して種文書を１文書分読み込みワークエリア４０００に格納する。
さらに、キーワード抽出ステップ２１３０では、読み込まれた種文書からキーワードとなる言葉を抽出し、この抽出されたキーワードをワークエリア４０００に格納する。
この文書からキーワードを抽出する技術は、日本語文書については“自動索引付け研究の動向”（諸橋著、情報処理学会誌、Ｖｏｌ.２５、Ｎｏ.９、１９８４）に記載されており、英語文書については“ＤＯＣＵＭＥＮＴＤＡＴＡＢＡＳＥ”（Ｇ．Ｊａｍｅｓ、ＶａｎＮｏｓｔｒａｎｄＲｅｉｎｈｏｌｄＣｏ．、１９８５）ｐｐ．８７−９４に記載されている。本実施例では、これらのキーワード抽出技術をそのまま利用する。
出現文書数カウントステップ２１４０では、抽出されたキーワードが出現する文書数をカウントし、キーワードに対応させて、ワークエリア４０００に格納する。
種文書数分繰返しステップ２１１０では、全ての種文書についてステップ２１２０からステップ２１４０までの一連の処理を繰り返す。
全ての種文書が処理された後、キーワード割り付けステップ２１５０では、出現文書数の総和がほぼ均等になるように、抽出したキーワードを各ブロックに割り付け、そのブロック番号をキーワードに対応する形でワークエリア４０００に格納する。
【００２２】
このように、種文書から抽出したキーワードの出現文書数を算出することにより、文書データベースのインデックスサイズが予測できる。つまり、種文書から抽出したキーワードに対応するインデックスのサイズは、種文書から抽出したキーワードの出現文書数と文書番号サイズの積により算出でき、これに文書データベースの登録件数と種文書数（サンプリングされた種文書数である）の比を掛けることにより、文書データベースにおけるインデックスサイズが予測できるからである。
また、文書データベースに登録する文書数が増加しブロックに割り付けられた全てのインデックスのサイズの総和が、ブロック間でほぼ均等でなくなった場合には、この文書データベースに登録した全ての文書の中から種文書の候補を乱数抽出などの手法を使い、全文書数の数％程度抽出する。この種文書を基に、再度、ブロックへのキーワードの割り付けを行う。
このキーワードの割り付けに基づき、文書データベースに登録されている全ての文書を再登録することにより、ブロックに割り付けたインデックスのサイズの総和をブロック間でほぼ均等にすることが可能となる。
他の方法として、ブロックに割り付けられているインデックスのサイズの和が他のブロックに比べ多いブロックについて、これに割り付けられているキーワードを、サイズの小さいブロックに割り付け直すことにより、インデックスのサイズの和をブロック間でほぼ均等にすることも可能である。
ブロックへキーワードを割り付ける際の指標として、キーワードの文書出現数の他に、キーワードそのものの出現数を使用したり、キーワードの種類数を使用することも可能である。
以上の処理を行うことにより、全ての種文書からキーワードを抽出し、各ブロックのインデックスサイズがほぼ均等になるように、キーワードをブロックに割り付けることができる。
【００２３】
さらに、上記キーワード割り付けステップ２１５０におけるキーワード割り付け処理について、図９を用いて詳細に説明する。
キーワード割り付けステップ２１５０は、抽出キーワードソートステップ２１５２、ブロック番号初期設定ステップ２１５３、抽出キーワード繰返しステップ２１５４、ブロック繰返しステップ２１５５、ブロックサイズ判定ステップ２１５６、キーワード設定ステップ２１５７、ブロック番号カウントステップ２１５８およびジャンプステップ２１５９から構成されている。
まず、抽出キーワードソートステップ２１５２で、ワークエリア４０００に格納されている抽出キーワードを、抽出キーワードに対応して格納されている抽出キーワードの出現文書数を降順にソートする。
次に、ブロック番号初期設定ステップ２１５３では、最初に処理するブロック番号として最初のブロック番号である１を設定する。
さらに、ブロックサイズ判定ステップ２１５６では、抽出キーワードをブロックに割り付ける場合を想定し、本ブロックに割り付けられるインデックスのサイズの和を算出し、所定のブロックサイズを越えるかどうか判定する。
越えない場合のみ、まず、キーワード設定ステップ２１５７を実行し、本ブロックに抽出キーワードを割り付ける。
次に、ブロック番号カウントステップ２１５８を実行し、処理対象となっているブロックのブロック番号に１を加え、次に処理するブロックのブロック番号を設定する。
このカウントアップにおいて、ブロック番号が最終番号までカウントアップされた場合には、最初のブロック番号の１に戻ることにする。
さらに、ジャンプステップ２１５９によりブロック繰返しステップ２１５５における繰返し処理を打ち切り、Ｌ１以降のステップを実行する。ここでは、抽出キーワード繰返しステップ２１５４を実行する。
ブロック繰返しステップ２１５５では、ステップ２１５６からステップ２１５９までの処理をブロック番号から順に全てのブロックについて繰返し行う。
抽出キーワード繰返しステップ２１５４では、ステップ２１５５からステップ２１５９までの処理を、全ての抽出キーワードについて繰返し行う。
以上の一連の処理により、出現文書数の最も多い抽出キーワードから順に、各ブロックへ割り付けることができるため、各ブロックにインデックスサイズをほぼ均等に割り付けることが可能となる。
【００２４】
次に、インデックス作成登録プログラム２２００の構成と文書登録処理について図１０を用いて説明する。
インデックス作成登録プログラム２２００は、文書番号取得ステップ２２０５、文書数分繰返しステップ２２１０、登録文書数読み込みステップ２２２０、キーワード抽出ステップ２２３０、ブロック番号対応ステップ２２４０、文書番号カウントステップ２２４５、ブロック数分繰返しステップ２２５０、抽出キーワード数判定ステップ２２６０、ブロック単位インデックス追加ステップ２２７０、キーワード数分繰返しステップ２２８０およびキーワード単位インデックス追加ステップ２２９０から構成される。
まず、文書番号取得ステップ２２０５では、ユーザがキーボード１０２から入力した登録文書の最初の文書番号と登録文書数を読み込む。
次に、登録文書読み込みステップ２２２０で、登録対象の文書をＦＤＤ１０６を介して、１文書分読み込みワークエリア４０００に格納する。
その後、キーワード抽出ステップ２２３０で、キーワード割り付けプログラム２１００におけるキーワード抽出ステップ２１３０と同様に、読み込まれた登録文書からキーワードとなる言葉を抽出する。この抽出キーワードをワークエリア４０００に格納する。
ブロック番号対応ステップ２２４０では、キーワード割り付けステップ２１５０でキーワードに対応する形でワークエリア４０００に格納したブロック番号を調べることにより、抽出キーワードが割り付けられたブロックのブロック番号を取得し、そのブロック番号を抽出キーワードに対応させ、ワークエリア４０００に格納する。もし、抽出キーワードがどのブロックにも割り付けられていない場合には、格納されているインデックスのサイズの和が最も小さいブロックにそのキーワードを割り付け、このブロック番号を抽出キーワードに対応した形でワークエリア４０００に格納する。
その後、文書番号カウントステップ２２４５で、文書番号をインクリメントし、次の登録文書の処理に備える。
文書数分繰返しステップ２２１０では、登録文書数回分、ステップ２２２０からステップ２２４５のキーワード抽出処理を繰り返す。
その後、抽出キーワード数判定ステップ２２６０では、最初のブロック番号であるブロック１について、抽出されたキーワードのうちブロック１に割り付けられている数をカウントし、所定数Ｎ以上か否かを調べる。
この所定数Ｎは、使用磁気ディスク、使用計算機等の性能等を考慮して最適となる値をユーザが指定する。
カウント数が所定数Ｎ以上であれば、次のブロック単位インデックス追加処理２２７０を実行する。ブロック単位インデックス追加処理２２７０では、ブロック１をワークエリア４０００に読み込み、ブロック１に割り付けられている全ての抽出キーワードについて一括してインデックスの追加処理を行う。
所定数Ｎに達しない場合は、キーワード単位インデックス追加ステップ２２９０を実行する。キーワード単位インデックス追加ステップ２２９０では、インデックスファイル８０００に格納されているブロック１の中に存在する上記抽出キーワードに対応するインデックスのみをワークエリア４０００に読み込み、そのインデックスに対応する抽出キーワードが出現する文書の文書番号を上記インデックスに追加するとともに再びインデックスファイル８０００に書き込む。
さらに、キーワード数分繰返しステップ２２８０では、抽出キーワードの中でブロック１に割り付けられているもの全てについてインデックスの追加処理が終了するまで、キーワード単位インデックス追加ステップ２２９０を繰返し実行する。
ブロック数分繰返しステップ２２５０では、全てのブロックについてステップ２２６０からステップ２２９０を繰返し実行し、インデックスの追加処理を行う。
本実施例では、このようにして新たな文書の追加登録を実現する。
【００２５】
さらに、前述のブロック単位インデックス追加処理ステップ２２７０について、図１１を用いて詳細に説明する。
ブロック単位インデックス追加処理ステップ２２７０は、ブロック読み出しステップ２２７２、キーワード数繰返しステップ２２７４、ブロック内インデックス追加ステップ２２７６およびブロック格納ステップ２２７８から構成される。
まず、ブロック読み出しステップ２２７２で、前記ブロック数分繰返しステップ２２５０により指定されたブロックを、インデックスファイル８０００から読み出し、ワークエリア４０００に格納する。
次に、ブロック内インデックス追加ステップ２２７６では、抽出キーワードの中で前記ブロック読み出しステップ２２７２でワークエリア４０００に読み出された更新対象ブロックに割り付けられているキーワードに対応するインデックスの末尾に、該当キーワードが出現した文書の文書番号を追加する。
さらに、キーワード数繰返しステップ２２７４では、上記読み出されたブロックにおける該当キーワードの全てについてステップ２２７６のインデックス追加処理を繰返し行う。
その後、ブロック格納ステップ２２７８では、インデックス追加処理が終了した上記ブロックを再びインデックスファイル８０００に格納する。
以上のように、ブロック単位インデックス追加処理ステップ２２７０では、ブロック単位にインデックスの追加処理を行う。
このように、ブロック単位に一括してキーワードのインデックス追加処理を行うことにより、追加キーワード数が多い場合でも、短時間にインデックスの追加処理を行うことができる。
【００２６】
上述のキーワード割り付けプログラム２１００でワークエリアに格納するキーワードに関する情報の格納例について図１２を用いて説明する。
キーワード抽出ステップ２１３０では、種文書からキーワードとなる言葉が抽出されるとともにワークエリア４０００に格納される。本図に示すように、キーワードに対応してそのキーワード番号を一緒に格納している。
本例では、抽出されたキーワードとして“コア”、“ディスク”、“コンピュータ”および“ＩＲ”が種文書から抽出されたことを想定している。
本ステップでは、これらの抽出キーワードの抽出された順番にシリアルな番号をそのキーワード番号として割り振り、本例のような形でワークエリア４０００に格納する。本例では、キーワード“コア”、“ディスク”、“コンピュータ”および“ＩＲ”にはキーワード番号として、それぞれ１、２、３および４が割り振られている。
次に、出現文書数カウントステップ２１４０で、キーワード毎に種文書における文書出現数がカウントされる。このとき、本図に示すようにキーワードに対応した形で出現文書数を格納する。本例では、キーワード“コア”、“ディスク”、“コンピュータ”および“ＩＲ”における出現文書数は、それぞれ２、１、３および２となっている。
その後、キーワード割り付けステップ２１５０で、上記出現文書数をもとに、インデックスを格納するブロックへのキーワードの割り付けが行われる。ここで割り付けられたブロック番号を、本図に示すようにキーワードに対応付け、ワークエリア４０００に格納する。
本例では、キーワード“コア”、“ディスク”、“コンピュータ”および“ＩＲ”が割り付けられたブロックのブロック番号は、それぞれ１、１、２および１となっている。
このような形式で、キーワードに関する情報をワークエリア４０００に格納することにより、種文書から抽出されたキーワードに関する情報を管理することができるため、ブロックへのキーワード割り付け処理が実現できる。
【００２７】
上述のインデックス作成登録プログラム２２００でワークエリアに格納するキーワードに関する情報の格納例について図１３を用いて説明する。本例では、文書番号５の登録文書からキーワードとして“コア”、“ディスク”および“ＩＲ”が抽出されたことを想定している。
キーワード抽出ステップ２２３０では、登録文書からキーワードとなる言葉が抽出されるとともにワークエリア４０００に格納される。このとき、図１２に示すキーワード情報から、抽出キーワードに対応するキーワード番号と割り付けられたブロックのブロック番号を取得する。このとき、本図に示すように、キーワードに対応して、取得した割り付けブロックの番号とその出現文書である登録文書の番号を格納する。
本例では、キーワード“コア”、“ディスク”、“コンピュータ”および“ＩＲ”には、キーワード番号として、それぞれ１、２、３および４を格納し、割り付けブロック番号と登録文書番号としては、それぞれブロック番号１と文書番号５を格納する。
この情報を基に、ステップ２２５０からステップ２２９０でインデックスの追加処理を行う。
このような形式で、キーワードに関する情報をワークエリア４０００に格納することにより、登録文書から抽出されたキーワードに関する情報を管理することができるため、これに対応するインデックスの追加処理が実現できる。
【００２８】
以上説明したように、本発明によれば、インデックスを格納するブロック毎に、該当する抽出キーワードが所定数以上の場合には、そのブロックを磁気ディスクからメモリ上に読み込むとともに該当する抽出キーワードに対するインデックスの追加処理メモリ上で一括して行い、これを磁気ディスクへ書き込むことにより、磁気ディスクへのアクセス回数を低減し、所定数未満の場合には該当する抽出キーワードに対応するインデックスのみメモリ上に読み込むとともにそのインデックスの追加処理を行い、これを磁気ディスクへ書き込むことにより、磁気ディスクへのアクセスデータ量を最小化することができるため、非常に高速なインデックス追加処理が可能となり、文書データベースへの高速な登録処理を実現することができる。
【００２９】
【発明の効果】
本発明によれば、複数のインデックスの追加処理をメモリ上で一括して実行することで、磁気ディスクへのアクセス回数を低減することができるため高速なインデックス追加処理が可能となり、文書データベースへの登録処理を高速に行うことが可能となる。
【図面の簡単な説明】
【図１】本発明が適用された文書検索システムの構成を示す図である。
【図２】インデックスの構成例を示す図である。
【図３】新たな文書に含まれるキーワードを追加した場合のインデックスの構成例を示す図である。
【図４】インデックス追加処理時のタイムチャートを示す図である。
【図５】ブロックに分割されたインデックスの構成例を示す図である。
【図６】新たな文書に含まれるキーワードを追加した場合のブロックに分割されたインデックスの構成例を示す図である。
【図７】論理ブロックの８回連続読み出し処理と８回連続書き込み処理を行った場合のインデックス追加処理時のタイムチャートを示す図である。
【図８】キーワード割り付けプログラム２１００の処理手順を示す図である。
【図９】キーワード割り付けステップ２１５０の処理手順を示す図である。
【図１０】インデックス作成登録プログラム２２００の処理手順を示す図である。
【図１１】ブロック単位のインデックス追加処理ステップ２２７０の処理手順を示す図である。
【図１２】キーワード管理デーブルの構成例を示す図である。
【図１３】ワークテーブルの構成例を示す図である。
【符号の説明】
１０１ディスプレイ
１０２キーボード
１０３ＣＰＵ
１０４メモリ
１０５磁気ディスク
１０６ＦＤＤ
１０７フロッピーディスク
２０００登録制御プログラム
３０００検索制御プログラム
４０００ワークエリア
５０００システム制御プログラム
６０００検索インタフェースプログラム
８０００インデックスファイル[0001]
[Industrial applications]
The present invention relates to a document search method and apparatus using an index, and is applied to a database, a document filing system, a DTP (Desk Top Publishing) system, and the like.
[0002]
[Prior art]
In an information processing system, it is one important thing to find all documents including a specific word that a searcher wants to search for, that is, a query word, from a document composed of a collection of character string data stored in a database. Processing.
As a method for searching for such a document, an index search method using an index is well known. This method is described in "Information Search" (Nakahara, IEICE Publishing, 1974) 203-207 (hereinafter referred to as Known Example 1) and "DOCUMENT DATABASE" (G. James, Van Nostrand Reinhold Co., 1985) pp. 87-94.
The index mentioned here is composed of the document number of the document in which the keyword appears. In these index search methods, only by referring to the index of the keyword that matches the query word, the document containing the keyword can be found, so that high-speed search is possible.
[0003]
FIG. 2 shows an example of the index described in the above-mentioned known example 1. The index is configured to store the keyword number and the document number of the document in which the keyword appears, corresponding to the keyword extracted from the document.
In this example, it is assumed that indexes corresponding to the keywords “core”, “disk”, “computer”, and “IR” have been created and stored in files on the magnetic disk.
At the time of retrieval, the index stores the document number of the document in which each keyword appears in the index only when "core", "disk", "computer", and "IR" are designated as the query words. Is read from the file (hereinafter referred to as an index file).
That is, if the query word is "core", document numbers 1 and 4, if the query word is "disk", document number 4; if the query word is "computer", document numbers 1, 2, 4, and 4. When the query word is "IR", a document number is output as a search result like document number 2.
When registering a new document in the database, a keyword that appears in the document is extracted, and the document number of the document in which this keyword appears is additionally registered in the index corresponding to the keyword.
Techniques for extracting keywords from documents as described above include “Trends in Automatic Indexing Research” (Morohashi, IPSJ Journal, Vol. 25, No. 9, 1984) and “DOCUMENT DATABASE” (G. James, Van). Nostrand Reinhold Co., 1985) pp. 87-94.
FIG. 3 shows an example of processing for adding an index corresponding to a keyword extracted using these keyword extraction techniques.
In this example, it is assumed that the document number of the document to be registered is 5, and that “core”, “computer”, and “IR” have been extracted from this document. In this index, a document number 5 is added to each of the indexes corresponding to “core”, “computer”, and “IR”. In this manner, the document registration process is realized by performing the process of adding the index corresponding to the extracted keyword.
[0004]
[Problems to be solved by the invention]
"Information Search" (Nakahara, IEICE Press, 1974) pp. According to H.120-128, a secondary storage device such as a large-capacity and inexpensive magnetic disk, which can perform random access, is generally used as a storage device for storing an index.
In the process of adding an index stored on a magnetic disk, an index corresponding to a keyword to be added is independently accessed. That is, in the index adding process, a plurality of indexes on the magnetic disk are accessed at random (hereinafter, referred to as random access).
In a general operating system of a workstation, access to a magnetic disk is performed in units called logical blocks. Here, it is assumed that the size of the logical block is 8,192 bytes (hereinafter, abbreviated as 8 KB) used in the operating system.
However, writing to the magnetic disk may not be performed in units of 8 KB. For example, assuming that the size of the document number to be written in the index adding process is 4 bytes per one document number, when one document is registered, data smaller than a logical block and less than 8 KB is written. In such a case, the operating system temporarily reads the logical block to be written from the magnetic disk into a buffer area on the main memory of the operating system. Next, the logical block is updated by writing the data to be written to a predetermined location of the logical block in the buffer area. Thereafter, the logical block is written to the magnetic disk again. In this way, writing of less data than the logical blocks to the magnetic disk is realized.
[0005]
An example of the operation of the magnetic disk at this time will be described with reference to FIG.
First, seek processing and rotation waiting processing of the head for positioning the head of the magnetic disk at the head position of the logical block to be read are performed, and then read processing of the logical block to be read is performed.
Thereafter, the logical block is updated (document number adding process), and a seek process and a rotation waiting process of the head for positioning the head of the magnetic disk again at the position of the logical block to be read are performed. Thereafter, a process of writing the logical block to the magnetic disk is performed.
[0006]
The values of the processing times shown in the figure are those of a general 3.5-inch magnetic disk. The average seek time (hereinafter, referred to as the average seek time) is about 14 ms, and the average rotation waiting time ( Hereinafter, it is assumed that the average rotation waiting time is about 17 ms, and the read time and the write time per logical block are about 4 ms. Also, 0.001 ms is assumed as the time for writing a 4-byte document number to a logical block in the buffer area.
In this example, it takes a total of 70 ms to write data less than the logical block to the magnetic disk. According to the time chart of FIG. 11, the time spent for the read process and the write process from the magnetic disk is 8 ms, whereas the time spent for the seek process for positioning the head and the rotation wait process is 62 ms, which is eight times. It turns out that it has been hanging for a long time.
That is, while the processing speed of reading and writing to the magnetic disk is 2 MB / s, the overall effective processing speed including the time of the seek processing and the rotation waiting processing is about 0.13 MB / s, and the reading and writing of the magnetic disk is performed. In addition, writing performance cannot be obtained.
[0007]
For this reason, in index addition processing performed at the time of document registration, random access to the magnetic disk occurs as many times as the number of keywords for which the addition processing has occurred, thereby reducing the effective processing speed of the magnetic disk. It will take time. That is, in the document search method using the index, there is a problem that it takes time to register a document.
An object of the present invention is to speed up the process of adding an index performed at the time of registration and reduce the registration time.
[0008]
[Means for Solving the Problems]
In order to solve the above problems, the present invention provides:
A document search method for extracting a keyword from a document, creating an index based on the keyword, and performing a search by referring to an index corresponding to a keyword that matches the query word at the time of the search, wherein the index is stored in a file on a secondary storage device And the file is divided into a predetermined number of blocks, keywords are assigned to the blocks so that the size of the index stored in each block is substantially equal, and added to the blocks when a document is registered. Alternatively, when the number of indexes in which the update has occurred is equal to or more than a predetermined number, the block is read from the secondary storage device to the main storage, and an index adding process is performed for the keyword corresponding to the index in the block. The block that has been added and processed above is stored in a secondary storage device, and If less than, the index is read from the secondary storage device to the main storage, and for the keyword corresponding to the index, index addition processing is performed on the main storage, and the added index is stored in the secondary storage. It is stored in the device.
In addition, when assigning a block to a keyword, for the keyword assigned to the block, the number of appearing documents of the keyword among all the documents to be indexed is calculated, and the sum of the number of appearing documents calculated for each keyword is calculated as a block. It is calculated for each block, and the block assignment of the keyword is performed so that the total sum of the number of appearing documents in each block is substantially equal in each block.
Further, in a document search device that extracts a keyword from a document, creates an index based on the keyword, and performs a search by referring to an index corresponding to a keyword that matches the query word at the time of search, the index is stored in a secondary storage device. Means for dividing the file into a predetermined number of blocks, allocating means for allocating keywords to the blocks such that the size of the index stored in each block is substantially equal, and Means for determining whether or not the number of indices added or updated in the block is equal to or greater than a predetermined number, and when the determination result is equal to or greater than the predetermined number, reading the block from the secondary storage device into the main memory, Index addition processing is performed on the main memory for the keyword corresponding to the index of, Means for storing the processed block in the secondary storage device, and reading the index from the secondary storage device to the main storage when the number is less than a predetermined number, and adding the index to the keyword corresponding to the index. There is provided a means for storing the index, which has been performed on the storage and additionally processed, in the secondary storage device.
The assigning means calculates the number of appearing documents of the keyword among all the documents to be indexed for the keyword assigned to the block, and calculates the sum of the number of appearing documents calculated for each keyword for each block. , And block assignment of keywords is performed so that the total sum of the number of appearing documents in each block is substantially equal in each block.
[0009]
[Action]
By the above means, keywords can be extracted from all the documents, and the keywords can be assigned to the blocks so that the index size of each block of a predetermined number of blocks on the secondary storage device becomes substantially equal. If the number of relevant extracted keywords is equal to or more than a predetermined number for each block that stores the index, the block is read from the magnetic disk to the memory and the index is added to the relevant extracted keyword. By writing this to the magnetic disk, the number of accesses to the magnetic disk is reduced. If the number is less than the predetermined number, only the index corresponding to the relevant extracted keyword is read into the memory and the index is added, and this is added. By writing to a magnetic disk, It is possible to minimize the access data amount to click, very allows fast indexing addition processing, it is possible to realize a high-speed process of registration in the document database.
[0010]
【Example】
First, the principle of the present invention will be described below.
As an initial setting, a predetermined size is secured on the magnetic disk as an area of an index file for storing an index, and the area is divided into a predetermined number of blocks (not logical blocks). Next, the capacity of the index stored in each block (index size) Sum of Is assigned to each block so that is approximately equal.
When a document is registered, first, a keyword is extracted from the registered document, and the keyword and the document number of the document in which the keyword appears are stored in the main memory.
Next, for each block, it is checked how many keywords assigned to the block have been extracted.
If the number is less than the predetermined number, the index corresponding to the keyword extracted for each block is read from the magnetic disk to the main storage as in the conventional case. Next, the document number of the document in which the keyword appears is added to the end of the index read into the main memory, and the document is stored again on the magnetic disk.
If the number is equal to or more than the predetermined number, first, the corresponding block is read from the magnetic disk to the main storage. Next, only for the extracted keywords assigned to the block, the document of the document in which the keyword appears at the end of the index corresponding to the keyword in the block read into the main memory. Add a number. Thereafter, this block is stored again on the magnetic disk.
This series of processes is performed on all blocks to which the extracted keywords are assigned, thereby performing a process of adding the keywords extracted from the registered document to the index.
[0011]
As described above, for a block including a predetermined number or more of the extracted keywords, the block on the magnetic disk is read into the main storage only once, an additional process for a new keyword is performed, and then this is written to the magnetic disk. Since it is possible to add multiple indexes simply by reading the index from the magnetic disk for each keyword, performing the additional processing for that keyword, and then writing this to the magnetic disk, as in the past, The time required for the index addition process can be reduced.
[0012]
The principle described above will be further described using a specific example.
FIG. 5 shows an example of an index file used in this example.
This index file is obtained by dividing the index file shown in FIG. 2 into two blocks, block 1 and block 2, and storing them on a magnetic disk. In addition, the size of the index contained in the block Sum of The keywords "core", "disk" and "IR" are allocated to block 1 and the keyword "computer" is allocated to block 2 so that the index is stored in the corresponding block, so that the index is stored in the corresponding block.
[0013]
This index is subjected to a search process in the same manner as in the related art without being aware of blocks when searching. Only when “core”, “disk”, “computer”, and “IR” are specified as query words, the document number of the document in which each keyword appears is read from the index file.
That is, if the query word is "core", document numbers 1 and 4, if the query word is "disk", document number 4; if the query word is "computer", document numbers 1, 2, 4, and 4. When the query word is "IR", a document number is output as a search result like document number 2.
[0014]
When registering a new document in the database, processing is performed in consideration of the block as shown in FIG. Hereinafter, the procedure will be described in detail.
First, it is assumed that “core”, “disk”, and “IR” have been extracted as keywords from a document to be registered, and a document number 5 is assumed as the document number of this document.
As for the assignment of the extracted keyword blocks, the keywords “core”, “disk” and “IR” are assigned to block 1 and “computer” is assigned to block 2. Among these keywords, the keywords extracted from the document 5 correspond to the keywords “block”, “core”, “disk” and “IR”.
Therefore, first, block 1 is read into the main memory. Next, 5 which is the document number of the document in which these keywords appear is added to the end of the index corresponding to the keywords "core", "disk" and "IR" stored in block 1, respectively. Thereafter, block 1 stored in the main memory is written to an index file on the magnetic disk.
As for block 2, since no keyword corresponding to the keyword assigned in block 2 has been extracted from document 5, no index addition processing is performed. Thus, the document registration process is performed.
[0015]
FIG. 7 shows a time chart of the index addition processing of the block 1.
In this example, a general 3.5-inch magnetic disk is used as a magnetic disk, 8 KB used in a general operating system of a workstation is used as a logical block size, and the size of a block for storing an index is used. Assume that 8 logical blocks are used as the size.
As the seek time and the rotation waiting time of the magnetic disk, an average seek time and an average rotation waiting time are assumed. Further, the average seek time, the average rotation waiting time, the read time of one logical block, and the write time of one logical block in the magnetic disk are assumed to be about 14 ms, about 17 ms, about 4 ms, and about 4 ms, respectively. Using the above values, the flow of the time chart of this figure will be described.
In this example, first, when reading the block 1 to the buffer area, access to the magnetic disk occurs, and the head of the magnetic disk is positioned at the head of the block 1 by the seek processing and the rotation waiting processing. During this time, an average seek time of 14 ms and an average rotation waiting time of 17 ms are spent.
Next, the logical blocks constituting block 1, in this case eight logical blocks, are read into the buffer area. At this time, 32 ms, which is eight times the read time for one logical block, is consumed.
Here, the document number 5 of the document in which these keywords appear at the end of the index corresponding to the keywords “core”, “disk” and “IR” stored in the block 1 read into the buffer area is shown. Each is added. Therefore, the document number is added to block 1 three times. At this time, 0.001 ms is assumed as a time for writing a 4-byte document number to one index on the buffer area. Here, the addition of the document number three times requires three times as much as 0.003 ms, but is negligibly small compared to other processing times.
After that, the block 1 on which the index addition processing has been performed is written to the magnetic disk. At this time, the head of the magnetic disk is positioned at a predetermined position by the seek processing and the rotation waiting processing. During that time, an average seek time of 14 ms and an average rotation waiting time of 17 ms are spent. After this, the eight logical blocks that make up block 1 are written to the magnetic disk, consuming 32 ms, eight times the writing time for one block.
With the above processing, a total of 126 ms is spent for adding the index in this example. This is an average of 42 ms per keyword.
As in the prior art, when adding processing to the index on the magnetic disk for each keyword is performed, as shown in FIG. 4 described above, it takes 70 ms per keyword, but the index is collectively indexed into blocks as in the present invention. By performing the additional processing, the registration time of the document can be reduced by about 40%.
When a general document is registered, the number of registered keywords is several tens of times that of the present example, so that the effect of the process of adding an index in block units according to the present invention is further enhanced.
[0016]
As described above, according to the present invention, by performing an index addition process on a block including a predetermined number or more of additional registration keywords at a time, a block on a magnetic disk can be read only once into main memory. Since a process of adding a plurality of indexes can be executed, a high-speed registration process to the document database can be realized.
[0017]
Hereinafter, examples of the present invention will be described.
The configuration of a document search system to which the present invention is applied will be described with reference to FIG.
This system includes a display 101, a keyboard 102, a CPU 103, a memory 104, a magnetic disk 105, and a floppy disk drive (FDD) 106.
The display 101, keyboard 102, memory 104, magnetic disk 105, and FDD 106 are accessed by the CPU 103 via a bus. The magnetic disk 105 stores an index file 8000.
The memory 104 is loaded with a system control program 5000, a search interface program 6000, a registration control program 2000, a search control program 3000, a keyword assignment program 2100, an index creation registration program 2200, and an index search program 3100, and a work area 4000 is secured. You.
Documents registered in the document database of the document search system are stored on the floppy disk 107 and accessed by the CPU 103 via the FDD 106.
[0018]
In this system, the system control program 5000 is started by the CPU 103 when the power is turned on, and the registration control program 2000 and the search control program 3000 are started under the control of the system control program 5000.
First, an outline of a document registration process in the present system having such a configuration will be described.
The system control program 5000 activates the registration control program 2000 in accordance with an instruction input from the keyboard 102 by the user.
In the registration control program 2000, before registering a document, the keyword assignment program 2100 is started according to an instruction input by the user from the keyboard 102, and an index file is initialized.
First, in accordance with an instruction input from the keyboard 102 by the user, an index file 8000 for storing an index is secured on the magnetic disk 105 by a predetermined number of blocks, and is divided into a specified number of blocks.
Then, a predetermined number of documents (hereinafter, referred to as seed documents) for assigning a keyword to each block are read from the floppy disk 107 to the work area 4000 of the memory 104 via the FDD 106. For example, when creating 100,000 newspaper article DBs, several hundred to several thousand articles of the same type, ie, newspaper articles, are registered as seed documents.
Next, words required for the search are extracted as keywords from the seed documents read into the work area 4000, and the number of appearing documents of the keywords is calculated.
Keywords are assigned to the respective blocks so that the sum of the number of documents in which the keywords appear is substantially equal between the respective blocks. Thereafter, the registration control program 2000 starts the index creation registration program 2200.
[0019]
The index creation registration program 2200 reads the document to be registered stored on the floppy disk 107 into the work area 4000 of the memory 104 via the FDD 106 in accordance with an instruction input by the user from the keyboard 102.
The words required for the search are extracted as keywords from the registered document, and the keywords and the document numbers or the document numbers are registered in the corresponding blocks of the index file 8000.
[0020]
Next, an outline of a document search operation in the present system will be described. The system control program 5000 activates the search control program 3000 and the search interface program 6000 in accordance with an instruction input from the keyboard 102 by the user.
Thereafter, the query word input by the user from the keyboard 102 is input to the search interface program 6000 and sent to the search control program 3000.
The search control program 3000 activates the index search program 3100 and sends the query to the program.
The index search program 3100 reads out the document number from the index corresponding to the received query word and sends it to the search control program 3000 as a search result.
This search result is sent to the search interface program 6000 and displayed on the display 101 as a search result document number.
[0021]
Next, the configuration of the keyword assignment program 2100 and keyword assignment processing will be described with reference to FIG.
The keyword assignment program 2100 includes an index division step 2105, a repetition step 2110 for the number of seed documents, a seed document reading step 2120, a keyword extraction step 2130, an appearance document count step 2140, and a keyword assignment step 2150.
First, in the index division step 2105, the number of blocks specified by the user is read from the keyboard 102. Next, an area for the specified number of blocks is secured as the index file 8000, and this is equally divided into the specified number of blocks.
Next, in a seed document reading step 2120, one seed document is read via the FDD 106 and stored in the work area 4000.
Further, in a keyword extraction step 2130, a word to be a keyword is extracted from the read seed document, and the extracted keyword is stored in the work area 4000.
Techniques for extracting keywords from this document are described in “Trends in Automatic Indexing Research” for Japanese documents (Morohashi, IPSJ Journal, Vol. 25, No. 9, 1984). About “DOCUMENT DATABASE” (G. James, Van Nostrand Reinhold Co., 1985) pp. 87-94. In the present embodiment, these keyword extraction techniques are used as they are.
In the appearing document number counting step 2140, the number of documents in which the extracted keyword appears is counted, and stored in the work area 4000 in association with the keyword.
In the repetition step 2110 for the number of seed documents, a series of processing from step 2120 to step 2140 is repeated for all seed documents.
After all seed documents have been processed, the keyword assignment step 2150 determines the number of appearing documents. Total The extracted keywords are assigned to each block so that the sum becomes substantially equal, and the block number is stored in the work area 4000 in a form corresponding to the keyword.
[0022]
As described above, by calculating the number of appearing documents of the keyword extracted from the seed document, the index size of the document database can be predicted. That is, the size of the index corresponding to the keyword extracted from the seed document can be calculated by the product of the number of occurrence documents of the keyword extracted from the seed document and the document number size, and the number of registered documents in the document database and the number of seed documents (sampled This is because the index size in the document database can be predicted by multiplying the ratio by the number of seed documents.
In addition, the number of documents registered in the document database increases, and the size of all indexes allocated to blocks is reduced. Total When the sum is not substantially equal among the blocks, a seed document candidate is extracted from all the documents registered in the document database by several percent of the total number of documents by using a method such as random number extraction. Based on this kind of document, keywords are assigned to blocks again.
By re-registering all the documents registered in the document database based on this keyword assignment, the size of the index assigned to the block can be reduced. Total The sum can be made substantially equal between the blocks.
As another method, for a block in which the sum of the sizes of the indexes assigned to the blocks is larger than that of the other blocks, the keywords assigned to the blocks are reassigned to the smaller-sized blocks, so that the sum of the index sizes is reduced. Can be made substantially equal between the blocks.
As an index for assigning a keyword to a block, it is also possible to use the number of appearances of the keyword itself or the number of types of keywords in addition to the number of document appearances of the keyword.
By performing the above processing, keywords can be extracted from all the seed documents, and the keywords can be assigned to the blocks so that the index sizes of the blocks are substantially equal.
[0023]
Further, the keyword assignment processing in the keyword assignment step 2150 will be described in detail with reference to FIG.
The keyword assignment step 2150 includes an extracted keyword sorting step 2152, a block number initial setting step 2153, an extracted keyword repetition step 2154, a block repetition step 2155, a block size determination step 2156, a keyword setting step 2157, a block number counting step 2158, and a jump step 2159. It is composed of
First, in the extracted keyword sorting step 2152, the extracted keywords stored in the work area 4000 are sorted in descending order of the number of documents of the extracted keywords stored corresponding to the extracted keywords.
Next, in the block number initial setting step 2153, 1 which is the first block number is set as the block number to be processed first.
Further, in the block size determination step 2156, assuming that the extracted keyword is allocated to the block, the sum of the sizes of the indexes allocated to this block is calculated, and it is determined whether or not the size exceeds a predetermined block size.
Only when it does not exceed, first, a keyword setting step 2157 is executed, and an extracted keyword is assigned to this block.
Next, a block number counting step 2158 is executed, 1 is added to the block number of the block to be processed, and the block number of the block to be processed next is set.
In this counting up, when the block number has been counted up to the last number, it returns to the first block number of 1.
Further, the repetition processing in the block repetition step 2155 is terminated by the jump step 2159, and the steps after L1 are executed. Here, the extracted keyword repetition step 2154 is executed.
In block repetition step 2155, the processing from step 2156 to step 2159 is repeated for all blocks in order from the block number.
In the extracted keyword repetition step 2154, the processing from step 2155 to step 2159 is repeated for all the extracted keywords.
Through the above series of processing, since the extracted keywords having the largest number of appearing documents can be allocated to each block in order, it is possible to allocate the index size to each block almost equally.
[0024]
Next, the configuration of the index creation registration program 2200 and the document registration process will be described with reference to FIG.
The index creation registration program 2200 includes a document number acquisition step 2205, a document number repetition step 2210, a registered document number reading step 2220, a keyword extraction step 2230, a block number correspondence step 2240, a document number counting step 2245, and a block number repetition step 2250. , An extracted keyword number determination step 2260, a block unit index addition step 2270, a repetition step 2280 for the number of keywords, and a keyword unit index addition step 2290.
First, in the document number acquisition step 2205, the first document number and the number of registered documents of the registered document input by the user from the keyboard 102 are read.
Next, in a registered document reading step 2220, a document to be registered is stored in the read work area 4000 for one document via the FDD 106.
After that, in a keyword extraction step 2230, a keyword word is extracted from the read registered document, as in the keyword extraction step 2130 in the keyword assignment program 2100. This extracted keyword is stored in work area 4000.
In the block number corresponding step 2240, the block number of the block to which the extracted keyword is assigned is obtained by examining the block number stored in the work area 4000 in a form corresponding to the keyword in the keyword assigning step 2150, and the block number is extracted. It is stored in the work area 4000 in association with the keyword. If the extracted keyword is not assigned to any block, the keyword is assigned to the block having the smallest sum of the stored indexes, and the block number is assigned to the work area 4000 in a form corresponding to the extracted keyword. To be stored.
Thereafter, in a document number counting step 2245, the document number is incremented to prepare for the processing of the next registered document.
In the repetition step 2210 for the number of documents, the keyword extraction processing of steps 2220 to 2245 is repeated for the number of registered documents.
After that, in the extracted keyword number determination step 2260, the number of the extracted keywords assigned to the block 1 among the extracted block 1 is counted, and it is determined whether or not the number is equal to or more than a predetermined number N.
The user designates an optimal value for the predetermined number N in consideration of the performance of the used magnetic disk, the used computer, and the like.
If the count number is equal to or greater than the predetermined number N, the next block-based index addition process 2270 is executed. In the block unit index addition processing 2270, the block 1 is read into the work area 4000, and the index addition processing is performed collectively for all the extracted keywords assigned to the block 1.
If the predetermined number N is not reached, a keyword unit index addition step 2290 is executed. In the keyword unit index addition step 2290, only the index corresponding to the extracted keyword existing in the block 1 stored in the index file 8000 is read into the work area 4000, and the index of the document in which the extracted keyword corresponding to the index appears is read. The document number is added to the index and written again to the index file 8000.
In addition, in step 2280, which is repeated by the number of keywords, keyword-based index addition step 2290 is repeatedly executed until index addition processing is completed for all extracted keywords assigned to block 1.
In step 2250, which is repeated as many times as the number of blocks, steps 2260 to 2290 are repeatedly executed for all blocks to perform index addition processing.
In this embodiment, additional registration of a new document is realized in this manner.
[0025]
Further, the above-described block unit index addition processing step 2270 will be described in detail with reference to FIG.
The block unit index addition processing step 2270 includes a block reading step 2272, a keyword number repetition step 2274, an intra-block index addition step 2276, and a block storage step 2278.
First, in the block reading step 2272, the blocks designated by the repetition step 2250 for the number of blocks are read from the index file 8000 and stored in the work area 4000.
Next, in the in-block index adding step 2276, the keyword is added to the end of the index corresponding to the keyword assigned to the update target block read into the work area 4000 in the block reading step 2272 among the extracted keywords. Add the document number of the document that appeared.
Further, in the keyword number repetition step 2274, the index addition processing of step 2276 is repeatedly performed for all the keywords in the read block.
After that, in the block storage step 2278, the above-mentioned block whose index addition processing has been completed is stored in the index file 8000 again.
As described above, in the block unit index addition processing step 2270, the index addition processing is performed in block units.
As described above, by performing the keyword index addition processing collectively in block units, even when the number of additional keywords is large, the index addition processing can be performed in a short time.
[0026]
An example of storing information on keywords stored in the work area by the above-described keyword assignment program 2100 will be described with reference to FIG.
In keyword extraction step 2130, words serving as keywords are extracted from the seed document and stored in work area 4000. As shown in the figure, the keyword number is stored together with the keyword.
In this example, it is assumed that "core", "disk", "computer", and "IR" are extracted from the seed document as the extracted keywords.
In this step, a serial number is assigned as the keyword number in the order in which these extracted keywords were extracted, and stored in the work area 4000 in the form as in this example. In this example, the keywords “core”, “disk”, “computer” and “IR” are assigned keyword numbers of 1, 2, 3 and 4, respectively.
Next, in an appearing document number counting step 2140, the number of appearing documents in the seed document is counted for each keyword. At this time, the number of appearing documents is stored in a form corresponding to the keyword as shown in FIG. In this example, the numbers of documents appearing in the keywords “core”, “disk”, “computer” and “IR” are 2, 1, 3 and 2, respectively.
Thereafter, in a keyword assignment step 2150, keywords are assigned to blocks storing indexes based on the number of appearing documents. The block number assigned here is associated with a keyword as shown in the figure and stored in the work area 4000.
In this example, the block numbers of the blocks to which the keywords “core”, “disk”, “computer”, and “IR” are assigned are 1, 1, 2, and 1, respectively.
By storing information on keywords in the work area 4000 in such a format, it is possible to manage information on keywords extracted from the seed document, so that keyword allocation processing to blocks can be realized.
[0027]
An example of storing information on keywords stored in the work area by the above-described index creation registration program 2200 will be described with reference to FIG. In this example, it is assumed that "core", "disk", and "IR" are extracted as keywords from the registered document of document number 5.
In keyword extraction step 2230, words serving as keywords are extracted from the registered document and stored in work area 4000. At this time, the keyword number corresponding to the extracted keyword and the block number of the allocated block are obtained from the keyword information shown in FIG. At this time, as shown in this figure, the number of the acquired allocation block and the number of the registered document, which is the document appearing the allocation block, are stored in correspondence with the keyword.
In this example, the keywords “core”, “disk”, “computer”, and “IR” store 1, 2, 3, and 4 as keyword numbers, respectively, and the assigned block number and the registered document number respectively. The block number 1 and the document number 5 are stored.
Based on this information, an index addition process is performed in steps 2250 to 2290.
By storing the information on the keywords in the work area 4000 in such a format, the information on the keywords extracted from the registered document can be managed, so that a process of adding an index corresponding to the information can be realized.
[0028]
As described above, according to the present invention, for each block that stores an index, if the number of relevant extracted keywords is equal to or greater than a predetermined number, the block is read from the magnetic disk to the memory and the index for the relevant extracted keyword is determined. Is performed collectively on the additional processing memory, and by writing this to the magnetic disk, the number of accesses to the magnetic disk is reduced. If the number is less than a predetermined number, only the index corresponding to the corresponding extracted keyword is read into the memory. At the same time, the index addition process is performed, and by writing this to the magnetic disk, the amount of data accessed to the magnetic disk can be minimized. Registration processing can be realized.
[0029]
【The invention's effect】
According to the present invention, by executing a process of adding a plurality of indexes on the memory at a time, the number of accesses to the magnetic disk can be reduced, so that a high-speed index adding process can be performed, and a document database can be added. The registration process can be performed at high speed.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a document search system to which the present invention has been applied.
FIG. 2 is a diagram illustrating a configuration example of an index.
FIG. 3 is a diagram illustrating a configuration example of an index when a keyword included in a new document is added.
FIG. 4 is a diagram showing a time chart at the time of index addition processing.
FIG. 5 is a diagram showing a configuration example of an index divided into blocks.
FIG. 6 is a diagram illustrating a configuration example of an index divided into blocks when a keyword included in a new document is added.
FIG. 7 is a diagram showing a time chart at the time of index addition processing when eight consecutive reading processing and eight consecutive writing processing of a logical block are performed.
FIG. 8 is a diagram showing a processing procedure of a keyword assignment program 2100.
FIG. 9 is a diagram showing a processing procedure of a keyword assignment step 2150.
FIG. 10 is a diagram showing a processing procedure of an index creation registration program 2200.
FIG. 11 is a diagram showing a processing procedure of an index addition processing step 2270 in block units.
FIG. 12 is a diagram illustrating a configuration example of a keyword management table.
FIG. 13 is a diagram illustrating a configuration example of a work table.
[Explanation of symbols]
101 Display
102 keyboard
103 CPU
104 memory
105 magnetic disk
106 FDD
107 floppy disk
2000 Registration Control Program
3000 search control program
4000 work area
5000 System control program
6000 search interface program
8000 index file

Claims

A document search method for extracting a keyword from a document, creating an index including document identification information of the document in which the extracted keyword appears, and performing a search by referring to an index corresponding to the keyword that matches the query word at the time of search.
The index storage area on the secondary storage device for storing the index is divided into a predetermined number of blocks, and a keyword is set so that the total sum of the capacities of the indexes stored in each block over each block is substantially equal. Is allocated to the block, and an index corresponding to the block is stored. If the number of indexes added or updated in the block is greater than or equal to a predetermined number at the time of registering a document, the block is secondarily stored. In addition to reading from the measure into the main memory, for the keyword corresponding to the index in the block, an index adding process is performed on the main memory, and the block subjected to the additional processing is stored in the secondary storage device. In such a case, the index is read from the secondary storage device into the main memory, and the The keywords that performs additional processing of the index on the primary storage, document retrieval method and storing the index that is additional processing to the secondary storage device.

The document search method according to claim 1,
When assigning a block to a keyword, for the keyword assigned to the block, the number of appearing documents of the keyword among all the documents to be indexed is calculated, and the sum of the number of appearing documents calculated for each keyword is calculated for each block. A document search method characterized by calculating and allocating keyword blocks so that the total number of appearing documents in each block is substantially equal in each block.

A document search apparatus that extracts a keyword from a document, creates an index including document identification information of the document in which the extracted keyword appears, and performs a search by referring to an index corresponding to the keyword that matches the query word at the time of search.
Means for dividing the index storage area on the secondary storage device for storing the index into a predetermined number of blocks, and the sum total of the capacities of the indexes stored in each block over each block is substantially equal. Allocating means for allocating a keyword to the block and storing an index corresponding to the block ;
Means for determining whether the number of indices added or updated in the block is greater than or equal to a predetermined number when registering a document, and reading the block from the secondary storage device to the main storage when the determination result is greater than or equal to the predetermined number Means for performing an index addition process on a main memory for a keyword corresponding to an index in the block, storing the added block in a secondary storage device, and storing the index in a secondary storage device when the number is less than a predetermined number. A document which is read from a device into a main storage, performs a process of adding an index to a main storage for a keyword corresponding to the index, and stores the added index in a secondary storage device. Search device.

The document search device according to claim 3,
The assigning means calculates, for each keyword assigned to the block, the number of appearing documents of the keyword out of all documents to be indexed, and calculates the sum of the number of appearing documents calculated for each keyword for each block A document retrieval apparatus characterized in that keyword blocks are allocated so that the sum of the number of appearing documents in each block is substantially equal in each block.