JPH08147328A

JPH08147328A - Method and device for retrieving document

Info

Publication number: JPH08147328A
Application number: JP6305575A
Authority: JP
Inventors: Hisamitsu Kawaguchi; 川口　　久光; Natsuko Mizutani; 奈津子水谷; Atsushi Hatakeyama; 敦畠山; Katsumi Tada; 勝己多田; Kanji Kato; 寛次加藤; Satoshi Asakawa; 悟志浅川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1994-11-15
Filing date: 1994-11-15
Publication date: 1996-06-07
Anticipated expiration: 2019-10-20
Also published as: JP3578501B2

Abstract

PURPOSE: To quicken registering processing to a document data base by collectively executing the additional processing of plural indexes on a memory. CONSTITUTION: Key words are extracted from all the documents and key words are assigned to each block so as to equalize the index size of each block of prescribed number of blocks on an index file 8000. At the time of registering the document, when the pertinent extracted key words are larger than a prescribed number corresponding to each block storing the indexes, the block is read from a magnetic disk 105 into a memory 104 and the additional processing of the indexes for the corresponding key word are collectively executed by the memory 4000. Writing this to the disk 105 reduces the number of accessing times to the disk 105 and when smaller than the prescribed number, only indexes corresponding to the pertinent key word are read onto the memory 4000 and the additional processing of the indexes is executed to write into the disk 105 so that an access data quantity is minimized.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、インデックスを使用し
た文書検索方法及び装置に係り、データベース、文書フ
ァイリングシステムおよびＤＴＰ（ＤｅｓｋＴｏｐ
Ｐｕｂｌｉｓｈｉｎｇ）システムなどに適用されるもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search method and apparatus using an index, and a database, a document filing system, and a DTP (Desk Top).
It is applied to a publishing system or the like.

【０００２】[0002]

【従来の技術】情報処理システムでは、データベースに
格納されている文字列データの集まりからなる文書の中
から、検索者の探したいある特定の言葉、すなわち質問
語、を含む全ての文書を探し出すことが一つの重要な処
理となっている。このような文書を検索するための方法
として、インデックスを使用したインデックス検索方式
が良く知られている。この方式は“情報検索”（中原
著、電子通信情報学会出版、１９７４）ｐｐ.２０３−
２０７（以下、公知例１と呼ぶ）や“ＤＯＣＵＭＥＮＴ
ＤＡＴＡＢＡＳＥ”（Ｇ．Ｊａｍｅｓ著、ＶａｎＮ
ｏｓｔｒａｎｄＲｅｉｎｈｏｌｄＣｏ．、１９８
５）ｐｐ.８７−９４に具体的に記載されている。ここ
で取り上げられているインデックスは、キーワードが出
現する文書の文書番号で構成されている。これらのイン
デックス検索方式では、質問語と一致するキーワードの
インデックスを参照するだけで、そのキーワードを含む
文書が分かるため高速な検索が可能である。2. Description of the Related Art In an information processing system, a searcher searches for all documents that include a specific word, that is, a query word, that a searcher wants to search from among documents that are a collection of character string data stored in a database. Is one important process. An index search method using an index is well known as a method for searching such a document. This method is "information retrieval" (Nakahara, published by The Institute of Electronics, Information and Communication Engineers, 1974) pp.203-
207 (hereinafter referred to as known example 1) or "DOCUMENT"
DATABASE "(G. James, Van N)
eastern Reinhold Co. , 198
5) Specifically described in pp.87-94. The index taken up here is composed of the document number of the document in which the keyword appears. In these index search methods, only by referring to the index of the keyword that matches the query word, the document including the keyword can be found, so that high-speed search is possible.

【０００３】上記公知例１に記載されているインデック
スの例を図２に示す。インデックスは、文書から抽出さ
れたキーワードに対応して、キーワード番号とキーワー
ドが出現する文書の文書番号が格納される構成となって
いる。本例では、キーワード“コア”、“ディスク”、
“コンピュータ”および“ＩＲ”に対応するインデック
スが作成され、磁気ディスク上のファイルに格納されて
いることを想定している。検索の際には、質問語として
“コア”、“ディスク”、“コンピュータ”および“Ｉ
Ｒ”が指定されたときのみ、このインデックスの中のそ
れぞれのキーワードが出現する文書の文書番号がインデ
ックスが格納されているファイル（以後、インデックス
ファイルと呼ぶ）から読み出される。すなわち、質問語
が“コア”の場合には文書番号１，４，質問語が“ディ
スク”の場合には文書番号４、質問語が“コンピュー
タ”の場合には文書番号１，２，４，質問語が“ＩＲ”
の場合には文書番号２のように検索結果として文書番号
が出力される。新たな文書をデータベースへ登録する際
には、その文書に出現したキーワードが抽出され、この
キーワードが出現した文書の文書番号が、そのキーワー
ドに対応するインデックスに追加登録される。このよう
に文書からキーワードを抽出する技術は、“自動索引付
け研究の動向”（諸橋著、情報処理学会誌、Ｖｏｌ.２
５、Ｎｏ.９、１９８４）や“ＤＯＣＵＭＥＮＴＤＡ
ＴＡＢＡＳＥ”（Ｇ．Ｊａｍｅｓ、ＶａｎＮｏｓｔｒ
ａｎｄＲｅｉｎｈｏｌｄＣｏ．、１９８５）ｐｐ．８
７−９４に記載されている。これらのキーワード抽出技
術を用いて抽出されたキーワードに対応するインデック
スの追加処理例を図３に示す。本例では、登録対象の文
書の文書番号は５であり、この文書から“コア”、“コ
ンピュータ”および“ＩＲ”が抽出されたことを想定す
る。このインデックスでは“コア”、“コンピュータ”
および“ＩＲ”に対応するインデックスにそれぞれ文書
番号５が追加されている。このようにして、抽出された
キーワードに対応するインデックスの追加処理が行われ
ることにより、文書の登録処理が実現される。FIG. 2 shows an example of the index described in the known example 1. The index is configured to store the keyword number and the document number of the document in which the keyword appears, corresponding to the keyword extracted from the document. In this example, the keywords "core", "disk",
It is assumed that indexes corresponding to "computer" and "IR" are created and stored in files on the magnetic disk. When searching, the query words are "core", "disk", "computer" and "I".
Only when "R" is specified, the document number of the document in which each keyword in this index appears is read from the file in which the index is stored (hereinafter referred to as the index file). Document numbers 1 and 4 in the case of "core", document number 4 in the case where the question word is "disk", and document numbers 1, 2 and 4 in the case where the question word is "computer" are "IR"
In the case of, the document number is output as the search result like the document number 2. When registering a new document in the database, the keyword that appears in the document is extracted, and the document number of the document in which this keyword appears is additionally registered in the index corresponding to that keyword. The technology for extracting keywords from documents is described in "Trends in Automatic Indexing Research" (Morohashi, IPSJ, Vol.
5, No. 9, 1984) and "DOCUMENT DA
TABASE ”(G. James, Van Nostr
and Reinhold Co. , 1985) pp. 8
7-94. FIG. 3 shows an example of processing for adding an index corresponding to a keyword extracted using these keyword extraction techniques. In this example, it is assumed that the document number of the document to be registered is 5, and "core", "computer", and "IR" have been extracted from this document. In this index, "core", "computer"
Document number 5 is added to the indexes corresponding to "IR" and "IR", respectively. In this way, the process of adding the index corresponding to the extracted keyword is performed, whereby the document registration process is realized.

【０００４】[0004]

【発明が解決しようとする課題】“情報検索”（中原
著、電子通信情報学会出版、１９７４）ｐｐ.１２０−
１２８によれば、インデックスが格納される記憶装置と
しては、ランダムアクセスを行うことができ、大容量で
安価な磁気ディスクなどの二次記憶装置の使用が一般的
とされている。磁気ディスクに格納されているインデッ
クスの追加処理では、追加の対象となったキーワードに
対応するインデックスが独立にアクセスされる。すなわ
ち、インデックスの追加処理は、磁気ディスク上の複数
のインデックスが飛び飛びにアクセス（以下、ランダム
アクセスと呼ぶ）されることになる。ワークステーショ
ンの一般的なオペレーティングシステムにおいては、磁
気ディスクへのアクセスが論理ブロックと呼ばれる単位
で行われる。ここでは論理ブロックのサイズとして、上
記オペレーティングシステムで使われている８、１９２
バイト（以後、８ＫＢと省略する）を想定する。ただ
し、磁気ディスクへの書き込みは、８ＫＢ単位で行われ
ない場合もある。例えば、インデックスの追加処理で書
き込む文書番号のサイズを一文書番号当たり４バイトと
想定すると、一文書を登録する場合には論理ブロックよ
り少ない８ＫＢ未満のデータの書き込みとなる。このよ
うな場合、上記オペレーティングシステムでは、書き込
み対象となっている論理ブロックを磁気ディスクから上
記オペレーティングシステムの主記憶上のバッファエリ
アに一旦読み込む。次に、書き込む対象のデータをバッ
ファエリア内の論理ブロックの所定の場所に書き込むこ
とにより論理ブロックを更新する。その後で、この論理
ブロックを再び磁気ディスクに書き込む。このようにし
て、論理ブロックより少ないデータの磁気ディスクへの
書き込みを実現している。[Problems to be Solved by the Invention] "Information Retrieval" (Nakahara, IEICE Press, 1974) pp.120-
According to 128, a secondary storage device such as a magnetic disk, which can perform random access and is large-capacity and inexpensive, is generally used as a storage device for storing an index. In the process of adding the index stored in the magnetic disk, the index corresponding to the keyword to be added is independently accessed. That is, in the index addition process, a plurality of indexes on the magnetic disk are randomly accessed (hereinafter referred to as random access). In a general operating system of a workstation, access to a magnetic disk is performed in units called logical blocks. Here, the size of the logical block is 8,192 used in the above operating system.
A byte (hereinafter abbreviated as 8 KB) is assumed. However, writing to the magnetic disk may not be performed in units of 8 KB. For example, assuming that the size of the document number to be written in the index addition process is 4 bytes per document number, data of less than 8 KB, which is smaller than the logical block, is written when registering one document. In such a case, the operating system once reads the logical block to be written from the magnetic disk into the buffer area on the main memory of the operating system. Next, the logical block is updated by writing the data to be written in a predetermined location of the logical block in the buffer area. Then, this logical block is written again on the magnetic disk. In this way, writing of less data than the logical block to the magnetic disk is realized.

【０００５】このときの磁気ディスクの動作の一例を図
４を用いて説明する。まず、磁気ディスクのヘッドを、
読み出し対象の論理ブロックの先頭位置に位置決めする
ためのヘッドのシーク処理と回転待ち処理が行われ、次
に読み出し対象となる論理ブロックの読み出し処理が行
われる。その後、前記論理ブロックの更新（文書番号の
追加処理）が行われ、再び磁気ディスクのヘッドを読み
出し対象となった論理ブロックの位置に位置決めするた
めのヘッドのシーク処理と回転待ち処理が行われ、その
後、論理ブロックの磁気ディスクへの書き込み処理が行
われる。An example of the operation of the magnetic disk at this time will be described with reference to FIG. First, the magnetic disk head,
The seek processing and the rotation waiting processing of the head for positioning the head of the logical block to be read are performed, and then the reading processing of the logical block to be read is performed. After that, the logical block is updated (document number addition processing), and head seek processing and rotation waiting processing for positioning the head of the magnetic disk at the position of the logical block to be read again are performed. Then, the writing process of the logical block to the magnetic disk is performed.

【０００６】本図の各処理時間の値は一般的な３.５イ
ンチの磁気ディスクのものであり、平均的なシーク時間
（以後、平均シーク時間と呼ぶ）としては約１４ｍｓ、
平均的な回転待ち時間（以後、平均回転待ち時間と呼
ぶ）としては約１７ｍｓ、一論理ブロック当たりの読み
出し時間および書き込み時間としては約４ｍｓを想定し
ている。また、バッファエリア上の論理ブロックに４バ
イトの文書番号を書き込む時間としては０.００１ｍｓ
を想定している。本例では、論理ブロックより少ないデ
ータの磁気ディスクへの書き込み処理に合計７０ｍｓ掛
かる。本図のタイムチャートより磁気ディスクからの読
み出し処理および書き込み処理に費やされる時間が８ｍ
ｓであるのに対して、ヘッドを位置決めするためのシー
ク処理および回転待ち処理に費される時間は６２ｍｓと
８倍程度長く掛かっていることが分かる。すなわち、磁
気ディスクへの読み出しや書き込みの処理速度は２ＭＢ
／ｓなのに対し、シーク処理や回転待ち処理の時間を含
めた全体の実効的な処理速度は約０.１３ＭＢ／ｓとな
り、磁気ディスクの読み出しおよび書き込み性能が引き
出せない状況となっている。The value of each processing time in this figure is for a general 3.5-inch magnetic disk, and the average seek time (hereinafter referred to as the average seek time) is about 14 ms.
It is assumed that the average rotation waiting time (hereinafter referred to as the average rotation waiting time) is about 17 ms, and the read time and the write time per logical block is about 4 ms. Also, the time to write the 4-byte document number to the logical block in the buffer area is 0.001 ms.
Is assumed. In this example, it takes a total of 70 ms to write the data smaller than the logical block to the magnetic disk. From the time chart in this figure, the time spent for reading and writing from the magnetic disk is 8m.
It can be seen that the time required for the seek process for positioning the head and the rotation waiting process is 62 ms, which is about eight times as long as s. That is, the processing speed of reading and writing to the magnetic disk is 2 MB.
On the other hand, the total effective processing speed including the seek processing time and the rotation waiting processing time is about 0.13 MB / s, which means that the read / write performance of the magnetic disk cannot be obtained.

【０００７】このため、文書の登録時に行われるインデ
ックスの追加処理において、追加処理が発生したキーワ
ードの個数分磁気ディスクのランダムアクセスが発生す
ることにより磁気ディスクの実効的な処理速度が低下
し、インデックスの追加に時間が掛かることになる。つ
まり、インデックスを用いた文書検索方式では、文書の
登録に時間が掛かるという問題がある。本発明の目的
は、登録時に行われるインデックスの追加処理を高速化
し、登録時間を短縮することにある。For this reason, in the index addition process performed when a document is registered, random access to the magnetic disk occurs by the number of keywords for which the addition process has occurred, so that the effective processing speed of the magnetic disk decreases and the index Will take time to add. In other words, the document search method using the index has a problem that it takes time to register the document. An object of the present invention is to speed up the index addition process performed at the time of registration and shorten the registration time.

【０００８】[0008]

【課題を解決するための手段】上記課題を解決するた
め、本発明は、文書からキーワードを抽出し、これに基
づいてインデックスを作成し、検索時に質問語と一致し
たキーワードに対応するインデックスを参照して検索を
行う文書検索方法において、前記インデックスを二次記
憶装置上のファイルに格納するとともに、該ファイルを
所定数のブロックに分割しておき、各ブロックに格納さ
れるインデックスのサイズがほぼ均等になるようにキー
ワードを該ブロックに割り付け、文書の登録時に、前記
ブロックに追加または更新が発生したインデックスの個
数が所定数以上の場合には、該ブロックを二次記憶装置
から主記憶上へ読み込むとともに、該ブロック内のイン
デックスに対応するキーワードについて、インデックス
の追加処理を主記憶上で行い、追加処理された該ブロッ
クを二次記憶装置へ格納し、所定数未満の場合には、該
インデックスを二次記憶装置から主記憶上へ読み込むと
ともに、該インデックスに対応するキーワードについ
て、インデックスの追加処理を主記憶上で行い、追加処
理された該インデックスを二次記憶装置へ格納するよう
にしている。また、キーワードのブロック割り付けに際
して、ブロックに割り付けられたキーワードについて、
インデックスが作成される全ての文書の内、該キーワー
ドの出現文書数を算出し、各キーワード毎に算出した出
現文書数の総和をブロック毎に算出し、各ブロックにお
ける出現文書数の総和が各ブロックにおいてほぼ均等に
なるようにキーワードのブロック割り付けを行うように
している。さらに、文書からキーワードを抽出し、これ
に基づいてインデックスを作成し、検索時に質問語と一
致したキーワードに対応するインデックスを参照して検
索を行う文書検索装置において、前記インデックスを二
次記憶装置上のファイルに格納するとともに、該ファイ
ルを所定数のブロックに分割する手段と、各ブロックに
格納されるインデックスのサイズがほぼ均等になるよう
にキーワードを該ブロックに割り付ける割り付け手段
と、文書の登録時に、前記ブロックに追加または更新が
発生したインデックスの個数が所定数以上か否か判定す
る手段と、判定結果が所定数以上のとき該ブロックを二
次記憶装置から主記憶上へ読み込み、該ブロック内のイ
ンデックスに対応するキーワードについてインデックス
の追加処理を主記憶上で行い、追加処理された該ブロッ
クを二次記憶装置へ格納する手段と、所定数未満のとき
該インデックスを二次記憶装置から主記憶上へ読み込
み、該インデックスに対応するキーワードについて、イ
ンデックスの追加処理を主記憶上で行い、追加処理され
た該インデックスを二次記憶装置へ格納する手段を備え
るようにしている。また、前記割り付け手段は、ブロッ
クに割り付けられたキーワードについて、インデックス
が作成される全ての文書の内、該キーワードの出現文書
数を算出し、各キーワード毎に算出した出現文書数の総
和をブロック毎に算出し、各ブロックにおける出現文書
数の総和が各ブロックにおいてほぼ均等になるようにキ
ーワードのブロック割り付けを行うようにしている。In order to solve the above problems, the present invention extracts a keyword from a document, creates an index based on the extracted keyword, and refers to the index corresponding to the keyword matching the query word at the time of search. In the document search method for performing a search, the index is stored in a file on a secondary storage device, the file is divided into a predetermined number of blocks, and the size of the index stored in each block is substantially equal. When a document is registered and the number of indexes added or updated to the block is equal to or more than a predetermined number, the block is read from the secondary storage device to the main storage so that At the same time, for the keyword corresponding to the index in the block, the process of adding the index is mainly stored. The additional processed block is stored in the secondary storage device, and when the number is less than the predetermined number, the index is read from the secondary storage device into the main storage and the index corresponding to the index Is added on the main memory, and the added index is stored in the secondary storage device. In addition, when assigning keywords to blocks, regarding the keywords assigned to blocks,
Of all the documents for which an index is created, the number of appearing documents of the keyword is calculated, and the sum of the number of appearing documents calculated for each keyword is calculated for each block, and the sum of the number of appearing documents in each block is calculated for each block. In, the keyword blocks are allocated so that they are almost even. Further, in a document search device for extracting a keyword from a document, creating an index based on the extracted keyword, and performing a search by referring to an index corresponding to a keyword matching a query word at the time of search, the index is stored in a secondary storage device. And a means for dividing the file into a predetermined number of blocks, an allocating means for allocating the keywords to the blocks so that the sizes of the indexes stored in the blocks are substantially equal, and a document registration process. Means for determining whether the number of indexes added or updated to the block is a predetermined number or more, and when the determination result is a predetermined number or more, the block is read from the secondary storage device to the main storage, The index addition process is performed on the main memory for the keyword corresponding to the index of Means for storing the added processed block in the secondary storage device, and when the number is less than a predetermined number, the index is read from the secondary storage device into the main storage, and the addition process of the index is mainly performed for the keyword corresponding to the index. A means for storing the index which is added to the memory and stored in the secondary storage device is provided. Further, the allocating means calculates the number of appearing documents of the keyword among all the documents for which indexes are created for the keywords assigned to the blocks, and sums up the number of appearing documents calculated for each keyword for each block. The keyword allocation is performed so that the total sum of the number of appearing documents in each block is approximately equal in each block.

【０００９】[0009]

【作用】上記手段により、全ての文書からキーワードを
抽出し、二次記憶装置上の所定数のブロックの各ブロッ
クのインデックスサイズがほぼ均等になるように、キー
ワードをブロックに割り付けることができ、文書の登録
時に、インデックスを格納するブロック毎に、該当する
抽出キーワードが所定数以上の場合には、そのブロック
を磁気ディスクからメモリ上に読み込むとともに該当す
る抽出キーワードに対するインデックスの追加処理メモ
リ上で一括して行い、これを磁気ディスクへ書き込むこ
とにより、磁気ディスクへのアクセス回数を低減し、所
定数未満の場合には該当する抽出キーワードに対応する
インデックスのみメモリ上に読み込むとともにそのイン
デックスの追加処理を行い、これを磁気ディスクへ書き
込むことにより、磁気ディスクへのアクセスデータ量を
最小化することができるため、非常に高速なインデック
ス追加処理が可能となり、文書データベースへの高速な
登録処理を実現することができる。By the above means, the keywords can be extracted from all the documents, and the keywords can be assigned to the blocks so that the index size of each block of the predetermined number of blocks on the secondary storage device becomes substantially equal. When the number of the corresponding extracted keywords is greater than the specified number for each block that stores the index at the time of registration, that block is read from the magnetic disk into the memory and the index is added to the corresponding extracted keywords collectively in the memory. By writing this to the magnetic disk, the number of accesses to the magnetic disk is reduced, and when the number is less than the specified number, only the index corresponding to the relevant extracted keyword is read into the memory and the process of adding the index is performed. By writing this to a magnetic disk, It is possible to minimize the access data amount to the gas disc, very allows fast indexing addition processing, it is possible to realize a high-speed process of registration in the document database.

【００１０】[0010]

【実施例】まず、本発明の原理を以下に説明する。初期
設定として、インデックスを格納するインデックスファ
イルのエリアとして所定サイズ分を磁気ディスク上に確
保するとともに、これを所定数のブロック（論理ブロッ
クではない）に分割する。次に、各ブロックに格納され
るインデックスの容量（インデックスのサイズ）がほぼ
均等になるように、インデックスに対応するキーワード
を各ブロックに割り付ける。文書の登録時には、まず、
登録文書からキーワードを抽出し、そのキーワードとそ
れが出現した文書の文書番号を主記憶へ格納する。次に
上記ブロック毎に、そこに割り付けられているキーワー
ドが、いくつ抽出されているかを調べる。所定数未満の
場合には、従来と同様に、各ブロック毎に抽出されたキ
ーワードに対応するインデックスを磁気ディスクから主
記憶へ読み込む。次に、主記憶に読み込まれているイン
デックスの末尾に、そのキーワードが出現した文書の文
書番号を追加し、磁気ディスクに再び格納する。所定数
以上の場合には、まず、該当ブロックを磁気ディスクか
ら主記憶へ読み込む。次に、抽出されたキーワードの中
で該ブロックに割り付けられているものについてのみ、
主記憶に読み込まれている該ブロック内の該キーワード
に対応するインデックスの末尾に、そのキーワードが出
現した文書の文書番号を追加する。その後、このブロッ
クを磁気ディスクに再び格納する。この一連の処理を、
抽出されたキーワードが割り付けられている全てのブロ
ックに対して行うことにより、登録文書から抽出された
キーワードのインデックスへの追加処理を行う。First, the principle of the present invention will be described below. As an initial setting, a predetermined size area is secured on the magnetic disk as an area of an index file for storing an index, and this is divided into a predetermined number of blocks (not logical blocks). Next, the keywords corresponding to the indexes are allocated to the blocks so that the capacities of the indexes (index sizes) stored in the blocks are almost equal. When registering a document, first
A keyword is extracted from the registered document, and the keyword and the document number of the document in which it appears are stored in the main memory. Next, it is checked how many keywords assigned to each of the blocks are extracted. If the number is less than the predetermined number, the index corresponding to the keyword extracted for each block is read from the magnetic disk into the main memory as in the conventional case. Next, the document number of the document in which the keyword appears is added to the end of the index read in the main memory and stored again on the magnetic disk. If the number is equal to or more than the predetermined number, first, the corresponding block is read from the magnetic disk to the main memory. Next, only the extracted keywords assigned to the block are
The document number of the document in which the keyword appears is added to the end of the index corresponding to the keyword in the block read in the main memory. Then, this block is stored again on the magnetic disk. This series of processing,
By performing the process on all the blocks to which the extracted keywords are assigned, the process of adding the keywords extracted from the registered document to the index is performed.

【００１１】以上のように、抽出されたキーワードが所
定数以上含まれるブロックに関しては、磁気ディスク上
のブロックを１度だけ主記憶に読み込み、新たなキーワ
ードに対する追加処理を行った後に、これを磁気ディス
クに書き込むだけで複数のインデックスの追加処理を実
現できるため、従来のようにキーワード毎に磁気ディス
クから該当するインデックスを読み込み、そのキーワー
ドに対する追加処理を行った後に、これを磁気ディスク
へ書き込む場合に比べ大幅にインデックスの追加処理に
掛かる時間を削減することができる。As described above, with respect to a block containing a predetermined number or more of extracted keywords, the block on the magnetic disk is read into the main memory only once, and an addition process for a new keyword is performed, and then this is magnetically processed. Since it is possible to add multiple indexes just by writing to the disk, it is necessary to read the corresponding index from the magnetic disk for each keyword, perform the additional processing for that keyword, and then write this to the magnetic disk as in the past. Compared with this, it is possible to significantly reduce the time required for the process of adding an index.

【００１２】以上説明した原理を、さらに具体例を用い
て説明する。本例で用いるインデックスファイルの例を
図５に示す。本インデックスファイルは、図２に示すイ
ンデックスファイルをブロック１とブロック２の二つに
分割し、磁気ディスクに格納したものである。さらに、
ブロックに含まれるインデックスのサイズがほぼ均等に
なるように、ブロック１にはキーワード“コア”、“デ
ィスク”および“ＩＲ”を、ブロック２にはキーワード
“コンピュータ”を割り付け、対応するブロックにイン
デックスを格納している。The principle described above will be further described using a concrete example. An example of the index file used in this example is shown in FIG. This index file is obtained by dividing the index file shown in FIG. 2 into two blocks, block 1 and block 2, and storing them on a magnetic disk. further,
The keywords "core", "disk" and "IR" are assigned to block 1 and the keyword "computer" is assigned to block 2 so that the indexes included in the blocks are approximately equal in size, and the corresponding blocks are indexed. Is stored.

【００１３】このインデックスは、検索の際には、ブロ
ックを意識することなく従来と同様に検索処理が行われ
る。質問語として“コア”、“ディスク”、“コンピュ
ータ”および“ＩＲ”が指定されたときのみ、それぞれ
のキーワードが出現する文書の文書番号がインデックス
ファイルから読み出される。すなわち、質問語が“コ
ア”の場合には文書番号１，４，質問語が“ディスク”
の場合には文書番号４、質問語が“コンピュータ”の場
合には文書番号１，２，４，質問語が“ＩＲ”の場合に
は文書番号２のように検索結果として文書番号が出力さ
れる。This index is subjected to a search process in the same manner as in the conventional case, without being aware of blocks at the time of search. Only when “core”, “disk”, “computer” and “IR” are designated as the query words, the document numbers of the documents in which the respective keywords appear are read from the index file. That is, when the question word is “core”, the document numbers 1, 4 and the question word are “disk”
In the case of, the document number is output as the search result, and in the case of the query word being "computer", the document numbers 1, 2, 4 and the query word being "IR", the document number is output as the search result. It

【００１４】新たな文書をデータベースへ登録する際に
は、図６に示すように上記ブロックを意識した処理を行
う。以下、詳細にその手順を説明する。まず、登録対象
の文書からキーワードとして、“コア”、“ディスク”
および“ＩＲ”が抽出されたものとし、さらにこの文書
の文書番号として文書番号５を想定する。抽出されたキ
ーワードのブロックの割り付けとしては、ブロック１に
はキーワード“コア”、“ディスク”および“ＩＲ”
が、ブロック２には“コンピュータ”が割り付けられて
いる。これらのキーワードの中で文書５から抽出された
キーワードとしては、ブロック１にキーワード“コ
ア”、“ディスク”および“ＩＲ”の三つが該当する。
このため、まず、ブロック１を主記憶上に読み込む。次
にブロック１に格納されているキーワード“コア”、
“ディスク”および“ＩＲ”に対応するインデックスの
末尾にこれらのキーワードが出現する文書の文書番号で
ある５をそれぞれ追加する。その後、主記憶上に格納さ
れているブロック１を磁気ディスク上のインデックスフ
ァイルに書き込む。ブロック２についてはこの中に割り
付けられたキーワードに該当するものが文書５から抽出
されていないためインデックスの追加処理は行わない。
このようにして、文書の登録処理が行われる。When registering a new document in the database, the above-mentioned block-conscious processing is performed as shown in FIG. The procedure will be described in detail below. First, "core" and "disk" as keywords from the document to be registered
And "IR" are extracted, and document number 5 is assumed as the document number of this document. As for the allocation of the extracted keyword blocks, the keywords “core”, “disk” and “IR” are assigned to block 1.
However, “computer” is assigned to block 2. Among these keywords, the keywords “core”, “disk” and “IR” in block 1 correspond to the keywords extracted from the document 5.
Therefore, first, block 1 is read into the main memory. Next, the keyword "core" stored in block 1
Document numbers 5 of documents in which these keywords appear are added to the end of the indexes corresponding to "disk" and "IR", respectively. After that, the block 1 stored in the main memory is written in the index file on the magnetic disk. For the block 2, since the keyword corresponding to the block 2 is not extracted from the document 5, the index addition process is not performed.
In this way, the document registration process is performed.

【００１５】上記ブロック１のインデックス追加処理の
タイムチャートを図７に示す。本例では、磁気ディスク
として一般的な３.５インチの磁気ディスクを使用し、
論理ブロックのサイズとしてワークステーションの一般
的なオペレーティングシステムで使われている８ＫＢを
使用し、インデックスを格納するブロックのサイズとし
て８論理ブロックを使用すること想定する。また、磁気
ディスクのシーク時間および回転待ち時間としては、平
均シーク時間および平均回転待ち時間を想定する。さら
に、磁気ディスクにおける平均シーク時間、平均回転待
ち時間、一論理ブロックの読み出し時間および一論理ブ
ロックの書き込み時間には、それぞれ、約１４ｍｓ、約
１７ｍｓ、約４ｍｓおよび約４ｍｓを想定する。上記の
値を用いて、本図のタイムチャートの流れを説明する。
本例では、まず、ブロック１をバッファエリアへ読み出
すときに磁気ディスクへのアクセスが発生し、シーク処
理と回転待ち処理により磁気ディスクのヘッドがブロッ
ク１の先頭に位置決めされる。この間、平均シーク時間
１４ｍｓと平均回転待ち時間１７ｍｓが費やされる。次
に、ブロック１を構成する論理ブロック、この場合八つ
の論理ブロックとする、がバッファエリアに読み込まれ
る。この際、１論理ブロック分の読み出し時間の８倍の
３２ｍｓが費やされる。ここで、バッファエリアに読み
出されたブロック１に格納されているキーワード“コ
ア”、“ディスク”および“ＩＲ”に対応するインデッ
クスの末尾にこれらのキーワードが出現する文書の文書
番号である５がそれぞれ追加される。したがって、ブロ
ック１に３回の文書番号の追加が発生する。このときの
バッファエリア上の一つのインデックスに４バイトの文
書番号を書き込む時間として０.００１ｍｓを想定す
る。ここでは、３回の文書番号の追加が発生するため３
倍の０.００３ｍｓを要するが、他の処理時間に比べ無
視できるほど小さい。その後、インデックスの追加処理
が行われたブロック１を、磁気ディスクに書き込む。こ
のとき、シーク処理と回転待ち処理により磁気ディスク
のヘッドが所定の位置に位置決めされる。その間、平均
シーク時間１４ｍｓと平均回転待ち時間１７ｍｓが費や
される。この後に、ブロック１を構成する八つの論理ブ
ロックが磁気ディスクに書き込まれ、１ブロック分の書
き込み時間の８倍の３２ｍｓが費やされる。以上の処理
により、本例のインデックスの追加に合計１２６ｍｓが
費やされることになる。これは、一キーワード当たり平
均４２ｍｓとなる。従来のように、キーワード毎に磁気
ディスク上のインデックスへの追加処理を行ったときに
は、前述した図４のように、一キーワード当たり７０ｍ
ｓ掛かったものが、本発明のようにブロック単位にまと
めてインデックスの追加処理を行うことにより４０％程
度文書の登録時間を短縮することが可能となる。一般の
文書を登録する際には、登録キーワードの個数が本例の
数十倍にもなるため、本発明のブロック単位でのインデ
ックスの追加処理の効果は更に大きくなる。FIG. 7 shows a time chart of the index addition processing of the block 1. In this example, a general 3.5-inch magnetic disk is used as the magnetic disk,
It is assumed that 8 KB used in a general operating system of a workstation is used as a size of a logical block and 8 logical blocks are used as a size of a block for storing an index. Further, as the seek time and the rotation waiting time of the magnetic disk, the average seek time and the average rotation waiting time are assumed. Further, the average seek time, the average rotation waiting time, the read time of one logical block, and the write time of one logical block on the magnetic disk are assumed to be about 14 ms, about 17 ms, about 4 ms, and about 4 ms, respectively. The flow of the time chart of this figure will be described using the above values.
In this example, first, when reading the block 1 into the buffer area, access to the magnetic disk occurs, and the seek of the block and the rotation waiting process position the head of the magnetic disk at the head of the block 1. During this period, an average seek time of 14 ms and an average rotation waiting time of 17 ms are consumed. Next, the logical blocks that make up block 1, which are eight logical blocks in this case, are read into the buffer area. At this time, 32 ms, which is eight times as long as the read time for one logical block, is spent. Here, 5 is the document number of the document in which these keywords appear at the end of the index corresponding to the keywords “core”, “disk” and “IR” stored in the block 1 read in the buffer area. Each will be added. Therefore, the document number is added three times in block 1. It is assumed that the time for writing the 4-byte document number into one index on the buffer area at this time is 0.001 ms. In this case, the document number is added three times, so 3
It takes twice as long as 0.003 ms, but it is negligibly small compared to other processing times. After that, the block 1 on which the index addition processing has been performed is written to the magnetic disk. At this time, the head of the magnetic disk is positioned at a predetermined position by the seek process and the rotation waiting process. During that time, an average seek time of 14 ms and an average rotation waiting time of 17 ms are consumed. After this, eight logical blocks forming the block 1 are written on the magnetic disk, and 32 ms, which is eight times as long as one block, is consumed. With the above processing, a total of 126 ms is spent to add the index of this example. This averages 42 ms per keyword. When adding processing to the index on the magnetic disk for each keyword as in the conventional art, as shown in FIG.
It takes about s, but it is possible to shorten the document registration time by about 40% by collectively performing the index addition processing in block units as in the present invention. When registering a general document, the number of registered keywords is several tens of times as large as that in this example, so that the effect of the index addition processing in block units according to the present invention is further enhanced.

【００１６】以上説明したように、本発明によれば、所
定数以上の追加登録キーワードが含まれるブロックにお
けるインデックスの追加処理を一括して行うことによ
り、磁気ディスク上のブロックを１度主記憶に読み込む
だけで複数のインデックスの追加処理が実行できるた
め、文書データベースへの高速な登録処理を実現するこ
とができる。As described above, according to the present invention, the blocks on the magnetic disk are once stored in the main memory by collectively performing the process of adding the indexes in the blocks containing the predetermined number or more of additional registration keywords. Since a process of adding a plurality of indexes can be executed only by reading, a high-speed registration process in the document database can be realized.

【００１７】以下、本発明の実施例を説明する。本発明
が適用された文書検索システムの構成について図１を用
いて説明する。本システムは、ディスプレイ１０１、キ
ーボード１０２、ＣＰＵ１０３、メモリ１０４、磁気デ
ィスク１０５およびフロッピーディスクドライブ（ＦＤ
Ｄ）１０６から構成される。ディスプレイ１０１、キー
ボード１０２、メモリ１０４、磁気ディスク１０５およ
びＦＤＤ１０６は、ＣＰＵ１０３よりバスを介してアク
セスされる。磁気ディスク１０５には、インデックスフ
ァイル８０００が格納される。メモリ１０４には、シス
テム制御プログラム５０００、検索インタフェースプロ
グラム６０００、登録制御プログラム２０００、検索制
御プログラム３０００、キーワード割り付けプログラム
２１００、インデックス作成登録プログラム２２００お
よびインデックス検索プログラム３１００がロードさ
れ、ワークエリア４０００が確保される。本文書検索シ
ステムの文書データベースに登録される文書は、フロッ
ピーディスク１０７に格納され、ＦＤＤ１０６を介して
ＣＰＵ１０３よりアクセスされる。Examples of the present invention will be described below. The configuration of the document search system to which the present invention is applied will be described with reference to FIG. This system includes a display 101, a keyboard 102, a CPU 103, a memory 104, a magnetic disk 105, and a floppy disk drive (FD).
D) 106. The display 101, keyboard 102, memory 104, magnetic disk 105, and FDD 106 are accessed by the CPU 103 via a bus. An index file 8000 is stored on the magnetic disk 105. A system control program 5000, a search interface program 6000, a registration control program 2000, a search control program 3000, a keyword assignment program 2100, an index creation registration program 2200, and an index search program 3100 are loaded in the memory 104, and a work area 4000 is secured. It The document registered in the document database of this document search system is stored in the floppy disk 107 and accessed by the CPU 103 via the FDD 106.

【００１８】本システムでは、電源投入時ＣＰＵ１０３
によりシステム制御プログラム５０００が起動され、シ
ステム制御プログラム５０００の制御のもとに登録制御
プログラム２０００および検索制御プログラム３０００
が起動される。まず、このような構成の本システムにお
ける文書の登録処理の概略について説明する。ユーザが
キーボード１０２から入力した指示に従って、システム
制御プログラム５０００が登録制御プログラム２０００
を起動する。登録制御プログラム２０００では、最初、
文書を登録する前に、ユーザがキーボード１０２から入
力した指示に従い、キーワード割り付けプログラム２１
００を起動し、インデックスファイルの初期設定を行
う。まず、ユーザがキーボード１０２から入力した指示
に従い、インデックスを格納するインデックスファイル
８０００を所定ブロック数分磁気ディスク１０５上に確
保するとともに、これを指定された数のブロックに分割
する。そして、各ブロックにキーワードを割り付けるた
めの所定数の文書（以後、種文書と呼ぶ）がＦＤＤ１０
６を介してフロッピーディスク１０７からメモリ１０４
のワークエリア４０００に読み込まれる。種文書として
は、例えば１０万件の新聞記事ＤＢを作成する場合に
は、同じ種類の文書、すなわち新聞記事を数百件〜数千
件程度登録する。次に、ワークエリア４０００に読み込
まれた種文書から検索に必要な言葉をキーワードとして
抽出し、そのキーワードの出現文書数を算出する。この
キーワードの出現文書数の総和が、各ブロック間でほぼ
均等になるようにキーワードを各ブロックに割り付け
る。その後、登録制御プログラム２０００では、インデ
ックス作成登録プログラム２２００を起動する。In this system, when the power is turned on, the CPU 103
The system control program 5000 is started by the registration control program 2000 and the search control program 3000 under the control of the system control program 5000.
Is started. First, an outline of a document registration process in the present system having such a configuration will be described. The system control program 5000 executes the registration control program 2000 according to the instruction input by the user from the keyboard 102.
To start. In the registration control program 2000,
Before registering the document, the keyword assignment program 21
00 to initialize the index file. First, according to an instruction input by the user from the keyboard 102, an index file 8000 for storing an index is secured on the magnetic disk 105 by a predetermined number of blocks and is divided into a specified number of blocks. Then, a predetermined number of documents (hereinafter referred to as seed documents) for assigning a keyword to each block are FDD10.
From floppy disk 107 to memory 104
Is read into the work area 4000 of. As a seed document, for example, when creating a newspaper article DB of 100,000 documents, several hundred to several thousand documents of the same type, that is, newspaper articles are registered. Next, the words necessary for the search are extracted as keywords from the seed document read in the work area 4000, and the number of documents that appear for that keyword is calculated. The keywords are assigned to the blocks so that the total number of documents in which the keywords appear is approximately equal among the blocks. Thereafter, the registration control program 2000 activates the index creation registration program 2200.

【００１９】インデックス作成登録プログラム２２００
では、ユーザがキーボード１０２から入力した指示に従
い、フロッピーディスク１０７に格納された登録対象の
文書を、ＦＤＤ１０６を介してメモリ１０４のワークエ
リア４０００に読み込む。この登録文書から検索に必要
な言葉がキーワードとして抽出され、インデックスファ
イル８０００の該当ブロックにキーワードと文書番号、
あるいは文書番号が登録される。Indexing registration program 2200
Then, in accordance with an instruction input by the user from the keyboard 102, the registration target document stored in the floppy disk 107 is read into the work area 4000 of the memory 104 via the FDD 106. Words necessary for a search are extracted as keywords from this registered document, and the keywords and document numbers are stored in the corresponding blocks of the index file 8000.
Alternatively, the document number is registered.

【００２０】次に、本システムにおける文書の検索動作
の概略について説明する。ユーザがキーボード１０２か
ら入力した指示に従い、システム制御プログラム５００
０は検索制御プログラム３０００と検索インタフェース
プログラム６０００を起動する。その後、ユーザがキー
ボード１０２から入力した質問語は、検索インタフェー
スプログラム６０００に入力され、検索制御プログラム
３０００に送られる。検索制御プログラム３０００で
は、インデックス検索プログラム３１００を起動すると
ともに本プログラムへ前記質問語を送る。インデックス
検索プログラム３１００では、受け取った質問語に対応
するインデックスから文書番号を読み出し、検索結果と
して検索制御プログラム３０００へ送出する。本検索結
果は、検索インタフェースプログラム６０００へと送ら
れ、検索結果文書番号としてディスプレイ１０１に表示
される。Next, an outline of the document search operation in this system will be described. The system control program 500 is executed according to the instruction input by the user from the keyboard 102.
0 starts the search control program 3000 and the search interface program 6000. Thereafter, the query word input by the user from the keyboard 102 is input to the search interface program 6000 and sent to the search control program 3000. The search control program 3000 activates the index search program 3100 and sends the query word to this program. The index search program 3100 reads the document number from the index corresponding to the received query word and sends it as a search result to the search control program 3000. This search result is sent to the search interface program 6000 and displayed on the display 101 as a search result document number.

【００２１】次に、キーワード割り付けプログラム２１
００の構成とキーワード割り付け処理について図８を用
いて説明する。キーワード割り付けプログラム２１００
は、インデックス分割ステップ２１０５、種文書数分繰
返しステップ２１１０、種文書読み込みステップ２１２
０、キーワード抽出ステップ２１３０、出現文書数カウ
ントステップ２１４０およびキーワード割り付けステッ
プ２１５０から構成される。まず、インデックス分割ス
テップ２１０５では、ユーザから指定されたブロック数
をキーボード１０２から読み込む。次にインデックスフ
ァイル８０００として、指定のブロック数分のエリアを
確保するとともに、これを指定ブロック数に均等分割す
る。次に、種文書読み込みステップ２１２０では、ＦＤ
Ｄ１０６を介して種文書を１文書分読み込みワークエリ
ア４０００に格納する。さらに、キーワード抽出ステッ
プ２１３０では、読み込まれた種文書からキーワードと
なる言葉を抽出し、この抽出されたキーワードをワーク
エリア４０００に格納する。この文書からキーワードを
抽出する技術は、日本語文書については“自動索引付け
研究の動向”（諸橋著、情報処理学会誌、Ｖｏｌ.２
５、Ｎｏ.９、１９８４）に記載されており、英語文書
については“ＤＯＣＵＭＥＮＴＤＡＴＡＢＡＳＥ”
（Ｇ．Ｊａｍｅｓ、ＶａｎＮｏｓｔｒａｎｄＲｅｉ
ｎｈｏｌｄＣｏ．、１９８５）ｐｐ．８７−９４に記
載されている。本実施例では、これらのキーワード抽出
技術をそのまま利用する。出現文書数カウントステップ
２１４０では、抽出されたキーワードが出現する文書数
をカウントし、キーワードに対応させて、ワークエリア
４０００に格納する。種文書数分繰返しステップ２１１
０では、全ての種文書についてステップ２１２０からス
テップ２１４０までの一連の処理を繰り返す。全ての種
文書が処理された後、キーワード割り付けステップ２１
５０では、出現文書数の和がほぼ均等になるように、抽
出したキーワードを各ブロックに割り付け、そのブロッ
ク番号をキーワードに対応する形でワークエリア４００
０に格納する。Next, the keyword assignment program 21
The configuration of 00 and the keyword allocation process will be described with reference to FIG. Keyword allocation program 2100
Is an index dividing step 2105, a repeating step 2110 for the number of seed documents, and a seed document reading step 212.
0, a keyword extraction step 2130, an appearance document number counting step 2140, and a keyword allocation step 2150. First, in the index division step 2105, the number of blocks designated by the user is read from the keyboard 102. Next, an area for the designated number of blocks is secured as the index file 8000, and this is equally divided into the designated number of blocks. Next, in the seed document reading step 2120, the FD
One seed document is read via D106 and stored in the work area 4000. Further, in the keyword extraction step 2130, words that are keywords are extracted from the read seed document, and the extracted keywords are stored in the work area 4000. The technique for extracting keywords from this document is "Trend of automatic indexing research" for Japanese documents (Morohashi, IPSJ, Vol. 2).
5, No. 9, 1984), and for English documents "DOCUMENT DATABASE"
(G. James, Van Nostrand Rei
nhold Co. , 1985) pp. 87-94. In this embodiment, these keyword extraction techniques are used as they are. In the appearing document number counting step 2140, the number of documents in which the extracted keywords appear is counted and stored in the work area 4000 in association with the keywords. Repeat step 211 for the number of seed documents
At 0, the series of processing from step 2120 to step 2140 is repeated for all seed documents. After all seed documents have been processed, keyword allocation step 21
In 50, the extracted keywords are assigned to each block so that the sum of the numbers of appearing documents is almost equal, and the block number corresponds to the keyword in the work area 400.
Store in 0.

【００２２】このように、種文書から抽出したキーワー
ドの出現文書数を算出することにより、文書データベー
スのインデックスサイズが予測できる。つまり、種文書
から抽出したキーワードに対応するインデックスのサイ
ズは、種文書から抽出したキーワードの出現文書数と文
書番号サイズの積により算出でき、これに文書データベ
ースの登録件数と種文書数（サンプリングされた種文書
数である）の比を掛けることにより、文書データベース
におけるインデックスサイズが予測できるからである。
また、文書データベースに登録する文書数が増加しブロ
ックに割り付けられた全てのインデックスのサイズの和
が、ブロック間でほぼ均等でなくなった場合には、この
文書データベースに登録した全ての文書の中から種文書
の候補を乱数抽出などの手法を使い、全文書数の数％程
度抽出する。この種文書を基に、再度、ブロックへのキ
ーワードの割り付けを行う。このキーワードの割り付け
に基づき、文書データベースに登録されている全ての文
書を再登録することにより、ブロックに割り付けたイン
デックスのサイズの和をブロック間でほぼ均等にするこ
とが可能となる。他の方法として、ブロックに割り付け
られているインデックスのサイズの和が他のブロックに
比べ多いブロックについて、これに割り付けられている
キーワードを、サイズの小さいブロックに割り付け直す
ことにより、インデックスのサイズの和をブロック間で
ほぼ均等にすることも可能である。ブロックへキーワー
ドを割り付ける際の指標として、キーワードの文書出現
数の他に、キーワードそのものの出現数を使用したり、
キーワードの種類数を使用することも可能である。以上
の処理を行うことにより、全ての種文書からキーワード
を抽出し、各ブロックのインデックスサイズがほぼ均等
になるように、キーワードをブロックに割り付けること
ができる。In this way, the index size of the document database can be predicted by calculating the number of documents in which the keywords extracted from the seed document appear. In other words, the size of the index corresponding to the keyword extracted from the seed document can be calculated by multiplying the number of documents that appear in the keyword extracted from the seed document by the document number size, and the number of registered documents in the document database and the number of seed documents (sampled This is because the index size in the document database can be predicted by multiplying the ratio by the number of seed documents).
Also, if the number of documents registered in the document database increases and the sum of the sizes of all indexes assigned to blocks does not become almost equal among blocks, then from among all the documents registered in this document database The candidates for seed documents are extracted by using a method such as random number extraction, and about several percent of the total number of documents is extracted. The keywords are again assigned to the blocks based on this kind of document. By re-registering all the documents registered in the document database based on the allocation of the keywords, it becomes possible to make the sum of the sizes of the indexes allocated to the blocks substantially equal among the blocks. As another method, for blocks that have a larger sum of index sizes assigned to blocks than other blocks, reallocate the keywords assigned to these blocks to smaller blocks to make the sum of index sizes larger. It is also possible to make the blocks substantially equal. As an index when assigning a keyword to a block, in addition to the number of occurrences of the document of the keyword, the number of appearances of the keyword itself is used,
It is also possible to use the number of types of keywords. By performing the above processing, the keywords can be extracted from all the seed documents, and the keywords can be assigned to the blocks so that the index sizes of the blocks are almost equal.

【００２３】さらに、上記キーワード割り付けステップ
２１５０におけるキーワード割り付け処理について、図
９を用いて詳細に説明する。キーワード割り付けステッ
プ２１５０は、抽出キーワードソートステップ２１５
２、ブロック番号初期設定ステップ２１５３、抽出キー
ワード繰返しステップ２１５４、ブロック繰返しステッ
プ２１５５、ブロックサイズ判定ステップ２１５６、キ
ーワード設定ステップ２１５７、ブロック番号カウント
ステップ２１５８およびジャンプステップ２１５９から
構成されている。まず、抽出キーワードソートステップ
２１５２で、ワークエリア４０００に格納されている抽
出キーワードを、抽出キーワードに対応して格納されて
いる抽出キーワードの出現文書数を降順にソートする。
次に、ブロック番号初期設定ステップ２１５３では、最
初に処理するブロック番号として最初のブロック番号で
ある１を設定する。さらに、ブロックサイズ判定ステッ
プ２１５６では、抽出キーワードをブロックに割り付け
る場合を想定し、本ブロックに割り付けられるインデッ
クスのサイズの和を算出し、所定のブロックサイズを越
えるかどうか判定する。越えない場合のみ、まず、キー
ワード設定ステップ２１５７を実行し、本ブロックに抽
出キーワードを割り付ける。次に、ブロック番号カウン
トステップ２１５８を実行し、処理対象となっているブ
ロックのブロック番号に１を加え、次に処理するブロッ
クのブロック番号を設定する。このカウントアップにお
いて、ブロック番号が最終番号までカウントアップされ
た場合には、最初のブロック番号の１に戻ることにす
る。さらに、ジャンプステップ２１５９によりブロック
繰返しステップ２１５５における繰返し処理を打ち切
り、Ｌ１以降のステップを実行する。ここでは、抽出キ
ーワード繰返しステップ２１５４を実行する。ブロック
繰返しステップ２１５５では、ステップ２１５６からス
テップ２１５９までの処理をブロック番号から順に全て
のブロックについて繰返し行う。抽出キーワード繰返し
ステップ２１５４では、ステップ２１５５からステップ
２１５９までの処理を、全ての抽出キーワードについて
繰返し行う。以上の一連の処理により、出現文書数の最
も多い抽出キーワードから順に、各ブロックへ割り付け
ることができるため、各ブロックにインデックスサイズ
をほぼ均等に割り付けることが可能となる。Further, the keyword assigning process in the keyword assigning step 2150 will be described in detail with reference to FIG. The keyword allocation step 2150 is the extracted keyword sorting step 215.
2, block number initial setting step 2153, extracted keyword repeating step 2154, block repeating step 2155, block size determining step 2156, keyword setting step 2157, block number counting step 2158 and jump step 2159. First, in the extraction keyword sorting step 2152, the extraction keywords stored in the work area 4000 are sorted in descending order of the number of documents in which the extraction keywords stored corresponding to the extraction keywords appear.
Next, in the block number initial setting step 2153, the first block number 1 is set as the block number to be processed first. Further, in the block size determination step 2156, assuming that the extracted keyword is assigned to the block, the sum of the sizes of the indexes assigned to this block is calculated, and it is determined whether or not the predetermined block size is exceeded. Only when it does not exceed, the keyword setting step 2157 is first executed to assign the extracted keyword to this block. Next, the block number counting step 2158 is executed, 1 is added to the block number of the block to be processed, and the block number of the block to be processed next is set. In this count-up, when the block number is counted up to the final number, the first block number 1 is returned to. Furthermore, the jump step 2159 terminates the iterative processing in the block iterative step 2155, and the steps after L1 are executed. Here, the extracted keyword repeating step 2154 is executed. In the block repeating step 2155, the processing from step 2156 to step 2159 is repeated for all blocks in order from the block number. In the extracted keyword repeating step 2154, the processing from step 2155 to step 2159 is repeated for all the extracted keywords. By the series of processes described above, the extracted keywords having the largest number of appearing documents can be allocated to each block in order, so that the index size can be allocated to each block almost evenly.

【００２４】次に、インデックス作成登録プログラム２
２００の構成と文書登録処理について図１０を用いて説
明する。インデックス作成登録プログラム２２００は、
文書番号取得ステップ２２０５、文書数分繰返しステッ
プ２２１０、登録文書数読み込みステップ２２２０、キ
ーワード抽出ステップ２２３０、ブロック番号対応ステ
ップ２２４０、文書番号カウントステップ２２４５、ブ
ロック数分繰返しステップ２２５０、抽出キーワード数
判定ステップ２２６０、ブロック単位インデックス追加
ステップ２２７０、キーワード数分繰返しステップ２２
８０およびキーワード単位インデックス追加ステップ２
２９０から構成される。まず、文書番号取得ステップ２
２０５では、ユーザがキーボード１０２から入力した登
録文書の最初の文書番号と登録文書数を読み込む。次
に、登録文書読み込みステップ２２２０で、登録対象の
文書をＦＤＤ１０６を介して、１文書分読み込みワーク
エリア４０００に格納する。その後、キーワード抽出ス
テップ２２３０で、キーワード割り付けプログラム２１
００におけるキーワード抽出ステップ２１３０と同様
に、読み込まれた登録文書からキーワードとなる言葉を
抽出する。この抽出キーワードをワークエリア４０００
に格納する。ブロック番号対応ステップ２２４０では、
キーワード割り付けステップ２１５０でキーワードに対
応する形でワークエリア４０００に格納したブロック番
号を調べることにより、抽出キーワードが割り付けられ
たブロックのブロック番号を取得し、そのブロック番号
を抽出キーワードに対応させ、ワークエリア４０００に
格納する。もし、抽出キーワードがどのブロックにも割
り付けられていない場合には、格納されているインデッ
クスのサイズの和が最も小さいブロックにそのキーワー
ドを割り付け、このブロック番号を抽出キーワードに対
応した形でワークエリア４０００に格納する。その後、
文書番号カウントステップ２２４５で、文書番号をイン
クリメントし、次の登録文書の処理に備える。文書数分
繰返しステップ２２１０では、登録文書数回分、ステッ
プ２２２０からステップ２２４５のキーワード抽出処理
を繰り返す。その後、抽出キーワード数判定ステップ２
２６０では、最初のブロック番号であるブロック１につ
いて、抽出されたキーワードのうちブロック１に割り付
けられている数をカウントし、所定数Ｎ以上か否かを調
べる。この所定数Ｎは、使用磁気ディスク、使用計算機
等の性能等を考慮して最適となる値をユーザが指定す
る。カウント数が所定数Ｎ以上であれば、次のブロック
単位インデックス追加処理２２７０を実行する。ブロッ
ク単位インデックス追加処理２２７０では、ブロック１
をワークエリア４０００に読み込み、ブロック１に割り
付けられている全ての抽出キーワードについて一括して
インデックスの追加処理を行う。所定数Ｎに達しない場
合は、キーワード単位インデックス追加ステップ２２９
０を実行する。キーワード単位インデックス追加ステッ
プ２２９０では、インデックスファイル８０００に格納
されているブロック１の中に存在する上記抽出キーワー
ドに対応するインデックスのみをワークエリア４０００
に読み込み、そのインデックスに対応する抽出キーワー
ドが出現する文書の文書番号を上記インデックスに追加
するとともに再びインデックスファイル８０００に書き
込む。さらに、キーワード数分繰返しステップ２２８０
では、抽出キーワードの中でブロック１に割り付けられ
ているもの全てについてインデックスの追加処理が終了
するまで、キーワード単位インデックス追加ステップ２
２９０を繰返し実行する。ブロック数分繰返しステップ
２２５０では、全てのブロックについてステップ２２６
０からステップ２２９０を繰返し実行し、インデックス
の追加処理を行う。本実施例では、このようにして新た
な文書の追加登録を実現する。Next, the index creation registration program 2
The configuration of 200 and the document registration process will be described with reference to FIG. The indexing registration program 2200 is
Document number acquisition step 2205, document number repetition step 2210, registered document number reading step 2220, keyword extraction step 2230, block number correspondence step 2240, document number counting step 2245, block number repetition step 2250, extraction keyword number determination step 2260. , Block unit index addition step 2270, keyword number repeated step 22
80 and keyword unit index addition step 2
290. First, the document number acquisition step 2
At 205, the first document number and the number of registered documents of the registered document input by the user from the keyboard 102 are read. Next, in a registered document reading step 2220, the document to be registered is stored in the work area 4000 for reading one document via the FDD 106. Then, in the keyword extraction step 2230, the keyword assignment program 21
Similar to the keyword extraction step 2130 in 00, the keyword word is extracted from the read registration document. This extracted keyword is the work area 4000
To be stored. In the block number correspondence step 2240,
By checking the block number stored in the work area 4000 in the form corresponding to the keyword in the keyword allocation step 2150, the block number of the block to which the extraction keyword is allocated is acquired, and the block number is associated with the extraction keyword, Store in 4000. If the extracted keyword is not assigned to any block, the keyword is assigned to the block having the smallest sum of the sizes of the stored indexes, and the work area 4000 is assigned with the block number corresponding to the extracted keyword. To store. afterwards,
In the document number counting step 2245, the document number is incremented to prepare for the processing of the next registered document. In the step 2210 for repeating the number of documents, the keyword extraction processing of steps 2220 to 2245 is repeated for the number of registered documents. After that, the number of extracted keywords determination step 2
At 260, with respect to the block 1 having the first block number, the number of the extracted keywords assigned to the block 1 is counted, and it is checked whether or not the predetermined number N or more. The predetermined number N is specified by the user in consideration of the performance of the magnetic disk used, the computer used, and the like. If the number of counts is equal to or greater than the predetermined number N, the next block unit index addition processing 2270 is executed. In the block unit index addition processing 2270, block 1
Is read into the work area 4000, and the index addition processing is collectively performed for all the extracted keywords assigned to the block 1. If the predetermined number N is not reached, the keyword-based index addition step 229
Execute 0. In the keyword-based index addition step 2290, only the indexes corresponding to the extracted keywords existing in the block 1 stored in the index file 8000 are added to the work area 4000.
, And the document number of the document in which the extracted keyword corresponding to the index appears is added to the index and is written again in the index file 8000. Further, repeat step 2280 for the number of keywords.
Then, the keyword unit index addition step 2 is executed until the index addition processing is completed for all the extracted keywords assigned to the block 1.
290 is repeatedly executed. In the block repetition step 2250, step 226 is performed for all blocks.
Steps 2290 to 0 are repeatedly executed to perform index addition processing. In this embodiment, additional registration of a new document is realized in this way.

【００２５】さらに、前述のブロック単位インデックス
追加処理ステップ２２７０について、図１１を用いて詳
細に説明する。ブロック単位インデックス追加処理ステ
ップ２２７０は、ブロック読み出しステップ２２７２、
キーワード数繰返しステップ２２７４、ブロック内イン
デックス追加ステップ２２７６およびブロック格納ステ
ップ２２７８から構成される。まず、ブロック読み出し
ステップ２２７２で、前記ブロック数分繰返しステップ
２２５０により指定されたブロックを、インデックスフ
ァイル８０００から読み出し、ワークエリア４０００に
格納する。次に、ブロック内インデックス追加ステップ
２２７６では、抽出キーワードの中で前記ブロック読み
出しステップ２２７２でワークエリア４０００に読み出
された更新対象ブロックに割り付けられているキーワー
ドに対応するインデックスの末尾に、該当キーワードが
出現した文書の文書番号を追加する。さらに、キーワー
ド数繰返しステップ２２７４では、上記読み出されたブ
ロックにおける該当キーワードの全てについてステップ
２２７６のインデックス追加処理を繰返し行う。その
後、ブロック格納ステップ２２７８では、インデックス
追加処理が終了した上記ブロックを再びインデックスフ
ァイル８０００に格納する。以上のように、ブロック単
位インデックス追加処理ステップ２２７０では、ブロッ
ク単位にインデックスの追加処理を行う。このように、
ブロック単位に一括してキーワードのインデックス追加
処理を行うことにより、追加キーワード数が多い場合で
も、短時間にインデックスの追加処理を行うことができ
る。Further, the above-mentioned block unit index addition processing step 2270 will be described in detail with reference to FIG. The block unit index addition processing step 2270 includes a block reading step 2272,
It comprises a keyword number repeating step 2274, an in-block index adding step 2276, and a block storing step 2278. First, in the block reading step 2272, the blocks designated by the block number repeating step 2250 are read from the index file 8000 and stored in the work area 4000. Next, in the in-block index adding step 2276, the relevant keyword is added to the end of the index corresponding to the keyword assigned to the update target block read in the work area 4000 in the block reading step 2272 among the extracted keywords. Add the document number of the appearing document. Further, in the keyword number repeating step 2274, the index adding process of step 2276 is repeated for all the corresponding keywords in the read block. After that, in the block storing step 2278, the above block for which the index addition processing has been completed is stored again in the index file 8000. As described above, in the block unit index addition processing step 2270, the index addition processing is performed in block units. in this way,
By performing the keyword index addition processing collectively for each block, the index addition processing can be performed in a short time even when the number of added keywords is large.

【００２６】上述のキーワード割り付けプログラム２１
００でワークエリアに格納するキーワードに関する情報
の格納例について図１２を用いて説明する。キーワード
抽出ステップ２１３０では、種文書からキーワードとな
る言葉が抽出されるとともにワークエリア４０００に格
納される。本図に示すように、キーワードに対応してそ
のキーワード番号を一緒に格納している。本例では、抽
出されたキーワードとして“コア”、“ディスク”、
“コンピュータ”および“ＩＲ”が種文書から抽出され
たことを想定している。本ステップでは、これらの抽出
キーワードの抽出された順番にシリアルな番号をそのキ
ーワード番号として割り振り、本例のような形でワーク
エリア４０００に格納する。本例では、キーワード“コ
ア”、“ディスク”、“コンピュータ”および“ＩＲ”
にはキーワード番号として、それぞれ１、２、３および
４が割り振られている。次に、出現文書数カウントステ
ップ２１４０で、キーワード毎に種文書における文書出
現数がカウントされる。このとき、本図に示すようにキ
ーワードに対応した形で出現文書数を格納する。本例で
は、キーワード“コア”、“ディスク”、“コンピュー
タ”および“ＩＲ”における出現文書数は、それぞれ
２、１、３および２となっている。その後、キーワード
割り付けステップ２１５０で、上記出現文書数をもと
に、インデックスを格納するブロックへのキーワードの
割り付けが行われる。ここで割り付けられたブロック番
号を、本図に示すようにキーワードに対応付け、ワーク
エリア４０００に格納する。本例では、キーワード“コ
ア”、“ディスク”、“コンピュータ”および“ＩＲ”
が割り付けられたブロックのブロック番号は、それぞれ
１、１、２および１となっている。このような形式で、
キーワードに関する情報をワークエリア４０００に格納
することにより、種文書から抽出されたキーワードに関
する情報を管理することができるため、ブロックへのキ
ーワード割り付け処理が実現できる。The keyword allocation program 21 described above
A storage example of information related to keywords stored in the work area 00 will be described with reference to FIG. In the keyword extraction step 2130, words that serve as keywords are extracted from the seed document and stored in the work area 4000. As shown in the figure, the keyword number is stored together with the keyword. In this example, the extracted keywords are “core”, “disk”,
It is assumed that "computer" and "IR" were extracted from the seed document. In this step, serial numbers are assigned as the keyword numbers in the order in which these extracted keywords are extracted, and stored in the work area 4000 in the form of this example. In this example, the keywords "core", "disk", "computer" and "IR"
Are assigned 1, 2, 3 and 4 as keyword numbers. Next, in the appearing document number counting step 2140, the number of document appearances in the seed document is counted for each keyword. At this time, the number of appearing documents is stored in a form corresponding to the keyword as shown in the figure. In this example, the numbers of documents appearing in the keywords “core”, “disk”, “computer” and “IR” are 2, 1, 3 and 2, respectively. After that, in the keyword allocation step 2150, the keywords are allocated to the blocks storing the indexes based on the number of appearing documents. The block number assigned here is associated with the keyword as shown in this figure and stored in the work area 4000. In this example, the keywords "core", "disk", "computer" and "IR"
The block numbers of the blocks to which are assigned are 1, 1, 2 and 1, respectively. In this form,
By storing the information about the keywords in the work area 4000, the information about the keywords extracted from the seed document can be managed, so that the keyword allocation process to the blocks can be realized.

【００２７】上述のインデックス作成登録プログラム２
２００でワークエリアに格納するキーワードに関する情
報の格納例について図１３を用いて説明する。本例で
は、文書番号５の登録文書からキーワードとして“コ
ア”、“ディスク”および“ＩＲ”が抽出されたことを
想定している。キーワード抽出ステップ２２３０では、
登録文書からキーワードとなる言葉が抽出されるととも
にワークエリア４０００に格納される。このとき、図１
２に示すキーワード情報から、抽出キーワードに対応す
るキーワード番号と割り付けられたブロックのブロック
番号を取得する。このとき、本図に示すように、キーワ
ードに対応して、取得した割り付けブロックの番号とそ
の出現文書である登録文書の番号を格納する。本例で
は、キーワード“コア”、“ディスク”、“コンピュー
タ”および“ＩＲ”には、キーワード番号として、それ
ぞれ１、２、３および４を格納し、割り付けブロック番
号と登録文書番号としては、それぞれブロック番号１と
文書番号５を格納する。この情報を基に、ステップ２２
５０からステップ２２９０でインデックスの追加処理を
行う。このような形式で、キーワードに関する情報をワ
ークエリア４０００に格納することにより、登録文書か
ら抽出されたキーワードに関する情報を管理することが
できるため、これに対応するインデックスの追加処理が
実現できる。The index creation registration program 2 described above
A storage example of information about keywords to be stored in the work area 200 will be described with reference to FIG. In this example, it is assumed that “core”, “disk”, and “IR” have been extracted as keywords from the registered document of document number 5. In the keyword extraction step 2230,
The keyword words are extracted from the registered document and stored in the work area 4000. At this time,
The keyword number corresponding to the extracted keyword and the block number of the allocated block are acquired from the keyword information shown in FIG. At this time, as shown in this figure, the number of the acquired allocation block and the number of the registered document that is the appearing document are stored in correspondence with the keyword. In this example, the keywords “core”, “disk”, “computer” and “IR” store 1, 2, 3 and 4 respectively as keyword numbers, and the allocation block number and the registered document number are respectively set. The block number 1 and the document number 5 are stored. Based on this information, step 22
From 50 to step 2290, index addition processing is performed. By storing the information on the keyword in the work area 4000 in such a format, it is possible to manage the information on the keyword extracted from the registered document, so that it is possible to implement the process of adding an index corresponding thereto.

【００２８】以上説明したように、本発明によれば、イ
ンデックスを格納するブロック毎に、該当する抽出キー
ワードが所定数以上の場合には、そのブロックを磁気デ
ィスクからメモリ上に読み込むとともに該当する抽出キ
ーワードに対するインデックスの追加処理メモリ上で一
括して行い、これを磁気ディスクへ書き込むことによ
り、磁気ディスクへのアクセス回数を低減し、所定数未
満の場合には該当する抽出キーワードに対応するインデ
ックスのみメモリ上に読み込むとともにそのインデック
スの追加処理を行い、これを磁気ディスクへ書き込むこ
とにより、磁気ディスクへのアクセスデータ量を最小化
することができるため、非常に高速なインデックス追加
処理が可能となり、文書データベースへの高速な登録処
理を実現することができる。As described above, according to the present invention, for each block storing an index, when the number of corresponding extraction keywords is a predetermined number or more, the block is read from the magnetic disk onto the memory and the corresponding extraction is performed. Addition of indexes to keywords Batch processing on the memory and writing this to the magnetic disk reduces the number of times the magnetic disk is accessed. When the number is less than the specified number, only the index corresponding to the relevant extracted keyword is stored in the memory. By reading the data on the top, adding the index to it, and writing it to the magnetic disk, the amount of access data to the magnetic disk can be minimized, making it possible to perform extremely high-speed index addition processing, and the document database To realize high-speed registration processing to Kill.

【００２９】[0029]

【発明の効果】本発明によれば、複数のインデックスの
追加処理をメモリ上で一括して実行することで、磁気デ
ィスクへのアクセス回数を低減することができるため高
速なインデックス追加処理が可能となり、文書データベ
ースへの登録処理を高速に行うことが可能となる。As described above, according to the present invention, the number of times of access to the magnetic disk can be reduced by collectively executing the process of adding a plurality of indexes on the memory, which enables a high-speed index addition process. The registration process in the document database can be performed at high speed.

[Brief description of drawings]

【図１】本発明が適用された文書検索システムの構成を
示す図である。FIG. 1 is a diagram showing a configuration of a document search system to which the present invention is applied.

【図２】インデックスの構成例を示す図である。FIG. 2 is a diagram showing a configuration example of an index.

【図３】新たな文書に含まれるキーワードを追加した場
合のインデックスの構成例を示す図である。FIG. 3 is a diagram showing a configuration example of an index when a keyword included in a new document is added.

【図４】インデックス追加処理時のタイムチャートを示
す図である。FIG. 4 is a diagram showing a time chart during index addition processing.

【図５】ブロックに分割されたインデックスの構成例を
示す図である。FIG. 5 is a diagram showing a configuration example of an index divided into blocks.

【図６】新たな文書に含まれるキーワードを追加した場
合のブロックに分割されたインデックスの構成例を示す
図である。FIG. 6 is a diagram showing a configuration example of an index divided into blocks when a keyword included in a new document is added.

【図７】論理ブロックの８回連続読み出し処理と８回連
続書き込み処理を行った場合のインデックス追加処理時
のタイムチャートを示す図である。FIG. 7 is a diagram showing a time chart at the time of index addition processing when the logical block is read eight times in a row and written eight times in a row;

【図８】キーワード割り付けプログラム２１００の処理
手順を示す図である。8 is a diagram showing a processing procedure of a keyword assignment program 2100. FIG.

【図９】キーワード割り付けステップ２１５０の処理手
順を示す図である。9 is a diagram showing a processing procedure of a keyword allocation step 2150. FIG.

【図１０】インデックス作成登録プログラム２２００の
処理手順を示す図である。FIG. 10 is a diagram showing a processing procedure of an index creation registration program 2200.

【図１１】ブロック単位のインデックス追加処理ステッ
プ２２７０の処理手順を示す図である。FIG. 11 is a diagram showing a processing procedure of an index addition processing step 2270 in block units.

【図１２】キーワード管理デーブルの構成例を示す図で
ある。FIG. 12 is a diagram showing a configuration example of a keyword management table.

【図１３】ワークテーブルの構成例を示す図である。FIG. 13 is a diagram showing a configuration example of a work table.

[Explanation of symbols]

１０１ディスプレイ１０２キーボード１０３ＣＰＵ１０４メモリ１０５磁気ディスク１０６ＦＤＤ１０７フロッピーディスク２０００登録制御プログラム３０００検索制御プログラム４０００ワークエリア５０００システム制御プログラム６０００検索インタフェースプログラム８０００インデックスファイル 101 display 102 keyboard 103 CPU 104 memory 105 magnetic disk 106 FDD 107 floppy disk 2000 registration control program 3000 search control program 4000 work area 5000 system control program 6000 search interface program 8000 index file

───────────────────────────────────────────────────── フロントページの続き (72)発明者多田勝己神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内 (72)発明者加藤寛次神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内 (72)発明者浅川悟志神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア開発本部内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Katsumi Tada 1099, Ozenji, Aso-ku, Kawasaki-shi, Kanagawa Inside the Hitachi, Ltd. Systems Development Laboratory (72) Inventor, Kanji Kato 1099, Ozen-ji, Aso-ku, Kawasaki, Kanagawa Hitachi, Ltd. System Development Laboratory (72) Inventor Satoshi Asagawa 5030 Totsuka-cho, Totsuka-ku, Yokohama, Kanagawa Prefecture Hitachi Ltd. Software Development Division

Claims

[Claims]

1. A document search method in which a keyword is extracted from a document, an index is created based on the keyword, and a search is performed by referring to an index corresponding to a keyword that matches a query word at the time of search. The file is stored in a file on the storage device, the file is divided into a predetermined number of blocks, and keywords are assigned to the blocks so that the sizes of the indexes stored in the blocks are approximately equal. When the number of indexes added or updated to the block is equal to or larger than a predetermined number, the block is read from the secondary storage device to the main memory, and the index of the keyword corresponding to the index in the block Perform additional processing on main memory,
The added block is stored in the secondary storage device, and if the number is less than a predetermined number, the index is read from the secondary storage device into the main storage, and the index addition process is performed for the keyword corresponding to the index. Is performed on the main memory, and the added index is stored in the secondary storage device.

2. The document search method according to claim 1, wherein when allocating a block of a keyword, the number of documents appearing with the keyword is calculated for all keywords indexed to the block. , Document retrieval characterized by calculating the sum of the number of appearing documents calculated for each keyword for each block, and performing the keyword block allocation so that the sum of the number of appearing documents in each block is almost equal in each block Method.

3. A document search device that extracts a keyword from a document, creates an index based on the keyword, and refers to the index corresponding to the keyword that matches the query word at the time of search to perform a search. A means for storing the file in a file on the storage device, dividing the file into a predetermined number of blocks, an allocating means for allocating the keywords to the blocks so that the sizes of the indexes stored in the blocks are substantially equal, and a document At the time of registration, means for determining whether the number of indexes added or updated to the block is a predetermined number or more, and when the determination result is a predetermined number or more, read the block from the secondary storage device to the main storage, On the main memory, the process of adding the index for the keyword corresponding to the index in the block is performed. Means for storing the added processed block in the secondary storage device, and reading the index from the secondary storage device into the main storage when the number is less than a predetermined number, and adding the index for the keyword corresponding to the index Is performed on the main memory and the added index is stored in the secondary storage device.

4. The document search device according to claim 3, wherein the allocating unit calculates the number of documents that appear for the keyword allocated to the block, out of all the documents for which an index is created, Calculate the sum of the number of appearing documents calculated for each keyword for each block,
A document search device characterized by performing keyword block allocation so that the total number of documents appearing in each block is approximately equal in each block.