JPH08287105A

JPH08287105A - Document registration and retrieval device

Info

Publication number: JPH08287105A
Application number: JP7115256A
Authority: JP
Inventors: Hiroko Matsuo; 裕子松尾; Makoto Ando; 誠安藤; Akio Yamashita; 明男山下; Kazuo Aihara; 一雄相原; Tatsuomi Kita; 辰臣喜多; Hiroshi Yamaguchi; 浩山口; Shinji Kawamoto; 真司川本; Naomi Hiraoka; 直美平岡
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-04-18
Filing date: 1995-04-18
Publication date: 1996-11-01
Anticipated expiration: 2019-05-24
Also published as: JP3531281B2

Abstract

PURPOSE: To decrease file capacity and to speed up a retrieval process by compressing indexes by comparing the total number of pieces of information regarding a registered document with the total number of pieces of document information where a key appears, and expanding and using them at the time of the retrieval process. CONSTITUTION: When a document to be registered is inputted from a document input means 1, a key extracting means 2 extracts the key from the document by referring to a dictionary, etc., and an index generating means 11 generates an index by making the extracted key correspond to information regarding the registered document. An index compressing means 13 compares the total number of documents where the key appears with the total number of registered documents, compresses the document information included in the index in format which reduces the amount of data, and stores and holds it in file means 14 and 15. When the retrieval key is inputted from a retrieval key input means 16 at the time of the retrieval process, an index expanding means 13 expands the index stored in the file means prior to or simultaneously with the retrieval and a retrieval means 17 performs the retrieval process.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、登録文書から抽出した
キーと当該文書に関する情報（例えば、文書や文書中の
段落等の識別子）とをインデックスとして登録し、当該
インデックスから検索キーにより該当する文書を検索す
る文書登録検索装置に関し、特に、インデックスに含ま
れる文書に関する情報を圧縮させて保存する文書登録検
索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention registers a key extracted from a registered document and information about the document (for example, an identifier such as a document or a paragraph in the document) as an index, and the index corresponds to the search key. More specifically, the present invention relates to a document registration / retrieval apparatus that compresses and stores information about a document included in an index.

【０００２】[0002]

【従来の技術】従来より文書の登録及び検索において
は、例えば特開平２−１８６４７６号公報に記載される
ように、登録する文書から抽出したキーと当該文書名や
文書中の段落等といったキー出現位置の情報とを互いに
対応付けてインデックスとして登録し、検索に際して入
力した検索キーを用いてインデックスから該当する文書
名や文書中のキー出現位置を検索することが行われてい
る。このようにインデックスを用いた文書登録検索方式
は、検索に際して迅速に所定の文書を検索できる等の利
点を有し、広く一般的に用いられている。2. Description of the Related Art Conventionally, in document registration and retrieval, a key extracted from a document to be registered and a key appearance such as the name of the document or a paragraph in the document, as described in Japanese Patent Laid-Open No. 2-186476, for example. The position information is associated with each other and registered as an index, and the search key input at the time of search is used to search the index for the corresponding document name or key appearance position in the document. As described above, the document registration search method using the index has an advantage that a predetermined document can be searched quickly at the time of search, and is widely and generally used.

【０００３】ここで、インデックスには登録した文書名
や文書中のキー出現位置等といった文書に関する情報が
格納されるが、登録件数が多くなるとインデックスのデ
ータ量が多くなって大きな記憶容量を必要としてしまう
ため、インデックスをコンパクト化することが要求され
ている。そこで、特開平５−２５７７７４号公報に記載
されるように、キーと対応付けてインデックスの格納す
るインデックスレコード番号を統計情報に基づいて圧縮
し、インデックスを格納するファイルの容量を減少さ
せ、検索に際しては圧縮したインデックスレコード番号
を伸長させる文書登録検索装置が提案されている。この
文書登録検索装置では統計情報から定めた１バイト差分
方式及び２バイト差分方式を用いて、基準データに対す
る１バイトの差分或いは２バイトの差分を圧縮したイン
デックスレコード番号としてインデックスに格納してい
る。Here, information about a document such as a registered document name and a key appearance position in the document is stored in the index. However, when the number of registrations increases, the index data amount increases and a large storage capacity is required. Therefore, it is required to make the index compact. Therefore, as described in Japanese Patent Laid-Open No. 5-257774, the index record number stored in the index in association with the key is compressed based on the statistical information to reduce the size of the file storing the index. Has proposed a document registration / retrieval device that expands a compressed index record number. In this document registration / retrieval device, the 1-byte difference method and the 2-byte difference method determined from the statistical information are used to store the 1-byte difference or 2-byte difference with respect to the reference data as an index record number compressed in the index.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、従来の
文書登録検索装置にあっては、処理は、統計情報に基づ
いてインデックスを圧縮するものであるため、圧縮処理
が複雑化して処理に多くの時間を要し、また同様に、伸
長処理も複雑化して処理に多くの時間を要してしまう。
このため、従来の文書登録検索装置では、インデックス
を格納するファイル容量の減少を達成したとしても、登
録及び検索処理が複雑で迅速な処理が行えないという問
題があった。However, in the conventional document registration / retrieval device, since the process compresses the index based on the statistical information, the compression process becomes complicated and a lot of time is required for the process. Similarly, the decompression process becomes complicated and a lot of time is required for the process.
Therefore, the conventional document registration / retrieval device has a problem that even if the file capacity for storing the index is reduced, the registration / retrieval process is complicated and the rapid process cannot be performed.

【０００５】本発明は上記従来の事情に鑑みなされたも
ので、簡単且つ合理的な形式の圧縮手法を用いて、登録
処理及び検索処理の複雑化及び遅延化を防止しつつ、イ
ンデックスを格納するファイルの容量を減少させる文書
登録検索装置を提供することを目的とする。特に、請求
項１に記載した発明は、登録された文書に関する情報の
総数とキーが出現した文書に関する情報の総数との比較
によって、最適且つ簡単な形式でインデックスの圧縮及
びこれに対応した伸長を行う文書登録検索装置を提供す
ることを目的とする。また、請求項２乃至請求項４に記
載した発明は、特に、合理的な形式でインデックスの圧
縮及びこれに対応した伸長を行う文書登録検索装置を提
供することを目的とする。The present invention has been made in view of the above-mentioned conventional circumstances, and stores an index while preventing a registration process and a search process from becoming complicated and delayed by using a simple and rational compression method. An object of the present invention is to provide a document registration / retrieval device that reduces the file size. In particular, according to the invention described in claim 1, the index compression and the corresponding decompression are performed in an optimum and simple format by comparing the total number of registered document information and the total number of document related keys. An object of the present invention is to provide a document registration / retrieval device. It is another object of the present invention to provide a document registration / retrieval device for compressing an index and decompressing the index in a rational format.

【０００６】[0006]

【課題を解決するための手段】上記の目的を達成するた
めに、請求項１に記載した文書登録検索装置は、登録対
象の文書を読み込む文書入力手段と、入力された文書か
らキーを抽出するキー抽出手段と、登録された文書を特
定するための文書に関する情報と前記抽出されたキーと
を対応付けたインデックスを作成するインデックス作成
手段と、登録された文書を検索するための検索キーを入
力する検索キー入力手段と、検索キーを用いてインデッ
クスを検索する検索手段と、検索結果を出力する出力手
段と、を備えた文書登録検索装置において、インデック
スを圧縮するために、登録された文書の総数とキーが出
現した文書の総数とに基づいてインデックスに含まれる
文書に関する情報を形式を換えて圧縮するインデックス
圧縮手段と、圧縮されたインデックスを格納するファイ
ル手段と、検索処理に際してファイル手段に格納された
情報から該当する圧縮形式に基づいて文書に関する情報
を伸長させるインデックス伸長手段と、文書に関する情
報が伸長されたインデックスを保持するインデックス保
持手段と、を更に備え、登録に際してはインデックスを
圧縮し、検索に際してはインデックスを伸長させること
を特徴とする。In order to achieve the above object, a document registration / retrieval device according to a first aspect of the present invention includes a document input unit for reading a document to be registered and a key for extracting the input document. Key extraction means, index creation means for creating an index that associates information about a document for identifying a registered document with the extracted key, and a search key for searching a registered document In a document registration / retrieval device equipped with a search key input unit, a search unit for searching an index by using the search key, and an output unit for outputting a search result, in order to compress the index, An index compression unit that compresses information about the documents included in the index by changing the format based on the total number and the total number of documents in which the key appears, and a compression unit. A file means for storing the index, an index decompression means for decompressing the information about the document based on the corresponding compression format from the information stored in the file means in the search process, and an index for decompressing the information about the document. An index holding unit is further provided, and the index is compressed at the time of registration and expanded at the time of retrieval.

【０００７】また、請求項２に記載した文書登録検索装
置は、請求項１に記載の文書登録検索装置において、イ
ンデックス圧縮手段は、登録された文書に関する情報の
総数とキーが出現した文書に関する情報の総数とが等し
い場合には、文書に関する情報を削除するという形式で
インデックスを圧縮することを特徴とする。The document registration / retrieval device according to a second aspect is the document registration / retrieval device according to the first aspect, in which the index compressing means includes the total number of information regarding the registered documents and the information regarding the document in which the key appears. If the total number is equal to, the index is compressed in a format of deleting the information about the document.

【０００８】また、請求項３に記載した文書登録検索装
置は、請求項１又は請求項２に記載の文書登録検索装置
において、インデックス圧縮手段は、登録された文書に
関する情報の総数とキーが出現した文書に関する情報の
総数との比較において、キーが出現しなかった文書に関
する情報の総数がキーが出現した文書に関する情報の総
数より少ない場合には、キーが出現しなかった文書に関
する情報を用いてインデックスを書き換えるという形式
でインデックスを圧縮することを特徴とする。The document registration / retrieval device according to a third aspect of the present invention is the document registration / retrieval device according to the first or second aspect, in which the index compression means displays the total number of information regarding the registered documents and the key. If the total number of information items related to documents for which the key did not appear is less than the total number of information items related to documents for which the key did not appear, use the information related to documents for which the key did not appear. The feature is that the index is compressed in a format of rewriting the index.

【０００９】また、請求項４に記載した文書登録検索装
置は、請求項１乃至請求項３に記載のいずれか１項の文
書登録検索装置において、文書に関する情報の総数と等
しいビット列の各ビットに各文書に関する情報をそれぞ
れ割り当てて、ビットを反転させることにより各文書に
関する情報を示す形式と、各文書に関する情報をそれぞ
れ固有の２進化数値で示す形式との内の、登録された文
書に関する情報の総数とキーが出現した文書に関する情
報の総数との比較において、データ量の少ないいずれか
一方の形式によりインデックスの文書に関する情報を表
すことを特徴とする。Further, a document registration / retrieval device according to a fourth aspect is the document registration / retrieval device according to any one of the first to third aspects, in which each bit of a bit string equal to the total number of information regarding the document is set. The information about the registered document is classified into a format indicating the information about each document by allocating the information about each document and inverting the bit and a format indicating the information about each document by a unique binary number. In the comparison between the total number and the total number of pieces of information regarding the document in which the key appears, one of the formats having a smaller amount of data is used to represent the information about the document of the index.

【００１０】[0010]

【作用】請求項１の文書登録検索装置による文書登録処
理では、文書入力手段から登録対象の文書が入力される
と、キー抽出手段が当該文書から装置に備えられた辞書
等を参照してキーを抽出し、文書識別番号等の登録され
た文書に関する情報と抽出されたキーとを対応付けてイ
ンデックス作成手段がインデックスを作成する。そし
て、インデックス圧縮手段が、登録された文書の総数と
キーが出現した文書の総数とを比較にして、データ量が
少なくて済む形式によってインデックスに含まれた文書
に関する情報を圧縮し、圧縮されたインデックスをファ
イル手段は格納保持する。In the document registration processing by the document registration / retrieval apparatus according to claim 1, when the document to be registered is input from the document input means, the key extraction means refers to the dictionary or the like provided in the apparatus from the document and the key is extracted. And the information about the registered document such as the document identification number and the extracted key are associated with each other, and the index creating means creates the index. Then, the index compressing unit compares the total number of registered documents with the total number of documents in which the key appears, and compresses the information about the documents included in the index in a format that requires a small amount of data. The file means stores and holds the index.

【００１１】また、請求項１の文書登録検索装置による
文書検索処理では、検索キー入力手段から検索キーが入
力されると、当該キーを用いて検索手段がインデックス
を検索して対応する文書に関する情報を検索するが、検
索に先立って或いは検索と同時に、インデックス伸長手
段がファイル手段に格納されているインデックスを伸長
させ、当該伸長されたインデックスにおいて検索手段に
検索処理を行わせる。そして、検索によって得られた結
果は出力手段により印刷出力、ディスプレイ表示されて
ユーザに提供される。Further, in the document search process by the document registration / retrieval apparatus according to claim 1, when a search key is input from the search key input means, the search means searches the index by using the key and information about the corresponding document. The index decompression unit decompresses the index stored in the file unit prior to or at the same time as the search, and causes the retrieval unit to perform the retrieval process on the decompressed index. Then, the result obtained by the search is printed out by the output means and displayed on the display to be provided to the user.

【００１２】また、請求項２の文書登録検索装置による
と、インデックス圧縮手段は、装置に登録されている文
書に関する情報の総数と或るキーが出現した文書に関す
る情報の総数とが等しい場合には、当該キーに関して文
書に関する情報をインデックスから削除する。すなわ
ち、装置に登録されている全ての文書に関する情報が或
るキーに対応しているときには、個々の文書に関する情
報を当該キーに関しては識別化する必要がないことか
ら、インデックスに当該文書に関する情報を格納しない
ことによりインデックスを圧縮する。According to another aspect of the document registration / retrieval device of the present invention, if the total number of information items regarding the document registered in the device is equal to the total number of information items regarding the document in which a certain key appears. , Delete information about the document for the key from the index. That is, when the information about all the documents registered in the device corresponds to a certain key, it is not necessary to identify the information about each document with respect to the key. Compress the index by not storing it.

【００１３】また、請求項３の文書登録検索装置による
と、インデックス圧縮手段は、装置に登録されている登
録された文書に関する情報の総数と或るキーが出現した
文書に関する情報の総数との比較において、当該キーが
出現しなかった文書に関する情報の総数が当該キーが出
現した文書に関する情報の総数より少ない場合には、当
該キーが出現しなかった文書に関する情報でインデック
スを書き換える。すなわち、装置に登録されている全て
の文書に関する情報の総数から或るキーが出現した文書
に関する情報の総数を差し引いた数（すなわち、或るキ
ーが出現した文書に関する情報の補集合の数）が少ない
場合には、当該少ない数の情報を用いることによりイン
デックスを圧縮する。According to another aspect of the document registration / retrieval device of the present invention, the index compression means compares the total number of pieces of information about the registered document registered in the apparatus with the total number of pieces of information about the document in which a certain key appears. In the case where the total number of pieces of information regarding the document in which the key does not appear is smaller than the total number of pieces of information regarding document in which the key appears, the index is rewritten with the information about the document in which the key does not appear. That is, the number obtained by subtracting the total number of pieces of information regarding the document in which a certain key appears from the total number of pieces of information regarding all the documents registered in the device (that is, the number of complement sets of the information regarding the document in which a certain key appears) is If the number is small, the index is compressed by using the small number of information.

【００１４】また、請求項４の文書登録検索装置による
と、インデックス圧縮手段は、装置に登録された文書に
関する情報の総数と或るキーが出現した文書に関する情
報の総数との比較において、当該キーが出現した文書に
関する情報の総数が少ない場合には当該文書に関する情
報をそれぞれ固有の２進化数値で識別化して表し、キー
が出現した文書に関する情報の総数が多い場合には当該
文書に関する情報をビット列で表す。According to another aspect of the document registration / retrieval device of the present invention, the index compression means compares the total number of information items regarding the document registered in the device with the total number of information items regarding the document in which a certain key appears. When the total number of information items regarding a document is small, the information about the document is identified and expressed by a unique binary code. When the total number of information items about a key is large, the information about the document is represented as a bit string. It is represented by.

【００１５】例えば、３２個の文書に関する情報をそれ
ぞれ識別化して表す場合に、２進化１０進数表示では１
つ１つの情報について５ビット必要であり、３２個の情
報全体を識別化しつつ表すときには少なくとも５×３２
＝１６０ビット必要となる。一方、同じ場合に、ビット
列表示では３２ビットあれば全ての情報識別化しつつ表
せる。この両者の形式を比較すると、３２個の情報の内
の表す情報の総数が６個である場合には、２進化１０進
数表示の方が全体として最低３０ビットで足りるのでビ
ット列表示よりデータ量が少なくて済むが、表す情報の
総数が７個である場合には、２進化１０進数表示では最
低３５ビット必要となって３２ビットのビット列表示の
方がデータ量が少なくて済む。このように、両者の形式
の内の有利が方を用いて文書に関する情報を表すことに
よりインデックスを圧縮する。For example, in the case where the information about 32 documents is identified and represented, the binary-coded decimal representation is 1
5 bits are required for each piece of information, and at least 5 × 32 is required to represent all 32 pieces of information while identifying them.
= 160 bits are required. On the other hand, in the same case, in the bit string display, if there are 32 bits, all information can be expressed while being identified. Comparing the two formats, when the total number of information to be expressed among the 32 pieces of information is 6, the binary coded decimal display requires at least 30 bits as a whole, and therefore the data amount is larger than that of the bit string display. If the total number of information to be represented is 7, the binary coded decimal display requires at least 35 bits, and the 32-bit bit string display requires a smaller amount of data. Thus, the index is compressed by representing the information about the document using whichever of the two formats is advantageous.

【００１６】[0016]

【実施例】以下、本発明の一実施例に係る文書登録検索
装置を図面を参照して説明する。図１に示すように、文
書登録検索装置は、文書入力手段１、キー抽出手段２、
キーリスト作成手段３、キーリスト保持手段４、キーリ
ストファイルアクセス手段５、キーリストファイル６、
ファイルリスト作成手段７、ファイルリスト保持手段
８、ファイルリストアクセス手段９、ファイルリストフ
ァイル１０、インデックス作成手段１１、インデックス
保持手段１２、インデックスファイルアクセス手段１
３、インデックスアクセスファイル１４、インデックス
ファイル１５、検索キー入力手段１６、検索手段１７、
検索結果表示手段１８を備えている。DESCRIPTION OF THE PREFERRED EMBODIMENTS A document registration / retrieval device according to an embodiment of the present invention will be described below with reference to the drawings. As shown in FIG. 1, the document registration / retrieval device includes a document input unit 1, a key extraction unit 2,
Key list creating means 3, key list holding means 4, key list file access means 5, key list file 6,
File list creating means 7, file list holding means 8, file list access means 9, file list file 10, index creating means 11, index holding means 12, index file access means 1
3, index access file 14, index file 15, search key input means 16, search means 17,
The search result display means 18 is provided.

【００１７】文書入力手段１は文書を電子化したデータ
として取り込む手段であり、例えば、紙媒体の文書を読
み込むＯＣＲ装置、電子情報化された文書を記憶手段か
ら読み込む装置、電子情報化された文書をネットワーク
から読み込む通信装置等から構成されている。キー抽出
手段２は辞書を備えており、文書入力手段１から読み込
まれた文書を辞書を参照して解析し、当該文書からキー
ワードや文節等といった辞書に定義されたキーを抽出す
る。このキーの抽出は形態素解析やパターンマッチング
等といった公知の手法を用いて行う。The document input means 1 is means for taking in the document as digitized data, and for example, an OCR device for reading a document on a paper medium, a device for reading an electronic information document from a storage means, an electronic information document It is composed of a communication device and the like for reading from the network. The key extracting means 2 is provided with a dictionary, analyzes the document read from the document inputting means 1 by referring to the dictionary, and extracts keys defined in the dictionary such as keywords and clauses from the document. This key extraction is performed using a known method such as morphological analysis or pattern matching.

【００１８】キーリスト作成手段３は、キー抽出手段２
が抽出したキーの重複のない一覧であるキーリストを作
成し、読出書込自在なメモリから成るキーリスト保持手
段４に保持させる。キーリスト保持手段４は文書登録処
理や文書検索処理に際して他の手段の作業領域となり、
当該作業のためにキーリストを保持する。キーリストフ
ァイルアクセス手段５はキーリスト保持手段４とキーリ
ストファイル６との間でキーリストの読み出し及び書き
込みを行い、文書登録処理に際してはキーリスト保持手
段４に保持されたキーリストをキーリストファイル６に
格納し、文書検索作業に際してはキーリストファイル６
に格納されたキーリストを読み出してキーリスト保持手
段４に保持させる。キーリストファイル６はハードディ
スク装置等から構成され、キーリストを読出書込自在に
格納する。The key list creating means 3 is the key extracting means 2
A key list, which is a list of keys that are extracted without duplication, is created and held in the key list holding means 4 composed of a readable / writable memory. The key list holding means 4 serves as a work area for other means in document registration processing and document search processing,
Holds a key list for the work. The key list file access means 5 reads and writes the key list between the key list holding means 4 and the key list file 6, and the key list held in the key list holding means 4 is used as the key list file in the document registration processing. Key list file 6 for document retrieval work.
The key list stored in is read and held in the key list holding means 4. The key list file 6 is composed of a hard disk device or the like, and stores the key list in a readable and writable manner.

【００１９】ファイルリスト作成手段７は、文書入力手
段１から読み込まれた文書についての文書名と文書識別
番号との重複のない一覧であるファイルリストを作成
し、読出書込自在なメモリから成るファイルリスト保持
手段８に保持させる。ファイルリスト保持手段８は文書
登録処理や文書検索処理に際して他の手段の作業領域と
なり、当該作業のためにファイルリストを保持する。フ
ァイルリストファイルアクセス手段９はファイルリスト
保持手段８とファイルリストファイル１０との間でファ
イルリストの読み出し及び書き込みを行い、文書登録処
理に際してはファイルリスト保持手段８に保持されたフ
ァイルリストをファイルリストファイル１０に格納し、
文書検索作業に際してはファイルリストファイル１０に
格納されたファイルリストを読み出してファイルリスト
保持手段８に保持させる。ファイルリストファイル１０
はハードディスク装置等から構成され、ファイルリスト
を読出書込自在に格納する。The file list creating means 7 creates a file list which is a list of the document names and document identification numbers of the documents read from the document inputting means 1 without duplication, and is a file composed of a readable / writable memory. The list holding means 8 holds the list. The file list holding means 8 serves as a work area for other means in document registration processing and document search processing, and holds a file list for the work. The file list file access means 9 reads and writes the file list between the file list holding means 8 and the file list file 10, and the file list held in the file list holding means 8 is used as the file list file during the document registration processing. Stored in 10,
At the time of the document search operation, the file list stored in the file list file 10 is read and held in the file list holding means 8. File list file 10
Is composed of a hard disk device or the like and stores a file list in a readable and writable manner.

【００２０】インデックスリスト作成手段１１は、キー
抽出手段２で抽出されたキーと当該キーが抽出された文
書に関する情報とを対応付たインデックスを作成し、読
出書込自在なメモリから成るインデックス保持手段１２
に保持させる。本実施例では文書に関する情報として文
書の識別番号を用いており、インデックスは後述するよ
うにキーを見出しとして対応する文書識別番号を格納し
ている。インデックス保持手段１２は文書登録処理や文
書検索処理に際して他の手段の作業領域となり、当該作
業のためにインデックスを保持する。The index list creating means 11 creates an index in which the key extracted by the key extracting means 2 is associated with the information about the document from which the key is extracted, and the index holding means is a readable / writable memory. 12
To hold. In this embodiment, the document identification number is used as the information about the document, and the index stores the corresponding document identification number using the key as a headline as described later. The index holding means 12 serves as a work area for other means in document registration processing and document search processing, and holds an index for the work.

【００２１】インデックスファイルアクセス手段１３は
インデックス保持手段１２とインデックスアクセスファ
イル１４及びインデックスファイル１５との間でインデ
ックスの読み出し及び書き込み、更には、圧縮処理及び
伸長処理を行う。すなわち、インデックスファイルアク
セス手段１３は、文書登録処理に際してはインデックス
保持手段１２に保持されたインデックスを圧縮処理して
インデックスアクセスファイル１４及びインデックスフ
ァイル１５に格納し、文書検索作業に際してはインデッ
クスアクセスファイル１４及びインデックスファイル１
５に格納されたインデックスを読み出し伸長処理してイ
ンデックス保持手段１２に保持させる。インデックスア
クセスファイル１４及びインデックスファイル１５はハ
ードディスク装置等から構成され、圧縮されたインデッ
クスの内容を読出書込自在に格納する。The index file access means 13 reads and writes the index between the index holding means 12 and the index access file 14 and the index file 15, and further performs compression processing and decompression processing. That is, the index file access unit 13 compresses the index held in the index holding unit 12 and stores it in the index access file 14 and the index file 15 in the document registration process, and the index access file 14 and the index access file 14 in the document search operation. Index file 1
The index stored in No. 5 is read out and expanded, and stored in the index storage unit 12. The index access file 14 and the index file 15 are composed of a hard disk device or the like, and store the contents of the compressed index in a readable and writable manner.

【００２２】検索キー入力手段１６は、ユーザが文書検
索しようとする検索キーを入力する手段であり、例え
ば、キーボードから構成されている。なお、文や文節等
の検索要求から検索キーを形態素解析して抽出すること
も可能であり、この場合には、検索キー入力手段は例え
ば、紙媒体の文書を読み込むＯＣＲ装置、電子情報化さ
れた文書を記憶手段から読み込む装置、電子情報化され
た文書をネットワークから読み込む通信装置等から構成
されて、解析機能を有したものとなる。The search key input means 16 is a means for the user to input a search key for searching a document, and is composed of, for example, a keyboard. It is also possible to extract the search key by morphological analysis from a search request for a sentence or a phrase. In this case, the search key input means is, for example, an OCR device that reads a document in a paper medium, or an electronic information device. It comprises an apparatus for reading the document from the storage means, a communication apparatus for reading the electronic information document from the network, and the like, and has an analysis function.

【００２３】検索手段１７は、検索キー入力手段１６か
ら入力された検索キーを用いてインデックス保持手段１
２に保持されたインデックスを検索し、検索キーに対応
する文書識別番号を抽出し、更に、当該文書識別番号を
用いてファイルリスト保持手段８に保持されたファイル
リストを検索し、対応する文書名を抽出する。検索結果
表示手段１８は検索手段１７が検索した文書名をユーザ
に対して出力する手段であり、例えばディスプレイ装置
やプリンタ等から構成されている。The search means 17 uses the search key input from the search key input means 16 to generate the index holding means 1
2 is searched, the document identification number corresponding to the search key is extracted, and the file list held in the file list holding means 8 is searched using the document identification number, and the corresponding document name is searched. To extract. The search result display means 18 is means for outputting the document name searched by the search means 17 to the user, and is composed of, for example, a display device, a printer, or the like.

【００２４】上記したインデックス保持手段１２に保持
されるインデックスは、図２に示す構造となっている。
インデックスは同図の（ａ）に示す定型部分と同図の
（ｂ）に示す可変長部分とを有している。定型部分は抽
出されたキーと同じ個数の項目から成り、これらの項目
の並びはキーリスト保持手段４におけるキーの並びと同
じになっている。したがって、各項目は抽出されたキー
とそれぞれ対応している。The index held by the index holding means 12 has the structure shown in FIG.
The index has a fixed part shown in (a) of the figure and a variable length part shown in (b) of the figure. The fixed part consists of the same number of items as the extracted keys, and the arrangement of these items is the same as the arrangement of the keys in the key list holding means 4. Therefore, each item corresponds to the extracted key.

【００２５】定型部分の各項目には当該キーが出現した
文書ファイル（文書識別番号）の数及び文書識別番号リ
ストへのポインタが含まれており、この文書識別番号リ
ストへのポインタによって可変長部分に含まれている対
応する文書識別番号リストへ関連付けられている。可変
長部分はキー毎に当該キーが出現した文書の識別番号を
リストにまとめて含んでおり、各文書識別番号リストに
は当該キーが出現した全ての文書識別番号が昇順で格納
されている。なお、文書識別番号リストへのポインタ
は、インデックス保持手段１２のメモリ上の絶対アドレ
スや相対アドレスであり、他の表現方式でもよい。すな
わち、インデックスの構造を定型部分と可変長部分とを
まとめて概念的に示すと、同図の（ｃ）に示すようにな
り、キーの個数と同数の項目にそれぞれ文書識別番号リ
ストが対応している。Each item of the standard part includes the number of document files (document identification numbers) in which the key appears and a pointer to the document identification number list. Associated with the corresponding list of document identification numbers contained in. The variable length part includes, for each key, the identification numbers of the documents in which the key appears in a list, and each document identification number list stores all the document identification numbers in which the key appears in ascending order. Note that the pointer to the document identification number list is an absolute address or a relative address on the memory of the index holding unit 12, and other representation method may be used. That is, when the structure of the index is conceptually shown together with the fixed part and the variable length part, it becomes as shown in (c) of the figure, and the document identification number list corresponds to the number of keys and the same number of items. ing.

【００２６】上記したインデックスアクセスファイル１
４に格納される内容（以下、単にインデックスアクセス
ファイルと称する）は、図３に示す構造となっている。
インデックスアクセスファイルは同図の（ａ）に示すよ
うに抽出されたキーと同数の項目から成り、これらの項
目の並びはキーリスト保持手段４におけるキーの並びと
同じになっている。したがって、各項目はインデックス
の各項目と対応し、抽出されたキーとそれぞれ対応して
いる。Index access file 1 described above
The content stored in No. 4 (hereinafter, simply referred to as an index access file) has the structure shown in FIG.
The index access file is composed of the same number of items as the extracted keys as shown in (a) of the figure, and the arrangement of these items is the same as the arrangement of keys in the key list holding means 4. Therefore, each item corresponds to each item of the index and each extracted key.

【００２７】インデックスアクセスファイル１４の各項
目には、インデックスに対応して当該キーが出現した文
書ファイル（文書識別番号）の数が含まれている他に、
インデックスの形式及びインデックスファイル１５にお
けるアドレスが含まれている。インデックスの形式に
は、同図の（ｂ）に示すように本実施例では３つのビッ
トａ、ｂ、ｃが用いられ、これら各ビットの組合せによ
りインデックスファイル１５に格納された文書識別番号
の圧縮形式が示される。Each item of the index access file 14 includes the number of document files (document identification numbers) in which the key appears in correspondence with the index.
The format of the index and the address in the index file 15 are included. In the present embodiment, three bits a, b, and c are used as the index format as shown in FIG. 7B, and the combination of these bits compresses the document identification number stored in the index file 15. The format is shown.

【００２８】本実施例では同図の（ｃ）に示すように４
つの圧縮形式を用いており、これら形式をビットａ、
ｂ、ｃの組み合わせで示している。ビットａ、ｂ、ｃの
並びで”００１”は、「登録されている全ての文書（文
書識別番号）において当該キーが含まれているので、当
該キーについては文書識別番号をインデックスファイル
１５に格納しない」ことを示しており、文書識別番号を
格納しないという形式の圧縮が図られていることを示
す。なお、当該形式の圧縮は、当該キーについては文書
識別番号をもって登録されている文書をキーの有無で識
別する必要がないことから採用される。In this embodiment, as shown in FIG.
Two compression formats are used. These formats are bit a,
It is shown as a combination of b and c. “001” in the sequence of bits a, b, and c indicates that “the document identification number is stored in the index file 15 because the key is included in all the registered documents (document identification numbers). “No” is indicated, indicating that compression is performed in a format in which the document identification number is not stored. The compression of this format is adopted because it is not necessary to identify the document registered with the document identification number for the key by the presence or absence of the key.

【００２９】また、ビットａ、ｂ、ｃの並びで”００
０”は、「当該キーを含んでいる文書の識別番号をその
数値のままで表してインデックスファイル１５に格納す
る」ことを示しており、当該キーを含む文書の識別番号
のみを２進化数値で格納するという形式の圧縮が図られ
ていることを示す。なお、当該形式の圧縮は、登録され
ている全ての文書識別番号の総数に比較して当該キーを
含んでいる文書の識別番号が過半数に満たない比較的少
ない場合に採用される。In addition, the sequence of bits a, b and c is "00".
"0" indicates that "the identification number of the document containing the key is stored as it is and stored in the index file 15", and only the identification number of the document containing the key is a binary code. This indicates that the storage format is being compressed. The compression of this format is adopted when the identification number of the document including the key is relatively small, which is less than the majority, compared with the total number of all registered document identification numbers.

【００３０】また、ビットａ、ｂ、ｃの並びで”０１
０”は、「登録された全ての文書の内で当該キーを含ん
でいない文書の識別番号をその数値のままで表してイン
デックスファイル１５に格納する」ことを示しており、
当該キーを含まない文書の識別番号のみを２進化数値で
格納するという形式の圧縮が図られていることを示す。
なお、当該形式の圧縮は、登録されている全ての文書識
別番号の総数に比較して当該キーを含んでいる文書の識
別番号が過半数を上回わる比較的多い場合に採用され、
当該キーを含んでいる文書識別番号の補集合となる少な
い文書識別番号を用いて、全登録文書をキーの有無に関
して識別する。In addition, the sequence of bits a, b and c is "01".
“0” indicates that “the identification number of the document that does not include the key among all the registered documents is represented by its numerical value and stored in the index file 15”,
This indicates that compression is achieved by storing only the identification number of the document that does not include the key as a binary code.
Note that the compression of the format is adopted when the identification number of the document including the key is more than the majority and is relatively large compared to the total number of all the registered document identification numbers,
All registered documents are identified for the presence or absence of a key by using a small document identification number that is a complement of the document identification number that includes the key.

【００３１】また、ビットａ、ｂ、ｃの並びで”１０
０”は、「登録された全ての文書の内で当該キーを含ん
でいる文書の識別番号をビット列表記で識別してインデ
ックスファイル１５に格納する」ことを示しており、２
進化数値ではなく、登録された全ての文書の識別番号の
総数と同じ数のビットの列で当該キーを含む文書の識別
番号のみを識別するという形式の圧縮が図られているこ
とを示す。なお、当該形式の圧縮は、登録されている全
ての文書識別番号の総数に比較して当該キーを含んでい
る文書の識別番号の総数が後述する基準を満たす場合に
採用される。In addition, the sequence of bits a, b and c is "10".
"0" indicates that "the identification number of the document that includes the key among all the registered documents is identified by the bit string notation and stored in the index file 15", and 2
It indicates that compression is performed in the form of identifying only the identification number of the document including the key by using a bit string having the same number as the total number of identification numbers of all the registered documents, instead of the evolutionary numerical value. The compression of the format is adopted when the total number of the identification numbers of the documents including the key is compared with the total number of all the registered document identification numbers, which will be described later.

【００３２】上記したインデックスファイル１５に格納
される内容（以下、単にインデックスファイルと称す
る）は、図４に示す構造となっている。インデックスフ
ァイルは同図の（ａ）に示すように抽出されたキーと同
数の項目から成り、これらの項目の並びはキーリスト保
持手段４におけるキーの並びと同じになっている。した
がって、各項目はインデックス及びインデックスアクセ
スファイルの各項目と対応し、抽出されたキーとそれぞ
れ対応している。The contents stored in the above-mentioned index file 15 (hereinafter, simply referred to as an index file) have the structure shown in FIG. The index file is made up of the same number of items as the extracted keys as shown in (a) of the figure, and the arrangement of these items is the same as the arrangement of keys in the key list holding means 4. Therefore, each item corresponds to each item of the index and the index access file, and each key corresponds to the extracted key.

【００３３】インデックスファイルの各項目には同図の
（ａ）に示すようにインデックスに対応して当該キーが
出現した文書の識別番号が一覧にまとめられて文書識別
番号リストとして格納されている。各項目のリストには
前記インデックス形式で示された圧縮形式でそれぞれ圧
縮が施されており、各リスト毎に適切な形式で圧縮され
て文書識別番号は格納されている。すなわち、前記ビッ
トａ、ｂ、ｃの並びで”００１”の場合には、当該キー
については文書識別番号はリストに格納されず、文書識
別番号リストは空の状態となっている。In each item of the index file, as shown in (a) of the figure, the identification numbers of the documents in which the key appears in association with the index are collected in a list and stored as a document identification number list. The list of each item is compressed in the compression format shown by the index format, and the document identification number is stored in each list compressed in an appropriate format. That is, when the arrangement of the bits a, b and c is "001", the document identification number is not stored in the list for the key, and the document identification number list is empty.

【００３４】また、ビットａ、ｂ、ｃの並びで”００
０”の場合には、同図の（ｂ）に示すように、ヒットし
たすなわち当該キーを含む文書の識別番号のみが文書識
別番号リストに２進化数値で格納される。また、ビット
ａ、ｂ、ｃの並びで”０１０”の場合には、同図の
（ｃ）に示すように、ヒットしないすなわち当該キーを
含まない文書の識別番号のみが文書識別番号リストに２
進化数値で格納される。また、ビットａ、ｂ、ｃの並び
で”１００”の場合には、同図の（ｄ）に示すように、
当該キーを含む文書の識別番号のみを識別するビット列
が格納される。In addition, the sequence of bits a, b and c is "00".
In the case of 0 ", as shown in (b) of the same figure, only the identification number of the document that is hit, that is, the document including the key is stored in the document identification number list as a binary number. Further, bits a and b are stored. , C in the arrangement of “010”, as shown in (c) of the figure, only the identification number of the document that does not hit, that is, does not include the key is 2 in the document identification number list.
It is stored as an evolved numerical value. Further, when the arrangement of bits a, b, and c is "100", as shown in (d) of FIG.
A bit string that identifies only the identification number of the document including the key is stored.

【００３５】上記したインデックス、インデックスアク
セスファイル、インデックスファイルをまとめて示すと
図５の（ａ）、（ｂ）、（ｃ）にそれぞれ示すようにな
る。なお、文書の識別番号の総数は２０、各識別番号は
１〜２０の番号、キーの総数は５、とした場合を示すて
ある。同図の（ａ）のインデックスにおける定型部分の
各項目には、それぞれキーに該当する文書の識別番号の
総数と、可変長部分に格納されているそれぞれの文書識
別番号リストへのポインタとが格納されている。例え
ば、定型部分の１番目の項目では当該キーに対応する文
書識別番号は２個であり、ポインタ（ｐｔｒ１）で示さ
れる文書識別番号リストには文書識別番号”１０”と”
１５”とが格納されている。The above-mentioned index, index access file, and index file are collectively shown in FIGS. 5A, 5B, and 5C, respectively. Note that the total number of document identification numbers is 20, each identification number is a number from 1 to 20, and the total number of keys is 5. Each item of the fixed part in the index of (a) in the figure stores the total number of identification numbers of the documents corresponding to the respective keys, and a pointer to each document identification number list stored in the variable length part. Has been done. For example, the first item of the standard part has two document identification numbers corresponding to the key, and the document identification number list indicated by the pointer (ptr1) has document identification numbers "10" and "".
15 "is stored.

【００３６】同図の（ｂ）のインデックスアクセスファ
イルには、インデックスと同様にそれぞれキーに該当す
る文書の識別番号の総数と、圧縮形式を示すビット列
と、インデックスファイルでのアドレスとが格納されて
いる。また、同図の（ｃ）のインデックスファイルで
は、インデックスアクセスファイルに格納されているア
ドレスを先頭アドレスとした領域にそれぞれ文書識別番
号が圧縮して格納されている。The index access file shown in FIG. 9B stores the total number of document identification numbers corresponding to the respective keys, the bit string indicating the compression format, and the address in the index file, as in the index. There is. Further, in the index file of (c) of the figure, the document identification number is compressed and stored in each of the areas having the address stored in the index access file as the start address.

【００３７】例えば、インデックスアクセスファイルの
１番目の項目ではインデックスと同じく当該キーに対応
する文書識別番号は２個であり、アドレス（ａｄｄｒ
１）で示されるインデックスファイルの領域には、圧縮
形式”０００”に対応して当該キーを含む文書識別番
号”１０”、”１５”のみが２進化数値で格納されてい
る。なお、アドレスａｄｄｒ１は全領域の先頭アドレス
であるので”０”となる。また、インデックスアクセス
ファイルの２番目の項目では、圧縮形式”００１”に対
応してインデックスファイルには文書識別番号が格納さ
れないため、アドレス欄は空となっている。また、イン
デックスアクセスファイルの３番目の項目では、圧縮形
式”１００”に対応してインデックスファイルには文書
識別番号がビット列で格納されている。For example, in the first item of the index access file, there are two document identification numbers corresponding to the key as in the index, and the address (addr)
In the area of the index file shown in 1), only the document identification numbers "10" and "15" including the key in correspondence with the compression format "000" are stored as binary numbers. The address addr1 is "0" because it is the head address of the entire area. In the second item of the index access file, the address column is empty because no document identification number is stored in the index file corresponding to the compression format “001”. In the third item of the index access file, the document identification number is stored in the index file as a bit string corresponding to the compression format “100”.

【００３８】上記した文書識別番号の圧縮形式を選択基
準を説明する。ここで、Ｌｉをｉ番目のキーを含んだ文
書の識別番号の総数とし、文書識別番号を数値で表記す
る場合は文書識別番号１つ当たりＤバイト（ｂｙｔｅ）
を要し、ビット列で表記する場合はＫバイトを要すると
し、登録された文書の識別番号の総数をＮ、ビット列が
８ビットで１バイトの場合では、ＫはＮ／８バイト（Ｎ
が８で割り切れる場合）、又は、Ｋ＋１バイト（Ｎが８
で割り切れない場合）であるとする。The selection criteria for the compression format of the above document identification number will be described. Here, let Li be the total number of identification numbers of the document including the i-th key, and if the document identification number is represented by a numerical value, D bytes (byte) per document identification number
If it is written in a bit string, K bytes are required. If the total number of registered document identification numbers is N, and the bit string is 8 bits and 1 byte, K is N / 8 bytes (N
Is divisible by 8), or K + 1 bytes (N is 8)
If not divisible by)).

【００３９】まず、ＬｉとＮとが等しい場合には、圧縮
形式が”００１”の場合であるので、インデックスファ
イルには文書識別番号は何も格納されない。次いで、Ｌ
ｉ・ＤがＫ以下の場合には、圧縮形式が”０００”の場
合であるので、図４の（ｂ）に示すように、インデック
スファイルにはキーを含んだ登録文書の識別番号をそれ
ぞれ数値で格納する。First, when Li is equal to N, the compression format is "001", so no document identification number is stored in the index file. Then L
If i · D is less than or equal to K, the compression format is “000”, and as shown in FIG. 4 (b), the index file contains the identification number of the registered document including the key. Store with.

【００４０】次いで、（Ｎ−Ｌｉ）・ＤがＫ以下の場合
には、圧縮形式が”０１０”の場合であるので、図４の
（ｃ）に示すように、インデックスファイルにはキーを
含まない登録文書の識別番号を数値で格納する。次い
で、Ｌｉ・ＤがＫを越え、（Ｎ−Ｌｉ）・ＤもＫを越え
る場合には、圧縮形式が”１００”の場合であるので、
図４の（ｄ）に示すように、全登録文書の識別番号をビ
ット列とし、キーを含む文書の識別番号に該当するビッ
トをＯＮ、それ以外をＯＦＦとしてインデックスファイ
ルに格納する。Next, when (N-Li) .D is less than or equal to K, it means that the compression format is "010", so that the index file contains a key as shown in FIG. 4 (c). Stores the identification number of a registered document that is not registered as a numerical value. Next, when Li · D exceeds K and (N-Li) · D also exceeds K, it means that the compression format is “100”.
As shown in FIG. 4D, the identification numbers of all registered documents are set as a bit string, the bits corresponding to the identification numbers of the documents including the key are turned ON, and the other bits are turned OFF and stored in the index file.

【００４１】上記構成の文書登録検索装置による主要な
登録処理の手順を図６を参照して説明する。まず、文書
入力手段１が登録対象の文書にアクセスして登録すべき
全ての文書を読み込むと（ステップＳ１）、ファイルリ
スト作成手段７が読み込んだ文書の文書名を一覧にし、
更に、これら文書名に固有の文書識別番号を与えてファ
イルリストを作成する（ステップＳ２）。この作成され
たファイルリストはファイルリスト保持手段８、ファイ
ルリストファイルアクセス手段９を介してファイルリス
トファイル１０に格納される。A procedure of main registration processing by the document registration / retrieval apparatus having the above configuration will be described with reference to FIG. First, when the document input unit 1 accesses the document to be registered and reads all the documents to be registered (step S1), a list of the document names of the documents read by the file list creating unit 7 is created,
Further, a unique document identification number is given to each of these document names to create a file list (step S2). The created file list is stored in the file list file 10 via the file list holding means 8 and the file list file access means 9.

【００４２】また、ファイルリストの作成後或いは同時
に、読み込んだ文書をキー抽出手段２が順に解析して各
文書からキーを抽出し（ステップＳ３）、キーリスト作
成手段３が抽出したキーを全文書において重複のない一
覧としてのキーリストを作成する（ステップＳ４）。こ
の作成されたキーリストはキーリスト保持手段４、キー
リストファイルアクセス手段５を介してキーリストファ
イル６に格納される。After the file list is created or at the same time, the read document is analyzed by the key extracting means 2 in order to extract the key from each document (step S3), and the keys extracted by the key list creating means 3 are all the documents. In step S4, a key list is created as a list without duplication. The created key list is stored in the key list file 6 via the key list holding means 4 and the key list file access means 5.

【００４３】また、キーリストの作成後或いは同時に、
インデックス作成手段１１がキーリストの順序に従って
キーとキーが出現した文書の文書識別番号との一覧であ
るインデックスを作成し（ステップＳ５）、作成された
インデックスはインデックス保持手段１２、インデック
スファイルアクセス手段１３を介してインデックスアク
セスファイル１５及びインデックスファイル１５に格納
される。上記一連の処理を読み込まれた登録対象の文書
がなくなるまで繰り返し行なって処理を終了する（ステ
ップＳ６）。なお、本実施例では、文書識別番号は文書
が登録された順に１から始まり以下２、３、・・・と続
き、最後に登録された文書の文書識別番号は登録された
文書の総数と同じ値となる。After the key list is created or at the same time,
The index creating means 11 creates an index that is a list of keys and the document identification numbers of the documents in which the keys appear according to the order of the key list (step S5), and the created indexes are index holding means 12 and index file access means 13 Are stored in the index access file 15 and the index file 15 via the. The series of processes described above is repeated until there are no documents to be registered that have been read, and the process ends (step S6). In this embodiment, the document identification number starts from 1 in the order in which the documents are registered and continues to 2, 3, ..., And the document identification number of the last registered document is the same as the total number of registered documents. It becomes a value.

【００４４】図７には上記ステップＳ５のインデックス
ファイル作成に関する処理の詳細な手順を示してある。
なお、以下の例では、全登録文書の識別番号の総数を
Ｎ、抽出されたキーの総数をＭ、ｉ番目のキーを持つ登
録文書の識別番号をＬｉ、文書識別番号を表す数値のフ
ァイル上の大きさをＤバイトとする。また、図２の
（ａ）、（ｂ）に示したインデックスを、インデックス
ファイルアクセス手段１３が圧縮して図３及び図４に示
したインデックスアクセスファイルとインデックスファ
イルとして格納するとする。FIG. 7 shows the detailed procedure of the processing relating to the index file creation in step S5.
In the following example, the total number of identification numbers of all registered documents is N, the total number of extracted keys is M, the identification number of the registered document having the i-th key is Li, and a numerical value representing the document identification number is stored in the file. Let be the size of D bytes. It is also assumed that the index file access means 13 compresses the indexes shown in FIGS. 2A and 2B and stores them as the index access file and the index file shown in FIGS.

【００４５】まず、全登録文書の識別番号をビット列で
表す時の領域の大きさＫバイトを求め（ステップＳ１
１）、各キー毎にインデックスアクセスファイルとイン
デックスファイルとを作成する。すなわち、インデック
スファイルでの書き込み開始アドレスと、ｉ番目のキー
を持つ文書の識別番号の総数Ｌｉとをインデックスアク
セスファイルに書き込む（ステップＳ１２、Ｓ１３）。
なお、インデックスファイルでの書き込み開始アドレス
は、一番最初は０で、書き込みがあると次に書き込むべ
きアドレスを指す。First, the size K of the area when the identification numbers of all registered documents are represented by bit strings is obtained (step S1).
1), create an index access file and an index file for each key. That is, the write start address in the index file and the total number Li of identification numbers of documents having the i-th key are written in the index access file (steps S12 and S13).
It should be noted that the write start address in the index file is 0 at the beginning and indicates the address to be written next when there is a write.

【００４６】次いで、ｉ番目のキーを持つ登録文書の識
別番号の数Ｌｉと全登録文書の識別番号の総数Ｎとが等
しいかどうか調べ（ステップＳ１４）、ＬｉとＮが等し
い場合には、インデックスアクセスファイルにのみ書き
込みを行い、インデックスファイルには文書識別番号の
書き込みを行わない（ステップＳ１５）。すなわち、イ
ンデックスアクセスファイルの「インデックスの形式」
には”００１”とする。一方、ＬｉとＮが等しくない場
合には、Ｌｉ・ＤがＫ以下かどうか調べて（ステップＳ
１６）、ｉ番目のキーを持つ文書の文書識別番号を数値
で表す場合とビット列で表す場合とでどちらが領域が小
さいか調べる。Next, it is checked whether or not the number Li of identification numbers of registered documents having the i-th key is equal to the total number N of identification numbers of all registered documents (step S14). Only the access file is written, and the document identification number is not written in the index file (step S15). That is, the "index format" of the index access file
Is set to "001". On the other hand, if Li and N are not equal, it is checked whether Li · D is K or less (step S
16), it is checked which is smaller, the case where the document identification number of the document having the i-th key is expressed by a numerical value or the case where it is expressed by a bit string.

【００４７】この結果、Ｌｉ・ＤがＫ以下の場合には、
ｉ番目のキーを持つ文書の識別番号を数値でインデック
スファイルに書き込み（ステップＳ１７）、インデック
スアクセスファイルのインデックスの形式には”００
０”を書き込んで（ステップＳ１８）、ｉ番目のキーを
持つ文書の識別番号を数値でインデックスファイルに書
き込んだことを示す。一方、Ｌｉ・ＤがＫを越える場合
には、ｉ番目のキーを持たない文書の識別番号の総数
（Ｎ−Ｌｉ）・ＤがＫ以下かどうか調べて（ステップＳ
１９）、キーを持たない文書の識別番号を数値で表す場
合とキーを持つ文書の識別番号をビット列で表す場合の
どちらが領域が小さいか調べる。As a result, when Li · D is K or less,
The identification number of the document having the i-th key is written numerically in the index file (step S17), and the index format of the index access file is "00".
By writing 0 "(step S18), it is indicated that the identification number of the document having the i-th key is written in the index file by a numerical value. On the other hand, when Li · D exceeds K, the i-th key is set. It is checked whether the total number (N-Li) · D of identification numbers of documents that the user does not have is K or less (step S
19) It is checked which of the case where the identification number of a document having no key is represented by a numerical value and the case where the identification number of a document having a key is represented by a bit string has a smaller area.

【００４８】この結果、（Ｎ−Ｌｉ）・ＤがＫ以下の場
合には、ｉ番目のキーを持たない文書の文書識別番号を
数値でインデックスファイルに書き込み（ステップＳ２
０）、インデックスアクセスファイルのインデックスの
形式には”０１０”を書き込んで（ステップＳ２１）、
キーを持たないファイルの文書識別番号を数値でインデ
ックスファイルに書き込んだことを示す。一方、（Ｎ−
Ｌｉ）・ＤがＫを越える場合には、登録されている全て
の文書の識別番号についてｉ番目のキーの有無をビット
列でインデックスファイルに書き込み（ステップＳ２
２）、インデックスアクセスファイルのインデックスの
形式には”１００”を書き込んで（ステップＳ２３）、
ｉ番目のキーを持つ文書の識別番号をビット列でインデ
ックスファイルに書き込んだことを示す。そして、上記
した一連の処理をキーの総数Ｍ回繰り返し行って処理を
終了する（ステップＳ２４）。As a result, when (N-Li) .D is K or less, the document identification number of the document having no i-th key is written in the index file as a numerical value (step S2).
0), "010" is written in the index format of the index access file (step S21),
Indicates that the document identification number of a file that does not have a key is written as a numerical value in the index file. On the other hand, (N-
If Li) · D exceeds K, the presence or absence of the i-th key for the identification numbers of all registered documents is written in the index file as a bit string (step S2).
2), write "100" in the index format of the index access file (step S23),
Indicates that the identification number of the document having the i-th key is written in the index file as a bit string. Then, the series of processes described above is repeated M times, which is the total number of keys, to end the process (step S24).

【００４９】なお、上記したステップＳ１４、Ｓ１６、
Ｓ１９の判断処理は順序を入れ替えてもよく、実施上任
意に設定することができる。上記の一連の処理によっ
て、例えば、図５の（ａ）に示すインデックスから、同
図の（ｂ）に示すインデックスアクセスファイルと同図
の（ｃ）に示すインデックスファイルとが作成され、イ
ンデックスファイルには文書識別番号が合理的且つ簡易
な手法で圧縮されて格納される。The steps S14, S16,
The order of the determination process of S19 may be changed, and the determination process can be arbitrarily set in practice. Through the series of processes described above, for example, the index access file shown in FIG. 5B and the index file shown in FIG. 5C are created from the index shown in FIG. The document identification number is compressed and stored by a rational and simple method.

【００５０】図８には、検索処理に際して行う、インデ
ックスアクセスファイル及びインデックスファイルから
インデックスファイルアクセス手段１３によって伸長処
理をしてインデックスを再生する処理の手順を示してあ
る。この処理では、インデックスアクセスファイル及び
インデックスファイルから抽出した情報でインデックス
の定型部分の内容を書き、また、インデックスの形式に
基づいてインデックスファイルに格納された文書識別番
号を伸長処理してインデックスの可変長部分に書き込
む。なお、インデックスの定型部分の領域は、例えば、
キーリストファイル６からキーの総数Ｍを得て事前にイ
ンデックス保持手段１２に生成しておく。FIG. 8 shows an index access file and a procedure of a process of expanding the index file from the index file by the index file access unit 13 to reproduce the index, which is performed in the search process. In this process, the content of the fixed part of the index is written by the index access file and the information extracted from the index file, and the document identification number stored in the index file is decompressed based on the format of the index to change the variable length of the index. Write on the part. The area of the standard part of the index is, for example,
The total number M of keys is obtained from the key list file 6 and generated in the index holding means 12 in advance.

【００５１】まず、インデックスアクセスファイルから
ｉ番目のキーを持つ文書の識別番号の総数Ｌｉとインデ
ックスファイルでの圧縮形式であるインデックスの形式
とを読み出す（ステップＳ３１、Ｓ３２）。そして、イ
ンデックスの形式が”００１”であるかを調べ（ステッ
プＳ３３）、全登録文書がｉ番目のキーを持つ場合に
は、全登録文書の識別番号の総数分の文書識別番号のリ
ストの領域をインデックス保持手段１２に確保し、当該
領域に先頭の識別番号が”１”、最後の識別番号が全登
録文書の識別番号の総数”Ｎ”となるようにリストに数
値を書き込み、ｉ番目のキーを持つ文書識別番号（登録
文書ファイル数）と文書識別番号リストへのポインタと
をインデックスの定型部分に書き込む（ステップＳ３
４）。First, the total number Li of identification numbers of documents having the i-th key and the index format which is the compression format in the index file are read from the index access file (steps S31 and S32). Then, it is checked whether the index format is "001" (step S33), and if all the registered documents have the i-th key, the area of the list of the document identification numbers corresponding to the total number of the identification numbers of all the registered documents. Is stored in the index holding means 12, and a numerical value is written in the list so that the leading identification number is “1” and the last identification number is the total number “N” of the identification numbers of all registered documents in the area. The document identification number having the key (the number of registered document files) and the pointer to the document identification number list are written in the standard part of the index (step S3).
4).

【００５２】一方、インデックスの形式が”００１”で
はなく、全登録文書でｉ番目のキーを持つのではない場
合には、インデックスアクセスファイルからインデック
スファイルでのアドレスを読み出して書き込み開始アド
レスを得て、インデックスファイルから文書識別番号を
読み出す用意をする（ステップＳ３５）。次いで、イン
デックスの形式が”０００”であるかを調べて、ｉ番目
のキーを持つ文書の文書識別番号を数値で書いたかどう
か調べ（ステップＳ３６）、ｉ番目のキーを持つ文書の
文書識別番号を数値で書いている場合には、ｉ番目のキ
ーを持つ文書の識別番号の総数Ｌｉ分の文書識別番号の
リストの領域をインデックス保持手段１２に確保し、イ
ンデックスファイルから文書識別番号の数値を読み出し
て文書識別番号のリストに書き込み、ｉ番目のキーを持
つ文書識別番号の総数Ｌｉと文書識別番号リストへのポ
インタとをインデックスの定型部分に書き込む（ステッ
プＳ３７）。On the other hand, if the index format is not "001" and all registered documents do not have the i-th key, the address in the index file is read from the index access file to obtain the write start address. The document identification number is prepared to be read from the index file (step S35). Next, it is checked whether the index format is "000", and it is checked whether the document identification number of the document having the i-th key is written as a numerical value (step S36), and the document identification number of the document having the i-th key. Is written as a numerical value, the index holding means 12 secures an area of the list of the document identification numbers for the total number Li of the identification numbers of the documents having the i-th key, and the numerical value of the document identification number is obtained from the index file. The data is read out and written in the document identification number list, and the total number Li of document identification numbers having the i-th key and the pointer to the document identification number list are written in the standard part of the index (step S37).

【００５３】一方、ｉ番目のキーを持つ文書の識別番号
を数値で書いているのではない場合には、インデックス
の形式が”０１０”であるかを調べて、ｉ番目のキーを
持たない文書の識別番号を数値で書いたかどうか調べる
（ステップＳ３８）。この結果、ｉ番目のキーを持たな
い文書の識別番号を数値で書いている場合には、ｉ番目
のキーを持つ文書識別番号の総数Ｌｉ分の文書識別番号
リストの領域をインデックス保持手段１２に確保し、イ
ンデックスファイルから文書識別番号の数値を読み出し
てビット列に変換し、各ビットの”０”と”１”とを反
転させた後に、更にビット列を数値に再変換して文書識
別番号リストに書き込み、ｉ番目のキーを持つ文書識別
番号の総数数Ｌｉと文書識別番号リストへのポインタと
をインデックスの定型部分に書き込む（ステップＳ３
９）。On the other hand, if the identification number of the document having the i-th key is not written numerically, it is checked whether the index format is "010" and the document having no i-th key is checked. It is checked whether or not the identification number of is written numerically (step S38). As a result, when the identification number of a document that does not have the i-th key is written as a numerical value, an area of the document identification number list for the total number Li of document identification numbers that have the i-th key is stored in the index holding means 12. Secure, read the numerical value of the document identification number from the index file, convert it to a bit string, invert "0" and "1" of each bit, then reconvert the bit string to a numerical value to create a document identification number list. Writing, the total number Li of the document identification numbers having the i-th key and the pointer to the document identification number list are written in the fixed part of the index (step S3).
9).

【００５４】一方、上記以外の場合には、ｉ番目のキー
を持つ文書の識別番号の総数Ｌｉ分の文書識別番号リス
トの領域をインデックス保持手段１２に確保し、インデ
ックスファイルから文書識別番号のビット列を読み出
し、数値に変換して文書識別番号リストに書き込み、ｉ
番目のキーを持つ文書識別番号の総数Ｌｉと文書識別番
号リストへのポインタとをインデックスの定型部分に書
き込む（ステップＳ４０）。上記一連の処理をキーの総
数数Ｍ回繰り返し行って処理を終了する（ステップＳ４
１）。On the other hand, in cases other than the above, an area of the document identification number list for the total number Li of the identification numbers of the document having the i-th key is secured in the index holding means 12, and the bit string of the document identification number from the index file is secured. Is read, converted into a numerical value and written in the document identification number list, and i
The total number Li of document identification numbers having the th key and the pointer to the document identification number list are written in the fixed part of the index (step S40). The series of processes described above is repeated M times, which is the total number of keys, and the process is terminated (step S4).
1).

【００５５】なお、上記したステップＳ３６、Ｓ３８の
判断処理は順序を入れ替えてもよく、また、インデック
スの形式の各欄について１ビットずつ”０”か”１”か
の条件判断をして処理を変えてもよく、手順は実施上任
意に設定することができる。上記の一連の処理によっ
て、例えば、図５の（ｂ）に示すインデックスアクセス
ファイルと同図の（ｃ）に示すインデックスファイルと
から、同図の（ａ）に示すインデックスが作成され、イ
ンデックスには文書識別番号が合理的且つ簡易な手法で
伸長されて格納される。Note that the order of the judgment processing in steps S36 and S38 described above may be changed, and the processing is performed by judging the condition "0" or "1" bit by bit for each column of the index format. It may be changed, and the procedure can be arbitrarily set in practice. By the series of processes described above, for example, the index shown in (a) of FIG. 5 is created from the index access file shown in (b) of FIG. 5 and the index file shown in (c) of FIG. The document identification number is decompressed and stored in a rational and simple manner.

【００５６】なお、本発明では、インデックスを定型部
分と可変長部分とに分けなくともよく、これらを一体と
したインデックスとしてもよい。また、インデックスに
は文書識別番号を昇順でなく降順で格納してもよく、ま
た、インデックスに格納する文書識別番号は数値ではな
くビット列で表したものでもよい。また、上記実施例で
は文書に関する情報として文書識別番号を用いた例を示
したが、文書名、文書ファイルの物理的なアドレス等を
用いるようにしてもよく、また、文書に関する情報は、
文書識別番号のような文書をファイル単位で特定するも
のではなく、文書中の章、段落、文等の単位でを特定す
る番号や記号等であってもよい。また、検索された文書
に関する情報は、文書名等を出力するばかりではなく、
検索キーがヒットした文書中の文や段落等も表示するよ
うにしてもよい。In the present invention, the index does not have to be divided into the fixed part and the variable length part, and these may be integrated. Further, the document identification numbers may be stored in the index in descending order instead of ascending order, and the document identification numbers stored in the index may be represented by bit strings instead of numerical values. Further, in the above-described embodiment, the example in which the document identification number is used as the information regarding the document is shown, but the document name, the physical address of the document file, or the like may be used.
A document such as a document identification number does not specify a document on a file-by-file basis, but may be a number, a symbol, or the like that specifies a chapter, paragraph, sentence, or the like in the document. In addition, not only the document name etc. is output for the information about the searched document,
You may also display the sentence, paragraph, etc. in the document which the search key hit.

【００５７】また、インデックスアクセスファイルにお
いて、圧縮方法を示すインデックスの形式は持たずに、
伸張してメモリに展開する都度、キーを持つ文書数と全
登録文書数とを比較して文書識別番号を展開してもよ
い。また、文書識別番号は、文書の登録日時によって割
当ててもよく、連続した値でなくてもよい。また、イン
デックスを格納するインデックスアクセスファイルとイ
ンデックスファイルは別個の記憶装置で構成しても、１
つの記憶装置に領域を仕切って構成してもよい。また、
インデックスをこれらファイルに保存する時期は、文書
登録検索装置の操作終了時、文書登録検索装置の登録処
理終了時、文書登録検索装置の操作開始から一定時間経
過後、文書登録検索装置の操作開始から一定時間経過す
る毎、種々設定することができる。The index access file does not have an index format indicating the compression method,
Each time the data is expanded and expanded in the memory, the number of documents having a key may be compared with the total number of registered documents to expand the document identification number. Further, the document identification number may be assigned according to the registration date and time of the document, and may not be a continuous value. Moreover, even if the index access file for storing the index and the index file are configured by separate storage devices,
The area may be divided into one storage device. Also,
The indexes are saved in these files at the end of the operation of the document registration / retrieval device, at the end of the registration processing of the document registration / retrieval device, after a certain time has elapsed from the start of the operation of the document registration / retrieval device, and from the start of the operation of the document registration / retrieval device. Various settings can be made each time a fixed time has elapsed.

【００５８】[0058]

【発明の効果】以上説明したように、請求項１の文書登
録検索装置によると、登録された文書に関する情報の総
数とキーが出現した文書に関する情報の総数との比較に
よって形式を換えて、インデックスを圧縮してファイル
に格納し、検索処理に際しては圧縮されたインデックス
を伸長させて用いるようにしたため、簡易且つ適切な圧
縮処理によりインデックスがコンパクト化されて格納に
必要とされるファイル容量が大幅に削減され、また、簡
易且つ迅速な伸長処理により伸長されたインデックスを
用いて迅速な検索処理が実現される。As described above, according to the document registration / retrieval apparatus of the first aspect, the format is changed by comparing the total number of registered document information and the total number of information regarding the document in which the key appears, and the index is changed. Is compressed and stored in a file, and the compressed index is decompressed and used for search processing, so the index can be compacted by a simple and appropriate compression processing, and the file capacity required for storage can be significantly increased. Further, a quick search process is realized by using the index reduced by the simple and quick decompression process.

【００５９】また、請求項２乃至請求項４の文書登録検
索装置によると、登録された文書に関する情報の総数と
キーが出現した文書に関する情報の総数との関係に基づ
いた、合理的な形式でインデックスの圧縮処理及びこれ
に対応した伸長処理を行うことができ、簡便且つ迅速な
登録検索作業を実現することができる。特に、文書に関
する情報を数値やビット列に統一してインデックスに展
開するようにすれば、より一層の迅速なる検索処理が実
現される。なお、特許公報のように各文書が同じ決まり
文句を持つ文書を登録・検索する時には、圧縮しない場
合の半分以下の容量となることもあり、多大なる効果が
得られる。Further, according to the document registration / retrieval device of any one of claims 2 to 4, in a rational format based on the relationship between the total number of information regarding registered documents and the total number of information regarding documents in which a key appears. The index compression process and the decompression process corresponding thereto can be performed, and a simple and quick registration search operation can be realized. In particular, if the information about the document is unified into numerical values or bit strings and is expanded into the index, a much quicker search process can be realized. It should be noted that when registering / retrieving a document in which each document has the same cliché as in the patent publication, the capacity may be half or less than that in the case where the document is not compressed, and a great effect can be obtained.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例に係る文書登録検索装置の
構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a document registration / retrieval device according to an embodiment of the present invention.

【図２】本発明の一実施例に係るインデックスの構成
を示す概念図である。FIG. 2 is a conceptual diagram showing a structure of an index according to an embodiment of the present invention.

【図３】本発明の一実施例に係るインデックスアクセ
スファイルの構造を示す概念図である。FIG. 3 is a conceptual diagram showing a structure of an index access file according to an embodiment of the present invention.

【図４】本発明の一実施例に係るインデックスファイ
ルの構造を示す概念図である。FIG. 4 is a conceptual diagram showing a structure of an index file according to an embodiment of the present invention.

【図５】本発明の一実施例に係るインデックス、イン
デックスアクセスファイル、インデックスファイルの構
造を示す概念図である。FIG. 5 is a conceptual diagram showing structures of an index, an index access file, and an index file according to an embodiment of the present invention.

【図６】本発明の一実施例に係るインデックス作成処
理の手順を示すフローチャートである。FIG. 6 is a flowchart showing a procedure of index creation processing according to an embodiment of the present invention.

【図７】本発明の一実施例に係るインデックス圧縮処
理の手順を示すフローチャートである。FIG. 7 is a flowchart showing a procedure of index compression processing according to an embodiment of the present invention.

【図８】本発明の一実施例に係るインデックス伸長処
理の手順を示すフローチャートである。FIG. 8 is a flowchart showing a procedure of index decompression processing according to an embodiment of the present invention.

[Explanation of symbols]

１・・・文書入力手段、２・・・キー抽出手段、１１
・・・インデックス作成手段、１２・・・インデック
ス保持手段、１３・・・インデックスファイルアクセス
手段（インデックス伸長手段、インデックス圧縮手
段）、１４・・・インデックスアクセスファイル、１５
・・・インデックスファイル、１６・・・検索キー入
力手段、１７・・・検索手段、１８・・・検索結果表
示手段、1 ... Document Input Means, 2 ... Key Extraction Means, 11
... Index creating means, 12 ... Index holding means, 13 ... Index file access means (index decompression means, index compression means), 14 ... Index access file, 15
... Index file, 16 ... Search key input means, 17 ... Search means, 18 ... Search result display means,

───────────────────────────────────────────────────── フロントページの続き (72)発明者山下明男神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者相原一雄神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者喜多辰臣神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者山口浩神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者川本真司神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 (72)発明者平岡直美神奈川県川崎市高津区坂戸３丁目２番１号ＫＳＰＲ＆Ｄビジネスパークビル富士ゼロックス株式会社内 ─────────────────────────────────────────────────── ─── Continued Front Page (72) Akio Yamashita Akio Yamashita 3-2-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Kazuo Aihara Sakado, Takatsu-ku, Kawasaki City, Kanagawa Prefecture 3-2-1 KSP R & D Business Park Building in Fuji Xerox Co., Ltd. (72) Inventor Tatsuomi Kita 3-2-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa Prefecture KSP R & D Business Park Building in Fuji Xerox Co., Ltd. (72) Inventor Hiroshi Yamaguchi 3-2-1 Sakado, Takatsu-ku, Kawasaki-shi, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Inventor Shinji Kawamoto 3-2-1, Sakado, Takatsu-ku, Kawasaki-shi, Kanagawa KSP R & D Business Park Building Fuji Xerox Co., Ltd. (72) Inventor Naomi Hiraoka 3-2-1 Sakado, Takatsu-ku, Kawasaki City, Kanagawa Prefecture KSP R & D Business Park Building Fuji Xerox Co., Ltd.

Claims

[Claims]

1. A document input unit for reading a document to be registered, a key extracting unit for extracting a key from the input document, information about a document for specifying a registered document, and the extracted key. An index creating means for creating an associated index, a search key input means for entering a search key for searching a registered document, a search means for searching an index using the search key, and a search result output. In order to compress the index, the document registration / retrieval device including an output unit changes the format of the information about the document included in the index based on the total number of registered documents and the total number of documents in which the key appears. Index compression means for compression, file means for storing the compressed index, and file means for storing during search processing. The index decompression unit that decompresses the information related to the document based on the corresponding compression format from the stored information, and the index holding unit that holds the index decompressed the information related to the document, compresses the index at the time of registration, A document registration / retrieval device that expands an index for retrieval.

2. The document registration / retrieval device according to claim 1, wherein the index compression means, if the total number of information regarding the registered document and the total number of information regarding the document in which the key appears are equal, A document registration / retrieval device characterized in that an index is compressed in the form of deleting.

3. The document registration / retrieval device according to claim 1 or 2, wherein the index compression means compares the total number of information regarding the registered document with the total number of information regarding the document in which the key appears. If the total number of information items related to documents for which the key does not appear is less than the total number of information items related to documents for which the key has appeared, compress the index in a format that rewrites the index using information about the document for which the key did not appear. Characteristic document registration and retrieval device.

4. The document registration / retrieval device according to claim 1, wherein the information about each document is assigned to each bit of the bit string equal to the total number of information about the document, and the bit is inverted. By doing so, the total number of information regarding the registered documents and the total number of information regarding the document in which the key appears in the format indicating the information regarding each document and the format indicating the information regarding each document by a unique binary number In the comparison, the document registration / retrieval device characterized in that the information about the document of the index is represented by one of the formats having a small amount of data.