JP2001022775A

JP2001022775A - Information retrieval device, information compressing method for information retrieval device, and recording medium

Info

Publication number: JP2001022775A
Application number: JP11194740A
Authority: JP
Inventors: Masao Ito; 正雄伊藤
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1999-07-08
Filing date: 1999-07-08
Publication date: 2001-01-26

Abstract

PROBLEM TO BE SOLVED: To suppress the capacity of an index file by compressing index information more efficiently and to enable high-speed retrieval using the index file. SOLUTION: When entries of index information are successively identical, this information compressing method deletes by an index information compression part 107 the entries except only one entry, puts (n) successive pieces of index information with the same document number in one group as to index information in the same character chain, and generates compressed index information having the document number compressed into one. When byte data of position numbers are not larger than a reference value (0×0f), the low-order four bits of each of the byte data of the position numbers are connected to generate compressed index information having the position number converted into 1-byte data.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報検索装置、情
報検索装置の情報圧縮方法および該情報圧縮方法を実行
させるためのプログラムを記録した記録媒体に係り、特
に、索引を用いて指定された文字列が含まれる文書の検
索を行う情報検索装置、情報検索装置の情報圧縮方法お
よび記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval apparatus, an information compression method of the information retrieval apparatus, and a recording medium storing a program for executing the information compression method. The present invention relates to an information search device for searching for a document including a character string, an information compression method of the information search device, and a recording medium.

【０００２】[0002]

【従来の技術】近年のワープロやパソコン等の普及によ
り、電子化された大量の文書データが蓄積され、必要に
応じて文書データを検索する文書データベースの実用化
が進んでいる。また、文書データに対してキーワードを
付けずに文書の内容から検索する全文検索方式が昨今に
おいて注目されており、オフィス等で用いられ始めてい
る。この全文検索方式では、文字の連鎖を利用した索引
ファイルを予め作成し、検索時には索引ファイルから文
字連鎖毎の索引情報を読み出すことによって高速な検索
を実現している。2. Description of the Related Art With the spread of word processors and personal computers in recent years, a large amount of electronic document data has been accumulated, and practical use of a document database for retrieving document data as necessary has been progressing. Also, a full-text search method for searching document data from the contents of a document without adding a keyword has attracted attention in recent years and has begun to be used in offices and the like. In this full-text search method, an index file using a character chain is created in advance, and at the time of search, high-speed search is realized by reading out index information for each character chain from the index file.

【０００３】このような全文検索方式を用いた従来の情
報検索装置としては、例えば図１８に示すようなものが
ある。図１８は、従来の情報検索装置を示す構成図であ
る。同図において、本従来例の情報検索装置は、端末１
１０１、文書入力部１１０２、検索条件入力部１１０
３、結果出力部１１０４、全体制御部１１０５、索引作
成部１１０６、書誌実体登録部１１０７、検索部１１０
８、文書番号記憶部１１０９、書誌取得部１１１０、実
体取得部１１１１、索引位置格納部１１１２、索引情報
格納部１１１３、書誌情報格納部１１１４および実体情
報格納部１１１５を備えて構成されている。As a conventional information retrieval apparatus using such a full-text retrieval method, for example, there is one as shown in FIG. FIG. 18 is a configuration diagram showing a conventional information search device. In FIG. 1, an information search device of the conventional example is a terminal 1
101, document input unit 1102, search condition input unit 110
3. Result output unit 1104, overall control unit 1105, index creation unit 1106, bibliographic entity registration unit 1107, search unit 110
8, a document number storage unit 1109, a bibliography acquisition unit 1110, an entity acquisition unit 1111, an index position storage unit 1112, an index information storage unit 1113, a bibliography information storage unit 1114, and an entity information storage unit 1115.

【０００４】本従来例の情報検索装置１１００では、先
ず文書を登録する際には、端末１１０１から入力された
文書を、文書入力部１１０２により索引作成部１１０６
が扱える形式にフォーマット変換した後、全体制御部１
１０５に送る。ここでいうフォーマット変換は、例え
ば、文書から書誌情報を取り出すことによって文書の実
体情報から書誌情報を分離する等の、索引を作成するた
めの文書の形式変換である。In the conventional information retrieval apparatus 1100, when a document is first registered, a document input from a terminal 1101 is indexed by a document input unit 1102 into an index creation unit 1106.
After converting the format to a format that can be handled by
Send to 105. The format conversion referred to here is, for example, format conversion of a document for creating an index, such as extracting bibliographic information from the document by extracting bibliographic information from the document.

【０００５】次に、全体制御部１１０５は、このフォー
マット変換された文書データを索引作成部１１０６に送
る。索引作成部１１０６では、特開平６−５２２２６号
公報等に記載されているような２文字の文字連鎖を利用
して、フォーマット変換された文書データから文字連鎖
毎の索引情報を作成する。索引情報は、文書データ中の
文字連鎖の位置を特定する位置情報の集合である。より
具体的には、文書を識別するための文書番号、並びに、
文字連鎖がその文書中のどこに位置するかを判別するた
めの位置番号が位置情報に該当する。索引作成部１１０
６によって作成された索引情報は、索引情報格納部１１
１３に格納され、索引ファイルを形成する。またこの
時、文字連鎖毎に、該文字連鎖に対応する索引情報の索
引ファイルにおける先頭位置を示す先頭位置情報と、該
索引情報のデータ長を示す長さ情報（バイト数）とを対
応付けて、索引位置格納部１１１２に格納する。Next, the overall control unit 1105 sends the format-converted document data to the index creation unit 1106. The index creation unit 1106 creates index information for each character chain from the format-converted document data using a two-character chain described in Japanese Patent Application Laid-Open No. 6-52226. The index information is a set of position information that specifies the position of a character chain in the document data. More specifically, a document number for identifying the document, and
A position number for determining where the character chain is located in the document corresponds to the position information. Index creation unit 110
6 is stored in the index information storage unit 11
13 to form an index file. At this time, for each character chain, head position information indicating the head position of the index information corresponding to the character chain in the index file is associated with length information (number of bytes) indicating the data length of the index information. , In the index position storage unit 1112.

【０００６】また、全体制御部１１０５は、文書の書誌
情報および実体情報を登録する書誌実体登録部１１０７
にも、フォーマット変換された文書データを送る。書誌
実体登録部１１０７では、文書データ中の書誌情報だけ
を書誌情報格納部１１１４に格納し、実体情報格納部１
１１５には文書データ中の実体情報だけを格納する。A general control unit 1105 registers a bibliographic entity registration unit 1107 for registering bibliographic information and entity information of a document.
Also, the document data that has undergone format conversion is sent. In the bibliographic entity registration unit 1107, only the bibliographic information in the document data is stored in the bibliographic information storage unit 1114.
115 stores only the entity information in the document data.

【０００７】一方、文書の検索を行う際には、端末１１
０１を用いて検索条件入力部１１０３から入力された検
索条件を、全体制御部１１０５を介して検索部１１０８
に送る。検索部１１０８では、検索条件の検索文字列か
ら文字連鎖情報を作成した後に、索引位置格納部１１１
２から該文字連鎖情報と対応付けられている先頭位置情
報および長さ情報を取得する。次に、取得した先頭位置
情報および長さ情報を用いて索引情報格納部１１１３に
アクセスし、索引ファイルから該当する索引情報を取得
する。次に、検索部１１０８は、索引情報の連続状況を
判定して全文検索を行い、検索条件に一致した文書番号
と検索式番号を文書番号記憶部１１０９に格納し、検索
件数を全体制御部１１０５に送る。次に、全体制御部１
１０５が検索件数を結果出力部１１０４に送ることによ
って、端末１１０１に検索結果としての検索件数が表示
される。On the other hand, when searching for a document, the terminal 11
01 and the search condition input from the search condition input unit 1103 via the overall control unit 1105 to the search unit 1108
Send to The search unit 1108 creates character chain information from the search character string of the search condition, and then creates the index position storage unit 111.
2 to obtain the head position information and the length information associated with the character chain information. Next, it accesses the index information storage unit 1113 using the acquired head position information and length information, and acquires the corresponding index information from the index file. Next, the search unit 1108 performs a full-text search by determining the continuous status of the index information, stores the document number and the search formula number that match the search conditions in the document number storage unit 1109, and determines the number of search cases by the overall control unit 1105. Send to Next, the overall control unit 1
When 105 sends the number of searches to the result output unit 1104, the number of searches as a search result is displayed on the terminal 1101.

【０００８】[0008]

【発明が解決しようとする課題】以上のように、従来の
情報検索装置１１００では、文書登録時に、文字連鎖を
利用して文書データから索引ファイルを作成し、検索時
には索引ファイルから文字連鎖毎の索引情報を読み出す
ことによって高速な検索を実現している。しかしなが
ら、日本語文書を想定した場合に、索引ファイルのデー
タサイズが文書データの約２倍と大きいため、索引ファ
イルを保持するための索引情報格納部１１１３に大容量
の記憶メディアが必要であるという問題点があった。つ
まり、文書データのデータサイズが「文書データの文字
数×２バイト」であるのに対し、索引ファイルのデータ
サイズは、２文字の文字連鎖を利用して索引が作成さ
れ、１文書当たりの索引数が１文書データの文字数とほ
ぼ同一であることから、索引情報をそれぞれ２バイトの
文書番号および位置番号から成るとした場合、「文字数
×４バイト」となることによるものである。As described above, in the conventional information search apparatus 1100, an index file is created from document data using a character chain at the time of document registration, and each index is created from the index file at the time of search. A high-speed search is realized by reading out the index information. However, assuming a Japanese document, the data size of the index file is about twice as large as the document data, so a large-capacity storage medium is required for the index information storage unit 1113 for holding the index file. There was a problem. That is, while the data size of the document data is “the number of characters of the document data × 2 bytes”, the data size of the index file is such that the index is created using a character chain of two characters, and the number of indexes per document is Is almost the same as the number of characters in one document data, so that if the index information is composed of a 2-byte document number and a position number, the number of characters is 4 characters.

【０００９】また、大きなデータサイズの索引ファイル
を用いて検索を行っていたため、特に、ＣＤ−ＲＯＭ等
のデータの読み出し速度が低速な記録媒体に索引ファイ
ルが保持されている場合には、読み出しに比較的長い時
間を必要とし、結果として検索速度が低速になってしま
うという問題点もあった。In addition, since the search is performed using an index file having a large data size, especially when the index file is held in a recording medium such as a CD-ROM which has a low data read speed, the search is not performed. There is also a problem that a relatively long time is required, resulting in a low search speed.

【００１０】これらの問題点を解決するために、索引フ
ァイルを圧縮する試みが行われてきたが、索引情報は連
続性のあまりないバイナリデータであるため、例えばＵ
ＮＩＸの「compress」等の既存の圧縮方式で圧縮した場
合には、一割程度しか圧縮されず、あまり効果が得られ
なかった。また、複雑な圧縮方式を用いた場合は、圧縮
ファイルの展開に比較的長い時間を必要とし、結果とし
て検索速度が低速になってしまうという問題点があった
ため、索引ファイルの圧縮は一般的に行われていなかっ
た。In order to solve these problems, attempts have been made to compress the index file. However, since the index information is binary data with little continuity, for example,
When compression was performed using an existing compression method such as "compress" of NIX, only about 10% was compressed, and a significant effect was not obtained. In addition, when a complicated compression method is used, there is a problem that a relatively long time is required for decompressing the compressed file, which results in a problem that a search speed becomes low. Had not been done.

【００１１】本発明は、上記従来の問題点に鑑みてなさ
れたものであって、索引情報をより効率的に圧縮して索
引ファイルの容量を抑え、該索引ファイルを用いて高速
に検索することのできる情報検索装置、情報検索装置の
情報圧縮方法および記録媒体を提供することを目的とし
ている。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned conventional problems, and it is an object of the present invention to compress index information more efficiently, reduce the size of an index file, and perform high-speed search using the index file. It is an object of the present invention to provide an information retrieval apparatus, an information compression method of the information retrieval apparatus, and a recording medium that can perform the information retrieval.

【００１２】[0012]

【課題を解決するための手段】上記課題を解決するため
に、本発明の請求項１に係る情報検索装置は、文書デー
タ中の特定の文字連鎖を位置情報によって特定する索引
を備えた情報検索装置において、前記文書データに出現
する文字連鎖毎に、前記位置情報を含む索引情報を作成
する索引情報作成手段と、同一の文字連鎖の索引情報に
ついて、位置情報の値に応じて該位置情報を圧縮した圧
縮索引情報を作成する索引情報圧縮手段とを備えたもの
である。According to a first aspect of the present invention, there is provided an information retrieval apparatus having an index for specifying a specific character chain in document data by position information. In the apparatus, for each character chain appearing in the document data, index information creating means for creating index information including the position information, and for the index information of the same character chain, the position information is stored in accordance with the value of the position information. Index information compression means for creating compressed compressed index information.

【００１３】また、請求項２に係る情報検索装置は、請
求項１に記載の情報検索装置において、前記索引情報圧
縮手段は、同一の文字連鎖の索引情報について、位置情
報に含まれる第１の位置情報が同一の複数の索引情報を
１群として、該群の第１の位置情報を１つに圧縮した圧
縮索引情報を作成するものである。According to a second aspect of the present invention, in the information search apparatus according to the first aspect, the index information compressing means includes a first information included in the position information for the index information of the same character chain. A plurality of pieces of index information having the same position information are set as one group, and compressed index information is created by compressing the first position information of the group into one.

【００１４】また、請求項３に係る情報検索装置は、請
求項１または２に記載の情報検索装置において、前記索
引情報圧縮手段は、同一の文字連鎖の索引情報または圧
縮索引情報について、位置情報に含まれる第２の位置情
報が所定基準値以下であるときに、該第２の位置情報を
より少ないビット数またはバイト数に圧縮した圧縮索引
情報を作成するものである。According to a third aspect of the present invention, in the information search apparatus according to the first or second aspect, the index information compressing means is configured to execute position information for index information or compressed index information of the same character chain. When the second position information included in the second position information is equal to or smaller than a predetermined reference value, compressed index information is created by compressing the second position information to a smaller number of bits or bytes.

【００１５】また、請求項４に係る情報検索装置は、請
求項２に記載の情報検索装置において、前記索引情報圧
縮手段は、前記第１の位置情報を１つに圧縮した圧縮索
引情報について、位置情報に含まれる第２の位置情報間
の差分を求め、該差分が所定基準値以下であるとき、前
記第２の位置情報をより少ないビット数またはバイト数
で表される差分に圧縮するものである。According to a fourth aspect of the present invention, in the information search apparatus according to the second aspect, the index information compressing means is configured to compress the first position information into one piece of compressed index information. Calculating a difference between the second position information included in the position information, and compressing the second position information into a difference represented by a smaller number of bits or bytes when the difference is equal to or smaller than a predetermined reference value. It is.

【００１６】また、請求項５に係る情報検索装置は、請
求項１、２、３または４に記載の情報検索装置におい
て、前記索引情報は、階層的な複数の位置情報を含むも
のである。According to a fifth aspect of the present invention, in the information search apparatus of the first, second, third or fourth aspect, the index information includes a plurality of hierarchical position information.

【００１７】また、請求項６に係る情報検索装置は、請
求項２、３または４に記載の情報検索装置において、前
記第１の位置情報を文書の識別情報とし、前記第２の位
置情報を前記第１の位置情報で特定される文書における
位置情報としたものである。According to a sixth aspect of the present invention, in the information search apparatus according to the second, third or fourth aspect, the first position information is used as document identification information and the second position information is used as the document identification information. This is the position information in the document specified by the first position information.

【００１８】また、請求項７に係る情報検索装置は、請
求項１、２、３、４、５または６に記載の情報検索装置
において、前記索引情報圧縮手段で圧縮された圧縮索引
情報を保持する圧縮索引情報記憶手段と、与えられる検
索条件に基づき、前記圧縮索引情報記憶手段内の圧縮索
引情報を読み出して元の索引情報に伸長する索引伸長手
段と、前記索引伸長手段で伸長した索引情報に基づき検
索を行う検索手段とを備えたものである。According to a seventh aspect of the present invention, in the information search apparatus according to the first, second, third, fourth, fifth, or sixth aspect, the information retrieval apparatus stores compressed index information compressed by the index information compression means. Compressed index information storage means, index expansion means for reading out compressed index information in the compressed index information storage means based on a given search condition and expanding the original index information, and index information expanded by the index expansion means And a search means for performing a search based on.

【００１９】また、請求項８に係る情報検索装置の情報
圧縮方法は、文書データ中の特定の文字連鎖を位置情報
によって特定する索引を備えた情報検索装置の情報圧縮
方法において、前記文書データに出現する文字連鎖毎
に、前記位置情報を含む索引情報を作成する索引情報作
成ステップと、同一の文字連鎖の索引情報について、位
置情報の値に応じて該位置情報を圧縮した圧縮位置情報
を作成する索引情報圧縮ステップとを備えたものであ
る。According to a still further aspect of the present invention, there is provided an information compression method for an information retrieval apparatus having an index for specifying a specific character chain in document data by position information. An index information creating step of creating index information including the position information for each character chain that appears, and creating compressed position information obtained by compressing the position information for the index information of the same character chain according to the value of the position information Index information compression step.

【００２０】また、請求項９に係る情報検索装置の情報
圧縮方法は、請求項８に記載の情報検索装置の情報圧縮
方法において、前記索引情報圧縮ステップは、同一の文
字連鎖の索引情報について、位置情報に含まれる第１の
位置情報が同一の複数の索引情報を１群として、該群の
第１の位置情報を１つに圧縮した圧縮索引情報を作成す
るものである。According to a ninth aspect of the present invention, in the information compression method of the information retrieval apparatus according to the eighth aspect, the index information compression step includes the steps of: A plurality of pieces of index information having the same first position information included in the position information are taken as one group, and compressed index information is created by compressing the first position information of the group into one.

【００２１】また、請求項１０に係る情報検索装置の情
報圧縮方法は、請求項８または９に記載の情報検索装置
の情報圧縮方法において、前記索引情報圧縮ステップ
は、同一の文字連鎖の索引情報または圧縮索引情報につ
いて、位置情報に含まれる第２の位置情報が所定基準値
以下であるときに、該第２の位置情報をより少ないビッ
ト数またはバイト数に圧縮した圧縮位置情報を作成する
ものである。According to a tenth aspect of the present invention, in the information compressing method of the information searching apparatus according to the eighth or ninth aspect, the index information compressing step comprises the steps of: Alternatively, when the second position information included in the position information is equal to or smaller than a predetermined reference value, the compressed index information is used to create compressed position information obtained by compressing the second position information to a smaller number of bits or bytes. It is.

【００２２】また、請求項１１に係る情報検索装置の情
報圧縮方法は、請求項９に記載の情報検索装置の情報圧
縮方法において、前記索引情報圧縮ステップは、前記第
１の位置情報を１つに圧縮した圧縮索引情報について、
位置情報に含まれる第２の位置情報間の差分を求め、該
差分が所定基準値以下であるとき、前記第２の位置情報
をより少ないビット数またはバイト数で表される差分に
圧縮するものである。[0022] According to an eleventh aspect of the present invention, in the information compression method of the information retrieval apparatus according to the ninth aspect, the index information compression step includes the step of storing one piece of the first position information. About the compressed index information compressed to
Calculating a difference between the second position information included in the position information, and compressing the second position information into a difference represented by a smaller number of bits or bytes when the difference is equal to or smaller than a predetermined reference value. It is.

【００２３】また、請求項１２に係る情報検索装置の情
報圧縮方法は、請求項８、９、１０または１１に記載の
情報検索装置の情報圧縮方法において、前記索引情報
は、階層的な複数の位置情報を含むものである。According to a twelfth aspect of the present invention, in the information compression method of the information retrieval apparatus according to the eighth, ninth, tenth, or eleventh aspect, the index information includes a plurality of hierarchical hierarchical information. It contains location information.

【００２４】また、請求項１３に係る情報検索装置の情
報圧縮方法は、請求項９、１０または１１に記載の情報
検索装置の情報圧縮方法において、前記第１の位置情報
を文書の識別情報とし、前記第２の位置情報を前記第１
の位置情報で特定される文書における位置情報としたも
のである。According to a thirteenth aspect of the present invention, in the information compression method of the information retrieval apparatus according to the ninth, tenth, or eleventh aspect, the first position information is used as document identification information. , The second location information to the first
Is the position information in the document specified by the position information.

【００２５】さらに、請求項１４に係るコンピュータに
より読み取り可能な記録媒体は、請求項８、９、１０、
１１、１２または１３に記載の情報検索装置の情報圧縮
方法をコンピュータに実行させるためのプログラムとし
て記録したものである。Further, a computer-readable recording medium according to claim 14 is a computer-readable recording medium.
It is recorded as a program for causing a computer to execute the information compression method of the information search device described in 11, 12, or 13.

【００２６】本発明に係る情報検索装置、情報検索装置
の情報圧縮方法、および記録媒体では、索引情報作成手
段（索引情報作成ステップ）によって、文書データに出
現する文字連鎖毎に位置情報を含む索引情報を作成し、
索引情報圧縮手段（索引情報圧縮ステップ）によって、
同一の文字連鎖の索引情報について、位置情報の値に応
じて該位置情報を圧縮した圧縮索引情報を作成してい
る。ここで、索引情報が含む位置情報は、請求項５に係
る情報検索装置、請求項１２に係る情報検索装置の情報
圧縮方法、および請求項１４に係る記録媒体のように、
階層的な情報であることが多く、例えば、請求項６に係
る情報検索装置、請求項１３に係る情報検索装置の情報
圧縮方法、および請求項１４に係る記録媒体のように、
第１の位置情報を文書の識別情報とし、第２の位置情報
を第１の位置情報で特定される文書における位置情報と
するのが一般的である。その他、階層的な位置情報とし
ては、文書の節，章等の区切りの番号、文書の頁、文書
の段落番号等々が該当する。In the information retrieval apparatus, the information compression method of the information retrieval apparatus, and the recording medium according to the present invention, the index information creating means (index information creating step) includes an index including position information for each character chain appearing in the document data. Create information,
By index information compression means (index information compression step),
For index information of the same character chain, compressed index information is created by compressing the position information according to the value of the position information. Here, the position information included in the index information is, as in the information search device according to claim 5, the information compression method of the information search device according to claim 12, and the recording medium according to claim 14,
It is often hierarchical information. For example, as in the information search device according to claim 6, the information compression method of the information search device according to claim 13, and the recording medium according to claim 14,
Generally, the first position information is the identification information of the document, and the second position information is the position information in the document specified by the first position information. In addition, the hierarchical position information includes a section number of a section or a chapter of the document, a page of the document, a paragraph number of the document, and the like.

【００２７】特に、請求項２に係る情報検索装置、請求
項９に係る情報検索装置の情報圧縮方法、および請求項
１４に係る記録媒体では、索引情報圧縮手段（索引情報
圧縮ステップ）によって、同一の文字連鎖の索引情報に
ついて、位置情報に含まれる第１の位置情報が同一の複
数の索引情報を１群として、該群の第１の位置情報を１
つに圧縮した圧縮索引情報を作成している。上記の通
り、索引情報が含む位置情報は階層的であることが多い
ことから、より上位階層の第１の位置情報が同一である
索引情報がｎ個存在するケースが容易に想像できる。し
たがって、これら索引情報を１群として、第１の位置情
報を１つに圧縮した圧縮索引情報を作成することによ
り、ｎ−１個の第１の位置情報分を削減することがで
き、索引情報を圧縮して索引ファイルの容量を抑えるこ
とができる。なお、索引情報作成手段（索引情報作成ス
テップ）において文字連鎖の出現順に索引情報を作成す
れば、第１の位置情報が同一である索引情報が連続して
出現するので、索引情報圧縮手段（索引情報圧縮ステッ
プ）による圧縮処理を容易に行うことができる。In particular, in the information retrieval apparatus according to the second aspect, the information compression method of the information retrieval apparatus according to the ninth aspect, and the recording medium according to the fourteenth aspect, the same index information compression means (index information compression step) is used. , The plurality of pieces of index information having the same first position information included in the position information are regarded as one group, and the first position information of the group is defined as 1
To create compressed index information. As described above, since the position information included in the index information is often hierarchical, it is easy to imagine a case where there are n pieces of index information having the same first position information in a higher hierarchy. Therefore, by creating the compressed index information in which the first position information is compressed into one by using the index information as one group, it is possible to reduce the number of n-1 pieces of the first position information, To reduce the size of the index file. If index information is created in the order of appearance of the character chain in the index information creating means (index information creating step), index information having the same first position information appears continuously. (Compression process) can be easily performed.

【００２８】また特に、請求項３に係る情報検索装置、
請求項１０に係る情報検索装置の情報圧縮方法、および
請求項１４に係る記録媒体では、索引情報圧縮手段（索
引情報圧縮ステップ）によって、同一の文字連鎖の索引
情報または圧縮索引情報について、位置情報に含まれる
第２の位置情報が所定基準値以下であるときに、該第２
の位置情報をより少ないビット数またはバイト数に圧縮
した圧縮索引情報を作成している。例えば、第２の位置
情報を第１の位置情報で特定される文書における位置情
報（出現順位）とした場合には、出現順位が所定基準値
以下であるものについてビット数またはバイト数を削減
することができ、索引情報を圧縮して索引ファイルの容
量を抑えることができる。なお、索引情報作成手段（索
引情報作成ステップ）において文字連鎖の出現順に索引
情報を作成すれば、第２の位置情報が所定基準値以下で
ある索引情報が連続して出現する可能性が高くなるの
で、索引情報圧縮手段（索引情報圧縮ステップ）による
圧縮処理を容易に行うことができる。Further, in particular, an information retrieval apparatus according to claim 3,
In the information compression method of the information search device according to the tenth aspect, and the recording medium according to the fourteenth aspect, the index information compression means (index information compression step) uses the index information of the same character chain or the compressed When the second position information included in the second
Compressed index information is created by compressing the position information of into a smaller number of bits or bytes. For example, when the second position information is position information (appearance order) in the document specified by the first position information, the number of bits or the number of bytes is reduced for an item whose appearance order is equal to or less than a predetermined reference value. The index information can be compressed to reduce the capacity of the index file. If the index information creating means (index information creating step) creates the index information in the order of appearance of the character chain, the possibility that the index information whose second position information is equal to or less than the predetermined reference value continuously appears increases. Therefore, compression processing by the index information compression means (index information compression step) can be easily performed.

【００２９】また特に、請求項４に係る情報検索装置、
請求項１１に係る情報検索装置の情報圧縮方法、および
請求項１４に係る記録媒体では、索引情報圧縮手段（索
引情報圧縮ステップ）は、第１の位置情報を１つに圧縮
した圧縮索引情報について、位置情報に含まれる第２の
位置情報間の差分を求め、該差分が所定基準値以下であ
るとき、前記第２の位置情報をより少ないビット数また
はバイト数で表される差分に圧縮している。例えば、第
２の位置情報を第１の位置情報で特定される文書におけ
る位置情報（出現順位）とした場合には、近接して出現
し差分が所定基準値以下であるものについてビット数ま
たはバイト数を削減することができ、索引情報を圧縮し
て索引ファイルの容量を抑えることができる。なお、索
引情報作成手段（索引情報作成ステップ）において文字
連鎖の出現順に索引情報を作成すれば、第２の位置情報
間の差分が所定基準値以下である索引情報が連続して出
現する可能性が高くなるので、索引情報圧縮手段（索引
情報圧縮ステップ）による圧縮処理を容易に行うことが
できる。[0029] Further, in particular, an information retrieval apparatus according to claim 4,
In the information compression method of the information search device according to the eleventh aspect and the recording medium according to the fourteenth aspect, the index information compressing means (index information compression step) is configured to compress the first position information into one piece of the compressed index information. Calculating a difference between the second position information included in the position information, and when the difference is equal to or smaller than a predetermined reference value, compressing the second position information into a difference represented by a smaller number of bits or bytes. ing. For example, when the second position information is position information (appearance order) in the document specified by the first position information, the number of bits or bytes of those that appear close to each other and whose difference is equal to or less than a predetermined reference value is determined. The number can be reduced, and the index information can be compressed to reduce the capacity of the index file. If the index information creating means (index information creating step) creates the index information in the order of appearance of the character chain, there is a possibility that the index information whose difference between the second position information is equal to or less than a predetermined reference value appears continuously. , The compression processing by the index information compression means (index information compression step) can be easily performed.

【００３０】以上のようにして索引情報を圧縮すること
で、索引情報（圧縮索引情報）は圧縮されていない索引
情報よりもデータサイズが小さくなるため、記録メディ
ア上の索引情報を保持するための領域、即ち索引ファイ
ルの容量を抑えることができる。また、同じ情報を得る
ときの記録メディアからメモリにロードされるデータ量
に関して、圧縮されていない索引情報よりも圧縮索引情
報の方が少ないため、メモリへのロード時間を短縮する
ことができる。したがって、ＣＤ−ＲＯＭ等のデータの
読み出し速度が低速な記録メディアに圧縮索引情報が保
持されていると場合でも、読み出し時間を短縮して高速
検索を実現することが可能となる。By compressing the index information as described above, the index information (compressed index information) has a smaller data size than the uncompressed index information. The area, that is, the capacity of the index file can be reduced. Further, regarding the amount of data to be loaded from the recording medium to the memory when obtaining the same information, the compressed index information is smaller than the uncompressed index information, so that the load time to the memory can be reduced. Therefore, even when the compressed index information is held in a recording medium such as a CD-ROM having a low data reading speed, the reading time can be reduced and a high-speed search can be realized.

【００３１】さらに、請求項７に係る情報検索装置で
は、圧縮索引情報記憶手段によって、索引情報圧縮手段
で圧縮された圧縮索引情報を保持し、索引伸長手段によ
って、与えられる検索条件に基づき圧縮索引情報記憶手
段内の圧縮索引情報を読み出して元の索引情報に伸長
し、検索手段によって索引伸長手段で伸長した索引情報
に基づき検索を行っている。上記請求項１，２，３，
４，５および６に係る情報検索装置、請求項８，９，１
０，１１，１２および１３に係る情報検索装置の情報圧
縮方法、並びに請求項１４に係る記録媒体において行う
圧縮方式は何れも簡単なものであることから、索引伸長
手段によって元の索引情報に伸長する処理も簡単に行う
ことができ、結果として、索引情報をより効率的に圧縮
して索引ファイルの容量を抑えると共に、該索引ファイ
ルを用いて検索することにより検索時間を短縮すること
ができる。Further, in the information retrieval apparatus according to the present invention, the compressed index information storage means holds the compressed index information compressed by the index information compressing means, and the compressed index information based on the search condition given by the index decompression means. The compressed index information in the information storage unit is read and expanded to the original index information, and the search unit performs a search based on the index information expanded by the index expansion unit. Claims 1, 2, 3,
An information retrieval apparatus according to any one of claims 4, 5, and 6, wherein
Since the information compression method of the information retrieval device according to the present invention is simple, and the compression method performed on the recording medium according to the present invention is simple, the original index information is decompressed by the index decompression means. As a result, the index information can be compressed more efficiently, the capacity of the index file can be reduced, and the search time can be reduced by searching using the index file.

【００３２】[0032]

【発明の実施の形態】以下、本発明の情報検索装置、情
報検索装置の情報圧縮方法および記録媒体の実施の形態
について、〔第１の実施形態〕、〔第２の実施形態〕の
順に図面を参照して詳細に説明する。なお、それぞれの
実施形態の説明では、本発明に係る情報検索装置および
情報検索装置の情報圧縮方法について詳述するが、本発
明に係る記録媒体については、情報圧縮方法を実行させ
るためのプログラムを記録した記録媒体であることか
ら、その説明は以下の情報検索装置の情報圧縮方法の説
明に含まれるものである。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of an information retrieval apparatus, an information compression method of an information retrieval apparatus, and a recording medium according to the present invention. This will be described in detail with reference to FIG. In the description of each embodiment, the information search device according to the present invention and the information compression method of the information search device will be described in detail. However, the recording medium according to the present invention is provided with a program for executing the information compression method. The description is included in the following description of the information compression method of the information retrieval device because the recording medium is a recorded recording medium.

【００３３】〔第１の実施形態〕図１は本発明の第１の
実施形態に係る情報検索装置の構成図である。本実施形
態の情報検索装置１００は、文字の連鎖を利用した索引
ファイルを予め作成し、検索時には索引ファイルから文
字連鎖毎の索引情報を読み出すことによって高速な検索
を実現する全文検索方式の情報検索装置であって、図１
８の従来例の構成に対して、索引情報圧縮部１０７およ
び索引伸長部１１０を付加した構成である。[First Embodiment] FIG. 1 is a block diagram of an information retrieval apparatus according to a first embodiment of the present invention. The information search apparatus 100 according to the present embodiment creates an index file using a chain of characters in advance, and reads out index information for each character chain from the index file at the time of search, thereby realizing a high-speed search. FIG.
8 is a configuration in which an index information compression unit 107 and an index decompression unit 110 are added to the configuration of the conventional example of FIG.

【００３４】図１において、本実施形態の情報検索装置
１００は、端末１０１、文書入力部１０２、検索条件入
力部１０３、結果出力部１０４、全体制御部１０５、索
引作成部（索引情報作成手段）１０６、索引情報圧縮部
（索引情報圧縮手段）１０７、書誌実体登録部１０８、
検索部（検索手段）１０９、索引伸長部（索引伸長手
段）１１０、文書番号記憶部１１１、書誌取得部１１
２、実体取得部１１３、索引位置格納部１１４、索引情
報格納部（圧縮索引情報記憶手段）１１５、書誌情報格
納部１１６および実体情報格納部１１７を備えて構成さ
れている。In FIG. 1, an information retrieval apparatus 100 according to this embodiment includes a terminal 101, a document input unit 102, a search condition input unit 103, a result output unit 104, an overall control unit 105, and an index creation unit (index information creation means). 106, an index information compression unit (index information compression unit) 107, a bibliographic entity registration unit 108,
Search unit (search means) 109, index expansion unit (index expansion unit) 110, document number storage unit 111, bibliography acquisition unit 11
2. It is provided with an entity acquisition unit 113, an index position storage unit 114, an index information storage unit (compressed index information storage unit) 115, a bibliographic information storage unit 116, and an entity information storage unit 117.

【００３５】索引作成部１０６は、大量の文書データか
ら高速な全文検索を行うための索引を作成する。索引情
報圧縮部１０７は、索引作成部１０６で作成した索引を
圧縮して圧縮索引情報を作成する。索引情報格納部１１
５は、索引情報圧縮部１０７が作成した圧縮索引情報を
格納する。索引位置格納部１１４は、文字連鎖毎に索引
情報格納部１１５内の圧縮索引情報の先頭位置と長さを
格納する。また、書誌実体登録部１０８は、文書の書誌
情報と実体情報を登録する。書誌情報格納部１１６は書
誌情報を格納し、実体情報格納部１１７は実体情報を格
納する。The index creating unit 106 creates an index for performing a high-speed full-text search from a large amount of document data. The index information compression unit 107 creates compressed index information by compressing the index created by the index creation unit 106. Index information storage unit 11
Reference numeral 5 stores the compressed index information created by the index information compression unit 107. The index position storage unit 114 stores the head position and length of the compressed index information in the index information storage unit 115 for each character chain. Further, the bibliographic entity registration unit 108 registers bibliographic information and entity information of the document. The bibliographic information storage unit 116 stores bibliographic information, and the entity information storage unit 117 stores entity information.

【００３６】また、索引伸長部１１０は、入力された検
索条件によって索引情報格納部１１５から圧縮索引情報
を読み出して元の索引に伸長する。検索部１０９は、索
引伸長部１１０によって伸長した索引から全文検索を行
う。文書番号記憶部１１１は、検索部１０９で検索条件
に一致した文書を格納する。書誌取得部１１２は、文書
番号記憶部１１１の検索条件に対応する書誌一覧を書誌
情報格納部１１６から取得する。実体取得部１１３は、
文書番号記憶部１１１の文書番号に対応する実体情報を
実体情報格納部１１７から取得する。さらに、全体制御
部１０５は、端末１０１からの要求に基づき各部に命令
を発行して結果を端末１０１に返す。The index decompression unit 110 reads out the compressed index information from the index information storage unit 115 according to the input search condition and decompresses the original index. The search unit 109 performs a full-text search from the index expanded by the index expansion unit 110. The document number storage unit 111 stores a document that matches the search condition in the search unit 109. The bibliography acquisition unit 112 acquires a bibliography list corresponding to the search condition of the document number storage unit 111 from the bibliography information storage unit 116. The entity acquisition unit 113
The entity information corresponding to the document number in the document number storage unit 111 is acquired from the entity information storage unit 117. Further, the overall control unit 105 issues an instruction to each unit based on a request from the terminal 101 and returns a result to the terminal 101.

【００３７】次に、本実施形態の情報検索装置１００に
おける文書の登録および検索時の動作について説明す
る。先ず、文書登録の際には、端末１０１から入力され
た文書を、文書入力部１０２により索引作成部１０６が
扱える形式にフォーマット変換した後、全体制御部１０
５に送る。ここで、フォーマット変換とは、例えば、文
書から書誌情報を取り出すことによって文書の実体情報
から書誌情報を分離するなどの、索引を作成するための
文書の形式変換のことである。Next, the operation of the information retrieval apparatus 100 according to the present embodiment at the time of document registration and retrieval will be described. First, at the time of document registration, the format of a document input from the terminal 101 is converted by the document input unit 102 into a format that can be handled by the index creation unit 106, and then the entire control unit 10
Send to 5. Here, the format conversion is a format conversion of a document for creating an index, for example, separating bibliographic information from entity information of the document by extracting bibliographic information from the document.

【００３８】次に、全体制御部１０５は、このフォーマ
ット変換された文書データのテキスト部分を索引作成部
１０６に送る。索引作成部１０６では、２文字の文字連
鎖を利用して、フォーマット変換された文書データから
文字連鎖毎の索引情報を作成する。例えば、「実体と全
体」という文書データに対しては、「実体」，「体
と」，「と全」および「全体」の４つの索引が作成され
ることになる。Next, the overall control unit 105 sends the text part of the format-converted document data to the index creation unit 106. The index creation unit 106 creates index information for each character chain from the format-converted document data using a two-character character chain. For example, for document data of "substance and whole", four indexes of "substance", "body and", "to all" and "whole" are created.

【００３９】索引情報は、文書データ中の文字連鎖の位
置を特定する位置情報の集合である。より具体的には、
文書を識別するための文書番号、並びに、文字連鎖がそ
の文書中のどこに位置するかを判別するための位置番号
が位置情報に該当し、それぞれ２バイトデータである。
例えば、上記例の文書データ「実体と全体」が文書番号
１番に存在するとき、索引「全体」のエントリは、４バ
イトデータ「（０ｘ００）（０ｘ０１）（０ｘ０１）
（０ｘ０２）」（ここで、「０ｘ」は１６進数であるこ
とを示す）として作成される。すなわち、１バイト目お
よび２バイト目のデータによって文書番号が示され、３
バイト目および４バイト目のデータによって位置番号が
示される。なお、位置番号の１バイト目および２バイト
目のデータは、それぞれ「全」および「体」という文字
の文書番号１番の文書中における出現順位を示すもので
ある。The index information is a set of position information for specifying the position of a character chain in the document data. More specifically,
A document number for identifying a document and a position number for determining where a character chain is located in the document correspond to the position information, and each is 2-byte data.
For example, when the document data “substance and entirety” in the above example exists at the document number 1, the entry of the index “entire” is the 4-byte data “(0x00) (0x01) (0x01)
(0x02) "(where" 0x "indicates a hexadecimal number). That is, the document number is indicated by the data of the first byte and the second byte, and 3
The position number is indicated by the data in the byte and the fourth byte. The data in the first byte and the second byte of the position number indicate the order of appearance of the characters “all” and “body” in the document with the document number 1 respectively.

【００４０】次に、索引作成部１０６によって作成され
た索引情報は索引情報圧縮部１０７に送られる。索引情
報圧縮部１０７では、索引情報を後述する圧縮方式によ
って圧縮する。該圧縮された圧縮索引情報は索引情報格
納部１１５に格納されて、索引ファイルを形成すること
となる。またこの時、文字連鎖毎に、該文字連鎖に対応
する圧縮索引情報の索引ファイルにおける先頭位置を示
す先頭位置情報と、該圧縮索引情報のデータ長を示す長
さ情報（バイト数）が対応付けられて、索引位置格納部
１１４に格納される。Next, the index information created by the index creating unit 106 is sent to the index information compressing unit 107. The index information compression unit 107 compresses the index information by a compression method described later. The compressed compressed index information is stored in the index information storage unit 115 to form an index file. At this time, for each character chain, head position information indicating the head position of the compressed index information corresponding to the character chain in the index file is associated with length information (number of bytes) indicating the data length of the compressed index information. And stored in the index position storage unit 114.

【００４１】また、全体制御部１０５は、文書の書誌情
報および実体情報を登録する書誌実体登録部１０８に
も、フォーマット変換された文書データを送る。書誌実
体登録部１０８では、文書データ中の書誌情報だけを書
誌情報格納部１１６に格納し、実体情報格納部１１７に
は文書データ中の実体情報だけを格納する。The overall control unit 105 also sends the format-converted document data to a bibliographic entity registration unit 108 for registering bibliographic information and entity information of the document. The bibliographic entity registration unit 108 stores only bibliographic information in the document data in the bibliographic information storage unit 116, and the entity information storage unit 117 stores only the entity information in the document data.

【００４２】一方、全文検索を行う際には、端末１０１
を用いて検索条件入力部１０３から入力された検索条件
を、全体制御部１０５を介して検索部１０９に送る。検
索部１０９は、検索条件の検索文字列から２文字の文字
連鎖情報を作成して、索引伸長部１１０に該文字連鎖情
報を送る。索引伸長部１１０では、索引位置格納部１１
４から該文字連鎖情報と対応付けられている先頭位置情
報および長さ情報を取得し、該取得した先頭位置情報お
よび長さ情報を用いて索引情報格納部１１５にアクセス
し、索引情報格納部１１５の索引ファイルから該当する
圧縮索引情報を取得する。そして、取得した圧縮索引情
報を元の索引情報に伸長して、該伸長された索引情報を
検索部１０９に送る。On the other hand, when performing a full-text search, the terminal 101
The search condition input from the search condition input unit 103 is transmitted to the search unit 109 via the overall control unit 105 using the search command. The search unit 109 creates character chain information of two characters from the search character string of the search condition, and sends the character chain information to the index expansion unit 110. In the index decompression unit 110, the index position storage unit 11
4 and obtains the head position information and length information associated with the character chain information, accesses the index information storage unit 115 using the obtained head position information and length information, And obtains the corresponding compressed index information from the index file. Then, the obtained compressed index information is expanded to the original index information, and the expanded index information is sent to the search unit 109.

【００４３】次に、検索部１０９では、索引情報から文
字連鎖の連続状況を判定して全文検索を行い、検索条件
に一致した文書番号と検索式番号を文書番号記憶部１１
１に格納し、検索件数を全体制御部１０５に送る。そし
て、全体制御部１０５が検索件数を結果出力部１０４に
送ることによって、端末１０１に検索件数が表示される
ことになる。Next, the search section 109 performs a full-text search by determining the continuity of the character chain from the index information, and stores the document number and the search formula number that match the search conditions in the document number storage section 11.
1 and sends the number of searches to the overall control unit 105. Then, the overall control unit 105 sends the number of searches to the result output unit 104, so that the number of searches is displayed on the terminal 101.

【００４４】さらに、書誌情報の一覧または実体情報を
取得する場合には、端末１０１を用いて検索条件入力部
１０３から入力された検索式番号が、全体制御部１０５
を介して書誌取得部１１２または実体取得部１１３に送
られるので、書誌取得部１１２または実体取得部１１３
は、文書番号記憶部１１１に保持された検索式番号に対
応する文書番号を取得する。次に、得られた文書番号か
ら書誌情報一覧を取得する場合には、書誌取得部１１２
により書誌情報管理部１１６から書誌情報が読み出され
て全体制御部１０５に送られる。また、得られた文書番
号から実体情報を取得する場合には、実体取得部１１３
により実体情報格納部１１７から実体情報が読み出され
て全体制御部１０５に送られる。そして、全体制御部１
０５が書誌情報または実体情報を結果出力部１０４に送
ることによって、端末１０１に書誌情報または実体情報
が表示されることになる。Further, when a list of bibliographic information or entity information is acquired, the search formula number input from the search condition input unit 103 using the terminal 101 is used as the general control unit 105.
Is transmitted to the bibliography acquisition unit 112 or the entity acquisition unit 113 via the
Acquires the document number corresponding to the search formula number stored in the document number storage unit 111. Next, when acquiring a bibliographic information list from the obtained document number, the bibliographic acquisition unit 112
Thus, the bibliographic information is read from the bibliographic information management unit 116 and sent to the overall control unit 105. When acquiring entity information from the obtained document number, the entity acquiring unit 113
, The entity information is read from the entity information storage unit 117 and sent to the overall control unit 105. And the overall control unit 1
05 sends the bibliographic information or the entity information to the result output unit 104, so that the terminal 101 displays the bibliographic information or the entity information.

【００４５】次に、本実施形態の情報検索装置１００の
索引情報圧縮部１０７が行う情報圧縮方法の実施例につ
いて、（第１実施例）、（第２実施例）、（第３実施
例）の順に図面を参照して詳細に説明する。Next, examples of the information compression method performed by the index information compression unit 107 of the information search device 100 of the present embodiment will be described (first example), (second example), (third example). The order will be described in detail with reference to the drawings.

【００４６】ここで、図２を参照して圧縮索引情報のデ
ータ形式について説明しておく。図２は、本実施形態の
情報検索装置１００における圧縮索引情報の一形式を示
す説明図である。同図に示すように、圧縮索引情報２
００は、例えば、該圧縮索引情報がどのような圧縮方式
で圧縮されたかを示す３ビットの圧縮方式ブロック２０
１と、該圧縮索引情報が有するエントリの数を示す５ビ
ットのエントリ個数ブロック２０２と、該エントリ個数
分のエントリとによって１つの圧縮索引情報のグループ
を形成している。すなわち、圧縮索引情報は、圧縮方式
ブロック２０１、エントリ個数ブロック２０２およびｎ
個（ｎ≧１）のエントリを１グループとし、その後には
別のグループの圧縮索引情報が連続して格納されてい
る。Here, the data format of the compressed index information will be described with reference to FIG. FIG. 2 is an explanatory diagram showing one format of the compressed index information in the information search device 100 of the present embodiment. As shown in FIG.
00 is, for example, a 3-bit compression method block 20 that indicates the compression method used to compress the compression index information.
One, a 5-bit entry number block 202 indicating the number of entries included in the compression index information, and one entry of the compression index information form a group of the compression index information. That is, the compression index information includes a compression method block 201, an entry number block 202, and n
One (n ≧ 1) entry is defined as one group, and thereafter, the compressed index information of another group is continuously stored.

【００４７】（第１実施例）先ず、図３〜図７を参照し
て、第１実施例に係る索引情報の圧縮方法について説明
する。なお、図３は第１実施例に係る索引情報の圧縮方
法のメインルーチンを説明するフローチャートであり、
図４、図５および図６はエントリをキューに入れる処理
のサブルーチン、図７はキュー出力処理のサブルーチン
をそれぞれ説明するフローチャートである。First Embodiment First, a method of compressing index information according to a first embodiment will be described with reference to FIGS. FIG. 3 is a flowchart illustrating a main routine of the index information compression method according to the first embodiment.
FIGS. 4, 5, and 6 are flowcharts for explaining a subroutine of a process for putting an entry in a queue, and FIG. 7 is a flowchart for explaining a subroutine of a queue output process.

【００４８】先ず、図３において、ステップＳ３０１で
は、索引作成部１０６が作成した索引から１エントリ
（４バイトデータ）を読み出す。次に、ステップＳ３０
２では、索引から次の１エントリ（４バイトデータ）を
読み出す。次に、ステップＳ３０３では、図４、図５お
よび図６に示したエントリをキューに入れる処理のサブ
ルーチンに移行する。このエントリをキューに入れる処
理（ステップＳ３０３）は、ステップＳ３０４で次の１
エントリ（４バイトデータ）を読み出して、ステップＳ
３０５において読み出すエントリが無くなるまで繰り返
し行われる。First, in FIG. 3, in step S301, one entry (4-byte data) is read from the index created by the index creating unit 106. Next, step S30
In step 2, the next one entry (4-byte data) is read from the index. Next, in step S303, the process proceeds to a subroutine of a process of putting the entries shown in FIGS. 4, 5, and 6 into a queue. The process of putting this entry in the queue (step S303) is performed in step S304.
Read the entry (4-byte data) and execute step S
This is repeated until there are no more entries to read in 305.

【００４９】エントリをキューに入れる処理のサブルー
チンでは、前回および今回と連続して読み出された２つ
のエントリを比較して、該エントリの４バイト全てを削
除可能であれば４バイト用キューに入れ、文書番号の２
バイトを削除可能であれば２バイト用キューに入れ、圧
縮不可能であれば０バイト用キューに入れるという具合
に、これら３種のキューにエントリを分配する。以下、
図４、図５および図６を参照して詳細に説明する。In the subroutine of the process of placing an entry in a queue, two entries read consecutively from the previous time and this time are compared, and if all 4 bytes of the entry can be deleted, the entry is placed in a 4-byte queue. , Document number 2
If a byte can be deleted, it is placed in a 2-byte queue, and if it cannot be compressed, it is placed in a 0-byte queue. Less than,
This will be described in detail with reference to FIGS.

【００５０】まず、図４におけるステップＳ４０１で
は、（ステップＳ３０１で読み出した）前回のエントリ
と、（ステップＳ３０２で読み出した）今回のエントリ
との４バイトデータを比較して、２つのエントリの４バ
イトデータ全てが同一であるか否かを判別する。ステッ
プＳ４０１において、２つのエントリの４バイトデータ
全てが同一であれば図６におけるステップＳ４２１に進
み、そうでなければステップＳ４０２に進む。First, in step S401 in FIG. 4, the 4-byte data of the previous entry (read in step S301) and the current entry (read in step S302) are compared, and the 4-byte data of the two entries is compared. It is determined whether or not all the data is the same. In step S401, if all the 4-byte data of the two entries are the same, the process proceeds to step S421 in FIG. 6, and if not, the process proceeds to step S402.

【００５１】次に、ステップＳ４０２では、前回のエン
トリと今回のエントリの文書番号（２バイトデータ）を
比較して、文書番号が同一であるか否かを判別する。ス
テップＳ４０２において、文書番号が同一であれば図５
におけるステップＳ４１１に進み、そうでなければステ
ップＳ４０３に進む。Next, in step S402, the document numbers (2-byte data) of the previous entry and the current entry are compared to determine whether or not the document numbers are the same. In step S402, if the document numbers are the same, FIG.
To step S411, otherwise to step S403.

【００５２】つまり、前回および今回と連続して読み出
された２つのエントリを比較して、エントリの４バイト
データ全てが同一であれば一方のエントリが削除可能で
あるので、４バイト用キューへの分配に伴う各種処理を
行うサブルーチン（図６）に分岐し、また、文書番号の
２バイトデータが同一であれば一方のエントリの文書番
号が削除可能であるので、２バイト用キューへの分配に
伴う各種処理を行うサブルーチン（図５）に分岐し、さ
らに、前の２つの条件を何れも満たさない場合には圧縮
不可能であるので、ステップＳ４０３以下の０バイト用
キューへの分配に伴う各種処理を行っていくことにな
る。That is, by comparing two entries read consecutively from the previous time and the current time, if all the 4-byte data of the entries are the same, one of the entries can be deleted. Branching to a subroutine (FIG. 6) for performing various processes associated with the distribution of the document. If the two-byte data of the document number is the same, the document number of one entry can be deleted. The subroutine branches to a subroutine (FIG. 5) for performing various processes accompanying the above. If none of the above two conditions is satisfied, compression is impossible. Various processes will be performed.

【００５３】先ず、ステップＳ４０３以下の０バイトキ
ューへの分配に伴う各種処理を説明する。ステップＳ４
０３では２バイト用キューにデータがあるか否かを判別
する。ステップＳ４０３において、データがなければス
テップＳ４０５に進み、データがあればステップＳ４０
４に進んで前回のエントリを２バイト用キューに入れる
と共に、後述のキュー出力処理サブルーチン（図７）に
より該２バイト用キューを出力する。First, various processes associated with the distribution to the 0-byte queue from step S403 will be described. Step S4
At 03, it is determined whether or not there is data in the 2-byte queue. In step S403, if there is no data, the process proceeds to step S405. If there is data, the process proceeds to step S40.
In step 4, the previous entry is placed in the 2-byte queue, and the 2-byte queue is output by a queue output processing subroutine (FIG. 7) described later.

【００５４】次に、ステップＳ４０５では４バイト用キ
ューにデータがあるか否かを判別する。ステップＳ４０
５において、データがなければステップＳ４０７に進ん
で前回のエントリを０バイト用キューに入れ、また、デ
ータがあればステップＳ４０６に進んで前回のエントリ
を４バイト用キューに入れると共に、該４バイト用キュ
ーを出力する。なお、４バイト用キューの内容は、４バ
イトデータ全てを削除可能なエントリであることから、
出力によってキューの内容がクリアされるだけで、キュ
ーの内容が索引ファイルに出力されることはない。さら
に、ステップＳ４０８では、ステップＳ４０３およびＳ
４０５の判別結果にかかわらず、今回のエントリが最後
ならば今回のエントリを０バイト用キューに入れて、サ
ブルーチンを終了する。Next, in step S405, it is determined whether or not there is data in the 4-byte queue. Step S40
In step 5, if there is no data, the flow advances to step S407 to put the previous entry in the queue for 0 bytes. If there is data, the flow advances to step S406 to put the previous entry in the queue for 4 bytes. Output the queue. Since the contents of the 4-byte queue are entries from which all 4-byte data can be deleted,
The output only clears the contents of the queue, but does not output the contents of the queue to the index file. Further, in step S408, steps S403 and S403
Regardless of the result of the determination at 405, if the current entry is the last, the current entry is put into the 0-byte queue, and the subroutine ends.

【００５５】なお、図４、図５および図６のアルゴリズ
ムでは、前回および今回と連続して読み出された２つの
エントリを比較して、基本的に前回のエントリを順次分
配していく手法を採っている。したがって、ステップＳ
４０３およびＳ４０５の判断において、２バイト用キュ
ーまたは４バイト用キューにデータがある場合には、連
続して２バイト用キューまたは４バイト用キューへのエ
ントリの分配が発生した時の最後のエントリに該前回の
エントリが該当していることを示すことから、該前回の
エントリを２バイト用キューまたは４バイト用キューに
分配する必要があるのである。また、ステップＳ４０８
において、今回のエントリが最後ならば０バイト用キュ
ーに入れておくのも、基本的に前回のエントリを順次分
配していく手法を採っていることによるものである。The algorithm of FIGS. 4, 5 and 6 compares the two entries read continuously from the previous and the present, and basically distributes the previous entries sequentially. I am taking it. Therefore, step S
If it is determined in steps 403 and S405 that there is data in the 2-byte queue or the 4-byte queue, the last entry when the distribution of entries to the 2-byte queue or the 4-byte queue occurs continuously is performed. Since it indicates that the previous entry corresponds, it is necessary to distribute the previous entry to a 2-byte queue or a 4-byte queue. Step S408
In this case, if the current entry is the last one, the entry is put in the 0-byte queue because basically the method of sequentially distributing the previous entry is adopted.

【００５６】次に、図４のステップＳ４０２において、
文書番号の２バイトデータが同一である時に行われる、
２バイト用キューへの分配に伴う各種処理を、図５を参
照して説明する。ステップＳ４１１では０バイト用キュ
ーにデータがあるか否かを判別する。ステップＳ４１１
において、データがなければステップＳ４１３に進み、
データがあればステップＳ４１２に進んで前回のエント
リを０バイト用キューに入れると共に、該０バイト用キ
ューを出力する。なお、０バイト用キューの内容は、削
除不可能なエントリであることから、出力によってキュ
ーの内容がクリアされると共に、キューの内容が図８に
示すテーブルの圧縮条件に応じた圧縮内容で索引ファイ
ルに出力される。Next, in step S402 of FIG.
Performed when the 2-byte data of the document number is the same,
Various processes associated with distribution to the 2-byte queue will be described with reference to FIG. In step S411, it is determined whether there is data in the 0-byte queue. Step S411
In step S413, if there is no data, the process proceeds to step S413.
If there is data, the flow advances to step S412 to put the previous entry in the 0-byte queue and output the 0-byte queue. Since the contents of the 0-byte queue are entries that cannot be deleted, the contents of the queue are cleared by output, and the contents of the queue are indexed with compressed contents corresponding to the compression conditions of the table shown in FIG. Output to a file.

【００５７】次に、ステップＳ４１３では４バイト用キ
ューにデータがあるか否かを判別する。ステップＳ４１
３において、データがなければステップＳ４１５に進ん
で前回のエントリを２バイト用キューに入れ、また、デ
ータがあればステップＳ４１４に進んで前回のエントリ
を４バイト用キューに入れると共に、該４バイト用キュ
ーを出力する。さらに、ステップＳ４１６では、ステッ
プＳ４１１およびＳ４１３の判別結果にかかわらず、今
回のエントリが最後ならば今回のエントリを２バイト用
キューに入れて、サブルーチンを終了する。Next, in step S413, it is determined whether there is data in the 4-byte queue. Step S41
In 3, if there is no data, the flow proceeds to step S 415 to put the previous entry in the 2-byte queue. If there is data, the flow goes to step S 414 to put the previous entry in the 4-byte queue, Output the queue. Further, in step S416, regardless of the determination results in steps S411 and S413, if the current entry is the last, the current entry is put in the 2-byte queue, and the subroutine is terminated.

【００５８】なお、図５の２バイト用キューへの分配で
は、文書番号の２バイトデータが同一であれば、必ず前
回のエントリを２バイト用キューに分配しておく必要が
あるので、ステップＳ４１１〜Ｓ４１５の処理は、図４
のステップＳ４０３〜Ｓ４０７とは異なる処理内容とな
っている。また、ステップＳ４１６において、今回のエ
ントリが最後ならば２バイト用キューに入れておくの
は、基本的に前回のエントリを順次分配していく手法を
採っていることによるものである。In the distribution to the two-byte queue in FIG. 5, if the two-byte data of the document number is the same, the previous entry must be distributed to the two-byte queue without fail. The processing of S415 is the same as that of FIG.
The processing contents are different from those of steps S403 to S407. In step S416, if the current entry is the last entry, the entry is placed in the 2-byte queue because basically, the method of sequentially distributing the previous entry is adopted.

【００５９】次に、図４のステップＳ４０１において、
エントリの４バイトデータ全てが同一である時に行われ
る、４バイト用キューへの分配に伴う各種処理を、図６
を参照して説明する。ステップＳ４２１では０バイト用
キューにデータがあるか否かを判別する。ステップＳ４
２１において、データがなければステップＳ４２３に進
み、データがあればステップＳ４２２に進んで前回のエ
ントリを０バイト用キューに入れると共に、該０バイト
用キューを出力する。Next, in step S401 of FIG.
FIG. 6 shows various processes performed when the 4-byte data of an entry is the same and is distributed to a 4-byte queue.
This will be described with reference to FIG. In step S421, it is determined whether there is data in the 0-byte queue. Step S4
In 21, if there is no data, the process proceeds to step S 423, and if there is data, the process proceeds to step S 422 to put the previous entry in the 0-byte queue and output the 0-byte queue.

【００６０】ステップＳ４２３では２バイト用キューに
データがあるか否かを判別する。ステップＳ４２３にお
いて、データがなければステップＳ４２５に進み、デー
タがあればステップＳ４２４に進んで前回のエントリを
２バイト用キューに入れると共に、後述のキュー出力処
理サブルーチン（図７）により該２バイト用キューを出
力する。In step S423, it is determined whether or not there is data in the 2-byte queue. In step S423, if there is no data, the process proceeds to step S425, and if there is data, the process proceeds to step S424 to put the previous entry in the 2-byte queue and to execute the 2-byte queue by a queue output processing subroutine (FIG. Is output.

【００６１】次に、ステップＳ４２５では前回のエント
リを４バイト用キューに入れ、さらにステップＳ４２６
では、ステップＳ４２１，Ｓ４２３の判別結果にかかわ
らず、今回のエントリが最後ならば今回のエントリを４
バイト用キューに入れて、サブルーチンを終了する。Next, in step S425, the previous entry is put in the 4-byte queue, and furthermore, in step S426
Then, regardless of the determination results in steps S421 and S423, if the current entry is the last, the current entry is changed to 4
Queue the byte and end the subroutine.

【００６２】なお、ステップＳ４２１，Ｓ４２３の判断
による場合分けを行って、前回のエントリを０バイト用
キュー、２バイト用キューまたは４バイト用キューに分
配しているのは、そして、ステップＳ４２６において、
今回のエントリが最後ならば４バイト用キューに入れて
おくのは、基本的に前回のエントリを順次分配していく
手法を採っていることによるものである。The reason why the previous entry is distributed to the 0-byte queue, the 2-byte queue, or the 4-byte queue by performing the case classification based on the judgments in steps S421 and S423 is that in step S426
If the current entry is the last entry, the entry in the 4-byte queue is basically due to the method of sequentially distributing the previous entry.

【００６３】以上のようにして、エントリを０バイト用
キュー、２バイト用キューまたは４バイト用キューに分
配する処理（ステップＳ３０３）が、ステップＳ３０５
において読み出すエントリが無くなるまで繰り返し行わ
れると、次に、ステップＳ３０６に進んで０バイト用キ
ュー、２バイト用キューまたは４バイト用キューの何れ
かのキューにデータがあるかを判別する。ステップＳ３
０６において、各キューにデータがなければ圧縮処理を
終了し、データがあればステップＳ３０７に進んで、０
バイト用キュー、２バイト用キューまたは４バイト用キ
ューを出力する。なお、２バイト用キューの出力につい
ては、後述のキュー出力処理サブルーチン（図７）によ
り行われる。As described above, the process of distributing entries to the 0-byte queue, the 2-byte queue or the 4-byte queue (step S303) is performed in step S305.
Is repeated until there are no more entries to be read, the process proceeds to step S306 to determine whether data is present in any of the 0-byte queue, the 2-byte queue, and the 4-byte queue. Step S3
In step 06, if there is no data in each queue, the compression process ends.
Outputs the byte queue, 2-byte queue or 4-byte queue. The output of the 2-byte queue is performed by a queue output processing subroutine (FIG. 7) described later.

【００６４】次に、図７のフローチャートを参照して、
キュー出力処理について説明する。先ず、ステップＳ７
０１では、２バイト用キューをキューＡとして、該キュ
ーＡから１エントリを読み出す。次に、ステップＳ７０
２では、位置番号の２バイトについて各バイトデータが
共に「０ｘ０ｆ」以下であるか否かを判別する。ステッ
プＳ７０２において、各バイトデータが共に「０ｘ０
ｆ」以下であればステップＳ７０３に進み、「０ｘ０
ｆ」より大きければステップＳ７０８に進む。Next, referring to the flowchart of FIG.
The queue output process will be described. First, step S7
In 01, one entry is read from the queue A with the queue for two bytes as the queue A. Next, step S70
In step 2, it is determined whether or not each byte data of the two bytes of the position number is equal to or less than “0x0f”. In step S702, each byte data is set to “0x0
f ”or less, the process proceeds to step S703, where“ 0x0
If it is larger than "f", the process proceeds to step S708.

【００６５】ステップＳ７０８では、キューＡから読み
出したエントリをテンポラリとしての別のキューＢに入
れる。次に、ステップＳ７０９では、キューＡからさら
に１エントリを読み出す。次に、ステップＳ７１０で
は、キューＡが空であるか否かを判別し、空であればス
テップＳ７１１に進み、空でなければステップＳ７０２
に移る。ステップＳ７１１では、キューＢに登録された
エントリの圧縮方式とエントリの個数を索引ファイルに
出力する。次に、ステップＳ７１２では、キューＢから
エントリを読み出して、図８に示すテーブルの圧縮条件
に応じた圧縮内容で、位置番号を圧縮せずに索引ファイ
ルに出力し、圧縮索引情報のグループを形成する。In step S708, the entry read from the queue A is put into another queue B as a temporary. Next, in step S709, one more entry is read from the queue A. Next, in step S710, it is determined whether or not the queue A is empty. If the queue A is empty, the process proceeds to step S711; if not, the process proceeds to step S702.
Move on to In step S711, the compression method and the number of entries of the entries registered in the queue B are output to the index file. Next, in step S712, the entry is read from the queue B, and the position number is output to the index file without compressing the position number according to the compression conditions of the table shown in FIG. I do.

【００６６】また、上述したように、位置番号の各バイ
トデータが共に「０ｘ０ｆ」以下であればステップＳ７
０３に進み、ステップＳ７０３では、ステップＳ７０１
においてキューＡから読み出したエントリをキューＢに
入れる。次に、ステップＳ７０４では、キューＡからさ
らに１エントリを読み出す。次に、ステップＳ７０５で
はキューＡが空であるかを判別し、空であればステップ
Ｓ７０６に進み、空でなければステップＳ７０２に移
る。As described above, if each byte data of the position number is equal to or less than "0x0f", step S7 is executed.
03, and in Step S703, Step S701
Puts the entry read from the queue A into the queue B. Next, in step S704, one more entry is read from the queue A. Next, in step S705, it is determined whether or not the queue A is empty. If the queue A is empty, the process proceeds to step S706, and if not, the process proceeds to step S702.

【００６７】ステップＳ７０６では、キューＢに登録さ
れたエントリの圧縮方式とエントリの個数を索引ファイ
ルに出力する。次に、ステップＳ７０７では、キューＢ
からエントリを読み出して、図８に示すテーブルの圧縮
条件に応じた圧縮内容で、位置番号を圧縮して索引ファ
イルに出力し、圧縮索引情報のグループを形成する。こ
のようにして、ステップＳ７０７またはＳ７１２を終え
てキュー出力処理のサブルーチンを終了した後は、図４
または図６のサブルーチンに戻って処理を継続するか、
或いは図３のメインルーチンに戻って圧縮処理を終了す
る。In step S706, the compression method and the number of entries of the entries registered in the queue B are output to the index file. Next, in step S707, the queue B
, And compresses the position number with compressed contents according to the compression conditions of the table shown in FIG. 8 and outputs the compressed position number to an index file, thereby forming a group of compressed index information. After ending the subroutine of the queue output process after ending step S707 or S712 in this manner, FIG.
Alternatively, return to the subroutine of FIG.
Alternatively, the compression processing is terminated by returning to the main routine of FIG.

【００６８】なお、上述の図８に示したテーブルは、エ
ントリの比較およびエントリの位置番号の値に応じて、
８種に場合分けされた圧縮方式を示すものである。以下
に、各圧縮方式の圧縮条件および圧縮内容について説明
する。同図において、８０１は圧縮条件を示し、８０２
は圧縮条件８０１に一致した圧縮内容を示す。The table shown in FIG. 8 is based on the comparison of the entries and the value of the position number of the entry.
It shows the compression methods classified into eight types. Hereinafter, compression conditions and compression contents of each compression method will be described. In the figure, reference numeral 801 denotes a compression condition;
Indicates compression contents that match the compression condition 801.

【００６９】まず、圧縮方式０では、前のエントリを比
較したとき、文書番号の２バイトデータが互いに異な
り、位置番号の２バイトデータのどちらかが「０ｘ１
０」以上であることが圧縮条件である。このケースは０
バイト用キューに分配されているエントリが該当し、全
てのエントリを圧縮を行わずに出力する。First, in the compression method 0, when the previous entry is compared, the two-byte data of the document number is different from each other, and one of the two-byte data of the position number is "0x1".
The compression condition is “0” or more. This case is 0
The entries distributed to the byte queue correspond to the output, and all the entries are output without compression.

【００７０】次に、圧縮方式１では、前のエントリと比
較したとき、文書番号の２バイトデータが互いに異な
り、位置番号の各バイトデータが共に「０ｘ０ｆ」以下
であることが圧縮条件である。このケースは０バイト用
キューに分配されているエントリが該当するが、位置番
号についての圧縮が可能なので、文書番号の２バイトデ
ータを出力し、位置番号の２バイトのうち各バイトデー
タの下位４ビットをつなぎ合わせた１バイトデータに変
換して出力する。Next, in the compression method 1, the compression condition is that, when compared with the previous entry, the two-byte data of the document number is different from each other, and each byte data of the position number is both "0x0f" or less. In this case, the entries distributed to the 0-byte queue correspond to the above. However, since the position number can be compressed, 2-byte data of the document number is output, and the lower 4 bytes of each byte data of the 2 bytes of the position number are output. The data is converted into 1-byte data by combining bits and output.

【００７１】次に、圧縮方式２では、前に出力した圧縮
索引情報のグループと文書番号が同じで、且つ、前のエ
ントリと比較したとき、文書番号の２バイトデータが同
一で、位置番号の各バイトデータのどちらかが「０ｘ１
０」以上であることが圧縮条件である。このケースは２
バイト用キューに分配されているエントリが該当し、文
書番号の２バイトデータは前に出力した圧縮索引情報の
グループの文書番号を使用することができるので省略
し、位置番号の２バイトデータのみを出力する。Next, in the compression method 2, when the document number is the same as that of the previously output compressed index information group, and when compared with the previous entry, the 2-byte data of the document number is the same, Either of the byte data is "0x1
The compression condition is “0” or more. This case is 2
The entry distributed to the byte queue corresponds to the above. The 2-byte data of the document number is omitted because the document number of the group of the compressed index information previously output can be used, and only the 2-byte data of the position number is used. Output.

【００７２】次に、圧縮方式３では、前に出力した圧縮
索引情報のグループと文書番号が同じで、且つ、前のエ
ントリと比較したとき、文書番号の２バイトデータが同
一で、位置番号の各バイトデータが共に「０ｘ０ｆ」以
下であることが圧縮条件である。このケースは２バイト
用キューに分配されているエントリが該当し、文書番号
の２バイトデータは前に出力した圧縮索引情報のグルー
プの文書番号を使用することができるので省略し、位置
番号の２バイトのうち各バイトデータの下位４ビットを
つなぎ合わせた１バイトデータに変換して出力する。Next, in the compression method 3, when the document number is the same as that of the previously output compressed index information group, and when compared with the previous entry, the 2-byte data of the document number is the same, The compression condition is that each byte data is equal to or less than “0x0f”. In this case, the entry distributed to the 2-byte queue is applicable, and the 2-byte data of the document number is omitted because the document number of the group of the compressed index information previously output can be used, and the position number 2 is omitted. The data is converted into 1-byte data in which the lower 4 bits of each byte data are connected and output.

【００７３】次に、圧縮方式４では、前に出力した圧縮
索引情報のグループと文書番号が異なり、且つ、前のエ
ントリと比較したとき、文書番号の２バイトデータが同
一で、位置番号の各バイトデータのどちらかが「０ｘ１
０」以上であることが圧縮条件である。このケースは２
バイト用キューに分配されているエントリが該当し、最
初のエントリについてはそのまま４バイトデータを出力
し、２番目以降のエントリについては、文書番号の２バ
イトデータは最初のエントリの文書番号を使用すること
ができるので省略し、位置番号の２バイトデータのみを
出力する。Next, in the compression method 4, when the document number is different from the group of the compression index information previously output, and when compared with the previous entry, the 2-byte data of the document number is the same and each of the position numbers is different. If either byte data is "0x1
The compression condition is “0” or more. This case is 2
The entries distributed to the byte queue correspond to the first entry, and the 4-byte data is output as it is for the first entry. For the second and subsequent entries, the 2-byte data of the document number uses the document number of the first entry. This is omitted because it is possible to output only 2-byte data of the position number.

【００７４】次に、圧縮方式５では、前に出力した圧縮
索引情報のグループと文書番号が異なり、且つ、前のエ
ントリと比較したとき、文書番号の２バイトデータが同
一で、位置番号の各バイトデータが共に「０ｘ０ｆ」以
下であることが圧縮条件である。このケースは２バイト
用キューに分配されているエントリが該当し、最初のエ
ントリについてのみ文書番号の２バイトデータを出力
し、各エントリについて、位置番号の２バイトのうち各
バイトデータの下位４ビットをつなぎ合わせた１バイト
データに変換して出力する。Next, in the compression system 5, the document number differs from the group of the compression index information previously output, and when compared with the previous entry, the 2-byte data of the document number is the same and each of the position numbers is different. The compression condition is that both the byte data are “0x0f” or less. In this case, the entries distributed to the 2-byte queue correspond to the first entry, and output the 2-byte data of the document number only for the first entry. For each entry, the lower 4 bits of each byte data of the 2-byte position number Is converted to 1-byte data in which the data is connected and output.

【００７５】次に、圧縮方式６では、前のエントリと比
較したとき、文書番号の２バイトデータと、位置番号の
２バイトデータがそれぞれ同一であることが圧縮条件で
ある。この場合、最初のエントリのみが０バイト用キュ
ーに分配されており、該エントリを圧縮を行わずに出力
する。Next, in the compression method 6, the compression condition is that the 2-byte data of the document number and the 2-byte data of the position number are the same when compared with the previous entry. In this case, only the first entry is distributed to the 0-byte queue, and the entry is output without performing compression.

【００７６】さらに、圧縮方式７では、前のエントリと
比較したとき、文書番号の２バイトデータと位置番号の
２バイトデータがそれぞれ同一で、且つ、位置番号の各
バイトデータが共に「０ｘ０ｆ」以下であることが圧縮
条件である。この場合、最初のエントリのみが０バイト
用キューに分配されているが、位置番号についての圧縮
が可能なので、文書番号の２バイトデータを出力し、位
置番号の２バイトのうち各バイトデータの下位４ビット
をつなぎ合わせた１バイトデータに変換して出力する。Further, in the compression method 7, when compared with the previous entry, the two-byte data of the document number and the two-byte data of the position number are the same, and both the byte data of the position number are both "0x0f" or less. Is the compression condition. In this case, only the first entry is distributed to the 0-byte queue, but since the position number can be compressed, 2-byte data of the document number is output and the lower byte of each byte data of the 2 bytes of the position number is output. It is converted into 1-byte data in which four bits are connected and output.

【００７７】次に、図９に示す索引情報の具体例を参照
して、本実施例に係る索引情報の圧縮方法について具体
的に説明する。図９は、第１実施例に係る索引情報の圧
縮方法について具体的な圧縮過程を例示する説明図であ
る。同図において、９０１は処理対象のエントリ（索引
情報）を示し、９０２は９０１に示した索引情報のうち
圧縮可能なエントリを示している。なお９０２では、索
引ｋ〜索引ｋ＋４のエントリの文書番号は全て「（０ｘ
２７）（０ｘ７ｂ）」であり、該エントリの位置番号の
各バイトデータは全て「０ｘ０ｆ」以下である。また、
９０３は位置番号について圧縮されたエントリを示し、
９０４は結果として得られる圧縮索引情報を示してい
る。Next, with reference to a specific example of index information shown in FIG. 9, a method of compressing index information according to the present embodiment will be specifically described. FIG. 9 is an explanatory diagram illustrating a specific compression process of the index information compression method according to the first embodiment. In the figure, reference numeral 901 denotes an entry (index information) to be processed, and reference numeral 902 denotes a compressible entry of the index information shown in 901. Note that in 902, the document numbers of the entries of the indexes k to k + 4 are all “(0x
27) (0x7b) ", and each byte data of the position number of the entry is all less than or equal to" 0x0f ". Also,
903 indicates an entry compressed for the position number,
904 indicates the resulting compressed index information.

【００７８】以下では、本実施例における索引情報の圧
縮方法（図３〜図７に示されるフローチャート）に、図
９に示した索引情報９０１を適用したときの処理過程に
ついて、図１０を参照して説明する。先ず、図３のステ
ップＳ３０１において索引ｋ−１のエントリを読み出
し、ステップＳ３０２において索引ｋのエントリを読み
出す。これらのエントリは各４バイトデータが共に異な
るため、ステップＳ３０３のサブルーチンにおいては、
図４のステップＳ４０１、Ｓ４０２、Ｓ４０３、Ｓ４０
５、Ｓ４０７およびＳ４０８を辿った後、図３のメイン
ルーチンのステップＳ３０４に進む。このため、索引ｋ
−１のエントリが、図１０に示すように０バイト用キュ
ーに登録されることとなる。In the following, a processing procedure when the index information 901 shown in FIG. 9 is applied to the index information compression method (the flowcharts shown in FIGS. 3 to 7) in this embodiment will be described with reference to FIG. Will be explained. First, the entry of the index k-1 is read in step S301 of FIG. 3, and the entry of the index k is read in step S302. Since these entries have different 4-byte data, in the subroutine of step S303,
Steps S401, S402, S403, S40 in FIG.
After following steps S5, S407, and S408, the process proceeds to step S304 of the main routine in FIG. Therefore, the index k
The entry of −1 is registered in the 0-byte queue as shown in FIG.

【００７９】次に、再び、ステップＳ３０４において索
引ｋ＋１のエントリが読み出されるが、索引ｋおよび索
引ｋ＋１のエントリを比較すると、各４バイトデータは
異なるが文書番号が同一であり、この時０バイト用キュ
ーには索引ｋ−１のエントリが登録されているので、ス
テップＳ３０３のサブルーチンにおいては、図４のステ
ップＳ４０１、Ｓ４０２、並びに図５のステップＳ４１
１、Ｓ４１２、Ｓ４１５およびＳ４１６を辿ることにな
る。このため、０バイト用キューに保持されていた索引
ｋ−１のエントリが圧縮方式０で出力され、索引ｋのエ
ントリが２バイト用キューに登録されることとなる。Next, the entry of the index k + 1 is read again in step S304. When the entries of the index k and the index k + 1 are compared, each 4-byte data is different but the document number is the same. Since the entry of index k-1 is registered in the queue, in the subroutine of step S303, steps S401 and S402 in FIG. 4 and step S41 in FIG.
1, S412, S415 and S416 are traced. Therefore, the entry of the index k-1 held in the 0-byte queue is output by the compression method 0, and the entry of the index k is registered in the 2-byte queue.

【００８０】次に、再び、ステップＳ３０４において索
引ｋ＋２のエントリを読み出し、ステップＳ３０５から
再びステップＳ３０３のサブルーチンに移る。索引ｋ＋
１および索引ｋ＋２のエントリを比較すると、各４バイ
トデータは異なるが文書番号が同一であるため、該サブ
ルーチンにおいては、図４のステップＳ４０１、Ｓ４０
２、並びに図５のステップＳ４１１、Ｓ４１３、Ｓ４１
５およびＳ４１６を辿ることになる。このため、該索引
ｋ＋１のエントリも索引ｋと同様に２バイト用キューに
登録される。Next, the entry of the index k + 2 is read out again in step S304, and the process proceeds from step S305 to the subroutine of step S303 again. Index k +
Comparing the entries of 1 and the index k + 2, each 4-byte data is different but the document number is the same. Therefore, in this subroutine, steps S401 and S40 in FIG.
2, and steps S411, S413, S41 in FIG.
5 and S416. For this reason, the entry of the index k + 1 is also registered in the 2-byte queue similarly to the index k.

【００８１】次に、索引ｋ＋３のエントリがステップＳ
３０４において読み出されるが、先と同様のステップを
辿ることにより、索引ｋ＋２のエントリも２バイト用キ
ューに登録される。そして、索引ｋ＋４のエントリがス
テップＳ３０４において読み出されるが、先と同様のス
テップを辿ることにより、索引ｋ＋３のエントリも２バ
イト用キューに登録される。Next, the entry of the index k + 3 is set in step S
Although read at 304, by following the same steps as above, the entry with index k + 2 is also registered in the 2-byte queue. Then, the entry of the index k + 4 is read in step S304. By following the same steps as above, the entry of the index k + 3 is also registered in the 2-byte queue.

【００８２】次に、索引ｋ＋５のエントリがステップＳ
３０４において読み出されるが、索引ｋ＋４および索引
ｋ＋５のエントリを比較したとき、各４バイトデータが
異なり、この時、２バイト用キューには索引ｋ〜索引ｋ
＋３のエントリが登録されているため、ステップＳ３０
３のサブルーチンにおいては、図４のステップＳ４０
１、Ｓ４０２、Ｓ４０３およびＳ４０８を辿ることにな
る。このため、索引ｋ＋４のエントリが２バイト用キュ
ーに登録された後、２バイト用キューに登録された索引
ｋ〜索引ｋ＋４のエントリが圧縮方式５で出力されるこ
とになる。Next, the entry of the index k + 5 is set in step S
When the entries at index k + 4 and index k + 5 are compared, the 4-byte data is different. At this time, the index k to index k are stored in the 2-byte queue.
Since the entry of +3 is registered, step S30
In the subroutine No. 3, step S40 in FIG.
1, S402, S403 and S408 are traced. Therefore, after the entry of the index k + 4 is registered in the 2-byte queue, the entries of the indexes k to k + 4 registered in the 2-byte queue are output by the compression method 5.

【００８３】ここで、２バイト用キューに登録された索
引ｋ〜索引ｋ＋４のエントリの出力処理を図７を参照し
て説明する。先ず、ステップＳ７０１において２バイト
用キュー（キューＡ）から索引ｋのエントリを読み出
す。索引ｋのエントリは位置番号の各バイトデータが共
に「０ｘ０ｆ」以下であるため、ステップＳ５０４に進
みキューＢに出力される。以後、索引ｋ＋１〜索引ｋ＋
４のエントリも位置番号の各バイトデータが共に「０ｘ
０ｆ」であるため、同様にキューＢに出力される。次
に、ステップＳ７０５においては２バイト用キューが空
であるため（キューＡ）、ステップＳ７０６に進む。Here, the output processing of the entries of the index k to the index k + 4 registered in the 2-byte queue will be described with reference to FIG. First, in step S701, the entry of the index k is read from the 2-byte queue (queue A). In the entry of the index k, since each byte data of the position number is equal to or less than “0x0f”, the process proceeds to step S504 and is output to the queue B. Thereafter, the index k + 1 to the index k +
In the entry No. 4 as well, each byte data of the position number is “0x”.
0f ", and is similarly output to the queue B. Next, in step S705, since the 2-byte queue is empty (queue A), the process proceeds to step S706.

【００８４】次に、ステップＳ７０６では、図８におけ
る圧縮方式５の圧縮条件が該当するので、「５」を索引
ファイルの圧縮方式ブロック９０４ａに出力し、キュー
Ｂに保持されているエントリの個数「５」をエントリ個
数ブロック９０４ｂに出力する。そしてステップＳ７０
７では、キューＢからエントリを読み出し、圧縮方式５
により各エントリを圧縮して出力する。Next, in step S706, since the compression condition of the compression method 5 in FIG. 8 is satisfied, “5” is output to the compression method block 904a of the index file, and the number of entries held in the queue B is “5”. "5" is output to the entry number block 904b. And step S70
7, the entry is read from the queue B, and the compression method 5
Compresses and outputs each entry.

【００８５】このように、各エントリの文書番号および
位置番号を圧縮して、３ビットの圧縮方式ブロック９０
４ａおよび５ビットのエントリ個数ブロック９０４ｂを
加えると、９０４に示すように、計８バイトの圧縮索引
情報のグループが形成される。つまり、本具体例では、
索引ｋ〜索引ｋ＋４のエントリを圧縮しない場合に２０
バイトを要した索引情報を、第１実施例の索引情報の圧
縮方法により８バイトの圧縮索引情報に削減することが
できた。As described above, the document number and position number of each entry are compressed, and the 3-bit compression method block 90 is used.
When the entry number block 904b of 4 bits and 5 bits is added, as shown in 904, a group of a total of 8 bytes of compressed index information is formed. That is, in this specific example,
20 when not compressing the entries from index k to index k + 4
The index information that required bytes could be reduced to 8-byte compressed index information by the index information compression method of the first embodiment.

【００８６】以上のように、本実施例の索引情報の圧縮
方法では、索引情報圧縮ステップ（図３，図４，図５，
図６および図７）によって、索引情報のエントリが連続
して同一になる場合には、１つのエントリのみを残して
他のエントリを削除するので、索引情報を圧縮して索引
ファイルの容量を抑えることができる。また、同一の文
字連鎖の索引情報について、文書番号が同一のｎ個の連
続した索引情報を１群として、該文書番号を１つに圧縮
した圧縮索引情報を作成するので、ｎ−１個の文書番号
分のデータを削減することができ、索引情報を圧縮して
索引ファイルの容量を抑えることができる。さらに、位
置番号の各バイトデータが基準値（０ｘ０ｆ）以下であ
るときに、位置番号の各バイトデータの下位４ビットを
つなぎ合わせた１バイトデータに変換した位置番号を持
つ圧縮索引情報を作成するので、文字の出現順位が該基
準値以下であるものについてバイト数を削減することが
でき、索引情報を圧縮して索引ファイルの容量を抑える
ことができる。As described above, in the index information compression method of the present embodiment, the index information compression step (FIG. 3, FIG. 4, FIG. 5,
According to FIGS. 6 and 7), when the entries of the index information are continuously the same, the other entries are deleted leaving only one entry, so that the index information is compressed to reduce the capacity of the index file. be able to. In addition, for index information of the same character chain, compressed index information is generated by compressing the document number into one by creating a group of n consecutive index information with the same document number. Data for the document number can be reduced, and the index information can be compressed to reduce the capacity of the index file. Further, when each byte data of the position number is equal to or smaller than the reference value (0x0f), compressed index information having a position number converted into 1-byte data by connecting lower 4 bits of each byte data of the position number is created. Therefore, the number of bytes can be reduced for characters whose appearance rank is equal to or less than the reference value, and the index information can be compressed to reduce the capacity of the index file.

【００８７】（第２実施例）次に、図３〜図６、図１１
および図１２を参照して、第２実施例に係る索引情報の
圧縮方法について説明する。なお、図１１および図１２
は、第２実施例に係る索引情報の圧縮方法のサブルーチ
ンを説明するフローチャートである。ここでは、図３に
示す索引情報の圧縮処理のメインルーチンおよびステッ
プＳ３０３に示すサブルーチン（図４，図５および図
６）は第１実施例と同様であるため、説明を省略する。(Second Embodiment) Next, FIGS.
A method of compressing index information according to the second embodiment will be described with reference to FIG. 11 and FIG.
10 is a flowchart illustrating a subroutine of a compression method of index information according to the second embodiment. Here, the main routine of the compression processing of the index information shown in FIG. 3 and the subroutine (FIGS. 4, 5 and 6) shown in step S303 are the same as those in the first embodiment, and therefore the description is omitted.

【００８８】以下では、図１１および図１２のフローチ
ャートを参照して、メインルーチン（ステップＳ３０
７）およびサブルーチン（図４および図６）のキュー出
力において、分岐して実行される２バイト用キューの出
力処理について説明する。先ず、図１１のステップＳ１
１０１では、２バイト用キューであるキューＡから１エ
ントリを読み出す。次に、ステップＳ１１０２では、図
１２のステップＳ１２０１に進み、位置番号の各バイト
データが共に「０ｘ０ｆ」以下であるか否かを判別す
る。ステップＳ１２０１において、各バイトデータが共
に「０ｘ０ｆ」以下であれば、ステップＳ１２０２に進
み、「０ｘ０ｆ」より大きければステップＳ１２１２に
進む。Hereinafter, the main routine (step S30) will be described with reference to the flowcharts of FIGS.
In the queue output of 7) and the subroutine (FIGS. 4 and 6), the output processing of the 2-byte queue which is executed by branching will be described. First, step S1 in FIG.
At 101, one entry is read from queue A, which is a 2-byte queue. Next, in step S1102, the process proceeds to step S1201 in FIG. 12, and it is determined whether each byte data of the position number is equal to or less than “0x0f”. In step S1201, if each byte data is equal to or less than “0x0f”, the process proceeds to step S1202, and if larger than “0x0f”, the process proceeds to step S1212.

【００８９】ステップＳ１２０２では、テンポラリであ
るキューＣが空であるか否かを判別し、キューＣが空で
なければステップＳ１２０３を実行した後に、またキュ
ーＣが空であればそのままステップＳ１２０４に進む。
ステップＳ１２０３では、キューＣに登録されているエ
ントリの圧縮方式（図８参照）とエントリの個数を索引
ファイルに出力し、キューＣからエントリを読み出し
て、位置番号を圧縮せずに索引ファイルに出力する。In step S1202, it is determined whether or not the temporary queue C is empty. If the queue C is not empty, the process proceeds to step S1203. If the queue C is empty, the process directly proceeds to step S1204. .
In step S1203, the compression method (see FIG. 8) of the entries registered in the queue C and the number of entries are output to the index file, the entries are read from the queue C, and the position numbers are output to the index file without compressing the position numbers. I do.

【００９０】次に、ステップＳ１２０４では、ステップ
Ｓ１２０２の判別結果にかかわらず、すなわち、キュー
Ｃが空であるか否かにかかわらず、ステップＳ１１０１
においてキューＡから読み出したエントリをキューＢに
入れる。次に、ステップＳ１２０５では、キューＡから
さらに１エントリを読み出す。そして、ステップＳ１２
０６ではキューＡが空であるか否かを判別し、キューＡ
が空であれば図１１に示すステップＳ１１０３に進み、
キューＡが空でなければステップＳ１２０１に戻る。Next, in step S1204, regardless of the determination result in step S1202, that is, regardless of whether the queue C is empty or not, step S1101 is executed.
Puts the entry read from the queue A into the queue B. Next, in step S1205, one more entry is read from the queue A. Then, step S12
At 06, it is determined whether or not the queue A is empty.
Is empty, the process proceeds to step S1103 shown in FIG.
If the queue A is not empty, the process returns to step S1201.

【００９１】また、上述したように、ステップＳ１２０
１において、位置番号の各バイトデータが共に「０ｘ０
ｆ」より大きければステップＳ１２１２に進むが、該ス
テップＳ１２１２ではキューＢが空であるか否かを判別
し、キューＢが空でなければステップＳ１２１３を実行
した後に、またキューＢが空であればそのままステップ
Ｓ１２１４に進む。ステップＳ１２１３では、キューＢ
に登録されているエントリの圧縮方式とエントリの個数
を索引ファイルに出力し、またキューＢからエントリを
読み出して、位置番号を圧縮せずに索引ファイルに出力
する。Also, as described above, step S120
1, each byte data of the position number is "0x0
If "f" is larger than "f", the process proceeds to step S1212. In step S1212, it is determined whether or not the queue B is empty. If the queue B is not empty, the process proceeds to step S1213. It proceeds to step S1214 as it is. In step S1213, the queue B
And outputs the compression method and the number of entries of the entries registered in the index file to the index file, reads the entries from the queue B, and outputs the position numbers to the index file without compressing the position numbers.

【００９２】次に、ステップＳ１２１４では、ステップ
Ｓ１２１２の判別結果にかかわらず、すなわち、キュー
Ｃが空であるないにかかわらず、ステップＳ１１０１に
おいてキューＡから読み出したエントリをキューＣに出
力する。次に、ステップＳ１２１５では、キューＡから
さらに１エントリを読み出す。そして、ステップＳ１２
１６ではキューＡが空であるか否かを判別し、キューＡ
が空であれば図１１に示すステップＳ１１０３に進み、
キューＡが空でなければステップＳ１２１１に戻る。Next, in step S1214, the entry read from queue A in step S1101 is output to queue C regardless of the result of determination in step S1212, that is, regardless of whether queue C is empty. Next, in step S1215, one more entry is read from queue A. Then, step S12
At 16, it is determined whether or not the queue A is empty.
Is empty, the process proceeds to step S1103 shown in FIG.
If the queue A is not empty, the process returns to step S1211.

【００９３】次に、図１１のステップＳ１１０３では、
キューＢが空であるか否かを判別し、キューＢが空であ
ればステップＳ１１０４に進み、キューＢが空でなけれ
ばステップＳ１１０５に進む。ステップＳ１１０５で
は、キューＢに登録されているエントリの圧縮方式とエ
ントリの個数を索引ファイルに出力し、キューＢからエ
ントリを読み出して位置番号を圧縮して索引ファイルに
出力した後、図４または図６のサブルーチンに戻って処
理を継続するか、或いは図３のメインルーチンに戻って
圧縮処理を終了する。Next, in step S1103 of FIG.
It is determined whether or not the queue B is empty. If the queue B is empty, the process proceeds to step S1104. If the queue B is not empty, the process proceeds to step S1105. In step S1105, the compression method and the number of entries of the entries registered in the queue B are output to the index file, the entries are read from the queue B, the position numbers are compressed and output to the index file. 6 to continue the processing, or return to the main routine of FIG. 3 to end the compression processing.

【００９４】また、ステップＳ１１０４では、キューＣ
が空であるか否かを判別し、キューＣが空でなければス
テップＳ１１０６を実行した後に、またキューＣが空で
あればそのまま図４または図６のサブルーチンに戻って
処理を継続するか、或いは図３のメインルーチンに戻っ
て圧縮処理を終了する。なお、ステップＳ１１０６で
は、キューＣに登録されているエントリの圧縮方式とエ
ントリの個数を索引ファイルに出力し、キューＣからエ
ントリを読み出して位置番号を圧縮して索引ファイルに
出力する。In step S1104, the queue C
Is determined to be empty. If the queue C is not empty, step S1106 is executed. If the queue C is empty, the process returns to the subroutine of FIG. 4 or FIG. Alternatively, the compression processing is terminated by returning to the main routine of FIG. In step S1106, the compression method and the number of entries of the entries registered in the queue C are output to the index file, the entries are read from the queue C, the position numbers are compressed and output to the index file.

【００９５】次に、図１３に示す索引情報の具体例を参
照して、本実施例に係る索引情報の圧縮方法について具
体的に説明する。図１３は、第２実施例に係る索引情報
の圧縮方法について具体的な圧縮過程を例示する説明図
である。同図において、１３０１は、図９と同様に、処
理対象のエントリ（索引情報）を示し、１３０２は１３
０１に示した索引情報のうち圧縮可能なエントリを示し
ている。また、１３０３は位置番号について圧縮された
エントリを示し、１３０４は結果として得られる圧縮索
引情報を示している。Next, referring to a specific example of index information shown in FIG. 13, a method of compressing index information according to the present embodiment will be specifically described. FIG. 13 is an explanatory diagram illustrating a specific compression process of the index information compression method according to the second embodiment. 9, 1301 denotes an entry (index information) to be processed as in FIG.
It shows a compressible entry in the index information shown in FIG. Reference numeral 1303 denotes an entry compressed with respect to the position number, and reference numeral 1304 denotes obtained compression index information.

【００９６】図１３の索引情報１３０１に対して、本実
施例における索引情報の圧縮方法（図３〜図６、図１１
および図１２に示したフローチャート）を適用すると、
索引情報のうち、索引ｋ〜索引ｋ＋４のエントリの文書
番号は全て「（０ｘ２７）（０ｘ７ｂ）」で同一である
ので、先ず、索引ｋ＋１〜索引ｋ＋４のエントリの文書
番号を省略することによって圧縮することができる。ま
た、索引ｋ〜索引ｋ＋２のエントリの位置番号の各バイ
トデータは「０ｘ０ｆ」以下であるので、位置番号の各
バイトデータの下位４ビットをつなぎ合わせた１バイト
データに変換することによって１エントリについて１バ
イト削減することができる。The index information 1301 shown in FIG. 13 is used to compress the index information in this embodiment (FIGS. 3 to 6 and FIG. 11).
And the flowchart shown in FIG. 12),
In the index information, the document numbers of the entries of the index k to the index k + 4 are all the same as “(0x27) (0x7b)”. be able to. In addition, since each byte data of the position number of the entry of the index k to the index k + 2 is equal to or less than “0x0f”, by converting the lower 4 bits of each byte data of the position number into 1 byte data, one entry is converted. One byte can be reduced.

【００９７】一方、索引ｋ＋３および索引ｋ＋４のエン
トリの位置番号の各バイトデータは「０ｘ０ｆ」より大
きいので、位置番号について圧縮できないが、索引ｋ〜
索引ｋ＋４のエントリの文書番号は全て同一であるの
で、連続した２つの圧縮索引情報のグループを形成する
こととする。すなわち、図１３に示すように、索引ｋ〜
索引ｋ＋２の各エントリについて圧縮方式５で圧縮が行
われた３エントリで６バイトの圧縮索引情報のグループ
と、索引ｋ＋３および索引ｋ＋４の各エントリについて
圧縮方式２で圧縮が行われた２エントリで５バイトの圧
縮索引情報のグループとから成る圧縮索引情報１３０４
を得ることができる。これにより、圧縮しない場合の２
０バイトの索引情報を、１１バイトの圧縮索引情報に削
減することができた。また、索引ｋ〜索引ｋ＋４の５つ
のエントリについて文書番号を圧縮して位置番号を圧縮
しない圧縮索引情報（１３バイト）と比較しても、２バ
イト削減することができる。On the other hand, since each byte data of the position number of the entry of the index k + 3 and the index k + 4 is larger than “0x0f”, the position number cannot be compressed.
Since the document numbers of the entries of the index k + 4 are all the same, a group of two consecutive pieces of compressed index information is formed. That is, as shown in FIG.
Each entry of the index k + 2 is compressed by the compression method 5 with 3 entries and is a group of 6-byte compression index information, and each of the indexes k + 3 and k + 4 is compressed by the compression method 2 with 5 entries. Compression index information 1304 including a group of byte compression index information
Can be obtained. This allows 2
0-byte index information could be reduced to 11-byte compressed index information. In addition, it is possible to reduce the number of bytes by two in comparison with the compressed index information (13 bytes) in which the document number is compressed and the position number is not compressed for the five entries of the index k to the index k + 4.

【００９８】以上のように、本実施例の索引情報の圧縮
方法では、索引情報圧縮ステップ（図３，図４，図５，
図６，図１１および図１２）によって、第１実施例と同
様に、索引情報のエントリが連続して同一になる場合に
は、１つのエントリのみを残して他のエントリを削除
し、また、同一の文字連鎖の索引情報について、文書番
号が同一のｎ個の連続した索引情報を１群として、該文
書番号を１つに圧縮した圧縮索引情報を作成するので、
ｎ−１個の文書番号分のデータを削減することができ、
索引情報を圧縮して索引ファイルの容量を抑えることが
できる。さらに、文書番号が同一のｎ個の連続した索引
情報の１群の内、一部の索引情報の位置番号の各バイト
データが基準値（０ｘ０ｆ）以下であるときに、その一
部の索引情報についてのみ、位置番号の各バイトデータ
の下位４ビットをつなぎ合わせた１バイトデータに変換
した位置番号を持つ圧縮索引情報を作成するので、文字
の出現順位が該基準値以下であるものについてバイト数
を削減することができ、索引情報を圧縮して索引ファイ
ルの容量を抑えることができる。As described above, in the index information compression method of the present embodiment, the index information compression step (FIG. 3, FIG. 4, FIG. 5,
According to FIGS. 6, 11 and 12), as in the first embodiment, when the entries of the index information are continuously the same, other entries are deleted except one entry, and With regard to the index information of the same character chain, compressed index information is created by compressing the document number into one by grouping n consecutive index information with the same document number as one group.
Data for n-1 document numbers can be reduced,
By compressing the index information, the capacity of the index file can be reduced. Further, when one byte data of the position number of a part of the index information is less than or equal to a reference value (0x0f) in a group of n consecutive index information having the same document number, a part of the index information is , The compressed index information having the position number converted into 1-byte data by connecting the lower 4 bits of each byte data of the position number is created. Can be reduced, and the index information can be compressed to reduce the capacity of the index file.

【００９９】（第３実施例）次に、図３〜図６および図
１４を参照して、第３実施例に係る索引情報の圧縮方法
について説明する。なお、図１４は、第３実施例に係る
索引情報の圧縮方法のサブルーチンを説明するフローチ
ャートである。ここでは、図３に示す索引情報の圧縮処
理のメインルーチンおよびステップＳ３０３に示すサブ
ルーチン（図４，図５および図６）は第１実施例と同様
であるため、説明を省略する。(Third Embodiment) Next, a method of compressing index information according to a third embodiment will be described with reference to FIGS. 3 to 6 and FIG. FIG. 14 is a flowchart illustrating a subroutine of a method for compressing index information according to the third embodiment. Here, the main routine of the compression processing of the index information shown in FIG. 3 and the subroutine (FIGS. 4, 5 and 6) shown in step S303 are the same as those in the first embodiment, and therefore the description is omitted.

【０１００】以下では、図１４のフローチャートを参照
して、メインルーチン（ステップＳ３０７）およびサブ
ルーチン（図４および図６）のキュー出力において、分
岐して実行される２バイト用キューの出力処理について
説明する。なお、図１４のフローチャートは、図７のフ
ローチャート（第１実施例）と対比すれば明らかなよう
に、図７におけるステップＳ７０１とステップＳ７０２
との間で、繰り返しループ前にステップＳ１４０２およ
びＳ１４０３が、繰り返しループ中にステップＳ１４０
４およびＳ１４０５が、それぞれ挿入されたものであ
る。In the following, with reference to the flowchart of FIG. 14, in the queue output of the main routine (step S307) and the subroutine (FIGS. 4 and 6), the output processing of the 2-byte queue which is executed in a branched manner will be described. I do. It should be noted that the flowchart of FIG. 14 is compared with the flowchart of FIG. 7 (first embodiment), and as is clear, steps S701 and S702 in FIG.
Between steps S1402 and S1403 before the repetition loop, and steps S140 and S1403 during the repetition loop.
4 and S1405 are respectively inserted.

【０１０１】本実施例では、文書番号が同一のｎ個の連
続した索引情報の１群が２バイト用キューに登録され
て、キューＡとして図１４に分岐してくるが、その２バ
イト用キューに登録されたエントリについて、位置番号
間の差分を求め、該差分の各バイトデータが基準値（０
ｘ０ｆ）以下であるときに、２バイトの位置番号を差分
の各バイトデータの下位４ビットをつなぎ合わせた１バ
イトデータに変換して圧縮している。In this embodiment, a group of n consecutive pieces of index information having the same document number is registered in the 2-byte queue, and branches off as queue A in FIG. , The difference between the position numbers is calculated, and each byte data of the difference is set to the reference value (0
x0f) or less, the 2-byte position number is converted to 1-byte data obtained by joining the lower 4 bits of each byte data of the difference and compressed.

【０１０２】先ず、ステップＳ１４０２では、ステップ
Ｓ１４０１でキューＡから読み出したエントリが最初の
エントリであるか否かを判別し、最初のエントリである
ときはステップＳ１４０３に進んで、キューＡからさら
に１エントリを読み出した後にステップＳ１４０４に進
み、最初のエントリでないときはそのままステップＳ１
４０４に進む。First, in step S1402, it is determined whether or not the entry read from the queue A in step S1401 is the first entry. If the entry is the first entry, the process proceeds to step S1403, where one more entry is read from the queue A. Is read, and the process proceeds to step S1404. If it is not the first entry, the process proceeds to step S1.
Proceed to 404.

【０１０３】次に、ステップＳ１４０４では、前後のエ
ントリの各位置情報の差分を求め、次にステップＳ１４
０５に進んで、差分を後のエントリの位置情報と置き換
える。そして、ステップＳ１４０６以降では、図７（第
１実施例）と同様に、差分の各バイトデータが「０ｘ０
ｆ」以下であるときに、位置番号の２バイトデータを、
差分の各バイトデータの下位４ビットをつなぎ合わせた
１バイトデータに変換して圧縮する。Next, in step S1404, the difference between the pieces of position information of the preceding and succeeding entries is obtained.
Proceeding to 05, the difference is replaced with the position information of the subsequent entry. Then, after step S1406, as in FIG. 7 (first embodiment), each byte data of the difference is set to “0x0”.
f ”or less, the 2-byte data of the position number is
The low-order 4 bits of each byte data of the difference are converted into 1-byte data by joining and compressed.

【０１０４】次に、図１５に示す索引情報の具体例を参
照して、本実施例に係る索引情報の圧縮方法について具
体的に説明する。図１５は、第３実施例に係る索引情報
の圧縮方法について具体的な圧縮過程を例示する説明図
である。同図において、１５０１は、図９と同様に、処
理対象のエントリ（索引情報）を示し、１５０２は１５
０１に示した索引情報のうち圧縮可能なエントリを示し
ている。また、１５０３は位置番号について差分によっ
て圧縮されたエントリを示し、１５０４は結果として得
られる圧縮索引情報を示している。Next, referring to a specific example of index information shown in FIG. 15, a method of compressing index information according to the present embodiment will be specifically described. FIG. 15 is an explanatory diagram illustrating a specific compression process of the index information compression method according to the third embodiment. 9, 1501 indicates an entry (index information) to be processed, as in FIG. 9, and 1502 indicates 15 entries.
It shows a compressible entry in the index information shown in FIG. Reference numeral 1503 denotes an entry compressed by the difference in the position number, and reference numeral 1504 denotes compression index information obtained as a result.

【０１０５】図１５の索引情報１５０１に対して、本実
施例における索引情報の圧縮方法（図３〜図６および図
１４に示したフローチャート）を適用すると、索引情報
のうち、索引ｋ〜索引ｋ＋４のエントリの文書番号は全
て「（０ｘ２７）（０ｘ７ｂ）」で同一であるので、先
ず、索引ｋ＋１〜索引ｋ＋４のエントリの文書番号を省
略することによって圧縮することができる。When the compression method of the index information (the flowcharts shown in FIGS. 3 to 6 and FIG. 14) in this embodiment is applied to the index information 1501 of FIG. 15, the index k to the index k + 4 of the index information is obtained. Since the document numbers of all entries are the same as "(0x27) (0x7b)", compression can be performed by first omitting the document numbers of the entries at the index k + 1 to the index k + 4.

【０１０６】また、索引ｋ＋２〜索引ｋ＋４のエントリ
の位置番号のバイトデータには「０ｘ０ｆ」より大きい
データが含まれるが、索引ｋ＋１〜索引ｋ＋４の各エン
トリの位置番号の２バイトデータは、１つ前のエントリ
の位置番号との差分をとったとき、該差分の各バイトデ
ータが全て「０ｘ０ｆ」以下であるので、位置番号の差
分の各バイトデータの下位４ビットをつなぎ合わせて、
１バイトデータの差分に変換することによって、各エン
トリについてそれぞれ１バイトずつ削減することができ
る。これにより、圧縮しない場合の２０バイトの索引情
報を、８バイトの圧縮索引情報に削減することができ
た。また、索引ｋ〜索引ｋ＋４の５つのエントリについ
て文書番号を圧縮して位置番号を圧縮しない圧縮索引情
報（１３バイト）と比較しても、５バイト削減すること
ができ、さらに、第２実施例の圧縮方法による場合（１
２バイト）と比較しても４バイト削減することができ
る。The byte data of the position number of the entry of the index k + 2 to the index k + 4 includes data larger than “0x0f”, but the 2-byte data of the position number of each entry of the index k + 1 to the index k + 4 is one. When the difference from the position number of the previous entry is obtained, all the byte data of the difference are equal to or less than “0x0f”. Therefore, the lower 4 bits of each byte data of the difference of the position number are connected, and
By converting the data into a difference of one byte data, each entry can be reduced by one byte. As a result, the 20-byte index information without compression can be reduced to the 8-byte compressed index information. In addition, compared with the compressed index information (13 bytes) in which the document number is compressed and the position number is not compressed for the five entries of the index k to the index k + 4, the number can be reduced by 5 bytes. (1)
(2 bytes) can be reduced by 4 bytes.

【０１０７】以上のように、本実施例の索引情報の圧縮
方法では、索引情報圧縮ステップ（図３，図４，図５，
図６および図１４）によって、第１実施例と同様に、索
引情報のエントリが連続して同一になる場合には、１つ
のエントリのみを残して他のエントリを削除し、また、
同一の文字連鎖の索引情報について、文書番号が同一の
ｎ個の連続した索引情報を１群として、該文書番号を１
つに圧縮した圧縮索引情報を作成するので、ｎ−１個の
文書番号分のデータを削減することができ、索引情報を
圧縮して索引ファイルの容量を抑えることができる。さ
らに、文書番号が同一のｎ個の連続した索引情報の１群
について、索引情報の位置番号の位置番号間の差分を求
め、該差分の各バイトデータが基準値（０ｘ０ｆ）以下
であるときに、２バイトの位置番号を差分の各バイトデ
ータの下位４ビットをつなぎ合わせた１バイトデータに
変換して圧縮索引情報を作成するので、文字連鎖が近接
して出現し、位置番号の差分が基準値以下であるものに
ついてビット数またはバイト数を削減することができ、
索引情報を圧縮して索引ファイルの容量を抑えることが
できる。As described above, according to the index information compression method of the present embodiment, the index information compression step (FIG. 3, FIG. 4, FIG. 5,
According to FIGS. 6 and 14), similarly to the first embodiment, when the entries of the index information are continuously the same, the other entries are deleted except for one entry.
Regarding the index information of the same character chain, n consecutive index information having the same document number is regarded as one group, and the document number is set to 1
Since the compressed index information is created, the data for n-1 document numbers can be reduced, and the index information can be compressed to reduce the capacity of the index file. Further, for a group of n pieces of consecutive index information having the same document number, a difference between the position numbers of the position numbers of the index information is obtained, and when each byte data of the difference is equal to or smaller than the reference value (0x0f). Since the compression index information is created by converting the 2-byte position number into 1-byte data by connecting the lower 4 bits of each byte data of the difference, character chains appear close to each other and the difference between the position numbers is used as a reference. The number of bits or bytes can be reduced for those below the value,
By compressing the index information, the capacity of the index file can be reduced.

【０１０８】以上説明したように、第１の実施形態に係
る情報検索装置および該情報検索装置の第１、第２およ
び第３の実施例に係る情報圧縮方法によれば、索引情報
を圧縮することで、索引情報（圧縮索引情報）は圧縮さ
れていない索引情報よりもデータサイズが小さくなるた
め、記録メディア上の索引情報を保持するための領域、
即ち索引ファイルの容量を抑えることができる。また、
同じ情報を得るときの記録メディアからメモリにロード
されるデータ量に関して、圧縮されていない索引情報よ
りも圧縮索引情報の方が少ないため、メモリへのロード
時間を短縮することができる。したがって、ＣＤ−ＲＯ
Ｍ等のデータの読み出し速度が低速な記録メディアに圧
縮索引情報が保持されている場合でも、読み出し時間を
短縮して高速検索を実現することが可能となる。As described above, according to the information retrieval apparatus according to the first embodiment and the information compression methods according to the first, second and third examples of the information retrieval apparatus, the index information is compressed. Since the index information (compressed index information) has a smaller data size than the uncompressed index information, an area for holding the index information on the recording medium,
That is, the capacity of the index file can be reduced. Also,
Regarding the amount of data to be loaded into the memory from the recording medium when obtaining the same information, the compressed index information is smaller than the uncompressed index information, so that the load time to the memory can be reduced. Therefore, CD-RO
Even when the compressed index information is held in a recording medium having a low reading speed of data such as M, the reading time can be reduced and a high-speed search can be realized.

【０１０９】〔第２の実施形態〕図１６は本発明の第２
の実施形態に係る情報検索装置１６００の構成図であ
る。本実施形態の情報検索装置１６００は、第１の実施
形態の情報検索装置１００（図１参照）に対して、書誌
情報を圧縮する書誌情報圧縮部１６０１と、実体情報を
圧縮する実体情報圧縮部１６０２と、圧縮された書誌情
報を伸長する書誌伸長部１６０３と、圧縮された実体情
報を伸長する実体伸長部１６０４とを付加した構成であ
る。[Second Embodiment] FIG. 16 shows a second embodiment of the present invention.
It is a lineblock diagram of information search device 1600 concerning an embodiment. The information search device 1600 of the present embodiment differs from the information search device 100 of the first embodiment (see FIG. 1) in that a bibliographic information compression unit 1601 that compresses bibliographic information and an entity information compression unit that compresses entity information The configuration is such that a bibliographic expansion unit 1603 for expanding compressed bibliographic information and a substance expanding unit 1604 for expanding compressed substance information are added.

【０１１０】つまり、本実施形態の情報検索装置１６０
０は、索引ファイル（索引情報格納部１１５）を圧縮し
て索引情報の記憶容量を削減するのみならず、書誌情報
格納部１１６の書誌情報および実体情報格納部１１７の
実体情報についても圧縮して記憶し、書誌情報および実
体情報の記憶容量についても削減したものである。した
がって、書誌情報格納部１１６には、圧縮された書誌情
報について、文書番号、長さ情報、圧縮方式情報および
先頭位置情報が対応付けられ、また同様に、実体情報格
納部１１７には、圧縮された実体情報について、文書番
号、長さ情報、圧縮方式情報および先頭位置情報が対応
付けられている。That is, the information search device 160 of the present embodiment
0 not only reduces the storage capacity of the index information by compressing the index file (index information storage unit 115), but also compresses the bibliographic information of the bibliographic information storage unit 116 and the entity information of the entity information storage unit 117. This also reduces the storage capacity of bibliographic information and entity information. Therefore, the bibliographic information storage unit 116 associates the compressed bibliographic information with the document number, the length information, the compression method information, and the head position information. Similarly, the entity information storage unit 117 stores the compressed bibliographic information. The document information, the length information, the compression method information, and the head position information are associated with the entity information.

【０１１１】本実施形態の情報検索装置１６００におけ
る動作についても、基本的動作は第１の実施形態と同様
であるが、第１の実施形態の情報検索装置１００とは次
の点で異なる。すなわち、書誌実体登録部１０８が書誌
情報および実体情報を登録する際に、書誌情報は、書誌
圧縮部１６０１により圧縮された後に書誌情報格納部１
１６に格納され、また実体情報は、実体圧縮部１６０２
により圧縮された後に実体情報格納部１１７に格納され
る点である。また、書誌取得部１１２が書誌情報を取得
する場合に、書誌伸長部１６０３が、書誌情報格納部１
１６内の圧縮された書誌情報を読み取り、圧縮方法に従
った伸長を行って元の書誌情報を取得する点と、実体取
得部１１３が実体情報を取得する場合には、実体伸長部
１６０４が、実体情報格納部１１７内の圧縮された実体
情報を読み取り、圧縮方法に従った伸長を行って元の実
体情報を取得する点も異なる。The basic operation of the information search device 1600 of this embodiment is the same as that of the first embodiment, but differs from the information search device 100 of the first embodiment in the following points. That is, when the bibliographic entity registration unit 108 registers bibliographic information and entity information, the bibliographic information is compressed by the bibliographic compression unit 1601 and then stored in the bibliographic information storage unit 1.
16 and the entity information is stored in the entity compression unit 1602
Is stored in the entity information storage unit 117 after compression. When the bibliographic acquisition unit 112 acquires bibliographic information, the bibliographic decompression unit 1603 sets the bibliographic information storage unit 1
16 to read the compressed bibliographic information and expand the data in accordance with the compression method to obtain the original bibliographic information. Also, when the entity obtaining unit 113 obtains the entity information, the entity expanding unit 1604 Another difference is that the compressed entity information in the entity information storage unit 117 is read, and the original entity information is obtained by performing decompression according to the compression method.

【０１１２】また、本実施形態の情報検索装置１６００
においても、第１の実施形態と同様に、上記第１，第２
および第３実施例の検索情報の圧縮方法を適用すること
ができる。さらに、索引伸長部１１０における圧縮され
た索引情報の伸長方法としては、図１７のフローチャー
トに示されるような手順で行われる。先ず、ステップＳ
１７０１では、圧縮された索引情報から圧縮方式とエン
トリ個数のデータを含む１バイトデータを読み出す。次
に、ステップＳ１７０２では、該１バイトデータから圧
縮方式とエントリ個数を取り出す。次に、ステップＳ１
７０３では、圧縮された索引情報からエントリ個数分の
エントリを読み出して、圧縮方式に応じて圧縮索引情報
のエントリを展開する。さらに、ステップＳ１７０４で
は、圧縮された索引情報が終わりか否かを判別し、終わ
りであれば伸長処理を終了し、終わりでなければ再びス
テップＳ１７０１に戻って処理を繰り返す。Further, the information search device 1600 of the present embodiment
Also, in the same manner as in the first embodiment,
In addition, the search information compression method of the third embodiment can be applied. Further, as a method of expanding the compressed index information in the index expanding section 110, a procedure as shown in a flowchart of FIG. 17 is performed. First, step S
At 1701, 1-byte data including the data of the compression method and the number of entries is read from the compressed index information. Next, in step S1702, the compression method and the number of entries are extracted from the one-byte data. Next, step S1
In step 703, entries corresponding to the number of entries are read from the compressed index information, and the entries of the compressed index information are expanded according to the compression method. Further, in step S1704, it is determined whether or not the compressed index information is the end. If it is the end, the decompression process is ended. If not, the process returns to step S1701 to repeat the process.

【０１１３】以上説明した第１および第２の実施形態の
情報検索装置、並びに該情報検索装置の第１、第２およ
び第３の実施例に係る情報圧縮方法では、索引情報を文
書番号と位置番号とによって構成したが、これらと同様
な性質を持つ情報であれば他の情報でも良い。例えば、
文書番号を階層的により細かく分解して、文書の節，章
等の区切りの番号、文書の頁、文書の段落番号等々を含
ませるといった変形も考えられる。また、索引情報の１
エントリのデータサイズを４バイトとしているが、これ
に限定されるものではなく、任意のビット数またはバイ
ト数で構成して良い。また、図２に示した圧縮索引情報
のデータ形式では、３ビットの圧縮方式ブロック２０１
と５ビットのエントリ個数ブロックとによって圧縮情報
のグループを分けるヘッダを構成しているが、各ブロッ
クのビット数を任意に設定可能であり、また該ヘッダに
他の情報を保持するためのブロックを付加した構成とし
ても良い。さらに、図８に示す圧縮条件および圧縮内容
は圧縮方式の一例を示すものであり、図８の内容に限定
されるものではない。また、第１、第２および第３の実
施例に係る情報圧縮方法、並びにそれぞれの具体例は、
単なる例示であって、具体的内容に限定されるものでは
ない。In the information retrieval apparatus according to the first and second embodiments described above and the information compression methods according to the first, second and third embodiments of the information retrieval apparatus, the index information includes the document number and the position. Although constituted by numbers, other information may be used as long as the information has properties similar to these. For example,
A variant is also conceivable in which the document number is hierarchically and finely decomposed to include section numbers of sections, chapters, etc. of the document, page numbers of the document, paragraph numbers of the document, and the like. Also, index information 1
Although the data size of the entry is 4 bytes, the present invention is not limited to this, and may be configured with an arbitrary number of bits or bytes. In the data format of the compression index information shown in FIG.
And a 5-bit entry count block, constitutes a header that divides the group of compressed information. The number of bits of each block can be set arbitrarily, and a block for holding other information is included in the header. An additional configuration may be used. Further, the compression conditions and the contents of compression shown in FIG. 8 are examples of the compression method, and are not limited to the contents of FIG. Further, the information compression methods according to the first, second and third embodiments, and specific examples of each,
It is merely an example and is not limited to specific contents.

【０１１４】[0114]

【発明の効果】以上説明したように、本発明の情報検索
装置、情報検索装置の情報圧縮方法および記録媒体によ
れば、索引情報作成手段（索引情報作成ステップ）によ
って、文書データに出現する文字連鎖毎に位置情報を含
む索引情報を作成し、索引情報圧縮手段（索引情報圧縮
ステップ）によって、同一の文字連鎖の索引情報につい
て、位置情報の値に応じて該位置情報を圧縮した圧縮索
引情報を作成することとしたので、索引情報をより効率
的に圧縮して索引ファイルの容量を抑え、該索引ファイ
ルを用いて高速検索することができる。As described above, according to the information retrieval apparatus, the information compression method of the information retrieval apparatus, and the recording medium of the present invention, the characters appearing in the document data are generated by the index information creating means (index information creating step). Index information including position information is created for each chain, and compressed index information obtained by compressing the position information according to the value of the position information for the index information of the same character chain by an index information compression unit (index information compression step) Is created, the index information can be more efficiently compressed, the capacity of the index file can be reduced, and a high-speed search can be performed using the index file.

【０１１５】また、本発明の情報検索装置、情報検索装
置の情報圧縮方法および記録媒体によれば、索引情報圧
縮手段（索引情報圧縮ステップ）によって、同一の文字
連鎖の索引情報について、位置情報に含まれる第１の位
置情報が同一の複数の索引情報を１群として、該群の第
１の位置情報を１つに圧縮した圧縮索引情報を作成する
こととしたので、例えばより上位階層の第１の位置情報
が同一である索引情報がｎ個存在するときに、これら索
引情報を１群として第１の位置情報を１つに圧縮した圧
縮索引情報を作成することにより、ｎ−１個の第１の位
置情報分を削減することができ、索引情報を圧縮して索
引ファイルの容量を抑えることができる。According to the information retrieval apparatus, the information compression method of the information retrieval apparatus, and the recording medium of the present invention, the index information compression means (index information compression step) converts the index information of the same character chain into the position information. Since a plurality of pieces of index information having the same first position information are included in one group, compressed index information in which the first position information of the group is compressed into one is created. When there are n pieces of index information having the same position information of one, by creating these pieces of index information as a group and compressing the first position information into one, n-1 pieces of compressed index information are created. The first position information can be reduced, and the index information can be compressed to reduce the capacity of the index file.

【０１１６】また、本発明の情報検索装置、情報検索装
置の情報圧縮方法および記録媒体によれば、索引情報圧
縮手段（索引情報圧縮ステップ）により、同一の文字連
鎖の索引情報または圧縮索引情報について、位置情報に
含まれる第２の位置情報が所定基準値以下であるとき
に、該第２の位置情報をより少ないビット数またはバイ
ト数に圧縮した圧縮索引情報を作成することとしたの
で、例えば、第２の位置情報を第１の位置情報で特定さ
れる文書における位置情報（出現順位）とした場合に
は、出現順位が所定基準値以下であるものについてビッ
ト数またはバイト数を削減することができ、索引情報を
圧縮して索引ファイルの容量を抑えることができる。Further, according to the information retrieval device, the information compression method of the information retrieval device, and the recording medium of the present invention, the index information compression means (index information compression step) uses the index information or compressed index information of the same character chain. When the second position information included in the position information is equal to or smaller than a predetermined reference value, compressed index information is created by compressing the second position information to a smaller number of bits or bytes. In the case where the second position information is position information (appearance order) in the document specified by the first position information, the number of bits or the number of bytes is reduced for those whose appearance order is equal to or less than a predetermined reference value. The index information can be compressed to reduce the capacity of the index file.

【０１１７】また、本発明の情報検索装置、情報検索装
置の情報圧縮方法および記録媒体によれば、索引情報圧
縮手段（索引情報圧縮ステップ）により、第１の位置情
報を１つに圧縮した圧縮索引情報について、位置情報に
含まれる第２の位置情報間の差分を求め、該差分が所定
基準値以下であるとき、前記第２の位置情報をより少な
いビット数またはバイト数で表される差分に圧縮するこ
ととしたので、例えば、第２の位置情報を第１の位置情
報で特定される文書における位置情報（出現順位）とし
た場合には、近接して出現し差分が所定基準値以下であ
るものについてビット数またはバイト数を削減すること
ができ、索引情報を圧縮して索引ファイルの容量を抑え
ることができる。According to the information search device, the information compression method of the information search device, and the recording medium of the present invention, the first position information is compressed into one by the index information compression means (index information compression step). For the index information, a difference between the second position information included in the position information is obtained, and when the difference is equal to or less than a predetermined reference value, the second position information is represented by a difference represented by a smaller number of bits or bytes. Therefore, for example, when the second position information is position information (appearance order) in the document specified by the first position information, the second position information appears close and the difference is equal to or less than a predetermined reference value. The number of bits or the number of bytes can be reduced, and the index information can be compressed to reduce the capacity of the index file.

【０１１８】さらに、本発明の情報検索装置によれば、
圧縮索引情報記憶手段により、索引情報圧縮手段で圧縮
された圧縮索引情報を保持し、索引伸長手段により、与
えられる検索条件に基づき圧縮索引情報記憶手段内の圧
縮索引情報を読み出して元の索引情報に伸長し、検索手
段によって索引伸長手段で伸長した索引情報に基づき検
索を行うこととしたので、本発明の情報検索装置、情報
検索装置の情報圧縮方法および記録媒体において行う圧
縮方式は何れも簡単なものであることから、索引伸長手
段によって元の索引情報に伸長する処理も簡単に行うこ
とができ、結果として、索引情報をより効率的に圧縮し
て索引ファイルの容量を抑えると共に、該索引ファイル
を用いて検索することにより検索時間を短縮することが
できる。Further, according to the information search device of the present invention,
The compressed index information storage means holds the compressed index information compressed by the index information compressing means, and the index decompression means reads out the compressed index information in the compressed index information storage means based on the given search condition and retrieves the original index information. The information retrieval apparatus according to the present invention, the information compression method of the information retrieval apparatus, and the compression method performed on the recording medium are all simple. Therefore, the process of decompressing the original index information by the index decompression means can be easily performed. As a result, the index information can be more efficiently compressed to reduce the size of the index file, and Searching using a file can reduce the search time.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る情報検索装置を示す
構成図である。FIG. 1 is a configuration diagram illustrating an information search device according to an embodiment of the present invention.

【図２】実施形態の情報検索装置における圧縮索引情報
の一形式を示す説明図である。FIG. 2 is an explanatory diagram showing one format of compressed index information in the information search device of the embodiment.

【図３】第１実施例に係る索引情報の圧縮方法のメイン
ルーチンを説明するフローチャートである。FIG. 3 is a flowchart illustrating a main routine of a method for compressing index information according to the first embodiment.

【図４】第１実施例に係る索引情報の圧縮方法のサブル
ーチンを説明するフローチャート（エントリをキューに
入れる処理：その１）である。FIG. 4 is a flowchart (a process of queuing entries: 1) illustrating a subroutine of a compression method of index information according to the first embodiment.

【図５】第１実施例に係る索引情報の圧縮方法のサブル
ーチンを説明するフローチャート（エントリをキューに
入れる処理：その２）である。FIG. 5 is a flowchart (a process of queuing an entry: part 2) for explaining a subroutine of a method of compressing index information according to the first embodiment;

【図６】第１実施例に係る索引情報の圧縮方法のサブル
ーチンを説明するフローチャート（エントリをキューに
入れる処理：その３）である。FIG. 6 is a flowchart (a process of queuing an entry: 3) illustrating a subroutine of a compression method of index information according to the first embodiment.

【図７】第１実施例に係る索引情報の圧縮方法のサブル
ーチンを説明するフローチャート（キュー出力処理）で
ある。FIG. 7 is a flowchart (queue output process) for explaining a subroutine of a compression method of index information according to the first embodiment.

【図８】エントリの比較およびエントリの位置番号の値
に応じて８種に場合分けされた圧縮方式を示す説明図で
ある。FIG. 8 is an explanatory diagram showing compression methods classified into eight cases according to the comparison of entries and the value of the position number of the entry.

【図９】第１実施例に係る索引情報の圧縮方法について
具体的な圧縮過程を例示する説明図である。FIG. 9 is an explanatory diagram illustrating a specific compression process of a compression method of index information according to the first embodiment.

【図１０】具体例において０バイト用キュー、２バイト
用キューおよび４バイト用キューに入力されるエントリ
を示す説明図である。FIG. 10 is an explanatory diagram showing entries input to a 0-byte queue, a 2-byte queue, and a 4-byte queue in a specific example.

【図１１】第２実施例に係る索引情報の圧縮方法のサブ
ルーチン（その１）を説明するフローチャートである。FIG. 11 is a flowchart illustrating a subroutine (part 1) of a method for compressing index information according to the second embodiment.

【図１２】第２実施例に係る索引情報の圧縮方法のサブ
ルーチン（その２）を説明するフローチャートである。FIG. 12 is a flowchart illustrating a subroutine (part 2) of a method for compressing index information according to the second embodiment.

【図１３】第２実施例に係る索引情報の圧縮方法につい
て具体的な圧縮過程を例示する説明図である。FIG. 13 is an explanatory diagram illustrating a specific compression process in a method of compressing index information according to the second embodiment.

【図１４】第３実施例に係る索引情報の圧縮方法のサブ
ルーチンを説明するフローチャートである。FIG. 14 is a flowchart illustrating a subroutine of a method for compressing index information according to the third embodiment.

【図１５】第３実施例に係る索引情報の圧縮方法につい
て具体的な圧縮過程を例示する説明図である。FIG. 15 is an explanatory diagram exemplifying a specific compression process in a compression method of index information according to the third embodiment.

【図１６】第２の実施形態の情報検索装置を示す構成図
である。FIG. 16 is a configuration diagram illustrating an information search device according to a second embodiment.

【図１７】圧縮索引情報の伸長方法を説明するフローチ
ャートである。FIG. 17 is a flowchart illustrating a method of expanding compressed index information.

【図１８】従来の情報検索装置を示す構成図である。FIG. 18 is a configuration diagram showing a conventional information search device.

[Explanation of symbols]

１０１端末１０２文書入力部１０３検索条件入力部１０４結果出力部１０５全体制御部１０６索引作成部１０７索引情報圧縮部１０８書誌実体登録部１０９検索部１１０索引伸長部１１１文書番号記憶部１１２書誌取得部１１３実体取得部１１４索引位置格納部１１５索引情報格納部１１６書誌情報格納部１１７実体情報格納部１６０１書誌情報圧縮部１６０２実体情報圧縮部１６０３書誌伸長部１６０４実体伸長部 Reference Signs List 101 terminal 102 document input unit 103 search condition input unit 104 result output unit 105 overall control unit 106 index creation unit 107 index information compression unit 108 bibliographic entity registration unit 109 search unit 110 index decompression unit 111 document number storage unit 112 bibliography acquisition unit 113 Entity acquisition unit 114 Index position storage unit 115 Index information storage unit 116 Bibliographic information storage unit 117 Entity information storage unit 1601 Bibliographic information compression unit 1602 Entity information compression unit 1603 Bibliography decompression unit 1604 Entity decompression unit

Claims

[Claims]

1. An information retrieval apparatus having an index for specifying a specific character chain in document data by position information, wherein an index for creating index information including the position information for each character chain appearing in the document data. An information retrieval apparatus comprising: an information creating unit; and index information compressing unit that creates compressed index information obtained by compressing position information of index information of the same character chain according to the value of the position information.

2. The index information compressing means, for index information of the same character chain, sets a plurality of pieces of index information having the same first position information included in the position information as one group.
2. The information retrieval apparatus according to claim 1, wherein compressed index information is created by compressing the position information into one.

3. The index information compressing means, when the second position information included in the position information is equal to or less than a predetermined reference value, for the index information or the compressed index information of the same character chain. 3. The information retrieval apparatus according to claim 1, wherein compressed index information is created by compressing the information to a smaller number of bits or bytes.

4. The index information compressing means obtains a difference between second position information included in the position information for the compressed index information obtained by compressing the first position information into one, and determines the difference as a predetermined reference value. The information search device according to claim 2, wherein when the value is equal to or less than the value, the second position information is compressed into a difference represented by a smaller number of bits or bytes.

5. The information retrieval apparatus according to claim 1, wherein the index information includes hierarchical position information.

6. The method according to claim 2, wherein the first position information is identification information of a document, and the second position information is position information in a document specified by the first position information. 3. The information search device according to 3 or 4.

7. A compressed index information storage means for holding compressed index information compressed by said index information compression means, and reads out the compressed index information in said compressed index information storage means based on a given search condition. 2. The apparatus according to claim 1, further comprising: index expansion means for expanding the index information; and search means for performing a search based on the index information expanded by the index expansion means.
The information search device according to 2, 3, 4, 5, or 6.

8. An information compression method of an information search device having an index for specifying a specific character chain in document data by position information, wherein index information including the position information is provided for each character chain appearing in the document data. And an index information compression step of creating compressed position information obtained by compressing the position information for the same character chain index information in accordance with the value of the position information. Information compression method.

9. The index information compressing step includes, for index information of the same character chain, a plurality of index information having the same first position information included in the position information as one group, and 9. The information compression method according to claim 8, wherein compressed index information is created by compressing the information into one.

10. The index information compressing step includes the step of, when the second position information included in the position information is equal to or less than a predetermined reference value, for the index information or the compressed index information of the same character chain. 10. The information compression method according to claim 8, wherein compressed position information is generated by compressing the information to a smaller number of bits or bytes.

11. The method according to claim 11, wherein the step of compressing the index information comprises:
Calculating a difference between the second position information included in the position information, and when the difference is equal to or smaller than a predetermined reference value, compressing the second position information into a difference represented by a smaller number of bits or bytes. 10. The information compression method according to claim 9, wherein:

12. The information compression method according to claim 8, wherein the index information includes hierarchical position information.

13. The method according to claim 9, wherein the first position information is identification information of a document, and the second position information is position information in a document specified by the first position information. 12. The information compression method according to 10 or 11.

14. A computer-readable recording medium recorded as a program for causing a computer to execute the information compression method according to claim 8, 9, 10, 11, 12, or 13.