JP2005129071A

JP2005129071A - Data compression/decompression apparatus and method

Info

Publication number: JP2005129071A
Application number: JP2004320948A
Authority: JP
Inventors: Nobuyuki Igata; 伸之井形; Isao Nanba; 功難波; Kunio Matsui; くにお松井
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-02-28
Filing date: 2004-11-04
Publication date: 2005-05-19
Anticipated expiration: 2018-01-23
Also published as: JP3898717B2

Abstract

<P>PROBLEM TO BE SOLVED: To raise the rate of an index creation process and to suppress the size of indices, without lowering the rate of a decoding process for data. <P>SOLUTION: A compressing means is provided to roughen the granularity of numerical data to be used within an index for information retrieval to compress the numerical data, and a storage means is provided to store the compressed data. A decompression means is provided to restore the numerical data, and to return the granularity of the restored numerical data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、任意のデータを符号化して圧縮するデータ圧縮装置およびその方法と、圧縮されたデータを復元するデータ復元装置およびその方法に関する。 The present invention relates to a data compression apparatus and method for encoding and compressing arbitrary data, and a data decompression apparatus and method for restoring compressed data.

全文検索装置やランキング検索装置等の情報検索装置においては、検索に用いるインデックスの作成速度を高め、かつ、そのサイズを抑えることが重要である。ここで、全文検索装置とは、文書ＤＢ（データベース）内の文書の全文を対象として、ユーザにより指定された文字列（キーワード）の存在する文書を検索する装置を指し、ランキング検索装置とは、指定された文字列に対し、関連度の高い文書を検索する装置を指す。 In an information search device such as a full-text search device or a ranking search device, it is important to increase the creation speed of an index used for search and to suppress its size. Here, the full-text search device refers to a device that searches for a document in which a character string (keyword) specified by a user exists for the full text of a document in a document DB (database). A device that retrieves documents with a high degree of relevance for a specified character string.

このような情報検索装置におけるインデックスとは、検索対象となるキーに対して、文書番号、文書内単語出現頻度、文書内単語出現位置等の情報が付加されたデータ構造である。例えば、キーワード「犬」とそれを含む文書の［文書番号，文書内単語出現頻度］の組は、図３８のように表される。 An index in such an information search apparatus is a data structure in which information such as a document number, a word appearance frequency in a document, and a word appearance position in a document is added to a key to be searched. For example, a set of [document number, word appearance frequency in document] of a keyword “dog” and a document including the keyword “dog” is represented as shown in FIG.

図３８のインデックスは、「犬」というキーは、文書番号１の文書に１回、文書番号２の文書に１回、文書番号３の文書に２回、文書番号２５の文書に３回出現していることを表している。 In the index of FIG. 38, the key “dog” appears once for the document number 1 document, once for the document number 2 document, twice for the document number 3 document, and three times for the document number 25 document. It represents that.

この例において、仮に１つの数値データを３２ｂｉｔ（４ｂｙｔｅ）で表すとすると、キー「犬」に対するインデックスは８つの数値を含むので、これを表すのに２５６ｂｉｔ（＝３２ｂｉｔ＊８）の領域が必要となる。Ｇｂｙｔｅ単位の文書に対して、この方法による文書番号のみの格納領域の試算を行ってみると、図３９に示すように、原文のサイズに対して非常に巨大なものになる。そこで、インデックスサイズを圧縮する必要が生じる。 In this example, if one piece of numerical data is represented by 32 bits (4 bytes), the index for the key “dog” includes eight numerical values, and therefore an area of 256 bits (= 32 bits * 8) is required to represent this. Become. If a trial calculation of the storage area of only the document number by this method is performed for a document in Gbyte units, as shown in FIG. 39, it becomes very large with respect to the size of the original text. Therefore, it is necessary to compress the index size.

インデックスサイズの圧縮の基本は、１つの数値を決まったビット数（通常、３２ｂｉｔ）で表すのではなく、できるだけ少ないビット数で表すことである。後述するように、既存の符号化手法は、小さな数値を少ないビット数で表し、大きな数値を大きなビット数で表すようになっている。 The basic principle of index size compression is not to express one numerical value with a fixed number of bits (usually 32 bits) but to represent it with as few bits as possible. As will be described later, existing encoding methods represent small numerical values with a small number of bits and large numerical values with a large number of bits.

そこで、インデックスサイズ圧縮の第一段階として、インデックス内に含まれる数値をできるだけ小さくすることが考えられる。これは、文書番号および文書内単語出現位置のそれぞれについて、連続する２つのデータの差分を取ることにより実現される。文書番号および文書内単語出現位置は、数値の小さいものから大きいものへと順に並んでいるため、各数値間の差分をとることにより、表現される数値を小さくすることができる。 Therefore, as a first stage of index size compression, it is conceivable to make the numerical value included in the index as small as possible. This is realized by taking a difference between two continuous data for each of the document number and the word appearance position in the document. Since the document number and the word appearance position in the document are arranged in order from the smallest numerical value to the largest numerical value, the numerical value expressed can be reduced by taking the difference between the numerical values.

例えば、図３８のインデックスに対して、各文書番号間の差分を取ると、図４０のようになる。図４０の最初のデータ［１，１］の文書番号“１”は、実際の文書番号そのものを表しており、２番目のデータ［１，１］の文書番号“１”は、最初のデータの文書番号“１”と２番目のデータの実際の文書番号“２”との差分値を表している。ここで、実際の文書番号とは、図３８に示された、差分をとる前の文書番号を指す。 For example, FIG. 40 shows the difference between the document numbers with respect to the index shown in FIG. 40, the document number “1” of the first data [1,1] represents the actual document number itself, and the document number “1” of the second data [1,1] This represents a difference value between the document number “1” and the actual document number “2” of the second data. Here, the actual document number refers to the document number before taking the difference, as shown in FIG.

また、３番目のデータ［１，２］の文書番号“１”は、２番目のデータの実際の文書番号“２”と３番目のデータの実際の文書番号“３”との差分値を表しており、４番目のデータ［２２，３］の文書番号“２２”は、３番目のデータの実際の文書番号“３”と４番目のデータの実際の文書番号“２５”との差分値を表している。 The document number “1” of the third data [1, 2] represents a difference value between the actual document number “2” of the second data and the actual document number “3” of the third data. The document number “22” of the fourth data [22, 3] is a difference value between the actual document number “3” of the third data and the actual document number “25” of the fourth data. Represents.

このような数値データに対する符号化手法としては、８ｂｉｔｂｌｏｃｋ（８ＢＢ）符号化、４ｂｉｔｂｌｏｃｋ（４ＢＢ）符号化、Ｕｎａｒｙｃｏｄｉｎｇ、γ−ｃｏｄｉｎｇ、δ−ｃｏｄｉｎｇ等が知られている。これらの符号化手法のいずれにおいても、小さな数値は少ないｂｉｔ数で表され、大きな数値は大きなｂｉｔ数で表される。 As encoding methods for such numerical data, 8-bit block (8BB) encoding, 4-bit block (4BB) encoding, Unary coding, γ-coding, δ-coding, and the like are known. In any of these encoding methods, a small numerical value is represented by a small number of bits, and a large numerical value is represented by a large number of bits.

まず、８ｂｉｔｂｌｏｃｋ符号化とは、８ｂｉｔ（１ｂｙｔｅ）から成る各ブロックの中の最初の１ビット（ｔｏｐｂｉｔ）を継続フラグとし、そのフラグが立っていれば、後続する次のブロックが存在するとみなす方法である。いくつかの数値の例を以下に挙げる。

表す数ｂｉｔ
１０００００００１
２００００００１０
３００００００１１
１２８１００００００１００００００００
１２９１００００００１０００００００１

ここで、数値１、２、３の場合は、後続するブロックが存在しないので先頭のビットは０となっており、数値１２８、１２９の場合は、２番目のブロックが存在するので先頭のビットが１となっている。この方法では、１ワード（３２ｂｉｔ）で表現される数値を表す符号の最小のビット数は８ｂｉｔ、最大のビット数は４０ｂｉｔとなる。 First, in 8-bit block coding, the first 1 bit (top bit) in each block consisting of 8 bits (1 byte) is set as a continuation flag, and if the flag is set, it is considered that the next block following is present. Is the method. Some numerical examples are given below.

Number to represent
100000001
200000010
3 00000011
128 10000001 00000000
129 10000001 00000001

Here, in the case of numerical values 1, 2, and 3, since the following block does not exist, the first bit is 0, and in the case of numerical values 128 and 129, the second block exists, so the first bit is It is 1. In this method, the minimum bit number of a code representing a numerical value represented by one word (32 bits) is 8 bits, and the maximum bit number is 40 bits.

また、４ｂｉｔｂｌｏｃｋ符号化とは、４ｂｉｔから成る各ブロックの中の最初の１ビットを継続フラグとし、そのフラグが立っていれば、後続する次のブロックが存在するとみなす方法である。いくつかの数値の例を以下に挙げる。

表す数ｂｉｔ
１０００１
２００１０
３００１１
４０１００
５０１０１
６０１１０
７０１１１
８１００１００００
９１００１０００１
１２８１０１０１０００００００
１２９１０１０１００００００１

ここで、数値１、２、３、４、５、６、７の場合は、後続するブロックが存在しないので先頭のビットは０となっており、数値８、９の場合は、２番目のブロックが存在するので先頭のビットが１となっている。また、数値１２８、１２９の場合は、３番目のブロックが存在するので、１番目および２番目のブロックの先頭のビットが１となっている。この方法では、１ワードで表現される数値を表す符号の最小のビット数は４ｂｉｔ、最大ビット数は４４ｂｉｔとなる。 The 4-bit block coding is a method in which the first 1 bit in each block consisting of 4 bits is used as a continuation flag, and if the flag is set, it is considered that there is a subsequent next block. Some numerical examples are given below.

Number to represent
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1001 0000
9 1001 0001
128 1010 1000 0000
129 1010 1000 0001

Here, in the case of numerical values 1, 2, 3, 4, 5, 6, 7, there is no subsequent block, so the first bit is 0, and in the case of numerical values 8, 9, the second block Is present, the leading bit is 1. In the case of numerical values 128 and 129, since the third block exists, the first bit of the first and second blocks is 1. In this method, the minimum number of bits of a code representing a numerical value expressed by one word is 4 bits, and the maximum number of bits is 44 bits.

また、Ｕｎａｒｙｃｏｄｉｎｇとは、数ｎを、ｎ−１個の１の連続＋０で表す方法である。これは、主として、後述するγ−ｃｏｄｉｎｇおよびδ−ｃｏｄｉｎｇの説明に用いられる。いくつかの数値の例を以下に挙げる。

表す数ｂｉｔ
１０
２１０
３１１０
４１１１０
５１１１１０
６１１１１１０
１２８１１１１１・・・１２７個の１の連続・・・０
１２９１１１１１・・・１２８個の１の連続・・・０

この方法では、１ワードで表現される数値を表す符号の最小のビット数は１ｂｉｔ、最大のビット数は４２９４９６７２９５（２³²−１）となる。 Further, Unary coding is a method of expressing the number n by n−1 consecutive 1s +0. This is mainly used for the explanation of γ-coding and δ-coding described later. Some numerical examples are given below.

Number to represent
1 0
2 10
3 110
4 1110
5 11110
6 111110
128 11111 ... 127 1 sequence of 1 ... 0
129 11111 ... 128 ones in succession ... 0

In this method, the minimum bit number of a code representing a numerical value expressed by one word is 1 bit, and the maximum bit number is 4294967295 (2 ³² −1).

また、γ−ｃｏｄｉｎｇでは、数ｘの符号を、ｐｒｅｆｉｘ部とｓｕｆｆｉｘ部に分けて表す。ここで、ｌｏｇ₂ｘの値以下の整数のうち最大のものをＩ１（ｘ）＝

と書くことにすると、ｐｒｅｆｉｘ部は、数（１＋Ｉ１（ｘ））をＵｎａｒｙｃｏｄｉｎｇで表すことで得られ、ｓｕｆｆｉｘ部は、値（ｘ−２^I1(x) ）をＩ１（ｘ）ｂｉｔ分の２進数で表すことで得られる。いくつかの数値の例を以下に挙げる。

表す数ｐｒｅｆｉｘ部ｓｕｆｆｉｘ部
１０なし
（１＋０ビットで０を表す）（０ビットで１−２⁰を表す）
２１００
（１＋１ビットで１を表す）（１ビットで２−２¹を表す）
３１０１
（１＋１ビットで１を表す）（１ビットで３−２¹を表す）
４１１０００
（１＋２ビットで２を表す）（２ビットで４−２²を表す）
５１１００１
（１＋２ビットで２を表す）（２ビットで５−２²を表す）
６１１０１０
（１＋２ビットで２を表す）（２ビットで６−２²を表す）
７１１０１１
（１＋２ビットで２を表す）（２ビットで７−２²を表す）
８１１１００００
（１＋３ビットで３を表す）（３ビットで８−２³を表す）
９１１１０００１
（１＋３ビットで３を表す）（３ビットで９−２³を表す）
１０１１１００１０
（１＋３ビットで３を表す）（３ビットで１０−２³を表す）
１２８１１１１１１１００００００００
（１＋７ビットで７を表す）（７ビットで１２８−２⁷を表す）
１２９１１１１１１１０００００００１
（１＋７ビットで７を表す）（７ビットで１２９−２⁷を表す）

例えば、数値１２９の場合は、ｐｒｅｆｉｘ部の符号‘１１１１１１１０’は８ビットであり、７つの連続する“１”を含んでいる。これは、Ｉ（１２９）＝７、すなわち、数値１２９のｓｕｆｆｉｘ部が７ビットであることを表している。そして、ｓｕｆｆｉｘ部‘００００００１’は、７ビットで１２９−２⁷を表している。この方法では、１ワードで表現される数値を表す符号の最小のビット数は１ｂｉｔ、最大のビット数は６３ｂｉｔ（＝１＋３１＋３１ｂｉｔ）となる。 Further, in γ-coding, the code of the number x is divided into a prefix part and a suffix part. Here, the largest integer less than or equal to log ₂ x is I1 (x) =

The prefix part is obtained by expressing the number (1 + I1 (x)) by Unary coding, and the suffix part is a value (x−2 ^{I1 (x)} ) that is 2 for I1 (x) bits. It is obtained by expressing in decimal. Some numerical examples are given below.

Number to represent Prefix part Suffix part 1 0 None
(Representing a 0 1 + 0 bit) (representing 1-2 ⁰ 0 bits)
2 10 0
(1 + 1 bit represents 1) (1 bit represents 2-2 ¹ )
3 10 1
(1 + 1 represents 1) (1 bit represents 3-2 ¹ )
4 110 00
(1 + 2 bits represent 2) (2 bits represent 4-2 ² )
5 110 01
(1 + 2 bits represent 2) (2 bits represent 5-2 ² )
6 110 10
(1 + 2 bits represent 2) (2 bits represent 6-2 ² )
7 110 11
(1 + 2 bits represent 2) (2 bits represent 7-2 ² )
8 1110 000
(Representing the 3 1 + 3 bits) (representing 8-2 ³ 3 bits)
9 1110 001
(Representing the 3 1 + 3 bits) (representing 9-2 ³ 3 bits)
10 1110 010
(Representing the 3 1 + 3 bits) (representing 10-2 ³ 3 bits)
128 11111110 0000000
(1 + 7 bits represent 7) (7 bits represent 128-2 ⁷ )
129 11111110 0000001
(1 + 7 bits represent 7) (7 bits represent 129-2 ⁷ )

For example, in the case of the numerical value 129, the code '11111110' of the prefix part is 8 bits and includes seven consecutive “1”. This indicates that I (129) = 7, that is, the suffix part of the numerical value 129 is 7 bits. The suffix part '0000001' represents 129-27 with ⁷ bits. In this method, the minimum bit number of a code representing a numerical value represented by one word is 1 bit, and the maximum bit number is 63 bits (= 1 + 31 + 31 bits).

δ−ｃｏｄｉｎｇでも、γ−ｃｏｄｉｎｇと同様に、数ｘの符号を、ｐｒｅｆｉｘ部とｓｕｆｆｉｘ部に分けて表す。ｐｒｅｆｉｘ部は、数（１＋Ｉ１（ｘ））をγ−ｃｏｄｉｎｇで表すことで得られ、ｓｕｆｆｉｘ部は、γ−ｃｏｄｉｎｇと同様に、値（ｘ−２^I1(x) ）をＩ１（ｘ）ｂｉｔ分の２進数で表すことで得られる。いくつかの数値の例を以下に挙げる。

表す数ｐｒｅｆｉｘ部ｓｕｆｆｉｘ部
１０なし
（１のγｃｏｄｉｎｇ）（０ビットで１−２⁰を表す）
２１０００
（２のγｃｏｄｉｎｇ）（１ビットで２−２¹を表す）
３１００１
（２のγｃｏｄｉｎｇ）（１ビットで３−２¹を表す）
４１０１００
（３のγｃｏｄｉｎｇ）（２ビットで４−２²を表す）
５１０１０１
（３のγｃｏｄｉｎｇ）（２ビットで５−２²を表す）
６１０１１０
（３のγｃｏｄｉｎｇ）（２ビットで６−２²を表す）
７１０１１１
（３のγｃｏｄｉｎｇ）（２ビットで７−２²を表す）
８１１００００００
（４のγｃｏｄｉｎｇ）（３ビットで８−２³を表す）
９１１０００００１
（４のγｃｏｄｉｎｇ）（３ビットで９−２³を表す）
１０１１００００１０
（４のγｃｏｄｉｎｇ）（３ビットで１０−２³を表す）
１２８１１１０００００００００００
（８のγｃｏｄｉｎｇ）（７ビットで１２８−２⁷を表す）
１２９１１１００００００００００１
（８のγｃｏｄｉｎｇ）（７ビットで１２９−２⁷を表す）

この方法では、１ワードで表現される数値を表す符号の最小のビット数は１ｂｉｔ、最大のビット数は４２ｂｉｔ（＝（５＋１＋５）＋３１ｂｉｔ）となる。 In δ-coding, as in γ-coding, the sign of number x is divided into a prefix portion and a suffix portion. The prefix part is obtained by expressing the number (1 + I1 (x)) by γ-coding, and the suffix part converts the value (x-2 ^{I1 (x)} ) by I1 (x) bits in the same way as γ-coding. It is obtained by expressing it in binary number. Some numerical examples are given below.

Number to represent Prefix part Suffix part 1 0 None
(1 Ganmacoding) (representing 1-2 ⁰ 0 bits)
2 100 0
(Γcoding of 2) (1 bit represents 2-2 ¹ )
3 100 1
(Γcoding of 2) (3-2 ¹ is represented by 1 bit)
4 101 00
(3 gamma coding) (2 bits represent 4-2 ² )
5 101 01
(3 gamma coding) (2 bits represent 5-2 ² )
6 101 10
(3 gamma coding) (2 bits represent 6-2 ² )
7 101 11
(3 Ganmacoding) (representing 7-2 ² 2 bits)
8 11,000,000
(4 Ganmacoding) (representing 8-2 ³ 3 bits)
9 11000 001
(4 Ganmacoding) (representing 9-2 ³ 3 bits)
10 11000 010
(4 Ganmacoding) (representative of 10-2 ³ 3 bits)
128 1110,000 0000000
(8 Ganmacoding of) (representing 128-2 ⁷ 7 bits)
129 1110000 0000001
(8 Ganmacoding of) (representing 129-2 ⁷ 7 bits)

In this method, the minimum bit number of a code representing a numerical value represented by one word is 1 bit, and the maximum bit number is 42 bits (= (5 + 1 + 5) +31 bits).

しかしながら、上述した従来の符号化手法には、次のような問題がある。
例えば、これらの符号化手法を用いて図４０のインデックス構造を表現した場合に、必要となるｂｉｔ数は図４１に示すようになる。図４１においては、符号化前の元データのみ１０進数の値で記述され、符号化後のインデックス構造は、元データを表現するのに必要なビット数を用いて記述されている。図４１を見ると、圧縮を行わない３２ｂｉｔ符号化が最も多くのビット数を必要とし、δ−ｃｏｄｉｎｇが最も少ないビット数で記述できることが分かる。 However, the conventional encoding method described above has the following problems.
For example, when the index structure of FIG. 40 is expressed using these encoding methods, the required number of bits is as shown in FIG. In FIG. 41, only the original data before encoding is described with a decimal value, and the index structure after encoding is described using the number of bits necessary for expressing the original data. Referring to FIG. 41, it can be seen that 32-bit encoding without compression requires the largest number of bits, and δ-coding can be described with the smallest number of bits.

一般に、８ｂｉｔｂｌｏｃｋ符号化、４ｂｉｔｂｌｏｃｋ符号化のようなブロック系の符号化手法では、どんなに小さな数値でも、必ず１ブロック分のビット数を必要とする。ところが、図４０のように、差分値を用いたインデックス構造においては、“１”や“２”のような小さな数値がデータの大部分を占めるため、インデックスサイズがあまり小さくならないという問題がある。また、各ブロックの先頭に継続フラグを付加していく処理が必要なため、インデックス作成処理に時間がかかるという問題もある。 In general, in block-type encoding methods such as 8-bit block encoding and 4-bit block encoding, the number of bits for one block is always required regardless of a small numerical value. However, as shown in FIG. 40, in the index structure using difference values, there is a problem that the index size is not so small because small numerical values such as “1” and “2” occupy most of the data. In addition, since it is necessary to add a continuation flag to the head of each block, there is a problem that it takes time to create an index.

これに対して、Ｕｎａｒｙｃｏｄｉｎｇ、γ−ｃｏｄｉｎｇ、δ−ｃｏｄｉｎｇのようなビット系の符号化手法では、“１”や“２”のような小さな数値をブロック系よりも小さなビット数で表すことができる。しかし、数値が大きくなるにつれて、ブロック系よりもはるかにビット数が多くなる傾向にあるため、必ずしもインデックスサイズが小さくなるという保証はない。また、アルゴリズムが複雑なため、インデックス作成処理および復号化処理ともに、処理時間が長くなるという問題もある。 On the other hand, in a bit-based encoding method such as Unary coding, γ-coding, and δ-coding, a small numerical value such as “1” or “2” may be expressed by a smaller number of bits than in the block system. it can. However, as the numerical value increases, the number of bits tends to be much larger than that of the block system, so there is no guarantee that the index size will be reduced. Further, since the algorithm is complicated, there is a problem that both the index creation process and the decryption process take a long processing time.

本発明の課題は、数値データの復号化処理の速度を落とさずに、インデックス作成処理を高速化し、インデックスのサイズを抑えることのできるデータ圧縮装置およびその方法と、データ復元装置およびその方法を提供することである。 An object of the present invention is to provide a data compression apparatus and method, and a data restoration apparatus and method that can increase the speed of index creation processing and reduce the size of the index without reducing the speed of the decoding process of numerical data. It is to be.

図１は、本発明のデータ圧縮装置およびデータ復元装置の原理図である。図１のデータ圧縮装置は圧縮手段１と格納手段２を備え、データ復元装置は格納手段２と復元手段３を備える。 FIG. 1 is a principle diagram of a data compression apparatus and data restoration apparatus according to the present invention. The data compression apparatus of FIG. 1 includes compression means 1 and storage means 2, and the data decompression apparatus includes storage means 2 and decompression means 3.

圧縮手段１は、与えられたデータ４をブロック単位で圧縮し、圧縮されたデータ５の先頭部分に、そのデータ５の長さを表す継続フラグ情報を生成する。
格納手段２は、圧縮されたデータ５を格納する。 The compression unit 1 compresses the given data 4 in units of blocks, and generates continuation flag information indicating the length of the data 5 at the head portion of the compressed data 5.
The storage means 2 stores the compressed data 5.

復元手段３は、ブロック単位で圧縮されたデータ５の先頭部分の継続フラグ情報に基づいて、そのデータ５の長さを決定し、元のデータ４を復元する。
元のデータ４がバイナリのビットパターンである場合、一般に、それが表す数値が大きいほど圧縮率は低下し、小さいほど圧縮率は向上する。圧縮手段１は、元のデータ４の値に応じて圧縮されたデータ５のブロック長を決定し、対応する継続フラグ情報を生成する。そして、その継続フラグ情報をデータ５の先頭部分に格納し、それに続いてデータ４を表すデータを格納する。 The restoring means 3 determines the length of the data 5 based on the continuation flag information at the head portion of the data 5 compressed in units of blocks, and restores the original data 4.
When the original data 4 is a binary bit pattern, generally, the larger the numerical value it represents, the lower the compression ratio, and the smaller, the higher the compression ratio. The compression means 1 determines the block length of the compressed data 5 according to the value of the original data 4, and generates corresponding continuation flag information. Then, the continuation flag information is stored in the head portion of the data 5, and subsequently, data representing the data 4 is stored.

数値が比較的小さければ、データ５はデータ４より短いビットパターンで表される。また、継続フラグ情報は、データ５から継続フラグ情報を除いた残りの部分のブロック長を表すようにしてもよい。 If the numerical value is relatively small, the data 5 is represented by a bit pattern shorter than the data 4. Further, the continuation flag information may represent the block length of the remaining part of the data 5 excluding the continuation flag information.

このような圧縮処理によれば、従来のブロック系の符号化のように、各ブロックの先頭に１つずつ継続フラグを付加していく必要がなく、継続フラグ情報の作成処理を１回で済ませることができる。したがって、データ５の作成処理が高速化され、これを用いてインデックス作成処理を高速化することができる。 According to such compression processing, it is not necessary to add a continuation flag one by one at the head of each block as in conventional block coding, and the continuation flag information creation processing can be completed only once. be able to. Therefore, the creation process of the data 5 is speeded up, and the index creation process can be speeded up using this process.

また、データ４の値に応じてデータ５の先頭ブロックの長さを変えることができ、小さな数値の場合にこれを短くすることで、データ５の圧縮率が向上する。インデックス内で用いられる数値データには、１や２のような小さな値が多数現れるため、これはインデックスサイズの削減につながる。 In addition, the length of the leading block of data 5 can be changed according to the value of data 4, and the compression rate of data 5 is improved by shortening this in the case of a small numerical value. Numerous small values such as 1 and 2 appear in the numerical data used in the index, which leads to a reduction in index size.

復元手段３は、圧縮されたデータ５の先頭部分から継続フラグ情報を取り出し、それを元にデータ５のブロック長を決定する。次に、その長さから継続フラグ情報のブロック長を差し引いて、データ５の残りの部分のブロック長を求め、残りのデータを取り出す。そして、取り出したデータから元のデータ４を生成する。継続フラグ情報が残りのデータのブロック長を表す場合は、それをそのまま用いて残りのデータを取り出すことができる。 The decompressing means 3 extracts the continuation flag information from the head portion of the compressed data 5 and determines the block length of the data 5 based on it. Next, the block length of the remaining portion of the data 5 is obtained by subtracting the block length of the continuation flag information from the length, and the remaining data is extracted. Then, original data 4 is generated from the extracted data. If the continuation flag information represents the block length of the remaining data, it can be used as it is to extract the remaining data.

このような復元処理によれば、従来のブロック系の復号化のように、各ブロックの先頭から継続フラグを１つずつ取り出す必要がなく、継続フラグ情報の取り出しを１回で済ませることができる。したがって、比較的大きな数値の場合には、データ４の作成処理が高速化される。 According to such a restoration process, it is not necessary to extract one continuation flag from the head of each block as in conventional block decoding, and continuation flag information can be extracted once. Therefore, in the case of a relatively large numerical value, the data 4 creation process is accelerated.

さらに、従来のビット系の符号化と比較して、圧縮処理および復元処理がより簡単であり、処理時間が短くて済む。また、大きな数値の圧縮率はより高くなると考えられる。
本発明の別のデータ圧縮装置およびデータ復元装置において、圧縮手段１は、情報検索のためのインデックス内で用いられる数値データの粒度を粗くして、数値データを圧縮し、格納手段２は、圧縮されたデータを格納する。復元手段３は、数値データを復元し、復元された数値データの粒度を元に戻す。 Furthermore, the compression process and the decompression process are simpler and the processing time is shorter than the conventional bit encoding. Moreover, it is thought that the compression rate of a big numerical value becomes higher.
In another data compression apparatus and data decompression apparatus of the present invention, the compression means 1 compresses numerical data by coarsening the granularity of numerical data used in the index for information retrieval, and the storage means 2 compresses the numerical data. Stored data. The restoring means 3 restores the numerical data, and restores the granularity of the restored numerical data.

例えば、図１の圧縮手段１と復元手段３は、後述する図２のＣＰＵ１６（中央処理装置）とメインメモリ１９に対応し、格納手段２はメインメモリ１９または磁気ディスク装置１１に対応する。 For example, the compression unit 1 and the decompression unit 3 in FIG. 1 correspond to a CPU 16 (central processing unit) and a main memory 19 in FIG. 2 to be described later, and the storage unit 2 corresponds to the main memory 19 or the magnetic disk device 11.

本発明によれば、数値データの復号化処理の速度を落とさずに、インデックス作成処理を高速化することができ、また、インデックスの圧縮率を高めることができる。
特に、４ＢＢ改符号化により、インデックス作成に要する時間を短縮することができ、Ｂ２４符号化および８４ＢＢ符号化により、それほど符号化／復号化の処理速度を落とすことなく、インデックスの圧縮率を高めることができる。また、Ｐｅｒ符号化により、情報の精度は多少落ちるものの、インデックスの圧縮率を高めることができる。 According to the present invention, the index creation process can be speeded up without reducing the speed of the numerical data decoding process, and the index compression rate can be increased.
In particular, the time required for index creation can be shortened by 4BB re-encoding, and the index compression rate can be increased without significantly reducing the encoding / decoding processing speed by B24 encoding and 84BB encoding. Can do. In addition, although the accuracy of information is slightly reduced by Per encoding, the index compression rate can be increased.

以下、図面を参照しながら、本発明の実施の形態を詳細に説明する。
本発明においては、新たな符号化方法として、４ｂｉｔｂｌｏｃｋ（４ＢＢ）改符号化、８４ｂｉｔｂｌｏｃｋ（８４ＢＢ）符号化、およびＢ２４（ｂｌｏｃｋ２４）符号化の３種類のブロック系符号化方法を提案する。まず、これらの符号化の概要を、それぞれ説明することにする。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In the present invention, as a new encoding method, three types of block encoding methods are proposed: 4-bit block (4BB) re-encoding, 84-bit block (84BB) encoding, and B24 (block24) encoding. First, the outline of these encodings will be described respectively.

４ＢＢ改符号化は、基本的には、上述の４ＢＢ符号化を改良したものである。通常の４ＢＢ符号化では、４ｂｉｔのうちｔｏｐ１ｂｉｔを継続フラグ‘１’とし、そのフラグが存在していたら次の４ｂｉｔも数が存在するとみなす。これに対し、４ＢＢ改符号は、すべての継続フラグをまとめて先頭に持つ構造とする。そして、最初に現れる０より後のビットパターンを数値部分とみなす。いくつかの数値の例を以下に挙げる。

表す数ｂｉｔ
１０００１
２００１０
３００１１
４０１００
５０１０１
６０１１０
７０１１１
８１０００１０００
９１０００１００１
１２８１１００１０００００００
１２９１１００１００００００１

ここで、数値１、２、３、４、５、６、７の場合は、後続するブロックが存在しないので先頭のビットは０となっており、数値８、９の場合は、２番目のブロックが存在するので、１番目のブロックの先頭のビットが１となっている。また、数値１２８、１２９の場合は、３番目のブロックも存在するので、１番目のブロックの１番目および２番目のビットが１となっている。継続フラグに続く数値のビットパターンのＬＳＢ（least significant bit ）は、いずれの場合も最後のブロックの右端に来るようになっている。 The 4BB reformed encoding is basically an improvement of the above-mentioned 4BB encoding. In normal 4BB encoding, top 1 bit of 4 bits is set as a continuation flag '1', and if the flag exists, it is considered that the next 4 bits also have a number. On the other hand, the 4BB reform code has a structure in which all continuation flags are put together at the head. Then, the bit pattern after 0 appearing first is regarded as a numerical part. Some numerical examples are given below.

Number to represent
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000 1000
9 1000 1001
128 1100 1000 0000
129 1100 1000 0001

Here, in the case of numerical values 1, 2, 3, 4, 5, 6, 7, there is no subsequent block, so the first bit is 0, and in the case of numerical values 8, 9, the second block Therefore, the first bit of the first block is 1. In the case of numerical values 128 and 129, the third block also exists, and therefore the first and second bits of the first block are 1. The LSB (least significant bit) of the numerical bit pattern following the continuation flag is arranged at the right end of the last block in any case.

この方法では、１ワードで表現される数値を表す符号の最小のビット数は４ｂｉｔ、最大ビット数は４４ｂｉｔとなり、圧縮効率自体は４ＢＢ符号化と等価である。しかしながら、符号化処理におけるステップ数が４ＢＢ符号化よりも少なくなる。このため、大規模データベース（ＤＢ）に対してインデックスを作成するような場合には、かなりの高速化が期待できる。 In this method, the minimum bit number of a code representing a numerical value expressed by one word is 4 bits and the maximum bit number is 44 bits, and the compression efficiency itself is equivalent to 4BB encoding. However, the number of steps in the encoding process is smaller than that in 4BB encoding. For this reason, when an index is created for a large-scale database (DB), a considerable increase in speed can be expected.

また、復号化に関しては、４ＢＢ符号化では、１つ１つ継続フラグを判定しながらループ処理を繰り返し、４ＢＢ改符号化では、継続フラグを先に復号化してから、まとめて数値を求めることになる。このような違いはあるが、復号化の処理速度は、符号化と比べると、両者の間でそれほど変化はない。ただし、数値が大きくなった場合には、４ＢＢ符号より４ＢＢ改符号のほうが速くなる。 As for decoding, in 4BB encoding, loop processing is repeated while determining each continuation flag, and in 4BB re-encoding, the continuation flag is decoded first and then a numerical value is obtained collectively. Become. Although there is such a difference, the decoding processing speed does not change so much between the two compared with the encoding. However, when the numerical value is increased, the 4BB code is faster than the 4BB code.

次に、４ＢＢ改符号化と類似する８４ＢＢ符号化について説明する。４ＢＢ改符号化のブロック長は、どんな数値に対しても４ｂｉｔ一定であるのに対し、８４ＢＢ符号化では、初めの１ｂｌｏｃｋだけを８ｂｉｔ、継続フラグによって後ろに続く各ブロックを４ｂｉｔとする。この方法では、１２７以下の数値は１ｂｉｔの０と７ｂｉｔの数値のビットパターンで表され、１２８以上の数値はいくつかの継続フラグと数値のビットパターンで表現される。いくつかの数値の例を以下に挙げる。

表す数ｂｉｔ
１０００００００１
２００００００１０
３００００００１１
４０００００１００
５０００００１０１
６０００００１１０
７０００００１１１
８００００１０００
９００００１００１
１２７０１１１１１１１
１２８１０００１０００００００
１２９１０００１００００００１

ここで、数値１、２、３、４、５、６、７、８、９、１２７の場合は、後続するブロックが存在しないので先頭のビットは０となっており、数値１２８、１２９の場合は、２番目のブロックが存在するので、１番目のブロックの先頭のビットが１となっている。この方法では、１ワードで表現される数値を表す符号の最小のビット数は８ｂｉｔ、最大のビット数は４４ｂｉｔとなる。 Next, 84BB encoding similar to 4BB reformed encoding will be described. The block length of 4BB re-encoding is fixed to 4 bits for any numerical value, whereas in 84BB encoding, only the first 1 block is 8 bits and each subsequent block is 4 bits by the continuation flag. In this method, a numerical value of 127 or less is represented by a 1-bit 0 and 7-bit numerical bit pattern, and a numerical value of 128 or more is represented by several continuation flags and a numerical bit pattern. Some numerical examples are given below.

Number to represent
100000001
200000010
3 00000011
400000100
500000101
600000100
700000111
800001000
9 000001001
127 01111111
128 10001000 0000
129 10001000 0001

Here, in the case of numerical values 1, 2, 3, 4, 5, 6, 7, 8, 9, 127, there is no subsequent block, so the first bit is 0, and in the case of numerical values 128, 129. Since the second block exists, the first bit of the first block is 1. In this method, the minimum bit number of a code representing a numerical value expressed by one word is 8 bits, and the maximum bit number is 44 bits.

例えば、インデックスにおいて文書内単語出現位置として扱われる数値は、文書番号（差分値）や文書内単語出現頻度と比較して、かなり大きな値になる。このため、差分値を用いたとしても、通常の数値データ符号化方法ではうまく圧縮することができない。 For example, the numerical value treated as the word appearance position in the document in the index is a considerably large value compared with the document number (difference value) and the word appearance frequency in the document. For this reason, even if the difference value is used, the normal numerical data encoding method cannot be compressed well.

８４ＢＢ符号化は、４ＢＢ符号化や４ＢＢ改符号化と比較した場合、最初の８ｂｉｔで、１ｂｉｔ分だけ大きな数値まで表現することができる。これにより、中くらいの大きさの数値を、他の符号化よりも少ないビット数で表すことができ、文書内単語出現位置の符号化に適しているといえる。 84BB encoding can express a numerical value larger by 1 bit in the first 8 bits when compared with 4BB encoding or 4BB re-encoding. As a result, a medium numerical value can be expressed with a smaller number of bits than other encodings, and it can be said that it is suitable for encoding a word appearance position in a document.

次に、Ｂ２４符号化とは、数値１、２を２ｂｉｔで符号化し、数値３〜６を４ｂｉｔで符号化し、数値７以上を４ＢＢ改符号化で符号化する方法である。数値１、２は、初めの１ｂｉｔを継続フラグ‘０’とする２ｂｉｔ符号で表し、数値３〜６は、初めの２ｂｉｔを継続フラグ‘１０’とする４ｂｉｔ符号で表す。それらより大きい数値の符号化は、４ＢＢ改符号化の処理と同様になる。 Next, B24 encoding is a method in which the numerical values 1 and 2 are encoded with 2 bits, the numerical values 3 to 6 are encoded with 4 bits, and the numerical value 7 or more is encoded with 4BB modified encoding. Numerical values 1 and 2 are represented by a 2-bit code in which the first 1 bit is a continuation flag '0', and numerical values 3 to 6 are represented by a 4-bit code in which the first 2 bits are a continuation flag '10'. The encoding of numerical values larger than those is the same as the 4BB re-encoding process.

ただし、４ＢＢ改符号化では、継続フラグの後に続く数値にはそのままの値が用いられるのに対し、Ｂ２４符号化では、数値１、２の場合には１を減じた値が用いられ、数値３〜６の場合には３を減じた値が用いられ、７以上の場合には７を減じた値が用いられる。また、継続フラグは、４ＢＢ改符号より１ｂｉｔ多くなる。いくつかの数値の例を以下に挙げる。

表す数ｂｉｔ
１００
２０１
３１０００
４１００１
５１０１０
６１０１１
７１１００００００
８１１０００００１
９１１００００１０
３９１１１０００１０００００
４０１１１０００１００００１

ここで、数値１、２の場合は、先頭のビットは０となっており、その次のビットは元の数値から１を減じた値を表している。また、数値３、４、５、６の場合は、先頭の２ビットが１０となっており、その次の２ビットは元の数値から３を減じた値を表している。 However, in the 4BB modified encoding, a value as it is is used as the numerical value following the continuation flag, whereas in the B24 encoding, a value obtained by subtracting 1 is used in the case of the numerical values 1 and 2, and the numerical value 3 In the case of ˜6, a value obtained by subtracting 3 is used. In the case of 7 or more, a value obtained by subtracting 7 is used. Further, the continuation flag is 1 bit more than the 4BB code change. Some numerical examples are given below.

Number to represent
100
2 01
3 1000
4 1001
5 1010
6 1011
7 1100 0000
8 1100 0001
9 1100 0010
39 1110 0010 0000
40 1110 0010 0001

Here, in the case of numerical values 1 and 2, the first bit is 0, and the next bit represents a value obtained by subtracting 1 from the original numerical value. In the case of numerical values 3, 4, 5, and 6, the first 2 bits are 10, and the next 2 bits represent a value obtained by subtracting 3 from the original numerical value.

また、数値７、８、９の場合は、２番目のブロックが存在するので、先頭の２ビットは１１となっており、２番目のブロックは元の数値から７を減じた値を表している。また、数値３９、４０の場合は、３番目のブロックも存在するので、先頭の３ビットが１１１となっており、２番目および３番目のブロックは元の数値から７を減じた値を表している。 In the case of numerical values 7, 8, and 9, since the second block exists, the first 2 bits are 11, and the second block represents a value obtained by subtracting 7 from the original numerical value. . In the case of numerical values 39 and 40, since the third block also exists, the first 3 bits are 111, and the second and third blocks represent values obtained by subtracting 7 from the original numerical values. Yes.

この方法では、１ワードで表現される数値を表す符号の最小のビット数は２ｂｉｔ、最大のビット数は４４ｂｉｔとなり、数値１および２を表した場合に、４ＢＢ改符号化よりも２ビット節約することができる。通常のＤＢでは、文書内単語出現頻度のほとんどの数値が１もしくは２となるため、これらの数値を２ｂｉｔで表すことで、４ＢＢ符号化および４ＢＢ改符号化よりも、実際のインデックスの圧縮率が高くなることが期待される。 In this method, the minimum number of bits of a code representing a numerical value expressed by one word is 2 bits, and the maximum number of bits is 44 bits. When the numerical values 1 and 2 are expressed, 2 bits are saved as compared with 4BB re-encoding. be able to. In ordinary DBs, most numerical values of the word appearance frequency in the document are 1 or 2, and these numerical values are represented by 2 bits, so that the actual index compression rate is higher than that of 4BB encoding and 4BB re-encoding. Expected to be higher.

また、符号化処理自体も、４ＢＢ改符号化と比較して、それほど処理速度は低下しない。さらに、符号化する数値のほとんどが１もしくは２ならば、ステップ数は４ＢＢ改符号化よりも少なくなるため、より高速である。復号化の処理速度に関しても、符号化と同様である。 Also, the processing speed of the encoding process itself is not so much lower than that of the 4BB reformed encoding. Furthermore, if most of the numerical values to be encoded are 1 or 2, the number of steps is smaller than that of 4BB re-encoding, which is faster. The processing speed of decoding is the same as that of encoding.

図２は、上述した符号化方法に基づくデータ圧縮装置／復元装置を含む情報検索装置の構成図である。図２の情報検索装置は、ソフトウェアを搭載した情報処理装置（コンピュータ）により実現され、磁気ディスク装置１１、フロッピーディスク駆動装置（ＦＤＤ）１２、プリンタ１４、ディスプレイ１５、ＣＰＵ（中央処理装置）１６、キーボード１７、ポインティング・デバイス１８、メインメモリ１９、およびネットワーク接続装置３１を備え、それらの各装置はバス２０により互いに結合されている。 FIG. 2 is a configuration diagram of an information search apparatus including a data compression apparatus / decompression apparatus based on the above-described encoding method. 2 is realized by an information processing device (computer) loaded with software, and includes a magnetic disk device 11, a floppy disk drive (FDD) 12, a printer 14, a display 15, a CPU (central processing unit) 16, A keyboard 17, a pointing device 18, a main memory 19, and a network connection device 31 are provided, and these devices are coupled to each other by a bus 20.

磁気ディスク装置１１には、文書ＤＢ２１とインデックス２２が格納される。磁気ディスク装置１１の代わりに、光ディスク装置、光磁気ディスク装置等を用いてもよい。
ＣＰＵ１６は、メインメモリ１９に格納されたプログラムを用いて、情報検索に必要な処理を実現する。メモリ１９は、例えばＲＯＭ（read only memory）、ＲＡＭ（random access memory）等を含む。メモリ１９には、インデックス作成プログラム２３、検索エンジン（検索プログラム）２４、文書表示プログラム２５等が保持され、ワーク領域２６が設けられる。 The magnetic disk device 11 stores a document DB 21 and an index 22. Instead of the magnetic disk device 11, an optical disk device, a magneto-optical disk device, or the like may be used.
The CPU 16 implements processing necessary for information retrieval using a program stored in the main memory 19. The memory 19 includes, for example, a ROM (read only memory), a RAM (random access memory), and the like. The memory 19 holds an index creation program 23, a search engine (search program) 24, a document display program 25, and the like, and a work area 26 is provided.

インデックス作成プログラム２３は、文書ＤＢ２１からインデックス２２を作成して、磁気ディスク装置１１に格納する。このプログラム２３は、４ＢＢ改符号化、８４ＢＢ符号化、あるいはＢ２４符号化等に基づくデータ圧縮処理を含んでいる。 The index creation program 23 creates an index 22 from the document DB 21 and stores it in the magnetic disk device 11. The program 23 includes data compression processing based on 4BB reformed encoding, 84BB encoding, B24 encoding, or the like.

検索エンジン２４は、インデックス２２を用いて、文書ＤＢ２１の文書を検索する。全文検索装置の場合には、ユーザが指定した単語列を含む文書を検索し、ランキング検索装置の場合には、ユーザが指定した単語列に対して関連度の高い文書を検索する。この検索エンジン２４は、４ＢＢ改符号化、８４ＢＢ符号化、あるいはＢ２４符号化等に基づくデータ復元処理（復号化処理）を含んでいる。文書表示プログラム２５は、検索結果から指定された文書を切り出し、それをユーザに表示する。ワーク領域２６は、これらのプログラム２３、２４、２５が処理に使用する領域である。 The search engine 24 uses the index 22 to search for documents in the document DB 21. In the case of a full-text search device, a document including a word string specified by the user is searched. In the case of a ranking search device, a document having a high degree of relevance is searched for the word string specified by the user. The search engine 24 includes data restoration processing (decoding processing) based on 4BB reformed encoding, 84BB encoding, B24 encoding, or the like. The document display program 25 cuts out a designated document from the search result and displays it to the user. The work area 26 is an area used by these programs 23, 24, and 25 for processing.

また、キーボード１７およびポインティング・デバイス１８は、ユーザからの要求や指示の入力に用いられ、プリンタ１４およびディスプレイ１５は、ユーザへの問い合せや処理結果等の出力に用いられる。 The keyboard 17 and the pointing device 18 are used for inputting requests and instructions from the user, and the printer 14 and the display 15 are used for outputting inquiries to the user and processing results.

ＦＤＤ１２は、フロッピーディスク１３を駆動し、その記憶内容にアクセスする。フロッピーディスク１３に、必要なデータやプログラム２３、２４、２５等を格納しておき、必要に応じて、それらをメモリ１９にロードして使用することができる。また、フロッピーディスク１３以外にも、メモリカード、ＣＤ−ＲＯＭ（compact disk read only memory ）、光ディスク、光磁気ディスク等の任意のコンピュータ読み取り可能な記録媒体を使用することができる。 The FDD 12 drives the floppy disk 13 and accesses the stored contents. Necessary data and programs 23, 24, 25, etc. are stored in the floppy disk 13 and can be loaded into the memory 19 and used as necessary. In addition to the floppy disk 13, any computer-readable recording medium such as a memory card, a CD-ROM (compact disk read only memory), an optical disk, or a magneto-optical disk can be used.

ネットワーク接続装置３１は、ＬＡＮ（local area network）等の任意の通信ネットワークに接続され、通信に伴うデータ変換等を行う。情報検索装置は、ネットワーク接続装置３１を介して、外部の情報提供者の装置３２（データベース等）と通信する。これにより、必要に応じて、上述のプログラムとデータを装置３２からネットワークを介して受け取り、それらをメモリ１９にロードして使用することができる。 The network connection device 31 is connected to an arbitrary communication network such as a local area network (LAN) and performs data conversion associated with communication. The information retrieval apparatus communicates with an external information provider apparatus 32 (database or the like) via the network connection apparatus 31. As a result, the above-described program and data can be received from the device 32 via the network and used by loading them into the memory 19 as necessary.

次に、図３から図２０までを参照しながら、本発明のデータ圧縮装置／復元装置で用いる４ＢＢ改符号化、８４ＢＢ符号化、およびＢ２４符号化について、より具体的に説明する。 Next, with reference to FIG. 3 to FIG. 20, the 4BB encoding, 84BB encoding, and B24 encoding used in the data compression / decompression apparatus of the present invention will be described more specifically.

図３から図２０において、変数Ｖａｌｕｅは、元データのビットパターンを表し、変数Ｂｉｔｂｕｆは、符号化されたビットパターンを表す。また、‘：＝’は、右辺の値を左辺の値へ代入する操作を表し、‘ｂｉｔｃｏｐｙ（第１引数，第２引数）’は、第２引数のビットパターンを第１引数の先頭にコピーする操作を表す。 3 to 20, a variable Value represents a bit pattern of original data, and a variable Bitbuf represents an encoded bit pattern. ': =' Represents an operation of assigning the value on the right side to the value on the left side, and 'bitcopy (first argument, second argument)' copies the bit pattern of the second argument to the beginning of the first argument. Represents the operation to be performed.

また、‘ｂｉｔｃａｔ（第１引数，第２引数）’は、第２引数のビットパターンを第１引数の後に追加する操作を表し、‘ｒｅａｄ（第１引数，第２引数）’は、第１引数から第２引数の個数分のビットパターンを読み取って数値にする操作を表す。その他の記述については、Ｃ言語もしくは数学記号と同様である。 'Bitcat (first argument, second argument)' represents an operation of adding a bit pattern of the second argument after the first argument, and 'read (first argument, second argument)' is the first Represents an operation of reading bit patterns as many as the number of second arguments from an argument and converting them into numerical values. Other descriptions are the same as in C language or mathematical symbols.

比較のため、最初に、４ＢＢ符号化処理および４ＢＢ復号化処理の実現例を説明することにする。図３は、４ＢＢ符号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、ｉ＝１１とおいて（ステップＳ１）、８ビットの１次バッファＣｏｄｅ［ｉ］を用意し、Ｖａｌｕｅのビットパターンの下位３ビットをＣｏｄｅ［ｉ］の後半に入れる（ステップＳ２）。ここで、‘Ｖａｌｕｅ＆０ｘ７’は、Ｖａｌｕｅと０ｘ７＝‘０１１１’の論理積を表している。 For comparison, first, an implementation example of the 4BB encoding process and the 4BB decoding process will be described. FIG. 3 is a flowchart of the 4BB encoding process. When the processing is started, the information search apparatus first sets i = 11 (step S1), prepares an 8-bit primary buffer Code [i], and sets the lower 3 bits of the Value bit pattern to Code [i ] In the latter half (step S2). Here, “Value & 0x7” represents a logical product of Value and 0x7 = “0111”.

次に、Ｖａｌｕｅを３ビット右にシフトし（ステップＳ３）、Ｖａｌｕｅの値を０と比較する（ステップＳ４）。Ｖａｌｕｅが０より大きい場合、ｉを１だけデクリメントし（ステップＳ５）、シフト後のＶａｌｕｅの下位３ビットの前にフラグの値１を付加して、Ｃｏｄｅ［ｉ］の後半に入れる（ステップＳ６）。ここで、‘０ｘ８｜（Ｖａｌｕｅ＆０ｘ７）’は、０ｘ８＝‘１０００’とＶａｌｕｅの下位３ビットとの論理和を表している。そして、Ｖａｌｕｅを３ビット右にシフトし（ステップＳ７）、ステップＳ４以降の処理を繰り返す。 Next, the value is shifted to the right by 3 bits (step S3), and the value of the value is compared with 0 (step S4). If Value is greater than 0, i is decremented by 1 (step S5), the flag value 1 is added before the lower 3 bits of the shifted value, and it is placed in the latter half of Code [i] (step S6). . Here, “0x8 | (Value & 0x7)” represents a logical sum of 0x8 = “1000” and the lower 3 bits of the Value. Then, the value is shifted to the right by 3 bits (step S7), and the processing after step S4 is repeated.

ステップＳ４において、Ｖａｌｕｅの値が０になると、次に、ｉと１２を比較する（ステップＳ８）。ｉが１２より小さければ、Ｃｏｄｅ［ｉ］の後半に格納された４ビットのデータを、Ｂｉｔｂｕｆの空領域の先頭部分にコピーして（ステップＳ９）。ｉを１だけインクリメントする（ステップＳ１０）。 When the value of Value becomes 0 in step S4, i and 12 are then compared (step S8). If i is smaller than 12, the 4-bit data stored in the second half of Code [i] is copied to the leading part of the empty area of Bitbuf (step S9). i is incremented by 1 (step S10).

そして、ステップＳ８以降の処理を繰り返し、ステップＳ８においてｉが１２に達すると、処理を終了する。このような符号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図４に示すようになる。 And the process after step S8 is repeated and a process will be complete | finished when i reaches 12 in step S8. The program code for such an encoding process (described in C language) is, for example, as shown in FIG.

図５は、４ＢＢ復号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、Ｂｉｔｂｕｆの初めの４ビットをＶａｌｕｅに読み込み（ステップＳ１１）、その値を０ｘ７と比較する（ステップＳ１２）。Ｖａｌｕｅが０ｘ７以下の場合は、先頭のフラグが０であり、後続ブロックが存在しないことを意味するので、そのまま処理を終了する。 FIG. 5 is a flowchart of the 4BB decoding process. When the processing is started, the information retrieval apparatus first reads the first 4 bits of Bitbuf into Value (step S11) and compares the value with 0x7 (step S12). When the value is 0x7 or less, the leading flag is 0, which means that there is no subsequent block, so the processing is terminated as it is.

Ｖａｌｕｅが０ｘ７より大きければ、先頭のフラグが１であり、後続ブロックが存在することを意味する。そこで、Ｖａｌｕｅのビットパターンの下位３ビットのみを改めてＶａｌｕｅとし（ステップＳ１３）、Ｂｉｔｂｕｆ内の次の４ビットを変数ｔｅｍｐに読み込む（ステップＳ１４）。 If Value is larger than 0x7, it means that the leading flag is 1 and there is a subsequent block. Therefore, only the lower 3 bits of the Value bit pattern are changed to Value (Step S13), and the next 4 bits in Bitbuf are read into the variable temp (Step S14).

次に、Ｖａｌｕｅを３ビット左にシフトして、ｔｅｍｐの下位３ビットを加算する（ステップＳ１５）。これにより、ｔｅｍｐから先頭のフラグを除いた残りの部分がＶａｌｕｅに付加される。そして、ｔｅｍｐの値を０ｘ７と比較する（ステップＳ１６）。 Next, the value is shifted 3 bits to the left, and the lower 3 bits of temp are added (step S15). As a result, the remaining part excluding the leading flag from temp is added to Value. Then, the value of temp is compared with 0x7 (step S16).

ｔｅｍｐが０ｘ７より大きければステップＳ１４以降の処理を繰り返し、ｔｅｍｐが０ｘ７以下であれば処理を終了する。終了時のＶａｌｕｅのビットパターンは、Ｂｉｔｂｕｆに対応する元データを表している。このような復号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図６に示すようになる。図６において、ｇｅｔｘｂｉｔｓ（Ｂｉｔｂｕｆ）は、Ｂｉｔｂｕｆからｘビット分のビットパターンを読み出す関数を表す。 If temp is greater than 0x7, the process from step S14 is repeated, and if temp is less than or equal to 0x7, the process ends. The bit pattern of Value at the end represents the original data corresponding to Bitbuf. The program code (description in C language) for such decryption processing is as shown in FIG. 6, for example. In FIG. 6, get xbits (Bitbuf) represents a function for reading a bit pattern for x bits from Bitbuf.

次に、４ＢＢ改符号化処理および４ＢＢ改復号化処理を説明する。図７は、４ＢＢ改符号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、Ｉ２（Ｖａｌｕｅ）＝

を求め、Ｉ２（Ｖａｌｕｅ）＋１を継続フラグＦｌａｇの値とする（ステップＳ２１）。ここで、Ｉ２（ｘ）＝

は、ｌｏｇ₈ｘの値以下の整数のうち最大のものを表す。
次に、ＦｌａｇをＵｎａｒｙコードに変換してＢｉｔｂｕｆに入れ（ステップＳ２２）、それに続いてＶａｌｕｅを入れて（ステップＳ２３）、処理を終了する。 Next, 4BB re-encoding processing and 4BB re-decoding processing will be described. FIG. 7 is a flowchart of the 4BB re-encoding process. When the process is started, the information retrieval apparatus firstly has I2 (Value) =

And I2 (Value) +1 is set as the value of the continuation flag Flag (step S21). Where I2 (x) =

Represents the largest integer less than or equal to the value of log ₈ x.
Next, the Flag is converted into an Unary code and put into Bitbuf (step S22), followed by Value (step S23), and the process is terminated.

図７を図３と比較すると、４ＢＢ改符号化処理のステップ数は、４ＢＢ符号化処理のそれよりはるかに少ないことが分かる。４ＢＢ改符号化では、継続フラグを元データのビットパターンの前に付加するだけなので、このようにステップ数が少なくて済み、高速な処理が実現される。また、継続フラグの値が小さければ、そのＵｎａｒｙコードは容易に求められる。 Comparing FIG. 7 with FIG. 3, it can be seen that the number of steps of the 4BB recoding process is much smaller than that of the 4BB coding process. In the 4BB re-encoding, since the continuation flag is only added before the bit pattern of the original data, the number of steps is reduced in this way, and high-speed processing is realized. If the value of the continuation flag is small, the Unary code can be easily obtained.

４ＢＢ改符号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図８に示すようになる。図８においては、実際にＩ２（Ｖａｌｕｅ）を計算する代わりに、ｉｆ−ｅｌｓｅｉｆで代用している。これは、Ｖａｌｕｅの値の範囲と、それに対応するＩ２（Ｖａｌｕｅ）の値の範囲とが、あらかじめ分かっているためである。また、ｓｅｔｘｂｉｔｓ（Ｂｉｔｂｕｆ，Ｘ）は、Ｘからｘビット分のビットパターンを読み出してＢｉｔｂｕｆに書き込む関数を表す。 The program code (described in C language) for the 4BB re-encoding process is as shown in FIG. 8, for example. In FIG. 8, if-else if is used instead of actually calculating I2 (Value). This is because the range of the value of Value and the range of the value of I2 (Value) corresponding thereto are known in advance. Set xbits (Bitbuf, X) represents a function of reading a bit pattern of x bits from X and writing it into Bitbuf.

例えば、十進法で２１という数値は以下の手順により符号化され、対応する符号‘１００１０１０１’が得られる。
１．Ｉ２（２１）＋１＝２であるため、継続フラグは２となる（ステップＳ２１）。 For example, the decimal number 21 is encoded by the following procedure, and the corresponding code '10010101' is obtained.
1. Since I2 (21) + 1 = 2, the continuation flag is 2 (step S21).

２．継続フラグ２をＵｎａｒｙコード‘１０’でＢｉｔｂｕｆに入れる（ステップＳ２２）。
３．継続フラグに続いて、Ｖａｌｕｅ＝２１のビットパターン‘０１０１０１’をＢｉｔｂｕｆに入れる（ステップＳ２３）。 2. The continuation flag 2 is put in Bitbuf with an Unary code “10” (step S22).
3. Following the continuation flag, a bit pattern “010101” of Value = 21 is entered in Bitbuf (step S23).

また、十進法で３００という数値は以下の手順により符号化され、対応する符号‘１１０１００１０１１００’が得られる。
１．Ｉ２（３００）＋１＝３であるため、継続フラグは３となる（ステップＳ２１）。 Also, the decimal value of 300 is encoded by the following procedure to obtain the corresponding code '110100101100'.
1. Since I2 (300) + 1 = 3, the continuation flag is 3 (step S21).

２．継続フラグ３をＵｎａｒｙコード‘１１０’でＢｉｔｂｕｆに入れる（ステップＳ２２）。
３．継続フラグに続いて、Ｖａｌｕｅ＝３００のビットパターン‘１００１０１１００’をＢｉｔｂｕｆに入れる（ステップＳ２３）。 2. The continuation flag 3 is put in Bitbuf with an Unary code “110” (step S22).
3. Following the continuation flag, the bit pattern '100101100' of Value = 300 is entered in Bitbuf (step S23).

次に、図９は、４ＢＢ改復号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、継続フラグのビット数を表す変数ＣＦｌａｇを０とおき（ステップＳ３１）、Ｂｉｔｂｕｆの初めの４ビットを変数Ｆｌａｇに読み込み（ステップＳ３２）、その値を０ｘ８と比較する（ステップＳ３３）。 Next, FIG. 9 is a flowchart of 4BB re-decoding processing. When the processing is started, the information search apparatus first sets a variable CFflag indicating the number of bits of the continuation flag to 0 (step S31), reads the first 4 bits of Bitbuf into the variable Flag (step S32), and sets the value. Is compared with 0x8 (step S33).

Ｆｌａｇが０ｘ８より小さい場合は、先頭のビットが０であり、後続するブロックが存在しないことを意味する。そこで、ＦｌａｇをＶａｌｕｅに代入して（ステップＳ３４）、処理を終了する。 When Flag is smaller than 0x8, it means that the first bit is 0 and there is no subsequent block. Therefore, Flag is substituted for Value (step S34), and the process is terminated.

Ｆｌａｇが０ｘ８以上であれば、先頭のビットが１であり、後続ブロックが１つ以上存在することを意味する。そこで、最初の後続ブロックが継続フラグに対応するかどうかを調べるために、Ｆｌａｇと０ｘｆ＝‘１１１１’を比較する（ステップＳ３５）。 If Flag is 0x8 or more, it means that the first bit is 1 and there are one or more subsequent blocks. Therefore, in order to check whether or not the first subsequent block corresponds to the continuation flag, Flag is compared with 0xf = '1111' (step S35).

Ｆｌａｇが０ｘｆであれば、さらに継続フラグのブロックが続くことが分かる。そこで、ＣＦｌａｇに４を加算し、Ｂｉｔｂｕｆ内の次の４ビットをＦｌａｇに読み込む（ステップＳ３６）。そして、ステップＳ３５以降の処理を繰り返す。 If Flag is 0xf, it can be seen that a block of continuation flags continues. Therefore, 4 is added to CFflag, and the next 4 bits in Bitbuf are read into Flag (step S36). And the process after step S35 is repeated.

Ｆｌａｇが０ｘｆより小さければ、そのビットパターンは０を含んでおり、継続フラグが途切れることが分かる。そこで、Ｆｌａｇの中で最初に０が現れた位置の順位をＣＦｌａｇに加算する。また、その位置より下のビットにマスクを掛けてそれらのビットを抽出し、Ｍａｓｋに代入する（ステップＳ３７）。 If Flag is smaller than 0xf, it is understood that the bit pattern includes 0 and the continuation flag is interrupted. Therefore, the rank of the position where 0 first appears in the flag is added to CFflag. Further, the bits below the position are masked to extract those bits and substituted into Mask (step S37).

次に、今までにＢｉｔｂｕｆより読み込んだビット数をＣＦｌａｇの４倍から減算した値をＲｅａｄＢｉｔとする（ステップＳ３８）。ＣＦｌａｇを４倍することで、Ｂｉｔｂｕｆに含まれるデータの全ビット数が得られ、それから読み込み済みのビット数を差し引くことで、残りのデータの全ビット数が得られる。 Next, a value obtained by subtracting the number of bits read from Bitbuf so far from four times CFlag is defined as ReadBit (step S38). By multiplying CFlag by four, the total number of bits of data included in Bitbuf is obtained, and by subtracting the number of bits already read from it, the total number of bits of the remaining data is obtained.

次に、ＭａｓｋをＲｅａｄＢｉｔのビット数だけ左にシフトしてＶａｌｕｅに代入し、Ｂｉｔｂｕｆから、ＲｅａｄＢｉｔのビット数だけの残りのデータを読み込んで、それをＶａｌｕｅに加算し、処理を終了する。これにより、継続フラグの部分を除いた元データのビットパターンが、Ｖａｌｕｅとして得られる。 Next, Mask is shifted to the left by the number of bits of ReadBit and assigned to Value, the remaining data of the number of bits of ReadBit is read from Bitbuf, added to Value, and the process is terminated. Thereby, the bit pattern of the original data excluding the continuation flag portion is obtained as Value.

図９を図５と比較すると、４ＢＢ改復号化処理のステップ数は、４ＢＢ復号化処理のそれより少し多いことが分かる。しかし、図５および図９のループ内の処理において、４ＢＢ復号化の場合は、継続フラグとデータ部分を同時に読み込むのに対して、４ＢＢ改復号化の場合は、継続フラグ部分のみ読み込めばよい。このため、１ブロックまたは２ブロック程度の符号を復号化する場合は、４ＢＢ改復号化の処理速度は４ＢＢ復号化処理とそれほど変わらないが、大きなブロックの復号化では４ＢＢ改復号化の方がはるかに速くなる。 Comparing FIG. 9 with FIG. 5, it can be seen that the number of steps of the 4BB decoding process is slightly larger than that of the 4BB decoding process. However, in the processing in the loop of FIGS. 5 and 9, in the case of 4BB decoding, the continuation flag and the data part are read simultaneously, whereas in the case of 4BB re-decoding, only the continuation flag part needs to be read. For this reason, when decoding a code of about 1 block or 2 blocks, the processing speed of 4BB decoding is not so different from that of 4BB decoding processing, but 4BB decoding is much more effective when decoding large blocks. Get faster.

４ＢＢ改復号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図１０に示すようになる。図１０においては、実際にＣＦｌａｇおよびＲｅａｄＢｉｔの値を計算する代わりに、ｉｆ−ｅｌｓｅｉｆで代用している。これは、Ｖａｌｕｅの値の範囲と、それに対応するＲｅａｄＢｉｔの値の範囲とが、あらかじめ分かっているためである。また、変数Ｖａｌｕｅを変数Ｆｌａｇの代わりに用いており、変数Ｍａｓｋの代わりに、Ｖａｌｕｅとマスク用ビットパターンの論理積を用いている。 The program code (described in C language) of the 4BB re-decoding process is as shown in FIG. 10, for example. In FIG. 10, if-else if is used instead of actually calculating the values of CFlag and ReadBit. This is because the value value range and the corresponding ReadBit value range are known in advance. Further, the variable Value is used instead of the variable Flag, and the logical product of the Value and the mask bit pattern is used instead of the variable Mask.

例えば、上述の符号‘１００１０１０１’は以下の手順により復号化され、対応する数値２１が得られる。
１．初めの４ビット‘１００１’を読み込む（ステップＳ３２）。 For example, the above-mentioned code “10010101” is decoded by the following procedure, and the corresponding numerical value 21 is obtained.
1. The first 4 bits “1001” are read (step S32).

２．継続フラグは‘１０’であり、これは上の２ビットに相当するため、下の２ビットにマスクを掛け、Ｍａｓｋ＝‘１００１’＆０ｘ３＝‘０００１’を得る（ステップＳ３７）。また、ＲｅａｄＢｉｔ＝２×４−４＝４となるので、‘０００１’を４ビット左へシフトし、Ｖａｌｕｅに代入する（ステップＳ３８）。これにより、Ｖａｌｕｅ＝‘１００００’＝１６となる。 2. Since the continuation flag is “10”, which corresponds to the upper 2 bits, the lower 2 bits are masked to obtain Mask = “1001” & 0x3 = “0001” (step S37). Since ReadBit = 2 × 4−4 = 4, “0001” is shifted to the left by 4 bits and substituted into Value (step S38). As a result, Value = '10000' = 16.

３．次の４ビット（ＲｅａｄＢｉｔのビット数）である‘０１０１’＝５を読み込み、Ｖａｌｕｅに加算する（ステップＳ３８）。こうして、Ｖａｌｕｅ＝１６＋５＝２１となる。 3. The next 4 bits (the number of bits of ReadBit) “0101” = 5 are read and added to Value (step S38). Thus, Value = 16 + 5 = 21.

また、上述の符号‘１１０１００１０１１００’は以下の手順により復号化され、対応する数値３００が得られる。
１．初めの４ビット‘１１０１’を読み込む（ステップＳ３２）。 Further, the above-described code “110100101100” is decoded by the following procedure, and the corresponding numerical value 300 is obtained.
1. The first 4 bits “1101” are read (step S32).

２．継続フラグは‘１１０’であり、これは上の３ビットに相当するため、下の１ビットにマスクを掛け、Ｍａｓｋ＝‘１１０１’＆０ｘ１＝‘０００１’を得る（ステップＳ３７）。また、ＲｅａｄＢｉｔ＝３×４−４＝８となるので、‘０００１’を８ビット左へシフトし、Ｖａｌｕｅに代入する（ステップＳ３８）。これにより、Ｖａｌｕｅ＝‘１００００００００’＝２５６となる。 2. Since the continuation flag is “110”, which corresponds to the upper 3 bits, the lower 1 bit is masked to obtain Mask = “1101” & 0x1 = “0001” (step S37). Since ReadBit = 3 × 4−4 = 8, “0001” is shifted to the left by 8 bits and substituted into Value (step S38). As a result, Value = '100000000' = 256.

３．次の８ビット（ＲｅａｄＢｉｔのビット数）である‘００１０１１００’＝４４を読み込み、Ｖａｌｕｅに加算する（ステップＳ３８）。こうして、Ｖａｌｕｅ＝２５６＋４４＝３００となる。 3. Next, '00101100' = 44, which is 8 bits (the number of bits of ReadBit), is read and added to Value (step S38). Thus, Value = 256 + 44 = 300.

次に、８４ＢＢ符号化処理および８４ＢＢ復号化処理を説明する。図１１は、８４ＢＢ符号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、Ｖａｌｕｅの値を１２８と比較する（ステップＳ４１）。そして、Ｖａｌｕｅが１２８より小さければ、それをＢｉｔｂｕｆにコピーして（ステップＳ４２）、処理を終了する。 Next, 84BB encoding processing and 84BB decoding processing will be described. FIG. 11 is a flowchart of the 84BB encoding process. When the process is started, the information search device first compares the value of Value with 128 (step S41). If Value is smaller than 128, it is copied to Bitbuf (step S42), and the process is terminated.

Ｖａｌｕｅが１２８以上であれば、上述のＩ２（ｘ）を用いてＩ２（Ｖａｌｕｅ）を求め、それを継続フラグＦｌａｇの値とする（ステップＳ４３）。次に、ＦｌａｇをＵｎａｒｙコードに変換してＢｉｔｂｕｆに入れ（ステップＳ４４）、それに続いてＶａｌｕｅを入れて（ステップＳ４５）、処理を終了する。 If Value is 128 or more, I2 (Value) is obtained using I2 (x) described above, and is set as the value of the continuation flag Flag (step S43). Next, the Flag is converted into an Unary code and put in Bitbuf (step S44), followed by Value (step S45), and the process ends.

図１１を図７と比較すると、８４ＢＢ符号化処理では、４ＢＢ改符号化処理より条件判定が１つ増えるだけなので、その処理速度は４ＢＢ改符号化とほとんど変わらない。
また、８４ＢＢ符号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図１２に示すようになる。図１２においては、４ＢＢ改符号化と同様に、実際にＩ２（Ｖａｌｕｅ）を計算する代わりに、ｉｆ−ｅｌｓｅｉｆで代用している。 Comparing FIG. 11 with FIG. 7, in the 84BB encoding process, only one condition determination is added to the 4BB modified encoding process, so the processing speed is almost the same as the 4BB modified encoding.
Further, the program code of 84BB encoding processing (described in C language) is, for example, as shown in FIG. In FIG. 12, if-else if is used instead of actually calculating I2 (Value) as in the case of 4BB re-encoding.

例えば、上述の３００という数値は以下の手順により符号化され、対応する符号‘１００１００１０１１００’が得られる。
１．Ｉ２（３００）＝２であるため、継続フラグは２となる（ステップＳ４３）。 For example, the above numerical value of 300 is encoded by the following procedure, and the corresponding code '100100101100' is obtained.
1. Since I2 (300) = 2, the continuation flag is 2 (step S43).

２．継続フラグ２をＵｎａｒｙコード‘１０’でＢｉｔｂｕｆに入れる（ステップＳ４４）。
３．継続フラグに続いて、Ｖａｌｕｅ＝３００のビットパターン‘０１００１０１１００’をＢｉｔｂｕｆに入れる（ステップＳ４５）。 2. The continuation flag 2 is put in Bitbuf with an Unary code “10” (step S44).
3. Following the continuation flag, a bit pattern “0100101100” of Value = 300 is entered in Bitbuf (step S45).

次に、図１３は、８４ＢＢ復号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、Ｂｉｔｂｕｆの初めの８ビットを変数Ｆｌａｇに読み込み（ステップＳ５１）、その値を１２８と比較する（ステップＳ５２）。Ｆｌａｇが１２８より小さい場合は、先頭のビットが０であり、後続するブロックが存在しないことを意味する。そこで、ＦｌａｇをＶａｌｕｅに代入して（ステップＳ５３）、処理を終了する。 Next, FIG. 13 is a flowchart of the 84BB decoding process. When the process is started, the information retrieval apparatus first reads the first 8 bits of Bitbuf into a variable Flag (step S51) and compares the value with 128 (step S52). When Flag is smaller than 128, the first bit is 0, which means that there is no subsequent block. Therefore, Flag is substituted for Value (step S53), and the process is terminated.

Ｆｌａｇが１２８以上の場合は、先頭のビットが１であり、後続ブロックが１つ以上存在することを意味する。そこで、継続フラグのビット数を表す変数ＣＦｌａｇを０とおき（ステップＳ５４）、最初の後続ブロックが継続フラグに対応するかどうかを調べるために、Ｆｌａｇの値を０ｘｆｆ＝‘１１１１１１１１’と比較する（ステップＳ５４ａ）。 When the flag is 128 or more, it means that the first bit is 1 and one or more subsequent blocks exist. Therefore, the variable CFflag representing the number of bits of the continuation flag is set to 0 (step S54), and the value of Flag is compared with 0xff = '11111111' in order to check whether or not the first subsequent block corresponds to the continuation flag ( Step S54a).

Ｆｌａｇが０ｘｆｆであれば、さらに継続フラグのブロックが続くことが分かる。そこで、ＣＦｌａｇに８を加算し（ステップＳ５４ｂ）、Ｂｉｔｂｕｆ内の次の４ビットをＦｌａｇに読み込む（ステップＳ５４ｃ）。そして、読み込んだブロックの次のブロックが継続フラグに対応するかどうかを調べるために、Ｆｌａｇの値を０ｘｆ＝‘１１１１’と比較する（ステップＳ５５）。 If Flag is 0xff, it can be seen that a block of continuation flags continues. Therefore, 8 is added to CFflag (step S54b), and the next 4 bits in Bitbuf are read into Flag (step S54c). Then, in order to check whether or not the block next to the read block corresponds to the continuation flag, the value of Flag is compared with 0xf = '1111' (step S55).

Ｆｌａｇが０ｘｆであれば、さらに継続フラグのブロックが続くことが分かる。そこで、ＣＦｌａｇに４を加算し、Ｂｉｔｂｕｆ内の次の４ビットをＦｌａｇに読み込む（ステップＳ５６）。そして、ステップＳ５５以降の処理を繰り返す。 If Flag is 0xf, it can be seen that a block of continuation flags continues. Therefore, 4 is added to CFflag, and the next 4 bits in Bitbuf are read into Flag (step S56). And the process after step S55 is repeated.

ステップＳ５４ａでＦｌａｇが０ｘｆｆより小さいとき、および、ステップＳ５５でＦｌａｇが０ｘｆより小さいときは、そのビットパターンは０を含んでおり、継続フラグが途切れることが分かる。そこで、Ｆｌａｇの中で最初に０が現れた位置の順位をＣＦｌａｇに加算する。また、その位置より下のビットにマスクを掛けてそれらのビットを抽出し、Ｍａｓｋに代入する（ステップＳ５７）。 When the flag is smaller than 0xff in step S54a and when the flag is smaller than 0xf in step S55, the bit pattern includes 0, and it can be seen that the continuation flag is interrupted. Therefore, the rank of the position where 0 first appears in the flag is added to CFflag. Further, the bits below the position are masked to extract those bits and substituted into Mask (step S57).

次に、今までにＢｉｔｂｕｆより読み込んだビット数を（ＣＦｌａｇ＋１）の４倍から減算した値をＲｅａｄＢｉｔとする（ステップＳ５８）。次に、ＭａｓｋをＲｅａｄＢｉｔのビット数だけ左にシフトしてＶａｌｕｅに代入し、Ｂｉｔｂｕｆから、ＲｅａｄＢｉｔのビット数だけの残りのデータを読み込んで、それをＶａｌｕｅに加算し、処理を終了する。これにより、継続フラグの部分を除いた元データのビットパターンが、Ｖａｌｕｅとして得られる。 Next, a value obtained by subtracting the number of bits read from Bitbuf so far from four times (CFlag + 1) is defined as ReadBit (step S58). Next, Mask is shifted to the left by the number of bits of ReadBit and assigned to Value, the remaining data of the number of bits of ReadBit is read from Bitbuf, added to Value, and the process is terminated. Thereby, the bit pattern of the original data excluding the continuation flag portion is obtained as Value.

このような８４ＢＢ復号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図１４に示すようになる。図１４においては、４ＢＢ改復号化と同様に、実際にＣＦｌａｇおよびＲｅａｄＢｉｔの値を計算する代わりに、ｉｆ−ｅｌｓｅｉｆで代用している。また、変数Ｖａｌｕｅを変数Ｆｌａｇの代わりに用いており、変数Ｍａｓｋの代わりに、Ｖａｌｕｅとマスク用ビットパターンの論理積を用いている。 The program code for such 84BB decoding processing (described in C language) is, for example, as shown in FIG. In FIG. 14, if-else if is used instead of actually calculating the values of CFlag and ReadBit, as in the case of 4BB re-decoding. Further, the variable Value is used instead of the variable Flag, and the logical product of the Value and the mask bit pattern is used instead of the variable Mask.

例えば、上述の符号‘１００１００１０１１００’は以下の手順により復号化され、対応する数値３００が得られる。
１．初めの８ビット‘１００１００１０’を読み込む（ステップＳ５１）。 For example, the above-mentioned code “100100101100” is decoded by the following procedure, and the corresponding numerical value 300 is obtained.
1. The first 8-bit '10010010' is read (step S51).

２．継続フラグは‘１０’であり、これは上の２ビットに相当するため、下の６ビットにマスクを掛け、Ｍａｓｋ＝‘１００１００１０’＆０ｘ３ｆ＝‘０００１００１０’を得る（ステップＳ５７）。また、ＲｅａｄＢｉｔ＝（２＋１）×４−８＝４となるので、‘０００１００１０’を４ビット左へシフトし、Ｖａｌｕｅに代入する（ステップＳ５８）。これにより、Ｖａｌｕｅ＝‘１００１０００００’＝２８８となる。 2. Since the continuation flag is “10”, which corresponds to the upper 2 bits, the lower 6 bits are masked to obtain Mask = “10010010” & 0x3f = “00010010” (step S57). Since ReadBit = (2 + 1) × 4−8 = 4, “00010010” is shifted to the left by 4 bits and assigned to Value (step S58). As a result, Value = '100100000' = 288.

３．次の４ビット（ＲｅａｄＢｉｔのビット数）である‘１１００’＝１２を読み込み、Ｖａｌｕｅに加算する（ステップＳ５８）。こうして、Ｖａｌｕｅ＝２８８＋１２＝３００となる。 3. The next 4 bits (the number of bits of ReadBit) “1100” = 12 are read and added to Value (step S58). Thus, Value = 288 + 12 = 300.

次に、Ｂ２４符号化処理およびＢ２４復号化処理を説明する。図１５は、Ｂ２４符号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、Ｖａｌｕｅを１と比較する（ステップＳ６１）。Ｖａｌｕｅ＝１の場合は、ビットパターン‘００’をＢｉｔｂｕｆに入れて（ステップＳ６２）、処理を終了する。 Next, the B24 encoding process and the B24 decoding process will be described. FIG. 15 is a flowchart of the B24 encoding process. When the process is started, the information search device first compares Value with 1 (step S61). If Value = 1, the bit pattern “00” is entered in Bitbuf (step S62), and the process ends.

Ｖａｌｕｅが１でなければ、次に、それを２と比較する（ステップＳ６３）。Ｖａｌｕｅ＝２の場合は、ビットパターン‘０１’をＢｉｔｂｕｆに入れて（ステップＳ６４）、処理を終了する。 If Value is not 1, it is then compared with 2 (step S63). In the case of Value = 2, the bit pattern “01” is put in Bitbuf (step S64), and the process is terminated.

Ｖａｌｕｅが２でなければ、次に、それを７と比較する（ステップＳ６５）。Ｖａｌｕｅが７より小さければ、ビットパターン‘１０’を継続フラグとしてＢｉｔｂｕｆに入れ、その後に（Ｖａｌｕｅ−３）のビットパターンを入れて（ステップＳ６６）、処理を終了する。 If Value is not 2, it is then compared with 7 (step S65). If Value is smaller than 7, the bit pattern “10” is set as a continuation flag in Bitbuf, followed by (Value-3) bit pattern (Step S66), and the process ends.

Ｖａｌｕｅが７以上であれば、ここで、上述のＩ２（ｘ）を用いてＩ２（Ｖａｌｕｅ）を求め、Ｉ２（Ｖａｌｕｅ）＋２を変数Ｆｌａｇに代入する（ステップＳ６７）。そして、ＦｌａｇをＵｎａｒｙコードに変換してＢｉｔｂｕｆに入れ、それに続いて（Ｖａｌｕｅ−７）のビットパターンを入れて、処理を終了する。 If Value is 7 or more, here, I2 (Value) is obtained using I2 (x) described above, and I2 (Value) +2 is substituted into variable Flag (step S67). Then, the Flag is converted into an Unary code and put in Bitbuf, followed by the (Value-7) bit pattern, and the process is terminated.

図１５を図７と比較すると、Ｖａｌｕｅが１または２であれば、Ｂ２４符号化処理のステップ数は４ＢＢ改符号化処理のそれよりも少なくなり、処理速度はより速くなる。また、図１５のステップＳ６７の処理は、図７のステップＳ２１、Ｓ２２、Ｓ２３の処理に対応している。したがって、Ｖａｌｕｅが７以上であっても、４ＢＢ改符号化処理と比較して条件判定が３つ追加されているだけなので、処理速度はそれほど低下しない。Ｖａｌｕｅが３〜６の場合も同様である。 Comparing FIG. 15 with FIG. 7, if Value is 1 or 2, the number of steps of the B24 encoding process is smaller than that of the 4BB re-encoding process, and the processing speed becomes faster. Further, the process in step S67 in FIG. 15 corresponds to the processes in steps S21, S22, and S23 in FIG. Therefore, even if the value is 7 or more, the processing speed does not decrease so much because only three condition determinations are added compared to the 4BB re-encoding process. The same applies when Value is 3-6.

Ｂ２４符号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図１６に示すようになる。図１６においては、上述の４ＢＢ改符号化と同様に、実際にＩ２（Ｖａｌｕｅ）を計算する代わりに、ｉｆ−ｅｌｓｅｉｆで代用している。例えば、上述の２１という数値は以下の手順により符号化され、対応する符号‘１１００１１１０’が得られる。 The program code for B24 encoding processing (described in C language) is as shown in FIG. 16, for example. In FIG. 16, if-else if is used instead of actually calculating I2 (Value), as in the above-described 4BB re-encoding. For example, the numerical value 21 described above is encoded by the following procedure, and the corresponding code '11001110' is obtained.

１．Ｉ２（２１）＋２＝３であるため、継続フラグは３となる（ステップＳ６７）。
２．継続フラグ３をＵｎａｒｙコード‘１１０’でＢｉｔｂｕｆに入れる（ステップＳ６７）。 1. Since I2 (21) + 2 = 3, the continuation flag is 3 (step S67).
2. The continuation flag 3 is put in Bitbuf with an Unary code “110” (step S67).

３．継続フラグに続いて、Ｖａｌｕｅ−７＝２１−７＝１４のビットパターン‘０１１１０’をＢｉｔｂｕｆに入れる（ステップＳ６７）。
次に、図１７は、Ｂ２４復号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、Ｂｉｔｂｕｆの初めの２ビットを変数Ｆｌａｇに読み込み（ステップＳ７１）、その値を１と比較する（ステップＳ７２）。 3. Following the continuation flag, a bit pattern '01110' of Value-7 = 21-7 = 14 is entered in Bitbuf (step S67).
Next, FIG. 17 is a flowchart of the B24 decoding process. When the process is started, the information retrieval apparatus first reads the first 2 bits of Bitbuf into a variable Flag (step S71) and compares the value with 1 (step S72).

Ｆｌａｇが１以下の場合は、先頭のビットが０であり、後続するビットパターンが存在しないことを意味する。そこで、（Ｆｌａｇ＋１）のビットパターンをＶａｌｕｅに代入して（ステップＳ７３）、処理を終了する。 When Flag is 1 or less, it means that the first bit is 0 and there is no subsequent bit pattern. Therefore, the bit pattern of (Flag + 1) is substituted for Value (step S73), and the process ends.

Ｆｌａｇが１より大きければ、先頭のビットが１であり、後続するビットパターンが存在することを意味する。そこで、次に、その値を２と比較する（ステップＳ７４）。
Ｆｌａｇが２であれば、後続するビットパターンは２ビットであることを意味する。そこで、Ｂｉｔｂｕｆの残りの２ビットを変数Ｖａｌｕｅに読み込み、３を加算して（ステップＳ７５）、処理を終了する。 If Flag is larger than 1, it means that the first bit is 1 and there is a subsequent bit pattern. Therefore, next, the value is compared with 2 (step S74).
If Flag is 2, it means that the subsequent bit pattern is 2 bits. Therefore, the remaining 2 bits of Bitbuf are read into the variable Value, 3 is added (step S75), and the process ends.

Ｆｌａｇが２より大きければ、それは‘１１’であり、後続する１つ以上のブロックが存在することを意味する。そこで、Ｆｌａｇ＝‘１１’を２ビット左へシフトし、Ｂｉｔｂｕｆの次の２ビットの値を読み込んで、Ｆｌａｇに加算する（ステップＳ７６）。そして、継続フラグのビット数を表す変数ＣＦｌａｇを０とおき、最初の後続ブロックが継続フラグに対応するかどうかを調べるために、Ｆｌａｇと０ｘｆ＝‘１１１１’を比較する（ステップＳ７８）。 If Flag is greater than 2, it is '11', meaning that there are one or more subsequent blocks. Therefore, Flag = '11 'is shifted to the left by 2 bits, the value of the next 2 bits of Bitbuf is read, and added to Flag (step S76). Then, a variable CFflag representing the number of bits of the continuation flag is set to 0, and Flag is compared with 0xf = '1111' to check whether the first subsequent block corresponds to the continuation flag (step S78).

Ｆｌａｇが０ｘｆであれば、さらに継続フラグのブロックが続くことが分かる。そこで、ＣＦｌａｇに４を加算し、Ｂｉｔｂｕｆ内の次の４ビットをＦｌａｇに読み込む（ステップＳ７９）。そして、ステップＳ７８以降の処理を繰り返す。 If Flag is 0xf, it can be seen that a block of continuation flags continues. Therefore, 4 is added to CFflag, and the next 4 bits in Bitbuf are read into Flag (step S79). And the process after step S78 is repeated.

Ｆｌａｇが０ｘｆより小さければ、そのビットパターンは０を含んでおり、継続フラグが途切れることが分かる。そこで、Ｆｌａｇの中で最初に０が現れた位置の順位をＣＦｌａｇに加算する。また、その位置より下のビットにマスクを掛けてそれらのビットを抽出し、Ｍａｓｋに代入する（ステップＳ８０）。 If Flag is smaller than 0xf, it is understood that the bit pattern includes 0 and the continuation flag is interrupted. Therefore, the rank of the position where 0 first appears in the flag is added to CFflag. Further, the bits below the position are masked to extract those bits and substituted into Mask (step S80).

次に、今までにＢｉｔｂｕｆより読み込んだビット数を（ＣＦｌａｇ−１）の４倍から減算した値をＲｅａｄＢｉｔとする（ステップＳ８１）。そして、ＭａｓｋをＲｅａｄＢｉｔのビット数だけ左にシフトしてＶａｌｕｅに代入し、Ｂｉｔｂｕｆから、ＲｅａｄＢｉｔのビット数だけの残りのデータを読み込んで、それをＶａｌｕｅに加算する。これにより、符号から継続フラグの部分を除いたビットパターンが、Ｖａｌｕｅとして得られる。元データを得るために、Ｖａｌｕｅにさらに７を加算して、処理を終了する。 Next, a value obtained by subtracting the number of bits read from Bitbuf so far from four times (CFlag-1) is defined as ReadBit (step S81). Then, Mask is shifted to the left by the number of bits of ReadBit and assigned to Value, and the remaining data of the number of bits of ReadBit is read from Bitbuf and added to Value. As a result, a bit pattern obtained by removing the continuation flag portion from the code is obtained as Value. In order to obtain original data, 7 is further added to Value, and the process is terminated.

図１７を図９と比較すると、Ｂ２４復号化処理では４ＢＢ改復号化処理よりも条件判定が１つ多いが、処理速度はそれほど低下しない。
Ｂ２４復号化処理のプログラムコード（Ｃ言語で記述）は、例えば、図１８に示すようになる。図１８においては、上述の４ＢＢ改復号化と同様に、実際にＣＦｌａｇおよびＲｅａｄＢｉｔの値を計算する代わりに、ｉｆ−ｅｌｓｅｉｆで代用している。また、変数Ｖａｌｕｅを変数Ｆｌａｇの代わりに用いており、変数Ｍａｓｋの代わりに、Ｖａｌｕｅとマスク用ビット列の論理積を用いている。 When FIG. 17 is compared with FIG. 9, the B24 decoding process has one more condition determination than the 4BB re-decoding process, but the processing speed does not decrease so much.
The program code (described in C language) of the B24 decoding process is as shown in FIG. 18, for example. In FIG. 18, if-else if is used instead of actually calculating the values of CFlag and ReadBit, as in the above-described 4BB decoding. The variable Value is used instead of the variable Flag, and the logical product of the Value and the mask bit string is used instead of the variable Mask.

例えば、上述の符号‘１１００１１１０’は以下の手順により復号化され、対応する数値２１が得られる。
１．初めの２ビット‘１１’をＦｌａｇに読み込む（ステップＳ７１）。 For example, the above-mentioned code “11001110” is decoded by the following procedure, and the corresponding numerical value 21 is obtained.
1. The first 2 bits '11' are read into the flag (step S71).

２．読み込んだ値は３であるため、Ｆｌａｇを２ビット左にシフトして、さらに次の２ビット‘００’を読み込み（ステップＳ７６）、継続フラグを確認する（ステップＳ７８）。ここで、継続フラグが‘１１０’であることが分かる。 2. Since the read value is 3, the flag is shifted to the left by 2 bits, the next 2 bits '00' is read (step S76), and the continuation flag is confirmed (step S78). Here, it can be seen that the continuation flag is '110'.

３．継続フラグが３ビット目で途切れているので、継続フラグに続く残りの１ビットにマスクを掛け、Ｍａｓｋ＝‘１１００’＆０ｘ１＝‘００００’を得る。また、ＲｅａｄＢｉｔ＝（３−１）×４−４＝４となるので、‘００００’を４ビット左へシフトし、Ｖａｌｕｅに代入する（ステップＳ８１）。 3. Since the continuation flag is interrupted at the third bit, the remaining 1 bit following the continuation flag is masked to obtain Mask = '1100' & 0x1 = '0000'. Since ReadBit = (3-1) × 4−4 = 4, “0000” is shifted to the left by 4 bits and assigned to Value (step S81).

４．次の４ビット（ＲｅａｄＢｉｔのビット数）である‘１１１０’＝１４を読み込み、Ｖａｌｕｅに加算して、さらに７を加算する（ステップＳ８１）。こうして、Ｖａｌｕｅ＝１４＋７＝２１となる。 4). The next 4 bits (the number of bits of ReadBit), “1110” = 14, is read, added to Value, and further 7 is added (step S81). Thus, Value = 14 + 7 = 21.

以上説明した４ＢＢ改符号化、８４ＢＢ符号化、およびＢ２４符号化に必要な各ビット数を、４ＢＢ符号化、γ−ｃｏｄｉｎｇ、およびδ−ｃｏｄｉｎｇに必要な各ビット数と比較すると、図１９に示すようになる。 FIG. 19 shows a comparison of the number of bits necessary for 4BB coding, 84BB coding, and B24 coding described above with the number of bits necessary for 4BB coding, γ-coding, and δ-coding. It becomes like this.

図１９において、ほとんどの符号化方法が小さい数値を少ないビット数、大きい数値を多いビット数で表していることが分かる。ある数値列が与えられた時にどの符号化が最も圧縮率が高くなるかは、その数値列中においてどの範囲の数値が多く出現しているかによって大きく異なる。 In FIG. 19, it can be seen that most encoding methods represent small numerical values with a small number of bits and large numerical values with a large number of bits. Which encoding has the highest compression rate when a certain numerical sequence is given differs greatly depending on which range of numerical values appears in the numerical sequence.

例えば、１、２のみが多く出現する数値列に対しては、Ｂ２４符号化が最も圧縮率が高いと予想され、また、５１２付近の値が多く出現するような数値列に対しては、８４ＢＢ符号化が最も圧縮率が高いと期待できる。 For example, B24 encoding is expected to have the highest compression rate for numeric sequences in which only 1 and 2 appear frequently, and 84BB for numeric sequences in which values near 512 appear frequently. Encoding can be expected to have the highest compression rate.

また、３２ビットで表現できる最大数である４２９４９６７２９５を各方法で符号化すると、得られる符号のビットパターンとビット数は図２０に示すようになる。
ところで、インデックスに用いられる文書内単語出現位置は、必ずしも小さな数値とは限らず、差分情報を活用したとしてもかなり大きな数値となることが多い。このような大きな数値を扱った場合には、いずれの符号化方法を用いても圧縮効率は上がらない。 In addition, when 4294967295, which is the maximum number that can be expressed in 32 bits, is encoded by each method, the bit pattern and the number of bits of the obtained code are as shown in FIG.
By the way, the word appearance position in the document used for the index is not necessarily a small numerical value, and even if the difference information is used, it is often a considerably large numerical value. When such a large numerical value is handled, the compression efficiency does not increase regardless of which encoding method is used.

そこで、本発明では、本来の文書内単語出現位置等の数値データを適当な整数値で除算することにより、情報の粒度を粗くし、より小さな中間数値に変換することにする。小さな数値であれば、図１９から分かるように、いずれの符号化方法を用いても圧縮効率が良くなる。このような変換を用いた符号化をＰｅｒ符号化と呼び、特に、除算の分母をｎとする場合をＰｅｒ（ｎ）符号化と呼ぶことにする。 Therefore, in the present invention, by dividing numerical data such as the original word appearance position in the document by an appropriate integer value, the granularity of information is coarsened and converted to a smaller intermediate numerical value. If the numerical value is small, as can be seen from FIG. 19, the compression efficiency is improved regardless of which encoding method is used. Encoding using such conversion is called Per encoding, and in particular, a case where the denominator of division is n is called Per (n) encoding.

図２１は、Ｐｅｒ符号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、元データを変数ｎｕｍに読み込み（ステップＳ９１）、それをあらかじめ決められた分母値Ｐｅｒで除算する（ステップＳ９２）。 FIG. 21 is a flowchart of the Per encoding process. When the process is started, the information retrieval apparatus first reads the original data into a variable num (step S91) and divides it by a predetermined denominator value Per (step S92).

除算に用いるＰｅｒの値は、高速に実行できるシフト命令が利用可能な値から選択することが望ましい。例えば、２、４、８、１６、３２、６４等の値がＰｅｒとして用いられる。ここでは、得られた商の少数点以下の端数は切り捨て、その整数部分を中間数値とし、それを改めてｎｕｍに代入する。 The value of Per used for division is preferably selected from values that can be used for a shift instruction that can be executed at high speed. For example, values such as 2, 4, 8, 16, 32, and 64 are used as Per. Here, the fractions below the decimal point of the obtained quotient are rounded down, the integer part is set as an intermediate value, and it is substituted for num again.

次に、ｎｕｍの値を符号化して（ステップＳ９３）、処理を終了する。ステップＳ９３では、任意の符号化方法を用いることができる。ただし、例えばＢ２４符号化のように、０を表現できない符号化方法を用いる場合には、ステップＳ９２で得られたｎｕｍが０のとき、それに１を加算してから符号化するものとする。 Next, the value of num is encoded (step S93), and the process ends. In step S93, an arbitrary encoding method can be used. However, when an encoding method that cannot represent 0, such as B24 encoding, is used, when num obtained in step S92 is 0, 1 is added to the encoding, and then encoding is performed.

ここで、ステップＳ９３における符号化方法としてＢ２４符号化を用いた場合の例を説明する。ここでは、分母値を２とするＰｅｒ（２）符号化により、上述の２１という数値は以下の手順により符号化され、対応する符号‘１１００００１１’が得られる。 Here, an example in which B24 encoding is used as the encoding method in step S93 will be described. Here, by the Per (2) encoding with the denominator value of 2, the above-described numerical value of 21 is encoded by the following procedure, and the corresponding code '11000011' is obtained.

１．２１を２で除算し、商の小数点以下は切り捨てる。これにより、ｎｕｍ＝１０となる（ステップＳ９２）。
２．Ｉ２（１０）＋２＝３であるため、継続フラグは３となる（図１５、ステップＳ６７）。 Divide 1.21 by 2 and round off the fractional part of the quotient. As a result, num = 10 (step S92).
2. Since I2 (10) + 2 = 3, the continuation flag is 3 (FIG. 15, step S67).

３．継続フラグ３をＵｎａｒｙコード‘１１０’でＢｉｔｂｕｆに入れる（ステップＳ６７）。
４．継続フラグに続いて、Ｖａｌｕｅ−７＝１０−７＝３のビットパターン‘０００１１’をＢｉｔｂｕｆに入れる（ステップＳ６７）。 3. The continuation flag 3 is put in Bitbuf with an Unary code “110” (step S67).
4). Following the continuation flag, a bit pattern “00011” of Value-7 = 10−7 = 3 is entered in Bitbuf (step S67).

次に、図２２は、Ｐｅｒ復号化処理のフローチャートである。処理が開始されると、情報検索装置は、まず、Ｐｅｒ符号を復号化し、得られた数値を変数ｎｕｍに読み込む（ステップＳ１０２）。ただし、ステップＳ１０２では、図２１のステップＳ９３で用いた符号化方法に対応する復号化方法を用いる。次に、ｎｕｍに上述の分母値Ｐｅｒを乗算し（ステップＳ１０３）、その結果を呼び出し元のプログラムへ返して（ステップＳ１０４）、処理を終了する。 Next, FIG. 22 is a flowchart of the Per decoding process. When the process is started, the information search apparatus first decodes the Per code and reads the obtained numerical value into the variable num (step S102). However, in step S102, a decoding method corresponding to the encoding method used in step S93 in FIG. 21 is used. Next, num is multiplied by the above denominator value Per (step S103), the result is returned to the calling program (step S104), and the process is terminated.

一般に、Ｐｅｒ復号化により得られる数値データは、必ずしも元データと一致するとは限らない。例えば、上述の数値２１に対応する符号‘１１００００１１’をＢ２４復号化処理により復号化すると、数値１０が得られる（ステップＳ１０２）。しかし、この数値にＰｅｒ＝２を乗算すると、ｎｕｍ＝２０となり（ステップＳ１０３）、元の数値には戻らない。したがって、Ｐｅｒ符号化は、文書内単語出現位置のように、元データのおおよその値が再現されればよい場合に有効である。 In general, numerical data obtained by Per decoding does not always match the original data. For example, when the code “11000011” corresponding to the above-described numerical value 21 is decoded by the B24 decoding process, the numerical value 10 is obtained (step S102). However, when this value is multiplied by Per = 2, num = 20 is obtained (step S103), and the original value is not restored. Therefore, Per encoding is effective when the approximate value of the original data only needs to be reproduced, such as the word appearance position in the document.

インデックスの圧縮に用いられる各符号化方法は、いずれも小さな数値を少ないビットで、大きな数値を多くのビットで表すようになっている。Ｐｅｒ符号化によれば、大きな数値を小さな数値に変換してから符号化するので、それだけ圧縮の効果が期待できる。 Each encoding method used for compression of the index is such that a small numerical value is represented by a small number of bits and a large numerical value is represented by a large number of bits. According to Per coding, encoding is performed after converting a large numerical value to a small numerical value, so that the compression effect can be expected.

ところで、文書内単語出現位置の情報を含めたインデックス構造は、［文書番号，文書内単語出現頻度，文書内単語出現位置領域数，文書内単語出現位置，・・・，］のようになる。ここで、文書内単語出現位置領域数には、後続する文書内単語出現位置のデータ領域の大きさが記述される。 By the way, the index structure including the information on the word appearance position in the document is as follows: [document number, word appearance frequency in the document, number of word appearance position areas in the document, word appearance position in the document,. Here, the number of word appearance position areas in the document describes the size of the data area of the subsequent word appearance position in the document.

文書番号の情報は、該当する文書番号と前の組の文書番号との差分で表すことができ、文書内単語出現位置の情報も、同一文書内における該当する位置と前の文書内単語出現位置との差分で表すことができる。しかし、文書内単語出現頻度や文書内単語出現位置領域数は、数値の小さいものから大きいものの順に並んでいないので、差分値を取ることはできない。 The document number information can be represented by the difference between the corresponding document number and the previous set of document numbers, and the word occurrence position information in the document is also the corresponding position in the same document and the previous word occurrence position in the document. And can be represented by the difference. However, since the word appearance frequency in the document and the number of word appearance position areas in the document are not arranged in order from the smallest value to the largest value, a difference value cannot be taken.

通常の情報検索装置では、文書内単語出現位置領域数はビット単位で記述される。しかし、本発明の符号化方法では、最小のブロックのビット数が２ビットまたは４ビットであることから、文書内単語出現位置領域数を最小ブロック単位で記述することができる。 In a normal information retrieval apparatus, the number of word appearance position areas in a document is described in bit units. However, in the encoding method of the present invention, the number of bits of the minimum block is 2 bits or 4 bits, so the number of word appearance position areas in the document can be described in units of minimum blocks.

例えば、あるキーの出現位置の情報を表現するのに２００ｂｉｔ必要だったとすると、ビット単位では２００という数値で表現されるのに対して、２ビット単位では１００という数値で表現され、４ビット単位では５０という数値で表現される。文書内単語出現位置領域数も他の数値とともに符号化されるため、より小さい数値で表現することによって、インデックスサイズの圧縮率の向上が期待できる。 For example, if 200 bits are required to express the information of the appearance position of a certain key, it is expressed by a numerical value of 200 in bit units, whereas it is expressed by a numerical value of 100 in 2-bit units, and in 4-bit units. It is expressed by a numerical value of 50. Since the number of word appearance position areas in the document is also encoded together with other numerical values, an improvement in the index size compression rate can be expected by expressing it with a smaller numerical value.

また、文書内単語出現位置領域数として文書内単語出現頻度を代用した場合のインデックス構造は、［文書番号，文書内単語出現頻度，文書内単語出現位置，・・・，］のようになる。ここで、文書内単語出現頻度は、後続する文書内単語出現位置の個数を表す。 In addition, the index structure when the word appearance frequency in the document is substituted as the number of word appearance position areas in the document is [document number, word appearance frequency in document, word appearance position in document,...]. Here, the word appearance frequency in the document represents the number of subsequent word appearance positions in the document.

ただし、この場合、次の組の文書番号を取り出すためには、文書内単語出現頻度の後の文書内単語出現位置をすべて復号化しなければならない。これに対して、文書内単語出現位置領域数を付加した場合には、文書内単語出現位置を復号化する必要はなく、その領域数から計算されるビット数だけ離れた場所にアクセスすればよい。 However, in this case, in order to extract the next set of document numbers, all the word appearance positions in the document after the word appearance frequency in the document must be decoded. On the other hand, when the number of word appearance positions in the document is added, it is not necessary to decode the word appearance position in the document, and it is only necessary to access a place separated by the number of bits calculated from the number of areas. .

次に、図２３から図３７までを参照しながら、上述の各符号化方法を組み合わせて用いたインデックス構造の例を説明する。インデックスの構造は、その用途に応じて、以下の５つの構成を取るものとする。 Next, an example of an index structure using a combination of the above encoding methods will be described with reference to FIGS. The structure of the index is assumed to have the following five configurations depending on the application.

第１のインデックス構造：［文書番号］
第２のインデックス構造：［文書番号，文書内単語出現頻度］
第３のインデックス構造：［文書番号，文書内単語出現頻度，文書内単語出現位置，・・・，］
第４のインデックス構造：［文書番号，文書内単語出現位置領域数，文書内単語出現位置，・・・，］
第５のインデックス構造：［文書番号，文書内単語出現頻度，文書内単語出現位置領域数，文書内単語出現位置，・・・，］
例えば、図４０のインデックス構造を第５のインデックス構造を用いて書き直すと、図２３に示すように表現される。ここで、‘？’の位置には、文書内単語出現位置領域数が書き込まれる。以下に示す例では、元データとして図２３の数値を用いており、それらは１０進数で表されている。 First index structure: [document number]
Second index structure: [document number, word appearance frequency in document]
Third index structure: [document number, word appearance frequency in document, word appearance position in document,...]
Fourth index structure: [document number, number of word appearance positions in document, word appearance position in document,...]
Fifth index structure: [document number, word appearance frequency in document, number of word appearance positions in document, word appearance position in document,...]
For example, when the index structure of FIG. 40 is rewritten using the fifth index structure, it is expressed as shown in FIG. here,'? The number of word appearance position areas in the document is written at the position '. In the example shown below, the numerical values of FIG. 23 are used as the original data, and they are expressed in decimal numbers.

図２４は、第１のインデックス構造を用いた場合の４ＢＢ改符号およびＢ２４符号のビットパターンと、それぞれの符号の総ビット数を示している。第１のインデックス構造は文書番号のみであるので、Ｐｅｒ符号化は用いられない。 FIG. 24 shows the bit patterns of the 4BB modified code and the B24 code when the first index structure is used, and the total number of bits of each code. Since the first index structure is only the document number, Per encoding is not used.

図２５は、第２のインデックス構造を用いた場合の符号のビットパターンと総ビット数を示している。ここでは、Ｐｅｒ（２）符号化を、文書内単語出現頻度のみに適用し、（文書内単語出現頻度／２）の整数部分を符号化している。ただし、その整数部分が０となる場合は代わりに数値１を符号化している。 FIG. 25 shows the bit pattern of the code and the total number of bits when the second index structure is used. Here, Per (2) encoding is applied only to the word appearance frequency in the document, and the integer part of (word appearance frequency / 2 in the document) is encoded. However, when the integer part is 0, the numerical value 1 is encoded instead.

また、文書番号と文書内単語出現頻度を符号化する際、上述の各符号化方法の様々な組合せが考えられる。ここでは、以下の６通りの組合せについて、符号化の結果が示されている。 Further, when encoding the document number and the word appearance frequency in the document, various combinations of the above encoding methods can be considered. Here, encoding results are shown for the following six combinations.

１．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改（図２５、組合せ１）
２．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改＋Ｐｅｒ（２）（図２５、組合せ２）
３．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４＋Ｐｅｒ（２）（図２５、組合せ３）
４．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改（図２５、組合せ４）５．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改＋Ｐｅｒ（２）（図２５、組合せ５）
６．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４＋Ｐｅｒ（２）（図２５、組合せ６）
図２６、２７、２８は、第３のインデックス構造を用いた場合の符号のビットパターンと総ビット数を示している。ここでは、Ｐｅｒ（１６）符号化を、文書内単語出現位置のみに適用し、（文書内単語出現頻度／１６）の整数部分を符号化している。ただし、その整数部分が０となる場合は代わりに数値１を符号化している。 1. Document number: 4BB break, word appearance frequency in document: 4BB break (FIG. 25, combination 1)
2. Document number: 4BB revision, word appearance frequency in document: 4BB revision + Per (2) (FIG. 25, combination 2)
3. Document number: 4BB revision, word appearance frequency in document: B24 + Per (2) (FIG. 25, combination 3)
4). Document number: B24, word appearance frequency in document: 4BB modification (FIG. 25, combination 4) Document number: B24, word appearance frequency in document: 4BB Kai + Per (2) (FIG. 25, combination 5)
6). Document number: B24, word appearance frequency in document: B24 + Per (2) (FIG. 25, combination 6)
FIGS. 26, 27, and 28 show the bit pattern of the code and the total number of bits when the third index structure is used. Here, Per (16) encoding is applied only to the word appearance position in the document, and the integer part of (word appearance frequency in document / 16) is encoded. However, when the integer part is 0, the numerical value 1 is encoded instead.

第３のインデックス構造の場合には、文書内単語出現位置の個数が文書内単語出現頻度となるため、文書内単語出現頻度のみにＰｅｒ符号化を適用することはできない。そこで、以下の２４通りの組合せについて、符号化の結果が示されている。 In the case of the third index structure, since the number of word appearance positions in the document becomes the word appearance frequency in the document, Per encoding cannot be applied only to the word appearance frequency in the document. Therefore, the encoding results are shown for the following 24 combinations.

１．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図２６、組合せ１）
２．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図２６、組合せ２）
３．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：Ｂ２４（図２６、組合せ３）
４．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図２６、組合せ４）
５．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：４ＢＢ改（図２６、組合せ５）
６．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図２６、組合せ６）
７．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：Ｂ２４（図２６、組合せ７）
８．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図２６、組合せ８）
９．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図２７、組合せ９）
１０．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図２７、組合せ１０）
１１．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：Ｂ２４（図２７、組合せ１１）
１２．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図２７、組合せ１２）
１３．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：４ＢＢ改（図２７、組合せ１３）
１４．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図２７、組合せ１４）
１５．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：Ｂ２４（図２７、組合せ１５）
１６．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図２７、組合せ１６）
１７．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図２８、組合せ１７）
１８．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図２８、組合せ１８）
１９．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：８４ＢＢ（図２８、組合せ１９）
２０．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図２８、組合せ２０）
２１．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図２８、組合せ２１）
２２．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図２８、組合せ２２）
２３．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：８４ＢＢ（図２８、組合せ２３）
２４．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図２８、組合せ２４）
図２９、３０、３１は、第４のインデックス構造を用いた場合の符号のビットパターンと総ビット数を示している。ここでは、Ｐｅｒ（１６）符号化を、文書内単語出現位置のみに適用し、（文書内単語出現頻度／１６）の整数部分を符号化している。ただし、その整数部分が０となる場合は代わりに数値１を符号化している。 1. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position in document: 4BB break (FIG. 26, combination 1)
2. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position in document: 4BB break + Per (16) (FIG. 26, combination 2)
3. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position in document: B24 (FIG. 26, combination 3)
4). Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position in document: B24 + Per (16) (FIG. 26, combination 4)
5). Document number: 4BB break, word appearance frequency in document: B24, word appearance position in document: 4BB break (FIG. 26, combination 5)
6). Document number: 4BB revision, word appearance frequency in document: B24, word appearance position in document: 4BB revision + Per (16) (FIG. 26, combination 6)
7). Document number: 4BB revision, word appearance frequency in document: B24, word appearance position in document: B24 (FIG. 26, combination 7)
8). Document number: 4BB revision, word appearance frequency in document: B24, word appearance position in document: B24 + Per (16) (FIG. 26, combination 8)
9. Document number: B24, word occurrence frequency in document: 4BB break, word appearance position in document: 4BB break (FIG. 27, combination 9)
10. Document number: B24, word occurrence frequency in document: 4BB break, word occurrence position in document: 4BB break + Per (16) (FIG. 27, combination 10)
11. Document number: B24, word appearance frequency in document: 4BB revision, word appearance position in document: B24 (FIG. 27, combination 11)
12 Document number: B24, word appearance frequency in document: 4BB revision, word appearance position in document: B24 + Per (16) (FIG. 27, combination 12)
13. Document number: B24, word appearance frequency in document: B24, word appearance position in document: 4BB modification (FIG. 27, combination 13)
14 Document number: B24, in-document word appearance frequency: B24, in-document word appearance position: 4BB Kai + Per (16) (FIG. 27, combination 14)
15. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position: B24 (FIG. 27, combination 15)
16. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position: B24 + Per (16) (FIG. 27, combination 16)
17. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position in document: 84BB (FIG. 28, combination 17)
18. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position in document: 84BB + Per (16) (FIG. 28, combination 18)
19. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position in document: 84BB (FIG. 28, combination 19)
20. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position in document: 84BB + Per (16) (FIG. 28, combination 20)
21. Document number: B24, word appearance frequency in document: 4BB revision, word appearance position in document: 84BB (FIG. 28, combination 21)
22. Document number: B24, word appearance frequency in document: 4BB revision, word appearance position in document: 84BB + Per (16) (FIG. 28, combination 22)
23. Document number: B24, word appearance frequency in document: B24, word appearance position in document: 84BB (FIG. 28, combination 23)
24. Document number: B24, word appearance frequency in document: B24, word appearance position in document: 84BB + Per (16) (FIG. 28, combination 24)
29, 30, and 31 show the bit pattern of the code and the total number of bits when the fourth index structure is used. Here, Per (16) encoding is applied only to the word appearance position in the document, and the integer part of (word appearance frequency in document / 16) is encoded. However, when the integer part is 0, the numerical value 1 is encoded instead.

文書内単語出現位置を４ＢＢ改符号化で符号化した場合には、文書内単語出現位置領域数の単位は４ビットとなり、Ｂ２４符号化で符号化した場合には、文書内単語出現位置領域数の単位は２ビットとなる。コード系のコラムの（）内のビット数は、この単位ブロックの大きさを表し、各ビットパターンの右側の（）内の数値は、そのビットパターンに対応する十進数を表す。ここでは、以下の２４通りの組合せについて、符号化の結果が示されている。 When the word appearance position in the document is encoded by 4BB re-encoding, the unit of the word appearance position area number in the document is 4 bits, and when it is encoded by B24 encoding, the number of word appearance position areas in the document The unit of is 2 bits. The number of bits in parentheses in the column of the code system represents the size of this unit block, and the numerical value in parentheses on the right side of each bit pattern represents a decimal number corresponding to the bit pattern. Here, encoding results are shown for the following 24 combinations.

１．文書番号：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図２９、組合せ１）
２．文書番号：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図２９、組合せ２）
３．文書番号：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４（図２９、組合せ３）
４．文書番号：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図２９、組合せ４）
５．文書番号：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改（図２９、組合せ５）
６．文書番号：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図２９、組合せ６）
７．文書番号：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４（図２９、組合せ７）
８．文書番号：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図２９、組合せ８）
９．文書番号：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図３０、組合せ９）
１０．文書番号：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３０、組合せ１０）
１１．文書番号：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４（図３０、組合せ１１）
１２．文書番号：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３０、組合せ１２）
１３．文書番号：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改（図３０、組合せ１３）
１４．文書番号：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３０、組合せ１４）
１５．文書番号：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４（図３０、組合せ１５）
１６．文書番号：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３０、組合せ１６）
１７．文書番号：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図３１、組合せ１７）
１８．文書番号：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３１、組合せ１８）
１９．文書番号：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ（図３１、組合せ１９）
２０．文書番号：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３１、組合せ２０）
２１．文書番号：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図３１、組合せ２１）
２２．文書番号：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３１、組合せ２２）
２３．文書番号：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ（図３１、組合せ２３）
２４．文書番号：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３１、組合せ２４）
図３２、３３、３４、３５、３６、３７は、第５のインデックス構造を用いた場合の符号のビットパターンと総ビット数を示している。ここでは、Ｐｅｒ（１６）符号化を、文書内単語出現位置のみに適用し、（文書内単語出現頻度／１６）の整数部分を符号化している。ただし、その整数部分が０となる場合は代わりに数値１を符号化している。 1. Document number: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break (FIG. 29, combination 1)
2. Document number: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break + Per (16) (FIG. 29, combination 2)
3. Document number: 4BB revision, word appearance position area number in document: 4BB revision, word appearance position in document: B24 (FIG. 29, combination 3)
4). Document number: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: B24 + Per (16) (FIG. 29, combination 4)
5). Document number: 4BB break, word appearance position area number in document: B24, word appearance position in document: 4BB break (FIG. 29, combination 5)
6). Document number: 4BB revision, word appearance position area number in document: B24, word appearance position in document: 4BB revision + Per (16) (FIG. 29, combination 6)
7). Document number: 4BB revision, number of word appearance positions in document: B24, word appearance position in document: B24 (FIG. 29, combination 7)
8). Document number: 4BB reform, word appearance position area number in document: B24, word appearance position in document: B24 + Per (16) (FIG. 29, combination 8)
9. Document number: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break (FIG. 30, combination 9)
10. Document number: B24, word appearance position area number in document: 4BB revision, word appearance position in document: 4BB revision + Per (16) (FIG. 30, combination 10)
11. Document number: B24, word appearance position area number in document: 4BB revision, word appearance position in document: B24 (FIG. 30, combination 11)
12 Document number: B24, word appearance position area number in document: 4BB revision, word appearance position in document: B24 + Per (16) (FIG. 30, combination 12)
13. Document number: B24, word appearance position area number in document: B24, word appearance position in document: 4BB modified (FIG. 30, combination 13)
14 Document number: B24, word appearance position area number in document: B24, word appearance position in document: 4BB Kai + Per (16) (FIG. 30, combination 14)
15. Document number: B24, word appearance position area number in document: B24, word appearance position in document: B24 (FIG. 30, combination 15)
16. Document number: B24, word appearance position area number in document: B24, word appearance position in document: B24 + Per (16) (FIG. 30, combination 16)
17. Document number: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 84BB (FIG. 31, combination 17)
18. Document number: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 84BB + Per (16) (FIG. 31, combination 18)
19. Document number: 4BB reform, word appearance position area number in document: B24, word appearance position in document: 84BB (FIG. 31, combination 19)
20. Document number: 4BB break, word appearance position area number in document: B24, word appearance position in document: 84BB + Per (16) (FIG. 31, combination 20)
21. Document number: B24, word appearance position area number in document: 4BB revision, word appearance position in document: 84BB (FIG. 31, combination 21)
22. Document number: B24, word appearance position area number in document: 4BB revision, word appearance position in document: 84BB + Per (16) (FIG. 31, combination 22)
23. Document number: B24, word appearance position area number in document: B24, word appearance position in document: 84BB (FIG. 31, combination 23)
24. Document number: B24, word appearance position area number in document: B24, word appearance position in document: 84BB + Per (16) (FIG. 31, combination 24)
32, 33, 34, 35, 36, and 37 show the bit pattern of the code and the total number of bits when the fifth index structure is used. Here, Per (16) encoding is applied only to the word appearance position in the document, and the integer part of (word appearance frequency in document / 16) is encoded. However, when the integer part is 0, the numerical value 1 is encoded instead.

コード系のコラムの（）内のビット数は、文書内単語出現位置領域数の単位の大きさを表し、各ビットパターンの右側の（）内の数値は、そのビットパターンに対応する十進数を表す。ここでは、以下の４８通りの組合せについて、符号化の結果が示されている。 The number of bits in parentheses in the code system column indicates the unit size of the word appearance position area number in the document, and the numerical value in parentheses on the right side of each bit pattern indicates the decimal number corresponding to the bit pattern. Represent. Here, the encoding results are shown for the following 48 combinations.

１．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図３２、組合せ１）
２．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３２、組合せ２）
３．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４（図３２、組合せ３）
４．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３２、組合せ４）
５．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改（図３２、組合せ５）
６．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３２、組合せ６）
７．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４（図３２、組合せ７）
８．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３２、組合せ８）
９．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図３３、組合せ９）
１０．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３３、組合せ１０）
１１．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４（図３３、組合せ１１）
１２．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３３、組合せ１２）
１３．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改（図３３、組合せ１３）
１４．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３３、組合せ１４）
１５．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４（図３３、組合せ１５）
１６．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３３、組合せ１６）
１７．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図３４、組合せ１７）
１８．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３４、組合せ１８）
１９．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４（図３４、組合せ１９）
２０．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３４、組合せ２０）
２１．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改（図３４、組合せ２１）
２２．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３４、組合せ２２）
２３．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４（図３４、組合せ２３）
２４．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３４、組合せ２４）
２５．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改（図３５、組合せ２５）
２６．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３５、組合せ２６）
２７．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４（図３５、組合せ２７）
２８．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３５、組合せ２８）
２９．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改（図３５、組合せ２９）
３０．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：４ＢＢ改＋Ｐｅｒ（１６）（図３５、組合せ３０）
３１．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４（図３５、組合せ３１）
３２．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：Ｂ２４＋Ｐｅｒ（１６）（図３５、組合せ３２）
３３．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図３６、組合せ３３）３４．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３６、組合せ３４）
３５．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ（図３６、組合せ３５）
３６．文書番号：４ＢＢ改、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３６、組合せ３６）
３７．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図３６、組合せ３７）
３８．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３６、組合せ３８）
３９．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ（図３６、組合せ３９）
４０．文書番号：４ＢＢ改、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３６、組合せ４０）
４１．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図３７、組合せ４１）
４２．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３７、組合せ４２）
４３．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ（図３７、組合せ４３）
４４．文書番号：Ｂ２４、文書内単語出現頻度：４ＢＢ改、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３７、組合せ４４）
４５．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ（図３７、組合せ４５）
４６．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：４ＢＢ改、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３７、組合せ４６）
４７．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ（図３７、組合せ４７）
４８．文書番号：Ｂ２４、文書内単語出現頻度：Ｂ２４、文書内単語出現位置領域数：Ｂ２４、文書内単語出現位置：８４ＢＢ＋Ｐｅｒ（１６）（図３７、組合せ４８）
以上説明した実施形態においては、本発明のデータ圧縮／復元装置を情報検索装置に適用しているが、このデータ圧縮／復元装置は、検索用インデックスのみに限らず、任意のデータの圧縮／復元に用いることができる。 1. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break (FIG. 32, combination 1)
2. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break + Per (16) (FIG. 32, combination 2)
3. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position number in document: 4BB break, word appearance position in document: B24 (FIG. 32, combination 3)
4). Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: B24 + Per (16) (FIG. 32, combination 4)
5). Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: B24, word appearance position in document: 4BB break (FIG. 32, combination 5)
6). Document number: 4BB revision, word appearance frequency in document: 4BB revision, word appearance position area number in document: B24, word appearance position in document: 4BB revision + Per (16) (FIG. 32, combination 6)
7). Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: B24, word appearance position in document: B24 (FIG. 32, combination 7)
8). Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: B24, word appearance position in document: B24 + Per (16) (FIG. 32, combination 8)
9. Document number: 4BB break, word appearance frequency in document: B24, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break (FIG. 33, combination 9)
10. Document number: 4BB break, word appearance frequency in document: B24, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break + Per (16) (FIG. 33, combination 10)
11. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: 4BB revision, word appearance position in document: B24 (FIG. 33, combination 11)
12 Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: 4BB revision, word appearance position in document: B24 + Per (16) (FIG. 33, combination 12)
13. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: B24, word appearance position in document: 4BB revision (FIG. 33, combination 13)
14 Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: B24, word appearance position in document: 4BB revision + Per (16) (FIG. 33, combination 14)
15. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: B24, word appearance position in document: B24 (FIG. 33, combination 15)
16. Document number: 4BB revision, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: B24 + Per (16) (FIG. 33, combination 16)
17. Document number: B24, word appearance frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 4BB break (FIG. 34, combination 17)
18. Document number: B24, in-document word appearance frequency: 4BB break, in-document word appearance position area number: 4BB break, in-document word appearance position: 4BB break + Per (16) (FIG. 34, combination 18)
19. Document number: B24, word appearance frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: B24 (FIG. 34, combination 19)
20. Document number: B24, word occurrence frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: B24 + Per (16) (FIG. 34, combination 20)
21. Document number: B24, in-document word appearance frequency: 4BB break, in-document word appearance position area number: B24, in-document word appearance position: 4BB break (FIG. 34, combination 21)
22. Document number: B24, in-document word appearance frequency: 4BB revision, in-document word appearance position area number: B24, in-document word appearance position: 4BB revision + Per (16) (FIG. 34, combination 22)
23. Document number: B24, in-document word appearance frequency: 4BB revision, in-document word appearance position area number: B24, in-document word appearance position: B24 (FIG. 34, combination 23)
24. Document number: B24, in-document word appearance frequency: 4BB break, in-document word appearance position area number: B24, in-document word appearance position: B24 + Per (16) (FIG. 34, combination 24)
25. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position number of areas: 4BB break, in-document word appearance position: 4BB break (FIG. 35, combination 25)
26. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: 4BB revision, in-document word appearance position: 4BB revision + Per (16) (FIG. 35, combination 26)
27. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: 4BB revision, in-document word appearance position: B24 (FIG. 35, combination 27)
28. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: 4BB reform, in-document word appearance position: B24 + Per (16) (FIG. 35, combination 28)
29. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: 4BB modified (FIG. 35, combination 29)
30. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: 4BB Kai + Per (16) (FIG. 35, combination 30)
31. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: B24 (FIG. 35, combination 31)
32. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: B24 + Per (16) (FIG. 35, combination 32)
33. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 84BB (FIG. 36, combination 33) 34. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 84BB + Per (16) (FIG. 36, combination 34)
35. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: B24, word appearance position in document: 84BB (FIG. 36, combination 35)
36. Document number: 4BB break, word appearance frequency in document: 4BB break, word appearance position area number in document: B24, word appearance position in document: 84BB + Per (16) (FIG. 36, combination 36)
37. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: 4BB revision, word appearance position in document: 84BB (FIG. 36, combination 37)
38. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: 4BB revision, word appearance position in document: 84BB + Per (16) (FIG. 36, combination 38)
39. Document number: 4BB revision, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: 84BB (FIG. 36, combination 39)
40. Document number: 4BB revision, word appearance frequency in document: B24, word appearance position area number in document: B24, word appearance position in document: 84BB + Per (16) (FIG. 36, combination 40)
41. Document number: B24, word occurrence frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 84BB (FIG. 37, combination 41)
42. Document number: B24, word occurrence frequency in document: 4BB break, word appearance position area number in document: 4BB break, word appearance position in document: 84BB + Per (16) (FIG. 37, combination 42)
43. Document number: B24, in-document word appearance frequency: 4BB revision, in-document word appearance position area number: B24, in-document word appearance position: 84BB (FIG. 37, combination 43)
44. Document number: B24, in-document word appearance frequency: 4BB break, in-document word appearance position area number: B24, in-document word appearance position: 84BB + Per (16) (FIG. 37, combination 44)
45. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: 4BB revision, in-document word appearance position: 84BB (FIG. 37, combination 45)
46. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: 4BB revision, in-document word appearance position: 84BB + Per (16) (FIG. 37, combination 46)
47. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: 84BB (FIG. 37, combination 47)
48. Document number: B24, in-document word appearance frequency: B24, in-document word appearance position area number: B24, in-document word appearance position: 84BB + Per (16) (FIG. 37, combination 48)
In the embodiment described above, the data compression / decompression apparatus of the present invention is applied to an information search apparatus. However, this data compression / decompression apparatus is not limited to a search index, and compresses / decompresses arbitrary data. Can be used.

本発明のデータ圧縮／復元装置の原理図である。It is a principle figure of the data compression / decompression apparatus of this invention. 情報検索装置の構成図である。It is a block diagram of an information search device. ４ＢＢ符号化処理のフローチャートである。It is a flowchart of a 4BB encoding process. ４ＢＢ符号化のプログラムを示す図である。It is a figure which shows the program of 4BB encoding. ４ＢＢ復号化処理のフローチャートである。It is a flowchart of a 4BB decoding process. ４ＢＢ復号化のプログラムを示す図である。It is a figure which shows the program of 4BB decoding. ４ＢＢ改符号化処理のフローチャートである。It is a flowchart of 4BB re-encoding process. ４ＢＢ改符号化のプログラムを示す図である。It is a figure which shows the program of 4BB re-encoding. ４ＢＢ改復号化処理のフローチャートである。It is a flowchart of 4BB re-decoding processing. ４ＢＢ改復号化のプログラムを示す図である。It is a figure which shows the program of 4BB re-decoding. ８４ＢＢ符号化処理のフローチャートである。It is a flowchart of an 84BB encoding process. ８４ＢＢ符号化のプログラムを示す図である。It is a figure which shows the program of 84BB encoding. ８４ＢＢ復号化処理のフローチャートである。It is a flowchart of a 84BB decoding process. ８４ＢＢ復号化のプログラムを示す図である。It is a figure which shows the program of 84BB decoding. Ｂ２４符号化処理のフローチャートである。It is a flowchart of a B24 encoding process. Ｂ２４符号化のプログラムを示す図である。It is a figure which shows the program of B24 encoding. Ｂ２４復号化処理のフローチャートである。It is a flowchart of a B24 decoding process. Ｂ２４復号化のプログラムを示す図である。It is a figure which shows the program of B24 decoding. 数値表現に必要なビット数を示す図である。It is a figure which shows the number of bits required for numerical expression. ３２ｂｉｔ最大数の符号化例を示す図である。It is a figure which shows the example of encoding of 32-bit maximum number. Ｐｅｒ符号化処理のフローチャートである。It is a flowchart of a Per encoding process. Ｐｅｒ復号化処理のフローチャートである。It is a flowchart of a Per decoding process. 第５のインデックス構造の例を示す図である。It is a figure which shows the example of a 5th index structure. 第１のインデックス構造のビットパターンを示す図である。It is a figure which shows the bit pattern of a 1st index structure. 第２のインデックス構造のビットパターンを示す図である。It is a figure which shows the bit pattern of a 2nd index structure. 第３のインデックス構造のビットパターンを示す図（その１）である。It is a figure (the 1) which shows the bit pattern of a 3rd index structure. 第３のインデックス構造のビットパターンを示す図（その２）である。It is FIG. (2) which shows the bit pattern of a 3rd index structure. 第３のインデックス構造のビットパターンを示す図（その３）である。It is FIG. (3) which shows the bit pattern of a 3rd index structure. 第４のインデックス構造のビットパターンを示す図（その１）である。It is a figure (the 1) which shows the bit pattern of a 4th index structure. 第４のインデックス構造のビットパターンを示す図（その２）である。It is FIG. (2) which shows the bit pattern of a 4th index structure. 第４のインデックス構造のビットパターンを示す図（その３）である。It is FIG. (3) which shows the bit pattern of a 4th index structure. 第５のインデックス構造のビットパターンを示す図（その１）である。It is a figure (the 1) which shows the bit pattern of a 5th index structure. 第５のインデックス構造のビットパターンを示す図（その２）である。It is a figure (the 2) which shows the bit pattern of a 5th index structure. 第５のインデックス構造のビットパターンを示す図（その３）である。It is FIG. (3) which shows the bit pattern of a 5th index structure. 第５のインデックス構造のビットパターンを示す図（その４）である。It is FIG. (4) which shows the bit pattern of a 5th index structure. 第５のインデックス構造のビットパターンを示す図（その５）である。It is FIG. (5) which shows the bit pattern of a 5th index structure. 第５のインデックス構造のビットパターンを示す図（その６）である。It is FIG. (6) which shows the bit pattern of a 5th index structure. キーとインデックス構造を示す図である。It is a figure which shows a key and an index structure. 圧縮のされていないインデックスのサイズを示す図である。It is a figure which shows the size of the index which is not compressed. 差分を用いたインデックス構造を示す図である。It is a figure which shows the index structure using a difference. 差分値の符号化例を示す図である。It is a figure which shows the example of an encoding of a difference value.

Explanation of symbols

１圧縮手段
２格納手段
３復元手段
４元のデータ
５圧縮されたデータ
１１磁気ディスク装置
１２フロッピーディスク駆動装置
１３フロッピーディスク
１４プリンタ
１５ディスプレイ
１６ＣＰＵ
１７キーボード
１８ポインティング・デバイス
１９メインメモリ
２０バス
２１文書データベース
２２インデックス
２３インデックス作成プログラム
２４検索エンジン
２５文書表示プログラム
２６ワーク領域
３１ネットワーク接続装置
３２外部の装置 DESCRIPTION OF SYMBOLS 1 Compression means 2 Storage means 3 Restoration means 4 Original data 5 Compressed data 11 Magnetic disk apparatus 12 Floppy disk drive apparatus 13 Floppy disk 14 Printer 15 Display 16 CPU
17 Keyboard 18 Pointing device 19 Main memory 20 Bus 21 Document database 22 Index 23 Index creation program 24 Search engine 25 Document display program 26 Work area 31 Network connection device 32 External device

Claims

Compression means for reducing the granularity of numerical data used in an index for information retrieval and compressing the numerical data;
A data compression apparatus comprising storage means for storing compressed data.

The data compression apparatus according to claim 1, wherein the compression means converts the numerical data into intermediate numerical data representing a smaller numerical value and compresses the intermediate numerical data.

2. The compression means compresses at least one or more numerical data among the word appearance frequency data in the document and the word appearance position data in the document used in the index for information retrieval. Data compression device.

The compression means compresses the intermediate numerical data with coarse granularity in units of 4 bits, and generates continuation flag information indicating the block length of the compressed data at the head of the compressed data. The data compression apparatus according to claim 1, wherein:

The compression means represents the intermediate numerical data in a 2-bit block when the intermediate numerical data with coarse granularity represents a numerical value of 2 or less, and the intermediate numerical data when the intermediate numerical data represents a numerical value of 3 or more 2. The data compression apparatus according to claim 1, wherein the data is compressed in units of 4 bits.

6. The data according to claim 5, wherein when the numerical data of 3 to 6 is given, the compression means represents the given numerical data with 2-bit continuation flag information and a 2-bit bit pattern. Compression device.

Storage means for coarsening the numerical data used in the index for information retrieval, storing the compressed data,
A data restoration device comprising: restoration means for restoring the numeric data and restoring the granularity of the restored numeric data.

Storage means for coarsening the numerical data used in the index for information retrieval, storing the compressed data,
Restoring means for restoring the numerical data and restoring the granularity of the restored numerical data;
An information search apparatus comprising: search means for searching a database using the restored original data.

A recording medium recording a program for a computer,
A computer-readable recording medium storing a program for causing the computer to realize a function of reducing the granularity of numerical data used in an index for information retrieval and compressing the numerical data.

A recording medium recording a program for a computer,
A computer having recorded thereon a program for causing the computer to realize a function of restoring compressed data by coarsening the granularity of numerical data used in an index for information retrieval and restoring the granularity of the restored numerical data A readable recording medium.

By coarsening the granularity of numerical data used in the index for information retrieval,
A data compression method comprising compressing the numerical data.

Restore the compressed data by coarsening the granularity of numerical data used in the index for information retrieval,
A data restoration method characterized by restoring the granularity of the restored numerical data.