JP3051501B2

JP3051501B2 - Data compression method

Info

Publication number: JP3051501B2
Application number: JP16554391A
Authority: JP
Inventors: 茂吉田; 佳之岡田; 泰彦中野; 広隆千葉
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1991-07-05
Filing date: 1991-07-05
Publication date: 2000-06-12
Anticipated expiration: 2015-06-12
Also published as: JPH0514206A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、ユニバーサル型アルゴ
リズムを用いて文字情報等の入力文字列を圧縮符号化す
るデータ圧縮方法に関する。近年、文字コード、ベクト
ル情報、画像など様々な種類のデータがコンピュータで
扱われるようになっており、扱われるデータ量も急速に
増加してきている。大量のデータを扱うときは、データ
の中の冗長な部分を省いてデータ量を圧縮することで、
記憶容量を減らしたり、速く伝送したりできるようにな
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention compresses and encodes an input character string such as character information using a universal algorithm.
Lud on over data compression method. In recent years, various types of data such as character codes, vector information, and images have been handled by computers, and the amount of data handled has rapidly increased. When dealing with large amounts of data, compressing the amount of data by eliminating redundant parts in the data,
It will be possible to reduce storage capacity and transmit faster.

【０００２】このような様々なデータを１つの方式でデ
ータ圧縮できる方法としてユニバーサル符号化が提案さ
れている。ここで、本発明の分野は、文字コードの圧縮
に限らず、様々なデータに適用できるが、以下では、情
報理論で用いられている呼称を踏襲し、データの１ワー
ド単位を文字と呼び、データが任意ワードつながったも
のを文字列と呼ぶことにする。[0002] Universal encoding has been proposed as a method capable of compressing such various data by one method. Here, the field of the present invention is not limited to character code compression, and can be applied to various types of data. In the following, one word unit of data is called a character, following the name used in information theory, Data in which arbitrary words are connected is called a character string.

【０００３】ユニバーサル符号の代表的な方法として、
ジブ−レンペル（Ziv-Lempel）符号がある（詳しくは、
例えば、宗像『Ziv-Lempelのデータ圧縮法』、情報処
理、Vol.26,No.1,1985年を参照のこと）。ジブーレンペ
ル符号では(1)ユニバーサル型と、(2)増分分解型（Incr
emental parsing ）の２つのアルゴリズムが提案されて
いる。[0003] As a typical method of the universal code,
There is a Ziv-Lempel code (for details,
For example, see Munakata "Ziv-Lempel Data Compression Method", Information Processing, Vol. 26, No. 1, 1985). Jibulempel code has (1) universal type and (2) incremental decomposition type (Incr
two algorithms have been proposed.

【０００４】更に、ユニバーサル型アルゴリズムの改良
として、ＬＺＳＳ符号（T.C. Bell,“Better OPM/L Tex
t Compression ”,IEEE Trans. on Commun., Vol.COM-3
4, No.12, Dec. 1986 参照）や、１／４インチ・カート
リッジ磁気テープの標準圧縮方式であるＱＩＣ−１２２
符号がある。また、増分分解型アルゴリズムの改良とし
ては、ＬＺＷ（Lempel-Ziv-Welch）符号がある（T.A. W
elch, “A Technique for High-Performance Data Comp
ression ”,Computer, June 1984参照）。Further, as an improvement of the universal algorithm, LZSS codes (TC Bell, “Better OPM / L Tex
t Compression ”, IEEE Trans. on Commun., Vol.COM-3
4, No. 12, Dec. 1986), and QIC-122, a standard compression method for 1/4 inch cartridge magnetic tape.
There is a sign. As an improvement of the incremental decomposition type algorithm, there is a LZW (Lempel-Ziv-Welch) code (TA W
elch, “A Technique for High-Performance Data Comp
ression ", Computer, June 1984).

【０００５】これらの改良符号は補助記憶装置のファイ
ル圧縮や、パソコン通信でのデータ伝送に利用されるよ
うになっている。[0005] These improved codes are used for file compression of an auxiliary storage device and data transmission through personal computer communication.

【０００６】[0006]

【従来の技術】まず従来のユニバーサル型アルゴリズム
とその改良の１つであるＱＩＣ−１２２符号について説
明する。［ユニバーサル型アルゴリズム］ユニバーサル型アルゴ
リズムは、演算量は多いが、高圧縮率が得られるデータ
圧縮方式である。2. Description of the Related Art First, a conventional universal algorithm and a QIC-122 code which is one of the improvements will be described. [Universal type algorithm] The universal type algorithm is a data compression method that requires a large amount of calculation but can obtain a high compression rate.

【０００７】即ち、ユニバーサル型アルゴリズムにあっ
ては、符号化しようとする文字列をを、符号化済みの文
字列の任意の位置から最大長一致する系列、所謂部分列
に区切り、入力文字列を過去の最大長一致する部分列の
複製として符号化する。図９にユニバーサル型ジブーレ
ンペル符号の符号化方式を示す。図９において、辞書と
しての機能をもつＰバッファ１２には入力済みの文字列
が格納されており、文字入力部としてのＱバッファ１０
にはこれから符号化しようとする文字列が入力されてい
る。パターンマッチング部１４はＱバッファ１０の文字
列をＰバッファ１２の系列と照合し、Ｐバッファ１２の
中で一致する最大長の文字部分列を検索する。そして、
Ｐバッファ１２中で検索した最大長一致する部分列を指
定するため図１０の情報の組Ｐバッファ中の最大長一致系列の開始位置（開始アドレ
ス）一致長（レングス）として符号化する。なお、一致系列がなければ不一致の
シンボルと共に生データを出力する。That is, in the universal type algorithm, a character string to be coded is divided into a sequence that matches the maximum length from an arbitrary position of the coded character string, that is, a so-called subsequence, and the input character string is divided. Encode as a copy of the past substring that matches the maximum length. FIG. 9 shows an encoding method of the universal type Jeho Lempel code. In FIG. 9, a P-buffer 12 having a function as a dictionary stores character strings that have already been input, and a Q-buffer 10 serving as a character input unit.
Is a character string to be encoded. The pattern matching unit 14 compares the character string in the Q buffer 10 with the series in the P buffer 12 and searches the P buffer 12 for a matching character substring of the maximum length. And
In order to specify the substring that matches the maximum length searched in the P buffer 12, the information is coded as the start position (start address) coincidence length (length) of the maximum length matching sequence in the P buffer in FIG. If there is no matching series, the raw data is output together with the unmatched symbols.

【０００８】次にＱバッファ１０内の符号化した文字列
をＰバッファ１２に移して新たな符号化済み文字列を登
録する。以下、同様の操作を繰り返し、入力文字列を部
分列に分解して符号化する。このようにジブーレンペル
符号では現在の文字列を、符号化済みの過去の文字列か
らの複製として符号化するものである。ジブーレンペル
符号を用いた場合、文字コードの文書情報は１／２程度
に圧縮できる。Next, the coded character string in the Q buffer 10 is transferred to the P buffer 12, and a new coded character string is registered. Hereinafter, the same operation is repeated to decompose the input character string into subsequences and encode them. As described above, in the Zibo Lempel code, a current character string is encoded as a copy from an encoded past character string. In the case of using the Zivurempel code, the character code document information can be compressed to about 1/2.

【０００９】［ＱＩＣ−１２２符号］３Ｍを中心とする
メーカの団体であるＱＩＣ（Quauter Inch Cartrrige S
tandard Inc.）が１／４インチ・カートリッジ磁気テー
プの標準圧縮方式として採用した符号である。ＱＩＣ−
１２２符号のアルゴリズムでは、Ｐバッファとして２０
４８バイトの履歴をもち、Ｑバッファの符号化する文字
列をＰバッファ中の文字列の複製として表すモードと、
生データを１バイトづつ符号化するモードの２つのモー
ドをもつ。そして、Ｐバッファ中の最大長一致文字列が
２文字以上の場合、複製モードで符号化し、それ以外の
ときは生データ・モードで符号化する。[QIC-122 code] QIC (Quauter Inch Cartrrige S), a group of manufacturers centering on 3M
tandard Inc.) as a standard compression method for 1/4 inch cartridge magnetic tape. QIC-
In the algorithm of 122 codes, 20 buffers are used as the P buffer.
A mode having a 48-byte history and representing the character string to be encoded in the Q buffer as a copy of the character string in the P buffer;
It has two modes of encoding raw data one byte at a time. When the maximum length matching character string in the P buffer is two or more characters, the encoding is performed in the copy mode, and otherwise, the encoding is performed in the raw data mode.

【００１０】図１１はＢＮＦメタ言語で表わされたＱＩ
Ｃ−１２２符号の符号語フォーマットを示す。またＢＮ
Ｆメタ言語に用いるメタ記号は図１２に示す意味をも
つ。図１１のＱＩＣ−１２２符号の符号語フォーマット
を詳細に説明すると次のようになる。（１）圧縮系列（Compressed Stream ）は、圧縮ストリ
ング（Compressed String)とエンドマーカで構成され
る。FIG. 11 shows a QI expressed in the BNF meta language.
3 shows a codeword format of the C-122 code. Also BN
The meta symbols used in the F meta language have the meanings shown in FIG. The codeword format of the QIC-122 code of FIG. 11 will be described in detail as follows. (1) A compressed stream (Compressed Stream) is composed of a compressed string (Compressed String) and an end marker.

【００１１】（２）圧縮ストリングは、生データについ
ては識別ビット０に続くＡＳＣＩＩ生バイトで表現さ
れ、また圧縮データについては識別ビット１に続いて圧
縮バイトで表現される。（３）ＡＳＣＩＩ生バイトは、８ビットを１バイトして
表現される。（４）圧縮バイトは、オフセット（開始位置）とレング
ス（一致長）の組でなる。(2) The compressed string is represented by an ASCII raw byte following the identification bit 0 for raw data, and a compressed byte following the identification bit 1 for compressed data. (3) An ASCII raw byte is represented by one byte of 8 bits. (4) A compressed byte is a set of an offset (start position) and a length (match length).

【００１２】（５）オフセット（開始位置）は、識別ビ
ット１の場合は７ビットで表現される。また識別ビット
０のは場合は１１ビットで表現される。（６）エンドマーカは、１１０００００００であり、オ
フセットは０となる。（７）ビットｂは０又は１である。（８）レングス（一致長）は、図１２のように可変長符
号で表現される。(5) The offset (start position) is represented by 7 bits in the case of the identification bit 1. The identification bit 0 is represented by 11 bits. (6) The end marker is 1100000000000, and the offset is 0. (7) The bit b is 0 or 1. (8) The length (match length) is represented by a variable length code as shown in FIG.

【００１３】図１３にＱＩＣ−１２２符号の符号化の具
体例を示す。図１３は文字列「ＡＢＡＡＡＡＡＡＣＡＢ
Ａ」が入力した場合を例にとっている。まず最初の３文
字「ＡＢＡ」に関してはＰバッファ中の一致する文字数
が１文字以下であることからＡＳＣＩＩ生バイトのビッ
ト系列を出力する。４文字目から８文字目までの５つの
「Ａ」については、Ｐバッファの直前文字「Ａ」と一致
することから、圧縮バイト識別ビット７ビットオフセット識別ビットオフセット＝１レングス＝５バイトでなるビット系列「１１００００００１１１０
０」として出力する。FIG. 13 shows a specific example of the encoding of the QIC-122 code. FIG. 13 shows the character string “ABAAAAAAACAB”.
The case where "A" is input is taken as an example. For the first three characters "ABA", a bit sequence of ASCII raw bytes is output because the number of matching characters in the P buffer is one or less. The five "A" characters from the fourth character to the eighth character match the character "A" immediately before in the P buffer, so the compressed byte identification bit 7-bit offset identification bit Offset = 1 Length = 5 bytes The series "1 1 000 0001 110
Output as "0".

【００１４】ここで最大長一致の部分列の開始位置を示
すオフセットの値は、Ｐバッファの最新登録位置（アド
レス）から前に遡って何番目かを示している。９番目の
文字「Ｃ」はＰバッファにないことからＡＳＣＩＩ生バ
イトを出力する。１０〜１２番目の文字「ＡＢＡ」はＰ
バッファの先頭からの３文字として既に登録済みである
ので、圧縮バイト識別ビット７ビットオフセット識別ビットオフセット＝９レングス＝３バイトでなるビット系列「１１０００１００１０１」を
出力する。Here, the offset value indicating the start position of the substring having the maximum length match indicates the number of the latest registered position (address) in the P buffer going back. Since the ninth character "C" is not in the P buffer, it outputs an ASCII raw byte. The 10th to 12th characters "ABA" are P
Since the three characters have already been registered as the three characters from the head of the buffer, a bit sequence “1 1000100101” consisting of compressed byte identification bits, 7-bit offset identification bits, offset = 9 length = 3 bytes is output.

【００１５】以上で全ての入力文字の符号化が済んだの
でエンドデータとして「１１０００００００」を出
力して処理を終了する。[0015] Since all the input characters have been encoded as described above, "1 100000" is output as end data, and the processing is terminated.

【００１６】［本願発明者が提案したデータ圧縮方式］
しかしながら、ジブーレンペル符号やＱＩＣ−１２２符
号等を用いた従来のユニバーサル符号を用いたデータ圧
縮方式にあっては、複製すべき最大長一致する文字列を
「一致開始位置」と「一致長」の組で表わして符号化し
ていたため、Ｐバッファに保持する符号化済み文字列が
増加してくると、一致開始位置はＰバッファを構成する
メモリのアドレスで表現されているために、長いビット
数で表わさなければならなくなり、符号化した文字列間
に冗長性が残り、高い圧縮率を得ることができなくなる
問題があった。[Data compression method proposed by the present inventor]
However, in a conventional data compression method using a universal code using a Zibou Lempel code, a QIC-122 code, or the like, a character string having the maximum length to be copied is set to a “match start position” and a “match length”. When the number of encoded character strings held in the P buffer increases, the match start position is represented by a long bit number because the match start position is represented by the address of the memory constituting the P buffer. In such a case, there is a problem that redundancy remains between the encoded character strings and a high compression ratio cannot be obtained.

【００１７】そこで本願発明者等は、文字間の相関を取
り込むことにより符号化済み文字列間の冗長性を削減し
た符号化により高い圧縮率を得ることのできるユニバー
サル符号を用いたデータ圧縮方式を提案している。図１
４に本願発明者等が提案している符号化処理の原理を示
す。。図１４において、まずＱバッファ１０にこれから
符号化しようとする文字列「ａｃｂａ・・・」が格納さ
れていたとすると、既に符号化済みの直前の一文字であ
る直前文字「ｂ」を含み、この直前文字「ｂ」に続いて
Ｑバッファ１０の文字列に一致する文字列がＰバッファ
１２にあるか否かが探索する。この条件を満足する文字
列として例えば文字列Ｓ１，Ｓ２，Ｓ３，Ｓ４が検索で
きたとする。Accordingly, the present inventors have developed a data compression system using a universal code which can obtain a high compression rate by encoding with reduced redundancy between encoded character strings by taking in correlation between characters. is suggesting. FIG.
FIG. 4 shows the principle of the encoding process proposed by the present inventors. . In FIG. 14, assuming that a character string “acba...” To be coded from now on is stored in the Q buffer 10, the character string “b” includes the immediately preceding character “b” which is one character immediately before being coded. After the character "b", a search is performed to determine whether a character string matching the character string in the Q buffer 10 exists in the P buffer 12. It is assumed that, for example, character strings S1, S2, S3, and S4 can be searched as character strings that satisfy this condition.

【００１８】文字列Ｓ１〜Ｓ４の出現番号は、Ｐバッフ
ァ１２に右から登録を行っているため文字列Ｓ１：１番目文字列Ｓ２：２番目文字列Ｓ３：３番目文字列Ｓ４：４番目の出現順番となり、順に出現番号１，２，３，４が割り
付けられる。Since the appearance numbers of the character strings S1 to S4 are registered in the P buffer 12 from the right, the character string S1: the first character string S2: the second character string S3: the third character string S4: the fourth The appearance order is set, and the appearance numbers 1, 2, 3, and 4 are sequentially assigned.

【００１９】次に文字列Ｓ１〜Ｓ４の中から入力文字列
に最大長一致する文字列Ｓ４を検索し、この文字列Ｓ４
の開始位置を出現番号４で表わし、一致長と組にした符
号語として出力する。一方、文字列Ｓ２のように直前文
字「ｂ」が連続する文字列については、連続文字列Ｓ２
の開始点または終点のいずれか一方に出現番号を割り付
けて符号化する。また連続文字列Ｓの開始点と終点に２
つの出現番号を割り付け、２つの出現番号と一致長との
組の符号語として符号化していもよい。Next, a character string S4 that matches the input character string with the maximum length is searched from the character strings S1 to S4.
Is represented by the appearance number 4 and is output as a code word paired with the matching length. On the other hand, for a character string in which the preceding character "b" is continuous, such as the character string S2, the continuous character string S2
Is encoded by assigning an appearance number to either the start point or the end point. Also, the start and end points of the continuous character string S are 2
One occurrence number may be allocated and encoded as a code word of a set of two occurrence numbers and a matching length.

【００２０】図１５はＱＩＣー１２２符号を例にとって
図１４の出現番号を用いたユニバーサル符号化アルゴリ
ズムの一実施例を示したフローチャートである。図１５
において、まずステップＳ１でＰバッファの内容を空に
し、またＱバッファに符号化しようとする入力データを
詰める。次にステップＳ２でＱバッファの直前文字の位
置からの文字列に一致するＰバッファの最長文字列Ｓを
検索する。続いてステップＳ３で検索できた最長文字列
Ｓが３文字以上か否か判別する。FIG. 15 is a flow chart showing an embodiment of a universal encoding algorithm using the appearance numbers of FIG. 14 using the QIC-122 code as an example. FIG.
First, in step S1, the contents of the P buffer are emptied, and the Q buffer is filled with input data to be encoded. Next, in step S2, the longest character string S in the P buffer that matches the character string from the position of the immediately preceding character in the Q buffer is searched. Subsequently, it is determined whether or not the longest character string S retrieved in step S3 has three or more characters.

【００２１】最長文字列Ｓが１文字或いは２文字の場合
はステップＳ４に進んで生データ・モードとなり、生デ
ータ・モードであることを示すフラクビット０とＡＳＣ
ＩＩコードでなる生データ１バイトを出力する。即ち、
図１７（ａ）のデータ形式の符号語を出力する。一方、
最長文字列Ｓが３文字以上であった場合には、ステップ
Ｓ５に進んで複製モードとし、圧縮データであることを
示すフラグビット１に続いて最長文字列の出現順番と一
致長の組を符号化する。即ち、図１７（ｂ）のデータ形
式の符号語を出力する。If the longest character string S is one or two characters, the flow advances to step S4 to enter the raw data mode.
Output 1 byte of raw data consisting of II code. That is,
The codeword of the data format of FIG. 17A is output. on the other hand,
If the longest character string S is three or more characters, the process proceeds to step S5 to set the copy mode, and sets a flag bit 1 indicating that the data is compressed data, and codes a set of the order of appearance and the matching length of the longest character string. Become That is, a code word in the data format shown in FIG.

【００２２】ステップＳ６では符号化済みのＱバッファ
の文字列又は文字をＰバッファに移すと共に、同じ数の
新たな文字をＱバッファに入力する。更にＱＩＣ−１２
２符号のアルゴリズムではＰバッファは２０４８バイト
と固定であるため、Ｐバッファに移した文字数分の最も
古い文字をＰバッファから捨てる。以下同様な処理を繰
り返す。In step S6, the encoded character string or character in the Q buffer is transferred to the P buffer, and the same number of new characters are input to the Q buffer. QIC-12
In the two-code algorithm, the P buffer is fixed at 2048 bytes, and the oldest characters corresponding to the number of characters transferred to the P buffer are discarded from the P buffer. Hereinafter, similar processing is repeated.

【００２３】図１６は図１５の複製モードで符号化され
るＱＩＣ−１２２符号語を利用したフォーマット説明図
であり、図１１と比較してオフセットが出現番号を示す
３ビットの固定長表現となっている。この場合、出現順
番を７番目まで符号化できる。ここで、出現番号０はＥ
ＮＤマークに用いている。また、一致長は、直前文字を
含め、３文字以上の文字列を表わしているので、一致長＝|Ｓ|−２を用いるものとする。更に、出現番号は右から数えた
が、逆に左から数えてもよい。FIG. 16 is a diagram for explaining a format using the QIC-122 codeword encoded in the duplication mode of FIG. 15. In FIG. 16, the offset is a 3-bit fixed-length representation indicating the appearance number as compared with FIG. ing. In this case, the order of appearance can be encoded up to the seventh. Here, the appearance number 0 is E
Used for ND mark. Since the matching length represents a character string of three or more characters including the immediately preceding character, the matching length = | S | -2 is used. Furthermore, although the appearance numbers are counted from the right, they may be counted from the left.

【００２４】このように従来の開始位置を示していたオ
フセットが７ビット或いは１１ビットであったものを出
現番号を用いた場合には、出現番号を示すオフセットを
３ビットと少ないビット数で表現でき、これによって圧
縮率を向上できる。As described above, when the appearance number is used instead of the conventional 7-bit or 11-bit offset indicating the start position, the offset indicating the appearance number can be represented by a small number of bits of 3 bits. Thus, the compression ratio can be improved.

【００２５】[0025]

【発明が解決しようとする課題】しかしながら、本願発
明者等が提案した文字間の相関を取り込んだユニバーサ
ル符号を用いたデータ圧縮方式にあっては、Ｐバッファ
内の出現番号を固定長で表すため、直前文字の出現個数
に対し最大一致長となる部分列の出現番号が若かった場
合には、不要なビット数が余分に取られたり、逆に予定
した最大出現個数を越えたような場合には最大一致長部
分列の出現番号を全ては表せない場合が生じ、圧縮符号
に無駄が生ずるという問題が残されている。However, in the data compression system using the universal code which takes in the correlation between characters proposed by the inventors of the present invention, the appearance number in the P buffer is represented by a fixed length. If the number of occurrences of the subsequence that has the maximum match length with respect to the number of occurrences of the preceding character is small, unnecessary bits may be taken extra or, conversely, if the maximum number of occurrences is exceeded. May not be able to represent all occurrence numbers of the maximum matching length subsequence, and there is a problem that the compression code is wasted.

【００２６】本発明は、このような問題点に鑑みてなさ
れたもので、出現個数が未定であっても効率良くビット
表現して更に圧縮率を向上するようにしたユニバーサル
符号を用いたデータ圧縮方式を提供することを目的とす
る。The present invention has been made in view of such a problem. Even if the number of appearances is undecided, data compression using a universal code which efficiently expresses bits and further improves a compression ratio is performed. The aim is to provide a scheme.

【００２７】[0027]

【課題を解決するための手段】図１は本発明の原理説明
図である。まず本発明は、本願発明者等が既に提案して
いる符号化済み文字列を辞書１２に保持しておき、文字
入力部１０の入力文字列に対する直前文字１８と同じ先
頭文字から始まる前記辞書１２に保持された符号化済み
文字列の一致部分列Ｓ１，Ｓ２，Ｓ３，Ｓ４の出現個数
ｎを検出すると共に最大長一致する部分列Ｓ４を検索
し、最大長一致する部分列Ｓ４の開始位置として該最大
長一致部分列Ｓ４の直前文字が現れる出現順番ｎと一致
長Ｌとの組で符号化する符号化部１４を備えたユニバー
サル符号を用いたデータ圧縮方式を対象とする。FIG. 1 is a diagram illustrating the principle of the present invention. First, the present invention stores an encoded character string already proposed by the present inventors in the dictionary 12 and starts the dictionary 12 starting from the same first character as the immediately preceding character 18 for the input character string of the character input unit 10. , The number n of occurrences of the matching substrings S1, S2, S3, and S4 of the encoded character string stored in the substring S4 is searched, and the substring S4 that matches the maximum length is searched. The present invention is directed to a data compression method using a universal code including an encoding unit 14 that encodes a set of an appearance order n in which the character immediately preceding the maximum length matching subsequence S4 appears and a matching length L.

【００２８】このように文字間の相関を取り込んで符号
化文字列間の冗長性を削減するようにしたユニバーサル
符号を用いたデータ圧縮方式につき本発明にあっては、
出現番号ｉを出現個数ｎに基づくビット数に応じて可変
長符号化することを特徴とする。またビット端数補償に
よる可変長符号化として符号化部１４は、出現個数をｎ
とした場合にｌｏｇ₂ ｎ以上の最小の整数を表わす「ｌ
ｏｇ₂ ｎ」で与えられるビット数をｐとし、出現個数ｎ
の最上位ビットを除く（ｐ−１）ビットで表わしたもの
をｎ^* とし、更に符号化する出現番号ｉの最上位ビット
を除く（ｐ−１）ビットで表わしたものをｉ^* した場
合、ｉ^* ＞ｎ^* であればｉ^* を可変長符号とし、ｉ^* ≦ｎ^* であれば、ｉ^* の後に前記出現番号ｉの最上位ビットを
付加したものを可変長符号とする。According to the present invention, there is provided a data compression system using a universal code in which the correlation between characters is taken in as described above to reduce the redundancy between encoded character strings.
It is characterized in that the appearance number i is variable-length coded according to the number of bits based on the number of appearances n. Also, the encoding unit 14 performs variable length encoding by bit fraction compensation,
Where “l” representing the smallest integer equal to or greater than log ₂ n
og ₂ n ”, p is the number of bits, and n
If n ^* is represented by (p-1) bits excluding the most significant bit of, and i ^* is represented by (p-1) bits excluding the most significant bit of the occurrence number i to be encoded, If i ^* > n ^* , i ^* is a variable length code. If i ^* ≦ n ^* , a variable length code is obtained by adding the most significant bit of the occurrence number i after i ^* .

【００２９】更にPhasing in Binary Codes による可変
長符号化として符号化部１４は、出現個数をｎとした場
合にｌｏｇ₂ ｎ以上の最小の整数を表わす「ｌｏｇ₂
ｎ」で与えられるビット数をｐとし、出現個数ｎの最上
位ビットを除く（ｐ−１）ビットで表わしたものをｎ^*
とし、更に符号化する出現番号ｉの最上位ビットを除く
（ｐ−１）ビットで表わしたものをｉ^* した場合、ｉ^* ＜２^P −ｎ−１であればｉ^* を可変長符号とし、ｉ^* ≧２^P −ｎ−１であれば、ｉ^* に（２^P −ｎ−１）を加えた値をｐビッ
トで表わして可変長符号とする。Further, as variable-length encoding by Phasing in Binary Codes, the encoding unit 14 uses “log ₂ ” representing a minimum integer equal to or greater than log ₂ n when the number of appearances is n.
The number of bits given by "n" is p, and the number represented by (p-1) bits excluding the most significant bit of the number of occurrences n is n ^*
And then, further except for the most significant bit of the occurrence number i to be encoded (p-1) if those expressed in bits and i ^*, the i ^* a variable length code if i ^* <2 and ^P -n-1 if i ^* ≧ 2 a ^P -n-1, and variable-length code i ^* to the value obtained by adding (2 ^P -n-1) represented by p bits.

【００３０】更にまた符号化部１４は、直前文字１８に
続く入力文字列に一致するｎ個の部分列が等確率で出現
するとみなし、出現順番ｉを多値算術符号化することを
特徴とする。更に符号化部１４は、入力文字列を１／４
インチ・カートリッジ磁気テープの標準圧縮方式である
ＱＩＣ−１２２符号に符号化する。Furthermore, the encoding unit 14 regards n substrings that match the input character string following the immediately preceding character 18 as appearing with equal probability, and encodes the appearance order i in multi-level arithmetic. . Further, the encoding unit 14 converts the input character string into １／
Encode to the QIC-122 code which is a standard compression method for inch cartridge magnetic tape.

【００３１】[0031]

【作用】以上説明したように本発明によれば、Ｑバッフ
ァの入力文字列に対する既に符号化済みの直前文字（Ｐ
バッファの最新登録の一文字）を先頭文字とする符号化
済み文字決の中の一致部分列を全て検索して出現個数ｎ
を知り、一致部分列の中の最大長一致する部分列の複製
として符号化する際に、従来は最大一致長部分列の開始
位置と一致長の組で符号化していたものを、本発明は、
最大一致長の部分列のＰバッファでの出現番号ｉと一致
長との組で表わして符号化し、更に出現番号ｉを固定長
とせずに可変長として符号化する。As described above, according to the present invention, the immediately preceding character (P
Search for all the matching substrings in the encoded character set whose first character is the latest character in the buffer), and search for n
When encoding as a copy of a subsequence that matches the maximum length in the matching subsequence, the present invention uses what is conventionally coded with the set of the start position of the maximum matching length subsequence and the matching length. ,
The subsequence having the maximum matching length is represented by a set of the occurrence number i and the matching length in the P buffer and coded, and further, the occurrence number i is coded as a variable length instead of a fixed length.

【００３２】このため従来の開始位置（開始アドレス）
に比べ、部分列の出現番号ｉの数の方が遙かに少なく、
出現番号ｉを少ないビット数で表現でき、更に出現順番
を可変長のビット数で表現するので、符号表現に無駄が
なく、少ないビット数で表現でき、高い圧縮率を得るこ
とができる。For this reason, the conventional start position (start address)
, The number of occurrence numbers i of the subsequence is much smaller,
Since the appearance number i can be represented by a small number of bits and the appearance order is represented by a variable-length bit number, the code representation can be represented by a small number of bits without waste, and a high compression rate can be obtained.

【００３３】[0033]

【実施例】図２は本発明の一実施例を示した実施例構成
図である。図２において、１６はバッファメモリであ
り、符号化を行おうとする文字列を格納する文字列入力
部としてのＱバッファ１０と、符号化済み文字列を登録
した辞書として機能するＰバッファ１２に割当られる。FIG. 2 is a block diagram showing an embodiment of the present invention. In FIG. 2, reference numeral 16 denotes a buffer memory, which is allocated to a Q buffer 10 as a character string input unit for storing a character string to be encoded and a P buffer 12 functioning as a dictionary in which encoded character strings are registered. Can be

【００３４】１４はＣＰＵを用いた符号化部としてのパ
ターンマッチング部であり、ユニバーサル符号化アルゴ
リズムに従ってＱバッファ１０の文字列に最大長一致す
る部分列をＰバッファ１２から検索し、最大長一致する
部分列の複製としてその開始位置と一致長の組でなる符
号語を出力する。即ち、パターンマッチング部１４は、
Ｐバッファ１２内の直前文字の出現個数をｎとし、出現
番号をｉとすると、図３の符号化のフローチャートに示
す符号化処理を行う。Reference numeral 14 denotes a pattern matching unit as an encoding unit using a CPU, which searches the P buffer 12 for a partial sequence that matches the maximum length of the character string in the Q buffer 10 according to the universal encoding algorithm, and matches the maximum length. As a copy of the subsequence, a codeword consisting of a set of the start position and the matching length is output. That is, the pattern matching unit 14
Assuming that the number of occurrences of the immediately preceding character in the P buffer 12 is n and the occurrence number is i, the encoding process shown in the encoding flowchart of FIG. 3 is performed.

【００３５】図３のフローチャートは図１５に示した本
願発明者等が既に提案した方式と基本的に同じであり、
相違点はステップＳ２でＰバッファ内の直前文字の出現
個数ｎを求めておき、ステップＳ５で最長一致文字列Ｓ
の出現番号ｉを出現個数ｎに基づくビット数で表現して
可変長符号化するようにしたことである。図４は本発明
のデータ圧縮方式で得られた符号語を復元するユニバー
サル型復号化アルゴリズムを示したフローチャートであ
る。The flowchart of FIG. 3 is basically the same as the scheme already proposed by the present inventors shown in FIG.
The difference is that the number n of occurrences of the immediately preceding character in the P buffer is determined in step S2, and the longest matching character string S is determined in step S5.
Is represented by the number of bits based on the number n of occurrences and is subjected to variable-length encoding. FIG. 4 is a flowchart showing a universal decoding algorithm for restoring a codeword obtained by the data compression method of the present invention.

【００３６】図４の復号処理においては、ステップＳ３
でフラグビット１から複製モードを判別してステップＳ
５に進んだ際に、図３のステップＳ２と同様にＰバッフ
ァ内の直前文字の出現個数ｎを求め、出現個数ｎに基づ
くビット数分だけ可変長の符号を切り出して出現番号ｉ
を復号し、ステップＳ６でＰバッファの直前文字列の出
現番号と一致長から最長文字列Ｓを復元している。In the decoding process of FIG. 4, step S3
To determine the duplication mode from the flag bit 1 and step S
5, the number n of occurrences of the immediately preceding character in the P-buffer is obtained in the same manner as in step S2 of FIG.
Is decoded, and in step S6, the longest character string S is restored from the occurrence number and the matching length of the character string immediately before in the P buffer.

【００３７】次に本発明の符号化処理で行われる出現番
号の可変長符号化の具体的な実施例を説明する。（１）可変固定長符号化Ｐバッファ内の直前文字の出現個数をｎとすると、出現
番号を「ｌｏｇ₂ ｎ」ビットで出現番号ｉを表して符号
化する。ここで「ｌｏｇ₂ｎ」はｌｏｇ₂ ｎ以上の最小
の整数を表わす。Next, a description will be given of a specific embodiment of variable-length encoding of an appearance number performed in the encoding processing of the present invention. (1) Variable Fixed Length Coding Assuming that the number of occurrences of the immediately preceding character in the P buffer is n, the occurrence number is encoded by expressing the occurrence number i with “log ₂ n” bits. Here, “log ₂ n” represents a minimum integer equal to or greater than log ₂ n.

【００３８】例えばある入力文字列を符号化する際にＰ
バッファ内の検索で得られた出現文字数ｎがｎ＝１２個
であったとすると、この時の最長文字列Ｓの出現番号ｉ
は、「ｌｏｇ₂ ｎ」＝「ｌｏｇ₂ １２」＝４ビットで表現される可変長符号となる。これを可変固定長符号
化という。（２）ビット端数補償による可変固定長符号化前記（１）の可変固定長符号化では、出現個数ｎに対応
した最大値ｉ＝ｎまでの出現番号ｉを「ｌｏｇ₂ ｎ」ビ
ットで表すと「ｌｏｇ₂ ｎ」−ｌｏｇ₂ ｎビットのビットロスが生じる。このビットの端数のロスを減し
て出現番号ｉを表現することにより符号化効率を向上さ
せるものとしてビット端数補償がある（例えば「Ziv-Le
mpel符号の改良とシミュレーションによる性能評価−
（II）」、電子通信学会技術研究報告C84-135, pp.1-8,
1984参照）。For example, when encoding a certain input character string, P
Assuming that the number n of appearance characters obtained by the search in the buffer is n = 12, the appearance number i of the longest character string S at this time is
Is a variable-length code represented by “log ₂ n” = “log ₂ 12” = 4 bits. This is called variable fixed length coding. (2) Variable Fixed-Length Coding by Bit Fraction Compensation In the variable fixed-length coding of the above (1), if the occurrence number i up to the maximum value i = n corresponding to the number n of occurrences is represented by “log ₂ n” bits A bit loss of “log ₂ n” −log ₂ n bits occurs. Bit fraction compensation is a technique for improving the coding efficiency by reducing the loss of the fraction of bits and expressing the appearance number i (for example, “Ziv-Le”
Improvement of mpel code and performance evaluation by simulation
(II) ", IEICE Technical Report C84-135, pp.1-8,
1984).

【００３９】このビット端数補償にあっては、出現個数
ｎに対応した最大出現番号をｉ＝ｎとした時のビット数
ｐをｐ＝「ｌｏｇ₂ ｎ」また出現番号ｉの最上位ビットを除く（ｐ−１）ビット
で表したものをｉ^* とする。同様に最大出現番号ｎの最
上位ビットを除く（ｐ−１）ビットで表したものをｎ^*
とする。In this bit fraction compensation, when the maximum appearance number corresponding to the appearance number n is i = n, the bit number p is p = “log ₂ n”, and the most significant bit of the appearance number i is excluded. What is represented by (p-1) bits is i ^* . Similarly, a value represented by (p-1) bits excluding the most significant bit of the maximum appearance number n is n ^*
And

【００４０】このような条件のもとで、ビット端数補償
による出現番号ｉの可変長符号語は、ｉ^* ≦ｎ^* のと
き、ｉ^* で表し、ｉ^* ＞ｎ^* のとき、ｉ^* の後に最上
位ビットを付けて表す。ここで出現個数ｎ＝１２の出現
番号ｉ＝０〜１１をビット端数補償で表す例を第５図に
示す。[0040] Under such conditions, the variable-length code words of the appearance number i by bit fraction compensation is, when i ^{^*} ≦ n ^*, expressed in i ^*, when i ^{^{^*>}} n ^*, i ^* of This is indicated by adding the most significant bit to the end. FIG. 5 shows an example in which the appearance numbers i = 0 to 11 of the appearance number n = 12 are represented by bit fraction compensation.

【００４１】図５においては、ｐ＝「ｌｏｇ₂ ｎ」＝「ｌｏｇ₂ １２」＝４ビットｐ−１＝３ビットであり、ｉ^* ≦３ビットのとき、ｉ^* で表し、ｉ^*
＞３ビットのとき、ｉ^* の後に最上位ビットを付けて表
す。[0041] In FIG. 5, p = a "log ₂ n" = "log ₂ 12" = 4 bits p-1 = 3 bits, i ^* ≦ 3 when bit, expressed in i ^*, i ^*
When> 3 bits, i ^* is followed by the most significant bit.

【００４２】即ち、参照番号（出現番号）ｉ＝０〜１１
の４ビットの２進表示は、前記の条件を満たす参照番
号ｉ＝４〜７の４つについては、上位１ビットを除いた
下位３ビットｉ^* で表わす。一方、前記の条件を満た
す参照番号ｉ＝０〜３及びｉ＝８〜１１については、上
位１ビットを除いた下位３ビットｉ^* の後に２進表示の
上位１ビットを付けて区別する。［Phasing in Binary Codes ］このＰＢＣ可変長符号
化は、例えば「 Text Compression」, Prentice-Hall I
nc. 1990, pp.293-294 に記載される。That is, reference numbers (appearance numbers) i = 0 to 11
In the four-bit binary notation, four reference numbers i = 4 to 7 satisfying the above conditions are represented by lower three bits i ^* excluding the upper one bit. On the other hand, the reference numbers i = 0 to 3 and i = 8 to 11 satisfying the above conditions are distinguished by attaching the lower 3 bits i ^* excluding the upper 1 bit to the upper 1 bit in binary notation. [Phasing in Binary Codes] This PBC variable-length encoding is performed, for example, in “Text Compression”, Prentice-Hall I
nc. 1990, pp. 293-294.

【００４３】ＰＢＣ可変長符号化ではｉ＜２^P −ｎ−
１のとき、ｉ^* で表し、ｉ≧２^P −ｎ−１のとき、出
現番号ｉに（２^P −ｎ−１）を加えた値（ｉ＋２^P −ｎ
−１）をｐビットで表す。出現個数ｎ＝１２の時の出現
番号（参照番号）ｉ＝０〜１１についてのＰＢＣ符号化
の具体例を図６に示す。In PBC variable length coding, i <2 ^P -n-
When 1, expressed by i ^*, when i ≧ 2 ^P -n-1, was added to the occurrence number i (2 ^P -n-1) value (i + 2 ^P -n
-1) is represented by p bits. FIG. 6 shows a specific example of PBC encoding for the occurrence numbers (reference numbers) i = 0 to 11 when the number of appearances n = 12.

【００４４】図６において、前記の条件を満足するの
は参照番号ｉ＝０〜３の場合であり、この場合には、ｐ
＝４ビットで表現されたｉ＝０〜３の２進表示コードは
最上位ビットを除く３ビットでＰＢＣ表現される。また
前記の条件を満足するのは参照番号ｉ＝４〜１１の場
合であり、この場合には、ｐ＝４ビットで表現されたｉ
＝４〜１１の２進表示に４の２進表示「１００」を加算
した４ビットでＰＢＣ表現される。（４）多値算術符号化前記（２）（３）の可変長符号化は、出現番号ｉによっ
てｐビットとｐ−１ビットで表しており、出現番号ｉの
一個ずつでみるとビットの端数のロスを減じることがで
きるものの、出現番号列全体としてみると冗長性がまだ
残る。In FIG. 6, the above condition is satisfied when reference numbers i = 0 to 3, and in this case, p
= The binary display code of i = 0 to 3 expressed by 4 bits is expressed by PBC with 3 bits excluding the most significant bit. The above condition is satisfied when reference numbers i = 4 to 11, and in this case, i = i expressed by p = 4 bits.
= 4 to 11 binary representations and 4 binary representations "100" are represented by 4 bits in PBC representation. (4) Multi-Level Arithmetic Coding The variable-length coding of (2) and (3) is represented by p bits and p-1 bits by the occurrence number i. Can be reduced, but there is still redundancy in the appearance number sequence as a whole.

【００４５】そこで、ビットのロスを更に削除するた
め、出現個数ｎ個の文字列が等確率で出現するものと仮
定して出現番号（シンボル）ｉを多値算術符号化する
（多値算術符号化については、例えば、文献“Arithmet
ic Coding for Data Compression”， Communication o
f the ACM, June 1987, Vol.30, No.6, pp.520-540参
照）。Therefore, in order to further reduce the bit loss, the occurrence number (symbol) i is multi-value arithmetically encoded by assuming that n character strings appear with equal probability (multi-value arithmetic code). For example, see the document “Arithmet
ic Coding for Data Compression ”, Communication o
f the ACM, June 1987, Vol. 30, No. 6, pp. 520-540).

【００４６】図７（ａ）（ｂ）に複数個のシンボルの符
号化に用いる多値算術符号化の符号化、復号化の概略フ
ローを示す。図７（ａ）多値算術符号化は、データ列を
［０，１］の数直線上の一点に対応付けるものであり、
シンボルごとに出現したシンボルの出現確率から求めた
累積出現確率によって［０，１］区間を逐次、細分割す
るものである。FIGS. 7 (a) and 7 (b) show schematic flows of encoding and decoding of multi-level arithmetic encoding used for encoding a plurality of symbols. FIG. 7 (a) multi-level arithmetic coding associates a data string with a point on a number line of [0, 1].
The [0, 1] section is sequentially subdivided according to the cumulative appearance probabilities obtained from the appearance probabilities of the symbols that have appeared for each symbol.

【００４７】図８は多値算術符号化の処理内容を示した
もので、１回目の出現個数ｎがｎ＝４であり、最長文字
列の出現番号ｉがｉ＝２番目であったとすると、上限＝
１十下限＝０の間の４分割された区間の中のｉ＝２に対
応するの区間が選択される。次に２回目の出現個数も
同じｎ＝２であり、この場合の最小文字列の出現番号ｉ
がｉ＝１番目であったとすると、更に４分割された中の
の区間が選択される。FIG. 8 shows the processing contents of the multi-level arithmetic coding. Assuming that the first appearance number n is n = 4 and the appearance number i of the longest character string is i = 2, Upper limit =
The section corresponding to i = 2 in the section divided into four while the lower limit of ten = 0 is selected. Next, the same number of occurrences of the second time is n = 2, and the occurrence number i of the smallest character string in this case is
Is the i = 1st, the section in the four sub-areas is further selected.

【００４８】以下同様に選択された区間の再分割が進
み、Ｎ回目に最終文字列に基づく区間が選択されると、
この選択区間の中の任意の一点の値（区間の上限又は下
限を示す値でも良い）との組を符号語として出力する。
また図７（ａ）のアルゴリズムでは、シンボル列全体の
符号化が終了するまで符号語が得られず、また、符号語
全体が得られないと復号ができないようになっている
が、実際の多値算術符号化では、有限桁の固定長のレジ
スタで演算して、ビット単位に符号語を得ることができ
る。Thereafter, the subdivision of the selected section proceeds similarly, and when the section based on the final character string is selected Nth time,
A combination with an arbitrary point value (a value indicating the upper or lower limit of the section) in the selected section is output as a code word.
Further, in the algorithm of FIG. 7A, a code word cannot be obtained until encoding of the entire symbol sequence is completed, and decoding cannot be performed unless the entire code word is obtained. In value arithmetic coding, a codeword can be obtained in bit units by performing calculations using a fixed-length register having a finite number of digits.

【００４９】即ち、図８の第１回目の符号化では、例え
ば上限が「００１」下限が「０１０」であり、両者の最
上位ビットは共に「０」であることから、この最上位の
ビット「０」は出力してしまうようにする。２回目以降
についても同様である。更に多値算術符号化を用いる場
合、文字列の「一致長」についても、各一致長ごとに出
現数を計数しておき、計数値から推定した一致長の出現
確率を出現番号とともに多値算術符号化するようにして
もよい。That is, in the first encoding shown in FIG. 8, for example, the upper limit is "001" and the lower limit is "010", and both the most significant bits are "0". "0" is output. The same applies to the second and subsequent times. Furthermore, when using multi-valued arithmetic coding, the number of occurrences of the “match length” of the character string is counted for each match length, and the occurrence probability of the match length estimated from the count value is calculated together with the occurrence number and the multi-value arithmetic. It may be encoded.

【００５０】尚、上記の実施例はＱＩＣ−１２２符号を
例にとるものであったが、これに限定されずジブーレン
ペル符号等の適宜のユニバーサル符号につきそのまま適
用できる。In the above embodiment, the QIC-122 code is used as an example, but the present invention is not limited to this, and the present invention can be applied to any appropriate universal code such as a Zibo Lempel code.

【００５１】[0051]

【発明の効果】以上説明したように本発明によれば、符
号化文字列を符号化済文字列の複製として複製の開始位
置を「出現順番」と「一致長」の組で表すとき、「出現
順番」を可変長のビット数で表現するので、符号表現に
無駄がなく、より短いビット数で表すことができ、高い
圧縮率を得ることができる。As described above, according to the present invention, when the coded character string is a copy of the coded character string, and the start position of the copy is represented by a set of "appearance order" and "match length", Since the “appearance order” is represented by a variable-length bit number, the code representation can be represented by a shorter bit number without waste, and a high compression rate can be obtained.

[Brief description of the drawings]

【図１】本発明の原理説明図FIG. 1 is a diagram illustrating the principle of the present invention.

【図２】本発明の実施例構成図FIG. 2 is a configuration diagram of an embodiment of the present invention.

【図３】本発明を用いたＱＩＣ−１２２符号化アルゴリ
ズムを示したフローチャートFIG. 3 is a flowchart showing a QIC-122 encoding algorithm using the present invention.

【図４】本発明により符号化され符号語の復号化アルゴ
リズムを示したフローチャートFIG. 4 is a flowchart illustrating a decoding algorithm of a codeword encoded according to the present invention;

【図５】本発明のビット端数補償による可変長符号化の
具体例説明図FIG. 5 is a diagram illustrating a specific example of variable-length encoding by bit fraction compensation according to the present invention;

【図６】本発明のＰＢＣ符号化による具体例説明図FIG. 6 is a diagram illustrating a specific example of PBC encoding according to the present invention;

【図７】本発明の多値算術符号化の無効果及び復号化ア
ルゴリズムを示した説明図FIG. 7 is an explanatory diagram showing the ineffectiveness of multi-level arithmetic coding and a decoding algorithm according to the present invention;

【図８】本発明の多値算出符号化の処理内容を示した説
明図FIG. 8 is an explanatory diagram showing processing contents of multi-value calculation encoding according to the present invention;

【図９】ユニバーサル型ジブーレンペル符号の符号化方
式説明図FIG. 9 is an explanatory diagram of an encoding method of a universal type Jeho Lempel code.

【図１０】ユニバーサル符号語のデータ形式説明図FIG. 10 is a diagram illustrating the data format of a universal codeword.

【図１１】ＱＩＣ１２２符号のフォーマット説明図FIG. 11 is an explanatory diagram of a format of a QIC122 code.

【図１２】図１１に使用したＢＮＦメタ言語の説明図FIG. 12 is an explanatory diagram of the BNF meta-language used in FIG.

【図１３】ＱＩＣ−１２２符号による符号化の具体例を
示した説明図FIG. 13 is an explanatory diagram showing a specific example of encoding using a QIC-122 code;

【図１４】本願発明者等が提案している符号化処理の原
理説明図FIG. 14 is a diagram illustrating the principle of encoding processing proposed by the present inventors.

【図１５】本願発明者等が提案しているＱＩＣ−１２２
符号化アルゴリズムを示したフローチャートFIG. 15 shows QIC-122 proposed by the present inventors.
Flowchart showing the encoding algorithm

【図１６】図１５の複製モードの符号語フォーマット説
明図16 is an explanatory diagram of a code word format in the duplication mode of FIG. 15;

【図１７】図１５の符号語のデータ形式説明図17 is an explanatory diagram of the data format of the code word in FIG.

[Explanation of symbols]

１０：文字入力部（Ｑバッファ）１２：辞書（Ｐバッファ）１４：符号化部（パターンマッチング部）１６：バッファメモリ１８：直前文字 10: Character input unit (Q buffer) 12: Dictionary (P buffer) 14: Encoding unit (pattern matching unit) 16: Buffer memory 18: Previous character

───────────────────────────────────────────────────── フロントページの続き (72)発明者千葉広隆神奈川県川崎市中原区上小田中1015番地富士通株式会社内 (56)参考文献特開平３−70214（ＪＰ，Ａ) 特開平３−78322（ＪＰ，Ａ) 特開平３−209922（ＪＰ，Ａ) 特開平４−256192（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 5/00 H03M 7/40 ────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Hirotaka Chiba 1015 Kamiodanaka, Nakahara-ku, Kawasaki City, Kanagawa Prefecture Inside Fujitsu Limited (56) References JP-A-3-70214 (JP, A) JP-A-3-78322 (JP, A) JP-A-3-209922 (JP, A) JP-A-4-256192 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 5/00 H03M 7 / 40

Claims

(57) [Claims]

An encoded character string is stored in a dictionary, and a character input is performed from among the encoded character strings stored in the dictionary.
Starts with the same character as the character immediately before the character string input from the
Search for a matching substring that matches the input string
The number of occurrences of the matching substring is detected, and the maximum length matches with the input character string from among the matching substrings.
Search for the largest matching subsequence to indicate the order in which the character immediately preceding the maximum length matching subsequence appears.
The occurrence number, the maximum matching substring and the input character string,
And the occurrence number according to the number of bits based on the number of occurrences.
Features and to Lud over data compression method that variable-length coding Te.

Wherein at the data compression method of claim 1, wherein said variable-length coding, the occurrence number i represents the smallest integer more than log ₂ n "log ₂ n" bit variable length features and to Lud over data compression method to be represented by the symbol.

3. In data compression method of claim 1, wherein said variable-length coding, log when the number of occurrences is n
_The number of bits given by “log ₂ n” representing the smallest integer of ₂ n or more is represented by p, and the number represented by (p−1) bits excluding the most significant bit of the number of occurrences n is represented by n ^* . appearance except for the most significant bits of the number i (p-1) if those in bits set to i ^*, i ^*> if n ^* a i ^* a variable-length code, i ^* ≦ n a ^* long to if, i ^* the occurrence number i features and to Lud <br/> over data compression method to be variable-length codes obtained by adding the most significant bit of after.

4. In the data compression method of claim 1, wherein said variable-length coding, log when the number of occurrences is n
_The number of bits given by “log ₂ n” representing the smallest integer of ₂ n or more is represented by p, and the number represented by (p−1) bits excluding the most significant bit of the number of occurrences n is represented by n ^* . When i ^* is represented by (p−1) bits excluding the most significant bit of the occurrence number i, if i ^* <2 ^P− n−1, i ^* is a variable length code, and i ^* ≧ if 2P-n-1, i ^* the characteristics and to Lud over data compression how to be a variable length code represents a value obtained by adding 2 ^P -n-1 with p-bit
Law .

5. In data compression method of claim 1, wherein said variable-length coding assumes that n partial string that matches the input character string following the immediately preceding character appear with equal probability, the appearance features and to Lud over to the multilevel arithmetic coding order i data compression
How .

6. In claim 1 data compression method of claim 5, wherein said variable-length encoding, the input string is a standard compression scheme of 1/4 inch cartridge magnetic tape QIC- 12
Features and to Lud over data compression method to be encoded in code words of the two code.