JPH06266531A

JPH06266531A - Multistage data compression device using optimum code expression

Info

Publication number: JPH06266531A
Application number: JP7862193A
Authority: JP
Inventors: Takashi Takizuka; 孝志瀧塚; Keiko Miyatake; 圭子宮武
Original assignee: Kokusai Denshin Denwa KK
Current assignee: KDDI Corp
Priority date: 1993-03-15
Filing date: 1993-03-15
Publication date: 1994-09-22
Anticipated expiration: 2013-03-25
Also published as: JP2732188B2

Abstract

PURPOSE:To improve the compression rate by reducing the increase rate of data by using the longest code expression when an output code is limited and outputting it as compression data suitable for the limit of the output code. CONSTITUTION:The device is provided with a special pattern encoder 2 compressing data depending on the information source by encoding the special patterns such as the succession of the same section code of KANJI (chinese character) data, the succession of the same bytes, and numerical expressions, general encoder 3 encoding universal code data such as the LZW codes which make outputs independent of the information source by means of the shortest code expression after compressing data independent of the information source, and code conversion device 4 reducing the increase rate of data by using the longest code expression when the output code is limited and outputting it as compression data 5 suitable for the limit of the output code. Thus, the special pattern encoding such as the omission of the section code according to the succession of KANJI in the same section code and the omission of consecutive bytes according to the succession of the same bytes.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、データ圧縮に広く使用
されているLZW （Lempei-Zip-Welch）符号化の圧縮率を
高めて符号化することができるデータ圧縮装置に関する
ものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data compression apparatus capable of increasing the compression rate of LZW (Lempei-Zip-Welch) coding which is widely used for data compression.

【０００２】[0002]

【従来の技術】LZW 符号（文献１：Terry A. Welch "A
Technique for High-Performance Data Compression"
, Computer, 17, 6, pp8-19, 1984 ）は、情報源によ
らないユニバーサル符号で実行速度も速いことから、デ
ータ圧縮に広く使用されている。UNIXオペレーティング
システムでは、可変長符号を採用したLZW 符号がファイ
ル圧縮コマンドcompressとして利用可能である。compre
ssでは出力する符号をＣ（２^m＜Ｃ≦２^m＋１）、割り
当てた最大の符号をＪ＝２^m＋Ｌ（０＜Ｌ＜２^m）とし
たとき、常にＣをｍ＋１ビットで符号化し、８個の符号
をまとめてブロック化している。また、LZW 符号では、
文字符号の種類が少ないときには全文字符号に予め番号
を付与しておくのではなく、初めて出てくる文字符号に
符号を付与した方がよいことが知られている（文献１参
照）。2. Description of the Related Art LZW code (Reference 1: Terry A. Welch "A
Technique for High-Performance Data Compression "
, Computer, 17, 6, pp8-19, 1984) is a universal code that does not depend on an information source and has a high execution speed, and is widely used for data compression. In UNIX operating system, LZW code which adopted variable length code is available as file compression command compress. compre
In ss, when the output code is C (2 ^m <C ≦ 2 ^m +1) and the assigned maximum code is J = 2 ^m + L (0 <L <2 ^m ), C is always coded with m + 1 bits, Eight codes are grouped together into blocks. Also, in LZW code,
It is known that when there are few types of character codes, it is better not to give numbers to all the character codes in advance, but to give a code to the character code that appears for the first time (see Document 1).

【０００３】[0003]

【発明が解決しようとする課題】このようなUNIXの電子
メイルシステムでは、バイナリファイルを送る場合に一
旦ASCII の印刷可能文字に変換する必要があり、UNIXで
はuuencodeを変換プログラムとして利用している。この
uuencodeはデータを６ビット毎に切り、４８を足し込ん
でASCII コードに変換している。テキストファイルで
も、大きなファイルを送るときには、ファイル圧縮の結
果得られたバイナリ形式のファイルをASCII形式のファ
イルに変換することが多い。和文テキストは、２バイト
を基本としたデータ構造になっているため、１バイト単
位で符号化するLZW 符号化の場合には圧縮効率が低い。
また、０等が多数連続するバイナリのデータに対しても
圧縮率が低い欠点がある。In such a UNIX electronic mail system, when a binary file is sent, it is necessary to once convert it into ASCII printable characters, and UNIX uses uuencode as a conversion program. this
uuencode cuts data every 6 bits and adds 48 to convert to ASCII code. Even in the case of text files, when sending large files, the binary format files obtained as a result of file compression are often converted to ASCII format files. Since the Japanese text has a data structure based on 2 bytes, the compression efficiency is low in the case of LZW encoding which encodes in 1-byte units.
Further, there is a drawback that the compression rate is low even for binary data in which a large number of 0s and the like are continuous.

【０００４】本発明は、このような従来の問題点に鑑み
てなされたもので、日本語を対象とする任意の出力文字
符号に対して従来の符号化圧縮よりも圧縮率を高めるこ
とのできる最適符号表現を用いた多段データ圧縮装置を
提供することを目的とする。The present invention has been made in view of the above conventional problems, and it is possible to increase the compression ratio of any output character code intended for Japanese as compared with the conventional coding compression. An object of the present invention is to provide a multistage data compression device using the optimum code representation.

【０００５】[0005]

【課題を解決するための手段】この目的を達成するため
に、本発明による最適符号表現を用いた多段データ圧縮
装置は、データの圧縮のために、漢字データの同一区コ
ードの連続や、同一のバイトの連続や、数値表現の如き
特殊パターンを符号化して情報源依存の圧縮をする特殊
パターン符号化装置と、該情報源依存の圧縮を行なった
後出力を情報源によらないLZW 符号等のユニバーサル符
号のデータに最短符号表現を用いて符号化する汎用符号
化装置と、出力符号が限定されている場合に最長符号表
現を用いて前記データの増加率を小さくして該出力符号
の限定に適合する圧縮データとして出力する符号変換装
置とを備えた構成を有している。In order to achieve this object, a multi-stage data compression apparatus using the optimum code representation according to the present invention uses a sequence of the same group code of Kanji data and an identical code for data compression. A special pattern encoding device that encodes a sequence of bytes or a special pattern such as a numerical expression to perform information source-dependent compression, and an LZW code that does not depend on the information source after performing the information source-dependent compression. General-purpose coding device for coding the universal code data using the shortest code representation, and limiting the output code by using the longest code representation to reduce the rate of increase of the data when the output code is limited. And a code conversion device that outputs compressed data conforming to the above.

【０００６】[0006]

【作用】LZW 符号化を行なう前に、特殊パターン符号化
装置によって、同一区コードの漢字の連続に関して区コ
ードを省略したり、同一バイトの連続に関して連続する
バイトを省略する等の特殊パターン符号化を行なうこと
により、日本語データやバイナリデータに対して圧縮率
を高めることができる。Operation: Before performing LZW encoding, a special pattern encoding device omits a division code for consecutive Kanji characters of the same division code, or omits consecutive bytes for the same consecutive bytes. By performing, it is possible to increase the compression rate for Japanese data and binary data.

【０００７】LZW 符号化の符号割り当てを、出力符号Ｃ
の領域によって出力ビット長を変化させることにより、
LZW 符号の圧縮率を高めることができる。Ｃの出現確率
を一様とすると、符号長は式（１）で与えられるビット
分、平均的に短縮される。従って、0 ≦Ｌ≦２^m区間の
平均は2mが十分大きいときに、式（２）で示すように、
log_e4 −１＝0.386 で近似される。即ち、符号長ｍ＋
１が１２ビットであれば、0.386 ／12＝3.22％の改善が
行なわれる。この圧縮方式を対数圧縮と呼ぶ。The code assignment of the LZW encoding is defined by the output code C
By changing the output bit length depending on the area of
The compression rate of the LZW code can be increased. If the appearance probability of C is uniform, the code length is reduced on average by the number of bits given by the equation (1). Therefore, the average of 0 ≤ L ≤ 2 ^m section is as shown in equation (2) when 2 m is sufficiently large,
It is approximated by log _e 4 −1 = 0.386. That is, the code length m +
If 1 is 12 bits, an improvement of 0.386 / 12 = 3.22% is achieved. This compression method is called logarithmic compression.

【０００８】[0008]

【数１】 [Equation 1]

【０００９】[0009]

【数２】 [Equation 2]

【００１０】出力コードとして、ｎ個（２^k＜ｎ＜２
^k+1）しか使用できない場合、２^k+1−ｎ個にｋビット
長のデータを、２ｎ−２^k+1個にｋ＋１ビット長のデー
タを割り当てる最適符号表現を用いて出力することで、
常にｋビットで切りだす場合よりも効率的な変換が行な
われる。As output codes, n codes (2 ^k <n <2
If only ^{k + 1} ) can be used, by outputting the data of k bit length to 2 ^{k + 1} −n pieces by using the optimal code expression that allocates the data of k + 1 bit length to 2n−2 ^{k + 1} pieces,
The conversion is more efficient than the case of always cutting out with k bits.

【００１１】同一のバイトがＬ個連続したとき、LZW 符
号では２Ｌの平方根に相当する個数の符号が出力され
る。そこで、LZW 符号化において日本語を対象とした圧
縮と、連続する同一バイトの圧縮を行なう機能を付加す
ると有効である。本発明では出力する符号Ｃが０≦Ｃ＜
２^m＋Ｌならｍビットで、２^m≦Ｃ≦Ｊならｍ＋１ビッ
トで符号化する方法（横尾英俊“ユニヴァーサル情報
符号化のための修正Ziv-Lempel符号”）を改良し、最短
符号領域を選択できるようにすることにより圧縮率を強
化する。また、LZW 符号化ではサイズの小さいデータ、
又は、新しい文字コードが出現する毎に文字符号を割り
当てることにより、サイズの小さいデータに対しても圧
縮率を高めることが可能になること、出力文字符号がAS
CII の印刷可能文字等に制限されている場合に、使用可
能な出力文字符号を全て用いることにより、より効率の
良い符号変換を行なうことが可能である。When L identical bytes continue, the LZW code outputs as many codes as the square root of 2L. Therefore, in LZW encoding, it is effective to add compression for Japanese and compression of consecutive same bytes. In the present invention, the output code C is 0 ≦ C <
The shortest code area can be selected by improving the method of coding with m bits for 2 ^m + L and with m + 1 bits for 2 ^m ≤C≤J (Hidetoshi Yokoo "Modified Ziv-Lempel code for universal information coding") By doing so, the compression rate is enhanced. Also, with LZW encoding, small size data,
Alternatively, by assigning a character code each time a new character code appears, it is possible to increase the compression rate even for small size data.
When the printable characters of CII are restricted, more efficient code conversion can be performed by using all available output character codes.

【００１２】[0012]

【実施例】以下、本発明の一実施例に於ける最適符号表
現を用いた多段データ圧縮装置を図１〜図１４を用いて
説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A multi-stage data compression apparatus using an optimum code expression according to an embodiment of the present invention will be described below with reference to FIGS.

【００１３】図１は符号割り当て装置のブロック構成図
である。ここで、１は符号化の対象となる原データであ
る。２は特殊パターン符号化装置である。３は装置２に
よって符号化されたファイルに対して、LZW 符号化を行
なう汎用符号化装置である。４は装置３によって符号化
圧縮されたバイナリ形式のデータを任意の文字セットの
みを使用したデータに変換する符号変換装置である。５
は装置２，３，４によって圧縮，変換された圧縮データ
である。FIG. 1 is a block diagram of a code allocation device. Here, 1 is the original data to be encoded. 2 is a special pattern coding device. Reference numeral 3 is a general-purpose encoding device for performing LZW encoding on the file encoded by the device 2. Reference numeral 4 is a code conversion device for converting the binary format data coded and compressed by the device 3 into data using only an arbitrary character set. 5
Is compressed data that has been compressed and converted by the devices 2, 3 and 4.

【００１４】特殊パターン符号化装置２は、汎用符号化
を行なう前に同一区コードの漢字の連続に関して区コー
ドを省略したり、同一バイトの連続に関して連続するバ
イトを省略する等の特殊パターン符号化を行なうことに
より、LZW 符号が不得意な系列に対する圧縮手段を備え
る。汎用符号化装置３は、新しく出現した文字符号に対
し、ASCII データに関しては７ビットで符号化し、最上
位ビットが１であるデータが現われたとき、モード切り
換え符号を出力し、その後に８ビットで符号化すること
により、サイズの小さいデータの圧縮率を高める。これ
を、出現符号割り当てと呼ぶ。LZW 符号化の出力する符
号をＣ、割り当てた最大の符号をＪ＝２^m＋Ｌ（０＜Ｌ
＜２^m）としたとき、ｍビットを見ただけで２^m−Ｌ個
の符号が一意的に決定できることを利用し、ｍビットと
ｍ＋１ビットを選択し、符号の値に応じて符号化する。
これを最短符号割り当てと呼ぶ。この際、符号の出現率
の高い範囲にｍビットで符号化する領域を設定すること
により、圧縮率を高める手段を備える。これを、動的符
号割り当てと呼ぶ。符号変換装置４は、汎用符号化装置
で符号化されたデータを、最短符号割り当ての逆変換を
用いて制限された出力符号にデータを変換する。これを
最長符号割り当てと呼ぶ。出力符号としてASCII の印刷
可能文字のみを使用する場合、３２〜１２６の９５個の
印刷可能文字が存在する。この符号変換装置４は、印刷
可能文字のうち１２８−９５＝３３個を６ビット長、２
×９５−１２８＝６２個を７ビット長の入力符号データ
の出力符号として割り当てることにより、平均ビット長
は、（33×６＋62×７）÷95＝6.65より、6.65となり、
常に６ビット長の出力符号で出力するuuencodeと比較し
て、１０％以上の改善を行なう手段を備える。The special pattern encoding device 2 omits the division code for consecutive Kanji characters of the same division code before performing general-purpose encoding, or omits consecutive bytes for consecutive same bytes. By doing so, a compression means is provided for a sequence in which the LZW code is not good. The general-purpose encoding device 3 encodes the newly appearing character code with 7 bits for ASCII data, outputs the mode switching code when the data with the most significant bit of 1 appears, and then with 8 bits. By encoding, the compression rate of small size data is increased. This is called appearance code assignment. The code output from the LZW encoding is C, and the assigned maximum code is J = 2 ^m + L (0 <L
<2 ^m ), the fact that 2 ^m −L codes can be uniquely determined only by looking at m bits is used, m bits and m + 1 bits are selected, and encoding is performed according to the value of the code. .
This is called shortest code allocation. At this time, a means for increasing the compression rate is provided by setting an area to be encoded with m bits in a range where the code appearance rate is high. This is called dynamic code allocation. The code conversion device 4 converts the data coded by the general-purpose coding device into a limited output code by using the inverse conversion of the shortest code allocation. This is called the longest code allocation. If only ASCII printable characters are used as the output code, then there are 95 to 32 printable characters. The code conversion device 4 converts 128-95 = 33 printable characters into 6-bit length, 2
By allocating × 95-128 = 62 as the output code of the input code data of 7-bit length, the average bit length is (33 × 6 + 62 × 7) ÷ 95 = 6.65, which is 6.65.
It is equipped with a means for improving by 10% or more as compared with uuencode which is always output with an output code of 6-bit length.

【００１５】以上のように構成された本実施例の制御手
順について説明する。まず、原データから特殊パターン
を摘出して特殊パターンの文字列の符号化を行なう。こ
こで、図２に符号割り当てを示す。ここでは漢字以外の
文字符号の長さを８ビット、漢字の文字符号の長さを１
６ビットとする。負の符号を文字型制御符号及び非文字
型制御符号として使用する。非文字型制御符号は、ASCI
I 文字から漢字等へのモード切り換え符号や、ファイル
名の指定、ファイルの終わりを表す符号等の、あまり使
われない符号であり、符号化を行なわずに符号をそのま
ま出力する。文字符号と文字型制御符号はLZW 符号化に
おいて符号化の対象とする。符号化の時点で、句読点や
括弧、中黒、長音記号等の平仮名や片仮名の中によく現
われる文字は、３〜５区の空いている符号にも割り当
て、これらの文字の前後の文字が３〜５区の場合に、連
続した文字列が同一の区に属するようにする。また、割
り当てた区点コードを持つ外字や漢字と判断できない系
列が現われた場合には処理を中断する。The control procedure of this embodiment configured as described above will be described. First, the special pattern is extracted from the original data and the character string of the special pattern is encoded. Here, code allocation is shown in FIG. Here, the length of the character code other than Kanji is 8 bits, and the length of the character code of Kanji is 1
6 bits. Negative signs are used as character and non-character control codes. Non-character control code is ASCI
Codes that are rarely used, such as mode switching codes from I to Kanji, file name designations, and codes that indicate the end of a file, are output as they are without encoding. Character codes and character type control codes are the targets of encoding in LZW encoding. At the time of encoding, characters that often appear in hiragana or katakana such as punctuation marks, parentheses, bullets, and long syllabaries are also assigned to vacant codes in the 3-5 wards, and the characters before and after these characters are 3 In the case of ~ 5 wards, consecutive character strings belong to the same ward. Also, if a sequence that cannot be determined to be an external character or kanji having the assigned kuten code appears, the processing is interrupted.

【００１６】図３に特殊パターン符号化の例を示す。説
明のために、xxを数値としたとき１６進をxxH 、２進を
xxB で表すことにする。' 「' 、' −' 、' ¬' の漢字
コードについて、区コードa8H が連続している。漢字コ
ードの場合は、同一の区コードの漢字の繰り返しに関し
て、点コードの最上位ビットを０とすることで、同一区
コードが連続することを示し、同一区コードの最後の点
コードは、最上位ビットを１とすることによって示す。
ここでは、区コードa8H の漢字が連続している。区コー
ド'a8'の漢字の点コードは、最上位ビットを０とする
と、23H 、21H となる。最後の点コードは、最上位ビッ
トを１とすると、a4Ｈとなる。FIG. 3 shows an example of special pattern coding. For the sake of explanation, when xx is a number, hexadecimal is xxH and binary is
We will refer to this as xxB. For the kanji codes of "", "-", and "¬", the ward code a8H is continuous. In the case of the kanji code, when the kanji of the same ward code is repeated, the most significant bit of the point code is 0. By doing so, it is shown that the same section code is continuous, and the last point code of the same section code is shown by setting the most significant bit to 1.
Here, the kanji of the ward code a8H are continuous. The point code of the kanji of the ward code'a8 'is 23H and 21H, when the most significant bit is 0. The last point code is a4H when the most significant bit is 1.

【００１７】３回以上の文字符号の繰り返しに対し、文
字の種類を増やさないために、文字型制御符号と、それ
までに出現した文字符号を用いて繰り返し回数を符号化
する。３回と４回の繰り返しについては、それぞれの繰
り返し回数を表す文字型制御符号を使用する。５回以上
の繰り返しに対しては、回数を出現した文字符号のみを
使用して表現するために出現文字符号表を作る。例え
ば、21H の繰り返しが検出された時点で出現符号表は図
４のようになり、出現文字数Ｒは６になる。ここで第１
欄はその文字符号に対応付けた数値とする。Ｘが繰り返
し回数（５≦Ｘ＜５＋Ｒ）であるとき、多回繰り返しを
表す文字型制御符号と出現文字符号表のＸ−５番目の要
素の組で繰り返し回数を符号化する。この場合、位置０
の符号が繰り返し回数５回に対応する。以下、同様に繰
り返し回数を対応付ける。Ｘ＝Ｒ＋５であるとき、多回
繰り返し制御符号３個で繰り返し回数を表す。Ｘ＞Ｒ＋
５であるとき、Ｃｉを出現文字表の要素として、”多回
繰り返し多回繰り返しＣ１Ｃ２…Ｃｎ多回繰り返
し”で表す。文字符号から出現文字表の位置を得る関数
をｌｏｃ（Ｃｉ）とする。Ｘ−（Ｒ＋５）が数式（３）
となるように回数を符号化する。数式（３）に於てΣｌ
ｏｃ（Ｃｉ）Ａｉ−１はＡ進の数値表現であり、ΣＡｉ
−１は符号化にｉ桁必要となる数値のオフセットを表
す。これは図５のように２桁の２値表現でも６まで表せ
ることを意味している。When the character code is repeated three times or more, the number of repetitions is coded by using the character type control code and the character code that has appeared so far in order not to increase the number of characters. For the 3rd and 4th repetitions, the character type control code indicating the respective number of repetitions is used. For repetitions of 5 times or more, an appearance character code table is created in order to express the number of times using only the character code that has appeared. For example, when the repetition of 21H is detected, the appearance code table becomes as shown in FIG. 4, and the appearance character number R becomes 6. Here first
The column is the numerical value associated with that character code. When X is the number of repetitions (5 ≦ X <5 + R), the number of repetitions is encoded by a set of a character type control code representing multiple repetitions and an X-5th element of the appearance character code table. In this case position 0
The code of corresponds to the number of repetitions of 5. Hereinafter, the number of repetitions is similarly associated. When X = R + 5, the number of repetitions is represented by three multi-repetition control codes. X> R +
When it is 5, Ci is an element of the appearance character table and is represented by “multi-repeat multi-repeat C1C2 ... Cn multi-repeat”. A function for obtaining the position of the appearance character table from the character code is loc (Ci). X- (R + 5) is mathematical formula (3)
The number of times is encoded so that Σl in equation (3)
oc (Ci) Ai-1 is a numerical expression of A-adic, and ΣAi
-1 represents a numerical value offset that requires i digits for encoding. This means that up to 6 can be represented by a 2-digit binary expression as shown in FIG.

【００１８】[0018]

【数３】 [Equation 3]

【００１９】非文字型制御符号としては、非漢字デー
タ、バイナリモード、EOD(End of Data)、0000B 用非文
字型制御符号、0001用非文字型制御符号、…、1111B 用
非文字型制御符号をここでは用意する。図６に制御符号
を示す。As the non-character type control code, non-Kanji data, binary mode, EOD (End of Data), non-character type control code for 0000B, non-character type control code for 0001, ..., Non-character type control code for 1111B Is prepared here. FIG. 6 shows the control code.

【００２０】次に、LZW 符号化を用いた汎用符号化を行
なう。特殊パターン符号化装置２で符号化されたデータ
を汎用符号化装置３で符号化して出力するときは、ASCI
I データに関しては７ビットで符号化し、最上位ビット
が"1" であるデータが現れたとき、モード切り換え符号
化を出力し、その後に８ビットで符号化する。さらに文
字コード下位Ｅ（＝０〜７）ビットを非文字型制御符号
とし、残りの７−Ｅまたは８−Ｅビットを制御符号の後
続情報として符号化し、各文字コードに符号を付与す
る。Ｅ＝４で' ' を送る場合、図７に示すように、'
' を表す符号は00100000B で、下位４ビット0000B を
対応する非文字型制御符号とし、後続情報をASCII モー
ドであれば100B、８ビットモードであれば0100B を文字
符号としてLZW 符号化する。Next, general-purpose coding using LZW coding is performed. When the data coded by the special pattern coding device 2 is coded by the general-purpose coding device 3 and output,
The I data is encoded with 7 bits, and when the data whose most significant bit is "1" appears, the mode switching encoding is output, and thereafter, it is encoded with 8 bits. Further, the lower E (= 0 to 7) bits of the character code are used as a non-character type control code, and the remaining 7-E or 8-E bits are encoded as subsequent information of the control code, and a code is given to each character code. When sending '' with E = 4, as shown in FIG.
The code indicating 'is 00100000B, the lower 4 bits 0000B is the corresponding non-character type control code, and subsequent information is LZW encoded with 100B in ASCII mode and 0100B in 8-bit mode as a character code.

【００２１】LZW 符号化において、制御符号のLZW 符号
を予め定める。制御符号のLZW 符号化は図６に示した通
りである。図３の出力符号に関する汎用符号化を図８に
示す。In LZW encoding, the LZW code of the control code is predetermined. The LZW encoding of the control code is as shown in FIG. A generalized encoding for the output code of FIG. 3 is shown in FIG.

【００２２】LZW 符号では、KwKwK として知られる特別
な系列のために、復号側はＪ＋１が入力の最大符号とな
る。従って、この割り当てをLZW 符号に適用する場合、
上限をＪ＋１にする必要がある。但し、初期状態の直後
は特別な系列にならないので、上限はＪでよい。ここ
で、剰余系で座標移動することにより、ｍビットで符号
化する領域を任意に設定できる。最適な領域は、データ
の種別と符号化したサイズによって異なるため、データ
の種別と符号化したサイズによってｍビットで符号化す
る領域を設定する。具体的にどの領域が最適であるか
は、実際に符号化した結果により求める。Ｈを２ｍとす
ると、０≦Ｌ＜Ｈ、Ｊ＝Ｈ＋Ｌが成り立つ。ｄ＝Ｌ＋１
だけ座標移動したい場合のアルゴリズムをＣ言語を用い
て次の（手順１）に示す。In the LZW code, J + 1 is the maximum input code on the decoding side because of a special sequence known as KwKwK. Therefore, when applying this assignment to LZW codes,
The upper limit must be J + 1. However, since there is no special series immediately after the initial state, the upper limit may be J. Here, by moving the coordinates in the remainder system, it is possible to arbitrarily set an area to be encoded with m bits. Since the optimum area differs depending on the type of data and the encoded size, an area to be encoded with m bits is set according to the type of data and the encoded size. Specifically, which region is optimum is obtained from the result of actual encoding. When H is 2 m, 0 ≦ L <H and J = H + L are satisfied. d = L + 1
An algorithm for moving coordinates only is shown in the following (Procedure 1) using C language.

【００２３】（手順１）Ｃ：出力する符号（２^m＜Ｃ≦２^m+1）ｄ：座標移動ｄ＝Ｌ＋１Ｊ：割り当てた最大の符号Ｊ＝Ｈ＋Ｌ（０＜Ｌ＜Ｈ）Ｃ＝Ｃ＋ｄ；ｉｆ（Ｃ＞Ｊ）Ｃ＝Ｃ−（Ｊ＋１）；復号の場合は、ｄ＝Ｊ−（Ｌ＋１）を初めに実行する(Procedure 1) C: Code to be output (2 ^m <C ≦ 2 ^{m + 1} ) d: Coordinate movement d = L + 1 J: Assigned maximum code J = H + L (0 <L <H) C = C + d If (C> J) C = C- (J + 1); In the case of decoding, d = J- (L + 1) is executed first.

【００２４】図８の出力符号を（手順１）によって座標
移動し、（手順２）のアルゴリズムによって出力した結
果を図９に示す。FIG. 9 shows the result obtained by moving the coordinates of the output code of FIG. 8 in (Procedure 1) and outputting it by the algorithm of (Procedure 2).

【００２５】（手順２）Ｈ：２^m value をsizeビットで出力する関数をbit Put (value、
size) とする (Procedure 2) A function that outputs the H: 2 ^m value in size bits is bit Put (value,
size)

【００２６】復号の場合は、（手順３）のアルゴリズム
によって符号を切り出す。In the case of decoding, the code is cut out by the algorithm of (Procedure 3).

【００２７】（手順３） sizeビットを切り出す関数をbit Get (size)とするＣ＝bit Get （ｍ）；ｉｆ（Ｃ≦Ｌ）Ｃ＝（bit Get (1) ≪ｍ）｜Ｃ；(Procedure 3) A function for cutting out size bits is defined as bit Get (size) C = bit Get (m); if (C≤L) C = (bit Get (1) << m) | C;

【００２８】符号変換装置４は、汎用符号化装置３で出
力されたバイナリ形式のデータを変換テーブルを使用し
て制限された文字符号セットにして出力する。The code conversion device 4 outputs the binary format data output from the general-purpose encoding device 3 into a limited character code set using a conversion table.

【００２９】符号変換装置４に於ける符号化、復号の結
果を図１０に示す。汎用符号化装置３から入力された符
号と使用ビット長を図１１の変換テーブルから検索す
る。このテーブルでは、使用ビット長６の場合は常に入
力符号の下位６ビット、使用ビット長７の場合は常に入
力符号の下位７ビットで出力符号が定まり、且つ使用ビ
ット長７に対応する入力符号の下位６ビットが使用ビッ
ト長６に対応する入力符号の下位６ビットとは異なり、
且つ出力符号を総て使用するように作られている。例え
ば入力符号xy101101B （ｘとｙは０または１）の出力符
号は’Ｍ’、使用ビット長は下位６ビットである。即
ち、下位６ビットが101101B である出力符号は’Ｍ’の
みであるため、入力符号の下位６ビットを右シフトで捨
て、残りxyが８ビット以上であれば、繰り返し検索を行
なう。次に入力符号Ｃを受け取ったときには、残りのビ
ット上位に挿入し（Ｃxy）、合計のビット長が８ビット
以上であれば検索を行なう。復号の場合は、入力された
符号を変換テーブルから検索し、その符号値と符号ビッ
ト長を求める。入力文字符号’Ｍ’の符号値は3eH 、符
号ビット長は６ビットであるため、６ビットで切り出
す。ここでは、８ビットの入力コードを７ビットもしく
は６ビットで符号化する。例えば入力コード’１４’に
対し、７ビットで符号化した結果は’ａ’である。残り
の１ビットに次の入力コードを付け加えて、同様に符号
化を続ける。復号の場合は、入力符号’ａ’は７ビット
の符号値’１４’になる。FIG. 10 shows the result of encoding and decoding in the code conversion device 4. The conversion table of FIG. 11 is searched for the code and the used bit length input from the general-purpose encoder 3. In this table, the output code is always determined by the lower 6 bits of the input code when the used bit length is 6, and the lower 7 bits of the input code when the used bit length is 7, and the input code corresponding to the used bit length 7 is determined. Unlike the lower 6 bits of the input code in which the lower 6 bits correspond to the used bit length 6,
And it is made to use all output codes. For example, the output code of the input code xy101101B (x and y are 0 or 1) is'M ', and the used bit length is the lower 6 bits. That is, since the output code whose lower 6 bits are 101101B is only'M ', the lower 6 bits of the input code are discarded by right shift, and if the remaining xy is 8 bits or more, repetitive search is performed. Next, when the input code C is received, it is inserted into the upper bits of the remaining bits (Cxy), and if the total bit length is 8 bits or more, a search is performed. In the case of decoding, the input code is searched from the conversion table, and its code value and code bit length are obtained. Since the code value of the input character code'M 'is 3eH and the code bit length is 6 bits, it is cut out with 6 bits. Here, an 8-bit input code is encoded by 7 bits or 6 bits. For example, the result of encoding the input code “14” with 7 bits is “a”. The next input code is added to the remaining 1 bit, and encoding is continued in the same manner. In the case of decoding, the input code “a” has a 7-bit code value “14”.

【００３０】[0030]

【発明の効果】compressに対する本発明による改善率の
例を図１２，図１３，図１４に示す。ここでは出力に全
ての文字コードセットを使用して、compressと比較評価
を行なった。英文テキストデータとしてSparc Station
2 上のSunOS Release 4.1.2 の/usr/man/man1/* 、和文
テキストデータとして/usr/man/japanese/man1/*、バイ
ナリデータとして/bin/*を採用し、compressにより圧縮
ができたファイルに対する改善率を示す。（イ）は短い
符号を割り当てる範囲が０≦Ｃ＜２^m−Ｌ、（ロ）はＬ
≦Ｃ＜２^m、（ハ）は２Ｌ≦Ｃ≦Ｊのときの値である。
テキストデータではＥを４、バイナリデータではＥを６
にしているが、通常Ｅ＝４〜６で最適な圧縮率が得ら
れ、入力に現われる文字種が多いほど大きい値が良い。
ここで、符号表の飽和時には、最近使用された符号を最
大の符号の半分だけ残す方式を採用した。[Effects of the Invention] FIGS. 12, 13 and 14 show examples of the improvement rate according to the present invention with respect to compress. Here, all character code sets were used for output, and a comparative evaluation with compress was performed. Sparc Station as English text data
2) / usr / man / man1 / * of SunOS Release 4.1.2, / usr / man / japanese / man1 / * as Japanese text data, / bin / * as binary data, and compression was possible with compress Indicates the improvement rate for the file. In (a), the range of assigning a short code is 0 ≦ C <2 ^m −L, and (b) is L.
≦ C <2 ^m , (c) is the value when 2L ≦ C ≦ J.
E is 4 for text data, E is 6 for binary data
However, the optimum compression ratio is usually obtained when E = 4 to 6, and the larger the number of character types appearing in the input, the larger the value.
Here, when the code table is saturated, a method is adopted in which recently used codes are retained by half of the maximum codes.

【００３１】動的符号割り当ての圧縮効果は、compress
の出力ファイルの大きさからLZW の符号数を計算し、式
（４）を厳密に計算することによって求めた結果を対数
圧縮効果とし、（イ）と（ロ）と（ハ）の平均から対数
圧縮効果を引いたものをその他の効果として示す。図中
の−は、該当する値がないことを表し、＊は符号表飽和
時の処理の違いのために比較ができないことを表す。The compression effect of dynamic code allocation is compress
The LZW code number is calculated from the size of the output file of, and the result obtained by rigorously calculating equation (4) is the logarithmic compression effect, and the logarithm is calculated from the average of (a), (b), and (c). The other effects are shown by subtracting the compression effect. In the figure, − indicates that there is no corresponding value, and * indicates that the comparison cannot be performed due to the difference in processing when the code table is saturated.

【００３２】[0032]

【数４】 [Equation 4]

【００３３】符号の出現確率が一様でないため、テキス
トデータでは３０Ｋバイト以下のサイズのとき（イ）の
範囲が、それ以上のサイズでは（ハ）の範囲が高い改善
率を上げている。バイナリデータでは１Ｋバイト以下の
サイズのとき（ハ）の範囲が、それ以上のサイズでは
（イ）の範囲が高い改善率を上げている。Since the appearance probabilities of the codes are not uniform, the improvement rate of the text data is high in the range of (A) when the size is 30 Kbytes or less, and in the range (C) when the size is more than 30 Kbytes. For binary data, the range of (C) is higher when the size is 1 Kbytes or less, and the range (A) is higher when the size is larger than that, and the improvement rate is high.

【００３４】２バイト符号の多い和文テキストには、多
種類の文字が出現するため、大きなファイルでは出現割
り当て効果は殆どないため、和文テキストに於けるその
他の効果は、連続する同一区コードの漢字の圧縮による
ものであり、約４〜７％圧縮率が強化されていることが
分かる。本和文テキストには多くのroffコマンドがASCI
I 文字で挿入されているため、純然たる和文テキストの
場合には更に圧縮率が向上するものと思われる。長い文
字列の連続は殆どない英文テキストに於けるその他の効
果は出現割り当てによるものであり、0.2 〜35％圧縮率
を強化することができる。殆どの文字が出現するバイナ
リデータでは、出現割り当て方式を用いると圧縮率が悪
化する。ファイルサイズが、１Ｋバイト以下のパイナリ
データは、ほとんどASCII で書かれたシェルスクリプト
である。１Ｋバイト以上のバイナリデータにおけるその
他の効果は連続する同一バイトの圧縮によるものであ
り、ファイル数の多い範囲で0.2 〜15％圧縮率を強化す
ることができる。Since many kinds of characters appear in a Japanese text with many 2-byte codes, there is almost no effect of appearance allocation in a large file. Therefore, other effects in the Japanese text are the Kanji characters of consecutive same section codes. It can be seen that the compression ratio is enhanced by about 4 to 7%. Many roff commands are ASCI in this Japanese text.
Since it is inserted with the I character, it seems that the compression ratio will be further improved in the case of pure Japanese text. Another effect in English text with few long strings is due to occurrence assignment, which can increase the compression ratio by 0.2-35%. For binary data in which most characters appear, the compression ratio deteriorates when the appearance allocation method is used. Pinary data with a file size of 1 Kbyte or less is a shell script written in ASCII. The other effect of binary data of 1 Kbyte or more is due to the compression of consecutive same bytes, and it is possible to enhance the compression rate of 0.2 to 15% in the range where the number of files is large.

【００３５】出力符号をASCII の印刷可能文字に制限す
る場合には、uuencodeより10％以上効率の良い符号割り
当てを行なうことができるため、本装置全体の圧縮効率
をcompressとuuencodeを組み合わせた圧縮率と比較する
と、１５〜５０％程度圧縮率が改善される。When the output code is limited to ASCII printable characters, code allocation that is more than 10% more efficient than uuencode can be performed. Therefore, the compression efficiency of the entire device can be reduced by combining compress and uuencode. Compared with, the compression ratio is improved by about 15 to 50%.

【００３６】以上のように、本発明によれば、日本語を
対象とする任意の出力文字符号に対して従来の符号化圧
縮よりも圧縮率を高めることができ、飽和時の処理以外
は、処理が単純であるため、殆ど実行速度を低下させず
に実現可能である。よって、日本語データの広い処理分
野に適用して効果大である。As described above, according to the present invention, it is possible to increase the compression rate for any output character code for Japanese language as compared with the conventional coded compression. Since the processing is simple, it can be realized with almost no reduction in execution speed. Therefore, it is effective when applied to a wide range of processing fields of Japanese data.

[Brief description of drawings]

【図１】本発明の実施例を示すブロック構成図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】本発明に用いられる制御コードの割り当てを示
す図である。FIG. 2 is a diagram showing allocation of control codes used in the present invention.

【図３】本発明における特殊パターン符号化の例を示す
図である。FIG. 3 is a diagram showing an example of special pattern coding according to the present invention.

【図４】本発明における出現文字符号表を示す図であ
る。FIG. 4 is a diagram showing an appearance character code table in the present invention.

【図５】本発明における２桁の２値表現を示す図であ
る。FIG. 5 is a diagram showing a two-digit binary representation in the present invention.

【図６】本発明における制御符号のLZW 符号を示す図で
ある。FIG. 6 is a diagram showing an LZW code as a control code in the present invention.

【図７】本発明における符号割り当ての例を示す図であ
る。FIG. 7 is a diagram showing an example of code allocation according to the present invention.

【図８】本発明における汎用LZW 符号化の例を示す図で
ある。FIG. 8 is a diagram showing an example of general-purpose LZW encoding according to the present invention.

【図９】本発明における座標移動の例を示す図である。FIG. 9 is a diagram showing an example of coordinate movement according to the present invention.

【図１０】本発明における符号変換装置における符号化
の例を示す図である。FIG. 10 is a diagram showing an example of encoding in the code conversion device according to the present invention.

【図１１】本発明における符号変換装置における変換テ
ーブルの例を示す図である。FIG. 11 is a diagram showing an example of a conversion table in the code conversion device of the present invention.

【図１２】本発明によるcompressに対する改善率を示す
図である。FIG. 12 is a diagram showing an improvement rate for compress according to the present invention.

【図１３】本発明によるcompressに対する改善率を示す
図である。FIG. 13 is a diagram showing an improvement rate for compress according to the present invention.

【図１４】本発明によるcompressに対する改善率を示す
図である。FIG. 14 is a diagram showing an improvement rate for compress according to the present invention.

[Explanation of symbols]

１原データ２特殊パターン符号割り当て装置３汎用符号割り当て装置４符号変換装置５圧縮データ 1 Original data 2 Special pattern code allocation device 3 General-purpose code allocation device 4 Code conversion device 5 Compressed data

Claims

[Claims]

1. A special pattern coding for compressing data depending on an information source by coding a same pattern of kanji data, a sequence of the same byte, and a special pattern such as a numerical expression for data compression. A device, a general-purpose coding device that performs compression depending on the information source and then encodes the output into data of a universal code such as an LZW code that does not depend on the information source by using the shortest code representation, and the output code is limited. A multi-stage data compression apparatus using an optimal code representation, which includes a code conversion device that reduces the rate of increase of the data by using the longest code representation and outputs the compressed data that meets the limitation of the output code.

2. The special pattern encoding device has a function of omitting the second and subsequent section codes in the series of the same section code of the Kanji data and compressing by using only point codes. A multistage data compression apparatus using the optimum code representation according to claim 1.

3. The optimum code expression according to claim 1, wherein the general-purpose coding device uses an LZW code that dynamically changes a shortest code area according to a coded data size or an appearing character code. Multi-stage data compression device using.