JP2825960B2

JP2825960B2 - Data compression method and decompression method

Info

Publication number: JP2825960B2
Application number: JP2275835A
Authority: JP
Inventors: 茂吉田; 佳之岡田; 泰彦中野; 広隆千葉
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1990-10-15
Filing date: 1990-10-15
Publication date: 1998-11-18
Anticipated expiration: 2013-11-18
Also published as: JPH04149766A

Description

【発明の詳細な説明】［概要］符号化が済んだ直前文字列の最終文字との従属関係
（履歴）に基づく索引で複数辞書の中の１つを指定し、
入力文字列を指定辞書に登録した既に符号化済みの部分
列の内、最大長一致する部分列の参照番号で指定してLZ
W符号に符号化すると共にLZW符号から文字列を復元する
データ圧縮方法及び復元方法に関し、複数辞書の初期登録を簡単にして符号化効率を向上す
ることを目的とし、処理対象となる全文字種の数より少ない所定数の辞書
から成る辞書群で構成して各辞書毎に全文字種を１文字
単位に参照番号を付けて初期登録するように構成する。DETAILED DESCRIPTION OF THE INVENTION [Summary] One of a plurality of dictionaries is designated by an index based on a dependency relationship (history) with the last character of an immediately preceding encoded character string,
Specify the input character string with the reference number of the substring that matches the maximum length among the already encoded substrings registered in the specified dictionary and LZ
Regarding the data compression method and decompression method that encodes to W code and decompresses character strings from LZW code, the purpose is to simplify initial registration of multiple dictionaries and improve encoding efficiency. A dictionary group consisting of a predetermined number of dictionaries less than the number of dictionaries is provided, and each dictionary is initially registered with a reference number assigned to each character type in units of one character.

［産業上の利用分野］本発明は、ユニバーサル符号の一種である増分分解型
の改良として知られたLZW符号によるデータ圧縮方法及
び復元方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data compression method and a decompression method using an LZW code known as an improvement of an incremental decomposition type, which is a kind of universal code.

近年、文字コード、ベクトル情報、画像など様々な種
類のデータがコンピュータで扱われるようになってお
り、扱われるデータ量も急速に増加してきている。大量
のデータを扱うときは、データの中の冗長な部分を省い
てデータ量を圧縮することで、記憶容量を減らしたり、
速く伝送したりできるようになる。In recent years, various types of data such as character codes, vector information, and images have been handled by computers, and the amount of data handled has rapidly increased. When dealing with large amounts of data, you can reduce the storage capacity by compressing the amount of data by eliminating redundant parts of the data,
It can be transmitted quickly.

このように様々なデータを１つの方式でデータ圧縮で
きる方法としてユニバーサル符号化が提案されている。As described above, universal coding has been proposed as a method capable of compressing various data by one method.

ここで、本発明の分野は、文字コードの圧縮に限ら
ず、様々なデータに適用できるが、以下では、情報理論
で用いられている呼称を踏襲し、データの１ワード単位
を文字と呼び、データが任意ワードつながったものを文
字列と呼ぶことにする。Here, the field of the present invention is not limited to character code compression, and can be applied to various types of data. In the following, one word unit of data is called a character, following the name used in information theory, Data in which arbitrary words are connected is called a character string.

ユニバーサル符号の代表的な方法として、ジブ−レン
ペル（Ziv−Lempel）符号がある（詳しくは、例えば宗
像「Ziv−Lempelのデータ圧縮法」、情報処理、Vol.26,
No.1,1985年を参照のこと）。As a typical method of the universal code, there is a Ziv-Lempel code (for details, for example, Munakata "Ziv-Lempel Data Compression Method", Information Processing, Vol. 26,
No. 1, 1985).

Ziv−Lempel符号ではユニバーサル型と、増分分解型（Incremental parsing）の２つのアルゴリズムが提案されている。 For Ziv-Lempel codes, two algorithms, a universal type and an incremental parsing type, have been proposed.

更に、ユニバーサル型アルゴリズムの改良として、LZ
SS符号がある（T.C.Bell,“Better OPM/L Text Compres
sion",IEEE Trans.on Commun.,Vol.COM−34,No.12,Dec.
1986参照）。Furthermore, as an improvement of the universal algorithm, LZ
There is an SS code (TCBell, “Better OPM / L Text Compres
sion ", IEEE Trans.on Commun., Vol.COM-34, No.12, Dec.
1986).

また、増分分解型アルゴリズムの改良としては、LZW
（Lempel−Ziv−Welch）符号がある（T.A.Welch,“A Te
chnique for High−Performance Data Compression",Co
mputer,June 1984参照）。In addition, as an improvement of the incremental decomposition algorithm, LZW
(Lempel-Ziv-Welch) code (TAWelch, "A Te
chnique for High-Performance Data Compression ", Co
mputer, June 1984).

これらの符号の内、高速処理ができることと、アルゴ
リズムの簡単さからLZW符号が記憶装置のファイル圧縮
などで使われるようになっている。Among these codes, the LZW code is used for file compression of a storage device or the like because of its high speed processing and the simplicity of the algorithm.

［従来の技術］従来のLZW符号の符号化アルゴリズムのフローチャー
トを第５図に示し、また復号化アルゴリズムのフローチ
ャートを第６図に示す。[Prior Art] A flowchart of a conventional LZW code encoding algorithm is shown in FIG. 5, and a flowchart of a decoding algorithm is shown in FIG.

まずLZW符号化は、書き替え可能な辞書を持ち、入力
文字列の中を相異なる文字列に分け、この文字列を出現
した順に番号を付けて辞書に登録すると共に、現在入力
している文字列を辞書に登録してある最長一致文字列の
参照番号だけで表して符号化するものである。First, LZW encoding has a rewritable dictionary, divides the input character string into different character strings, assigns numbers to the character strings in the order in which they appear, registers them in the dictionary, and registers the currently input character string. The sequence is represented and encoded only by the reference number of the longest matching character string registered in the dictionary.

第７図にLZW符号化の具体例を示し、また第９図にLZW
復号化の具体例を示し、さらに第８図に符号化と復号化
で使用される辞書の内容を示す。尚、第7,8,9図にあっ
ては、説明を簡単にするためabcの３文字の組合せから
なる文字列を圧縮、復元する場合を例にとっている。FIG. 7 shows a specific example of LZW encoding, and FIG. 9 shows LZW encoding.
FIG. 8 shows a specific example of decoding, and FIG. 8 shows the contents of a dictionary used in encoding and decoding. FIGS. 7, 8, and 9 illustrate a case where a character string composed of a combination of three characters of abc is compressed and decompressed for simplicity of description.

まず第５図のLZW復号化の処理を説明すると次のよう
になる。First, the LZW decoding process in FIG. 5 will be described as follows.

ステップS1（以下、「ステップ」を省略）：予め全文字につき１文字からなる文字列を初期値とし
て登録してから符号化を始める。S1の符号化は入力した
最初の文字Ｋにより辞書を検索して参照番号ωを求め、
これを語頭文字列（prefix string）とする。Step S1 (hereinafter, “step” is omitted): A character string consisting of one character for all characters is registered as an initial value before encoding starts. The encoding of S1 is performed by searching the dictionary using the first character K input to obtain a reference number ω,
This is a prefix string.

S2:入力データの次の文字Ｋを読み込む。S2: Read the next character K of the input data.

S3:文字入力が終了したか否かをチェックする。S3: Check whether character input is completed.

S4:S1で求めた語頭文字列ωにS2で読み込んだ文字Ｋを
加えた（ωＫ）が辞書にあるか否か探す。S4: A search is performed to determine whether or not (ωK) obtained by adding the character K read in S2 to the initial character string ω obtained in S1 is in the dictionary.

S5:もし、S4で文字列（ωＫ）が辞書にあれば、S5で文
字列（ωＫ）を参照番号ωに置き換え、再びS2に戻って
文字列（ωＫ）が辞書から探せなくなるまで最大一致長
の探索を続ける。S5: If the character string (ωK) is in the dictionary in S4, the character string (ωK) is replaced with the reference number ω in S5, and the process returns to S2 again until the character string (ωK) cannot be searched from the dictionary. Continue searching for.

S6:もし、S4で文字列（ωＫ）が辞書になければ、S6に
進んでS1で求めた文字Ｋの参照番号ωを符号語code
（ω）として出力し、また文字列（ωＫ）に新たな参照
番号を付加して辞書に登録し、更にS2の入力文字Ｋを参
照番号ωに置き換えると共に、辞書アドレスｎをインク
リメントしてS2に戻って次の文字Ｋを読み込む。S6: If the character string (ωK) is not found in the dictionary in S4, the process proceeds to S6, where the reference number ω of the character K obtained in S1 is a code word code.
(Ω), add a new reference number to the character string (ωK), register it in the dictionary, replace the input character K of S2 with the reference number ω, and increment the dictionary address n to S2 Return and read the next character K.

第7,8図を参照して具体的に説明すると次のようにな
る。This will be specifically described with reference to FIGS.

まず第７図の入力データinputは左から右へと読む。
最初の文字ａを入力した時、辞書には文字ａの他に一致
する文字列がないので、OUTPUT CODE 1（参照番号ω）
を符号語して出力する。そして文字ａを語頭文字列ωと
する。First, the input data input of FIG. 7 is read from left to right.
When the first character a is entered, there is no matching character string in addition to the character a in the dictionary, so OUTPUT CODE 1 (reference number ω)
Is output as a code word. Then, the character a is set to the initial character string ω.

次に２番目の文字ｂを入力したとすると、この入力文
字を語頭文字列ωに加えた拡張文字列ωＫ＝abは辞書に
ないことから、文字ｂのOUTPUT CODE 2を符号語として
出力する。そして、拡張文字列ωＫ＝abに参照番号４を
付けて辞書に登録する。実際の辞書登録は第８図の右側
に示すように文字列1bとして登録される。そして文字ｂ
が語頭文字列ωとなる。Next, assuming that the second character b is input, since the extended character string ωK = ab obtained by adding the input character to the initial character string ω is not in the dictionary, the OUTPUT CODE 2 of the character b is output as a code word. . Then, a reference number 4 is added to the extended character string ωK = ab and registered in the dictionary. The actual dictionary registration is registered as a character string 1b as shown on the right side of FIG. And the letter b
Becomes the initial character string ω.

続いて３番目の文字ａを入力したとすると、文字ａに
語頭文字列ωを加えた拡張文字列ωＫ＝ba＝2aは辞書に
ないことから、文字ｂのOUTPIT CODE 2を符号語として
出力した後、拡張文字列ωＫ＝baを2aで表わし、参照番
号５を付けて辞書に登録する。そして文字ａが開当たな
語頭文字列ωとなる。Subsequently, if the third character a is input, the extended character string ωK = ba = 2a obtained by adding the initial character string ω to the character a is not in the dictionary, so that OUTPIT CODE 2 of the character b is output as a code word. After that, the extended character string ωK = ba is represented by 2a, and is added to the reference number 5 and registered in the dictionary. Then, the character "a" becomes the initial character string ω.

４番目の入力文字ｂについては拡張文字列ωＫ＝abは
1bの符号語４として既に辞書に登録されているので、文
字列ωＫを新たな語頭文字列ωとし、５番目の文字ｃを
入力して拡張文字列ωＫ＝4c＝abcを作る。この拡張文
字列ωＫ＝abcは辞書に登録されていないことから、文
字列ab＝1bのOUTPUT CODE 4を符号語として出力し、拡
張文字列ωＫ＝abcを辞書に4cの形で符号語６として登
録する、以下同様に、この処理を続ける。For the fourth input character b, the expanded character string ωK = ab is
Since the code word 4 of 1b is already registered in the dictionary, the character string ωK is set as a new initial character string ω, and the fifth character c is input to create an extended character string ωK = 4c = abc. Since this extended character string ωK = abc is not registered in the dictionary, OUTPUT CODE 4 of the character string ab = 1b is output as a code word, and the extended character string ωK = abc is written in the dictionary as code word 6 in the form of 4c. Register, and so on.

次に第６図に復号化処理を説明する。この復号化で
は、復号化と同様に予め辞書に全文字につき一文字から
なる文字列を初期値として登録してから復号を始める。Next, the decoding process will be described with reference to FIG. In this decoding, similarly to the decoding, a character string consisting of one character for every character is registered in the dictionary as an initial value before decoding starts.

S1:最初の符号CODEを読み込み参照番号ωを復号する。
現在の参照番号ωをOLDωとし、最初の符号は既に辞書
に登録された一文字の参照番号いずれかに該当すること
から、入力参照番号ωに一致する文字Ｄ（Ｋ）を探し出
し、文字Ｋを出力する。尚、出力した文字Ｋは後の例外
処理のためFINcharにセットしておく。S1: The first code CODE is read and the reference number ω is decoded.
The current reference number ω is OLDω, and since the first code corresponds to one of the reference numbers of one character already registered in the dictionary, a character D (K) that matches the input reference number ω is searched for and the character K is output. I do. Note that the output character K is set in FINchar for later exception processing.

S2:次の符号CODEを読み込む。S2: Read the next code CODE.

S3:新たな符号があるか否か、即ち符号入力の終了の有
無をチェックする。S3: It is checked whether or not there is a new code, that is, whether or not code input has been completed.

S4:読み込んだ符号CODEから参照番号ωを復号し、INω
としてセットする。S4: The reference number ω is decoded from the read code CODE, and INω
Set as

S5:S4で入力された符号CODEが辞書に登録されているか
否（ω≧ｎ）かチェックする。S5: It is checked whether or not the code CODE input in S4 is registered in the dictionary (ω ≧ n).

S6:通常、入力した符号語は前回までの処理で辞書に登
録されているため、S6に進んで参照番号ωに対応する文
字列Ｄ（ω′Ｋ）を辞書から読み出す。S6: Normally, the input codeword has been registered in the dictionary in the previous process, so the process proceeds to S6, where the character string D (ω′K) corresponding to the reference number ω is read from the dictionary.

S7:文字列Ｋを一時的にスタックし、参照番号ω′を新
たなωとして再度S6に戻り、このS6の手順を再帰的に参
照番号ωが１文字に至まで繰り返す。S7: The character string K is temporarily stacked, the reference number ω ′ is set as a new ω, and the process returns to S6 again. The procedure of S6 is recursively repeated until the reference number ω reaches one character.

S8:S7でスタックした文字をLILO（Last In Fast Out）
形式でポップアップして出力する。同時に、前回使った
参照番号OLDωと今回復元した文字列の最初の一文字Ｋ
を組（OLDω,K）と表した文字列に、新たな参照番号ｎ
を付加して辞書に登録する。LILO (Last In Fast Out) for characters stacked in S8: S7
Output in pop-up format. At the same time, the reference number OLDω used last time and the first letter K of the character string restored this time
Is replaced with a string (OLDω, K) with a new reference number n
And register it in the dictionary.

このLZW復号処理を第９図について具体的に説明する
と次のようになる。This LZW decoding processing will be specifically described with reference to FIG.

まず最初の入力符号は１であり、１文字a,b,cについ
ては既に参照番号1,2,3として第１表に示すように辞書
に登録されているため、辞書の参照により符号１に一致
する参照番号の文字列ａに置き換えて出力する。次の符
号２についても同様にして文字ｂに置き換えて出力す
る。このとき前回処理した符号と今回復号した最初の一
文字ｂとを組み合わせた（1b）に新たな参照番号４を付
加して辞書に登録する。First, the first input code is 1, and one character a, b, c is already registered in the dictionary as reference numbers 1, 2, and 3 as shown in Table 1. The output is replaced with the character string a having the same reference number. Similarly, the next code 2 is replaced with the character b and output. At this time, a new reference number 4 is added to (1b), which is a combination of the previously processed code and the first character b decoded this time, and registered in the dictionary.

３番目の符号４は辞書の探索により1bからabと置き換
えて文字列abを出力する。同時に前回処理した符号２と
今回復号した文字列の１番目の文字ａとの組合せた文字
列2a（＝ba）を新たな参照番号５を付加して辞書に登録
する。The third code 4 replaces 1b with ab by searching the dictionary and outputs a character string ab. At the same time, a character string 2a (= ba), which is a combination of the previously processed code 2 and the first character a of the currently decoded character string, is added to the new reference number 5 and registered in the dictionary.

以下同様に、この処理を繰り返す。 Hereinafter, similarly, this processing is repeated.

第９図の復号化では次の例外処理がある。 In the decoding of FIG. 9, there is the following exception processing.

この例外処理は、第６番目の入力符号８の復号で生ず
る。符号８は復号時に辞書に定義されておらず、復号で
きない。この場合には、前回処理した符号５に前回符号
した文字列baの最初の一文字ｂを加えた文字列5bを求
め、更に2ab,babと置き換えられて出力される。そし
て、文字列の出力語に前回の符号語５に今回復号した文
字列の文字ｂを加えた文字列5bに参照番号８を付加して
辞書に登録する。This exception processing occurs when the sixth input code 8 is decoded. The code 8 is not defined in the dictionary at the time of decoding and cannot be decoded. In this case, a character string 5b is obtained by adding the first character b of the previously processed character string ba to the previously processed code 5, and further replaced with 2ab and bab and output. Then, a reference number 8 is added to a character string 5b obtained by adding the character b of the character string decoded this time to the previous code word 5 to the output word of the character string and registered in the dictionary.

この例外処理は第６図の復号化処理フローのS5,S9の
処理を通じて行なわれ、最終的にS8で文字列の出力と新
たな文字例に参照番号を付加した辞書への登録が行なわ
れる。This exception processing is performed through the processing of S5 and S9 in the decoding processing flow of FIG. 6, and finally, in S8, a character string is output and registered in a dictionary in which a reference number is added to a new character example.

尚、第４図，第５図の符号化／復号化処理は、同じ辞
書を作り出しながら行なう。Note that the encoding / decoding processing in FIGS. 4 and 5 is performed while creating the same dictionary.

［発明が解決しようとする課題］このように従来のLZW符号では、入力文字列の中を相
異なる文字列に分けて符号化するとき、現在符号化中の
各文字列は以前の文字列とは独立に出現するとして符号
化する形式を取っている。[Problems to be Solved by the Invention] As described above, in the conventional LZW code, when the input character string is divided into different character strings and encoded, each character string currently being encoded is different from the previous character string. Has the form of encoding as appearing independently.

この方法は、無記憶情報源の符号化には問題ない。し
かし、実際の文章等、多くのデータは記憶情報源とみな
され、従来のLZW符号では文字列が出現する履歴を十分
利用できておらず、データ圧縮後も文字列の出現の従属
性については冗長性が残る欠点があった。This method is not problematic for encoding a memoryless source. However, many data, such as actual sentences, are regarded as storage information sources, and the history of occurrence of character strings cannot be used sufficiently with conventional LZW codes. There was a disadvantage that redundancy remained.

このような欠点に対し本願発明者等は、符号化文字列
に対して直前の文字列の最終文字との従属関係、即ち履
歴を辞書に取り込むことによって文字列間の冗長性を削
減し、圧縮率を高めるようにしたデータ圧縮および復元
方式を提案としている。In order to solve such a drawback, the present inventors reduced the redundancy between character strings by taking the dependency relationship between the encoded character string and the last character of the immediately preceding character string, that is, the history into the dictionary, and reducing the compression between the character strings. It proposes a data compression and decompression scheme that increases the rate.

具体的には第10図に示すように、辞書10を複数個の辞
書10−1,10−2,10−3,10−４に分けて索引を付けてお
き、例えば直前の文字列abの最終文字ｂを索引にして特
定の辞書10−２を選択する。各辞書10−１〜10−４に
は、索引文字に後続してつながる文字列のみを登録して
おく。Specifically, as shown in FIG. 10, the dictionary 10 is divided into a plurality of dictionaries 10-1, 10-2, 10-3, and 10-4 and indexed. A specific dictionary 10-2 is selected using the last character b as an index. In each of the dictionaries 10-1 to 10-4, only a character string that follows the index character is registered.

この方法によれば、従来、辞書中の文字列を全体から
見た参照番号で指定していたのに対し、索引につながる
系列だけの参照番号で指定できるので小さい参照番号を
使用でき、LZW符号を短く表現して符号化効率を向上さ
せることができる。According to this method, conventionally, a character string in a dictionary is designated by a reference number viewed from the whole, but since it can be designated by a reference number of only a series leading to an index, a small reference number can be used, and an LZW code can be used. Can be expressed short to improve coding efficiency.

しかし、この方法では次のように初期値の設定法が問
題となる。However, in this method, the method of setting the initial value is problematic as follows.

LZW符号ではバイト単位にデータを扱うとき、符号語
を簡単にし、参照番号だけで表すため、256個の全文字
種の初期値として予め辞書に登録しておく。この初期登
録を第10図の方法に適用すると、総て文字種256を索引
とした256子の辞書を使用し、各辞書毎に初期登録する
ことから、256×256（64K）個を予め登録しておくこと
が必要になる。実際には、64K個のうち使われないもの
も多いので、そのまま初期値を登録する方法では使用し
ない文字分だけ参照番号が大きくなって符号化効率が低
下する。In the LZW code, when data is handled in byte units, the code words are registered in a dictionary in advance as initial values of all 256 character types in order to simplify the code words and represent only the reference numbers. When this initial registration is applied to the method in Fig. 10, 256 dictionaries are indexed using all 256 character types, and the initial registration is performed for each dictionary. It is necessary to keep. Actually, since many of the 64K characters are not used, in the method of registering the initial value as it is, the reference number is increased by the unused characters, and the coding efficiency is reduced.

そこで初期値を予め辞書に登録しておかず、１文字か
らなる文字列が新たに出現したとき辞書に登録すれば、
初期値を登録しておくことの非効率を解決できる。しか
し、この方法を採ると、符号語を全て参照番号で表すこ
とはできず、符号語が１文字からなる文字列は生データ
を符号化するものと、２文字以上の文字列は参照番号符
号化するものとに分ける必要があり、アルゴリズムが複
雑になる問題点があった。Therefore, if the initial value is not registered in the dictionary in advance, but is registered in the dictionary when a character string consisting of one character newly appears,
The inefficiency of registering the initial value can be solved. However, if this method is adopted, not all codewords can be represented by reference numbers. A character string in which the codeword is composed of one character encodes raw data, and a character string of two or more characters is represented by a reference number code. And the algorithm becomes complicated.

本発明は、このような問題点に鑑みてなされたもの
で、符号化が済んだ直前文字列の最終文字との従属関係
に基づく索引で複数辞書の中の１つを指定して符号化及
び復元する場合、従属関係をまとめることで複数辞書の
初期登録を簡単にして符号化効率を向上するようにした
データ圧縮方法及び復元方法を提供することを目的とす
る。The present invention has been made in view of such a problem, and designates one of a plurality of dictionaries using an index based on a dependency relationship with the last character of the immediately preceding character string that has been encoded. It is an object of the present invention to provide a data compression method and a decompression method in which the initial registration of a plurality of dictionaries is simplified by grouping the subordination relations to improve the coding efficiency.

［課題を解決するための手段］第１図は本発明の原理説明図である。[Means for Solving the Problems] FIG. 1 is an explanatory view of the principle of the present invention.

まず本発明は、入力文字列を辞書10に登録された既に
符号化済みの部分列の内、最大長一致する部分列の参照
番号で指定してLZW符号に符号化するデータ圧縮方法に
関する。First, the present invention relates to a data compression method of encoding an input character string into an LZW code by designating the input character string by a reference number of a subsequence having the maximum length matching among already encoded subsequences registered in the dictionary 10.

このようなデータ圧縮方法として本発明にあっては、
辞書10を、処理対象となる全文字種の数より少ない所定
数の辞書10−１〜10−Ｎから成る辞書群で構成して各辞
書10−１〜10−Ｎ毎に全文字種を１文字単位に参照番号
を付けて初期登録しておく。In the present invention as such a data compression method,
The dictionary 10 is composed of a dictionary group consisting of a predetermined number of dictionaries 10-1 to 10-N smaller than the number of all character types to be processed, and each of the dictionaries 10-1 to 10-N has one character unit. With a reference number for the initial registration.

そして、入力文字列の符号化時には、以前に符号化済
みの文字列との従属関係（履歴）を示す索引情報に従っ
て辞書群の中の特定の辞書10−ｉを指定して符号化し、
同時に指令辞書10−ｉに入力文字列がなかった場合に
は、以前の符号化済み文字列の参照番号に次の１文字を
加えた文字列を新たな参照番号を付けて登録することを
特徴とする。Then, at the time of encoding the input character string, the encoding is performed by designating a specific dictionary 10-i in the dictionary group according to the index information indicating the dependency relationship (history) with the previously encoded character string,
At the same time, if there is no input character string in the command dictionary 10-i, a character string obtained by adding the next character to the reference number of the previous encoded character string is registered with a new reference number. And

ここで、入力文字列の符号化時には、直前に符号化済
みの文字列の最終文字コードの一部分から得られた索引
情報に従って辞書群の中の特定の辞書10−１を指定す
る。さらに具体的には、直前に符号化済みの文字列の最
終文字コードの上位ビットで示される索引情報に従って
前記辞書群の中の特定の辞書10−ｉを指定する。Here, at the time of encoding the input character string, a specific dictionary 10-1 in the dictionary group is designated according to the index information obtained from a part of the last character code of the character string that has just been encoded. More specifically, a specific dictionary 10-i in the dictionary group is specified in accordance with index information indicated by the upper bits of the last character code of the character string that has just been encoded.

一方、入力文字列の符号化時には、直前に符号化済み
の文字列の最終文字コードによりルックアップテーブル
を参照して得られた索引情報に従って前記辞書群の中の
特定の辞書10−１を指定してもよい。具体的には、直前
に符号化済みの文字列の最終文字コードの上位ビットに
よりルックアップテーブルを参照して得られた索引情報
に従って前記辞書群の中の特定の辞書10−ｉを指定す
る。On the other hand, when encoding an input character string, a specific dictionary 10-1 in the dictionary group is designated in accordance with index information obtained by referring to a look-up table using the last character code of a character string that has just been encoded. May be. Specifically, a specific dictionary 10-i in the dictionary group is designated according to the index information obtained by referring to the look-up table using the upper bits of the last character code of the character string that has just been encoded.

また本発明は、入力文字列を辞書10に登録された既に
符号化済みの部分列の内、最大長一致する部分列の参照
番号で指定して符号化された符号語から元の文字列を復
元するデータ復元方式を対象とし、辞書10を、処理対象
となる全文字種の数より少ない所定数の辞書10−１〜10
−Ｎから成る辞書群で構成して各辞書10−１〜10−Ｎ毎
に全文字種を１文字単位に参照番号を付けて初期登録し
ておく。そして、入力符号語の復元時には、以前に復元
済みの文字列との従属関係を示す索引情報に従って前記
辞書群の中の特定の辞書10−１を指定して復元し、復元
毎に、以前に復元済み文字列の参照番号に、今回復元し
た文字列の最初の１文字を加えた文字列を新たな参照番
号を付けて登録することを特徴とする。ここで復元時に
使用する特定辞書10−ｉの指定は、符号化の場合と同じ
である。Also, the present invention provides an original character string from a code word encoded by designating an input character string by a reference number of a subsequence that matches the maximum length among already encoded subsequences registered in the dictionary 10. For the data restoration method to be restored, the dictionary 10 has a predetermined number of dictionaries 10-1 to 10-10 smaller than the number of all character types to be processed.
-N, and all the character types are initially registered in each of the dictionaries 10-1 to 10-N with reference numbers assigned in units of one character. Then, at the time of restoring the input codeword, a particular dictionary 10-1 in the dictionary group is designated and restored according to the index information indicating the dependency relationship with the previously restored character string. A character string obtained by adding the first character of the restored character string to the reference number of the restored character string is registered with a new reference number. Here, the specification of the specific dictionary 10-i used at the time of restoration is the same as that of the encoding.

［作用］このような構成を備えた本発明のデータ圧縮方法及び
復元方法によれば、次の作用が得られる。[Operation] According to the data compression method and the decompression method of the present invention having such a configuration, the following operation is obtained.

まず直前文字列の最終文字との従属関係を示す履歴
は、そのままだと256通りの状態があるが、文字の出現
には偏りがあり、256通りのうち出現しない状態もあ
る。そこで、本発明は最終文字の履歴をマージして縮小
し、有為な少数通りの状態、例えば８〜16通りに帰着さ
せ、辞書の数を減らす。First, the history indicating the subordinate relationship with the last character of the immediately preceding character string has 256 states as it is, but there is a bias in the appearance of characters, and there is a state where it does not appear in 256 cases. Therefore, the present invention reduces the number of dictionaries by merging and reducing the history of the final character to reduce the number of significant states, for example, 8 to 16 states.

履歴の状態数が少数であるため、全文字種256子の各
辞書への初期値として登録数は、履歴数、即ち辞書数×
256個であり、大きな無駄は出ないようにできる。Since the number of states in the history is small, the number of registrations as the initial value for each dictionary of all 256 character types is the number of histories, that is, the number of dictionaries x
The number is 256, so that no large waste is produced.

履歴をまとめる方法として、例えば、符号化済直前文
字列の最終文字の上位４ビットを取れば、履歴は16個の
状態にまとめられる。履歴のまとめ方としては、辞書を
有効に使う上では均等に出現する状態を用いるのが望ま
しい。しかし、必ずしも文字中の生データのビットを用
いる必要はなく、データの大まかな性質な合わせて、符
号化済直前文字列の最終文字を履歴の状態に対応付ける
ルックアップ・テーブル（LUT）を用意して、直前文字
の履歴状態、即ち辞書の索引を指定してもよい。As a method of compiling the histories, for example, if the upper 4 bits of the last character of the encoded character string immediately before are taken, the histories are compiled into 16 states. As a method of compiling the histories, it is desirable to use a state in which the histories appear evenly in order to use the dictionary effectively. However, it is not necessary to use the bits of the raw data in the character, and a look-up table (LUT) that maps the last character of the immediately preceding encoded string to the state of the history is prepared according to the rough nature of the data. Thus, the history state of the immediately preceding character, that is, the index of the dictionary may be designated.

［実施例］第２図は本発明の一実施例を示した実施例構成図であ
る。[Embodiment] Fig. 2 is an embodiment configuration diagram showing one embodiment of the present invention.

第２図において、12は制御手段としてのCPUであり、C
PU12に対してはプログラムメモリ14とデータメモリ26が
接続される。In FIG. 2, reference numeral 12 denotes a CPU as control means,
The program memory 14 and the data memory 26 are connected to the PU 12.

プログラムメモリ14にはコントロールソフト16、LZW
符号を用いた最大一致長検索を行なう最大一致長検索ソ
フト18、入力文字列をLZW符号に変換する符号化ソフト2
0、符号化ソフト20でLZW符号に変換された符号を元の文
字列に復元する復号化ソフト22、及び処理対象となる全
文字種、例えば256個の文字種を初期登録する辞書初期
値作成ソフト24を備える。The program memory 14 has control software 16 and LZW
Maximum match length search software 18 that performs maximum match length search using codes, encoding software 2 that converts input character strings to LZW codes
0, decoding software 22 for restoring the code converted to the LZW code by the encoding software 20 to the original character string, and dictionary initial value creation software 24 for initially registering all character types to be processed, for example, 256 character types Is provided.

一方、データメモリ26には、これから符号化しようと
する文字列、或いはこれから復号化しようとする符号列
を格納するデータバッファ28と、LZW符号を対象とした
符号化及び復号化の際に逐次作成されながら使用される
辞書10を備える。On the other hand, the data memory 26 has a data buffer 28 for storing a character string to be encoded or a code string to be decoded, and a data buffer 28 for sequentially creating and encoding LZW codes. The dictionary 10 is used while being used.

辞書10は、例えば符号化済み文字列の最終文字コード
の上位４ビットでなる従属関係を示す索引情報により分
類される場合を例にとると、256個の全文字種に対し16
個の辞書10−１〜10−16で構成される。符号化文字決の
最終文字コードの上位４ビットによる辞書の索引指定
は、直接指定しても良いが、以下の説明にあっては、ル
ックアップテーブル（LUT）を参照して辞書の索引を読
出して指定する場合を例にとる。The dictionary 10 is, for example, classified by index information indicating a subordinate relationship consisting of the upper 4 bits of the final character code of the encoded character string.
It is composed of dictionaries 10-1 to 10-16. The index specification of the dictionary by the upper 4 bits of the final character code of the encoded character determination may be directly specified, but in the following description, the dictionary index is read by referring to the look-up table (LUT). An example is given below.

この第３図の実施例における本発明のデータ圧縮及び
復元の概略は次のようになる。The outline of data compression and decompression of the present invention in the embodiment shown in FIG. 3 is as follows.

CPU12はコントロールソフト16による制御のもとに辞
書初期値作成ソフト24を起動し、辞書初期値作成処理を
行なう。具体的には、辞書初期値作成ソフト24は全て文
字種256のを１文字ずつ参照番号を付けて辞書を構成す
る16個の辞書10−１〜10−６のそれぞれに登録する。The CPU 12 activates dictionary initial value creation software 24 under the control of the control software 16, and performs dictionary initial value creation processing. More specifically, the dictionary initial value creation software 24 assigns a reference number to each of the 256 character types and registers them in each of the 16 dictionaries 10-1 to 10-6 constituting the dictionary.

データメモリ26のデータバッファ28は符号化すべきデ
ータを外部から一定長の複数文字分を一時に格納し、符
号化ソフト20の要求に従って一文字ずつ受渡す。そし
て、データバッファ28の文字が空になるつど、同様に外
部から複数文字分を取込む。The data buffer 28 of the data memory 26 temporarily stores data to be encoded for a plurality of characters of a fixed length from the outside, and transfers the characters one by one according to a request of the encoding software 20. Each time a character in the data buffer 28 becomes empty, a plurality of characters are fetched from outside similarly.

次に第３図のフローチャートを参照して本発明の符号
化アルゴリズムを説明する。Next, the encoding algorithm of the present invention will be described with reference to the flowchart of FIG.

第３図において、まずS1においては次の処理を行う。 In FIG. 3, first, the following processing is performed in S1.

直前文字列の最終文字て選択するＮ個の各辞書Di（但
し、ｉ＝1,・・・,N）に一文字からなる文字列全種を初
期値として予め登録する。本発明にあっては、全て文字
種256に対し辞書の総数ＮはＮ＝16個と少なくなってい
る。All kinds of character strings consisting of one character are previously registered as initial values in N dictionaries Di (i = 1,..., N) selected as the last character of the immediately preceding character string. In the present invention, the total number N of dictionaries is as small as N = 16 for all 256 character types.

各辞書Diの参照番号の総数をn₁で管理し、初期化のと
き、辞書数Ｎ個のn₁に n₁＝（文字種＋１）をセットする。The total number of reference numbers each dictionary Di managed by n _1, upon initialization, n ₁ = sets (character type +1) to the dictionary number N pieces of n _1.

直前の文字列からの履歴、即ち直前文字列の最終文字
コードの上位４ビットをPKとし、初期値としてPKにPK＝
０をセットする。The history from the immediately preceding character string, that is, the upper 4 bits of the last character code of the immediately preceding character string is defined as PK, and PK = PK =
Set 0.

最初の文字を入力Ｋし、これを参照番号（語頭文字
列）ωに直す。The first character is input K and converted into a reference number (initial character string) ω.

直前文字列の最終文字K1から履歴状態に対応つけるLU
Tをセットする。但し、最初は直前文字列はないので、
直前文字列の最終文字を示すK1はK1＝０にセットすると
共に、K1＝０でLUTから得られる索引PKはPK＝０となる
ようにLUTをセットしておく。LU that associates with the history state from the last character K1 of the previous character string
Set T. However, since there is no previous character string at first,
K1 indicating the last character of the immediately preceding character string is set to K1 = 0, and the LUT is set so that the index PK obtained from the LUT when K1 = 0 is PK = 0.

このようなS1の処理が終了するとS4〜S7の手順に従っ
て符号化する。このS4〜S7の手順は、基本的には第５図
に示した従来と同じである。When the processing of S1 is completed, encoding is performed according to the procedures of S4 to S7. The procedure of S4 to S7 is basically the same as the conventional one shown in FIG.

相違点は、従来のLZW符号化では辞書は１個だけだっ
たのに対して、本発明にあっては、最初はS1、それ以降
はS6に示す符号化済みの文字列の最終文字K1によりLUT
を参照して得られた履歴状態LUT（K1）＝PKによって複
数個の辞書から特定の辞書D_PKを選択して、選択した辞
書D_PKに登録されている文字列と照合して最大一致長文
字列を探し、最大一致長を一文字伸ばした文字列ωＫを
選択した辞書D_PKに登録する点が異なる。The difference is that in the conventional LZW encoding, there was only one dictionary, but in the present invention, the first character is S1 and the subsequent characters are the last character K1 of the encoded character string shown in S6. LUT
A specific dictionary D _PK is selected from a plurality of dictionaries based on the history state LUT (K1) = PK obtained by referring to the dictionary, and the maximum match length is checked by comparing with a character string registered in the selected dictionary D _PK for the string, the point to be registered in the dictionary D _PK selects the string ωK that extended character the maximum match length is different.

S8で辞書D_PKに登録した後は、辞書D_PKの参照番号を管
理するカウンタn_PKがn_PK＝n_PK＋１と１つインクリメン
トされる。また、前述したように次の文字列の辞書を選
ぶために最終文字K1よりLUTを用いて新たな履歴状態PK
が求められる。After the registration in the dictionary D _PK in S8, the counter n _PK for managing the reference number of the dictionary D _PK is incremented by one as n _PK = n _PK +1. Also, as described above, in order to select the dictionary of the next character string, a new history state PK
Is required.

次に本発明の復号化アルゴリズムを第４図を参照して
説明する。Next, the decoding algorithm of the present invention will be described with reference to FIG.

復号化は、符号化の逆の動作となる。まずS100に示す
辞書の初期化は符号化と同様である。S1〜S8の手順は、
第７図の従来方式と基本的に同じである。Decoding is the reverse operation of encoding. First, the initialization of the dictionary shown in S100 is the same as the encoding. The steps of S1 to S8 are
This is basically the same as the conventional system shown in FIG.

本発明の復号化が異なる点は、入力した符号CODEから
S4で参照番号ωを復号した後、直前の文字列の最終文字
から求めた履歴状態PKを使用して辞書D_PKを選び、選択
した辞書D_PKの中から参照番号ωに対応する文字列を求
める。The difference of the decoding of the present invention is that the input code CODE
After decoding the reference number ω in S4, a dictionary D _PK is selected using the history state PK obtained from the last character of the immediately preceding character string, and a character string corresponding to the reference number ω is selected from the selected dictionary D _PK. Ask.

辞書への新たな文字列の登録は、LZW符号化の場合と
同様であるが、符号化のときより１テンポ遅れて行なわ
れる。即、符号化の際には注目文字列の符号化を終了し
た時点で一文字伸ばした文字列ωＫ（注目文字列＋次の
１文字）を辞書に登録しているが、復号化では、注目文
字列ωを一文字伸ばすときは次の文字列の先頭文字と合
わせて辞書に登録するため、次の文字列の復元が終了し
た時点で登録を行なう。The registration of a new character string in the dictionary is the same as in the case of LZW encoding, but is performed one tempo later than the time of encoding. Immediately, at the time of encoding, when the encoding of the target character string is completed, the character string ωK (the target character string + the next one character) extended by one character is registered in the dictionary. When the character string ω is extended by one character, the character string is registered in the dictionary together with the first character of the next character string. Therefore, the registration is performed when the restoration of the next character string is completed.

具体的にはS7に示すように、直前文字列の参照番号OL
Dωと復元文字列の第１文字K1の組を、直前の前の文字
列の最終文字から履歴状態PK1で選ばれた辞書D_PK1に登
録することになる。そこで、復元した文字列を伸ばして
次に登録するときのために現在の履歴状態PKをPK1に移
しておき、復元文字列の最終文字K2より、新たな履歴状
態を求めるようにしている。Specifically, as shown in S7, the reference number OL of the immediately preceding character string
The set of Dω and the first character K1 of the restored character string is registered in the dictionary D _PK1 selected in the history state PK1 from the last character of the immediately preceding character string. Therefore, the current history state PK is moved to PK1 in order to extend the restored character string and register it next, and a new history state is obtained from the last character K2 of the restored character string.

尚、上記の実施例は、全文字種256個に対して辞書を
履歴状態に従って16個で構成する場合を例にとるもので
あったが、必要に応じて全て文字種の総数以下であれば
適宜の辞書数としてよい。In the above-described embodiment, a case is described in which the dictionary is configured with 16 characters according to the history state for all 256 character types. The number of dictionaries may be used.

また文字種の数も必要に応じて適宜に定められるもの
である。Also, the number of character types is appropriately determined as needed.

［発明の効果］以上説明したように本発明によれば、簡単で無駄なく
文字列の履歴状態に従った複数辞書の初期値登録がで
き、符号化中の文字列に対して以前に出現した文字列の
履歴も採り入れることができるため、文字列間の冗長性
が削減され、従来のLZW符号より高い圧縮率が得られる
とともに、符号が参照番号のみで表わされるので簡単な
アルゴリズムで実行できる。[Effects of the Invention] As described above, according to the present invention, initial values of a plurality of dictionaries can be registered simply and without waste according to the history state of a character string, and previously appeared for a character string being encoded. Since the history of character strings can also be adopted, redundancy between character strings is reduced, a higher compression ratio than conventional LZW codes can be obtained, and codes can be represented only by reference numbers, so that they can be executed with a simple algorithm.

[Brief description of the drawings]

第１図は本発明の原理説明図；第２図は本発明の実施例構成図；第３図は本発明の符号化アルゴリズムのフローチャー
ト；第４図は本発明の復号化アルゴリズムのフローチャー
ト；第５図は従来のLZW符号化アルゴリズムのフローチャー
ト；第６図は従来のLZW復号化アルゴリズムのフローチャー
ト；第７図は従来のLZW符号化の具体例説明図；第８図は辞書構成例の説明図；第９図は従来のLZW復号化の具体例説明図；第10図は本願発明者等が既に提案している部分列分解と
文字列間の履歴の取込を行った符号化説明図である。［符号の説明］ 10,10−１〜10−N;辞書 12:CPU 14:プログラムメモリ 16:コントロールソフト 18:最大一致長検索ソフト 20:符号化ソフト 22:復号化ソフト 24:辞書初期値作成ソフト 26:データメモリ 28:データバッファFIG. 1 is a diagram for explaining the principle of the present invention; FIG. 2 is a block diagram of an embodiment of the present invention; FIG. 3 is a flowchart of an encoding algorithm of the present invention; FIG. 5 is a flowchart of a conventional LZW encoding algorithm; FIG. 6 is a flowchart of a conventional LZW decoding algorithm; FIG. 7 is an explanatory diagram of a specific example of conventional LZW encoding; FIG. FIG. 9 is an explanatory diagram of a specific example of conventional LZW decoding; FIG. 10 is an explanatory diagram of encoding that has already been proposed by the inventors of the present invention in which subsequence decomposition and capture of a history between character strings are performed. is there. [Explanation of Codes] 10, 10-1 to 10-N; Dictionary 12: CPU 14: Program Memory 16: Control Software 18: Maximum Match Length Search Software 20: Encoding Software 22: Decoding Software 24: Create Dictionary Initial Value Software 26: Data memory 28: Data buffer

───────────────────────────────────────────────────── フロントページの続き (72)発明者千葉広隆神奈川県川崎市中原区上小田中1015番地富士通株式会社内 (56)参考文献特開昭60−116228（ＪＰ，Ａ) 特開平３−262331（ＪＰ，Ａ) 特開平３−270417（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁶，ＤＢ名) H03M 7/40────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Hirotaka Chiba 1015 Kamiodanaka, Nakahara-ku, Kawasaki City, Kanagawa Prefecture Inside Fujitsu Limited (56) References JP-A-60-116228 (JP, A) JP-A-3-262331 (JP, A) JP-A-3-270417 (JP, A) (58) Fields investigated (Int. Cl. ⁶ , DB name) H03M 7/40

Claims

(57) [Claims]

1. A data compression method for encoding an input character string by designating it by a reference number of a subsequence having a maximum length matching among already encoded subsequences registered in a dictionary (10). The dictionary (10) is composed of a dictionary group including a predetermined number of dictionaries (10-1 to 10-N) smaller than the number of all character types to be processed, and each dictionary (10-1 to 10-N) At least all character types are initially registered with reference numbers in units of one character, and at the time of encoding an input character string, are stored in the dictionary group according to index information indicating a subordination relationship with a previously encoded character string. When the specified dictionary (10-i) is specified and encoded, and there is no input character string in the specified dictionary (10-i),
A data compression method characterized in that a character string obtained by adding the next character to the reference number of a previously encoded character string is added with a new reference number and registered.

2. The data compression method according to claim 1, wherein at the time of encoding the input character string, the dictionary group is encoded according to index information obtained from a part of the last character code of the character string which has been encoded immediately before. A data compression method characterized by designating a specific dictionary (10-i).

3. The data compression method according to claim 2, wherein at the time of encoding the input character string, the dictionary group is encoded according to the index information indicated by the high-order bit of the last character code of the character string that has just been encoded. A data compression method characterized by designating a specific dictionary (10-i).

4. The data compression method according to claim 1, wherein at the time of encoding the input character string, index information obtained by referring to the look-up table using the last character code of the character string that has just been encoded. A data compression method characterized by designating a specific dictionary (10-i) in the dictionary group according to the following.

5. The data compression method according to claim 4, wherein at the time of encoding of the input character string, the dictionary group is encoded according to index information created from the most significant bit of the last character code of the character string that has just been encoded. A data compression method characterized by designating a specific dictionary (10-i) in the following.

6. A method according to claim 1, wherein an input character string is designated by a reference number of a subsequence having a maximum matching length among already encoded subsequences registered in a dictionary, and an original character is obtained from a codeword. In the data restoring method for restoring a column, the dictionary (10) is composed of a dictionary group composed of dictionaries (10-1 to 10-N) with a predetermined number smaller than the number of all character types to be processed. For each dictionary (10-1 to 10-N), at least all character types are initially registered with reference numbers in units of one character, and when the input codeword is restored, the subordination relation with the previously restored character string is determined. The specified dictionary (10-i) in the dictionary group is designated and restored according to the index information shown, and each time the restoration is performed, the reference number of the previously reduced character string is replaced with the first one of the currently restored character string. A data restoration method characterized by registering a character string with added characters with a new reference number .

7. A data restoration method according to claim 6, wherein at the time of restoring an input codeword, the data in the dictionary group is restored in accordance with index information obtained from a part of the last character code of the character string just restored. A data restoration method characterized by designating a specific dictionary (10-i).

8. The data restoring method according to claim 7, wherein at the time of restoring the input code word, the input code word is restored in the dictionary group according to the index information indicated by the upper bit of the last character code of the character string just restored. A data restoration method characterized by designating a specific dictionary (10-i).

9. The data restoring method according to claim 6, wherein at the time of restoring an input codeword, said input codeword is referred to in accordance with index information obtained by referring to a look-up table using the last character code of a character string just restored. A data restoration method characterized by designating a specific dictionary (10-i) in a dictionary group.

10. A data restoration method according to claim 9, wherein at the time of restoring an input codeword, an index obtained by referring to a look-up table using upper bits of the last character code of a character string just restored. A data restoration method characterized by designating a specific dictionary (10-i) in the dictionary group according to information.