JP3346626B2

JP3346626B2 - Data compression device

Info

Publication number: JP3346626B2
Application number: JP29937693A
Authority: JP
Inventors: 隆昭林
Original assignee: Kyocera Corp
Current assignee: Kyocera Corp
Priority date: 1993-11-30
Filing date: 1993-11-30
Publication date: 2002-11-18
Anticipated expiration: 2017-11-18
Also published as: JPH07152533A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、データ中の冗長成分を
取り除くことにより情報損失が生じないデータ圧縮装置
及び方法に関する。特に、文字コード、ベクトル図形、
画像データ等のデータ形式に依存せず、良好な圧縮効果
を得られるユニバーサル圧縮符号化方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data compression apparatus and method which eliminate information loss by removing redundant components in data. In particular, character codes, vector graphics,
The present invention relates to a universal compression encoding method that can obtain a good compression effect without depending on a data format such as image data.

【０００２】[0002]

【従来の技術】従来より、様々な形式のデータを単一の
符号化方式でデータ圧縮可能な方法としてＬｅｍｐｅｌ
ーＺｉｖの符号化方法（ＬＺ符号）が知られている。2. Description of the Related Art Conventionally, Lempel has been used as a method capable of compressing data of various formats by a single encoding method.
-Ziv encoding method (LZ code) is known.

【０００３】ＬＺ符号にはスライド辞書方式と動的辞書
方式の二つのアルゴリズムが提案されている( 詳しく
は、インターフェース１９９２年ｖｏｌ．８Ｎｏ．１
８３”データ圧縮アルゴリズムとその実現" を参照) 。
一般的に動的辞書方式はスライド辞書方式に比べて圧縮
率は低くなるもののアルゴリズムが簡単で高速処理が可
能であるといわれている。また、動的辞書方式の改良と
して，ＬｅｍｐｅｌーＺｉｖの符号化方法（ＬＺＷ符
号）が提案されている（Ｔ．Ａ．Ｗｅｌｃｈ，”ＡＴｅ
ｃｈｎｉｑｕｅｆｏｒＨｉｇｈーＰｅｒｆｏｒｍａ
ｎｃｅＤａｔａＣｏｍｐｒｅｓｓｉｏｎ”，Ｃｏｍｐ
ｕｔｅｒ，Ｊｕｎｅ，１９８４）。ＬＺＷ符号は入力デ
ータ列を増分分解（ＩｎｃｒｅｍｅｎｔａｌＰａｒｓ
ｉｎｇ）といわれる方法で部分データ列に分解し、その
部分データ列を辞書に登録していき、その辞書を参照し
ながら入力データの符号化処理を進めていくアルゴリズ
ムである。For the LZ code, two algorithms, a slide dictionary method and a dynamic dictionary method, have been proposed (for details, see Interface 1992, Vol. 8, No. 1).
83 "Data Compression Algorithms and Implementations").
It is generally said that the dynamic dictionary method has a lower compression ratio than the slide dictionary method, but has a simple algorithm and can perform high-speed processing. As an improvement of the dynamic dictionary system, a Lempel-Ziv coding method (LZW code) has been proposed (TA Welch, "ATe").
chnique for High-Performa
nce DataCompression ”, Comp
Uter, June, 1984). The LZW code performs incremental decomposition of an input data sequence (Incremental Pars).
ing) is an algorithm for decomposing the data into partial data strings, registering the partial data strings in a dictionary, and proceeding with the encoding process of the input data while referring to the dictionary.

【０００４】以下に図面を参照しながらＬＺＷ符号につ
いて詳細に説明する。図４にＬＺＷ符号化方式の処理の
流れを示す。Hereinafter, the LZW code will be described in detail with reference to the drawings. FIG. 4 shows a processing flow of the LZW encoding method.

【０００５】まず、図４（ａ）を用いてＬＺＷ符号化処
理全体の流れについて説明する。First, the overall flow of the LZW encoding process will be described with reference to FIG.

【０００６】まず、Ｓ４０１で辞書の初期化を行う。辞
書には初期値として入力データ中に出現し得る全てのシ
ンボルを１文字のデータ列として登録しておく。例え
ば、最も簡単な例として０と１しかシンボルが存在しな
い２値データ系列を考えると、辞書には初期値０と１が
辞書の未登録アドレス上に登録され、それぞれに対して
一意に識別可能なインデックスが付される。Ｓ４０２で
は、入力データ列を第１シンボル目から読み込むための
カウンタｎを１にセットする。ここで、入力データ列の
第ｎ番目のシンボルをｓｎと表すこととし、Ｓ４０３で
第１番目のシンボルｓｎ( ｎ＝１) を読み込み、辞書の
中からこのシンボルｓｎと一致するものを検索し、その
インデックスｉを求める。そして、入力データ列のカウ
ンタｎの値を１だけ増加させる。次に、Ｓ４０４では最
長一致検索処理が行われる。ここでは、入力データ列に
おいて前記のシンボルｓｎから始まる部分データ列と最
長一致するシンボル列を辞書の中から探索する。そし
て、最長一致したシンボル列のインデックスの値を新た
にｉへ返し、また、データ列中の最長一致した部分デー
タ列の次のシンボルの値を新たにｓｎとへ返す。最長一
致検索処理の流れについては後で詳しく説明する。次
に、最長一致したシンボル列のインデックスｉはＳ４０
５で符号ｃ（ｉ）に変換され、出力される。Ｓ４０６で
は、辞書に新たにシンボル列の登録を行う。まず、イン
デックスｉで示されるシンボル列の最後尾にｓｎを連結
し、シンボル列ｉｓｎを生成する。インデックスｉで示
されるシンボル列は最長一致した部分データ列なので、
その最後尾に入力データ列中の次のシンボルであるｓｎ
を連結したシンボル列ｉｓｎは辞書の中には存在しな
い。そこで、このシンボル列ｉｓｎを辞書の未登録アド
レス上に登録し、それまでに登録されているシンボル列
から一意に識別可能なインデックスを付して、再びＳ４
０３に戻り、同様の手順で符号化処理を行っていく。First, in step S401, a dictionary is initialized. In the dictionary, all symbols that can appear in the input data as initial values are registered as one-character data strings. For example, as a simplest example, considering a binary data sequence having only 0 and 1 symbols, initial values 0 and 1 are registered in an unregistered address of the dictionary, and each can be uniquely identified. Index. In S402, a counter n for reading the input data string from the first symbol is set to one. Here, the n-th symbol of the input data sequence is represented as sn, the first symbol sn (n = 1) is read in S403, and a dictionary matching this symbol sn is searched from the dictionary. The index i is obtained. Then, the value of the counter n of the input data string is increased by one. Next, in S404, a longest match search process is performed. Here, in the input data sequence, a dictionary is searched from the dictionary for a symbol sequence that is the longest match with the partial data sequence starting from the symbol sn. Then, the index value of the longest matching symbol string is newly returned to i, and the value of the symbol next to the longest matching partial data string in the data string is newly returned to sn. The flow of the longest match search process will be described later in detail. Next, the index i of the longest matching symbol sequence is S40
5, and is converted to a code c (i) and output. In S406, a new symbol string is registered in the dictionary. First, sn is connected to the end of the symbol sequence indicated by the index i to generate a symbol sequence isn. Since the symbol sequence indicated by index i is the longest matching partial data sequence,
At the end thereof, sn which is the next symbol in the input data sequence
Are not present in the dictionary. Therefore, this symbol string isn is registered on an unregistered address in the dictionary, an index that can be uniquely identified from the symbol strings registered so far is added, and S4 is again executed.
Returning to step 03, the encoding process is performed in the same procedure.

【０００７】次に、最長一致検索処理の流れについて図
４（ｂ）を用いて説明する。まず、Ｓ４１１で処理の終
了を判定するために、まだ入力データ中に処理すべきシ
ンボルが残っているか否かを判定する。シンボルｓｎが
存在しないときはＳ４１２に進み前に求めたインデック
スｉの内容を符号化して符号ｃ（ｉ）を出力し処理を終
了する。また、シンボルｓｎが存在しているときは、そ
のシンボルｓｎを読み込んで、Ｓ４１３に進む。次に、
Ｓ４１３で前に求めたインデックスｉで示されるシンボ
ル列の最後尾に新たに読み込まれたシンボルｓｎを連結
してシンボル列ｉｓｎを生成する。そして、辞書の中に
このシンボル列ｉｓｎと一致するものがあるか否か探索
する。ここで、シンボル列ｉｓn と一致するものが辞書
の中に存在するときは、Ｓ４１５でシンボル列ｉｓｎに
対応する辞書のインデックスを求めｉの値をそのインデ
ックスの値に更新し、入力データのカウンタｎの値を１
だけ増加させる。そして、再びＳ４１１に戻り、同様の
手順で辞書の中からシンボル列ｉｓｎと一致するシンボ
ル列が見いだせなくなるまで処理を繰り返し、入力デー
タ中の部分データ列の中で最長一致するシンボル列を辞
書から捜し出すことを行う。この反復処理により、Ｓ４
１４でシンボル列ｉｓｎと一致するものが辞書の中に存
在しなかったときは、辞書に登録されたシンボル列の中
で前述した入力データの部分データ列と最長一致するも
のはインデックスｉで示されるシンボル列となる。Next, the flow of the longest match search process will be described with reference to FIG. First, in S411, it is determined whether or not symbols to be processed still remain in the input data in order to determine the end of the processing. If the symbol sn does not exist, the flow advances to S412 to encode the contents of the index i obtained before and output the code c (i), and ends the processing. If the symbol sn exists, the symbol sn is read, and the process proceeds to S413. next,
In step S413, a newly read symbol sn is connected to the tail of the symbol sequence indicated by the index i obtained earlier to generate a symbol sequence isn. Then, a search is made as to whether or not there is a dictionary that matches this symbol sequence isn. If there is a match in the dictionary with the symbol sequence isn, the index of the dictionary corresponding to the symbol sequence isn is obtained in step S415, and the value of i is updated to the value of the index. The value of 1
Just increase. Then, the process returns to step S411 again, and repeats the process in the same procedure until a symbol sequence that matches the symbol sequence isn cannot be found in the dictionary, and searches the dictionary for the longest matching symbol sequence among the partial data sequences in the input data. Do that. By this iterative process, S4
If there is no symbol string that matches the symbol string isn in the dictionary, the symbol string registered in the dictionary that has the longest match with the partial data string of the input data is indicated by the index i. It becomes a symbol sequence.

【０００８】以上のようにＬＺＷ符号では過去のデータ
列を部分データ列に分解して各々の部分データ列を順次
辞書に登録していくことにより符号化処理の内部で辞書
の内容が更新されていくことを特徴としている。ここで
作成される辞書を模式的に表すと図５に示すような木構
造となる。ここで、木の枝がシンボルを表し、木の根か
ら節点までで表されるシンボル列が辞書に登録されたシ
ンボル列である。また、その節点に付けられている番号
がシンボル列を識別するためのインデックスとなってい
る。ＬＺＷ符号は、その性質上、入力データ列中に出現
する頻度の高いシンボル列を含む枝は伸びやすくなり、
また、そのように長く伸びた枝で示されたシンボル列長
は各枝を識別するためのインデックスを符号化した際の
符号長よりも長くなるため、データの圧縮が可能とな
る。As described above, in the LZW code, the contents of the dictionary are updated inside the encoding process by decomposing the past data string into partial data strings and sequentially registering each partial data string in the dictionary. It is characterized by going. When the dictionary created here is schematically represented, it has a tree structure as shown in FIG. Here, a branch of a tree represents a symbol, and a symbol sequence represented from the root of the tree to a node is a symbol sequence registered in the dictionary. The number assigned to the node is an index for identifying the symbol string. Due to the nature of the LZW code, a branch including a symbol sequence that frequently appears in an input data sequence is easily elongated,
Further, the symbol string length indicated by such a long branch is longer than the code length when an index for identifying each branch is coded, so that data can be compressed.

【０００９】[0009]

【発明が解決しようとする課題】しかし、上述のような
従来のデータ圧縮方式では、図４のＳ４０３からＳ４０
６の反復処理により追加更新される辞書の枝は、それま
でに辞書に登録されていたシンボル列のいずれかの最後
尾に１シンボル連結して作成されるので、１シンボル分
しか伸びない。ＬＺＷ符号では、効果的なデータ圧縮を
行うためには辞書の中で出現頻度の高い枝はできるだけ
速やかに伸びることが望まれるが、前記の構成上、枝の
伸びる速さは極めて制限されたものとなっている。この
ため、特に辞書の作成される初期の段階ではデータ圧縮
性能が低下する。However, in the conventional data compression method as described above, in steps S403 to S40 in FIG.
The branch of the dictionary that is additionally updated by the repetition processing of No. 6 is created by connecting one symbol to the end of any of the symbol strings registered in the dictionary up to that point, and thus extends only by one symbol. In the LZW code, in order to perform effective data compression, it is desired that a branch having a high frequency of appearance in a dictionary grows as quickly as possible. It has become. For this reason, the data compression performance is deteriorated particularly at the initial stage when the dictionary is created.

【００１０】本発明では、上記の点に鑑み、辞書に登録
されている枝をできるだけ速やかに伸ばすことにより、
より高性能なデータ圧縮を実現することを目的とする。[0010] In the present invention, in view of the above points, by extending the branches registered in the dictionary as quickly as possible,
The purpose is to realize higher performance data compression.

【００１１】[0011]

【課題を解決するための手段】本発明はこれらの課題を
解決するためのものであり、複数種類のシンボルからな
る離散情報から構成される入力データに含まれる冗長成
分を除去することによって圧縮データを生成するデータ
圧縮装置において、前記入力データの部分データ列に一
致する最長のシンボル列を辞書に登録されているシンボ
ル列の中から検索する最長一致検索手段と、そのシンボ
ル列に付与されるインデックスにもとづいて符号を作成
する符号化手段と、その符号化手段によって符号化され
た第１の最長一致部分データ列と、前記第１の最長一致
部分データ列の直後に符号化された第２の最長一致部分
データ列とから、前記第１の最長一致部分データ列の最
後尾に前記第２の最長一致部分データ列の先頭のシンボ
ルから順次連結し、複数個の新規なシンボルを生成する
ための生成手段と、その生成手段によって生成されたシ
ンボル列を前記辞書に登録する辞書登録手段とから構成
されたデータ圧縮装置を提供する。さらに、前記辞書登
録手段は一回の処理で新規に登録されるシンボル列の数
の最大値が予め設定されている請求項１記載のデータ圧
縮装置を提供する。SUMMARY OF THE INVENTION The present invention has been made to solve these problems, and is intended to eliminate compressed data by removing redundant components contained in input data composed of discrete information comprising a plurality of types of symbols. A longest match search unit for searching a longest symbol sequence that matches the partial data sequence of the input data from the symbol sequences registered in the dictionary, and an index assigned to the symbol sequence. Encoding means for generating a code based on the first longest-matching partial data string encoded by the encoding means, and a second longest-matching partial data string encoded immediately after the first longest-matching partial data string. From the longest-matching partial data string, to the end of the first longest-matching partial data string, sequentially from the first symbol of the second longest-matching partial data string. A generating means for generating a plurality of new symbol, to provide a data compression device configured symbol sequence generated by the generating means and a dictionary registration means for registering the dictionary. 2. The data compression apparatus according to claim 1, wherein said dictionary registration means has a preset maximum value of the number of symbol strings newly registered in one process.

【００１２】[0012]

【作用】本発明では、以上の手段を持つことにより、辞
書に登録されるシンボル列ができるだけ速やかに長くな
るようになり、従来のＬＺＷ符号より高速に辞書が成長
するようになった。このことにより、特にデータ圧縮処
理において処理の初期の段階でも辞書が充分成長するの
でデータ圧縮効果が高まるものとなる。In the present invention, by having the above means, the symbol sequence registered in the dictionary becomes longer as quickly as possible, and the dictionary grows faster than the conventional LZW code. As a result, especially in the data compression process, the dictionary grows sufficiently even at the initial stage of the process, and the data compression effect is enhanced.

【００１３】また、一回の辞書登録処理で登録するシン
ボル列の最大数を制限することにより、高速に辞書を成
長させつつ無用に長大なシンボル列を登録することを避
けることができるので、データ圧縮効果が高い構成とな
っている。Further, by limiting the maximum number of symbol strings to be registered in one dictionary registration process, it is possible to avoid unnecessary registration of a long symbol string while growing the dictionary at high speed. The structure has a high compression effect.

【００１４】[0014]

【実施例】以下、図面を用いて本発明の実施例について
説明する。図１は、本発明の実施例のデータ圧縮装置の
構成を示したものである。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows the configuration of a data compression device according to an embodiment of the present invention.

【００１５】入力データ列１０１がデータ圧縮装置１０
２に入力されると、最長一致検索部１０４で辞書１０５
に登録されているシンボル列と一致する最長の入力デー
タ列１０１の部分データ列が探索される。辞書に登録さ
れているシンボル列にはそれぞれ他のシンボル列と一意
に識別可能なインデックスが付されており、入力データ
１０１の部分データ列と最長一致するシンボル列のイン
デックスが符号化部１０６に送られて符号化され、圧縮
データ１０３に変換され、データ圧縮装置１０２から出
力される。また、最長一致検索部１０４で検索されたシ
ンボル列は、辞書登録部１０７に送られる。ここでは、
現在符号化された最長一致シンボル列とその直前に符号
化された最長一致シンボル列から新たにシンボル列を作
成し、辞書１０５に新規に登録する。この際、一回の辞
書登録処理で登録されるシンボル列の個数は予め定めら
れた最大登録数設定値１０８を越えないこととする。The input data sequence 101 is the data compression device 10
2, the longest match search unit 104 outputs the dictionary 105
Is searched for a partial data string of the longest input data string 101 that matches the symbol string registered in the. Each symbol sequence registered in the dictionary is provided with an index that can be uniquely identified from other symbol sequences, and the index of the symbol sequence that is the longest match with the partial data sequence of the input data 101 is transmitted to the encoding unit 106. The data is then encoded, converted to compressed data 103, and output from the data compression device 102. The symbol string searched by the longest match search unit 104 is sent to the dictionary registration unit 107. here,
A new symbol sequence is created from the currently encoded longest-match symbol sequence and the longest-match symbol sequence encoded immediately before it, and newly registered in the dictionary 105. At this time, it is assumed that the number of symbol strings registered in one dictionary registration process does not exceed a predetermined maximum registration number setting value 108.

【００１６】図２は、本発明の実施例のデータ圧縮方式
の処理の流れを示したものである。本実施例において
も、簡単のために入力データ列は０と１の２つのシンボ
ルにより構成される２値データ列であるものとする。FIG. 2 shows a flow of processing of the data compression system according to the embodiment of the present invention. Also in this embodiment, for simplicity, it is assumed that the input data string is a binary data string composed of two symbols 0 and 1.

【００１７】最初に、データ圧縮処理の全体の流れにつ
いて図１（ａ）を用いて説明する。まず、Ｓ２０１で従
来のＬＺＷ符号と同様に辞書の初期化を行う。本実施例
では、２値データ列なので０と１のシンボルが互いに一
意に識別可能なインデックスが付されて初期値として登
録される。Ｓ２０２では、入力データ列を第１シンボル
目から読み込むためのカウンタｎを１にセットする。次
に、Ｓ２０３で第ｎ番目のシンボルｓｎを入力し、その
シンボルｓｎと一致するものを辞書の中から探索し、そ
のインデックスをｉＬとする。また、カウンタｎの値を
１だけ増加させる。First, the overall flow of the data compression processing will be described with reference to FIG. First, in step S201, the dictionary is initialized as in the case of the conventional LZW code. In the present embodiment, since the data sequence is a binary data sequence, 0 and 1 symbols are uniquely indexed and registered as initial values. In S202, the counter n for reading the input data string from the first symbol is set to one. Next, in step S203, the n-th symbol sn is input, a search for a symbol that matches the symbol sn is made in the dictionary, and the index is set to iL. Further, the value of the counter n is increased by one.

【００１８】Ｓ２０３で求めたインデックスｉＬはＳ２
０４で符号化され符号ｃ（ｉＬ）として出力される。次
に、Ｓ２０５でシンボルｓｎについて、それと一致する
ものを辞書の中から探索し、そのインデックスをｉＰと
する。また、カウンタｎの値を１だけ増加させる。Ｓ２
０６では最長一致検索処理が行われる。ここでは、入力
データ列において前記のシンボルｓｎから始まる部分デ
ータ列と最長一致するシンボル列を辞書の中から探索す
る。そして、最長一致したシンボル列のインデックスの
値を新たにｉＰへ返し、また、データ列中の最長一致し
た部分データ列の次のシンボルの値を新たにｓｎとへ返
す。最長一致検索処理の流れについては後で詳しく説明
する。インデックスｉＰはＳ２０７で符号化され、符号
ｃ（ｉＰ）として出力される。The index iL obtained in S203 is S2
04 and output as code c (iL). Next, in S205, a symbol sn is searched from the dictionary for a symbol sn, and the index is set as iP. Further, the value of the counter n is increased by one. S2
At 06, the longest match search process is performed. Here, in the input data sequence, a dictionary is searched from the dictionary for a symbol sequence that is the longest match with the partial data sequence starting from the symbol sn. Then, the index value of the longest matching symbol string is newly returned to iP, and the value of the symbol next to the longest matching partial data string in the data string is newly returned to sn. The flow of the longest match search process will be described later in detail. The index iP is encoded in S207 and output as a code c (iP).

【００１９】次に、Ｓ２０８では新規なシンボル列を辞
書に登録する辞書登録処理が行われる。ここでは、一回
の辞書登録処理につき複数個の新規なシンボル列が辞書
に登録される。ここで、予め辞書登録処理Ｓ２０８で最
大幾つまでの新規なシンボル列を辞書に登録するかを定
めておき、その値をＭａｘＥｎｔとする。つまり、一度
の辞書登録処理により、１個からＭａｘＥｎｔ個までの
新規なシンボル列が辞書に登録される。このため、辞書
登録処理Ｓ２０８では、入力データ列の中で出現頻度の
高くかつ長大な部分データ列が辞書中のシンボル列とし
て登録されることとなり、辞書の学習効果が向上するの
でデータ圧縮率が向上することとなる。辞書登録処理Ｓ
２０８による登録処理の詳細については後で説明する。
最後に、Ｓ２０９でインデックスｉＬとインデックスｉ
Ｐの比較が行われ、インデックスｉＬとインデックスｉ
Ｐの値が異なるときは、インデックスｉＰの値がインデ
ックスｉＬへ代入され、その後Ｓ２０５へと戻る。ま
た、インデックスｉＬとインデックスｉＰの値が等しい
ときは、Ｓ２０５へ戻る。この分岐処理についての説明
も辞書登録処理Ｓ２０８の詳細な説明を行うときに同時
に行う。Next, in S208, a dictionary registration process for registering a new symbol sequence in the dictionary is performed. Here, a plurality of new symbol strings are registered in the dictionary for each dictionary registration process. Here, it is determined in advance how many new symbol sequences are to be registered in the dictionary in the dictionary registration process S208, and the value is set as MaxEnt. That is, by one dictionary registration process, one to MaxEnt new symbol strings are registered in the dictionary. For this reason, in the dictionary registration process S208, a long partial data sequence having a high appearance frequency in the input data sequence is registered as a symbol sequence in the dictionary, and the learning effect of the dictionary is improved. Will be improved. Dictionary registration processing S
Details of the registration process by 208 will be described later.
Finally, in step S209, the index iL and the index i
P is compared, and index iL and index i
If the value of P is different, the value of index iP is assigned to index iL, and the process returns to S205. If the values of the index iL and the index iP are equal, the process returns to S205. The description of the branching process will be made at the same time as the detailed description of the dictionary registration process S208.

【００２０】次に、最長一致検索処理の流れについて図
１（ｂ）を用いて説明する。Next, the flow of the longest match search process will be described with reference to FIG.

【００２１】まず、Ｓ２２１で入力データ中にシンボル
ｓｎが存在するかどうか調べる。シンボルｓｎが存在し
ないときは入力データの処理は全て終了したと判断し、
インデックスｉＰをＳ２２２で符号化し、符号ｃ（ｉ
Ｐ）として出力してデータ圧縮処理を終了する。シンボ
ルｓｎが存在するときは処理を続行し、Ｓ２２３へと進
む。First, it is checked in step S221 whether a symbol sn exists in the input data. When the symbol sn does not exist, it is determined that all the processing of the input data has been completed,
The index iP is encoded in S222, and the code c (i
P) and the data compression process ends. If the symbol sn exists, the process is continued, and the process proceeds to S223.

【００２２】Ｓ２２３では、インデックスｉＰで示され
るシンボル列とシンボルｓｎの連結が行われる。連結は
インデックスｉＰで示されるシンボル列の最後尾にシン
ボルｓｎを追加することで達成される。そして、この連
結処理により新たに生成されたシンボル列をｉＰｓｎと
する。Ｓ２２４では、シンボル列ｉＰｓｎに対する探索
を行う。つまり、辞書の中からシンボル列ｉＰｓｎに一
致するものがあるかどうか検索し、一致するものがない
ときは、インデックスｉＰで示されるシンボル列が最長
一致シンボル列となり、最長一致シンボル列のインデッ
クスｉＰ及び最長一致シンボル列の次に現れるシンボル
ｓｎの値をデータ圧縮処理の全体の流れのＳ２０８に返
し、最長一致検索処理を終了する。また、Ｓ２２４で辞
書の中にシンボル列ｉＰｓｎに一致するものがあるとき
は、Ｓ２２１に戻り、最長一致検索処理を継続する。こ
のように、入力データから１シンボルづつ連結処理によ
り部分データ列を伸ばして順次辞書からの検索を行うこ
とにより、辞書に登録されたシンボル列と最長一致する
部分データ列を探索することができる。At S223, the symbol sequence indicated by the index iP and the symbol sn are connected. The concatenation is achieved by adding a symbol sn to the end of the symbol sequence indicated by the index iP. Then, the symbol sequence newly generated by the concatenation process is set to iPsn. In S224, a search for the symbol sequence iPsn is performed. In other words, the dictionary is searched for a match with the symbol sequence iPsn. If there is no match, the symbol sequence indicated by the index iP becomes the longest match symbol sequence, and the indexes iP and The value of the symbol sn that appears next to the longest matching symbol sequence is returned to S208 in the overall flow of the data compression process, and the longest matching search process ends. If there is a dictionary that matches the symbol sequence iPsn in S224, the process returns to S221 to continue the longest match search process. As described above, the partial data string is expanded by one symbol at a time from the input data by the concatenation process, and by sequentially searching the dictionary, it is possible to search for the partial data string that has the longest match with the symbol string registered in the dictionary.

【００２３】次に、辞書登録処理の流れについて図１
（ｃ）を用いて説明する。辞書登録する際には、前述し
たように予め一回の辞書登録処理で登録することができ
るシンボル列の最大数ＭａｘＥｎｔを定めておく。ま
た、辞書登録処理には、データ圧縮処理の全体の流れか
ら、インデックスｉＬ及びｉＰが渡される。そして、イ
ンデックスｉＬで示されるシンボル列にインデックスｉ
Ｐで示されるシンボル列の先頭のいくつかのシンボルを
連結することにより、新たなシンボル列を生成し、そこ
で生成されたシンボル列を辞書に登録することとなる。
最初に、Ｓ２３１で複数（最大ＭａｘＥｎｔ個）のシン
ボル列を登録するために必要な反復処理を行うためのカ
ウンタ値ｊを０に初期化する。そして、Ｓ２３２におい
て、インデックスｉＰで示されるシンボル列の第ｊ番目
のシンボルをｉＰ〔ｊ〕で示し、インデックスｉＰで示
されるシンボル列にシンボルｉＰ〔ｊ〕が存在するかど
うか判定する。ここで、シンボル列の先頭のシンボルは
第０番目のシンボルであるとする。Next, the flow of the dictionary registration process is shown in FIG.
This will be described with reference to FIG. When registering a dictionary, as described above, the maximum number MaxEnt of symbol strings that can be registered in one dictionary registration process is determined in advance. In addition, indexes iL and iP are passed to the dictionary registration process from the overall flow of the data compression process. Then, an index i is added to the symbol sequence indicated by the index iL.
By linking some symbols at the head of the symbol sequence indicated by P, a new symbol sequence is generated, and the generated symbol sequence is registered in the dictionary.
First, in step S231, a counter value j for performing an iterative process required to register a plurality of (maximum MaxEnt) symbol strings is initialized to zero. Then, in S232, the j-th symbol in the symbol sequence indicated by the index iP is indicated by iP [j], and it is determined whether or not the symbol iP [j] exists in the symbol sequence indicated by the index iP. Here, it is assumed that the first symbol of the symbol sequence is the 0th symbol.

【００２４】シンボルｉＰ〔ｊ〕が存在しないときは辞
書登録処理を終了し、データ圧縮処理の全体の流れに戻
る。また、シンボルｉＰ〔ｊ〕が存在するときはＳ２３
３に進み、シンボルｉＰ〔ｊ〕をｔとする。次に、Ｓ２
３４で辞書へのシンボル列の登録が行われる。まず、イ
ンデックスｉＬで示されるシンボル列の最後尾にシンボ
ルｔを連結し、連結により新たに生成されたシンボル列
をｉＬｔとする。インデックスｉＬで示されるシンボル
列は、現在辞書に登録されているシンボル列の中で入力
データの部分データ列と最長一致するものである。ま
た、インデックスｉＰで示されるシンボル列は、入力デ
ータ中でインデックスｉＬで示されるシンボル列に続く
シンボル列だから、シンボル列ｉＬｔは現在の辞書には
登録されていない。そこで、新規にシンボル列ｉＬｔを
辞書に登録し、辞書に登録されている他のシンボル列と
一意に識別可能なインデックスを付けておく。このよう
に辞書にシンボル列を登録したらば、Ｓ２３５に進みシ
ンボル列ｉＬｔのインデックスを新たにｉＬとし、ま
た、カウンタ値ｊを１だけ増加させる。最後に、カウン
タ値ｊとＭａｘＥｎｔの値を比較し、等しければ辞書登
録処理を終了し、データ圧縮処理の全体の流れに戻る。
また、カウンタ値ｊとＭａｘＥｎｔが等しくなければＳ
２３２に戻り、辞書登録処理を続ける。If the symbol iP [j] does not exist, the dictionary registration processing is terminated, and the flow returns to the entire data compression processing. If the symbol iP [j] is present, S23
Proceeding to 3, the symbol iP [j] is set to t. Next, S2
At 34, a symbol string is registered in the dictionary. First, a symbol t is connected to the end of the symbol sequence indicated by the index iL, and a symbol sequence newly generated by the connection is defined as iLt. The symbol string indicated by the index iL is the symbol string that is the longest match with the partial data string of the input data among the symbol strings currently registered in the dictionary. Further, since the symbol sequence indicated by the index iP is a symbol sequence following the symbol sequence indicated by the index iL in the input data, the symbol sequence iLt is not registered in the current dictionary. Therefore, a new symbol sequence iLt is registered in the dictionary, and an index that can be uniquely identified from other symbol sequences registered in the dictionary is added. After registering the symbol string in the dictionary as described above, the process proceeds to S235, where the index of the symbol string iLt is newly set to iL, and the counter value j is increased by one. Finally, the counter value j is compared with the value of MaxEnt, and if they are equal, the dictionary registration processing is terminated, and the flow returns to the entire flow of the data compression processing.
If the counter value j is not equal to MaxEnt, S
Returning to H.232, the dictionary registration process is continued.

【００２５】ここで、インデックスｉＬとｉＰの値が等
しいときについて述べる。例えば、図３における（ａ）
のような木構造を持つ（ｅ）のような辞書が作成されて
いるとする。また、このときのインデックスｉＬの値は
２（ｉＬの示すシンボル列は００）であるとする。そし
て、この後に続く入力データ列が０００１１．．．であ
るとすると、最長一致部分データ列は００であり、イン
デックスｉＰの値も２となる。このとき、ＭａｘＥｎｔ
を２とすると、シンボル列０００ ,００００の２つが新
たに辞書に登録され、辞書は木構造（ｂ）を持つ（ｆ）
となる。ここで、通常の処理のように、インデックスｉ
Ｌの値をｉＰの値で更新して処理を進めると、インデッ
クスｉＬの値は２（ｉＬの示すシンボル列は００）で、
次の最長一致部分データ列は 01 となり、インデックス
ｉＰの値は４となる。従って、辞書に登録するシンボル
列は０００，０００１となるので、辞書は木構造（ｃ）
を持つ（ｇ）のように更新される。すると、インデック
ス６とインデックス８の示すシンボル列がともに０００
となり、辞書の構成要素が重複することとなる。このこ
とは、辞書の中に不用なインデックスが増加することに
なるので、インデックスを一意に識別できるように符号
化するための符号長が大きくなることになり、符号化効
率の低下を招く。そのため、インデックスｉＬとｉＰの
値が等しいときはインデックスｉＬの値をｉＰの値で更
新して処理を進めずに、インデックスｉＬの示すシンボ
ル列にｉＰの示すシンボル列の先頭から２シンボル目ま
でを連結して生成された最新に登録されたシンボル列の
インデックスでインデックスｉＬの値を更新して処理を
進めることとする。すると、図２の例では、インデック
スｉＬの値は７となり、それが示すシンボル列は０００
０となって、次の最長一致部分データ列は０１だから、
インデックスｉＰの値は４となり、結局、辞書は木構造
（ｄ）を持つ（ｉ）のように更新される。Here, a case where the values of the indexes iL and iP are equal will be described. For example, (a) in FIG.
It is assumed that a dictionary as shown in (e) having a tree structure as shown in FIG. It is also assumed that the value of the index iL at this time is 2 (the symbol sequence indicated by iL is 00). Then, the input data string following this is 00001. . . , The longest matching partial data string is 00, and the value of the index iP is also 2. At this time, MaxEnt
Is 2, two symbol strings 000 and 0000 are newly registered in the dictionary, and the dictionary has a tree structure (b) (f).
Becomes Here, the index i
When the value of L is updated with the value of iP and the process proceeds, the value of index iL is 2 (the symbol sequence indicated by iL is 00), and
The next longest matching partial data string is 01, and the value of the index iP is 4. Therefore, the symbol sequence registered in the dictionary is 000,0001, and the dictionary has a tree structure (c).
Is updated as shown in (g). Then, the symbol strings indicated by index 6 and index 8 are both 000
And the components of the dictionary are duplicated. This means that unnecessary indexes increase in the dictionary, so that the code length for coding so that the indexes can be uniquely identified increases, and the coding efficiency decreases. Therefore, when the value of the index iL is equal to the value of iP, the value of the index iL is updated with the value of iP, and the processing is not advanced. It is assumed that the value of the index iL is updated with the index of the most recently registered symbol string generated by concatenation, and the process proceeds. Then, in the example of FIG. 2, the value of the index iL is 7, and the symbol sequence indicated by it is 000.
0, and the next longest matching partial data string is 01,
The value of the index iP is 4, and the dictionary is eventually updated as shown in (i) having the tree structure (d).

【００２６】[0026]

【発明の効果】以上のように、本発明は辞書登録処理を
行う際、最長一致部分データ列の最後尾に入力データ列
のそれに続く複数個のシンボルまで連結して、複数個の
新規なシンボル列を作成し、辞書に登録することによ
り、従来のデータ圧縮処理に比べて辞書の成長が高速に
なり、圧縮効果が高くなる。As described above, according to the present invention, when performing dictionary registration processing, a plurality of new symbols are connected to the end of the longest matching partial data sequence up to a plurality of symbols following the input data sequence. By creating a column and registering it in the dictionary, the dictionary grows faster than the conventional data compression processing, and the compression effect is enhanced.

【００２７】また、前記の最長一致部分データ列に連結
されるシンボル列の最大数を制限することにより、入力
データ列中における出現頻度が極めて低くなるような長
大なシンボル列が登録されることを防止し、圧縮効率が
低下を起こさないものとなっている。By limiting the maximum number of symbol strings linked to the longest matching partial data string, it is possible to register a long symbol string whose appearance frequency in the input data string is extremely low. The compression efficiency is not reduced.

[Brief description of the drawings]

【図１】本発明のデータ圧縮装置の構成を示すブロック
図である。FIG. 1 is a block diagram showing a configuration of a data compression device of the present invention.

【図２】本発明のデータ圧縮装置の制御を示すフローチ
ャート図である。FIG. 2 is a flowchart illustrating control of the data compression device of the present invention.

【図３】本発明のデータ圧縮装置で用いられる辞書登録
処理の説明図である。FIG. 3 is an explanatory diagram of a dictionary registration process used in the data compression device of the present invention.

【図４】従来のデータ圧縮装置方法のの制御を示すフロ
ーチャート図である。FIG. 4 is a flowchart illustrating control of a conventional data compression apparatus method.

【図５】辞書の木構造による表現を説明した図である。FIG. 5 is a diagram illustrating an expression using a tree structure of a dictionary.

[Explanation of symbols]

１０１入力データ１０２データ圧縮装置１０３圧縮データ１０４最長一致データ検索部１０５辞書１０６符号化部１０７辞書登録部１０８最長登録数設定値 Reference Signs List 101 input data 102 data compression device 103 compressed data 104 longest matching data search unit 105 dictionary 106 encoding unit 107 dictionary registration unit 108 longest registration number setting value

Claims

(57) [Claims]

1. A data compression apparatus for generating compressed data by removing redundant components included in input data composed of discrete information composed of a plurality of types of symbols, comprising: A longest match search unit for searching the symbol sequence registered in the dictionary for a symbol sequence, a coding unit for creating a code based on an index given to the symbol sequence, and a coding process performed by the coding unit. The first longest done
A matching partial data string and the first longest matching partial data string
Or the second longest matching partial data string coded immediately after
At the end of the first longest matching partial data string.
2 in sequence from the first symbol of the longest matching partial data string
Binding to, a generation unit for generating a plurality of new symbol, the data compression apparatus characterized by being configured symbol sequence generated by the generating means and a dictionary registration means for registering the dictionary.

2. The data compression apparatus according to claim 1, wherein the dictionary registration means sets in advance a maximum value of the number of symbol strings newly registered in one process.