JP3384813B2

JP3384813B2 - Data compression method

Info

Publication number: JP3384813B2
Application number: JP28613691A
Authority: JP
Inventors: 泰彦中野; 茂吉田; 佳之岡田; 広隆千葉
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1991-10-31
Filing date: 1991-10-31
Publication date: 2003-03-10
Anticipated expiration: 2018-03-10
Also published as: JPH05128103A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明はデータ圧縮方式に関し、
特に文字の出現頻度を演算して出現頻度から一致検索処
理を制限するデータ圧縮方式に関する。BACKGROUND OF THE INVENTION The present invention relates to a data compression method,
In particular, the present invention relates to a data compression method that calculates the appearance frequency of characters and limits the matching search processing based on the appearance frequency.

【０００２】近年、著しい技術開発によって、コンピュ
ータの処理速度及び記憶容量等は飛躍的な発展を遂げつ
つある。しかし、コンピュータでベクトル情報及び画像
情報等のデータを扱うようになってからは、従来以上に
取り扱うデータ量が増加しつつある。このようなデータ
量の大幅な増加に対処するため、データの内容を損なわ
ずにデータ量を減らす方式、すなわちデータ圧縮方式が
提案された。In recent years, due to remarkable technological development, the processing speed, storage capacity, etc. of computers have been dramatically improved. However, since computers have been handling data such as vector information and image information, the amount of data to be handled is increasing more than ever before. In order to cope with such a large increase in the amount of data, a method of reducing the amount of data without damaging the content of the data, that is, a data compression method has been proposed.

【０００３】このデータ圧縮方式は大量のデータを扱う
場合、データに含まれる冗長な部分を省いて符号化する
ことによって、データを圧縮する方式である。データ圧
縮方式によって、データ量を減らすことができ、結果的
に記憶容量を減らすことができる。また、通信では圧縮
したデータを伝送することによって、同一内容の情報を
速く伝送することができる。This data compression method is a method for compressing data when a large amount of data is handled, by omitting redundant portions contained in the data and encoding. The data compression method can reduce the amount of data and consequently the storage capacity. Also, in communication, by transmitting compressed data, information of the same content can be transmitted at high speed.

【０００４】なお、「文字（Character ）」及び「文字
列（Character String）」の定義はＪＩＳ−Ｃ６２３０
に従うほか、情報理論で用いられている呼称を踏襲し、
１ワード単位で構成されるデータを「文字」と呼び、任
意のワード単位で構成されるデータを「文字列」と呼ぶ
ことにする。The definitions of "Character" and "Character String" are defined in JIS-C6230.
In addition to obeying the name used in information theory,
Data composed of one word unit is called "character", and data composed of arbitrary word unit is called "character string".

【０００５】[0005]

【従来の技術】従来、上記のようなデータを圧縮する方
式としては、ユニバーサル符号化方式が提案されてい
る。ユニバーサル符号化方式の代表的な例として、ＬＺ
（Lempel-Ziv）符号化法と算術符号化法とがある。ま
た、ＬＺ符号化法には、ユニバーサル型と増分分解型
（Incrementalpersing ）のアルゴリズムが提案されて
いる。さらに、これらのアルゴリズムを改良した符号化
法として、ユニバーサル型に属するＬＺＳＳ符号化法
と、増分分解型に属するＬＺＷ（Lempel-Ziv-Welch）符
号化法とがある。2. Description of the Related Art Conventionally, a universal coding system has been proposed as a system for compressing the above data. As a typical example of the universal encoding method, LZ
(Lempel-Ziv) coding method and arithmetic coding method. Further, as the LZ encoding method, universal type and incremental decomposition type (Incremental persing) algorithms have been proposed. Further, there are LZSS coding methods belonging to the universal type and LZW (Lempel-Ziv-Welch) coding methods belonging to the incremental decomposition type as coding methods obtained by improving these algorithms.

【０００６】なお、ＬＺ符号化法は、例えば、宗像清治
著「Lempel-Zivデータ圧縮法」、情報処理、pp.2〜6, V
ol.26, No.1, 1985 に詳しく掲載されている。また、Ｌ
ＺＳＳ符号化法は、T.C. Bell, "Better OPM/L Text Co
mpression", IEEE Trans.onCommu., Vol.COM-34, No.1
2, Dec.1986 に詳しく掲載されている。さらに、ＬＺＷ
符号化法は、T.A. Welch, "A Technique for High-Perf
ormance Data Compression", Computer, Jun.1984 に詳
しく掲載されている。そして、増分分解型の符号化法及
びＬＺＷ符号化法は、特開昭59−231683号、米国特許Ｎ
o. 4,558,302号において開示されている。The LZ encoding method is described in, for example, “Lempel-Ziv data compression method” by Seiji Munakata, Information Processing, pp.2-6, V.
See ol.26, No.1, 1985 for details. Also, L
The ZSS encoding method is TC Bell, "Better OPM / L Text Co.
mpression ", IEEE Trans.on Commu., Vol.COM-34, No.1
2, Dec. 1986. Furthermore, LZW
The encoding method is TA Welch, "A Technique for High-Perf
Ormance Data Compression ", Computer, Jun. 1984. The incremental decomposition type coding method and the LZW coding method are disclosed in Japanese Patent Laid-Open No. 59-231683 and US Pat.
o. 4,558,302.

【０００７】これらの符号化法のうち、高速処理がで
き、アルゴリズムが簡単であるという利点から、一般的
にＬＺＷ符号化法が使用されてきた。ＬＺＷ符号化法
は、書き換え可能な辞書をもち、以下に示す処理によっ
て符号化を行う方法である。まず、新規の入力文字列を
相異なる部分文字列に分割し、この部分文字列が辞書に
登録されてなければ、出現した順に識別番号を付して全
て辞書に登録する。同時に、現在入力している部分文字
列のうち、最長の部分文字列と一致する部分文字列を辞
書から選択し、選択した部分文字列に付されている識別
番号で符号化する。また、ある区間における圧縮率が所
定値よりも低いときは、それまで学習により蓄積してき
た部分文字列を有する辞書を破棄し、新たに辞書を構築
した。Among these encoding methods, the LZW encoding method has been generally used because it has the advantages of high-speed processing and simple algorithm. The LZW encoding method is a method that has a rewritable dictionary and performs encoding by the processing described below. First, a new input character string is divided into different partial character strings, and if the partial character string is not registered in the dictionary, the identification numbers are assigned in the order of appearance and all are registered in the dictionary. At the same time, of the currently input partial character strings, the partial character string that matches the longest partial character string is selected from the dictionary and encoded with the identification number given to the selected partial character string. Further, when the compression rate in a certain section is lower than a predetermined value, the dictionary having the partial character strings accumulated by learning until then is discarded and a new dictionary is constructed.

【０００８】図５は、辞書の木構造の一例を示す図であ
る。この辞書の木構造は、ＬＺ符号化法に含まれる増分
分解型のアルゴリズムによる符号化の際に用いられる辞
書の内部構造を示したものである。図において、丸数字
は識別番号を示し、この丸数字が付されている箇所を
「ノード（node；節）」と呼ぶ。FIG. 5 is a diagram showing an example of a tree structure of the dictionary. The tree structure of this dictionary shows the internal structure of the dictionary used for encoding by the incremental decomposition type algorithm included in the LZ encoding method. In the figure, circled numbers indicate identification numbers, and the places to which these circled numbers are attached are called "nodes".

【０００９】辞書５０は、ルート（root；根）５１を起
点とする。このルート５１には、文字は割り当てられな
い。そして、ルート５１の一階層下、すなわち第１階層
５２には一文字目の文字が登録される。この一文字目の
文字の登録は、相異なる文字が登録され、主に辞書５０
の初期化の時に行われる。図には「ａ」，「ｂ」及び
「ｃ」の３文字が登録されているが、実際には８ビット
のデータで表現可能な２５６文字が登録される。The dictionary 50 has a root 51 as a starting point. No characters are assigned to this route 51. Then, the first character is registered one level below the root 51, that is, in the first level 52. When registering the first character, different characters are registered and the dictionary 50 is mainly used.
It is done at the time of initialization. Although three characters “a”, “b” and “c” are registered in the figure, actually 256 characters which can be represented by 8-bit data are registered.

【００１０】そして、第２階層５３から下の階層は、情
報源から入力された文字列を学習することによって登録
される文字である。なお、一つ下の階層を有するノード
を「枝（branch）」と呼び、一つ下の階層を有するノー
ドを「葉（leaf）」と呼ぶ。したがって、図では丸数字
の２５，２６，１３，１４，２７，２８，１６，６，・
・・，２２，２３，２４のノードが「葉」であり、その
他のノードは「枝」である。The layers below the second layer 53 are characters registered by learning the character string input from the information source. It should be noted that a node having a layer one level below is called a “branch”, and a node having a layer one level below is called a “leaf”. Therefore, in the figure, circled numbers 25, 26, 13, 14, 27, 28, 16, 6, ...
.., 22, 23, and 24 are "leaf", and the other nodes are "branches".

【００１１】なお、あるノードが現在は「葉」であって
も、学習により「枝」となる可能性がある。例えば、
「ａｃｄ」という文字列を辞書５０に登録する場合、文
字列「ａｃ」は第１階層５２が「ａ」（丸数字１）、第
２階層５３が「ｃ」（丸数字６）として登録されている
ので、第２階層５３の「ｃ」の下の第３階層５４に、新
たに「ｄ」を登録することになる。このとき、丸数字６
のノードは「葉」から「枝」に変わる。Even if a certain node is currently a “leaf”, there is a possibility that it will become a “branch” due to learning. For example,
When the character string "acd" is registered in the dictionary 50, the character string "ac" is registered as "a" (circle number 1) in the first layer 52 and "c" (circle number 6) in the second layer 53. Therefore, “d” is newly registered in the third layer 54 below “c” in the second layer 53. At this time, circle number 6
Node changes from "leaf" to "branch".

【００１２】次に、この辞書５０を使用した圧縮処理の
アルゴリズムについて説明する。図６は、ＬＺＷ符号化
法による圧縮処理のアルゴリズムを示すフローチャート
である。図において、Ｓの後に続く数字はステップ番号
を示す。Next, an algorithm of compression processing using the dictionary 50 will be described. FIG. 6 is a flowchart showing an algorithm of compression processing by the LZW encoding method. In the figure, the number following S indicates a step number.

【００１３】〔Ｓ６１〕初期化処理を行う。具体的に
は、辞書Ｄの初期化及び変数ｎの初期化を行う。辞書Ｄ
の初期化では、相異なる１文字からなる文字列を全て辞
書Ｄに登録する。すなわち、Ｄ（ｉ）＝ｉ（ｉ＝１，２，・・・，Ａ）を行う。ここで、Ａはアルファベットの大きさを表し、
通常２５６である。また、変数ｎの初期化では、辞書Ｄ
の初期化で登録した文字の種類数、すなわちアルファベ
ットの大きさＡを設定する。さらに、新規に入力する文
字列の先頭に、カーソルが位置付けられるように設定す
る。[S61] An initialization process is performed. Specifically, the dictionary D and the variable n are initialized. Dictionary D
In the initialization of, all character strings consisting of different one characters are registered in the dictionary D. That is, D (i) = i (i = 1, 2, ..., A) is performed. Where A represents the size of the alphabet,
It is usually 256. Further, in the initialization of the variable n, the dictionary D
The number of types of characters registered in the initialization of, that is, the size A of the alphabet is set. Further, the cursor is positioned so that it is positioned at the beginning of the newly input character string.

【００１４】〔Ｓ６２〕文字列検索処理を行う。すなわ
ち、入力ストリーム（input stream）から新規に文字列
を入力する。その後、カーソル位置に示される文字から
の文字列と一致する文字列のうち、最大長の文字列を辞
書Ｄから検索する。もし、入力する文字列がなければ、
圧縮処理を終了する。[S62] A character string search process is performed. That is, a new character string is input from the input stream. Then, the dictionary D is searched for the maximum length character string among the character strings that match the character string starting from the character indicated by the cursor position. If there is no character string to enter,
The compression process ends.

【００１５】〔Ｓ６３〕符号化処理を行う。すなわち、
ステップＳ６２において検索された文字列に付された識
別番号を符号化して出力ストリーム（output stream ）
へ出力する。例えば、検索によって得られた文字列の識
別番号をｒとすると、ビット数が〔log₂ｒ〕の２進数符
号に変換して出力する。ここで、記号〔ｘ〕は数値ｘ以
上の整数のうち、最小の整数を表す。以下、この意味で
記号〔ｘ〕を用いることにする。[S63] Encoding processing is performed. That is,
The identification number attached to the character string retrieved in step S62 is encoded to output the output stream.
Output to. For example, assuming that the identification number of the character string obtained by the search is r, it is converted into a binary code having a bit number of [log ₂ r] and output. Here, the symbol [x] represents the smallest integer among the integers of the numerical value x or more. Hereinafter, the symbol [x] will be used in this sense.

【００１６】〔Ｓ６４〕文字列処理を行う。すなわち、
カーソル位置に示される最初の文字を保存しておき、ス
テップＳ６２で入力した現在の入力文字列につづく文字
列の先頭に、カーソルが位置付けられるように設定す
る。[S64] Character string processing is performed. That is,
The first character shown at the cursor position is stored, and the cursor is set to be positioned at the beginning of the character string following the current input character string input in step S62.

【００１７】〔Ｓ６５〕辞書登録判別を行う。具体的に
は、変数ｎが辞書Ｄに登録可能な最大値ＮＭＡＸを超え
ているか否かを判別する。もし、変数ｎが最大値ＮＭＡ
Ｘを超えていなければ（ＹＥＳ）ステップＳ６６に進
み、超えていれば（ＮＯ）ステップＳ６７に進む。[S65] The dictionary registration is determined. Specifically, it is determined whether or not the variable n exceeds the maximum value NMAX that can be registered in the dictionary D. If the variable n is the maximum value NMA
If X is not exceeded (YES), the process proceeds to step S66, and if it is exceeded (NO), the process proceeds to step S67.

【００１８】〔Ｓ６６〕辞書登録処理を行う。すなわ
ち、変数ｎを１だけ増加（以下、１だけ増加する操作を
「インクリメント」と呼ぶ。）する。その後、現在の入
力文字列にステップＳ６４で保存した文字を付加した文
字列に、識別番号をｎとして辞書Ｄに登録する。そし
て、次の文字列を処理するためステップＳ６２に戻る。[S66] A dictionary registration process is performed. That is, the variable n is increased by 1 (hereinafter, the operation of increasing by 1 is referred to as “increment”). Then, the character string in which the character stored in step S64 is added to the current input character string is registered in the dictionary D with the identification number n. Then, the process returns to step S62 to process the next character string.

【００１９】〔Ｓ６７〕圧縮率の悪化を判別する。すな
わち、圧縮率＝（入力文字列の全ビット数）／（符号の全ビッ
ト数）を演算し、圧縮率が低下していないかどうか判別する。
もし、圧縮率が悪化（低下）していれば（ＹＥＳ）ステ
ップＳ６１に戻り、悪化していなければ（ＮＯ）ステッ
プＳ６２に戻る。[S67] The deterioration of the compression rate is determined. That is, the compression rate = (the total number of bits of the input character string) / (the total number of bits of the code) is calculated to determine whether or not the compression rate has decreased.
If the compression ratio has deteriorated (decreased) (YES), the process returns to step S61, and if it has not deteriorated (NO), the process returns to step S62.

【００２０】このように、従来のＬＺＷ符号化法では、
辞書登録において辞書が一杯になった場合、すなわち辞
書の最大アドレスまで登録が行われた場合には辞書への
登録を中止した。そして、入力文字列が所定量、例えば
数 100キロバイトの単位毎に圧縮率を判別し、今回の圧
縮率が前回の圧縮率より低くなった場合は辞書を初期化
していた。この理由は、入力されるデータ（文字列）が
蓄積された辞書の統計的性質とは大きく異なるため、圧
縮率がさらに悪化すると判断したからである。As described above, in the conventional LZW encoding method,
When the dictionary became full in the dictionary registration, that is, when the maximum address of the dictionary was registered, the registration in the dictionary was stopped. Then, the compression rate is determined in units of a predetermined amount of input character strings, for example, several hundred kilobytes, and if the current compression rate is lower than the previous compression rate, the dictionary is initialized. The reason is that it is determined that the compression rate is further deteriorated because the input data (character string) is significantly different from the statistical property of the accumulated dictionary.

【００２１】また、算術符号化法には、例えば、複数個
のシンボルの符号化に用いる多値算術符号化法がある。
多値算術符号化法は、入力文字列を〔０，１）の数直線
上の一点に対応付け、入力文字列ごとに出現した文字列
の出現確率から演算した累積出現確率によって、〔０，
１）区間を逐次に細分化する方法である。実際の多値算
術符号化法では、有限桁の固定長レジスタで種々の演算
を行うため、ビット単位に符号化することができる。な
お、多値算術符号化法は、I.H. Witten 他, "Arimetic
Coding for Data Compression", Commu. of ACM,Vol.3
0, No.6, 1987に詳しく掲載されている。ここで、上記
の〔ｘ，ｙ）区間とは数値がｘ以上ｙ未満（ｘは含まれ
るが、ｙは含まれない）の区間のことである。The arithmetic coding method includes, for example, a multi-valued arithmetic coding method used for coding a plurality of symbols.
The multi-value arithmetic coding method associates an input character string with a point on the number line of [0, 1), and calculates the cumulative occurrence probability calculated from the appearance probability of the character string that appears for each input character string, [0,
1) A method of sequentially subdividing the section. In the actual multi-valued arithmetic coding method, various operations are performed by a fixed-length register having finite digits, so that coding can be performed in bit units. The multilevel arithmetic coding method is described in IH Witten et al., "Arimetic
Coding for Data Compression ", Commu. Of ACM, Vol.3
0, No. 6, 1987. Here, the above [x, y) section is a section whose numerical value is greater than or equal to x and less than y (x is included but y is not included).

【００２２】図７は、多値算術符号化法による圧縮処理
のアルゴリズムを示すフローチャートである。図におい
て、Ｓの後に続く数字はステップ番号を示す。〔Ｓ７１〕初期化処理を行う。具体的には、辞書Ｄの初
期化と変数の初期化を行う。辞書Ｄの初期化では、相異
なる１文字からなる文字列を全て辞書Ｄに登録する。す
なわち、Ｄ（ｉ）＝ｉ（ｉ＝１，２，・・・，Ａ）を行う。ここで、Ａはアルファベットの大きさを表し、
通常２５６である。また、変数ｎの初期化では、算術用
１次元配列Ｉ、出現頻度１次元配列freq及び累積出現頻
度１次元配列cum-freqを初期化する。すなわち、Ｉ（ｉ）＝ｉ（ｉ＝１，２，・・・，Ａ） freq(i) ＝１（ｉ＝１，２，・・・，Ａ） cum-freq(i) ＝Ａ−ｉ（ｉ＝１，２，・・・，Ａ）を行う。FIG. 7 is a flowchart showing an algorithm of compression processing by the multi-value arithmetic coding method. In the figure, the number following S indicates a step number. [S71] Initialization processing is performed. Specifically, the dictionary D and the variables are initialized. In the initialization of the dictionary D, all character strings consisting of different one characters are registered in the dictionary D. That is, D (i) = i (i = 1, 2, ..., A) is performed. Where A represents the size of the alphabet,
It is usually 256. In the initialization of the variable n, the arithmetic one-dimensional array I, the appearance frequency one-dimensional array freq, and the cumulative appearance frequency one-dimensional array cum-freq are initialized. That is, I (i) = i (i = 1, 2, ..., A) freq (i) = 1 (i = 1, 2, ..., A) cum-freq (i) = A-i (I = 1, 2, ..., A).

【００２３】〔Ｓ７２〕文字入力処理を行う。すなわ
ち、入力ストリームから新規に１文字ｋを入力する。〔Ｓ７３〕ステップＳ７２で新規に文字を入力したか否
かの判別を行う。もし、新規に文字を入力した（ＹＥ
Ｓ）ならばステップＳ７４に進み、新規に文字を入力し
なかった（ＮＯ）ならば本圧縮処理を終了する。[S72] Character input processing is performed. That is, one character k is newly input from the input stream. [S73] In step S72, it is determined whether or not a new character is input. If you enter a new character (YE
If S), the process proceeds to step S74. If no new character is input (NO), the main compression process ends.

【００２４】〔Ｓ７４〕多値算術符号化処理を行う。す
なわち、ステップＳ７２で入力された文字ｋに対応する
算術用１次元配列Ｉから算術値ｊを得る。すなわち、ｊ＝Ｉ（ｋ），ｉ＝Ｄ（ｊ）により、ｊを多値算術化する。また、この算出値ｊを符
号化して出力ストリームに出力する。このとき、出力さ
れるビット数は〔ｊ〕である。[S74] Multi-value arithmetic coding processing is performed. That is, the arithmetic value j is obtained from the arithmetic one-dimensional array I corresponding to the character k input in step S72. That is, j = I (k) and i = D (j) are used to perform multivalued arithmetic on j. The calculated value j is encoded and output to the output stream. At this time, the number of output bits is [j].

【００２５】〔Ｓ７５〕交換処理を行う。すなわち、出
現頻度１次元配列freqに示される配列から最大の出現頻
度を求め、この出現頻度に対応する配列番号ｒとステッ
プＳ７２で得られた算術値ｊとについて、算術用１次元
配列Ｉ及び辞書Ｄ内の文字列を交換する。すなわち、Ｉ
（ｒ）とＩ（ｊ）の値、およびＤ（ｒ）とＤ（ｊ）の値
を交換する。[S75] Exchange processing is performed. That is, the maximum appearance frequency is obtained from the array shown in the appearance frequency one-dimensional array freq, and the array number r corresponding to this appearance frequency and the arithmetic value j obtained in step S72 are used for the arithmetic one-dimensional array I and the dictionary. Swap the strings in D. That is, I
Exchange the values of (r) and I (j), and D (r) and D (j).

【００２６】〔Ｓ７６〕累積出現頻度１次元配列cum-fr
eqのソート処理を行う。具体的には、まずステップＳ７
５で得られた出現頻度１次元配列freqの配列番号ｒで示
される内容をインクリメントする。そして、配列番号ｒ
よりも小さな配列番号について、一つ大きな配列番号の
出現頻度１次元配列freqに入っている累積出現頻度を注
目する配列番号の出現頻度１次元配列freqに代入する処
理を行う。すなわち、 cum-freq(r) ＝cum-freq(r)+1 cum-freq(i) ＝cum-freq(i+1) （ｉ＝ r-1, r-2,・・・，１）を行う。そして、次の文字処理のため、ステップＳ７２
に戻る。[S76] Cumulative appearance frequency one-dimensional array cum-fr
Perform eq sort processing. Specifically, first, step S7
The content indicated by the array number r of the appearance frequency one-dimensional array freq obtained in 5 is incremented. And the array element number r
For smaller array numbers, the process of substituting the cumulative appearance frequency in the appearance frequency one-dimensional array freq of one larger array number into the appearance frequency one-dimensional array freq of the noted array number is performed. That is, cum-freq (r) = cum-freq (r) +1 cum-freq (i) = cum-freq (i + 1) (i = r-1, r-2, ..., 1) To do. Then, for the next character processing, step S72
Return to.

【００２７】さらに、他の多値算術符号化法として、多
重の履歴から条件付確率を符号化することによって、高
い圧縮率を得る方法が発表されている。この方法は、例
えば、D.M. Abramson, "An Adaptive Dependancy Sourc
e Model for Data Compression", Commu. of ACM, Vol.
30, No.6, 1987、又は J.G. Cleary 他, "Data Compre
ssion Using Adaptive Coding and Partial String Mat
ching", Commu. of ACM, Vol.30, No.6, 1987 に詳しく
掲載されている。Further, as another multilevel arithmetic coding method, a method of obtaining a high compression rate by coding conditional probabilities from multiple histories has been announced. This method is described, for example, in DM Abramson, "An Adaptive Dependancy Sourc.
e Model for Data Compression ", Commu. of ACM, Vol.
30, No.6, 1987, or JG Cleary et al., "Data Compre
ssion Using Adaptive Coding and Partial String Mat
ching ", Commu. of ACM, Vol.30, No.6, 1987.

【００２８】図８は、多重履歴に基づく多値算術符号化
法による圧縮処理のアルゴリズムを示すフローチャート
である。なお、このフローチャートには１重履歴に基づ
く多値算術符号化法による圧縮処理を示す。図におい
て、Ｓの後に続く数字はステップ番号を示す。FIG. 8 is a flow chart showing an algorithm of compression processing by the multivalued arithmetic coding method based on multiple histories. It should be noted that this flowchart shows the compression processing by the multivalued arithmetic coding method based on the single history. In the figure, the number following S indicates a step number.

【００２９】〔Ｓ８１〕初期化処理を行う。具体的に
は、辞書Ｄの初期化と変数の初期化を行う。辞書Ｄの初
期化では、相異なる１文字からなる文字列を全て辞書Ｄ
に登録する。すなわち、Ｄ（ｐ，ｉ）＝ｉ（ｐ，ｉ＝１，２，・・・，Ａ）を行う。ここで、Ａはアルファベットの大きさを表し、
通常２５６である。また、変数の初期化では、算術用２
次元配列Ｉ、出現頻度２次元配列freq、累積出現頻度２
次元配列cum-freq及び直前文字ｐを初期化する。すなわ
ち、Ｉ（ｐ，ｉ）＝ｉ（ｐ，ｉ＝１，２，・・・，Ａ） freq（ｐ，ｉ）＝１（ｐ，ｉ＝１，２，・・・，Ａ） cum-freq(i) ＝Ａ−ｉ（ｉ＝１，２，・・・，Ａ）ｐ＝１を行う。[S81] An initialization process is performed. Specifically, the dictionary D and the variables are initialized. When initializing the dictionary D, all the character strings consisting of different one characters are stored in the dictionary D.
Register with. That is, D (p, i) = i (p, i = 1, 2, ..., A) is performed. Where A represents the size of the alphabet,
It is usually 256. In addition, in the initialization of variables,
Dimensional array I, appearance frequency 2 dimensional array freq, cumulative appearance frequency 2
Initialize the dimension array cum-freq and the preceding character p. That is, I (p, i) = i (p, i = 1, 2, ..., A) freq (p, i) = 1 (p, i = 1, 2, ..., A) cum- freq (i) = A-i (i = 1, 2, ..., A) p = 1.

【００３０】〔Ｓ８２〕文字入力処理を行う。すなわ
ち、入力ストリームから新規に１文字ｋを入力する。〔Ｓ８３〕ステップＳ８２で新規に文字を入力したか否
かの判別を行う。もし、新規に文字を入力した（ＹＥ
Ｓ）ならばステップＳ８４に進み、新規に文字を入力し
なかった（ＮＯ）ならば本圧縮処理を終了する。[S82] Character input processing is performed. That is, one character k is newly input from the input stream. [S83] In step S82, it is determined whether or not a new character is input. If you enter a new character (YE
If S), the process proceeds to step S84, and if no new character is input (NO), the main compression process ends.

【００３１】〔Ｓ８４〕多値算術符号化処理を行う。す
なわち、ステップＳ８２で入力された文字ｋに対応する
算術用２次元配列Ｉから算術値ｊを得る。すなわち、ｊ＝Ｉ（ｐ，ｋ），ｉ＝Ｄ（ｐ，ｊ）により、ｊを多値算術化する。また、この算出値ｊを符
号化して出力ストリームに出力する。このとき、出力さ
れるビット数は〔ｊ〕である。[S84] Multi-value arithmetic coding processing is performed. That is, the arithmetic value j is obtained from the arithmetic two-dimensional array I corresponding to the character k input in step S82. That is, j = I (p, k) and i = D (p, j) are used to perform multivalued arithmetic on j. The calculated value j is encoded and output to the output stream. At this time, the number of output bits is [j].

【００３２】〔Ｓ８５〕交換処理を行う。すなわち、出
現頻度２次元配列freqに示される配列から最大の出現頻
度を求め、この出現頻度に対応する配列番号ｒとステッ
プＳ８２で得られた算術値ｊとについて、算術用２次元
配列Ｉ及び辞書Ｄ内の文字列を交換する。すなわち、Ｉ
（ｐ，ｒ）とＩ（ｐ，ｊ）の値、およびＤ（ｐ，ｒ）と
Ｄ（ｐ，ｊ）の値を交換する。[S85] Exchange processing is performed. That is, the maximum appearance frequency is obtained from the array shown in the appearance frequency two-dimensional array freq, and the array number r corresponding to this appearance frequency and the arithmetic value j obtained in step S82 are used for the arithmetic two-dimensional array I and the dictionary. Swap the strings in D. That is, I
Exchange the values of (p, r) and I (p, j), and D (p, r) and D (p, j).

【００３３】〔Ｓ８６〕累積出現頻度２次元配列cum-fr
eqのソート処理を行う。ステップＳ８５で得られた出現
頻度２次元配列freqの配列番号ｒの値をインクリメント
する。また、配列番号ｒよりも小さな配列番号につい
て、一つ大きな配列番号の出現頻度２次元配列freqに入
っている累積出現頻度を注目する配列番号の出現頻度２
次元配列freqに代入する処理を行う。すなわち、 cum-freq(r) ＝cum-freq(r)+1 cum-freq(i) ＝cum-freq(i+1) （ｉ＝ r-1, r-2,・・・,1）を行う。[S86] Cumulative appearance frequency two-dimensional array cum-fr
Perform eq sort processing. The value of the array number r of the appearance frequency two-dimensional array freq obtained in step S85 is incremented. For the array numbers smaller than the array number r, the appearance frequency of one larger array number The appearance frequency 2 of the array numbers for which the cumulative appearance frequency in the two-dimensional array freq is noted
Perform the process of substituting into the dimensional array freq. That is, cum-freq (r) = cum-freq (r) +1 cum-freq (i) = cum-freq (i + 1) (i = r-1, r-2, ..., 1) To do.

【００３４】〔Ｓ８７〕直前文字設定を行う。すなわ
ち、今回新規に入力した文字ｋを改めて直前文字ｐとし
て設定する。そして、次の文字処理のため、ステップＳ
８２に戻る。[S87] The preceding character is set. That is, the character k newly input this time is set again as the immediately preceding character p. Then, for the next character processing, step S
Return to 82.

【００３５】[0035]

【発明が解決しようとする課題】従来のＬＺＷ符号化法
では、辞書内の文字列と入力文字列とを照合して圧縮を
行うので処理速度は速い。しかし、稀にしか参照されな
い文字列でさえも辞書に登録されていたため、辞書に登
録する文字列に付される識別番号が大きくなってしま
い、この識別番号を符号化するときのビット数も増え、
圧縮効率が低下するという問題点があった。In the conventional LZW encoding method, since the character string in the dictionary is collated with the input character string to perform compression, the processing speed is high. However, even a character string that is rarely referenced was registered in the dictionary, so the identification number attached to the character string registered in the dictionary becomes large, and the number of bits when encoding this identification number also increases. ,
There is a problem that the compression efficiency is reduced.

【００３６】また、辞書を破棄することによって、それ
まで学習により蓄積してきた文字列が有効に利用できな
くなるため、かえって圧縮効率が低下していた。これを
解決するため、本出願人は特願平2-275836号において開
示したように、辞書内の各文字列に、最近参照されたか
否かを示すフラグを付した。そして、このフラグによっ
て最近参照された文字列のみを区別し、再構築する辞書
に残すようにした。これによって、学習して辞書に登録
した文字列を生かすようにした。Further, by discarding the dictionary, the character strings accumulated by learning cannot be used effectively, so that the compression efficiency is rather lowered. In order to solve this, the applicant has added a flag to each character string in the dictionary indicating whether or not it has been recently referenced, as disclosed in Japanese Patent Application No. 2-275836. Then, this flag is used to distinguish only the character strings that were recently referenced and leave them in the dictionary to be reconstructed. With this, the character strings learned and registered in the dictionary are used.

【００３７】しかし、辞書の再構築には、最近参照され
た文字列か否かを判別しなければならないため、かなり
の時間を要していた。したがって、全符号化処理が終わ
るまでには時間がかかるという問題点があった。However, the rebuilding of the dictionary requires a considerable amount of time because it is necessary to determine whether or not the character string is a recently referenced character string. Therefore, there is a problem that it takes time to complete the entire encoding process.

【００３８】一方、従来の多値算術符号化法では、出現
確率に基づき文字ごとに符号化を行うので高い圧縮率が
得られるが、この符号化には複雑な算術処理を行わなけ
ればならず、複雑な算術処理のために時間がかかるとい
う問題点があった。On the other hand, in the conventional multi-valued arithmetic coding method, a high compression rate can be obtained because the coding is performed for each character based on the occurrence probability, but this coding requires complicated arithmetic processing. However, there is a problem that it takes time due to complicated arithmetic processing.

【００３９】本発明はこのような点に鑑みてなされたも
のであり、辞書へ登録する文字列を抑えて、符号化に必
要な時間を短縮するデータ圧縮方式を提供することを目
的とする。The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a data compression method that suppresses a character string registered in a dictionary and shortens the time required for encoding.

【００４０】[0040]

【課題を解決するための手段】図１に本発明の原理説明
図を示す。出現頻度演算手段１は、入力された文字列を
構成する文字の出現数に基づき、出現頻度を演算する。
生起確率推定手段２は、この出現頻度に基づき、新規に
入力された新規入力文字列の生起確率を推定する。辞書
登録手段３は、この生起確率が所定の第１の基準確率値
以上となるときは、新規入力文字列に識別番号を付して
辞書６に登録する。文字列検索手段４は、辞書６から、
新規入力文字列と一致する一致文字列を検索する。符号
化手段５は、この一致文字列に付された識別番号を符号
化して出力する。FIG. 1 shows an explanatory view of the principle of the present invention. The appearance frequency calculation means 1 calculates the appearance frequency based on the number of appearances of the characters forming the input character string.
The occurrence probability estimation means 2 estimates the occurrence probability of the newly input character string newly input, based on this appearance frequency. When the occurrence probability is greater than or equal to the predetermined first reference probability value, the dictionary registration means 3 adds an identification number to the newly input character string and registers it in the dictionary 6. The character string search means 4 uses the dictionary 6
Search for a match string that matches the new input string. The encoding means 5 encodes and outputs the identification number attached to the matching character string.

【００４１】また、出現頻度演算手段１は、入力された
文字列を構成する文字と、文字が出現した出現数を全て
記憶し、出現数に基づき出現頻度を演算する。あるい
は、予め全文字列を入力し、全文字列を構成する文字と
文字が出現した出現数とを全て記憶し、新規に入力され
た新規入力文字列を構成する文字毎に出現数に加え、出
現数に基づき出現頻度を演算する。Further, the appearance frequency calculation means 1 stores all the characters forming the input character string and the number of appearances of the characters, and calculates the appearance frequency based on the number of appearances. Alternatively, by inputting all the character strings in advance, all the characters that form the entire character string and the number of appearances of the characters are stored, and in addition to the number of appearances for each character that forms the newly input new input character string, The appearance frequency is calculated based on the number of appearances.

【００４２】さらに、生起確率推定手段２は、特定のデ
ータから始まる文字列の生起確率である条件付確率を推
定する。また、特定のデータは、特に新規入力文字列の
直前に入力された入力文字列の中の最終文字とする。Further, the occurrence probability estimating means 2 estimates the conditional probability which is the occurrence probability of a character string starting from specific data. Further, the specific data is particularly the last character in the input character string input immediately before the new input character string.

【００４３】そして、文字列検索手段４は、辞書６に登
録された文字列のうち、新規入力文字列と一致する一致
文字列を検索する。あるいは、新規入力文字列を相異な
る部分文字列に分割し、辞書６に登録された文字列のう
ち、部分文字列と一致する一致文字列を検索し、一致文
字列のうち文字列長が最長である文字列を選択する。ま
た、この一致文字列は、特に生起確率が所定の第２の基
準確率値以上となる文字列とする。Then, the character string search means 4 searches the character strings registered in the dictionary 6 for a matching character string that matches the newly input character string. Alternatively, the new input character string is divided into different partial character strings, and a matching character string that matches the partial character string is searched from the character strings registered in the dictionary 6, and the character string length of the matching character string is the longest. Select the character string that is. In addition, this matching character string is a character string whose occurrence probability is equal to or higher than a predetermined second reference probability value.

【００４４】それから、辞書登録手段３は、新規入力文
字列の符号化による圧縮率が悪化した場合には、辞書６
を再構成する手段を設ける。Then, when the compression rate due to the encoding of the newly input character string is deteriorated, the dictionary registration means 3 makes the dictionary 6
And means for reconfiguring.

【００４５】[0045]

【作用】出現頻度演算手段１は、入力された文字列を構
成する文字の出現数に基づき、出現頻度を演算する。生
起確率推定手段２は、出現頻度に基づき、新規に入力さ
れた新規入力文字列の生起確率を推定する。辞書登録手
段３は、この生起確率が所定の第１の基準確率値以上と
なる場合は、新規入力文字列に識別番号を付して辞書６
に登録する。文字列検索手段４は、辞書６から新規入力
文字列と一致する一致文字列を検索する。符号化手段５
は、この一致文字列に付された識別番号を符号化して出
力する。したがって、辞書６への登録が抑えられるた
め、識別番号の増加も抑えられ、符号化効率を高めるこ
とができる。The appearance frequency calculation means 1 calculates the appearance frequency based on the number of appearances of the characters forming the input character string. The occurrence probability estimation means 2 estimates the occurrence probability of a newly input character string newly input, based on the appearance frequency. When the occurrence probability is greater than or equal to the predetermined first reference probability value, the dictionary registration means 3 attaches an identification number to the newly input character string, and the dictionary 6
Register with. The character string search means 4 searches the dictionary 6 for a matching character string that matches the newly input character string. Encoding means 5
Encodes and outputs the identification number attached to this matching character string. Therefore, since the registration in the dictionary 6 is suppressed, the increase in the identification number can be suppressed and the coding efficiency can be improved.

【００４６】また、出現頻度演算手段１は、入力された
文字列を構成する文字と文字が出現した出現数とを全て
記憶し、出現数に基づき出現頻度を演算する。あるい
は、予め全文字列を入力し、全文字列を構成する文字と
文字が出現した出現数とを全て記憶し、新規に入力され
た新規入力文字列を構成する文字毎に出現数に加え、出
現数に基づき出現頻度を演算する。これによって、出現
頻度の高い文字列のみが辞書６に登録されるため、辞書
６が極度に大きくなるのを抑えることができる。Further, the appearance frequency calculation means 1 stores all the characters constituting the input character string and the number of appearances of the characters, and calculates the appearance frequency based on the number of appearances. Alternatively, by inputting all the character strings in advance, all the characters that form the entire character string and the number of appearances of the characters are stored, and in addition to the number of appearances for each character that forms the newly input new input character string, The appearance frequency is calculated based on the number of appearances. As a result, only the character string having a high appearance frequency is registered in the dictionary 6, so that the dictionary 6 can be prevented from becoming extremely large.

【００４７】さらに、生起確率推定手段２は、特定のデ
ータから始まる文字列の生起確率である条件付確率を推
定する。また、特定のデータは、特に新規入力文字列の
直前に入力された入力文字列の中の最終文字とすること
によって、文字のつながりの関連性が高い文字列が辞書
６に登録されるため、入力される文字列（データ）に適
した符号を出力することができる。Further, the occurrence probability estimating means 2 estimates the conditional probability which is the occurrence probability of a character string starting from specific data. Further, since the specific data is the last character in the input character string input immediately before the new input character string, a character string having a high degree of connection of characters is registered in the dictionary 6, A code suitable for the input character string (data) can be output.

【００４８】そして、文字列検索手段４は、辞書６に登
録された文字列のうち、新規入力文字列と一致する一致
文字列を検索する。あるいは、新規入力文字列を相異な
る部分文字列に分割し、辞書６に登録された文字列のう
ち、部分文字列と一致する一致文字列を検索し、一致文
字列のうち文字列長が最長である文字列を選択する。ま
た、この一致文字列は、特に生起確率が所定の第２の基
準確率値以上となる文字列とする。これにより、所定の
文字列に対して最適な符号化処理が行え、圧縮率を高め
ることができる。Then, the character string searching means 4 searches the character strings registered in the dictionary 6 for a matching character string that matches the newly input character string. Alternatively, the new input character string is divided into different partial character strings, and a matching character string that matches the partial character string is searched from the character strings registered in the dictionary 6, and the character string length of the matching character string is the longest. Select the character string that is. In addition, this matching character string is a character string whose occurrence probability is equal to or higher than a predetermined second reference probability value. As a result, the optimum encoding process can be performed on the predetermined character string, and the compression rate can be increased.

【００４９】それから、辞書登録手段３は、新規入力文
字列の符号化による圧縮率が悪化した場合には、辞書６
を再構築して、圧縮率の悪化を防ぐことができる。Then, when the compression rate due to the encoding of the newly input character string is deteriorated, the dictionary registration means 3 causes the dictionary 6
Can be reconstructed to prevent the compression rate from deteriorating.

【００５０】[0050]

【実施例】以下、本発明の一実施例を図面に基づいて説
明する。図２は、本発明の実施例を示すフローチャート
である。この符号化処理手順は、出現頻度に以前入力し
た文字の履歴を考慮しない、いわゆる０重マルコフ・モ
デルに適応させた符号化処理手順である。図において、
Ｓの後に続く数字はステップ番号を示す。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 2 is a flowchart showing an embodiment of the present invention. This coding processing procedure is a coding processing procedure adapted to a so-called zero-order Markov model, which does not consider the history of previously input characters in the appearance frequency. In the figure,
The number following S indicates a step number.

【００５１】〔Ｓ２１〕初期化処理を行う。具体的に
は、変数の初期化として、各文字の出現の頻度を計数す
るための出現頻度１次元配列freqを初期化する。すなわ
ち、 freq(i) ＝１（ｉ＝１，２，・・・，Ａ）を行う。ここで、Ａはアルファベットの大きさを表し、
通常２５６である。また、辞書登録数を示す変数ｎの初
期化では、辞書Ｄの初期化で登録した文字の種類数、す
なわちアルファベットの大きさＡを設定する。さらに、
新規に入力する文字列の先頭にカーソルが位置付けられ
るように設定する。[S21] An initialization process is performed. Specifically, as a variable initialization, an appearance frequency one-dimensional array freq for counting the appearance frequency of each character is initialized. That is, freq (i) = 1 (i = 1, 2, ..., A) is performed. Where A represents the size of the alphabet,
It is usually 256. Further, in the initialization of the variable n indicating the number of dictionary registrations, the number of types of characters registered in the initialization of the dictionary D, that is, the alphabet size A is set. further,
Set so that the cursor is positioned at the beginning of the newly entered character string.

【００５２】〔Ｓ２２〕辞書構築を行う。まず、情報源
から新規に文字列を入力しながら、文字総数Ｔ及び出現
確率１次元配列ｐを求める。すなわち、Ｔ＝Σfreq(i) （ｉ＝１，２，・・・，Ａ）ｐ(i) ＝freq(i) ／Ｔ（ｉ＝１，２，・・・，Ａ）を行う。そして、辞書Ｄの構築では、式ｐ(1) ｐ(2) ｐ(3) ・・・ｐ(A) Ｔ≧Ｃを満たす全ての文字列を識別番号とともに辞書Ｄに登録
する。ここで、定数Ｃは無制限に文字列を辞書Ｄに登録
するのを避けるための所定値である。また、入力する文
字列の先頭にカーソルが位置付けられるように、改めて
設定する。[S22] The dictionary is constructed. First, while newly inputting a character string from an information source, the total number T of characters and the appearance probability one-dimensional array p are obtained. That is, T = Σfreq (i) (i = 1, 2, ..., A) p (i) = freq (i) / T (i = 1, 2, ..., A) is performed. Then, in the construction of the dictionary D, all the character strings satisfying the expressions p (1) p (2) p (3) ... P (A) T ≧ C are registered in the dictionary D together with the identification numbers. Here, the constant C is a predetermined value for avoiding unlimited registration of character strings in the dictionary D. Also, set again so that the cursor is positioned at the beginning of the input character string.

【００５３】〔Ｓ２３〕文字列入力検査を行う。すなわ
ち、入力ストリームから新規に文字列を入力されるか否
かを判定する。もし、文字列が入力された（ＹＥＳ）な
らばステップＳ２４に進み、文字列が入力されなかった
（ＮＯ）ならば本圧縮処理を終了する。[S23] A character string input check is performed. That is, it is determined whether or not a new character string is input from the input stream. If the character string is input (YES), the process proceeds to step S24, and if the character string is not input (NO), the main compression process ends.

【００５４】〔Ｓ２４〕文字列検索処理を行う。具体的
には、カーソルの位置からの文字列と一致する文字列を
辞書Ｄから検索する。このとき、辞書Ｄからは、式ｐ(1) ｐ(2) ｐ(3) ・・・ｐ(A) Ｔ≧（Ｃ＋α）を満たす文字列のみを検索する。ここで、定数αは辞書
Ｄを新たに作る余地を確保するための所定値である。ま
た、検索された文字列のうち、最も文字数が多い文字列
を最大一致文字列Ｓとする。逆に、上記の式を満足する
文字列が検索されなかったならば、この文字列に新しい
識別番号を付す。[S24] Character string search processing is performed. Specifically, the dictionary D is searched for a character string that matches the character string from the position of the cursor. At this time, the dictionary D is searched for only the character string satisfying the expressions p (1) p (2) p (3) ... P (A) T ≧ (C + α). Here, the constant α is a predetermined value for securing a room for newly creating the dictionary D. Further, among the searched character strings, the character string having the largest number of characters is set as the maximum matching character string S. On the contrary, if no character string satisfying the above equation is found, this character string is given a new identification number.

【００５５】〔Ｓ２５〕符号出力を行う。すなわち、最
大一致文字列Ｓ又はステップＳ２４で検索されなかった
文字列に付されている識別番号を〔ｎ〕ビットで符号化
して出力する。[S25] Code output is performed. That is, the identification number attached to the maximum matching character string S or the character string not searched in step S24 is encoded with [n] bits and output.

【００５６】〔Ｓ２６〕出現頻度インクリメントを行
う。すなわち、ステップＳ２５で符号化した文字列に対
応する出現頻度１次元配列freqをインクリメントする。
また、文字総数Ｔ及び出現確率１次元配列ｐを改めて求
める。すなわち、Ｔ＝Σfreq(i) （ｉ＝１，２，・・・，Ａ）ｐ(i) ＝freq(i) ／Ｔ（ｉ＝１，２，・・・，Ａ）を行う。[S26] The appearance frequency is incremented. That is, the appearance frequency one-dimensional array freq corresponding to the character string encoded in step S25 is incremented.
Further, the total number of characters T and the appearance probability one-dimensional array p are obtained again. That is, T = Σfreq (i) (i = 1, 2, ..., A) p (i) = freq (i) / T (i = 1, 2, ..., A) is performed.

【００５７】〔Ｓ２７〕辞書登録を行う。すなわち、ス
テップＳ２４で検索されなかった文字列に、前回ステッ
プＳ２８で記憶した先頭文字ｋを付加して、辞書登録の
ための登録文字列を作る。そして、この登録文字列を識
別番号とともに辞書Ｄに登録する。また、辞書登録を行
なったときは変数ｎをインクリメントする。[S27] The dictionary is registered. That is, the head character k stored in the previous step S28 is added to the character string not searched in step S24 to create a registered character string for dictionary registration. Then, this registered character string is registered in the dictionary D together with the identification number. When the dictionary is registered, the variable n is incremented.

【００５８】〔Ｓ２８〕カーソル位置設定を行う。具体
的には、ステップＳ２５で符号化した文字列の先頭の文
字を先頭文字ｋとして記憶する。そして、次に入力する
文字列を処理するために、この文字列の次の文字に位置
付けする。[S28] The cursor position is set. Specifically, the leading character of the character string encoded in step S25 is stored as the leading character k. Then, in order to process the character string to be input next, it is positioned at the next character of this character string.

【００５９】〔Ｓ２９〕圧縮率の悪化を判別する。すな
わち、入力された文字列の所定量、例えば数 100キロバ
イトごとの文字列について、圧縮率＝（所定の入力文字列の全ビット数）／（符号の
全ビット数）を演算し、圧縮率が低下していないかどうか判別する。
もし、圧縮率が悪化（低下）していれば（ＹＥＳ）ステ
ップＳ２２に戻り、悪化していなければ（ＮＯ）ステッ
プＳ２３に戻る。なお、ステップＳ２２に戻る場合は、
辞書Ｄが再構築される。[S29] The deterioration of the compression rate is determined. That is, for a predetermined amount of the input character string, for example, a character string every several hundred kilobytes, the compression ratio = (total number of bits of the specified input character string) / (total number of bits of the code) is calculated, and the compression ratio is Determine if it has dropped.
If the compression ratio has deteriorated (decreased) (YES), the process returns to step S22, and if it has not deteriorated (NO), the process returns to step S23. When returning to step S22,
The dictionary D is reconstructed.

【００６０】図３は、多重履歴に基づく辞書の木構造の
一例を示す図である。この辞書の木構造は、出現頻度に
以前入力した一文字前の履歴を考慮した、いわゆる１重
マルコフ・モデルに適応させた符号化処理の際に用いら
れる辞書の内部構造を示したものである。図において、
丸数字は識別番号を示し、この丸数字が付されている箇
所を「ノード（node；節）」と呼ぶ。FIG. 3 is a diagram showing an example of a tree structure of a dictionary based on multiple histories. The tree structure of this dictionary shows the internal structure of the dictionary used in the encoding process adapted to the so-called single Markov model, in which the history of the previous input one character is considered in the appearance frequency. In the figure,
The circled numbers indicate the identification number, and the part with this circled number is called a "node".

【００６１】図において、辞書３０は、直前文字３１ａ
からなる部分辞書３１、直前文字３２ａからなる部分辞
書３２及び直前文字３３ａからなる部分辞書３３から構
成される。しかし、実際には８ビットのデータで表現可
能な２５６の直前文字からなる部分辞書から構成され
る。これらの部分辞書３１、部分辞書３２及び部分辞書
３３等の各部分辞書は、図５で示した木構造と同様の構
造をしている。In the figure, the dictionary 30 has the immediately preceding character 31a.
It is composed of a partial dictionary 31 consisting of, a partial dictionary 32 consisting of the preceding character 32a, and a partial dictionary 33 consisting of the preceding character 33a. However, actually, it is composed of a partial dictionary consisting of 256 immediately preceding characters that can be represented by 8-bit data. Each partial dictionary such as the partial dictionary 31, the partial dictionary 32, and the partial dictionary 33 has a structure similar to the tree structure shown in FIG.

【００６２】この辞書３０を使用して、辞書登録及び検
索は次のような手順で行われる。まず、直前文字によっ
て、いずれかの部分辞書が選択される。そして、登録又
は検索する文字列について、選択された部分辞書の中か
ら登録又は検索が行われる。例えば、直前文字が「ａ」
として、文字列「ｂａｂ」を検索する場合、まず直前文
字が「ａ」であることから部分辞書３１が選択される。
そして、この部分辞書３１について、文字列「ｂａｂ」
はノードの丸数字２，７，１２をたどることによって検
索される。Using this dictionary 30, dictionary registration and search are performed in the following procedure. First, one of the partial dictionaries is selected depending on the immediately preceding character. Then, the character string to be registered or searched is registered or searched from the selected partial dictionary. For example, the last character is "a"
When searching for the character string “bab”, the partial dictionary 31 is selected because the immediately preceding character is “a”.
Then, for this partial dictionary 31, the character string "bab"
Is searched by following the circled numbers 2, 7, 12 of the node.

【００６３】次に、この辞書３０を使用した圧縮処理の
アルゴリズムについて説明する。図４は、本発明の他の
実施例を示すフローチャートである。この符号化処理手
順は、出現頻度に以前入力した一文字前の履歴を考慮し
た、いわゆる１重マルコフ・モデルに適応させた符号化
処理手順である。図において、Ｓの後に続く数字はステ
ップ番号を示す。Next, the algorithm of the compression process using this dictionary 30 will be described. FIG. 4 is a flowchart showing another embodiment of the present invention. This coding processing procedure is a coding processing procedure adapted to a so-called single Markov model in which the history of one character before that which was previously input is considered in the appearance frequency. In the figure, the number following S indicates a step number.

【００６４】〔Ｓ４０〕初期化処理を行う。具体的に
は、変数の初期化として、まず文字ｊの後に文字ｉが出
現する頻度を計数するための出現頻度２次元配列freqを
初期化する。すなわち、 freq(i,j) ＝１（ｉ，ｊ＝１，２，・・・，Ａ）を行う。ここで、Ａはアルファベットの大きさを表し、
通常２５６である。また、辞書登録数を示す変数ｎの初
期化では、辞書Ｄの初期化で登録した文字の種類数、す
なわちアルファベットの大きさＡを設定する。さらに、
新規に入力する文字列の先頭にカーソルが位置付けられ
るように設定する。[S40] An initialization process is performed. Specifically, as the initialization of the variables, first, an appearance frequency two-dimensional array freq for counting the frequency of occurrence of the character i after the character j is initialized. That is, freq (i, j) = 1 (i, j = 1,2, ..., A) is performed. Where A represents the size of the alphabet,
It is usually 256. Further, in the initialization of the variable n indicating the number of dictionary registrations, the number of types of characters registered in the initialization of the dictionary D, that is, the alphabet size A is set. further,
Set so that the cursor is positioned at the beginning of the newly entered character string.

【００６５】〔Ｓ４１〕辞書構築を行う。まず、情報源
から新規に文字列を入力しながら、文字総数Ｔ、履歴出
現確率１次元配列ｐ(i｜j)及び特定文字出現確率１次元
配列ｐ(k) を求める。すなわち、Ｔ＝Σfreq(i,j) （ｉ，ｊ＝１，２，・・・，Ａ）ｐ(i｜j)＝freq(i,j)/（ｐ(j) Ｔ）（ｉ，ｊ＝１，２，・・・，Ａ）ｐ(k) ＝Σｐ(k｜j) （ｊ，ｋ＝１，２，・・・，Ａ）を行う。そして、辞書Ｄの構築では、式ｐ(k) ｐ(1｜k)ｐ(2｜1)ｐ(3｜2)・・・ｐ(n｜n-1)Ｔ≧Ｃを満たす全ての文字列を識別番号とともに辞書Ｄに登録
する。ここで、定数Ｃは無制限に文字列を辞書Ｄに登録
するのを避けるための所定値である。また、入力する文
字列の先頭にカーソルが位置付けられるように、改めて
設定する。ここで、ｐ(i｜j)は特定の文字ｊが出現した
後に、文字ｉが出現する条件付確率を示す。[S41] The dictionary is constructed. First, while newly inputting a character string from an information source, the total number of characters T, the history appearance probability one-dimensional array p (i | j) and the specific character appearance probability one-dimensional array p (k) are obtained. That is, T = Σfreq (i, j) (i, j = 1,2, ..., A) p (i | j) = freq (i, j) / (p (j) T) (i, j = 1, 2, ..., A) p (k) = Σp (k | j) (j, k = 1, 2, ..., A). Then, in the construction of the dictionary D, all characters satisfying the expression p (k) p (1 | k) p (2 | 1) p (3 | 2) ... p (n | n-1) T ≧ C Register the column with dictionary number in dictionary D. Here, the constant C is a predetermined value for avoiding unlimited registration of character strings in the dictionary D. Also, set again so that the cursor is positioned at the beginning of the input character string. Here, p (i | j) indicates a conditional probability that the character i appears after the specific character j appears.

【００６６】〔Ｓ４２〕文字列入力検査を行う。すなわ
ち、入力ストリームから新規に文字列を入力されるか否
かを判定する。もし、文字列が入力された（ＹＥＳ）な
らばステップＳ４３に進み、文字列が入力されなかった
（ＮＯ）ならば本圧縮処理を終了する。[S42] A character string input check is performed. That is, it is determined whether or not a new character string is input from the input stream. If the character string is input (YES), the process proceeds to step S43, and if the character string is not input (NO), the main compression process ends.

【００６７】〔Ｓ４３〕文字列検索処理を行う。具体的
には、カーソルの位置からの文字列と一致する文字列を
辞書Ｄから検索する。このとき、辞書Ｄからは、式ｐ(k) ｐ(1｜k)ｐ(2｜1)ｐ(3｜2)・・・ｐ(n｜n-1)Ｔ≧（Ｃ＋α）を満たす文字列のみを検索する。ここで、定数αは辞書
Ｄを新たに作る余地を確保するための所定値である。ま
た、検索された文字列のうち、最も文字数が多い文字列
を最大一致文字列Ｓとする。逆に、上記の式を満足する
文字列が検索されなかったならば、この文字列に新しい
識別番号を付す。[S43] Character string search processing is performed. Specifically, the dictionary D is searched for a character string that matches the character string from the position of the cursor. At this time, from the dictionary D, characters satisfying the expression p (k) p (1 | k) p (2 | 1) p (3 | 2) ... p (n | n-1) T ≧ (C + α) Search only columns. Here, the constant α is a predetermined value for securing a room for newly creating the dictionary D. Further, among the searched character strings, the character string having the largest number of characters is set as the maximum matching character string S. On the contrary, if no character string satisfying the above equation is found, this character string is given a new identification number.

【００６８】〔Ｓ４４〕符号出力を行う。すなわち、最
大一致文字列Ｓ又はステップＳ４５で検索されなかった
文字列に付されている識別番号を〔ｎ〕ビットで符号化
して出力する。[S44] Code output is performed. That is, the identification number attached to the maximum matching character string S or the character string not searched in step S45 is encoded with [n] bits and output.

【００６９】〔Ｓ４５〕出現頻度インクリメントを行
う。すなわち、ステップＳ４４で符号化した文字列のう
ち、直前文字ｒを含む文字列に対応する出現頻度２次元
配列freqをインクリメントする。また、履歴出現確率１
次元配列ｐ(i｜j)及び特定文字出現確率１次元配列ｐ
(k) を改めて求める。すなわち、Ｔ＝Σfreq(i,j) （ｉ，ｊ＝１，２，・・・，Ａ）ｐ(i｜j)＝freq(i,j)/（ｐ(j) Ｔ）（ｉ，ｊ＝１，２，・・・，Ａ）ｐ(k) ＝Σｐ(k｜j) （ｊ，ｋ＝１，２，・・・，Ａ）を行う。[S45] The appearance frequency is incremented. That is, the appearance frequency two-dimensional array freq corresponding to the character string including the immediately preceding character r in the character string encoded in step S44 is incremented. Also, history appearance probability 1
Dimensional array p (i | j) and specific character appearance probability one-dimensional array p
Request (k) again. That is, T = Σfreq (i, j) (i, j = 1,2, ..., A) p (i | j) = freq (i, j) / (p (j) T) (i, j = 1, 2, ..., A) p (k) = Σp (k | j) (j, k = 1, 2, ..., A).

【００７０】〔Ｓ４６〕辞書登録を行う。すなわち、ス
テップＳ４３で検索されなかった文字列に、前回ステッ
プＳ４７で記憶した先頭文字ｋを付加して、辞書登録の
ための登録文字列を作る。そして、この登録文字列を識
別番号とともに辞書Ｄに登録する。また、辞書登録を行
なったときは変数ｎをインクリメントする。[S46] The dictionary is registered. That is, the first character k stored in the previous step S47 is added to the character string not searched in step S43 to create a registered character string for dictionary registration. Then, this registered character string is registered in the dictionary D together with the identification number. When the dictionary is registered, the variable n is incremented.

【００７１】〔Ｓ４７〕直前文字設定を行う。具体的に
は、ステップＳ４４で符号化した文字列の最終の文字を
直前文字ｒとして記憶する。〔Ｓ４８〕カーソル位置設定を行う。具体的には、ステ
ップＳ４４で符号化した文字列の先頭の文字を先頭文字
ｋとして記憶する。そして、次に入力する文字列を処理
するために、この文字列の次の文字に位置付けする。[S47] The preceding character is set. Specifically, the last character of the character string encoded in step S44 is stored as the immediately preceding character r. [S48] The cursor position is set. Specifically, the leading character of the character string encoded in step S44 is stored as the leading character k. Then, in order to process the character string to be input next, it is positioned at the next character of this character string.

【００７２】〔Ｓ４９〕圧縮率の悪化を判別する。すな
わち、入力された文字列の所定量、例えば数 100キロバ
イトごとの文字列について、圧縮率＝（所定の入力文字列の全ビット数）／（符号の
全ビット数）を演算し、圧縮率が低下していないかどうか判別する。
もし、圧縮率が悪化（低下）していれば（ＹＥＳ）ステ
ップＳ４１に戻り、悪化していなければ（ＮＯ）ステッ
プＳ４２に戻る。[S49] The deterioration of the compression rate is determined. That is, for a predetermined amount of the input character string, for example, a character string every several hundred kilobytes, the compression ratio = (total number of bits of the specified input character string) / (total number of bits of the code) is calculated, and the compression ratio is Determine if it has dropped.
If the compression ratio has deteriorated (decreased) (YES), the process returns to step S41, and if it has not deteriorated (NO), the process returns to step S42.

【００７３】上記他の実施例では、直前文字ｒを考慮し
ないで辞書構築（ステップＳ４１）を行なったが、直前
文字ｒを考慮して、ｐ(1｜r)ｐ(2｜1)ｐ(3｜2)・・・ｐ(n｜n-1)Ｔ≧Ｃr を満たす全ての文字列を識別番号とともに辞書Ｄr に登
録し、辞書Ｄrからの文字列検索（ステップＳ４３）で
は、式ｐ(1｜r)ｐ(2｜1)ｐ(3｜2)・・・ｐ(n｜n-1)Ｔ≧（Ｃr ＋α）を満たす文字列のみを検索するようにしてもよい。ここ
で、定数Ｃr は無制限に文字列を辞書Ｄr に登録するの
を避けるための所定値である。これにより、辞書登録を
行うにあたって、より適切な文字列のみが選択されるた
め、識別番号の増加も抑えることができる。また、辞書
からの検索時間をより短縮することができ、圧縮率もよ
り高めることができる。In the other embodiments described above, the dictionary construction (step S41) is performed without considering the immediately preceding character r, but p (1 | r) p (2 | 1) p ( 3 | 2) ... All the character strings satisfying p (n | n-1) T ≧ Cr are registered in the dictionary Dr together with the identification number, and in the character string search from the dictionary Dr (step S43), the expression p ( You may make it search only the character string which satisfy | fills 1 | r) p (2 | 1) p (3 | 2) ... p (n | n-1) T> = (Cr + (alpha)). Here, the constant Cr is a predetermined value for avoiding unlimited registration of character strings in the dictionary Dr. As a result, only a more appropriate character string is selected when registering the dictionary, so that it is possible to suppress an increase in the identification number. In addition, it is possible to further shorten the search time from the dictionary and further increase the compression rate.

【００７４】上記の実施例の説明では、初期化処理では
出現頻度freqを１で初期化したが、入力する文字列（デ
ータ）の性質（例えば、文字データあるいは画像データ
等）によって、統計から推定した所定の値で初期化する
ようにしてもよい。In the above description of the embodiment, the appearance frequency freq is initialized to 1 in the initialization processing, but it is estimated from statistics according to the nature of the input character string (data) (for example, character data or image data). The initialization may be performed with the predetermined value.

【００７５】また、識別番号を符号化する際、〔識別番
号〕のビット数からなる符号で出力したが、本出願人が
特願平3-130623号において開示したように、ビット端数
補償、Phasing in Binary Codes、あるいは多値算術符
号からなる符号で出力してもよい。Further, when the identification number was encoded, it was output with a code consisting of the number of bits of [identification number]. However, as disclosed by the applicant in Japanese Patent Application No. 3-130623, bit fraction compensation, Phaseing You may output in Binary Codes or the code which consists of multi-valued arithmetic codes.

【００７６】さらに、辞書の再構築は圧縮率の悪化（低
下）を判別することにより行なったが、文字の出現頻度
の計数値の悪化を判別することにより行なってもよい。
文字の出現頻度の計数値としては、例えば全文字の出現
頻度の合計値等がある。Further, the dictionary is reconstructed by judging deterioration (decrease) of the compression rate, but it may be carried out by judging deterioration of the count value of the appearance frequency of characters.
As the count value of the appearance frequency of characters, for example, there is a total value of appearance frequencies of all characters.

【００７７】なお、上記の各実施例は、ワークステーシ
ョン等における文字コード、ベクトル情報、画像データ
などの圧縮に応用され、記憶容量を大幅に削減すること
ができる。The above embodiments are applied to compression of character codes, vector information, image data, etc. in workstations and the like, and the storage capacity can be greatly reduced.

【００７８】また、通信回線を利用したデータ送受信に
おいても応用でき、通信時間の短縮を図ることができ
る。例えば、モデム、ファクシミリ等の通信機器に応用
できる。Further, it can be applied to data transmission / reception using a communication line, and the communication time can be shortened. For example, it can be applied to communication devices such as a modem and a facsimile.

【００７９】[0079]

【発明の効果】以上説明したように本発明では、新規に
入力された新規入力文字列について、新規入力文字列を
構成する各文字の出現頻度に基づき演算された生起確率
が所定の基準確率値以上になる場合、この新規入力文字
列に識別番号を付して辞書に登録し、辞書に登録された
文字列のうち、別の新規入力文字列と一致する一致文字
列を検索し、検索された一致文字列に付された識別番号
を符号化して出力するように構成したので、辞書への登
録を抑えることができる。したがって、識別番号の増加
も抑制され、符号化効率を高めることができる。As described above, according to the present invention, with respect to a newly input new input character string, the occurrence probability calculated based on the appearance frequency of each character forming the new input character string is a predetermined reference probability value. In the case of the above, add an identification number to this new input character string, register it in the dictionary, search the character string registered in the dictionary for a matching character string that matches another new input character string, and search for it. Since the identification number attached to the matched character string is encoded and output, registration in the dictionary can be suppressed. Therefore, it is possible to suppress an increase in identification number and improve coding efficiency.

【００８０】また、辞書に登録される文字列が抑えられ
るため、検索が速くなり、符号化処理全体も速くなる。Further, since the character strings registered in the dictionary are suppressed, the search becomes faster and the whole encoding process becomes faster.

[Brief description of drawings]

【図１】本発明の原理説明図である。FIG. 1 is a diagram illustrating the principle of the present invention.

【図２】本発明の実施例を示すフローチャートである。FIG. 2 is a flowchart showing an embodiment of the present invention.

【図３】多重履歴に基づく辞書の木構造の一例を示す図
である。FIG. 3 is a diagram showing an example of a tree structure of a dictionary based on multiple histories.

【図４】本発明の他の実施例を示すフローチャートであ
る。FIG. 4 is a flowchart showing another embodiment of the present invention.

【図５】辞書の木構造の一例を示す図である。FIG. 5 is a diagram showing an example of a tree structure of a dictionary.

【図６】ＬＺＷ符号化法による圧縮処理のアルゴリズム
を示すフローチャートである。FIG. 6 is a flowchart showing an algorithm of compression processing by the LZW encoding method.

【図７】多値算術符号化法による圧縮処理のアルゴリズ
ムを示すフローチャートである。FIG. 7 is a flowchart showing an algorithm of compression processing by a multi-value arithmetic coding method.

【図８】多重履歴に基づく多値算術符号化法による圧縮
処理のアルゴリズムを示すフローチャートである。FIG. 8 is a flowchart showing an algorithm of compression processing by a multilevel arithmetic coding method based on multiple histories.

[Explanation of symbols]

１第１の出現頻度演算手段２生起確率推定手段３辞書登録手段４第１の文字列検索手段５符号化手段６辞書 1 First appearance frequency calculation means 2 Occurrence probability estimation means 3 dictionary registration means 4 First character string search means 5 Encoding means 6 dictionary

───────────────────────────────────────────────────── フロントページの続き (72)発明者千葉広隆神奈川県川崎市中原区上小田中1015番地富士通株式会社内 (56)参考文献特開平４−213221（ＪＰ，Ａ) 特開平３−270417（ＪＰ，Ａ) 特開平３−17731（ＪＰ，Ａ) 特開昭63−209229（ＪＰ，Ａ) 特開昭59−231683（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 5/00 G06F 17/22 520 H03M 7/30 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Hirotaka Chiba 1015 Kamiodanaka, Nakahara-ku, Kawasaki City, Kanagawa Prefecture, Fujitsu Limited (56) References JP-A-4-213221 (JP, A) JP-A-3-270417 (JP, A) JP-A-3-17731 (JP, A) JP-A 63-209229 (JP, A) JP-A 59-231683 (JP, A) (58) Fields investigated (Int. Cl. ⁷⁾ , DB name) G06F 5/00 G06F 17/22 520 H03M 7/30

Claims

(57) [Claims]

1. A data compression method in which a character string input from an information source is compressed by encoding and output, and an appearance frequency is calculated based on the number of appearances of characters that form the input character string. Appearance frequency calculation means (1), occurrence probability estimation means (2) for estimating the occurrence probability of a newly input character string newly input based on the appearance frequency, and the first reference probability with the occurrence probability being a predetermined value. If it is more than the value, add an identification number to the newly input character string and add the dictionary (6).
A dictionary registration means (3) for registering with the character string search means (4) for searching the dictionary (6) for a matching character string that matches the newly input character string; A data compression method, comprising: an encoding means (5) for encoding and outputting an identification number.

2. The appearance frequency calculation means (1) stores all characters forming the input character string and the number of appearances of the character, and calculates the appearance frequency based on the number of appearances. The data compression method according to claim 1, wherein the data compression method is configured as described above.

3. The appearance frequency calculation means (1) inputs all the character strings in advance, stores all the characters constituting the all character strings and the number of appearances of the characters, and newly inputs them. In addition to the number of occurrences for each character that makes up the new input character string,
The data compression method according to claim 1, wherein the appearance frequency is calculated based on the number of appearances.

4. The occurrence probability estimating means (2) is configured to estimate a conditional probability which is an occurrence probability of a character string starting from specific data.
The data compression method described in 2 or 3.

5. The data compression method according to claim 4, wherein the specific data is configured to be the last character in the input character string input immediately before the new input character string.

6. The character string search means (4) is configured to search for a matching character string that matches the newly input character string among the character strings registered in the dictionary (6). The data compression method according to any one of claims 1 to 5.

7. The character string search means (4) divides the new input character string into different partial character strings, and matches the partial character string among the character strings registered in the dictionary (6). The data compression according to any one of claims 1 to 5, wherein a matching character string is searched for and a character string having the longest character string length is selected from the matching character strings. method.

8. The data compression method according to claim 6, wherein the matching character string is a character string whose occurrence probability is equal to or higher than a predetermined second reference probability value. .

9. The dictionary registration means (3) is configured to reconstruct the dictionary (6) when the compression rate due to the encoding of the newly input character string deteriorates. The data compression method according to claim 1.