JPH0969785A

JPH0969785A - Method and device for data compression

Info

Publication number: JPH0969785A
Application number: JP22215595A
Authority: JP
Inventors: Atsuko Toda; 亜津子戸田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1995-08-30
Filing date: 1995-08-30
Publication date: 1997-03-11

Abstract

PROBLEM TO BE SOLVED: To realize effective data compression in the case of taking Japanese- language text data as the compression object. SOLUTION: A correspondence list where code data of specific characters and their compressed code data correspond to each other is prepared, and it is retrieved whether character code data of one character taken out from compression object data exists in the correspondence list or not (S11 and S12). If it exists in the list as the result (S13), its compressed code data is read from the correspondence list and is outputted (S14 and S15); but if it doesn't exist there (S13), data is outputted as it is (S16 and S17).

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、デジタルデータを
扱うコンピュータなどのデータ処理装置に用いられるデ
ータ圧縮方法及びデータ圧縮装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data compression method and a data compression device used in a data processing device such as a computer that handles digital data.

【０００２】[0002]

【従来の技術】従来、デジタルデータを扱うコンピュー
タなどのデータ処理装置では、例えばハフマン符号化方
式のように、１バイト単位のデータの出現頻度の違いを
利用して、デジタルデータを圧縮する装置があった。2. Description of the Related Art Conventionally, in a data processing device such as a computer that handles digital data, there is a device that compresses digital data by utilizing the difference in the appearance frequency of data in 1-byte units, such as the Huffman coding method. there were.

【０００３】この他にも、コンピュータが扱うデータ圧
縮には様々な方法が存在するが、これらの基本的な方法
は外国で考え出されたものであり、日本語の特徴を利用
したものではなかった。In addition to this, there are various methods for compressing data handled by a computer, but these basic methods have been devised in foreign countries and do not utilize the characteristics of Japanese. It was

【０００４】[0004]

【発明が解決しようとする課題】ところで、日本語の文
章の中には、平仮名、片仮名、数字、アルファベット、
句読点といった文字が多く見られる。このような文字を
対象としてデータ圧縮を考えた場合、上述したような方
法では単に出現頻度の違いだけでデータ圧縮を行う方法
であるため、効果的なデータ圧縮を行うことはできなか
った。By the way, in Japanese sentences, hiragana, katakana, numbers, alphabets,
Many characters such as punctuation marks are seen. In the case of considering data compression for such characters, the above-mentioned method cannot perform effective data compression because it is a method of performing data compression only by the difference in appearance frequency.

【０００５】また、圧縮対象データの最小単位は１バイ
トであることから、２バイトを基本とする日本語のテキ
ストデータには最適とはいえない。本発明は上記のよう
な点に鑑みなされたもので、日本語のテキストデータを
圧縮対象とした場合に効果的なデータ圧縮を行うことの
できるデータ圧縮方法及びデータ圧縮装置を提供するこ
とを目的とする。Since the minimum unit of data to be compressed is 1 byte, it cannot be said to be optimal for Japanese text data based on 2 bytes. The present invention has been made in view of the above points, and an object of the present invention is to provide a data compression method and a data compression apparatus that can perform effective data compression when Japanese text data is compressed. And

【０００６】[0006]

【課題を解決するための手段】本発明は、特定文字のコ
ードデータとその圧縮コードデータとを対応付けた対応
一覧表を用意しておき、圧縮対象データが上記対応一覧
表に存在する場合にその文字コードデータに対応する圧
縮コードデータを出力し、上記対応一覧表に存在しない
場合にその文字コードデータをそのまま出力するように
したものである。According to the present invention, a correspondence list in which code data of a specific character and its compressed code data are associated with each other is prepared, and when the data to be compressed is present in the above correspondence list. The compressed code data corresponding to the character code data is output, and when it does not exist in the correspondence list, the character code data is output as it is.

【０００７】このような構成によれば、例えば平仮名、
片仮名、数字、アルファベット、句読点といった日本語
のテキストデータの中で出現頻度の高い文字を対象に、
その文字コードデータとその圧縮コードデータを対応一
覧表に登録しておけば、日本語の特徴を生かしたデータ
圧縮を行うことができる。According to such a configuration, for example, hiragana,
For characters that appear frequently in Japanese text data such as katakana, numbers, alphabets, and punctuation marks,
By registering the character code data and the compressed code data in the correspondence list, it is possible to perform data compression that makes full use of the characteristics of Japanese.

【０００８】[0008]

【発明の実施の形態】以下、図面を参照して本発明の一
実施形態を説明する。図１は本発明の一実施形態に係る
データ圧縮装置の構成を示すブロック図である。本装置
は、デジタル化された日本語テキストデータに対して圧
縮処理を行い、その結果として入力データ数よりも、出
力データ数が少なくなるように変形することを目的とし
たシステムであり、ＣＰＵ１１、メモリ１２、補助記憶
装置１３を主として構成される。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the configuration of a data compression apparatus according to an embodiment of the present invention. This device is a system for performing compression processing on digitized Japanese text data, and as a result, transforming it so that the number of output data is smaller than the number of input data. The memory 12 and the auxiliary storage device 13 are mainly configured.

【０００９】ＣＰＵ１１は、本装置全体の制御を行うも
のであり、ここではデータ圧縮処理を実行する。メモリ
１２は、例えばＲＯＭあるいはＲＡＭからなり、ここで
は対応一覧表１２ａを記憶している。対応一覧表１２ａ
は、平仮名、片仮名、数字、アルファベット、句読点と
いった日本語のテキストデータの中で出現頻度の高い文
字を対象に、その文字コードデータと圧縮コードデータ
とを対応付けている。The CPU 11 controls the entire apparatus, and here executes data compression processing. The memory 12 is, for example, a ROM or a RAM, and stores a correspondence list 12a here. Correspondence list 12a
Targets a character having a high appearance frequency in Japanese text data such as hiragana, katakana, numbers, alphabets, and punctuation marks, and associates the character code data with the compressed code data.

【００１０】補助記憶装置１３は、例えばフロッピーデ
ィスク装置（ＦＤＤ）あるいはハードディスク装置（Ｈ
ＤＤ）からなり、ここでは圧縮対象データ（圧縮前のデ
ータ）と圧縮後のデータを格納する。The auxiliary storage device 13 is, for example, a floppy disk device (FDD) or a hard disk device (H
DD), in which data to be compressed (data before compression) and data after compression are stored.

【００１１】図２は同実施形態における対応一覧表１２
ａの内容を示す図である。対応一覧表１２ａには、平仮
名、片仮名、数字、アルファベット、句読点といった日
本語のテキストデータの中で出現頻度の高い文字のコー
ドデータとそれに対応する圧縮コードデータが格納され
ている。この例では、文字コードデータは１６ビット、
圧縮コードは８ビット（１６進）で表現されている。FIG. 2 is a correspondence list 12 in the embodiment.
It is a figure which shows the content of a. The correspondence list 12a stores code data of characters that appear frequently in Japanese text data such as hiragana, katakana, numbers, alphabets, and punctuation marks, and corresponding compressed code data. In this example, the character code data is 16 bits,
The compressed code is represented by 8 bits (hexadecimal).

【００１２】次に、同実施形態の動作を説明する。図３
は同実施形態におけるデータ圧縮処理の動作を示すフロ
ーチャートである。ＣＰＵ１１は、まず、補助記憶装置
１３などに保持されている圧縮対象データの中から１文
字分（２バイト）の文字コードデータを取り出す（ステ
ップＳ１１）。Next, the operation of the embodiment will be described. FIG.
3 is a flowchart showing an operation of data compression processing in the same embodiment. The CPU 11 first extracts character code data for one character (2 bytes) from the compression target data stored in the auxiliary storage device 13 or the like (step S11).

【００１３】次に、ＣＰＵ１１はメモリ１２に記憶され
た対応一覧表１２ａをアクセスし、上記取り出した文字
コードデータが対応一覧表１２ａ内にあるか否か検索す
る（ステップＳ１２）。Next, the CPU 11 accesses the correspondence list 12a stored in the memory 12 and searches for the extracted character code data in the correspondence list 12a (step S12).

【００１４】ここで、当該文字コードデータが対応一覧
表１２ａ内に存在した場合には（ステップＳ１３のＹｅ
ｓ）、その圧縮コードデータを出力することになる。そ
の際に、ＣＰＵ１１は圧縮コードデータであることを示
す１ビットの情報「１」を出力し（ステップＳ１４）、
次に対応一覧表１２ａの中から当該文字コードデータに
対応する圧縮コードデータを読み出し、これを出力する
（ステップＳ１５）。If the character code data is present in the correspondence list 12a (Yes in step S13).
s), the compressed code data will be output. At that time, the CPU 11 outputs 1-bit information "1" indicating that the data is compressed code data (step S14),
Next, the compressed code data corresponding to the character code data is read out from the correspondence list 12a and is output (step S15).

【００１５】一方、当該文字コードデータが対応一覧表
１２ａ内に存在しなかった場合には（ステップＳ１３の
Ｎｏ）、そのまま圧縮せずに出力することになる。その
際に、ＣＰＵ１１は圧縮コードデータでないことを示す
１ビットの情報「０」を出力し（ステップＳ１６）、次
に対応一覧表１２ａの中から当該文字コードデータに対
応する圧縮コードデータを読み出し、これを出力する
（ステップＳ１７）。On the other hand, if the character code data does not exist in the correspondence list 12a (No in step S13), the character code data is output without being compressed. At that time, the CPU 11 outputs 1-bit information "0" indicating that the character code data is not the compressed code data (step S16), then reads the compressed code data corresponding to the character code data from the correspondence list 12a, This is output (step S17).

【００１６】以後、同様にして、圧縮対象データの中か
ら順次文字コードデータを取り出し、これを対応一覧表
１２ａと照らし合わせながら適宜圧縮して出力する（ス
テップＳ１８，Ｓ１９）。これにより、対応一覧表１２
ａに存在する文字コードデータについては圧縮して出力
することができる。Thereafter, similarly, the character code data is sequentially extracted from the data to be compressed, and the character code data is appropriately compressed while being compared with the correspondence list 12a and output (steps S18 and S19). As a result, the correspondence list 12
The character code data existing in a can be compressed and output.

【００１７】ここで、具体的を挙げて上記処理を説明す
る。例えばステップＳ１１またはＳ１８で「ア」という
片仮名の文字コードデータが圧縮対象データの中から取
り出されたと仮定すると、処理はＳ１２→Ｓ１３→Ｓ１
４→Ｓ１５と進む。Here, the above-mentioned processing will be described concretely. For example, if it is assumed in step S11 or S18 that the character code data of the katakana "A" is extracted from the compression target data, the process is S12 → S13 → S1.
4 → S15.

【００１８】圧縮コードを示す１ビットの情報を「１」
とすると、ステップＳ１４で「１」が出力され、ステッ
プＳ１５で「９２」（１６進）が出力される。このステ
ップＳ１４、Ｓ１５で出力されたデータを２進数で表現
すると、図４（ａ）に示すように「１１００１００１
０」となる。The 1-bit information indicating the compressed code is "1".
Then, "1" is output in step S14, and "92" (hexadecimal) is output in step S15. When the data output in steps S14 and S15 is represented by a binary number, as shown in FIG.
0 ".

【００１９】また、例えばステップＳ１１またはＳ１８
で「亜」という漢字の文字コードデータが圧縮対象デー
タの中から取り出されたと仮定すると、処理はＳ１２→
Ｓ１３→Ｓ１６→Ｓ１７と進む。Further, for example, step S11 or S18
Assuming that the character code data of the Chinese character "A" is extracted from the compression target data, the process is S12 →
The process proceeds from S13 → S16 → S17.

【００２０】圧縮コードでないことを示す１ビットの情
報を「０」とすると、ステップＳ１６で「０」が出力さ
れ、ステップＳ１７で「３０２１」（１６進）が出力さ
れる。このステップＳ１６、Ｓ１７で出力されたデータ
を２進数で表現すると、図４（ｂ）に示すように「００
０１１００００００１００００１」となる。Assuming that the 1-bit information indicating that the code is not a compressed code is "0", "0" is output in step S16 and "3021" (hexadecimal) is output in step S17. When the data output in steps S16 and S17 is represented by a binary number, as shown in FIG.
01100000000010 ”.

【００２１】このように、日本語の文字単位（１６ビッ
ト）で、出現頻度の高い文字のコードデータを予め対応
一覧表１２ａに登録しておき、圧縮対象データが対応一
覧表１２ａに存在する場合に、これを８ビットに置き換
えて出力する。これにより、例えば図５に示すように
「コンピュータで扱うデータ」といった文字列データが
あった場合、圧縮前に１９２ビットあったものが、圧縮
後には１１６ビットに減らして出力することができる。In this way, when the code data of a character having a high appearance frequency is registered in advance in the correspondence list 12a in units of Japanese characters (16 bits), and the compression target data exists in the correspondence list 12a. Then, this is replaced with 8 bits and output. Thus, for example, when there is character string data such as “data handled by computer” as shown in FIG. 5, it is possible to reduce the data of 192 bits before compression to 116 bits after compression and output the data.

【００２２】[0022]

【発明の効果】以上のように本発明によれば、対応一覧
表に存在する文字コードデータを圧縮コードデータに置
き換えて出力するようにしたため、例えば平仮名、片仮
名、数字、アルファベット、句読点といった日本語のテ
キストデータの中で出現頻度の高い文字を対象に、その
文字コードデータと圧縮コードデータを対応一覧表に登
録しておけば、日本語の特徴を生かしたデータ圧縮を行
うことができる。As described above, according to the present invention, the character code data existing in the correspondence list is replaced with the compressed code data and is output. Therefore, for example, Japanese characters such as hiragana, katakana, numbers, alphabets and punctuation marks are used. If the character code data and the compression code data are registered in the correspondence list for the characters having a high appearance frequency in the text data, the data compression that makes the best use of the characteristics of Japanese can be performed.

【００２３】この場合、基本的にＬＺ法などの適応型圧
縮法と異なり、以前に出現したデータを利用しないの
で、対象データが短くても圧縮することができる。ま
た、静的ハフマン符号化と異なり、対応一覧表を出力し
ないので、圧縮率を高められる。さらに、ハフマン符号
化と異なり、データの出現頻度を算出する処理が不要で
あるので処理速度を高速化できる等の効果がある。In this case, basically, unlike the adaptive compression method such as the LZ method, since the data that has appeared before is not used, the data can be compressed even if the target data is short. Also, unlike the static Huffman coding, since the correspondence list is not output, the compression rate can be increased. Further, unlike the Huffman coding, there is no need for the process of calculating the appearance frequency of the data, so that the processing speed can be increased.

[Brief description of drawings]

【図１】本発明の一実施形態に係るデータ圧縮装置の構
成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a data compression device according to an embodiment of the present invention.

【図２】同実施形態における対応一覧表の内容を示す
図。FIG. 2 is a view showing the contents of a correspondence list in the same embodiment.

【図３】同実施形態におけるデータ圧縮処理の動作を示
すフローチャート。FIG. 3 is a flowchart showing an operation of data compression processing in the same embodiment.

【図４】同実施形態における圧縮コードの形を説明する
ための図。FIG. 4 is a view for explaining the shape of a compressed code in the same embodiment.

【図５】同実施形態におけるデータ圧縮結果を示す図。FIG. 5 is a diagram showing a result of data compression in the same embodiment.

[Explanation of symbols]

１１…ＣＰＵ、１２…メモリ、１２ａ…対応一覧表、１３…補助記憶装置。 11 ... CPU, 12 ... Memory, 12a ... Correspondence list, 13 ... Auxiliary storage device.

Claims

[Claims]

1. A correspondence list in which code data of a specific character and its compression code data are associated with each other, and it is determined whether or not the compression target data exists in the correspondence list, and the compression target data is If it exists in the above correspondence table, the compression code data corresponding to that character code is taken out from the above correspondence list and output. If the data to be compressed does not exist in the above correspondence table, that character code A data compression method characterized by outputting data as it is.

2. The data compression method according to claim 1, wherein the information for identifying whether or not the output data is compressed data is output prior to the output data.

3. Storage means for storing a correspondence list in which code data of a specific character and its compressed code data are associated with each other, and whether the compression target data exists in the correspondence list stored in the storage means. And a determination means for determining whether or not the compression target data is present in the correspondence list by the determination means, and the compression code data corresponding to the character code is extracted from the correspondence list and output. If the data to be compressed does not exist in the correspondence list, the data compression device is provided with an output unit that outputs the character code data as it is.

4. The data compression apparatus according to claim 3, wherein the output means outputs information for identifying whether or not the output data is compressed data, prior to the output data.