JPH10326273A

JPH10326273A - Data compressing device and data restoring device and data compressing method and data restoring method and program recording medium

Info

Publication number: JPH10326273A
Application number: JP9348321A
Authority: JP
Inventors: Hironori Yahagi; 裕紀矢作; Takashi Morihara; 隆森原
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-03-26
Filing date: 1997-12-17
Publication date: 1998-12-08

Abstract

PROBLEM TO BE SOLVED: To provide a data compressing device and resorting device suitable for the compression of a Japanese document. SOLUTION: A data compressing device is constituted so that phonetic data being data obtained by replacing a Chinese character idiom in sentence data with information constituted of number of character identification information indicating the number of characters of the reading of the Chinese character idiom and the reading and homonym identification information can be outputted by a phonetics converting part 12 having a homonym dictionary 17 for storing the Chinese character idiom and the reading and homonym identification information by making them correspond to each other, and the phonetic data can be compressed by a non-distortion compressing part 13. At that time, the first byte of a two byte character which is used in the sentence data and one byte information whose content is different from that of one byte character is used as the number of character identification information.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数バイトで表さ
れる文字を含む文書データを圧縮、復元するための装置
及び方法及びプログラム記録媒体に関し、例えば、日本
語文書データを圧縮、復元するための装置及び方法及び
プログラム記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus and a method for compressing and restoring document data containing characters represented by a plurality of bytes, and a program recording medium, for example, for compressing and restoring Japanese document data. Device and method and a program recording medium.

【０００２】[0002]

【従来の技術】近年、電子メール等の普及に伴い、個人
所有のコンピュータで処理あるいは保存される電子化さ
れた文書の量が飛躍的に増大してきている。例えば、１
日に数百〜１０００通程度の電子メールを処理する人も
多く、１年間で数百ＭＢ以上に及ぶ文書データが保存さ
れることも珍しくなくなっている。2. Description of the Related Art In recent years, with the spread of electronic mail and the like, the amount of digitized documents processed or stored by a personally owned computer has been dramatically increased. For example, 1
Many people process several hundred to one thousand e-mails a day, and it is not unusual that document data of several hundred MB or more is stored in one year.

【０００３】このため、データの中の冗長な部分を省い
てデータ量を圧縮することによって、データの伝送時間
の短縮や、データの記憶に必要とされる容量の低減が図
られている。データを圧縮する方法としては、さまざま
なものが知られており、文字コード、ベクトル情報、画
像などといった様々なデータを圧縮できる方法も存在し
ている。このような圧縮方法では、ユニバーサル符号化
と呼ばれる符号化が用いられている。For this reason, by reducing the amount of data by omitting redundant portions in the data, the data transmission time and the capacity required for storing data are reduced. Various methods for compressing data are known, and there are methods that can compress various data such as character codes, vector information, and images. In such a compression method, encoding called universal encoding is used.

【０００４】以下、ユニバーサル符号化に分類される幾
つかの符号化方法の概要を説明する。なお、以下の説明
においては、情報理論で用いられている呼称を踏襲し、
データの１単位を「文字」と表記し、「文字」が複数個
つながったものを「文字列」と表記する。[0004] The outline of some encoding methods classified as universal encoding will be described below. In the following description, following the names used in information theory,
One unit of data is described as “character”, and a plurality of “characters” connected is described as “character string”.

【０００５】まず、算術符号化の概要を説明する。算術
符号化には２値算術符号化と、３値以上の多値算術符号
化とがあり、多値算術符号化では、０以上、１未満の数
直線（以下［０、１）と記す）が、符号化すべきデータ
を構成する各文字の生起確率（出現頻度）に応じて順次
狭められていく。そして、全ての文字に対する処理が終
わったときに、狭められた区間内の１点を表す数値が符
号として出力される。First, an outline of arithmetic coding will be described. Arithmetic coding includes binary arithmetic coding and multi-valued arithmetic coding with three or more values. In multi-valued arithmetic coding, a number line from 0 to less than 1 (hereinafter referred to as [0, 1)). Are sequentially narrowed according to the occurrence probability (appearance frequency) of each character constituting the data to be encoded. Then, when processing for all characters is completed, a numerical value representing one point in the narrowed section is output as a code.

【０００６】たとえば、符号化対象となる文字がａ、
ｂ、ｃ、ｄ、ｅの５つであり、それらの文字の生起確率
が、それぞれ、０．２、０．１、０．０５、０．１５、
０．５であった場合、図２４に示したように、各文字に
対して、その生起確率に応じた区間幅を有する区間が割
り当てられる。For example, the character to be encoded is a,
b, c, d, and e, and the occurrence probabilities of those characters are 0.2, 0.1, 0.05, 0.15,
If it is 0.5, as shown in FIG. 24, a section having a section width corresponding to the occurrence probability is assigned to each character.

【０００７】そして、符号化すべき文字列が“ａｂｅ”
であった場合には、図２５に模式的に示したように、ま
ず、区間［０、１）は、文字“ａ”に対する区間［０、
０．２）に狭められる。次いで、その区間［０、０．
２）が各文字の生起確率に応じた区間に分割され、次の
文字である“ｂ”に対応する区間［０．０４、０．０
６）が文字列“ａｂ”に対する区間として選択される。
さらに、その区間［０．０４、０．０６）が各文字の生
起確率に応じた区間に分割され、次の文字である“ｅ”
に対応する区間［０．０５、０．０６）が文字列“ａｂ
ｅ”に対する区間として選択される。その後、その区間
内の任意の点（たとえば下限）の位置を２進表示した際
の少数点以下のビット列が符号化結果として出力され
る。The character string to be encoded is "ab"
25, first, as schematically shown in FIG. 25, the section [0, 1) is divided into sections [0, 1] for the character “a”.
0.2). Next, the section [0, 0.
2) is divided into sections corresponding to the occurrence probabilities of the characters, and the sections [0.04, 0.0] corresponding to the next character “b”
6) is selected as a section for the character string "ab".
Further, the section [0.04, 0.06) is divided into sections corresponding to the occurrence probability of each character, and the next character “e”
The section [0.05, 0.06) corresponding to the character string "ab
Then, a bit string with a decimal point or less when the position of an arbitrary point (for example, the lower limit) in the section is binary-displayed is output as an encoding result.

【０００８】なお、算術符号化は、生起確率（出現頻
度）に応じた区間の分割方法によって、さらに、各文字
の実際の出現頻度によらず、予め設定した出現頻度に従
って分割を行う静的(static)符号化方式と、最初に全文
字列を走査することによって得られた出現頻度で分割す
る準適応型(semi-adaptive)符号化方式、文字が出現す
る毎に、出現頻度を再計算し、区間を再設定する適応型
(adaptive)符号化方式に分類される。算術符号化に関す
る文献としては、Bell,T.C.,Cleary,J.G.,and Witten,
I.H.(1990) "Text Compression", Prentice Hall があ
る。The arithmetic coding is performed by a method of dividing a section in accordance with the occurrence probability (appearance frequency), and furthermore, by performing a division according to a preset appearance frequency irrespective of the actual appearance frequency of each character. (static) A coding method and a semi-adaptive coding method that divides by the appearance frequency obtained by scanning all character strings first, recalculates the appearance frequency every time a character appears , Adaptive type to reset the section
(adaptive) coding method. Literature on arithmetic coding includes Bell, TC, Cleary, JG, and Witten,
IH (1990) "Text Compression", Prentice Hall.

【０００９】また、スプレー符号化と呼ばれるユニバー
サル符号化も知られている。スプレー符号化では、文字
が符号化される度に、出現頻度がより高い文字に、より
短い符号が割り当てられるように、符号木（木構造の符
号表）を組み替える処理が行われる。なお、スプレー符
号化の詳細に関しては、例えば、Jones, Douglas W."Ap
plication of Splay Trees Data Compression", Commo
n. ACM, vol.31,no.8,pp.996-1007,Aug.,1998.を参照さ
れたい。[0009] Universal coding called spray coding is also known. In spray coding, a process of rearranging a code tree (a code table having a tree structure) is performed such that a shorter code is assigned to a character having a higher appearance frequency every time a character is coded. For details of spray coding, see, for example, Jones, Douglas W. "Ap
replication of Splay Trees Data Compression ", Commo
n. ACM, vol. 31, no. 8, pp. 996-1007, Aug., 1998.

【００１０】また、ブレンド・スプレー符号化と呼ばれ
る符号化方法も知られている。ブレンド・スプレー符号
化は、スプレー符号化に、ブレンド・モデル(blending
model）と呼ばれる統計モデルを採用した符号化であ
る。[0010] An encoding method called blend spray encoding is also known. Blend spray coding uses a blending model (blending
This is an encoding that employs a statistical model called “model”.

【００１１】ブレンド・スプレー符号化では、文脈毎に
符号木が用意される。ここで、文脈とは、図２６に模式
的に示したように、符号化対象文字（“ｃ”）の直前に
存在する文字列（“ａｂ”）のことである。ブレンド・
スプレー符号化（ブレンド・モデル）では、図２７に示
したような文脈木によって、文脈として用いる文字数
（次数）が文脈の出易さに応じたものに制御される。す
なわち、一般に、文字間の相関が強いデータを対象とす
る場合、高次の文脈を用いる程、高圧縮率を得ることが
できるが、文字間の相関が弱いデータを対象とした場
合、高次の文脈を用いるとかえって圧縮率が低下してし
まうこともある。このことを防止するために考案された
技術がブレンド・モデルであり、ブレンド・モデルで
は、ある文脈が出やすい場合には、その文脈に関する次
数を上げ、出にくい場合には低い次数のままといったよ
うに、各文脈の次数が入力データに適応したものに変更
される。In the blend spray coding, a code tree is prepared for each context. Here, the context is a character string (“ab”) existing immediately before the encoding target character (“c”), as schematically shown in FIG. blend·
In spray coding (blend model), the number of characters (order) used as a context is controlled by a context tree as shown in FIG. That is, in general, when data having a strong correlation between characters is targeted, a higher compression ratio can be obtained as a higher-order context is used. When the context is used, the compression ratio may be reduced. The technology devised to prevent this is the blend model. In the blend model, if a certain context is likely to appear, the order related to the context is increased, and if it is difficult to appear, the order is kept low. Then, the order of each context is changed to the one adapted to the input data.

【００１２】[0012]

【発明が解決しようとする課題】上述した各符号化は、
アルファベットを使用する文化圏で開発されたものであ
るため、各符号化を用いた圧縮時には、１バイトが一文
字として取り扱われる。このため、日本語等、２バイト
で表される文字を含む文章を、各技術を用いて圧縮した
場合、英文文書ほど高い圧縮率が得られないといった問
題があった。SUMMARY OF THE INVENTION
Since it was developed in a cultural area that uses alphabets, one byte is treated as one character during compression using each encoding. For this reason, when a sentence including characters represented by 2 bytes, such as Japanese, is compressed using each technique, there is a problem that a higher compression ratio cannot be obtained as compared to an English document.

【００１３】すなわち、２バイト文字は、２バイトのデ
ータの組み合わせが意味を持ち、２バイト文字を構成す
る２つのバイトデータ間には何ら相関関係は存在しな
い。このため、２バイト文字を１バイトずつ処理してい
く従来の圧縮方法では、情報理論で言えば、情報源を縮
小（２バイトのデータを１バイトに分割）した上で圧縮
が行われていることになり、高い圧縮率が得られない。That is, a combination of two-byte data is significant for a two-byte character, and there is no correlation between the two byte data constituting the two-byte character. For this reason, in the conventional compression method of processing two-byte characters one byte at a time, according to information theory, compression is performed after reducing the information source (dividing two-byte data into one byte). That is, a high compression ratio cannot be obtained.

【００１４】また、文脈の利用による高圧縮化が困難で
あるといった問題もあった。すなわち、漢字は、日常的
に用いられるものでも数千種あるため、同程度のサイズ
の文章を、同じバイト数の文脈を利用した符号化により
圧縮を行った場合、日本語文書の方が、英文文書に比し
て、多くの文脈が現れてしまう。実際、８ｋＢの日本語
文書と、８ｋＢの英文文書の圧縮時に現れる４バイトの
文脈の数を計数してみたところ、英文文書では、およそ
３０００種の文脈が現れたのに対し、日本語文書では、
およそ５０００種の文脈が現れた。また、圧縮が行われ
る日本語文書には、電子メールのように比較的容量の少
ない（Ａ４数枚程度）のものが多い。これらの結果とし
て、日本語文書の圧縮時には、各文脈に関して十分な統
計情報が収集される前に、全文書の圧縮が完了してしま
うことが多く、このことも、日本語文書の圧縮率を低下
させている要因となっていた。There is another problem that it is difficult to achieve high compression by using context. In other words, there are thousands of types of kanji that are used on a daily basis, so if a sentence of the same size is compressed by encoding using a context with the same number of bytes, the Japanese document will be More context appears than in English documents. In fact, when counting the number of 4-byte contexts that appear when compressing an 8 kB Japanese document and an 8 kB English document, approximately 3000 types of contexts appeared in the English document, whereas in the Japanese document, ,
Approximately 5000 contexts have emerged. In addition, many Japanese documents to be compressed are relatively small in capacity (about several A4 pages) such as electronic mail. As a result, when compressing Japanese documents, compression of all documents is often completed before sufficient statistical information is collected for each context, which also reduces the compression rate of Japanese documents. It was a factor that was lowering.

【００１５】そこで、本発明の課題は、日本語のような
複数バイトで表される文字を有する言語の文書データの
圧縮に適したデータ圧縮装置、データ圧縮方法、プログ
ラム記録媒体を提供することにある。An object of the present invention is to provide a data compression apparatus, a data compression method, and a program recording medium suitable for compressing document data of a language having characters represented by a plurality of bytes such as Japanese. is there.

【００１６】また、本発明の他の課題は、そのようなデ
ータ圧縮装置、データ圧縮方法によって圧縮されたデー
タを復元できるデータ復元装置、データ復元方法、プロ
グラム記録媒体を提供することにある。Another object of the present invention is to provide a data decompression device, a data decompression method, and a program recording medium capable of decompressing data compressed by such a data compression device and data compression method.

【００１７】[0017]

【課題を解決するための手段】上記課題を解決するため
に、本発明の第１の態様では、一文字が複数バイトの情
報で表される文字情報を含む元文書データを圧縮するた
めのデータ圧縮装置を、圧縮すべき元文書データに含ま
れる各文字情報を、その文字情報に対応づけられる文字
を発音したときの音を表す表音文字情報で置き換えたデ
ータである表音文書データを作成する表音文書データ作
成手段と、この表音文書データ作成手段が作成する表音
文書データに対して圧縮を行う圧縮手段とを用いて構成
する。According to a first aspect of the present invention, there is provided a data compression system for compressing original document data including character information in which one character is represented by a plurality of bytes of information. The device creates phonetic document data, which is data obtained by replacing each character information included in the original document data to be compressed with phonetic character information representing a sound generated when a character associated with the character information is pronounced. The phonetic document data creating means and compression means for compressing the phonetic document data created by the phonetic document data creating means are used.

【００１８】この第１の態様のデータ圧縮装置によれ
ば、元文書データが、種類がより少ない表音文字情報で
表された表音文書データに変換された上で圧縮されるの
で、元文書データを直接圧縮する場合に比して、高い圧
縮率での圧縮が実現できることになる。According to the data compression apparatus of the first aspect, the original document data is compressed after being converted into phonogram document data represented by phonogram information having a smaller number of types. Compression with a higher compression ratio can be realized as compared with the case where data is directly compressed.

【００１９】なお、このデータ圧縮装置で圧縮されたデ
ータを復元する際には、圧縮文書データを復元する復元
手段と、この復元手段で復元される、文字を発音したと
きの音を表す表音文字情報を、対応する文字情報に変換
することによって、圧縮文書データの元となったデータ
である元文書データを生成する元文書データ生成手段と
を備えるデータ復元装置を用いる。When the data compressed by the data compression apparatus is restored, the restoration means for restoring the compressed document data, and the phonogram representing the sound when a character is reproduced, restored by the restoration means. A data restoring device is used which includes original document data generating means for generating original document data which is data that is a source of compressed document data by converting character information into corresponding character information.

【００２０】本発明の第２の態様では、一文字が複数バ
イトで表される文字情報を含む元文書データを圧縮する
データ圧縮装置を、一または複数の文字情報からなる複
数の変換対象語情報に対して、その変換対象語情報が示
す語を発音したときの音を表す情報である表音文字情報
が記憶された表音文字情報記憶手段と、元文書データか
ら、表音文字情報記憶手段内に記憶されている変換対象
語情報を探索するとともに、探索した変換対象語情報に
対応する表音文字情報を表音文字情報記憶手段から読み
出す探索・読出手段と、この探索・読出手段で探索され
た元文書データ内の変換対象語情報を、探索・読出手段
で読み出された表音文字情報を含む変換対象語置換情報
で置換することによって、表音文書データを作成する表
音文書データ作成手段と、この表音文書データ作成手段
により作成された表音文書データで使用されている情報
要素に対して、中間符号を対応づけるための中間符号表
を作成する中間符号表作成手段と、中間符号表作成手段
で作成された中間符号表を用いて、表音文書データを構
成する各情報要素を対応する中間符号に変換することに
よって、中間符号文書データを作成する中間符号文書デ
ータ作成手段と、この中間符号文書データ作成手段が作
成する中間符号文書データに対して圧縮を行う圧縮手段
とを用いて構成する。According to a second aspect of the present invention, a data compression device for compressing original document data including character information in which one character is represented by a plurality of bytes is converted into a plurality of conversion target word information including one or a plurality of character information. On the other hand, phonogram information storage means storing phonogram information which is information representing a sound when the word indicated by the conversion target word information is pronounced, and phonogram information storage means from the original document data. Search / read means for searching the conversion target word information stored in the storage means and reading phonogram information corresponding to the searched conversion target word information from the phonogram character information storage means, and the search / read means. Creating phonetic document data by replacing the conversion target word information in the original document data with the conversion target word replacement information including the phonetic character information read by the search / read means. An intermediate code table creating means for creating an intermediate code table for associating an intermediate code with an information element used in the phonetic document data created by the phonetic document data creating means; Using an intermediate code table created by the code table creating means, by converting each information element constituting the phonetic document data into a corresponding intermediate code, thereby creating intermediate code document data creating intermediate code document data; And compression means for compressing the intermediate code document data created by the intermediate code document data creation means.

【００２１】すなわち、本発明の第２の態様によるデー
タ圧縮装置では、まず、元文書データ内の変換対象語を
表音文字情報で置換することによって、元文書データよ
りも使用されている文字の種類の少ない表音文書データ
が作成される。その後、表音文書データに含まれる各情
報要素（文字）に対して新たな符号（中間符号）が割り
当てられ、その中間符号を用いて、表音文書データが中
間符号文書データに変換される。そして、その中間符号
データが圧縮される。このため、本データ圧縮装置によ
れば、表音文字情報で表現することが困難な、例えば記
号などの文字を含む文書データをも高い圧縮率で圧縮で
きることになる。That is, in the data compression device according to the second aspect of the present invention, first, the conversion target word in the original document data is replaced with phonetic character information, whereby the character used more than the original document data is replaced. A few types of phonetic document data are created. Thereafter, a new code (intermediate code) is assigned to each information element (character) included in the phonetic document data, and the phonetic document data is converted to intermediate code document data using the intermediate code. Then, the intermediate code data is compressed. For this reason, according to the present data compression apparatus, it is possible to compress document data including characters such as symbols, which are difficult to represent with phonetic character information, at a high compression rate.

【００２２】この第２の態様のデータ圧縮装置で圧縮さ
れたデータを復元する際には、一または複数の文字情報
からなる変換対象語情報に対して変換対象語情報が示す
語を発音したときの音を表す情報である表音文字情報が
記憶された表音文字情報記憶手段と、圧縮文書データを
復元することによって中間符号文書データを出力する復
元手段と、復元手段が出力する中間符号文書データに含
まれる各中間符号を、圧縮文書データに関する中間符号
表においてその中間符号に対応づけられている情報に置
換することによって、表音文書データを生成する表音文
書データ生成手段と、この表音文書データ生成手段が生
成する表音文書データ中に含まれる変換対象語置換情報
を探索し、探索した変換対象語置換情報を、その変換対
象語置換情報に含まれる表音文字情報に対応づけられて
表音文字情報記憶手段に記憶された変換対象語情報で置
換することによって、圧縮文書データの元となった元文
書データを生成する元文書データ生成手段とを備えるデ
ータ復元装置を用いる。When decompressing the data compressed by the data compression apparatus of the second aspect, when the word indicated by the conversion target word information is pronounced with respect to the conversion target word information composed of one or a plurality of character information, Phonetic character information storage means for storing phonetic character information which is information representing the sound of the sound, decompression means for outputting intermediate code document data by decompressing compressed document data, and intermediate code document output by the decompression means A phonetic document data generating means for generating phonetic document data by replacing each intermediate code included in the data with information associated with the intermediate code in an intermediate code table relating to the compressed document data; The conversion target word replacement information included in the phonetic document data generated by the phonetic document data generation means is searched, and the searched conversion target word replacement information is included in the conversion target word replacement information. Original document data generating means for generating original document data that is the basis of compressed document data by substituting the conversion target word information stored in the phonogram information storage means in association with the phonogram information to be converted. Is used.

【００２３】第２の態様のデータ圧縮装置を形成する際
には、中間符号表作成手段として、表音文書データで使
用されている各情報要素に対して、それらの情報要素を
表現可能な最小のビット数を有する中間符号を割り当て
るための中間符号表を作成する手段を用いることができ
る。In forming the data compression apparatus of the second aspect, the intermediate code table creating means uses, for each information element used in the phonetic document data, a minimum information element capable of expressing the information element. Means for creating an intermediate code table for allocating an intermediate code having the number of bits.

【００２４】また、表音文書データ作成手段として、所
定種類の文字情報を、その文字情報が所定種類の文字情
報であることを示す開始位置識別情報と終了位置識別情
報とで挟まれた置換情報に置換することによって、表音
文書データを作成する手段を用い、中間符号表作成手段
として、表音文書データ内の、開始位置識別情報に続く
情報と終了位置識別情報とには、中間符号が対応づけら
れない中間符号表を作成する手段を用い、中間符号デー
タ作成手段として、表音文書データ内の、開始位置識別
情報に続く情報と終了位置識別情報に関しては、中間符
号への変換を行わない手段を用いることも出来る。The phonetic document data creating means may include a predetermined type of character information, replacement information interposed between start position identification information and end position identification information indicating that the character information is a predetermined type of character information. By using means for creating phonetic document data, the intermediate code is used as the intermediate code table creating means, in the information following the start position identification information and the end position identification information in the phonetic document data. A means for creating an uncorrelated intermediate code table is used, and as the intermediate code data creating means, the information following the start position identification information and the end position identification information in the phonetic document data are converted into intermediate codes. No means can be used.

【００２５】このような各手段を用いてデータ圧縮装置
を構成した場合には、例えば、音読み漢字１字などの所
定種類の文字情報に対しては、表音文字情報への変換並
びに中間符号への変換を行わせないことが可能となる。When a data compression apparatus is constituted by using such means, for example, for a predetermined type of character information such as one kanji character, it is converted into phonetic character information and converted into an intermediate code. Can not be performed.

【００２６】また、中間符号表作成手段として、表音文
書データで使用されている情報要素の種類が所定ビット
数で表現可能な情報数Ｎを越えていた場合、表音文書デ
ータで使用されている情報要素の中から中間符号を割り
当てる“Ｎ−１”個の情報要素を選択するとともに、選
択した“Ｎ−１”個の情報要素と開始位置識別情報と
に、内容が互いに異なる所定ビット数の中間符号を対応
づけるための中間符号表を作成する手段を用い、中間符
号データ作成手段として、表音文書データ内の、中間符
号表で中間符号が対応づけられている情報に関しては、
その情報を対応する中間符号に変換し、中間符号が割り
当てられていない情報である未割当情報に関しては、そ
の未割当情報を、開始位置識別情報に対応づけられてい
る中間符号を先頭に有し、かつ、未割当情報を含む情報
であって、終了位置が分かる形態の情報である未割当置
換情報で置換することによって、中間符号文書データを
作成する手段を用いることが出来る。Also, when the type of the information element used in the phonetic document data exceeds the number N of information that can be represented by a predetermined number of bits, the intermediate code table creating means is used for the phonetic document data. "N-1" information elements to which an intermediate code is to be assigned are selected from among the information elements that are present, and the selected "N-1" information elements and the start position identification information have a predetermined number of bits different from each other. Using means for creating an intermediate code table for associating the intermediate codes of, as the intermediate code data creating means, for the information in the phonetic document data, the information to which the intermediate code is associated in the intermediate code table,
The information is converted to a corresponding intermediate code, and for the unallocated information that is information to which no intermediate code is allocated, the unallocated information has an intermediate code associated with the start position identification information at the beginning. In addition, a means for creating intermediate code document data can be used by replacing with unassigned replacement information, which is information including unassigned information and in which the end position is known.

【００２７】また、表音文字情報記憶手段として、複数
の変換対象語に対して、表音文字情報と、同じ表音文字
情報が対応づけられる他の変換対象語との識別を行うた
めの同音異義語識別情報とを記憶する手段を用い、探索
・読出手段として、探索した変換対象語に対応する表音
文字情報と同音異義語識別情報とを読み出す手段を用
い、表音文書データ作成手段は、元文書データ内の変換
対象語情報を、探索・読出手段で読み出された表音文字
情報と同音異義語識別情報を含む変換対象語置換情報で
置換する手段を用い、中間符号作成手段は、同音異義語
識別情報を除く情報要素を対象として、中間符号表を作
成する手段を用いることも出来る。[0027] The phonetic character information storage means is used for identifying a plurality of conversion target words from the phonetic character information and another conversion target word to which the same phonetic character information is associated. The means for storing the synonym identification information and the means for reading the phonetic character information and the homonym identification information corresponding to the searched conversion target word are used as search / readout means. Means for replacing the conversion target word information in the original document data with the conversion target word replacement information including the phonetic character information and the homonym identification information read by the search / read means, Means for creating an intermediate code table for information elements other than homonym identification information can also be used.

【００２８】本発明の第３の態様では、１文字が複数バ
イトで表される文字情報を含む元文書データを圧縮する
データ圧縮装置を、一または複数の文字情報からなる変
換対象語情報に対して変換対象語情報が示す語を発音し
たときの音を表す情報である表音文字情報が記憶された
表音文字情報記憶手段と、元文書データから、表音文字
情報記憶手段内に記憶されている変換対象語情報を探索
するとともに、探索した変換対象語情報に対応する表音
文字情報を表音文字情報記憶手段から読み出す探索・読
出手段と、この探索・読出手段で探索された元文書デー
タ内の変換対象語情報を、探索・読出手段で読み出され
た表音文字情報を含む変換対象語置換情報で置換するこ
とによって、表音文書データを作成する表音文書データ
作成手段と、この表音文書データ作成手段が作成する表
音文書データに対して圧縮を行う圧縮手段とを用いて構
成する。According to a third aspect of the present invention, a data compression apparatus for compressing original document data containing character information in which one character is represented by a plurality of bytes is provided for converting target word information comprising one or more character information. The phonetic character information storage means in which phonetic character information that is information representing the sound when the word indicated by the conversion target word information is pronounced, and the phonetic character information storage means from the original document data are stored in the phonetic character information storage means. Searching / reading means for searching the conversion target word information, and reading phonogram information corresponding to the searched conversion target word information from the phonogram information storage means, and an original document searched by the search / read means. Phonetic document data creating means for creating phonetic document data by replacing the conversion target word information in the data with the conversion target word replacement information including the phonetic character information read by the search / read means; this Configured using a compression means for compressing against phonetic document data is sound document data creation unit creates.

【００２９】この第３の態様のデータ圧縮装置によれ
ば、元文書データが、種類がより少ない表音文字情報で
表された表音文書データに変換された上で圧縮されるの
で、元文書データを直接圧縮する場合に比して、高い圧
縮率での圧縮が実現できることになる。According to the data compression apparatus of the third aspect, the original document data is compressed after being converted into phonogram document data represented by phonogram information of a smaller number. Compression with a higher compression ratio can be realized as compared with the case where data is directly compressed.

【００３０】第３の態様のデータ圧縮装置で圧縮された
データを復元する際には、一または複数の文字情報から
なる変換対象語情報に対して、各変換対象語情報が示す
語を発音したときの音を表す情報である表音文字情報が
記憶された表音文字情報記憶手段と、圧縮文書データを
復元することによって、表音文書データを出力する復元
手段と、この復元手段が生成する表音文書データ中に含
まれる変換対象語置換情報を探索し、探索した変換対象
語置換情報を、その変換対象語置換情報に含まれる表音
文字情報に対応づけられて表音文字情報記憶手段に記憶
された変換対象語情報で置換することによって、圧縮文
書データの元となった元文書データを生成する元文書デ
ータ生成手段とを備えるデータ復元装置を用いる。When decompressing the data compressed by the data compression device of the third aspect, the word indicated by each conversion target word information is pronounced with respect to the conversion target word information composed of one or a plurality of character information. The phonetic character information storage means in which phonetic character information as information representing the sound at the time is stored, the restoring means for outputting phonetic document data by restoring the compressed document data, and the restoring means generate Searching for conversion target word replacement information included in the phonetic document data, and storing the searched conversion target word replacement information in correspondence with the phonetic character information included in the conversion target word replacement information. And a source document data generating means for generating source document data which is the source of the compressed document data by replacing the source document data with the conversion target word information stored in the source document data.

【００３１】なお、各態様のデータ圧縮装置では、どの
ような表音文字情報をも使用することができる。例え
ば、アルファベットを表す情報や、ハングル文字の母音
と子音を表す情報を用いることが出来る。また、国際音
声記号や、注音字母を表す情報を用いることも出来る。In the data compression device of each mode, any phonetic character information can be used. For example, information representing an alphabet or information representing a vowel and a consonant of a Hangul character can be used. It is also possible to use information representing international phonetic symbols and note characters.

【００３２】[0032]

【発明の実施の形態】以下、図面を参照して、本発明の
実施形態を具体的に説明する。＜第１実施形態＞第１実施形態のデータ圧縮・復元装置
は、日本語文書の圧縮・復元を行う装置であり、コンピ
ュータを、圧縮・復元用プログラムに従って動作させる
ことによって実現されている。Embodiments of the present invention will be specifically described below with reference to the drawings. <First Embodiment> The data compression / decompression device according to the first embodiment is a device for compressing / decompressing a Japanese document, and is realized by operating a computer in accordance with a compression / decompression program.

【００３３】まず、図１に示した機能ブロック図を用い
て、第１実施形態のデータ圧縮・復元装置の概要を説明
する。図示したように、第１実施形態のデータ圧縮・復
元装置は、記憶部１１と表音変換部１２と無歪圧縮部１
３と無歪復元部１４と表音逆変換部１５とを備える。First, the outline of the data compression / decompression device of the first embodiment will be described with reference to the functional block diagram shown in FIG. As illustrated, the data compression / decompression device of the first embodiment includes a storage unit 11, a phonetic conversion unit 12, and a distortionless compression unit 1.
3, a distortionless restoration unit 14, and a phonetic inverse conversion unit 15.

【００３４】記憶部１１は、圧縮が行われるデータ（あ
るいは復元されたデータ）である平文データと、平文デ
ータの圧縮結果である圧縮データを記憶する。なお、記
憶部１１には、平文データとして、シフトＪＩＳコード
が用いられたデータが記憶される。The storage unit 11 stores plaintext data which is data to be compressed (or decompressed data) and compressed data which is a compression result of the plaintext data. The storage unit 11 stores data using a shift JIS code as plaintext data.

【００３５】表音変換部１２および無歪圧縮部１３は、
データ圧縮を行う際に機能する。表音変換部１２は、日
本語解析辞書１６と同音異義語辞書１７を有する。日本
語解析辞書１６は、非漢字文字（平仮名、片仮名、記号
等）と漢字からなる平文データから、漢字熟語や１文字
の漢字（音読み、訓読み）を抽出するための辞書であ
る。同音異義語辞書１７は、漢字熟語（あるいは漢字）
の読み（平仮名文字列）から、１つの漢字熟語（あるい
は漢字）を特定するための情報である同音異義語識別情
報が記憶された辞書である。すなわち、同音異義語辞書
１７には、図２に例示したように、読みが同じ複数の漢
字熟語に対して、互いに異なる内容の同音異義語識別情
報（０、１、…）が記憶されている。なお、本実施形態
では、同音異義語識別情報として、数値を２進数で表し
た１バイトの情報を用いている。The speech conversion unit 12 and the distortionless compression unit 13
Works when performing data compression. The phonetic conversion unit 12 includes a Japanese analysis dictionary 16 and a homonym dictionary 17. The Japanese analysis dictionary 16 is a dictionary for extracting kanji idioms and one kanji (sound reading, kun reading) from plain text data consisting of non-kanji characters (hiragana, katakana, symbols, etc.) and kanji. The homonym dictionary 17 contains kanji idioms (or kanji)
Is a dictionary in which homonym identification information, which is information for specifying one kanji idiom (or kanji) from a reading (hiragana character string), is stored. That is, the homonym dictionary 17 stores homonym identification information (0, 1,...) Having different contents for a plurality of kanji idioms having the same pronunciation as illustrated in FIG. . In the present embodiment, 1-byte information in which a numerical value is represented by a binary number is used as homonym identification information.

【００３６】表音変換部１２は、日本語解析辞書１６と
同音異義語辞書１７を用いて、記憶部１１内に記憶され
た平文データを変換し、変換結果である表音データを無
歪圧縮部１３に供給する。詳細については後述するが、
表音変換部１２は、平文データ中に含まれる漢字あるい
は漢字熟語を、その読み（平仮名文字列）の文字数を表
す文字数識別情報と、読みと、同音異義語識別情報とか
らなる情報に置き換えることによって、平文データの表
音データへの変換を行う。無歪圧縮部１３は、情報のロ
スがない(lossless)圧縮を表音データに対して施し、そ
の結果（圧縮データ）を記憶部１１に格納する。本実施
形態では、無歪圧縮部１３において、一例として、文脈
の最高次数が“２”であるブレンド・スプレー符号化を
行うものを用いている。The phonetic conversion unit 12 converts the plaintext data stored in the storage unit 11 using the Japanese analysis dictionary 16 and the homonym dictionary 17, and compresses the converted phonetic data without distortion. To the unit 13. Details will be described later,
The phonetic conversion unit 12 replaces the kanji or kanji idioms contained in the plaintext data with information including character number identification information indicating the number of characters of the reading (hiragana character string), reading, and homonym identification information. To convert the plaintext data into phonetic data. The lossless compression unit 13 performs lossless compression on the phonetic data, and stores the result (compressed data) in the storage unit 11. In the present embodiment, for example, the distortion-free compression unit 13 performs the blend spray encoding in which the highest order of the context is “2”.

【００３７】無歪復元部１４および表音逆変換部１５
は、データ復元時に機能する。無歪復元部１４は、無歪
圧縮部１３が出力する圧縮データを復元する機能を有す
る。無歪復元部１４は、記憶部１１内に記憶された圧縮
データの復元を行い、復元結果（表音データ）を、表音
逆変換部１５に供給する。表音逆変換部１５は、同音異
義語辞書１７と同じ内容の同音異義語辞書１８を有す
る。表音逆変換部１５は、当該辞書を用いて表音データ
の変換を行い、変換結果である平文データを、記憶部１
１内に記憶する。The distortionless restoration unit 14 and the phonetic inverse conversion unit 15
Works during data recovery. The distortionless restoration unit 14 has a function of restoring the compressed data output from the distortionless compression unit 13. The distortion-free restoration unit 14 restores the compressed data stored in the storage unit 11 and supplies the restoration result (phonetic data) to the phonetic inverse conversion unit 15. The phonetic inverse conversion unit 15 has a homonym dictionary 18 having the same contents as the homonym dictionary 17. The phonetic inverse conversion unit 15 converts the phonetic data using the dictionary and stores the plaintext data as the conversion result in the storage unit 1.
Stored in 1.

【００３８】以下、第１実施形態のデータ圧縮・復元装
置の動作を具体的に説明する。まず、データ圧縮時の動
作を説明する。図３に、データ圧縮時の表音変換部１２
の動作手順を示す。図示したように、データ圧縮時、表
音変換部１２は、まず、記憶部１１内の圧縮をおこなう
べき平文データから１バイト分のデータを取得する（ス
テップＳ１０１）。そして、そのデータが１バイト文字
（制御コード、アルファベット、数字、半角片仮名）で
あるか、２バイト文字の第１バイトであるかを判断（ス
テップＳ１０２）する。Hereinafter, the operation of the data compression / decompression device of the first embodiment will be specifically described. First, the operation at the time of data compression will be described. FIG. 3 shows the sound conversion unit 12 during data compression.
The following shows the operation procedure. As shown in the figure, at the time of data compression, the phonetic conversion unit 12 first obtains 1-byte data from plaintext data to be compressed in the storage unit 11 (step S101). Then, it is determined whether the data is a one-byte character (control code, alphabet, numeral, half-width katakana) or the first byte of a two-byte character (step S102).

【００３９】取得したデータが２バイト文字の第１バイ
トであった場合（ステップＳ１０２；2ハ゛イト文字）、表
音変換部１２は、平文データから次の１バイト分のデー
タを取得（ステップＳ１０３）する。次いで、表音変換
部１２は、取得した２バイトのデータが表す文字が漢字
であるか否かを判断する（ステップＳ１０４）。その文
字が漢字ではなかった場合（ステップＳ１０４；非漢
字）、すなわち、当該文字が、平仮名、片仮名、記号
（罫線素片etc.）等であった場合、表音変換部１２は、
取得したデータを無歪圧縮部１３に出力する（ステップ
Ｓ１０５）。また、ステップＳ１０１で取得したデータ
が１バイト文字であった場合（ステップＳ１０２；1ハ゛イ
ト文字）も、表音変換部１２は、取得したデータを無歪
圧縮部１４に出力する（ステップＳ１０５）。そして、
出力したデータが“ＥＯＦ”でなかった場合（ステップ
Ｓ１１４；Ｎ）には、ステップＳ１０１に戻り、平文デ
ータ内の残りのデータに対する処理を実行する。If the acquired data is the first byte of a two-byte character (step S102; two-byte character), the phonetic conversion unit 12 acquires the next one-byte data from the plaintext data (step S103). I do. Next, the phonetic conversion unit 12 determines whether the character represented by the obtained 2-byte data is a Chinese character (step S104). If the character is not a kanji (step S104; non-kanji), that is, if the character is a hiragana, a katakana, a symbol (ruled line segment etc.), the phonetic conversion unit 12
The acquired data is output to the distortionless compression unit 13 (Step S105). Also, when the data obtained in step S101 is a one-byte character (step S102; one byte character), the phonetic conversion unit 12 outputs the obtained data to the distortionless compression unit 14 (step S105). And
If the output data is not “EOF” (step S114; N), the process returns to step S101, and the processing for the remaining data in the plaintext data is executed.

【００４０】以下、説明の便宜上、１バイト文字を表す
１バイトのデータと、２バイト文字を表す２バイトのデ
ータを、共に、単位情報と表記することにする。取得し
た２バイトのデータからなる単位情報の表す文字が漢字
であった場合（ステップＳ１０４；漢字）、表音変換部
１２は、平文データから次の単位情報を取得する（ステ
ップＳ１０６）。すなわち、平文データから次の１バイ
トのデータを取得し、そのデータが１バイト文字であっ
た場合には、そのデータを単位情報とする。また、その
データが２バイト文字の第１バイトであった場合、表音
変換部１２は、さらに、１バイトのデータを取得し、そ
れら２バイトのデータを単位情報とする。Hereinafter, for convenience of description, both 1-byte data representing a 1-byte character and 2-byte data representing a 2-byte character will be referred to as unit information. If the character represented by the obtained unit information consisting of two bytes of data is a kanji (step S104; kanji), the phonetic conversion unit 12 obtains the next unit information from the plaintext data (step S106). That is, the next one-byte data is obtained from the plaintext data, and if the data is a one-byte character, the data is used as unit information. If the data is the first byte of a two-byte character, the phonetic conversion unit 12 further acquires one-byte data and uses the two-byte data as unit information.

【００４１】その後、表音変換部１２は、新たに取得し
た単位情報が漢字を表すものであるか、非漢字あるいは
１バイト文字を表すものであるかを判断する（ステップ
Ｓ１０７）。その単位情報が漢字を表すものであった場
合（ステップＳ１０７；漢字）、表音変換部１２は、ス
テップＳ１０６に戻り、平文データから次の単位情報を
取得する。Thereafter, the phonetic conversion unit 12 determines whether the newly acquired unit information represents a kanji, a non-kanji, or a one-byte character (step S107). If the unit information represents a kanji (step S107: kanji), the phonetic conversion unit 12 returns to step S106 and acquires the next unit information from the plaintext data.

【００４２】ステップＳ１０６において新たに取得した
単位情報が非漢字あるいは１バイト文字を表すものであ
った場合（ステップＳ１０７；その他）、表音変換部１
２は、それまでに取得した漢字数が“１”であるか否か
を判断する（ステップＳ１０８）。取得した漢字数が
“１”であった場合（ステップＳ１０８；Ｙ）、その漢
字が音読み漢字であるか否かを判断する（ステップＳ１
０９）。なお、表音変換部１２は、ステップＳ１０６，
Ｓ１０７のループで最後に取得した単位情報（送りがな
等）を利用して、ステップ１０９の判断を行う。If the newly acquired unit information in step S106 represents a non-Kanji character or a one-byte character (step S107; other), the phonetic conversion unit 1
No. 2 determines whether or not the number of kanji acquired so far is "1" (step S108). If the acquired number of kanji is "1" (step S108; Y), it is determined whether or not the kanji is a phonetic kanji (step S1).
09). Note that the phonetic conversion unit 12 performs step S106,
The determination in step 109 is performed by using the unit information (such as sending data) acquired last in the loop of S107.

【００４３】表音変換部１２は、取得した漢字が１文字
であり、かつ、音読み漢字でなかった場合（ステップＳ
１０９；Ｎ）と、取得した漢字が２文字以上であった場
合（ステップＳ１０８；Ｎ）には、取得した漢字あるい
は漢字列と、その読み（平仮名文字列）を用い、同音異
義語辞書１３から同音異義語識別情報を検索する（ステ
ップＳ１１０）。次いで、読みの文字数に対応する文字
数識別情報と、読みと、検索した同音異義語識別情報と
を、この順で、無歪圧縮部１４に出力する（ステップＳ
１１１）。ここで、文字数識別情報とは、数値に対応づ
けられた１バイトのデータであり、各数値に対応する文
字数識別情報には、１バイト文字で使用されている00-7
F,A1-DF,２バイト文字の第１バイトとして使用されてい
る81-9F,E0-EF（１６進表記）を除くコードが割り当て
られている。The phonetic conversion unit 12 determines that the acquired kanji is one character and is not a phonetic kanji (step S
109; N), and when the acquired kanji is two or more characters (step S108; N), the acquired kanji or kanji string and its reading (hiragana character string) are used to read from the homonym dictionary 13 Search for homonym identification information (step S110). Next, the character number identification information corresponding to the character number of the reading, the reading, and the searched homonym identification information are output to the distortionless compression unit 14 in this order (step S).
111). Here, the number-of-characters identification information is one-byte data associated with a numerical value.
Codes other than 81-9F and E0-EF (hexadecimal notation) used as the first byte of the F, A1-DF and 2-byte characters are assigned.

【００４４】例えば、取得した漢字列が“製品”であっ
た場合、表音変換部１２は、ステップＳ１１１におい
て、図４に模式的に示したように、“製品”の読みであ
る“せいひん”の文字数４を表す１バイトの文字数識別
情報と、“せいひん”を表す総計８バイトのデータと、
１バイトの同音異義語識別情報（同音異義語辞書１７の
内容が図３に示したものであった場合には、0x00）とを
出力する。For example, if the acquired kanji string is "product", the phonetic conversion unit 12 in step S111, as schematically shown in FIG. 4, reads "seihin" of "product". 1-byte character number identification information representing the number of characters 4 of "", 8-byte data representing "seihin",
1-byte homonym identification information (0x00 when the content of the homonym dictionary 17 is as shown in FIG. 3).

【００４５】同音異義語識別情報の出力後、表音変換部
１２は、ステップＳ１０６，Ｓ１０７のループで最後に
取得した単位情報を無歪圧縮部１３に出力（ステップＳ
１１２）する。そして、出力した単位情報が“ＥＯＦ”
でなかった場合（ステップＳ１１４；Ｎ）には、ステッ
プＳ１０１からの処理を再度実行する。After outputting the homonym identification information, the phonetic conversion unit 12 outputs the unit information finally obtained in the loop of steps S106 and S107 to the distortionless compression unit 13 (step S10).
112). Then, the output unit information is "EOF"
If not (step S114; N), the processing from step S101 is executed again.

【００４６】取得した漢字が音読み漢字１文字であった
場合（ステップＳ１０９；Ｙ）、表音変換部１２は、取
得している２個の単位情報（漢字と、非漢字あるいは１
バイト文字）を無歪圧縮部１３に出力（ステップＳ１１
３）して、ステップＳ１１４に進む。なお、１文字の音
読み漢字に対して別処理を行わせているのは、以下の理
由による。音読み漢字には同音異義語が多数存在するも
のが多く、中には、１バイトで表現できない数の同音異
義語が存在するもの（例えば、「三省堂、新明解漢和辞
典第３版１９８７年」には、“こう”の同音異義語と
して、３６２語が掲載されている。）もある。このた
め、音読み漢字１字に対して他の漢字と同処理を行った
場合、同音異義語識別情報の種類が多くなる結果とし
て、読みに変換する意義が薄れるためである。If the acquired kanji is one phonetic kanji (step S109; Y), the phonetic conversion unit 12 outputs the two pieces of acquired unit information (a kanji and a non-kanji or 1
(Byte character) is output to the distortionless compression unit 13 (step S11).
3) Then, the process proceeds to step S114. The separate processing is performed for one on-reading kanji for the following reason. Many of the Onyomi Kanji have a large number of homonyms, and some of them have a number of homonyms that cannot be expressed in one byte (for example, "Sanseido, Shinmeiken Kanji Dictionary 3rd Edition 1987" 362 words are listed as homonyms of “ko”.) For this reason, when the same processing is performed for one on-read kanji with another kanji, the number of types of homonym word identification information increases, and as a result, the significance of conversion to reading decreases.

【００４７】表音変換部１２は、このような処理を繰り
返し、“ＥＯＦ”の出力を行ったときに（ステップＳ１
１４；Ｙ）、圧縮対象である平文データに対する処理を
終える。The phonetic conversion unit 12 repeats such processing and outputs "EOF" (step S1).
14; Y), the processing on the plaintext data to be compressed ends.

【００４８】なお、図示は省略したが、実際には、ステ
ップＳ１０８において“Ｎ”側への分岐を行う際、表音
変換部１２は、取得している漢字列を、同音異義語辞書
１７に同音異義語識別情報が定義されているサイズの漢
字熟語（漢字の場合もある）に分解し、分解した各漢字
熟語に対して、ステップＳ１１０、Ｓ１１１相当の処理
を行う。すなわち、同音異義語辞書１７に記憶されてい
る２つの漢字熟語からなる漢字列が平文データに含まれ
ていた場合、表音変換部１２は、その漢字列を構成する
第１番目の漢字熟語の前後に、文字数識別情報と同音異
義語識別情報とを付加したデータを出力した後、第２番
目の漢字熟語の前後に、文字数識別情報と同音異義語識
別情報とを付加したデータを出力する。その後、その漢
字列の次に存在していた漢字以外の文字を出力する。そ
して、表音変換部１２は、平文データから取得した単位
情報が漢字以外の文字を表すものであった場合には、そ
の単位情報をそのまま出力する。このため、例えば、
「半導体製品の売り上げは好調です。」という文章は、
表音変換部１２によって、図５に模式的に示したような
データ（表音データ）に変換されることになる。Although illustration is omitted, in practice, when branching to the “N” side in step S 108, the phonetic conversion unit 12 stores the acquired kanji string in the homonym dictionary 17. It is decomposed into kanji idioms (often kanji) of a size in which homonym identification information is defined, and processing corresponding to steps S110 and S111 is performed on each decomposed kanji idiom. That is, when a kanji string composed of two kanji idioms stored in the homonym dictionary 17 is included in the plaintext data, the phonetic conversion unit 12 outputs the first kanji idiom of the kanji string. After the data before and after the character number identification information and the homonym identification information are added, the data to which the character number identification information and the homonym identification information are added before and after the second kanji phrase are output. Thereafter, characters other than the kanji that existed after the kanji string are output. If the unit information obtained from the plaintext data represents a character other than a kanji, the phonetic conversion unit 12 outputs the unit information as it is. Thus, for example,
The sentence "Semiconductor product sales are strong."
The phonogram conversion unit 12 converts the data into phonogram data as schematically shown in FIG.

【００４９】第１実施形態のデータ圧縮・復元装置で
は、表音変換部１２の出力するこのような表音データ
が、無歪圧縮部１３によって圧縮される。そして、無歪
圧縮部１３による圧縮結果が、圧縮データとして記憶部
１１に格納され、データ圧縮が完了する。すなわち、本
データ圧縮・復元装置では、平分データが、元の状態に
復元でき（復元手順については後述）、かつ、使用され
ている文字の種類がより少ないデータである表音データ
に変換される。そして、その表音データが圧縮される。
このため、本データ圧縮・復元装置では、圧縮時の統計
情報の収集が、平文データを直接圧縮する場合に比して
効率的に行われることになり、その結果として、高い圧
縮率でのデータ圧縮が実現されることになる。In the data compression / decompression device of the first embodiment, such speech data output from the speech conversion unit 12 is compressed by the distortionless compression unit 13. Then, the result of compression by the distortionless compression unit 13 is stored in the storage unit 11 as compressed data, and the data compression is completed. That is, in the present data compression / decompression device, the plain data can be restored to the original state (decompression procedure will be described later), and the data is converted into phonogram data that is data with a smaller number of used characters. . Then, the phonetic data is compressed.
For this reason, in the present data compression / decompression device, the collection of statistical information at the time of compression is performed more efficiently than in the case of directly compressing plaintext data. Compression will be achieved.

【００５０】次に、第１実施形態のデータ圧縮・復元装
置のデータ復元時の動作を説明する。無歪復元部１４
は、復元すべきデータとして指示された圧縮データを記
憶部１１から読み出し、当該圧縮データの復元結果、す
なわち、表音変換部１２がある平文データに基づき作成
した表音データを、表音逆変換部１５に供給する。Next, the operation of the data compression / decompression device of the first embodiment at the time of data decompression will be described. Distortion-free restoration unit 14
Reads the compressed data designated as the data to be restored from the storage unit 11 and converts the result of the compression data, that is, the phonetic data created based on the plaintext data by the phonetic conversion unit 12 into the phonetic inverse conversion. To the unit 15.

【００５１】上述した表音変換部１２の動作から明らか
なように、表音データには、２バイト文字、１バイト文
字、文字数識別情報、同音異義語識別情報が含まれてい
る（図５参照）。このうち、同音異義語識別情報は、数
値を２進数で表した１バイトの情報であるため、１バイ
ト文字等との弁別が不可能な情報となっている。しか
し、同音異義語識別情報の前には、１バイト文字並びに
２バイト文字の第１バイトとの弁別が可能な文字数識別
情報が存在している。また、同音異義語識別情報は、文
字数識別情報が存在する位置並びに内容によって定まる
位置に存在している。As is apparent from the operation of the phonogram conversion unit 12, the phonogram data includes 2-byte characters, 1-byte characters, character number identification information, and homonym identification information (see FIG. 5). ). Among them, the homonym identification information is one-byte information in which a numerical value is represented by a binary number, and thus cannot be discriminated from a one-byte character or the like. However, before the homonym identification information, there is character number identification information that can be distinguished from the first byte of the one-byte character and the two-byte character. The homonym identification information exists at a position where the character number identification information exists and at a position determined by the content.

【００５２】このため、表音逆変換部１５は、図６に示
した流れに従って、表音データから平文データを復元す
る。表音逆変換部１５は、まず、無歪復元部１４から１
バイト分のデータを取得する（ステップＳ２０１）。そ
して、表音逆変換部１５は、そのデータが、１バイト文
字であるか、２バイト文字の第１バイトであるか、文字
数識別情報であるかを判断（ステップＳ２０２）する。For this reason, the phonetic inverse converter 15 restores the plaintext data from the phonetic data according to the flow shown in FIG. The phonetic inverse conversion unit 15 first receives the 1
Byte data is acquired (step S201). Then, the phonetic inverse conversion unit 15 determines whether the data is a one-byte character, the first byte of a two-byte character, or character number identification information (step S202).

【００５３】取得したデータが１バイト文字を表すもの
であった場合（ステップＳ２０２；1ハ゛イト文字）、表音
逆変換部１５は、当該データをそのまま記憶部１１に出
力する（ステップＳ２０３）。そして、出力したデータ
が“ＥＯＦ”でなかった場合（ステップＳ２１０；Ｎ）
には、ステップＳ２０１に戻り、無歪復元部１４から次
の１バイト分のデータを取得する。また、取得したデー
タが２バイト文字の第１バイトであった場合（ステップ
Ｓ２０２；2ハ゛イト文字）、表音逆変換部１５は、さら
に、１バイト分のデータを取得し、それら２バイト分の
データを記憶部１１に出力する（ステップＳ２０４）。
そして、ステップＳ２０１に戻る。If the acquired data represents a one-byte character (step S202; one-byte character), the phonetic inverse conversion unit 15 outputs the data to the storage unit 11 as it is (step S203). Then, when the output data is not “EOF” (Step S210; N)
In step S201, the process returns to step S201 to acquire the next one-byte data from the distortion-free restoration unit 14. If the acquired data is the first byte of a two-byte character (step S202; two-byte character), the phonetic inverse conversion unit 15 further acquires one-byte data, and The data is output to the storage unit 11 (Step S204).
Then, the process returns to step S201.

【００５４】取得したデータが文字数識別情報であった
場合（ステップＳ２０２；文字数識別情報）、表音逆変
換部１５は、その文字数識別情報の内容に基づき、後続
する平仮名文字列の文字数を認識する（ステップＳ２０
５）。そして、表音逆変換部１５は、無歪復元部１４か
ら、認識した文字数分のデータ（文字数×２バイトのデ
ータ；読み）と、次の１バイト分のデータ（同音異義語
識別情報）を取得する（ステップＳ２０６、Ｓ２０
７）。次いで、表音逆変換部１５は、取得した読みと同
音異義語識別情報を用いて、同音異義語辞書１８から、
漢字あるいは漢字熟語を検索する（ステップＳ２０
８）。そして、検索した漢字あるいは漢字熟語を記憶部
１１に出力して、ステップＳ２０１に戻る。If the acquired data is character number identification information (step S202; character number identification information), the phonetic inverse conversion unit 15 recognizes the number of characters of the succeeding hiragana character string based on the content of the character number identification information. (Step S20
5). Then, the phonetic inverse conversion unit 15 outputs the data of the number of recognized characters (data of the number of characters × 2 bytes; reading) and the data of the next 1 byte (homophonetic identification information) from the distortionless restoration unit 14. Acquire (Steps S206, S20
7). Next, the phonetic inverse conversion unit 15 uses the acquired reading and the homonym identification information to read from the homonym dictionary 18
Search for kanji or kanji idioms (step S20)
8). Then, the searched kanji or kanji phrase is output to the storage unit 11, and the process returns to step S201.

【００５５】表音逆変換部１５は、このような動作を無
歪復元部１４からの各データに対して繰り返し、“ＥＯ
Ｆ”の出力を行ったときに（ステップＳ２１０；Ｙ）、
図示した処理を終了する。The phonetic inverse conversion unit 15 repeats such an operation for each data from the distortion-free restoration unit 14, and outputs “EO
When F "is output (Step S210; Y),
The illustrated processing ends.

【００５６】＜変形形態＞第１実施形態のデータ圧縮・
復元装置は、平文データ内の、漢字あるいは漢字熟語
を、文字数識別情報と読みと同音異義語識別情報とから
なるデータで置換した形態の表音データが作成される装
置であったが、文字数識別情報の後に、同音異義語識別
情報と読みがこの順で付加される形態の表音データが作
成されるように装置を構成しても良いことは当然であ
る。<Modification> Data compression / compression of the first embodiment
The restoration device is a device in which phonogram data in a form in which kanji or kanji idioms in plaintext data are replaced with data comprising character number identification information and reading and homonym word identification information is created. It goes without saying that the apparatus may be configured so that phonetic data in a form in which homonym identification information and reading are added in this order after the information is created.

【００５７】また、複数の文字数識別情報を用意してお
く代わりに、開始位置識別情報と終了位置識別情報とを
用意しておき、データ圧縮時に漢字の読みを出力する
際、その前後に開始位置識別情報と終了位置識別情報を
付加するように装置を構成しても良い。この際、同音異
義語識別情報の位置は、開始位置識別情報の直後あるい
は終了位置識別情報の直後としておく。また、開始位置
識別情報と終了位置識別情報との間に存在している読み
を漢字に戻すことによってデータ復元が行われるように
装置を構成しておく。Also, instead of preparing a plurality of character number identification information, start position identification information and end position identification information are prepared, and when the kanji reading is output at the time of data compression, the start position is set before and after that. The apparatus may be configured to add the identification information and the end position identification information. At this time, the position of the homonym identification information is set immediately after the start position identification information or immediately after the end position identification information. Further, the apparatus is configured so that data restoration is performed by returning the reading existing between the start position identification information and the end position identification information to kanji.

【００５８】また、第１実施形態のデータ圧縮・復元装
置は、１文字の音読み漢字が処理対象となったとき、そ
の漢字コードをそのまま出力する装置であったが、読み
を本来のものとせず、当該漢字の訓読みを読みとするこ
とによって、本来、音読み漢字である漢字１文字もが読
みに変換されるよう装置を構成することも出来る。Further, the data compression / decompression device of the first embodiment is a device for directly outputting a kanji code when one kanji kanji is to be processed. By using the Kanji reading of the kanji as a reading, it is also possible to configure the apparatus such that even one kanji that is originally a phonetic reading kanji is converted into a reading.

【００５９】また、第１実施形態のデータ圧縮・復元装
置は、シフトＪＩＳを対象とするものであったが、ＪＩ
Ｓ、ＥＵＣ等の他のコード体系を対象とする装置として
も良いことは当然である。さらに、本実施形態の技術
は、適用対象が日本語に限られるものではなく、複数バ
イト文字の定義が必要とされる言語であればどのような
言語にも適用可能である。The data compression / decompression device of the first embodiment is intended for shift JIS.
It goes without saying that the apparatus may be designed for other code systems such as S and EUC. Further, the technology of the present embodiment is not limited to Japanese and can be applied to any language that requires the definition of multi-byte characters.

【００６０】＜第２実施形態＞第１実施形態のデータ圧
縮・復元装置は、平文データ内に含まれる漢字熟語を、
その漢字熟語の読みを含むデータに置換した形態の表音
データに変換した後、圧縮を行う装置であった。これに
対して、第２実施形態のデータ圧縮・復元装置は、表音
データを作成する際、漢字熟語以外のデータも、他形態
のデータに置換する。また、漢字熟語を置換するデータ
として、その漢字熟語の半角読み（読みを半角片仮名文
字列で表した情報）を含むデータを用いる。<Second Embodiment> The data compression / decompression device of the first embodiment converts kanji idioms contained in plaintext data into
The device converts the phonogram data in a form replaced with data including the reading of the kanji idiom, and then performs compression. On the other hand, the data compression / decompression device according to the second embodiment also replaces data other than kanji idioms with data of another form when creating phonetic data. Further, as the data for replacing the kanji idiom, data including half-width reading of the kanji idiom (information in which the reading is represented by a half-width katakana character string) is used.

【００６１】図７に、第２実施形態のデータ圧縮・復元
装置の機能ブロック図を示す。図から明らかなように、
第２実施形態のデータ圧縮・復元装置の基本的な構成
は、第１実施形態のデータ圧縮・復元装置と同じものと
なっている。ただし、第２実施形態のデータ圧縮・復元
装置が備える表音変換部１２′は、同音異義語辞書１
７′のみを有する。同音異義語辞書１７′は、同音異義
語辞書１７とは異なり、漢字あるいは漢字熟語と、半角
読み（読みを半角片仮名文字列で表した情報）及び同音
異義語識別情報との対応関係が記憶された辞書となって
いる。また、表音逆変換部１５′にも、同音異義語辞書
１７′と同じ内容の同音異義語辞書１８′が備えられて
いる。FIG. 7 shows a functional block diagram of a data compression / decompression device according to the second embodiment. As is clear from the figure,
The basic configuration of the data compression / decompression device of the second embodiment is the same as the data compression / decompression device of the first embodiment. However, the phonetic conversion unit 12 'provided in the data compression / decompression device of the second embodiment is the same as the homonym dictionary 1
It has only 7 '. The homonym dictionary 17 'is different from the homonym dictionary 17 in that the correspondence between kanji or kanji idioms, half-width reading (information in which the reading is represented by a half-width katakana character string), and homonym identification information is stored. Has become a dictionary. The phonetic inverse conversion unit 15 'is also provided with a homonym dictionary 18' having the same contents as the homonym dictionary 17 '.

【００６２】そして、表音変換部１２′は、データ圧縮
を行う際、図８に示した手順で、平文データから表音デ
ータを作成する。表音変換部１２′は、まず、圧縮の対
象である平文データから１単位情報分のデータを取得す
る（ステップＳ３０１）。そして、そのデータが１バイ
ト文字であった場合（ステップＳ３０２；1ハ゛イト文字）
には、そのデータが表す文字が半角片仮名であるか否か
を判断する（ステップＳ３０３）。当該文字が半角片仮
名でなかった場合（ステップＳ３０３；Ｎ）、すなわ
ち、制御文字、ローマ字であった場合、表音変換部１
２′は、そのデータをそのまま無歪圧縮部１３に出力す
る（ステップＳ３０９）。そして、そのデータが“ＥＯ
Ｆ”でなかったとき（ステップＳ３１０；Ｎ）には、ス
テップＳ３０１に戻り、平文データを構成する次のデー
タに対する処理を開始する。When performing data compression, the phonogram conversion unit 12 'creates phonogram data from plaintext data according to the procedure shown in FIG. The phonetic conversion unit 12 'first obtains data for one unit of information from the plaintext data to be compressed (step S301). If the data is a one-byte character (step S302; one byte character)
In step S303, it is determined whether the character represented by the data is a half-width katakana character (step S303). If the character is not a half-width katakana (step S303; N), that is, if it is a control character or a Roman character, the phonetic conversion unit 1
2 'outputs the data as it is to the distortionless compression unit 13 (step S309). And the data is "EO
If it is not F "(step S310; N), the process returns to step S301 to start processing for the next data constituting the plaintext data.

【００６３】取得した１バイト文字が半角片仮名であっ
た場合（ステップＳ３０３；Ｙ）、表音変換部１２′
は、次の単位情報（１バイトあるいは２バイト文字）を
取得する（ステップＳ３０４）。そして、取得した単位
情報の表す文字が半角片仮名であった場合（ステップＳ
３０５；Ｙ）には、ステップＳ３０４に戻る。一方、当
該文字が半角片仮名でなかった場合（ステップＳ３０
５；Ｎ）には、ステップＳ３０６に進む。すなわち、表
音変換部１２′は、ステップＳ３０４，Ｓ３０５のルー
プを実行することによって、平文データから、半角片仮
名が連なっている文字列と次の１文字分の単位情報を取
得する。If the obtained one-byte character is a half-width katakana (step S303; Y), the phonetic conversion unit 12 '
Acquires the next unit information (1 byte or 2 byte character) (step S304). If the character represented by the obtained unit information is a half-width katakana (step S
305; Y), the process returns to the step S304. On the other hand, if the character is not a half-width katakana (step S30
5; N), the process proceeds to step S306. In other words, the phonetic conversion unit 12 'executes a loop of steps S304 and S305 to acquire, from the plaintext data, a character string in which half-width katakana is continuous and unit information for the next one character.

【００６４】その後、表音変換部１２′は、半角片仮名
開始位置識別情報と、半角片仮名文字列を、終了位置識
別情報とを、この順で出力する（ステップＳ３０６）。
なお、半角片仮名開始位置識別情報、終了位置識別情報
は、１バイト文字、２バイト文字、文字数識別情報との
弁別が可能なようにその内容が設定された１バイトの情
報であり、データ復元時（詳細は後述）、これらの位置
識別情報間に存在するデータが、半角片仮名として出力
すべき（そのまま出力すべき）データと判断される。Thereafter, the phonetic conversion unit 12 'outputs the half-width katakana start position identification information, the half-width katakana character string, and the end position identification information in this order (step S306).
The half-width katakana start position identification information and the end position identification information are 1-byte information whose contents are set so as to be distinguishable from 1-byte characters, 2-byte characters, and character number identification information. (Details will be described later). Data existing between these pieces of position identification information is determined to be data to be output as half-width katakana (to be output as it is).

【００６５】次いで、表音変換部１２′は、最後に実行
したステップＳ３０４で取得した単位情報が表す文字が
１バイト文字であるか２バイト文字であるかを判断する
（ステップＳ３０８）。そして、当該文字が１バイト文
字であった場合（ステップＳ３０８；1ハ゛イト文字）に
は、ステップＳ３０９に進み、その半角片仮名文字では
ない１バイト文字を、無歪圧縮部１３に出力する。Next, the phonetic conversion unit 12 'determines whether the character represented by the unit information acquired in the last executed step S304 is a one-byte character or a two-byte character (step S308). If the character is a one-byte character (step S308; one-byte character), the process proceeds to step S309 to output a one-byte character other than the half-width katakana character to the distortionless compression unit 13.

【００６６】一方、最後に取得した単位情報が２バイト
文字であった場合（ステップＳ３０８；2ハ゛イト文字）
と、ステップＳ３０１で取得した単位情報の表す文字が
２バイト文字であった場合（ステップＳ３０２；２バイ
ト文字）、表音変換部１２′は、図９に示してあるよう
に、その２バイト文字が、平仮名であるか、漢字、片仮
名あるいは記号（罫線素片等）であるかを判断する（ス
テップＳ３２０）。より具体的には、対応する半角片仮
名文字が存在する文字（平仮名、句読点等）であるか否
かを判断する。そして、対応する半角片仮名文字が存在
する文字であった場合（ステップＳ３２０；平仮名）、
その２バイト文字（処理対象文字）に対応する半角片仮
名文字コードを出力（ステップＳ３２１）し、図８のス
テップＳ３０１に戻る。On the other hand, when the last acquired unit information is a 2-byte character (step S308; 2-byte character)
If the character represented by the unit information acquired in step S301 is a two-byte character (step S302; two-byte character), the phonetic conversion unit 12 ', as shown in FIG. Is a hiragana, a kanji, a katakana, or a symbol (ruled line segment or the like) (step S320). More specifically, it is determined whether or not the corresponding half-width katakana character is a character (eg, hiragana, punctuation, etc.). If the corresponding half-width katakana character exists (step S320; hiragana),
The half-width katakana character code corresponding to the two-byte character (character to be processed) is output (step S321), and the process returns to step S301 in FIG.

【００６７】一方、対応する半角片仮名文字が存在しな
い文字であった場合（ステップＳ３２０；片仮名、漢字
等）、表音変換部１２′は、処理対象文字と異なる種類
の文字が現れるまで、平文データからの単位情報の取得
を繰り返す（ステップＳ３２２、Ｓ３２３）。すなわ
ち、処理対象文字が片仮名であった場合には、片仮名以
外の文字（漢字、平仮名、１バイト文字）が現れるまで
単位情報の取得を繰り返し、処理対象文字が漢字であっ
た場合には、漢字以外の文字が現れるまで単位情報の取
得を繰り返す。また、処理対象文字が記号に分類される
ものであった場合には、漢字、片仮名等が現れるまで単
位情報の取得を繰り返す。On the other hand, if the corresponding half-width katakana character does not exist (step S320; katakana, kanji, etc.), the phonetic conversion unit 12 'transmits the plaintext data until a character different from the character to be processed appears. (Steps S322 and S323). That is, if the character to be processed is katakana, the unit information is repeatedly obtained until a character other than katakana (kanji, hiragana, 1-byte character) appears. If the character to be processed is kanji, The acquisition of unit information is repeated until a character other than appears. If the characters to be processed are classified as symbols, the acquisition of unit information is repeated until kanji, katakana, and the like appear.

【００６８】そして、異なる種類の文字を表す単位情報
を取得できた際（ステップＳ３２３；Ｎ）に、表音変換
部１２′は、収集した文字の種類が漢字であった場合
（ステップＳ３２４；Ｎ）、取得した漢字列（あるいは
漢字）に対応づけられている半角読み並びに同音異義語
識別情報を、同音異義語辞書１７′から検索する（ステ
ップＳ３２５）。次いで、表音変換部１２′は、半角読
みの文字数（バイト数）に対応する文字数識別情報と、
半角読みと、検索した同音異義語識別情報とを、この順
で、無歪圧縮部１３に出力する（ステップＳ３２６）。
そして、図６のステップＳ３０２に戻って、最後に取得
した単位情報に対する処理を開始する。Then, when unit information representing different types of characters can be obtained (step S323; N), the phonetic conversion unit 12 'determines that the type of the collected characters is a kanji (step S324; N). ), The half-width reading and homonym identification information associated with the acquired kanji string (or kanji) are retrieved from the homonym dictionary 17 '(step S325). Next, the phonetic conversion unit 12 'determines the number-of-characters identification information corresponding to the number of characters (number of bytes) of half-width reading,
The half-width reading and the searched homonym identification information are output to the distortionless compression unit 13 in this order (step S326).
Then, the process returns to step S302 in FIG. 6 to start the processing for the last acquired unit information.

【００６９】また、文字の種類が片仮名であった場合
（ステップＳ３２４；Ｙ）、取得した片仮名文字列情報
に対応する半角片仮名文字列情報を作成する（ステップ
Ｓ３２７）。そして、全角片仮名開始位置識別情報と半
角片仮名文字列情報と終了位置識別情報を、この順で、
無歪圧縮部１３に対して出力（ステップＳ３２８）し、
ステップＳ３０２に戻って、最後に取得した文字に対す
る処理を開始する。なお、全角片仮名開始位置識別情報
は、１バイト文字、２バイト文字、文字数識別情報、半
角片仮名開始位置識別情報、終了位置識別情報との弁別
が可能なようにその内容が設定された１バイトの情報で
ある。データ復元時には、全角片仮名開始位置識別情報
と終了位置識別情報間に存在するデータが、全角片仮名
に変換して出力すべきデータと判断される。If the character type is katakana (step S324; Y), half-width katakana character string information corresponding to the obtained katakana character string information is created (step S327). Then, the full-width katakana start position identification information, the half-width katakana character string information and the end position identification information are written in this order,
Output to the distortionless compression unit 13 (step S328),
Returning to step S302, processing for the last acquired character is started. The full-width katakana start position identification information is a 1-byte character whose content is set so that it can be distinguished from 1-byte characters, 2-byte characters, character number identification information, half-width katakana start position identification information, and end position identification information. Information. At the time of data restoration, data existing between full-width katakana start position identification information and end position identification information is determined to be data to be converted to full-width katakana and output.

【００７０】また、文字の種類が記号であった場合（ス
テップＳ３２４；記号）、表音変換部１２′は、取得し
た記号（列）に関する単位情報（列）をそのまま無歪圧
縮部１３に出力し、図８のステップＳ３０２に戻って、
最後に取得した単位情報に対する処理を開始する。If the character type is a symbol (step S 324; symbol), the phonetic conversion unit 12 ′ outputs the obtained unit information (column) relating to the symbol (column) to the distortionless compression unit 13 as it is. Then, returning to step S302 in FIG.
Processing for the last acquired unit information is started.

【００７１】表音変換部１２′は、このような処理を繰
り返し、“ＥＯＦ”の出力を行ったときに（ステップＳ
３１０；Ｙ）、圧縮対象である平文データに対する処理
を終える。The phonetic conversion unit 12 'repeats such processing and outputs "EOF" (step S10).
310; Y), the process for the plaintext data to be compressed ends.

【００７２】以上、説明したように、表音変換部１２′
は、平文データ内の平仮名を半角片仮名に置換し、漢字
熟語を、文字数識別情報と半角片仮名文字列で表された
読みと同音異義語識別情報からなる情報で置換する。ま
た、全角片仮名文字列を、全角片仮名開始位置識別情報
と、その全角片仮名文字列と同値の半角片仮名文字列
と、終了位置識別情報からなる情報で置換する。そし
て、半角片仮名文字列を、半角片仮名開始位置識別情報
と、その全角片仮名文字列と同値の半角片仮名文字列
と、終了位置識別情報からなる情報で置換する。As described above, the phonetic conversion unit 12 '
Replaces hiragana in plaintext data with half-width katakana, and replaces kanji idioms with information consisting of the number of characters identification information and the pronunciation and homonym identification information represented by the half-width katakana character string. Further, the full-width katakana character string is replaced with information consisting of full-width katakana start position identification information, half-width katakana character string equivalent to the full-width katakana character string, and end position identification information. Then, the half-width katakana character string is replaced with information consisting of half-width katakana start position identification information, a half-width katakana character string equivalent to the full-width katakana character string, and end position identification information.

【００７３】このため、記号を含まない文章、例えば、
「テ゛ータをメモリに保存すると、…」という文章は、表音
変換部１２′によって、図１０に模式的に示したよう
に、１バイト文字と、１バイトの情報である何種類かの
識別情報のみを含む表音データに変換される。For this reason, sentences that do not include symbols, for example,
As shown schematically in FIG. 10, the sentence "when the data is stored in the memory ..." is converted by the phonetic conversion unit 12 'into one-byte characters and several types of identification information that is one-byte information. It is converted to phonetic data containing only

【００７４】そして、そのような表音データが、無歪圧
縮部１３によって圧縮されるので、本データ圧縮・復元
装置では、平文データを直接圧縮する場合に比して効率
的に圧縮が行われることになる。具体的には、図１１に
模式的に示したように、平文データを直接圧縮している
従来装置では、１文字が２バイトとなっているので、あ
る２バイト文字（図では、“と”）の第１バイトが注目
文字となっているときには、その２バイト文字の前に存
在する１文字だけが文脈として取り扱われる。これに対
し、本装置では、１文字が１バイトとなっているので、
ある文字（“ト”）が注目文字となっているときには、
その文字の前に存在する２文字が文脈として取り扱われ
る。すなわち、本装置では、ある次数の文脈を考慮に入
れる圧縮手順が、実質上、２倍の次数の文脈を考慮に入
れる圧縮手順として機能することになる。このため、第
２実施形態のデータ圧縮・復元装置によれば、高い圧縮
率での圧縮が実現できる。Since such phonogram data is compressed by the distortionless compression unit 13, the present data compression / decompression device performs compression more efficiently than in the case of directly compressing plaintext data. Will be. Specifically, as schematically shown in FIG. 11, in a conventional device that directly compresses plaintext data, one character is two bytes, so that a certain two-byte character (in the figure, "and" When the first byte of ()) is a target character, only one character existing before the two-byte character is treated as a context. On the other hand, in this device, one character is one byte,
When a certain character (“G”) is the target character,
The two characters that precede that character are treated as context. That is, in the present apparatus, a compression procedure that takes into account the context of a certain order functions as a compression procedure that takes into account the context of twice the order. Therefore, according to the data compression / decompression device of the second embodiment, compression at a high compression ratio can be realized.

【００７５】ここで、第２実施形態のデータ圧縮・復元
装置のデータ復元時の動作を簡単に説明しておく。図１
２に示したように、データ復元時、表音逆変換部１５′
は、まず、無歪復元部１４が出力する１バイト分のデー
タを取得する（ステップＳ４０１）。次いで、表音逆変
換部１５′は、そのデータが、文字数識別情報である
か、全角片仮名開始位置識別情報であるか、半角片仮名
開始位置識別情報であるか、半角片仮名文字を表すデー
タであるか、漢字（記号）を表すデータであるかの判別
を行う（ステップＳ４０２）。Here, the operation of the data compression / decompression device of the second embodiment at the time of data decompression will be briefly described. FIG.
As shown in FIG. 2, when data is restored, the phonetic inverse conversion unit 15 '
First, 1-byte data output from the distortionless restoration unit 14 is obtained (step S401). Next, the phonetic inverse conversion unit 15 'is data representing the character number identification information, the full-width katakana start position identification information, the half-width katakana start position identification information, or the half-width katakana character. It is determined whether the data is data representing a kanji (symbol) (step S402).

【００７６】そして、当該データが文字数識別情報であ
った場合（ステップＳ４０２；文字数識別情報）、表音
逆変換部１５′は、その文字数識別情報から文字数（バ
イト数）を認識（ステップＳ４０３）し、認識した文字
数分のデータ（半角読み）を無歪復元部１４から取得す
る（ステップＳ４０４）。その後、表音逆変換部１５′
は、次の１バイト分のデータ、すなわち、同音異義語識
別情報を取得する（ステップＳ４０５）。次いで、表音
逆変換部１５′は、取得した半角読みと同音異義語識別
情報を用いて同音異義語辞書１７′から漢字あるいは漢
字列を表す情報を検索（ステップＳ４０６）し、検索し
た漢字（列）情報を、記憶部１１に対して出力する（ス
テップＳ４０７）。そして、ステップＳ４０１に戻り、
残りのデータに関する処理を開始する。If the data is character number identification information (step S402; character number identification information), the phonetic inverse conversion unit 15 'recognizes the character number (byte number) from the character number identification information (step S403). Then, data (half-width reading) for the number of recognized characters is obtained from the distortion-free restoration unit 14 (step S404). Then, the phonetic inverse converter 15 '
Acquires the next one byte of data, that is, homonym identification information (step S405). Next, the phonetic inverse conversion unit 15 'searches the homonym dictionary 17' for information representing a kanji or a kanji string using the obtained half-width reading and homonym identification information (step S406), and searches the searched kanji ( Column) information is output to the storage unit 11 (step S407). Then, returning to step S401,
Start processing for the remaining data.

【００７７】ステップＳ４０２において、ステップＳ４
０１で取得したデータが、全角片仮名開始位置識別情報
あるいは全角片仮名開始位置識別情報であることを検出
した場合、表音逆変換部１５′は、終了位置識別情報が
取得されるまで、無歪復元部１４からのデータ取得を繰
り返す（ステップＳ４１０）。そして、表音逆変換部１
５′は、終了位置識別情報を取得したときに、ステップ
Ｓ４１０を終了し、現在行っている処理が全角片仮名開
始位置識別情報の検出によるものであった場合（ステッ
プＳ４１１；Ｙ）には、ステップＳ４１０で取得した、
半角片仮名文字（列）に関するデータ（除く終了位置識
別情報）を、全角片仮名文字（列）に関するデータに変
換して、記憶部１１に出力する（ステップＳ４１３）。
そして、ステップＳ４０１に戻る。一方、現在行ってい
る処理が半角片仮名開始位置識別情報の検出によるもの
であった場合（ステップＳ４１１；Ｎ）には、取得した
半角片仮名文字（列）に関するデータ（除く終了位置識
別情報）を、そのまま、記憶部１１に出力（ステップＳ
４１２）し、ステップＳ４０１に戻る。In step S402, step S4
When the data obtained in step 01 is detected to be full-width katakana start position identification information or full-width katakana start position identification information, the phonetic inverse conversion unit 15 ′ performs distortionless restoration until end position identification information is obtained. The data acquisition from the unit 14 is repeated (step S410). Then, the phonetic inverse conversion unit 1
5 ′, when the end position identification information is acquired, the step S410 ends, and when the current processing is the detection of the full-width katakana start position identification information (step S411; Y), the step S410 is performed. Acquired in S410,
Data related to half-width katakana characters (strings) (excluding end position identification information) is converted to data related to full-width katakana characters (strings) and output to storage unit 11 (step S413).
Then, the process returns to step S401. On the other hand, if the current processing is based on the detection of the half-width katakana start position identification information (step S411; N), the data (excluding the end position identification information) regarding the obtained half-width katakana character (string) is Output to the storage unit 11 as it is (step S
412), and return to step S401.

【００７８】ステップＳ４０１で取得したデータが、１
バイト文字あるいは２バイト文字の第１バイトであった
場合（ステップＳ４０２；その他）、表音逆変換部１
５′は、ステップＳ４１５において、以下の処理を実行
する。半角片仮名データに対しては、対応する平仮名コ
ードを出力する。２バイト文字の第１バイトに対して
は、さらに、もう１バイトのデータを無歪復元部１４か
ら取得し、それら２バイトのデータを出力する。その他
の１バイト文字に対しては、取得したデータをそのまま
記憶部１１に出力する。The data acquired in step S401 is 1
If it is the first byte of a byte character or a double-byte character (step S402; other), the phonetic inverse conversion unit 1
5 'executes the following processing in step S415. For the half-width katakana data, the corresponding hiragana code is output. For the first byte of the two-byte character, another one-byte data is further obtained from the distortionless restoration unit 14, and the two-byte data is output. For other 1-byte characters, the acquired data is output to the storage unit 11 as it is.

【００７９】表音逆変換部１５′は、このような処理を
繰り返し、“ＥＯＦ”の出力を行ったときに（ステップ
Ｓ４１７；Ｙ）、図示した処理を終了する。＜変形形態＞第２実施形態のデータ圧縮・復元装置も、
第１実施形態のデータ圧縮・復元装置と同様の変形が可
能である。すなわち、複数の文字数識別情報を用意して
おく代わりに、開始位置識別情報と終了位置識別情報と
を用意しておき、データ圧縮時に漢字の半角読みを出力
する際、その前後に開始位置識別情報と終了位置識別情
報を付加するように装置を構成しても良いし、他のコー
ド体系の文書を対象とする装置としても良い。The phonetic inverse conversion unit 15 'repeats the above processing and, when outputting "EOF" (step S417; Y), ends the illustrated processing. <Modification> The data compression / decompression device of the second embodiment is also
Modifications similar to those of the data compression / decompression device of the first embodiment are possible. That is, instead of preparing a plurality of character number identification information, start position identification information and end position identification information are prepared. The device may be configured to add the end position identification information to the document, or may be a device that targets a document of another code system.

【００８０】また、第２実施形態のデータ圧縮・復元装
置は、２バイト文字を、半角片仮名文字に変換する装置
であったが、２バイト文字がローマ字に変換されるよう
に装置を構成することも出来る。すなわち、“セイソ゛ウ”
を表すデータの代わりに、"SEIZOU"を表すデータが出力
されるように装置を構成しても良い。The data compression / decompression device according to the second embodiment is a device for converting double-byte characters to half-width katakana characters. However, the data compression / decompression device may be configured to convert double-byte characters to Roman characters. You can also. In other words,
The apparatus may be configured so that data representing "SEIZOU" is output instead of data representing.

【００８１】また、２バイトの表意文字だけからなる言
語に、本技術を適用する際には、例えば、国際音声記
号、ハングル文字の母音と子音を表す情報、注音字母を
表す情報等に１バイトコードを割り当てておき、各表意
文字データが、その表意文字の音を国際音声記号等で表
したコードに変換されるようにしておけば良い。When the present technology is applied to a language consisting of only two-byte ideographic characters, for example, an international phonetic symbol, information representing a vowel and a consonant of a Hangul character, information representing a note character, etc. A code may be assigned so that each ideographic character data is converted into a code representing the sound of the ideographic character with an international phonetic symbol or the like.

【００８２】＜第３実施形態＞図１３に、第３実施形態
のデータ圧縮・復元装置の機能ブロック図を示す。ま
ず、この図を用いて、第３実施形態のデータ圧縮・復元
装置の動作の概要を説明する。<Third Embodiment> FIG. 13 shows a functional block diagram of a data compression / decompression device according to a third embodiment. First, the outline of the operation of the data compression / decompression device according to the third embodiment will be described with reference to FIG.

【００８３】図示したように、第３実施形態のデータ圧
縮・復元装置は、平文データ、圧縮データの記憶に用い
られる記憶部２１を備える。また、データ圧縮時に機能
するブロックとして、表音変換部２２と中間符号変換部
２３と一時記憶部２４と無歪圧縮部２５を備える。そし
て、データ復元時に機能するブロックとして、無歪復元
部２６と中間符号逆変換部２７と表音逆変換部２８を備
える。As shown, the data compression / decompression device according to the third embodiment includes a storage unit 21 used for storing plaintext data and compressed data. Also, as blocks that function at the time of data compression, there are provided a phonetic conversion unit 22, an intermediate code conversion unit 23, a temporary storage unit 24, and a distortionless compression unit 25. As blocks that function at the time of data restoration, a distortionless restoration unit 26, an intermediate code inverse conversion unit 27, and a phonetic inverse conversion unit 28 are provided.

【００８４】表音変換部２２が有している日本語解析辞
書１６と同音異義語辞書１７は、第１実施形態の表音変
換部１２が有している辞書と同じものである。表音変換
部２２は、これらの辞書を用いて、第１実施形態の表音
変換部１２とほぼ同じ手順で、平文データ内の漢字を
“読み”に変換していくことによって、表音データを生
成する。ただし、表音変換部２２は、表音データの生成
時に、表音データ内で使用されている文字、識別情報の
リストである文字リストをも作成する。The Japanese analysis dictionary 16 and the homonym dictionary 17 of the phonetic conversion unit 22 are the same as the dictionary of the phonetic conversion unit 12 of the first embodiment. The phonetic conversion unit 22 uses these dictionaries to convert kanji in plaintext data into “reading” in substantially the same procedure as the phonetic conversion unit 12 of the first embodiment, thereby obtaining phonetic data. Generate However, when generating the phonogram data, the phonogram conversion unit 22 also creates a character list which is a list of characters used in the phonogram data and identification information.

【００８５】中間符号変換部２３は、表音変換部２２
が、一時記憶部２４内に生成する表音データと文字リス
トを用いて、表音データ内で使用されている文字や識別
情報に新たな符号（中間符号）を割り当てるための中間
符号対応表を作成する。そして、中間符号変換部２３
は、その中間符号表を用いて、表音データを、中間符号
を含むデータである中間符号データに変換する。なお、
中間符号変換部２３には、動作モードとして、中間符号
のビット数を指定しないモード（ビット長非指定モー
ド）と、中間符号のビット数を指定するモード（ビット
長指定モード）とが用意されており、ビット長指定モー
ドで動作させる場合、データ圧縮に先駆けて中間符号変
換部２３に中間符号のビット長を指定する情報が与えら
れる。The intermediate code conversion unit 23 includes the phonetic conversion unit 22
Uses the phonetic data and the character list generated in the temporary storage unit 24 to generate an intermediate code correspondence table for assigning a new code (intermediate code) to characters and identification information used in the phonetic data. create. Then, the intermediate code conversion unit 23
Converts the phonetic data into intermediate code data, which is data including an intermediate code, using the intermediate code table. In addition,
The intermediate code conversion unit 23 is provided with, as operation modes, a mode in which the number of bits of the intermediate code is not specified (bit length non-designation mode) and a mode in which the number of bits of the intermediate code is specified (bit length specification mode). When operating in the bit length specification mode, information specifying the bit length of the intermediate code is provided to the intermediate code conversion unit 23 prior to data compression.

【００８６】無歪圧縮部１３は、中間符号変換部２３か
ら与えられる中間符号対応表と中間符号データをそれぞ
れ損失がない状態で圧縮し、それらの圧縮結果を、記憶
部２１内に、圧縮データとして記憶する。The distortion-free compression unit 13 compresses the intermediate code correspondence table and the intermediate code data supplied from the intermediate code conversion unit 23 without loss, and stores the compression results in the storage unit 21. To be stored.

【００８７】無歪復元部２６は、無歪圧縮部２５によっ
て圧縮されたデータを復元する機能を有し、復元を指示
された圧縮データに基づき、中間符号対応表と中間符号
データを復元、出力する。中間符号逆変換部２７は、無
歪復元部２６から与えられる中間符号対応表を用いて、
その後に与えられる中間符号データ内に含まれる中間符
号を元の情報（表音データ）に戻していく。表音逆変換
部２８は、同音異義語辞書１７と同じ内容を有する辞書
である同音異義語辞書１８を用いて、表音データに含ま
れる“読み”を漢字に戻すことによって、無歪復元部２
６が処理した圧縮データの元情報である平文データを記
憶部１１内に生成する。The distortion-free restoration section 26 has a function of restoring the data compressed by the distortion-free compression section 25, and restores and outputs an intermediate code correspondence table and intermediate code data based on the compressed data instructed to be restored. I do. The intermediate code inverse conversion unit 27 uses the intermediate code correspondence table given from the distortionless restoration unit 26,
The intermediate code included in the intermediate code data given thereafter is returned to the original information (phonetic data). The phonetic inverse conversion unit 28 uses the homonym dictionary 18 which is a dictionary having the same contents as the homonym dictionary 17 to return the “reading” included in the phoneme data to kanji, thereby obtaining a distortion-free restoration unit. 2
6 generates, in the storage unit 11, plaintext data which is the original information of the processed compressed data.

【００８８】以下、各部の動作を具体的に説明する。ま
ず、図１４を用いて、表音変換部２２の動作を説明す
る。図示したように、データ圧縮の開始が指示された
際、表音変換部２２は、まず、変数Ｋに“０”をセット
する（ステップＳ５００）。変数Ｋの用途については後
述するが、変数Ｋは、表音変換部２２が出力する同音異
義語識別情報の最大値が記憶される変数となっている。Hereinafter, the operation of each unit will be specifically described. First, the operation of the phonetic conversion unit 22 will be described with reference to FIG. As shown, when the start of data compression is instructed, the phonetic conversion unit 22 first sets “0” to a variable K (step S500). Although the use of the variable K will be described later, the variable K is a variable in which the maximum value of the homonym identification information output by the phonetic conversion unit 22 is stored.

【００８９】次いで、表音変換部２２は、圧縮対象とし
て指定された平文データから１文字分のデータを取得す
る（ステップＳ５０１）。次いで、表音変換部２２は、
そのデータが表す文字の種類を判別し、当該文字が非漢
字であった場合（ステップＳ５０２；非漢字）には、取
得したデータを一時記憶部２４に出力するとともに、そ
のデータが文字リストに登録されていない未登録データ
であった場合には、当該データを文字リストに登録する
（ステップＳ５０３）。そして、出力したデータが“Ｅ
ＯＦ”でなかった場合（ステップＳ５１１；Ｎ）には、
ステップＳ５０１に戻り、平文データ内の次の１文字分
のデータに対する処理を開始する。Next, the phonetic conversion unit 22 obtains data of one character from the plaintext data designated as the compression target (step S501). Next, the phonetic conversion unit 22
The type of the character represented by the data is determined, and if the character is a non-kanji (step S502; non-kanji), the acquired data is output to the temporary storage unit 24 and the data is registered in the character list. If the data has not been registered, the data is registered in a character list (step S503). The output data is "E
OF ”(step S511; N),
Returning to step S501, processing for the data of the next one character in the plaintext data is started.

【００９０】ステップＳ５０１で読み出したデータが漢
字を表すデータであった場合（ステップＳ５０２；漢
字）、表音変換部２２は、非漢字データが取得されるま
で平文データからのデータ取得を繰り返す（ステップＳ
５０４、５０５）。そして、非漢字データが取得された
とき（ステップＳ５０５；非漢字）、表音変換部２２
は、ステップＳ５０６に進む。If the data read in step S501 is data representing a kanji (step S502; kanji), the phonetic conversion unit 22 repeats data acquisition from plaintext data until non-kanji data is acquired (step S502). S
504, 505). When the non-kanji data is acquired (step S505; non-kanji), the phonetic conversion unit 22
Proceeds to step S506.

【００９１】ステップＳ５０６において、表音変換部２
２は、その非漢字データの前に取得したデータが音読み
漢字１字分のデータであるか否かを判断する。そして、
音読み漢字１字分のデータであった場合（ステップＳ５
０６；Ｙ）には、開始位置識別情報とその音読み漢字デ
ータと終了位置識別情報とを一時記憶部２４に出力する
（ステップＳ５０７）。なお、開始位置識別情報と終了
位置識別情報は、１バイト文字、２バイト文字の第１バ
イトとの判別が可能なように、その内容が設定された１
バイトの情報であり、本実施形態では、同じ情報を、開
始位置識別情報並びに終了位置識別情報として使用して
いる。In step S506, the phonetic conversion unit 2
No. 2 determines whether or not the data obtained before the non-kanji data is data for one phonetic Kanji character. And
If it is data for one on-read kanji (step S5
06; Y), the start position identification information, its on-read kanji data, and the end position identification information are output to the temporary storage unit 24 (step S507). The contents of the start position identification information and the end position identification information are set so that they can be distinguished from the first byte of a one-byte character or a two-byte character.
In this embodiment, the same information is used as start position identification information and end position identification information.

【００９２】ステップＳ５０７の実行後、表音変換部２
２は、最後に取得したデータ（非漢字データ）を一時記
憶部２４に出力する（ステップＳ５０８）。次いで、ス
テップＳ５０７で出力した開始位置識別情報が文字リス
トに登録されていない未登録データであった場合には、
そのデータ（開始位置識別情報）を文字リストに登録す
る。また、ステップＳ５０８で出力したデータが未登録
データであった場合には、そのデータ文字リストに登録
する（ステップＳ５０９）。すなわち、表音変換部２２
は、最初にステップＳ５０９を実行したときには、文字
リストへの開始位置識別情報の登録を行い、終了位置識
別情報の次に存在していたデータが未登録データであっ
た場合には、そのデータの登録も行う。そして、２度目
以降にステップＳ５０９を実行したときには、既に開始
位置識別情報の登録は行われているので、終了位置識別
情報の次に存在していたデータが未登録データであった
場合にのみ、文字リストへのデータ登録を行う。After the execution of step S507, the phonetic conversion unit 2
2 outputs the last acquired data (non-kanji data) to the temporary storage unit 24 (step S508). Next, if the start position identification information output in step S507 is unregistered data that is not registered in the character list,
The data (start position identification information) is registered in a character list. If the data output in step S508 is unregistered data, it is registered in the data character list (step S509). That is, the phonetic conversion unit 22
When the step S509 is first executed, the start position identification information is registered in the character list. If the data existing next to the end position identification information is unregistered data, Also perform registration. When step S509 is executed for the second time or later, since the start position identification information has already been registered, only when the data existing next to the end position identification information is unregistered data, Register data to the character list.

【００９３】そして、表音変換部２２は、ステップＳ５
０８で出力したデータが“ＥＯＦ”でなかった場合（ス
テップＳ５１１；Ｎ）には、ステップＳ５０１に戻る。
一方、非漢字データの前に取得したデータが音読み漢字
１字分のデータでなかった場合（ステップＳ５０６；
Ｎ）、表音変換部２２は、表音変換処理（ステップＳ５
１０）を実行する。Then, the phonetic conversion section 22 proceeds to step S5.
If the data output in step 08 is not "EOF" (step S511; N), the process returns to step S501.
On the other hand, when the data obtained before the non-kanji data is not data for one phonetic kanji (step S506;
N), the phonetic conversion unit 22 performs the phonetic conversion process (step S5).
Execute 10).

【００９４】図１５に、表音変換処理の流れを示す。図
示したように、表音変換処理時、表音変換部２２は、ま
ず、非漢字データの前に取得したデータが表す漢字列あ
るいは漢字を、仮名文字列（読み）に変換する（ステッ
プＳ６０１）。次いで、表音変換部２２は、同音異義語
辞書１７から、その読みと漢字列（漢字）に対応づけら
れている同音異義語識別情報を取得する（ステップＳ６
０２）。そして、読みの文字数を表す文字数識別情報と
読みと同音異義語識別情報とを、この順で、一時記憶部
２４に出力（ステップＳ６０３）し、さらに、最後に取
得したデータ（非漢字）を一時記憶部２４に出力する
（ステップＳ６０４）。FIG. 15 shows the flow of the phonetic conversion process. As shown in the figure, at the time of the phonetic conversion process, the phonetic conversion unit 22 first converts a kanji string or a kanji represented by data obtained before the non-kanji data into a kana character string (reading) (step S601). . Next, the phonetic conversion unit 22 acquires, from the homonym dictionary 17, homonym identification information associated with the reading and the kanji string (kanji) (step S6).
02). Then, the character number identification information indicating the number of characters of the reading and the phonetic and homonym identification information are output to the temporary storage unit 24 in this order (step S603), and the finally obtained data (non-kanji) is temporarily stored. The data is output to the storage unit 24 (step S604).

【００９５】その後、表音変換部２２は、ステップＳ６
０３、Ｓ６０４で出力したデータのうち、同音異義語識
別情報を除くデータの中に未登録データがあった場合に
は、その（あるいはそれらの）データを文字リストに登
録する（ステップＳ６０５）。また、同音異義語識別情
報の値がＫを越えていた場合には、Ｋにその値を設定
（ステップＳ６０６）し、表音変換処理を終了する（図
１４のステップＳ５１１に進む）。Thereafter, the phonetic conversion unit 22 proceeds to step S6.
03, among the data output in S604, if there is unregistered data in the data excluding the homonym identification information, the data is registered in the character list (or those data) (step S605). When the value of the homonym identification information exceeds K, the value is set to K (step S606), and the phonetic conversion process ends (proceeding to step S511 in FIG. 14).

【００９６】そして、“ＥＯＦ”を出力したことを検出
した際（ステップＳ５１１；Ｙ）に、表音変換部２２
は、図１４に示した処理を終了する。このように、表音
変換部２２は、一時記憶部２４内に、第１実施形態の表
音変換部１２が出力する表音データとほぼ同形態の表音
データを作成する。また、表音変換部２２は、その表音
データに含まれている文字数識別情報、開始位置識別情
報、１バイト文字、２バイト文字のリストである文字リ
ストをも作成し、表音データ内に含まれている同音異義
語識別情報の最大値を変数Ｋに記憶して、図示した処理
を終了する。When it is detected that "EOF" has been output (step S511; Y), the phonetic conversion unit 22
Ends the process shown in FIG. In this way, the phonetic conversion unit 22 creates, in the temporary storage unit 24, phonetic data having substantially the same form as the phonetic data output by the phonetic converting unit 12 of the first embodiment. The phonetic conversion unit 22 also creates a character list that is a list of character number identification information, start position identification information, one-byte characters, and two-byte characters included in the phonetic data. The maximum value of the included homonym identification information is stored in the variable K, and the processing shown in the figure is terminated.

【００９７】中間符号変換部２３は、表音変換部２２に
よる処理が完了したときに、動作を開始する。既に説明
したように、中間符号変換部２３は２つの動作モードを
有する。ここでは、まず、図１６および図１７を用い
て、ビット長非指定モード時の動作の説明を行うことに
する。The intermediate code conversion unit 23 starts operating when the processing by the phonetic conversion unit 22 is completed. As described above, the intermediate code conversion unit 23 has two operation modes. Here, the operation in the bit length non-designation mode will be described first with reference to FIGS.

【００９８】表音変換部２２による表音データの生成が
完了したとき、中間符号変換部２３は、図１６に示した
ように、まず、Ｋの値を２進数で表すのに必要な最小の
ビット数Ｍと、文字リストに記憶されている文字種数ｎ
とを求める（ステップＳ７０１）。次いで、中間符号変
換部２３は、２^L-1＜ｎ≦２^Lを満たすＬを算出する（ス
テップＳ７０２）。そして、算出した値Ｌが、Ｍ未満で
あった場合（ステップＳ７０３；Ｙ）には、ＬにＭの値
をセット（ステップＳ７０４）し、ステップＳ７０５に
進む。一方、Ｌ＜Ｍでなかった場合（ステップＳ７０
３；Ｙ）には、Ｌの値を更新することなく、ステップＳ
７０５に進む。When the generation of the phonetic data by the phonetic conversion unit 22 is completed, the intermediate code converting unit 23 firstly sets the minimum value necessary to represent the value of K in binary as shown in FIG. Number of bits M and number of character types n stored in the character list
Is obtained (step S701). Next, the intermediate code conversion unit 23 calculates ^L that satisfies 2 ^L-1 <n ≦ 2 ^L (Step S702). If the calculated value L is smaller than M (step S703; Y), the value of M is set to L (step S704), and the process proceeds to step S705. On the other hand, if L <M is not satisfied (step S70)
3; Y) includes the step S without updating the value of L.
Proceed to 705.

【００９９】ステップＳ７０５において、中間符号変換
部２３は、文字リスト内のｎ種の情報（文字数識別情
報、２バイト文字等）のそれぞれに、Ｌ桁の２進数、す
なわち、Ｌビットの中間符号を割り当てることにより、
中間符号対応表を作成する。次いで、中間符号変換部２
３は、その作成した中間符号対応表の内容を表すデータ
を無歪圧縮部２５に供給する（ステップＳ７０６）。In step S705, the intermediate code conversion unit 23 assigns an L-digit binary number, that is, an L-bit intermediate code, to each of the n types of information (character number identification information, 2-byte characters, etc.) in the character list. By assigning
Create an intermediate code correspondence table. Next, the intermediate code conversion unit 2
3 supplies data representing the contents of the created intermediate code correspondence table to the distortionless compression unit 25 (step S706).

【０１００】その後、中間符号変換部２３は、一時記憶
部２４に記憶された表音データから、１単位情報分のデ
ータを取得する（ステップＳ７１１）。すなわち、この
ステップで、中間符号変換部２３は、１バイトのデータ
を取得し、そのデータが２バイト文字の第１バイトであ
った場合には、さらに１バイトのデータを取得する。一
方、取得したデータが１バイト文字、文字数識別情報あ
るいは開始位置識別情報であった場合には、それ以上、
データを取得することなく、ステップＳ７１１を終え
る。ステップＳ７１１で取得したデータが開始位置識別
情報であった場合（ステップＳ７１２；開始位置識別情
報）、中間符号変換部２３は、開始位置識別情報に対応
づけられて中間符号対応表に記憶されている中間符号
（以下、開始位置識別情報用中間符号と表記する）を、
無歪圧縮部２５に供給する（ステップＳ７１３）。次
に、表音データ内の、その開始位置識別情報に続く３バ
イトのデータ、すなわち、音読み漢字を表すデータ（２
バイト）と終了位置識別情報（１バイト）を、そのま
ま、無歪圧縮部２５に供給する（ステップＳ７１４）。
その後、中間符号変換部２３は、ステップＳ７１１に戻
り、表音データ内の次のデータに対する処理を開始す
る。After that, the intermediate code conversion unit 23 acquires data for one unit information from the phonogram data stored in the temporary storage unit 24 (step S711). That is, in this step, the intermediate code conversion unit 23 acquires 1-byte data, and if the data is the first byte of a 2-byte character, further acquires 1-byte data. On the other hand, if the acquired data is one-byte characters, character number identification information or start position identification information,
Step S711 ends without acquiring data. If the data acquired in step S711 is the start position identification information (step S712; start position identification information), the intermediate code conversion unit 23 stores the data in the intermediate code correspondence table in association with the start position identification information. An intermediate code (hereinafter, referred to as an intermediate code for start position identification information)
The data is supplied to the distortionless compression unit 25 (step S713). Next, in the phonetic data, 3-byte data following the start position identification information, that is, data (2
Byte) and the end position identification information (1 byte) are supplied to the distortionless compression unit 25 as they are (step S714).
After that, the intermediate code conversion unit 23 returns to step S711 and starts processing for the next data in the phonetic data.

【０１０１】取得したデータが、文字数識別情報であっ
た場合（ステップＳ７１２；文字数識別情報）、中間符
号変換部２３は、その文字数識別情報から読みの文字数
を認識するとともに、その文字数識別情報に対応づけら
れて中間符号対応表に記憶されている中間符号を無歪圧
縮部２５に供給する（ステップＳ７１５）。次いで、認
識した文字数分のデータを取得し、各データに対応する
中間符号を無歪圧縮部２５に供給する（ステップＳ７１
６）。そして、表音データから次の１バイト分のデー
タ、すなわち、同音異義語識別情報を取得し、その同音
異義語識別情報をＬビットの情報に直して無歪圧縮部２
５に供給する（ステップＳ７１７）。すなわち、Ｌ＞８
であった場合、中間符号変換部２３は、同音異義語識別
情報の上位ビット側に“Ｌ−８”個の“０”を加えたＬ
ビットの情報を無歪圧縮部２５に供給する。また、Ｌ＜
８であった場合には、同音異義語識別情報に上位ビット
側から“８−Ｌ”個の“０”を取り除いたＬビットの情
報を無歪圧縮部２５に供給する、当然、Ｌ＝８であった
場合には、同音異義語識別情報をそのまま無歪圧縮部２
５に供給する。その後、中間符号変換部２３は、ステッ
プＳ７１１に戻る。If the acquired data is character number identification information (step S712; character number identification information), the intermediate code conversion unit 23 recognizes the number of characters to be read from the character number identification information, and responds to the character number identification information. Then, the intermediate code stored in the intermediate code correspondence table is supplied to the distortionless compression unit 25 (step S715). Next, data corresponding to the number of recognized characters is obtained, and an intermediate code corresponding to each data is supplied to the distortionless compression unit 25 (step S71).
6). Then, the next one byte of data, that is, the homonym identification information, is obtained from the phonetic data, and the homonym identification information is converted into L-bit information, and the distortion-free compression unit 2 acquires
5 (step S717). That is, L> 8
, The intermediate code conversion unit 23 adds “L−8” “0” s to the upper bit side of the homonym identification information.
The bit information is supplied to the distortionless compression unit 25. Also, L <
If it is 8, L-bit information obtained by removing “8-L” “0” s from the higher-order bit side in the homonym identification information is supplied to the distortionless compression unit 25. Naturally, L = 8 , The homonym identification information is used as is in the distortionless compression unit 2.
5 After that, the intermediate code conversion unit 23 returns to Step S711.

【０１０２】取得したデータが、１バイト文字あるいは
２バイト文字であった場合（ステップＳ７１２；その
他）、中間符号変換部２３は、そのデータに対応する中
間符号を無歪圧縮部２５に供給する（ステップＳ７１
８）。そして、供給したデータが“ＥＯＦ”でなかった
場合（ステップＳ７１９；Ｎ）には、ステップＳ７１１
に戻り、次のデータに対する処理を開始し、“ＥＯＦ”
であった場合（ステップＳ７１９；Ｙ）に、処理を終了
する。If the obtained data is a one-byte character or a two-byte character (step S712; other), the intermediate code conversion unit 23 supplies an intermediate code corresponding to the data to the distortionless compression unit 25 ( Step S71
8). If the supplied data is not “EOF” (step S719; N), step S711
To start the process for the next data and return to "EOF"
Is satisfied (step S719; Y), the process ends.

【０１０３】このように、中間符号変換部２３は、ビッ
ト長非指定モード時、表音データに含まれる１バイト文
字、音読み漢字１字を除く２バイト文字、文字数識別情
報、開始位置識別情報を表しうる最小のビット数Ｌを求
める。その際、求めたＬが、同音異義語識別情報に必要
とされるビット数Ｍ未満であった場合には、Ｌの値をＭ
の値で更新する。そして、表音データ内の、音読み漢字
１字、同音異義語識別情報を除く各情報をＬビットの中
間符号に変換し、また、同音異義語識別情報の表現形態
をＬビットに変換する。音読み漢字１字に関しては、変
換は行わず、そのまま中間符号データの要素とする。As described above, in the non-bit length designation mode, the intermediate code conversion unit 23 converts the one-byte characters included in the phonetic data, the two-byte characters excluding one phonetic kanji, the number-of-characters identification information, and the start-position identification information. Find the minimum number L of bits that can be represented. At this time, if the obtained L is less than the number of bits M required for the homonym identification information, the value of L is changed to M
Update with the value of Then, in the phonetic data, each piece of information excluding one phonetic kanji and homonym identification information is converted into an L-bit intermediate code, and the expression form of the homonym identification information is converted into L bits. One on-read Kanji character is not converted, and is used as an element of the intermediate code data as it is.

【０１０４】例えば、平文データに“製品”という漢字
熟語が存在していた場合、図１８に模式的に示したよう
に、表音データ内に、文字数識別情報と“せいひん”と
いう読みと同音異義語識別情報とからなる１０バイトの
情報が含まれることになる。中間符号変換部２３は、こ
の情報を、同図に示してあるように、６個のＬビットの
データ（５個の中間符号とＬビット形態の同音異義語識
別情報）からなる情報に変換する。For example, when the kanji phrase “product” exists in the plaintext data, as shown schematically in FIG. 18, the phonetic data includes the character number identification information and the same sound as the reading “seihin”. 10 bytes of information including the synonym identification information are included. The intermediate code conversion unit 23 converts this information into information including six L-bit data (five intermediate codes and L-bit homonym identification information) as shown in FIG. .

【０１０５】また、平文データに“その間”という音読
み漢字“間”（読み；かん）を含む語句が存在していた
場合、図１９に模式的に示したように、表音データ内に
は、“その”を表す４バイトの情報と、開始位置識別情
報と、“間”を表す２バイトの情報と、終了位置識別情
報とからなる情報が含まれる。中間符号変換部２３は、
この情報を、同図に示してあるように、３個のＬビット
の中間符号と、“間”を表す２バイトの情報と、１バイ
トの終了位置識別情報とからなる情報に変換する。If the plaintext data contains a phrase including the phonetic kanji character “ma” (reading; kan), the phonetic data includes, as schematically shown in FIG. The information includes four-byte information indicating “the”, start position identification information, two-byte information indicating “between”, and end position identification information. The intermediate code conversion unit 23
This information is converted into information including three L-bit intermediate codes, two-byte information representing "between", and one-byte end position identification information, as shown in FIG.

【０１０６】次に、ビット長指定モード時の中間符号変
換部２３の動作を説明する。ビット長指定モード時、中
間符号変換部２３は、図２０、２１に示した流れ図に従
って、表音データを中間符号データに変換する。Next, the operation of the intermediate code converter 23 in the bit length designation mode will be described. In the bit length designation mode, the intermediate code conversion unit 23 converts phonetic data into intermediate code data according to the flowcharts shown in FIGS.

【０１０７】図２０に示したように、中間符号変換部２
３は、まず、ビット長非指定モード時と同様に、同音異
義語識別情報の最大値であるＫの値を２進数で表すのに
必要な最小のビット数Ｍと、文字リストに記憶されてい
る文字種数ｎとを求める（ステップＳ８０１）。次い
で、中間符号変換部２３は、指定ビット長Ｌ′とＭの大
小関係を判定し、Ｌ′＜Ｍであった場合（ステップＳ８
０２；Ｙ）には、Ｌ′にＭの値をセット（ステップＳ８
０３）し、ステップＳ８０４に進む。一方、Ｌ′＜Ｍで
なかった場合（ステップＳ８０２；Ｎ）には、Ｌ′の値
を更新することなく、ステップＳ８０４に進む。As shown in FIG. 20, the intermediate code conversion unit 2
3, the minimum number of bits M required to represent the value of K, which is the maximum value of the homonym identification information, in a binary number as in the non-bit length designation mode, and stored in the character list. The number of character types n is obtained (step S801). Next, the intermediate code conversion unit 23 determines the magnitude relationship between the designated bit length L 'and M, and if L'<M (step S8).
02; Y), the value of M is set in L '(step S8).
03), and the process proceeds to step S804. On the other hand, if L '<M is not satisfied (step S802; N), the process proceeds to step S804 without updating the value of L'.

【０１０８】ステップＳ８０４において、中間符号変換
部２３は、２^L′≧ｎが成立するか否かを判断する（ス
テップＳ８０４）。２^L′≧ｎが成立する場合（ステッ
プＳ８０４；Ｙ）、すなわち、Ｌ′ビットのデータで文
字リストに記憶されているｎ種の情報を全て表現するこ
とが出来る場合、中間符号変換部２３は、文字リスト内
のｎ種の情報（文字数識別情報、２バイト文字等）のそ
れぞれに、Ｌ′ビットの中間符号を割り当てることによ
り、中間符号対応表を作成する（ステップＳ８０５）。In step S804, the intermediate code conversion unit 23 determines whether 2 ^L ≧ n is satisfied (step S804). If 2 ^L ′ ≧ n is satisfied (step S804; Y), that is, if all n types of information stored in the character list can be represented by L ′ bit data, the intermediate code conversion unit 23 An intermediate code correspondence table is created by assigning an L'-bit intermediate code to each of the n types of information (character number identification information, 2-byte characters, etc.) in the character list (step S805).

【０１０９】一方、２^L′≧ｎが成立しない場合（ステ
ップＳ８０４；Ｎ）、すなわち、Ｌ′ビットのデータで
は、文字リストに記憶されているｎ種の情報を表現する
ことが出来ない場合、中間符号変換部２３は、ｎ
_exc（＝ｎ−２^L′）種の文字を文字リストから消去する
（ステップＳ８０６）。この際、中間符号変換部２３
は、平仮名以外の文字（記号）を消去することによっ
て、文字数識別情報、開始位置識別情報、平仮名文字を
文字リスト内に残す。そして、文字リスト内の２^L′種
の情報のそれぞれに、Ｌ′ビットの中間符号を割り当て
ることにより、中間符号対応表を作成する（ステップＳ
８０７）。On the other hand, if 2 ^L ′ ≧ n is not satisfied (step S 804; N), that is, if the L′-bit data cannot represent n kinds of information stored in the character list, The intermediate code converter 23 calculates n
_{^{exc (= n-2 L '}} ) to clear the species characters from the character list (step S806). At this time, the intermediate code conversion unit 23
Deletes characters (symbols) other than Hiragana, thereby leaving the character number identification information, start position identification information, and Hiragana characters in the character list. Then, an intermediate code correspondence table is created by assigning an L 'bit intermediate code to each of the 2 ^L ' types of information in the character list (step S).
807).

【０１１０】中間符号対応表の作成後、中間符号変換部
２３は、作成した中間符号対応表の内容を表すデータを
無歪圧縮部２５に供給（ステップＳ８０８）する。そし
て、表音データを中間符号データに変換するために、図
２１に示した処理を開始する。After the creation of the intermediate code correspondence table, the intermediate code conversion unit 23 supplies data representing the contents of the created intermediate code correspondence table to the distortionless compression unit 25 (step S808). Then, the processing shown in FIG. 21 is started in order to convert the phonetic data into the intermediate code data.

【０１１１】図２１に示したステップＳ８１１〜Ｓ８１
７において、中間符号変換部２３は、それぞれ、ステッ
プＳ７１１〜Ｓ７１７（図１７）と同じ処理を行う。こ
のため、これらのステップにおける動作の説明は省略す
る。Steps S811-S81 shown in FIG.
7, the intermediate code conversion unit 23 performs the same processing as in steps S711 to S717 (FIG. 17), respectively. Therefore, the description of the operation in these steps will be omitted.

【０１１２】さて、ビット長非指定モード時は、表音デ
ータ内に含まれている、音読み漢字１字を除く各２バイ
ト文字および各１バイト文字に対して、中間符号が割り
当てられている。このため、表音データから取得したデ
ータが１バイト文字あるいは２バイト文字であった場
合、そのデータに対応する中間符号が存在していた。By the way, in the bit length non-designating mode, an intermediate code is assigned to each 2-byte character and each 1-byte character included in the phonetic data except for one phonetic Kanji character. For this reason, when the data obtained from the phonetic data is a one-byte character or a two-byte character, an intermediate code corresponding to the data exists.

【０１１３】これに対して、ビット長指定モード時は、
表音データ内に含まれている幾つかの文字に関しては、
対応する中間符号が用意されていない。このため、中間
符号変換部２３は、ステップＳ８１２で“その他”側へ
の分岐を行った際、ステップＳ８１１で取得したデータ
が、中間符号が割り当てられていないデータ（以下、除
外文字データと表記する）であるか否かを判断する（ス
テップＳ８２０）。そして、除外文字データでなかった
場合（ステップＳ８２０；Ｎ）には、当該データに対応
する中間符号を無歪圧縮部１３に供給する（ステップＳ
８２１）。次いで、中間符号変換部２３は、出力したデ
ータが“ＥＯＦ”であるか否かの判断を行い、“ＥＯ
Ｆ”でなかった場合（ステップＳ８２５；Ｎ）には、ス
テップＳ８１１に戻り、次のデータに対する処理を開始
する。On the other hand, in the bit length designation mode,
For some characters contained in phonetic data,
There is no corresponding intermediate code. Therefore, when the intermediate code conversion unit 23 branches to the “other” side in step S812, the data acquired in step S811 is replaced with data to which no intermediate code is assigned (hereinafter, referred to as excluded character data). ) Is determined (step S820). If the data is not the exclusion character data (step S820; N), the intermediate code corresponding to the data is supplied to the distortionless compression unit 13 (step S820).
821). Next, the intermediate code conversion unit 23 determines whether the output data is “EOF”,
If not F ”(step S825; N), the process returns to step S811 to start processing for the next data.

【０１１４】一方、取得したデータが除外文字データで
あった場合（ステップＳ８２０；Ｙ）、中間符号変換部
２３は、除外文字データ以外のデータが現れるまで、表
音データからデータの取得を繰り返す（ステップＳ８２
３）。そして、開始位置識別情報用中間符号と、最後に
取得したデータ（除外文字データ以外のデータ）を除く
データと、終了位置識別情報を、無歪圧縮部２５に供給
する（ステップＳ８２４）。次いで、中間符号変換部２
３は、最後に取得した除外文字データではないデータを
処理するために、ステップＳ８１２に戻る。On the other hand, if the acquired data is the exclusion character data (step S820; Y), the intermediate code conversion unit 23 repeats the acquisition of the data from the phonetic data until data other than the exclusion character data appears (step S820; Y). Step S82
3). Then, the intermediate code for the start position identification information, the data excluding the data acquired last (data other than the exclusion character data), and the end position identification information are supplied to the distortionless compression unit 25 (step S824). Next, the intermediate code conversion unit 2
No. 3 returns to step S812 to process data that is not the last excluded character data.

【０１１５】すなわち、ビット長指定モード時、中間符
号変換部２３は、表音データ内に中間符号が割り当てら
れていない文字があった場合、その文字の前後に開始位
置識別情報用中間符号と終了位置識別情報を付加した情
報を無歪圧縮部２５に供給する。また、そのような文字
が複数個連なっていた場合には、その文字列の前後に開
始位置識別情報用中間符号と終了位置識別情報を付加し
た情報を無歪圧縮部２５に供給する。That is, in the bit length designation mode, if there is a character to which no intermediate code is assigned in the phonetic data, the intermediate code conversion unit 23 sets the intermediate code for start position identification information and the end code before and after the character. The information to which the position identification information is added is supplied to the distortionless compression unit 25. In addition, when a plurality of such characters are continuous, information obtained by adding the start position identification information intermediate code and the end position identification information before and after the character string is supplied to the distortionless compression unit 25.

【０１１６】例えば、表音データ内に“αβγは”とい
う文字列を表す情報が存在しており、ステップＳ８０６
において、文字リストから“α”、“β”、“γ”に関
する情報が削除されていたものとする。この場合、中間
符号変換部２３は、図２２に模式的に示したように、Ｌ
（＝Ｌ′）ビットの開始位置識別情報用中間符号と、
“αβγ”と、終了位置識別情報と、“は”に対応する
中間符号からなる情報を、無歪圧縮部２５に供給する。For example, information representing a character string “αβγ is” exists in the phonetic data, and step S806 is performed.
, It is assumed that information on “α”, “β”, and “γ” has been deleted from the character list. In this case, the intermediate code converter 23, as schematically shown in FIG.
(= L ′) bit start position identification information intermediate code;
The information including the intermediate code corresponding to “αβγ”, the end position identification information, and “ha” is supplied to the distortionless compression unit 25.

【０１１７】第３実施形態のデータ圧縮・復元装置で
は、このような手順によって、表音データから生成され
る中間符号対応表と中間符号データとが、無歪圧縮部２
５によって圧縮され、平文データの圧縮が完了する。According to the data compression / decompression device of the third embodiment, the intermediate code correspondence table and the intermediate code data generated from the phonetic data are converted into the distortionless compression unit 2 by the above procedure.
5, and the compression of the plaintext data is completed.

【０１１８】最後に、第３実施形態のデータ圧縮・復元
装置のデータ復元時の動作を説明する。ある圧縮データ
の復元を指示された際、無歪復元部２６は、その圧縮デ
ータを構成する中間符号対応表に関する圧縮データを復
元する。次いで、無歪復元部２６は、中間符号データに
関する圧縮データを復元する。Finally, the operation of the data compression / decompression device according to the third embodiment when decompressing data will be described. When instructed to decompress certain compressed data, the distortion-free decompression unit 26 decompresses the compressed data relating to the intermediate code correspondence table constituting the compressed data. Next, the distortion-free restoration unit 26 restores the compressed data related to the intermediate code data.

【０１１９】これに呼応して、中間符号逆変換部２７
は、図２３に示した処理を実行する。すなわち、中間符
号逆変換部２７は、まず、無歪復元部２６から、中間符
号対応表を取得し、中間符号のビット数Ｌを認識する
（ステップＳ９０１）。次いで、無歪復元部２６から、
Ｌビットのデータ（中間符号）を取得する（ステップＳ
９０２）。そして、取得した中間符号に対応づけられて
いる情報を、中間符号対応表から読み出す（ステップＳ
９０３）。In response, intermediate code inverse conversion section 27
Executes the processing shown in FIG. That is, the intermediate code inverse conversion unit 27 first obtains an intermediate code correspondence table from the distortionless restoration unit 26, and recognizes the number L of bits of the intermediate code (step S901). Next, from the distortionless restoration unit 26,
Acquire L-bit data (intermediate code) (step S
902). Then, information associated with the acquired intermediate code is read from the intermediate code correspondence table (step S
903).

【０１２０】その後、読み出した情報が開始位置識別情
報であった場合（ステップＳ９０４；開始位置識別情
報）、中間符号逆変換部２７は、次の１バイトあるいは
２バイト単位のデータを無歪復元部２６から取得する
（ステップＳ９０５）。そして、取得したデータが終了
位置識別情報でなかった場合（ステップＳ９０６；Ｎ）
には、当該データ（終了位置識別情報は含まない）をそ
のままを表音逆変換部２８に出力（ステップＳ９０７）
し、ステップＳ９０５に戻る。一方、取得したデータが
終了位置識別情報であった場合（ステップＳ９０６；
Ｙ）には、そのデータを出力することなく、ステップＳ
９０２に戻り、Ｌビット単位のデータ取得を再開する。Thereafter, if the read information is the start position identification information (step S904; start position identification information), the intermediate code reverse conversion unit 27 converts the next 1-byte or 2-byte data into a distortion-free restoration unit. 26 (step S905). Then, when the acquired data is not the end position identification information (Step S906; N)
Output the data (excluding the end position identification information) as it is to the phonetic inverse conversion unit 28 (step S907).
Then, the process returns to step S905. On the other hand, when the acquired data is the end position identification information (step S906;
Y), without outputting the data, step S
Returning to step 902, data acquisition in units of L bits is resumed.

【０１２１】中間符号対応表から読み出したデータが文
字数識別情報であった場合（ステップＳ９０４；文字数
識別情報）、中間符号逆変換部２７は、後続する読みの
文字数を認識するとともに、読み出したデータを１バイ
トの文字数識別情報に変換して表音逆変換部２８に出力
する（ステップＳ９０８）。次いで、認識した文字数分
の中間符号を無歪復元部２６から取得し、取得した各中
間符号に中間符号対応表で対応づけられているデータを
出力する（ステップＳ９０９）。その後、中間符号逆変
換部２７は、Ｌビット分のデータ、すなわち、同音異義
語識別情報を取得し、その同音異義語識別情報を１バイ
トの情報に直して表音逆変換部２８に出力する（ステッ
プＳ９１０）。そして、ステップＳ９０２に戻り、Ｌビ
ット単位のデータ取得を再開する。If the data read from the intermediate code correspondence table is the character number identification information (step S904; character number identification information), the intermediate code inverse conversion unit 27 recognizes the number of characters in the subsequent reading and processes the read data. It is converted into 1-byte character number identification information and output to the phonetic inverse conversion unit 28 (step S908). Next, the intermediate codes for the number of recognized characters are acquired from the distortion-free restoration unit 26, and data associated with each acquired intermediate code in the intermediate code correspondence table is output (step S909). After that, the intermediate code inverse conversion unit 27 obtains L bits of data, that is, homonym identification information, converts the homonym identification information into 1-byte information, and outputs the same to the phonetic inverse conversion unit 28. (Step S910). Then, the process returns to step S902, and data acquisition in L-bit units is restarted.

【０１２２】中間符号対応表から読み出したデータが開
始位置識別情報でも文字数識別情報でもなかった場合
（ステップＳ９０４；その他）、中間符号逆変換部２７
は、そのデータを表音逆変換部２８に出力する（ステッ
プＳ９１１）。そして、出力したデータが“ＥＯＦ”で
なかった場合（ステップＳ９１２；Ｎ）には、ステップ
Ｓ９０２に戻り、“ＥＯＦ”であった場合（ステップＳ
９１２；Ｙ）に、図示した処理を終了する。If the data read from the intermediate code correspondence table is neither start position identification information nor character number identification information (step S904; other), the intermediate code reverse conversion unit 27
Outputs the data to the phonetic inverse conversion unit 28 (step S911). If the output data is not “EOF” (step S912; N), the process returns to step S902, and if it is “EOF” (step S912).
At 912; Y), the illustrated processing ends.

【０１２３】このように、中間符号逆変換部２７は、開
始位置識別情報用中間符号が現れたときに、それ以後、
終了位置識別情報が現れるまでのデータがバイト単位の
データ（音読み漢字１字のデータあるいは除外文字デー
タ）であることを認識する。そして、開始位置識別情報
用中間符号と終了位置識別情報との間に挟まれたデータ
に関しては、そのまま、表音逆変換部２８に出力する。
その際、開始位置、終了位置識別情報の出力は行わな
い。また、単独では、中間符号と弁別ができない同音異
義語識別情報に関しては、文字数識別情報用中間符号の
位置に基づき、その位置を特定し、１バイトのデータに
戻す。As described above, when the intermediate code for start position identification information appears, the intermediate code inverse conversion unit 27 performs
It recognizes that the data until the end position identification information appears is data in byte units (one-reading Kanji character data or excluded character data). Then, the data sandwiched between the intermediate code for the start position identification information and the end position identification information is output to the phonetic inverse conversion unit 28 as it is.
At this time, the output of the start position and end position identification information is not performed. For homonym identification information that cannot be distinguished from the intermediate code by itself, the position is specified based on the position of the intermediate code for character number identification information, and the data is returned to 1-byte data.

【０１２４】この結果、表音逆変換部２８には、表音変
換部２２が出力した表音データから開始位置識別情報と
終了位置識別情報を全て取り除いた表音データが供給さ
れることになる。このため、表音逆変換部２８は、第１
実施形態の表音逆変換部１５と全く同じ手順で表音デー
タを平文データに変換するよう構成されている。As a result, phonetic data obtained by removing all of the start position identification information and the end position identification information from the phonetic data output by the phonetic conversion unit 22 is supplied to the phonetic inverse conversion unit 28. . For this reason, the phonetic inverse conversion unit 28
The phonetic data is converted to plaintext data in exactly the same procedure as the phonetic inverse converter 15 of the embodiment.

【０１２５】[0125]

【発明の効果】以上、詳細に説明したように、本発明に
よれば、１文字が複数バイトで表される言語に関する文
書データを、高い圧縮率で圧縮することが可能となる。As described above, according to the present invention, it is possible to compress document data relating to a language in which one character is represented by a plurality of bytes at a high compression rate.

[Brief description of the drawings]

【図１】本発明の第１実施形態によるデータ圧縮・復元
装置の機能ブロック図である。FIG. 1 is a functional block diagram of a data compression / decompression device according to a first embodiment of the present invention.

【図２】第１実施形態のデータ圧縮・復元装置が備える
同音異義語辞書の説明図である。FIG. 2 is an explanatory diagram of a homonym dictionary provided in the data compression / decompression device of the first embodiment.

【図３】第１実施形態のデータ圧縮・復元装置が備える
表音変換部の動作手順を示す流れ図である。FIG. 3 is a flowchart showing an operation procedure of a speech conversion unit provided in the data compression / decompression device of the first embodiment.

【図４】第１実施形態のデータ圧縮・復元装置が備える
表音変換部の動作の説明図である。FIG. 4 is an explanatory diagram of an operation of a phonetic conversion unit included in the data compression / decompression device of the first embodiment.

【図５】第１実施形態のデータ圧縮・復元装置が備える
表音変換部の動作の説明図である。FIG. 5 is an explanatory diagram of an operation of a phonetic conversion unit included in the data compression / decompression device of the first embodiment.

【図６】第１実施形態のデータ圧縮・復元装置が備える
表音逆変換部の動作手順を示す流れ図である。FIG. 6 is a flowchart showing an operation procedure of a phonetic inverse conversion unit provided in the data compression / decompression device of the first embodiment.

【図７】本発明の第２実施形態によるデータ圧縮・復元
装置の機能ブロック図である。FIG. 7 is a functional block diagram of a data compression / decompression device according to a second embodiment of the present invention.

【図８】第２実施形態のデータ圧縮・復元装置が備える
表音変換部の動作手順を示す流れ図である。FIG. 8 is a flowchart showing an operation procedure of a phonetic conversion unit provided in the data compression / decompression device of the second embodiment.

【図９】第２実施形態のデータ圧縮・復元装置が備える
表音変換部の動作手順を示す流れ図である。FIG. 9 is a flowchart showing an operation procedure of a phonetic conversion unit provided in the data compression / decompression device of the second embodiment.

【図１０】第２実施形態のデータ圧縮・復元装置が備え
る表音変換部の動作の説明図である。FIG. 10 is an explanatory diagram of an operation of a phonetic conversion unit included in the data compression / decompression device of the second embodiment.

【図１１】第２実施形態のデータ圧縮・復元装置によっ
て高い圧縮率で圧縮が行える理由の説明図である。FIG. 11 is an explanatory diagram showing why data can be compressed at a high compression ratio by the data compression / decompression device of the second embodiment.

【図１２】第２実施形態のデータ圧縮・復元装置が備え
る表音逆変換部の動作手順を示す流れ図である。FIG. 12 is a flowchart showing an operation procedure of a phonetic inverse conversion unit provided in the data compression / decompression device of the second embodiment.

【図１３】本発明の第３実施形態によるデータ圧縮・復
元装置の機能ブロック図である。FIG. 13 is a functional block diagram of a data compression / decompression device according to a third embodiment of the present invention.

【図１４】第３実施形態のデータ圧縮・復元装置が備え
る表音変換部の動作手順を示す流れ図である。FIG. 14 is a flowchart showing an operation procedure of a phonetic conversion unit provided in the data compression / decompression device of the third embodiment.

【図１５】第３実施形態のデータ圧縮・復元装置が備え
る表音変換部が実行する表音変換処理の流れ図である。FIG. 15 is a flowchart of a phonetic conversion process performed by a phonetic conversion unit included in the data compression / decompression device of the third embodiment.

【図１６】第３実施形態のデータ圧縮・復元装置が備え
る中間符号変換部のビット長非指定時の動作手順を示す
流れ図である。FIG. 16 is a flowchart showing an operation procedure of the intermediate code conversion unit provided in the data compression / decompression device of the third embodiment when the bit length is not specified.

【図１７】第３実施形態のデータ圧縮・復元装置が備え
る中間符号変換部のビット長非指定時の動作手順を示す
流れ図である。FIG. 17 is a flowchart showing an operation procedure of the intermediate code conversion unit provided in the data compression / decompression device of the third embodiment when the bit length is not specified.

【図１８】第３実施形態のデータ圧縮・復元装置が備え
る中間符号変換部の動作の説明図である。FIG. 18 is an explanatory diagram of an operation of an intermediate code conversion unit included in the data compression / decompression device of the third embodiment.

【図１９】第３実施形態のデータ圧縮・復元装置が備え
る中間符号変換部の動作の説明図である。FIG. 19 is an explanatory diagram of an operation of an intermediate code conversion unit included in the data compression / decompression device of the third embodiment.

【図２０】第３実施形態のデータ圧縮・復元装置が備え
る中間符号変換部のビット長指定時の動作手順を示す流
れ図である。FIG. 20 is a flowchart showing an operation procedure when a bit length is specified in an intermediate code conversion unit provided in the data compression / decompression device of the third embodiment.

【図２１】第３実施形態のデータ圧縮・復元装置が備え
る中間符号変換部のビット長指定時の動作手順を示す流
れ図である。FIG. 21 is a flowchart showing an operation procedure when a bit length is specified in an intermediate code conversion unit provided in the data compression / decompression device of the third embodiment.

【図２２】第３実施形態のデータ圧縮・復元装置が備え
る中間符号変換部の動作の説明図である。FIG. 22 is an explanatory diagram of an operation of an intermediate code conversion unit included in the data compression / decompression device of the third embodiment.

【図２３】第３実施形態のデータ圧縮・復元装置が備え
る中間符号逆変換部の動作手順を示す流れ図である。FIG. 23 is a flowchart showing an operation procedure of an intermediate code inverse conversion unit provided in the data compression / decompression device of the third embodiment.

【図２４】算術符号化における、文字と生起確率と区間
との対応関係を示す図である。FIG. 24 is a diagram illustrating the correspondence between characters, occurrence probabilities, and sections in arithmetic coding.

【図２５】算術符号化における符号化手順を示す説明図
である。FIG. 25 is an explanatory diagram showing an encoding procedure in arithmetic encoding.

【図２６】文脈の説明図である。FIG. 26 is an explanatory diagram of a context.

【図２７】文脈収集に使用される木構造の一例を示した
図である。FIG. 27 is a diagram showing an example of a tree structure used for context collection.

[Explanation of symbols]

１１、２１記憶部１２、２２表音変換部１３、２５無歪圧縮部１４、２６無歪復元部１５、２８表音逆変換部２３中間符号変換部２４一時記憶部２７中間符号逆変換部 11, 21 storage unit 12, 22 phonetic conversion unit 13, 25 distortion-free compression unit 14, 26 distortion-free restoration unit 15, 28 phonetic inverse conversion unit 23 intermediate code conversion unit 24 temporary storage unit 27 intermediate code inverse conversion unit

Claims

[Claims]

1. A data compression apparatus for compressing original document data including character information in which one character is represented by a plurality of bytes of information, wherein each character information included in the original document data to be compressed is converted into the character information. Phonetic document data creating means for creating phonetic document data which is data replaced with phonetic character information representing the sound when the associated character is pronounced, and phonetic document created by the phonetic document data creating means A data compression device, comprising: compression means for compressing data.

2. A data compression apparatus for compressing original document data including character information in which one character is represented by a plurality of bytes, comprising: a plurality of conversion target word information including one or a plurality of character information; Phonogram information storage means for storing phonogram information which is information representing a sound when the word indicated by the information is pronounced, and phonogram information storage means for storing the phonogram information storage means from the original document data Search / read means for searching conversion target word information and reading phonogram information corresponding to the searched conversion target word information from the phonogram information storage means; and the original document searched by the search / read means Phonetic document data for creating phonetic document data by replacing the conversion target word information in the data with the conversion target word replacement information including the phonetic character information read by the search / read means. Creating means, and an intermediate code table creating means for creating an intermediate code table for associating an intermediate code with information elements used in the phonetic document data created by the phonetic document data creating means, Intermediate code document data for creating intermediate code document data by converting each information element constituting the phonetic document data into a corresponding intermediate code using the intermediate code table created by the intermediate code table creating means. A data compression apparatus comprising: a creation unit; and a compression unit that compresses the intermediate code document data created by the intermediate code document data creation unit.

3. An intermediate code table creating means for allocating, to each information element used in the phonetic document data, an intermediate code having a minimum number of bits capable of expressing the information element. 3. The data compression device according to claim 2, wherein an intermediate code table is created.

4. The phonetic document data creating means replaces a predetermined type of character information with start position identification information and end position identification information indicating that the character information is a predetermined type of character information. By replacing with information, phonetic document data is created, The intermediate code table creating means, in the phonetic document data,
The information subsequent to the start position identification information and the end position identification information, create an intermediate code table that is not associated with an intermediate code, the intermediate code data creating means, in phonetic document data,
4. The data compression apparatus according to claim 2, wherein the information following the start position identification information and the end position identification information are not converted into an intermediate code.

5. When the type of information element used in the phonetic document data exceeds the number N of information that can be represented by a predetermined number of bits, the intermediate code table creating means generates the intermediate code table. From among the used information elements, "N-1" information elements to which an intermediate code is assigned are selected, and the selected "N-1" information elements and the start position identification information have different contents. Creating an intermediate code table for associating the intermediate code of the predetermined number of bits, wherein the intermediate code data creating means is for the information in the phonetic document data to which the intermediate code is associated in the intermediate code table. Converts that information into the corresponding intermediate code,
Regarding the unallocated information that is the information to which the intermediate code is not allocated, the unallocated information is the information having the intermediate code at the head associated with the start position identification information, and including the unallocated information. 3. An intermediate-coded document data is created by replacing with unassigned replacement information which is information in a form in which an end position is known.
A data compression device as described.

6. The phonetic character information storage means for discriminating between the plurality of conversion target words and other conversion target words to which the same phonetic character information is associated with the plurality of conversion target words. The search / readout unit reads out phonetic character information and homonymous identification information corresponding to the searched conversion target word, and the phonetic document data creating unit includes: The conversion target word information in the original document data is replaced with conversion target word replacement information including phonetic character information and homonym identification information read by the search / read means, and the intermediate code creating means includes 6. The data compression apparatus according to claim 2, wherein an intermediate code table is created for information elements other than homonym identification information.

7. A data compression apparatus for compressing original document data including character information in which one character is represented by a plurality of bytes, wherein a word indicated by conversion target word information corresponds to conversion target word information including one or more character information. A phonetic character information storage unit storing information on a phonetic character which is information representing a sound generated when the sound is pronounced, and conversion target word information stored in the phonetic character information storage unit from the original document data. Searching and reading means for reading from the phonetic character information storage means the phonetic character information corresponding to the searched conversion target word information, and converting the original document data searched by the searching and reading means. Phonetic document data creating means for creating phonetic document data by replacing target word information with conversion target word replacement information including phonetic character information read by the search / read means, A data compression apparatus comprising compression means for compressing the phonetic document data created by the phonetic document data creating means.

8. The phonetic character information storage means identifies the phonetic character information and another conversion target word to which the same phonetic character information is associated with the plurality of conversion target word information. The search / readout means reads phonetic character information and homonym identification information corresponding to the searched conversion target word information, and the phonetic document data creating means The conversion target word information in the original document data, the conversion target word replacement information including the phonetic character information and the homonym identification information read by the search / read means, and the homonym identification information is 8. The data compression device according to claim 2, wherein replacement is performed with conversion target word replacement information included in a specific position.

9. The phonetic document data creating means adds numerical identification information indicating the length of the phonetic character information read by the search / read means to the head of the conversion target word replacement information. 9. The data compression device according to claim 2, wherein:

10. The phonetic document data creating means converts the target word information into phonetic character information start position identification information indicating the start and end of the target word information and the phonetic character information end. 10. The data compression apparatus according to claim 2, wherein phonetic document data is created by substituting conversion target word substitution information sandwiched between position identification information.

11. A decompression means for decompressing compressed document data, and converting the phonetic character information representing a sound when a character is reproduced, decompressed by the decompression means, into corresponding character information, A data restoration device comprising: original document data generation means for generating original document data which is data that is the source of document data.

12. Phonetic character information in which phonetic character information that is information representing a sound when the word indicated by the conversion target word information is pronounced with respect to the conversion target word information including one or a plurality of character information is stored. Storage means, decompression means for outputting intermediate code document data by decompressing the compressed document data, and each intermediate code included in the intermediate code document data output by the decompression means, in an intermediate code table for the compressed document data. A phonetic document data generating means for generating phonetic document data by substituting the information associated with the intermediate code, and a conversion included in the phonetic document data generated by the phonetic document data generating means The target word replacement information is searched, and the searched conversion target word replacement information is associated with the phonetic character information included in the conversion target word replacement information. A data restoration apparatus comprising: original document data generating means for generating original document data that is a source of the compressed document data by replacing the compressed document data with conversion target word information stored in a storage means.

13. The phonetic document data generation means uses the intermediate code table for information sandwiched between an intermediate code corresponding to start position identification information and end position identification information output by the restoration means. 13. The data restoration apparatus according to claim 12, wherein the replacement is not performed and the information is used as it is as an element of the phonetic document data.

14. The intermediate code table is a code table in which "N-1" information elements and start position identification information are associated with intermediate codes of a predetermined number of bits having different contents from each other. Document data generation means, for the unassigned replacement information starting with the intermediate code associated with the start position identification information included in the intermediate code document data output by the restoration means, from the unassigned replacement information, the intermediate code and the 14. The data restoration apparatus according to claim 13, wherein information is output from which information added to identify the end position is removed.

15. The phonetic character information storage means, for each of the plurality of conversion target word information, identifies a phonetic character information and another conversion target word to which the same phonetic character information is associated. The original document data generating means searches for the conversion target word replacement information contained in the phonetic document data generated by the phonetic document data generating means, and performs the search. The conversion target word replacement information is replaced with the conversion target word information stored in the phonogram information storage unit in association with the phonogram information and the homonym identification information included in the conversion target word replacement information. 15. The data restoration apparatus according to claim 12, wherein the data restoration is performed.

16. A phonogram in which phonogram information, which is information representing a sound when a word indicated by each conversion target word information is pronounced, is stored for conversion target word information including one or a plurality of character information. Character information storage means, restoration means for outputting phonetic document data by restoring the compressed document data, and search for and search for conversion target word substitution information contained in the phonetic document data generated by the restoring means. By replacing the conversion target word replacement information with the conversion target word information stored in the phonogram information storage unit in association with the phonogram information included in the conversion target word replacement information, the compression is performed. A data restoration apparatus comprising: original document data generation means for generating original document data that is the source of document data.

17. The phonetic character information storage means, for each of the plurality of conversion target word information, distinguishes between phonetic character information and another conversion target word having the same phonetic character information. The original document data generating means searches for conversion target word replacement information contained in the phonetic document data generated by the restoration means, and performs the searched conversion target word replacement. By replacing the information with the conversion target word information stored in the phonogram information storage means in association with the phonogram information and the homonym identification information included in the conversion target word replacement information, 15. The method according to claim 12, wherein the original document data which is the basis of the compressed document data is generated.
7. The data restoration device according to any one of 6.

18. The method according to claim 12, wherein said original document data creating means searches for information starting with numerical identification information indicating the length of phonetic character information when searching for said conversion target word replacement information. The data restoration device according to claim 17.

19. The original document data creation means searches for information sandwiched between phonetic information start position identification information and phonetic information end position information when searching for the conversion target word replacement information. The data restoration device according to claim 12, wherein

20. A data compression method for compressing original document data including character information in which one character is represented by a plurality of bytes of information, wherein each character information included in the original document data is converted based on the original document data. A phonetic document data creating step of creating phonetic document data which is data replaced with phonetic character information representing a sound generated when a character associated with the character information is pronounced; and A compression step of compressing the phonetic document data created in the processing.

21. A data compression method for compressing original document data including character information in which one character is represented by a plurality of bytes, comprising: converting original document data to be compressed to conversion target word information including one or more character information; On the other hand, while searching for the conversion target word information stored in the dictionary in which the phonogram information that is the sound when the word indicated by the conversion target word information is pronounced is stored, and the searched conversion target word information A reading / reading step of reading phonetic character information corresponding to the word from the dictionary; and a conversion target word information in the original document data searched by the searching / reading step, a table read by the searching / reading step. A phonetic document data creating step of creating phonetic document data by replacing with the conversion target word replacement information including phonetic information, An intermediate code table creating step for creating an intermediate code table for associating an intermediate code with each information element used in the phonetic document data created by the creating step, and the intermediate code table creating step An intermediate code document data creating step of creating intermediate code document data by converting each information element constituting the phonetic document data into a corresponding intermediate code using the intermediate code table thus obtained; A compression step of compressing the intermediate code document data created by the data creation step.

22. The intermediate code table creating step, for each information element used in the phonetic document data,
22. The data compression method according to claim 21, further comprising a step of creating an intermediate code table for assigning an intermediate code having a minimum number of bits capable of expressing these information elements.

23. The intermediate code table creation step, if the type of information element used in the phonetic document data exceeds the number N of information that can be represented by a predetermined number of bits, From among the used information elements, "N-1" information elements to which an intermediate code is assigned are selected, and the selected "N-1" information elements and the start position identification information have different contents. It is a step of creating an intermediate code table for associating the intermediate code of the predetermined number of bits, the intermediate code data creating step, in the phonetic document data, the intermediate code is associated with the intermediate code table For the information that has been assigned, the information is converted into a corresponding intermediate code, and for the unassigned information that is information to which no intermediate code has been assigned, the intermediate information associated with the start position identification information is used. Is a step of creating intermediate-coded document data by replacing with unassigned replacement information, which is information having unsigned information at the head and including unassigned information, and in a form in which the end position is known. 23. A method according to claim 22.
Data compression method described.

24. A data compression method for compressing original document data including character information in which one character is represented by a plurality of bytes, wherein a word indicated by the conversion target word information corresponds to a conversion target word information including one or a plurality of character information. Using a dictionary in which phonogram information, which is information representing a sound when pronounced, is stored, from the original document data to be compressed, searching for conversion target word information stored in the dictionary, A search / read step of reading phonetic character information corresponding to the searched conversion target word information from the dictionary; and a search / read step of converting conversion target word information in the original document data searched by the search / read step. A phonetic document data creating step of creating phonetic document data by replacing with the conversion target word replacement information including phonetic character information read by A compression step of compressing the phonetic document data created by the phonetic document data creating step.

25. The searching / reading step for identifying, for each of the plurality of conversion target words, phonetic character information and another conversion target word having the same phonetic character information. Reading the phonetic character information and the homonym identification information corresponding to the conversion target word from the dictionary in which the homonym identification information is stored, wherein the phonetic document data creation step includes: 22. The step of replacing the conversion target word information with the conversion target word replacement information including the phonetic character information and the homonym identification information read in the search / read step. Item 25. The data compression method according to any one of Items 24.

26. The phonetic document data creating step,
26. The method according to claim 21, wherein numerical identification information indicating a length of the phonetic character information read in the search / read step is added to a head of the conversion target word replacement information. Data compression method described.

27. A decompression step of decompressing compressed document data, and converting the phonetic character information representing a sound when a character is reproduced, which is decompressed in the decompression step, into corresponding character information, An original document data generating step of generating original document data which is data that is a source of the document data.

28. A decompression step of outputting intermediate code document data by decompressing compressed document data, and converting each intermediate code included in the intermediate code document data output by the processing of the decompression step to the compressed document data. A phonetic document data generating step of generating phonetic document data by substituting information associated with the intermediate code table, and converting target word information comprising one or more character information; The conversion target included in the phonetic document data generated by the processing of the phonetic document data generating step using a dictionary in which phonetic character information that is information representing a sound when the word indicated by is generated is stored. Search for word replacement information, and convert the searched conversion target word replacement information to the conversion target word information corresponding to the phonetic character information included in the conversion target word replacement information. In by replacing the data restoring method which is characterized in that it comprises a source document data generation step of generating the original document data is the source of the compressed document data.

29. The intermediate code table is a code table in which “N−1” information elements and start position identification information are associated with intermediate codes of a predetermined number of bits having different contents from each other. The document data generating step includes, for the unallocated replacement information that is information that starts with an intermediate code associated with the start position identification information included in the intermediate code document data output by the processing of the restoration step, 29. The data restoration method according to claim 28, further comprising outputting information obtained by removing information added to identify the intermediate code and its end position from the intermediate code.

30. A restoring step of outputting phonetic document data by restoring compressed document data, and when a word indicated by the conversion target word information is pronounced with respect to the conversion target word information including one or a plurality of character information. Using a dictionary in which phonogram information, which is information representing the sound of the utterance, is stored, the conversion target word replacement information included in the phonogram document data generated by the processing of the restoration step is searched for. Original document data generation that replaces word replacement information with conversion target word information corresponding to the phonogram information included in the conversion target word replacement information, thereby generating original document data from which the compressed document data is based. And a data restoring method.

31. The original document data generating step identifies, for each of the plurality of conversion target word information, phonogram information and another conversion target word having the same phonogram information. Conversion target word replacement information including phonetic character information and homonym identification information in the phonetic document data generated by the processing of the restoring step using a dictionary in which homonym identification information and homonym identification information are stored. 31. The data restoration method according to claim 28, wherein is replaced with the corresponding conversion target word information.

32. The original document data creating step searches for the conversion target word replacement information by searching for information sandwiched between start position identification information and end position information. 32. The data restoration method according to claim 31.

33. The original document data creating step searches for the conversion target word replacement information by searching for information sandwiched between phonetic information start position identification information and phonetic information end position identification information. 32. The data restoration method according to claim 28, wherein:

34. A phonogram which is data obtained by replacing each character information included in original document data to be compressed with phonogram information representing a sound when a character corresponding to the character information is pronounced. A program recording medium in which a phonetic document data creating means for creating document data and a program for functioning as a compressing means for compressing the phonetic document data created by the phonetic document data creating means are recorded.

35. A computer, comprising: restoring means for restoring compressed document data; and converting phonetic character information representing a sound when a character is reproduced, restored by the restoring means, into corresponding character information. And a program recording medium on which a program for functioning as original document data generating means for generating original document data which is data on which the compressed document data is based is recorded.