JP2010102559A

JP2010102559A - Apparatus and method for data compression and program

Info

Publication number: JP2010102559A
Application number: JP2008274276A
Authority: JP
Inventors: Mitsugi Miura; 貢三浦
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-10-24
Filing date: 2008-10-24
Publication date: 2010-05-06
Anticipated expiration: 2028-10-24
Also published as: JP5344132B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus and a method for data compression and a program, for improving compressibility when compressing text data including Kanjis (Chinese characters). <P>SOLUTION: The data compression apparatus 1 for performing the data compression to the text data including Kanjis is used. The data compression apparatus 1 includes a Kanji separation part 6 for separating the Kanji included in the text data from characters other than the Kanji included in the text data, and a Kanji data generation part 7 for generating Kanji list data for specifying the separated Kanji and Kanji position data for specifying the position of the Kanji in the text data. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、テキストデータに対して圧縮を行うデータ圧縮装置、データ圧縮方法、及びプログラムに関する。 The present invention relates to a data compression apparatus, a data compression method, and a program for compressing text data.

従来から、電子化されたデータに対しては、データ転送の効率化や、データ記憶域の有効利用のため、データ圧縮が行われている。データ圧縮には、大きく分けて、可逆圧縮と非可逆圧縮とがある。このうち、非可逆圧縮では、その名の通り、元のデータの一部が失われるため、圧縮後のデータから元のデータを完全に復元することは不可能となる。但し、非可逆圧縮は、可逆圧縮に比べて非常に高い圧縮率を達成できる点で優れている。 Conventionally, data compression has been performed on digitized data for efficient data transfer and effective use of a data storage area. Data compression is roughly classified into lossless compression and lossy compression. Of these, in the lossy compression, as the name implies, a part of the original data is lost, and therefore it is impossible to completely restore the original data from the compressed data. However, the irreversible compression is superior in that a very high compression ratio can be achieved compared to the lossless compression.

例えば、音声データや画像データ等においては、一部のデータを除いて、人間の可聴範囲外の音や、識別不能な色の変化などを平準化しても、実質的に失われるデータは少なく、これらのデータは、完全に復元できなくても情報として利用可能である。よって、画像データや音声データ等に対しては、データサイズの大きさ故に高い圧縮率が求められていることと相俟って、非可逆圧縮が主に利用されている（例えば、特許文献１〜３参照）。 For example, in audio data, image data, etc., except for some data, even if leveling sounds outside the human audible range or indistinguishable color change, there is little data lost, These data can be used as information even if they cannot be completely restored. Therefore, irreversible compression is mainly used for image data, audio data, and the like, coupled with the fact that a high compression ratio is required because of the large data size (for example, Patent Document 1). To 3).

これに対して、プログラムデータや、文書データ等のテキストデータに非可逆圧縮を行った場合は、テキストが文字単位の離散データであり、ノイズに弱いことから、復元が極めて困難となる。この場合、プログラムデータにおいては、復元後にプログラムが正常に動作しなくなるという問題が生じてしまう。また、文書データにおいては、復元後に文字化けが発生し、情報として利用できなくなるという問題が生じてしまう。 On the other hand, when irreversible compression is performed on text data such as program data and document data, the text is discrete data in units of characters and is vulnerable to noise, so that restoration becomes extremely difficult. In this case, the program data has a problem that the program does not operate normally after restoration. In addition, the document data is garbled after restoration, which causes a problem that it cannot be used as information.

このように、プログラムデータや文書データ等のテキストデータに対して、非可逆圧縮を実行するのは難しく、テキストデータの圧縮には、主に、可逆圧縮が用いられている。但し、テキストデータに対しても、圧縮率の向上が求められており、従来から、種々の圧縮方法が提案されている（例えば、特許文献４〜６参照。）。 As described above, it is difficult to perform lossy compression on text data such as program data and document data, and lossless compression is mainly used for compression of text data. However, improvement of the compression rate is also demanded for text data, and various compression methods have been conventionally proposed (see, for example, Patent Documents 4 to 6).

例えば、特許文献４は、文書データを電子メールとして送信する場合において、データ量を削減するデータ圧縮技術を開示している。特許文献４のデータ圧縮技術では、全角ひらがなの半角カタカナへの変換、全角英数字の半角英数字への変換、及び特定の文字列の特定の記号への変換等によってデータ圧縮を行っている。 For example, Patent Document 4 discloses a data compression technique that reduces the amount of data when document data is transmitted as an e-mail. In the data compression technique of Patent Document 4, data compression is performed by conversion of full-width hiragana to half-width katakana, conversion of full-width alphanumeric characters to half-width alphanumeric characters, conversion of specific character strings to specific symbols, and the like.

また、特許文献５は、イメージが付加された文書データを圧縮する際に、テキストデータに対してはＬＺＨやＺＩＰ等の可逆圧縮を行い、イメージデータについては非可逆圧縮を行うデータ圧縮技術を開示している。特許文献５のデータ圧縮技術によれば、イメージが付加された文書データに対する圧縮率の向上が可能となる。 Patent Document 5 discloses a data compression technique for performing reversible compression such as LZH or ZIP for text data and irreversible compression for image data when compressing document data to which an image is added. is doing. According to the data compression technique of Patent Document 5, it is possible to improve the compression rate for document data to which an image is added.

更に、特許文献６は、コンピュータ装置からプリンタに送られる印刷用の文書データを圧縮するデータ圧縮技術を開示している。特許文献６のデータ圧縮技術では、先ず、テキストデータはドットマップデータに変換され、その後、得られたドットマップデータに対して圧縮が行われるため、圧縮率の向上が図られる。
特表２００１−５０７１９５号公報特表２００４−５２３９３３号公報特表２００５−５００７０９号公報特開２００３−１９６２６８号公報（第５図）特開２００５−１８２７３５号公報特開２００６−１２０１４５号公報 Further, Patent Document 6 discloses a data compression technique for compressing document data for printing sent from a computer device to a printer. In the data compression technique of Patent Document 6, first, text data is converted into dot map data, and then the obtained dot map data is compressed, so that the compression rate is improved.
JP-T-2001-507195 JP-T-2004-523933 JP 2005-500709 A Japanese Patent Laying-Open No. 2003-196268 (FIG. 5) JP 2005-182735 A JP 2006-120145 A

しかしながら、特許文献４に開示のデータ圧縮技術では、文書データ中に含まれている漢字に対しては、対応する記号が設定されている場合を除き、データの圧縮が行わることはない。特許文献４に開示のデータ圧縮技術では、テキストデータの圧縮率の向上には限界がある。 However, in the data compression technique disclosed in Patent Document 4, data compression is not performed on kanji characters included in document data unless a corresponding symbol is set. In the data compression technique disclosed in Patent Document 4, there is a limit in improving the compression rate of text data.

また、特許文献５に開示のデータ圧縮技術では、テキストデータは、ＬＺＨやＺＩＰ等の一般的な可逆圧縮アルゴリズムによって圧縮されるに過ぎない。よって、特に漢字が含まれる文書データに対しては、テキストデータの圧縮率の向上は困難である。 In the data compression technique disclosed in Patent Document 5, text data is only compressed by a general lossless compression algorithm such as LZH or ZIP. Therefore, it is difficult to improve the compression rate of text data, especially for document data including kanji.

更に、特許文献６に開示のデータ圧縮技術では、テキストデータは、ドットマップデータに変換され、その後、圧縮されるが、圧縮率の向上を図る場合は、テキストの品位を落とす必要がある。このため、特にテキストデータが漢字を含む場合は、圧縮率の向上は極めて困難となる。 Furthermore, in the data compression technique disclosed in Patent Document 6, text data is converted into dot map data and then compressed. However, in order to improve the compression ratio, it is necessary to lower the quality of the text. For this reason, particularly when the text data includes Chinese characters, it is extremely difficult to improve the compression rate.

本発明の目的は、上記問題を解消し、漢字を含むテキストデータを圧縮する際の圧縮率の向上を図り得る、データ圧縮装置、データ圧縮方法、及びプログラムを提供することにある。 An object of the present invention is to provide a data compression apparatus, a data compression method, and a program capable of solving the above problems and improving the compression rate when compressing text data including Chinese characters.

上記目的を達成するために、本発明におけるデータ圧縮装置は、漢字を含むテキストデータに対してデータ圧縮を行うデータ圧縮装置であって、
前記テキストデータに含まれる前記漢字を、前記テキストデータに含まれる、前記漢字以外の文字から分離する漢字分離部と、
分離された前記漢字を特定する漢字リストデータ、及び前記漢字の前記テキストデータにおける位置を特定する漢字位置データを生成する漢字データ生成部とを備える、ことを特徴とする。 In order to achieve the above object, a data compression apparatus according to the present invention is a data compression apparatus that performs data compression on text data including Chinese characters,
A kanji separating unit that separates the kanji included in the text data from characters other than the kanji included in the text data;
And a kanji data generation unit that generates kanji list data for specifying the separated kanji and kanji position data for specifying the position of the kanji in the text data.

また、上記目的を達成するため、本発明におけるデータ圧縮方法は、漢字を含むテキストデータに対してデータ圧縮を行うためのデータ圧縮方法であって、
（ａ）前記テキストデータに含まれる前記漢字を、前記テキストデータに含まれる、前記漢字以外の文字から分離するステップと、
（ｂ）分離された前記漢字を特定する漢字リストデータ、及び前記漢字の前記テキストデータにおける位置を特定する漢字位置データを生成するステップとを有する、ことを特徴とする。 In order to achieve the above object, a data compression method according to the present invention is a data compression method for performing data compression on text data including kanji,
(A) separating the Chinese characters included in the text data from characters other than the Chinese characters included in the text data;
(B) generating kanji list data for specifying the separated kanji and generating kanji position data for specifying the position of the kanji in the text data.

更に、上記目的を達成するため、本発明におけるプログラムは、コンピュータ装置を用いて、漢字を含むテキストデータに対してデータ圧縮を実行するためのプログラムであって、
前記コンピュータ装置に、
（ａ）前記テキストデータに含まれる前記漢字を、前記テキストデータに含まれる、前記漢字以外の文字から分離するステップと、
（ｂ）分離された前記漢字を特定する漢字リストデータ、及び前記漢字の前記テキストデータにおける位置を特定する漢字位置データを生成するステップとを実行させる、ことを特徴とする。 Furthermore, in order to achieve the above object, a program according to the present invention is a program for performing data compression on text data including kanji using a computer device,
In the computer device,
(A) separating the Chinese characters included in the text data from characters other than the Chinese characters included in the text data;
(B) generating kanji list data for specifying the separated kanji and generating kanji position data for specifying the position of the kanji in the text data.

以上の特徴により、本発明におけるデータ圧縮装置、データ圧縮方法、及びプログラムによれば、漢字を含むテキストデータを圧縮する際の圧縮率の向上が図られる。 With the above features, according to the data compression device, data compression method, and program of the present invention, the compression rate when compressing text data including Chinese characters can be improved.

（実施の形態）
以下、本発明の実施の形態におけるデータ圧縮装置、データ圧縮方法、及びプログラムについて、図１〜図４を参照しながら説明する。最初に、本実施の形態におけるデータ圧縮装置の構成について図１〜図３を用いて説明する。 (Embodiment)
Hereinafter, a data compression apparatus, a data compression method, and a program according to an embodiment of the present invention will be described with reference to FIGS. Initially, the structure of the data compression apparatus in this Embodiment is demonstrated using FIGS. 1-3.

図１は、本発明の実施の形態におけるデータ圧縮装置の構成を示すブロック図である。図２は、本発明の実施の形態におけるデータ圧縮装置で行われる処理の結果の一例を示す図である。図３は、本実施の形態におけるデータ圧縮装置で生成されるデータの一例を示す図である。 FIG. 1 is a block diagram showing a configuration of a data compression apparatus according to an embodiment of the present invention. FIG. 2 is a diagram illustrating an example of a result of processing performed by the data compression device according to the embodiment of the present invention. FIG. 3 is a diagram illustrating an example of data generated by the data compression apparatus according to the present embodiment.

図１に示すデータ圧縮装置１は、漢字を含むテキストデータ１２（図２参照）に対してデータ圧縮を実行する。図１に示すように、データ圧縮装置１は、漢字分離部６と漢字データ生成部７とを有する漢字処理部５を備えている。 The data compression apparatus 1 shown in FIG. 1 performs data compression on text data 12 (see FIG. 2) including Chinese characters. As shown in FIG. 1, the data compression apparatus 1 includes a kanji processing unit 5 having a kanji separation unit 6 and a kanji data generation unit 7.

漢字分離部６は、データ圧縮の対象となるテキストデータ１２に含まれる漢字を、同じくテキストデータに含まれる、漢字以外の文字（かなや英数字）から分離する。漢字データ生成部７は、分離された漢字を特定する漢字リストデータと、漢字のテキストデータ１２における位置を特定する漢字位置データとを生成する（図３参照）。 The kanji separating unit 6 separates kanji characters included in the text data 12 to be subjected to data compression from characters (kana and alphanumeric characters) other than kanji characters also included in the text data. The kanji data generation unit 7 generates kanji list data for specifying the separated kanji and kanji position data for specifying the position of the kanji in the text data 12 (see FIG. 3).

本実施の形態では、圧縮が困難な漢字は、かなや英数字から分離され、漢字リストデータと漢字位置データとに変換される。そして、図３に示すように、漢字リストデータは、英数字の組合わせで作成される漢字コードで構成でき、漢字リストデータに対するデジタル圧縮は容易である。また、図３に示すように、漢字位置データは文書中の場所を示せれば良く、英数字で構成できるので、漢字位置データに対するデジタル圧縮も容易である。このため、本実施の形態によれば、漢字を含むテキストデータに対して、高い圧縮率で、圧縮を行うことができる。 In this embodiment, kanji characters that are difficult to compress are separated from kana and alphanumeric characters and converted into kanji list data and kanji position data. As shown in FIG. 3, the kanji list data can be composed of kanji codes created by combining alphanumeric characters, and digital compression of the kanji list data is easy. Further, as shown in FIG. 3, the kanji position data only needs to indicate the location in the document, and can be composed of alphanumeric characters, so that digital compression of the kanji position data is easy. For this reason, according to the present embodiment, it is possible to compress text data including Chinese characters at a high compression rate.

続いて、データ圧縮装置１の構成について更に具体的に説明する。図１に示すように、本実施の形態では、データ圧縮装置１は、漢字処理部５に加えて、データ読込部２、形態素解析部３、辞書４、補助圧縮部９、データ圧縮部１０、及びデータ出力部１１を備えている。また、漢字処理部５は、漢字分離部６及び漢字データ生成部７に加えて、更に、漢字かな変換部８を有している。 Next, the configuration of the data compression device 1 will be described more specifically. As shown in FIG. 1, in this embodiment, the data compression apparatus 1 includes a data reading unit 2, a morpheme analysis unit 3, a dictionary 4, an auxiliary compression unit 9, a data compression unit 10, in addition to the kanji processing unit 5. And a data output unit 11. The kanji processing unit 5 further includes a kanji-kana conversion unit 8 in addition to the kanji separation unit 6 and the kanji data generation unit 7.

データ読込部２は、外部から入力されるテキストデータを読み込み、これを形態素解析部３に出力する。入力されるテキストデータとしては、例えば、図２に示す、漢字を含む日本語のテキストデータ１２が挙げられる。なお、実際には、テキストデータ１２は、コード化された状態で入力される。 The data reading unit 2 reads text data input from the outside and outputs it to the morphological analysis unit 3. As the text data to be input, for example, Japanese text data 12 including kanji as shown in FIG. In practice, the text data 12 is input in a coded state.

本実施の形態において、テキストデータは、例えば、データ圧縮装置１に接続されたキーボードやマウスといった入力機器を介して、又は外部のコンピュータ装置からネットワークを介して、データ読込部２に入力される。この場合、データ読込部２としては、外部とデータ圧縮装置１とを接続するためのインターフェイスが用いられる。また、テキストデータは、記録媒体に格納された状態で提供されても良い。この場合は、データ読込部２としては、読取装置が用いられる。 In the present embodiment, the text data is input to the data reading unit 2 via an input device such as a keyboard and a mouse connected to the data compression device 1 or from an external computer device via a network. In this case, as the data reading unit 2, an interface for connecting the outside and the data compression apparatus 1 is used. The text data may be provided in a state stored in a recording medium. In this case, a reading device is used as the data reading unit 2.

形態素解析部３は、辞書４を参照しながら形態素解析を行い、テキストデータを単語単位で分解する。辞書４は、品詞、漢字表記、読み等の情報が付された単語リストを格納している。本実施の形態において、形態素解析のアルゴリズムは特に限定されるものではなく、形態素解析部３は、規則に基づく形態素解析を行っても良いし、確率的言語モデルに基づく形態素解析を行っても良い。 The morpheme analysis unit 3 performs morpheme analysis with reference to the dictionary 4 and decomposes the text data in units of words. The dictionary 4 stores a word list with information such as part of speech, kanji notation, and reading. In the present embodiment, the morphological analysis algorithm is not particularly limited, and the morphological analysis unit 3 may perform a morphological analysis based on a rule or may perform a morphological analysis based on a probabilistic language model. .

また、形態素解析が終了すると、形態素解析部３は、形態素解析の結果を示すデータ１３（図２参照）を、漢字処理部５に出力する。本実施の形態では、形態素解析部３は、後述する漢字かな変換部８での変換処理を可能とするため、データ１３には、各漢字の読み（図２において図示せず）を付加している。 When the morpheme analysis is completed, the morpheme analysis unit 3 outputs data 13 (see FIG. 2) indicating the result of the morpheme analysis to the kanji processing unit 5. In the present embodiment, the morphological analysis unit 3 adds a reading of each Chinese character (not shown in FIG. 2) to the data 13 in order to enable conversion processing in the Chinese character kana conversion unit 8 described later. Yes.

漢字分離部６は、上述したように、テキストデータ１２に含まれる漢字を漢字以外の文字から分離する。このとき、本実施の形態では、漢字分離部６は、文字コードに基づいて、漢字を分離する。また、漢字分離部６は、分離された漢字毎に、対応する漢字コードを抽出し、更に、漢字の位置を特定する情報（漢字位置情報）も抽出する。本実施の形態では、漢字コードとしては、例えば、ＪＩＳコード、シフトＪＩＳコード、Ｕｎｉｃｏｄｅ等が用いられる。また、漢字位置情報としては、例えば、文頭の文字を「１」として、順に番号付けした場合の番号が用いられる。 As described above, the kanji separating unit 6 separates kanji included in the text data 12 from characters other than kanji. At this time, in this embodiment, the kanji separating unit 6 separates kanji based on the character code. In addition, the kanji separation unit 6 extracts a corresponding kanji code for each separated kanji, and further extracts information for specifying the position of the kanji (kanji position information). In the present embodiment, for example, JIS code, shift JIS code, Unicode, etc. are used as the Kanji code. In addition, as the kanji position information, for example, a number in the case of sequentially numbering the first character as “1” is used.

漢字データ生成部７は、漢字分離部６が抽出した漢字コードを用いて、図３に示す漢字リストデータを生成する。更に、漢字データ生成部７は、漢字分離部６が抽出した漢字位置情報を用いて、図３に示す漢字位置データも生成する。 The kanji data generation unit 7 uses the kanji code extracted by the kanji separation unit 6 to generate kanji list data shown in FIG. Furthermore, the kanji data generation unit 7 also generates kanji position data shown in FIG. 3 using the kanji position information extracted by the kanji separation unit 6.

漢字かな変換部８は、かな変換テキストデータ１４（図２参照）を生成する。かな変換テキストデータ１４は、テキストデータ１２（図２参照）に含まれる漢字をかなに変換して得られる、漢字以外の文字のみで構成されたデータである。本実施の形態では、漢字かな変換部８は、テキストデータ１２に含まれる文字毎に、文字コードが漢字コードであるかどうかを判定する。そして、漢字かな変換部８は、漢字コードであると判定した文字については、形態素解析の結果を利用して、対応するかなへと変換する。これにより、かな変換テキストデータ１４が生成される。 The kanji / kana conversion unit 8 generates the kana conversion text data 14 (see FIG. 2). The kana conversion text data 14 is data composed only of characters other than the kanji obtained by converting the kanji included in the text data 12 (see FIG. 2) into kana. In the present embodiment, the kanji / kana conversion unit 8 determines whether the character code is a kanji code for each character included in the text data 12. Then, the kanji / kana conversion unit 8 converts a character determined to be a kanji code into a corresponding kana by using the result of morphological analysis. Thereby, kana conversion text data 14 is generated.

図２に示すように、かな変換テキストデータ１４は、漢字以外の文字で構成され、同一の文字種が連続したデータとなる。よって、かな変換テキストデータ１４においては、テキストデータ１２から漢字のみを取り除いたデータ（例えば「が／る／まで／で／った／・・・」（データ１３参照））に比べて、圧縮の対象となる共通領域が増加する。本実施の形態では、かな変換テキストデータ１４の作成により、更なる圧縮率の向上が図られる。 As shown in FIG. 2, the kana conversion text data 14 is composed of characters other than Kanji and is data in which the same character type is continuous. Therefore, the kana converted text data 14 is more compressed than the data obtained by removing only the kanji from the text data 12 (for example, “GA / RU / TO / DE ///” (see data 13)). The target common area increases. In the present embodiment, the kana conversion text data 14 is created to further improve the compression rate.

補助圧縮部９は、かな変換テキストデータ１４に含まれる文字コードの種類を削減し、かな変換テキストデータ１４のデータ量を圧縮する。具体的には、補助圧縮部９は、破裂音、濁音、及び促音の清音文字への変換（例えば、ぱ→は、が→か、っ→つ）や、カタカナのひらがなへの変換、更には、カタカナ及びひらがなのローマ字への変換を実行する。図２において、データ１５は、補助圧縮によって得られたデータを示している。なお、本実施の形態において、補助圧縮部９による圧縮は、必要に応じて行われれば良く、行われない態様であっても良い。 The auxiliary compression unit 9 reduces the types of character codes included in the kana conversion text data 14 and compresses the data amount of the kana conversion text data 14. Specifically, the auxiliary compression unit 9 converts plosives, muddy sounds, and prompt sounds into clear-sound characters (for example, pa → haga → ka, tsu → tsu), katakana hiragana, , Convert katakana and hiragana to romaji. In FIG. 2, data 15 indicates data obtained by auxiliary compression. In the present embodiment, the compression by the auxiliary compression unit 9 may be performed as necessary, and may be performed in a manner that is not performed.

データ圧縮部１０には、図３に示す、漢字リストデータ、漢字位置データ、及びかな変換テキストデータ１４（補助圧縮部９による補助圧縮が行われた場合はデータ１５）が入力される。具体的には、データ圧縮部１０は、例えば、ＬＺＨ形式、ＬＺ（Lempel-Ziv）形式、又はＺＩＰ形式等の可逆圧縮アルゴリズムを利用して、データのデジタル圧縮を実行する The data compression unit 10 receives the kanji list data, the kanji position data, and the kana converted text data 14 (data 15 when the auxiliary compression is performed by the auxiliary compression unit 9) shown in FIG. Specifically, the data compression unit 10 performs digital compression of data using, for example, a reversible compression algorithm such as LZH format, LZ (Lempel-Ziv) format, or ZIP format.

また、上述したように、実際には、漢字リストデータ、漢字位置データ、及びかな変換テキストデータは英数字で構成される。よって、データ圧縮部１０によって圧縮されたデータは、背景技術の欄において特許文献４〜６に示した技術によって圧縮される場合に比べて、高い圧縮率で圧縮される。本実施の形態によれば、漢字を含む日本語のテキストデータ１２（図２参照）が、高い圧縮率で圧縮される。 As described above, actually, the kanji list data, the kanji position data, and the kana conversion text data are composed of alphanumeric characters. Therefore, the data compressed by the data compression unit 10 is compressed at a higher compression rate than when compressed by the techniques disclosed in Patent Documents 4 to 6 in the background art column. According to the present embodiment, Japanese text data 12 (see FIG. 2) including kanji is compressed at a high compression rate.

更に、本実施の形態では、データ圧縮部１０は、外部からの指示、具体的には、データ圧縮装置１のオペレータからの指示に応じて、漢字リストデータ及び漢字位置データを破棄し、かな変換テキストデータ１４に対してのみデジタル圧縮を実行することもできる。この場合は、圧縮後のデータを元のテキストデータ１２に復元することは、不可能となるが、可逆的に圧縮を行う必要がなく、高い圧縮率だけが求められる場合に有効となる。 Furthermore, in the present embodiment, the data compression unit 10 discards the kanji list data and the kanji position data in accordance with an instruction from the outside, specifically, an instruction from the operator of the data compression apparatus 1, and performs kana conversion. Digital compression can be performed only on the text data 14. In this case, it is impossible to restore the compressed data to the original text data 12, but it is not necessary to perform reversible compression and is effective when only a high compression rate is required.

圧縮後のデータは、データ出力部１１によって外部に出力される。具体的には、圧縮後のデータは、データ圧縮装置１にネットワークを介して接続された別のコンピュータ装置や、記録媒体へと出力される。 The compressed data is output to the outside by the data output unit 11. Specifically, the compressed data is output to another computer device or a recording medium connected to the data compression device 1 via a network.

次に、本発明の実施の形態におけるデータ圧縮方法について図４を用いて説明する。図４は、本発明の実施の形態におけるデータ圧縮方法で行われる処理の流れを示すフロー図である。本実施の形態におけるデータ圧縮方法は、図１に示したデータ圧縮装置１を動作させることによって実施できる。このため、以降において、本実施の形態におけるデータ圧縮方法の説明は、適宜図１〜図４を参酌しながら、図１に示したデータ圧縮装置の動作の説明と共に行う。 Next, a data compression method according to the embodiment of the present invention will be described with reference to FIG. FIG. 4 is a flowchart showing a flow of processing performed in the data compression method according to the embodiment of the present invention. The data compression method in the present embodiment can be implemented by operating the data compression apparatus 1 shown in FIG. Therefore, hereinafter, the description of the data compression method according to the present embodiment will be made together with the description of the operation of the data compression apparatus shown in FIG. 1, with appropriate reference to FIGS.

先ず、図４に示すように、テキストデータ１２（図２参照）がデータ圧縮装置１に入力されると、データ読込部２は、入力されたテキストデータ１２を読み込み（ステップＳ１）、読み込んだテキストデータ１２を形態素解析部３に入力する。 First, as shown in FIG. 4, when text data 12 (see FIG. 2) is input to the data compression apparatus 1, the data reading unit 2 reads the input text data 12 (step S1) and reads the read text. Data 12 is input to the morphological analyzer 3.

次に、テキストデータ１２が入力されると、形態素解析部３は、辞書４を参照しながら、これに対して形態素解析を実行し、各漢字の読みが付加されたデータ１３を生成する（ステップＳ２）。形態素解析部３は、生成したデータ１３を漢字処理部５に入力する。 Next, when the text data 12 is input, the morpheme analysis unit 3 performs morpheme analysis on the dictionary 4 while referring to the dictionary 4, and generates data 13 to which the reading of each kanji is added (step). S2). The morpheme analysis unit 3 inputs the generated data 13 to the kanji processing unit 5.

続いて、漢字処理部５はテキストデータ１２に漢字が含まれているかどうかを判定する（ステップＳ３）。漢字が含まれていない場合は、補助圧縮部９によってステップＳ５が実行される。一方、漢字が含まれている場合は、漢字処理部５は、漢字処理を行なう（ステップＳ４）。 Subsequently, the kanji processing unit 5 determines whether or not kanji is included in the text data 12 (step S3). If the kanji is not included, step S5 is executed by the auxiliary compression unit 9. On the other hand, when kanji is included, the kanji processing unit 5 performs kanji processing (step S4).

具体的には、ステップＳ４においては、先ず、漢字分離部６が、テキストデータ１２に含まれる漢字を漢字以外の文字（かなや英数字）から分離し、分離された各漢字に対応する漢字コード及び漢字位置情報を抽出する。次に、ステップＳ４においては、漢字データ生成部７が、抽出された漢字コードと漢字位置情報とを用いて、漢字リストデータ及び漢字位置データを生成する（図３参照）。更に、ステップＳ４においては、漢字かな変換部８が、かな変換テキストデータ１４を生成する。 Specifically, in step S4, first, the kanji separation unit 6 separates kanji included in the text data 12 from characters other than kanji (kana and alphanumeric characters), and kanji codes corresponding to the separated kanji characters. And kanji position information is extracted. Next, in step S4, the kanji data generation unit 7 generates kanji list data and kanji position data using the extracted kanji code and kanji position information (see FIG. 3). Further, in step S4, the kanji / kana conversion unit 8 generates the kana converted text data 14.

次に、補助圧縮部９が、かな変換テキストデータ１４に対して、データ量を削減するための補助圧縮を実行する（ステップＳ５）。なお、ステップＳ４が実行されていない場合は、補助圧縮部９は、形態素解析によって生成されたデータ１３に対して、補助圧縮を実行する。 Next, the auxiliary compression unit 9 executes auxiliary compression for reducing the amount of data for the kana converted text data 14 (step S5). In addition, when step S4 is not performed, the auxiliary compression part 9 performs auxiliary compression with respect to the data 13 produced | generated by the morphological analysis.

続いて、データ圧縮部１０が、漢字リストデータ、漢字位置データ、及び、補助圧縮後のかな変換テキストデータ１４に対して、データ圧縮を実行する（ステップＳ６）。なお、ステップＳ４が実行されていない場合は、データ圧縮部１０は、補助圧縮後のデータ１３に対して、データ圧縮を実行する。 Subsequently, the data compression unit 10 performs data compression on the kanji list data, the kanji position data, and the kana converted text data 14 after the auxiliary compression (step S6). When step S4 is not executed, the data compression unit 10 performs data compression on the data 13 after the auxiliary compression.

その後、データ出力部１１が、ステップＳ６で圧縮されたデータを外部に出力する（ステップＳ７）。ステップＳ７が終了すると、データ圧縮装置１は処理を終了する。このように、本実施の形態におけるデータ圧縮法を実行すれば、漢字は漢字以外の文字とは別に圧縮され、漢字を含むテキストデータ１２は、高い圧縮率で圧縮されることとなる。 Thereafter, the data output unit 11 outputs the data compressed in step S6 to the outside (step S7). When step S7 ends, the data compression apparatus 1 ends the process. Thus, if the data compression method in this Embodiment is performed, a Chinese character will be compressed separately from characters other than a Chinese character, and the text data 12 containing a Chinese character will be compressed with a high compression rate.

また、本実施の形態におけるプログラムは、コンピュータ装置に、図４に示すステップＳ１〜Ｓ７を具現化させるプログラムである。よって、データ圧縮装置１は、コンピュータに、このプログラムをインストールし、このプログラムを実行することによって、実現することができる。この場合、コンピュータのＣＰＵ（central processing unit）は、形態素解析部３、漢字処理部５、補助圧縮部９、データ圧縮部１０及びデータ出力部１１として機能し、ステップＳ１〜Ｓ７の処理を行なう。 Moreover, the program in this Embodiment is a program which makes a computer apparatus embody step S1-S7 shown in FIG. Therefore, the data compression apparatus 1 can be realized by installing this program in a computer and executing this program. In this case, a central processing unit (CPU) of the computer functions as the morphological analysis unit 3, the kanji processing unit 5, the auxiliary compression unit 9, the data compression unit 10, and the data output unit 11, and performs the processes of steps S1 to S7.

本実施の形態では、辞書４は、コンピュータ装置に備えられたハードディスク等の記憶装置に、辞書４を構成するデータファイルを格納することによって、実現することができる。また、辞書４は、データファイルが格納された記録媒体をコンピュータ装置と接続された読取装置に搭載することによっても実現できる。更に、辞書４は、プログラムがインストールされたコンピュータ装置とは別のコンピュータ装置によって提供されていても良い。 In the present embodiment, the dictionary 4 can be realized by storing a data file constituting the dictionary 4 in a storage device such as a hard disk provided in the computer device. The dictionary 4 can also be realized by mounting a recording medium in which a data file is stored in a reading device connected to a computer device. Furthermore, the dictionary 4 may be provided by a computer device different from the computer device in which the program is installed.

図１は、本発明の実施の形態におけるデータ圧縮装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a data compression apparatus according to an embodiment of the present invention. 図２は、本発明の実施の形態におけるデータ圧縮装置で行われる処理の結果の一例を示す図である。FIG. 2 is a diagram illustrating an example of a result of processing performed by the data compression device according to the embodiment of the present invention. 図３は、本実施の形態におけるデータ圧縮装置で生成されるデータの一例を示す図である。FIG. 3 is a diagram illustrating an example of data generated by the data compression apparatus according to the present embodiment. 図４は、本発明の実施の形態におけるデータ圧縮方法で行われる処理の流れを示すフロー図である。FIG. 4 is a flowchart showing a flow of processing performed in the data compression method according to the embodiment of the present invention.

Explanation of symbols

１データ圧縮装置
２データ読込部
３形態素解析部
４辞書
５漢字処理部
６漢字分離部
７漢字データ生成部
８漢字かな変換部
９補助圧縮部
１０データ圧縮部
１１データ出力部
１２テキストデータ
１３形態素分析後のデータ
１４かな変換テキストデータ
１５補助圧縮後のデータ DESCRIPTION OF SYMBOLS 1 Data compression apparatus 2 Data reading part 3 Morphological analysis part 4 Dictionary 5 Kanji processing part 6 Kanji separation part 7 Kanji data generation part 8 Kanji kana conversion part 9 Auxiliary compression part 10 Data compression part 11 Data output part 12 Text data 13 Morphological analysis Data after 14 Kana conversion text data 15 Data after auxiliary compression

Claims

A data compression device that performs data compression on text data including kanji,
A kanji separating unit that separates the kanji included in the text data from characters other than the kanji included in the text data;
A data compression apparatus comprising: a kanji list data for specifying the separated kanji characters; and a kanji data generation unit for generating kanji position data for specifying a position of the kanji in the text data.

A kanji conversion unit that converts kanji contained in the text data into kana and generates kana conversion text data composed only of characters other than kanji.
The data compression apparatus according to claim 1.

The data compression apparatus according to claim 1 or 2, further comprising a data compression unit that performs digital compression on the kanji list data, the kanji position data, and the kana converted text data.

The data according to claim 3, wherein the data compression unit discards the kanji list data and the kanji position data and performs digital compression only on the kana converted text data when instructed from outside. Compression device.

Morphological analysis is performed on the text data, and further includes a morphological analysis unit that decomposes the text data into words.
The data compression apparatus according to claim 2, wherein the kanji / kana conversion unit converts the kanji contained in the text data to kana using the result of the morphological analysis to generate the kana conversion text data.

A data compression method for performing data compression on text data including kanji,
(A) separating the Chinese characters included in the text data from characters other than the Chinese characters included in the text data;
And (b) generating kanji list data for specifying the separated kanji and kanji position data for specifying the position of the kanji in the text data.

7. The data compression method according to claim 6, further comprising the step of (c) converting kanji included in the text data into kana and generating kana converted text data composed only of characters other than kanji.

The data compression method according to claim 6 or 7, further comprising: (d) performing digital compression on the kanji list data, the kanji position data, and the kana converted text data.

The step of (d), when instructed from outside, discards the kanji list data and the kanji position data, and performs digital compression only on the kana converted text data. Data compression method.

(E) further comprising performing a morphological analysis on the text data and decomposing the text data into words;
The step (e) is executed before the step (a) is executed,
In the step of (a), using the result of the morphological analysis obtained in the step of (e), kanji included in the text data is converted into kana, and the kana converted text data is generated. Item 8. The data compression method according to Item 7.

A program for performing data compression on text data including kanji using a computer device,
In the computer device,
(A) separating the Chinese characters included in the text data from characters other than the Chinese characters included in the text data;
(B) generating a kanji list data specifying the separated kanji and generating kanji position data specifying a position of the kanji in the text data.

The program according to claim 11, further comprising: (c) causing the computer device to further execute a step of converting kanji included in the text data into kana and generating kana converted text data including only characters other than kanji. .

The program according to claim 11 or 12, further causing the computer apparatus to execute a step of performing digital compression on the kanji list data, the kanji position data, and the kana converted text data.

The step of (d), when there is an instruction from the outside, discards the kanji list data and the kanji position data and performs digital compression only on the kana converted text data. Program.

(E) performing a morphological analysis on the text data, and causing the computer device to further execute a step of decomposing the text data into words, prior to the step (a),
In the step of (a), using the result of the morphological analysis obtained in the step of (e), kanji included in the text data is converted into kana, and the kana converted text data is generated. Item 13. The program according to item 12.