JPH0338772A

JPH0338772A - Compression system for character code data

Info

Publication number: JPH0338772A
Application number: JP1175208A
Authority: JP
Inventors: Isao Kondo; 勲近藤
Original assignee: NEC Office Systems Ltd
Current assignee: NEC Office Systems Ltd
Priority date: 1989-07-05
Filing date: 1989-07-05
Publication date: 1991-02-19

Abstract

PURPOSE:To compress the character code data by using a KANJI (Chinese character) phrase code dictionary and applying a process system where the character codes and the phrase codes are unified into 2 bytes and 3 bytes respectively. CONSTITUTION:The character codes 9 are defined when the most significant bit of a 1st byte is equal to '0', and the KANJI phrase codes 10 are defined when the most significant bit is equal to '1' respectively. The codes 9 have 2 bytes and are assigned to 32K characters with remaining 15 bits, and the codes 10 are assigned to 32K characters of the codes 9 with remaining 15 bits of 1st and 2nd bytes and then assigned to the code of a KANJI phrase starting at the head character of a phrase at a 3rd byte. The character strings of the KANJI-KANA (Japanese syllabary) sentences are changed into the codes by a character code dictionary 3. A character string code part 8 of a KANJI phrase code dictionary 6 is retrieved by a deciding means 4. When the coincident codes are detected, the data on a code part 7 is read out. If no coincident code is confirmed, the character code read out of the dictionary 3 is stored in a storage means 5 as it is. Thus the character code data can be compressed and converted.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、文字コードデータの圧縮方式に関し、特に漢
字の文字コードデータの圧縮方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a compression method for character code data, and particularly to a compression method for character code data of Kanji characters.

[Conventional technology]

一般に、日本語ワードプロセッサ等文書作成機等は、漢
字仮名混じり文の処理を行うため日本語データはＪＩＳ
規格（Ｃ−６２２６情報交換用漢字符号系）等を用い、
日本語１字を２バイトで表わし殆んどこの文字コードを
そのままの形で′処理されている。In general, document creation machines such as Japanese word processors process sentences containing kanji and kana, so Japanese data is JIS.
Using standards (C-6226 Kanji code system for information exchange) etc.
One Japanese character is expressed in two bytes, and almost all character codes are processed as they are.

このデータ処理の効率をあげるには、データの圧縮が必
要で従来の技術として文字コードデータ圧縮方式（特公
昭６ｌ−２３２７２４）があり、第１バイトのうち１ビ
ットまたは複数ビットで１バイト長か、２バイトを区別
し、残りのビット数を語のコード辞書に割り当てる、ま
た、語のコード辞書の数を増やすため、第１バイトを拡
張制御コードとして使い残りの２バイトを語のコード辞
書に割り当て、文字コードデータの文字列を語のコード
と一致をとる手段により圧縮変換する方式〔発明が解決
しようとする課題〕上述した従来の文字コードデータ圧縮方式においては、
使用頻度順に辞書のため低頻度の語に対しては検索に時
間がかかり、また、語コード、文字コードが共に可変長
であるため処理が複雑になったり、圧縮できない語の事
も考慮しなけれ１ばならず、元のバイト数より増加する
こともあるという欠点があった。In order to improve the efficiency of this data processing, data compression is necessary, and the conventional technology is the character code data compression method (Japanese Patent Publication No. 61-232724). , distinguish between 2 bytes and allocate the remaining number of bits to the word code dictionary.Also, to increase the number of word code dictionaries, use the first byte as an extended control code and use the remaining 2 bytes as the word code dictionary. [Problem to be solved by the invention] In the conventional character code data compression method described above,
Because the dictionary is ordered by frequency of use, it takes time to search for low-frequency words, and since both the word code and character code are variable lengths, processing becomes complicated, and words that cannot be compressed must be taken into account. However, there is a drawback that the number of bytes may increase more than the original number of bytes.

本発明の目的は、以上の欠点を解決し容易に検索できる
２文字２バイト、熟語３バイトの固定長である１元の文
字コード列のバイト数より増えない文字コードデータの
圧縮方式を提供することにある。The purpose of the present invention is to solve the above drawbacks and provide a compression method for character code data that can be easily searched and does not increase in number of bytes than the number of bytes of an original character code string, which has a fixed length of 2 characters, 2 bytes, and idioms, 3 bytes. There is a particular thing.

〔課題を解決するための手段〕本発明の文字コードデータの圧縮方式は、日本語ワード
プロセッサ等で日本語の処理を行うとき、個々の文字は
一般に２バイト（１６ビット〉の固定長で１字を表現し
て１つの言葉を表現する複数の文字列も２バイトのコー
ドデータをそのまま使用する文字コードデータの圧縮方
式において、日本語の漢字仮名混じり文の漢字部分に注
目し、漢字の文字列の組合せの言葉である漢字熟語に対
して、漢字の先頭文字を２バイトで表わし、次の１バイ
トで前記漢字先頭文字で始まる漢字熟語をコード化した
３バイトの漢字熟語のコード辞書を備え、あらかじめ前
記漢字熟語の先頭文字の音訓の読みを五十音順に並べ、
２バイトで表わされた漢字の前記先頭文字により検索し
、前記２バイトで表わされた漢字文字のコードと前記漢
字熟語のコードとの一致をとる手段により、漢字文字コ
ードデータ列を漢字熟語コードデータに圧縮変換し、熟
語として登録されていないために圧縮できない文字列は
そのまま使用することにより、元のバイト数が増えない
ように構成されている。[Means for Solving the Problems] The character code data compression method of the present invention is such that when processing Japanese using a Japanese word processor, each character is generally one character with a fixed length of 2 bytes (16 bits). In the character code data compression method that uses 2-byte code data as is for multiple character strings that express one word, we focused on the kanji part of Japanese kanji and kana mixed sentences, and For Kanji compound words that are combinations of words, a 3-byte code dictionary of Kanji compound words is provided, in which the first letter of the Kanji is expressed in 2 bytes, and the next 1 byte is used to code the Kanji compound word starting with the first letter of the Kanji character, Arrange the readings of the first letters of the kanji compound words in alphabetical order in advance,
The kanji character code data string is searched by the first character of the kanji character expressed in 2 bytes, and the code of the kanji character expressed in 2 bytes matches the code of the kanji compound word. It is configured so that the original number of bytes does not increase by compressing and converting it into code data and using character strings that cannot be compressed because they are not registered as idioms as they are.

〔Example〕

次に本発明について図面を参照して詳細に説明する。 Next, the present invention will be explained in detail with reference to the drawings.

第１図は本発明の一実施例の機能ブロック図で、１は入
力手段、２は文字コード列−時記憶手段、３は文字コー
ド辞書、４は判別手段、５は記憶手段、６は漢字熟語コ
ード辞書、７は漢字熟語コード部、８は文字列コード部
である。第２図は文字コード、漢字熟語コードの構成図
で、第１バイトの最上位ビットが０″の場合は文字コー
ド９゜最上位ビットが１″の場合は漢字熟語コード１０
として区別され、文字コード９は２バイトコードで、残
り１５ビットで３２に字に割り当てられ、漢字熟語コー
ド１０は第１バイト、第２バイトの残り１５ビットで文
字コード９の３２に字に割り当て、第３バイトで熟語の
先頭文字で始まる漢字熟語のコードに割り当てる。FIG. 1 is a functional block diagram of an embodiment of the present invention, in which 1 is an input means, 2 is a character code string-time storage means, 3 is a character code dictionary, 4 is a discrimination means, 5 is a storage means, and 6 is a kanji character. In the idiom code dictionary, 7 is a kanji idiom code section, and 8 is a character string code section. Figure 2 is a configuration diagram of the character code and kanji phrase code. If the most significant bit of the first byte is 0'', the character code is 9°; if the most significant bit is 1'', the kanji phrase code is 10.
Character code 9 is a 2-byte code, and the remaining 15 bits are assigned to character 32, and Kanji compound word code 10 is the remaining 15 bits of the 1st and 2nd bytes, which are assigned to character 32 of character code 9. , the third byte is assigned to the code of the kanji compound word starting with the first character of the compound word.

第３図は漢字熟語の一例として漢字１日」で始まる漢字
熟語の例を示している。第４図は漢字熟語コード辞書の
一部で熟語コード１２文字コード１３が示されており、
漢字熟語コード辞書は漢字の読みを五十音順にならべて
おきキーワード１１で検索する。Figure 3 shows an example of a kanji idiom starting with ``kanji ichiichi''. Figure 4 shows part of the Kanji Idiom Code Dictionary, which shows 12 Idiom Codes and 13 character Codes.
The Kanji Idiom Code Dictionary arranges the pronunciations of kanji in alphabetical order and searches using keyword 11.

使用法として第１図より、入力手段１より入力された漢
字かな混じり文の文字列は、文字コード辞書３より該当
文字コードにコード化され、文字列コード−時記憶手段
２に読み出す。読み出された文字コード列は判別手段４
により、漢字熟語コード辞書６の文字列コード部８を検
索し、該当文字列を探す。一致した場合、その漢字熟語
のコード部７のデータを読み出し記憶手段５に格納し、
また一致するものがない場合は文字コード辞書３より読
みだされた文字コードをそのまま記憶手段５に格納する
ことにより圧縮変換する。以後、このコード体系で編集
、出力、格納および伝送等の処理を行う。As a usage method, as shown in FIG. 1, a character string containing kanji and kana characters inputted from the input means 1 is encoded into the corresponding character code from the character code dictionary 3, and read into the character string code-time storage means 2. The read character code string is determined by the discrimination means 4.
The character string code section 8 of the kanji compound word code dictionary 6 is searched to find the corresponding character string. If there is a match, the data of the code part 7 of the kanji compound word is read out and stored in the storage means 5,
If there is no match, the character code read from the character code dictionary 3 is stored as it is in the storage means 5 for compression conversion. Thereafter, processing such as editing, output, storage, and transmission will be performed using this code system.

なお、第５図はＪＩＳ２バイトコード文字列１４を処理
後の圧縮結果を１５に示す圧縮変換例である。この例で
は２８バイトのコードが２１バイトになり、０．７５に
圧縮されたことを示している。Note that FIG. 5 is an example of compression conversion in which the compression result after processing the JIS 2-byte code character string 14 is shown in 15. In this example, the 28-byte code becomes 21 bytes, indicating that it has been compressed to 0.75.

このように、圧縮出来ない文字コードはそのまま２バイ
トで１字を表現し、また、圧縮される漢字熟語は第１．
第２バイトで漢字文字コードを。In this way, character codes that cannot be compressed are expressed as they are in 2 bytes, and kanji compound words that can be compressed are the first.
Kanji character code in the second byte.

第３バイトでその漢字で始まる漢字熟語コード（番号）
の文字コード２バイト、漢字熟語コード３バイトのコー
ドで統一していることにより処理が簡単である。また、
元のデータ長より増えない、漢字の読みを五十音順にソ
ートし、その個々の漢字をキーワードに使゛うことで低
頻度の熟語に対しても短時間で検索でき、さらに、１つ
の漢字で始まる熟語も２５６種とれるため充分実用化に
供することができ、圧縮結果データ量が減るため処理お
よび伝送速度があがり記憶容量を減らすことができる。Kanji compound word code (number) starting with that kanji in the 3rd byte
Processing is simplified by unifying the 2-byte character code and 3-byte kanji compound word code. Also,
By sorting the readings of kanji in alphabetical order, which does not increase the length of the original data, and using each kanji as a keyword, you can search even for low-frequency compound words in a short time. Since 256 types of phrases starting with can be taken, it can be put to practical use sufficiently, and since the amount of data as a result of compression is reduced, processing and transmission speeds can be increased, and storage capacity can be reduced.

〔Effect of the invention〕

以上説明したように、本発明の文字コードデー７− タの圧縮方式は、五十音順にソートした漢字文字専用の
漢字熟語コード辞書を設け、文字コードは２バイト、熟
語コードは３バイトに統一した処理方式を採用するこに
より、容易に検索でき、文字２バイト、熟語３バイトの
固定長である１元の文字コード列のバイト数より増えな
い文字コードデータの圧縮を行うことができるという効
果がある。As explained above, the character code data compression method of the present invention provides a Kanji compound word code dictionary exclusively for Kanji characters sorted in alphabetical order, and standardizes character codes to 2 bytes and compound words to 3 bytes. By adopting this processing method, the character code data can be easily searched and can be compressed without increasing the number of bytes of the original character code string, which has a fixed length of 2 bytes for characters and 3 bytes for idioms. There is.

[Brief explanation of drawings]

第１図は本発明の一実施例を示すブロック図、第２図は
文字コード及び漢字熟語コードの構成図、第３図は漢字
「日」で始まる漢字熟語の例を示す図、第４図は漢字熟
語コード辞書（一部）を示す図、第５図は圧縮変換例を
示す図である。１・・・入力手段、２・・・文字コード列−時記憶手段
、３・・・文字コード辞書、４・・・判別手段、５・・
・記憶手段、６・・・漢字熟語コード辞書、７・・・漢
字熟語コード部、８・・・文字列コード部、９・・・文
字コード、１０・・・漢字熟語コード、１１・・・キー
ワード、１２・・・一熟語コード、１３・・・文字列コード、１４・・・ＪＩ
Ｓ２バイトコード、１５・・・処理後の圧縮結果。Fig. 1 is a block diagram showing one embodiment of the present invention, Fig. 2 is a configuration diagram of character codes and kanji compound words codes, Fig. 3 is a diagram showing an example of kanji compound words starting with the kanji ``日'', and Fig. 4 5 is a diagram showing a (partial) Kanji compound word code dictionary, and FIG. 5 is a diagram showing an example of compression conversion. DESCRIPTION OF SYMBOLS 1... Input means, 2... Character code string-time storage means, 3... Character code dictionary, 4... Discrimination means, 5...
- Storage means, 6... Kanji compound word code dictionary, 7... Kanji compound word code section, 8... Character string code section, 9... Character code, 10... Kanji compound word code, 11... Keyword, 12... Idiom code, 13... Character string code, 14... JI
S2 bytecode, 15... compression result after processing.

Claims

[Claims]

When processing Japanese using a Japanese word processor, etc.
Each character generally has a fixed length of 2 bytes (16 bits) to represent one character, and multiple character strings representing one word can also be represented by 2 characters.
In a compression method for character code data that uses byte code data as is, we focused on the kanji part of Japanese sentences containing kanji and kana, and compared the first character of kanji to kanji compound words that are combinations of kanji character strings. is expressed in 2 bytes, and the next 1 byte is provided with a 3-byte kanji idiom code dictionary that encodes kanji idioms starting with the first character of the kanji,
Arrange the onkun readings of the first letters of the kanji compound words in alphabetical order in advance, search using the first letter of the kanji characters expressed in 2 bytes, and find the code of the kanji character expressed in the 2 bytes and the kanji compound word. The original number of bytes is increased by compressing and converting the kanji character code data string into kanji compound word code data using a method that matches the code of A character code data compression method characterized in that the character code data compression method is configured such that the character code is