JPH0338772A - Compression system for character code data - Google Patents

Compression system for character code data

Info

Publication number
JPH0338772A
JPH0338772A JP1175208A JP17520889A JPH0338772A JP H0338772 A JPH0338772 A JP H0338772A JP 1175208 A JP1175208 A JP 1175208A JP 17520889 A JP17520889 A JP 17520889A JP H0338772 A JPH0338772 A JP H0338772A
Authority
JP
Japan
Prior art keywords
kanji
character
code
bytes
codes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP1175208A
Other languages
Japanese (ja)
Inventor
Isao Kondo
勲 近藤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Office Systems Ltd
Original Assignee
NEC Office Systems Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Office Systems Ltd filed Critical NEC Office Systems Ltd
Priority to JP1175208A priority Critical patent/JPH0338772A/en
Publication of JPH0338772A publication Critical patent/JPH0338772A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To compress the character code data by using a KANJI (Chinese character) phrase code dictionary and applying a process system where the character codes and the phrase codes are unified into 2 bytes and 3 bytes respectively. CONSTITUTION:The character codes 9 are defined when the most significant bit of a 1st byte is equal to '0', and the KANJI phrase codes 10 are defined when the most significant bit is equal to '1' respectively. The codes 9 have 2 bytes and are assigned to 32K characters with remaining 15 bits, and the codes 10 are assigned to 32K characters of the codes 9 with remaining 15 bits of 1st and 2nd bytes and then assigned to the code of a KANJI phrase starting at the head character of a phrase at a 3rd byte. The character strings of the KANJI-KANA (Japanese syllabary) sentences are changed into the codes by a character code dictionary 3. A character string code part 8 of a KANJI phrase code dictionary 6 is retrieved by a deciding means 4. When the coincident codes are detected, the data on a code part 7 is read out. If no coincident code is confirmed, the character code read out of the dictionary 3 is stored in a storage means 5 as it is. Thus the character code data can be compressed and converted.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 本発明は、文字コードデータの圧縮方式に関し、特に漢
字の文字コードデータの圧縮方式に関する。
DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a compression method for character code data, and particularly to a compression method for character code data of Kanji characters.

〔従来の技術〕[Conventional technology]

一般に、日本語ワードプロセッサ等文書作成機等は、漢
字仮名混じり文の処理を行うため日本語データはJIS
規格(C−6226情報交換用漢字符号系)等を用い、
日本語1字を2バイトで表わし殆んどこの文字コードを
そのままの形で′処理されている。
In general, document creation machines such as Japanese word processors process sentences containing kanji and kana, so Japanese data is JIS.
Using standards (C-6226 Kanji code system for information exchange) etc.
One Japanese character is expressed in two bytes, and almost all character codes are processed as they are.

このデータ処理の効率をあげるには、データの圧縮が必
要で従来の技術として文字コードデータ圧縮方式(特公
昭6l−232724)があり、第1バイトのうち1ビ
ットまたは複数ビットで1バイト長か、2バイトを区別
し、残りのビット数を語のコード辞書に割り当てる、ま
た、語のコード辞書の数を増やすため、第1バイトを拡
張制御コードとして使い残りの2バイトを語のコード辞
書に割り当て、文字コードデータの文字列を語のコード
と一致をとる手段により圧縮変換する方式〔発明が解決
しようとする課題〕 上述した従来の文字コードデータ圧縮方式においては、
使用頻度順に辞書のため低頻度の語に対しては検索に時
間がかかり、また、語コード、文字コードが共に可変長
であるため処理が複雑になったり、圧縮できない語の事
も考慮しなけれ1ばならず、元のバイト数より増加する
こともあるという欠点があった。
In order to improve the efficiency of this data processing, data compression is necessary, and the conventional technology is the character code data compression method (Japanese Patent Publication No. 61-232724). , distinguish between 2 bytes and allocate the remaining number of bits to the word code dictionary.Also, to increase the number of word code dictionaries, use the first byte as an extended control code and use the remaining 2 bytes as the word code dictionary. [Problem to be solved by the invention] In the conventional character code data compression method described above,
Because the dictionary is ordered by frequency of use, it takes time to search for low-frequency words, and since both the word code and character code are variable lengths, processing becomes complicated, and words that cannot be compressed must be taken into account. However, there is a drawback that the number of bytes may increase more than the original number of bytes.

本発明の目的は、以上の欠点を解決し容易に検索できる
2文字2バイト、熟語3バイトの固定長である1元の文
字コード列のバイト数より増えない文字コードデータの
圧縮方式を提供することにある。
The purpose of the present invention is to solve the above drawbacks and provide a compression method for character code data that can be easily searched and does not increase in number of bytes than the number of bytes of an original character code string, which has a fixed length of 2 characters, 2 bytes, and idioms, 3 bytes. There is a particular thing.

〔課題を解決するための手段〕 本発明の文字コードデータの圧縮方式は、日本語ワード
プロセッサ等で日本語の処理を行うとき、個々の文字は
一般に2バイト(16ビット〉の固定長で1字を表現し
て1つの言葉を表現する複数の文字列も2バイトのコー
ドデータをそのまま使用する文字コードデータの圧縮方
式において、日本語の漢字仮名混じり文の漢字部分に注
目し、漢字の文字列の組合せの言葉である漢字熟語に対
して、漢字の先頭文字を2バイトで表わし、次の1バイ
トで前記漢字先頭文字で始まる漢字熟語をコード化した
3バイトの漢字熟語のコード辞書を備え、あらかじめ前
記漢字熟語の先頭文字の音訓の読みを五十音順に並べ、
2バイトで表わされた漢字の前記先頭文字により検索し
、前記2バイトで表わされた漢字文字のコードと前記漢
字熟語のコードとの一致をとる手段により、漢字文字コ
ードデータ列を漢字熟語コードデータに圧縮変換し、熟
語として登録されていないために圧縮できない文字列は
そのまま使用することにより、元のバイト数が増えない
ように構成されている。
[Means for Solving the Problems] The character code data compression method of the present invention is such that when processing Japanese using a Japanese word processor, each character is generally one character with a fixed length of 2 bytes (16 bits). In the character code data compression method that uses 2-byte code data as is for multiple character strings that express one word, we focused on the kanji part of Japanese kanji and kana mixed sentences, and For Kanji compound words that are combinations of words, a 3-byte code dictionary of Kanji compound words is provided, in which the first letter of the Kanji is expressed in 2 bytes, and the next 1 byte is used to code the Kanji compound word starting with the first letter of the Kanji character, Arrange the readings of the first letters of the kanji compound words in alphabetical order in advance,
The kanji character code data string is searched by the first character of the kanji character expressed in 2 bytes, and the code of the kanji character expressed in 2 bytes matches the code of the kanji compound word. It is configured so that the original number of bytes does not increase by compressing and converting it into code data and using character strings that cannot be compressed because they are not registered as idioms as they are.

〔実施例〕〔Example〕

次に本発明について図面を参照して詳細に説明する。 Next, the present invention will be explained in detail with reference to the drawings.

第1図は本発明の一実施例の機能ブロック図で、1は入
力手段、2は文字コード列−時記憶手段、3は文字コー
ド辞書、4は判別手段、5は記憶手段、6は漢字熟語コ
ード辞書、7は漢字熟語コード部、8は文字列コード部
である。第2図は文字コード、漢字熟語コードの構成図
で、第1バイトの最上位ビットが0″の場合は文字コー
ド9゜最上位ビットが1″の場合は漢字熟語コード10
として区別され、文字コード9は2バイトコードで、残
り15ビットで32に字に割り当てられ、漢字熟語コー
ド10は第1バイト、第2バイトの残り15ビットで文
字コード9の32に字に割り当て、第3バイトで熟語の
先頭文字で始まる漢字熟語のコードに割り当てる。
FIG. 1 is a functional block diagram of an embodiment of the present invention, in which 1 is an input means, 2 is a character code string-time storage means, 3 is a character code dictionary, 4 is a discrimination means, 5 is a storage means, and 6 is a kanji character. In the idiom code dictionary, 7 is a kanji idiom code section, and 8 is a character string code section. Figure 2 is a configuration diagram of the character code and kanji phrase code. If the most significant bit of the first byte is 0'', the character code is 9°; if the most significant bit is 1'', the kanji phrase code is 10.
Character code 9 is a 2-byte code, and the remaining 15 bits are assigned to character 32, and Kanji compound word code 10 is the remaining 15 bits of the 1st and 2nd bytes, which are assigned to character 32 of character code 9. , the third byte is assigned to the code of the kanji compound word starting with the first character of the compound word.

第3図は漢字熟語の一例として漢字1日」で始まる漢字
熟語の例を示している。第4図は漢字熟語コード辞書の
一部で熟語コード12文字コード13が示されており、
漢字熟語コード辞書は漢字の読みを五十音順にならべて
おきキーワード11で検索する。
Figure 3 shows an example of a kanji idiom starting with ``kanji ichiichi''. Figure 4 shows part of the Kanji Idiom Code Dictionary, which shows 12 Idiom Codes and 13 character Codes.
The Kanji Idiom Code Dictionary arranges the pronunciations of kanji in alphabetical order and searches using keyword 11.

使用法として第1図より、入力手段1より入力された漢
字かな混じり文の文字列は、文字コード辞書3より該当
文字コードにコード化され、文字列コード−時記憶手段
2に読み出す。読み出された文字コード列は判別手段4
により、漢字熟語コード辞書6の文字列コード部8を検
索し、該当文字列を探す。一致した場合、その漢字熟語
のコード部7のデータを読み出し記憶手段5に格納し、
また一致するものがない場合は文字コード辞書3より読
みだされた文字コードをそのまま記憶手段5に格納する
ことにより圧縮変換する。以後、このコード体系で編集
、出力、格納および伝送等の処理を行う。
As a usage method, as shown in FIG. 1, a character string containing kanji and kana characters inputted from the input means 1 is encoded into the corresponding character code from the character code dictionary 3, and read into the character string code-time storage means 2. The read character code string is determined by the discrimination means 4.
The character string code section 8 of the kanji compound word code dictionary 6 is searched to find the corresponding character string. If there is a match, the data of the code part 7 of the kanji compound word is read out and stored in the storage means 5,
If there is no match, the character code read from the character code dictionary 3 is stored as it is in the storage means 5 for compression conversion. Thereafter, processing such as editing, output, storage, and transmission will be performed using this code system.

なお、第5図はJIS2バイトコード文字列14を処理
後の圧縮結果を15に示す圧縮変換例である。この例で
は28バイトのコードが21バイトになり、0.75に
圧縮されたことを示している。
Note that FIG. 5 is an example of compression conversion in which the compression result after processing the JIS 2-byte code character string 14 is shown in 15. In this example, the 28-byte code becomes 21 bytes, indicating that it has been compressed to 0.75.

このように、圧縮出来ない文字コードはそのまま2バイ
トで1字を表現し、また、圧縮される漢字熟語は第1.
第2バイトで漢字文字コードを。
In this way, character codes that cannot be compressed are expressed as they are in 2 bytes, and kanji compound words that can be compressed are the first.
Kanji character code in the second byte.

第3バイトでその漢字で始まる漢字熟語コード(番号)
の文字コード2バイト、漢字熟語コード3バイトのコー
ドで統一していることにより処理が簡単である。また、
元のデータ長より増えない、漢字の読みを五十音順にソ
ートし、その個々の漢字をキーワードに使゛うことで低
頻度の熟語に対しても短時間で検索でき、さらに、1つ
の漢字で始まる熟語も256種とれるため充分実用化に
供することができ、圧縮結果データ量が減るため処理お
よび伝送速度があがり記憶容量を減らすことができる。
Kanji compound word code (number) starting with that kanji in the 3rd byte
Processing is simplified by unifying the 2-byte character code and 3-byte kanji compound word code. Also,
By sorting the readings of kanji in alphabetical order, which does not increase the length of the original data, and using each kanji as a keyword, you can search even for low-frequency compound words in a short time. Since 256 types of phrases starting with can be taken, it can be put to practical use sufficiently, and since the amount of data as a result of compression is reduced, processing and transmission speeds can be increased, and storage capacity can be reduced.

〔発明の効果〕〔Effect of the invention〕

以上説明したように、本発明の文字コードデー7− タの圧縮方式は、五十音順にソートした漢字文字専用の
漢字熟語コード辞書を設け、文字コードは2バイト、熟
語コードは3バイトに統一した処理方式を採用するこに
より、容易に検索でき、文字2バイト、熟語3バイトの
固定長である1元の文字コード列のバイト数より増えな
い文字コードデータの圧縮を行うことができるという効
果がある。
As explained above, the character code data compression method of the present invention provides a Kanji compound word code dictionary exclusively for Kanji characters sorted in alphabetical order, and standardizes character codes to 2 bytes and compound words to 3 bytes. By adopting this processing method, the character code data can be easily searched and can be compressed without increasing the number of bytes of the original character code string, which has a fixed length of 2 bytes for characters and 3 bytes for idioms. There is.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は本発明の一実施例を示すブロック図、第2図は
文字コード及び漢字熟語コードの構成図、第3図は漢字
「日」で始まる漢字熟語の例を示す図、第4図は漢字熟
語コード辞書(一部)を示す図、第5図は圧縮変換例を
示す図である。 1・・・入力手段、2・・・文字コード列−時記憶手段
、3・・・文字コード辞書、4・・・判別手段、5・・
・記憶手段、6・・・漢字熟語コード辞書、7・・・漢
字熟語コード部、8・・・文字列コード部、9・・・文
字コード、10・・・漢字熟語コード、11・・・キー
ワード、12・・・一 熟語コード、13・・・文字列コード、14・・・JI
S2バイトコード、15・・・処理後の圧縮結果。
Fig. 1 is a block diagram showing one embodiment of the present invention, Fig. 2 is a configuration diagram of character codes and kanji compound words codes, Fig. 3 is a diagram showing an example of kanji compound words starting with the kanji ``日'', and Fig. 4 5 is a diagram showing a (partial) Kanji compound word code dictionary, and FIG. 5 is a diagram showing an example of compression conversion. DESCRIPTION OF SYMBOLS 1... Input means, 2... Character code string-time storage means, 3... Character code dictionary, 4... Discrimination means, 5...
- Storage means, 6... Kanji compound word code dictionary, 7... Kanji compound word code section, 8... Character string code section, 9... Character code, 10... Kanji compound word code, 11... Keyword, 12... Idiom code, 13... Character string code, 14... JI
S2 bytecode, 15... compression result after processing.

Claims (1)

【特許請求の範囲】[Claims] 日本語ワードプロセッサ等で日本語の処理を行うとき、
個々の文字は一般に2バイト(16ビット)の固定長で
1字を表現して1つの言葉を表現する複数の文字列も2
バイトのコードデータをそのまま使用する文字コードデ
ータの圧縮方式において、日本語の漢字仮名混じり文の
漢字部分に注目し、漢字の文字列の組合せの言葉である
漢字熟語に対して、漢字の先頭文字を2バイトで表わし
、次の1バイトで前記漢字先頭文字で始まる漢字熟語を
コード化した3バイトの漢字熟語のコード辞書を備え、
あらかじめ前記漢字熟語の先頭文字の音訓の読みを五十
音順に並べ、2バイトで表わされた漢字の前記先頭文字
により検索し、前記2バイトで表わされた漢字文字のコ
ードと前記漢字熟語のコードとの一致をとる手段により
、漢字文字コードデータ列を漢字熟語コードデータに圧
縮変換し、熟語として登録されていないために圧縮でき
ない文字列はそのまま使用することにより、元のバイト
数が増えないように構成されたことを特徴とする文字コ
ードデータ圧縮方式。
When processing Japanese using a Japanese word processor, etc.
Each character generally has a fixed length of 2 bytes (16 bits) to represent one character, and multiple character strings representing one word can also be represented by 2 characters.
In a compression method for character code data that uses byte code data as is, we focused on the kanji part of Japanese sentences containing kanji and kana, and compared the first character of kanji to kanji compound words that are combinations of kanji character strings. is expressed in 2 bytes, and the next 1 byte is provided with a 3-byte kanji idiom code dictionary that encodes kanji idioms starting with the first character of the kanji,
Arrange the onkun readings of the first letters of the kanji compound words in alphabetical order in advance, search using the first letter of the kanji characters expressed in 2 bytes, and find the code of the kanji character expressed in the 2 bytes and the kanji compound word. The original number of bytes is increased by compressing and converting the kanji character code data string into kanji compound word code data using a method that matches the code of A character code data compression method characterized in that the character code data compression method is configured such that the character code is
JP1175208A 1989-07-05 1989-07-05 Compression system for character code data Pending JPH0338772A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP1175208A JPH0338772A (en) 1989-07-05 1989-07-05 Compression system for character code data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP1175208A JPH0338772A (en) 1989-07-05 1989-07-05 Compression system for character code data

Publications (1)

Publication Number Publication Date
JPH0338772A true JPH0338772A (en) 1991-02-19

Family

ID=15992185

Family Applications (1)

Application Number Title Priority Date Filing Date
JP1175208A Pending JPH0338772A (en) 1989-07-05 1989-07-05 Compression system for character code data

Country Status (1)

Country Link
JP (1) JPH0338772A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07282040A (en) * 1994-04-13 1995-10-27 Nec Commun Syst Ltd Japanese information compression system
US5921792A (en) * 1994-03-10 1999-07-13 The Whitaker Corporation Card connector and card-ejecting mechanism

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5921792A (en) * 1994-03-10 1999-07-13 The Whitaker Corporation Card connector and card-ejecting mechanism
JPH07282040A (en) * 1994-04-13 1995-10-27 Nec Commun Syst Ltd Japanese information compression system

Similar Documents

Publication Publication Date Title
Lewis et al. Syntax-directed transduction
JP3300866B2 (en) Method and apparatus for preparing text for use by a text processing system
JPH08194719A (en) Retrieval device and dictionary and text retrieval method
EP3276507B1 (en) Encoding device, encoding method and search method
US20040225497A1 (en) Compressed yet quickly searchable digital textual data format
JPH0338772A (en) Compression system for character code data
KR100326634B1 (en) Device and method of storing text data, device and method of searching text data, recording medium containing a program for storing the text data and recording medium containing a program for searching text data
JPH056398A (en) Document register and document retrieving device
JPH0546358A (en) Compressing method for text data
EP0638187B1 (en) Categorizing strings in character recognition
CN112800722B (en) Text organization coding method based on semantic understanding
JP2785168B2 (en) Electronic dictionary compression method and apparatus for word search
JPH04223556A (en) Compression system for character code data
JPH0140370B2 (en)
JP6784084B2 (en) Coding program, coding device, coding method, and search method
JPS6057421A (en) Documentation device
JP2004013680A (en) Character code compression/decompression device and method
JPS61232724A (en) Compressing system for character code data
JP2005275880A (en) Device, method and program for converting word and phrase into data
JPS62214468A (en) Kana-kanji converter
JPH01194065A (en) Document processor
JPH03282961A (en) Mutual conversion dictionary system
JPH07319895A (en) Device and method for retrieving document
JPH05181900A (en) Proper noun processing device
JPH0721798B2 (en) Language processor