JPS6268325A

JPS6268325A - Sentence compression and expansion system

Info

Publication number: JPS6268325A
Application number: JP60206625A
Authority: JP
Inventors: Etsuaki Kurosaki; 黒崎　悦明
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1985-09-20
Filing date: 1985-09-20
Publication date: 1987-03-28

Abstract

PURPOSE:To compress a coded sentence by extracting sequentially character code strings having the same number of characters from a stored character code strings and outputting a word code corresponding to the character code string when they are retrieved and coincident. CONSTITUTION:A character code string of a sentence mixed with Kanji (Chinese character) and Kana (Japanese syllabary) desired to be compressed is fed sequentially from an inlet 1 to an input register section 2. The length of the register is selected by 2, 3 or 4 characters by selecting the length of the unit of conversion word. The code string of 2 characters stored in the input register section 2 is fed to a word retrieval section 3 and when a word stored in an electronic dictionary storage section 9 is retrieved and when they are coincident, the content of a word code supplied to the word is read and outputted to an output register section 5 via a compression control section 4. The code stored in the compression code input register section 12 is fed to a code retrieval section 13, a word code stored in the electronic dictionary storage section 9 is retrieved and the 2-character word of the index corresponding to the code is read and fed to an original sentence output register section 15 via an expansion control section 14 from the path 16.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は符号化された文章の圧縮・伸展方式に関し、特
に日本語を扱う漢字データ出力機器あるいはデータ送受
信端末において、鍵盤やＯＣＲあるいはファイル装置等
の入力装置から入力される文章の文字符号列を圧縮し、
かつすでに圧縮されている符号列を元の入力文字符号列
に復元するための日本語文章圧縮・伸展方式に関するも
のである。Detailed Description of the Invention (Industrial Application Field) The present invention relates to a method for compressing and decompressing encoded sentences, and is particularly applicable to keyboards, OCR, or file devices in kanji data output equipment or data transmission/reception terminals that handle Japanese. Compress the character code string of the text input from an input device such as
The present invention also relates to a Japanese text compression/expansion method for restoring an already compressed code string to the original input character code string.

（従来の技術）文字符号列における文字データ圧縮については、従来か
ら連続文字圧縮方式が一般的である（例えば、「昭和５
５年電気四学会連合大会講演論文集〔５〕」、昭和５５
年１０月、何間、西村他、ネットワーク系のプロトコル
実現のための符号化、Ｐ　５−１０１〜Ｐ５−１０２）
。この方式は圧縮制御符号を用いて、同一の文字がｐヶ
連続している場合に該文字１字とその一つ前にｐ文字連
続する旨を示す制御文字を１字挿入する。連続が終了し
、再び１字毎に異なる文字列にユｆ化した場合にこのこ
とを示す別の制御文字を挿入する方法を採用している。(Prior art) Concerning character data compression in character code strings, continuous character compression methods have conventionally been common (for example, "Showa 5
Collected papers from the 5th Annual Conference of the Four Electrical Engineers of Japan [5]”, 1972
Oct. 2015, Ikuma, Nishimura et al., Coding for realizing network protocols, P5-101~P5-102)
. This method uses a compression control code, and when the same character appears consecutively for p consecutive characters, a control character indicating that p characters are consecutive is inserted in front of the character. A method is adopted in which when the sequence ends and each character is changed into a different character string again, another control character is inserted to indicate this.

（発明が解決しようとする問題点）しかしながら、この従来方式では容易に解るように同一
文字が４字以上連続した場合にのみ効果を発揮するので
、一般の文書ではその条件が生ずる確率は低いことから
圧縮効果は極めて低い。(Problem to be solved by the invention) However, as is easily understood, this conventional method is effective only when four or more identical characters occur in a row, so the probability that this condition will occur in ordinary documents is low. Therefore, the compression effect is extremely low.

本発明は上記の欠点を除き、文字の連続性には無関係に
任意の文章を効率よく圧縮・伸展し得る文章圧縮・伸展
方式を提供することを目的とする。SUMMARY OF THE INVENTION An object of the present invention is to provide a text compression/expansion method capable of efficiently compressing/expanding any text regardless of the continuity of characters, while eliminating the above drawbacks.

（問題点を解決するための手段）特許請求の範囲第１項に記載の発明（以下、第１の発明
という）は、以下の４つの手段を有して構成される。(Means for Solving the Problems) The invention set forth in claim 1 (hereinafter referred to as the first invention) includes the following four means.

第１の手段は、所定の文字数からなる語の文字符号列を
当該語に固有の語コードに対応させて予め複数個格納し
ておく。The first means stores in advance a plurality of character code strings of words each having a predetermined number of characters in correspondence with a word code specific to the word.

第２の手段は、圧縮されるべき文章を１文字ごとに符号
化して得られた文字符号列を一時格納する０第３の手段は、第２の手段から前記所定の文字数と同一
の文字分だけ文字符号列を順次取り出し、当該文字符号
列が第１０手段に格納されている語の文字符号列に一致
するかどうかを検索する。The second means temporarily stores a character code string obtained by encoding the text to be compressed character by character.The third means encodes the same number of characters as the predetermined number of characters from the second means. character code strings are sequentially extracted, and a search is made to see if the character code strings match the character code strings of words stored in the tenth means.

第４の手段は、第３の手段における検索の結果、一致し
た場合には第１の手段から得られる当該文字符号列の語
に固有の語コードを出力し、一致しない場合には当該文
字符号列のうちの先頭の文字符号を第２の手段からその
まま出力する。The fourth means outputs a word code specific to the word of the character code string obtained from the first means if the search results in the third means match, and if there is no match, the character code The first character code in the string is output as is from the second means.

また、特許請求の範囲第２項に記載の発明（以下、第２
の発明という）は、以下の４つの手段を有して構成され
る。In addition, the invention set forth in claim 2 (hereinafter referred to as
(referred to as the invention) is constituted by having the following four means.

第１の手段は、文章を１文字ごとに符号化して得られた
文字符号列のうち所定の文字数からなる予め決められた
語があれば、これを当該語に固有の語コードに置き換え
ることによシ得られた圧縮文字符号列を一時格納する。The first method is to encode a text character by character and, if there is a predetermined word consisting of a predetermined number of characters in the character code string obtained, replace this word with a word code specific to that word. Temporarily stores the obtained compressed character code string.

第２の手段は、前記所定の文字数からなる語の文字符号
列を当該語に固有の語コードに対応させて予め複数個格
納しておく。The second means stores in advance a plurality of character code strings of words each having the predetermined number of characters in correspondence with a word code specific to the word.

第３の手段は、前記第１の手段に格納された圧縮文字符
号列を符号ごとに取シ出して第２の手段に格納されてい
る語コードに一致するかどうかを検索する。The third means extracts each code from the compressed character code string stored in the first means and searches for a match with the word code stored in the second means.

第４の手段は、第３の手段における検索の結果、一致し
た場合は第２の手段から得られる当該符号に対応する語
の文字符号列を出力し、一致しない場合には当該符号を
第１の手段からそのまま出力する。The fourth means outputs the character code string of the word corresponding to the code obtained from the second means if the search result in the third means matches, and if the search does not match, the fourth means outputs the character code string of the word corresponding to the code obtained from the second means. Output as is from the means of.

（作用）上記第１の発明は次のとおシ動作する。(effect) The first invention operates as follows.

文章を符号化して得られた文字符号列は第２の手段に一
時格納される。第３の手段は第２の手段に格納された文
字符号列から第１の手段に格納されている語の文字数と
同一数の文字符号列を順次取シ出す。そして、第３の手
段は取シ出した文字符号列の語が第１の手段に予め格納
されている語に一致するかどうかを検索する。第４の手
段は、この検索の結果、一致した場合には当該文字符号
列に対応する語コードを第２の手段から出力し、一致し
ない場合は当該文字符号列のうちの先頭の文字符号をそ
のまま出力する。この結果、符号化された文章が圧縮さ
れる。The character code string obtained by encoding the text is temporarily stored in the second means. The third means sequentially extracts character code strings of the same number as the number of characters of the word stored in the first means from the character code strings stored in the second means. Then, the third means searches whether the word of the retrieved character code string matches a word stored in advance in the first means. The fourth means outputs the word code corresponding to the character code string from the second means if the search results in a match, and if there is no match, outputs the first character code of the character code string. Output as is. As a result, the encoded text is compressed.

上記第２の発明は次のとお！ｌｌ動作する。The second invention mentioned above is as follows! It works.

第１の発明により圧縮された圧縮文字符号列は第１の手
段に一時格納される。第３の手段は第１の手段に格納さ
れた圧縮文字符号列を符号ごとに取り出して、第２の手
段に格納されている語コードに一致するかどうかを検索
する。第４の手段は、この検索の結果、一致した場合は
第２の手段から得られる当該符号（従って、この場合の
符号は語コードである）に対応する語の文字符号列を出
力し、一致しない場合は当該符号（従って、この場合の
符号は文字符号である）を第１の手段からそのまま出力
する。この結果、圧縮符号化された文章が伸展される。The compressed character code string compressed according to the first invention is temporarily stored in the first means. The third means extracts each code from the compressed character code string stored in the first means, and searches for a match with the word code stored in the second means. If there is a match as a result of this search, the fourth means outputs the character code string of the word corresponding to the code obtained from the second means (therefore, the code in this case is a word code), and If not, the code (therefore, the code in this case is a character code) is output as is from the first means. As a result, the compression-encoded text is expanded.

（実施例）本発明の詳細な説明するのに先立ち、本発明の背景とな
る事項について説明する。(Example) Prior to a detailed explanation of the present invention, the background of the present invention will be explained.

日本語において用いられている漢字字種は数千種以上で
ある。このため、１つの文字は、１０　　ビット以上の
コードによって符号化される。例えば１５　ビットで符
号化した場合には、最大２１５（＝３２．７６８　）種
の文字を符号化することができ、また１６ビツトを用い
た場合には、２′６（＝６５．５３６）種の文字を符号
化することができる。本発明は、ビット数に関係なく構
成できるものであるが、説明を容易にするために１６ビ
ツト構成法によって以下に説明する。There are over several thousand types of kanji used in Japanese. Therefore, one character is encoded by a code of 10 bits or more. For example, when encoding with 15 bits, a maximum of 215 (=32.768) types of characters can be encoded, and when using 16 bits, 2'6 (=65.536) types of characters can be encoded. characters can be encoded. Although the present invention can be configured regardless of the number of bits, for ease of explanation, a 16-bit configuration method will be described below.

日常用いられている漢字の字種は、仮名や特殊記号を含
めてもせいぜい１万種であるから、もし１６　ビット構
成法を採用するならば６５．５３６種のコードのうち残
りの約５万種余シのコードを他の目的に利用することが
できる。一方、一般の日本語文書においては、１文字の
自立語と付属語（例；家、水、や、が、・・・）および
特殊記号の出現頻度に比べて、２文字以上から構成され
る自立語や付属語（例；会議、ロボット、美しい、らし
い、でした、・・・）の出現頻度は一般にきわめて大き
い。There are at most 10,000 types of kanji in daily use, including kana and special symbols, so if we adopt the 16-bit construction method, the remaining 50,000 out of 65,536 types of codes will be used. You can use the leftover code for other purposes. On the other hand, in general Japanese documents, compared to the frequency of single-character independent words, adjunct words (e.g., house, water, ya, ga,...) and special symbols, the frequency of occurrence is higher than that of single-character independent words and adjunct words (e.g., house, water, ya, ga,...), and special symbols that are composed of two or more characters. In general, independent words and attached words (e.g., meeting, robot, beautiful, seems, deshita, etc.) appear extremely frequently.

この性質を利用するならば、入力装置から順次入力され
てくる文字コード列から２文字以上の語を抽出し、その
語を構成する文字コード列の部分を、電子辞書を用いて
語にあらかじめ割シあてられている１６ピツトの語コー
ドに置換し、また語の−構成素と判定されなかった入力
文字コードは何ら変換されずそのまま出力する、文字・
語混合符号列への変換機構を実現することにより、文章
の大福々圧縮を行うことができる。If we take advantage of this property, we can extract a word of two or more characters from a character code string that is input sequentially from an input device, and assign the part of the character code string that makes up that word to a word in advance using an electronic dictionary. The characters are replaced with the assigned 16-pit word code, and the input character codes that are not determined to be - constituents of the word are output as they are without any conversion.
By implementing a conversion mechanism into a word-mixed code string, it is possible to compress sentences using a combination of words.

一方、伸展する側においては、上記文字・語混合符号列
を受信して、単一の文字コードはそのまま出力し、語コ
ードについては送信側と同一の語が登録されている電子
辞書を用いて２文字以上の文字コード列に置換して出力
することによシ、圧縮変換される前の原入力文字列に伸
展することができる。On the other hand, the decompressing side receives the character/word mixed code string, outputs the single character code as is, and outputs the word code using an electronic dictionary in which the same words as the transmitting side are registered. By replacing it with a character code string of two or more characters and outputting it, it is possible to expand it to the original input character string before compression conversion.

第１図は第１の発明及び第２の発明のそれぞれの実施例
を示すブロック図である。図中、一点鎖線Ａで囲まれた
部分は圧縮機構であり、二点鎖線Ｂで囲まれた部分は伸
展機構である〇はじめに、圧縮機構について説明する。２は入力レジス
タ部であり、圧縮したい漢字仮名まじシ文章の文字符号
列が入力口】よシ順次供給される。FIG. 1 is a block diagram showing respective embodiments of the first invention and the second invention. In the figure, the part surrounded by a dashed-dotted line A is a compression mechanism, and the part surrounded by a dashed-two dotted line B is an extension mechanism. First, the compression mechanism will be explained. Reference numeral 2 denotes an input register section, into which character code strings of kanji, kana, and kanji texts to be compressed are sequentially supplied to the input port.

レジスタの長さは、変換単語単位の長さを選ぶことによ
り、２字、３字、あるいは４字などに定めることができ
る。説明を容易にするために以下では長さを２字として
説明する。The length of the register can be set to 2, 3, or 4 characters by selecting the length of each converted word. For ease of explanation, the length will be explained below as two characters.

入力レジスタ部２に置かれた２文字の符号列は、語探索
部３に供給され、電子辞書記憶部９に収容された語を検
索して一致するものがあればその単語に付与されている
語コードの内容が読出されて圧縮制御部４を介して出力
レジスタ部５に出力される。一致するものがない場合に
は、入力レジスタ部２は左に１文字分シフトされて次の
１文字を供給口１から供給するとともに、シフトアウト
された最左端の文字は経路７を通り、圧縮制御部５を介
して出力レジスタ部５に無変換のまま送られて出力され
る。The two-character code string placed in the input register section 2 is supplied to the word search section 3, which searches the words stored in the electronic dictionary storage section 9, and if a match is found, it is added to that word. The contents of the word code are read out and output to the output register section 5 via the compression control section 4. If there is no match, the input register section 2 is shifted one character to the left and supplies the next character from the supply port 1, and the leftmost character shifted out passes through the path 7 and is compressed. The signal is sent to the output register section 5 via the control section 5 without conversion and is output.

いま、“御要望に答え・・・”という文が圧縮変換され
る場合を考える。電子辞書記憶部９には第２図に示すよ
うな単語・語コード対応を内容とする辞書が収容されて
いるものとする。入力レジスタ部２には、最初の２文字
“御要″が置かれる。語探索部３は電子辞書記憶部９の
見出し部を順次読み出して、入力レジスタ部２の“両便
”と比較していく。第２図から明らかなように、制御部
″という単語は見つからないので、ない旨の信号が圧縮
制御部４に与えられて入力レジスタ部２の最初の文字“
御″が制御部４を介して出力レジスタ部５に出力される
とともに、次の１文字“望″が１から与えられて入力レ
ジスタ部２には１要望”が置かれる。こんどは第２図か
ら明らかなように°要望”が検索されるので、その旨の
信号が圧縮制御部４に与えられ、辞書の“要望″に付加
されている語コード＠０３＃が経路６を通って圧縮制御
部４を介して出力レジスタ部５に出力される。Now, let us consider a case where the sentence "In response to your request..." is compressed and converted. It is assumed that the electronic dictionary storage unit 9 stores a dictionary whose contents include word/word code correspondences as shown in FIG. The first two characters "Goyone" are placed in the input register section 2. The word search section 3 sequentially reads out the heading sections of the electronic dictionary storage section 9 and compares them with "Ryobin" of the input register section 2. As is clear from FIG. 2, since the word "control unit" is not found, a signal to the effect that it is not found is given to the compression control unit 4, and the first character "" of the input register unit 2 is
``control'' is output to the output register section 5 via the control section 4, and the next character ``desir'' is given from 1, and ``1'' is placed in the input register section 2. This time, as is clear from FIG. 2, ``°request'' is searched, so a signal to that effect is given to the compression control unit 4, and the word code @03# added to ``request'' in the dictionary is sent to route 6. is output to the output register section 5 via the compression control section 4.

次に入力レジスタ部２は２文字左にシフトされて今まで
置かれていた“要望“の２文字のかわシに次の２字“に
答″が入力口１から供給されて置かれ、上記と同様の手
順で変換が順次行なわれる。Next, the input register section 2 is shifted two characters to the left, and the next two characters "answer" are supplied from the input port 1 and placed in the two-character mark "request" that has been placed so far. Conversion is performed sequentially using the same procedure.

次に伸展機構について説明する。１２は１６ビツトの圧
縮符号入力レジスタ部であシ、圧縮文の供給口１１から
圧縮符号列が１符号（１６ビツト）ずつ順次に供給され
る。圧縮符号入力レジスタ部１２に置かれた符号は、コ
ード探索部】３に供給されて、電子辞書記憶部９に収容
されている語コードを検索する。一致する語コードがあ
れば、そのコードに対応する見出しの単語２文字が読み
出されて、経路１６から伸展制御部】４を介して原文出
力レジスタ部】５に送られる。Next, the extension mechanism will be explained. Reference numeral 12 denotes a 16-bit compressed code input register section, to which a compressed code string is sequentially supplied one code (16 bits) at a time from a compressed sentence supply port 11. The code placed in the compression code input register section 12 is supplied to a code search section 3, which searches for word codes stored in the electronic dictionary storage section 9. If there is a matching word code, the two characters of the heading word corresponding to that code are read out and sent from a path 16 to the original text output register section 5 via the expansion control section 4.

先の例文“両便望に答え・・・”に関する圧縮符号列が
入力口１１から順次供給されてくる場合について説明す
るならば、“御″の語コードは電子辞書記憶部９にない
ので経路１７を介してそのまま原文出力レジスタ部１５
に出力される。次に、語コード″０３＃がレジスタ１２
に供給される。コード探索部１３は電子辞書記憶部９に
収容されている語コードを検索する。一致する語コード
があればそのコードに対応する見出しの単語２文字が読
み出されて、経路１６から伸展制御部】４を介して原文
出力レジスタ部１５に送られる。この場合、第２図から
明らかなように語“０３″が見つかるので、その旨を示
す信号が伸展制御部１４に与えられて、電子辞書記憶部
９の見出し部の“要望”の２文字が読み出されて経路１
６から伸展制御部１４を介して原文出力レジスタ部１５
“要望″の２文字が出力される。以下、同様の手順を繰
シかえずことによシ、圧縮まえの原文“両便望に答え・
・・″に伸展される。To explain the case in which compressed code strings related to the previous example sentence "Answer to both convenience..." are sequentially supplied from the input port 11, the word code for "Go" is not in the electronic dictionary storage unit 9, so the path is 17 to the original text output register section 15 as it is.
is output to. Next, the word code "03#" is in register 12.
supplied to The code search unit 13 searches for word codes stored in the electronic dictionary storage unit 9. If there is a matching word code, the two characters of the heading word corresponding to that code are read out and sent from path 16 to original text output register section 15 via expansion control section ]4. In this case, as is clear from FIG. 2, the word "03" is found, so a signal indicating this is given to the expansion control section 14, and the two characters "request" in the heading section of the electronic dictionary storage section 9 are changed. Read route 1
6 to the original text output register unit 15 via the expansion control unit 14.
The two characters “request” are output. Below, I will not repeat the same procedure, but instead will write down the original text before compression.
...” is extended.

以上、本発明を実施例に基づき説明した。上述の説明で
は電子辞書記憶部９に収容される語長を固定長として説
明したが、たとえば電子辞書記憶部９に収容される語の
最大文字数を４字とした場合に、２字、３字、および４
字までの長さの語を収容し、かつ広く知られている最長
マツチング法を用いて最長の語を優先的に探索して１つ
の１６ビツト符号に置きかえる方法を用いても同様であ
る０尚、文章圧縮に関し、電子辞書記憶部９に収容されてい
ない語は単に無変換のまま出力されるのみであり、した
がって成子辞書記憶部９に収容されるべき語粟の選択は
自由である。とくに高出現頻度の語を優先的に収容する
ならば、よシ効果的な圧縮ができる。The present invention has been described above based on examples. In the above explanation, the word length stored in the electronic dictionary storage unit 9 was explained as being a fixed length, but for example, if the maximum number of characters of a word stored in the electronic dictionary storage unit 9 is 4 characters, the length of the word stored in the electronic dictionary storage unit 9 is 2 characters, 3 characters, etc. , and 4
The same result can be obtained by accommodating words up to the length of a character, and using a widely known longest matching method to preferentially search for the longest word and replace it with a single 16-bit code. Regarding text compression, words that are not stored in the electronic dictionary storage section 9 are simply output without conversion, and therefore words to be stored in the Seiko dictionary storage section 9 can be freely selected. In particular, if words with high frequency of occurrence are preferentially accommodated, very effective compression can be achieved.

（発明の効果）以上説明したように、本発明によれば文字の連続性には
無関係に任意の文章を効率よく圧縮・伸展し得る文章圧
縮・伸展方式を提供することができる。(Effects of the Invention) As described above, according to the present invention, it is possible to provide a text compression/expansion method that can efficiently compress/expand any text regardless of the continuity of characters.

[Brief explanation of drawings]

第１図は本発明の実施例のブロック図、及び第２図は電
子辞書記憶部内の格納例を示す図である。２・・・入力レジスタ部　吐・・語探索部４・・・圧縮
制御部　　　訃・・出力レジスタ部９・・・電子辞書記
憶部　１２・・・圧縮符号入力レジスタ部１３・・・コード探索部　１４・・・伸展制御部１５・
・・原文出力レジスタ部FIG. 1 is a block diagram of an embodiment of the present invention, and FIG. 2 is a diagram showing an example of storage in an electronic dictionary storage unit. 2... Input register section Output... Word search section 4... Compression control section Output register section 9... Electronic dictionary storage section 12... Compression code input register section 13... Code search section 14... Extension control section 15.
・Original output register section

Claims

[Claims]

(1) A first means of storing in advance a plurality of character code strings of words consisting of a predetermined number of characters in correspondence with word codes specific to the word; and a method of encoding the text to be compressed character by character. a second means for temporarily storing the obtained character code string; and a character code string for the same number of characters as the predetermined number of characters is sequentially extracted from the second means, and the character code string is stored in the first means. a third means for searching whether the word code matches the word code of the word being searched; and if the search result in the third means matches, a code unique to the word of the character code string obtained from the first means; a fourth means for outputting the word code of the character code string, and outputting the first character code of the character code string as it is from the second means if the two do not match;

(2) If there is a predetermined word consisting of a predetermined number of characters in the character code string obtained by encoding the text character by character, it is obtained by replacing this with a word code specific to that word. a first means for temporarily storing a compressed character code string; and a second means for storing in advance a plurality of character code strings of a word consisting of the predetermined number of characters in correspondence with a word code specific to the word.
a third means for extracting each code from the compressed character code string stored in the first means and searching for a match with a word code stored in the second means; If there is a match as a result of the search in the means, a character code string of the word corresponding to the code obtained from the second means is output, and if there is no match, the code is output as is from the first means. A compressed text decompression method characterized by having the following means.