JPS61500345A

JPS61500345A - Data compression method and device

Info

Publication number: JPS61500345A
Application number: JP59503813A
Authority: JP
Inventors: タギユー，ルイ　ドン; コツブ，アレン　テイー
Original assignee: テキスト　サイエンセズ　コ−ポレ−シヨン
Priority date: 1983-10-19
Filing date: 1984-10-17
Publication date: 1986-02-27
Also published as: EP0160672A4; IT8468039A1; CA1226369A; WO1985001814A1; IT8468039A0; IT1180100B; EP0160672A1

Abstract

(57)【要約】本公報は電子出願前の出願データであるため要約のデータは記録されません。 (57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】データ圧縮方法および装置発明の背景本発明は英数字データを記憶または伝送する英数字データをコード化するに必要な信号数を減らす方法および装置に関する０本発明は書物などの大冊のテキストをコンピュータ装置に記憶し、またはこれをデータ通信装置で伝送する上で特に有益である。[Detailed description of the invention] Data compression method and device Background of the invention The present invention is necessary for encoding alphanumeric data to store or transmit alphanumeric data. The present invention relates to a method and apparatus for reducing the number of signals. in particular when storing it in a computer device or transmitting it by a data communication device. Beneficial.

英数字テキストをコード化するための先行技術は普通バイトと呼ばれる８ビツトの２進コードをテキストの各文字に代用する方法に依拠している。このようなコードの１つは情報交換用米国標準コード（ＡＳＣＩＩ）に基づいてその文字を定義する７ビツトとパリティピット（奇偶検査ビット）としてか、またはゼロに設定される８ビツトよりなる。これらコードの一覧表は。The prior art for encoding alphanumeric text is 8 bits, commonly called bytes. It relies on a method of substituting a binary code for each character of the text. This kind of One of the codes defines its characters based on the American Standard Code for Information Interchange (ASCII). as a parity bit (odd-even check bit) or set to zero. Consists of 8 bits determined by Here is a list of these codes.

例えばラルストン（Ｒａｌｓｔｏｎ）ほか著「コンピュータ科学およびエンジニアリング辞典」第２版（ＶａｎＮｏｓｔｒａｎｄ　Ｒｅ１ｎｈｏｌｄ、　１９８３）第１２５および１２６頁に列記されている。For example, Ralston et al., “Computer Science and Engineering” 2nd edition of "Alling's Dictionary" (Van Nostrand Re1nhold, 198 3) Listed on pages 125 and 126.

しかしながら、大冊の英数字テキスト中の各文字を表わすために８ビツトを使用することは現代のマイクロコンピュータおよび通信装置の限界に厳しい負担を課することになる０例えば、新約を書には１７０，０００以上の語および１，０３６，０００個の分離した文字が含まれている。したがって、新約を書を記憶するためには、１メガバイト以上のデータ格納が必要である。現代の記憶技術を以ってしても、この種の要件はテキストその他の書物全部を記憶するには比較的費用がかかる。同様に、書物に匹敵するコード化されたテキスト量を伝送することも比較的費用がかかり、また時間がかかる。However, 8 bits are used to represent each character in large volumes of alphanumeric text. This puts a severe strain on the limits of modern microcomputers and communication equipment. For example, the New Testament contains over 170,000 words and 1,03 Contains 6,000 separate characters. Therefore, remember the New Testament This requires data storage of 1 megabyte or more. With modern memory technology Even so, this type of requirement makes it relatively expensive to memorize entire texts and other writings. It takes. Similarly, it is also possible to transmit amounts of encoded text comparable to books. It is relatively expensive and time consuming.

データの記憶および伝送要件を少なくするため、２文字の組合わせを一層頻繁に表わす一定の８ビツトコードを使用するように標準コードがこれまで修正されてきた。こうして、二重音字“ｔｈ”は一つが“ｔ”を。Two-character combinations are used more frequently to reduce data storage and transmission requirements. The standard code has previously been modified to use a constant 8-bit code to represent came. In this way, the double letter "th" has one "t".

他の一つが“ｈ”を表わす２個の８ビツトコードではなく、１個の８ビツトコードで表わすことができる。The other one is one 8-bit code instead of two 8-bit codes representing “h”. It can be expressed as

しかしこの技術はそれが達成できるデータの圧縮上、比較的限定される。一般に、英数字テキストを表わすに必要な２進コードの長さを約４０％減らすことができる。対をなす文字が特定のテキストに出現する頻度に発明の概要デジタルコード形成で英数字データを記憶する場合に達成できるデータ圧縮量を著しく改良させる技術を我々は考案した。我々の発明によれば、英数字テキストの異なる各類およびそれに伴う句読点を独特の符号に割当てる辞書が作られる。However, this technique is relatively limited in the data compression it can achieve. in general , the length of the binary code required to represent alphanumeric text can be reduced by approximately 40%. Wear. Summary of the invention based on the frequency with which paired characters appear in a particular text The amount of data compression that can be achieved when storing alphanumeric data using digital code formation. We have devised a technique that significantly improves this. According to our invention, alphanumeric text A dictionary is created that assigns each different class of and its associated punctuation marks a unique code.

英数字テキスト中の各類は、ついで辞書中のその語を指す符号で置き換えられる０例えば、各符号は一連の２進数字でその１ｉ！を識別またはアドレスする１６ビツト（２ビツト）までを含む、従って、辞書は２”＝６５，５３６までの記憶語を含むことができ、これはほとんどの書物に関する語の記憶には十分過ぎる程の量である。これら６５，５３６語のいずれか１つを識別するのに僅か２バイトの情報が必要であるに過ぎないため、テキストの各類を２バイトの情報で置き換えれば、テキストの記憶に要する数字の平均数を約三分の−に減らすことができる。もし辞書が６５．５３６語以上を含んでいるとすると、少なくとも若干の符号中に必要とするビットの数は１６以上でなければならないであろう、逆に、もし辞書中の語の数が１６以下の２の乗数であれば、各符号中のビット数を１６以上にすることができる。好都合なことは、辞書を従来のマイクロコンピュータ装置を用いて極めて迅速に作ることが可能であり、かつ記憶されたテキストを、そのコンピュータ装置により人間が解読できる形式に再構成できることである。Each class in the alphanumeric text is then replaced by the symbol that points to that word in the dictionary. 0 For example, each code is a series of binary digits whose 1i! 16 to identify or address contains up to 2 bits, so the dictionary can store up to 2" = 65,536 This is more than enough for memorizing words related to most books. is the amount of It takes only 2 bytes to identify any one of these 65,536 words Replace each type of text with 2 bytes of information, since only the following information is needed: If you do this, you can reduce the average number of numbers required to memorize text by about a third. Ru. If the dictionary contains more than 65.536 words, at least some The number of bits required in the code would have to be greater than or equal to 16; If the number of words in the dictionary is a power of 2 less than or equal to 16, then the number of bits in each code is greater than or equal to 16. Can be on top. The advantage is that dictionaries can be stored on conventional microcomputer devices. The text can be created very quickly using the can be reconstructed into a human-readable format by computer equipment.

辞書を記憶するのに必要なバイト数は、アルファベット類に語を記憶し、そに伴う文字の冗長性を利用することにより実質的に減らすことができる。こうして、もし２つの記憶語のうち２番目の語が先の語と同じ５文字を含んでいる場合、その５文字を表わす１字を記憶することによってその語を意味づけることができる・句読点を除いては同一の複数形語、所有格、同族語を使用することによる辞書中の大量の語の冗長性の故に、この技術により辞書のサイズを約三分の−に小さくすることができる。The number of bytes required to memorize a dictionary is the number of bytes required to memorize words in alphabets and This can be substantially reduced by exploiting the redundancy of characters. thus, If the second of two mnemonic words contains the same five letters as the first word, then You can give meaning to a word by memorizing the one letter that represents the five letters of the word. ・Dictionary by using the same plural words, possessives, and cognates except for punctuation marks Due to the redundancy of a large number of words in the dictionary, this technique reduces the size of the dictionary by about a third. can be reduced.

圧縮されたテキストの長さをさらに短くするには、多くの場合、最も頻繁に使用される語を２バイトより短い符号で表わすことにより達成される１通常、テキスト中のすべての語の半分以上は最も頻繁に用いられる少数の語で構成するので、例えば、最も頻繁に使用される語に対して、２バイト符号の代りに１バイト符号を用いることにより、テキストの記憶要件を少なくとも更に２５％、そし多くの場合、５０％以上減らすことができる。To further reduce the length of compressed text, you can often use the most frequently used This is usually achieved by representing the word to be expressed in a code shorter than 2 bytes. Since more than half of all the words in the list are made up of the few most frequently used words, For example, 1-byte codes instead of 2-byte codes for the most frequently used words. By using In this case, it can be reduced by more than 50%.

前述の技術は語と語の間の境界を保ちながら重要なデータの圧縮を達成する。新約を書のキングジェームス訳でテストしたところ、これらの技術で新約を書の１．０３６，０００文字を、ある圧縮方法を用いて２２０，０００バイト中に、そして他の圧縮方法を用いて１８３，０００バイト中に、記憶させることが可能であった。また法律家の養成資料中の約９００　、０００文字で実施したテストでは、テキストを１５０，０００バイト以下に圧縮することができた。The techniques described above achieve compression of critical data while preserving boundaries between words. new When we tested the New Testament with the King James Version of the Book, we found that these techniques could be used to convert the New Testament into Book 1. ．． 036,000 characters into 220,000 bytes using some compression method. can be stored in 183,000 bytes using other compression methods. there were. In addition, in a test conducted with approximately 900,000 characters in training materials for lawyers, was able to compress text to less than 150,000 bytes.

辞書は英数字テキスト中の各類を含むため、特定の語または数個の語がテキスト中に用いられているかどうかを確認することができる。さらに、テキスト中の語の場所は辞書中の各類に、その語が出現するテキストの各セグメントを指す確認子を付加することによって指定することができる。この特徴により、テキストの同一のセグメ°ントに出現する語を見つけるために。Dictionaries contain each class in alphanumeric text, so a particular word or several words can be You can check whether it is used in Furthermore, the words in the text Confirm that the location in each category in the dictionary points to each segment of text in which that word appears. It can be specified by adding a child. This feature allows text to To find words that occur in the same segment.

異なる語に結合する確認子を比較することが可能である。It is possible to compare confirmators that bind to different words.

図面の簡単な説明我々の発明のこれらおよび他の目的、特徴および長所はその実施の態様に関する以下の詳細な説明から一層明らかになるであろう。Brief description of the drawing These and other objects, features and advantages of our invention relate to its modes of implementation. It will become clearer from the detailed description below.

第１図は我々の発明の好適な一実施の態様の−゛般概念を例示するフローチャート。FIG. 1 is a flowchart illustrating the general concept of one preferred embodiment of our invention. to.

第２図は第１図に示す実施の態様を更に詳細に例示するフローチャート、第３図は第２図の詳細を例示するフローチャート、第４図は我々の発明の好適な実施の態様の第２の特徴を例示するフローチャート、そして第５図は我々の発明の好適な実施の態様に使用される例示的装置を示す線図である。FIG. 2 is a flowchart illustrating the embodiment shown in FIG. 1 in more detail; FIG. 3 is a flowchart illustrating details of FIG. 2, and FIG. 4 is a flowchart illustrating the details of FIG. a flowchart illustrating a second feature of the embodiment, and FIG. 5 is a diagram illustrating an exemplary apparatus used in a preferred embodiment of our invention. Ru.

発明の好適な実施の態様の説明第１図に示すように、我々の発明における英数字テキストは先ず、該テキストの各類を１６ビツト（２バイト）までの独特の符号に結合する辞書を作ることによって圧縮される０周知のように、１６ビツト中の１またはＯのパターンを用いてゼロから６５，５３６までのどの数でも表わすことができる。圧縮されたテキストを作るため、各類は辞書中のその語を指す符号によって置き換えられる。辞書のサイズは、随時辞書中の語をアルファベット類に記憶し、かつその結果生ずる文字の冗長性を利用することによって縮少することができる。Description of preferred embodiments of the invention As shown in Figure 1, the alphanumeric text in our invention first consists of By creating a dictionary that combines each class into a unique code of up to 16 bits (2 bytes). As is well known, using a pattern of 1 or O in 16 bits, It can represent any number from zero to 65,536. compressed text To create a list, each class is replaced by the symbol that points to that word in the dictionary. dictionary The size of is determined by memorizing the words in the dictionary at any time in alphabetical order, and the resulting It can be reduced by exploiting character redundancy.

都合のよいことは、圧縮されたテキストの長さは、最も頻繁に用いられる語を２バイト以下の長さをもつ符号で表わすことによって更に短縮することができる。Conveniently, the length of the compressed text is 2 times the most frequently used words. It can be further shortened by representing it with a code whose length is less than a byte.

これらの手段は従来のマイクロコンピュータにより遂行されることが望ましい。Preferably, these means are performed by a conventional microcomputer.

マイクロコンピュータにおいて、第１図の技術を実施する特別な手段は第２図中に示されている。第１に。In a microcomputer, special means for implementing the technique shown in Figure 1 are shown in Figure 2. is shown. Firstly.

圧縮される書物のテキストまたはその他の資料は語の線状リストに変換される。The text of a book or other material to be compressed is converted into a linear list of words.

実際上、このことはキャリエツジリターン／ライン送りがテキストの各類のあとに挿入されることを要求する。この目的のため１便宜上、各類はテキスト中の連続するスペース間の１句読点を含む全ての英数字記号であると考えられる。このようにして、テキスト中にスペースが出てくるたびにキャリエツジリターン／ライン送りが挿入されるだけであり、英数字テキストの直前の１スペースの一部と考えられる１語と語の間に多数のスペースがみられる場合、テキストの直前のスペースはスペース文字よりなる単一語として取扱う。In practice, this means that a carrier return/line advance follows each type of text. request to be inserted into. For this purpose and for convenience, each category refers to the series in the text. All alphanumeric symbols are considered including one punctuation mark between consecutive spaces. this , a career return/line will occur every time a space appears in the text. The infeed is only inserted, and is part of the one space immediately before the alphanumeric text. If there are many spaces between possible words, select the previous space in the text. A pace is treated as a single word consisting of space characters.

線状リストがつくられると、従来の分類法を用いてそれを分類し、その結果、テキストの全単語がアルファベット類に配置される。Once the linear list is created, it can be classified using traditional classification methods, resulting in All words of the text are placed in alphabetical order.

アルファベット化されたリストは記憶の重複をさけ、かつ各記憶語の使用頻度計算をするためマイクロコンピュータによって処理される。こうして、アルファベット化された語のリスト全部が最初にアルファベット化されたリストからの各類を識別し、かつ最初の、アルファベット化されたリスト中に現われるその語の出現回数を指定する新しい圧縮リストに置き換えられる。The alphabetized list avoids memorization duplication and allows you to measure the frequency of use of each memorized word. Processed by a microcomputer to perform calculations. In this way, the alphabet The entire list of alphabetized words is added to each class from the first alphabetized list. and the first occurrence of that word in the alphabetized list. Replaced by a new compressed list specifying the current count.

例えば、この手続は第３図に示すように実行される。For example, this procedure is performed as shown in FIG.

アルファベット化されたリストの各類は、こんどはマイクロコンピュータによって取出される。この場合。Each type of alphabetized list is then processed by a microcomputer. is removed. in this case.

その語を前に取出した語と比較し、それが新しい語であるかどうかを確認する。Compare the word with previously retrieved words to see if it is a new word.

もし２つの語が同じであれば１問題の語は古い語であり、頻度カウンタが１つだ１ｔｉ１１分され、前記リストからつぎの語が取出される。If the two words are the same, then the word in question is an old word and has a frequency counter of 1. 1ti11, and the next word is extracted from the list.

２つの語が異なる場合、問題の語は新語であり、古い語と頻度カウンタの内容が新しいリストに書き込まれ。If the two words are different, the word in question is a new word and the old word and the content of the frequency counter are written to a new list.

頻度カウンタは１にリセットされ、新しい語はっぎの比較のために記憶される。The frequency counter is reset to 1 and stored for comparison of new words.

辞書をつくるために、圧縮されたアルファベット類の各類はそれぞれの符号を割当てられる。しかしながら、記憶要件を減らすためには、２バイト以下の長さをもつ符号を、幾つかの技術のうちのどれか１つを用いてより頻繁に用いられる語に割当てることが望ましい０例えば、１バイトの符号は最も頻繁に用いられる語に割当てることができる。そのためには、まず圧縮されたアルファベット類のリストのコピーをつくり、それを記憶する。ついで、語のリストと頻度計算値が頻度カウンタにより分別され使用頻度の減少順序に語が配置される新しいリストを得る。ある技術においては、あるバイトの８ビツトの１つを、２バイト符号でなく１バイト符号としてそのバイトを識別するのに用いることができる。このような場合、前記バイトの他の７つのビットを１２８の異なる符号をつくるために用いることができる。もしこのバイトが１バイト符号として識別されない場合、２バイト符号中の残りの１７ビツトはテキスト中の３２，７６８に及ぶ異なる語を識別するために用いることができる。To create a dictionary, each class of the compressed alphabet is divided into its respective codes. Can be guessed. However, to reduce storage requirements, the length should be less than or equal to 2 bytes. code with more frequently used words using one of several techniques. For example, a 1-byte code should be assigned to the most frequently used word. can be assigned to. To do this, we first need to download the compressed alphabet. Make a copy of the strike and memorize it. The list of words and frequency calculations are then Create a new list in which words are sorted by a frequency counter and placed in order of decreasing frequency of use. obtain. In some technologies, one of the 8 bits of a byte is encoded with a 2-byte code. It can be used as a 1-byte code to identify that byte. like this , the other 7 bits of the byte are used to create 128 different codes. I can be there. If this byte is not identified as a 1-byte code, then 2 The remaining 17 bits in the byte code represent 32,768 different words in the text. It can be used for identification.

従って、この技術においては、最も頻繁に用いられる１２８の語の各々は１２８の異なる１バイト符号の１つを割当てられ、残りの語は異なった２バイト符号を割当てられる。Therefore, in this technique, each of the 128 most frequently used words is 128 are assigned one of different 1-byte codes, and the remaining words have different 2-byte codes. Assigned.

別の方法として、１バイト符号の数をテキスト中に用いられている異なる語の数に応じて変えることができる。特に、１および２バイト符号の組合せによって表わすことができる異なる語の最大数はＸ＋２５６（２５６−Ｘ）　（但し又は使用される１バイト符号の数を示す）で与えられる。明らかに、Ｘは２５６以下またはそれと等しい正の整数である。このことから、使用できる１バイト符号の最大数はｘ＜（２５６”　−Ｙ）／２５５　（１）（ただし、Ｙはテキスト中の異なる語の数である）。Alternatively, the number of one-byte codes can be calculated as the number of different words used in the text. can be changed depending on. In particular, The maximum number of different words that can be used is X + 256 (256 - (indicates the number of 1-byte codes used). Obviously, X must be less than or equal to 256. or an equivalent positive integer. From this, the maximum number of 1-byte codes that can be used is The large number is x<(256”-Y)/255 (1) (where Y is a different word in the text ).

例えば、テキスト中に１２，０００の異なる語がある場合、Ｘ＝２０９となる。For example, if there are 12,000 different words in the text, then X=209.

このように、２０９の最も頻繁に用いられる語は２０９の１バイト符号によって表わされ、残りの１１，７９１語は２バイト符号で表わされる。Thus, the 209 most frequently used words are represented by 209 one-byte codes. The remaining 11,791 words are represented by 2-byte codes.

従って、この技術を用いる場合、方程式（１）は使用できる１バイト符号の最大するを計算するのに用いられる。この最も頻繁に用いられる語の数は、ついで。Therefore, when using this technique, equation (1) is the maximum number of one-byte codes that can be used. It is used to calculate . The number of this most frequently used word is then.

バイト符号を割当てられる。そしてテキスト中の残りの語は２バイト符号を割当てられる。Can be assigned a byte code. The remaining words in the text are then assigned 2-byte codes. Can be used.

１バイト符号の数を定めるためにどの方法が用いられるにしても、最初の語に続く連続した数的順序で各類に符号を割当てることにより、コンピュータを通じて辞書が作られる。これらの符号の数的順序は下位から上位、または上位から下位になり得るが、ここに述べる実施の態様においては単調増減でなければならない、つぎの説明においては、数的順序は上位に向っている。好都合なことは、１バイト符号で表わされる語は第１の辞書に割当てられ、残りの語は第２の辞書に割当てられる。記憶要件を最少にするには、以下詳述するように、各類と２バイト符号を結合する第２の辞書はせいぜい２５６記憶語をもつだけであるから１通常はこの辞書をアルファベット類にする必要はない、しかしながら、この辞書に記憶される語はテキスト中に非常に頻繁に用いられるため、その検索時間を最小にすることが望ましい、この目的で、最も頻繁に用いられる語を最初にしてテキスト中にそれらの語をその使用頻度順に記憶する。Whichever method is used to determine the number of one-byte codes, the through a computer by assigning a code to each class in a consecutive numerical order. A dictionary is created. The numerical order of these signs is from low to high, or from high to low. , but in the implementation described here it must be monotonically increasing or decreasing. , in the following description, the numerical order is upwards. The good thing is that 1 bar The words represented by the code are assigned to the first dictionary, and the remaining words are assigned to the second dictionary. Can be guessed. To minimize storage requirements, each class and 2 bytes, as detailed below, Since the second dictionary that combines the codes has at most 256 memory words, it is usually There is no need for this dictionary to be alphabetical, however, Since the words to be memorized are used very frequently in the text, it is important to minimize their search time. For this purpose, it is preferable to use text with the most frequently used words first. Store the words in order of frequency of use.

記憶される辞書はそれに含まれる語だけを含み符号は１つも含まないことが望ましい０例えば。It is desirable that the dictionary to be stored contains only the words contained in it and does not contain any codes. For example.

ＡＳＣＩＩによりコード化された記号の形式で、１バイトが各記号を表わすようにして、語が記憶される。A format of symbols encoded by ASCII, with one byte representing each symbol. Then, the words are memorized.

ＡＳＣＩＩコードは僅か９６に過ぎないから、各バイトの１ビツトは他の目的に用いられる。このビットは各類のはじめを識別するのに用いられる。特に、各類のはじめは、その最初のＡＳＣＩＩ文字の８ビツトを“１”に設定し、一方その語の１つ置きのＡＳＣＩＩ文字の８ビツトを“０”に設定することによって識別される。その結果、辞書中の特定の語と結合した符号は、辞書のはじめからその語までの語数を数え、その数とリスト中の最初の語と結合した符号の数値とを加えるだけで決まる。この計算は各バイトの８ビツトをマスクして、コンピュータがリスト中の最初の語から問題の語まで各バイトを走査するに応じてその位置における各“１”ビットの出現を数えることにより簡単に行われる。Since there are only 96 ASCII codes, one bit of each byte is used for other purposes. used. This bit is used to identify the beginning of each class. In particular, each category At the beginning of , set the 8 bits of its first ASCII character to “1”, while Identified by setting 8 bits of every other ASCII character in the word to “0” be done. As a result, the code associated with a particular word in the dictionary is Count the number of words up to the word and add that number to the number of the sign combined with the first word in the list. It's decided just by looking at it. This calculation is done by masking the 8 bits of each byte and scans each byte from the first word in the list to the word in question. This is simply done by counting the occurrences of each "1" bit in .

例えば、最初の辞書が２０９語を含む場合、ｏｏｏｏ　ｏｏｏ。For example, if the initial dictionary contains 209 words, oooo ooo.

から１１０１０００１までの２進数値をもつ符号はこれらの語に割当てられる。Codes with binary values from 11010001 to 11010001 are assigned to these words.

特定の語に割当てられた符号を確認するために、コンピュータは最初のバイトからはじまって、その符号の数値が計算される特定の語の直前のバイトで終る辞書中の各バイトの８ビツトの位置中の各１ビツトの出現を数えるだけである。この辞書中の最初の語に割り当てられた符号の数値はゼロであるから、計算値は符号の値となる０例えば、第２の辞書の語に割当てられた符号は２進数値１１０１００１０００００００００ではじまることになる。従って、その符号の値は第１の辞書と同じ方法で語を教え、第２の辞書の最初の語と結合した２進数値１１０１００１０００００００００を前記計算値に加えることによって決定される。To determine the code assigned to a particular word, the computer checks the first byte or a dictionary starting with the byte immediately preceding the particular word whose sign value is to be computed It simply counts the occurrences of each 1 bit in the 8 bit positions of each byte in the 8 bit position. this Since the numerical value of the sign assigned to the first word in the dictionary is zero, the calculated value is the sign For example, the code assigned to a word in the second dictionary is the binary value 11010. It will start with 01000000000. Therefore, the value of its sign is the first Teach the word in the same way as the dictionary and combine the binary value 1101 with the first word of the second dictionary It is determined by adding 001000000000 to the calculated value.

計算手続きを速くするには、一定の語を結合した符号を識別する調査表が役立つ０例えば、アルファベットの２６文字の各々ではじまる最初の語と結合した符号を記憶することができる。そして符号が計算される語と最初の文字を同じくする最初の語がら計算手続きを開始することができる。To speed up calculation procedures, it is useful to have a study table that identifies the symbols that combine certain words. 0 For example, the code combined with the first word starting with each of the 26 letters of the alphabet can be memorized. and have the same first letter as the word whose sign is being calculated The calculation procedure can be started with the first word.

辞書が作られたのち、マイクロコンピュータは最初に生じた線状リストから各類を読み取り、第１または第２の辞書中の語を調べ、その辞書から得た符号で線状リスト中の語を置き換えることによって英数字テキストを圧縮する。この工程において、第１の辞書の各類が調査がまず行われ、それらの語のＡＳＣＩＩコードが符号で置き換えられる語と一致しているかどうかを確認するテストが行われ、確認できなかった各テスト数を計算する。もし１両者が一致していれば、不確認テストの数はその符号の値を示すことになる。但し、最初の語と結合した符号の値はゼロである。もしも第１の辞書中に前記一致が得られなかった場合、コンピュータは第２の辞書に移る。ここで、調査表を用いてその辞書の調査開始点を見つける０例えば、符号が定められる語の最初の文字を、その文字ではじまる最初の語を調査表中で見つけるために用いることができる。After the dictionary was created, a microcomputer could select each category from the first linear list. , look up the word in the first or second dictionary, and use the code obtained from that dictionary to form a linear Compress alphanumeric text by replacing words in a list. to this process , each category of the first dictionary is first examined and the ASCII codes of those words are is tested to see if it matches the word to be replaced by the sign, Calculate the number of each test that could not be confirmed. 1 If both match, disconfirmation The number of tests will indicate the value of that sign. However, the code combined with the first word The value is zero. If said match is not found in the first dictionary, the compiler The computer moves to the second dictionary. Here, use the survey table to find the starting point for the dictionary's survey. Add 0 For example, change the first letter of the word for which the code is specified to the first letter starting with that letter. can be used to find the word in the questionnaire.

調査表はその語の符号の数値を与えることになる。ついで、前記の文字ではじまる異なる語を調べ、各類のＡＳＣＩＩコードが、各類と一致しているかどうかを確認する。テストできなかった話語にカウンタが１つだけ増分する。そしてその語が見つかった場合、その語の符号を、同じ最初の文字ではじまる最初の語と結合した符号の調査表から得た値にカウンタの計算値を加えて、計算する。この方法で語の線状リスト全体が符号化されたテキストをつくるために符号リストで置き検相を使って圧縮することが可能である。それは、この辞書の語がアルファベット類に配列されており、はとんどすべての語が、辞書の配列順で先行する語の頭文字または文字に共通する頭文字を、少なくとも１個は含んでいることに起因する。配列順で２番目の語が先行する第１の語も頭文字と同一の文字を少なくとも２個含んでいる場合は、その２番目の語を表現するためには、（１）第１の語の頭文字と比較して同じ文字がいくつあるかを示す数と、（２）２番目の語の第１の語と異なる残りの文字を示す文字の列を使うのが便利な方法である。したがって辞書の個々の語は、先行する見出し語の文字と同じ頭文字の数を指定する数と。The questionnaire will give you a numerical value for the sign of the word. Then, start with the above characters. Examine the different words in the class and see if the ASCII code for each class matches the class. confirm. The counter is incremented by one for each spoken word that could not be tested. And that If a word is found, combine the code of the word with the first word starting with the same first letter. Calculate by adding the calculated value of the counter to the value obtained from the search table of the matched code. This person In the method, the entire linear list of words is placed in a code list to create encoded text. It is possible to compress using phase detection. That is because the words in this dictionary are alphabetical. The words are arranged in blocks, and almost every word is the same as the word that precedes it in the dictionary order. Caused by containing at least one initial letter or common initial letter do. The first word preceded by the second word in sequence must also have at least the same letter as the initial letter. also contains two words, in order to express the second word, (1) the first word and (2) the number of the same letters compared to the first letter of the second word. A convenient method is to use a string of characters to indicate the remaining characters that are different from the first word. However, Each word in the dictionary is a number specifying the number of initial letters that are the same as the letters of the preceding entry word. and.

異なる残りの文字を表わしたＡＳＣＩＩコードを使って格・納される。処理を促進させるために、その数は１語の頭文字の検索を行う場合にすぐ使用できるように２進数で格納される。１列として、辞書に連続して出てくるｓｔｏｒａｇｅ” 、”５ｔｏｒｅ”、および“５ｔｏｒｅｄ”の３つの語をとってみる。この場合には、”　５ｔｏｒｅ”は、最初の４文字が先行する“ｓｔｏｒａｇｅ”にあり　＆ｅ”が異なるため、１４″に相当する２進数と“ｅ”に相当するＡＳＣＩＩ文字を使って表現される。また“５ｔｏｒｅｄ”は。It is stored using ASCII codes representing the different remaining characters. Prompt processing In order to speed up the search, the number can be easily used when searching for the first letter of a single word. is stored in binary format. "storage" that appears consecutively in the dictionary as one column , "5tore", and "5tored". in this case , "5tore" is in "storage" preceded by the first four characters. &e” are different, so the binary number equivalent to 14″ and the ASCII equivalent to “e” expressed using letters. Also, “5tored”.

最初の５文字が先行する“５ｔｏｒｅ”と同じで“ｄ”が異なるため、５＃に相当する２進数と“ｄ”に相当するＡＳＣＩＩ文字を使って表現される。The first five characters are the same as the preceding “5tore” but the “d” is different, so it is compatible with 5#. It is expressed using the corresponding binary number and the ASCII character corresponding to "d".

このあとに、符号化されたテキスト、辞書、Ｗ４査表。After this are encoded texts, dictionaries, and W4 examination forms.

および符号化されたテキストを読み取るコンピュータのプログラムが、テープ、ディスク、またはＲＯＭのような適当な媒体のいずれかに格納される。またこの同じ情報はデータ通信システムを通して、ある位置から別の位置へ伝送することもできる。我々の発明を利用してデータの圧縮を行うと、実物大のボリュームの　゛書物の完全なテキストを、１個か２個の５１／４インチ（１３ｍｍ）のブロッピー・ディスクに格納することが可能である。一般にテキストの長さは１語を符号に置き換えることによって、　６０〜７０％程度まで縮小することができる。またテキストの中で非常に頻繁に使用される語に対して１バイトの符号をあてることにより、さらに２５％の、場合によっては５０％までの縮小を達成することができる。したがって本発明を実際に行うことによってテキストの長さを、全体として７５％程度まで縮小させることが容易にできる。辞書がテキストの長さを増加させることは明白な事実ではあるが、第２の辞書の長さは、前に述べたように連続した語の同一の頭文字を数値コードを使って表現することにより縮小することができる。これは、３個程度の因数を使って辞書の長さを減らしている１本発明を実行して達成できる圧縮の量を、下の例１に図示しである。また我々の発明を利用して、この種類のテキストを伝送するために必要なチャネルの伝送容量の低減も実際に行うことができる。and a computer program that reads the encoded text on the tape, It is stored either on disk or on a suitable medium such as ROM. Also this the transmission of the same information from one location to another through a data communication system You can also do it. When data is compressed using our invention, it is possible to compress data using a full-scale volume. ゛The complete text of a book can be printed on one or two 5 1/4 inch (13 mm) blocks. It is possible to store it on a DVD disk. Generally, the length of the text is one word. By replacing it with a code, it can be reduced to about 60-70%. . It also assigns 1-byte codes to words that are used very frequently in the text. By doing so, it is possible to achieve a further reduction of 25%, and in some cases up to 50%. I can do it. Therefore, by actually implementing the present invention, the length of the text can be reduced completely. The body can easily be reduced to about 75%. Dictionary is the length of the text Although it is an obvious fact that the length of the second dictionary increases By representing the same initial letters of successive words using numerical codes, can be done. This reduces the length of the dictionary by using about 3 factors1 The amount of compression that can be achieved by practicing the invention is illustrated in Example 1 below. Also our The invention can be used to improve the transmission capacity of the channel required to transmit this type of text. A reduction in quantity can also be achieved in practice.

図４のフロー・チャートは、符号化されたテキストからコンピュータを使ってもとの英数字のテキストを復元する方法を図示したものである。これに示されているように、コンピュータは１個々の符号を順番に取り出してその符号に関連する語を見つけ出すために。The flow chart in Figure 4 can also be used to generate data from encoded text using a computer. This diagram illustrates how to restore alphanumeric text with . shown in this As shown in the figure, a computer sequentially extracts each individual code and calculates the information associated with that code. To find words.

辞書の１つを探索する。１バイトの符号の場合には。Search one of the dictionaries. In the case of a 1-byte code.

コンピュータは、その符号の２進数の値をカウンタにロードするだけでよい、そして第１の辞書の中の語を、最も頻繁に使用される語から始めて順に読み、８番目のビット位置が“１”ビットになっているバイトごとに１ずつカウントを減らし、カウンタの示す値がゼロになるまで続ける。この時点で２次に読まれる語が、始めにカウンタにロードされた符号で表現された語に相当する。第２の辞書を探索する場合には、コンピュータは符号をアルファベットの個々の文字で始まる最初の語に関連づけている調査表を有効に用いる。すなわち、コンピュータはテキストに変換される符号の値から表の符号の値を減算しながら、調査表を逆の順序で走査していけばよい、２つの値の差が負の値から正の値に変わる瞬間に、コンピュータは、符号によって表現された語の文字と同じ文字で始まる最初の語に到達する。これによってコンピュータはテキストに変換される符号の値からこの符号の値を差し引き、その文字で始まる異なる語のバイトを読んでいくという同一の処理を繰り返し行う、８番目のビット位置が“１”ビットであるバイトごとにカウントを１ずつ減らし。The computer only needs to load the binary value of that sign into a counter; read the words in the first dictionary in order, starting with the most frequently used word, and Decrement the count by 1 for each byte whose second bit position is “1” bit. and continue until the value indicated by the counter becomes zero. At this point, the second word to be read is , corresponds to the word represented by the code initially loaded into the counter. second dictionary When searching, the computer searches for codes starting with individual letters of the alphabet. Make effective use of the questionnaire relating to the first word. In other words, the computer Run the survey table in reverse order, subtracting the sign values in the table from the sign values to be converted to text. The moment the difference between the two values changes from a negative value to a positive value, the The computer searches for the first word that starts with the same letter as the word represented by the code. reach. This allows the computer to convert this sign value to text. The same thing as subtracting the sign value and reading the bytes of different words starting with that letter. Repeat the process 1 for each byte where the 8th bit position is a “1” bit. Decrease the count by 1.

カウントがゼロになるまで続ける。ゼロの時点で１次に調べられる語が符号によって識別される語に相当する。第１または第２のいずれの辞書から検索が行われても、このあとにその語は１表示装置、プリンタ、またはその種の他のコンピュータの出力装置に送られる。Continue until the count reaches zero. The word examined in the first order at the time of zero is determined by the code. It corresponds to the word identified as The search is performed from either the first or second dictionary. However, after this, the word 1 displays, printers, or other types of computers. data is sent to the data output device.

そしてコンピュータは次の符号に進む。The computer then advances to the next code.

以下余白我々の発明は、コンピュータによって実行されるシステムであればあらゆる方式のシステムに適用させることができる。テキストの符号化を行い、符号化されたテキストからもとの英数字のテキストを復元する処理に適した装置としては、適当なプログラムによって稼動するコンピュータであればどれでもよい０図５に示すように、一般にこのようなコンピュータは、プロセッサ（１０）、第１と第２メモリ（２０と３０）、キーボード（４０）および陰極線管ＣＲＴ　（５０）から構成されている。またこのような装置には、任意選択機能としてプリンタ（６０）や通信インタフェース（７０）をも含めることができる。これらの装置は１図に示すようにデータ・バス（９０）によって相互に接続され、マイクロプロセッサ（１０）から信号線（９０）を通して制御される。さらにメモリは、アドレス線（１００）によってアドレス指定することができる。図５に示す構成は、通常のマイクロコンピュータの編成として一般に認められているものである。辞書を作成し、英数字のテキストを符号化するプログラムは、便宜上読取専用メモリである第１のメモリに格納することができる。また同じ装置を符号化されたテキストから英数字のテキストを復元するためにも使用する場合は、そのプログラムもメモリ（２０）に格納することができる０作成された符号化されたテキストは、辞書と調査表とともに通常はメモリ（３０）に格納される。また復元のプログラムはメモリ（２０）が使用できない場合には、メモリ（３０）に格納することができる。これらの符号化されたテキスト、辞書、調査表、および復元のプログラムは、通信インタフェース（７０）を通して遠隔地の別のマイクロコンピュータに伝送することも可能である。Margin below Our invention can be applied to any system executed by a computer. It can be applied to the following systems. Performs text encoding and encoded Appropriate equipment is suitable for the process of recovering original alphanumeric text from text. Any computer can be used as long as it runs on a suitable program (as shown in Figure 5). Generally, such a computer includes a processor (10), a first and a second Memory (20 and 30), keyboard (40) and cathode ray tube CRT (50) It is composed of Such devices also include a printer (6) as an optional feature. 0) and a communication interface (70). These devices are 1 The microprocessors are interconnected by a data bus (90) as shown in the figure. It is controlled from the sensor (10) through a signal line (90). Furthermore, the memory can be addressed by the line (100). The configuration shown in Figure 5 is This is generally accepted as the organization of a regular microcomputer. dictionary A program that creates and encodes alphanumeric text is stored in read-only memory for convenience. can be stored in a first memory. The same device can also be used to If you also use the program to restore alphanumeric text from The encoded text created by 0 can also be stored in memory (20) , along with a dictionary and a lookup table, typically stored in memory (30). Also the restoration program RAM should be stored in memory (30) if memory (20) is not available. Can be done. These encoded texts, dictionaries, questionnaires, and restoration programs The RAM communicates with another microcomputer at a remote location through a communication interface (70). It is also possible to transmit data to a computer.

メモリ（３０）は、プログラム可能読取専用メモリ（ＦＲＯＭ）か、磁気テープ、またはフロッピィ・ディスク装置であることが望ましい、これは、これらの装置が十分な大きさの容量を有していて、ある書物のテキスト全体を１合理的なサイズのＦＲＯＭか少数のフロッピィ・ディスクに格納することができるからである。またＦＲＯＭを使用する場合には、ＦＲＯＭに符号化されたテキスト、辞書、ｍ査表、および復元のプログラムを記録するために適当な装置（図には示していない）を使用しなければならない、このような装置は一般によく知られている。また多数の書物を１つのレコードに格納するのが望ましい場合に、本発明を行うためには、非常に容量の大きい固定ディスク装置　。Memory (30) can be programmable read-only memory (FROM) or magnetic tape. , or a floppy disk device; this is because these devices If the storage has a large enough capacity to store the entire text of a book in one reasonable size. This is because it can be stored on a small number of floppy disks or a small number of floppy disks. Ru. Also, when using FROM, the text encoded in FROM, the dictionary A suitable device (not shown in the figure) for recording such devices are generally well known. . The present invention can also be used when it is desirable to store a large number of books in one record. In order to do this, a fixed disk device with extremely large capacity is required.

か大きいＲＯＭのボードを使うことができる８図５の装置が、ディスクに格納されたデータからもとの英数字のテキストを復元するために使用する場合には、ディスクの内容全体を半導体メモリに転送して処理するのが便利である。半導体メモリは非常な高速で処理を行うために、辞書の語の調査を促進させ、処理の時間を短縮することができる。またこの目的のために、通常のマイクロコンピュータのメモリの記憶容量に適合するような大きさまで辞書を圧縮してしまうと便利である。我々は、６４キロバイトの半導体メモリが使用できる状態のときにこれを実行して効果が上がることを確認した。8 The device in Figure 5, which can use large ROM boards, is stored on disk. When used to recover the original alphanumeric text from It is convenient to transfer the entire contents of the disk to a semiconductor memory for processing. Semiconductor metal Mori accelerates dictionary word lookups and reduces processing time in order to perform processing at very high speeds. can be shortened. Also for this purpose, a normal microcomputer It is convenient to compress the dictionary to a size that fits the memory storage capacity of be. We did this when we had 64 kilobytes of semiconductor memory available. I implemented it and confirmed that it was effective.

我々の発明は、広い分野にわたって応用させることが可能である。前にも述べたように本発明を利用して、データの記憶や伝送のため英数字のテキストを圧縮することができる。またもとのテキストの復元を迅速に実行することができるため、圧縮されたデータをもとのテキストを使って行われていた種々の応用分野に役立てることができる。また圧縮されたデータは辞書がないと全く役に立たないために、符号化されたテキストと辞書を生成して、記憶と伝送の目的のためにそれらを分離させることによって、英数字データの確実な記憶および／または伝送を行うことができる。Our invention can be applied in a wide range of fields. mentioned before The present invention can be used to compress alphanumeric text for data storage and transmission. can be done. It also allows for quick restoration of the original text. , compressed data can be used in various applications that were previously done using the original text. It can be erected. Also, compressed data is completely useless without a dictionary. generate encoded text and dictionaries and use them for storage and transmission purposes. This separation allows for reliable storage and/or transmission of alphanumeric data. It can be carried out.

辞書には英数字のテキストの個々の語が含まれてはいるが、比較的短いものであるため、情報の検索を行う場合に有効なツールとしてこの辞書を使うこともできる。とりわけ、単に辞書を走査するだけで、特定の語が英数字のテキストで使用されているかどうかを容易に確認することができる。また辞書の個々の語にその語が現われるテキストのそれぞれの区分を指定する識別子を追加して、さらに有効に使用することができる０例えば、この識別子が１バイトの長さで、そのバイトの８個のビット位置の１つ１つが、テキストの８個の区分に対応していると仮定してみる。この例ではそのバイトの８個のビット位置のいずれかに１のビットが入っている場合は、テキストの対応する区分に関連の語があることを示している。このように識別子を使用することによって、問題の語を取り囲んでいる英数字テキストの検索の速度を高めることができる。それは語が現われない区分を探索する必要がないからである。Dictionaries contain individual words of alphanumeric text, but they are relatively short. This dictionary can also be used as an effective tool when searching for information. Ru. Among other things, you can simply scan the dictionary to see if a particular word is used in alphanumeric text. You can easily check whether the Also, for individual words in the dictionary, Add identifiers to specify each segment of text in which the word appears, and then For example, if this identifier is 1 byte long and that byte is Assume that each of the eight bit positions in the text corresponds to one of the eight segments of the text. Let's try to set it. In this example, a 1 bit in any of the 8 bit positions of the byte. , it indicates that there is a related word in the corresponding segment of the text. Ru. By using identifiers in this way, the alphanumeric characters surrounding the word in question This can speed up searching for text. It searches for categories in which no words appear. This is because there is no need to search.

さらに、異なる言葉に関係する識別子のそれぞれのビットを比較することによって、それらの言葉がテキストの同じ部分に使用されているかどうかがわかる。Additionally, by comparing each bit of the identifier related to different words, to see if the words are used in the same part of the text.

明らかに、識別子は、言葉の使用をより正確に発見するために必要なので２その大きさを変えることができる。Obviously, identifiers are necessary to discover word usage more precisely, so 2. You can change the size.

我々の発明の実習では、多くのバリエーションがまた可能である。今まで、我々は英数字のテキスト、２進数字のトークンおよびアスキーコードが言葉で発明について解説してきたが、この発明は記号の全ての方法で実習することができ、また記号はトークン化できるし、いろいろの方法でコード化することもできる。In the practice of our invention, many variations are also possible. Until now, we Invents alphanumeric text, binary digit tokens and ASCII codes in words However, this invention can be practiced using all methods of symbols, and Symbols can be tokenized and encoded in various ways.

例えば、外国語、数学的記号、グラフ記号や句読点等が全てこの発明を実習するのに用意されており、またこれらの記号は、アスキー、拡張したアスキーまたは選ばれたどんなコードによっても表わすことができる。For example, foreign languages, mathematical symbols, graph symbols, punctuation marks, etc. can all be practiced with this invention. and these symbols can be ASCII, Extended ASCII or It can be represented by any code chosen.

我々の発明の実習では２進トークンの使用が好ましいが、そのようなトークンを１６進数のような他の基数で表わすのが好ましいかもしれないが、またこの発明は、どんな基数の桁をもつトークンを使用しても実習することができる。Although the use of binary tokens is preferred in our invention practice, we Although it may be preferable to represent it in other bases, such as hexadecimal, this invention also can be practiced using tokens with digits of any radix.

我々は、頻繁に使用される言葉を記憶するのに２バイト以下のコードを使用してトークン化されたテキストの大きさを小さくするための２つの例を揚げて説明してきた。しかし、数多くの他の技術が使用できるのである０例えば、大抵の本の中で使用される用語数は１６ビツトで表わせる６５，５３６語より明らかに少ないので。We use codes of less than 2 bytes to remember frequently used words. Here are two examples of how to reduce the size of tokenized text. It's here. However, many other techniques can be used. For example, most books The number of terms used in this is clearly smaller than the 65,536 words that can be expressed in 16 bits. Inode.

アルファベット化されたテキストの言葉のそれぞれを。Each of the words in the alphabetized text.

１６ビツト以下で表現することは大抵の場合可能である。In most cases, it is possible to express with 16 bits or less.

例えば、３２，７６８語は１５ビツトで表現できるし、１６，３８４語は１４ビツトで表わすことができる。したがって、ビットにトークンを割当てる他の方法は、それぞれの異なった言葉を、その最低のビット数を持つ異なるトークンで表わすことのできるその最低のビット数を計算し、次にその最低のビット数を持つ異なるトークンをそれぞれの言葉に割当てることである。もし使用されている用語数が６５，５３６語以上であれば、同様の原理で。For example, 32,768 words can be represented in 15 bits, and 16,384 words can be represented in 14 bits. It can be expressed as So other ways to assign tokens to bits represents each different word by a different token with its lowest number of bits. Calculate the lowest number of bits that can be ignored, then have that lowest number of bits. The idea is to assign a different token to each word. If used If the number of words is 65,536 or more, the same principle applies.

１７．１８またはそれ以上のビットのトークンをテキスト中のそれぞれの異なる言葉に割当てることができる。17. Tokens of 18 or more bits for each different token in the text can be assigned to words.

代替的アプローチとしては、２つのフィールドをもつトークンを小することである。この最初のフィールドは２番目のフィールドの長さを指定する固定長のフィールドである。この技術では、トークンは、それぞれの言葉の頻度計算に厳密に従って言葉に割当てられるので、最も短いトークンは、テキスト中に最も多く現われる言葉に割当てられ５次に短いトークンは、次に多く使用されている言葉に割当てられる、という具合になる。この方法では、辞書は、頻度計算の順番に、最も頻度の高い言葉を辞書の最初に記憶するように。An alternative approach is to reduce the token with two fields. Ru. This first field is a fixed-length field that specifies the length of the second field. It is old. In this technique, the tokens are strictly used to calculate the frequency of each word. Therefore, the shortest tokens that are assigned to words are the ones that occur most often in the text. The 5th shortest token assigned to the next most used word is It will be assigned. In this method, the dictionary, in order of frequency calculation, The most frequent words are memorized first in the dictionary.

記憶されるのである。It will be remembered.

この技術では、１つのトークンは１２ビツトの長さをもつことができる。しかし、よくあるように、言葉の頻度分布が非常に急な曲線を描くような場合は、テキスト中のそれぞれの言葉を表わすのに必要な平均ビット数は、下に示す例１の場合のように、大幅に減らすことができる。トークン化されたテキストが、２つのフィールドを持つトークンを使用して記憶される場合は、トークンを並列のリストに記憶し、そのリストの１つは最初のフィールドだけのリストであり、もう１つのリストは２番目のフィールドのリストである。というようにすると便利である。データは、２つのリストに同じ順序で記憶される。したがって、トークン化されたテキストを元の英数字テキストに変換するために、計算機は、最初のフィールドから４ビツトを読み取り、この４ビツトから２番目のフィールドリストから読み取るビット数を決定する、そしてこのビットを読み、そこで、言葉を頻度順に記憶している辞書の最初の所から言葉を数えて、そのビットに関連する英数字を発見するのである。このようにして、最も多く使用される言葉は、最初のリストでは００００で表わされ、２番目のリストではゼロビットで表わされる；次の２つの多く使用される言葉は、最初のリストでは０００１で。With this technique, one token can have a length of 12 bits. but , when the frequency distribution of words follows a very steep curve, as is often the case, the text The average number of bits required to represent each word in the text is can be significantly reduced, as in the case of The tokenized text is divided into two When stored using tokens with fields, the tokens are stored in a parallel list. one of the lists is a list of only the first field, and the other is a list of only the first field. The first list is the second field list. It is convenient to do as follows. Ru. The data is stored in the same order in the two lists. Therefore, tokenization To convert the text into the original alphanumeric text, the calculator uses the first file Read 4 bits from the field and read the second field list from these 4 bits. Determine the number of bits to read from Count the words from the beginning of the memorized dictionary and find the alphanumeric value associated with that bit. Discover the letters. In this way, the most used words are represented by 0000 in the first list and zero bits in the second list; The two most used words are 0001 in the first list.

また次のリストでは１ビツトで表わされる；次の４つの言葉は、最初のリストでは００１０で、２番目のリストでは２ビツトで表わされる。という具合になる。Also in the next list they are represented by 1 bit; the next four words are represented by 1 bit in the first list. is 0010 and is represented by 2 bits in the second list. That's how it goes.

計算機が最初のリストで００００を読み取ると、これらビットは、２番目のリストにはエントリーなしであることを示しており、したがって計算機は、辞書の中の最初の言葉である最も頻度数の高い言葉を検索するのである。If the computer reads 0000 in the first list, these bits will be read in the second list. indicates that there are no entries in the dictionary, so the calculator It searches for the most frequent word, which is the first word of the word.

計算機が最初のリストで０００１を読み取ると、２番目のリストで次のビットを読み取り、第２のビットのビットがゼロか１であるかによって、辞書中の第２または第３の言葉を検索するのである。When the calculator reads 0001 in the first list, it reads the next bit in the second list. Read, depending on whether the bit in the second bit is zero or one, the second or or a third word.

上に述べた。それぞれの言葉をトークンの形で記憶する技術は、また、言葉のグループ（即ち句）の記憶に拡張することができる。普通の句は、全てによって認知される。　ｒｏｆ　ｔｈｅ」、ｒａｎｄ　ｔｈｅＪやｒｔｏ　ｔｈｅＪは。mentioned above. The technology of memorizing each word in the form of a token also It can be extended to the storage of loops (i.e. phrases). Ordinary phrases are recognized by all. be known. "rof the", rand theJ and rto theJ.

殆どの英語の英数字テキスト中でがなりの頻度で使用されていると思われる。そういう句は、辞書の中では、自動的に１つの場所が割当てられるが、１つのトークンが、１つのそういう句の１つの出現のために用意されるのである。It seems to be used fairly frequently in most English alphanumeric texts. So The phrase ``u'' is automatically assigned one place in the dictionary, but has only one tone. A kun is reserved for one occurrence of one such phrase.

逆に１句は英数字のテキストを走査し、最も使用頻度の高い言葉のサブ・セットで言葉を比較し、簡単に発見することができる０例えば、最も使用頻度の高い１００語がこのサブ・セットを構成することもある。この手順では、最も使用頻度の高い句は、それが最も使用頻度の高い言葉の１つであるかどうか決めるのに続いてテキストのそれぞれの言葉を試験することによって簡単にアセンブルすることができる。もし、最も使用頻度の高い言葉でない場合には、次の言葉が取出される。もし、そうであわ、ば、その言葉は、最も使用頻度の高い言葉のリスト上にある直前の言葉とともに使用されるのである。最終的に、最も頻度数の高い言葉のリスト上にない所に来た場合、記憶されている言葉は４句のリストに加えられる。テキスト全体が走査されると、句の記憶リストがアルファベット類に分類され、重複しているものは除かれ１句の使用頻度計算が行われるのである１句を表わすのに使用できるトークンの数によって、トークンは、最もよく使用されるものから始まって、これらの句ｌ；割当てられるのであるが、これらの句は、他の句が割当てられる前に、テキスト中での句の代りをするのである。辞書とトークン化されたテキストから見て、そのトークンが１語を表わすのか言葉のグループを表わすのかで違いはないのである。したがって１元の英数字テキストは１図４のプロセスに従って簡単に再構成することができる。Conversely, one phrase scans alphanumeric text and identifies a subset of the most frequently used words. You can compare words and easily discover 0 for example, the most frequently used 1 00 words may constitute this sub-set. This step describes the most frequently used A phrase with a high easy to assemble by testing each word of the text. I can do it. If it is not the most frequently used word, the next word will be extracted. It will be done. If so, then the word is on the list of most frequently used words. It is used with the previous word in . Finally, the most frequent If you come to a place that is not on the leaf list, the memorized word will not be added to the list of 4 phrases. It will be done. Once the entire text is scanned, the memorized list of phrases is sorted into alphabets. The usage frequency of one phrase is calculated by removing duplicates and calculating the usage frequency of one phrase. Depending on the number of tokens that can be used to represent Starting from things, these phrases are assigned; these phrases are assigned to other It replaces the phrase in the text before the phrase is assigned. dictionary and toe Does the token represent a single word from the point of view of the converted text? It makes no difference whether it represents a group or not. Therefore, 1 original alphanumeric text is 1 figure It can be easily reconfigured by following the process in step 4.

例１我々の発明の実習で、我々は、トークンをそれぞれの言葉に関係づける辞書を作り、そのトークンで新約を書のそれぞれの言葉を置き換えることによって、新約を書全体を記憶した。辞書を記憶するのに必要なスペースを減らすために、辞書の殆ど全てを、アルファベット類に記憶し、辞書中の先行する言葉の最初の文字と同じ最初の文字を表わすために数字コードを使用して、圧縮した。Example 1 In our inventive exercise, we created a dictionary that related tokens to each word. the New Testament by replacing each word in the book with that token. I memorized the entire book. To reduce the space required to memorize dictionaries, memorize almost all of them alphabetically, starting with the first letter of the preceding word in the dictionary. compressed using a numeric code to represent the same first letter.

テキストをトークン化した形で記憶する最初の努力で、我々は、最も使用頻度の高い言葉を表わすのに１バイトのトークンを使用した。新約を書の中には、約１４　、０００語の異なる言葉があるので、最も使用頻度の高い言葉を約２００を１バイトのトークンで表わし、残りの１３，８００語を２バイトのトークンを表わした。この方法では、新約を書の１７０，０００語の約６５％が１バイトのトークンで表わされている。この１バイト・トークンを使用して、我々は、新約を書の全体の１，０３６，０００の文字を約２２０，０００バイトの記憶容量で記憶したのである。In our initial effort to memorize text in tokenized form, we One-byte tokens were used to represent high-level words. In the book of the New Testament, there are about 1 There are 4,000 different words, so we have selected about 200 of the most frequently used words. The remaining 13,800 words are represented by 2-byte tokens. I did. With this method, approximately 65% of the New Testament's 170,000 words are written in one-byte chunks. It is expressed in . Using this one-byte token, we create the New Testament The entire book's 1,036,000 characters can be stored with approximately 220,000 bytes of storage capacity. I remembered it.

記憶必要条件をさらに減少させるために、上に述べた型の２フイールド・トークンを使用することが有利であるということがわかった。特に、新約を書での最も使用頻度数の高い５つの言葉、それらが使用されている回数およびそれぞれの言葉を表わすのに使用されているトークン等、を表わしている表１から明らかなように、言葉の使用頻度曲線は非常に急である。０表１　■トークン　■言葉　■ 使用回数　２．フィールド・トークンを使用することにより、新約を書のテキスト全体を記憶するのに必要なバイト数を、約１８３，０００バイトまで減らすことができた。To further reduce memory requirements, two-field talk of the type described above It has been found to be advantageous to use In particular, the New Testament is the most The five most frequently used words, the number of times they are used, and the number of times each word is used. It is clear from Table 1 that the tokens used to represent leaves, etc. The frequency curve of words is very steep. 0Table 1 ■Token ■Words ■ Number of uses 2. By using field tokens, you can read the text of the New Testament. This reduces the number of bytes required to store the entire file to approximately 183,000 bytes. I was able to do it.

例２図１の一般的技術の操作を、ｍａｔｔｈｅｗ、　ｃｈａｐｔｅｒ　ＩＩからの２ −３の詩を使って、解説することができる：（略）発明に従って、それぞれの言葉が１つのトークンをもつ、１つの辞書を作る。Example 2 The operation of the general technique of Figure 1 is explained in Matthew 2 from Chapter II. -Can be explained using poem 3: (omitted) According to the invention, we create one dictionary in which each word has one token.

■表■ リストは、ここで、テキストの全ての言葉を整理するために１表■に示す如く、アルファベット類に分類される。■Table■ Here, a list is created to organize all the words in the text, as shown in Table 1 ■. It is classified as alphabetical.

■表■ アルファベット類のリストは、そこで、重複エントリーを取除く処理をされ１表 ■に示すように、それぞれのエントリーの頻度計算を出すようにされる。■Table■ The list of alphabets is then processed to remove duplicate entries and compiled into one table. As shown in ■, the frequency calculation for each entry is calculated.

■表■ 発明の好ましい形では、言葉と頻度計算のリストは。■Table■ In the preferred form of the invention, there is a list of words and frequency calculations.

そこで、言葉が使用頻度の減る順番に整理されている新しいリストを得るために、頻度計算により分類される。例２のテキストは非常に短いので、使用頻度に従ってリストを分類し、使用頻度の高い言葉を表わすのにより小さいトークンを使用する必要が殆どない、しかし、上でも強調したように、そういう分類は、テキストの大きさがかなり長い場合には、有益である。So to get a new list where the words are organized in order of decreasing frequency of use, , classified by frequency calculation. The text in Example 2 is very short, so it is Categorize the list by using smaller tokens to represent frequently used words. However, as emphasized above, such a classification This is useful if the strike size is fairly long.

個々の言葉は、そこで、言葉のアルファベット化されたリスト中で連続エントリーに割当てられているだんだん大きくなる数字をもつトークンを割当てられるのである。このようにして１例２での言葉へのトークンの割当ては１表■に示したようになる。Individual words are then sequential entries in an alphabetized list of words. tokens with increasing numbers assigned to It is. In this way, the assignment of tokens to words in Example 1 and 2 is shown in Table 1■ It becomes like this.

■表Ｖこの例では、それぞれの異なる言葉をユニークに識別するにはたった６ビツトが必要なだけであることは明らかである。明らかに、ビット数は、トークン化される異なる言葉の数によって変動する。■Table V In this example, it takes only 6 bits to uniquely identify each different word. It is clear that it is necessary. Obviously, the number of bits is tokenized varies depending on the number of different words used.

最後に、計算機は、表■に示すようなトークン化されたテキストを作るために、表■の直線リスト中のそれぞれの言葉を１表■に示す相当するトークンで置き換えるのである。Finally, the calculator uses Replace each word in the linear list in table ■ with the corresponding token shown in table ■ It's possible.

■表■ 倒２では、言葉の辞書を圧縮する意味はあまりない。■Table■ In Tora 2, there is not much point in compressing the word dictionary.

しかし、多くの言葉の最初の文字が同じである大きなテキストでは、辞典は、１つの言葉の全ての最初の文字が、先行する言葉の最初の文字と同じである場合。However, in large texts where many words have the same first letter, the dictionary When all the first letters of one word are the same as the first letters of the preceding word.

それらの文字を１つの数で置き換えることにより、圧縮することができる。It can be compressed by replacing those characters with a single number.

元のテキストの再構成は１図４に示すように、それぞれのトークンを１度に１つ読み取り、相当する言葉を発見、検索しまた適切な出力として準備されるまで、辞書を通して、数えるのに使用されることにより達成されるのである。The reconstruction of the original text consists of 1 each token at a time, as shown in Figure 4. until it is read, finds the equivalent word, searches for it, and prepares it as the appropriate output. This is achieved by being used for counting through the dictionary.

上に述べたように、辞書はまた、１つの言葉が英数字テキストで使用されていることを示すために情報検索に使用することもできる。この応用では、言葉が使用されているテキストの部分を示すために識別子を使用することは、その文脈でのその言葉の検索を速める。As mentioned above, dictionaries can also be used to determine where one word is used in alphanumeric text. It can also be used in information retrieval to indicate that In this application, the word Using an identifier to indicate a portion of text that is Speed up your search for that word.

新約を書の場合には、４つのＧｏｓｐｅｌｓ、　Ａｃｔ　ｏｆ　ｔｈｅＡｐｏｃｔｔｅｓ、　Ａｐｏｃａｌｙｐｓｅ、　Ｐａｕｌｉｎｅ　Ｅｐｉｓｔｌｅｓと　ｎｏｖ−Ｐａｕｌｉｎｇ　Ｅｐｉｓｔｌｅｓのそれぞれの別／Ｊ（１）識別を１バイトの識別子で行うことができる。In the case of the New Testament, there are four Gospels, Act of the Apoc. ttes, Apocalypse, Pauline Epistles and nov-Pauling Each of the Epistles / J (1) Identification 1 This can be done with byte identifiers.

この技術に熟達しているものには明らかなように、上に述べた発明には多くの変形が可能である。As will be apparent to those skilled in the art, there are many variations to the invention described above. shape is possible.

オ　１　図終Ｊｔ　２　図才　３ｒＸＪオ　４　図才　５　図手続補正書動幻昭和６０年１２月　６日特許庁長官　宇　賀　道　部　殿１、事件の表示　ＰＣＴ／ＵＳ８４１０１６６７２、発明の名称　データ圧縮方法および装置３、補正をする者事件との関係　特許出願人名　称　テキスト　サイエンセズ　コーポレーション４、代理人住　所　（〒１００）東京都千代田区丸の内−丁目５番１号３、補正の対象　特許法第１８４条の５第１項の規定による書面及び委任状並びに明細書及び請求の範囲の翻訳文の浄書フ、補正の内容　特許法第１８４条の５第１項の規定による書面の特許出願人の欄の代表者名を補充し。Figure 1 End J t2 diagram Sai 3rXJ E 4 Figure Sai 5 diagram Procedural amendment document December 6, 1985 Mr. Michibe Uga, Commissioner of the Patent Office 1. Indication of incident PCT/US841016672, title of invention Data compression method Law and Apparatus 3, Persons Making Amendments Relationship to the case: Patent applicant Name Text: Sciences Corporation 4, Agent Address (100) Marunouchi-5-1-3, Chiyoda-ku, Tokyo; Target of amendment: Special Documents and power of attorney as well as specifications and claims pursuant to the provisions of Article 184-5, Paragraph 1 of the Permission Act. Engraving a range of translations F. Contents of amendment: Patent applicant's document pursuant to the provisions of Article 184-5, Paragraph 1 of the Patent Law Fill in the name of the representative in the column.

国際調査報告international search report

Claims

[Claims]

1. In a machine-based system for storing or transmitting text, said text is Associating one different token with each different word or group of words in the text , the average number of digits needed to represent the token is Create a dictionary with fewer digits than the average number of digits needed to represent said word in step, and each word or group of words is identified by the dictionary as the word or group of words. Replace the tokens associated with a group of words, thereby representing the said text. The number of digits required to A method for compressing files.

2. The method of claim 1, wherein the text includes alphanumeric symbols and phrases. Consists of comma words.

3. The method of claim 1, wherein each word is a continuous space in the text. A string of alphanumeric characters and a single symbol, such as a punctuation mark, located between spaces. Ru.

4. In the method according to claim 1, the step of creating a dictionary includes A step that instructs the text words in block notation to form an alphabetical arrangement. and remove all duplicate words in the alphabetical list and simplify the the step of creating a simplified alphabetical list; Each step consists of assigning a different token in the list.

5. In the method of claim 4, each different token is one different token. have numeric values and represent different tokens in the reduced alphabetical list. The step of assigning words is to alphabetize different tokens in consecutive numerical order. It consists of assigning words in different orders.

6. In the method according to claim 4, the step of creating a dictionary further includes the following: It consists of two steps: the first is to identify the most frequently occurring words in the text. The other is to determine which words appear the most, and which words appear less frequently. This means assigning tokens that are shorter than the tokens assigned to the given word.

7. In the method according to claim 6, the step of allocating the token comprises the step of allocating 1-byte tokens to the first 128 most used words; and assigning tokens longer than one byte to the remaining words.

8. In the method according to claim 7, the tones assigned to each word are The first byte of the token indicates whether the token is 1 byte long or longer than 1 byte. It has one bit position containing the bit.

9. In the method set forth in claim 6, the step of allocating tokens includes the following: It consists of steps: first, the remaining words are represented by two-byte tokens. 1-byte talk that can be used to express the words most often used when being spoken to The second is to calculate the maximum number of words that are used in large quantities. The third is to allocate a 1-byte token and the third is to assign a 2-byte token to the remaining words. The first step is to allocate a token.

10. In the method according to claim 4, the step of creating a dictionary further includes the following: Consists of steps: in an alphabetized list to create a frequency number. Counting the duplicate entries of words, then following the frequency calculation for each word. Sorting the compressed alphabet list, the last, most appearing Tokens that are shorter than the tokens assigned to words that occur less frequently in words It is to allocate.

11. In the method of claim 10, the step of allocating the tokens comprises: It consists of the following steps: one token with two fields for each word. but its first field is of fixed length and the length of the second field is and talk according to the frequency calculation for each word as mentioned above. tokens can be assigned to words, so the shortest tokens appear most often in the text. The next shortest token is assigned to the next most occurring word. , and so on.

12. 12. The method of claim 11, wherein the first field contains four binary Numbers have equal length.

13. In the method according to claim 4, the step of creating a dictionary further includes the following: Consisting of steps: first, to a different token with its lowest number of bits. Therefore, we need to calculate the minimum number of bits required to represent each different word. and the second is to represent different tokens with the lowest number of bits in different words. It is to assign to leaves.

14. The method according to claim 1 further comprises the following steps: How many times in both words are the first letters of the word the same as the first letter of the preceding word? Compresses the dictionary by replacing it with a single number indicating whether the first character of That's true.

15. In the method according to claim 1, the text is divided into multiple segments. However, the means of creating a dictionary is based on the way the word appears in each different word. means for providing an indicator specifying the portion of the text to be displayed.

16. A dictionary produced by the method according to claim 16.

17. A dictionary produced by the method according to claim 1.

18. One dictionary for each different word or group of words in the text. In systems using machines that attach signals such as these, text can be reconstructed from these signals. The method to create consists of the following steps: extracting the next token from the above signal. , find a word with the above token in the dictionary, and put the above word into the machine The use is to prepare the output of the system.

19. Machine-based systems for storing or transmitting text compress or transmit text. The method consists of the following steps: A different token for a word or group of words to represent that word in that system. The average number of digits required to represent that token is less than the minimum average number of digits required to represent that token. Create a dictionary with a low number of digits, then find out how many digits are needed to represent the text. of the word or words in the dictionary to create a compressed text that can be reduced. Place each word or group of words with a token given to the group. and extracting the next token from the compressed text, or Find the word with that token in the dictionary, and then use the machine is to prepare that word for output.

20. The method of claim 19, wherein the text includes alphanumeric symbols and punctuation. Consists of dot words.

21. In the method according to claim 19, the step of creating a dictionary includes the following step. consisting of: in alphabetical order to create an alphabetized list. ordering the words of the text, then the compressed alphabet list Remove all duplicate words in the alphabet list to make Assigning different tokens to different words in the shortened alphabet list And so it is.

22. The tool for compressing text consists of the following means: Different tokens for different words or groups of words, that word in that system the average number of digits needed to represent that token less than the average number of digits needed to represent it One way is to create a dictionary that gives the number of digits, etc., and then A token with a word or group of words that reduces the number of digits required for The idea is to replace each word or group of words in that dictionary.

23. The device of claim 22, wherein the text includes alphanumeric symbols and punctuation. Consists of dot words.

24. In the device according to claim 22, each word may be an alphanumeric character or a phrase. A string of symbols between consecutive spaces in text, such as commas.

25. In the device according to claim 22, the means for creating a dictionary comprises: Sort words in text alphabetically to create an alphabet list Alphabet list to create a condensed alphabet list, a means of ordering means to remove all duplicate words in the A means of assigning different tokens to different words.

26. In the tool according to claim 22, the means for creating a dictionary further comprises: consists of: a means of determining which words are used most often in a text, and Tokens assigned to words that occur more often are shorter than those assigned to words that occur less frequently. A means of allocating new tokens.

27. In the device according to claim 22, the text is divided into the popularity of the parts. The means of creating a dictionary is to identify the parts in which the word appears for each different word. It consists of means for giving an indicator to specify where.