EP0160672A1

EP0160672A1 - Method and apparatus for data compression

Info

Publication number: EP0160672A1
Application number: EP84903871A
Authority: EP
Inventors: Louie Don Tague; Allen T. Cobb
Original assignee: TEXT SCIENCES Corp
Current assignee: TEXT SCIENCES Corp
Priority date: 1983-10-19
Filing date: 1984-10-17
Publication date: 1985-11-13
Also published as: IT8468039A1; CA1226369A; JPS61500345A; EP0160672A4; IT1180100B; WO1985001814A1; IT8468039A0

Abstract

Un procédé et un appareil permettent la compression de données alphanumériques qui sont stockées ou transmises sous forme de codes digitaux. Un dictionnaire est créé pour affecter chaque mot du texte alphanumérique et la ponctuation qui le suit à une adresse ou signe unique comprenant par exemple jusqu'à 16 bits (deux bytes). Chaque mot du texte alphanumérique est alors remplacé par l'adresse qui se réfère à ce mot du dictionnaire. Etant donné que le dictionnaire peut contenir jusqu'à 216 = 65.536 entrées, il est d'une taille plus que suffisante pour la mémorisation des mots associés à presque n'importe quel livre. Etant donné que seuls deux bytes d'informations sont nécessaires pour s'adresser à l'un quelconque de ces 65.000 mots, le remplacement de chaque mot de texte par deux bytes d'informations d'adresses réduit d'un facteur de trois environ le nombre moyen de chiffres requis pour stocker le texte. Des réductions supplémentaires de 25% ou davantage de la longueur du texte comprimé peuvent être obtenues dans la plupart des cas en représentant les mots utilisés le plus fréquemment par des signes qui sont plus courts que deux bytes en longueur. Le nombre de bytes requis pour mémoriser le dictionnaire peut être sensiblement réduit en mémorisant les mots dans l'ordre alphabétique et en tirant avantage de la redondance des caractères qui en résulte. Ainsi, si la seconde de deux entrées contient cinq lettres qui sont identiques à celles de l'entrée précédente, ceci peut être signifié en stockant un caractère représentant le nombre 5 et les autres caractères restants qui ne sont pas communs aux deux entrées.A method and apparatus allows the compression of alphanumeric data which is stored or transmitted in the form of digital codes. A dictionary is created to assign each word of the alphanumeric text and the punctuation that follows it to a unique address or sign comprising for example up to 16 bits (two bytes). Each word in the alphanumeric text is then replaced by the address that refers to that word in the dictionary. Since the dictionary can contain up to 216 = 65,536 entries, it is more than sufficient in size to store words associated with almost any book. Since only two bytes of information are required to address any one of these 65,000 words, replacing each text word with two bytes of address information reduces by a factor of about three the average number of digits required to store text. Additional reductions of 25% or more in the length of the compressed text can in most cases be achieved by representing the most frequently used words by signs which are shorter than two bytes in length. The number of bytes required to store the dictionary can be significantly reduced by storing the words in alphabetical order and taking advantage of the resulting redundancy of characters. Thus, if the second of two entries contains five letters which are identical to those of the previous entry, this can be signified by storing a character representing the number 5 and the other remaining characters which are not common to the two entries.

Description

METHOD AND APPARATUS FOR DATA COMPRESSION

BACKGROUND OF THE INVENTION

This relates to a method and apparatus for reducing the number of signals required to encode alphanumeric data for storage or transmission. It is particularly useful in storing large volumes of text such as a book in a computer system or transmitting such volumes on a data communication system.

Prior art techniques for encoding alphanumeric text usually rely on the substitution of an eight-bit binary code commonly called a byte for each character of the alphanumeric text. One such code comprises seven bits which define the character in accordance with the American Standard Code for Information Interchange (ASCII) and an eighth bit that is either used as a parity bit or is set to 0. Tables of these codes are set forth, for example, At pages 125 and 126 of Ralston, et al., Encyclopedia of Computer Science and Engineering, 2nd Ed. (Van Nostrand Reinhold, 1983).

However, the use of eight bits to represent each character in large volumes of alphanumeric text severely taxes the limits of present-day microcomputers and communications systems. For example, there are over 170,000 words in the New Testament and about 1,036,000 separate characters. Accordingly, over one megabyte of data storage is required to store the New Testament. Even with present-day storage technology, requirements of this sort make it relatively expensive to store full text books and the like in modern computer systems. Likewise, it is relatively expensive and time-consuming to transmit coded quantities of text as large as a book. In an effort to reduce data storage and transmission requirements, standard codes have been modified so as to use certain eight-bit codes to represent the more frequent combinations of two letters. Thus, the digraph "th" might be represented by a single eight-bit code rather than by two eight-bit codes, one of which represents a "t" and the other of which represents an "h". This technique, however, is relatively limited in the data compression it can achieve. Typically, a reduction of about 40% can be achieved in the length of the binary codes required to represent the alphanumeric text. Somewhat greater reductions can be achieved if careful attention is paid to the frequency with which letter pairs appear in the particular text; but the best that can be achieved is on the order of a 60% reduction_*.

SUMMARY OF THE INVENTION

We have devised a technique for significantly improving the amount of data compression that can be achieved when storing alphanumeric data in the form of digital codes. In accordance with our invention, a dictionary is created which assigns each different word of the alphanumeric text and the punctuation that follows it to a unique token. Each word in the alphanumeric text is then replaced by the token that refers to that word in the dictionary. Illustratively, each token is a sequence of binary digits and contains up to 16 bits (two bytes) which identify or address one such word. Accordingly, the dictionary can contain up to 2¹⁶ = 65,536 entries, which is more than adequate for the storage of the words associated with almost any book. Because only two bytes of information are needed to identify any one of these 65,536 words, replacement of each word of text with two bytes of information reduces the average number of digits rsquired to store the text by a factor of about three. If the dictionary contains more than 65,536 words, the number of bits needed in at least some tokens will have to be greater than 16. Conversely, if the number of words in the dictionary is some power of two less than the sixteenth power, the number of bits in each token can be less than 16. Advantageously, the dictionary can be created very rapidly using a conventional microcomputer system and the stored text can be recreated in human readable form by such a microcomputer system.

The number of bytes required to store the dictionary can be substantially reduced by storing the words in alphabetical order and taking advantage of the redundancy in characters that results. Thus, if the second of two entries contains five letters that are the same as that of the preceding entry, this can be signified by storing one character representing the number 5 and the remaining characters not common to both entries. Because of the large amount of redundancy that is present in such a dictionary because of the use of plurals, possessives, cognates and entries that are identical except for punctuation, the size of the dictionary can be reduced by such techniques by a factor of about three.

Further reductions in the length of the compressed text can be achieved in most cases by representing the most frequently used words with tokens that are shorter than two bytes in length. Because only a small number of the most frequently used words ordinarily account for more than half of ell the words in the text, the use of a one byte token, for example, instead of a two byte token, for the most frequently used words can reduce the storage requirements for the text by at least an additional 25% and in many cases by considerably more than 50%. The foregoing techniques achieve significant data compression while maintaining the boundaries between words. In a test performed on the Ring James Version of the New Testament, they made it possible to store the 1,036,000 characters of the New Testament in approximately 220,000 bytes using one compression method and 183,000 bytes using another. In a test performed on approximately 900,000 characters of training material for lawyers, they permitted the text to be compressed to less than 150,000 bytes.

Because the dictionary contains each word in the alphanumeric text, it can be used to determine if a particular word, or several words, is used in the text. Since the size of the dictionary is considerably smaller than the entire alphanumeric text, one can determine if a word is used in the text much faster by searching the dictionary than by searching the entire alphanumeric text. In addition, the location of the word in the text can be specified by adding to each word in the dictionary An identifier that indicates each segment of the text in which the word appears. With this feature, it is also possible to compare the identifiers associated with different words to locate those words that appear in the same segment of the text.

BRIEF DESCRIPTION OF DRAWINGS

These and other objects, features and advantages of our invention will be more readily apparent from the following detailed .description of preferred embodiments of the invention in which:

Fig. 1 is a flow chart illustrating the general concept of the preferred embodiment of our invention;

Fig. 2 is a flow chart illustrating the preferred embodiment of our invention in greater detail; Fig. 3 is a flow chart illustrating a detail of Fig. 2;

Fig. 4 is a flow chart illustrating a second feature of a preferred embodiment of our invention; and

Fig. 5 is a block diagram depicting illustrative apparatus used with a preferred embodiment of our invention.

DESCRIPTION OF PREFERRED EMBODIMENT OF THE INVENTION

As shown in Fig. 1, an alphanumeric text is compressed in our Invention by first creating a dictionary which associates each word of the alphanumeric text with a unique token of up to sixteen bits (two bytes). As is well known, the pattern of ones and zeroes in sixteen bits can be used to represent any number from 0 to 65,536. To form a compressed text, each word is replaced by the token that refers to that word in the dictionary. Optionally, the size of the dictionary can be reduced by storing the words of the dictionary in alphabetical order and taking advantage of the redundancy in characters that results. Advantageously, the length of the compressed text can be further reduced by representing the most frequently used words with tokens having a length that is less than two bytes. Preferably, these steps are performed by a computer such as a conventional microcomputer.

Specific steps for implementing the techniques of Fig. 1 in a microcomputer are set forth in Fig. 2. First, the text of the book or other material to be compressed is converted to a linear list of words. In effect this requires that a carriage return/line feed be inserted after each word of the text. Conveniently for this purpose, each word is considered to be all the alphanumeric symbols including punctuation between successive spaces in the text. Thus, the carriage return/line feed is simply inserted every time a space or spaces is encountered in the text; and the one space immediately in front of the alphanumeric text is considered to be part of that word. Where multiple spaces are found between words, all the spaces except the one space immediately in front of the alphanumeric text are treated as a single word of space characters. After the linear list is created, it is sorted alphabetically using a conventional sort so that all the words of the text are arranged in alphabetical order.

The alphabetized list is then processed by the microcomputer to eliminate duplicate entries and to generate a frequency count for each entry. Thus, the entire alphabetized list of words is replaced by a new condensed list which identifies each word from the original alphabetized list and specifies the number of times that word appears in the original alphabetized list. Illustratively, this procedure is implemented as shown in Fig. 3. Bach word of the alphabetized list is fetched in turn by the microcomputer. A determination is made if this is a new word by comparing this word with the previously fetched word. If the two words are the same, the word in question is an old word; and the frequency counter is incremented by one and the next word is fetched from the list. If the two words are different, the word in question is a new word; and the old word and the contents of the frequency counter are written on the new list, the frequency counter is reset to one and the new word is stored for subsequent comparison.

To create a dictionary, each of the words of the condensed alphabetized list is assigned an individual token. However, in order to reduce storage requirements, it is desirable to assign tokens having a length less than two bytes to the more frequently used words using any one of several techniques. For example, one byte tokens can be assigned to the most frequently used words. To do this, a copy is first made of the condensed alphabetized list and the list is stored. The list of words and frequency counts is then sorted by frequency count to obtain a new list in which the words are arranged in decreasing order of frequency of use. In one technique, one of the eight bits of a byte can be used to identify the byte as a one byte token instead of a two byte token. In such case, the other seven bits of the byte can be used to provide 128 different tokens. If the byte is not identified as a one byte token, then the remaining fifteen bits of the two byte token can be used to identify up to 32,768 different words in the text. Accordingly, in this technique each of the 128 most frequently used words is assigned one of the 128 different one byte tokens; and the remaining words are assigned different two byte tokens.

Alternatively, the number of one byte tokens can be varied depending on the number of different words used in the text. In particular, it can be shown that the maximum number of different words that can be represented by a combination of one and two byte tokens is given by x + 256 (256 - x) where x is the number of one byte tokens used. Obviously x must be a positive whole number less than or equal to 256. From this it follows that where y is number of different words in the text the largest number of one byte tokens that can be used is the largest whole number such that:

x ≤ (256² - y)/255 (1)

For example, if there are 12,000 different words in the text, x - 209. Thus, the 209 most frequently used words can be represented by 209 one byte tokens and the remaining 11,791 words are represented by two byte tokens. Accordingly, when using this technique, equation (1) is used to calculate the maximum number of one byte tokens that can be used. This number of the most frequently used words is then assigned one byte tokens, each word being assigned a different token. The remaining words in the text are then assigned two byte tokens.

Whichever method is used to determine the number of one byte tokens, a dictionary is created by the microcomputer by assigning the tokens to the words in successive numeric order beginning with the first word and continuing to the last. The numeric order of the tokens can be ascending or descending but must be monotonic in the preferred embodiments described herein. In subsequent descriptions, it is assumed that the numeric order is ascending. Advantageously, the words that are represented by one byte tokens are assigned to a first dictionary and the remaining words are assigned to a second. To minimize storage requirements, as detailed below, the second dictionary that associates words and two byte tokens is stored so that the words are in alphabetic order. Because the first dictionary has at most 256 entries, there is usually no need to alphabetize this dictionary. However, because the words stored in this dictionary are used so often in the text, it is desirable to minimize retrieval time from this dictionary. To this end, the words are stored in the order of their frequency of use in the text with the most frequently used word first. The dictionaries that are stored preferably contain only the words of the dictionary and none of the tokens. Illustratively, the words are stored in the form of ASCII-encoded symbols with one byte being used to represent each symbol. Since there are only 96 ASCII symbols, one bit of each byte is available for other purposes. This bit is used to identify the beginning of each word. In particular, the beginning of each word is identified by setting the eighth bit of the first ASCII character of each word to a "1" while the eighth bit of every other ASCII character in the word is set to "0". As a result, the token associated with a particular word in the dictionary can be determined simply by counting the number of words from the beginning of the dictionary to the particular word in question and adding that count and the numeric value of the token associated with the first word in the list. This counting can be done simply by masking all but the eighth bit of each byte and counting the appearance of each "1" bit in that position as the computer scans each byte from the first word in the list to the word in question.

For example, if the first dictionary contains 209 words, tokens, having binary values from 0000 0000 to 11010001 will be assigned to these words. To determine the token assigned to a particular word, 'the computer simply counts the appearance of each "1" bit in the eighth bit position of each byte in the dictionary beginning with the first byte and ending with the byte immediately before the particular word whose token is being calculated. Since the numeric value of the token assigned to the first word in this dictionary is zero, the count is the value of the token. For this example, the tokens assigned to the words of the second dictionary will commence with the binary value 1101 0010 0000 0000. Accordingly, the value of the token is determined by counting words in the same fashion as in the first dictionary and adding to the count the binary value, 11010010 0000 0000, associated with the first word of the second dictionary.

To speed up the counting process it is helpful to use a look-up table that identifies the token associated with certain words. For example, the look-up table could store the token associated with the first word beginning with each of the twenty-six letters of the alphabet, and the counting process could begin with the first word that had the same first letter »9 the word whose token is to be calculated.

After the dictionaries have been created, the microcomputer then compresses the alphanumeric text by reading each word from the linear list that was initially generated, looking up the word in the first or second dictionary and replacing the word in the linear list with the token obtained from the dictionary. In this process, a search is first made through the words of the first dictionary, testing the individual ASCII codes of the words of the first dictionary to determine if they are the same as the word that is to be replaced by a token and counting each test that fails. If a match is found, the count of failed tests is the value of the token provided the value of the token associated with the first word is zero. If no match is found in the first dictionary, the computer moves on to the second dictionary. Here the look-up table is used to provide a starting point for the search through the dictionary. For example, the first letter of the word whose token is to be determined can be used to locate in the look-up table the first word that begins with that letter. The table will supply the value of the token for that word. A search can then be made in alphabetic order through the different words that begin with that letter, testing the individual ASCII codes of each word to determine if they are the same as the word in question. For each word that fails the test, a counter is incremented by one. When the word is finally located, its token is calculated by adding the contents of the counter to the value obtained from the look-up table of the token associated with the first word that begins with the same first letter. In this way, the entire linear list of words is replaced by a list of tokens to form a tokenized text.

Finally, the second dictionary may be compressed by coding techniques. Because the words of this dictionary are in alphabetical order, almost all words will have at least one initial character that is in common with the initial character or characters in the preceding word in the dictionary. In the case where at least two initial characters in a second word are the same as the corresponding initial characters in the first immediately preceding word, it becomes advantageous to represent that second word by (1) a number that identifies the number of initial characters in the first word that are the same and (2) a string of characters which are the balance of characters in the second word that are different from those in the first word. Thus individual words in the dictionary are stored using a number to specify the number of initial characters that are the same as those in the preceding entry and the ASCII codes for the remaining characters that are different. To expedite processing, the number is stored as a binary number that can be used immediately in retrieving the initial characters of the word. For example, the words "storage," "store" and "stored" may appear successively in the dictionary. In this case, the word "store" is represented by the binary number for "4" and the ASCII character for "e" because the first four letters of the word are found in the immediately preceding word while the "e" is not; and the word "stored" is represented by the binary number for "5" and the ASCII-encoded character for "d" because the first five letters of the word are found in the immediately preceding word while the "d" is not.

The tokenized text, the dictionaries, the look-up table and a computer program to read the tokenized text are then stored on any appropriate media such as tape, disk or ROM. Alternatively, this same information may be transmitted from one location to another by a data communication system. Because of the significant data compression achieved in practicing our invention, it is possible to store the entire text of full sized books on one or two 5-1/4" (13 mm) floppy disks. In general, the length of the text can be reduced by about 60% or 70% by the substitution of tokens for words. A further reduction of 25% and in some cases as much as 50% can be achieved by the use of one byte tokens for the more frequently used words in the text. Thus an overall reduction of about 75% in text length is readily achievable in practicing the invention. The dictionary obviously adds to the length of the text. The length of the second dictionary, however, can be minimized by using numeric codes as set forth above to represent identical initial characters in successive words. This reduces the length of the dictionary by a factor of about three. Illustrations of the amount of compression that can be achieved with the invention are set forth in Example 1 below. Similar reductions in the channel transmission capacity required to transmit such text can also be achieved with the practice of our invention.

A flow chart illustrating the reconstruction by a computer of the original alphanumeric text from the tokenized text is set forth in Fig. 4. As shown therein, each token is fetched in turn by the computer which then searches one of the dictionaries to find the word associated with the token. In the case of a one byte token, the computer simply loads the binary value of the token into a counter and, commencing with the most frequently used word, successively reads the words in the first dictionary, decrementing the count by one for every byte that has a "1" bit in the eighth bit position until the value in the counter is zero. At this point, the next word to be read is the word represented by the token initially loaded into the counter. In searching the second dictionary, the computer advantageously uses the look up table that associates tokens with the first word beginning with each letter of the alphabet. Thus, the computer simply scans the look-up table in reverse order subtracting the values of the tokens in the table from the value of the token that is to be converted to text. When the difference between the values shifts from a negative value to a positive value, the computer has reached the first word that begins with the same letter as that of the word represented by the token. Accordingly, the computer subtracts this token value from the value of the token to be converted to text and begins the same process of reading the bytes of the different words that begin with that letter. With each byte that has a "1" bit in the eighth bit position, the computer decrements the count by one until the count reaches zero, at which point the next word to be tested is the word identified by the token. whether retrieved from the first dictionary or the second, the word is then provided to the computer output which may be a display, a printer or the like; and the computer moves on to the next token.

Our invention may be practiced in all manner of machine-implemented systems. Specific apparatus for tokenizing the text and for reconstructing the original alphanumeric text from the tokenized text may be any number of suitably programmed computers. In general, as shown in Fig. 5, each such computer comprises a processor 10, first and second memories 20, 30, a keyboard 40 and a cathode ray tube (CRT) 50. Optionally the apparatus may also include a printer 60 and communication interface 70. The devices are interconnected as shown by a data bus 80 and controlled by signal lines 90 from microprocessor 10. In addition, the memories may be addressed by address lines 100. The configuration shown in Fig. 5 will be recognized as a conventional microcomputer organization. The program to create the dictionary and tokenize the alphanumeric text may advantageously be stored in the first memory which may be a read only memory (ROM). If the same device is also used to reconstruct the alphanumeric text from the tokenized text, that program may also be stored in memory 20. The tokenized text that is created, along with the dictionaries and look-up tables, is typically stored in memory 30; and the reconstruction program may also be stored in memory 30 if it is not available in memory 20. The tokenized text, dictionaries, look-up tables and reconstruction program may also be transmitted by means of communication interface 70 to another microcomputer at a remote location.

Advantageously, memory 30 is a programmable read only memory (PROM) , a magnetic tape or a floppy disk drive because the capacity of such devices is generally large enough to accommodate the entire text of a book in a PROM of reasonable size or a small number of floppy disks. Obviously, where a PROM is used, an appropriate device (not shown) must be used to record the tokenized text, dictionaries, look-up tables and reconstruction program in the PROM. Such devices are well known. Where it is desirable to store a large number of books in one record, the significantly larger capacity of fixed disk drives or large ROM boards can advantageously be used in the practice of the invention. When the apparatus of Fig. 5 is used to reconstruct the original alphanumeric text from data stored on disks, it is advantageous to transfer the entire contents of the disks to a semiconductor memory because the significantly higher speeds of the semiconductor memory will greatly facilitate the look up of words in the dictionary. For this purpose it is also advantageous to compress the dictionary to a size such that it fits within the storage capacity of conventional microcomputer memories. We have found it practical to do this where 64K bytes of semiconductor memory are available.

There are numerous applications for our invention.. As indicated above, the invention is useful in compressing alphanumeric text for data storage or transmission. Because reconstruction of the original text can be performed expeditiously, the compressed data can then be used in any application for which the original text might have been used. In addition, because the compressed data is meaningless without the dictionary, one can provide for secure storage and/or transmission of alphanumeric text by generating the tokenized text and dictionary and then separating them for purposes of storage and transmission.

Because the dictionary contains each word of the alphanumeric text but is considerably shorter, the dictionary is also a useful tool in information retrieval. In particular, one can readily determine if a particular word is used in the alphanumeric text simply by scanning the dictionary. Additional advantages can be obtained by adding an identifier to each word in the dictionary which specifies each segment of the text in which the word appears. For example, the identifier might be one byte long and each of the eight bit positions in the byte could be associated with one of eight segments of the text. For this example, the presence of a 1-bit in any of the eight bit positions of that byte would indicate that the associated word was located in the corresponding segment of the text. Use of such an identifier will greatly speed retrieval of the alphanumeric text surrounding the word in question because there is no need to search segments in which the word does not appear. Moreover, by comparing the individual bits in the identifiers associated with different words, one can determine if the words are used in the same segment of the text. Obviously, the size of the identifier can be varied as needed to locate word usage more precisely.

Numerous variations may also be made in the practice of our invention. While we have described the invention in terms of alphanumeric texts, binary tokens and ASCII codes, the invention may be practiced with all manner of symbols and the symbols may be tokenized and coded in various ways. For example, foreign languages, mathematical symbols, graphical symbols and punctuation can all be accommodated in practicing the invention and these symbols can be represented by ASCII, expanded ASCII or any suitable code of one's own choice. While the use of binary tokens is preferred in the practice of our inventioni it may be convenient to represent such tokens in other radices such as hexadecimal and the invention can be practiced using tokens having digits of any radix.

We have illustrated two examples for reducing the size of the tokenized text by using codes of less than two bytes to store more frequently used words. Numerous other techniques, however, are available. For example, because the vocabulary used in most books is typically significantly less than the 65,536 words that can be represented by sixteen bits, it is frequently possible to represent each of the words of the alphabetised text by less than sixteen bits. For example, as many as 32,768 words can be represented by fifteen bits and 16,384 words may be represented by fourteen bits. Accordingly, another method of assigning tokens to bits is to calculate the minimum number of bits required to represent each different word by a different token having that minimum number of bits and then to assign to each different word a different token having that minimum number of bits. If the vocabulary used has more than 65,536 words, this same principle can be used to assign tokens of 17, 18 or even more bits to each different word in the text.

An alternative approach is to use tokens having two fields, the first of which is a field of fixed length that specifies the length of the second field. In this technique the tokens are assigned to the words strictly in accordance with the frequency count for each word so that the shortest token is assigned to the word that appears most frequently in the text, the next shortest token is assigned to the word that appears next most frequently, and so forth. In this arrangement the dictionary is stored in frequency count order with the most frequent words being stored at the beginning of the dictionary. With this technique, a token can be as long as twenty bits. However, if the frequency distribution of the words is a very steep curve, as it often is, the average number of bits required to represent each word in the text is significantly reduced, as in the case of Example 1 below. When tokenized text is stored using a token having two fields, it is advantageous to store the tokens in two parallel lists, one of which is merely the list of the first fields and the other is the list of the second fields. Data is stored on the two lists in the same order. Accordingly, to convert the tokenized text to the original alphanumeric text, the computer reads four bits from the first field list, determines from these four bits the number of bits to read from the second field list, reads these bits, and then locates the alphanumeric word associated with such bits by counting words from the beginning of the dictionary in which words are stored in frequency count order. Thus, the most frequently used word would be represented by 0000 in the first list and zero bits in the second list; the next two most frequently used words by 0001 in the first list and one bit in the second list; the next four words by 0010 in the first list and two bits in the second list; and so on. When the computer reads 0000 in the first list, these bits indicate there is no entry in the second list and accordingly the computer retrieves the most frequently used word which is the first word in the dictionary. When the computer reads 0001 in the first list, it reads the next bit in the second list and retrieves either the second or third word in the dictionary depending on whether the bit in the second bit is a zero or a one.

The techniques described above for storing individual words in the form of tokens can also be extended to the storage of groups of words (i.e., phrases). Common phrases will be recognized by all. Phrases such as "of the", "and the", and "to the" can be expected to occur with considerable frequency in almost all English language alphanumeric text. Such phrases can be automatically assigned a place in the dictionary and one token can be provided for each appearance of one such phrase.

Alternatively, phrases can be identified simply by scanning the alphanumeric text and comparing the words with a subset of the most frequently used words. For example, the 100 most frequently used words might constitute this subset. In this procedure, phrases of the most frequently used words can be assembled simply by testing each word of the text in succession to determine if it is one of the most frequently used words. If it is not, the next word is fetched. If the word is, it is stored along with any immediately preceding words that are on the list of most frequently used words. When a word is finally reached that is not on the list of most frequently used words, the stored words are added to a list of phrases. After the entire text has been scanned, the stored list of phrases is sorted in alphabetical order, duplicates are eliminated and a frequency count of the phrases is made. Depending on the number of tokens available to represent phrases, tokens are assigned to these phrases beginning with the most frequently used phrases, and these tokens are then substituted for the phrases in the alphanumeric text before any other tokens are assigned. From the standpoint of the dictionary and the tokenized text, it makes no difference whether the token represents one word or a group of words. Accordingly, the original alphanumeric text can be reconstructed simply by following the process of Fig. 4.

Example 1

In practicing our invention, we have stored the entire New Testament by generating a dictionary that associates each word with a token and replacing each word of the New Testament with that token. In order to reduce the space required to store the dictionary, almost all of the dictionary is stored in alphabetic order and is compressed by using numeric codes to represent the number of initial characters that are the same as the initial characters of the preceding word in the dictionary.

In our initial efforts to store the text in tokenized form, we used one-byte tokens to represent the most frequently used words. Because there are approximately 14,000 different words in the New Testament, approximately 200 of the most frequently used words can be represented by one-byte tokens and the remaining 13,800 words are represented by two-byte words. For this arrangement, approximately 65% of the 170,000 words in the New Testament are represented by a one-byte token. By using such one-byte tokens, we stored the entire 1,036,000 characters of the New Testament in approximately 220,000 bytes of storage. In order to further reduce storage requirements, we found it advantageous to use two-field tokens of the type described above. In particular, the curve of the frequency of use of words is very steep, as is apparent from Table I which sets forth the five most frequently used words in the New Testament, the number of times they are used and the token used to represent each such word.

TABLE I

Token Word # Times Used

0000 the 10,145

00010 and 7,309

00011 of 5,705

001000 that 3,364

001001 to 3,098

By using two-field tokens, we have been able to reduce the number of bytes required to store the entire text of the New Testament to approximately 183,000 bytes.

EXAMPLE 2

The operation of the general technique of Fig. 1 can be illustrated with respect to a few verses from Matthew, Chapter II:

"1. Now when Jesus was born in Beth-lehem of Judaea in the days of Herod the king, behold, there came wise men from the east to Jerusalem, 2. Saying, Where is he that is born King of the Jews? for we have seen his star in the east, and are come to worship him. 3. When Herod the king had heard these things, he was troubled, and all Jerusalem with him." In accordance with the invention, a dictionary is created in which each word is associated with a token. Illustratively this is accomplished by first creating a linear list of words such as set forth in Table II:

TABLE II

Now when

Jesus was with him.

The list is then sorted alphabetically so as to arrange all the words of the text in alphabetic order as shown in Table III.

TABLE III

all and and are

Jerusalem Jerusalem, when

When

Where wise with worship The alphabetized list ia then processed to eliminate duplicate entries and to generate a frequencxy count for each entry as shown in Table IV.

TABLE IV

all 1 and 2 are 1

Jerusalem 1 Jerusalem, 1 when 1 When 1 where 1 wise 1 with 1 worship 1

In the preferred embodiment of the invention, the list of words and frequency counts is then sorted by frequency count to obtain a new list in which the words are arranged in decreasing order of frequency of use; and each of these words is assigned an individual token. Because the text of Example 2 is so short, there is little need to sort the list in accordance with frequency of use and to use tokens of smaller size to represent the more frequently used words. However, as emphasized above, such a sort is useful where the size of the text is considerably longer. The individual words are then assigned tokens with increasingly greater numerical value being assigned to successive entries in the alphabetized list of words. Thus, the assignment of tokens to words in Example 2 is as set forth in Table V. TABLE V

00 0000 all 00 0001 and 00 0002 are

01 iio rusalem

010111 Jerusalem, 011001 Jesus

01 1110 Now

10 1101 when

10 1110 When

111111 where 110000 wise 110001 with 110010 worship

For this example, it is apparent that only six bits are needed to identify each different word uniquely. Obviously the number of bits can be varied depending on the number of different words to be tokenized.

Finally, the computer replaces each word in the linear list of Table II with the corresponding token as set forth in Table V to generate a tokenized text as shown in Table VI.

Table VI

011111 10 1101 01 1000 10 1011

11 0001 01 0011 For example 2, there is very little advantage in compressing the dictionary of words. For a larger text, however, in which the initial characters of many words would be the same, the dictionary would then be compressed by replacing with a number all those initial characters in a word that are the same as the initial characters of the preceding word.

Reconstruction of the original text proceeds as shown in Fig. 4 with the individual tokens being read one at a time and used to count through the dictionary until the corresponding word is located, retrieved, and provided to a suitable output.

As indicated above, the dictionary can also be used in information retrieval to indicate that a word has been used in the alphanumeric text. In this application, the use of an identifier to indicate the segment of the text in which the word is used will speed up the retrieval of that word in its context, In the case of the New Testament, a one byte identifier allows separate identification of each of the four Gospels, the Acts of the Apostles, the Apocalypse, the Pauline Epistles and the non-Pauline Epistles.

As will be apparent to those skilled in the art, numerous modifications may be made in the invention described above.

Claims

What is claimed is:

1. In a machine-implemented system for storage or transmission of text, a method for compressing text comprising the steps of: creating a dictionary that associates each different word or group of words of said text with a different token, the average number of digits required to represent said token being less than the average number of digits required to represent said word in said system, and replacing each word or group of words with the token associated by said dictionary with said word or group of words, whereby the number of digits required to represent said text is reduced.

2. The method of claim 1 wherein the text comprises words of alphanumeric symbols and punctuation.

3. The method of claim 1 wherein each word is a string of symbols, such as alphanumeric characters and punctuation, located between successive spaces in the text.

4. The method of claim 1 wherein the step of creating the dictionary comprises the steps of: ordering the words of the text in alphabetical order to form an alphabetized list, eliminating all duplicate words in the alphabetized list to form a condensed alphabetized list, and assigning different tokens to different words in the condensed alphabetized list.

5. The method of claim 4 wherein each different token has a different numeric value and the step of assigning different tokens to different words in the condensed alphabetized list comprises the step of assigning the different tokens in successive numeric order to different words in alphabetic order.

6. The method of claim 4 wherein the step of creating the dictionary further comprises the steps of: determining which words appear most frequently in the text, and assigning to the words that appear most frequently tokens that are shorter tha-n the tokens assigned to words that appear less frequently.

7. The method of claim 6 wherein the step of assigning tokens comprises the steps of assigning to the first 128 most frequently used words a token that is one byte in length and assigning to the remaining words a token that is longer than one byte.

8. The method of claim 7 wherein the first byte of the token assigned to each word has one bit position that contains a bit Indicating whether the token is one byte long or more than one byte long.

9. The method of claim 6 wherein the step of assigning tokens comprises the steps oft calculating the maximum number of one byte tokens that can be used to represent the most frequently used words if the remaining words are represented by two byte tokens, assigning one byte tokens to no more than that maximum number of most frequently used words, and assigning two byte tokens to the remaining words.

10. The method of claim 4 wherein the step of creating the dictionary further comprises the steps of: counting the duplicate entries of words in the alphabetized list to form a frequency count, sorting the condensed alphabetised list in accordance with the frequency count. for each word, and assigning to the words that appear most frequently tokens that are shorter than the tokens assigned to words that appear less frequently.

11. The method of claim 10 wherein the step of assigning tokens comprises the steps of: assigning to each word a token having two fields, the first of which is a field of fixed length that specifies the length of the second field, said tokens being assigned to said words in accordance with the frequency count for each word so that the shortest token is assigned to the word that appears most frequently in the text, the next shortest token is assigned to the word that appears next most frequently, and so forth.

12. The method of claim 11 wherein the first field has a length of four binary digits or their equivalent.

13. The method of claim 4 wherein the step of creating the dictionary further comprises the steps of: calculating the minimum number of bits required to represent each different word by a different token having that minimum number of bits, and assigning to each different word a different token having that minimum number of bite.

14. The method of claim 1 further comprising the step of compressing the dictionary by replacing the initial characters of a word that are the same as the initial characters of an immediately preceding word with a number indicating how many of said initial characters in both words are the same.

15. The method of claim 1 wherein the text is divided into a plurality of segments and the means for creating a dictionary further comprises means for providing for each different word an indicator specifying in which segments of the text that word appears.

16. A dictionary formed by the method of claim 16.

17. A dictionary formed by the method of claim 1.

18. In a machine-implemented system in which a dictionary associates each different word or group of words of a text with a different token comprised of one or more signals, a method of reconstructing the text from said signals comprising the steps of: fetching the next token from said signals, locating in the dictionary the word associated with said token, and providing said word to an output of said machine-implemented system.

19. In a machine-implemented system for storage or transmission of text, a method for compressing and reconstructing text comprising the steps of: creating a dictionary that associates each different word or group of words of said text with a different token, the average number of digits required to represent said token being less than the average number of digits required to represent said word in said system, replacing each word or group of words with the token associated by said dictionary with said word or group of words to form a compressed text in which the number of digits required to represent said text is reduced, fetching the next token from said compressed text, locating in the dictionary the word associated with said token, and providing said word to an output of said machine-implemented system.

20. The method of claim 19 wherein the text comprises words of alphanumeric symbols and punctuation.

21. The method of claim 19 wherein the step of creating the dictionary comprises the steps of: ordering the words of the text in alphabetical order to form an alphabetized list, eliminating all duplicate words in the alphabetised list to form a condensed alphabetized list, and assigning different tokens to different words in the condensed alphabetized list.

22. Apparatus for compressing text comprising: means for creating a dictionary that associates each different word or group of words of said text with a different token, the average number of digits required to represent said token being less than the average number of digits required to represent said word in said system, and means for replacing each word or group of words with the token associated by said dictionary with said word or group of words, whereby the number of digits required to represent said text is reduced.

23. The apparatus of claim 22 wherein the text comprises words of alphanumeric symbols and punctuation.

24. The apparatus of claim 22 wherein each word is a string of symbols, such as alphanumeric characters and punctuation, located between successive spaces in the text.

25. The apparatus of claim 22 wherein the means for creating the dictionary comprises: means for ordering the words of the text in alphabetical order to form an alphabetized list, means for eliminating all duplicate words in the alphabetized list to form a condensed alphabetized list, and means for assigning different tokens to different words in the condensed alphabetized list.

26. The apparatus of claim 22 wherein the means for creating the dictionary further comprises: means for determining which words appear most frequently in the text, and means for assigning to the words that appear most frequently tokens that are shorter than the tokens assigned to words that appear less frequently.

27. The apparatus of claim 22 wherein the text is divided into a plurality of segments and the means for creating a dictionary further comprises means for providing for each different word an indicator specifying in which segments of the text that word appears.