EP0160672A4 - Procede et appareil de compression de donnees. - Google Patents

Procede et appareil de compression de donnees.

Info

Publication number
EP0160672A4
EP0160672A4 EP19840903871 EP84903871A EP0160672A4 EP 0160672 A4 EP0160672 A4 EP 0160672A4 EP 19840903871 EP19840903871 EP 19840903871 EP 84903871 A EP84903871 A EP 84903871A EP 0160672 A4 EP0160672 A4 EP 0160672A4
Authority
EP
European Patent Office
Prior art keywords
words
word
text
dictionary
token
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP19840903871
Other languages
German (de)
English (en)
Other versions
EP0160672A1 (fr
Inventor
Louie Don Tague
Allen T Cobb
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TEXT SCIENCES Corp
Original Assignee
TEXT SCIENCES CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TEXT SCIENCES CORP filed Critical TEXT SCIENCES CORP
Publication of EP0160672A1 publication Critical patent/EP0160672A1/fr
Publication of EP0160672A4 publication Critical patent/EP0160672A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/42Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code using table look-up for the coding or decoding process, e.g. using read-only memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities

Definitions

  • This relates to a method and apparatus for reducing the number of signals required to encode alphanumeric data for storage or transmission. It is particularly useful in storing large volumes of text such as a book in a computer system or transmitting such volumes on a data communication system.
  • the digraph "th" might be represented by a single eight-bit code rather than by two eight-bit codes, one of which represents a "t” and the other of which represents an "h".
  • This technique is relatively limited in the data compression it can achieve.
  • a reduction of about 40% can be achieved in the length of the binary codes required to represent the alphanumeric text.
  • Somewhat greater reductions can be achieved if careful attention is paid to the frequency with which letter pairs appear in the particular text; but the best that can be achieved is on the order of a 60% reduction * .
  • a dictionary is created which assigns each different word of the alphanumeric text and the punctuation that follows it to a unique token.
  • Each word in the alphanumeric text is then replaced by the token that refers to that word in the dictionary.
  • each token is a sequence of binary digits and contains up to 16 bits (two bytes) which identify or address one such word.
  • the dictionary contains more than 65,536 words, the number of bits needed in at least some tokens will have to be greater than 16. Conversely, if the number of words in the dictionary is some power of two less than the sixteenth power, the number of bits in each token can be less than 16.
  • the dictionary can be created very rapidly using a conventional microcomputer system and the stored text can be recreated in human readable form by such a microcomputer system.
  • the number of bytes required to store the dictionary can be substantially reduced by storing the words in alphabetical order and taking advantage of the redundancy in characters that results.
  • the second of two entries contains five letters that are the same as that of the preceding entry, this can be signified by storing one character representing the number 5 and the remaining characters not common to both entries.
  • the size of the dictionary can be reduced by such techniques by a factor of about three.
  • the dictionary contains each word in the alphanumeric text, it can be used to determine if a particular word, or several words, is used in the text. Since the size of the dictionary is considerably smaller than the entire alphanumeric text, one can determine if a word is used in the text much faster by searching the dictionary than by searching the entire alphanumeric text.
  • the location of the word in the text can be specified by adding to each word in the dictionary An identifier that indicates each segment of the text in which the word appears. With this feature, it is also possible to compare the identifiers associated with different words to locate those words that appear in the same segment of the text.
  • Fig. 1 is a flow chart illustrating the general concept of the preferred embodiment of our invention
  • Fig. 2 is a flow chart illustrating the preferred embodiment of our invention in greater detail
  • Fig. 3 is a flow chart illustrating a detail of Fig. 2;
  • Fig. 4 is a flow chart illustrating a second feature of a preferred embodiment of our invention.
  • Fig. 5 is a block diagram depicting illustrative apparatus used with a preferred embodiment of our invention.
  • an alphanumeric text is compressed in our Invention by first creating a dictionary which associates each word of the alphanumeric text with a unique token of up to sixteen bits (two bytes).
  • the pattern of ones and zeroes in sixteen bits can be used to represent any number from 0 to 65,536.
  • each word is replaced by the token that refers to that word in the dictionary.
  • the size of the dictionary can be reduced by storing the words of the dictionary in alphabetical order and taking advantage of the redundancy in characters that results.
  • the length of the compressed text can be further reduced by representing the most frequently used words with tokens having a length that is less than two bytes.
  • these steps are performed by a computer such as a conventional microcomputer.
  • Fig. 2 Specific steps for implementing the techniques of Fig. 1 in a microcomputer are set forth in Fig. 2.
  • the text of the book or other material to be compressed is converted to a linear list of words.
  • each word is considered to be all the alphanumeric symbols including punctuation between successive spaces in the text.
  • the carriage return/line feed is simply inserted every time a space or spaces is encountered in the text; and the one space immediately in front of the alphanumeric text is considered to be part of that word.
  • all the spaces except the one space immediately in front of the alphanumeric text are treated as a single word of space characters.
  • the linear list is created, it is sorted alphabetically using a conventional sort so that all the words of the text are arranged in alphabetical order.
  • the alphabetized list is then processed by the microcomputer to eliminate duplicate entries and to generate a frequency count for each entry.
  • the entire alphabetized list of words is replaced by a new condensed list which identifies each word from the original alphabetized list and specifies the number of times that word appears in the original alphabetized list.
  • this procedure is implemented as shown in Fig. 3.
  • Bach word of the alphabetized list is fetched in turn by the microcomputer. A determination is made if this is a new word by comparing this word with the previously fetched word. If the two words are the same, the word in question is an old word; and the frequency counter is incremented by one and the next word is fetched from the list. If the two words are different, the word in question is a new word; and the old word and the contents of the frequency counter are written on the new list, the frequency counter is reset to one and the new word is stored for subsequent comparison.
  • each of the words of the condensed alphabetized list is assigned an individual token.
  • tokens having a length less than two bytes can be assigned to the more frequently used words.
  • a copy is first made of the condensed alphabetized list and the list is stored.
  • the list of words and frequency counts is then sorted by frequency count to obtain a new list in which the words are arranged in decreasing order of frequency of use.
  • one of the eight bits of a byte can be used to identify the byte as a one byte token instead of a two byte token.
  • the other seven bits of the byte can be used to provide 128 different tokens. If the byte is not identified as a one byte token, then the remaining fifteen bits of the two byte token can be used to identify up to 32,768 different words in the text. Accordingly, in this technique each of the 128 most frequently used words is assigned one of the 128 different one byte tokens; and the remaining words are assigned different two byte tokens.
  • the number of one byte tokens can be varied depending on the number of different words used in the text.
  • the maximum number of different words that can be represented by a combination of one and two byte tokens is given by x + 256 (256 - x) where x is the number of one byte tokens used.
  • x must be a positive whole number less than or equal to 256. From this it follows that where y is number of different words in the text the largest number of one byte tokens that can be used is the largest whole number such that:
  • equation (1) is used to calculate the maximum number of one byte tokens that can be used. This number of the most frequently used words is then assigned one byte tokens, each word being assigned a different token. The remaining words in the text are then assigned two byte tokens.
  • a dictionary is created by the microcomputer by assigning the tokens to the words in successive numeric order beginning with the first word and continuing to the last.
  • the numeric order of the tokens can be ascending or descending but must be monotonic in the preferred embodiments described herein. In subsequent descriptions, it is assumed that the numeric order is ascending.
  • the words that are represented by one byte tokens are assigned to a first dictionary and the remaining words are assigned to a second.
  • the second dictionary that associates words and two byte tokens is stored so that the words are in alphabetic order. Because the first dictionary has at most 256 entries, there is usually no need to alphabetize this dictionary.
  • the words stored in this dictionary are used so often in the text, it is desirable to minimize retrieval time from this dictionary.
  • the words are stored in the order of their frequency of use in the text with the most frequently used word first.
  • the dictionaries that are stored preferably contain only the words of the dictionary and none of the tokens.
  • the words are stored in the form of ASCII-encoded symbols with one byte being used to represent each symbol. Since there are only 96 ASCII symbols, one bit of each byte is available for other purposes. This bit is used to identify the beginning of each word.
  • each word is identified by setting the eighth bit of the first ASCII character of each word to a "1" while the eighth bit of every other ASCII character in the word is set to "0".
  • the token associated with a particular word in the dictionary can be determined simply by counting the number of words from the beginning of the dictionary to the particular word in question and adding that count and the numeric value of the token associated with the first word in the list. This counting can be done simply by masking all but the eighth bit of each byte and counting the appearance of each "1" bit in that position as the computer scans each byte from the first word in the list to the word in question.
  • the computer simply counts the appearance of each "1" bit in the eighth bit position of each byte in the dictionary beginning with the first byte and ending with the byte immediately before the particular word whose token is being calculated. Since the numeric value of the token assigned to the first word in this dictionary is zero, the count is the value of the token.
  • the tokens assigned to the words of the second dictionary will commence with the binary value 1101 0010 0000 0000. Accordingly, the value of the token is determined by counting words in the same fashion as in the first dictionary and adding to the count the binary value, 11010010 0000 0000, associated with the first word of the second dictionary.
  • the look-up table could store the token associated with the first word beginning with each of the twenty-six letters of the alphabet, and the counting process could begin with the first word that had the same first letter »9 the word whose token is to be calculated.
  • the microcomputer then compresses the alphanumeric text by reading each word from the linear list that was initially generated, looking up the word in the first or second dictionary and replacing the word in the linear list with the token obtained from the dictionary.
  • a search is first made through the words of the first dictionary, testing the individual ASCII codes of the words of the first dictionary to determine if they are the same as the word that is to be replaced by a token and counting each test that fails. If a match is found, the count of failed tests is the value of the token provided the value of the token associated with the first word is zero. If no match is found in the first dictionary, the computer moves on to the second dictionary.
  • the look-up table is used to provide a starting point for the search through the dictionary.
  • the first letter of the word whose token is to be determined can be used to locate in the look-up table the first word that begins with that letter.
  • the table will supply the value of the token for that word.
  • a search can then be made in alphabetic order through the different words that begin with that letter, testing the individual ASCII codes of each word to determine if they are the same as the word in question.
  • a counter is incremented by one.
  • its token is calculated by adding the contents of the counter to the value obtained from the look-up table of the token associated with the first word that begins with the same first letter. In this way, the entire linear list of words is replaced by a list of tokens to form a tokenized text.
  • the second dictionary may be compressed by coding techniques. Because the words of this dictionary are in alphabetical order, almost all words will have at least one initial character that is in common with the initial character or characters in the preceding word in the dictionary. In the case where at least two initial characters in a second word are the same as the corresponding initial characters in the first immediately preceding word, it becomes advantageous to represent that second word by (1) a number that identifies the number of initial characters in the first word that are the same and (2) a string of characters which are the balance of characters in the second word that are different from those in the first word. Thus individual words in the dictionary are stored using a number to specify the number of initial characters that are the same as those in the preceding entry and the ASCII codes for the remaining characters that are different.
  • the number is stored as a binary number that can be used immediately in retrieving the initial characters of the word.
  • the words “storage,” “store” and “stored” may appear successively in the dictionary.
  • the word “store” is represented by the binary number for "4" and the ASCII character for "e” because the first four letters of the word are found in the immediately preceding word while the "e” is not; and the word “stored” is represented by the binary number for "5" and the ASCII-encoded character for "d” because the first five letters of the word are found in the immediately preceding word while the "d” is not.
  • the tokenized text, the dictionaries, the look-up table and a computer program to read the tokenized text are then stored on any appropriate media such as tape, disk or ROM. Alternatively, this same information may be transmitted from one location to another by a data communication system. Because of the significant data compression achieved in practicing our invention, it is possible to store the entire text of full sized books on one or two 5-1/4" (13 mm) floppy disks. In general, the length of the text can be reduced by about 60% or 70% by the substitution of tokens for words. A further reduction of 25% and in some cases as much as 50% can be achieved by the use of one byte tokens for the more frequently used words in the text. Thus an overall reduction of about 75% in text length is readily achievable in practicing the invention.
  • the dictionary obviously adds to the length of the text.
  • the length of the second dictionary can be minimized by using numeric codes as set forth above to represent identical initial characters in successive words. This reduces the length of the dictionary by a factor of about three. Illustrations of the amount of compression that can be achieved with the invention are set forth in Example 1 below. Similar reductions in the channel transmission capacity required to transmit such text can also be achieved with the practice of our invention.
  • FIG. 4 A flow chart illustrating the reconstruction by a computer of the original alphanumeric text from the tokenized text is set forth in Fig. 4.
  • each token is fetched in turn by the computer which then searches one of the dictionaries to find the word associated with the token.
  • the computer simply loads the binary value of the token into a counter and, commencing with the most frequently used word, successively reads the words in the first dictionary, decrementing the count by one for every byte that has a "1" bit in the eighth bit position until the value in the counter is zero.
  • the next word to be read is the word represented by the token initially loaded into the counter.
  • the computer advantageously uses the look up table that associates tokens with the first word beginning with each letter of the alphabet.
  • the computer simply scans the look-up table in reverse order subtracting the values of the tokens in the table from the value of the token that is to be converted to text.
  • the computer has reached the first word that begins with the same letter as that of the word represented by the token. Accordingly, the computer subtracts this token value from the value of the token to be converted to text and begins the same process of reading the bytes of the different words that begin with that letter.
  • the computer decrements the count by one until the count reaches zero, at which point the next word to be tested is the word identified by the token. whether retrieved from the first dictionary or the second, the word is then provided to the computer output which may be a display, a printer or the like; and the computer moves on to the next token.
  • each such computer comprises a processor 10, first and second memories 20, 30, a keyboard 40 and a cathode ray tube (CRT) 50.
  • the apparatus may also include a printer 60 and communication interface 70.
  • the devices are interconnected as shown by a data bus 80 and controlled by signal lines 90 from microprocessor 10.
  • the memories may be addressed by address lines 100.
  • the configuration shown in Fig. 5 will be recognized as a conventional microcomputer organization.
  • the program to create the dictionary and tokenize the alphanumeric text may advantageously be stored in the first memory which may be a read only memory (ROM). If the same device is also used to reconstruct the alphanumeric text from the tokenized text, that program may also be stored in memory 20.
  • the tokenized text that is created, along with the dictionaries and look-up tables, is typically stored in memory 30; and the reconstruction program may also be stored in memory 30 if it is not available in memory 20.
  • the tokenized text, dictionaries, look-up tables and reconstruction program may also be transmitted by means of communication interface 70 to another microcomputer at a remote location.
  • memory 30 is a programmable read only memory (PROM) , a magnetic tape or a floppy disk drive because the capacity of such devices is generally large enough to accommodate the entire text of a book in a PROM of reasonable size or a small number of floppy disks.
  • PROM programmable read only memory
  • an appropriate device (not shown) must be used to record the tokenized text, dictionaries, look-up tables and reconstruction program in the PROM.
  • Such devices are well known.
  • the significantly larger capacity of fixed disk drives or large ROM boards can advantageously be used in the practice of the invention.
  • the invention is useful in compressing alphanumeric text for data storage or transmission. Because reconstruction of the original text can be performed expeditiously, the compressed data can then be used in any application for which the original text might have been used. In addition, because the compressed data is meaningless without the dictionary, one can provide for secure storage and/or transmission of alphanumeric text by generating the tokenized text and dictionary and then separating them for purposes of storage and transmission.
  • the dictionary contains each word of the alphanumeric text but is considerably shorter, the dictionary is also a useful tool in information retrieval. In particular, one can readily determine if a particular word is used in the alphanumeric text simply by scanning the dictionary. Additional advantages can be obtained by adding an identifier to each word in the dictionary which specifies each segment of the text in which the word appears. For example, the identifier might be one byte long and each of the eight bit positions in the byte could be associated with one of eight segments of the text. For this example, the presence of a 1-bit in any of the eight bit positions of that byte would indicate that the associated word was located in the corresponding segment of the text.
  • identifier Use of such an identifier will greatly speed retrieval of the alphanumeric text surrounding the word in question because there is no need to search segments in which the word does not appear. Moreover, by comparing the individual bits in the identifiers associated with different words, one can determine if the words are used in the same segment of the text. Obviously, the size of the identifier can be varied as needed to locate word usage more precisely.
  • An alternative approach is to use tokens having two fields, the first of which is a field of fixed length that specifies the length of the second field.
  • the tokens are assigned to the words strictly in accordance with the frequency count for each word so that the shortest token is assigned to the word that appears most frequently in the text, the next shortest token is assigned to the word that appears next most frequently, and so forth.
  • the dictionary is stored in frequency count order with the most frequent words being stored at the beginning of the dictionary.
  • a token can be as long as twenty bits.
  • the frequency distribution of the words is a very steep curve, as it often is, the average number of bits required to represent each word in the text is significantly reduced, as in the case of Example 1 below.
  • tokenized text is stored using a token having two fields
  • the computer reads four bits from the first field list, determines from these four bits the number of bits to read from the second field list, reads these bits, and then locates the alphanumeric word associated with such bits by counting words from the beginning of the dictionary in which words are stored in frequency count order.
  • the most frequently used word would be represented by 0000 in the first list and zero bits in the second list; the next two most frequently used words by 0001 in the first list and one bit in the second list; the next four words by 0010 in the first list and two bits in the second list; and so on.
  • 0000 in the first list these bits indicate there is no entry in the second list and accordingly the computer retrieves the most frequently used word which is the first word in the dictionary.
  • 0001 in the first list it reads the next bit in the second list and retrieves either the second or third word in the dictionary depending on whether the bit in the second bit is a zero or a one.
  • phrases can also be extended to the storage of groups of words (i.e., phrases). Common phrases will be recognized by all. Phrases such as “of the”, “and the”, and “to the” can be expected to occur with considerable frequency in almost all English language alphanumeric text. Such phrases can be automatically assigned a place in the dictionary and one token can be provided for each appearance of one such phrase.
  • phrases can be identified simply by scanning the alphanumeric text and comparing the words with a subset of the most frequently used words. For example, the 100 most frequently used words might constitute this subset.
  • phrases of the most frequently used words can be assembled simply by testing each word of the text in succession to determine if it is one of the most frequently used words. If it is not, the next word is fetched. If the word is, it is stored along with any immediately preceding words that are on the list of most frequently used words. When a word is finally reached that is not on the list of most frequently used words, the stored words are added to a list of phrases.
  • the stored list of phrases is sorted in alphabetical order, duplicates are eliminated and a frequency count of the phrases is made.
  • tokens are assigned to these phrases beginning with the most frequently used phrases, and these tokens are then substituted for the phrases in the alphanumeric text before any other tokens are assigned. From the standpoint of the dictionary and the tokenized text, it makes no difference whether the token represents one word or a group of words. Accordingly, the original alphanumeric text can be reconstructed simply by following the process of Fig. 4.
  • a dictionary is created in which each word is associated with a token. Illustratively this is accomplished by first creating a linear list of words such as set forth in Table II:
  • the list is then sorted alphabetically so as to arrange all the words of the text in alphabetic order as shown in Table III.
  • the list of words and frequency counts is then sorted by frequency count to obtain a new list in which the words are arranged in decreasing order of frequency of use; and each of these words is assigned an individual token. Because the text of Example 2 is so short, there is little need to sort the list in accordance with frequency of use and to use tokens of smaller size to represent the more frequently used words. However, as emphasized above, such a sort is useful where the size of the text is considerably longer.
  • the individual words are then assigned tokens with increasingly greater numerical value being assigned to successive entries in the alphabetized list of words.
  • Table V TABLE V
  • Reconstruction of the original text proceeds as shown in Fig. 4 with the individual tokens being read one at a time and used to count through the dictionary until the corresponding word is located, retrieved, and provided to a suitable output.
  • the dictionary can also be used in information retrieval to indicate that a word has been used in the alphanumeric text.
  • the use of an identifier to indicate the segment of the text in which the word is used will speed up the retrieval of that word in its context, In the case of the New Testament, a one byte identifier allows separate identification of each of the four Gospels, the Acts of the alles, the Apocalypse, the Pauline Epistles and the non-Pauline Epistles.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
EP19840903871 1983-10-19 1984-10-17 Procede et appareil de compression de donnees. Withdrawn EP0160672A4 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US54328683A 1983-10-19 1983-10-19
US543286 1990-06-25

Publications (2)

Publication Number Publication Date
EP0160672A1 EP0160672A1 (fr) 1985-11-13
EP0160672A4 true EP0160672A4 (fr) 1986-05-12

Family

ID=24167358

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19840903871 Withdrawn EP0160672A4 (fr) 1983-10-19 1984-10-17 Procede et appareil de compression de donnees.

Country Status (5)

Country Link
EP (1) EP0160672A4 (fr)
JP (1) JPS61500345A (fr)
CA (1) CA1226369A (fr)
IT (1) IT1180100B (fr)
WO (1) WO1985001814A1 (fr)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1986003039A1 (fr) * 1984-11-08 1986-05-22 Datran Corporation Systeme d'identification symbolique de mots et de phrases
US4758955A (en) * 1985-07-19 1988-07-19 Carson Chen Hand-held spelling checker and method for reducing redundant information in the storage of textural material
US4949302A (en) * 1986-11-17 1990-08-14 International Business Machines Corporation Message file formation for computer programs
US4843389A (en) * 1986-12-04 1989-06-27 International Business Machines Corp. Text compression and expansion method and apparatus
WO1988009586A1 (fr) * 1987-05-25 1988-12-01 Megaword International Pty. Ltd. Procede de traitement d'un texte permettant de garder le texte en memoire
US5754847A (en) * 1987-05-26 1998-05-19 Xerox Corporation Word/number and number/word mapping
US5560037A (en) * 1987-12-28 1996-09-24 Xerox Corporation Compact hyphenation point data
US5099426A (en) * 1989-01-19 1992-03-24 International Business Machines Corporation Method for use of morphological information to cross reference keywords used for information retrieval
DE3914589A1 (de) * 1989-05-03 1990-11-08 Bosch Gmbh Robert Verfahren zur datenreduktion bei strassennamen
US5325091A (en) * 1992-08-13 1994-06-28 Xerox Corporation Text-compression technique using frequency-ordered array of word-number mappers
CA2125337A1 (fr) * 1993-06-30 1994-12-31 Marlin Jay Eller Methode et systeme d'exploration de donnees comprimees
US6023679A (en) * 1994-10-04 2000-02-08 Amadeus Global Travel Distribution Llc Pre- and post-ticketed travel reservation information management system
GB2305746B (en) * 1995-09-27 2000-03-29 Canon Res Ct Europe Ltd Data compression apparatus
AU1082397A (en) * 1995-12-14 1997-07-03 Motorola, Inc. Apparatus and method for storing and presenting text
US6012062A (en) * 1996-03-04 2000-01-04 Lucent Technologies Inc. System for compression and buffering of a data stream with data extraction requirements
US5883906A (en) * 1997-08-15 1999-03-16 Advantest Corp. Pattern data compression and decompression for semiconductor test system
DE19854179A1 (de) * 1998-11-24 2000-05-25 Siemens Ag Verfahren und Anordnung zur Kompression bzw. Expansion von Zeichenketten durch eine DV-Einrichtung
CN1732426A (zh) * 2002-12-27 2006-02-08 诺基亚公司 用于移动通信终端的预测性文本条目和数据压缩方法
DE102008022184A1 (de) * 2008-03-11 2009-09-24 Navigon Ag Verfahren zur Erzeugung einer elektronischen Adressdatenbank, Verfahren zur Durchsuchung einer elektronischen Adressdatenbank und Navigationsgerät mit einer elektronischen Adressdatenbank
KR101750646B1 (ko) 2013-03-22 2017-06-23 후지쯔 가부시끼가이샤 압축 장치, 압축 방법, 신장 장치, 신장 방법 및 정보 처리 시스템
JP2020061641A (ja) * 2018-10-09 2020-04-16 富士通株式会社 符号化プログラム、符号化方法、符号化装置、復号化プログラム、復号化方法および復号化装置

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3717851A (en) * 1971-03-03 1973-02-20 Ibm Processing of compacted data
EP0083420A2 (fr) * 1981-12-31 1983-07-13 International Business Machines Corporation Codage de mots entiers pour traitement d'informations

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3344405A (en) * 1964-09-30 1967-09-26 Ibm Data storage and retrieval system
GB1516310A (en) * 1974-10-29 1978-07-05 Data Recording Instr Co Information indexing and retrieval processes
US4270182A (en) * 1974-12-30 1981-05-26 Asija Satya P Automated information input, storage, and retrieval system
US4189781A (en) * 1977-01-25 1980-02-19 International Business Machines Corporation Segmented storage logging and controlling
JPS55108075A (en) * 1979-02-09 1980-08-19 Sharp Corp Data retrieval system
US4356549A (en) * 1980-04-02 1982-10-26 Control Data Corporation System page table apparatus
US4358826A (en) * 1980-06-30 1982-11-09 International Business Machines Corporation Apparatus for enabling byte or word addressing of storage organized on a word basis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3717851A (en) * 1971-03-03 1973-02-20 Ibm Processing of compacted data
EP0083420A2 (fr) * 1981-12-31 1983-07-13 International Business Machines Corporation Codage de mots entiers pour traitement d'informations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of WO8501814A1 *

Also Published As

Publication number Publication date
IT8468039A1 (it) 1986-04-19
JPS61500345A (ja) 1986-02-27
WO1985001814A1 (fr) 1985-04-25
CA1226369A (fr) 1987-09-01
IT1180100B (it) 1987-09-23
EP0160672A1 (fr) 1985-11-13
IT8468039A0 (it) 1984-10-19

Similar Documents

Publication Publication Date Title
WO1985001814A1 (fr) Procede et appareil de compression de donnees
EP0083393B1 (fr) Méthode pour la compression d'information et un appareil pour la compression d'un texte anglais
US4384329A (en) Retrieval of related linked linguistic expressions including synonyms and antonyms
JP3234104B2 (ja) 圧縮データをサーチする方法及びシステム
EP0584992B1 (fr) Technique de compression de texte utilisant un tableau ordonné en fréquence de nombres représentant des mots
US4843389A (en) Text compression and expansion method and apparatus
US5551049A (en) Thesaurus with compactly stored word groups
EP0277356B1 (fr) Système pour la correction de fautes d'orthographe
US3694813A (en) Method of achieving data compaction utilizing variable-length dependent coding techniques
US5787386A (en) Compact encoding of multi-lingual translation dictionaries
EP0293161B1 (fr) Système pour le traitement des caractères avec fonctions pour faire un test d'orthographe
US5845238A (en) System and method for using a correspondence table to compress a pronunciation guide
US4542477A (en) Information retrieval device
CN86105459A (zh) 输入处理系统
GB2057821A (en) Communication method and system
US5560037A (en) Compact hyphenation point data
US5815096A (en) Method for compressing sequential data into compression symbols using double-indirect indexing into a dictionary data structure
US5225833A (en) Character encoding
Cooper et al. Text compression using variable‐to fixed‐length encodings
US5915041A (en) Method and apparatus for efficiently decoding variable length encoded data
JP3071570B2 (ja) 複合のターゲット語に関する辞書データを決定するための装置及び方法
JPH056398A (ja) 文書登録装置及び文書検索装置
US6731229B2 (en) Method to reduce storage requirements when storing semi-redundant information in a database
JPH0546357A (ja) テキストデータの圧縮方法および復元方法
JPH0546358A (ja) テキストデータの圧縮方法

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19850620

AK Designated contracting states

Designated state(s): BE CH DE FR GB LI NL SE

A4 Supplementary search report drawn up and despatched

Effective date: 19860512

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 19890503

RIN1 Information on inventor provided before grant (corrected)

Inventor name: COBB, ALLEN, T.

Inventor name: TAGUE, LOUIE, DON