CA1226369A

CA1226369A - Method and apparatus for data compression

Info

Publication number: CA1226369A
Application number: CA000465602A
Authority: CA
Inventors: Louie D. Tague; Allen T. Cobb
Original assignee: TEXT SCIENCES Corp
Current assignee: TEXT SCIENCES Corp
Priority date: 1983-10-19
Filing date: 1984-10-17
Publication date: 1987-09-01
Also published as: EP0160672A4; JPS61500345A; IT8468039A1; WO1985001814A1; IT8468039A0; IT1180100B; EP0160672A1

Abstract

ABSTRACT A method and apparatus for compressing alphanumeric data that is stored or transmitted in the form of digital codes. A dictionary is created which assigns each word of the alphanumeric text and the punctuation that follows it to a unique address or token of, illustratively, up to 16 bits (two bytes). Each word in the alphanumeric text is then replaced by the address that refers to that word in the dictionary. Because the dictionary can contain up to 216 = 65,536 entries, it is more than adequate for the storage of the words associated with almost any book. Because only two bytes of information are needed to address any one of these 65,000 words, replacement of each word of text with two bytes of address information reduces the average of about three. Further reductions of 25% or more in the length of the compressed text can be achieved in most cases by representing the most frequently used words with tokens that are shorter than two bytes in length. The number of bytes required to store the dictionary can be substantially reduced by storing the words in alphabetical order and taking advantage of the redundancy in characters that results. Thus, if the second of two entries contains five letters that are the same as that of the preceding entry, this can be signified by storing one character representing the number 5 and the remaining characters not common to both entries.

Description

S-7Sll J~3T~IOD AND APPARATUS FOR
DATA co~?R:~ssIo~a ~CXGROVND OF T~3 INVE3NTI(~

Th:L3 flats Jo a method and ~ppara~u~ for reduc~llg the number of signal ruined to encode ælphanu~nerlc data or ~tor~ge sir tran~ sion. It is partlc:ul~rly useful in storing large JolurdeA of test ugh 10 a a book in a c~mpu'cer ~yste~ or tran~mittitlg such volumes on a fate con~unicat$orl sys'cem.
Prior art . echni~ue~ for encoding ~lphanuloeric ~x~ usually rely on the subRtitu~iorl of an eight bit binary code co~only called byte or Mach ch~r~cter ox 15 the alphanumeric text. One such code ~omprl~es seven bits whlch def ine the character in accordanca with the Amer ican Stan~lard Code for InforDIation Interch3inge (ASCII) and an eighth but chat l either used us a parlty bit or is set to 0. Table ox these codes are set forth, or example,

2~ at pagea 125 and 126 of ~al~ton, et alp .ncyclopedia of _ , 2nd Ed. Ryan Nostrand Reinhold, 1983~.
~owev~r, the us of eight bits to represent each ~haraoter in large volu~e~ of alphanumeric text severely ice the flit of pre~ent-day microcomputers and communications 8y8tem50 For example, there are over 170,000 words in the Now Testament and about 1,036,000 ~ep~rate chancier Accordingly, over on megabyte of data storage it r~qulred to etore the Now Testamentr oven 3D with pre~ent-day forage technology, requirements ox this 80rt Ike it r~lativ~ly expensive to store pull text books end th2 like in modern computer 8y5te~. Like~ise~ it it relati~ly expel e and ti~s-consu~lng to tr~ns~i~ coded quantiti~ of text a large as a book.

, ' . .J~

n an ef or 'co r~du~e da'ca forage and tran~mi~ion f ire~ents, standard CO~l~!?8 } ye been modl~l~d Jo ~æ to use certaln eight-blt code Jo represent thy or frequent co~b~nation~ of two l~tt~r~. Thus, the 5 digraph ~th" ht be represented by I- single elght-bit code rather t:han by two eight-bit odes, osle ox which repr~nt~ M "to and the otber of whl~:h represents an oh". Thus technique, however, i8 relatively flied in thy data ~o~pr~sion it can achieve. Typlc~lly, a 10 reducticn of about 40~ con be achieved in the length ox the binary cods r~slu~red lto represent the alphanumeric text. 8O~ew21at greeter reductions can be a~:hie~ed if careful attention it pald to the fsequency with which let~r pin appear in the particular textS but the Hess 15 that can be achieved i8 on the order of a 609d reduation.

~UMM~RY OF TOE INVENTION

We have devised a technique for ~ignlficantly 20 improving the amount of data compression that can be achieved when ion ing ~lphanum~r ic date in the form of digltAl codes. In accordance with our invention diatis~nary i arid whlch a88igra8 each different word of the alphanumeric text and the punctuation that follows it 25 go a unique token. Mach word in the alphanumeric text is then rev ced by the token that refers Jo that word in the dictionary. Illu~tratlvely, each token l a sequence of binary digit and conchs up to 16 bit to bytes) which identify or addre~ one auch ~srd. Acaordingly, the 30 dict~orlAry can contain up to 216 - 659536 entr$e~, which l more than adequate or he storage of thy wards associated with almost any book. 8ecau~e only tw:a bytes of inforlaatiol~ f needed to identify 2ny on of these ~5,536 word, replacement of Mach Ford of text with two 35 bytes of information reduces the average number of do gits ~2~3~i~

required Jo for the text by a factor of about three., If the dic~lonary ~ontaill~ aeon than 650536 ~ord~, the number of bits neQded ln it least or token ~111 have to be greater tharl 160 ::onv~rsely, of the number of words in 5 the dictionary 18 ~o~e power ox two let than the sixteenth pour, thy nulaber of bit qch token can be let th3~n 16; Advantageously, the dlct~onary can be s:reat~d very rapidly using a conventional microcomputer system and the ored text can by recreated in hu~ar 10 readable form by such a ~icroco~puter ~yste~
he numbex of byte required to tore the dictlonary can by substantially reduced by storing the word $n ~lph~betic~l vrd~r arid taking advantage o the redundancy in ~haracter~ thaf re~ult~. Thus, if the 15 ~cond of two nine contain five letter that are the 3aiae a that ox the preceding entry this can by ~ignif ied by storing one character representing the number 5 and the relaainlng chancier not common to both ~ntr les. Becau6e ox the large amount of redundancy that l pr~ent in such 20 a dictionary bes:~u~e of the use of plural, po~es~ive~, cognates and ~ntri~3 that ore identical except for punctualtion; the size of the dictionary can be reduced by such technique by a eon of about three.
Further reductions in the length of the 25 compre~ed text carl be achieved in ~1108t case by r~presentlng the most frequently used ~ord~ with token that are hefter ban two bytes ln length. Beaause only a small number of the most frequently u3ed words ordinarily count for Gore than half ox all ache words in the text, 30 the use of a one byte token, for example instead ox a wo byte token, for the ~o~t fresIuently used word ean reduce the ~tor~ge r~s~uire~ent~ for the t2x'c by it least an ad~itio~al 25% and in many case by cons~derab~y none than 50% .

;9 Toe foregoing technique achieve significant data compre~ion while maintaining the boundaries between word. In test performed on the Xing Jo Version of the New Testament, they jade it possible to tore the 1,036,000 character of the New Te~t.lment in approximately 22D,000 byte using one co~pres~ion rlethod and lB3,000 byte using ~nother~ In a text performed on approx~ately goa,ooo ~har~cter~ of raining material for layers, they permitted thy text to ye compressed t:o lets than 150,000 byte~D
ecause the dictionary contain each word in the alphanumeric let it can be used to determine if a particular word, or several words, i3 used in the textO
Since the size of the dictionary la considerably staller than the entire alphanumeric text, one can determine if a word i8 used ln the text much faster by searching the dictionary th3n by searching the entire ~lphanu~erlc textO In addition, the location of the word ln the text can be specified by adding to etch word in the dictionary an identifier that indicates each segment of the text in which the Ford appear. With this ~eature~ it it also possible to compare the identlfiers associated with different word to locate those words that appear ln the ~a~e segment of the textO
BRIEF DESCRIPTION OF DRAWINGS

These and other object features and advantages ox our invention will be Gore readily apparent from the hollowing detailed description of preferred embodiments of the invention ln whichs Fig. 1 it a flow chart illustrating the general concept of thy preferred embodiment of our invention;
Fig. 2 iB a flow chart illustrating the preferred embodiment of our invention in greater detail;

5_ jig. 3 l flow ohart flu trating detail of l@ig .
Fig. 4 l a flow chart lll~tr~t~ns~ eond featslr~ of preferred ~DIbodlloent of our vention~ and Flg. 5 18 block diagr~ dep$ctlng ~llu~tr~tive apparatll~ U$1~d wlth a pr~ferr~d embod~nt of ¢)ur invention.

El E 5CEU PTION OF PREFI~RRF.D 13 I!lBODIPlli:NT
OF To I NV~N~ ION
A ~ho~dn in Fisl. 1, an alphalluJseric react is co~pre~d irk our lnventlon by flrst crating a dictionary whlc:h a~80c~ etch ~rd of the alphanumeric text with a unlque token of up to sixteen bit two bytes). As i8 well known, thy pattern of one and Eros ln sixteen bits can be used to sepresent any number from 0 to 6505~16. To forEo a coDIpr~ed text, each word i3 repl~d by the token that rein to that word ln the dicltion~ry~ Optionally, the ai2~ of k ionary con be reduced by toning the words of thy dlctionary in alphabetical order end taking ~dvsr~t~g@ oP thy r~dund~ncy in ~haract~rs that results.
Advan~g~ou~ly, tb~ length of the ~ompre~s~d text a be furthq~r reduced by r~pre~enting the ~llO8t fre~au~ntly used words with tok~n~ hiving a length that la let than two 25 byte. PrQf~r8b1Y, hoe step ore performed by a colaputer such a a convent ional mi~roaor~puter .
Spell ic lop for i~pl~Menting the technique of Fig 1 in a ~lcroco~puter ore 8et forth ln Fig. 2. First, the text of the book or other E~ateEial to lye compr~ed is converted to a linear lit of word. In effect this r~quir~ that a carriage return~l~ne fled be ~n~serted Air Q~ch Ford of the text. Convenlently iEor thus pur~o~ ch ~dord it con~ider~d to be ~11 the alphanumeric symbol including punctuatlon between 2~

~ucce~sive 8pace8 in the text. Thus, the carriage r2turn/lin~ feed l simply in~ert~d Avery tire a space or pace ia encountered in the text; end the one æpace lmmediately in front of the alphanu~rlc text it considered to by part of that word. Where multiple spaces are found between words, all the space except the one space immediately in front of the alphanumeric text are treatQd a single Ford of space character6, After the linear list it created, it it sorted alphabetlc~lly using a conventional tort Jo that all the words of the text are arranged in alphabetical order.
Thy alphabetized lit i3 thin proce3~ed by the microcomputer to eliminate duplicate entrie3 and to generate frequency count for each entry. Thus, the entire alphabetized ill of word i3 replaced by a new condensed list which ~dentifie~ each word from the original ~lphabeti~ed list and specific the number ox t1mes that word appears in the original alphabetized lit. Illu~tratlvely, this procedure ia implemented as 8hown in Fog. 3. ah word of the alphabetized lot i8 fetched in turn by the ~icrocomput~r. A determination is ode if thls A a Jew word by comparing this word with the previously retched Ford. If the two words are he tame, the word iD question it an old word; and khe frequency counter i8 incEemented by one and the next word ls fetched from thy list. If the two word are different, the word in question ls a new word and the old word and the contents of the frequency counter are written on the new li8t, thy frequency counter it reset to one and the new word iB pored or ~ubse~uent compari~onO
To create a dictionary, Mach of the words of the condensed alpbabetlzed li3t it signed an individual token. however, in order to reduce storage requirements, it it d~irable to assign tokens having a length lest than two by to the more frequently used words using any one ~%;~636 --7~

of several techni~ue~. For example, on byte tokeni can be ~s~ignea to the ~08t frequently used word. To do thisr copy iæ if jade of the condensed alphabetized and the li8~ iB ~toredO Thy list of words and frequency counts it thin sorted by fr~uency count to obtain a new lit in which the words are arranged in decrying order of frequency of us. In on technique, one of the eight bit of a byte oan be used to identify tha byte a a owe byte token instead of a two byte token.
In such case, the other seven bit of the byte can be used to provide 128 dlfferent token. If the byte is not identified a a one byte token, then thy remaining fifteen bits of thy two byte token con be used to ldentify up to 32,768 different word ln the text.
Accordingly, in this technique each of the 128 most fr~uently used words 8 aligned one ox the 128 different one byte tokenst end the remaining words ore assigned different two byte tokens.
Alternatively, the number of owe byte tokens can be varied depending on the number of different words used in the text. In particular, it can be shown that the maximum number ox different word that can by represented by a combination of one and two byte token it given by x + 256 (256 - x) where x i8 the number of one byte tokens used. Obviou3ly x must be a positive whole number less than or equal to 256. From this lt follows that where y i8 number of different words in the text the largest nu~bar of one byte tokens that can be used i8 the largest whole number such that:
x (~56 - y)/255 (l) For exa~ple~ lf there ore 12,0Q0 different word in the textt x - 209. Thus, the 209 ~o~t ~reque~tly used worde ~2~ 9 Jan by represen~d by 209 one byte tvken~ and the re~ainln~ 11,791 ~ordl~ are represented by two byte tokens.
~¢ordir~gly, ~h~n using thi t~s:hnique, equation l iL8 used to ~:aleulat~ the ~axi~um nu~b~r of one byte 5 tokens that ssn be used, Thl~ nlamber of the mo8t frequently used wo~d~ l then aligned one byte tokens, each wsrd being aligned a different token. The retaining words in the text ore then aligned two byte token.
Whichever msthod i8 used to d~ter~in~ the number 10 of one byte tokens, a dlctlonary is created by the microcomputer by a8 igning the tokens to the words in succe~slve nu~erir: order beglnnlng with the first word and continuing to 1~he last. The nur~eric order of the tokens con l ascending or ~e~cendin~ but IDU~t be monotonlc in the preferrer embodiments decried herein- In subsequent de~cr~ptions, it l umed that the numerlc order it ~cending~ Advantageously, the words that are represented by one byte token are assigned to first dictionary and the remaining words are aligned to a second. To minimize storage requirementa, as detailed below, the second dictionary that ~soci~tes words and two byte tokens is lore Jo that the word are in alph~be~ic odor Because the flr~t dictionary his at most 256 entrie-~t there it usually no need to alphabetize this dictiona~y~ however, becau~ the words stored in this di~tlonary are u~d Jo often in the text, it 18 de~ir~ble to minimize retrieval tire from this dictionary. To this end, the word are stored in the order of their frequency of use l the text with the ~o~t frsguently used word first.
the dict~onarie3 that are stored preferably contain only the word ox the dictionary and note of the tok~ns~ Illustratively, the words are toyed in the form of ASCII encoded symbol with one byte being used to represent etch 8y~bolu Since th~r~ are only 96 ASCII
~ymbol~, one bit ox each byte i8 ava~labl~ for other ~22~36~

urpo~e~. hi bit it used to ident~ fy thy beglnning of Mach ~ordO In partlcul~r~ the beginn:lng ox each Ford is ld~ntl~l~d by ~t~ln9 thy eighth blt of t:h~ fir ASCII
ch~ract~r oil ah Ford . o a "lo while thy eighth be ox 5 Avery other ~CII character on the word il8 cot is ~0". As f it thy tolc~n o¢~ted with a partlcular ~IDord ir thy d~t:tloll~ry m by determirl~d thy by countlng the nulob~r ox ~ord~ from the beginnlng of the tionary to thy particular Ford in question and aadin~ that ~ourlt and 10 k nuDl~ri~ value ox the tolcen a~80clat~ with the fir8t word in who 118to hi counting can by don ~i~ply by masking all but the eighth bit of ah byte end ~oun'cing thy App~ar~nc~ oP Mach Al" bit in that poll on the c:o~putar scans ah byta frola thy first Ford in thy fig 15 to k word in qu~tlonO
For Qx~ple~ if lthe flr~t dliction~ry cont~ 209 words, tokens havirlg binary vAlues prom 0000 0000 to 1101 OOOl wlll by igned to these or o determine the token a2slgned to a particular word, thy cor~puter 20 ~i~plr count the! app2arllnce ox etch i blt in the eighth by t position ox each byte ln the d$l:tionary begiLnning with the 1r~t byte antl ~nd~ng with the ate iE~ediately Ibefore the particular bond whose token l belng ealculated.
Sinai the numaric value of the token as nod to lthe flrst 25 word ln hi dlc:tionary it zero, the count la the Yalue of lthe tok~n~, or thls example " ache ~oken~ ned to the word of thy ~e~ond dictionary 7~ill coD~nc:e with the blrl~lry v~lu~ llOl OOlO 0000 OûO0. Accordingly, thy value of thy token it det~r~iLned by countlng words in the e

3~ fashion a~3 ln thy first tionary end ~ddinq to lthe ~oun~
thy bitlary ~1UQ~ 1101 0010 0000 OUOO, a~l30cla~ed wlth the f lr0t word of thy se~os~d dictionary.
To speed up the counting prOC~lt88 it ill helpful to use a look-up table that identlfie~ the toketl a~oc:lated 35 with c~rt~in word. or example, the look-up table could _ . . . . . . . ..

J~7~263~

~1:or@ k token ~oc~ate~d wlth the f let word beginning ~l~h ugh of k erlty six titer of thy habet, and thy eoun~ g pro~0~ could begin q~ith thy ~lrBl: word that hod k iElr~ tin the sword ~hos~ tc>k@n l to be 5 calt::Ll~t~d.
Air thy dliction~ri~ helYe bun e:re~d~ the ~icroco~puter t,herl oompr~ ho ~lph~nu~erlc text by rending ah Ford from l:he liner 1~8t that we lniti~lly g~n~r~t~d, looking up the word in k f irk or second 19 dlctlonary end replacing 'che wc>rd in the 11 near lit with the toksn obtain ro~ the dict~on~ryO on thls pro~es~, a arch lo f if jade through the ford of the f irk d~ct~on~ry, toting the lndivld~lal A~;CII codes of the worn of thy f lrst dlctionary to determine if they are the 15 same c tho Ford that lo to be replaced by token alld counting ~ac:h test that Pall. }I match it :Eol:md, the count owe failed teBte i8 'che value of the token provided thy value of the token associated with the fir word i8 zero. liE no itch lo found in the first dictionary, the 20 computer movQs on to thy second dlctlon~lry. Bere the 104k-up tax used to provide a ~t~rtlrlg point for the March through the dictionary. For example, th2 first l¢tt~r ox the word whoa token it to be deter~lned can be USA o foot or. thy lsok-up table thy f ifs word that 25 beglns ~lth that letter. The table upply the value of the token for that word. A search can then be aade ln alphabetic or~eY thrGugh the different word that begin with that lett2r, testlng the individual ASCII codes of each ~dord to d~lt*r~i ne if they are the tame E18 the word in 30 q[ue~tlon. or each Ford that f~il3 the lest a counter it lrlcremented by on. ~ahen the word iB finally located, its token it a ul~t~d by adding the contents of the counter to the value obtained from thy look-up table of the tokea a~socl~t~d with the f~iFl~t word that b~gin~ with the same 36 first letter. In this way, the entire linear li3t of ~22~63~ !3 word it replaced by a!l li8t of ltoken~ to or tokenized exit Fln~lly~ he 8~con~ dlctloll~ry Jay be c:ompre~R~d by ~:oding techrliqus~,. Becaul3e the ~ord~ of hi Ei di~tlonary art alphabetical order, al~ao~t ill words wlll ye it fag on lnitial character that 18 in c:om~on wlth the lnltl~ h~r~ct~r or ch~arac1t~r~ thy preceding worB in t:h@ dlctionary. In th* vase where a lest two initl~l ch~ract~rs on a second word f th* owe as the 10 ¢orr~pon~lng llllti~ harac'cers in thy iElr~t i~ediately prec:~di~9 worn lt becomes advant~gçl!ou~ to rep nt ha ~corld Ford by i a number that lderltlfl~ the num~r of lnltilll char~t~r~ ln thy furs word that are the some and (2) a ~t~ing ox sharact~rs whi~:h ore the balance ox 15 charact~r~ ln thy ond word that art di~fer~nt from thoue in tho ~i~st Ford. Thus individual words in the dictionary arQ atorod using a number to upe~i~y the number of lniti~l charact~r~ that are the some a those in the precedlny ~n~ry asld the ~SCII code for the re~nalrling 20 ch~rac~c~rs that art dlfferen~, To Qxpedllte proc:e~ning, the number l stored binary number that con by u ed direly ln retrl~ving the initial chara~::ters ox the . word. or ~xamplg2, the word Hstor~ger" store and astorod~ may appear ~ucce sively in the dictionary. In 2~i thle a the word store :L8 represented by the binary number or ~4~ and the A~CII character for c because the f irst four leers of ache word are found irl the imnedlat~ly precedlng word while the ae~ it not and the word stored i8 represented by the blnary nu~Db~r for "5~
30 end the ASCII-en~odQd character far ~d~ cue the f ir~t f it 'ct2rl3 of thy word ore found in the i~m~diately prec~dirlg word Chile thy 'IdW it not.
ho tok~ni~se~ text, the dic. lonarie~, the look-up table and a computer program 'co read the 1:okenized text 35 art thin lord on any approprist~ ~edi~ such a ape, ~Z;263gi~3 dl~k or ROPI, Alt~rrlativelyt this anile ~normation Jay be ~crans~i~cted foe one 10CAtiOn to ~noth~r by a daka communlcatlora 3y~temO Xecause of the signlEicant data co~pre~sion achi~red in practicing our i~avention, it i8 S po~s~ble to tore the entire text of full ~iz~d book on one or two 5-1/4" ~13 IDm) floppy disk. In general, the length of thy t~x~ can be reduced by abolat 6()~ or 70~ by the ~ub~titul:ion of token for words.. A further reduction of 25% and in on caves a much as 50~ can be achieved by the u8~ of one byte token for the more frequently used words in the toot Thu8 an overall reduction of about 75%
in text length iM readily a¢hievable ln practicing the inverl'cion. The dictionary obv$ou~1y add to the length of the textO The length of the second d~ct~onary~ however, 15 can be minimlzed by using numeric code a jet forth above to represent ld~ntical ~nltial character in successive words. This reduces the length of the dictionary by a factor ox bout three. Illustrations oE the amount of compre~ion that can be achieved with the invention are 20 jet forth in ~5xample 1 below. Similsr reduction in the channel trans~is~ion capacity re~3[uired to transmit such text con alto be achieved with the price of our invention .
A :Elow chart illustrating thy recon~ruction by a 25 compu er of the orlginal alphanumeric text from the tokenized text is jet forth in Fog. 4. A shown therein each token l etched in turn by the computer which then searches one of thy dictionaries to ~lnd the word a~ociated with the token. In the case of a one byte 30 tokent the ~:omput~r simply loads 'che binary value of the token into a counter and, commencing with the most fr~qu~ntly used word, successively read the word in the iEir~t dietionAry, decre~enting the count by owe for every byte that ho a l bit in the eighth blt ps3sition until 35 the value in the counter zero. At this point, the next ~f~36~

word to be read 18 the word represented by the token lrlltially loaded lrlto the countç~r. In ~e~rching the second dictionaryO thy colaputer advan~a~eola~ly u~e~ the look up table that a~ociates token~3~ wl~h the first word 5 beginning with eas:h ls~tter of the al.phabet. ~hu~, the computer ~i~ply an the look-up table in reY~rse order subtracting the v~lu~8 of the tok~?n~ in the table from the value ox ~h2 token 'chit l to be ~onver~ed to text. When the dlfference between the values hilt Pro a negative 10 value to a positive valued the co~put~r ho reas:~hed the f ire uord that b~gin~ with the tame letter that of the word re~sresen'ced by the token. Accordillgly, the computer subtracts this token Yalu~ rom the value of the token to be converted to text snd begin the same process of 15 readlng the bytes of the diff~rerlt word that begin with hi lettee. With etch byte that ha a l bit in the eigbth blt po~iltion, the cornputer decrements the count by one untll the count roaches zero, it which point the next word to be aid l the word identif ied by thy 40ken.
20 whether retrieved from the first dictionary or the second, the word i8 then provided to the computer output which may be a display, printer or the like and the computer moves on to the next Soken.
Our invention Jay be practiced in all mannar of ~achi~e-imple~nted 5y~te~8. Specific apparatus for tokenlzlng the text and or reconstructing the original alphanumeric text from the tokenized text may ye any number of suitably programmed computersO In general, ~8 shown in FigO 5, such such computer coapri~es a proc~sor 10, first and cod conies 20, 30, a keyboaxd 40 and a cathode ray tub (CRT~ 50. Optionally the apparatus may alto include printer 60 and ~o~munication l~t~rface 70.
The d~vic~ are interconnected a8 shown by a data buy 80 end aontrolled by ~lgnal line 90 from microprocessor 10.
In ~dditlon, the ~e~orle3 may be addressed by address ..

~Z636~3

4--llnes lOO., The conflg-lration shown ln jig. 5 will be recognized ~8 a collventlorlal rocclllapulter organi2ation~
Thy program to cr0ate the dlctlonary and toke~aize the ~lpl2~numerie text Jay advantageously by storod $n the

5 f irst inezlory which may be a rend only ~elDory TOM) If thy me device 18 also used to reconstruct ye ~lphanum~ric text from the tokenized text, that prs:~gr~m may also be stored in memory 20. The ~ok~nized text that $s created, alons~ with thy dictionaries and l~ok-up 10 table89 it typlcally cored in mea~ory 30t and the reconstruction program may also be stored in crony 30 of it it not av~ilabl~ in ~e~Dory 20. the tok~ni2ed text, dic'cionarie~ look-up table and reconstruction program my lo be transmitted by Dlean~ of comlounication 15 interface 70 to another microcomputer at a remote locution, Adv~n~ageou~ly, memory 30 is A progr~n~nable read only r~emory (PE10~ aagnetic tape or a sloppy disk drive because 'che capacity of such devices it generally large 2~ enough to accommodate the entire text of a book in a PRO
of reasonable size or a small number of floppy dil~k~o Obviously, where PROM it used, an appropEia~e device snot 3hown~ must be used to record the tokenlze~ ox dlct~on~rle~, look-up tables and ~econ~truction program in 25 the PRo~q. Such device are well known. h~here it i8 deslrable to store large number of books in one record p the ~igni~ican~lr larger capacity of fixed disk drives or large ROM board can advantageously be used in the practice of the invesltion. When the apparatus of Fig, 5 30 1B used to reconstruct the original alphanumerlc ~xt frvm data lord on di~k~, lt it advantage~u~ to transfer the entire ~ontent~ of thy do sks to a ~e~iconductor mer~ory bec~u~e the ~lgnificantly higher speed of the semiconductor ~erQory will greatly facilltate thy look up 35 of word6 iD the dictlonary. For this purpose it i8 alto ~L2;26~6~3 advantageous to COallpre~8 the di4:t:1On~ry Jo a size ~u~h that lt f it ~lthln thg~ ~tor~ge capacity of conv~ntlonal ~roco~pu er ~ori~O l hove wound it pra~lc~ o do hi ~h~r~ 64~ byte o so ondu~to3r cony art 5 available I, Tar are nu~rou~ appll~at~ons for our in1ve~tlon. A in~31cat~ bore the lnvention i8 useful in ~omps~a~ing alphanu~esi~ xt Pox data ~toFage sr ~rEln~lllli8~1023. Beaus reconstruction of l:h~ original text 10 con be p~rorlaed exped~tiou~ly,~ the c:o~1?r~ed date can then by used l any application for which oh@ original text tight h~v~ bun used. In addition, b~oau~e the compre~ed data it ~anlngleR~ without thy dic~ion~ry, one on provl~ or sun ~torag~ ~nd~or tr~n~lssio~ ox alph~nu~ri text by q~nerating the tokQnized text end dictionary end then separating thy for purpose ox ~orag~ and inn on.
Bec~u~e the dictionary oont~ln~ 2~ch word o the alpbanu~ric ~@xt but i8 considerably shorter, the diationary it al30 a u~ful tool in lnfor~ation '. r~ri~Yal. In particular, one can readlly d~ter~ine if a particular word i8 u3~d in the alph~nu~erlc text ~i~ply by canning thy dic~ion~ry. addltional advantages can be obtained by aadiDg An idsnt~fier to each word in thy d~ctlonary which specific etch segment of the text in ~hioh the Ford appe~r~. For example, the identifier might be one byte long and each of the @ight I po~ition~ in the byte could be aa~ociDted wl~h one of tight &egments of thy text. For this ~a~ple, the pres~no~ of a l-bit in any ox the eight bit potion of that byte would indicate what the asao~iated word was located in the corresponding ~eg~ent of thy text. e of such an ldenti~ler Jill greatly spied retrieval of the alph~nu~ric text surrounding thy Ford in qu~tion because there i8 no naed to search ~gment~ in which the word dots not appear --16~

~oreo~er" by 60~p~ring thy ~nd~Yidual bitt in the identl~ f E18130C~ted ~lth diiE~eren~ ~ord3,, owe can ~iet~r~in~ liE the r~o~d8 are uaed in k a gm~nt of the t~x'c. Obviou~ly~ tha a~ize of thy no flQ~ Jan be varied 5 z~ n~d~d to lock word usage Gore pr~la~ly., Numerous or lation~ y alto ba ad ln k pra~:tlc:~ of sur ln~ntlo~O Chile we have ~e~crlbed the illVQn~iOII tq~r~lll3 ox ?hanu~ri~ t~xt~, binary token arld ASCII cod the invention my by lprac:~ iced with ~11 ~ann~r ox 8~}~011EI and thy 8~b018 Jllay be ltolceni~ed and coxed in v~rlous v For i!~Campl@, ~or~lgn languages, ~ath~tic~ bolt, graphical ElyDlbOl8 end punctual:io:n con ~11 be accom~od~t~d ln practicing thy lnvention and the ~y~bol~ awn be represented by ~ASCII, escpand~d ~SCII
15 or any ~ult~bl~ cod2 of one 7 En own choir . While the use ox bin2~ry tok~n~ it pref~rr~d ln thy practlc~ ox our invention it Jay by conv~ni~ni: to r~pr~ent ~u~h token in other rlldioe~ ~ul~h ~8 hexadecimal end the Lnvention can be practlc~d u~lrag token hazing d~glt~ of any radix.
site h~Ye illu~ral~ed two exA~pl~ for reduc:ing the ~iz~ of the tokenized text by using odes ox le813 than two bytes to store laore r~qu~ntly u~3ed wo~d~ uaerou~ other l:Qahniqu~s, however, art ~Yai}able., For example because the vocabulary u~d on IE108t books it typically significantly l than thy 65,536 words that can be repre~ntQd by ~ixte~n bits, lt l fr~qu~ntly poa~ibla to reprint oh of the word ox the alphabetize ~*x~ by 8~ thin ~ixte~n b~t~. or ample a ~oany a 32,768 words c:az: by r~pr~s~nt~d by lien bit and 16, 384 word ay be rQpr~nted by four n bit As:cordingly~ another method ox aligning tokens to bit to ealcul~e the si~lnl~ur~ nu~b~r of bitt required to repre~Qnt eac:h dlfforent word by a different token h~vlrllg that minimum nun of bit and tAen o æ~ign 'co Mach dl~iEerent word a 35 dliffer~n~ token having what ~ainlMu~ nw~ber of bit If 3~3 the vocabulary u~d ho or than 65,536 word lth~ sarQe pr ln~lpl~ con be used to assign ~ok~nl3 of 17, 18 or even Gore bitl3 to Mach difer~rlt word on k fox An alt~rr~ati~ approach it 1:o use tok~n~ having two ~1d~, thy ~r8t of which it 1eld of flxed length that 8pe~ci;E 1 en the length ox the second :Eleld, In 'chis technique tokens ore a~sign~d to the word ~rletly in ac~ordanc~ with the fr~quen~y coup for h word 80 what the ~hort~t token l a~lgned to thl~ Ford that appeare 0 D~081' frequently ill the text7 the next ~horte~t ltoken 18 assigned to the worn that app~ar~ n~xlt Dot ~requ~ntly~
an 80 orth. In thl~ ~rr~ngerllent thy di~tlorlary it stored on fEequancy count order s~tith the most requent word beillg lord at the beginning ox the dlctlonaryO
With thiff technique, l token all be a long 8 tw~n'cy bit. Elow~Rr, if the frequQncy ~i~trlbution of the words i8 a very lop curve, 2l5 lt often i8, the av~rag~ number ox blts rQqulr~d to r~pr~nt each word in the text i8 ~lgni~lcantly reduced, aa in thy case of ~:xalopl~ 1 below. ~hsn ~okeni~ed text l lord uRing a token having two f lid it it ~v~ntageous to tore the token in two parAllel 11B~ Or1~ of which l merely the li3t of the f lrst 1eld~ end thy other i8 the 11st of the second f lid Data stored on thy two lists in the tame order. A~eordiLngly, to convert thy tokenized text to the orlgin~l alphanuRIg!ric text, the computer read four blt3 ~rol4 She first fleld 11st, deterairles fxom these four by tB the nulaber of bit to read no the second fleld list fad the bits, and thell loc3te8 ah alphanumeric word ~oci~t3d with suc:h bit by counting words from the begln~l~g of the dictlonary in whlch words are lord ln fr~qu~ncy mount orderO Thusr the Yost frequently u~d word Gould by represented by OOOû on thy first lit and zero bit on tha second listt the next two ~IOB~ frequently used word by 0001 ln thy f~r~t lit and one bit ln the . second lit the next our word by 0010 in tlle first lit end two bits in thy second li~t5 ~ndl Jo on,. When the computer read 0000 in 'che first lit, these bit indicate where læ no entry in the second lit and accordingly the 5 oomputer r~trl~ve~ the IDOE~ fr~sq[uerlt:ty u~d word which i3 the flr~t word in the dic~ioT~ary. when th2 computer read 0001 ln thy. ~1rs~ t, it rear the n~2xt bl in the ~e~ond ill and r~ltriev~s either the ~Qond or third word in thy dictlonary depending on whether thy Ibit in the 10 ~ecolld blt 18 I!a zero or a one.
The te~hnique~ described ~l~ov~ for toning indi~ridual ~ord~ in the form of tokens can alto be extended Jo the storage of groups of words ( l . e ., phrase . Common phrases wlll be re~ogn~ed by ,~11.
Phr~es such a aof the't~ Rand the", and to the can be expected to occur with considerably frequency in almost all English language alphanumer ic text . Such phrases can be automatically a~sign~d a pie ln the dictionary and one tol6en can be provided for each appearance of one such 20 phrase.
Alternatlvely, phr~es can be identified slmply by scanning ^cbe alphanu~er ic text and comp~r ing the words with subset o$ tbe most frequently u~d word. For example, the 100 1~08t ~re~uently used words tight.
25 constitute this subset. In this procedure, phra~e~ of the most frequently used word can be assembled simply by te~t~ng each word of the text in suGce~ion to d~t~rmine if it 18 one of the most ~r~quently u3ed words. If it it not, the next word ~8 retched. It the word it lt 18 30 stored alQng with any inunediately prec:eding word that are on the it of W108t frequently used words. When a word i8 1nally reac:hed that it not on the lit of ~o~t frequently used ~ord~, the stored words are added to a fig of phrase. After the entire test has been ~canne~, the 35 scored li8 of phrases i8 sorted in alphabetlcal order, ~63~
~19--duplic~te~ ore in~ted and frequency mount of l;he phr~e~ ade. Dep2nding on the nu~b~r of tokens availabla Jo rQpr~s~nt phrase, token f assigned to Shoe phr~es begiD~ g with the 11~08t fr~q~uently used 5 phr~se~ end these 'c~ken~ are then ~ub~titu~d or the phra~s in the alpbanum~ric text before any other tokens ore ~s~ignedO From the standpoln owe the alctionary Rand the tok~ni~d t~3xt, it make no difference wh~'cher t1ne token repres~nt3 on word or a group of words.
10 Accordingly, the original alphanumer.lc text con be reconstructed sin~ply by following the process of FigO 4.

In pr~ctlclng our invention, ode have stored the entl~a Now Test~DQent by yenerating dicta onary that associa~e~ each word with a token and r~2pl~cing oh word of tha New Te~t~ment with that token. In order to reduce the space r~qu~r~d to store the dictionary, elmo~t all of the dictionary i8 stored ln alphabetic order an i8 compressed by using numeric codes to rçpre ent the number of initlal characters that are the same a the initial character of the preceding word in k dictionary.
In our initial effort to store the text in tok~nized form, we u3e~ one-byte tokens to represent the ~08t frequently used words. cause there are approximately 1~,000 different words in the New Testament, approximately 200 of the most frequently used word can be represented by onQ-byte tokens and thy retaining 13,800 words ore r~preRented by two-byte words. or this arrangement, ~pprox~ately 65~ of the 170,000 word in the New Te~ta~en~ ar2 repr~ented by a one-byte token. By using such one-byte tokens, we stored toe entire 1,036,000 character of the Jew Te~tamen~ in approxlmately 220,000 byte8 of ator~ge.

~27~

n off to further reduce storaga re~uirelRen~
we folJnd 1t adv~ntAgeou~ to ul3e tæo-fi~ld ~ok~n~ of the type de~crib~d aboveO In partlcular, the curve of the freqllency of use of ~/ords i8 very strep, a l apparent 5 from ~abl~ I whi¢h 13~tB forth the f lve ~o~t frequently used word ln the New Te~tam~rlt, the nu~nber of times t:hey are u~d and the token used to repr~ent each such word.

TILE: I

Token Ford 0000 the 10 ,145 00010 end 7, 309 ~0~11 ox ~,705 001000 that 3, 36J.
OOlOQl Jo 3,098 By using two-f ield tokens, we have been able to reduce the number of bytes required to Gore the entire text of the New 're~tament to approxirnately 183,000 bytes.

~XA~IPLB 2 The operation of the general technique of FigO 1 can be lllustrated with respect to a few verses from 25 Matthew, Chapter IIs "1. Now when Jesus was born ln Beth-lehem of Ju~aea in the days ox ~erod the king, behold, there calDe wise men prom the east to Jeru~alem~
2. Saying, Wh~r~ i8 he that it torn Kin ox the Jew~7 for we have seen his star in the east, end ore Rome to worsh~l? him,.
3. Zen E~rod the king had herd the3e thing, he was troubled, and all JQru3aler~
with hi.

3~ }
~21--In accordarlce with the inv~lltlon, dl~tionary 1 crated in en h ~aah Ford if a ~clat~d with a token.
Illu~r&~ ly thl~ 8 accorapli~h~d by if grin linear lit of words such a set forth in table II~
f II
..
Now when J~u~
was wlth hi.
the list iz then sorted ~lphab~tlcally 80 el8 to ~rrang~ ~11 thy words ox the text ln alph~betlc order a6 shown ln Table III.

TABLE I I I
I; , dll end and ar~3 .

Jerusalem Jer u~alem, ,.

when When where wise with worship ~636g ~2~

ho ~lphsb~tlzed lit it th2n processed tg ells e ~upll~at~ entri@~ and o ~ner~te res[llellcxy counll: or each ~n~ry ~ho~qn on ~b~

TABLE IV

- aïl 1 arld 2 f 1 tG
Jeru~leD~ 1 Jerusale~a7 when 1 When 1 where l wise 1 I th worship 1 In the pre$erred embodiment of the inYelat~on, the ill ox words end fr~que~cy count it then sorted by fre51uen¢y ~:ount to obtain a new list ln dhich the ~qord~
ore ~rr~rlged in dare g order o frequency of uses alad each of the words i8 assigned on itldi~idual token.
Because the text ox ~x~ple 2 it Jo short, therm little need to sort thy li3t in ~caordance with frequency of use and to use token ox ~mall~r ~lz~ to repre~eslt the Gore frequently used bore Elow~ver~ a ~mpha~ized above, ~ue:h a tort it useful where ache size of . he text conslder~bly longer.
the ln~liviL~ual word are lthen signed tokens ~rlth lncs~eaalngly g~eat~r numerical value bring a~3~igned to ~IJCC~Ie8al~ ntrl~ the ~lphabetiz~d 118t owe WoXd8.
q~hu8, .th6~ t of token lto word in l~x~ple 2 18 a it forth in TablQ V.

~2gF~

TABLE V
__ 0~ 00~0 ~11 00 00~1 and 00 0092 wire 0l 0110 Jeru~al~m 01 0111 Jeru~al~
01 100l Je~u~
,. .
O
0l lll0 Now l0 ll0l when l0 lll0 When ll llll where ll 0000 wise ll 000l wlth ll 00l0 wor8hip For this exar~ple, lt is apparent that only six bits are needed to identify each diferent word uniquely.
20 Obviously the number of bits c:an be varied depending on the number ox dl~iEerent word to by ~okeni~ed.
Fillally, the cornputer r~plAc~s etch word in the liner ill of Table II with the corre~pondlng token as jet worth ln table V lto generate a tokenlzesS text as shown axle VI.

Table VI
l 1111 l0 ll0l .

11 0~ 01 0l 0ûll ~L~2~i3~ ) Por example 2, there 18 very llttle advantage n compre~ing thy dictionary of ~ord~9 For a largeL texl:
however, in ~h~ch 'che ln~tial chsra tern of zany words would be the save" the dicti~n~ry ~uld then be compressed 5 by replacing with a number ill those lniltlal characters in a word . hat are 'che ~3~e as the initial ch~r~cters of the precedlng ~rd.
Recon~truc:ltlon of the ordinal text proceed shown in Flg. 4 with the ndivldual token being read one 10 at a tire and used to count through the dlctiorl~ry until the ~:orrenponding word 18 located, re rieved, and provided to a suitable output.
A indlcated above, the dictlonary can also be used ln ln~orma~ion retrieval to indicate that ord ha 15 been ufled $n the alphanu~erlc text. In this application, the uJ3e of on $dentif ier to indicate the eegment of the text in which the word i8 used will ~pe~sd up the retr ieval of that word in it context. In the cave of the New Testament, a one by1:e identifier alloy ~eparat~
20 iderlti f ication of 2ach o the four Gospel39 the Act of the Apostle, the Apocalypse, the Pauline ~p~stle~ and the non-Paullne 13pistles.
A will by apparent to those skilled in he ark, numerous ~od~ f ications may be made on the invention 25 descr lbed above.

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:-

1. In a machine-implemented system for storage or transmission of information represented by groups of digits of different values, a method for compressing text comprising the steps of:
providing the text in the form of words, each of which comprises a group of symbols, such as alphanumeric characters and punctuation, creating from said text a dictionary that associates each different word or group of words of said text with a different token, the average number of digits required to represent said token being less than the average number of digits required to represent the individual symbols of said word on a symbol-by-symbol basis, and replacing each word or group of words in said text with the token associated by said dictionary with said word or group of words, whereby the number of digits required to represent said text is reduced.

2. The method of claim 1 wherein the text comprises words of alphanumeric symbols and punctuation.

3. The method of claim 1 wherein each word is a string of symbols, such as alphanumeric characters and punctuation, located between successive spaces in the text.

4. The method of claim 1 wherein the step of creating the dictionary comprises the steps of:
ordering the words of the text in alphabetical order to form an alphabetized list.
eliminating all duplicate words in the alphabetized list to form a condensed alphabetized list, and assigning different tokens to different words in the condensed alphabetized list.

5. The method of claim 4 wherein each different token has a different numeric value and the step of assigning different tokens to different words in the condensed alpha-betized list comprises the step of assigning the different token in successive numeric order to different words in alphabetic order.

6. The method of claim 4 wherein the step of creating the dictionary further comprises the steps of:
determining which words appear most frequently in the text, and assigning to the words that appear most frequently tokens that are shorter than the tokens assigned to words that appear less frequently.

7. The method of claim 6 wherein the step of assigning tokens comprises the steps of assigning to the first 128 most frequently used words a token that is one byte in length and assigning to the remaining words a token that is longer than one byte.

8. The method of claim 7 wherein the first byte of the token assigned to each word has one bit position that contains a bit indicating whether the token is one byte long or more than one byte long.

9. The method of claim 6 wherein the step of assigning tokens comprises the steps of:
calculating the maximum number of one byte tokens that can be used to represent the most frequently used words if the remaining words are represented by two byte tokens, assigning one byte tokens to no more than that maximum number of most frequently used words, and assigning two byte tokens to the remaining words.

10. The method of claim 4 wherein the step of creating the dictionary further comprises the steps of:
counting the duplicate entries of words in the alphabetized list to form a frequency count, sorting the condensed alphabetized list in accordance with the frequency count for each word, and assigning to the words that appear most frequently tokens that are shorter than the tokens assigned to words that appear less frequently.

11. The method of claim 10 wherein the step of assigning tokens comprises the steps of:
assigning to each word a token having two fields, the first of which is a field of fixed length that specifies the length of the second field, said tokens being assigned to said words in accordance with the frequency count for each word so that the shortest token is assigned to the work that appears most frequently in the text, the next shortest token is assigned to the word that appears next most frequently, and so forth.

12. The method of claim 11 wherein the first field has a length of four binary digits or their equivalent.

13. The method of claim 4 wherein the step of creating the dictionary further comprises the steps of:
calculating the minimum number of bits required to represent each different word by a different token having that minimum number of bits, and assigning to each different word a different token having that minimum number of bits.

14. the method of claim 1 further comprising the step of compressing the dictionary by replacing the initial characters of a word that are the same as the initial characters of an immediately preceding word with a number indicating how many of said initial characters in both words are the same.

15. The method of claim 1 wherein the text is divided into a plurality of segments and the means for creating a dictionary further comprises means for providing for each different word an indicator specifying in which segments of the text that word appears.

16. A dictionary formed by the method of claim 15.

17. A dictionary formed by the method of claim 1.

18. In a machine-implemented system in which a dictionary associates each different word or group of words of a text with a different token comprised of one or more signals, a method of reconstructing the text from said signals comprising the steps of:
fetching the next token from said signals, locating in the dictionary the word associated with said token, and providing said word to an output of said machine-implemented system.

19. In a machine-implemented systems for storage or transmission of text, a method for compressing and reconstructing text comprising the steps of:
creating a dictionary that associates each different word or group of words of said text with a different token, the average number of digits required to represent said token being less than the average number of digits required to represent said word in said system, replacing each word or group of words with the token associated by said dictionary with said word or group of words to form a compressed text in which the number of digits required to represent said text is reduced, fetching the next token from said compressed text, locating in the dictionary the word associated with said token, and providing said word to an output of said machine-implemented system.

20. The method of claim 19 wherein the text comprises words of alphanumeric symbols and punctuation.

21. The method of claim 19 wherein the step of creating the dictionary comprises the steps of:
ordering the words of the text in alphabetical order to form an alphabetized list, eliminating all duplicate words in the alphabetized list to form a condensed alphabetized list, and assigning different tokens to different words in the condensed alphabetized list.