WO1986003039A1 - Systeme d'identification symbolique de mots et de phrases - Google Patents

Systeme d'identification symbolique de mots et de phrases Download PDF

Info

Publication number
WO1986003039A1
WO1986003039A1 PCT/US1985/002223 US8502223W WO8603039A1 WO 1986003039 A1 WO1986003039 A1 WO 1986003039A1 US 8502223 W US8502223 W US 8502223W WO 8603039 A1 WO8603039 A1 WO 8603039A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
word
grammar
dictionary
byte
Prior art date
Application number
PCT/US1985/002223
Other languages
English (en)
Inventor
Dale Eugene Winther
Original Assignee
Datran Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datran Corporation filed Critical Datran Corporation
Publication of WO1986003039A1 publication Critical patent/WO1986003039A1/fr

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • This invention relates to microprocessors and more particularly to a custom microprocessor which permits a conventional microprocessor to manipulate quickly and efficiently entire language words and phrases as tokens as opposed to manipulating single characters.
  • a data compression method called tokenization is the basis of this invention.
  • Language words and phrases are converted into tokens which utilize fewer bits than the ASCII or EBCDIC binary equivalents of the same words.
  • the following United States patent documents were considered in the investigation and evaluation of this invention:
  • Patent 2,709,199 entitled “CODE SIGNAL CONVERTER” is an apparatus for converting signals of a code having a given number of units into signals of a code employing a different number of units. An example would be converting signals from a five- to an eight-unit code.
  • This invention does not employ data compression me ⁇ thods. It actually increases the amount of code being transmitted and is merely used to change from one code format to another, longer code format for use in telegra ⁇ phy.
  • Patent 3,309,694 entitled “CODE CONVERSION APPARATUS” is an electronic translator for converting a binary code representation of a letter, number, or phrase to an audible, and/or visible indication of that letter, number, or phrase. This is not a system of text compression.
  • Patent 3,388,380 entitled “SYSTEM FOR DISPLAY OF A WORD DESCRIPTION OF PARAMETERS AND VALUES THEREOF IN RE ⁇ SPONSE TO AN INPUT OF A WORD DESCRIPTION OF THE PARAMETER” is a data display system. It does not involve the compres ⁇ sion of text.
  • Patent 3,599,205 entitled “BINARY TO TERNARY PROTECTED CODE CONVERTER” is an apparatus for converting nonprotected code signals having a specified number of bits to protected code having approximately the same number of bits. This system does not attempt to compress the code.
  • Patent 3,685,033 entitled “BLOCK ENCODING FOR MAGNETIC RECORDING SYSTEMS” is a high density recording and repro ⁇ ducing system. It is a method for increasing the number of stored bits per transition on a magnetic recording medium. This is not a means of text compression.
  • Patent 3,810,154 entitled "DIGITAL CODE TRANSLATOR” is an apparatus for substantially increasing the amount of information that can be transferred by a teletype in a given amount of time. This method does not require an increase in the bandwidth of the carrier.
  • the transmitting station converts parallel, fixed-place, digital-coded mes ⁇ sage characters into serial, variable-place, digital-coded message characters.
  • the receiving station reverses the conversion. This is not a method of data compression using tokenization.
  • Patent 4,122,299 entitled “DATA OUTPUT MODIFYING SYSTEM” is a system for placing data in a format for accep- tance by a general purpose communications printer. The data is originally in a format for providing a television display to news wire service subscribers. This invention involves the conversion of data from one format to another. It does not attempt to compress text.
  • Patent 4,229,817 entitled “PORTABLE ELECTRONIC cryp ⁇ tographic DEVICE” is a hand-held apparatus for enciphering and deciphering text. This device merely changes the presentation of data and does not compress it.
  • a further object of this invention is to provide a custom microprocessor that permits fast searching through parallel processing of a human language dictionary.
  • the custom microprocessor contains an internal electronic dictionary which contains language words and phrases. This dictionary may be interrogated by an external comput ⁇ ing device such as a conventional microprocessor for pur ⁇ poses such as spelling verification. Search times are very fast because the symbolic tokenizer is designed to simultaneously process whole or partial words and phrases in parallel instead of individual characters.
  • the symbolic tokenizer produces a binary code that represents a human language word or phrase.
  • the binary symbol is used -as a token for the representation of a language word or phrase.
  • the token can be used by a con ⁇ ventional microprocessor to manipulate text in such appli ⁇ cations as mass storage and telecommunications.
  • the tokens are reconverted to human language.
  • the tokenization of phrases is a two-step process. First, the words comprising a phrase are tokenized. When the control microprocessor recognizes a key word which may be the beginning of a tokenizable phrase, a search is made of the phrase portion of the dictionary using the already tokenized language words. Since the search is conducted using tokenized words, the same architecture allows for the simultaneous searching of the several words which comprise the phrase.
  • the symbolic tokenizer can be used with a number of different systems and serves many useful purposes.
  • Token ⁇ ized text (a series of tokens representing words or phrases of the text under consideration) can be transmitted or stored far more efficiently than its ASCII or EBCDIC equiv ⁇ alent.
  • the ability to rapidly search a language dictionary through parallel processing makes a very fast spelling checker feasible.
  • the control microprocessor can be any central process ⁇ ing unit from a simple microprocessor to a conventional mainframe computer.
  • FIG. 1 is a block diagram of the symbolic tokenizer showing a dictionary ROM and a control microprocessor em ⁇ bodying the present invention
  • FIG. 1A is a schematic and block diagram depicting a storage array in the control microprocessor containing data for operating the symbolic tokenizer
  • FIG. 2 is a schematic and block diagram of the dic ⁇ tionary portion of the symbolic tokenizer and the control logic for the dictionary;
  • FIG. 3 is a schematic and block diagram of the com ⁇ parator portion of the symbolic tokenizer and showing a mailbus and output multiplexer;
  • FIG. 4 is a schematic and block diagram of a control logic portion of the symbolic tokenizer
  • FIG. 5 is a schematic of an interface to a control microprocessor, such as a Commodore 64K computer;
  • FIG. 6 is a schematic and block diagram of a dictio ⁇ nary ROM text storage arrangement allowing parallel char ⁇ acter processing
  • FIG. 7 is a schematic and block diagram depicting a flowchart of the process of controlling the symbolic token ⁇ izer for tokenization
  • FIGS. 8A-8 are schematic and block diagrams depicting a flowchart of the process of tokenization
  • FIG. 9 is a schematic and block diagram depicting a flowchart of the process of detokenization.
  • FIG. 10 is a schematic depiction of a keytoken list stored in the dictionary ROM
  • FIG. 11 is a schematic and a depiction of a phrase table stored in the dictionary ROM
  • FIG. 12 is a schematic and a depiction of a memory map stored in the control microprocessor and used to access the phrase table
  • FIG. 13 is a schematic and a depictin of a memory map stored in the control microprocess and used to access the keytoken list
  • FIGS. 14 and 14A are a schematic and block diagram depicting a flowchart of the process of phrase tokeniza ⁇ tion.
  • a conventional microprocessor 100 is used to control the symbolic tokenizer 102A.
  • the control microprocessor may be a Commodore 64K computer, an IBM Personal Computer, or any other type of processor.
  • the control microprocessor includes a storage area, for example, ROM 100A. However, it is to be understood that the storage area to be used in the manner described below may also be a removable storage medium, such as a floppy diskette.
  • the storage area contains a map of the start and end points of areas to be searched in the dictionary.
  • the map is a memory map similar to that depicted in FIG. 1A and is accessed according to the starting character of the English-language trial word to be tokenized and the number of characters in that word. For any given starting character and character length, there is both a unique starting address, alpha, corre ⁇ sponding to an address of a memory location in the dictio ⁇ nary ROM 104 and a representation of the total number of grammar bytes, beta, having the given starting character and the given character length.
  • the term grammar byte will be defined below with respect to FIG. 6 after the dictionary architecture is defined.
  • the starting address points to a location in the dictionary ROM for starting a search for words to be compared to the particular word under consideration.
  • the number of grammar bytes indicates the total number of dictionary look-ups required to compare the trial word with all the words in the dictionary ROM containing the given starting character and the given number of characters in the trial word. Therefore, the number of grammar bytes is used to determine the end point for searching.
  • a dictionary of preselected, commonly-used words is stored in ROM 104.
  • the symbolic tokenizer's dictionary ROM 104 stores words according to starting letter and character string length.
  • the dictionary effectively contains segments which depend upon the number of characters in the words they contain. In the preferred embodiment, the first segment contains words having between one and four characters, the second between five and eight, the third between nine and twelve, and the fourth between 13 and 16. The reason for the divisions into segments is that the language words in any given segment contain the same number of grammar bytes.
  • Segment codes are used to indicate how many dictionary reads must occur before the results of the comparison should be sampled. Only four codes, SI, S2, S3, and S4, are required because, for example, for the character code SI, one grammar byte is needed to read a single character word, a two-character word, a three-character word, and a four-character word. Therefore, only one clock cycle or grammar byte is required to read the entire word for the character code S2, two grammar bytes are needed to read a 5-character word, and 6-, 7-, or 8-character words. Each read during a given clock cycle transfers a maximum of four characters from the dictionary ROM to the compare logic 103.
  • Each dictionary segment contains up to four subseg- ments and up to 104 sectors. There is one subsegment for each of the 4-character string lengths in the segment. There is one sector for each different starting letter and each unique character string length of the language words within that dictionary segment. For example, if a particular dictionary subsegment contained language words of two characters starting with the letters A through Y, but no language words starting with Z, then that dictionary subsegment would contain 25 sectors of words. That is, one sector of words for each unique starting character and character string length. 0 Division of the memory map into both segments and sectors permits two-dimensional accessing of the memory map data for obtaining the dictionary starting address for the search.
  • the symbolic tokenizer compares a trial word to a language word within the stored dictionary and returns 5 the address in the dictionary ROM where that word may be found. The address is then provided to the control micro ⁇ processor where the token is derived.
  • the token is the address of that word within the dictionary, divided by the number of the dictionary Q segment within which the word was found, and added to an offset. The offset is determined by the dictionary segment and depends upon the location of the segment within the dictionary.
  • the control microprocessor knows from which dic ⁇ tionary segment the token was generated, because tokens generated from given segments fall within given numeric ranges. Tokens numerically greater than one given number, but less than another given number, must have their corre ⁇ sponding language word within a corresponding dictionary segment. That is, each dictionary segment has tokens that fall between predetermined numbers.
  • the dictionary segment can be determined from the token itself, the offset can be subtracted from the token and the token then multiplied by the appropriate dictionary segment number to find the address of the lan ⁇ guage word within the dictionary ROM.
  • the ROM address is supplied to the symbolic tokenizer by the control microprocessor, and the symbolic toke'nizer returns the language word represented by the token to the control microprocessor.
  • language words are represented in the dictionary by one or more 20-bit bytes called gram ⁇ mar bytes.
  • Each 20-bit byte represents four characters from the language word.
  • Language words may be represented by up to four 20-bit bytes in the dictionary because the present embodiment contemplates storing words of up to 16 characters.
  • the use of up to four 20-bit grammar bytes provides a variable-byte word length of 20 bits each to maximize dictionary storage efficiency. For example, the word "water” is stored in two 20-bit grammar bytes in ROMs 1-5, four characters in the first byte and one char ⁇ acter in the second. Additionally, as indicated in FIG.
  • ROMs 1-10 1 are separated into two banks of five ROMs each, ROMs 1-5 and ROMs 6-10.
  • ROMs 1-5 will be considered, but all the grammar bytes for any given word are stored in the same bank of ROMs, and it will be understood that
  • ROMs 6-10 can be treated in an equivalent manner.
  • the last four characters are stored in one 20-bit grammar byte of ROMs 1-5, as indicated in FIG. 6.
  • the least significant four bits of the letter "r” are stored in the least significant four bits of ROM 1.
  • the 0 most significant bit of the letter "r” is stored in the least significant bit of first ROM 2.
  • the least signifi ⁇ cant three bits of the letter "e” are stored in the remain ⁇ ing least significant three bits of the first four bits of ROM 2, and the most significant two bits of the letter
  • the next succeeding memory location in each of ROMs 1-5 is used to store the character "w".
  • the least significant four bits of the character "w” are stored in the least significant four bits of ROM 1
  • the most significant bit of the character "w” is stored in the least significant bit of the first four bits of
  • 2_5 bytes would be required. For words containing between 9 and 12 characters, a series of three 20-bit grammar bytes would be required, all of the bits in the first two.grammar bytes being used while an appropriate number of bits in the last 20-bit grammar byte are taken up by the required
  • phrases up to 80 bits long are also stored in the dictionary.
  • the phrases are stored as previously tokenized words; therefore, several words may comprise a phrase which is stored in four 20-bit bytes. That is, since the phrases 5 are stored as tokens of, at most, 20 bits each, and not as language words, their storage requires much less memory than if the phrases were stored as their ASCII or EBCDIC equivalents.
  • the "token" is typically 1 0 between 14 and 16 bits long.
  • the character length distribution of language words varies depending upon a particular given dictionary's vocabulary.
  • a dictionary may be designed around a standard business vocabulary.
  • the business vocab- 15 ulary has a character population density that places about 90% of its words in the 5- to 12-character length category.
  • segment map that represents a business dictionary. Other vocabularies may be divided differently.
  • segment division points are software
  • the prototype symbolic tokenizer's business dictionary is composed of 13,312 language words.
  • the business dic ⁇ tionary is divided into four segments designated as SI,
  • Segment SI contains approximately one thousand words and is addressed by the hexadecimal numbers 0000 through 03FF.
  • Segment S2 contains approximately six thousand words
  • Segment S3 contains approximately five thousand words and is addressed by the hexadecimal numbers 1C00 through
  • o ⁇ o Segment S4 contains approximately one thousand words and is addressed by the hexadecimal numbers 3000 through 33FF.
  • SI expresses words requiring one 20-bit grammar byte (up to four characters) .
  • S2 expresses words requiring two 20-bit grammar bytes or 40 bits (between five and eight characters) .
  • S3 expresses words requiring three 20-bit grammar bytes or 60 bits (between nine and twelve characters) .
  • S4 expresses words requiring four 20-bit grammar bytes or 80 bits (between thirteen and sixteen 0 characters) .
  • the compare logic 103 compares the trial word from the trial word register 111 to selected individual words from the proper segment of the dictionary ROM 104.
  • Address counter 105 determines each address of the dictio ⁇ nary ROM 104 from which a word, or a portion of a word is to be taken to be compared by the c'ompare logic 103. Words are transferred from the dictionary ROM 104 to the . compare logic 103 over the grammar bus 112. 0
  • the address counter 105 is a register/counter and holds the memory address for ROM 104 indicating the grammar byte to be accessed.
  • the segment register 106 determines how many dictio ⁇ nary reads must take place to determine when the output of the compare logic 103 should be sampled. This is ne ⁇ cessary because words stored in the dictionary ROM 104 are variable in length and require from one to four reads Q to be output to the compare logic 103.
  • a segment counter 107 counts down while the address counter 105 counts up.
  • the segment counter 107 counts the number of grammar bytes required to read a given language word from the selected segment of the dictionary ROM 104 being searched.
  • the first grammar byte will be read from the memory loca ⁇ tions of ROMs 1-5.
  • the segment counter will be decremented, and the address counter will be incremented.
  • the segment counter 107 is reloaded from the segment register 106 so the next word can be read.
  • the control logic 109 stops the address counter 105 from incrementing when either a comparison by the compare logic 103 is true, or the entire sector of the dictionary ROM 104 which was to be searched had been searched, and no comparison by the compare logic 103 was found to be true.
  • the control logic 109 also accepts tokenizer control codes, shown as TCC in FIG. 1, from the control microprocessor
  • a control microprocessor 100 * (a Commodore 64K in the prototype) will have a memory map for the dictionary ROM 104 already created as discussed above, and the dictionary ROM will contain the desired library of dictionary words.
  • the control microprocessor 100 issues a tokenizer reset command through interface
  • the interface 101 is described below with respect to FIG. 5.
  • the reset command is used to reset the segment register 106, and reinitialize the segment counter 107.
  • the tokenizer reset also reinitial- izes the address counter 105 and the control logic 109.
  • the scan limit down counter 108 and the status register 110 are also reinitialized.
  • the control microprocessor determines the number of characters in the trial word under consideration and determines a segment code according to the number of char- acters in the trial word. As discussed above, if the trial word contains between one and four characters, a segment code representing "SI" will be issued to the seg ⁇ ment register 106. If the trial word contains between five and eight characters, a segment code representing "S2" will be loaded in the segment register. If the word contains between nine and twelve characters, a representa ⁇ tion of the segment code "S3" is loaded, and if the trial word contains between thirteen and sixteen characters, a 0 segment code representing "S4" is loaded. For example, the word “water” uses two grammar bytes, so the segment counter would contain a representation of "S2" .or "2", indicating that two grammar bytes must be read for "water”.
  • the control microprocessor then supplies the trial 5 word "water” through the interface 101, and the parallel mailbus 102, t to the trial word register 111.
  • the trial data is the word on which tokenization is to be attempted.
  • the control microprocessor utilizes the character o string length of the trial word and the beginning character of the trial word to access the dictionary ROM information contained in the memory map in storage device 100A.
  • the microprocessor obtains the starting address and the total number of grammar bytes for the words starting with the 5 given character and having the given character count.
  • beta indicates the number of total grammar bytes of the words containing the given letter and having the given character length. Since a single grammar byte may not contain representations of all characters in the stored word, several grammar bytes may be required to form a complete representation of the stored word. How ⁇ ever, beta is a representation of the total number of grammar bytes needed to step through all the entries in the dictionary ROM containing words starting with the given character and having the given character string length.
  • the word “water” and all other five-character words beginning with the character "w” and stored in the dictionary ROM would require two grammar bytes to store the individual word. If there were five words stored in the dictionary ROM, each starting with the character "w”, and each consisting of a total of five characters, then the total number of grammar bytes, beta, required to search through the dictionary ROM for the five words would be ten grammar bytes.
  • the tokenizer 102A must cycle or search through the dic ⁇ tionary ROMs, starting at the starting address, represented by alpha, twice for each word. Carrying out the twice- repeated cycle for each stored word requires a total of ten cycles or look-ups. Therefore, beta, corresponding to words starting with the character "w” and having five characters, would contain a representation of "10".
  • the control microprocessor loads the starting address into the address counter 105. For example, if the words starting with "w" and having five characters were found beginning at the first address location of ROM 104, the address counter would be loaded with a representation of the hexadecimal number 0000.
  • the control microprocessor also loads the scan limit down counter 108 with a repre ⁇ sentation of beta, the total number of 20-bit grammar bytes to be read from the dictionary ROM 104. In the present example, beta is 10.
  • Each of the above steps are each preferably accom ⁇ plished during a respective single clock cycle of the microprocessor.
  • the particular order in which the steps are carried out is not significant, as long as the token ⁇ izer registers are loaded with representations of the appropriate information. However, as would be clear to one skilled in the art, reset of the tokenizer must occur prior to loading of any registers in the tokenizer.
  • the control microprocessor issues a start-scan command to the tokenizer 102A. Once the start-scan command is issued, the tokenizer steps through a defined sequence of steps, and the control micro ⁇ processor is free to carry out other operations.
  • the control microprocessor intermittently tests the mailbus 102 for one of three status codes, a representation of DEQ representing data equals, a representation of DNF meaning data not found, or a representation of SIP indi ⁇ cating scan in progress.
  • a mailbus Q interrupt was possible for the interface 101 to interrupt the operation of the control microprocessor 100 when a mailbus interrupt signal is received from the tokenizer.
  • the tokenizer com ⁇ pares the trial word with each word in the dictionary ROM 5 identified according to the starting address and the number of grammar bytes.
  • the compare logic 103 will compare the first grammar byte of the trial word contained in register 111 with the first grammar byte read from the first memory location in dictionary ROM 104.
  • the 0 segment counter will be decremented so that it contains a representation of "1", indicating that one more grammar byte must be read in order to read the entire word "water”.
  • the address counter will be incremented so that the counter contains a representation of the hexade ⁇ 5 cimal number 0001.
  • the scan limit down counter 108 is also decremented so that it contains a representation of "09".
  • the next grammar byte is read from dictionary ROM 104 and compared with the next grammar byte in trial word register 111. Additionally, the segment down counter is decremented to a representation of "0", indicating that all grammar bytes for the given word have been read and compared. At this point, the compare logic _ is sampled to determine the status of the compare. Since 5 a match would be found in the present example, the token ⁇ izer issues the DEQ (data equals) signal, and the control microprocessor issues a read command to the tokenizer so that the address of the dictionary word for which a match was found can be read by the control microprocessor.
  • the address is then used by the control microprocessor to determine an appropriate token. Specifically, the address is preferably divided by the segment number within which the word was found and added to an offset, as described above.
  • the token is either then stored as a substitute text storage scheme, or may be communicated to another processor as part of a text transfer. Detokenization may then be conducted at any time.
  • the starting address is incremented, the segment down counter is reloaded with the representation of the segment code, the scan limit down counter is decremented concrete and the next
  • the continued cycling allows the segment down counter to be reloaded with the representation of the segment code so that all of the grammar bytes for the next succeeding word will be read and compared. If no match is found after the appropriate number of grammar bytes have been accessed, the DNF (data not found) signal is provided to the status register 110, which then produces a DNF to mailbus 102. This signal is provided to the control microprocessor which then either terminates the text processing or repeats the above-described process with a subsequent trial word.
  • the control microprocessor identifies the trial word as having greater than 17 char- acters, a data not found code is issued, and the control microprocessor returns to process another word or ends the text processing.
  • the particular trial word is stored as the ASCII or EBCDIC equivalents. However, this would occur in a very small percentage of the words.
  • the tokenizer and the method and means by which the control microprocessor directs the tokenizer will now be considered in detail.
  • the tokenizer control codes are 9- bit commands from the control microprocessor 100 to the symbolic tokenizer.
  • the commands are divided between write commands, which cause outputs to be sent from the control microprocessor 100 to the symbolic ' Tokenizer, and read commands, which cause inputs to be sent from the symbolic tokenizer to the control microprocessor 100.
  • the commands are as follows:
  • ADDRESS VALUE FUNCTION SSSS00000 00 TOKENIZER RESET SSSS00001 01 TRIAL WORD BYTE 1A SSSS00010 02 TRIAL WORD BYTE IB SSSS00011 03 TRIAL WORD BYTE 1C SSSS00100 04 LOAD SCAN ADDRESS REGISTER LOW BYTE
  • each mailbus bit address (SSSS) are the bits which the control microprocessor 100 uses to select with which symbolic tokenizer it will com ⁇ municate when more than one symbolic tokenizer sits across the mailbus 102. This is the address set into the dip switch 57 and sensed by the address select 53, both of FIG. 4. The X's in the read commands indicate digits not used.
  • the read and write commands are utilized by a mailbus address line 116, to be described below with respect to FIG. 4, to enable the correct logic component to accept the data being provided by the control microprocessor on the mailbus data lines 115.
  • the TOKENIZER RESET write command is sent from the control microprocessor 100 to the symbolic tokenizer to reset the symbolic tokenizer's logic.
  • the TRIAL WORD BYTE commands are used to load the trial word into the trial word register 111 in the form of register files 34-38.
  • the trial word transmitted to the symbolic tokenizer may be up to 80 bits long. This allows for trial words of up to 16 characters, as twenty bits are required for each four characters. Transmission of the trial word to the symbolic tokenizer is accomplished in up to twelve separate trans- missions, i.e., a maximum of eight 8-bit loads and four 4-bit loads.
  • TRIAL WORD BYTE 1A through 1C write commands represent commands to load the first 20 bits of the trial word.
  • TRIAL WORD BYTE 2A through 2C represent commands to load the second 20 bits of the trial word.
  • TRIAL WORD BYTE 3A through 3C represent commands to load the third 20 bits of the trial word.
  • TRIAL WORD BYTE 4A through 4C represent commands to load the fourth 20 bits of the trial word.
  • A' suf- fix such as TRIAL WORD BYTE 4A
  • TRIAL WORD BYTE 4A represents a command to load bits 0 through 7
  • a TRIAL WORD BYTE having a 'B' suffix represents a command to load bits 8 through 15
  • TRIAL WORD BYTE having a 'C suffix represents a command to load bits 16 through 19.
  • LOAD SCAN ADDRESS REGISTER LOW BYTE and LOAD SCAN ADDRESS REGISTER HIGH BYTE write commands enable loading in the address counter of the 16-bit dictionary address, where the search is to begin. This is the address of the first word in the dictionary sector where the word on which tokenization is being attempted will be contained, if it is in the dictionary.
  • This address is derived from alpha in the memory map and is loaded in two 8-bit bytes to the address counter 105 in the form of token address counter 16-19.
  • LOAD SCAN ADDRESS WORD COUNTER LOW BYTE and LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE write commands enable loading of the scan limit down counter 108 in the form of dictionary scan down counter 54-56 of FIG. 4.
  • the dictio ⁇ nary scan down counter 54-56 is loaded with the number of grammar bytes in the sector of the dictionary being searched so that the dictionary scan down counter 54-56 can decrement as grammar bytes are compared until all words of that dictionary sector have been compared.
  • SCAN ADDRESS WORD COUNTER LOW BYTE allows loading of eight bits of the count
  • LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE allows loading of the remaining four bits of the count.
  • the counter prevents dic ⁇ tionary reads beyond the proper dictionary sector. For example, if there are five 5-letter words beginning with "w", then the dictionary scan down counter will be loaded with a representation of 10 (five words times two grammar bytes per word) .
  • the LOAD SEGMENT REGISTER write command enables load ⁇ ing of the Segment Register 106 with the segment code.
  • the START SCAN write command is the command from the control microprocessor 100 to the symbolic tokenizer to begin a search of the dictionary. It is issued after the above output command functions have been issued and the appropriate logic device set.
  • the READ commands are address codes utilized in con ⁇ junction with a MBRW (mail 45 read/write) code from the control microprocessor to enable access to the contents of various logic components in the tokenizer for reading.
  • the TOKENIZER ADDRESS BITS 0-7 and the TOKENIZER 0 ADDRESS BITS 8-14 read commands enable access by the con ⁇ trol microprocessor to the address counter to read the address of the dictionary word with which a comparison was found true.
  • the GRAMMAR BUS BITS 16-19 read commands load the contents of the grammar bus 112 into the control micropro ⁇ cessor 100.
  • the grammar bus con ⁇ tains the language word which was represented by the ad ⁇ dress derived from the token.
  • the token address shown in FIG. 1 as TKA, is a form of the token.
  • the token may be the actual memory address or may be a derivation of the memory address, as described above.
  • the token address is the address within the dic ⁇ tionary ROM 104 where the word was found.
  • 25 dress is transmitted through the token address bus 113 and the mailbus 102 to the control microprocessor 100 when the compare logic 103 is true.
  • Detokenization or the converting of tokens back into language words, is achieved by presenting the token
  • the address counter 105 functions as a simple address register, instead of as a counter.
  • the data found at the addressed ROM 104 location is
  • This data is the language o5 1 word which was represented by the token on which detokeni ⁇ zation is taking place.
  • the TOKENIZER STATUS BITS 4-7 read command requests the symbolic tokenizer to send its status to the control microprocessor 100.
  • the symbolic tokenizer status may be
  • Data Equals (DEQ) , Data Not Found (DNF) , or Scan In Prog ⁇ ress (SIP) .
  • Data Equals is represented by the binary code 001.
  • Data Not Found is represented by the binary code 010.
  • Scan In Progress is represented by the binary code 100.
  • dictionary ROMs 1 through 10 store the dictionary of words which may be tokenized.
  • the dic ⁇ tionary of the prototype unit utilizes ten 2764 EPROMs.
  • EPROMs are used in the prototype unit to allow for ease of changing their content as the tokenization of different vocabularies is studied. ROMs will be used in production units because of their low cost.
  • the ROMs are 8-bit devices, each having a capacity of eight kilobytes. Toge-
  • the tokenizer manipulates 20-bit grammar bytes.
  • the grammar bytes are generated by taking five of the ten ROMs at a time and utilizing four of the 5 eight bits from each ROM to obtain a 20-bit grammar byte.
  • the ROMs are accessed using the dictionary address through token address counter 16-19 described below.
  • the least significant bit of the dictionary address causes multiplexing between the least significant and the most
  • each Dictionary ROM 1 through 10 5 significant, four bits of each Dictionary ROM 1 through 10. In other words, if the address is even, for example, the first four bits of each ROM in the bank of ROMs will be accessed, whereas if the address is odd, only the last four bits of the eight bits of each ROM will be accessed.
  • the dictionary ROMs 1 through 10 are addressed by the token address bus 113.
  • the data multiplexers 11 through 15 are 74LS257 data multiplexer chips. They switch between the low and high four output lines of dictionary ROM 1-10 and are switched by the least significant bit of the absolute address (TKAD00) . That is, the multiplexers are switched by the TKAD00.
  • a zero on line 00 of the dictionary address bus indicates that the low four bits of each ROM on the dictionary address bus are to be used for data and a 1 on line 00 of the token address bus 113 indicates that the high four bits of each
  • TKAD00 Token address line 00
  • TKAD14 token address line 14
  • the grammar bus 112 connects the data multiplexers 11-15 to the comparator 29-33 of FIG. 3.
  • the absolute address required to access words in the dictionary ROM is 15 bits long, as can be seen from the output of the token address counter 16-19.
  • the 15-bit address with multiplexing allows 32,768 absolute dictionary addresses.
  • the token address counter 16-19 is comprised of four 74LS191 up/down counters configured as an incremental counter.
  • the token address counter 16-19 accepts the starting address of the first word in the dictionary sector to be searched and then increments from that address to the end of the list in that sector at the mailbus clock rate. The end of the list is determined by the starting address and beta.
  • the starting address for the dictionary segment and sector derived from the memory map is parallel loaded from control microprocessor 100 through the external mail ⁇ bus 115 to the internal mailbus 114 and then into the address counter 16-19.
  • the mailbus 102 of FIG. 1 is di ⁇ vided, first, into the external mailbus (Databus) 115, the mailbus address bus 116, a mailbus input/output (MBIO) , mailbus interrupt (MBINT inverse) , mailbus clock (MBCLK) , and mailbus read/write (MBRW) and, second, into the inter ⁇ nal mailbus 114.
  • the external mailbus 115 is the eight data lines which enter the bidirectional data bus trans- ceiver 46 from the interface 101 of FIG. 1.
  • the internal mailbus 114 is the eight data lines between the bidirec ⁇ tional data bus transceiver 46 and the tokenizer output multiplexer 41-45, the register file 34-38, as well as the token address counter 16-19 of FIG. 2, and the dictio ⁇ nary scan down counter 54-56 of FIG. 4.
  • Each of the three chips of the token address counter 16-18 supplies four of the bits re-quired to form the 15-bit dictionary address, with the last chip of the token address counter 19 supplying the most significant three bits.
  • inverter 20 is part of a 7404 digital inverter chip. Inverter 20 produces the inverse of the mailbus clock (MBCLK) from the mailbus clock (MBCLK) signal.
  • the inverse of the mailbus clock is used by com ⁇ pare latch 39 of FIG. 3, and it is also the beginning 0 of a time-delay chain comprising inverters 21 and 22.
  • Digital inverters 20, 21, and 22 provide a timing delay of the clock signal through sequential propagation.
  • Digital inverters 21 and 22 are part of a 74LS38 NAND gate configured as inverters.
  • This delayed clock signal 5 is used for the clocking of the token address counter 16-19. This delay guarantees that the address counter will not increment prior to the acquisition of dictionary data from the grammar bus 112.
  • the token address counter 16-19 increments on the rising edge of the delayed clock o signal.
  • Noninverting buffer 23 is a 7407 noninverting buffer chip. It provides a sequential propagation delay as part of the timing delay circuit which further delays the clock signal to the token address counter 16-19.
  • 5 Dictionary enable AND Gates 25 and 26 are 7408 AND gates. They provide the dictionary enable, which supplies the enable signal to the dictionary ROMs 1-10 so that the ROM address locations may be accessed. Two AND gates are utilized to provide fanout since they must provide a signal Q to each of the ten dictionary ROMs 1 through 10.
  • AND gate 25 provides an enable signal to six of the dictionary ROMs 1-5 and 6-10, and AND gate 26 provides an enable signal to four.
  • the dictionary ROMs 1-10 are tri-state output devices. When the outputs of 25 and 26 go low the selected dictio- nary ROMs 1 through 5, or 6 through 10, appear on the data bus.
  • the dictionary ROMs 1 through 5, or 6 through 10, are selected by the token address (TKAD 14) described below) , and its output appears on the grammar bus 112.
  • the dictionary enable AND gates 25 and 26 go low when two signals are present at their inputs. Those sig ⁇ nals are mailbus clock inverse (MBCLK) and grammar bus lock inverse (GBLOCK) .
  • GBLOCK Grammar bus lock
  • Inverter 27 is a part-of a 7404 Hex inverter. Inver ⁇ ter 27 selects the group of five dictionary ROMs 1-5, or 6-10, which is to be enabled. The input of the inverter 27 is coupled to the bank of ROMs 1-5, while the output is coupled to the bank of ROMs 6-10. Either dictionary ROMs 1-5, or 6-10, will always be selected. The input to inverter 27 comes from the fifteenth bit (most significant bit) of the token address counter 19 (TKAD14) . That is, the fifteenth bit from the token address counter 19 deter ⁇ mines which group of five dictionary ROMs 1-5, or 6-10, is to be enabled.
  • TKAD14 fifteenth bit from the token address counter 19
  • Dictionary ROMs 1-5 contain the lower 16 kilobytes of dictionary memory, and dictionary ROMs 6-10 contain the upper 16 kilobytes of dictionary memory.
  • NAND Gate 28 of FIG. 3 is for producing GBLOCK inverse and is a part of a 74LS00 Quad NAND gate. It provides the grammar bus lock inverse (GBLOCK) which is used by the dictionary enable AND gates 25 and 26 to force the enable of the dictionary ROMs 1-10 so that the dictionary
  • ROMs 1-10 may be interrogated during the detokenization 1 operation.
  • FIGS. 3 and 4 can be placed side by side, with FIG. 4 on the right, to form one continuous schematic.
  • comparators 29-33 compare the
  • Each grammar byte represents up to four characters of a language word.
  • Register file 34-38 is loaded with the trial word from the microprocessor through the bidirectional data bus transceiver 46 and the external mailbus 115, as de-
  • Each chip of the register file is a group of 4-bit register slices, i.e., four 4-bit registers.
  • the address buffer 58 and the register file select 59 may address any 4-bit register in each chip in order to load .data in the form
  • 3 Q mailbus provide only eight bits of data per cycle, the first four bits of the first eight bits of a trial word are placed in the first 4-bit register of chip 34, and the last four bits are placed in the same 4-bit register of chip 35.
  • the same 4-bit register of chip 36 gets the
  • the same 4-bit register of chip 38 gets the next four bits of the trial word from the control microprocessor over Internal mailbus 114.
  • the first 20-bit grammar byte of the trial word has been loaded from the control microprocessor into the first 4- bit registers of register file 34-38.
  • the remaining 20- bit grammar bytes are loaded in a similar manner. There ⁇ fore, the trial word will be stored in the form of 20-bit ' grammar bytes, just as the language words are stored as 20-bit grammar bytes in the dictionary ROM.
  • the remaining 20-bit grammar bytes up to a total of four grammar bytes or 16 characters totaling 80 bits, are stored in respective 4-bit registers of the register file 34-38.
  • all 20 bits representing four characters in a given grammar byte may be accessed and processed in parallel in a manner similar to that for a given bank of dictionary ROMs.
  • a 20-bit or 4 character comparison can be made in one clock cycle.
  • grammar byte-by-grammar byte from the appropriate bank of the dictionary ROM 1-10, they are compared to respective grammar bytes of the trial word until either a match is found between words, or all of the words in the chosen sector of the dictionary ROM 1-10 are exhausted.
  • the five 74170 chips combined have a maximum capacity of 80 bits, or four grammar bytes, com ⁇ prising a total of 20 characters.
  • Inter grammar byte compare latch 39 is a 7474 Dual D Flip-Flop.
  • a successful comparison e.g., bit-for-bit equivalence
  • DEQ Data Equals
  • scan termination logic 40 the stop scan signal
  • a start scan flip-flop in scan control logic 51 is reset at pin 1. This causes scan enable inverse (SCANEN) from scan control logic 51 to go high.
  • the inter grammar byte compare latch 39 uses the mailbus clock inverse signal (MBCLK) from inverter 20 for timing.
  • Pins 1, 10, and 14 of inter grammar byte compare latch 39 go to positive 5 volts.
  • Pin 7 is ground.
  • Pin 4 is the output of NAND gate 65.
  • Pin 11 is an input to NAND gate 65.
  • Pin 8 is an input to NAND gate 70.
  • Pin 13 goes to TKRST and to pin 10 on scan termination logic 40.
  • Pin 2 goes to pin 6 of comparator 33.
  • Pins 5 and 12 go to positive 5 volts.
  • Pin 7 is ground.
  • Pin 4 is the output of NAND gate 65.
  • Pin 11 is an input to NAND gate 65.
  • Pin 8 is an input to NAND gate 70.
  • Pin 13 goes to TKRST and to pin 10 on scan termination logic 40.
  • Pin 2 goes to pin 6 of comparator 33.
  • Pins 5 and 12 go
  • Scan termination logic 40 allows the loading of suc ⁇ cessive 20-bit bytes, even though an earlier byte failed comparison. The appropriate number of bytes must be com-
  • the scan termination logic 40 also serves to 1 reset scan control logic 51 from pin 8 to stop all counters when a match is found. Logic 40 may also provide an inter ⁇ rupt to processor 100 through AND gate 66.
  • Pin 5 tokenizer reset inverse (TKRST) signal.
  • Pin 5 is an input to NAND gate 70.
  • Pin 3 goes to pin 13 of dictionary scan down counter 56.
  • Pin 11 is an output from inverter 62.
  • Pin 8 is an input to AND gate 66 and to pin 1 of scan control logic.
  • Pins 2 and 7 go to ground.
  • Pins 1, 13, 0 and 14 go to positive 5 volts.
  • Pin 6 is the Data Not
  • Pin 12 is the output of NAND gate 70.
  • Tokenizer output multiplexer 41-45 multiplexes token address bus 113, the grammar bus 112, and the tokenizer status data to the microprocessor.
  • the tokenizer status is presented to the control microprocessor 100 from three free pins of the tokenizer output multiplexer 45.
  • the status will be either Data o Equals (DEQ) , Data Not Found (DNF) , or Scan In Progress (SIP), which are found on pins 11, 13, and 15, respective ⁇ ly.
  • DEQ Data o Equals
  • DNF Data Not Found
  • SIP Scan In Progress
  • control microprocessor 100 interrogates the symbolic tokenizer and finds the status to be Data Equals
  • control microprocessor 100 will read the address of the appropriate bank of dictionary ROM 1 through 10 where the word was found. This is done by providing the appropriate read command address to the mailbus address line 116. The address is decoded by the address decoder
  • the address can be read from the output of, or
  • the token address counter 16-19 This can be done because the token address counter was stopped when a match was found through compare latch 39 and scan termination logic 40. That address subtracted by the number of the segment code (the number of address incre ⁇ ments) divided by the dictionary segment number and added to an offset by the control microprocessor, then becomes the token for the word.
  • Bidirectional data bus transceiver 46 is an 8-bit 0 tri-state data transceiver to interface the symbolic token ⁇ izer with the external mailbus 115. It is used to direct data flow between the symbolic tokenizer and the control microprocessor 100.
  • Segment Up Counter 47 is a 2-bit binary up counter, 5 and it is reset prior to reading a new language word from the dictionary ROM 1-10 to reflect the proper number of grammar bytes required to compare the entire trial word in the register file 34-38. It progressively selects grammar bytes from the register file 34-38 for comparison o until the required number of grammar bytes have been com ⁇ pared with a language word.
  • Segment holding register 48 stores the selected 2- bit segment code, designated above as SI, S2, S3, or S4, generated by the control microprocessor.
  • the segment 5 code is taken from the internal mailbus 114 on lines 120 to pins 2 and 12 of register 48.
  • the segment code deter ⁇ mines the number of reads of the given bank of dictionary ROM 1-10 required to compare a given language word to each dictionary word during a search. Each read of the dictio- 0 nary ROM 1-10 loads a maximum of four characters into the comparators 29-33, so that to compare a 16-character word requires four successive loads.
  • Pins 1, 4, 10, 13, and 14 of segment holding register 48 go to positive 5 volts.
  • Pin 2 goes to pin 18 of the bidirectional data bus transceiver 46.
  • Pin 12 goes to 1 pin 17 of the bidirectional data bus transceiver.
  • Pin 7 is ground.
  • Pins 3 and 11 go to pin 10 of address decoder 50.
  • Pin 9 goes to pin 1 of the segment down counter 49, and pin 5 goes to pin 15 of the Segment Down Counter 49.
  • Segment down counter 49 determines when the proper number of grammar bytes have been accessed by the segment up counter 47 for comparing with a given language word.
  • Pin 11 goes to the output of AND gate 67.
  • Pin 13 goes to pin 12 of scan control logic 51 and pin 14 goes to SCANCLK.
  • Address decoder 50 is a 74LS138 decoder for decoding the address and write commands from the control micropro-
  • Address decoder 50 also supplies the scan address
  • Decoder 50 also supplies the scan counter load low
  • Pins 5 and 8 of the address decoder 50 are ground. Pins 6 and 16 are positive 5 volts.
  • Pin 1 goes to pin 3 of address decoder 52 and to pin 14 of address buffer 58.
  • Pin 2 goes to pin 12 of address buffer 58.
  • Pin 3 goes to pin 9 of address buffer 58.
  • Pin 9 goes to pin 4 of scan control logic 51.
  • Pin 15 goes to pin 13 of scan control logic 51 and provides the tokenizer reset code.
  • Pin 10 goes to pins 3 and 11 of segment holding register 48.
  • Pin 4 goes to pin 11 of register file select 59.
  • Pin 11 is the scan counter high byte load strobe (SCANCTH)
  • pin 12 is the scan counter low byte load strobe (SCANCTL) .
  • strobes are used to enable loading of the dictionary scan down counter 54, 55, and 56 when the control micro- processor 100 issues the LOAD SCAN ADDRESS WORD COUNTER LOW BYTE and LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE write commands.
  • Pin 13 is the scan address load strobe high (SCANLDH)
  • pin 14 is the scan address load strobe low (SCANLDL) .
  • SCANLDH scan address load strobe high
  • SCANLDL scan address load strobe low
  • the scan control logic 51 is a 7474 Dual D Flip-Flop.
  • Scan control logic 51 provides the scan enable latch, controls the operation of the segment up counter 47, and provides the Scan In Progress code. Scan control logic also stops all tokenizer operation when logic 51 is reset.
  • the scan enable inverse of the scan control logic 51 goes high after the reset signal from scan termination logic 40, all address counting in the token address counter 16 through 19 terminates (see SCANEN input to device 16 of FIG. 2) .
  • a mailbus interrupt signal (MBINTO) is produced to get the attention of the control microprocessor 100.
  • the tokenizer status can at 1 this time be read by the microprocessor, and the address of the language word in the dictionary ROMs can be read after issuing an appropriate command.
  • Pin 14 of scan control logic 51 is positive 5 volts.
  • Pin 13 goes to pin 15 of address decoder 50.
  • Pin 4 comes from pin 9 of address decoder 50.
  • Pin 8 is an input to NAND gate 65 at pin 13 and to pins 2 and 3 of segment up counter 47.
  • Pin 1 goes to pin 8 of scan termination logic 40.
  • Pins 3 and 7 are ground.
  • Pin 10 is the output of
  • Pin 6 is the scan enable inverse (SCANEN) signal.
  • Pin 9 is an input to pin 9 of AND gate 67.
  • Pin 5 is an input to pin 10 of AND gate 67 and provides the Scan In Progress (SIP) signal.
  • Pin 12 goes to pin 13 of the segment down counter 49.
  • Pin 11 goes to pin 14 of the segment down .counter 49 and is coupled to the SCANCLK signal.
  • Address decoder 52 is a 74LS138 Decoder for decoding the address and write commands of the control microproces ⁇ sor and for providing a strobe to the correct logic compo ⁇
  • Pin 16 of address decoder 52 goes to positive 5 volts.
  • Pin 6 goes to pin 6 of register file select 59 and pin 6 of address select 53 and to inverter 60.
  • Pin 1 goes to
  • Pin 15 goes to pin 1 of register file select 59 and to pin 18 of address buffer 58.
  • Pin 2 goes to pin 2 of register file select 59 and to pin 16 of buffer 58.
  • Pin 3 goes to pin 1 of address decoder 50, to pin 14 of address buffer 58, and to pins 14 of each register file chip 34-38.
  • Pin 15 goes
  • pin 10 goes to pin 19 of token- izer output multiplexer 45.
  • Pin 4 goes to pin 3 of regis ⁇ ter' file select 59 and to inverter 69.
  • Pins 5 and 8 are ground.
  • Address select 53 is a 7485 comparator for sensing the condition of the dip switch 57 and for taking the most significant four bits of the nine bits of the mailbus address write or read commands and enabling the correct address decoder.
  • Each symbolic tokenizer may have an address from 1 to 16.
  • the output of address select 53, 0 is used by the symbolic tokenizer to ascertain if that symbolic tokenizer has been addressed by the microprocessor and if so, to enable transceiver 46, address decoder 52 and register file select 59.
  • the mailbus address lines 116 which allow the control 5 microprocessor 100 to address the symbolic tokenizer through the address select 53, include lines 05, 06, 07, and 08 of the complete mailbus address lines 116.
  • Dictionary scan down-counter 54, 55, and 56 are 74191 counters and are loaded with a representation of beta.
  • o Scan down counter decrements with each mailbus clock cycle as words are read from the given banks of ROMs in the dictionary ROM 1-10 for comparison to the trial word in the register file 34-38.
  • the dictio ⁇ nary scan down counter is loaded with a representation of 5 beta from the appropriate entry of the memory map, e.g., the number of grammar bytes needed to read the sector beginning with the starting address read from the appro ⁇ priate entry in the memory map into the token address counter 16-19.
  • DNF Data Not Found
  • MINTO mailbus interrupt
  • the mailbus interrupt causes the control microprocessor 100 to interrogate the status of the sym ⁇ bolic tokenizer.
  • the DNF Data Not Found
  • the 5 control microprocessor 100 That is, the word to be token ⁇ ized was not found in the appropriate bank of the dictio ⁇ nary ROM 1 through 10.
  • Dip switch 57 is used in conjunction with address select 53 to select the control microprocessor 100 address of the symbolic tokenizer. More than one symbolic token ⁇ izer may reside on the mailbus.
  • Address buffer 58 generates the binary address to supply a 2-bit code to pins 13 and 14 of the register file 34-38. This selects the address of the proper 20- bit byte of the four possible 20-bit bytes for loading into the register file 34-38. Address buffer 58 also selects whether address decoder 50 or register file select 59 is enabled and provides a code for the enabled component for producing an appropriate output. Address buffer is a 74244 chip.
  • Register file select 5 " 9 uses the input from address buffer 58 to either select the appropriate ' register file 34-38 for loading the trial word from the control micropro ⁇ cessor 100 or to enable address decoder 50.
  • the primary function is to select the appropriate registers in the register file 34-38.
  • Pin 16 of register file select 59 goes to positive 5 volts.
  • Pin 10 goes to pin 12 of register files 34 and 35.
  • Pin 9 goes to pin 12 of register files 36 and 37.
  • Pin 7 goes to pin 12 of register file 38.
  • Pins 5 and 8 are ground.
  • Pin 3 is the output of inverter 69 and is coupled to pin 4 of address decoder 52.
  • Pin 4 is the mailbus clock inverse signal (MBCLK) .
  • Pin 1 goes to pin 1 of address decoder 52 and to pin 18 of address buffer 58.
  • Pin 2 goes to pin 2 of address decoder 52 and to pin 16 of address buffer 58.
  • Pin 6 goes to pin 6 of address decoder 52 and pin 6 of address select 53.
  • Pin 11 goes to pin 4 of address decoder 50.
  • Inverter 60 is part of a 74LS04 digital inverter chip. It inverts the enable signal from address select 53 to pin 19 of the tri-state enable of the bidirectional data bus transceiver 46 to provide the correct polarity or direction.
  • Inverter 62 is part of a 74LS04 digital inverter chip. It inverts the SCANCLK signal to SCANCLK inverse for NAND gate 65.
  • Inverter 63 and inverter 64 are part of a 74LS04 digital inverter chip. They form a sequential propagation 0 delay to increase the pulse width of the preset pulse from the NAND gate 65 to the scan control logic 51 and the inter grammar byte compare latch 39.
  • NAND gate 65 is part of a 74LS38 inverting AND gate. It generates the preset pulse to the inverters 63 and 64. 5 The preset pulse presets the compare logic posting flip-flop of the inter grammar byte compare latch 39 which causes pins 5 and 12 in the inter grammar byte compare latch 39 to go high. This causes pin 3 of the comparator 29 to go high, which posts a 1 into the comparators 29-33 o which indicates that the last compare was true. A failed comparison thereafter results in a 0 on pin 3 of the com ⁇ parator 29 and on the remaining comparators until the preset pulse again presets the compare latch 39.
  • AND gate 66 is part of a 7408 AND gate chip. When 5 multiple symbolic tokenizers sit across the mailbus 102, AND gate 66 causes an interrupt to the control micropro ⁇ cessor 100 when a scan termination by the compare logic 103 is found true, i.e., when the scan is terminated by scan termination logic 40. This is caused by either a 0 true signal on the line from termination logic 40 or a true signal for mailbus interrupt input inverse (MBINTI) from one of the other tokenizers on the mailbus 102. The appropriate output of AND gate 66 alerts the control micro ⁇ processor 100 that a symbolic tokenizer has completed a _ scan cycle. The scan through the other tokenizers are terminated. Interrogation of the tokenizer status will yield the cause of the termination. MBINTI low indicates that another tokenizer on the bus has already caused an interrupt, and every other symbolic tokenizer must wait for interrogation by the control microprocessor 100 if more than one interrupt request was generated.
  • AND gate 67 is part of a 7408 AND gate chip. It generates the load strobe for the segment counter 49 to cause it to be reloaded from the segment holding register 48.
  • Inverter 69 provides an appropriate signal to the transceiver 46 for input or output and to address decoder 52 to provide a correct output.
  • FIG. 5 shows the interface 101 of FIG. 1 which inter- faces the symbolic tokenizer to the control microprocessor 100 of FIG. 1.
  • the prototype symbolic tokenizer utilizes a Commodore 64K computer for its control microprocessor. Any central processing unit can serve as a control micro ⁇ processor, if appropriately interfaced to the symbolic tokenizer.
  • a connector 201 attaches the interface 101 to the control microprocessor 100. This connection is made through the I/O expansion jack of the Commodore 64K compu ⁇ ter. The pin numbers of the Commodore 64K I/O expansion jack are shown on the connector 201.
  • NAND gate 202 is part of a 7437 inverting AND gate chip. It provides an enable signal to the bidirectional bus data transceiver 204.
  • NAND gate 203 is part of a 7437 inverting AND gate chip. It provides the mailbus I/O signal (MBIO) to the symbolic tokenizer.
  • MBIO mailbus I/O signal
  • the mailbus I/O signal (MBIO) is high when the mailbus 102 is being addressed by the control microprocessor 100.
  • Bidirectional data bus transceiver 204 is a 74LS245. It provides data transfer between the control microproces- 1 sor 100 and the symbolic tokenizer.
  • Buffers 205 and 206 are both 7407 buffer chips. They provide address, mailbus clock, and mailbus read/write signal isolation between the control microprocessor 100
  • the six lines exiting buffer 205 are mailbus address lines 116.
  • the uppermost three lines exiting buffer 206 are also mailbus address lines.
  • the lower two lines exiting buffer 206 are the mailbus clock signal (MBCLK) and the mailbus read/write signal
  • MRRW Mobile Radio
  • the mailbus interrupt inverse signal (MBINT) supplies an interrupt signal to the control microprocessor 100 when a dictionary scan is complete. This may be omitted if the control microprocessor also does other tasks.
  • the external mailbus 115 is the data path between the control microprocessor 100 and the symbolic tokenizer.
  • the mailbus address lines 116 also provide the addressing signals by which each symbolic tokenizer recognizes that it is the symbolic tokenizer being ad ⁇
  • the token- 0 izer 102A The tokenizer condition is static and the tokenizer is available for accepting various read or write commands from the control microprocessor through interface 101. It will be assumed henceforth that all commands to be accepted by the tokenizer 102A are made by the control 5 microprocessor through interface 101 to mailbus 102, i.e., busses 115 and 116.
  • the control microprocessor will pro- vide an address code over the mailbus address line 116, the address bus, and will provide data over the external mailbus 115, the data bus. Additionally, the mailbus o input/output, the inverse mailbus interrupt, ground, mail ⁇ bus clock, and the mailbus read/write codes are provided over pins on the interface 101, as indicated in FIG. 5.
  • the tokenizer accepts a read or write command on mailbus ad- 5 dress line 116, as indicated in block 301 of FIG. 8A.
  • the tokenizer accepts a 4-bit code from the four most significant bits of the read or write command and decodes the 4-bit code in address select 53 by comparing the code with the settings of the DIP switch 57. If the 4-bit corresponds to SSSS, transceiver 46, address decoder 52 and register file select 59 are enabled. If the address code does not correspond to SSSS, the tokenizer returns to 300 at the beginning of block 301 to await another address code.
  • tokenizer is processing the commands loaded from the con ⁇ trol microprocessor to provide either a memory address of a word to be tokenized, or a language word based on an address derived from a token in the control microprocessor (detokenization) .
  • Number 370 is a flow indicator.
  • a tokenizer reset command is issued by the control microprocessor to reset the scan control logic 51 and the scan termination logic 40.
  • the segment down counter 49 and segment up counter 47 are reset, and a 1 is posted on compare latch 39.
  • the tokenizer is reset by enabling address decoder 50 to output TKRST from pin 15.
  • the scan control logic 51 is reset through pin 13, and the scan termination logic 40 is reset through pins 4 and 10.
  • Compare latch 39 is reset through pin 13 so that a 1 is posted at pins
  • the segment up counter 47 and segment down counter 49 are also reset.
  • the tokenizer status then returns to 300 to await an additional command. Otherwise, the tokenizer operates as described below for a different input command.
  • the register file 34-38 is loaded with the trial word.
  • the trial word is separated by the control microprocessor 100 into 20-bit grammar bytes, each grammar byte being loaded into regis ⁇ ters in the register file 34-38 in a sequence of eight bits, eight bits, and four bits.
  • Subsequent grammar bytes are loaded in the same manner. Specifically, if the ad ⁇ dress is 00001, the first four bits of register files 34 and 35 are enabled through pin 10 of register file select 59 and pins 12 and 14 of buffer 58 so that the first four bits of each of register files 34 and 35 will accept the respective four bits from the internal mailbus 114 through transceiver 46.
  • the first grammar byte is loaded first by loading the first eight bits of the grammar byte through the external mailbus 115. The next eight bits are loaded according to block 308, and the last four bits of the first grammar byte are loaded according to block 310. In any case, the tokenizer is ready for another command by returning to 300.
  • the first four bits of register files 36 and 37 are enabled through pin 9 of register file select 59 and pins 12 and 14 of buffer 58.
  • the first bit of each regis ⁇ ter 36 and 37 will accept respective four bits from the eight bits of data coming from the internal mailbus 114. In this manner, eight bits of data can be loaded in paral ⁇ lel to the respective registers in a single clock cycle.
  • the tokenizer is then ready for loading of additional bits of the first grammar byte. In block 308, this is represented by a return to 300.
  • the tokenizer is then ready to accept an additional address code in the form of a read or write command. This is represented in block 310 by a return to 300.
  • the second 20 four bits of register files 36 and 37 are enabled through pin 9 of register file select 59 and pins 12 and 14 of buffer 58.
  • the respective second four bits can then accept respective four bits of the next eight bits in the second grammar byte from internal mailbus 114.
  • the tokenizer is
  • 3 0 38 will accept the remaining four bits from the second grammar byte off of internal mailbus 114.
  • three clock cycles are used to load the second grammar byte from internal mailbus 114 into the registers 34-38 of the register file. If the trial word did not require
  • null characters are in- serted in the bits not required. If the word requires only two 20-bit grammar bytes, the remaining bits in the register file are not used.
  • control micro ⁇ processor will issue a write command having an address of 01001.
  • this address causes the third four bits of register files 34 and 35 to be enabled through pin 10 of register file select 59 and pins 12 and 14 of buffer 58.
  • the third four bits of each register will then accept respective four bits from the first eight bits of the third grammar byte off of the internal mailbus 114.
  • the next eight bits of the third grammar byte are loaded into the register file. Therefore, if the address is 01010, the third four bits of register files 36 and 37 are enabled through pin 9 of register file select 59 and pins 12 and 14 of buffer 58. . The third four bits of the two register files will accept respective four bits from the second eight bits of the third grammar byte off of internal mailbus 114.
  • the third four bits of register file 38 are enabled through pin 7 of register file select 59 and pins 12 and 14 of buffer 58.
  • the third four bits will accept the last four bits of the third grammar byte from internal mailbus 114. In this manner, three grammar bytes, or up to twelve charac ⁇ ters, are loaded into the register file in nine clock cycles. If an additional grammar byte is required to store the entire trial word, an additional write command is issued by the control microprocessor, as discussed below.
  • the first eight bits of the fourth grammar byte are loaded into the register file. If the address is 01101, the fourth four bits of register files 1 34 and 35 are enabled through pin 10 of register file select 59 and pins 12 and 14 of buffer 58.' The fourth four bits will then accept respective four bits of the first eight bits of the fourth grammar byte off of internal
  • the fourth four bits will then accept the last four bits of the fourth grammar byte from internal mailbus 114. After this step, the register file will contain all four grammar bytes required to store the trial word. The tokenizer will then be ready to accept a further write
  • register file select 59 was discussed in blocks 35 1 306-328, it is to be understood that all four bits of the address were used to access and load the register file 34-38.
  • pins 16 and 18 of buffer 58 were used to produce the appropriate output signal from pins
  • register file select 59 5 7, 9, or 10 of register file select 59.
  • control microprocessor issues a write command for loading into the token address counter 16-19 part of the starting address of the sector to be searched. Specifically, if the address is 00100, the
  • SCANLDL is sent to counters 16 and 17 so that the counters will accept the low eight bits of the starting address from the internal mailbus 114 through transceiver 46.
  • the tokenizer is then ready to accept the high eight bits
  • the remaining eight bits of the starting address are loaded into the counters 18 and 19 over the internal mailbus 114. Specifically, if the command address
  • the address decoder 50 is enabled as in block 330 to strobe token address counter-register 18 and 19.
  • a strobe SCANLDH is provided to counters 18 and 19 from pin 13 of decoder 50 to accept the high bits of the start ⁇ ing address from the internal mailbus 114.
  • the address decoder 50 is enabled as in block 330 to strobe the dictionary scan down counter 54 and 55 to accept the low eight bits of beta from the internal mailbus 114. The upper four bits can then be
  • the dic ⁇ tionary scan down counter 56 then accepts the high four bits of beta from the internal mailbus 114.
  • the se-gment code is loaded into segment holding register 48 according to a 2-bit code read from
  • 20 code is sent from pins 9, 12, and 14 of buffer 58 to the decoder 50 to provide a load strobe from pin 15 of decoder 50 to cause the segment holding register 48 to accept the 2-bit segment code.
  • the segment code is taken over lines 120 from the internal mailbus 114 through transceiver 46.
  • the tokenizer is then available during the next clock cycle to accept another command. Otherwise, a different command and address code was provided by the control micro ⁇ processor and is processed as provided herein.
  • the start scan command may be issued.
  • the tokenizer is issued a command to start scanning for the word to be tokenized. Since the trial word is already stored in the register file 34-38,
  • tokenizer can step through the language words stored in the appropriate bank of dictionary ROMs 1-10 to compare the words with the trial word. This is done in a
  • a data equals (DEQ) is issued, and the scan is terminated. If the end of the sector is reached without a match being indicated, the data not found signal (DNF) is issued, arid the tokenizer awaits a further com-- mand.
  • DEQ data equals
  • the address decoder 50 is enabled as in block 330 to strobe the scan control logic 51 to set the SCANEN latch for token address counter 16- 19.
  • the latch is also provided to the segment down counter
  • the scan control logic 51 also provides a scan in progress (SIP) code to the internal mailbus 114 through multiplexer 45 so that a check by the microprocessor 100 will show that the tokenizer is scan-
  • SIP scan in progress
  • the tokenizer operates as described 35 1 below without further input from the control microprocessor until the scan is terminated.
  • the scan control logic 51 provides a high signal to pin 10 of AND gate 67, along with the SIP code, and provides a second high signal from pin 9 of scan control logic 51 to pin 9 of AND gate 67 for providing a load strobe to pin 11 of segment down counter 49. This allows loading of the 2- bit segment code from the segment holding register 48.
  • the first grammar byte is compared with the first grammar byte in register file 34- 38, as selected by the 2-bit code from segment up counter 47. For example, if the trial word is "water", and the
  • the segment up counter may provide a code 00 to indicate the first four bits of the 4-bit register slice in each register 34-38 to be output to the comparator bank 29-33.
  • a determination is made as to whether or not the dictionary language word grammar byte is iden ⁇ tical to the corresponding trial word grammar byte under comparison.
  • the tokenizer reset caused a 1 to be posted at pins 5 and 12 of the compare latch 39 to be input to the comparators 29-33.
  • the comparators are linked by a daisy chain connection so that a 1 posted on the input of comparator 29 is also posted on the input of comparators 30, 31, 32, and 33.
  • the output from pin 6 of comparator 33 also indicates a 1 for input to pin 2 of latch 39.
  • the output of that particular com ⁇ parator goes to zero, and the output from the subsequent comparators, including the output from pin 6 of comparator 33, goes to zero. Thereafter, a zero is provided to pin 2 of latch 39 and is output from pins 5 and 12 of the compare latch until the compare latch 39 is reset.
  • a sector comprising 5-character words, beginning with "w" and containing five language words, would be stored in five groups of two grammar bytes each so that a total of 1 ten grammar bytes would be required to be read in order to scan the contents of the sector.
  • a data not found code (DNF) is produced at pin 6 of scan termination logic 40 to be output through pin 13 of multi ⁇ plexer 45;
  • the countdown of the dictionary scan down counter to zero indicates that all dictionary words in the sector have been scanned and compared.
  • the tokenizer then tests to see whether or not the segment down counter 49 has reached zero, as indicated in block 354. In the example of "water", the tokenizer
  • segment code originally indicated that two grammar bytes must be read from the appropriate bank of the dictionary ROMs 1-10 in order to compare the entire word. Since only one grammar byte had been read, the segment down counter shows a representation of "1", having been decre-
  • the scan control logic 51 is given a signal from the segment down counter 49 to test the compare latch 39 to determine the results of the compare of the two words. Since block 346 produced a mismatch, scan outputs a load strobe from AND gate 67 to reload the segment down counter 49 from the segment holding register 48. The signal to scan control logic 51 also resets the segment up counter 47 through the output from pin 8 of scan control logic 51. The same signal resets the compare latch 39 to post a 1 on pins 5 and 12 of latch 39 and increments the token address counter and decrements the dictionary scan down counter. This indicates that all grammar bytes for a given word have been read, but that subsequent words remain to be compared in the comparators 29-33, since dictionary scan down counter had not reached zero. The compare process then repeats, as discussed above and as described below.
  • segment up counter 47 is incremented as indicated in block 360. This provides a signal to the register files 34-38 to allow parallel access to the appropriate 4-bit registers in each register 34-38. Additionally, the segment down counter is decremented, and the dictionary scan down counter is also decremented. Finally, the token address counter is incremented to allow access to the next succeeding grammar byte in the appropriate bank of dictionary ROMs 1-10. The compare process is then repeated through block 344 for each succeeding grammar byte and subsequent language words on a grammar byte-by-grammar byte basis (since a mismatch was found for the word under comparison) .
  • the dictionary ROMs is "water"
  • a comparison of the first grammar byte for the word "water” from the dictionary ROM with the first grammar byte from register files 34-38 would produce a match.
  • the scan limit down counter would be tested as in block 362 to determine whether or
  • block 362 continues to block 364.
  • the tokenizer determines whether or not the segment down counter has reached zero. This step deter- 5 mines whether or not all the grammar bytes for the trial word "water” have been read and compared. In the case where the second grammar byte for "water” has not been compared, the tokenizer increments the segment up counter 47 to designate a new register file location to be accessed
  • the tokenizer also decrements the dictionary scan down counter and increments the token address counter so that the next grammar byte in the dictionary ROMs can be accessed. These steps are carried out in block 360.
  • __ scan control logic 51 which provides an output signal at «J5 1 pin 8 to pin 11 of compare latch 39. This tests the status of compare latch 39 and finds a high signal at pin 9, indicating DEQ. At the same time, a low signal is output from pin 8 of latch 39 to NAND gate 70, which provides a
  • the tokenizer is then available to accept additional input such as a read command.
  • output signals are provided indi ⁇ cating the status of the tokenizer with respect to the processing conducted by the tokenizer. During that pro ⁇ cess, an address may be derived which can be used by the control microprocessor to produce a token, as is well
  • control microprocessor provides a read command to access the address for the purpose of
  • the remaining bits of the token address for the match trial word is read from the token address counter 69-19. Specifically, if the read address is xxOOl and MBRW is high, use the decoder 52 to enable reading of 5 the most significant seven bits of the token address and return to 300 to await a further command.
  • a token can be derived as discussed above.
  • the lan ⁇ guage word can be output to the control microprocessor on a grammar byte-by-grammar byte basis.
  • an address xxOOlO and a high signal for MBRW causes GBLOCK is low.
  • the decoder 52 is used to enable reading of the first eight bits of the grammar byte of the language word contained in the appropriate bank of ROMs 1-10.
  • the decoder 52 is used to enable reading of the second eight bits of the grammar byte of the language word contained in the given bank of ROMs 1-10.
  • an address xxlOO and a high for MBRW allows decoder 52 to enable reading -of the last four bits of the grammar byte of the language word. If the particular word corresponding to the token requires more than one • grammar byte for storage of the word, the remaining grammar byte or grammar bytes can be accessed in the same manner.
  • the final read command the control microprocessor can produce is the address xxlOl. If MBRW is high, the decoder 52 enables reading of the tokenizer status bits from the pins of multiplexer 45. The tokenizer then re ⁇ turns to 300 to await a subsequent command and returns to 300 if the tokenizer fails to recognize any of the commands output by the control microprocessor.
  • detokenization is initiated by the control microprocessor first issuing a tokenizer reset command for resetting the tokenizer as indicated in block 402.
  • the control microprocessor then utilizes an appropriate routine to derive a segment code and address from the particular token for which a natural language word is desired from ROMs 1-10.
  • the segment code and address are used as data and placed on the data bus 115 when the appro ⁇ priate address code is produced at the address bus.
  • block 406 indicates that the control micro- processor issues a write command ssssOOlOO for loading the low byte of the scan address register.
  • Block 408 indicates the control microprocessor issues the address ssssOlOOO to load the high byte or the scan address regis ⁇ ter.
  • This write command is also decoded in the decoders 52 and 50.
  • the most significant bits of the address are then loaded in the token address counter 16-19. Reading of the language word from the appropriate bank of ROMs 1-10 is accomplished on a grammar byte-by- grammar byte basis wherein representations of each grammar byte of the language word are transmitted in a sequence of eight bits, eight bits, and four bits over three clock cycles.
  • control microprocessor issues a MBRW code and an address ssssxxQIO in order to read the first eight bits of the appropriate grammar byte from grammar bus 112. The next eight bits and the last four bits of the particular grammar byte are read sequentially after the control microprocessor issues the read command ssssxxOll and ssssxxlOO.
  • additional grammar bytes may be required to be read if the segment code derived by the control microprocessor from the particular token indicates that more than one grammar byte must be read to obtain the entire language word. If the segment code is a representation of "S2" the above steps in blocks 406, 408 and 410 are repeated once by incrementing the address by one and reloading the incremented address into the token address counter. Similar comments apply if the 1 segment code is a representation of "S3" or "S4". The control microprocessor then waits for the next read or write command.
  • 5 ral language words can be transformed into tokens according to addresses of common words contained in a dictionary.
  • the tokens derived in the tokenization process can be used to store files in a compressed form or to provide textual transmission over a substantially decreased amount
  • the token is manipulated to derive an address and a segment code for the language word corresponding to the address.
  • the token may be processed to provide the language word so that the tokenized text can be displayed
  • a further application of the method and means de ⁇ scribed herein is for tokenizing strings of words such as .phrases. Therefore, not only can common words be repre ⁇ sented by tokens, but common phrases including common 0 words which are already stored in the dictionary ROM can also be represented by a single token. This application provides for even greater reduction in required storage space or transmission time over regular tokenization.
  • the same apparatus as described above can be used for
  • the dictionary ROM 104 contains additional lists comprising a keytoken list and a phrase table.
  • the keytoken list is preferably comprised of 16-bit tokens stored as a 20-bit grammar byte in a given bank of
  • a keytoken is taken to be the starting word of any common phrase which has been represented by a token in the dic ⁇ tionary ROMs 1-10.
  • the phrase is then represented by a single 16-bit token in ROMs 1-10. If the common phrase has four words in it, those four words in the defined sequence are represented by a single 16-bit token. The same is true of three word phrases and two word phrases.
  • the keytoken list is arranged according to the al- phabeti ⁇ starting letter of the word which is a potential keytoken and the relative address at which the keytoken is found for th .given starting, letter. For example, all of the ' possible keytokens having a starting letter of "a" are listed horizontally on the first row of Fig. 10. Each keytoken entry is in the form of a 16-bit keytoken byte stored in a portion of the dictionary ROMs 1-10. All the K a entries in Fig. 10 represent predefined key ⁇ tokens representing words having an alphabetic beginning letter of "a".
  • a portion of the dictionary ROMs 1-10 also includes a phrase table such as the phrase table schematically depicted in Fig. 11.
  • This portion of the dictionary ROM includes a two dimensional array of token triplets, S1-3.
  • Each triplet has a phrase table address and a corresponding keytoken address.
  • the keytoken address 0 from Fig. 10 corresponds to the keytoken address 0 in the phrase table in Fig. 11.
  • Each entry or triplet com ⁇ prises three tokens, the first token being the token for the first word after the keytoken word in the phrase, the second token representing the second word in the phrase after the keytoken word, and the third token representing the third word in the phrase after the keytoken word.
  • Each triplet in a given column under the keytoken address are tokens representing potential second, third, and fourth
  • Each entry in the phrase table comprises a triplet of tokens representing at most three words in a phrase.
  • Each triplet is entered in the phrase table according to a starting
  • phrase table address " corresponding to the matching triplet is used by the con ⁇ trol microprocessor to produce a token for the given
  • control microprocessor includes a memory map in the form of a phrase table memory map such as that
  • the memory map includes two entries for corresponding to a plurality of keytoken addresses.
  • the keytoken address is that derived from the address of Fig. 10.
  • the phrase table memory map is used to derive a starting address for the phrase table of
  • a keytoken word which is the starting word for a four word phrase and which has a beginning letter of "a" may m __ match the keytoken K a n.
  • the control microprocessor then reads the address, namely 1, and accesses the phrase table memory map to determine the starting address and the total number of grammar bytes so that the trial phrase, in the form of tokens, can be compared to the stored tokens for the given starting address. For example, the address derived from the keytoken list of Fig.
  • 10 corresponds to a starting address for the phrase triplets of 101 and includes a total of 15 grammar bytes. This information is used to load the token address counter 16-19 with the 0 starting address and to load the dictionary scan down counter 54-56 with a representation of the total number of grammar bytes, 15.
  • the segment holding register 48 is loaded with a representation of 3 since the triplets in the phrase table in the dictionary ROM comprises three 5 tokens.
  • phrase tokenization includes 5 the additional step of determining whether or not a token can be substituted for an entire phrase of up to four language words. As indicated in Fig. 13, a language word is retrieved from the text contained in the control pro ⁇ cessor and tokenized using the tokenizer 102a.
  • the token is then used as a potential keytoken, namely the first word in a potential phrase to be substi ⁇ tuted by a token.
  • the control microprocessor accesses a memory map for keytokens, such as that depicted schemat ⁇ ically in Fig. 13, using the starting letter of the origi ⁇ nal text word to determine a starting address and a key- D 1 token segment length for searching in a keytoken list.
  • the start address is the address in dictionary ROMs 1-10 at which the search for the keytoken is to begin.
  • the keytoken list contains the first
  • the keytoken segment length is the number of gram ⁇ mar bytes required to be searched and compared for all keytokens of a given starting character. This step is similar to tokenization as discussed above.
  • a search is conducted of the key ⁇ token list in ROMs 1-10 for a keytoken match using the start address obtained from the memory map and the keytoken segment length. Specifically, the start address is loaded into token address counter 16-19 with an appropriate write
  • the dictionary scan down counter 54-56 is loaded with a representation of the key ⁇ token segment length to serve the same function as the term beta used in conjunction with tokenization. The .
  • 2o dictionary scan down counter is loaded using an appropriate- address code to the mailbus address line 116.
  • the segment holding register 48 is loaded with a representation of 1 since all keytokens are one grammar byte in length.
  • the segment holding register is loaded using an appropriate
  • the control micro- 1 processor If a data not found command is issued, the control micro- 1 processor returns to retrieve the next subsequent language word for repeating the above-described process. If a match is found, the word from the text is a keytoken and is the first word in a potential phrase in the text. Therefore, it must be determined whether or not the word in the text is followed by at most three words which are identical to a phrase in the phrase table. To this end, the control microprocessor reads the address from the keytoken list at which a match was found between the trial
  • the keytoken address is used to search the phrase table memory map for the start address of the potential phrase and the segment length in the phrase table for those potential phrases having
  • phrase table memory map is depicted schematically in Fig. 12.
  • keytoken address from the keytoken list, for example address 1, the starting address of the three potential words or triplets in the phrase table and
  • the total number of grammar bytes needed to search through all the triplets having the keytoken as the first word can be found. For example, where the keytoken address is 1, the start address of the phrase triplets in the phrase table is 101. The total number of grammar bytes needed
  • control microprocessor tokenizes the three words coming immediately after the trial keytoken obtained from the text in block 502. The three tokens
  • the start address derived from the phrase table memory map, 101, and the segment length, 15, are loaded in the tokenizer. Specifically, a repre ⁇ sentation of 101 is loaded into the token address counter
  • a repre ⁇ sentation of 15 is loaded into the dictionary scan down counter 54-56 with an appropriate write address command.
  • the segment holding register 48 is loaded with a 2-bit code representing a segment code S3 using an appropriate
  • the segment code is S3 since there are three tokens or three grammar bytes to be compared in comparators 29-33. It should be noted that the three trial tokens representing the last three words in the potential phrase are loaded into the register file 34-38
  • a start scan command is then issued by the control microprocessor to begin comparing the thre tokens in the register file 5 34-38 with the token triplets in the phrase ' table starting at starting address 101. Specifically, the first token in the register file is compared with the token represented by SI, the second token is compared with the token repre ⁇ sented by S2 and the third token is compared with the token
  • the control microprocessor computes a memory variable to determine whether or not the tokenizer compared three trial tokens with three tokens in the phrase table. If a comparison was made of three tokens, and a match was not found, block 520 is entered and the control microprocessor issues a write command to the tokenizer 102a to null the bits repre ⁇ senting the last token in the trial tokens being compared. This is done by placing null characters in the last grammar byte loaded into register file 34-38.
  • the search is conducted again through the phrase table in the dictionary ROM 1-10 using the representa ⁇ tion of 101 as the starting address.
  • the tokenizer co - pares only the first two tokens in the register file 34-38 with the respective grammar bytes represented by SI and S2.
  • the tokenizer determines whether or not there is a match according to block 516 as described above. If a match is found, the phrase table address is read by the control microprocessor to be used in creating a token. It should be noted in this case that the address and there ⁇ fore the resulting token represents only three language words; namely the keytoken and the language words repre ⁇ sented by SI and S2.
  • the control microprocessor senses the data not found code (DNF) .
  • the control microprocessor then tests its memory variable to determine whether or not the memory variable is greater than 1. If the memory variable is greater then 1, this indicates that in addition to the keytoken, there is at least one token in the phrase table the can still combine with the keytoken to constitute a phrase. The process is then repeated until a match is not found, and the bits storing the token corresponding to S2 and S3 have already been loaded with null characters. In such a case, the text word for which a keytoken was found will merely be represented by the 16-bit token.
  • an entire textual block can be represented by tokens requiring significantly less storage space or processing time than the ASCII or EBCDIC equivalent.
  • Many of the phrases in the textual passage will be capable of being represented by a single token, the remaining words being represented by their own tokens.
  • the tokenizer can conduct searches or scans very efficiently because a complete operation, in terms of a comparison with a match or rejection, can be carried out in at least one system clock cycle and in, at most, four clock cycles. This is in contrast to other processors which require many clock cycles to complete a given task. Additionally, the disclosed embodiments pro- vide for sufficient flexibility to allow less than four grammar bytes to be processed in a given search, if the trial word contains less than thirteen characters.
  • the disclosed method and apparatus also enhances searching efficiency by limiting the search or scan range only to that small portion of the dictionary ROM where the word will appear, if at all.
  • the scan limit down counter stops the search and compare process after there is no chance of finding a match. Also, all grammar bytes of a word are compared, even though a match is no longer possible, in order to properly increment the address coun ⁇ ter for the next compare cycle.
  • Commodore 6510 Assembly Language listing of the code required to cause the control micro ⁇ processor 100 (a Commodore 64K) to access the symbolic tokenizer when the above Basic program is run.
  • the Commo- dore 64K computer uses a 6510 central processing unit which has a similar assembly language instruction set to the more popular 6502 central processing unit.
  • FFFF is the start- ing address of a sector. It is the sector containing words starting with "A" and being two characters in length.
  • .WORD 13 which follows .WORD $FFFF, 13 is the number of grammar bytes in that sector.
  • 11 is the starting address of another sector. This sector contains words starting with "B" and which are also two characters in length.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Microprocesseur pouvant générer des signes à grande vitesse. Les signes sont des symboles binaires qui représentent des mots et des phrases. Les signes peuvent être manipulés, mis en mémoire, ou transmis à la place des mots ou des phrases eux-mêmes, ce qui permet d'atteindre une efficacité considérablement accrue grâce au nombre statistiquement réduit de bits requis pour représenter les mots individuels du langage.
PCT/US1985/002223 1984-11-08 1985-11-08 Systeme d'identification symbolique de mots et de phrases WO1986003039A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US66956284A 1984-11-08 1984-11-08
US669,562 1984-11-08

Publications (1)

Publication Number Publication Date
WO1986003039A1 true WO1986003039A1 (fr) 1986-05-22

Family

ID=24686822

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1985/002223 WO1986003039A1 (fr) 1984-11-08 1985-11-08 Systeme d'identification symbolique de mots et de phrases

Country Status (3)

Country Link
EP (1) EP0201564A1 (fr)
AU (1) AU5091485A (fr)
WO (1) WO1986003039A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5754847A (en) * 1987-05-26 1998-05-19 Xerox Corporation Word/number and number/word mapping

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3309677A (en) * 1964-01-02 1967-03-14 Bunker Ramo Automatic information indexing
EP0012777A1 (fr) * 1971-08-31 1980-07-09 SYSTRAN INSTITUT Ges.für Forschung und Entwicklung maschineller Sprachübersetzungssysteme mbH Procédé utilisant un ordinateur numérique programmé pour la traduction de langues naturelles
EP0079465A2 (fr) * 1981-11-13 1983-05-25 International Business Machines Corporation Méthode pour la mise en mémoire et pour l'accès d'une base de donnée relationnelle
WO1985001814A1 (fr) * 1983-10-19 1985-04-25 Text Sciences Corporation Procede et appareil de compression de donnees

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3309677A (en) * 1964-01-02 1967-03-14 Bunker Ramo Automatic information indexing
EP0012777A1 (fr) * 1971-08-31 1980-07-09 SYSTRAN INSTITUT Ges.für Forschung und Entwicklung maschineller Sprachübersetzungssysteme mbH Procédé utilisant un ordinateur numérique programmé pour la traduction de langues naturelles
EP0079465A2 (fr) * 1981-11-13 1983-05-25 International Business Machines Corporation Méthode pour la mise en mémoire et pour l'accès d'une base de donnée relationnelle
WO1985001814A1 (fr) * 1983-10-19 1985-04-25 Text Sciences Corporation Procede et appareil de compression de donnees

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5754847A (en) * 1987-05-26 1998-05-19 Xerox Corporation Word/number and number/word mapping
US6233580B1 (en) * 1987-05-26 2001-05-15 Xerox Corporation Word/number and number/word mapping

Also Published As

Publication number Publication date
EP0201564A1 (fr) 1986-11-20
AU5091485A (en) 1986-06-03

Similar Documents

Publication Publication Date Title
US4314356A (en) High-speed term searcher
EP0083393B1 (fr) Méthode pour la compression d'information et un appareil pour la compression d'un texte anglais
US4342085A (en) Stem processing for data reduction in a dictionary storage file
US6069573A (en) Match and match address signal prioritization in a content addressable memory encoder
US3995254A (en) Digital reference matrix for word verification
US4503514A (en) Compact high speed hashed array for dictionary storage and lookup
GB1247061A (en) Improvements relating to electrical circuit testing systems
WO1985001814A1 (fr) Procede et appareil de compression de donnees
US4028677A (en) Digital reference hyphenation matrix apparatus for automatically forming hyphenated words
JPH02115973A (ja) 記号列照合装置とその制御方法
US3008127A (en) Information handling apparatus
US4188669A (en) Decoder for variable-length codes
WO1986003039A1 (fr) Systeme d'identification symbolique de mots et de phrases
JPS60105039A (ja) 文字列照合方式
GB1070423A (en) Improvements in or relating to variable word length data processing apparatus
US3577142A (en) Code translation system
US3271743A (en) Analytic bounds detector
US3204221A (en) Character comparators
EP0224267A2 (fr) Dispositif de traitement de données
US4435781A (en) Memory-based parallel data output controller
US3993980A (en) System for hard wiring information into integrated circuit elements
CA1255809A (fr) Abreviateur symbolique de mots et de phrases
KR860003555A (ko) 디스크 제어기용 비트스트림 구성장치
WO1991004527A1 (fr) Procede et circuit de recherche
JPS5814710B2 (ja) パタ−ン分類装置

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BR DK FI HU JP KP KR NO RO SU US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE FR GB IT LU NL SE

WWE Wipo information: entry into national phase

Ref document number: 1985905725

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1985905725

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1985905725

Country of ref document: EP