WO1985001600A1 - String comparator device system, circuit and method - Google Patents

String comparator device system, circuit and method Download PDF

Info

Publication number
WO1985001600A1
WO1985001600A1 PCT/US1983/001540 US8301540W WO8501600A1 WO 1985001600 A1 WO1985001600 A1 WO 1985001600A1 US 8301540 W US8301540 W US 8301540W WO 8501600 A1 WO8501600 A1 WO 8501600A1
Authority
WO
WIPO (PCT)
Prior art keywords
output
indicia
string
input
word
Prior art date
Application number
PCT/US1983/001540
Other languages
French (fr)
Inventor
Peter Yianilos
Samuel R. Buss
Original Assignee
Proximity Devices Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Proximity Devices Corporation filed Critical Proximity Devices Corporation
Priority to EP19830903352 priority Critical patent/EP0157768A4/en
Priority to PCT/US1983/001540 priority patent/WO1985001600A1/en
Publication of WO1985001600A1 publication Critical patent/WO1985001600A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • An associative memory is a special kind of storage device. Whereas most memories are numerically addressed, associative memories are addressed via their contents. For example, one might ask an associative memory to return all records containing the letters "ZXU" in columns one, two, and three respectively.
  • the address applied to an associati memory is the query. If some record exactly satisfies the query, then the query is said to be exact, otherwise it is said to be inexact.
  • Conventional associative memories provide no information in response to an inexact query. Inexact queries are merely rejected. If a record exists within the associative memory that is only slightly differen from the supplied query, then in many cases it would be desirable to know of its existence. Such records are minor corruptions of the query. Alternatively, they are ' very similar to the query.
  • a storage loop is formed when a storage device repeatedly and sequentially transmits its entire contents to external devices over a high speed data bus.
  • An associative memory may be formed by attaching to this bus a device whose funct it is to scrutinize in a passive manner the data stream originating from the storage device as it passes by on the bus. This attached device senses data appearing on the bus that is related in some predetermined fashion to a supplied query.
  • the system circuit a computer peripherial devic includes an indicia string comparator device that compares strings of indicia or characters at high speeds. This sys circuit performs an approximate string comparison operatio
  • the system circuit computes a measure of string similarity.
  • the system circuit is accessed by a computer in the same manner as a Random Access Memory.
  • Query strings and recor strings are written into the system's memory, the approxim similarity measures are automatically computed and the bes matches are recorded in a memory inside the systems circui These best matches may then be accessed by the host comput
  • the system circuit would be used to search a large lexicon or database of record strings for th entries which are most similar to a query string.
  • the quer string is entered into the system circuits and then each lexicon or database record string is entered. At the end, the pointers to the record strings most similar to the quer string are automatically available in a list in memory in the system circuit.
  • the measure of string similarity is an extremely general function which can be tailored or adapted under software control for a specific application by the setting of parameters.
  • the computation time is proportional to the average length of a string of indicia.
  • the string comparison device is a means described in Appendix 1 which is made a part of this application.
  • First the string comparator means is described in "the simplest string comparison function" in pages 1 through 6 o Appendix 1, second it is described in “string comparison function with variable character weights” in pages 7 throug 10 of Appendix 1, third, it is described in “string comparis function with unmatched character compensation and variable character weight", in pages 11 through 14 of Appendix 1, fourth it is described in “string comparison function with directional biasing and variable character weights” in page 15 through 18 of Appendix 1, and fifth it is described in “the full string comparison function with variable characte weights, unmatched character compensation and directional biasing" in pages 19 through 22 of Appendix 1.
  • the system circuit is an electronic device which includes a string comparator means.
  • the elect circuit performs the calculation of the string similarity function 0
  • the string comparator which is an electronic circuit, and an I/O controller and ranker means together comprise an electronic device which is referred to as the String Comparator means.
  • the system circuit is connected t a computer system by means of an interface circuit to utiliz the string comparator means.
  • the string comparator means includes Bus Control means. Master Control means, Ranker me Divider means, and a String Similarity Computer.
  • the Strin Comparator means is used like a random access memory.
  • Bus Control means uses well-known techniques to control int bus lines and the external signals. Master Control impleme DMA operation and DMA editing in the manner described in de herein below.
  • the Ranker means maintains a ranked list of best comparison results.
  • the Divider means computes the ratio of two binary numbers from the String Similarity Comp
  • the String Similarity Computer computes the full string similarity function ⁇ defined in appendix 1.
  • the Master Control means, the Bus Control means , the Ranker means, the Divider means and the String Similarity Computer may be logic circuits and may be embodied with well-known electron c ⁇ roonents.
  • the String Similarity Computer is comprised of th String Control means.
  • Parameter Generation means Core sect C section, CB section, R section, M section, TOTR section TOTM section.
  • the purpose of the String Control means is to con and coordinate the activities of the rest of the string simi computer.
  • the String Control means is connected as shown in the drawings and has internal storage registers and memory.
  • the Parameter Generation means is used to obtain t indicia or character weight and compensation values.
  • the in or character itself is received from the string control mean or circuit.
  • the character weight and compensation value are
  • the Parameter Generation means is connected as shown in drawings
  • the Core section is the decision-making part of the String Similarity Computer.
  • the Core section identifies common portion of the query string and the record string, and is connected as shown in drawings.
  • the CA section is used to compute the total compensation value for the indicia or characters in the string A.
  • the CA section is connected as shown in the drawings.
  • the CB section is used to compute the total compensation value for the indicia or characters in the string B.
  • the CB section is shown in the drawings.
  • the R section computes an intermediate subtotal value. The values computed by the R section correspond precisely to the R function described in the description of string similarity function.
  • the R function is connected as shown in the drawi
  • the M section computes the numerator of the ratio defining t string -similarity function.
  • the M function is connected as shown in the drawings.
  • the TOTR section computes an interme subtotal value.
  • the values computed by the TOTR section correspond precisely to the values of the variable TOTR used in the C programs.
  • the TOTR section is connected as shown i the drawings.
  • the TOTM computes the denominator of the ratio defining the string similarity function.
  • the TOTM section is connected as shown in the drawings.
  • the original disclosure of the invention related t new and improved word comparator device, referred to hereina as a word associator circuit, that provides a numeric measur of the degree of a word similarity between the compared word as defined by mathematical formula.
  • the original disclosure also related to an associative retrieval system and method for retrieval of inexact queries in a quick and expeditious manner.
  • the circuit may be an electrical digital circuit or other
  • An associative memory is normally thought of as device which responds only to exact queries. Some may resp in a limited manner to inexact queries, for example see "Backend Processors is REM the Answer" in Datamation, March 1978, pg. 206-207.
  • the system and method of the original disclosure uses the improved word comparator device to form associative memory which responds to inexact queries in a new and useful way. This word comparator device is used to rapidly locate records most similar to a query. Similarity is defined by certain mathematical formula and is measured using a high speed digital circuit referred to hereafter as Word Associator Circuit in the preferred embodiment.
  • a wor associator circuit will also be referred to as a string sim computer.
  • Appendix 1 also describes four improvements to th original string comparison function. These improvements pr an extremely flexible string comparison function which can adapted for a given application by the setting of parameter Any of these improvements to the word associator circuit ma be used in an associative memory in the same manner as the original word associator circuit.
  • Words are strings of symbols from some alphabet.
  • a symbol can be any character or indicium from a finite or infinite set or alphabet.
  • Words are also referre to as character strings, symbol strings, indicia strings, o merely strings.
  • Word similarity, or indicia string similar is a measurement of the similarity between two words or indicia strings.
  • variable parameters are involved in the definition of word similarity, making it very flexib Of significance is the ease with which the degree of word similarity may be determined using digital circuits. Using existing technology word associator circuits may be built t process serial data at rates in excess of 20,000,000 charac per second. Such a circuit is described hereinbelow.
  • the word similarity of two indicia strings may be expressed in three different forms: absolute, ratio, and fractional.
  • Absolute word similarity provides a single non negative number M which is a measure of the total weight of common portions of the indicia strings.
  • the common portion are symbols or indicia occuring in both of the indicia stri the total weight of which depends on the individual indicia weights and the relative positions of the common portions.
  • Absolute word similarity is most useful when the lengths or total weights of the indicia strings are always equal or ne equal.
  • Ratio word similarity provides the absolute word similarity value M and the total word weight TOTM. The tot
  • P H O word weight is a measure of the total weights of the indic comprising the indicia strings and of the length of the in strings.
  • Fractional word similarity provides a number bet 0 and 1 which is equal to the ratio of the absolute word similarity value M and the total word lengths TOTM.
  • Absolute word similarity and ratio word similari are most useful in applications which search for a word similarity value exceeding a certain threshold.
  • a possibl application would involve monitoring a heartbeat by contin encoding recorded heartbeats as a string of indicia. This indicia string would then be compared to one or more indic strings which denote abnormal heartbeats. If the similari of the two indicia strings were above a certain threshold, an alarm could be sounded.
  • Fractional word similarity is useful in applicat such as the associative memory circuit, where a large numb of indicia strings are compared with a word associator cir and a ranked list of the most similar indicia strings is maintained. The fractional representation of word similari provides a convenient way to compare word similarity values
  • the system includes the use of one or more associ circuits in storage loops to locate and extract records mos similar to the supplied query.
  • the system's architecture i preferably totally parallel so that it may be configured to process any number of simultaneous requests. Its memory is partitioned into storage loops so that through suitable configuration, arbitrarily large datasets may be processed. Attached to each storage loop via a dynamically variable network are multiple associator circuits to locate and
  • O PI IPO extract the records most similar to the supplied query.
  • configurations may achieve a wide range of cost/response-time possibilities.
  • the associativ retrieval system with associator circuits could off load fro a host processor, the task of searching for database records In doing this, it could provide improved services as well as entirely new services.
  • a further object of this invention is to provide method of processing data for indicia string comparison tha conforms to a particular algorithm.
  • An additional object of this invention is to provi a system method of retrieving similar indicia strings, words numbers, and/or masks from inexact queries.
  • Figure 1 is the improved block diagram of the sys circuit interfaced with a computer system.
  • Figure 1A is a block diagram of the system circui interfaced to and monitoring a data stream.
  • Figure 2 is the improved block diagram of' the sys circuit including the string comparator means.
  • Figure 3 is an illustration of the improved timin
  • Figure 4 is a block diagram of the String Control in Figure 2.
  • Figure 5 is an illustration of components of the String Control of Figure 4.
  • Figure 6 is a block diagram of the Parameter Generation shown in Figure 2.
  • Figure 7 is a block diagram .of the Core Section shown in Figure 2.
  • Figure 8 is a block diagram of the CA section shown in Figure 2.
  • Figure 9 is a block diagram of the CB section shown in Figure 2.
  • Figure 10 is a block diagram of the R section shown in Figure 2.
  • Figure 11 is a block diagram of the M section shown in Figure 2.
  • SUBSTITUTE SHEET OMPI Figure 12 is a block diagram of the TOTR section shown in Figure 2.
  • Figure 13 is a block diagram of the TOTM section shown in Figure 2.
  • Figure 14 is a block diagram of an associative retrieval system.
  • Figure 15 is a block diagram of the associator circuit illustrated in Figure 14.
  • Figure 16 is an illustration of the timing of ⁇ a ⁇ .
  • Figure 17 is a block diagram of the basic word associator or comparator circuit.
  • Figure 18 is a block diagram of the basic word associator circuit in another system.
  • Figure 19 is a schematic diagram of the selector circuit shown in Figure 17.
  • Figure 20 is a schematic diagram of the tally mem circuit shown in Figure 17.
  • Figure 21 is a schematic diagram of the add circu shown in Figure 17.
  • Figure 22 is a schematic diagram of the latch cir shown in Figure 17.
  • Figure 23 is a schematic diagram of the test circ shown in Figure 17.
  • Figure 24 is a schematic diagram of the add-latch circuit or R section shown in Figure 17.
  • Figure 25 is a schematic diagram of another add- latch circuit or M section shown in Figure 17.
  • Figure 26 is an illustration of the operation of the word comparator.
  • This invention is to a new and improved string comparator device which provides a numeric measurement of th
  • the preferred embodiment of the system circuit is an electronic device which is a string comparator means including a comparator device.
  • Figure 1 shows a system in which this electronic device would be used.
  • the string comparator 105 performs the calculation of the string similarity function ⁇ .
  • the string similarity computer 105 and an I/O controller and ranker means 104 together comprise the system circuit 103 referred to as the String Comparator means.
  • the system circuit 103 is connected to a computer sy 101 by means of an interface circuit 102.
  • the computer syst 101 is shown with a keyboard 101' and a display 101" for hum interaction.
  • a variety of well-known devices for storage, communications and computation may also be attached to the computer system 101.
  • FIG. 1A Another use of the string comparator circuit 105 ' is shown in Figure 1A.
  • an interface means 106 monitors a stream of data 107, formats words denoting the contents of the data stream, and uses a string comparator circuit 105' to compare against one or more predefined strings. When a predetermined similarity threshhold value is exceeded, an output signal 108 denotes a match condition.
  • String Comparator means The logic design of the String Comparator means is explained below. Appendix 2 made a part of this specificati gives a complete, detailed logic specification for the prefe embodiment of the String Comparator means as a Large Scale Integrated (LSI) electronic circuit. String Comparator mean
  • OMPI refers to the preferred embodiment as well as alternate cir for performing particular functions.
  • the preferred embodiment of the String Comparator means 103 of Figure 1 is shown in Figure 2. It consists of Bus Control means 21, Master Control means 20, Ranker means Divider means 23, and a String Similarity Computer 10.
  • the string comparator means is used like a random access memory as described in Appendix 3.
  • the Bus Control 21 uses well-known techniques to control internal bus lines 24, 24* and 24" and the external signals 26.
  • Bus Control 21 controls all external accesses to system circuit, and monitors the activities of the other in components of the system circuit.
  • Master Control 20 implements DMA (Direct Memory A operation and DMA editing in the manner documented in Appen 3. Master Control controls automatic loading of data from an external memory.
  • DMA Direct Memory A operation and DMA editing in the manner documented in Appen 3.
  • Master Control controls automatic loading of data from an external memory.
  • the Ranker means 22 maintains a ranked list of the best comparison results.
  • the ranked lists contain up to 16 entries; each entry consists of the string similarity value and a record pointer. Appendix 3 describes the effect of the Ranker means.
  • the Divider means 23 computes the ratio of two binary numbers from M 17 shown in Figure 11 and TOTM 19 shown in Figure 13; this ratio is expressed as a fractional binary number.
  • the String Similarity Computer 10 computes the full string similarity function ⁇ defined in detail in Appendix 1.
  • the Master Control means 20 the Bus Control means
  • Figure 1A may include any one of the following circuit configurations.
  • the design of the PF474 is based upon a structure design methodology which uses two-clock logic.
  • Figure 3 shows the two clocks ⁇ l and ⁇ 2.
  • the clocks are designed so that only one is high (active) at a time.
  • ⁇ l hi inputs are accepted into a logic block; when ⁇ l falls, the inputs are latched into the input registers.
  • ⁇ 2 is high the output of a logic block may change; when ⁇ 2 falls, the output signals are latched.
  • the string similarity computer 10 is also describ in further detail in Appendices 2.
  • the string similarity computer 10 is now described on a functional level.
  • the purpose of the String Control means 11 is to and coordinate the activities of the rest of the string similarity computer 10.
  • the String Control means 11 is sho
  • Figure 4 shows the input and ou signals which are important for the logical functionality.
  • Figure 5 shows the internal storage registers and memory.
  • the GO input 30 is a 1-bit signal.
  • String Control will initiate a string comparison operation.
  • the RBUSY signal 36 and the CTS signal 37 are status indicators.
  • a string comparison operation can be started only when the RBUSY signal 36 is low (inactive) or CTS signal 37 is high (active) .
  • the SBUSY output signal 35 is high (active) whil string comparison operation is in progress. This signal i used by Master Control 20.
  • the String Control means contains Random Access Memory 80 which contains areas for strings of 8-bit characters. The two strings are designat
  • String-1 and String-2 are loaded into the Ra Access Memory 80 by the Bus Control means 21.
  • String-1 an String-2 are terminated with the character NULL.
  • Addition two 8-bit registers LENl 81 and LEN2 82 contains the lengt of String-1 and String-2, respectively.
  • the DISP register 83 and the POS register 84 are internal 8-bit registers used to perform the Forward and Reverse scans.
  • the lengths LENl and LEN2 are compared.
  • the shorter of the two strings is designated String-A, the other is designated StringB. If the lengths are equal, the designation is made arbitrarily.
  • the CMD output signal 31 is a 3-bit signal which is used to sequence the activities of other parts of the string similarity computer 10.
  • Table 1 shows the valid values for the CMD signal 31. The entries in Table 1 are listed in the order in which the values are output.
  • the CHAR output 32 is an 8-bit signal containing a character from either String-A or String-B. Each character from Strin or String-B is transmitted once during the Forward Scan and Reverse Scan.
  • the SCID output signal 34 is low (inactive) d the Forward Scan and high (active) during the Reverse Scan.
  • the STRID output signal 33 is high (active) when a character from String-B is being output on CHAR lines 32.
  • the STRID output signal is low (inactive) when a character from String is being output.
  • the STRID output signal 33, the SCID outpu signal 34, and the CHAR output 32 are meaningful only when t CMD output 31 is equal to '001* indicating that a character should be processed.
  • Control implements the following Forward Scan Program: char strA[128], strB[128]; /*Two Strings*/ int lenA, lenB; /*Strings' lengths.
  • variable POS and DISP in the above program 5 correspond exactly in function and purpose to the 8-bit reg POS 42 and DISP 41 shown in Figure 5.
  • the Reverse Scan phase of the String Control mean operation causes the computation of the function M, as desc in the previous section defining the string similarity func 0 ⁇ .
  • the Reverse Scan is divided into two stages to facilita the computation of the compensation functions. In effect.
  • String Control left-justifies the String-1 and String-2 and transmits every character beginning with the rightmost and ending with the leftmost. More precisely, String Control 5 implements the following Reverse Scan Program:
  • POS and DISP correspond exactly in function and purpose to the 8-bit regi POS 42 and DISP 41 shown in Figure 5.
  • POS and DISP are initialized in the program RSCAN to the values which the held at the end of the program FSCAN.
  • RSCAN can c ⁇ nven be run directly after FSCAN with no extra initialization.
  • the purpose of the Paramater Generation means 12 is to obtain the character weight and compensation values.
  • the character itself is received from the String Control circuit 11.
  • the character weight and compensation values ar used by the rest of the string similarity computer 10.
  • the Parameter Generation means 12 is shown in Figure 6.
  • Figure 6 shows the input signals 31-34, the output signals 39-43 an the internal Random Access Memory 38.
  • the input signals are the CHAR input 32, the SCID input 34; the STRID input 33, and the CMD input 31. All of these signals are outputs of the String Control means 11.
  • the Random Access Memory 38 contains, for each of 255 characters, a compensation value between 0 and 7, a weig value between 0 and 7, and a bias value between -2 and +1. These values correspond to the compensation function C, the forward weight function W f , and the bias function B which we described in the appendix defining the string similarity function ⁇ .
  • the Random Access Memory 38 is loaded via the Bus Control means 21.
  • the CHAR output signal 39 is always the same as th previous CHAR input signal 32.
  • the CHAR signals 32, 39 are 8-bit data items.
  • the STRID output signal 40 is always the same as t previous STRID input- signal 33.
  • the STRID signals 33, 40 ar 1-bit signals.
  • the CMD output signal 43 is always the same as the previous CMD input signal 31.
  • the CMD signals 31, 43 are 3- signals.
  • the C output signal 42' is a 3-bit positivie binary signal equal to the compensation value for the character denoted by the CHAR signal 39. This compensation value is read from the Random Access Memory 38.
  • the WGHT output signal 41 is a 3-bit signal equal to either the forward or the reverse weight of the character denoted by the CHAR signal 39. When the SCID input signal 3 is low (inactive) denoting the Forward Scan phase, the WGHT output signal 41 is equal to the value W-f read from the Random Access Memory 38. When the SCID signal is high (acti denoting the Reverse Scan phase, the WGHT output signal 41 i equal to the sum of the values of W f and B read from the
  • the Core Section 13 is the decision-making part of t string similarity computer 10. It receives data from the
  • Parameter Generation means 12 maintains character counts and determines what the rest of the string similarity computer must do.
  • the Core Section 13 is pictured in Figure 7 with its input signals 39-43, internal TALLY memory 44, and output signals 45-52. '
  • the input signals are CHAR 39, STRID 40, WGHT 42, C and CMD 43. These input signals are all outputs of the Parame Generation means 12.
  • the TALLY memory 44 is a fast clear memory means of size 256X4.
  • the TALLY memory contains a 4-bit signed (two's complement) number in the range -7 to 7, inclusive, for each character specified by the CHAR input signal 39.
  • the clear control of the TALLY memory zeros the entire memory. Furtherm individual entries in the TALLY memory may be incremented or decremented. Attempting to increment the value 7 or to decrement the value -7 results in an unchanged state. This is referred to as latching at +7.
  • the TALLY memor 44 correspond directly to the array T used in the C programs in the earlier sections defining the string similarity function ⁇ .
  • the WX output signal 46 is equal to the arithmetic (two's complement) inverse of the WGHT input signal 41.
  • WX 46 is a 4-bit non-positive integer.
  • the WGHT output signal 49 is equal to the previous WGHT input signal 41.
  • the WGHT signals 41, 49 are 3-bit unsigned integers.
  • the CMD output signal 50 is a 3-bit signal that is always equal to the previous CMD input signal 43.
  • the CMDX output signal 48 is a 1-bit signal equal to the low-order bit of the CMD input signal 43.
  • the CMDB output signal 47 is a 2-bit signal.
  • SUBSTITUTESHEET low-order bit of CMDB 47 is high (active) only if the CMD in 43 is equal to '001' denoting a Process Character command.
  • the high-order bit of CMDB 47 is high only if the CMD input is equal to '010' or Oil", denoting a Clear command.
  • the CMD input 43 is equal to '010* or Oil',
  • a Process Character command is specified.
  • the CHAR input 39 specifies a character to be processed.
  • Each character designates a 4-bit entry in the TALLY memory 44".
  • Character processing consists of incrementing or decrementing the entry in the TA memory 44 which corresponds to the character denoted by the CHAR input 39.
  • the CMD input 43 is equal to '001' and the ST input 40 is high (.active).
  • the appropriate TALLY memory entr is incremented and latched at 7. If the result is not posit then the T output signal 52 is high (active) otherwise the T 52 is low (inactive) .
  • the appropriate TALLY memory entry is decremented and latched at -7. If the result is no negative then the T output signal 52 is high (active) otherw the T output 52 is low (inactive) . If the T output signal 52 as computed above is hig
  • the CA section 14 is used to compute the total compensation value for the characters in the STRING-A.' The CA section 14 is shown in Figure 7 with its inputs 45, 49-52
  • the signal COMPA 53 is an output signal which is also fed-back as an input to CA.
  • the input signals are STRID 45, WGHT 49, CMD 50, C and T 52. These signals are outputs of the CORE section 13.
  • the output signals STRID 54, WGHT 55, CMD 56, C 57 and T 58 always the same value as the corresponding inputs; i.e. thes signals are passed through unchanged.
  • the COMPA output signal 53 is a 9-bit unsigned non negative integer.
  • the COMPA outpu value 53 is zero.
  • the CMD input signal 50 is equal tc '001', denoting a Process Character command, and the value of the STRID input signal 45 is equal to the value of the T input signal 52; then the 4-bit signed input C 51 is added t the previous value of COMPA 53; the resulting sum is output as the COMPA signal 53. If an overflow occurs on this ' addition, the carry bit is lost; an overflow will not occur if the programming restraints documented in Appendix 3 are obyeed. When neither of the above condition is met., the COM output signal 51 is the same as the previous COMPA input sig
  • the CB section 15 is used to compute the total compensation value for the characters in the STRING-B.
  • the CB section 15 is shown in Figure B with its inputs 53-58 and outputs 59-63.
  • the signal COMPB 59 is an output signal whic also fed-back as an input to CB.
  • the input signals are COMPA 53, STRID 54, WGHT 55, CMD 56, C 57, and T 58. These signals are outputs of the CA section 14.
  • the output signals STRID 60, WGHT 61, CMD 62, a T 63 are always the same value as the corresponding inputs; i.e. these signals are passed through unchanged.
  • the COMPB output signal 59 is a 9-bit unsigned non
  • the R section 16 computes an intermediate subtotal value.
  • the R section 16 is pictured in Figure 10 with its inputs ' 59-63 and its outputs 64-66.
  • the RSUM signal fedback is an input to the R section 16.
  • the values of the output signals STRID 65 and CMD 6 are always the same as the values ot the inputs STRID 60 and CMD 62.
  • the RSUM output signal 64 is a 10-bit non-negative (unsigned) integer.
  • the CMD input 62 is equal to '010' or 'Oil 1 , denoting a Clear command, then the RSUM output 64 is zero.
  • the CMD input 62 is equal to '101', denoting t Load CA into R command, then the value of the COMPB signal 59 is added to the previous value of the RSUM input signal 64; this sum is the next RSUM output 64.
  • £i IBQT ⁇ TC iTE SHEIT- is equal to '001', denoting a Process Character command, and the T input signal 63 is high (active) ; then twice the value of the WGHT input signal 61 is added to the value of the prev RSUM input 64; the result is the next RSUM output 64.
  • the RSUM output signal 6 is equal to the previous RSUM input signal 64.
  • the M section 17 is shown in Figure 7.
  • the inputs to the M section 17 are the signals RSUM 64 (a 10-bit unsigne integer) , STRID 65 and CMD 66.
  • the outputs are the READY output signals 68 and the MVAL output signal 67.
  • MVAL 67 is a 16-bit nonnegative integer which is fedback as an" input to the M section 17.
  • the output READY 68 is high (active) only when the CMD input 66 is equal to ' 110 ' , denoting a Transmit Result command. While the READY signal 68 is active, the output of M section 17 and the TOTM section 19 are valid and ready for the DIVIDER section 23 to use them.
  • the MVAL output 67 is zero. Whe either of the following two conditions hold (1) the CMD input 66 is equal to '100', denoting a Load CA into R command, or ( the STRID input 65 is high (active) and the CMD input 66 is e to '001', denoting a Process Character command; then, the MVA input 67 is added to the RSUM input 64 and the result of the addition is the next MVAL output 67.
  • the TOTR section 18 computes an intermediate subt value.
  • the values computed by the TOTR section correspond precisely to the values of the variable TOTR used in the C programs in the appendix 1 defining the ⁇ string similarity function.
  • the TOTR section 18 is shown in Figure 12 with its input signals 45-48 and its output signals 71-74.
  • the TOTRS output 71 is an 11-bit signed (two's complement) integer whi is fedback as an input to the TOTR section 18.
  • the WX input 46 is a 4-bit signed (two's complemen integer. Both WX 46 and TOTRSUM 71 have only non-positive v
  • the CMDX output 74 and the STRID output 72 are alw equal to the previous CMDX input 48 and STRID input 45, respectively.
  • the high-order bit of the CMDB input 47 is hi
  • the TOTM section 19 computes the denominator of th ratio defining the ⁇ , string similarity function.
  • the TOTM section 19 is shown in Figure 19 with its inputs 71-74 and its output TOTMVAL 75.
  • the TOTMVAL signal 7 is a 16-bit non-positive integer which is fedback to the M s 19 as an input.
  • a non-positive integer is either zero or is negative number in two's complement format with an implicit negative sign bit.
  • the reason TOTMVAL 75 is a non-positive integer is to simplify the circuit design of the DIVIDER sec 23.
  • the TOTRSUM input 71 is an 11-bit signed (two's complem number.
  • the TOTMVAL outpu 75 is zero.
  • the low-order bit of the CMDB input 73 is (active) , denoting a Process Character command, and the STRI input 72 is high (active) ; then the TOTRSUM input 71 is adde to the TOTMVAL input 75 and the result is the next TOTMVAL output 75.
  • the TOTMVAL output 75 is unchanged from the previous TOTMVAL input 75.
  • FIG. 1 shows a block diagram of the PF474.
  • the Master Control means 20 and the Bus Control means 21 are implemented with nodes 2700-3299.
  • Ranker means 22 is implemented with nodes 3300-3900.
  • the Divider means 23 is node numbers 1-299.
  • the String Similari Computer 10 is implemented with nodes 700-2000.
  • the String Similarity Computer 10 consists of several different logic blocks.
  • the String Control is implemented as nodes 300-600.
  • the Core section 13 is implemented as nodes 700-1023.
  • the CA section 14 is implemented as nodes 1024-1122, 1182-1207.
  • the CB section 15 is implemented as nodes 1123-1170, 1214- 1303.
  • the R section 16 is implemented as nodes 1326-1462.
  • the M section 17 is implemented as nodes 1463-1622.
  • the TOTR section 18 is implemented as nodes 1630-1738, 1768- 1788.
  • the TOTM section 19 is implemented as nodes 1740-
  • the chip manual entitled Advanced Product Descripti is included as Appendix 3 and made a part hereof to provide disclosure of use of the chip.
  • Appendix 4 is included herein and made a part hereof as an example of an electrical interface of the chip with the S-100 Bus, a widely used system for computer interconnection.
  • the circuit described in Appendix 4 supports appropriate communication between the chip, a 4K-word by 8-bit RAM, and any of the widely used computer system which are compatible with the S-100.
  • Appendix 4 included drawings A4-1 through A4-8.
  • the improved string comparator is based on the associative memory circuit originally disclosed and filed on March 14, 1979 as Serial Number 20,618 includes a word associator circuit shown in detail as an electrical digital circuit in Figures 14 through 26.
  • the system or associative memory circuit is shown in Figure 14.
  • This associative memory circuit is an improved associative retrieval device that includes the use of the word associator or comparator circuit connected in a storage loop to locate and extract records that are most similar to the supplied query. Inexact queries will rapidly locate records similar with respect to word, numeric and mask related measurements of similarity.
  • the new and improved method that is set forth below in detail provides a method of word comparison and a method of processing in the improved associative memory circuit or associative retrieval device.
  • the processing is preferably in a parallel configuration that provides rapid response to queries, while processing a large number of simultaneous requests.
  • shared memory 312 such as a time multiplexed multi-po random access read-write memory of any well known design such as TI-s 74200.
  • shared memory 312 such as a time multiplexed multi-po random access read-write memory of any well known design such as TI-s 74200.
  • 312 is allotted a brief time slice on the order of one millise A port may disconnect prior to this time elapsing.
  • the associative memory circuit 10 communicates with the outside world through its communications modules 314 and 314' of any well known design.
  • the communications modules 314 and 314' are microcomputer based flexible interfaces responsib
  • the communications modules or circuits may communicate with the other associative memory circuits 10 using shared memory through buses 316 and 316' in any well known manner and by use of any well known design. These communication modules mig also perform considerable preprocessing before passing a query onto the other system components.
  • the main storage units (MSU) 320 and 320' of any well known design are devices that contain the actual records to be searched in memory units of any well known design.
  • the main storage units contain any o a variety of well known control circuits to transmit these records in a fixed format over a bus.
  • a plurality of main storage units may be used as illustrated by number 322 and the dots.
  • the transmission format requires the simultaneous transmission of recor characters taken sequentially from the record moving from right to left and from left to right, see Figure 25 and the in-use description set forth hereinbelow. Numeric portions of a record are transmitted separately.
  • the bus or lines 324 and 324' also contain control and timing signals, error correction codes and a data path of well known design for use in the communications between associator circuits 342, 344, 342' and 344' and extractor circuits 356 and 356'.
  • MSU main storage units
  • a short blanking period is required to permit the associator circuits to initialize themselves for another record.
  • the MSU 320 and 320' Prior to the transmission of each records data, the MSU 320 and 320' must transmit an internal record number for the record that follows.
  • SUBSTITUTE SHEET numbers should be assigned sequentially by the MSU 320 and 320'.
  • the control circuit 330 is connected to the MSU devices 320 and 320 r by bus 326.
  • the control circuit 330 is responsible for all update and control of the MSU's.
  • the control circuit may consist of one or more simple micro ⁇ computers of well known design.
  • Control circuit 330 communicate through shared memory 312 over bus 340.
  • a direct interface 332 of well known design might be attached by bus 334. This would permit a direct data path from an MSU 320 and 320' to an external high speed device. This would facilitate the rapid loading of an entire MSU 320 and 320' as might occur at bootstrap time.
  • the network switching circuit 36 is responsible for routing through bus 338 or 338' data from an MSU 320 and 320' to a vacant associator circuit 342 or 344 as well as 342' or 344' to satisfy a query. Additional associator circuits may be connected between 342 and 344 and 342' and 344' as illustrated by numerals 346 and 346' and the dots. Additional parallel circuits may also be interconnected as illustrated by numeral 348 and the dots.
  • Network switching circuit 336 is connected to a control device 350 of well known design by bus 352 which processes requests communicated through shared memory 312 by connection bus or line 354. Control device 350 decodes these requests and decides which requests are to be processed and in what order. Then control device 350 communicates to network switching 336 over line 352, a specific order to reconfigure the network.
  • the associator circuits (ac) 342, 344, 342' and 344' are an important part of this invention.
  • the associator circuit 342, 344, 342' and 344' are connected in strings terminated at one end by a single or multiple extractor circuits 356 and respectively by continuations of bus 338 or 338' respectively and at the other end by the network switching module 336.
  • Data from a selected MSU passes through network switching 336 5 and then through an associator circuit 342, 344, 342 and 344'. This circuit scrutinizes the data as it passes, locking for records that are very similar to the query provided.
  • the associator circuits 342, 344, 342', and 344' flag the most similar records and they are 15 then extracted from the data stream by the extractor circuits 356 and 356' and eventually passes back through shared memory 312 over bus 360 to the communications circuits 314 and 314'.
  • the diagnostic computer 364 also of any well known design, is connected in a well known manner to the associative 20 memory 310 to provide system performance statistics and maintenance information in a well known manner.
  • FIG. 343 is a more detailed block diagram of associator circuits 342 and 342' of Figure 14.
  • Each 25 pair of associator circuits in Figure 14 is similar to the
  • Figure 15 illustration.
  • Figure 15 shows the associator circuit 343 along with the basic interconnections.
  • the query is stored in query storage 370.
  • the records are received by the associator circuit on 30 interface 372 over bus 374 where ' the records are merged with query characters transmitted over bus 376 in an appropriate manner and then forwarded through buses 378, 380, and 382 to the three types of associator circuitry.
  • the three types of associator circuity are: 1) a word associator circuit 384, 2) a number associator circuit 386, and 3) a mask associator circuit 388.
  • the word associator circuit 384 combines the output of circuits 400 and 400' at the end of each record to arrive at the degree of word similarity. If the basic circuit of Figure 17 is used then at the end of each record, the M output ' from each copy of the circuit are added together by any well known means to arrive at the numerator of the fraction that equals the degree of word similarity.
  • the denominator is computed by any well known means including table lockup by circuit 384 and is equal to L(L+1) where L is the length of the compared words.
  • the numerator is computed by any well known means and is equal to twice the sum of the M quantities output from the two copies of the circuit.
  • the denominator is computed by any well known means and is equal to the sum of the TOTM quantities output from each copy of the circuit.
  • the word associator circuit 384 may or may not actually perform a division to arrive at the degree of word similarity. Instead, the ranker 396 and the other associator circuits 380 and 388 might work entirely with fractional representations of similarity.
  • the word associator circuit 384 is mainly made up of circuits 400 and 400' and interconnecting circuitry of well known design.
  • N is an integer design parameter in any well known manner. Loading of the many parameters involved in the association process is controlled by an onboard microprocessor or controller 398 of well known design. The microprocessor is connected to the basic module in a well known manner. When all records have been observed, the ranker waits for the highest ranking records to appear again in the storage loop. As they appear, the ranker 396 marks them for extraction in' any well known manner.
  • bidirectional serial we mean that successive positions from each of the two words under comparison and from their flips are simultaneously transmitted.
  • two observers of the data stream performing the procedure or method as illustrated in Figure 26.
  • signi ⁇ ficance is the fact that each observer needs knowledge only of the data instantaneously before him in the data stream.
  • FIG. 17 the one word associ circuit in block diagram form is illustrated as-numeral 4 Two of these circuits 400 are included in the word associ circuit 384 in Figure 15.
  • the data selection 1 illustra in Figure 17 is shown in detail in Figure 19 and may uti two quadruple bus gates such as Texas Instruments, Incor circuit 74125 and 74126, Exhibit A in the original appli and made a part of this application. It has two input b 404 and 406 entering from above. One of these two in on time frame is routed to a single output 408 exiting belo
  • Figure 19 discloses the schematic for a single bus line.
  • Clock input potential ⁇ 110 is connected to the selector 402. Two clocks are utilized in circuit 400, Figure
  • The purpose of ⁇ is to define whether a query or a bus character is currently being processed. A complete cycle of 0 * corresponding to a read/write cycle in 114 occurs during each half cycle of ⁇ as shown in Figure 16. Another purpose of ⁇ is to provide a trigger to latch 182 of Figure 17. The event consisting of a high to low transition of ⁇ a may serve as a trigger to latch 182. The clock ⁇ is used as a control input in blocks 144, 126 and 402 of Figure 17.
  • the main portion of the Figure 17 circuit is design ated by numeral 112 and includes a random access read/write memory 114 referred to as tally or a tally memory.
  • the memory address enters from above through bus 408.
  • the read/write mode of operation is selected by the read/write potential 0" entering from the right through line 116 from a clock means of well known design not shown. When read/write potential is low, the read mode is selected. When read/write potential is high, the write mode is selected. Data exists from the lower left on bus 120 and enters from the upper right.
  • tally is organized as 256 8 bit words.
  • addresses and data are 8-bit quantiti Figure 20 shows the schematic for each bit of a tally word.
  • tally may include one or more Texas Instrument Incorporated
  • the add block Figure 17 is an incrementer/decremen 126. Data enters from the right. Depending on the state of input potential, the input potential value of the entering data is either incremented or decremented before exiting the two four bit binary full adders circuit 130 on bus 132, as shown in Figure 21.
  • the adders 130 may include a Texas Instrument Incorporated 4-bit binary full adder No. 7483, Exhibit C in the original application and made a part hereof Hex inverters 134 No. 7404, Exhibit AA in the original appli is connected between the input potential over bus 128.
  • the incrementer/decrementer 126 in Figure 17 is actually an adde in which one of the summands is restricted to either plus or one. If input potential of 128 is zero, then plus one is ad
  • a positive edge triggered data latch 136 in Figur 17 is connected to bus 132 by bus 138. Data enters through bus extension 138 from the left and exits from the right on bus 140 to tally 114. On the positive going edge of read/ write potential actuated by input 2, the contents entering latch 136 are latched and become the output from the latch over bus 140.
  • the latch is a two-bit D-type register 5 with 3-state output 142 shown in Figure 22, Texas Instrument Incorporated Number 74173 attached to the original application as Exhibit D and made a part hereof.
  • a convertible sign tester 144 in Figure 17 has an input through bus 132 for data entering from the right and the tester 144 determines if this data is non-positive or non-negative, depending on the state of the input potential on bus 146 entering from below. The test is performed relative to two's complement arithemetic. If the input potential is low, then test will output high on output bus 148 to the left, provided that its input is greater than or equal to zero.
  • tester 144 is shown in Figure 23 as an 8-bit device.
  • the input 132 is connected to two dual 5-input positive no gate 150 and 152 Texas Instrument No.
  • Texas Instrument No 7408 attached in the original applicat as Exhibit E and made a part hereof connected to one gate of quadruple 2-input positive and gate 154, Texas Instrument No 7408 attached in the original application as Exhibit EE and a part hereof.
  • Gate 154 is connected to one gate of a quadr 2-input positive or gates 156 Texas Instrument Incorporated 7432 attached in the original application as Exhibit EEE and a part hereof.
  • Gate 156 Texas Instrument No. 7432 is also connected to and gate 158.
  • gate 158 Texas Instrument No. 7408 is connected to one of the input lines 132 and line
  • a clearable edge triggered latch 170 is shown in Figure 17 and shown in detail in Figure 24 with an adder attached as described below.
  • Input 172 inserts a one (1) in latch 170.
  • the output is transmitted on bus 174 to add latc 182.
  • the adder 170 has two inputs on busses 148 and 172 and a single output on bus 174 which is the sum of the inputs.
  • output 178 of the adders 176 such as a 4-bit binary full a of Texas Instrument, Incorporation No. 7483 attached as Exhi c on the original application become the input to the latche 180, such a 4-bit D-type registers with 3-state outputs, Exhibit D.
  • One of the inputs to the adder is.the output of the latch.
  • latch 170 is an latch coupled to an adder with an 8-bit output and two 7-bit inputs as shown in Figure 24.
  • the clearing connections are shown in Figure 24, but may be accomplished by any well know manner.
  • Add-latch 182 shown in Figures 17 and 25, is like item 170. Its input enters from bus 174 from above. Add- latch 182 is triggered on the negative going edge of the re write potential ⁇ . In the preferred embodiment, add-latch 182 includes 4-bit binary full adders 184 such as Texas Instruments Corporation No. 7483, is a 13-bit latch
  • the memo 190 contains the words ABC and ABB which are to be compared They are transmitted via two transmitters 192 and 194 over data stream four characters wide, as illustrated.
  • the top half of the stream contains the transmission of the unalter words and the bottom half contains the transmission of the flips of the words.
  • On each side of the stream sits an observer 196 and 200. Each observer is watching a single column at a time as columns flow from left to right.
  • the ' memory 190 and transmitters 192 and 194 correspond to the MSU 320 of Figure 14 and the data stream roughly for illust purposes corresponds to the data bus 338 of Figure 15 (alth query characters do not occur on the data bus or in the MSU)
  • the two observers correspond for illustration purposes to t two copies of the circuit 100 shown in Figure 17 contained within the word associator 384 of Figure 15.
  • each observer must "remember" a numeric qua assocaited with each alphabet member. In this example ther are but three; A, B, and C. This collection of quantities corresponds for illustration purposes to tally 114 of Figur
  • Each observer must also "remember” a numeric quantity R tha is 170 and another M that is 182. These correspond for illustration purposes directly with Figure 17.
  • each observer notices two characters before him. processes one at a time in some fixed order, say top to bot
  • OBSERVER ONE shown as 196 in Figure 26 corresponds to one copy of the circuit 400 of Figure 17.
  • the circuit of Figure 17 is -controlled by two cloc cycles shown in Figure 16.
  • the bottom signal (THETA) is the major timing signal.
  • THETA is the major timing signal.
  • PHI is called the minor clock.
  • clocks are an essential part of the invention only ins as they specify the order of processing which occurs within circuit.
  • Three major clock cycles are required to process t examples with three characters. Within each of these, two characters are processed, first a character from the query w is processed and then a character from the bus word.
  • th circuit first processes the character "A” from the query and from the bus word. Then it processes the character “B” from query word and “B” from the bus word. Then it processes the character “C” from the query word and "B” from the bus word.
  • this quantity may be "normalized" by a proce involving a- division as discussed elsewhere if one desires similarity measure that ranges from zero to one.
  • R, M. and TALLY are rese to zero.
  • TALLY is writing its updated information.
  • the character "A” is present at the selection inpu
  • the character "A” is present at the selector input
  • the R register shown as 170 in Figure 4 contains z
  • the M register shown as 182 in Figure 4 contains z
  • Tally location "B” 0
  • Tally location "C” 0
  • the character "A" is present at the selector input
  • the R register shown as 170 in Figure 17 contains at the start of this period and 1 at the completion.
  • the M register shown as 182 in Figure 17 contains at the start of this period and 1 at the completion.
  • Tally location "A” 1 0
  • Tally location "B” 0
  • Tally location "C” 0
  • the character "B" is present at the selector input 404.
  • the R register shown as 170 in Figure 17 contains 1 throughout.
  • the M register shown as 182 in Figure 17 contains
  • Tally location "A” 0
  • Tally location "B” 0
  • Tally location "C” 0
  • the R register shwon ' as 170 in Figure 17 contains the start of this period and 2 at the completion.
  • the M register shown as 182 in Figure 17 contains the start of this period and 3 at the completion.
  • Tally location "C” 0
  • M is updated and becomes 3.
  • Minor clock cycle 1 During the start of this period while PHI is low, character "C" is routed from 404 to become the address of th memory 114 and since PHI is low the memory responds by readi out this location and presenting it to the ADD block 126. S THETA is low this block adds one to the value producing 1. is then routed both into LATCH 136 and TEST 144. Since THET
  • OMPI_ is low TEST outputs a 0.
  • the LATCH contents are frozen as TALLY switches into write m This in effect writes the updated quantity just computed by ADD back into location "C". Also at this transition the R register is incremented if signal 148 is 1. In this case it is not. During the second half of this period while PHI is high, TALLY is writing its updated information.
  • the character "C” is present at the selector input
  • the character "B” is present at the selector input
  • the R register shown as 170 in Figure 17 contains
  • the M register shown as 182 in Figure 17 contains
  • the character "C” is present at the selector input
  • the character "B” is present at the selector input
  • the M register shown as 182 in Figure 17 contains start of this period and 5 at the completion.
  • Tally location "A” 0
  • Tally location "B” -1
  • Tally location "C” 1
  • n(a,w,i) denotes the number of occurances of alphabet memb "a" in word w found in position i or beyond where position measured canonically from left to right. If w and v are w in A, both of length L, then the degree of word similarity them is denoted S(w,v) and is given by the formula below:
  • circuit invention evol from the formula to the algorithm to the circuit.
  • follo criteria are met: 1.
  • the next evolutionary step is the invention of an algorithm which quickly computes the formula.
  • This algorith is presented herein in the form of a Fortran function sub ⁇ program. It is called with three input parameters: IQ, IR and N. IQ and IR each of which is a dimension N integer vector. Upon return, the variable Theta is the degree of similarity between the input words IQ and IR. Alphabet members are integers between 1 and 256 inclusive.
  • our circuit requires on two internal clock cycles. Each represents approximately a read/write cycle pertaining to a high speed random access memory. Actually, the clock must be slightly slower than
  • a complete similarity memory system may contain m such circuits. Therefore, the system would then be perform a search function beyond the capabilities of existant gener CPU's.
  • the basic algorithm as programmed in Fortran is: FUNCTION THETA (IQ, IR, N)
  • ITALLY (IR(I) ) ITALLY(IR(I) ) -1
  • the SEL is a random access read/write memory 208. Its address enters on bus 224 from above and data output is to the right on bus 214. In Figure 18, it is assumed that the read mode is selected as the device is only written to during master initialization. In th preferred embodiment the read/write memory 208 is organized as 127 2-bit words.
  • SYN consists of three random access read/write memories, 202, 204, and 206. In each, the address enters on bus 406' from above the data output transmitted on bus 220, 220', and 220 ' ' from below.
  • the read mode is selected as the device is only written to during master initialization. In the preferred embodiment, it is organized as three memories, each consisting of 256 8-bit words.
  • Select is a data selector 210. It has three input busses 220, 220', and 220'' entering from above. One of these three is routed to a single output 224 existing below, depending upon the nume value of the 2-bit value entering from the left. If this valu 0, then Select ignores its input and outputs zero. If this va is 1, 2 or 3, then the first, second or third input data bus
  • SHEET respectively is routed to the output.
  • it has 8-bit inputs and outputs.
  • MV is a one bit latch 222. It is set/reset during master initialization. The current state of MV exists above.
  • Select 402 is a data selector. It has two input busses 224 and 404' entering from above. One of these is routed to a single output exiting below, depending upon the • state of ⁇ entering from the right. If ⁇ is low, then the right input bus is selected. Otherwise, the left bus is selected. In the preferred embodiment. Select has 8-bit inpu and outputs.
  • PW is a random access read/write memory. Its addre enters from the left and data output is to the right. In the figure, we assume that the read mode is selected, as the devi is only written to during master initialization. In the preferred embodiment, it is organized as 127 2-bit words.
  • CW is a random access read/write memory. Its addre enters from above and data output is to the right. In the fi we assume that the read mode is selected, as the device is on written to during master initialization. In the preferred embodiment, it is organized as 256 2-bit words.
  • Distributor is a data distributor. It has three ou above and a 2-bit control input to the left. When this input is zero, all three outputs are zero. When it is 1, 2, or 3, the first, second or third output goes high respectively, lea the others zero.
  • Each gate is a pair of logical and gates used to control the propagation of the data output from CW. Both bi leaving CW center each gate to the left. Inside gate there two 2-input and gates. One input from each becomes a common control input shown entering from below. The remaining two inputs connect to the two entering data lines. The outputs the gates are shown to the right. When the control input is low, the gate outputs zero. When it is high, gate simply propagates its two bit input. The combination of the three gate circuits and the distributor circuit forms a variable shift register which sh the output of CW, depending upon the output of PW. This aff the computation of the final weight.
  • TOTM is an add-latch device. Its input enters abo It is triggered on a negative ⁇ transition.
  • TOTM is a 17-bit latch together with an adder having one 17-bit input and one 11-bit input.
  • R is an add-latch device. Its input enters above and its output exists below. It is triggered on a positive transition provided that T entering from the right is equal one.
  • R is an 11-bit latch coup to an adder having one 11-bit input and one four bit input.
  • M is an add-latch device. Its input enters above.
  • SUBSTITUTE SHEET It is triggered on a negative ⁇ transition.
  • it is a 17-bit latch together with an adder havi one 17-bit input and one 11-bit input.
  • this diagram illustrates ho the basic circuit of Figure 17 may be considerably enhanced without sacrificing speed of processing.
  • befor data reaches the core or data selector, circuit 402 and the basic word associator circuit 112 that is identical to that shown in Figure 17, several tasks are performed.
  • a memory word is fetched corresponding to the current column position being processed. If this word is zero, then the current column is ignored. This is accomplished by using the double skip on zero circuit 220 of well known design.
  • This circuit merely blocks propagation of all timing signals durin the current character pair. Therefore, the circuit ignores the current column. Whereas the circuit of Figure 17 process every column unconditionally, this facility permits column selection in the circuit of Figure 18.
  • the fetched word i non zero, then it is used to select one of three tables to be used in translating the data character from the record before it reaches the select circuit 402. This is called synonym processing and permits additional flexibility.
  • the facilitie above are implemented via the random access memory's SEL 208 and the three random access memory's labels SYN 202, 204, and 206, and by the SEL 208 component which simply selects one of the three outputs from the SYN memories 202, 204, and 206.
  • the SEL 208 is of any well known design.
  • the SYN 202, 204 an 206 is a set of three random access memories of a well known design. If the translated value of a record character is zer
  • MV 222 is a one bit latch of a well known design. This permits the definition of missing value fields not to detract from the measure of the similarity between the record and the query.
  • Position bus 224 is connected to SEL 2 and PW 226 that is a random access memory of well known desig.
  • Figure 18 implements a weighing scheme wherein the column number currently under consideration and the current characte are used to determine a weight which is to be added to R inst of "1". In this way, one can weight characters heavier than others.
  • Figure 18 contains one such circuit. In this circuit, each alphabet character is assigned a two bi weight. This weights 0, 1, 2, 3 are possible. Each column position is also assigned a weight also two bits. But this weight is used to control a shift register so that here the possible weights are 0, 1, 2, 4. The character weight is effectively multiplied by the column weight to arrive at the final weight.
  • a complete word associator circuit must of course contain two copies of the circuit of Figure 18.
  • the character is not processed.
  • This is accomplishe the "single skip on zero" circuit 230 which blocks propagat of timing signals for the duration of a single character.
  • the weight scheme is implemented by the random access memories PW 226 and CW 242 of well known design and the selectable shift register 244 formed by the distributor and gate components 248, 250, and 252 and by the single ski on zero circuit 230, all of which are of well known designs
  • the final result M needed to divide by N(N+L) where N is the length of the wor under comparison, to arrive at the measure of similarity.
  • this denominator must be compu since it will depend upon the weights encountered during processing.
  • TOTR 254 and TOTM 256 that are add-latches, compute a denominator term.
  • a corresponding term is computed by the other copy of the circuit.
  • the sum of these two terms is the final denominator.
  • the final numerator is twice the sum of the two M values read out of the two circuit copies.
  • the similarity between the query and bus words is the quotient of the numerator and the denominator quantities.
  • the circuit 112' in the lower right of Figure 18 is recognizable as very simila to the circuit of Figure 17. The only difference is that t selection component is now external and R may now be update by quantities other than "1".
  • the circuit of Figure 18 is divided by dashed lin into three stages designated by I, II and III. Note that busses passing from stage to stage are broken. These indic that buffers might be inserted to achieve a pipeline with three stages. In this way the circuit can process data as
  • INSTITUTE SHEET (OMPI io fast as the circuit of Figure 17. Without pipeline buffers, the circuit is two to three times slower.
  • the timing signal are labeled identically in each stage but may vary from stage to stage both because of the optional pipeline and bec of the skip on zero circuits.
  • the circuit of Figure 18 must be initialized befor use.
  • a master initialization must be performed once per sea to establish weights, etc. This initialization must load th 208, SYN 202, 204 and 206, PW 226, and CW 242 memories.
  • Als the MV 222 flag must be set or reset.
  • the tally memory in 112' not illustrated must be set to zero as must t R and M, not shown, the TOTR 254 and TOTM 256 add-latches. each record is processed, the contents of M, not shown in 11 and TOTM 256 are read out of the circuit.
  • th bus data character must be stable on the bus even during the processing of the query character. This permits the bus character to be translated while the query character, which does not pass through synonym translation, is processed. Th query translation is better left to software since the query is fixed during a search.

Abstract

The string comparator device (105) for comparison of strings of indicia at high speeds for use in a system circuit (103) in a computer system (101). The string comparison device (105) provides a numeric measurement of the degree of similarity between the compared indicia string as defined by a mathematical algorithm. The algorithm is solved through a new string comparator device (105) or a new program in a computer system (101). The system circuit (103) in chip form can be connected in a storage loop of a computer system (101) to locate and quickly extract records that are very similar to the supplied query. Inexact queries will rapidly locate records similar with respect to indicia string related measurements of similarity. The method of indicia string comparison in the improved string comparator device can provide rapid response to queries in a computation time proportional to the average length of the indicia string.

Description

STRING COMPARATOR DEVICE! SYSTEM, CIRCUIT AND METHOD
BACKGROUND OF THE INVENTION
An associative memory is a special kind of storage device. Whereas most memories are numerically addressed, associative memories are addressed via their contents. For example, one might ask an associative memory to return all records containing the letters "ZXU" in columns one, two, and three respectively. The address applied to an associati memory is the query. If some record exactly satisfies the query, then the query is said to be exact, otherwise it is said to be inexact. Conventional associative memories provide no information in response to an inexact query. Inexact queries are merely rejected. If a record exists within the associative memory that is only slightly differen from the supplied query, then in many cases it would be desirable to know of its existence. Such records are minor corruptions of the query. Alternatively, they are' very similar to the query. Some related concepts were discussed in the following articles :
(1) 'The application of a pattern matching algorithm to searching medical record text," in the proceedings of the second annual symposium on computer applications in medical care, p. 308-313, IEEE 78CH 1413-3, by the inventor et al, a copy of which is attached hereto and made a part hereof;
(2) The Ramon D. Faulk article in communications of the ACM, Vol. 7/Number II/November 1964, pages 647-653;
(3) The A.J. Szanser article, Mathematical Linguistics Error-Correcting Methods in Natural Language Processing, Information Processing .68-North-Holland Publishing Company- Amsterdam (1969) , pages 1412-1416; and
fTϊTϋTE *»*"«»-
SUBS" ( 4) The A.J. Szanser article , The Computer Journal, Vol. 1 Number 2 , pages 132-134.
(5) United States Patents 3 ,333 , 243 ; 3 , 651, 459 and 4 , 084 , 2 show the state of the prior art. (6) "Approximate String Matching", Patrick A.V. Hall and G R. Dowling, ACM Computing Surveys, Vol. 12, No. 4, pp 381-4 December, 1980.
A storage loop is formed when a storage device repeatedly and sequentially transmits its entire contents to external devices over a high speed data bus. An associative memory may be formed by attaching to this bus a device whose funct it is to scrutinize in a passive manner the data stream originating from the storage device as it passes by on the bus. This attached device senses data appearing on the bus that is related in some predetermined fashion to a supplied query.
Central to associative memories of this type is some sort of word comparator device. In the prior art, a simple digital comparator distinguishes two cases: equal and unequal. Other devices can detect certain special corruptions such as the transposition of two characters, th deletion of a single character. Still other devices compar two words to arrive at an indication of how similar they are. One such device recodes the words as binary strings and then measures the Hamming distance between them. Devic in the prior art do not, however, seem to compare words in general way that approaches the sort of similarity recogniz ability found in humans. United States Patents 3,333,243; 3,651,459; and 4,084,260 show the state of the prior art. Approximate string matching means which provide a fairly general and sophisticated measure of string similari
Figure imgf000004_0001
exist in prior art. However, in prior art, such systems a relatively slow in completing the matching process. Most such prior art string matching systems utilize dynamic programming methods as discussed in ACM Computing Surveys "Approximate String Matching" . The computation time for such prior art systems is generally proportional to the square of the average length of the strings. SUMMARY OF THE INVENTION
The system circuit, a computer peripherial devic includes an indicia string comparator device that compares strings of indicia or characters at high speeds. This sys circuit performs an approximate string comparison operatio The system circuit computes a measure of string similarity. The system circuit is accessed by a computer in the same manner as a Random Access Memory. Query strings and recor strings are written into the system's memory, the approxim similarity measures are automatically computed and the bes matches are recorded in a memory inside the systems circui These best matches may then be accessed by the host comput Typically the system circuit would be used to search a large lexicon or database of record strings for th entries which are most similar to a query string. The quer string is entered into the system circuits and then each lexicon or database record string is entered. At the end, the pointers to the record strings most similar to the quer string are automatically available in a list in memory in the system circuit.
The measure of string similarity is an extremely general function which can be tailored or adapted under software control for a specific application by the setting of parameters. The computation time is proportional to the average length of a string of indicia.
SUBSTITUTE SHEET The system circuit in chip form also incorporates to perform Direct Memory Access (DMA) operations.
The string comparison device is a means described in Appendix 1 which is made a part of this application. First the string comparator means is described in "the simplest string comparison function" in pages 1 through 6 o Appendix 1, second it is described in "string comparison function with variable character weights" in pages 7 throug 10 of Appendix 1, third, it is described in "string comparis function with unmatched character compensation and variable character weight", in pages 11 through 14 of Appendix 1, fourth it is described in "string comparison function with directional biasing and variable character weights" in page 15 through 18 of Appendix 1, and fifth it is described in "the full string comparison function with variable characte weights, unmatched character compensation and directional biasing" in pages 19 through 22 of Appendix 1. Each of the first five string comparison means descriptions are describ in a mathematical description and in an algorithm descripti One of the preferred means for computation of the string similarity function described in the Appendix 1 is the system circuit. The system circuit is an electronic device which includes a string comparator means. The elect circuit performs the calculation of the string similarity function 0 The string comparator which is an electronic circuit, and an I/O controller and ranker means together comprise an electronic device which is referred to as the String Comparator means. The system circuit is connected t a computer system by means of an interface circuit to utiliz the string comparator means.
The logic design of the String Comparator means is explained in detail herebelow.
s UBSTΓΓ . - =»"
O PI
Λ WIPO In the preferred embodiment of the String Compara means as shown in the drawings, the string comparator means includes Bus Control means. Master Control means, Ranker me Divider means, and a String Similarity Computer. The Strin Comparator means is used like a random access memory. The
Bus Control means uses well-known techniques to control int bus lines and the external signals. Master Control impleme DMA operation and DMA editing in the manner described in de herein below. The Ranker means maintains a ranked list of best comparison results. The Divider means computes the ratio of two binary numbers from the String Similarity Comp The String Similarity Computer computes the full string similarity function θ defined in appendix 1. The Master Control means, the Bus Control means , the Ranker means, the Divider means and the String Similarity Computer may be logic circuits and may be embodied with well-known electron cσπroonents.
The String Similarity Computer is comprised of th String Control means. Parameter Generation means, Core sect C section, CB section, R section, M section, TOTR section TOTM section.
The purpose of the String Control means is to con and coordinate the activities of the rest of the string simi computer. The String Control means is connected as shown in the drawings and has internal storage registers and memory.
The Parameter Generation means is used to obtain t indicia or character weight and compensation values. The in or character itself is received from the string control mean or circuit. The character weight and compensation value are
SUBST1 . u • - =* 4- ( o Pi
Λ, WIPO used by the rest of the string similarity comparator. The Parameter Generation means is connected as shown in drawings The Core section is the decision-making part of the String Similarity Computer. The Core section identifies common portion of the query string and the record string, and is connected as shown in drawings. The CA section is used to compute the total compensation value for the indicia or characters in the string A. The CA section is connected as shown in the drawings. The CB section is used to compute the total compensation value for the indicia or characters in the string B. The CB section is shown in the drawings. The R section computes an intermediate subtotal value. The values computed by the R section correspond precisely to the R function described in the description of string similarity function. The R function is connected as shown in the drawi The M section computes the numerator of the ratio defining t string -similarity function. The M function is connected as shown in the drawings. The TOTR section computes an interme subtotal value. The values computed by the TOTR section correspond precisely to the values of the variable TOTR used in the C programs. The TOTR section is connected as shown i the drawings. The TOTM computes the denominator of the ratio defining the string similarity function. The TOTM section is connected as shown in the drawings. The original disclosure of the invention related t new and improved word comparator device, referred to hereina as a word associator circuit, that provides a numeric measur of the degree of a word similarity between the compared word as defined by mathematical formula. The original disclosure also related to an associative retrieval system and method for retrieval of inexact queries in a quick and expeditious manner. The circuit may be an electrical digital circuit or other type of circuitry that will provide an output conformi
O PI
SUBSTITUTE SHEET IPO
/*i to a mathematical formula to provide an improved word comp function. An associative memory is normally thought of as device which responds only to exact queries. Some may resp in a limited manner to inexact queries, for example see "Backend Processors is REM the Answer" in Datamation, March 1978, pg. 206-207. The system and method of the original disclosure uses the improved word comparator device to form associative memory which responds to inexact queries in a new and useful way. This word comparator device is used to rapidly locate records most similar to a query. Similarity is defined by certain mathematical formula and is measured using a high speed digital circuit referred to hereafter as Word Associator Circuit in the preferred embodiment. A wor associator circuit will also be referred to as a string sim computer.
This continuation in part relates to improvements the word associator circuit described in the original paten The original word associator circuit computed a numeric mea of the similarity of two strings σf indicia. Said numeric measurement was defined in the original patent; is defined
"The definition, computation, and application of symbol str similarity functions", Emory University, M.S. Thesis, Depar of Mathematics, 1978, by the inventor; and is also defined "The simple string comparison function" in pages 1-6 of App 1.
Appendix 1 also describes four improvements to th original string comparison function. These improvements pr an extremely flexible string comparison function which can adapted for a given application by the setting of parameter Any of these improvements to the word associator circuit ma be used in an associative memory in the same manner as the original word associator circuit.
SUBSTITUTE SHEET This continuation in part also relates to a new preferred embodiment of the improved word associator circui a system circuit in chip form.
Similarity has three components; word, numeric, and mask. Most central to this invention is the notion of word similarity. Words are strings of symbols from some alphabet. A symbol can be any character or indicium from a finite or infinite set or alphabet. Words are also referre to as character strings, symbol strings, indicia strings, o merely strings. Word similarity, or indicia string similar is a measurement of the similarity between two words or indicia strings. Several variable parameters are involved in the definition of word similarity, making it very flexib Of significance is the ease with which the degree of word similarity may be determined using digital circuits. Using existing technology word associator circuits may be built t process serial data at rates in excess of 20,000,000 charac per second. Such a circuit is described hereinbelow.
The word similarity of two indicia strings may be expressed in three different forms: absolute, ratio, and fractional. Absolute word similarity provides a single non negative number M which is a measure of the total weight of common portions of the indicia strings. The common portion are symbols or indicia occuring in both of the indicia stri the total weight of which depends on the individual indicia weights and the relative positions of the common portions. Absolute word similarity is most useful when the lengths or total weights of the indicia strings are always equal or ne equal. Ratio word similarity provides the absolute word similarity value M and the total word weight TOTM. The tot
SUBSTITUTE SHEET ' Y OΠMPHO word weight is a measure of the total weights of the indic comprising the indicia strings and of the length of the in strings. Fractional word similarity provides a number bet 0 and 1 which is equal to the ratio of the absolute word similarity value M and the total word lengths TOTM.
Absolute word similarity and ratio word similari are most useful in applications which search for a word similarity value exceeding a certain threshold. A possibl application would involve monitoring a heartbeat by contin encoding recorded heartbeats as a string of indicia. This indicia string would then be compared to one or more indic strings which denote abnormal heartbeats. If the similari of the two indicia strings were above a certain threshold, an alarm could be sounded. Fractional word similarity is useful in applicat such as the associative memory circuit, where a large numb of indicia strings are compared with a word associator cir and a ranked list of the most similar indicia strings is maintained. The fractional representation of word similari provides a convenient way to compare word similarity values
The system includes the use of one or more associ circuits in storage loops to locate and extract records mos similar to the supplied query. The system's architecture i preferably totally parallel so that it may be configured to process any number of simultaneous requests. Its memory is partitioned into storage loops so that through suitable configuration, arbitrarily large datasets may be processed. Attached to each storage loop via a dynamically variable network are multiple associator circuits to locate and
-Ql INSTITUTE SHEET -^JRE
O PI IPO extract the records most similar to the supplied query. By varying the number of storage loops versus the number of associator circuits, configurations may achieve a wide range of cost/response-time possibilities. In use, the associativ retrieval system with associator circuits could off load fro a host processor, the task of searching for database records In doing this, it could provide improved services as well as entirely new services.
The method for retrieval as described in detail herein that conforms to a mathematical formula also provides a new' and improved invention over and above the prior art.
When queries of words are made seeking records, exact records as well as slightly different records or simil records, are produced as an output. Such outputs are said to be a minor corruption of the query or are said to be in close similarity with the query. A measurement of the degre of similarity between two strings of symbols from some alphabet is defined in "The definition, computation and application of symbol string similarity functions", Emory University, M.S. Thesis, Department of Mathematics, 1978, by the inventor, a copy of which is attached hereto and made a part hereof. This measurement agrees well with intuition while remaining mathematically simple and easy to compute, see "The application of a pattern matching algorithm to searching medical record text" , in the proceedings of the second annual symposium on computer applications in medical care, p. 308-313, IEEE, 78 CH 1413-4, by the inventor et al, a copy of which is attached hereto and made a part hereof. Using this measurement, minor corruptions may be located, thus extending the conventional function of an associative memory into the area of a similarity memory. Of significanc is the fact that this measurement may be trivally computed using the new and improved digital circuitry disclosed herei
_ι iβftTlTUTE SHEET OMPI Λ,. IPO Circuits to compute it are described herein and may be built to achieve very high processing rates. Therefore, in some applications, minor corruptions may be located without additional overhead. The basic mathematics of word similarity was developed by the inventor and constituted his Master's thes at Emory University, Atlanta, Georgia, June, 1978 referred to hereinabove. The inventor is also primary author of a paper presented to and published by the IEEE reference here above. This paper describes the application of word simila to searching raw narrative medical record text.
It is an object of this invention to provide a st comparison device that is a new, improved fast indicia stri associator circuit. Another object of this invention is to provide a string comparison device that provides a numeric measuremen of the similarity 'of two indicia strings in a time proporti to the average length of the indicia strings.
It is another object of this invention to provide an indicia string comparison device that provides a numeric measurement of the degree of indicia string similarity betw the compared indicia strings as defined by an algorithm.
It is another object of this invention to provide an indicia string comparison device in the form of an elect circuit for providing indicia string or word comparison con to the algorithm.
It is another object of this invention to provide a system by connecting at least one indicia string comparis device into a computer system for access in the same manner as a random access memory.
A further object of this invention is to provide method of processing data for indicia string comparison tha conforms to a particular algorithm.
SUBSTITUTE SHEET An additional object of this invention is to provi a system method of retrieving similar indicia strings, words numbers, and/or masks from inexact queries.
In accordance with these and other objects which will be apparent hereinafter, the instant invention will no be described with particular reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS
In the drawings: Figure 1 is the improved block diagram of the sys circuit interfaced with a computer system.
Figure 1A is a block diagram of the system circui interfaced to and monitoring a data stream.
Figure 2 is the improved block diagram of' the sys circuit including the string comparator means.
Figure 3 is an illustration of the improved timin
Figure 4 is a block diagram of the String Control in Figure 2.
Figure 5 is an illustration of components of the String Control of Figure 4.
Figure 6 is a block diagram of the Parameter Generation shown in Figure 2.
Figure 7 is a block diagram .of the Core Section shown in Figure 2. Figure 8 is a block diagram of the CA section shown in Figure 2.
Figure 9 is a block diagram of the CB section shown in Figure 2.
Figure 10 is a block diagram of the R section shown in Figure 2.
Figure 11 is a block diagram of the M section shown in Figure 2.
SUBSTITUTE SHEET OMPI Figure 12 is a block diagram of the TOTR section shown in Figure 2.
Figure 13 is a block diagram of the TOTM section shown in Figure 2. Figure 14 is a block diagram of an associative retrieval system.
Figure 15 is a block diagram of the associator circuit illustrated in Figure 14.
Figure 16 is an illustration of the timing of β a θ.
Figure 17 is a block diagram of the basic word associator or comparator circuit.
Figure 18 is a block diagram of the basic word associator circuit in another system. Figure 19 is a schematic diagram of the selector circuit shown in Figure 17.
Figure 20 is a schematic diagram of the tally mem circuit shown in Figure 17.
Figure 21 is a schematic diagram of the add circu shown in Figure 17.
Figure 22 is a schematic diagram of the latch cir shown in Figure 17.
Figure 23 is a schematic diagram of the test circ shown in Figure 17. Figure 24 is a schematic diagram of the add-latch circuit or R section shown in Figure 17.
Figure 25 is a schematic diagram of another add- latch circuit or M section shown in Figure 17.
Figure 26 is an illustration of the operation of the word comparator.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
This invention is to a new and improved string comparator device which provides a numeric measurement of th
SUBSTITUTE SHEET degree of indicia string similarity between the compared str as defined by an algorithm.
The preferred embodiment of the system circuit is an electronic device which is a string comparator means including a comparator device. Figure 1 shows a system in which this electronic device would be used. In Figure 1 the string comparator 105 performs the calculation of the string similarity function θ . The string similarity computer 105 and an I/O controller and ranker means 104 together comprise the system circuit 103 referred to as the String Comparator means. The system circuit 103 is connected to a computer sy 101 by means of an interface circuit 102. The computer syst 101 is shown with a keyboard 101' and a display 101" for hum interaction. A variety of well-known devices for storage, communications and computation may also be attached to the computer system 101.
.A sample interface .circuit 102 is described in App 4 including drawings. Figure A4-1 through A4-8 are made a part of this specification." Appendix 3 made a part of thi specification is a document describing how to interface and utilize the String Comparator means.
Another use of the string comparator circuit 105 ' is shown in Figure 1A. Here an interface means 106 monitors a stream of data 107, formats words denoting the contents of the data stream, and uses a string comparator circuit 105' to compare against one or more predefined strings. When a predetermined similarity threshhold value is exceeded, an output signal 108 denotes a match condition.
The logic design of the String Comparator means is explained below. Appendix 2 made a part of this specificati gives a complete, detailed logic specification for the prefe embodiment of the String Comparator means as a Large Scale Integrated (LSI) electronic circuit. String Comparator mean
SUBSTITUTE SHEET "ft fR
OMPI refers to the preferred embodiment as well as alternate cir for performing particular functions.
The preferred embodiment of the String Comparator means 103 of Figure 1 is shown in Figure 2. It consists of Bus Control means 21, Master Control means 20, Ranker means Divider means 23, and a String Similarity Computer 10.
The string comparator means is used like a random access memory as described in Appendix 3. The Bus Control 21 uses well-known techniques to control internal bus lines 24, 24* and 24" and the external signals 26.
Bus Control 21 controls all external accesses to system circuit, and monitors the activities of the other in components of the system circuit.
Master Control 20 implements DMA (Direct Memory A operation and DMA editing in the manner documented in Appen 3. Master Control controls automatic loading of data from an external memory.
The Ranker means 22 maintains a ranked list of the best comparison results. The ranked lists contain up to 16 entries; each entry consists of the string similarity value and a record pointer. Appendix 3 describes the effect of the Ranker means.
The Divider means 23 computes the ratio of two binary numbers from M 17 shown in Figure 11 and TOTM 19 shown in Figure 13; this ratio is expressed as a fractional binary number.
The String Similarity Computer 10 computes the full string similarity function θ defined in detail in Appendix 1. The Master Control means 20, the Bus Control means
21, the Ranker means 22, and the Divider means 23 are shown in complete detail in Appendix 2 as logic circuits. The string similarity computer means 10 is also shown in Appendi
SUBSTΓTU * Λ "&ϋREX
O PI 2. Furthermore, the string similarity computer is describe a functional level in the next section.
It should be noted that Figure 1A may include any one of the following circuit configurations. First, string control 11, Parameter Generator 12, Core section 13, R secti and M section 17; second, string control 11, Parameter Gene 12, Core section 13, CA section 14, CB section 15, R sectio 16 and M section 17; third, string control 11, Parameter Generator 12, Core section 13, CA section 14, CB section 15, R section 16 and M section 17, TOTR section 18 and TOTM sec 19; fourth, string control 11, Parameter Generator 12, Core section 13, CA section 14, CB section 15, R section 16, M section 17, TOTR section 17, TOTM section 19 and Divider 2 The design of the PF474 is based upon a structure design methodology which uses two-clock logic. Figure 3 shows the two clocks θl and Θ2. The clocks are designed so that only one is high (active) at a time. When θl is hi inputs are accepted into a logic block; when θl falls, the inputs are latched into the input registers. When Θ2 is high the output of a logic block may change; when Θ2 falls, the output signals are latched. These two clocks provide an orderly method for passing signals from one logi block to another. It is not possible or appropriate to dis the features of this structured design methodology in this patent. For a thorough discussion of this design philosoph refer to the excellent text "Introduction to VLSI Systems" by Carver Mead and Lynn Conway (Addison- esley, 1980).
The string similarity computer 10 is also describ in further detail in Appendices 2. The string similarity computer 10 is now described on a functional level.
The purpose of the String Control means 11 is to and coordinate the activities of the rest of the string similarity computer 10. The String Control means 11 is sho
SUBSTITUTE SHEET in Figure 4 and Figure 5. Figure 4 shows the input and ou signals which are important for the logical functionality. Figure 5 shows the internal storage registers and memory.
The GO input 30 is a 1-bit signal. When GO is h (active) , String Control will initiate a string comparison operation. The RBUSY signal 36 and the CTS signal 37 are status indicators. A string comparison operation can be started only when the RBUSY signal 36 is low (inactive) or CTS signal 37 is high (active) . The SBUSY output signal 35 is high (active) whil string comparison operation is in progress. This signal i used by Master Control 20.
As shown in Figure 5, the String Control means contains Random Access Memory 80 which contains areas for strings of 8-bit characters. The two strings are designat
String-1 and String-2. The strings are loaded into the Ra Access Memory 80 by the Bus Control means 21. String-1 an String-2 are terminated with the character NULL. Addition two 8-bit registers LENl 81 and LEN2 82 contains the lengt of String-1 and String-2, respectively.
The DISP register 83 and the POS register 84 are internal 8-bit registers used to perform the Forward and Reverse scans.
When a string comparison operation is initated, the lengths LENl and LEN2 are compared. The shorter of the two strings is designated String-A, the other is designated StringB. If the lengths are equal, the designation is made arbitrarily.
The CMD output signal 31 is a 3-bit signal which is used to sequence the activities of other parts of the string similarity computer 10. Table 1 shows the valid values for the CMD signal 31. The entries in Table 1 are listed in the order in which the values are output.
ssi INSTITUTE SHEET TABLE 1
CMD SIGNALS
VALUE COMMAND
000 NOP (Idle State) 010 Pre-Forward Clear 001 Process Character
(repeated during Forward Scan)
Oil Pre-Reverse Clear 101 Load CB into R 001 Process Character
(repeated during Reverse Scan stage
100 Load CA into R 001 Process Character
(repeated during Reverse Scan stage
110 Transmit Results 000 NOP (Idle State)
During the Forward Scan and the Reverse Scan, the CHAR output 32 is an 8-bit signal containing a character from either String-A or String-B. Each character from Strin or String-B is transmitted once during the Forward Scan and Reverse Scan. The SCID output signal 34 is low (inactive) d the Forward Scan and high (active) during the Reverse Scan. The STRID output signal 33 is high (active) when a character from String-B is being output on CHAR lines 32. The STRID output signal is low (inactive) when a character from String is being output. The STRID output signal 33, the SCID outpu signal 34, and the CHAR output 32 are meaningful only when t CMD output 31 is equal to '001* indicating that a character should be processed.
SUBSTITUTE SHEET OMPI WIPO The Forward Scan phase of the String Control mean operation causes the computation of the function M^ as desc in the previous section defining the string similarity func θ. In effect, the String Control right-justifies String-1
5 String-2 and then transmits every character beginning with leftmost and ending with the rightmost. More precisely. St
Control implements the following Forward Scan Program: char strA[128], strB[128]; /*Two Strings*/ int lenA, lenB; /*Strings' lengths.
10 LenB LenA*/ fscan () int disp; /*DISP register*/ int pos; /*POS register*/ pos = 0; /*Initialization*/
-, c disp = LenB-LenA; for C; pos LenB; ++p) int i; /*Temporary Variable*/ i = pos-disp; if (i == 0) 0 printf("CMD'001' ,OUTPUT CHAR' %c' from STRING A. ", strA[i] ) ; printf("CMD'OOl' , OUTPUT CHAR*%c' from STRING B.0" , strB[pos] ) ;
The variable POS and DISP in the above program 5 correspond exactly in function and purpose to the 8-bit reg POS 42 and DISP 41 shown in Figure 5.
The Reverse Scan phase of the String Control mean operation causes the computation of the function M, as desc in the previous section defining the string similarity func 0 θ. The Reverse Scan is divided into two stages to facilita the computation of the compensation functions. In effect. String Control left-justifies the String-1 and String-2 and transmits every character beginning with the rightmost and ending with the leftmost. More precisely, String Control 5 implements the following Reverse Scan Program:
usβ r'rrru i
Figure imgf000021_0001
char strA[128]; strB[128]; /*Two Strings*/ int lenA, lenB; /*Strings* lengths.
/*LenB LenA*/ rscanQ int disp; /*DISP register*/ int pos; /*POS register*/ pos = lenB/ /initialization*/ disp = LenB-LenA; /*Reverse Scan - Stage 1*/ for (; disp = 0; disp—, pos—) printf ("CMD'001', OUTPUT CHAR' %c' from STRING B. "]?", strBtpos]);
/*LOAD CA INTO R*/ printf CCMD'100' , LOAD CA INTO R.0); /*Reverse Scan - Stage 2*/ for (; pos 0; pos—) printf ("CMD'001 , OUTPUT CHAR' %c from
STRING A."0T" , strA[pos] ) ; printf ("CMD'001' , OUTPUT CHAR'%c' from
STRING B.0", strB[pos] ) ;
The variables POS and DISP in the above program correspond exactly in function and purpose to the 8-bit regi POS 42 and DISP 41 shown in Figure 5. Also, POS and DISP are initialized in the program RSCAN to the values which the held at the end of the program FSCAN. Thus RSCAN can cσnven be run directly after FSCAN with no extra initialization.
The purpose of the Paramater Generation means 12 is to obtain the character weight and compensation values. The character itself is received from the String Control circuit 11. The character weight and compensation values ar used by the rest of the string similarity computer 10. The Parameter Generation means 12 is shown in Figure 6. Figure 6 shows the input signals 31-34, the output signals 39-43 an the internal Random Access Memory 38.
SUBSTITUTE SHEET -S J E4
OMPI The input signals are the CHAR input 32, the SCID input 34; the STRID input 33, and the CMD input 31. All of these signals are outputs of the String Control means 11.
The Random Access Memory 38 contains, for each of 255 characters, a compensation value between 0 and 7, a weig value between 0 and 7, and a bias value between -2 and +1. These values correspond to the compensation function C, the forward weight function Wf, and the bias function B which we described in the appendix defining the string similarity function θ. The Random Access Memory 38 is loaded via the Bus Control means 21.
The CHAR output signal 39 is always the same as th previous CHAR input signal 32. The CHAR signals 32, 39 are 8-bit data items. The STRID output signal 40 is always the same as t previous STRID input- signal 33. The STRID signals 33, 40 ar 1-bit signals.
The CMD output signal 43 is always the same as the previous CMD input signal 31. The CMD signals 31, 43 are 3- signals.
The C output signal 42' is a 3-bit positivie binary signal equal to the compensation value for the character denoted by the CHAR signal 39. This compensation value is read from the Random Access Memory 38. The WGHT output signal 41 is a 3-bit signal equal to either the forward or the reverse weight of the character denoted by the CHAR signal 39. When the SCID input signal 3 is low (inactive) denoting the Forward Scan phase, the WGHT output signal 41 is equal to the value W-f read from the Random Access Memory 38. When the SCID signal is high (acti denoting the Reverse Scan phase, the WGHT output signal 41 i equal to the sum of the values of Wf and B read from the
SUBSTITUTE SHEET /^SJT Random Access Memory 38.
The Core Section 13 is the decision-making part of t string similarity computer 10. It receives data from the
Parameter Generation means 12, maintains character counts and determines what the rest of the string similarity computer must do. The Core Section 13 is pictured in Figure 7 with its input signals 39-43, internal TALLY memory 44, and output signals 45-52.'
The input signals are CHAR 39, STRID 40, WGHT 42, C and CMD 43. These input signals are all outputs of the Parame Generation means 12.
The TALLY memory 44 is a fast clear memory means of size 256X4. The TALLY memory contains a 4-bit signed (two's complement) number in the range -7 to 7, inclusive, for each character specified by the CHAR input signal 39. The clear control of the TALLY memory zeros the entire memory. Furtherm individual entries in the TALLY memory may be incremented or decremented. Attempting to increment the value 7 or to decrement the value -7 results in an unchanged state. This is referred to as latching at +7. The TALLY memor 44 correspond directly to the array T used in the C programs in the earlier sections defining the string similarity function θ.
The WX output signal 46 is equal to the arithmetic (two's complement) inverse of the WGHT input signal 41. WX 46 is a 4-bit non-positive integer.
The WGHT output signal 49 is equal to the previous WGHT input signal 41. The WGHT signals 41, 49 are 3-bit unsigned integers.
The CMD output signal 50 is a 3-bit signal that is always equal to the previous CMD input signal 43.
The CMDX output signal 48 is a 1-bit signal equal to the low-order bit of the CMD input signal 43.
The CMDB output signal 47 is a 2-bit signal. The
SUBSTITUTESHEET low-order bit of CMDB 47 is high (active) only if the CMD in 43 is equal to '001' denoting a Process Character command. The high-order bit of CMDB 47 is high only if the CMD input is equal to '010' or Oil", denoting a Clear command. When the CMD input 43 is equal to '010* or Oil',
Clear command is specified. The entire contents of the TALL memory 44 are cleared in this case.
When the CMD input 43 is equal to '001', a Process Character command is specified. The CHAR input 39 specifies a character to be processed. Each character designates a 4-bit entry in the TALLY memory 44". Character processing consists of incrementing or decrementing the entry in the TA memory 44 which corresponds to the character denoted by the CHAR input 39. When the CMD input 43 is equal to '001' and the ST input 40 is high (.active). , the appropriate TALLY memory entr is incremented and latched at 7. If the result is not posit then the T output signal 52 is high (active) otherwise the T 52 is low (inactive) . When the CMD input 43 is equal to '001' and the ST input 40 is low (inactive) , the appropriate TALLY memory entry is decremented and latched at -7. If the result is no negative then the T output signal 52 is high (active) otherw the T output 52 is low (inactive) . If the T output signal 52 as computed above is hig
(active), then the C output signal 51 is the arithmetic (two complement) inverse of the C input 42. Otherwise the C outp 51 is equal to the C input 42. The C input 42 is an unsigne 3-bit integer. The C output 52 is a signed 4-bit integer. The CA section 14 is used to compute the total compensation value for the characters in the STRING-A.' The CA section 14 is shown in Figure 7 with its inputs 45, 49-52
'SUBSTITUTE SHEET OMPI and outputs 53-58. The signal COMPA 53 is an output signal which is also fed-back as an input to CA.
The input signals are STRID 45, WGHT 49, CMD 50, C and T 52. These signals are outputs of the CORE section 13. The output signals STRID 54, WGHT 55, CMD 56, C 57 and T 58 always the same value as the corresponding inputs; i.e. thes signals are passed through unchanged.
The COMPA output signal 53 is a 9-bit unsigned non negative integer. When the CMD input signal 50 is equal to '010', denoting a Pre-forward Clear command, the COMPA outpu value 53 is zero. When the CMD input signal 50 is equal tc '001', denoting a Process Character command, and the value of the STRID input signal 45 is equal to the value of the T input signal 52; then the 4-bit signed input C 51 is added t the previous value of COMPA 53; the resulting sum is output as the COMPA signal 53. If an overflow occurs on this ' addition, the carry bit is lost; an overflow will not occur if the programming restraints documented in Appendix 3 are obyeed. When neither of the above condition is met., the COM output signal 51 is the same as the previous COMPA input sig
The CB section 15 is used to compute the total compensation value for the characters in the STRING-B. The CB section 15 is shown in Figure B with its inputs 53-58 and outputs 59-63. The signal COMPB 59 is an output signal whic also fed-back as an input to CB.
The input signals are COMPA 53, STRID 54, WGHT 55, CMD 56, C 57, and T 58. These signals are outputs of the CA section 14. The output signals STRID 60, WGHT 61, CMD 62, a T 63 are always the same value as the corresponding inputs; i.e. these signals are passed through unchanged.
The COMPB output signal 59 is a 9-bit unsigned non
. - UR A
SUBSTITUTE SHEET ' OMPI negative integer. When the CMD input signal 56 is equal to '010', denoting a Pre-forward Clear command, the COMPB outpu value 59 is zero. When the CMD input signal 56 is equal to '100', denoting a Load CA into R command, the COMPB output value 59 is equal to the COMPA input value 53. When the CMD input signal 56 is equal to '001', denoting a Process Character command, and the value of the STRID input signal 5 is not equal to the value of the T input signal 58; then the 4-bit signed input C 57 is added to the previous value of COMPB 59; the resulting sum is output as the COMPB signal 59. If an overflow occurs on this addition, the carry bit is lost; an overflow will net occur if the programming restraints documented in Appendix 3 are obeyed. When none of the above conditions are met, the COMPB output signal 59 is equal to the previous COMPB input signal 59.
The R section 16 computes an intermediate subtotal value. The values computed by the R section .correspond prec to the R function described in the appendix 1 defining the string similarity function θ. The R section 16 is pictured in Figure 10 with its inputs' 59-63 and its outputs 64-66. The RSUM signal fedback is an input to the R section 16.
The values of the output signals STRID 65 and CMD 6 are always the same as the values ot the inputs STRID 60 and CMD 62.
The RSUM output signal 64 is a 10-bit non-negative (unsigned) integer. When the CMD input 62 is equal to '010' or 'Oil1, denoting a Clear command, then the RSUM output 64 is zero. When the CMD input 62 is equal to '101', denoting t Load CA into R command, then the value of the COMPB signal 59 is added to the previous value of the RSUM input signal 64; this sum is the next RSUM output 64. When the CMD input 62
£i IBQTΓTC iTE SHEIT- is equal to '001', denoting a Process Character command, and the T input signal 63 is high (active) ; then twice the value of the WGHT input signal 61 is added to the value of the prev RSUM input 64; the result is the next RSUM output 64. When none of the above conditions is met, the RSUM output signal 6 is equal to the previous RSUM input signal 64.
If the result of the above additions causes an overflow, the carry bit is lost. An overflow will not occur if the programming restrictions documented in Appendix 3 are The M section 17 computes the numerator of the rati defining the θ string similarity function. The M function is described in appendix 1.
The M section 17 is shown in Figure 7. The inputs to the M section 17 are the signals RSUM 64 (a 10-bit unsigne integer) , STRID 65 and CMD 66. The outputs are the READY output signals 68 and the MVAL output signal 67. MVAL 67 is a 16-bit nonnegative integer which is fedback as an" input to the M section 17.
The output READY 68 is high (active) only when the CMD input 66 is equal to ' 110 ' , denoting a Transmit Result command. While the READY signal 68 is active, the output of M section 17 and the TOTM section 19 are valid and ready for the DIVIDER section 23 to use them.
When the CMD input 66 is equal to '010*, denoting a Pre-Forward Clear command, the MVAL output 67 is zero. Whe either of the following two conditions hold (1) the CMD input 66 is equal to '100', denoting a Load CA into R command, or ( the STRID input 65 is high (active) and the CMD input 66 is e to '001', denoting a Process Character command; then, the MVA input 67 is added to the RSUM input 64 and the result of the addition is the next MVAL output 67. When none of the above
SUBSTITUTE SHEεT f _ O conditions are met, then the MVAL output 67 is unchanged fr the previous MVAL input 67.
If the result of the above additions causes an overflow, the carry bit is lost. An overflow will not occu the programming restrictions documented in Appendix 3 are o
The TOTR section 18 computes an intermediate subt value. The values computed by the TOTR section correspond precisely to the values of the variable TOTR used in the C programs in the appendix 1 defining the θ string similarity function.
The TOTR section 18 is shown in Figure 12 with its input signals 45-48 and its output signals 71-74. The TOTRS output 71 is an 11-bit signed (two's complement) integer whi is fedback as an input to the TOTR section 18. The WX input 46 is a 4-bit signed (two's complemen integer. Both WX 46 and TOTRSUM 71 have only non-positive v
. The CMDX output 74 and the STRID output 72 are alw equal to the previous CMDX input 48 and STRID input 45, respectively. When the high-order bit of the CMDB input 47 is hi
(active) , denoting a Clear command, then the TOTRSUM output is zero. When the low-order bit of the CMDB input 47 is hig (active) , denoting a Process Character command, then the WX input 46 is added to the TOTRSUM input 71 and the result of addition is the next TOTRSUM output value 71. When neither the above conditions is met, then the TOTRSUM output 71 is unchanged from the previous TOTRSUM input 71.
If the result of the above additions causes an und flow, the carry bit is lost. An underflow will not occur if programming restrictions documented in Appendix 3 are obeyed
The TOTM section 19 computes the denominator of th ratio defining the θ, string similarity function. The TOTM
SUBSTITUTE SHEET function was described in appendix 1.
The TOTM section 19 is shown in Figure 19 with its inputs 71-74 and its output TOTMVAL 75. The TOTMVAL signal 7 is a 16-bit non-positive integer which is fedback to the M s 19 as an input. A non-positive integer is either zero or is negative number in two's complement format with an implicit negative sign bit. The reason TOTMVAL 75 is a non-positive integer is to simplify the circuit design of the DIVIDER sec 23. The TOTRSUM input 71 is an 11-bit signed (two's complem number.
When the high order bit of the CMDB input signal 73 is high (active) and the CMDX input signal 74 is low (ina denoting a Pre-Forward Clear command, then the TOTMVAL outpu 75 is zero. When the low-order bit of the CMDB input 73 is (active) , denoting a Process Character command, and the STRI input 72 is high (active) ; then the TOTRSUM input 71 is adde to the TOTMVAL input 75 and the result is the next TOTMVAL output 75. When neither of the above conditions is met, the the TOTMVAL output 75 is unchanged from the previous TOTMVAL input 75.
If the result of the above additions causes an overflow, the carry bit is lost. An underflow will not occu if the programming restrictions documented in Appendix 3 are obeyed*. Complete logic drawings for the chip including the string comparator means shown in Figures 1 and 2 are incorpo herein as part of this specification as appendix 2 including drawing numbers 1953-000-22 five sheets. These drawings are being used by and made by American Microsystems Inc. under contract to Proximity Devices Corporation to manufacture the chip to be known by the trademark "PF474" as a Large Scale
SUBSTITUTE SHEET Integrated (LSI) circuit using a 5-micron NMOS process. Below, we give the correspondence of node numbers on the complete logic drawings in this Appendix 2 to the numbers shown in Figures 1 through 13. This will show how the logic functions described in the body of the patent have been implemented in the logic drawings. Figure 2 shows a block diagram of the PF474. The Master Control means 20 and the Bus Control means 21 are implemented with nodes 2700-3299. Ranker means 22 is implemented with nodes 3300-3900. The Divider means 23 is node numbers 1-299. The String Similari Computer 10 is implemented with nodes 700-2000. The String Similarity Computer 10 consists of several different logic blocks. The String Control is implemented as nodes 300-600. The Core section 13 is implemented as nodes 700-1023. The CA section 14 is implemented as nodes 1024-1122, 1182-1207. The CB section 15 is implemented as nodes 1123-1170, 1214- 1303. 'The R section 16 is implemented as nodes 1326-1462. The M section 17 is implemented as nodes 1463-1622. The TOTR section 18 is implemented as nodes 1630-1738, 1768- 1788. The TOTM section 19 is implemented as nodes 1740-
1767, 1789-1978. We present in tabular form the node numbers corresponding to the signals, registers and memories shown in Figures 4-13.
SUBSTITUTE SHEET f OMPI j. IPO NODE NUMBER CORRESPONDENCE
Signal Name Reference . Figure Node Numbers in Appendix
GO 30 Figure 4 3044
RBUSY 36 Figure 4 111
CTS 37 Figure 4 108
SBUSY 35 Figures 4, 6 2509
CMD 31 Figures 4, 6 2511-2513
CHAR 32 Figures 4, 6 2500-2507
STRID 33 Figures 4, 6 2510
SCID 34 Figures 4, 6 2508
String Control
RAM 80 Figure 5 2530-2699
LENl register
81 Figure 5 2486-2492
LEN2 register 82 Figure 5 2493-2499
DISP register 83 Figure 5 2333-2399
POS register 84 Figure 5 2153-2160,2215-2266
Parameter Generation
RAM 38 Figure 6 380-563
CHAR 39 Figures 6, 7 357-364 STRID 40 Figures 6, 7 371
WGHT 41 Figures 6 , 7 368-370
C 42 Figures 6, 7 365-367
CMD 43 Figures 6, 7 372-374
Tally Memory 44 Figure 7 823-926
STRID 45 Figures 7, 8 811
WX 46 Figures 7, 8 807-810
CMDB 47 Figures 7, 8 1700-1701
CMDX 48 Figures 7, 8 1702
WCHT 49 Figures 7, 8 701-703
CMD 50 Figures 7, 8 815-817
C 51 Figures 7, 8 818-821
T 52 Figures 7, 8 822
Figure imgf000032_0001
NODE NUMBER CORRESPONDENCE (CONT'D.)
Signal Name Reference Figure Node Numbers in Append
COMPA 53 Figures 8, 9 1114 -1122 STRID 54 Figures 8, 9 1106 WCHT 55 Figures 8, 9 1107 •1109 CMD 56 Figures 8, 9 1102-1104 C 57 Figures 8, 9 1110-1113 T 58 Figures 8, 9 1105 COMPB 59 Figures 9, 10 1227 -1235 STRID 60 Figures 9 , 10 1240 WGHT 61 Figures 9, 10 1241 •1243 CMD 62 Figures 9 , 10 1236 •1238 T 63 Figures 9, 10 1239 RSUM 64 Figures 10, 11 1453 -1462 STRID 65 Figures 10 , 11 1450 CMD 66 Figures 10, 11 1447 •1449 MVAL 67 Figure 11 1607 •1622 READY 68 Figure 11 1605 TOTRSUM 71 Figures 12, 13 1774 •1784 STRID 72 Figures 12, 13 1788 CMDB 73 Figures 12 , 13 1785' 1786 CMDX 74 Figures 12, 13 1787 TOTMVAL 75 Figure 13 1931 1946
The chip manual entitled Advanced Product Descripti is included as Appendix 3 and made a part hereof to provide disclosure of use of the chip.
Appendix 4 is included herein and made a part hereof as an example of an electrical interface of the chip with the S-100 Bus, a widely used system for computer interconnection. The circuit described in Appendix 4 supports appropriate communication between the chip, a 4K-word by 8-bit RAM, and any of the widely used computer system which are compatible with the S-100. Appendix 4 included drawings A4-1 through A4-8.
-g T EA"^
SUBSTITUTE SHEET OMPI M * The improved string comparator is based on the associative memory circuit originally disclosed and filed on March 14, 1979 as Serial Number 20,618 includes a word associator circuit shown in detail as an electrical digital circuit in Figures 14 through 26. The system or associative memory circuit is shown in Figure 14. This associative memory circuit is an improved associative retrieval device that includes the use of the word associator or comparator circuit connected in a storage loop to locate and extract records that are most similar to the supplied query. Inexact queries will rapidly locate records similar with respect to word, numeric and mask related measurements of similarity. The new and improved method that is set forth below in detail provides a method of word comparison and a method of processing in the improved associative memory circuit or associative retrieval device. The processing is preferably in a parallel configuration that provides rapid response to queries, while processing a large number of simultaneous requests.
Referring now to Figure 14, most internal data traffic within the associative memory circuit 10 passes through shared memory 312, such as a time multiplexed multi-po random access read-write memory of any well known design such as TI-s 74200. Each of the many ports of the shared memory
312 is allotted a brief time slice on the order of one millise A port may disconnect prior to this time elapsing.
The associative memory circuit 10 communicates with the outside world through its communications modules 314 and 314' of any well known design. A plurality of communicati
modules may be connected as illustrated by numeral 318 and the small circles or dots. The communications modules 314 and 314' are microcomputer based flexible interfaces responsib
--* » * » 4.1 ? >«He:E i O P for decoding requests and then supervising the operations of the associative memory circuits 310 to satisfy the request The communications modules or circuits may communicate with the other associative memory circuits 10 using shared memory through buses 316 and 316' in any well known manner and by use of any well known design. These communication modules mig also perform considerable preprocessing before passing a query onto the other system components.
The main storage units (MSU) 320 and 320' of any well known design are devices that contain the actual records to be searched in memory units of any well known design. The main storage units contain any o a variety of well known control circuits to transmit these records in a fixed format over a bus. A plurality of main storage units may be used as illustrated by number 322 and the dots. The transmission format requires the simultaneous transmission of recor characters taken sequentially from the record moving from right to left and from left to right, see Figure 25 and the in-use description set forth hereinbelow. Numeric portions of a record are transmitted separately. The bus or lines 324 and 324' also contain control and timing signals, error correction codes and a data path of well known design for use in the communications between associator circuits 342, 344, 342' and 344' and extractor circuits 356 and 356'. Withi an MSU 320 and 320', data might be compressed to conserve resources by any well known means. The main storage units (MSU) 320 and 320' might be formed using virtually any of today's data storage devices. Following the transmission of each record along lines 324 and 324', a short blanking period is required to permit the associator circuits to initialize themselves for another record. Prior to the transmission of each records data, the MSU 320 and 320' must transmit an internal record number for the record that follows. These
SUBSTITUTE SHEET numbers should be assigned sequentially by the MSU 320 and 320'.
The control circuit 330 is connected to the MSU devices 320 and 320 r by bus 326. The control circuit 330 is responsible for all update and control of the MSU's. The control circuit may consist of one or more simple micro¬ computers of well known design. Control circuit 330 communicate through shared memory 312 over bus 340. Optionally, a direct interface 332 of well known design might be attached by bus 334. This would permit a direct data path from an MSU 320 and 320' to an external high speed device. This would facilitate the rapid loading of an entire MSU 320 and 320' as might occur at bootstrap time.
All data storage loops generated by the MSU devices 320 and 320' feed into network switching 336 by bus 324 and 324' The network switching circuit 36 is responsible for routing through bus 338 or 338' data from an MSU 320 and 320' to a vacant associator circuit 342 or 344 as well as 342' or 344' to satisfy a query. Additional associator circuits may be connected between 342 and 344 and 342' and 344' as illustrated by numerals 346 and 346' and the dots. Additional parallel circuits may also be interconnected as illustrated by numeral 348 and the dots. Network switching circuit 336 is connected to a control device 350 of well known design by bus 352 which processes requests communicated through shared memory 312 by connection bus or line 354. Control device 350 decodes these requests and decides which requests are to be processed and in what order. Then control device 350 communicates to network switching 336 over line 352, a specific order to reconfigure the network.
The associator circuits (ac) 342, 344, 342' and 344', are an important part of this invention. The associator circuit 342, 344, 342' and 344' are connected in strings terminated at one end by a single or multiple extractor circuits 356 and respectively by continuations of bus 338 or 338' respectively and at the other end by the network switching module 336. Data from a selected MSU passes through network switching 336 5 and then through an associator circuit 342, 344, 342 and 344'. This circuit scrutinizes the data as it passes, locking for records that are very similar to the query provided. Of significance here is the fact that the word associator circuits, a part of 342, 344, 342', and 344' (within each asso 10 circuit) can look for similar records at very high data rates. It is expected that data rates on the bus in excess of 20,000,000 characters per second are quite possibly using today's standard technology. The associator circuits 342, 344, 342', and 344' flag the most similar records and they are 15 then extracted from the data stream by the extractor circuits 356 and 356' and eventually passes back through shared memory 312 over bus 360 to the communications circuits 314 and 314'. The diagnostic computer 364, also of any well known design, is connected in a well known manner to the associative 20 memory 310 to provide system performance statistics and maintenance information in a well known manner.
Referring now to Figure 15, the basic module is referred to by numeral 343 which is a more detailed block diagram of associator circuits 342 and 342' of Figure 14. Each 25 pair of associator circuits in Figure 14 is similar to the
Figure 15 illustration. Figure 15 shows the associator circuit 343 along with the basic interconnections. The query is stored in query storage 370. As records pass by on the data bus 338, the records are received by the associator circuit on 30 interface 372 over bus 374 where'the records are merged with query characters transmitted over bus 376 in an appropriate manner and then forwarded through buses 378, 380, and 382 to the three types of associator circuitry. The three types of associator circuity are: 1) a word associator circuit 384, 2) a number associator circuit 386, and 3) a mask associator circuit 388. Within the word associator circuits 384 exists two circuits designated by numerals 400 and 400', one of which is shown in greater detail in block diagram in Figure 17 which is illustrated in schematic form in Figures 19 through 25.- The word associator circuit 384 combines the output of circuits 400 and 400' at the end of each record to arrive at the degree of word similarity. If the basic circuit of Figure 17 is used then at the end of each record, the M output ' from each copy of the circuit are added together by any well known means to arrive at the numerator of the fraction that equals the degree of word similarity. The denominator is computed by any well known means including table lockup by circuit 384 and is equal to L(L+1) where L is the length of the compared words. If the more complex circuit of Figure 18 is used, then the numerator is computed by any well known means and is equal to twice the sum of the M quantities output from the two copies of the circuit. The denominator is computed by any well known means and is equal to the sum of the TOTM quantities output from each copy of the circuit. The word associator circuit 384 may or may not actually perform a division to arrive at the degree of word similarity. Instead, the ranker 396 and the other associator circuits 380 and 388 might work entirely with fractional representations of similarity. Using the basic circuit of Figure 17 as 400 and 400' in Figure 15, computes the basic form of word similarity given by the mathematical formula disclosed herein. It should also be noted that Figure 18 is an enhanced version of the circuit in Figure 17.
The word associator circuit 384 is mainly made up of circuits 400 and 400' and interconnecting circuitry of well known design.
SU -^TI UTE SHEET OMPI IPO yJ ϊ:
" Again, referring to Figure 15, at the end of each record, the three associators forward their "opinion" of how similar the record and the query were over bus 390, 392, and 394 respectively to a ranker 396 of well known design. If the record was a perfect match, then it is marked by the ranker
396 for immediate extraction. Otherwise it is ranked relative to the prior records processed. Only the N highest ranking records are maintained in the ranker 396 by their internal record numbers. Here, N is an integer design parameter in any well known manner. Loading of the many parameters involved in the association process is controlled by an onboard microprocessor or controller 398 of well known design. The microprocessor is connected to the basic module in a well known manner. When all records have been observed, the ranker waits for the highest ranking records to appear again in the storage loop. As they appear, the ranker 396 marks them for extraction in' any well known manner.
The basic method for computing the degree of word similarity in a non-complex and expeditious manner involves processing sybmols as they occur in a bidirectional serial data stream. By bidirectional serial, we mean that successive positions from each of the two words under comparison and from their flips are simultaneously transmitted. for example, imagine two observers of the data stream performing the procedure or method as illustrated in Figure 26. Of signi¬ ficance is the fact that each observer needs knowledge only of the data instantaneously before him in the data stream. We describe the procedure from the standpoint of these observers. We describe the information they must remember.
$3?
Figure imgf000039_0001
computations they must perform and decisions they must m Each observer performs an identical method, the first obs the transmission of the words, the second observes the transmission of the flips of the words. Before starting, each observer must perform certain initial tasks. After data has passed, they combined their knowledge to arrive the degree of word similarity between the transmitted wor Hereafter, the block diagram shown in Figure 17 will be described, second the method performed by each observer be described and then the method in which their knowledg eventually combined will be described.
Referring now to Figure 17, the one word associ circuit in block diagram form is illustrated as-numeral 4 Two of these circuits 400 are included in the word associ circuit 384 in Figure 15. The data selection 1 illustra in Figure 17 is shown in detail in Figure 19 and may uti two quadruple bus gates such as Texas Instruments, Incor circuit 74125 and 74126, Exhibit A in the original appli and made a part of this application. It has two input b 404 and 406 entering from above. One of these two in on time frame is routed to a single output 408 exiting belo Figure 19 discloses the schematic for a single bus line. Clock input potential θ 110 is connected to the selector 402. Two clocks are utilized in circuit 400, Figure
They are referred to as 2 and θ and graphically illustra in Figure 16 to disclose their interrelationship. Their interrelationship and purpose is described and disclosed the diagram illustrated in Figure 16. The purpose of J?
SUBSTITUTE SHEET OMPI f . WIIPPOO to select either the read or write mode of operation in the tally memory 114 in Figure 17 and to provide certain trigger in latches- 170 and 136 of Figure 17. When 0 is low, the rea mode of operation in 114 is selected. When is high, the write mode of operation in 114 is selected. The event consisting of a ldw to high transition of 0 may serve as a trigger to latch components 170 and 136 of Figure 17.
The purpose of θ is to define whether a query or a bus character is currently being processed. A complete cycle of 0* corresponding to a read/write cycle in 114 occurs during each half cycle of θ as shown in Figure 16. Another purpose of θ is to provide a trigger to latch 182 of Figure 17. The event consisting of a high to low transition of θ a may serve as a trigger to latch 182. The clock θ is used as a control input in blocks 144, 126 and 402 of Figure 17.
The main portion of the Figure 17 circuit is design ated by numeral 112 and includes a random access read/write memory 114 referred to as tally or a tally memory. The memory address enters from above through bus 408. The read/write mode of operation is selected by the read/write potential 0" entering from the right through line 116 from a clock means of well known design not shown. When read/write potential is low, the read mode is selected. When read/write potential is high, the write mode is selected. Data exists from the lower left on bus 120 and enters from the upper right.
In the preferred embodiment, tally is organized as 256 8 bit words. Thus, addresses and data are 8-bit quantiti Figure 20 shows the schematic for each bit of a tally word. tally may include one or more Texas Instrument Incorporated
Hex inverter 122 No. 7404, Exhibit AA in the original applica and 256-bit read/write memory 124, No. 74200, Exhibit B in th original application and made a part hereof disclosing the
Figure imgf000041_0001
circuits. It should be designed so that its contents may rapidly be set to zero by clear means 118. Simply setting a cells simultaneously to zero, may however not be practically feasible due to power surge and overheating considerations. Therefore, several cycles may be necessary to clear the memo
Each cycle would then clear a fixed portion of the memory. it is not necessary to actually set all of the cells to zero An extra bit associated with each word could be maintained. clear the memory, only those extra bits would be cleared. T when a memory word is read, this extra bit'is tested. If it is zero, then zero is read out instead of the actual memory contents. This extra bit is set only when its assoc¬ iated location is written into. Now, if one is read out, th the actual memory contents are presented as usual to the out side word. In this way, the extra bits affect a logical clearing of the memory while avoiding a physical clearing of all of the cells. Techniques such as these serve to sign- . ificantly expedite the clearing operation, but such procedur are not necessary because there are well known standard pro- cedures available.
The add block Figure 17 is an incrementer/decremen 126. Data enters from the right. Depending on the state of input potential, the input potential value of the entering data is either incremented or decremented before exiting the two four bit binary full adders circuit 130 on bus 132, as shown in Figure 21. The adders 130 may include a Texas Instrument Incorporated 4-bit binary full adder No. 7483, Exhibit C in the original application and made a part hereof Hex inverters 134 No. 7404, Exhibit AA in the original appli is connected between the input potential over bus 128. The incrementer/decrementer 126 in Figure 17 is actually an adde in which one of the summands is restricted to either plus or one. If input potential of 128 is zero, then plus one is ad
SUBSTITUTE SHEET ^ SΕE
OMPI If input potential of 128 is one, then minus one is added.
A positive edge triggered data latch 136 in Figur 17 is connected to bus 132 by bus 138. Data enters through bus extension 138 from the left and exits from the right on bus 140 to tally 114. On the positive going edge of read/ write potential actuated by input 2, the contents entering latch 136 are latched and become the output from the latch over bus 140. in the preferred embodiment shown in Figure 22, the latch is a two-bit D-type register 5 with 3-state output 142 shown in Figure 22, Texas Instrument Incorporated Number 74173 attached to the original application as Exhibit D and made a part hereof.
A convertible sign tester 144 in Figure 17 has an input through bus 132 for data entering from the right and the tester 144 determines if this data is non-positive or non-negative, depending on the state of the input potential on bus 146 entering from below. The test is performed relative to two's complement arithemetic. If the input potential is low, then test will output high on output bus 148 to the left, provided that its input is greater than or equal to zero. In the preferred embodiment, tester 144 is shown in Figure 23 as an 8-bit device. The input 132 is connected to two dual 5-input positive no gate 150 and 152 Texas Instrument No. 74260 attached in the original applicat as Exhibit E and made a part hereof connected to one gate of quadruple 2-input positive and gate 154, Texas Instrument No 7408 attached in the original application as Exhibit EE and a part hereof. Gate 154 is connected to one gate of a quadr 2-input positive or gates 156 Texas Instrument Incorporated 7432 attached in the original application as Exhibit EEE and a part hereof. Gate 156 Texas Instrument No. 7432 is also connected to and gate 158. And gate 158 Texas Instrument No. 7408 is connected to one of the input lines 132 and line
βs Bββ ri HTΠ O ΠSTT 146 through inverter 160, a Texas Instrument No. 7404. The output of 156 is connected to the input of 162 that is the s as 156 having another input from and gate 164 with input fro bus 146 and the output of inverter 168, a Texas Instrument Incorporated 7404. The exhibits are incorporated herein and made a part hereof.
A clearable edge triggered latch 170 is shown in Figure 17 and shown in detail in Figure 24 with an adder attached as described below. Input 172 inserts a one (1) in latch 170. The output is transmitted on bus 174 to add latc 182. The adder 170 has two inputs on busses 148 and 172 and a single output on bus 174 which is the sum of the inputs. output 178 of the adders 176, such as a 4-bit binary full a of Texas Instrument, Incorporation No. 7483 attached as Exhi c on the original application become the input to the latche 180, such a 4-bit D-type registers with 3-state outputs, Exhibit D. One of the inputs to the adder is.the output of the latch. The other input enters from above in Figure 24 and is wired permanently to be equal to 1. The latches 180 are similar to item 142 and are triggered from bus 148 when it is high at the positive going edge of the read/write pote The current latch contents exit from below over bus 174, to add-latch 182. In the preferred embodiment, latch 170 is an latch coupled to an adder with an 8-bit output and two 7-bit inputs as shown in Figure 24. The clearing connections are shown in Figure 24, but may be accomplished by any well know manner.
Add-latch 182, shown in Figures 17 and 25, is like item 170. Its input enters from bus 174 from above. Add- latch 182 is triggered on the negative going edge of the re write potential θ. In the preferred embodiment, add-latch 182 includes 4-bit binary full adders 184 such as Texas Instruments Corporation No. 7483, is a 13-bit latch
Figure imgf000044_0001
coupled, that is connected to a 4-bit D-type register with 3-stage outputs 186 such as Texas Instrument Corporation No 74173. Indicator/readout devices of well known design may be connected thereto. The output of item 182 is an electri output signal that may be translated to readable or other indications by any well known device. Clearing connections are not shown in Figure 25, but any well known devices or procedures may be used.
Referring now to illustration Figure 26, the memo 190 contains the words ABC and ABB which are to be compared They are transmitted via two transmitters 192 and 194 over data stream four characters wide, as illustrated. The top half of the stream contains the transmission of the unalter words and the bottom half contains the transmission of the flips of the words. On each side of the stream sits an observer 196 and 200. Each observer is watching a single column at a time as columns flow from left to right. The ' memory 190 and transmitters 192 and 194 correspond to the MSU 320 of Figure 14 and the data stream roughly for illust purposes corresponds to the data bus 338 of Figure 15 (alth query characters do not occur on the data bus or in the MSU) The two observers correspond for illustration purposes to t two copies of the circuit 100 shown in Figure 17 contained within the word associator 384 of Figure 15. To perform hi appointed task, each observer must "remember" a numeric qua assocaited with each alphabet member. In this example ther are but three; A, B, and C. This collection of quantities corresponds for illustration purposes to tally 114 of Figur Each observer must also "remember" a numeric quantity R tha is 170 and another M that is 182. These correspond for illustration purposes directly with Figure 17. At each ins in time, each observer notices two characters before him. processes one at a time in some fixed order, say top to bot
S^4~ •-=-«■ _._. ^ Λ__ f O OMMPPII IPO - First, he increments the quantity corresponding to the top character. Then if it is less than or equal to zero, he increments R. Next, he decrements the quantity correspondin the bottom character. Then if it is greater than or equal t gero, he increments R. Finally, now that both characters ar processed, he updates M by adding R to it. This continues f each column as they flow past.
The relationship between Figures 17 and 26 are now particularly described. To illustrate this relationship a description of w occurs in the circuit of Figure 17, step by step, as the example of Figure 26 is processed by the circuit is presente
First, OBSERVER ONE shown as 196 in Figure 26 corresponds to one copy of the circuit 400 of Figure 17. We will also call the word "ABC" the query word and "ABB" the b word.
The circuit of Figure 17 is -controlled by two cloc cycles shown in Figure 16. The bottom signal (THETA) is the major timing signal. Within each THETA cycle, another clock PHI makes a complete cycle. PHI is called the minor clock.
These clocks are an essential part of the invention only ins as they specify the order of processing which occurs within circuit. Three major clock cycles are required to process t examples with three characters. Within each of these, two characters are processed, first a character from the query w is processed and then a character from the bus word.
Since OBSERVER ONE is considered at this point, th circuit first processes the character "A" from the query and from the bus word. Then it processes the character "B" from query word and "B" from the bus word. Then it processes the character "C" from the query word and "B" from the bus word.
SUsawm ITS- ©« ι»-w It is useful to imagine that these characters flo downward into selections 404 and 406 as follows:
Minor clock cycle
1 2
(Major clock cycle 3) C B (Major clock cycle 2) B - B
(Major clock cycle 1) A A
In other words external means are used to access the two words being compared and "feed" these words into th core circuit. Then external means may be used to interpret the result of the circuit which at the completion of the computation is available at output 182. This interpretatio normally involves a division but the essential point is tha the degree of similarity is directly indicated by the magnitude of the number present at output 182 at the end of the computation, i.e. larger numbers mean more similarity.
In other words this quantity may be "normalized" by a proce involving a- division as discussed elsewhere if one desires similarity measure that ranges from zero to one.
Before operation begins, R, M. and TALLY are rese to zero.
During Major clock cycle 1 Minor clock cycle 1: During the start of this period while PHI is low, character "A" is routed from 404 to become the address of t memory 114 and since PHI is low that memory responds by rea out this location and presenting it to the ADD block 126. THETA is low this block adds one to the value producing 1. This is then routed both into LATCH 136 and TEST 144. Sinc THETA is low TEST outputs a 0. At the positive going edge PHI, the LATCH contents are frozen as TALLY switches into w mode. This in effect writes the updated quantity just comp by ADD back into location "A" . Also at this transition the register is incremented if signal 148 is 1. In this case i
SUBSTITUTE SHEET not. During the second half of this period while PHI is hig
TALLY is writing its updated information.
The character "A" is present at the selection inpu The character "A" is present at the selector input The R register shown as 170 in Figure 4 contains z
The M register shown as 182 in Figure 4 contains z
Original Value Updated Va Tally location "A" = 0 1
Tally location "B" = 0 Tally location "C" = 0
During Major clock cycle 1 Minor clock cycle 2: During the start of this period while PHI is low, character "A" is routed from 406 to become the address of t memory 114 and since PHI is low the memory responds by readi out this location and presenting it to the ADD block 126. S THETA is- high this block subtracts one to the value producin 0. This is then routed both into LATCH 136 and TEST 144. S THETA is high TEST outputs a 1. At the positive going edge PHI , the LATCH contents are frozen as TALLY switches into wr mode. This in effect writes the updated quantity just compu by ADD back into location "A" . Also at this transition the R register is incremented if signal 148 is 1. In this case is. During the second half of this period while PHI is high Tally is writing its updated information. The character "A" is present at the selector input
404.
The character "A" is present at the selector input The R register shown as 170 in Figure 17 contains at the start of this period and 1 at the completion. The M register shown as 182 in Figure 17 contains at the start of this period and 1 at the completion.
lUSs-ππ π Γ '^U EX
Έ SHEE OMPI ^ Original Value Updated V
Tally location "A" = 1 0 Tally location "B" = 0 Tally location "C" = 0
The end of this cycle is marked by the negative g edge of THETA. At this time M is updated and becomes 1.
During Major clock cycle 2 Minor clock cycle 1:
During the start of this period while PHI is low, the character "B" is routed from 404 to become the address the memory 114 and since PHI is low the memory responds by reading out this location and presenting it to the ADD bloc 126. Since THETA is low this block adds one to the value producing 1. This is then routed both into LATCH 136 and T 144. Since THETA is low TEST outputs a 0. At the positive going edge of PHI , the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the update quantity just computed by Add back into location "B". Also at this transition the R register is incremented if signal 148 is 1. In this case it is not.
During the second half of this period while PHI is high, TALLY is writing its updated information.
The character "B" is present at the selector input 404.
The character "B" is present at the selector input
The R register shown as 170 in Figure 17 contains 1 throughout.
The M register shown as 182 in Figure 17 contains
Original Value Updat
Tally location "A" = 0 Tally location "B" = 0 Tally location "C" = 0
Figure imgf000049_0001
During Major clock cycle 2 Minor clock cycle 2:
During the start of this period while PHI is low, the character "B" is routed from 406 to become the address o the memory 114 and since PHI is low the memory responds by reading out this location and presenting it to the ADD block
126. Since THETA is high this block subtracts one to the. v producing 0. This is then routed both into LATCH 136 and TE
144. Since THETA is high TEST outputs a 1. At the positive going edge of PHI , the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the update quantity just computed by ADD back into location "B". Also this transition the R register is incremented if signal 148
In this case it is. During the second half of this period w
PHI is high, TALLY is writing its updated information. The character "B" is present at the selector input
The character "B" is present at the selector inpu
The R register shwon 'as 170 in Figure 17 contains the start of this period and 2 at the completion.
The M register shown as 182 in Figure 17 contains the start of this period and 3 at the completion.
Original Value Updated Va Tally location "A" = 0 0
Tally location "B" = 1
Tally location "C" = 0 The end of this cycle is marked by the negative go edge of THETA. At this time M is updated and becomes 3. During Major clock cycle 3 Minor clock cycle 1: During the start of this period while PHI is low, character "C" is routed from 404 to become the address of th memory 114 and since PHI is low the memory responds by readi out this location and presenting it to the ADD block 126. S THETA is low this block adds one to the value producing 1. is then routed both into LATCH 136 and TEST 144. Since THET
OMPI_ is low TEST outputs a 0. At the positive going edge of PHI, the LATCH contents are frozen as TALLY switches into write m This in effect writes the updated quantity just computed by ADD back into location "C". Also at this transition the R register is incremented if signal 148 is 1. In this case it is not. During the second half of this period while PHI is high, TALLY is writing its updated information.
The character "C" is present at the selector input The character "B" is present at the selector input The R register shown as 170 in Figure 17 contains
2 throughout .
The M register shown as 182 in Figure 17 contains
3 throughout .
Original Value Updated Va Tally location "A"= 0 Tally location "B"= 0
Tally location "C"= 0 1
During Major clock cycle 2 Minor clock cycle 2: During the start of this period while PHI is low, character "B" is routed from 406 to become the address of th memory 114 and since PHI is low the memory responds by readi out this location and presenting it to the ADD block 126. Since THETA is high this block subtracts one to the value pr 0. This is then routed both into LATCH 136 and TEST 144. S THETA is high TEST outputs a 0. At the positive going edge PHI , the LATCH contents are frozen as TALLY switches into wr mode. This in effect writes the updated quantity just compu by ADD back into location "B". Also at this transition the register is incremented if signal 148 is 1. In this case it not. During the second half of this period while PHI is hig TALLY is writing its updated information.
The character "C" is present at the selector input The character "B" is present at the selector input
Ol 3RST1TUTS SHE! The R register shown as 170 in Figure 17 contains the start of this period and 2 at the completion.
The M register shown as 182 in Figure 17 contains start of this period and 5 at the completion.
Original Value Updated Va
Tally location "A"= 0
Tally location "B"= 0 -1
Tally location "C"= 1
The end of this cycle is marked by the negative go edge of THETA. At this time M is updated and becomes 5.
FINAL TALLY CONTENTS:
Tally location "A" = 0 Tally location "B" = -1 Tally location "C" = 1 FINAL R CONTENTS: 2
FINAL M CONTENTS: 5
The circuit copy representing OBSERVER TWO would work in the same manner except that the order of characters both words is reversed. In the above, we have assumed that the observer started with all quantities equal to zero. Once the record has passed, the two observers add together the values for M that they have arrived at. This result is then divided by_ L(L+1) which in this case is 3(3÷1)=12. L is defined as te length of the compared words. Thus, we have 8/12=2/3 as the final similarity between the two words. Figure 26 displays the final state of all numeric quantitites involved A basic mathematical formula has been created. If A is an alphabet, then words in A are finite concatenations of members from A. If w is a word in A, then w denotes the flip of w. For example, the flip of the word "abed" is "deba". If x and y are numbers then (x,y) denotes the great of the two quantities x-y and 0. If w is a word in A, then
Figure imgf000052_0001
n(a,w,i) denotes the number of occurances of alphabet memb "a" in word w found in position i or beyond where position measured canonically from left to right. If w and v are w in A, both of length L, then the degree of word similarity them is denoted S(w,v) and is given by the formula below:
LCL+1) - (n(a,w,i,) ,n(a,v,i,) )+(n(a,w,i) ,
S(w,v) a
L(L+1) This formula, as set forth above and well understood by tho skilled in the art, produces a number between 0 and 1 inclu It produces 1 if and only if w and v are identical. It pro 0 if and only if w and v share no common alphabet members. Intermediate values are interpreted as degrees of similarit between these two extremes. Formula exist and are discusse "The Definition, Computation and Application of Symbol Stri
Similarity Functions" referred hereinabove, which do not pr equality of the lengths of w and v. The above formula is, the most fundamental. It equally weighs all alphabet membe and word positions. It corresponds to the circuit of Figur In the equation above, the fundamental computatio is that involving the double summation. Various forms of t equation might still perhaps produce a useful measurement o word similarity. For example, the whole equation might be to some positive integral power. Falk in the article refer to hereinabove, discussed a function which also associated each pair of words, a number between zero and one. But his function is considerably more complex and in practice is mu more difficult to compute. This is discussed in the thesis referred to hereinabove. The disclosed formula rests on a simpler and more rigerous mathematical foundation, as point
SUBSTITUTE SHEET out in the thesis. Faulk also makes little attempt towards justifying his formula.
It should be noted that the circuit invention evol from the formula to the algorithm to the circuit. The follo criteria are met: 1. Mathematical simplicity
2. Ease of computation
3. Agreement with human intuition
4. Flexibility to permit varied applications. The formula derived in the thesis and disclosed herein and made a part hereof is mathematically simple and produces results that appear to agree well with human intuition as referred to in the IEEE articles referred to hereinabove. Computation of the formula in a straight forward fashion requires quite a bit of work, primarily due to the double summation.
The next evolutionary step is the invention of an algorithm which quickly computes the formula. This algorith is presented herein in the form of a Fortran function sub¬ program. It is called with three input parameters: IQ, IR and N. IQ and IR each of which is a dimension N integer vector. Upon return, the variable Theta is the degree of similarity between the input words IQ and IR. Alphabet members are integers between 1 and 256 inclusive.
To process a character from each of the two words under comparison, the algorithm implemented in machine langu on a modern general purpose CPU such as the IBM 370, require the execution of dozens of instructions each comprised of ma micro instructions.
To perform this same task, our circuit requires on two internal clock cycles. Each represents approximately a read/write cycle pertaining to a high speed random access memory. Actually, the clock must be slightly slower than
-α JRE SUBSTITUTE SHEET f Q PI IPO the memory's maximum speed to permit other circuit componen to operate. In use, however, the multiple should be less than 2. Therefore, we see that our circuit is capable of computing our definition of word similarity much faster tha any existant general purpose processor.
A complete similarity memory system may contain m such circuits. Therefore, the system would then be perform a search function beyond the capabilities of existant gener CPU's. The basic algorithm as programmed in Fortran is: FUNCTION THETA (IQ, IR, N)
C
C IQ and IR ARE EACH WORDS OF LENGTH N IN THE ALPHABET CONS C OF THE NUMBERS 1 THROUGH 256. THEY ARE PASSED AS INTEGER C EACH OF DIMENSION N. UPON RETURN, THETA ASSUMES THE VALU C THE BASIC WORD SIMILARITY BETWEEN IQ AND IR. C
INTEGER R
DIMENSION ITALLY (256) , IQ(N) , IR(N)
DO 1 1=1,256 1 ITALLY(I) =0
M=0
R=0 C
DO 2 1=1,N ITALLY (IQ(I) ) =ITALLY(IQ (I) ) +1
IF (ITALLY(IQ(I) ) .LE.0) R=R+1
ITALLY (IR(I) )=ITALLY(IR(I) ) -1
IF (ITALLY (IR ( I) ) . GE . 0 ) R=R+1 2 M=M+R
R=0
DO 4 1=1, 256
ITALLY ( I ) =0
SU STITUTE SHEET DO 3 J=1,N I=N+1-J
ITALLY (IQ(I) ) =ITALLY(IQ(I) )+1 IF (ITALL (IQ(I)) .LE.0)R=R+1
ITALLY(IR(I) ) =ITALLY(IR(I) )-1 IF (ITALLY(IR(I)) .GE.O) R=R+1 3 M=M+R C THETA=FLOA (M) /(N*(N+l) )
C
RETURN END
Referring now to Figure 18, the SEL is a random access read/write memory 208. Its address enters on bus 224 from above and data output is to the right on bus 214. In Figure 18, it is assumed that the read mode is selected as the device is only written to during master initialization. In th preferred embodiment the read/write memory 208 is organized as 127 2-bit words.
. SYN consists of three random access read/write memories, 202, 204, and 206. In each, the address enters on bus 406' from above the data output transmitted on bus 220, 220', and 220 ' ' from below. In the Figure, we assume that the read mode is selected as the device is only written to during master initialization. In the preferred embodiment, it is organized as three memories, each consisting of 256 8-bit words. Select is a data selector 210. It has three input busses 220, 220', and 220'' entering from above. One of these three is routed to a single output 224 existing below, depending upon the nume value of the 2-bit value entering from the left. If this valu 0, then Select ignores its input and outputs zero. If this va is 1, 2 or 3, then the first, second or third input data bus
Figure imgf000056_0001
SHEET respectively is routed to the output. In the preferred embodiment, it has 8-bit inputs and outputs.
MV is a one bit latch 222. It is set/reset during master initialization. The current state of MV exists above. Select 402 is a data selector. It has two input busses 224 and 404' entering from above. One of these is routed to a single output exiting below, depending upon the state of θ entering from the right. If θ is low, then the right input bus is selected. Otherwise, the left bus is selected. In the preferred embodiment. Select has 8-bit inpu and outputs.
Double skip on zero is a circuit to block the propagation of 0* and θ, for the duration of two β cycles, provided that θ=0, MV=1 and the output from Select is zero, at the time of a positive transition of β. This effectively causes the later circuit stages to ignore completely the current column. These altered versions of β and θ are then used by the central stage of the circuit.
PW is a random access read/write memory. Its addre enters from the left and data output is to the right. In the figure, we assume that the read mode is selected, as the devi is only written to during master initialization. In the preferred embodiment, it is organized as 127 2-bit words.
CW is a random access read/write memory. Its addre enters from above and data output is to the right. In the fi we assume that the read mode is selected, as the device is on written to during master initialization. In the preferred embodiment, it is organized as 256 2-bit words.
Distributor is a data distributor. It has three ou above and a 2-bit control input to the left. When this input is zero, all three outputs are zero. When it is 1, 2, or 3, the first, second or third output goes high respectively, lea the others zero.
SUBSTΓΠJT" SHEET OM I . Single skip on zero is a circuit to block the propagation of β, for the duration of one β cycle, provided that the cumulative output from the gate circuits is zero at the time of a positive transition of β. This altered version of β is then used by the lowest circuit stage.
Each gate is a pair of logical and gates used to control the propagation of the data output from CW. Both bi leaving CW center each gate to the left. Inside gate there two 2-input and gates. One input from each becomes a common control input shown entering from below. The remaining two inputs connect to the two entering data lines. The outputs the gates are shown to the right. When the control input is low, the gate outputs zero. When it is high, gate simply propagates its two bit input. The combination of the three gate circuits and the distributor circuit forms a variable shift register which sh the output of CW, depending upon the output of PW. This aff the computation of the final weight.
TOTM is an add-latch device. Its input enters abo It is triggered on a negative θ transition. In the preferre embodiment, TOTM is a 17-bit latch together with an adder having one 17-bit input and one 11-bit input.
R is an add-latch device. Its input enters above and its output exists below. It is triggered on a positive transition provided that T entering from the right is equal one. In the preferred embodiment, R is an 11-bit latch coup to an adder having one 11-bit input and one four bit input.
TOTR is an add-latch device. Its input enters abo and its output exists below. It is triggered on a positive transition of β provided that θ=l. In the preferred embodim it is an 11-bit latch together with an adder having one 11-bi input and one four-bit input.
M is an add-latch device. Its input enters above.
SUBSTITUTE SHEET It is triggered on a negative θ transition. In the preferre embodiment, it is a 17-bit latch together with an adder havi one 17-bit input and one 11-bit input.
Sign test is defined as test of Figure 17. L is defined as L of Figure 17. Tally is defined as tally of Fig 17. Adder is defined as add of Figure 17.
Referring to Figure 18, this diagram illustrates ho the basic circuit of Figure 17 may be considerably enhanced without sacrificing speed of processing. In Figure 18, befor data reaches the core or data selector, circuit 402 and the basic word associator circuit 112 that is identical to that shown in Figure 17, several tasks are performed. First, a memory word is fetched corresponding to the current column position being processed. If this word is zero, then the current column is ignored. This is accomplished by using the double skip on zero circuit 220 of well known design. This circuit merely blocks propagation of all timing signals durin the current character pair. Therefore, the circuit ignores the current column. Whereas the circuit of Figure 17 process every column unconditionally, this facility permits column selection in the circuit of Figure 18. If the fetched word i non zero, then it is used to select one of three tables to be used in translating the data character from the record before it reaches the select circuit 402. This is called synonym processing and permits additional flexibility. The facilitie above are implemented via the random access memory's SEL 208 and the three random access memory's labels SYN 202, 204, and 206, and by the SEL 208 component which simply selects one of the three outputs from the SYN memories 202, 204, and 206. The SEL 208 is of any well known design. The SYN 202, 204 an 206 is a set of three random access memories of a well known design. If the translated value of a record character is zer
SUBSTITUTE SHEET and the MV flag is set, then the entire current column is ignored as above. MV 222 is a one bit latch of a well known design. This permits the definition of missing value fields not to detract from the measure of the similarity between the record and the query. Position bus 224 is connected to SEL 2 and PW 226 that is a random access memory of well known desig The enhancements we have discussed so far constitute simple p processing and are not crutial to the basic invention. We no discuss some more crucial enhancements. In the circuit of Figure 17 you will note that the quantity "1" is always added to R 107. This has the effect o weighing all alphabet members and column positions equally. Figure 18 implements a weighing scheme wherein the column number currently under consideration and the current characte are used to determine a weight which is to be added to R inst of "1". In this way, one can weight characters heavier than others. Generally speaking, many circuits might compute a we to be used to update R. Figure 18 contains one such circuit. In this circuit, each alphabet character is assigned a two bi weight. This weights 0, 1, 2, 3 are possible. Each column position is also assigned a weight also two bits. But this weight is used to control a shift register so that here the possible weights are 0, 1, 2, 4. The character weight is effectively multiplied by the column weight to arrive at the final weight. A complete word associator circuit must of course contain two copies of the circuit of Figure 18. We observe that the positional weight tables defined for each co might differ. This might allow certain columns to be process with more emphasis on initial or on final characters. When t two tables agree, there is no such directional bias. It shou be noted that when the final weight of a column/character pai
Figure imgf000060_0001
zero, the character is not processed. This is accomplishe the "single skip on zero" circuit 230 which blocks propagat of timing signals for the duration of a single character. The weight scheme is implemented by the random access memories PW 226 and CW 242 of well known design and the selectable shift register 244 formed by the distributor and gate components 248, 250, and 252 and by the single ski on zero circuit 230, all of which are of well known designs In the circuit of Figure 17, the final result M needed to divide by N(N+L) where N is the length of the wor under comparison, to arrive at the measure of similarity. In the circuit of Figure 18, this denominator must be compu since it will depend upon the weights encountered during processing. In Figure 18, TOTR 254 and TOTM 256 that are add-latches, compute a denominator term. A corresponding term is computed by the other copy of the circuit. The sum of these two terms is the final denominator. The final numerator is twice the sum of the two M values read out of the two circuit copies. The similarity between the query and bus words is the quotient of the numerator and the denominator quantities. The above is just one way in which the information read out of the circuit may be interpreted to arrive at a measurement of similarity. Other schemes might weight various terms unequally. The circuit 112' in the lower right of Figure 18 is recognizable as very simila to the circuit of Figure 17. The only difference is that t selection component is now external and R may now be update by quantities other than "1".
The circuit of Figure 18 is divided by dashed lin into three stages designated by I, II and III. Note that busses passing from stage to stage are broken. These indic that buffers might be inserted to achieve a pipeline with three stages. In this way the circuit can process data as
INSTITUTE SHEET ( OMPI io fast as the circuit of Figure 17. Without pipeline buffers, the circuit is two to three times slower. The timing signal are labeled identically in each stage but may vary from stage to stage both because of the optional pipeline and bec of the skip on zero circuits.
The circuit of Figure 18 must be initialized befor use. A master initialization must be performed once per sea to establish weights, etc. This initialization must load th 208, SYN 202, 204 and 206, PW 226, and CW 242 memories. Als the MV 222 flag must be set or reset. Before each record is processed, additional initialization is required. The tally memory in 112' not illustrated must be set to zero as must t R and M, not shown, the TOTR 254 and TOTM 256 add-latches. each record is processed, the contents of M, not shown in 11 and TOTM 256 are read out of the circuit.
Connections describing initialization and readout are trivial and of well known design and are therefore not shown in the drawings.
Finally, note that in the circuit of Figure 18, th bus data character must be stable on the bus even during the processing of the query character. This permits the bus character to be translated while the query character, which does not pass through synonym translation, is processed. Th query translation is better left to software since the query is fixed during a search.
The instant invention has been shown and described herein in what is considered to be the most practical and preferred embodiment. It is recognized, however, that de¬ partures may be made therefrom within the scope of the inven and that obvious modifications will occur to a person skille the art.
What I Claim Is:
SUBSΠ " ITS fS T

Claims

1. An indicia string comparator circuit for providing a numeric measurement of the degree of indicia string similarity between a record indicia string and a query indicia string, comprising; first means including an output for presenting record string indicia and query string indicia alternately and for presenting control information at said output, said first means connected to a digital source of record string indicia and query string indicia, read/write memory operably connected to a second means, said second means having an input operably connected to said output of said first means, said second means having an output, said second means for reading from and writing to said read/write memory and for indicating at said output of said second means common portions of the indicia strings by generating control signals based on indicia counts, and third means includes an input and output, said input of said third means connected to said output of said second means said third means for computing a numeric measurement of the weight of the common portions of the indicia string, and for presenting a digital signal representing said numeric measurement at said output of said third means.
2. An indicia string comparator circuit as set forth in Claim 1, wherein; said indicia string comparator circuit including said first means, said read/write memory, said second means, and said third means for computation in a time span generally proportional to the average length of said indicia strings.
SI ϊ~*«π"τι ΓTC euccr
3. An indicia string comparator circuit as set forth in Claims 1 or 2, wherein; said indicia string comparator circuit provides the degree of indicia string similarity which approximates the formula;
L(L+l)-(n(a,w,i,) ,n(a,v,i))+(n(a,w,i) ,n(a,v,i)) S(_w,vl = L(L-f-l)
4. An indicia string comparator circuit as set forth in Cl wherei ; said first means including parameter means ha ing an connected to said first means and an output, said parameter means for presenting at said output of s parameter means the indicia's weights, said second means for presenting at said output of sa second means the indicia weights obtained from sai means.
5. An indicia string comparator circuit as set forth in Cl wherein; said parameter means for presenting the indicia's wei and the indicia's compensation values, said second means for presenting the indicia weights and compensation values obtained from said first m at said output of said second means.
6. An indicia string comparator circuit as set forth in Claim 4, wherein; said first means includes string control means and sa parameter means is a parameter generation means, said string control means including an output, said string control means for presenting record string indicia and query string indicia alternately and presenting control information,
SUBSTITUTE SHEET
O PI said parameter generation means connected to said o of said string control means, and said parameter generation means for presenting an indicia weight value for the indicia input from said stri control means and for presenting all the inputs f said string control means, said output of said parameter generation means is said output of said means.
7. An indicia string comparator circuit as set forth in Claim 6, wherein: said parameter generation means for presenting an in weight and an indicia compensation value; said second means for presenting the indicia weights and compensation values obtained from said first at said output of said second means.
8. An indicia string comparator as set forth in Claim 1, wherein: said third means includes R means having an input an output and M means having an input and output, said R means for providing a tally of common portion the indicia strings by summing input from said second means, said R means input is said third me input, said M means for computing a numeric measurement of common portions of the indicia string by summing input from said R means, said output of said R me connected to said input of said M means, said out of said M means is said third means' output.
-gORE
SΪ SRSTΓΠ ιτϊ= SHEET ( ■■ OMPI
^ ι\
9. An indicia string comparator as set forth in Claims
4 or 6, wherein; said third means including R means including an inpu and output, and M means including an input and output; said R means for providing a tallying sum of the wei of the weights of the common portions by summing input from said second means, said input of said R means is said input of said third means; said M means for providing a numeric measurement of the weight of the common portions of the indicia strings by summing input from said R means, said input of said M means connected to said output of said R means, said output of said M means is said third means' output.
10. An indicia string comparator as set forth in Claims
5 or 7; * said third means for computing a numeric measurement including CA means including an input and output CB means including an input and output, R means including an input, and output M means including an input and output, said CA means for providing a numeric measurement of sum of the compensation values of the indicia of shorter string unmatched by indicia in the longer string, said input of said CA means is said third means input, said CB means for providing a numeric measurement of sum of the compensation values of the indicia of longer string unmatched by indicia in the shorter string, said input of said CB means connected to said output of said CA means.
© s'.
-STITUTE SHEET O PI WIPO said R means for providing a tallying sum of the wei of the common portions and of the compensations b summing output from said CA means, CB means and s second means, said input of said R means connecte to said output of said CB means, said M means for providing a numeric measurement of weight of the common portions of the indicia stri by summing input from said R means, said input of said M means connected to said output of said R means, said output of said M means is said third means' output.
11. An indicia string comparator circuit as set forth in Claims 1, 2, 4, 5, 6, 7, or 8, wherein fourth means having an input and output, said fourth means for computing a numeric measurement of the total weight of the indicia strings, said fourth means connected to said output of said second" mea
12. An indicia string comparator circuit for providing a nu measurement of the degree of indicia string similarity betwe a record indicia string and a query indicia string, comprisi a first means connected to a digital source of record string indicia and query string indicia; said first means includes string control means and a paramenter generation means, said string control means including an output, said string control means for presenting record string indicia and query string indicia alternately and presenting control information at said output of said string control means, said parameter generation means connected to said out of said string control means, and said parameter generation means for presenting an indicia weight and an indicia compensation value for the indicia
SUBSTITUT -SHEET - input from said string control means, for presenti all the inputs from said string control means and for presenting an indicia weight and an indicia compensation value, read/write memory operably connected to a core means, said core means having an input operably connected to output of said parameter generation means, said co means having an output, said core means for readin from and writing to said read/write memory and for indicating at said output of said core means commo portions of the indicia strings by generating cont signals based on indicia counts, CA means having an input and output for providing a numeric measurement of the sum of the compensati values of the indicia of the shorter string unmatc by indicia in the longer string, said input of sai means connected to said output of said second mean CB means having an input and output for providing a numeric measurement of the sum of the compensati values of the indicia of the longer string unmatch by indicia in the shorter string, said input of said CB means connected to said output of said CA means, R means for providing a tallying sum of the weights of the common portions and of the compensations by summing output from said CA means, CB means and sa second means, said input of said R means connected said output of said CB means, M means for providing and presenting a digital signal representing a numeric measurement of the weight o the common portions of the indicia strings by summ input from said R means, said input of said M mean connected to said output of said R means.
-δjjR A
OMPI IPO fourth means having an input and output, said input said fourth means connected to said output of cor means, said fourth means for computing and presen a digital signal representing a numeric measureme of the total weight of the indicia strings.
13. An indicia string comparator circuit as set forth in Claim 1 or 8 including, a fourth means including TOTR means and TOTM means, said TOTR means having an input and output for provi a tally of indicia by summing input from said sec means, input of said TOTR means connected to said output of second means , said TOTM means having an input and output for provi a numeric measurement of the total weight of the indicia strings by summing input from said TOTR means, said input of said TOTM means connected to said output of said TOTR means, said output of said TOTM means is said fourth means' output.
14. An indicia string comparator circuit as set forth in Claims 12, wherein: said fourth means including TOTR means and TOTM mean said TOTR means having an input and output for provi a tally of indicia by summing input from said sec means, input of said TOTR means connected to said output of core means, said TOTM means having an input and output for provi a numeric measurement of the total weight of the indicia strings by summing input from said TOTR means, said input of said TOTM means connected to said output of said TOTR means, said output of sa TOTM means is said fourth means' output.
SUBSTITUTE SHEET ^^S S .
O PI
/., WIPO ?£R \*
15. A indicia string comparator circuit as set forth in Claim 14, wherein divider means including an input and output, said divider means for computing and presenting a digit signal representing the ratio of the output of sai
M means over said output of said TOTM means.
16. An indicia string comparator circuit as set forth in Claim 8 including, a fourth means including TOTR means and TOTM means, said TOTR means having an input and output for provid a tally of indicia by summing input from said seco means, input of said TOTR means connected to said output of second means, said TOTM means having an input and output for provid a numeric measurement of the total weight of the indicia strings by summing input from said TOTR means, said input of said TOTM means connected to said output of said TOTR means, divider means including an input and output, said divider means for computing and presenting a digit signal representing the ratio of the output of sai M means over said output of said TOTM means.
17. An indicia string comparator circuit as set forth in Claim 12, wherein said indicia string comparator circuit including said first means, said read/write memory, said core mea said CA means, said CB means, said R means, said M means and said fourth means for computation in a t span generally proportional to the average length of said indicia strings.
18. An indicia string comparator circuit as set forth in Claim 14 including: divider means including an input and output, said
Figure imgf000070_0001
divider means for computing and presenting a digi signal representing the ratio of the output of sa M means over said output of said TOTM means, a bus control means connected to said string control and said parameter generator means for controllin all external accesses to said indicia string comp circuit and a ranker means and for monitoring the activities of said ranker means and the indicia string comparator means; ° said ranker means for maintaining a ranked list of t best string comparison results for saving said ranked list in a memory accessible to external de for having said ranked list contain entries consi of a record pointer and a string similarity coeff and for having said string similarity coefficient input from said divier means.
19. A'system circuit as set 'forth in Claim 18, including; a Master Control means connected to said bus control means and said string control means for automatic loading of words from an external memory into the memory of said string control means.
20. A word comparator device for use with a digital data processing device and an input/output display device, said word comparator device for providing a numeric measurement the degree of indicia similarity between query words and re words whereby said records words can be ranked by said digi data processing device and displayed according to rank by sa input/output display device in a quick and expeditious manne comprising : a selecting means having at least one selecting means input and a selecting means output, said selecting means for alternately addressing and routing a que . word character and a record word character to said
- A
OMPI
SU STITUTE SHEET ΛΛ* WIP0 tfSNATW selecting means output; said record word storage area electrically coupled to a selecting means input; said query word input/output device electrically 5 coupled to a selecting means input; a read/write memory means having a memory means input memory means output, and updating input, said memo means input is connected to said selecting means output, said read/write memory means for storage 10 and retrieval of numeric information addressed by said selecting means; control timing circuit means for selecting either the or write mode of said read/write memory means and defining whether a query word or record word is be 15 processed, said control timing circuit connected t selecting means input for controlling the input of words and record words; said control timing circuit means connected to said read/write memory means, 20 an adder means including an adder input connected to said read/write memory means output and an adder output, said adder means for incrementing or decre the numeric information and updating of said read/ memory means; 25 said control timing circuit means for controlling the of said adder means, connected to said adder means latching means connected to said adder output, said latching means for updating said read/write memory means with said adder output;
30 said control timing circuit means for controlling the of said latching means, connected to said latching
SUBSTITUTE SHEET a test means including comparator means, said test means for generating a test output depending on t state of said control timing circuit means, said output being representative of the numeric charac similarity between a query word and a correspondi record indicia string; said control timing circuit means for controlling sa test means, connected to said test means; and output latch means connected to said test means; sai output latch means for providing an output which represents a cumulative numeric measurement of the numeric character similarities representative of the degree of word similarity between the reco word and the query word.
21. The new use of the general purpose electronic data pro having a central processing unit with a working storage are at least one random access storage device and at least one input/output means, the new use comprising a method of orde stored date words comprising 'the steps of: generating a query word of sequentially ordered data characters,- generating strings of sequentially ordered stored da word characters, each said string representing a stored data word; comparing each ordered data character of each said stored data word with a respective ordered data character of said query word; calculating a forward similarity function between sa query word and each said stored data word; reversing the ordered sequence of said query word da characters and each said stored data word;
}URftτ J— -
Figure imgf000073_0001
comparing each ordered data character of each said string with a respective ordered data character of said query word; calculating a reverse similarity function between sai query word and each said stored data word; calculating a total similarity function between each said stored data word and said query word; ordering said stored data word relative to said total similarity function.
22. A new use of a general purpose electronic data processo as set forth in Claim 21, wherein: said step of calculating a forward similarity functio a reverse similarity function and a total similari function between each stored data word and said qu word is accomplished according to the following
Fortran algorithm:
FUNCTION THETA (IQ, IR, N)
_»_ -^IJRE
SUBSTITUTE SHEET { OMPI C
C IQ AND IR ARE EACH WORDS OF LENGTH N IN THE ALPHABET C CONSISTINCY OF THE NUMBERS 1 THROUGH 256. THEY ARE PA C AS INTEGER VECT EACH OF DIMENSION N. UPON RETURN, THE C ASSUMES THE VALUE OF THE BASIC WORD SIMILARITY BETWEEN C IQ AND IR. C
INTEGER R
DIMENSION ITALLY (256) , IQ(N) , IR(N) DO 1 1=1,256
1 ITALLY (I) = 0 M=0 R=0 C DO 2 1=1,N
ITALLY (IQ(I))=ITALLY (IQ(I))+1 IF (ITALLY (IQ(D) .LE.O) R=R+1 ITALLY (IR(I) )=ITALLY (IR(I))-1 IF (ITALLY(IR(I) ) .GE.O) R=R+1 2 M=M+R
C
R=0
DO 4 1=1,256 4 ITALLY (I)=0 C
DO 3 J=1,N I=N+1-J
ITALLY (IQ(I))=ITALLY(IQ(I) )+l • '
IF (ITALLY(IQ(I) ) .LE.O) R=R+1 ITALLY (IR(I) ) =ITALLY(IR(I) ) -1
IF (ITALLY (IR(I) ) .GE.O) R=R+1 3 M=M+R
THETA=FLOAT (M) / (N* (N+l) )
RETURN END
SUBSTITUTE CU ES=— '
OMPI , IPO
PCT/US1983/001540 1983-10-04 1983-10-04 String comparator device system, circuit and method WO1985001600A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19830903352 EP0157768A4 (en) 1983-10-04 1983-10-04 String comparator device system, circuit and method.
PCT/US1983/001540 WO1985001600A1 (en) 1983-10-04 1983-10-04 String comparator device system, circuit and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US1983/001540 WO1985001600A1 (en) 1983-10-04 1983-10-04 String comparator device system, circuit and method

Publications (1)

Publication Number Publication Date
WO1985001600A1 true WO1985001600A1 (en) 1985-04-11

Family

ID=22175472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1983/001540 WO1985001600A1 (en) 1983-10-04 1983-10-04 String comparator device system, circuit and method

Country Status (2)

Country Link
EP (1) EP0157768A4 (en)
WO (1) WO1985001600A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4290115A (en) * 1978-05-31 1981-09-15 System Development Corporation Data processing method and means for determining degree of match between two data arrays

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4328561A (en) * 1979-12-28 1982-05-04 International Business Machines Corp. Alpha content match prescan method for automatic spelling error correction
US4385371A (en) * 1981-02-09 1983-05-24 Burroughs Corporation Approximate content addressable file system
DE3382270D1 (en) * 1983-10-03 1991-06-06 Proximity Technology Inc METHOD AND DEVICE FOR COMPARISON OF WORDS.

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4290115A (en) * 1978-05-31 1981-09-15 System Development Corporation Data processing method and means for determining degree of match between two data arrays

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP0157768A4 *

Also Published As

Publication number Publication date
EP0157768A4 (en) 1987-12-09
EP0157768A1 (en) 1985-10-16

Similar Documents

Publication Publication Date Title
US4490811A (en) String comparator device system circuit and method
US5632041A (en) Sequence information signal processor for local and global string comparisons
US4499553A (en) Locating digital coded words which are both acceptable misspellings and acceptable inflections of digital coded query words
Bentley et al. A tree machine for searching problems
US4471459A (en) Digital data processing method and means for word classification by pattern analysis
US5983222A (en) Method and apparatus for computing association rules for data mining in large database
WO1994009443A1 (en) Non-numeric coprocessor
Karson Handbook of Methods of Applied Statistics. Volume I: Techniques of Computation Descriptive Methods, and Statistical Inference. Volume II: Planning of Surveys and Experiments. IM Chakravarti, RG Laha, and J. Roy, New York, John Wiley; 1967, $9.00.
EP0217922A1 (en) An array for simulating computer functions for large computer systems.
Agrawal et al. A hardware logic simulation system
Aspinall et al. Introduction to microprocessors
WO1985001600A1 (en) String comparator device system, circuit and method
EP0136379B1 (en) Word comparator means and method
Xie et al. Improving discrimination in data envelopment analysis without losing information based on Renyi’s entropy
CA1154874A (en) Word comparator device
Shima Solid state/computers: Demystifying microprocessor design: Big chips require artful compromises, as demonstrated by this design case history
Gonzalez Workshop report: Future directions in computer architecture
JPH1166086A (en) Device and method for retrieving similar document
Kaposi et al. Systems are not only software
US20080281843A1 (en) Distributed Memory Type Information Processing System
McWilliams et al. Using LSI processor bit-slices to build a PDP-11: a case study in microcomputer design
JPS61114341A (en) Processing system
Fisher Implementation issues for algorithmic VLSI processor arrays
Cármenes Nonlinear regression
JPS6379149A (en) Address conversion system

Legal Events

Date Code Title Description
AK Designated states

Designated state(s): JP

AL Designated countries for regional patents

Designated state(s): AT BE CH DE FR GB LU NL SE

WWE Wipo information: entry into national phase

Ref document number: 1983903352

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1983903352

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 1983903352

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1983903352

Country of ref document: EP