EP0519062A4

EP0519062A4 - Very fast approximate string matching algorithms for multiple errors spelling correction

Info

Publication number: EP0519062A4
Application number: EP19920904493
Authority: EP
Inventors: Min-Wen Du; Shih-Chio Chang
Original assignee: GTE Laboratories Inc
Current assignee: Verizon Laboratories Inc
Priority date: 1990-12-31
Filing date: 1991-12-30
Publication date: 1993-12-29
Also published as: WO1992012493A1; EP0519062A1; CA2076526A1; JPH05505270A

Abstract

A data string processing system uses fast algorithms for approximate string matching in a dictionary (23). Multiple spelling errors of insert, delete, change and transpose operations on character strings are considered in the disclosed fault model. S-trace, the fault model is used in formulating the algorithms and, a four-step reduction procedure improves the efficiency of an approximate string matching algorithm. These approaches to spelling correction, (i.e., using the upper bound, the string length partition criterion and the cut-off criterion) represent three improvements from the basic exhaustive comparison approach. Each can be naturally incorporated into the next step. In the fourth-step, a hashing scheme avoids comparing the given string with words at large distances when searching in the neighborhood of a small distance. An algorithm that is sub-linear to the number of words in dictionary (23) results. An application of the algorithms to a library information system uses original text files (21), information description files (22) and a negative dictionary (23) stored on disks (12).

Description

VERY FAST APPROXIMATE STRING MATCHING ALGORITHMS FOR MULTIPLE ERRORS SPELLING CORRECTION

This invention pertains generally to the field of data process- ing, and in particular to the approximate string matching problem in which a search is made for those words which most closely resemble a given character string from a set of possible words which may or may not include the string. The invention is utilized in program error correction, text editing in word processing and information retrieval from a data base.

The approximate string matching problem and algorithms proposed or used for its solution in various contexts are well known in the prior art, having been discussed in the literature at least as early as 1970. The approximate string matching (ASM) problem may be stated as: search those words that most closely "resemble" a given character string from a set of possible words (dictionary). The given string may or may not be in the dictionary. Word resemblance is generally measured by a distance function defined between two strings. For example, the minimal number of editing operations, including insert, delete, change a character, and transpose two adjacent characters, to change one string to another string is a natural and commonly used distance measure between two strings. Therefore, the problem may also be stated as: find the nearest neighbors of a given character string among a set of possible words.

In program error correction, the dictionary usually consists of the set of reserved keywords and the set of variable and function names defined by the user. In text editing, the dictionary is the set of accepted words of the language. In information retrieval, the dictionary is the set of searching keys in the database. An excellent introduction to the problem has been given in P.A.V. Hall and G.R. Dowling, "Approximate String Matching," ACM Computing Surveys, 12, 4, pp. 381-402, Dec. 1980.

Approximate string matches are extremely desirable in most information handling systems, because errors in databases are common. Observations show that, in some cases, more than 22% of

UB TITUTE SHEET database index terms are misspelled. Consequently, approximate string matching becomes the only means to retrieve such partially corrupted data.

Errors may be introduced in various stages in information 5 processing. For instance, in an airline reservation system, a traveller's name is very easily misspelled. Since information is often conveyed by telephone conversations and, furthermore, since international names often lack a standard spelling, errors are unavoidable. They may exist in both the searching keys (names) and 1 the database. Approximate string matching techniques make it possible to retrieve partially incorrect records with partially incorrect searching keys.

Approximate string matching techniques can greatly improve the man-machine interface design in today's interactive computer 75 environment. If a character string entered by a user is incorrect, it would be desirable for the system to guess the word and let the user verify it. Alternatively, it would be desirable for the system to present several possibilities and let the user select the correct one. This option may make the system much more user friendly. 2.0 Four spelling errors are the most common: insert, delete, change a character, and transpose two adjacent characters, as reported in F.J. Damerau, "A Technique for Computer Detection and Correction of Spelling Errors," Comm. ACM 7, 3, pp. 171-176, March 1964, and H.L. Morgan, "Spelling Correction in Systems Programs," 5 Comm. ACM 13, 2, pp. 90-94, Feb. 1970. In almost all earlier approaches, the fault models assume only single errors. However, such an assumption is generally inadequate. For example, current programming practice encourages longer variable and function names to enhance program readability and maintainability. Longer names 0 invite multiple errors.

In some applications, only the consideration of multiple errors can lead us from an erroneous word to the correct word. For instance, at least four insert, delete and change operations are needed to obtain the correct spelling "Jeopardize" from the 5 misspelled "Jeprodise." The number of spelling errors provides a simple and natural definition of error distance between two strings.

S Consider the following application. Assume that thousands of files have been created in a large software project. It often happens that a user wants to search a file but cannot remember the exact file name. Using ASM techniques, the system can help the user gradually expand the partially correct name in its immediate neighborhood, until the file name is found. This provides an alternative to the popular wildcard matching method, which matches a given regular expression against a set of known strings. The wildcard approach is less useful in this situation because the concept of error distance has not been naturally implemented in its formulation.

On the other hand, multiple error fault models have long been used in comparing two long strings. Multiple errors are seldom considered in approximate string matching because they are difficult to handle, as discussed in Hall & Dowling, op. cit.

The following three approaches show the difficulty. We shall limit our discussion to cases where the error distance between the given string and its nearest neighbors in the dictionary is small. These cases occur most frequently in practical applications. We assume the 26-letter alphabet, and further assume that the given character string for approximate matching is of length m, the average word length in the dictionary is n, and there are p words in the dictionary. We also assume that the words in the dictionary are stored in random access memory. 1) We may calculate the distance between the given string and every word in the dictionary, and then find those words within the minimal distance. The time for distance calculation between two words is proportional to the product of the length of the two strings in various fault models. Therefore, it will take k x p x m x n time to find the nearest neighbors, where k is a time constant. Let k = 100ms, p = 10 , m = n = 10, then the time to find the nearest neighbors is 1000 seconds. This approach takes too long for a real-time application.

2) We may implement an indexing mechanism for the dictionary for exact matching. Let us sort the p words in the dictionary in alphabetical order and adopt binary search. To find the nearest

SUBSTITUTE SHEET neighbors of given string, we generate all the strings within a small error distance (radius) r of the given string and check if each of them is in the dictionary. The radius r is increased by one each time, starting from zero, until nearest neighbors have been found in the dictionary. Let K(m, r) denote the number of strings within distance r of the given string. Then, K(m, 1) is approxi¬ mately equal to 26 x (2 x m + 1) + m - 1 and K(10, 1) = 565. Also K(m, r) is approximately equal to K(m,l) for small r. If a neares- t neighbor is of distance r from the given string, the time required for the calculation is equal to k x log2(p) x K(m, r). Let k = 20 us (the operation here is simpler than that in Case 1), p = 10 , and m = π = 10. Then the time required for the calculation is equal to 106 seconds when r = 2 and equal to 998 minutes when r = 3. 3) We may pregenerate and store all strings in the neighbor- hood of a small distance r from all the words in the dictionary. Then a logarithmic search is possible. However, assuming that we use one byte to store a character, the memory required is equal to K(m, r) x p x n bytes. Again let p = 10 ,m = n = 10. When r = 2, the memory required is 32 x 10 bytes, which is huge and cannot fit in directly accessible computer memory in the foreseeable future. The previously discussed three approaches are either memory efficient but require too long a time to find nearest neighbors, or time efficient but require an excessively large memory to implement an indexing mechanism. A practical approach should fit between these two extremes; i.e., use enough, but not too much, memory to build up an indexing mechanism so that nearest neighbors can be found within seconds.

Accordingly, the present invention provides a system for multiple errors spelling correction in a data proce sing system having a sequential digital storage media for storing a large data base, comprising: a dictionary comprising a set of acceptable words in a universe stored in said data processing system; each word in said dictionary comprising a string of characters; said dictionary being partitioned according to the length of said strings of characters; means to receive a string Z for determining whether string Z in said dictionary or is a misspelled word in said

SUBSTITUTE SHEET dictionary; means to match said string Z with strings in said dictionary to find the nearest neighbors of Z comprising: means to calculate the error distance between Z and all the words in said dictionary, wherein said error distance is the shortest sequential editing sequence operating from left to right to transform Z into said words; means to record words with minimal distance; means to limit the calculations of error distances by determining an upper bound on the length of words for which said calculations are made; means to use a string length partition to limit said calculations of error distances; means to use a cut-off criterion to limit said calculations of error distances; and means to limit further the search region by eliminating words at an error distance greater than the error distance in a neighborhood.

In the drawings: FIG. 1 is a block diagram showing the hardware and operating software systems on which embodiments of the present invention have been implemented; FIG. 2 is an information processing flow chart for indexing the full text data base input used in the embodiments of the present invention;

FIG. 3 is an information processing flow chart of the query process for information retrieval from the full text data bases of FIG.

2; FIG. 4 is a flow chart of one embodiment of the adaptive ranking system of the present invention showing the record weight determinatoin at a given level; FIG. 5 is a diagram illustrating the length of an editing sequence and the cost of an S-trace; FIG. 6 is a graph of the distribution of word lengths in three dictionaries;

FIGS. 7a, 7b and 7c present the distribution of distance between words in three dictionaries; FIG. 8 is a diagram illustrating the. order of calculation of an error distance matrix; FIG. 9 is a mapping diagram for constructing a limited set of nearest neighbors;

SUBSTITUTE SHEET FIG. 10 is a derivation tree for all strings within a radius one of

"test"; FIG. 11 illustrates the covering problem for finding hash functions; FIG. 12 illustrates a hash function selection for a finite number of dummy characters;

FIG. 13 illustrates a covering table for constructing covers of deviation vectors; FIG. 14 is a flow chart of an embodiment of the elastic string-matching algorithm of the present invention; FIGS. 15a-15e are plots of experimental results measuring the performance of five algorithms for error distances ranging from

0 to 5; and FIGS. 16a-16e are plots of experimental results measuring the time for execution of the five algorithms of FIGS. 15a-15e.

This invention pertains to very fast algorithms for approximate string matching in a dictionary. Multiple spelling errors of insert, delete, change and transpose operations on character strings are considered in the disclosed fault model. Before describing the algorithms, we shall present an overview of the information retrieval system in which these algorithms have been implemented.

FIG. 1 is a block diagram of the hardware and operating systems environment for an experimental information retrieval system designated by the acronym FAIRS and partially disclosed in "And-less Retrieval: Toward Perfect Ranking," by S.-C. Chang and W. C. Chen, Proc. ASIS Annual Meeting, 1987, Oct. 1987, pp. 30-35, and also partially disclosed in "Towards a Friendly Adaptable Information Retrieval System," Proc. RIAO 88, Mar. 1988, pp. 172-182. In the cited references, a scheme using a text editor, within an experimental information retrieval system FAIRS, was described in general terms FAIRS operates on a variety of computer systems each using its own operating system. The principal feature of all the systems is the massive data storage devices indicated by reference number 12. FIG. 2 is a flow chart showing the information processing flow for inputting a full text data base and indexing the data base in a

SUB TIT large system using FAIRS. Original text files 21, are read into storage 12 as is, with the user optionally specifying record markers, each file being named and having .TXT as an extension to its file name. The user also describes his files to the system 22, providing a list of his files with .SRS as the extension, the configuration of his files with .CFG as extension, and additional new files with .NEW as extension. The user also provides a negative dictionary 23 (.NEG) of words not to be indexed. The inputs 21, 22, 23 are processed by an adaptive information reader/parser 24 under the FAIRS program. As part of the process an INDEX builder 25, produces the index files 26 necessary for retrieval. A major component of index files is an inverted file .INV 27, which is an index to the locations of all occurrences of each word in the text files 21. The remaining index files (28a, 28b, 28c, 28d) contain the location of the records having each word (.REC), the location of occurrences of that word (.LOC), the address of each record (.ADR) and a utility file (.CNT). FIG. 3 is an information processing flow chart for retrieving information from the files inputted into the system through queries. A user query 31 is enhanced 32 by checking it for spelling variation 33 and synonym definitions 34. After the user verifies the query the index files 26 are used to search 35 for records containing the query terms. The records found in the search are ranked 36 according to ranking rules 37. The original files 21 are the displayed 38 for user feedback. At this point the user can feedback relevance information 39a to refine the search or accept the retrieved text records 39b and transfer them to other media for further use.

FIG. 4 has been described in the cross-referenced application, and is not directly pertinent to this invention. The present invention pertains directly to the spelling check of the queries and the enhancement of those queries in this informa¬ tion flow. It also has wide application in other areas. In particular, it pertains to very fast algorithms for approximate string matching in a dictionary. Multiple spelling errors of insert, delete, change and transpose operations on character strings are considered in the disclosed fault model.

SUBS In the following description, we will present L-trace, the fault model used in formulating the algorithms of this invention. A four-step reduction procedure to improve the efficiency of an approximate string matching algorithm is then described. The design to achieve the fourth step in the procedure is the principal contri¬ bution of this invention. In that step, we developed a hashing scheme to avoid comparing the given string with words at large distances. Thus, an algorithm that is sub-linear to the number of words in the dictionary results. The details of the operation and the design of the hashing mechanism are disclosed. We then discuss properties of dictionaries in use which will ordinarily affect the effectiveness of searching algorithms and make some observations about the properties of dictionaries that occur in typical applica¬ tions. We conclude by describing the application of the algorithms developed to a library information retrieval database using the FAIRS system described above, and discuss our experimental results. The experimental results show that performing approximate string matching for a large dictionary in real-time on an ordinary sequential computer is feasible.

THE L-TRACE FAULT MODEL

Most earlier approaches in spelling error detection and correction assume only single errors. Wagner and Fisher proposed the first formal string editing model for handling multiple insert, delete, and change operations on character strings, R.A. Wagner and M.J. Fisher, "The String-to-String Correction Problem," J.ACM 21, 1, pp. 168-173, Jan. 1973. They developed a dynamic programming formulation of the problem for distance calculations. That model has been later extended by Lowrance and Wagner to inc ude transpose operations, R. Lowrance and R.A. Wagner, "An Extension of the String-to-String Correction Problem," J.ACM 22, pp. 177-183, Apr. 1975. Based on the Lowrance-Wagner extended model, we have developed a fault model named linear trace (L-Trace). The L-Trace model handles multiple insert, delete, change, and transpose errors. It places natural constraints on possible editing sequences to reflect the common errors. This invention utilizes the L-Trace

S model, although the techniques disclosed here can be used for other fault models as well. In the following paragraphs, we define the L-Trace model.

In this description, a dictionary is a set of character strings constructed from a character set Σ. A character string in the dictionary will be called a word.

The following notation and conventions will be used in all 1ater discussions.

X=X[1], X[2], ..., X[m]: a string of characters from Σ. Y=Y[1], Y[2], ..., Y[n]: a second string of characters from Σ

Z[i:j]=Z[i], Z[i+1] ,..., Z[j]: an array with indices from i to j. H[i-:i , j,:j„]: an array with indices from i to i and j to j„. H[i, j] is to be used to represent the distance between X[l:i] and Y[l:j]. H will be called the distance matrix between X and Y. a-b-c-...-r: a sequence of elements. When there is only one element in the sequence, we write -a-. /_Z: the length of string Z. n_S: the size of set S. h (R): the inverse image of a range R under a mapping function h, i.e., h (R) = {x | h(x) ε R} . S': the Kleene closure of a character set S. String universe U: is equal to ∑' .

N(Z, r): the neighborhood of string Z within distance (radius) r. The following editing operations on character strings will be considered.

I(i, s): Insert a character s between the (i-1) and the i characters of a string; D(i): Delete the character at the i position; C(i,s): Change the i character to s;

T(i): Transpose the characters at i and i+1.

The change editing operation defined here may change a character to itself. This deviates from the definition of the traditional change operation in which a character has to be changed to a different one. Defining the change operation in this new way

SUBSTITUTE SHEET greatly simplifies our later discussions. It can be shown, however, that all the results obtained in this invention still apply if we adopt the traditional change operation in our fault model.

Definition 1: An editing operation of insert, delete, change and transpose is proper if it can be carried out. An editing sequence E[l:k] on a character string is a sequence consisting of proper editing operations. Each editing operation E[j] is associated with an index Efj], the position on the string where E[j] is acting. The index E[l:k] is the sequence of position indices associated with the editing sequence E[l:k].

For example, the editing sequence D(3)D(4)I(5,o)C(8,s) transforms the word "jeopardize" into the incorrect spelling "jeprodise", and T(2) transforms "deuce" into "duece". The index sequence of the former editing sequence is 3-4-5-8, while index ^. --, is -2-. T(5) is not an editing sequence on "deuce" because it cannot be carried out.

Definition 2: The editing distance between two strings X and Y is the shortest editing sequence to transform X into Y.

An elegant notion, called trace, has been developed by Wagner and Fisher, op. cit. , and Lowrance and Wagner, op. cit. , to facilitate the discussion of string editing problems.

Definition 3: A trace T from a string X to a string Y is the union of two sets, identity set I and changing set C, of number pairs (i, j), where l<i</_X,l<j</_Y, such that

a) if (i, j) is in I, X[i] = Y[j];

b) if (i, j) is in C, X[i] ψ Y[j]

c) if (i. , j ) and i₂, j₂) are in T,

Each pair in T will be called a line connecting a character of

X and a character of Y. Two lines (i , j-), (i„, j_) in T will

SUBSTITUTE SHEET cross each other if i. < i„ but j > j , or i_o ^<i₁ but j„>j-.. If (i, j) is in T, X[i] and Y[j] are said to be incident to that line.

It is easy to see that every editing sequence results in a trace, and every trace corresponds to at least one editing sequence. For example, the editing sequence D(2)T(l)C(3,r), which transforms "testing" into "string," corresponds to the trace [(1,2),(3,1),(5,4),(6,5),(7,6)} with identity set I={(1,2),(3,1),(5,4),(6,5),(7,6)} and changing set C={(4,3)}.

The discussion in the references cited above used weighted costs for editing operations. Here, we consider the problem of lengths of editing sequences, which amounts to assigning a weight of one to each editing operation. The cost of a trace can be defined as follows:

Definition 4: The cost of a trace T=IUC, between two character strings X and Y is equal to (/_X + /_Y) - (2x(n_I-_n_C)) + count of line crossings in T.

Definition 5: A trace T=IUC from X to Y is a restricted trace (R-trace) if

a) if (i_j, j_j) and (i₂, j₂) are in T, and i₁ ^<i₂- -.2^1' ^then both ±₁, j ) and (i_2> j₂) are in I;

b) if (i , j ), (i„, j ) and (i , j ) are three lines in T, and with (i_j.j_j) crosses i_{2 2} and (ig-j-_j)- then

c) if (i., j ) and (i , j ) are lines of T that cross, with i-<i„, then there is no integer i (or j. such that

(1) i₁ ^<i i₂ and X[i ] = X[i] or

(2) j₂<j<j₁ and Y[j₂] = Y[j].

Condition a) in Definition 5 states that only lines in I can cross in a restricted trace. Condition b) states further that no

SUBSTITUTE SHEET line crosses more than one line. Condition c) states that i (j„) is the rightmost position in X[l:i ] (Y[l, j.,]) and that X[i-J = Y[j ] (Yfj ] = X[i ]). A cross in a constrained (restricted) trace can be considered as an aggregate of a series of transpose, insert and delete operations.

Definition 6:

E[l:n] is a linear editing sequence on a character string if it is an editing sequence and index-., is non-decreasing, and

a) if index., .. =index_rr ... . , then E[i] is a delete operation;

E[ιJ E[ι+1]

b) if E[i] is a transpose operation, then iπdex_E[i+1]]>index_E[._]+l.

Definition 6 requires a linear editing sequence to operate on the string from left to right, with each insert and change operation fixing one character, and each transpose operation fixing two consecutive characters. For an example, the sequence D(2)T(l)C(3,r), which transforms "testing" into "string," is an editing sequence but not a linear editing sequence, because the index sequence 2-1-3 is not non-decreasing. This transformation can be performed by a linear editing sequence D(1),D(1),I(3,r), with a non-decreasing index sequence 1-1-3.

In a linear editing sequence, a later editing operation will not cancel the effect of an earlier operation. For example, an inserted character will not be erased by a later delete operation. We can thus consider a linear editing sequence from T to Y as a sequence that creates errors in spelling a word and consider the length of the editing sequence as the number of errors that occurred In the spelling process. We can naturally define the error distance from a string X to another string Y as follows.

SUBSTITUTE SHEET Dβfinition 7: The error distance, or the number of spelling errors, from a character string X to a character string Y is the minimal length of the linear editing sequences that transform X into Y.

In parallel with the R-trace (Definition 5), a linear trace (L-trace) has been defined corresponding to each linear editing sequence.

Definition 8: A trace T = IUC from X to Y is a linear trace

(L-trace) if the following is true:

^if (i_j_. --^■ _]_) and (i₂, j₂) are in T, and i_j*^' i₂ ^<^ V then both (i., j-) are in I, and i ⁼ i-, " 1_> j-₎ ⁼ j₁ " 1-

According to Definition 8, {(3,1),(4,2), (5,4), (6,5),(7,6)} is an L-trace from "testing" to "string", and both {(1,1),(4,4),(5,5)] and [(1,1),(4,4),(5,5)} {(2,3),(3,2)} are L-traces from "deuce" to "duece".

PROPERTIES OF L-TRACE FAULT MODEL This section introduces some basic properties of the L-trace fault model. From the preceding definitions, it follows that:

Theorem 1: the minimal cost of L-traces between two strings X and Y is equal to the error distance between X and Y, which is the number of spelling errors from X to Y.

An example will illustrate the relation between the length of a linear editing sequence and the cost of the corresponding L-trace. In FIG. 5, a linear editing sequence of the L-trace is:

C(1*)D(4)D(4)I(4,*)I(5,*)S(6)D(8)D(8)I(8,*)

where * represents some character. The length of the sequence is 9=/_X+/_Y-2 x n_I-n_C + count of line crossings in the trace = 9+8-2x4-1+1 = cost of the L-trace.

SUBSTITUTE SHEET By Theorem 1, to find the error distance between X and Y, we need only find a minimal cost L-trace between X and Y.

Let H denote the error distance matrix between two character strings X and Y, i.e., H[i,j] is the error distance between X[l:i] and Y[l:jJ. The following theorem calculates the error distance matrix H.

Theorem 2: Given two strings X[l:m] and Y[l:n]. Let bound = max[m,n}. Define the boundary values of H[-l:m, -l:n] as: H[i, -1] = bound for -l<i<m; H[-l, j] = bound for -l≤j≤n; H[i, 0] = i for 0<i<; and H[0, j] = j for O≤j≤n. The entry H[i, j] of the distance matrix H[l-ra, l:n] between X and Y can be calculated recursively as

H[i+1, j+1] = H[i,j] if X[i+1] = Y[j] Formula 1

H[i+1, j+1] = min [H[i,j], H[i+l,j], H[i,j+1] H[i-l,j-l]} + 1 IF both X[i] = Y[i + 1] and X[i+1] = YfiJ; and H[i+1, j+1] = min [H[i,j], H[i+l,j], H(i,j+1]} + 1 in all other cases.

There are five alternatives in Formula 1 for getting the value H[i+1, j+1]: Each corresponds to one of the five editing opera¬ tions, denoted by the letter in parentheses:

a) no change or X[i+l]=Y[j+1]; (H[i, j]) b) change X[i+1] to Y[j+1]; (H[i + 1, j] + 1) c) insert Y[j+1]; (H[i + l,j] + 1) d) delete X[i+1]; and (H[i, j + 1] + 1) e) transpose. (H[I-1, j-1] + 1)

Theorem 3: The matrix H[0:m, 0:n] defined by Formula 1 satisfies the following properties: a) H[i, j]-l<H[i+l, j]≤H[i, j]+l for all 0<i<m, O≤j≤n; b) H[i, j]-l<H[i, j+l]≤H[i, j]+l for all O≤i≤ , 0<j<n; c) H[i, j]<H[i+l, j+l]<H[i, j]+l for all 0<i<m, 0<j<n.

Corollary 1: Thus Formula 1 in Theorem 2 can be simplified to:

SUBSTITUTE SHEET H[i+1, j+1] = H[i, j] if X[i+1] = Ytj+l]; Formula 2

H[i+1, j+1] = min [H[i-1, j-1] H[i+1, j] H[i, j+1]} + 1 IF both X[i]

= Y[i]; and H[i+1, j+1] = min[H(i, j], H[i+1, j], H[i, j+1]} + 1 in all other cases.

Definition 9: Letx-<x„ and Y-^Y_j- A sequence (i., J₁ -(i₂-J₂) (i , j ) on matrix of dimension [0:m, 0:n] is a descendent path (^X _j, y_χ) to (x₂, y_£) if ^and ^{and 0≤(} V^≤1' ^{but elther ( i} _s+rV ^{or (J}s+ι is greater than 0, for l<s<r.

Theorem 4: Let H[0:m, 0:n] be the error distance matrix between X[l:m] and Y[l:n], defined by Theorem 2. Assume that m≥n and let d = m-n. Then the descendent path (1, 1) - (2, l)-...-(d + 1, l)-(d + 2, 2)-...-(m, n) is non-decreesing on H. It can be shown that this is a unique descendent path and provides a cut-off criterion.

Corollary 2: Let d be the error distance between two character strings X and Y. Then d-≤max[/_X, /_Y} . This is a simple upper bound of the error distance between two strings.

Corollary 3: Let d be the error distance between two character strings X and Y. Then /_X-d</_Y≤/_X+d.

Corollary 2 gives a simple upper bound on the error distance between two strings. Corollary 3 is the string length partition criterion ordinarily used to save computation in nearest neighbor searching of character strings in the prior art.

Before proceeding further with the algorithms, we first consider some properties of dictionaries.

1) Word lengths in dictionaries are generally small, as shown in FIG. 6. Therefore, simple instead of complex algorithms should be used in the distance calculation. Sophisticated distance calculation algorithms generally have large time constants and are good only for long strings.

SUBSTITUTE SHEET 2) The shape of the distributions of word length in the three dictionaries which we studied are bell-like, i.e., there are many fewer words with either small or large word length than words with medium word length. This implies that words with small or large word length can be treated separately without affecting either the average performance very much, or affecting the worst case performance, of ASM (approximate string matching) algorithms. Treating long words separately is especially beneficial because such words have a huge neighborhoods, as discussed before. 3) Although words in dictionaries are not random, FIGS. 7a, 7b and 7c show that they do not cluster together either. This phenomenon may be partially attributed to the fact that the alphabet size in use is generally much larger than the length of an average word in the dictionary. From FIGS. 7a, 7b and 7c, it can be seen that words are actually declustered, in the sense that there are very few words in the near neighborhood and almost all words are in far distance, of every word. This makes it possible to design an efficient indexing mechanism to do nearest neighborhood searching. Given a character string Z and a dictionary, we can find the nearest neighbors of Z by calculating the distance between Z and every word in the dictionary, and we can record those words with the minimal distance. In the following discussion, we assume that the words in the dictionary are partitioned according to their length. The simple upper bound discussed above in Corollary 2 tells us that we need not consider words with length greater than 2 x maximum {/- _Z, minimal length of words in the dictionary}, which is ordinarily equal to 2 x /_Z. Thus, this upper bound defined by Corollary 2 can be used immediately to decrease the number of words to be compared. That number can be further cut down by the string le: gth partition criterion discussed above in Corollary 3 because by dynamically recording a number d, the smallest distance currently found, there is no need to compare those words with length less than /_Z-d or greater than /_Z+d. The best strategy to use this property is to search through word groups in which difference between their word lengths and /_ Z is equal to 0, 1, etc. , until a neighbor or neighbors are found.

SUBSTITUTE SHEET Another simple rule according to Theorem 4 above to make the nearest neighbor searching more efficient is the cut-off criterion for the distance calculation because it can tell, during the calculation, whether the distance is larger than a pre-specified quantity. This property is useful because when the error distance between the given string and its neighbors in the dictionary is small, which is usually the case, we can avoid the calculation of most of the entries in the error distance matrices between the given string and the words in the dictionary.

A FOUR-STEP REDUCTION PROCEDURE TO CONSTRUCT EFFICIENT ASM ALGORITHMS

In order to use the cut-off criterion, the entries on an error distance matrix must be calculated in a particular order, as shown in FIG. 8; here we assume that /_X≥/_Y. We visit the entries layer by layer, along the descendent path in the theorem (the cut-off path). At the end of the calculation of a layer, the H value on the cut-off path is obtained and compared to the error distance r of the current neighborhood. If that H value is smaller than r, we calculate another layer. If the layer is the last one, indicating that a nearest neighbor has been found, we record the word and continue to find all the words with distance equal to the current distance r. When the H value is greater than r on the cut-off path, we suspend the calculation of the current word and go for the next word. If no word has been found within distance r, we relax r to r + 1 and continue the searching. Note that we can always find a nearest neighbor within the distance of the maximal length of the words in the dictionary, and usually a nearest neighbor will be found long before we reach such a large distance. The three approaches, i.e., using the upper bound of Corollary 2, the string length partition criterion pf Corollary 3 and the cut-off criterion of Theorem 4, represent three improvements from the basic exhaustive comparison approach. Each step can be naturally incorporated into the next step, as shown by the following Algorithms 0, 1, 2, and 3. They implement the exhaustive comparison method, the upper bound, the string length criterion, and the

SUBSTITUTE SHEET cut-off criterion, respectively. In all the algorithms, we group the words in the dictionary according to their length. We let max_DICT be the maximal word length, min_DICT be the minimal word length, n_word_DICT[i] be the number of words of length i, in the dictionary. Error distance (X, Z) is a subroutine that calculates the error distance between X and Z by using Formula 2.

Algorithm 0: (Exhaustive comparison method) 0. Given string Z. 1. Let minimum_found = 9999. /* Set the minimal distance to a large number. */ Set S = φ 2. For (X in Dictionary)

[ dist = error_distance (X, Z); if (dist≤minimum found)

[ if (dist < minimum_found)

I minimum_found = dist; reset S to [X};

} else

S = S 1) (X|; }

1

3. End. /*S is the set of the nearest neighbors found */

Algorithm 1: (Cut by upper bound of distance: Corollary 2) 0. Given string Z.

1. Let minimum found = 9999. /* Set the minimal distance to a large number. */ Set S = φ . upper_bound= 2 x maximum [min_DICT, /_Z} . 2. For (X in Dictionary and /_X < upper_bound)

SUBSTITUTE SHEET dist = error_distance (X, Z); if (dist < minimum_found)

( if (dist < minimum_found) [ minimum_found = dist; reset S to [X} ;

} else S = S U [X};

} 3

3. End. /*S is the set of the nearest neighbors found */

Algorithm 2: (Cut by current upper bound of distance: Corollary 3)

0. Given string Z.

1. Set S = φ . radius = -1;

2. While (S = φ) do step 3 and step 4.

{ 3. radius = radius + 1;

4. For( I l-/_Z|<radius /*1 is the loop control variable.*/

{

For (X in DICTIONARY and /_X=1)

{ dist = error_distance (X, Z); if (dist = radius) S = S U [X}; } }

End. /*S is the set of the nearest neighbors found */

Algorithm 3 modifies Algorithm 2. It calls a subroutine error_dist (X, Y, r), which finds the error distance between two character strings X and Y if the distance is no greater than r. If the distance is found to be greater than r during the calculation,

SUBSTITUTE SHEET error_dist will suspend and return -t, where t is the number of layers calculated, applying the cut-off criterion of Theorem 4.

Algorithm 3: (Cut by cut-off criterion: Theorem 4) 0. Given string Z.

1. Let S = φ, radius = -1;

2. While (S = ψ) do step 3 and step 4.

I

3. radius = radius + 1; 4. For (|l-/_Z|< radius). /*1 is the loop control variable.*/

For (X in DICTIONARY and /_X=1)

{ dist = error_distance (X, Z, radius); if (dist = radius)

S = S U (X);

5

} }

5. End. /*S is the set of the nearest neighbors found */

Definition 10: Define V , (Z) to be the set of error distance alg matrix (H[i, j]) entries visited by algorithm alg in searching for the nearest neighbors of Z. Also define E - (Z) to be the total alg number of times the error distance matrix (H[i,jJ) entries computed by algorithm alg in searching for the nearest neighbors of Z. We have j. V , (Z) ϋ v (w.) ^al* i=l

where v(w.) is the set of H[i, j] entries visited by algorithm alg when comparing the given string Z and a word wi in the dictionary. We also have

SUBSTITUTE SHEET ^Ealg ∞ ^{= U e (w}i>

where i=l to I , and where e(w.) is the number of times the H[i,j] entries values computed by algorithm alg for word w. in the dictionary. It is clear that the computation time of algorithm alg is roughly proportional to E . . In testing experiments discussed hereinbelow, E , will be used as a measure of the efficiency of the alg algorithms described herein. It is also easy to show that the

Algorithms 1, 2, and 3 described above each successively cut down the number of H[i, j] entries visited by its predecessor. This number of H[i, j] entries can be reduced further so that the algorithm can be speeded up.

We may observe that E . .., =V , .., _Λ, " algoπthm_0 algoπthm_0 algorithm_l algorithm_l' algorithm_2 algorithm_2. But because algorithm_3 does not store the intermediate calculation result each time when it suspends the calculation of a work in the dictionary, entry values on distance matrices may be recomputed several times. Therefore, E ₁ . , _ may be greater than

V , , , ,. When a given word Z is at a near distance of a word in the dictionary, however, E _n ,_J, „ will be still close to ^J algoπthm_3

V algorithm_3^'

It can be shown that

^Valgorithm_0^(■Z'⁾-^Valgorithm_l ^Z'^)ΞValgorithm_2^(Z-^)ΞValgorithm_3^(Z)•

Therefore, the efficiencies of the algorithms have been improved successively. We may now ask: Can the algorithms be speeded up further? The answer is yes.

We have constructed that it is possible to provide a mechanism to avoid comparing words at large distances when searching in the neighborhood of a small distance. Specifically, given a word Z and a small distance r, which define a neighborhood N(Z, r), we have found a mechanism to calculate R(Z, r), a small subregion of the dictionary, with

SUBSTITUTE SHEET [X|XεDICTIONARY, |/_x-_z|<r}2R(Z,r)2N(Z,r)nDICTIONARY

It is evident that we need to compare Z with only those words in R(Z, r) to find words in the dictionary that are within distance r from Z. In the following text we modify Algorithm_3 to arrive at Algorithm_4. The design of the mechanism for calculating R(Z, r) is described later.

Algorithm 4: (Cut by cut-off criterion in Theorem 4 and limiting the searching region)

0. Given string Z.

1. Let S = φ, radius = -1;

2. While (S = φ) do step 3 and step 4.

( 3. radius = radius + 1;

4. For (X in R (Z, radius))

dist = error_distance (X, Z, radius); if (dist = radius) S = S U [X};

] }

5. End. /*S is the set of the nearest neighbors found */ We conclude this section by the following statement:

Theorem 5: For any string Z, the following relation exists:

^Valgorithm_0^{(Z) ~ V}algorithm_l^(Z) - ^Valgorithm_2^{(Z) ≡}

^?algorithm_3^(Z-⁾- ^{≡ V}algorithm_4^{(Z) '}

SUBSTITUTE SHEET STRINGS WITH DUMMY X. DEVIATION VECTORS. AND NEIGHBORHOOD COVERING

The mechanism that we have developed for constructing R(Z, r) can be described by the mapping diagram in FIG. 9. Given any Z in the string universe U, and a small integer r, let the neighborhood N(Z, r)=UN., where N.'s are not necessarily disjoint sets of strings. Assume that corresponding to each N., we have a mapping function h. which maps from U into a finite (integer) range H,. A function designed for this mapping purpose will be called an h. (hash) function. The following statement is clearly true: if a string X is not in the inverse image of any h.(N.), then X is not N(Z, r). Therefore, to find if there are nearest neighbors of distance r, we need only compare Z with those words in the dictionary which lie in the inverse image of h.(N.).

The above observation is formally described by the following theorem:

Theorem 6: Let DICT be a set of words extracted from the string universe U, and let h. map from U to integer regions H., l≤i≤s. For a given string Z, and an integer r, if N(Z, r) = U N., then N(Z, r) -1 ¹

• £ h. (h.(N.)), and a word X in DICT is in N(Z, r) only if X is in

R(Z, r) = DICT n (Uh.^_1(h.(N.))), or equivalently in

R(Z, r) = n(DICT U h..^'1 (h^N..))). (2)

The following expression then results:

Corollary 4: Let R(Z, r) = U (DICT fl h₁ ^"1 _fhl(N₁))), as defined above. If r is the least integer such that there exists an X in R(Z, r) with its error distance to Z equal to r, then X is a nearest neighbor of Z.

(3) From this, it is evident that Algorithm 4 implements statements (2) and (3) above.

SUBSTITUTE SHEET The following two aspects of the problems remain to be discussed.

1) Assuming that the h. functions have been constructed, how do we generate R(Z, r) from a given string Z and a given error distance r?, and

2) Given a dictionary, consider how to construct a set of h. functions that can be used to generate R(Z, r). Those functions must be general enough so that only a small set of h. functions are provided for all possible Z strings. Both h. and its inverse must also be able to be implemented efficiently, in terms of both time and memory space.

Intuitively, although the neighborhood N(Z, r) is huge, for even a modest length of Z and a small r, because all strings in the near neighbor- hood of any specific string must be "similar" to each other, good solutions to the above two problems may still exist. In the following text, we describe our R(Z, r) mechanism and show how it works. We start from the representation of strings in N(Z, r).

The notation N(Z, r) is itself a simple representation of the neighborhood of Z with error distance r, but it is too abstract to be useful here. By observing that the enormity of N(Z, r) is actually caused by two operations - Insert and Change - which create • a large number of possibilities for choosing characters, we have found that the neighborhood representation can be simplified by the introduction of strings with dummy symbol X, as defined by the following definition.

Definition 11: Assume that the symbol X itself is not in Σ. A string with (dummy) symbol X is any string in [Σ U X} .

We can consider a string with dummy X to be a s t of strings generated by replacing the X in the string by any character in Σ. To see how strings with dummy X work in representing a neighborhood, we construct a derivation tree for all strings within radius one of the string "test", as shown in FIG. 10. The derivation tree represents an enumeration of all possible editing sequences on "test", with the editing operations performed from the left of the string to the right. Note that in the figure each node may have

SUBSTITUT five outgoing branches: I,D,C,T,i, corresponding to the five editing operations Insert, Delete, Change, Transpose, and identity, respectively. The i outgoing branch exists only when the number of errors from the root to the node exceeds one. Dashed lines are used to indicate that the editing operation cannot be applied. We can write:

N("test", 1) = Xtest U est U Xest U etst U tXest U tst U tXst U tset U teXst U tet U teXt U tets U tesXt U tes U tesX U testX.

A derivation tree for N(Z, r) for r greater than one is constructed similarly by letting the number of errors from the root of the tree to any terminal node be exactly equal to r. Note that the Change editing operation makes any string with error distance less than r to be included in the enumeration. Thus, any N(Z, r) can be represented by the terminal nodes on the deviation tree. The number of those nodes on the derivation tree is much smaller than the size of the neighborhood N(Z, r).

In the case when string Z is long, and r is greater than one, the number of the terminal nodes on a derivation tree may still be very large. We can improve performance by noticing that the strings with dummy X on the terminal nodes of a derivation tree are still very "similar", and that the constraint on N. needed in Theorem 6 is only that N(Z, r)c UN.. We need the following definitions to proceed.

Definition 12: Let Z=z- , z , ..., z/_z, with z. from Σ. A vector V = [v-, v„, ..., v/_v], is said to be a deviation vector with dummy X (of Z) if v. = i or X, for some l≤i≤/_Z. The corresponding string S = s-, s„, ..., s. , with s. = v. or X if v. = X, is said to be a 1 2' /_v' J J .1 string with dummy X (of Z) derived from V.

For example, S = Xest can be derived from Z = test, with deviation vector [X, 2, 3, 4]. Note that S can also be derived from another deviation vector [X, 2, 3, 1].

SUBSTITUTE SHEET Definition 13: A set of deviation vectors [V.} with dummy X is a covering scheme of a set of strings, N, if every string in N is a member of at least one of the strings with dummy X derived from V.. We also say that [V.} covers N if it is a covering scheme of N. Each string with dummy X on a terminal node of a neighborhood derivation tree represents a set of strings in the neighborhood. Several such strings with dummy X can be covered in turn by a (larger) string with dummy X derived from a deviation vector. For example, XXst, derived from [X, X, 3, 4,], covers Xest, etst, and tXst of N("test", 1). Therefore, a deviation vector of a string Z can be considered as a super-cover of strings in the neighborhood of Z. Any neighborhood can be covered by a set of deviation vectors or super-covers. For example, N("test", 1) is covered by the following set of deviation vectors:

[[1,2,X],[1,3,X],[2,3,X],[1,2,X,X],[1,3,X,X],[2,1,X,X],[1,X,3,X],- [X,2,3,X],[1,X,X,X,4],[X,1,2,X,X].[X,2,3,X,X]};

which derives the following set of strings with dummy

X: [esX,tsX,teX,teXX,tsXX,etXX,tXsX,XesX,XteXX,tXXXt,XesXX} .

The deviation vectors have the nice feature that they specify only the positions from which to extract characters from a string Z, and (implicitly) the positions in the vector to put those characters. In other words, if a set of deviation vectors is a covering scheme of a neighborhood N(Z, r), then it also covers any other N(Z', r), as long as /_Z' is equal to /_Z. This leads us to construct R(Z, r) based on deviation vectors. Let N(Z, r) be covered by the set of deviation vectors [V.}, and each V. derives S., a string with dummy X. Then N(Z, r) £ U {S.}. Therefore, [S.} serves well as the cover {N.} in Theorem 6, with the condition in the theorem being relaxed to N(Z, r) £ U N,.

SUBSTITUTE S E T THE R(Z. r) SCHEME

We can now describe the R(Z, r) scheme that we propose. It consists of two structures: sets of deviation vectors for covering neighborhoods, and a set of h. functions to calculate the mappings and inverse mappings.

We chose the following h, function for a deviation vector: let a deviation vector be V = [v., v„, ... , v . ], a string be S = [ s. , s , ..., s, ], and from left to right, c , ..., c, be the character codes of S corresponding to the non-X v,'s in V, then h.(S) = c (mod|∑|) + c (mod|∑|) x| ∑|+...+cj(mod| ∑| ) x |∑|j-l. This h_± function partitions the set of words of length /_V in the dictionary into |∑|^J blocks, with some of those blocks possibly empty. With this selection of the h. function, it is easy to calculate both h,(S) and DICT fl hi (h.(S)), provided inverted files have been constructed for h. on the dictionary. Note that h. maps all the strings of a string with dummy X to a single value. The following example illustrates how to calculate R(Z, r) for a given string Z and a small distance r. The calculation procedure solves Problem 2, posted previously. An example for the R(Z, r) calculation is presented.

Let the dictionary of interest be DICT = [test, the, best, mess, example}, and assume that we want to find the nearest neighbors of the string "rest", that are not in the dictionary. First, we try the neighborhood with zero distance. N("rest", 0) = ["rest"} and "rest" is not in the dictionary. So we try the next smallest neighborhood, N("rest", 1). The same set of deviation vectors that was used to cover N("test", 1) can be used to cover N("rest", 1). It is [[1,2,X], [1,3,X], [2,3,X], [1,2,X,X], [ 1- ,3,X,X], [2,1,X,X], [1,X,3,X], [X,2,3,X], [1,X,X,X,4], [X,1,2,X,X], [X,2,3,X,X] } ; which generates the following set of strings with dummy X:

S = [esX,rsX,reX,reXX,rsXX,erXX,XsX,XesX,XteXX,rXXXt,XesXX}.

Assume that Σ is the set of lower case characters, and that ASCII codes are used in representing these characters. It is easy to see

SUBSTITUTE SHEET that only [XesX} can be mapped by h. and then h. to words in the dictionary. The calculations can be carried out as follows:

h.(XesX) = 91(mod26) + 105(mod26) x 26 = 39.

R("rest", 1) = DICT fl h.^_1(39) = ["test","best","mess"} . The error distance between "test", "best", "mess" and the given string "rest" are 1, 1, and 2, respectively. So the nearest neighbors of "rest" are "test" and "best".

SELECTION OF h1. FUNCTIONS

In our R(Z, r) mechanism, the inverted files of every h. function on the dictionary must be pre-constructed. We certainly don't want to construct inverted files for all possible combinations of positions in strings, thus creating an enormous number of h. functions. In this section, we show how we select only a sufficient number of h. functions for calculating R(Z, r).

To simplify the discussion, we restrict ourselves to the case that the length of the string, Z, is equal to 5 and the error distance, r, is equal to 2. We also limit the choice of h. functions to those that use only two distinct positions in Z to calculate function values. The results can be generalized easily.

Our objective is to find a small number of hash functions that can calculate values for any string with dummy X derived from a set of deviation vectors that covers N(Z, 2), for any string Z. Because the error distance of any string in N(Z, 2) is at most two, we have made the following simple but useful observation: N(Z, 2) can be covered by deviation vectors each containing at most two X symbols. We say that a set of hash functions, H, covers a set of strings with dummy X, S, if for every string s in S there is at least one function in H that can calculate a value for s.

Another observation which makes the discussion simpler is that if a set of hash functions covers the set of strings with exactly two dummy X symbols, it also covers the set of strings with less than two dummy X symbols.

SUBSTITUTE SHEET The problem is, therefore, reduced to the following: find the smallest number of h. functions that can calculate values for the

1 set of strings with dummy X derived from deviation vectors with exactly two dummy X symbols. FIG. 11 interprets this problem as a 5 covering problem, with the length of deviation vectors equal to 5. The figure depicts a covering table, with each row representing a candidate h. function that selects two positions in a string mapping, and each column representing a possible string with dummy X derived from a deviation vector with exactly two X symbols. A row

70 covers a column if the positions selected by that row are the non-X positions of the column. The covering relation is indicated by x symbols at the intersections of the rows and columns. We encounter the following traditional covering problem: find a minimal number of rows to cover all the columns in a table.

'5 The general covering problem is an NP-complete problem, that means the problem is difficult to solve when the size of the table is large. But when the size of the table is small, as in FIG. 11, minimal solutions can be obtained by traditional methods. One such minimal cover for the table in FIG. 11 is [[ 1,2] , [1,3], [2,3] ,[4,5] } .

^0 In general, the number of columns in the covering table for covering m deviation vectors of length with ( )X symbols is equal to r. When m is large, and with r greater than 1, the covering table becomes very large and a minimal cover will be difficult to find. In practical applications, a good cover of the table that may

25 not be minimal is satisfactory for our purpose. In the following, we use an example to show a useful heuristic method that generally obtains good covers for tables of large size.

Assume that we want to find a number of h. functions, each of

1 which uses two string characters to calculate values, for all deviation vectors of length seven with at most three dummy X symbols. We draw seven vertices, and draw closed curves with each curve enclosing two vertices, such that for any choice of three vertices, there is at least one closed curve that does not enclose any one of the three vertices. A solution can be found easily by 5 trial and error. The heuristic rule is to enclose unenclosed vertices first, i.e., make the enclosed sets as disjoint as

SUBSTITUTE SHEET possible. Let the seven vertices be [1,2,3,4,5,6,7}. We can find easily that the following selections [[1,2], [3,4] , [5,6],[6,7], [5,7]} constitute a valid solution. FIG. 12 shows more examples of good cover selections. Although the general covering problem is an NP-complete problem, the covering problem described above may not be NP-complete. The reason is that, in our case, the covering tables are not arbitrary. We suspect that simple and efficient procedures can be invented to find minimal solutions for our special covering problem.

Another consideration in designing h. functions is how many characters should be chosen from a string for function value calculation. We have not obtained either theoretical or experimental results to answer this question yet. It is not difficult to conceive the following dilemma: the more characters chosen for calculating h. functions, the smaller the inverse subregions will be; thus the number of false hits will be reduced for each h. function; but in that case, larger sets of deviation vectors need be used for covering neighborhoods. Also, the more characters chosen for calculating h. functions, the more h. functions need be provided in the whole mapping mechanism. This is a typical time-space trade-off problem. In practical design, some tuning may be required.

COSSTRUCTING COVERS OF DEVIATION VECTORS

Assume that a sufficient set of h. functions has been selected.

1

We now proceed to find sets of deviation vectors for covering neighborhoods. Again, we shall use a simple example to show the essence of the problem. Let Z ="1234" and r=l. Then N(Z, r) = X1234U234UX234U2134-

U1X234U134U1X34U1324U12X34U124U12X4U1243U123X4U123U123XU1234X.

Since the h, functions in our scheme are defined according to the string length, we need to group the strings with dummy X in N(Z, r) by length. Assume that the h. functions [[1,2], [1,3],- [2,3], [4,5]} have been selected for the mapping calculation for strings of length five. We show how to construct a set of deviation

SUBSTITUTE SHEET vectors to cover [X1234, 1X234, 12X34, 123X4,1234X} of N("l234", 1). Again, a covering table can be used, as shown in FIG. 13. In the figure, row [i, j] covers a column of a string with dummy X if the i and the j positions in the string are non-X. A deviation vector can be obtained easily from a row and the column which it covers. For example, [1, 3] covers 1X234, therefore, the deviation vector [1,X,2,X,X] can be used to cover 1X234.

The objective here is to find a number of rows to cover all the columns in the table, so that a minimal number of deviation vectors will be created. Since the objective function is not a direct count of the number of rows in the cover, this problem appears to be even harder than the general covering problem. We are satisfied by finding a minimal number of rows that cover all the columns because, in our experience, a minimal cover often leads to a small covering set of deviation vectors. A minimal cover for the table in FIG. 13 is {[1,2], [4,5]}. It produces the set of deviation vectors {[1,2,X,X,X], [X,X,X,3,4]}.

For longer strings and with r greater than one, the covering table for constructing covers of deviation vectors becomes huge. We have found that the following simple algorithm can effectively find good covers for our purpose. It is called a greedy algorithm because, if a column is not yet covered by the currently selected rows, an arbitrary row that covers that column will be added to the cover during the construction process.

A GREEDY ALGORITHM TO CONSTRUCT COVERS OF DEVIATION VECTORS

/* Given h. functions hl,...,hs, string Z=12.../_Z, and error distance r. */

/* Let S be the set of strings with dummy X to cover N(Z, r).*/ 0. Set V = φ .

1. Do 2 to 5 until (S is exhausted).

[

2. Generate a new element s of S;

3. if (s is not covered by V) I

4. Find an h, to cover s;

1

SUBSTITUTE SHEET 5. } V = VU [deviation vector of (h., s)};

3

6. End, /*V is a cover of deviation vectors for N(Z, r).*/

The selection of h. functions and the construction of covers of

1 deviation vectors together provide an answer to Problem 1 above.

PROPERTIES OF DICTIONARIES IN COMMON APPLICATIONS

The distribution of the words in a dictionary will, in general, greatly affect the efficiency of word searching algorithms. Before we design an algorithm, it is better to study the properties of the databases on which the algorithms will be working. For the purpose of this study, we wrote a program that exhaustively calculates the distance between each pair of the words in a dictionary, by using Formula 1. The following three kinds of dictionaries with different application areas have been examined by the program:

1) DICT_pgm: The set of variable and function names of a large prolog program.

2) DICT_Unix: The set of English words in a dictionary provided by the Unix System.

3) DICT_IR: The set of index words used in an information retrieval system of the library of GTE Laboratories, Waltham, Massachusetts, which is a mixture of author names, titles, and abstracts of books, journals, and technical reports. It contains 25167 distinct words, with average word length equal to 8.320. Normalized distributions of the word length of the three dictionaries are shown in FIG. 6.

Assuming that each pair of words in a dictionary has the same probability of occurrence, we get the distribution of conditional probabilities of distance between words, with respect to the maximal length of a pair of words, in the three dictionaries DICT_pgm, DICT Jnix, and DICT_IR. They are drawn in FIGS. 7a, 7b, 7c.

We have implemented the algorithms described in this paper and applied them to the dictionary DICT_IR. For Algorithm 4, we constructed inverted files for words with word length from 2 to 15,

SUBSTITUTE SHEET All the h. functions select exactly two positions in strings to calculate values, except in the case for words of length 2, where only one position is used in constructing the inverted files. The number of h. functions created and the maximal number of errors (for

1 different word lengths), within which we provide a speedup mechanism, are given in Table 1. The number of deviation vectors used to cover the neighborhood N(Z, r) is given in Table 2. Let us take one entry in Table 2 as an example. There are 48 deviation vectors in the cover for N(Z, 2) with /_Z equal to nine. Among those, there are 6, 9, 12, 11, and 10 deviation vectors of length 7, 8, 9, 10, and 11, respectively. From these numbers, we can roughly calculate the size of R(Z, r) to show the effectiveness of providing the inverted files in the R(Z, r) mapping mechanism. Assume that there are four thousand words in each group of word length 7 to 11 in the dictionary. Assume also that each h. function partitions the four thousand words into four hundred blocks with each block containing ten words. Then the size of R(Z, r) is less than or equal to 10 x (6+9+12+11+10) = 480. Therefore we need to compare the given Z with only 480 words, instead of the 20000 words given by the string length partition (Corollary 3) ordinarily used to save computation in nearest neighbor searching of character strings.

The actual program implementation of Algorithm 4 has been modified in the following way: during program execution, whenever no R(Z, r) mechanism has been provided for a certain portion of neighborhood N(Z, r), the program switches to Algorithm 3 to handle that portion.

We have conducted experiments to compare the efficiency of the five ASM programs. For each word length / (from 2 to 28) and each error distance r (from 1 to 4) we randomly generate 100 character strings, with the property that each one is a nearest neighbor in distance r from a word of length /. The average performance of algorithms on the 100 strings for each / and r is recorded.

Two kinds of performance measurements are used in our experiments. The first one is the number of H[i, j] entries visited by an algorithm (Definition 10). This measurement is system and implementation independent. Since computation overheads are not

SUBSTITUTE SHEET negligible in Algorithm 3 and Algorithm 4, we also measure the real-time used by all the implemented algorithms to compare their overall efficiencies.

All the five algorithms are implemented In the C programming language. The experiments were conducted on a COMPAQ DESKPRO 386 personal computer. Large tables, including the dictionary data and the inverted files of the h. functions used in Algorithm 4, were stored in the extended memory of the computer. Because of the system overhead, the extended memory in the computer is effectively several times slower than the directly accessible memory.

The experimental results are given in FIGs. 15a-15e, and FIGs.

16a-16e.

The following observations have been made from the experimental results: l). Algorithm 0, 1, 2 and 3 are all easy to implement. They all use little extra memory. Algorithm 1 is faster than Algorithm 0 only when the length of the given character string is small.

Algorithm 2 and Algorithm 3 are much faster than Algorithm 0 and

Algorithm 1. 2). For small r, E . . . . _(Z) is much less than algoπthm_3

E . , „(Z). But because Algorithm 2 has a simpler loop structure, the program implemented for Algorithm 3 is not much faster then the program for Algorithm 2. More sophisticated lower-level programming techniques must be used to implement Algorithm 3 to harness its advantage of visiting a much smaller quantity of H[i, j] entries.

3). Algorithm 4 is the fastest algorithm among the five. FIGURES 16a-16e shows the relative speeds of the five algorithms with different / and r. To examine the results more closely, we le- T.(/, r) be the time spent by Algorithm i, corresponding to string length / and error distance r in our experiment. T.(28, 4) is equal to 253, 253, 0.15, 0.15, 0.07, and 0.08 seconds, for i equal to 0, 1, 2, 3, and 4, respectively. T.(9, 1) is equal to 121, 121, 34, 23, and 0.33 seconds, for i equal to 0, 1, 2, 3, 4, respectively. In all our experimental cases, Algorithm 4 finds nearest neighbors within seconds. These results show that, under our fault model,

SUBSTITUTE SHEET performing approximate string matching for large dictionaries in real-time on an ordinary sequential computer is feasible.

4) . The memory space required for storing the inverted files in Algorithm 4 is large, but it is affordable with current hardware technology. In our implementation, the dictionary data occupies 172K bytes, and the inverted files occupy 389K bytes.

In summary, we have presented a method for designing algorithms for approximate string matching. Among the five algorithms described, Algorithm 0, 1, 2, and 3 are simple and space efficient. Algorithm 2 and Algorithm 3 are relatively fast. Algorithm 4 is very fast but requires substantial memory. When the size of an application dictionary is small, Algorithm 2 and Algorithm 3 are good choices. When the dictionary has a large number of words, Algorithm 4 is the only choice to provide real-time performance.

SUBSTITUTE SHEET

Claims

CIAIMS:

1. A system for multiple errors spelling correction in a data processing system having a sequential digital storage media for storing a large data base, comprising: a dictionary comprising a set of acceptable words in a universe stored in said data processing system; each word in said dictionary comprising a string of characters; said dictionary being partitioned according to the length of said strings of characters; means to receive a string Z for determining whether string Z in said dictionary or is a misspelled word in said dictionary; means to match said string Z with strings in said dictionary to find the nearest neighbors of Z comprising: means to calculate the error distance between Z and all the words in said dictionary, wherein said error distance is the shortest sequential editing sequence operating from left to right to transform Z into said words; means to record words with minimal distance; means to limit the calculations of error distances by determining an upper bound on the length of words for which said calculations are made; means to use a string length partition to limit said calculations of error distances; means to use a cut-off criterion to limit said calculations of error distances; and means to limit further the search region by eliminating words at an error distance greater than the error distance in a neighborhood.

2. The system of claim 1 wherein said means to calculate the error distance between Z and all words in said dictionary for each word X comprises:

SUBSTITUTE SHEET means to transform string Z into a string X from a sequence of editing operations consisting of insert, delete, change and transpose operations; means to apply said editing operations sequentially on each character position of the string Z; means to select the shortest sequence of said editing operations to effect said transformation; and wherein the error distance is the number of said editing operation in said shortest sequence.

70

3. The system of claim 1 wherein said means to limit error distance calculations by determining an upper bound, comprise: means to eliminate calculation of error distances between Z and 75 words two times the maximum of the length of Z, the minimal length of words, in said dictionary.

4. The system of claim 1 wherein said string length partition means comprises: 0 means to dynamically record a number d representing the smallest error distance currently found in a search; means to eliminate calculation of error distances for words having a length less than the length of Z minus d or greater than the length of Z plus d. 5

5. The system of claim 1 wherein said means to limit further search region comprises: means to construct an error distance matrix of all words in said dictionary; 0 means to construct a neighborhood of Z in said dictionary comprising all words within a given error distance r of Z; means to construct a region of said neighborhood wherein the absolute value of the difference in string length between Z and any word in said neighborhood is less than or equal 5 to said error distance r.

SUBSTITUTE SHEET 6. The system of claim 5 wherein said means to construct a region of a neighbor comprises: means to construct an inverse image of said error distance matrix using a hash function.

7. A method of finding the nearest neighbors of a string of letters among a set of possible words forming a dictionary from a universe of words, wherein each word is a linear combination of characters, comprising the steps of: (i) creating storage area for storing a set of words and initializing said area to a null value; (ii) defining an initial value of an error distance measure variable; (iii) computing a neighborhood of said string comprising all words within a given error distance from said string;

(iv) calculating a subregion of said dictionary including words in both said neighborhood of said string and said dictionary; (v) for each word in said subregion, computing an error distance measure and storing only those words in said storage area with an error distance measure equal to said variable; and (vi) incrementing said variable and repeating steps (iii), (iv), and (v) until said storage area contains at least one word, wherein said stored words represent the nearest neighbors.

8. The method of claim 7 wherein said step of calculating a subregion includes the steps of: defining said neighborhood as a union of a group of individual sub-neighborhoods; assigning to each sub-neighborhood a mapping function which maps the words in each sub-neighborhood to a respective integer range; inverse mapping each integer range back into said universe to create a respective inverse image neighborhood; and creating a union set comprising all of the words in said inverse image neighborhoods; wherein the common words between said union set and said dictionary forms said subregion.

70

75

20

5

0

5

SUBSTITUTE SHEET