WO1992012493A1 - Algorithmes tres rapides servant a determiner une correspondance approximative de chaines pour la correction de multiples fautes d'orthographe - Google Patents

Algorithmes tres rapides servant a determiner une correspondance approximative de chaines pour la correction de multiples fautes d'orthographe

Info

Publication number
WO1992012493A1
WO1992012493A1 PCT/US1991/009756 US9109756W WO9212493A1 WO 1992012493 A1 WO1992012493 A1 WO 1992012493A1 US 9109756 W US9109756 W US 9109756W WO 9212493 A1 WO9212493 A1 WO 9212493A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
string
dictionary
neighborhood
error distance
Prior art date
Application number
PCT/US1991/009756
Other languages
English (en)
Inventor
Min-Wen Du
Shih-Chio Chang
Original Assignee
Gte Laboratories Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gte Laboratories Incorporated filed Critical Gte Laboratories Incorporated
Priority to CA002076526A priority Critical patent/CA2076526A1/fr
Priority to JP92504399A priority patent/JPH05505270A/ja
Publication of WO1992012493A1 publication Critical patent/WO1992012493A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Definitions

  • This invention pertains generally to the field of data process- ing, and in particular to the approximate string matching problem in which a search is made for those words which most closely resemble a given character string from a set of possible words which may or may not include the string.
  • the invention is utilized in program error correction, text editing in word processing and information retrieval from a data base.
  • the approximate string matching problem and algorithms proposed or used for its solution in various contexts are well known in the prior art, having been discussed in the literature at least as early as 1970.
  • the approximate string matching (ASM) problem may be stated as: search those words that most closely "resemble" a given character string from a set of possible words (dictionary).
  • the given string may or may not be in the dictionary.
  • Word resemblance is generally measured by a distance function defined between two strings. For example, the minimal number of editing operations, including insert, delete, change a character, and transpose two adjacent characters, to change one string to another string is a natural and commonly used distance measure between two strings. Therefore, the problem may also be stated as: find the nearest neighbors of a given character string among a set of possible words.
  • the dictionary In program error correction, the dictionary usually consists of the set of reserved keywords and the set of variable and function names defined by the user. In text editing, the dictionary is the set of accepted words of the language. In information retrieval, the dictionary is the set of searching keys in the database.
  • Errors may be introduced in various stages in information 5 processing. For instance, in an airline reservation system, a traveller's name is very easily misspelled. Since information is often conveyed by telephone conversations and, furthermore, since international names often lack a standard spelling, errors are unavoidable. They may exist in both the searching keys (names) and 1 the database. Approximate string matching techniques make it possible to retrieve partially incorrect records with partially incorrect searching keys.
  • Approximate string matching techniques can greatly improve the man-machine interface design in today's interactive computer 75 environment. If a character string entered by a user is incorrect, it would be desirable for the system to guess the word and let the user verify it. Alternatively, it would be desirable for the system to present several possibilities and let the user select the correct one. This option may make the system much more user friendly. 2.0 Four spelling errors are the most common: insert, delete, change a character, and transpose two adjacent characters, as reported in F.J. Damerau, "A Technique for Computer Detection and Correction of Spelling Errors," Comm. ACM 7, 3, pp. 171-176, March 1964, and H.L. Morgan, "Spelling Correction in Systems Programs," 5 Comm. ACM 13, 2, pp. 90-94, Feb. 1970. In almost all earlier approaches, the fault models assume only single errors. However, such an assumption is generally inadequate. For example, current programming practice encourages longer variable and function names to enhance program readability and maintainability. Longer names 0 invite multiple errors.
  • the time required for the calculation is equal to k x log2(p) x K(m, r).
  • the previously discussed three approaches are either memory efficient but require too long a time to find nearest neighbors, or time efficient but require an excessively large memory to implement an indexing mechanism.
  • a practical approach should fit between these two extremes; i.e., use enough, but not too much, memory to build up an indexing mechanism so that nearest neighbors can be found within seconds.
  • the present invention provides a system for multiple errors spelling correction in a data proce sing system having a sequential digital storage media for storing a large data base, comprising: a dictionary comprising a set of acceptable words in a universe stored in said data processing system; each word in said dictionary comprising a string of characters; said dictionary being partitioned according to the length of said strings of characters; means to receive a string Z for determining whether string Z in said dictionary or is a misspelled word in said
  • SUBSTITUTE SHEET dictionary means to match said string Z with strings in said dictionary to find the nearest neighbors of Z comprising: means to calculate the error distance between Z and all the words in said dictionary, wherein said error distance is the shortest sequential editing sequence operating from left to right to transform Z into said words; means to record words with minimal distance; means to limit the calculations of error distances by determining an upper bound on the length of words for which said calculations are made; means to use a string length partition to limit said calculations of error distances; means to use a cut-off criterion to limit said calculations of error distances; and means to limit further the search region by eliminating words at an error distance greater than the error distance in a neighborhood.
  • FIG. 1 is a block diagram showing the hardware and operating software systems on which embodiments of the present invention have been implemented;
  • FIG. 2 is an information processing flow chart for indexing the full text data base input used in the embodiments of the present invention;
  • FIG. 3 is an information processing flow chart of the query process for information retrieval from the full text data bases of FIG.
  • FIG. 4 is a flow chart of one embodiment of the adaptive ranking system of the present invention showing the record weight determinatoin at a given level;
  • FIG. 5 is a diagram illustrating the length of an editing sequence and the cost of an S-trace;
  • FIG. 6 is a graph of the distribution of word lengths in three dictionaries;
  • FIGS. 7a, 7b and 7c present the distribution of distance between words in three dictionaries;
  • FIG. 8 is a diagram illustrating the. order of calculation of an error distance matrix;
  • FIG. 9 is a mapping diagram for constructing a limited set of nearest neighbors;
  • FIG. 10 is a derivation tree for all strings within a radius one of
  • FIG. 11 illustrates the covering problem for finding hash functions
  • FIG. 12 illustrates a hash function selection for a finite number of dummy characters
  • FIG. 13 illustrates a covering table for constructing covers of deviation vectors
  • FIG. 14 is a flow chart of an embodiment of the elastic string-matching algorithm of the present invention
  • FIGS. 15a-15e are plots of experimental results measuring the performance of five algorithms for error distances ranging from
  • FIGS. 16a-16e are plots of experimental results measuring the time for execution of the five algorithms of FIGS. 15a-15e.
  • This invention pertains to very fast algorithms for approximate string matching in a dictionary. Multiple spelling errors of insert, delete, change and transpose operations on character strings are considered in the disclosed fault model.
  • FIG. 1 is a block diagram of the hardware and operating systems environment for an experimental information retrieval system designated by the acronym FAIRS and partially disclosed in "And-less Retrieval: Toward Perfect Ranking," by S.-C. Chang and W. C. Chen, Proc. ASIS Annual Meeting, 1987, Oct. 1987, pp. 30-35, and also partially disclosed in "Towards a Friendly Adaptable Information Retrieval System," Proc. RIAO 88, Mar. 1988, pp. 172-182.
  • FAIRS operates on a variety of computer systems each using its own operating system. The principal feature of all the systems is the massive data storage devices indicated by reference number 12.
  • FIG. 2 is a flow chart showing the information processing flow for inputting a full text data base and indexing the data base in a
  • Original text files 21, are read into storage 12 as is, with the user optionally specifying record markers, each file being named and having .TXT as an extension to its file name.
  • the user also describes his files to the system 22, providing a list of his files with .SRS as the extension, the configuration of his files with .CFG as extension, and additional new files with .NEW as extension.
  • the user also provides a negative dictionary 23 (.NEG) of words not to be indexed.
  • the inputs 21, 22, 23 are processed by an adaptive information reader/parser 24 under the FAIRS program. As part of the process an INDEX builder 25, produces the index files 26 necessary for retrieval.
  • FIG. 3 is an information processing flow chart for retrieving information from the files inputted into the system through queries.
  • a user query 31 is enhanced 32 by checking it for spelling variation 33 and synonym definitions 34.
  • the index files 26 are used to search 35 for records containing the query terms.
  • the records found in the search are ranked 36 according to ranking rules 37.
  • the original files 21 are the displayed 38 for user feedback. At this point the user can feedback relevance information 39a to refine the search or accept the retrieved text records 39b and transfer them to other media for further use.
  • FIG. 4 has been described in the cross-referenced application, and is not directly pertinent to this invention.
  • the present invention pertains directly to the spelling check of the queries and the enhancement of those queries in this informa ⁇ tion flow. It also has wide application in other areas. In particular, it pertains to very fast algorithms for approximate string matching in a dictionary. Multiple spelling errors of insert, delete, change and transpose operations on character strings are considered in the disclosed fault model.
  • a dictionary is a set of character strings constructed from a character set ⁇ .
  • a character string in the dictionary will be called a word.
  • X X[1], X[2], ..., X[m]: a string of characters from ⁇ .
  • Y Y[1], Y[2], ..., Y[n]: a second string of characters from ⁇
  • Z[i:j] Z[i], Z[i+1] ,..., Z[j]: an array with indices from i to j.
  • H[i, j] is to be used to represent the distance between X[l:i] and Y[l:j]. H will be called the distance matrix between X and Y.
  • a-b-c-...-r a sequence of elements. When there is only one element in the sequence, we write -a-. /_Z: the length of string Z.
  • n_S the size of set S.
  • S' the Kleene closure of a character set S. String universe U: is equal to ⁇ ' .
  • N(Z, r) the neighborhood of string Z within distance (radius) r. The following editing operations on character strings will be considered.
  • I(i, s) Insert a character s between the (i-1) and the i characters of a string; D(i): Delete the character at the i position; C(i,s): Change the i character to s;
  • T(i) Transpose the characters at i and i+1.
  • the change editing operation defined here may change a character to itself. This deviates from the definition of the traditional change operation in which a character has to be changed to a different one. Defining the change operation in this new way
  • Definition 1 An editing operation of insert, delete, change and transpose is proper if it can be carried out.
  • An editing sequence E[l:k] on a character string is a sequence consisting of proper editing operations.
  • Each editing operation E[j] is associated with an index Efj], the position on the string where E[j] is acting.
  • the index E[l:k] is the sequence of position indices associated with the editing sequence E[l:k].
  • the editing sequence D(3)D(4)I(5,o)C(8,s) transforms the word “jeopardize” into the incorrect spelling "jeprodise”
  • T(2) transforms "deuce” into "duece”.
  • the index sequence of the former editing sequence is 3-4-5-8, while index ⁇ . --, is -2-.
  • T(5) is not an editing sequence on "deuce” because it cannot be carried out.
  • Definition 2 The editing distance between two strings X and Y is the shortest editing sequence to transform X into Y.
  • a trace T from a string X to a string Y is the union of two sets, identity set I and changing set C, of number pairs (i, j), where l ⁇ i ⁇ /_X,l ⁇ j ⁇ /_Y, such that
  • SUBSTITUTE SHEET cross each other if i. ⁇ iiza but j > j , or i o ⁇ i 1 but j household>j-.. If (i, j) is in T, X[i] and Y[j] are said to be incident to that line.
  • every editing sequence results in a trace, and every trace corresponds to at least one editing sequence.
  • a trace T IUC from X to Y is a restricted trace (R-trace) if
  • Condition a) in Definition 5 states that only lines in I can cross in a restricted trace.
  • Condition b) states further that no
  • SUBSTITUTE SHEET line crosses more than one line.
  • a cross in a constrained (restricted) trace can be considered as an aggregate of a series of transpose, insert and delete operations.
  • E[l:n] is a linear editing sequence on a character string if it is an editing sequence and index-., is non-decreasing
  • Definition 6 requires a linear editing sequence to operate on the string from left to right, with each insert and change operation fixing one character, and each transpose operation fixing two consecutive characters.
  • the sequence D(2)T(l)C(3,r) which transforms "testing" into "string,” is an editing sequence but not a linear editing sequence, because the index sequence 2-1-3 is not non-decreasing.
  • This transformation can be performed by a linear editing sequence D(1),D(1),I(3,r), with a non-decreasing index sequence 1-1-3.
  • SUBSTITUTE SHEET D ⁇ finition 7 The error distance, or the number of spelling errors, from a character string X to a character string Y is the minimal length of the linear editing sequences that transform X into Y.
  • a trace T IUC from X to Y is a linear trace
  • ⁇ (3,1),(4,2), (5,4), (6,5),(7,6) ⁇ is an L-trace from “testing” to "string”
  • both ⁇ (1,1),(4,4),(5,5)] and [(1,1),(4,4),(5,5) ⁇ ⁇ (2,3),(3,2) ⁇ are L-traces from "deuce” to "duece”.
  • Theorem 1 the minimal cost of L-traces between two strings X and Y is equal to the error distance between X and Y, which is the number of spelling errors from X to Y.
  • a linear editing sequence of the L-trace is:
  • H denote the error distance matrix between two character strings X and Y, i.e., H[i,j] is the error distance between X[l:i] and Y[l:jJ.
  • H[i,j] the error distance between X[l:i] and Y[l:jJ.
  • the following theorem calculates the error distance matrix H.
  • H[i, -1] bound for -l ⁇ i ⁇ m
  • H[-l, j] bound for -l ⁇ j ⁇ n
  • H[i, 0] i for 0 ⁇ i ⁇
  • H[0, j] j for O ⁇ j ⁇ n.
  • H[i, j] of the distance matrix H[l-ra, l:n] between X and Y can be calculated recursively as
  • Theorem 3 The matrix H[0:m, 0:n] defined by Formula 1 satisfies the following properties: a) H[i, j]-l ⁇ H[i+l, j] ⁇ H[i, j]+l for all 0 ⁇ i ⁇ m, O ⁇ j ⁇ n; b) H[i, j]-l ⁇ H[i, j+l] ⁇ H[i, j]+l for all O ⁇ i ⁇ , 0 ⁇ j ⁇ n; c) H[i, j] ⁇ H[i+l, j+l] ⁇ H[i, j]+l for all 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n.
  • H[i+1, j+1] min [H[i-1, j-1] H[i+1, j] H[i, j+1] ⁇ + 1 IF both X[i]
  • H[i+1, j+1] min[H(i, j], H[i+1, j], H[i, j+1] ⁇ + 1 in all other cases.
  • Corollary 3 Let d be the error distance between two character strings X and Y. Then /_X-d ⁇ /_Y ⁇ /_X+d.
  • Corollary 2 gives a simple upper bound on the error distance between two strings.
  • Corollary 3 is the string length partition criterion ordinarily used to save computation in nearest neighbor searching of character strings in the prior art.
  • Word lengths in dictionaries are generally small, as shown in FIG. 6. Therefore, simple instead of complex algorithms should be used in the distance calculation. Sophisticated distance calculation algorithms generally have large time constants and are good only for long strings.
  • SUBSTITUTE SHEET 2 The shape of the distributions of word length in the three dictionaries which we studied are bell-like, i.e., there are many fewer words with either small or large word length than words with medium word length. This implies that words with small or large word length can be treated separately without affecting either the average performance very much, or affecting the worst case performance, of ASM (approximate string matching) algorithms. Treating long words separately is especially beneficial because such words have a huge neighborhoods, as discussed before. 3) Although words in dictionaries are not random, FIGS. 7a, 7b and 7c show that they do not cluster together either. This phenomenon may be partially attributed to the fact that the alphabet size in use is generally much larger than the length of an average word in the dictionary. From FIGS.
  • this upper bound defined by Corollary 2 can be used immediately to decrease the number of words to be compared. That number can be further cut down by the string le: gth partition criterion discussed above in Corollary 3 because by dynamically recording a number d, the smallest distance currently found, there is no need to compare those words with length less than /_Z-d or greater than /_Z+d.
  • the best strategy to use this property is to search through word groups in which difference between their word lengths and /_ Z is equal to 0, 1, etc. , until a neighbor or neighbors are found.
  • SUBSTITUTE SHEET Another simple rule according to Theorem 4 above to make the nearest neighbor searching more efficient is the cut-off criterion for the distance calculation because it can tell, during the calculation, whether the distance is larger than a pre-specified quantity. This property is useful because when the error distance between the given string and its neighbors in the dictionary is small, which is usually the case, we can avoid the calculation of most of the entries in the error distance matrices between the given string and the words in the dictionary.
  • the entries on an error distance matrix must be calculated in a particular order, as shown in FIG. 8; here we assume that /_X ⁇ /_Y.
  • the H value on the cut-off path is obtained and compared to the error distance r of the current neighborhood. If that H value is smaller than r, we calculate another layer. If the layer is the last one, indicating that a nearest neighbor has been found, we record the word and continue to find all the words with distance equal to the current distance r.
  • Algorithm 1 (Cut by upper bound of distance: Corollary 2) 0. Given string Z.
  • Algorithm 2 (Cut by current upper bound of distance: Corollary 3)
  • Algorithm 3 modifies Algorithm 2. It calls a subroutine error_dist (X, Y, r), which finds the error distance between two character strings X and Y if the distance is no greater than r. If the distance is found to be greater than r during the calculation,
  • SHEET error_dist will suspend and return -t, where t is the number of layers calculated, applying the cut-off criterion of Theorem 4.
  • Algorithm 3 (Cut by cut-off criterion: Theorem 4) 0. Given string Z.
  • V , (Z) to be the set of error distance alg matrix (H[i, j]) entries visited by algorithm alg in searching for the nearest neighbors of Z.
  • E - (Z) to be the total alg number of times the error distance matrix (H[i,jJ) entries computed by algorithm alg in searching for the nearest neighbors of Z.
  • v(w.) is the set of H[i, j] entries visited by algorithm alg when comparing the given string Z and a word wi in the dictionary.
  • Algorithms 1, 2, and 3 described above each successively cut down the number of H[i, j] entries visited by its predecessor. This number of H[i, j] entries can be reduced further so that the algorithm can be speeded up.
  • E . .., V , .., ⁇ , " algo ⁇ thm_0 algo ⁇ thm_0 algorithm_l algorithm_l' algorithm_2 algorithm_2.
  • algorithm_3 does not store the intermediate calculation result each time when it suspends the calculation of a work in the dictionary, entry values on distance matrices may be recomputed several times. Therefore, E 1 . , _ may be greater than
  • V algorithm_0 ( ⁇ Z ' ) - V algorithm_l Z ' ) ⁇ V algorithm_2 (Z - ) ⁇ V algorithm_3 (Z)•
  • Algorithm 4 (Cut by cut-off criterion in Theorem 4 and limiting the searching region)
  • R(Z, r) DICT n (Uh. _1 (h.(N.))), or equivalently in
  • Algorithm 4 implements statements (2) and (3) above.
  • N(Z, r) is itself a simple representation of the neighborhood of Z with error distance r, but it is too abstract to be useful here.
  • Definition 11 Assume that the symbol X itself is not in ⁇ .
  • a string with (dummy) symbol X is any string in [ ⁇ U X ⁇ .
  • N("test", 1) Xtest U est U Xest U etst U tXest U tst U tXst U tset U teXst U tet U teXt U tets U tesXt U testX.
  • a derivation tree for N(Z, r) for r greater than one is constructed similarly by letting the number of errors from the root of the tree to any terminal node be exactly equal to r. Note that the Change editing operation makes any string with error distance less than r to be included in the enumeration. Thus, any N(Z, r) can be represented by the terminal nodes on the deviation tree. The number of those nodes on the derivation tree is much smaller than the size of the neighborhood N(Z, r).
  • a set of deviation vectors [V. ⁇ with dummy X is a covering scheme of a set of strings, N, if every string in N is a member of at least one of the strings with dummy X derived from V..
  • [V. ⁇ covers N if it is a covering scheme of N.
  • Each string with dummy X on a terminal node of a neighborhood derivation tree represents a set of strings in the neighborhood.
  • Several such strings with dummy X can be covered in turn by a (larger) string with dummy X derived from a deviation vector.
  • a deviation vector of a string Z can be considered as a super-cover of strings in the neighborhood of Z. Any neighborhood can be covered by a set of deviation vectors or super-covers.
  • N("test", 1) is covered by the following set of deviation vectors:
  • the deviation vectors have the nice feature that they specify only the positions from which to extract characters from a string Z, and (implicitly) the positions in the vector to put those characters.
  • a set of deviation vectors is a covering scheme of a neighborhood N(Z, r)
  • it also covers any other N(Z', r) as long as /_Z' is equal to /_Z.
  • R(Z, r) based on deviation vectors.
  • N(Z, r) be covered by the set of deviation vectors [V. ⁇ , and each V. derives S., a string with dummy X.
  • R(Z, r) scheme consists of two structures: sets of deviation vectors for covering neighborhoods, and a set of h. functions to calculate the mappings and inverse mappings.
  • This h ⁇ function partitions the set of words of length /_V in the dictionary into
  • h. maps all the strings of a string with dummy X to a single value.
  • the following example illustrates how to calculate R(Z, r) for a given string Z and a small distance r. The calculation procedure solves Problem 2, posted previously. An example for the R(Z, r) calculation is presented.
  • the error distance between "test”, “best”, “mess” and the given string “rest” are 1, 1, and 2, respectively. So the nearest neighbors of "rest” are “test” and "best”.
  • FIG. 11 interprets this problem as a 5 covering problem, with the length of deviation vectors equal to 5.
  • the figure depicts a covering table, with each row representing a candidate h. function that selects two positions in a string mapping, and each column representing a possible string with dummy X derived from a deviation vector with exactly two X symbols.
  • the number of columns in the covering table for covering m deviation vectors of length with ( )X symbols is equal to r.
  • the covering table becomes very large and a minimal cover will be difficult to find. In practical applications, a good cover of the table that may
  • h. functions Another consideration in designing h. functions is how many characters should be chosen from a string for function value calculation. We have not obtained either theoretical or experimental results to answer this question yet. It is not difficult to conceive the following dilemma: the more characters chosen for calculating h. functions, the smaller the inverse subregions will be; thus the number of false hits will be reduced for each h. function; but in that case, larger sets of deviation vectors need be used for covering neighborhoods. Also, the more characters chosen for calculating h. functions, the more h. functions need be provided in the whole mapping mechanism. This is a typical time-space trade-off problem. In practical design, some tuning may be required.
  • a covering table can be used, as shown in FIG. 13.
  • row [i, j] covers a column of a string with dummy X if the i and the j positions in the string are non-X.
  • a deviation vector can be obtained easily from a row and the column which it covers. For example, [1, 3] covers 1X234, therefore, the deviation vector [1,X,2,X,X] can be used to cover 1X234.
  • the objective here is to find a number of rows to cover all the columns in the table, so that a minimal number of deviation vectors will be created. Since the objective function is not a direct count of the number of rows in the cover, this problem appears to be even harder than the general covering problem. We are satisfied by finding a minimal number of rows that cover all the columns because, in our experience, a minimal cover often leads to a small covering set of deviation vectors.
  • a minimal cover for the table in FIG. 13 is ⁇ [1,2], [4,5] ⁇ . It produces the set of deviation vectors ⁇ [1,2,X,X,X], [X,X,X,3,4] ⁇ .
  • DICT_pgm The set of variable and function names of a large prolog program.
  • DICT_Unix The set of English words in a dictionary provided by the Unix System.
  • DICT_IR The set of index words used in an information retrieval system of the library of GTE Laboratories, Waltham, Massachusetts, which is a mixture of author names, titles, and abstracts of books, journals, and technical reports. It contains 25167 distinct words, with average word length equal to 8.320. Normalized distributions of the word length of the three dictionaries are shown in FIG. 6.
  • Algorithm 4 has been modified in the following way: during program execution, whenever no R(Z, r) mechanism has been provided for a certain portion of neighborhood N(Z, r), the program switches to Algorithm 3 to handle that portion.
  • the first one is the number of H[i, j] entries visited by an algorithm (Definition 10). This measurement is system and implementation independent. Since computation overheads are not
  • Algorithm 0, 1, 2 and 3 are all easy to implement. They all use little extra memory. Algorithm 1 is faster than Algorithm 0 only when the length of the given character string is small.
  • Algorithm 2 and Algorithm 3 are much faster than Algorithm 0 and
  • Algorithm 4 is the fastest algorithm among the five.
  • FIGURES 16a-16e shows the relative speeds of the five algorithms with different / and r.
  • T.(28, 4) is equal to 253, 253, 0.15, 0.15, 0.07, and 0.08 seconds, for i equal to 0, 1, 2, 3, and 4, respectively.
  • T.(9, 1) is equal to 121, 121, 34, 23, and 0.33 seconds, for i equal to 0, 1, 2, 3, 4, respectively.
  • Algorithm 4 finds nearest neighbors within seconds.
  • Algorithm 0, 1, 2, and 3 are simple and space efficient.
  • Algorithm 2 and Algorithm 3 are relatively fast.
  • Algorithm 4 is very fast but requires substantial memory.
  • Algorithm 2 and Algorithm 3 are good choices.
  • Algorithm 4 is the only choice to provide real-time performance.

Abstract

Système de traitement de chaînes de données utilisant des algorithmes rapides pour déterminer une correspondance approximative avec des chaînes de caractères dans un dictionnaire (23). Pour l'exemple de faute décrit, on a prévu des opérations sur les chaînes de caractères comprenant de multiples fautes d'orthographe. L'exemple de faute, ''S-trace'', est utilisé pour élaborer les algorithmes, et un processus de réduction à quatre étapes améliore l'efficacité d'un algorithme de correspondance approximative de chaînes. Cette façon d'aborder la correction orthographique (consistant à utiliser la borne supérieure, le critère de cloisonnement de la longueur des chaînes et le critère de sectionnement) représente trois améliorations par rapport à celle qui consiste à effectuer une comparaison minutieuse. Chacune s'incorpore aisément à l'étape suivante. Lors de la quatrième étape, un procédé d'adressage calculé évite la comparaison d'une chaîne donnée avec des mots très éloignés lorsque la recherche s'effectue au voisinage sur une petite distance. On obtient ainsi un algorithme sous-linéaire au nombre de mots dans le dictionnaire (23). L'application des algorithmes à un système d'information de bibliothèque consiste à utiliser des fichiers de texte originel (21), des fichiers de description des informations (22) et un dictionnaire négatif (23) stockés sur disques (12).
PCT/US1991/009756 1990-12-31 1991-12-30 Algorithmes tres rapides servant a determiner une correspondance approximative de chaines pour la correction de multiples fautes d'orthographe WO1992012493A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CA002076526A CA2076526A1 (fr) 1990-12-31 1991-12-30 Algorithmes rapides d'adaptation approximative de chaines pour corriger les fautes d'orthographe multiples
JP92504399A JPH05505270A (ja) 1990-12-31 1991-12-30 多重エラースペリング修正のための高速近似ストリングマッチング法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US63664090A 1990-12-31 1990-12-31
US636,640 1990-12-31

Publications (1)

Publication Number Publication Date
WO1992012493A1 true WO1992012493A1 (fr) 1992-07-23

Family

ID=24552735

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1991/009756 WO1992012493A1 (fr) 1990-12-31 1991-12-30 Algorithmes tres rapides servant a determiner une correspondance approximative de chaines pour la correction de multiples fautes d'orthographe

Country Status (4)

Country Link
EP (1) EP0519062A4 (fr)
JP (1) JPH05505270A (fr)
CA (1) CA2076526A1 (fr)
WO (1) WO1992012493A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7337375B2 (en) * 1999-10-20 2008-02-26 Broadcom Corporation Diagnostics of cable and link performance for a high-speed communication system
WO2010114486A1 (fr) * 2009-03-31 2010-10-07 Azimuth Intellectual Products Pte Ltd Appareil et procédé d'analyse d'emballages de marchandises
EP2284653A1 (fr) * 2009-08-14 2011-02-16 Research In Motion Limited Dispositif électronique doté d'un affichage sensible au toucher et procédé de facilitation de saisie pour le dispositif électronique
CN116522164A (zh) * 2023-06-26 2023-08-01 北京百特迈科技有限公司 一种基于用户采集信息的用户匹配方法、装置及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4771385A (en) * 1984-11-21 1988-09-13 Nec Corporation Word recognition processing time reduction system using word length and hash technique involving head letters
US4903206A (en) * 1987-02-05 1990-02-20 International Business Machines Corporation Spelling error correcting system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4783758A (en) * 1985-02-05 1988-11-08 Houghton Mifflin Company Automated word substitution using numerical rankings of structural disparity between misspelled words & candidate substitution words
JPH0782544B2 (ja) * 1989-03-24 1995-09-06 インターナショナル・ビジネス・マシーンズ・コーポレーション マルチテンプレートを用いるdpマツチング方法及び装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4771385A (en) * 1984-11-21 1988-09-13 Nec Corporation Word recognition processing time reduction system using word length and hash technique involving head letters
US4903206A (en) * 1987-02-05 1990-02-20 International Business Machines Corporation Spelling error correcting system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP0519062A4 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7337375B2 (en) * 1999-10-20 2008-02-26 Broadcom Corporation Diagnostics of cable and link performance for a high-speed communication system
US7711999B2 (en) 1999-10-20 2010-05-04 Broadcom Corporation Diagnostics of cable and link performance for a high-speed communication system
US7913127B2 (en) 1999-10-20 2011-03-22 Broadcom Corporation Diagnostics of cable and link performance for a high-speed communication system
WO2010114486A1 (fr) * 2009-03-31 2010-10-07 Azimuth Intellectual Products Pte Ltd Appareil et procédé d'analyse d'emballages de marchandises
EP2284653A1 (fr) * 2009-08-14 2011-02-16 Research In Motion Limited Dispositif électronique doté d'un affichage sensible au toucher et procédé de facilitation de saisie pour le dispositif électronique
CN116522164A (zh) * 2023-06-26 2023-08-01 北京百特迈科技有限公司 一种基于用户采集信息的用户匹配方法、装置及存储介质
CN116522164B (zh) * 2023-06-26 2023-09-05 北京百特迈科技有限公司 一种基于用户采集信息的用户匹配方法、装置及存储介质

Also Published As

Publication number Publication date
EP0519062A4 (en) 1993-12-29
JPH05505270A (ja) 1993-08-05
CA2076526A1 (fr) 1992-07-01
EP0519062A1 (fr) 1992-12-23

Similar Documents

Publication Publication Date Title
Blumer et al. Complete inverted files for efficient text retrieval and analysis
JP3077765B2 (ja) 語彙辞書の検索範囲を削減するシステム及び方法
Boytsov Indexing methods for approximate dictionary searching: Comparative analysis
Czech et al. Perfect hashing
JP3581652B2 (ja) データ検索システムと方法およびサーチ・エンジンにおけるその使用
Jokinen et al. A comparison of approximate string matching algorithms
US5895446A (en) Pattern-based translation method and system
US5768423A (en) Trie structure based method and apparatus for indexing and searching handwritten databases with dynamic search sequencing
Hodge et al. A comparison of standard spell checking algorithms and a novel binary neural approach
JPH05290082A (ja) パターンに基づく翻訳方法及び翻訳装置
Du et al. An approach to designing very fast approximate string matching algorithms
US5553284A (en) Method for indexing and searching handwritten documents in a database
Gog et al. Fast and lightweight LCP-array construction algorithms
JP2018506115A (ja) 電子デバイスにおいて受け付けられた入力ストリングの置き換えとして単語の候補を提案するための方法
Amir et al. Managing unbounded-length keys in comparison-driven data structures with applications to online indexing
Gawrychowski et al. Improved bounds for shortest paths in dense distance graphs
Loukides et al. Bidirectional string anchors: A new string sampling mechanism
Andersson et al. Suffix trees on words
Adebiyi et al. An efficient algorithm for finding short approximate non-tandem repeats
WO1992012493A1 (fr) Algorithmes tres rapides servant a determiner une correspondance approximative de chaines pour la correction de multiples fautes d'orthographe
SE513248C2 (sv) Metod för hantering av datastrukturer
KR20230170891A (ko) 메모리 내 효율적인 다단계 검색
US8204887B2 (en) System and method for subsequence matching
Dewar The SETL programming language
Grana et al. Compilation methods of minimal acyclic finite-state automata for large dictionaries

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU MC NL SE

WWE Wipo information: entry into national phase

Ref document number: 2076526

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 1992904493

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1992904493

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1992904493

Country of ref document: EP