WO2001065416A2 - Moteur d'appariement probabiliste - Google Patents

Moteur d'appariement probabiliste Download PDF

Info

Publication number
WO2001065416A2
WO2001065416A2 PCT/US2001/006447 US0106447W WO0165416A2 WO 2001065416 A2 WO2001065416 A2 WO 2001065416A2 US 0106447 W US0106447 W US 0106447W WO 0165416 A2 WO0165416 A2 WO 0165416A2
Authority
WO
WIPO (PCT)
Prior art keywords
token
tokens
record
query
database
Prior art date
Application number
PCT/US2001/006447
Other languages
English (en)
Other versions
WO2001065416A3 (fr
Inventor
Matthew A. Jaro
Original Assignee
Vality Technology Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vality Technology Incorporated filed Critical Vality Technology Incorporated
Priority to AU2001243337A priority Critical patent/AU2001243337A1/en
Priority to CA002401170A priority patent/CA2401170A1/fr
Priority to JP2001564037A priority patent/JP2004506960A/ja
Publication of WO2001065416A2 publication Critical patent/WO2001065416A2/fr
Publication of WO2001065416A3 publication Critical patent/WO2001065416A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Definitions

  • the present invention generally relates to database information retrieval techniques.
  • the present invention relates to database information retrieval based on record linkage theory with query expansion.
  • Browsing is typically more passive than querying. Browsing involves a user accessing a portion of a database through a simple mechanism, such as a menu topic, and then exploring the accessed information by navigating through it, often with some degree of information retrieval system guidance. Hypertext systems generally support a browsing approach to information retrieval. Although perceived as demanding less of a user, browsing is not necessarily the most efficient way to retrieve information from a large database.
  • querying In contrast to browsing, querying requires a user to specify the information that is of interest to him. Querying will only be successful when the information of interest is specified in a way that matches the database language. The match often requires a compromise in the selection of query terms. Querying can be perceived as taxing on a user, particularly if the user is untrained. Querying can also produce poor retrieval results. Querying itself has traditionally been classified as belonging to one of two genres: querying done in connection with Boolean retrieval and querying done in connection with probabilistic retrieval. Querying in connection with Boolean retrieval is the most established form of information retrieval. It requires a user to create an appropriate combination of terms which match both the information of interest and the database language.
  • Boolean searching requires a user to specify only a limited number of terms to achieve an acceptable number of retrieval results.
  • Optimal Boolean searching requires the user to be familiar with the Boolean operators and with the effective ways to combine terms. Nonetheless, users rarely make explicit use of Boolean operators.
  • Querying in connection with probabilistic retrieval offers users a greater scope of retrieval.
  • Retrieval results are typically compared to the query terms using an algorithm based on probability theory and rated on how closely they match the query terms. Terms that occur less frequently in a database are considered more discriminating and are typically given more weight in predicting a match.
  • a user is not constrained in the number of query terms he may use because the rating of the retrieval results mitigates the problem of excessive retrieval results.
  • the invention includes a method for indexing a database. Records of a database are input. Each record is parsed into record tokens using a pattern action language. An index to the record is created from the record tokens for each record.
  • the parsing includes converting each record into original tokens, characterizing each original token, and converting the characterized original tokens into record tokens based on the pattern action language.
  • the pattern action language is responsive to the domain with which the record token is associated.
  • the index creation includes creating a list of unique index tokens from the record tokens for each record, calculating a frequency of occurrence in the database for each unique index token, and creating a table of index tokens.
  • the table of index tokens contains the frequency of occurrence in the database for each unique index token.
  • an index token comprises a phonetic equivalent for the respective record token.
  • a list of unique record tokens is also created.
  • the invention includes a method for indexing a database. Records of a database are input and each record is parsed into record tokens. An index token is generated from a respective record token. The index token is a phonetic equivalent for the record token. A frequency of occurrence in the database is calculated for a unique index token. A table of index tokens is created. The table of index tokens includes the frequency of occurrence for the unique index tokens.
  • a list of unique record tokens is also created.
  • each record is parsed into record tokens using a pattern action language.
  • the parsing includes converting each record into original tokens, characterizing each original token, and converting the characterized original tokens into record tokens based on the pattern action language.
  • an index to the database is created from the record tokens for each record.
  • Each record token is associated with a domain in the database, the pattern action language is responsive to the domain, the frequency of occurrence is calculated with respect to a domain in the database, and the index of unique record tokens list the frequency of occurrence by domain.
  • the invention relates to an apparatus for indexing a database.
  • An input device accepts records of a database.
  • a parser parses the records into record tokens, and an indexer generates an index of the record tokens in the database.
  • the parser includes a tokenizer, a token characterizer, and a token converter.
  • the tokenizer converts records into original tokens.
  • the token characterizer characterizes each original token, and the token converter converts the characterized original tokens into record tokens based on a pattern action language.
  • the pattern action language is responsive to the domain with which a record token is associated.
  • the indexer includes a token comparator, a frequency calculator, and a table generator.
  • the token comparator creates a list of unique index tokens from the record tokens.
  • the frequency calculator calculates a frequency of occurrence in the database for the unique index tokens.
  • the table generator generates a table containing a frequency of occurrence for the unique index tokens.
  • an index token is a phonetic equivalent for the respective record token and the tokens comparator communications with the parser via the token generator.
  • a record token comparator also creates a list of unique record tokens.
  • the invention in another aspect, relates to an apparatus for indexing a database.
  • An input device accepts records of a database, and a parser parses the records into record tokens.
  • a token generator generates an index token from a respective record token.
  • the index token is a phonetic equivalent of the respective record token.
  • a table generator generates a table containing for each index token a frequency of occurrence of the index token in the database, calculated by a frequency calculator, and a pointer to all records containing the index token.
  • a record token comparator creates a list of unique record tokens from the record tokens for each record.
  • the table generator generates a table that contains a pointer to each record in the database that contains an index token corresponding to said unique index token.
  • a record token comparator in communication with the parser also creates a list of unique record tokens.
  • the parser parses each record using a pattern action language.
  • the parser further includes a tokenizer, a token characterizer, and a token converter.
  • the tokenizer converts records into original tokens.
  • the token characterizer characterizes each original token, and the token converter converts the characterized original tokens into record tokens based on the pattem action language.
  • the original token, the respective record token, and all respective index tokens are all associated with the same domain in the database
  • the pattem recognition is responsive to the domain associated with a token
  • the frequency of occurrence for an index token is calculated by domain.
  • a table generator generates a table containing for unique index tokens the frequency of occurrence and a pointer to each record in the database containing the corresponding record token.
  • the invention relates to a method of querying a database.
  • a query is input and parsed into query tokens using a pattem action language.
  • a search token is generated from a query token. The search token is looking up on an index table to access a record within the database.
  • the parsing includes converting the query into original tokens, characterizing each original token, and converting the characterized original tokens into the query tokens based on the pattern action language.
  • an original token and the resulting query token are associated with the same domain in the database.
  • the pattem action language is responsive to the domain with which the tokens are associated.
  • a search token is generated from a query token. and the search token is associated with the domain in the database with which the respective query token is associated.
  • the invention in another aspect, relates to a method of querying a database.
  • a query is input and parsed into query tokens.
  • a search token is generated from a query token. Search token generation includes checking a list of unique record tokens for a token that is similar to the query token based on an information theoretic algorithm. It also includes translating query tokens and similar tokens into search tokens. The search tokens are phonetic equivalents for the query tokens or the similar tokens.
  • a search token is looked up on an index table to access a record within the database.
  • a search token is associated with the same domain in the database as the respective query token.
  • the parsing is done using a pattem action language.
  • the parsing includes converting the query into original tokens, characterizing each original token, and converting the characterized original tokens into query tokens based on the pattem action language.
  • the invention relates to an apparatus for querying a database.
  • a query input device accepts a query as input.
  • a parser parses the input into query tokens using a pattem action language.
  • a generator generates search token from the query tokens.
  • a database accessor accesses records in the database in response to a search token.
  • the parser includes a tokenizer, a characterizer, and a converter.
  • the tokenizer creates original tokens from the input, and the characterizer characterizes each of them.
  • the converter converts the characterized original tokens into query tokens based on the pattern action language.
  • the original token is associated with the same domain in the database as the respective query tokens and search token.
  • the tokens are associated with the same domain in the database, and the pattern action language is responsive to the domain with which they are associated.
  • the invention in another aspect, relates to an apparatus for querying a database.
  • a query input device accepts input, and a parser parses it into query tokens.
  • a generator generates search tokens from the query tokens.
  • the generator includes a query expander that adds tokens that qualify as similar to a query token based on an information theoretic algorithm. These are similar tokens.
  • the generator also includes a translator that translates each query token and similar token into a phonetically-equivalent search token.
  • a database accessor finds pointers to records in the database with a search token.
  • each query token, respective similar token, and respective search token are all associated with the same domain in the database.
  • the parser parses uses a pattem action language.
  • the parser includes a tokenizer, a characterizer, and a converter.
  • the invention relates to a method for accessing data within a database.
  • a token is selected from a set as the first token with which to search.
  • a set of records is retrieved from the database in response to the selected token.
  • a likelihood of relevance to the query is determined for each record in the set.
  • the set of records is ordered by likelihood of relevance to the query.
  • the highest likelihood of relevance to the query for the set is compared to a continuation threshold. If the threshold is exceeded, the search is terminated and the set of records is output. If not, a different token is selected for a new search.
  • likelihood of relevance to the query is determined based on Record Linkage Theory.
  • the set of records consists of more than one record and the output records are ordered by likelihood of relevance to the query.
  • a frequency of occurrence in a database is identified for each token, the tokens are ordered by frequency of occurrence, and the token having the lowest frequency of occurrence is selected as the first search token. If the continuation threshold is not exceeded, the token with the next lowest token is selected as the next search token.
  • the frequencies of occurrence relate to domains in the database and tokens are each associated with a domain. In such an embodiment, tokens are ordered and the token having the lowest frequency of occurrence in the associated domain is the first selected token.
  • a likelihood of relevance to the query is determined for each record based on Record Linkage Theory. In further related embodiment, if a buffer of retrieved records overflows, the buffer is cleared and a new search is begun for records contain all of the tokens.
  • the invention in another aspect, relates to an apparatus for accessing data within a database.
  • a database accessor retrieves a set of records from the database in response to the token selected as the first token on which to search by the token selector.
  • a relevance determiner determines the likelihood of relevance to the query for each record in the set or records.
  • a relevance comparator orders each record in the set by likelihood of relevance and a threshold comparator compares a continuation threshold to the highest likelihood of relevance. If the continuation threshold is exceeded, the relevance comparator terminates the search. If not, the relevance comparator removes the selected token and allows the token selector to select another token.
  • An output device returns the set of records when the threshold comparator terminates the search.
  • the likelihood of relevance to the query is determined based on Record Linkage Theory.
  • the database accessor retrieves more than one record and the output device returns the records ordered by likelihood of relevance to the query.
  • a frequency comparator identifies a frequency of occurrence in the database for each token and orders the tokens by the frequency of occurrence. The token selector selects the token having the lowest frequency of occurrence as the first token on which to search.
  • the frequency comparator identifies a frequency of occurrence in the domain in the database with which the token is associated and selects the token having the lowest frequency of occurrence in the associated domain as the first token.
  • the relevance determiner determines a likelihood of relevance to a query based on Record Linkage Theory.
  • a buffer overflow arrestor clears a buffer when it overflows and sends an overflow signal to the token selector. The database accessor then retrieves the set of records from the database that contain all of the tokens.
  • FIG. 1 is a functional block diagram of the information retrieval process as known to the prior art.
  • FIG. 1 A describes an embodiment of the evolution of records throughout the indexing process in accordance with the invention.
  • FIG. IB describes an embodiment of the evolution of a query throughout query processing in accordance with the invention.
  • FIG. IC describes an embodiment of the interaction of the search token and record in the information accessing process in accordance with the invention.
  • FIG. 2 is a functional block diagram of an embodiment of the information indexing portion of the information retrieval process performed in accordance with the invention.
  • FIG. 3 is a functional block diagram of an embodiment of the query processing portion of the information retrieval process performed in accordance with the invention.
  • FIG. 4 is a functional block diagram of an embodiment of the information accessing portion of the information retrieval process performed in accordance with the invention. Description
  • the present invention relates in general to the information retrieval process for an electronic database as illustrated in FIG. 1.
  • the information retrieval process is a process by which a query is used to access existing reference data in a database.
  • probability theory is used to select records in a database according to a user query and retrieve them.
  • the information retrieval process can generally be separated into three steps as illustrated in FIG. 1 : indexing the reference data, processing the query, and accessing the reference data. The last two steps of the information retrieval process may be considered the search phase.
  • a database generally includes many records, each of which may be referred to by record number. Each record generally includes several domains. Similarly, each domain generally includes several fields. Each field may further contain free form text.
  • an Internal Revenue Service database may contain a separate record for each taxpayer. The taxpayer record may be numbered and may include separate domains for the home and work address of the taxpayer. Each address domain may contain a street field, a town field, a zip code field, and other fields. The street field, for example, may accept free form text such as "10910 Way Thru The Woods" or "71 Camino De Gratia.” Databases typically do not require that every field or domain include information.
  • the reference data in a database is presumed to include a number of records, each record including a number of domains, each domain including one or more fields, each field containing free form text.
  • the present invention operates on free form text residing in fields.
  • the first step in the information retrieval process a block diagram of which is illustrated in FIG. 1, is to index (STEP 10) the reference data.
  • FIG. 1A illustrates the evolution of a database record during the indexing process according to one embodiment of the present invention.
  • the elements 42 of each record 44 are parsed into a set of record tokens (TR ⁇ ) 46.
  • the parsing process in some embodiments includes elimination of some portions of the record and standardization of other portions of the record.
  • index tokens (T ⁇ n ) 62 are then generated from record tokens (TR ⁇ ) 46.
  • the index tokens (T ⁇ n ) 62 and record tokens (TR ⁇ ) 46 are analyzed to facilitate later searching.
  • a list of unique record tokens (TR ⁇ ) 46 contained in the reference data is created.
  • a table 96 of unique index tokens (T ⁇ n ) 62 is created.
  • the table 96 includes the frequency of occurrence (v n ) 92 in the database for each unique index token (T ⁇ n ) 62.
  • the table 96 includes pointers 94 to the records in the reference data that contain the tokens.
  • the second step in the information retrieval process illustrated in FIG. 1 is to process (STEP 20) the query.
  • Processing the query may be considered preparation of the query for use in the information accessing phase of the information retrieval process.
  • FIG. IB illustrates the evolution of a query 54 during query processing according to one embodiment of the present invention. Elements 52 of the query 54 are parsed into a set of query tokens (TQ ⁇ ) 56. In the embodiment shown in FIG. IB, the parsing process includes elimination of some portions of the query 54 and standardization of other portions of the query.
  • any token from a list of record tokens (TR ⁇ ) 46 that qualifies as similar to a query token (TQ ⁇ ) 56 based on an information theoretic algorithm is added to the set of query tokens.
  • search tokens (Ts n ) 72 that can be used to access records in the reference data, are generated from query tokens (TQ ⁇ ) 56 and similar tokens.
  • the processing of a query corresponds to the processing of the records in the reference data.
  • the third step in the information retrieval process illustrated in FIG. 1 is to access (STEP 30) the reference data. Accessing the reference data is the culmination of the preparation of the reference data and the query.
  • FIG. IC illustrates the accessing process according to one embodiment of the present invention.
  • a search token (Ts n ) 72 is selected from the set of search tokens based on the selectivity of the search token. Records 44 from the reference data containing the search token (Ts n ) 72 are retrieved using a token table 96.
  • a weight is calculated for each record representing the likelihood that it is relevant to the user query 54. In a related embodiment, the weight calculation is based on Record Linkage Theory.
  • the maximum weight for a set of retrieved records is compared to a threshold to determine whether the search should continue or be terminated.
  • the retrieved records are ordered and returned to the user.
  • the weight of each record is returned to the user alone or in association with the record.
  • the final result of the information retrieval process is the user having a list or records and, in some embodiments, weights to evaluate each record's relevance to the query.
  • the first step is to parse (STEP 40) a record 44 of the reference data. Parsing the record into tokens includes separating the data in the record into a set of tokens.
  • the developer of the reference data defines a set of individual characters to be used as the basis for separation of the contents of a record into tokens. In some such embodiments, these developer-defined characters are used alone. In other such embodiments, these developer-defined characters are used in addition to default characters as the basis for separation. In other embodiments, the developer allows default characters to be used as the sole basis for the separation. A group of characters may be used as a basis for separation. In some embodiments, the separation characters themselves become tokens.
  • parsing the record includes eliminating some tokens.
  • the developer defines a set of tokens to be eliminated after the separation of the contents of a record into tokens.
  • the developer defined tokens are the sole tokens that are eliminated.
  • the developer defined tokens are eliminated in addition to the default tokens. In other embodiments, the developer simply allows default tokens to be eliminated.
  • a token to be eliminated need not consist of a single character. For example, " ⁇ big> ⁇ ;> ⁇ bad> ⁇ Xwolf and redriding hood>” becomes “ ⁇ big> ⁇ bad> ⁇ wolf and redriding hood>” where the semicolon and period are defined as tokens to be eliminated.
  • the developer defines different tokens to be eliminated in different fields or domains.
  • parsing the record includes examining the set of tokens that results from the separation process for patterns and acting upon one or more tokens in a recognized pattern.
  • the attributes of each token are determined once a record has been separated into tokens.
  • the attributes include class, length, value, abbreviation, and substring.
  • additional attributes are determined.
  • different attributes are determined.
  • fewer attributes are determined.
  • the determination of some attributes of a token may negate the requirement to determine other attributes of the token.
  • classes include Numeric, Alphabetic, Leading Numeric followed by one or more letters, Leading Alphabetic followed by one or more numbers, Complex Mix containing a mixture of numbers and letters that do not fit into either of the two previous classes, and Special containing a special characters that are not generally encountered.
  • other classes are defined.
  • the Alphabetic classification is case sensitive.
  • additional developer defined classes are used in conjunction with default classes.
  • developer defined classes are used to the exclusion of the default classes.
  • the token ⁇ aBcdef> has the attributes of an Alphabetic class token with a length of 6 characters and a value of "aBcdef ' where the Alphabetic tokens are case sensitive.
  • a pattern must be defined for recognition based on the possible attributes of the tokens.
  • a pattem is defined for action only if it occurs in a specific domain.
  • a pattern is defined for action if it occurs anywhere in the set of record tokens. Pattem matching begins with the first token and proceeds one token at a time. There may be multiple pattern matches to a record.
  • a pattem is defined by any of the attributes of a token, a portion of a token, or a set of tokens.
  • a pattem is defined by the attributes of a single token.
  • a pattern is defined by the attributes of a set of tokens.
  • the pattem is defined as a token with length of less than 10, a substring of "ANTI” in the first through fourth characters of a token, and a substring of "CS" without constraint on where it occurs in the token.
  • the tokens ⁇ ANTICS> and ⁇ ANTI-CSAR> will be recognized for action.
  • the token ⁇ ANTIPATHY> will not be recognized due to failure to meet the second substring constraint and the token ⁇ ANTIPOLITICS> will not be recognized due to failure to meet the length constraint.
  • a number of actions may be taken to modify the pattem.
  • the action taken in response to a recognized pattem is to change one of the attributes of the pattem.
  • the action taken in response to a recognized pattem is to concatenate a portion of the pattern.
  • the action taken is response to a recognized pattem is to print debugging information.
  • other actions are taken. Some embodiments take an action with respect to a substring of a token. Some embodiments take a number of actions in response to a recognized pattem.
  • the command "SET the value of ⁇ token> to (1:2) ⁇ token>” is defined for execution upon recognition of the pattem of an alphabetic class token of length 7 with a substring "EX" in the first two characters.
  • the token ⁇ EXAMPLE> is recognized as fitting the pattem and the command is executed resulting in the value of the token changing to the first two characters of the original token or "EX".
  • the value of noise words such as "at”, “by”, and "on”, which are not typically helpful to a search, are set to zero so that they are excluded from the list of unique index tokens.
  • parsing converts a database record 44 into record tokens (TR ⁇ ) 46.
  • the second step in the process of indexing reference data illustrated in FIG. 2 is to identify (STEP 50) the unique record tokens. Identifying the unique record tokens allows a list of unique record tokens to be created. Such a list may be described as a dictionary of database terms. In one embodiment, certain fields are excluded from contributing to the list. In another embodiment, certain domains are excluded from contributing to the list. In one embodiment, tokens are excluded from contributing to the list of unique tokens based on their class. In another embodiment, tokens are excluded from contributing to the list of unique tokens based on their class and another attribute. In some embodiments, the excluded classes or other attributes are designated with respect to a domain. In some embodiments, the excluded classes or other attributes are designated with respect to records as a whole.
  • a developer excludes all numeric tokens with a length of more than 5 characters from the list of unique tokens.
  • STEP 50 is skipped.
  • STEP 50 is done later in the process of indexing reference data illustrated in FIG. 2.
  • the third step in the process of indexing reference data illustrated in FIG. 2 is to generate (STEP 60) index tokens (T In ) 62 from record tokens (T R]1 ) 46. Step 60 is also shown in FIG. 1 A.
  • the index tokens are the record tokens themselves.
  • STEP 70 is duplicative of STEP 50. In other embodiments, as shown in FIG.
  • the index tokens (T ⁇ n ) 62 are phonetic equivalents of the record tokens (TR ⁇ ) 46.
  • the index tokens are generated by translating a record token into a phonetic language.
  • the phonetic language is NYSIIS.
  • the phonetic language is SOUNDEX.
  • the phonetic equivalence is based on another phonetic language or variation thereof.
  • only record tokens in the alphabetic class are translated and other classes of tokens are not used to generate index tokens.
  • record tokens in the alphabetic class and other classes generate index tokens, but only the alphabetic portion of the record tokens are translated into index tokens.
  • the fourth step in the process of indexing reference data illustrated in FIG. 2 is to identify
  • STEP 70 the unique index tokens.
  • STEP 70 is very similar to STEP 50. Identifying the unique index tokens allows a list of unique index tokens to be created. Such a list may be described as a dictionary of index terms.
  • certain fields are excluded from contributing to the list.
  • certain domains are excluded from contributing to the list.
  • a token is excluded from contributing to the list of unique tokens based on its class.
  • a token is excluded from contributing to the list of unique tokens based on its class and another attribute.
  • the excluded classes and attributes are designated with respect to a domain. In some embodiments, the excluded classes and attributes are designated with respect to records as a whole.
  • a developer excludes all alphabetic tokens with a length of less than 5 characters from contributing to the list of unique tokens.
  • STEP 70 is skipped.
  • STEP 70 is done after STEP 80.
  • STEP 70 is done as part of STEP 80.
  • the fifth step in the process of indexing reference data illustrated in FIG. 2 is to check (STEP 80) for additional records.
  • This step is simply a check step which determines when it is appropriate to calculate the frequency of occurrence of index tokens. If there are additional records, the next record will be processed before this step will be repeated. If there are no additional records, the indexing process continues on to STEP 90.
  • the check for additional records comprises simply looking for an end of file flag.
  • the sixth and final step in the process of indexing reference data illustrated in FIG. 2 is to calculate (STEP 90) the frequency of occurrence of the tokens in the database.
  • Frequency of occurrence is also known as collection frequency or document frequency. Assuming independence of tokens, a lower frequency of occurrence indicates a more selective token. Tokens are not necessarily independent. For example, phrases containing specific groups of tokens may be included repeatedly in a database. Nonetheless, independence of tokens is an acceptable approximation of reality.
  • Frequency of occurrence may be calculated for any type of token that can be associated with a record. For example, in one embodiment, frequency of occurrence is calculated for index tokens. Frequency of occurrence may be calculated for multiple different types of tokens that can be associated with a record. For example, in another embodiment, frequency of occurrence is calculated for index tokens and record tokens.
  • a frequency of occurrence is calculated for each unique index token with respect to the database as a whole. In another embodiment, a frequency of occurrence is calculated for each unique index token with respect to each domain in the database. In another embodiment, a frequency of occurrence is calculated for each unique index token with respect to each field in each domain in the database. Other levels of specificity for the calculation are also possible. In some embodiments, no frequency of occurrence is calculated for some unique index tokens. In one embodiment, such index tokens include noise words such as ⁇ the> and ⁇ and>. Creating a list of index tokens while calculating their respective frequency of occurrence makes the frequency calculation more efficient.
  • the frequency of occurrence is calculated, it is efficient to create and save a token table 96 that includes pointers 94 to records containing the token in the respective location in the database.
  • the table 96 prevents duplicative searches for records containing the token from being required.
  • the pointers 94 are included in a comprehensive table 96.
  • the pointers are included in a separate table and associated with the respective token.
  • FIG. 3 the figure illustrates a block diagram of query processing according to one embodiment.
  • the first step in processing a query shown in FIG. 3 is to parse (STEP 40) the query.
  • Query parsing can be done using the same process and variations thereto as used for parsing (STEP 40) a record from a database. The only difference is that, whereas parsing a record 44 results in record tokens (TR ⁇ ) 46, parsing a query 54 results in query tokens (T Qn ) 56 as shown in FIG. IB.
  • the second step in processing a query as illustrated in FIG. 3 is to expand (STEP 90) the query.
  • the query is expanded by adding similar tokens to the query tokens.
  • similar tokens are selected from the list of unique record tokens.
  • various comparisons of a query token and a candidate record token may be considered.
  • the list of unique record tokens may be considered a dictionary of database terms.
  • the comparisons of a query token and candidate record tokens may be considered a spelling check for the query. In one embodiment, the following comparisons are considered: the number of mismatched characters; the number of transpositions; and the lengths of the character strings.
  • a subset of the above comparisons are considered.
  • other comparisons are considered instead of or in addition to the named comparisons.
  • the entire set of tokens from the list of unique record tokens are used for comparison to a query token.
  • a smaller subset of tokens from the list of unique record tokens are used for comparison.
  • the subset of record tokens that have the same first two characters as the query token are used for comparison with an individual query token.
  • the list of unique record tokens includes no record tokens with the same first two characters as the query token ⁇ XENITH>, no further comparison is done and no record token is added to the set of query tokens for the query token ⁇ XENITH>.
  • a threshold is set to determine which candidate record tokens are added to the set of query tokens and which are not.
  • the threshold is based on the similarity of the candidate record tokens in comparison to a query token.
  • the threshold is a minimum similarity required for inclusion of the candidate record token.
  • the threshold is based on the dissimilarity of the candidate record tokens in comparison to a query token.
  • the threshold is a maximum dissimilarity required for exclusion of the candidate record token.
  • the threshold is a combination of the similarity and the dissimilarity.
  • Similarity may be calculated as follows where each S is a weighting factor, c is the number of characters in common with both the query token and the candidate record token, d is the length of the query token, r is the length of the candidate record token, and t r is the number of transpositions of characters found by comparing the query token to the candidate record token.
  • S cd is the weight factor for the percentage of characters in the query token consisting of characters in common with the candidate record token
  • S rd is the weight factor for the percentage of characters in the candidate record token consisting of characters in common with the query token
  • S tr is the weight factor for the percentage of characters in common with the query token and the candidate record token that are not transposed.
  • all of the similarity weighting factors are set to a value of 300 and the candidate records are added to the set of query tokens if their calculated similarity exceeds a minimum similarity.
  • Dissimilarity may be calculated as follows where each D is a weighting factor, u cd is the number of characters in the query token that are not in the candidate record token, d is the length of the query token, u rd is the number of characters in the candidate record token that are not in the query token, r is the length of the candidate record token, t r is the number of transpositions of characters found by comparing the query token to the candidate record token, and c is the number of letters in common with both the query token and the candidate record token.
  • D cd is the penalty factor for the percentage of characters in the query token that are not in the candidate record token
  • D rd is the penalty factor for the percentage of characters in the candidate record token that are not in the query token
  • P tr is the penalty factor for the percentage of characters in common with the query token and the candidate record token that are transposed.
  • the query is further expanded by generating search tokens (Ts n ) 72 from the query tokens (TQ ⁇ ) 56 and the similar tokens.
  • Search token generation can be done using the same process and variation thereto as used for generating (STEP 60) index tokens from record tokens. The only difference is that, whereas index tokens (T ⁇ n ) 62 are generated from record tokens (TR ⁇ ) 46, search tokens (Tg n ) 72 are generated from query tokens (TQ ⁇ ) 56.
  • the query is expanded by generating search tokens (Ts n ) 72 from the query tokens (T Q ⁇ ) 56 alone.
  • search token generation can be done using the same process and variation thereto as used for generating (STEP 60) index tokens from record tokens. Again, the only difference is that, whereas index tokens (T ⁇ n ) 62 are generated from record tokens (TR ⁇ ) 46, search tokens (Ts n ) 72 are generated from query tokens (T Qn ) 56.
  • the first step in the process of accessing the reference data shown in FIG. 4 is to select (STEP 100) the first search token.
  • the first search token is selected at random from the search tokens.
  • the first search token is selected by the given order within the search tokens.
  • the first search token is the most selective search token.
  • search tokens are ordered by selectivity. In one such embodiment, selectivity is determined by frequency of occurrence in an indexed database record set. In another such embodiment, selectivity is determined by frequency of occurrence in a specific domain within an indexed database record set.
  • selectivity is determined by frequency of occurrence in a specific field in a domain within an indexed database record set.
  • the first search token is the most selective search token in the domains corresponding to the domains specified in the query.
  • the most selective search token is identified by comparing frequencies of occurrence reported in a table of unique index tokens.
  • the second step in the process of accessing the reference data illustrated in FIG. 4 is to access (STEP 110) reference data.
  • a new search of the database record set for the selected token is initiated.
  • the selected token is looked up on a token table.
  • the token table 96 will directly return a set of pointers 94 to records within the database containing the selected token (Ts 3 ) 72.
  • the token table will indirectly return a set of pointers to records within the database containing the selected token. The pointers may be used to access the records within the database.
  • the third step in the process of accessing the reference data illustrated in FIG. 4 is to calculate (STEP 120) relevance.
  • each accessed record is evaluated by calculating a weight representing its likelihood of relevance to the query.
  • the weight is calculated by comparing the query tokens to the record tokens.
  • the weight is calculated by comparing the query tokens to the record tokens in the domains specified by the query.
  • Record linkage is the process of examining records and locating pairs of records that match on some combination of fields.
  • Record Linkage Theory is the probabilistic basis for considering a pair of records to match or be relevant to each other.
  • the present invention applies the Theory in some embodiments to matching a query to individual records within a database record set.
  • a query is defined as a record from the set A of records.
  • a record from the reference data that is a candidate for matching the query is defined as a record from the set B of records.
  • Each pair of records includes one record from set A, in effect the query, and one record from set B.
  • Each pair of records is either a member of the set of matching pairs M or a member of the set of non-matching pairs U.
  • the power of a field to identify a match depends on the selectivity of the contents of the field and the accuracy of the contents of the field.
  • Selectivity is a measure of the power of the contents of the field to discriminate amongst records. For example, where the field is surnames, the token ⁇ Humperdinck> is likely to be much more selective than the token ⁇ Smith> because there are likely to be many more records containing ⁇ Smith> in the surname field than ⁇ Humperdinck>.
  • Accuracy is a measure of the reliability of the data in the field. For example, field information which is entered more carefully or checked after entry is more likely to agree in a matched pair than field information which is less carefully entered or not checked after entry.
  • Accuracy m t is defined as the probability that two records have the same contents in a field when the pair of records is a member of the set of matching pairs M. This is expressed mathematically as follows where P( ⁇
  • ⁇ ) is the probability of ⁇ being true given the condition ⁇ : m : P(fields _ agree
  • Agreement Weight W A is added to the likelihood of relevance of a candidate record when the candidate record contains a token equivalent to the query token in the respective domain i. In other embodiments, Agreement Weight W A is added to the likelihood of relevance of a candidate record when the candidate record contains a token equivalent to the query token in the respective field i. In other embodiments, i represents another level of specificity of location of data.
  • Disagreement Weight W is the log of the ratio of one minus the accuracy m t to one minus the selectivity u v
  • Disagreement Weight W D is subtracted from the likelihood of relevance of a candidate record when the candidate record does not contain a token equivalent to the query token in the respective domain i. In other embodiments, Disagreement Weight W D is subtracted from the likelihood of relevance of a candidate record when the candidate record does not contain a token equivalent to the query token in the respective field i. In other embodiments, i represents another level of specificity of location of data. In some embodiments, Adjacency Weight is added to the likelihood of relevance weight of a candidate record if the candidate record contains more than one token equivalent to more than one query token and the relevant record tokens are immediately adjacent to each other.
  • Semi- Adjacency Weight is added to the likelihood of relevance weight of a candidate record if the candidate record contains more than one token equivalent to more than one query token and the relevant record tokens are located near each other. In one embodiment, Semi-Adjacent Weight is added if search tokens are separated by one intervening token. In other embodiments, Semi- Adjacent is added if search tokens are separated by more than one intervening tokens. In one embodiment, the Adjacency and Semi- Adjacency Weight is a factor of the weights of the relevant search tokens. Various weighting schemes for nearness are available.
  • the likelihood of relevance of a candidate record is calculated by summing the Agreement Weight W A , the Adjacency Weight, and the Semi- Adjacency Weight of all the record tokens in the candidate record with respect to the query tokens.
  • Semi-Adjacency Weight is only added only when there is one intervening token between the record tokens in the candidate record that are equivalent to query tokens.
  • the fourth step in the process of accessing the reference data illustrated in FIG. 4 is to compare (STEP 130) the calculated relevance to a threshold.
  • the weight of each accessed record is compared to one or more thresholds.
  • the candidate records are ordered by their likelihood of relevance weight so that weights for the set of accessed records are more efficiently compared to one or more thresholds.
  • the weight is compared to a continuation threshold.
  • the search is terminated if the continuation threshold is exceeded. At that point, all accessed records are output.
  • failure to exceed the continuation threshold will trigger (STEP 140) a different search.
  • the token that was used as the basis for the previous search is eliminated from the set of available search tokens.
  • the first step in the new search is to select a different token with which to access reference data. In such an embodiment, if the most selective token has already been used to access data, the second most selective token is used in the subsequent search. The process is repeated until the continuation threshold is exceeded or all search tokens have been used to access data.
  • the weight of accessed records is compared to a presentation threshold. In such an embodiment, a portion of the accessed records are output. In embodiments using a presentation threshold, the output records are limited to the those records whose likelihood of relevance weight exceeds the presentation threshold.
  • a highest possible likelihood of relevance weight is calculated for each query. The highest possible likelihood of relevance weight depends on the weighting scheme that is selected. In some embodiments, the developer chooses to have additional tokens reduce the weight of a candidate record. For example, in embodiments that use only Agreement Weight W A , the highest possible likelihood of relevance weight is the weight a candidate record would have if it included every query token in the respective domain. For another example, in embodiments that use Agreement Weight W A and Adjacency Weight, the highest possible likelihood of relevance weight is the weight a candidate record would have if it included every query token in the respective domain and in the query arrangement.
  • the continuation threshold weight used as a basis for terminating a search is a percentage of the highest possible weight. In other embodiments, the continuation threshold weight is an absolute weight. In some embodiments, the presentation threshold weight used as a criteria for presenting a record accessed in a search is a percentage of the highest possible weight. In other embodiments, the presentation threshold weight is an absolute weight.
  • the accessed records are ordered for output by likelihood of relevance weight. In other embodiments, the accessed records are output in the order in which they are retrieved. In still other embodiments, the accessed records are output in another order.
  • Some embodiments include a step in the database accessing process not shown in the embodiment of FIG. 4. In this step, the amount of information accessed is compared to an overflow threshold. If the overflow threshold is exceeded in such embodiments, the current search is terminated. The memory or buffer is cleared. In one such embodiment, a new search is triggered. The new search is based on all search tokens connected together with a Boolean AND. If the overflow threshold triggers a new search, the continuation threshold is then disabled.
  • the records accessed in the new search are handled the same as in a regular search.
  • the overflow threshold used as a basis for terminating a search and triggering a different search is as a software error or warning regarding available memory space or buffer space.
  • the developer elects to have a search based on all search tokens connected together with a Boolean AND run for each query.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil permettant l'extraction d'informations d'une base de données électronique sur la base d'une approche probabiliste et de d'un traitement de requête. Selon un aspect, les documents d'une base de données sont analysés sous forme de jetons au moyen d'un langage de modèle d'action avant qu'un indice des documents ne soit créé. Selon un autre aspect, une table de jetons de l'indice est créée, laquelle table comprend une fréquence d'occurrence dans la base de données de chaque jeton de l'indice et chaque jeton de l'indice comprend un équivalent phonétique pour un jeton de l'indice respectif. Selon un aspect, une demande est analysée sous forme de jetons de demande au moyen d'un langage de modèle d'action, un jeton de recherche est généré à partir d'un jeton de demande, lequel jeton de recherche permet d'accéder aux documents de la base de données. Selon un autre aspect, un jeton de recherche comprend un équivalent phonétique pour un jeton de demande ou un jeton qualifié comme étant similaire à un jeton de demande et un jeton de recherche permet d'accéder aux documents de la base de données. La qualification d'un jeton comme étant similaire à un jeton de demande se base sur une comparaison du jeton de demande avec un dictionnaire de base de données au moyen d'un algorithme théorique d'information. Dans un aspect supplémentaire, un jeton choisi permet d'accéder aux documents de la base de données, une probabilité de pertinence par rapport à la demande est calculée pour chaque document et la plus élevée de ces probabilités est comparée à un seuil de continuation. Si ce seuil est dépassé, il n'est plus possible d'accéder à aucun document et les documents auxquels on a déjà accédé sont sortis. En revanche, si ce seuil n'est pas dépassé, le jeton de recherche choisi est éliminé de l'ensemble de jetons de recherches disponibles et un nouveau jeton est choisi afin d'accéder aux documents de la base de données.
PCT/US2001/006447 2000-02-28 2001-02-28 Moteur d'appariement probabiliste WO2001065416A2 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU2001243337A AU2001243337A1 (en) 2000-02-28 2001-02-28 Probabilistic matching engine
CA002401170A CA2401170A1 (fr) 2000-02-28 2001-02-28 Moteur d'appariement probabiliste
JP2001564037A JP2004506960A (ja) 2000-02-28 2001-02-28 蓋然論マッチング・エンジン

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US51474300A 2000-02-28 2000-02-28
US09/514,743 2000-02-28

Publications (2)

Publication Number Publication Date
WO2001065416A2 true WO2001065416A2 (fr) 2001-09-07
WO2001065416A3 WO2001065416A3 (fr) 2003-12-31

Family

ID=24048505

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/006447 WO2001065416A2 (fr) 2000-02-28 2001-02-28 Moteur d'appariement probabiliste

Country Status (4)

Country Link
JP (1) JP2004506960A (fr)
AU (1) AU2001243337A1 (fr)
CA (1) CA2401170A1 (fr)
WO (1) WO2001065416A2 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1546940A1 (fr) * 2002-09-04 2005-06-29 Neural Technologies Ltd. Procede de detection de donnees de proximite
US7805438B2 (en) 2006-07-31 2010-09-28 Microsoft Corporation Learning a document ranking function using fidelity-based error measurements
US8583415B2 (en) 2007-06-29 2013-11-12 Microsoft Corporation Phonetic search using normalized string

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254596B (zh) * 2021-06-22 2021-10-08 湖南大学 基于规则匹配和深度学习的用户质检需求分类方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0304191A2 (fr) * 1987-08-14 1989-02-22 International Business Machines Corporation Système de recherche de textes
US5687384A (en) * 1993-12-28 1997-11-11 Fujitsu Limited Parsing system
US5774888A (en) * 1996-12-30 1998-06-30 Intel Corporation Method for characterizing a document set using evaluation surrogates
US5937422A (en) * 1997-04-15 1999-08-10 The United States Of America As Represented By The National Security Agency Automatically generating a topic description for text and searching and sorting text by topic using the same

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0304191A2 (fr) * 1987-08-14 1989-02-22 International Business Machines Corporation Système de recherche de textes
US5687384A (en) * 1993-12-28 1997-11-11 Fujitsu Limited Parsing system
US5774888A (en) * 1996-12-30 1998-06-30 Intel Corporation Method for characterizing a document set using evaluation surrogates
US5937422A (en) * 1997-04-15 1999-08-10 The United States Of America As Represented By The National Security Agency Automatically generating a topic description for text and searching and sorting text by topic using the same

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1546940A1 (fr) * 2002-09-04 2005-06-29 Neural Technologies Ltd. Procede de detection de donnees de proximite
EP1546940A4 (fr) * 2002-09-04 2006-03-08 Neural Technologies Ltd Procede de detection de donnees de proximite
US7805438B2 (en) 2006-07-31 2010-09-28 Microsoft Corporation Learning a document ranking function using fidelity-based error measurements
US8583415B2 (en) 2007-06-29 2013-11-12 Microsoft Corporation Phonetic search using normalized string

Also Published As

Publication number Publication date
AU2001243337A1 (en) 2001-09-12
JP2004506960A (ja) 2004-03-04
WO2001065416A3 (fr) 2003-12-31
CA2401170A1 (fr) 2001-09-07

Similar Documents

Publication Publication Date Title
JP5740029B2 (ja) 対話型サーチクエリーを改良するためのシステム及び方法
US10055461B1 (en) Ranking documents based on large data sets
US7860853B2 (en) Document matching engine using asymmetric signature generation
US7747642B2 (en) Matching engine for querying relevant documents
US7716216B1 (en) Document ranking based on semantic distance between terms in a document
US6161084A (en) Information retrieval utilizing semantic representation of text by identifying hypernyms and indexing multiple tokenized semantic structures to a same passage of text
US6678677B2 (en) Apparatus and method for information retrieval using self-appending semantic lattice
US6947920B2 (en) Method and system for response time optimization of data query rankings and retrieval
US8321410B1 (en) Identification of semantic units from within a search query
US6167397A (en) Method of clustering electronic documents in response to a search query
US6578032B1 (en) Method and system for performing phrase/word clustering and cluster merging
US20020123994A1 (en) System for fulfilling an information need using extended matching techniques
US20050216478A1 (en) Techniques for web site integration
US20080313178A1 (en) Determining searchable criteria of network resources based on commonality of content
US7324988B2 (en) Method of generating a distributed text index for parallel query processing
WO2006122086A2 (fr) Moteur de mise en correspondance a generation de signatures et detection de pertinence
D'Souza et al. Is CORI Effective for Collection Selection? An Exploration of Parameters, Queries, and Data.
WO2001065416A2 (fr) Moteur d'appariement probabiliste
JP3249743B2 (ja) 文書検索システム
Youssef et al. Math search with equivalence detection using parse-tree normalization
EP1258815B1 (fr) Processus pour l'extraction de mot-clés
JP3333186B2 (ja) 文書検索システム
WO2002046970A2 (fr) Systeme permettant de repondre a un besoin d'information par des techniques d'appariement approfondies
WO2006058252A2 (fr) Identification du sens d'un document par l'etude de la maniere dont les mots s'influencent les uns les autres

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase in:

Ref country code: JP

Ref document number: 2001 564037

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 2401170

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2001916296

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 2001916296

Country of ref document: EP

122 Ep: pct application non-entry in european phase