EP1138007A1 - Systeme et procede pour trouver des quasi-correspondances parmi des articles contenus dans des bases de donnees - Google Patents

Systeme et procede pour trouver des quasi-correspondances parmi des articles contenus dans des bases de donnees

Info

Publication number
EP1138007A1
EP1138007A1 EP99966015A EP99966015A EP1138007A1 EP 1138007 A1 EP1138007 A1 EP 1138007A1 EP 99966015 A EP99966015 A EP 99966015A EP 99966015 A EP99966015 A EP 99966015A EP 1138007 A1 EP1138007 A1 EP 1138007A1
Authority
EP
European Patent Office
Prior art keywords
record
records
data store
identifiers
creating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99966015A
Other languages
German (de)
English (en)
Inventor
David Whipple
Joseph Carsanaro
Ken Young
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bloodhound Software Inc
Original Assignee
Bloodhound Software Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bloodhound Software Inc filed Critical Bloodhound Software Inc
Publication of EP1138007A1 publication Critical patent/EP1138007A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • This invention generally relates to computer systems and the locating of records in databases and datastores. More specifically, this invention relates to a system and method for identifying near matches among records in a datastore based upon discriminant analysis.
  • the invention addresses the problem of having duplicate and near duplicate records in database files, data marts, data warehouses or any data file.
  • the duplication of information is difficult to find and can lead to wasted time and money.
  • Processing duplicate claims, expense payments or other duplicate records can lead to cost over runs, customer service problems, inefficient processing time, manual intervention into automated systems, and wasted disk storage on computer systems.
  • Unsynchronized data over multiple environments can lead to data duplicates, data replication and other data management problems.
  • the inability to locate a near match in Internet searches can lead to lost sales opportunities, poor customer service problems and lost revenue.
  • the present invention is a system and method for finding near-matches among records in one or more databases.
  • the system is for identifying near matches between records in a data store and a selected record having an associated coordinate set, and includes a data store for storing the records and a processor.
  • the processor of the system performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
  • the present invention provides a computer-readable storage device containing instructions that upon execution cause a processor to identify near matches between records in a data store and a selected record having an associated coordinate set.
  • the device preferably performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
  • the present inventive system accordingly provides a method for identifying near matches between records in a data store and a selected record that has an associated coordinate set.
  • the method includes steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
  • the identifier associated with each record in the data store preferably comprises one or more characters.
  • the method preferably further includes the step of determining a set of records from the retrieved records that match the selected record.
  • the step of determining a set of records from the retrieved records that match the selected record is preferably screening the coordinate sets associated with the retrieved records for sets within a second predetermined distance in the discriminant space from the coordinate set associated with the selected record.
  • the method further includes the step of extracting, deleting, or otherwise modifying the determined set of records.
  • the method also preferably includes the step of acquiring a mapping template, and the step of acquiring is preferably creating or receiving a mapping template.
  • the method then preferably includes the step of refining the acquired mapping template.
  • the method can include the step of selecting an identifier format for use in the step of creating one or more identifier, and then further include the step of acquiring one or more mapping templates.
  • the step of selecting an identifier format is then preferably evaluating the acquired one or more mapping templates.
  • the method further preferably includes the steps of receiving the selected record, creating an identifier associated with the selected record, and mapping the identifier associated with the selected record into a coordinate set in the discriminant space associated with the selected record. Additionally, the method can further include the step of retrieving the coordinate set associated with the selected record from the data store.
  • near matches to a selected record that has an associated coordinate set are identified among records in a data store.
  • An identifier preferably a keystring, is created for each record in the database. Each such identifier is then mapped into a set of coordinates in discriminant space.
  • Records in the data store that are near matches to the selected record are retrieved by collecting all records with an associated coordinate set within a predetermined distance in discriminant space from the coordinate set associated with the selected record.
  • identical records may be identified by further selecting records from retrieved records or by setting the predetermined distance at a threshold guaranteeing only identical records are retrieved.
  • the records retrieved as near matches and/or identical matches may be automatically deleted from the data store as duplicative or may be outputted to an appropriate output device for further automated or manual processing.
  • the system and method for finding near matches among records in databases accordingly has industrial applicability in that the invention can be installed and practiced on existing computer systems to increase searching efficiency. Moreover, the inventive system can be created during the manufacture of computer systems having database record searching as a significant component of the system functionality.
  • the present invention therefore has a commercial advantage in that it defines a multi-dimensional indexing scheme that performs efficient database searching.
  • a multi-dimensional indexing scheme the likelihood of finding mismatched characters is not dependent on the position of the character in the keystring. Tests suggest this method is more efficient at finding sims, identifying from 50% to 100% more sims than conventional methods known in the art. Consequently, sims are more likely to be identified through use of the present invention than through use of in existing linear indexing systems.
  • FIG. 1 is a flowchart illustrating the mapping process of the present invention, specifically illustrating the record, key, and coordinate templates of the process.
  • FIG. 2 is a flowchart illustrating the process of detecting duplicate entries, and specifically illustrating the template definition, key optimization, and record location processes.
  • FIG. 3 is an illustration of a defined template created from the QWERTY keyboard wherein the likelihood of errors is assumed to occur from typing errors due to key proximity.
  • FIG. 4a is a flowchart illustrating base template generation.
  • FIG. 4b is a flowchart illustrating the optimization of the base template generated from the process of Fig. 4a.
  • FIG. 5 is a flowchart illustrating the process of key creation.
  • FIG. 6 is a representative diagram of a host computer environment in connection with other computers and databases though a local area network (LAN) and through the Internet.
  • LAN local area network
  • Fig. 1 is a flowchart illustrating the present inventive method for finding near-matches among records in one or more databases.
  • the present invention is a result of extending and enhancing the concepts of multiple discriminant analysis to locate each record in discriminant space (hereafter referred to as sim-space).
  • the system selects a record 12 in one or more databases to dimension, shown at step 10, and record 12, and a key applied to the record 12, shown at step 14, to create a keystring 16.
  • Each character position in the keystring, such as keystring 16 preferably represents a dimension of the record, although other aspects of this invention can use each character position to represent two or more dimensions.
  • the system determines the coordinates for each character in the in keystring through a selected template, shown at step 18, such as template 20.
  • Pre-determined "templates,” such as template 20 define the actual coordinates of a record in sim-space. These templates provide a conformal mapping for each character in the keystring to a coordinate (or coordinates) in sim-space.
  • ASCII characters Although shown here as ASCII characters, other characters that can be used in the present invention include the full English alphabet, numbers, words, special characters such as n (Spanish), ⁇ (German), or ⁇ (Norwegian), or can consist of entire non-Roman alphabets such as Greek, Russian, Arabic, Hebrew, Chinese, or Japanese characters.
  • Multidimensional indexing finds application in two main areas, firstly in searching an indexed list for a near match of a given record, and secondly in the detection of groups of similar records in lists. Finding a possible match for a given record in a previously indexed database can therefore be achieved by the following steps: generating the keystring for the record to be matched, determining the location in sim-space by applying template to the keystring, and searching the locations neighborhood, i.e. one or more databases or datastores, for sims by applying a nearest neighbor algorithm.
  • Identifying sims within a previously indexed data base can be achieved by the following steps preformed on the system: randomly selecting a keystring in sim-space, pulling a pre-determined number of nearest neighbors, and checking all possible pairs for sims within the set of neighbors.
  • the template defining process defines templates are defined in such a manner so as to assign characters commonly substituted erroneously, to near-by coordinates.
  • the system thus creates mapping templates, as shown at step 24.
  • the key optimization process creates a keystring 16 for each record 12 in the database or datastore.
  • the system creates a key to apply to the particular dataset under review in the program, shown at step 26, wherein the dataset can be one or more databases or data stores, and then the system creates a keystring 16 for each record in the dataset, shown at step 28.
  • Each character position in the keystring is given an evaluation based on its ability to discriminate between records. The evaluation nominally lies between 0 and 1 (preferably expressed as a percentage) with 1 being the best discrimination (desired).
  • PE (n - ⁇ a / n.
  • Another method is to calculate the standard deviation of the coordinates for a given character position.
  • Such method has a disadvantage in that there is no easily defined "best" position evaluation. It should be noted that in the creation of a key, the order of the characters in the keystring is irrelevant for the present invention, unlike keys created for conventional indexing methods.
  • RLP record location process
  • the system takes a template generated with a keystring 16 and locates matching records for the template, which represents near matches to the exact keystring generated from the record.
  • the system maps the keystring into sim-space using the specified template, shown at step 30, and then the system examines the neighborhood, or database/datastore, for matches to the templated keystring, shown at step 32. Locating a record requires creating the keystring for that record and then determining the coordinates for that record using a template. Any number of processes may be utilized to make a key as are known in the art.
  • One preferred method of creating a mapping template that assigns characters which are commonly substituted erroneously is simply to "stretch" the QWERTY keyboard under the assumption that errors are commonly produced by typing a letter adjacent to the desired letter on the keyboard. This extrapolation of the QWERTY keyboard creates the template shown in Fig. 3.
  • Fig. 3 illustrates a template having a section 36 for the ASCII letter characters 38 with the template coordinates 40, and a section 42 for ASCII numbers 44 with corresponding template coordinates 46.
  • determining the coordinates of the record is a simple matter of substitution. For example, using the QWERTY template in Fig. 3, the keystring "CLARKE” would have coordinates of ⁇ 11, 25, 3, 9, 22, 5 ⁇ .
  • a template evaluation function is defined as the sum of the error frequency (f) divided by the coordinate distance between each pair
  • step 52 This step is repeated a pre-determined number of times and the set with the best (highest) score is saved and becomes the basis for the next step.
  • This set of coordinates is then "shuffled" by switching the coordinates for a randomly selected pair of characters that lie within a variable coordinate distance 'm.' A comparison is then made to determine if the switch produces a better evaluation, shown at decision 54. If the switch has made a better evaluation, the new set is saved and becomes the basis for continuing optimization, as shown at step 56, and a decision is made, decision 58, as to whether the process has been repeated a sufficient amount of times. If there is no improvement after a pre-determined number of switches, i.e. the score is not greater that the previous highest score, m is decreased by 1 and a decision is again made as to whether the process has been repeated the requisite amount of time, decision 58, and step 52 is repeated.
  • Another variable is defined that represents a distance metric within the coordinate system, shown here as having an initial value 'm', as shown at step 60.
  • the value 'm' is initially chosen such that it completely includes all the set members in the current template configuration.
  • a pair of characters in the co-ordinate space lying within 'm' units of each other are randomly selected, shown at step 62, and their coordinates are switched.
  • the Template evaluation function is then applied, shown at step 64, and the resulting error is compared to the current optimal templates, shown at decision 66. Should this template configuration yield a higher score, it is flagged as the new optimal template and set as the current template, shown at step 68.
  • Each template should be designed to be independent of previously created templates. This can be accomplished by setting the error frequency of pairs of characters that have adjacent coordinates to zero and running the hill-climber algorithm again. Given a set of templates, these templates can be used to evaluate the method used to create a keystring.
  • the preferred process of key creation is illustrated.
  • the system selects fields of the dataset, such as the local database, that provide a suitable level of discrimination, shown at step 78, and then all neglection of textual attributes such as vowels, numbers, punctuation and spaces are specified and preferably applied, shown at step 80.
  • any logical field groupings are identified, shown at step 82, and any source field substitutions are specified, shown at step 84, should the source field be blank.
  • all field weightings are specified, shown at step 86, and all composite fields that can be analyzed by subfield partitioning, such as addresses, are identified, shown at step 88.
  • a decision is then made, decision 90, as to whether the key test results show a high level of discrimination. If the key test does show a high level of discrimination, then the process ends. If the key test does not show a high level of discrimination, then the process is begun anew, with new dataset fields again selected at step 78.
  • Fig. 6 illustrates a host computer environment 92 comprised of a host computer 94 having a local memory 98 and a central processing unit (CPU) 96.
  • the host computer environment 92 is thus a system for identifying near matches between records in a data store, such as local memory 98, or a directly connected database 100, as example of which is a hard disk for the host computer 94.
  • the CPU 96 of the host computer 94 preferably performs the steps of: creating one or more identifiers, such as a keystring 16, wherein each identifier is associated with a record in the data store; mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, such as creating the template 20; and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record, as set forth above.
  • the present invention provides a computer-readable storage device, such as memory 98 or local database 100, containing instructions that upon execution cause a processor (CPU 96) to identify near matches between records in a data store (e.g. local database 100) and a selected record having an associated coordinate set.
  • the device preferably performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, such as a keystring 16, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, such as with template 20, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
  • the host computer environment 92 and host computer 94 can be connected to any manner of computer or database and perform the method of finding near matches therein.
  • host computer 94 is in direct connection with another computer 102 having a database 104, and the CPU 96 can access the data either resident on the directly connected computer 102, or the other database 104.
  • the host computer environment 92 can be connected to a local area network (LAN) 106 as are common in the art, and through the LAN 106, the host computer 94 can be in communication with one or more networked computers 108, each of which can have an attached database 110 that is accessible by the host computer 94.
  • LAN local area network
  • the host computer 94 can also be in communication through the LAN 106 with one or more networked databases 112, and can perform the record searching upon the data therein.
  • the host computer environment 92 can either directly, or through the LAN 106 as shown in Fig. 6, be connected to the Internet 114, or other wide area network
  • the host computer 94 can thereby access one or more databases 116 in communication with the Internet 114, and can also access other computers 118 in communication with the Internet 114 and any databases 120 accessible to the other computers 118 on the Internet 114.
  • the present inventive system can therefore be used in any environment having a processor and a datastore as are known in the art, and is not to be limited to the host computer environment 92 and connective environments disclosed in Fig. 6.
  • the present inventive system accordingly provides a method for identifying near matches between records in a data store and a selected record that has an associated coordinate set.
  • the method includes steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, such as keystrings 16, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
  • the identifier associated with each record in the data store preferably comprises one or more characters, such as keystring 16.
  • the method preferably further includes the step of determining a set of records from the retrieved records that match the selected record.
  • the step of determining a set of records from the retrieved records that match the selected record is preferably screening the coordinate sets associated with the retrieved records for sets within a second predetermined distance (such as 'm' in Fig. 4b) in the discriminant space from the coordinate set associated with the selected record. And then the method further includes the step of extracting, deleting, or otherwise modifying the determined set of records.
  • the method also preferably includes the step of acquiring a mapping template, such as the template in Fig. 3, and the step of acquiring is preferably creating or receiving a mapping template.
  • the method then preferably includes the step of refining the acquired mapping template, as show in the processes of Figs. 4a and 4b.
  • the method can include the step of selecting an identifier format for use in the step of creating one or more identifier, and example of which is the process of Fig.5, and then further includes the step of acquiring one or more mapping templates.
  • the step of selecting an identifier format is then preferably evaluating the acquired one or more mapping templates.
  • the method further preferably includes the steps of receiving the selected record, creating an identifier associated with the selected record, and mapping the identifier associated with the selected record into a coordinate set in the discriminant space associated with the selected record, as discussed above. Additionally, the method can further include the step of retrieving the coordinate set associated with the selected record from the datastore(s).
  • the present inventive system and method can be implemented on any category of computer devices including the four main categories of digital computers: supercomputers, mainframe computers, minicomputers and microcomputers.
  • the structures, processes, methods and system as disclosed herein can also be implemented on handheld computers, and Personal Digital Assistant (PDA) and Personal Information Management (PIM) devices including, but not limited to, cellular/mobile phones, Personal Organizers, Windows CE devices or hybrid devices such as a smart phone that may be deployed over a fixed or wireless network.
  • PDA Personal Digital Assistant
  • PIM Personal Information Management
  • the invention can be implemented on a variety of computing platforms and operating systems.
  • the present invention may be implemented on a standard personal computer (PC) operating under an operating system such as Windows, Windows NT, Unix, Linux, or other operating system.
  • Standard development tools, languages and compilers all can be used to implement the processes described herein, under programming languages and development tools such as Java, C, C++, XML (Extensible Markup Language), Visual Basic, PowerBuilder, and other languages as known
  • Database files are the preferred file type to implement the invention.
  • the databases can exist alone, in a data warehouse or in a data mart. These databases are operated upon by the processes may be created, managed, transformed and/or consolidated using a variety of database systems as are know in the art. These systems include but are not limited to Oracle, Sybase, Informix, Access, SQL, ODBC, Foxpro, XML schema or any other traditional or relational databases and/or database access tools.
  • Typical uses of this invention include locating duplicate records, locating near duplicate records, locating records with similar characteristics, and enhancing search capabilities in a database, data mart, or data warehouse.
  • the invention also can used in locating duplicate URLs over the Internet and/or locating correct URLs when URLs are misspelled or typed incorrectly.
  • the invention could further enhance Internet search capabilities in locating similar URLs or products on an e-business site. Locating similar products for an e-business site is another use of this invention.
  • the methods and processes of the invention would be able to solve failed searches by providing a list of 'projectors' based on the Internet search for 'projecters.
  • the present invention further can be applied in locating and extracting duplicate or near duplicate records in a customer or supplier database such as duplicate customer's name and address, customer order, and/or customer payment information.
  • the methods and processes of the invention are no limited to test searches. Such capability can also locate duplicate or near duplicate customers/prospects in a Direct Marketing campaign or Sales Force Automation where there is data consolidation.
  • the methods and processes used in this invention would allow one to compare all similar record sets to determine if duplicate data exists. This will then allow one to extract current customers from the prospected database.
  • a further application of the present invention is locating similar or near duplicate records that are possibly fraudulent in e-commerce applications which are conducted over the Internet 114.
  • E-business fraud can include any electronic credit card or other transactions where similar records are fraudulently used as a unique record. For example, in e-business that given benefits for signing up, the present invention can detect new members that sign up multiple times by changing name slightly.
  • This invention can further synchronize database files.
  • Wireless devices are small and prone to input/data entry errors.
  • PIM Personal Information Management
  • Data existing on LAN, WAN, PIM, Internet and Mainframe systems can be out of synchronization and this invention can be used to clean the synchronized data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un système et un procédé permettant de trouver des quasi-correspondances parmi plusieurs articles contenus dans des bases de données (104, 100, 116, 120, 112, 110) et des mémoires de données de systèmes informatiques. Le système de cette invention est notamment destiné à identifier des quasi-correspondances entre les articles d'une mémoire de données et un article comprenant un ensemble de coordonnées associé. Un processeur (96) crée des identificateurs associés à chaque article de ladite mémoire de données, puis applique chaque article situé dans un espace discriminant associé à chacun desdits identificateurs, avant d'extraire de l'ensemble de coordonnées associé à l'article choisi tous les articles de la mémoire de données comprenant l'ensemble de coordonnées associé à une distance prédéterminée, à l'intérieur dudit espace discriminant.
EP99966015A 1998-12-07 1999-12-06 Systeme et procede pour trouver des quasi-correspondances parmi des articles contenus dans des bases de donnees Withdrawn EP1138007A1 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11121298P 1998-12-07 1998-12-07
US111212P 1998-12-07
PCT/US1999/028870 WO2000034897A1 (fr) 1998-12-07 1999-12-06 Systeme et procede pour trouver des quasi-correspondances parmi des articles contenus dans des bases de donnees

Publications (1)

Publication Number Publication Date
EP1138007A1 true EP1138007A1 (fr) 2001-10-04

Family

ID=22337203

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99966015A Withdrawn EP1138007A1 (fr) 1998-12-07 1999-12-06 Systeme et procede pour trouver des quasi-correspondances parmi des articles contenus dans des bases de donnees

Country Status (4)

Country Link
EP (1) EP1138007A1 (fr)
AU (2) AU2166700A (fr)
CA (1) CA2353095A1 (fr)
WO (1) WO2000034897A1 (fr)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8423374B2 (en) * 2002-06-27 2013-04-16 Siebel Systems, Inc. Method and system for processing intelligence information
GB0220576D0 (en) * 2002-09-04 2002-10-09 Neural Technologies Ltd Data proximity detector
US8126738B2 (en) 2006-04-28 2012-02-28 Mdi Technologies, Inc. Method and system for scheduling tracking, adjudicating appointments and claims in a health services environment
US8126739B2 (en) 2006-04-28 2012-02-28 MDI Technologies, Inc Method and system for tracking treatment of patients in a health services environment
US9262475B2 (en) 2012-06-12 2016-02-16 Melissa Data Corp. Systems and methods for matching records using geographic proximity
US9563677B2 (en) * 2012-12-11 2017-02-07 Melissa Data Corp. Systems and methods for clustered matching of records using geographic proximity
CN113595805B (zh) * 2021-08-23 2024-01-30 海南房小云科技有限公司 一种用于局域网内的个人计算机数据共享方法
WO2023063971A1 (fr) * 2021-10-13 2023-04-20 Equifax Inc. Détection d'enregistrement fragmenté basée sur des techniques d'appariement d'enregistrements

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5649183A (en) * 1992-12-08 1997-07-15 Microsoft Corporation Method for compressing full text indexes with document identifiers and location offsets
US5465353A (en) * 1994-04-01 1995-11-07 Ricoh Company, Ltd. Image matching and retrieval by multi-access redundant hashing
US6029167A (en) * 1997-07-25 2000-02-22 Claritech Corporation Method and apparatus for retrieving text using document signatures
US6026398A (en) * 1997-10-16 2000-02-15 Imarket, Incorporated System and methods for searching and matching databases

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0034897A1 *

Also Published As

Publication number Publication date
AU6436599A (en) 2000-06-08
WO2000034897A9 (fr) 2001-06-07
AU2166700A (en) 2000-06-26
WO2000034897A1 (fr) 2000-06-15
CA2353095A1 (fr) 2000-06-15

Similar Documents

Publication Publication Date Title
US11704494B2 (en) Discovering a semantic meaning of data fields from profile data of the data fields
Dasu et al. Mining database structure; or, how to build a data quality browser
US6820079B1 (en) Method and apparatus for retrieving text using document signatures
US6678681B1 (en) Information extraction from a database
Burrows et al. Efficient plagiarism detection for large code repositories
Doermann et al. The detection of duplicates in document image databases
US7296011B2 (en) Efficient fuzzy match for evaluating data records
US6934634B1 (en) Address geocoding
US7711719B1 (en) Massive multi-pattern searching
US7565348B1 (en) Determining a document similarity metric
US20110047171A1 (en) Answering web queries using structured data sources
US20070299855A1 (en) Detection of attributes in unstructured data
KR100627195B1 (ko) 광학문자인식으로 생성된 전자문서 검색방법 및 그 시스템
EP1934829A2 (fr) Recherche locale
Chen et al. Template detection for large scale search engines
US20130031083A1 (en) Determining keyword for a form page
US7240045B1 (en) Automatic system for configuring to dynamic database search forms
US6691103B1 (en) Method for searching a database, search engine system for searching a database, and method of providing a key table for use by a search engine for a database
WO2000034897A1 (fr) Systeme et procede pour trouver des quasi-correspondances parmi des articles contenus dans des bases de donnees
US7370037B2 (en) Methods for processing a text search query in a collection of documents
JP4426041B2 (ja) カテゴリ因子による情報検索方法
CN111475464B (zh) 一种自动发现挖掘Web组件指纹的方法
CN107291951B (zh) 数据处理方法、装置、存储介质和处理器
WO1998049632A1 (fr) Systeme et methode afferente d'extraction de donnees axes sur l'entite
WO2024064705A1 (fr) Techniques pour découvrir et mettre à jour une signification sémantique de champs de données

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010621

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

RIN1 Information on inventor provided before grant (corrected)

Inventor name: YOUNG, KEN

Inventor name: CARSANARO, JOSEPH

Inventor name: WHIPPLE, DAVID

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20030701