WO2000034897A9

WO2000034897A9 - System and method for finding near matches among records in databases

Info

Publication number: WO2000034897A9
Application number: PCT/US1999/028870
Authority: WO
Inventors: David Whipple; Joseph Carsanaro; Ken Young
Original assignee: Bloodhound Software Inc
Priority date: 1998-12-07
Filing date: 1999-12-06
Publication date: 2001-06-07
Also published as: EP1138007A1; AU2166700A; AU6436599A; CA2353095A1; WO2000034897A1

Abstract

The present invention is a system and method for finding near matches among records in databases (104, 100, 116, 120, 112, 110) and data stores in computer systems. The system identifies near matches between records in the data store and a selected record having an associated coordinate set. The processor (96) creates identifiers which are associated with each record in the data store, maps each identifiers in a discriminant space associated with each record, and retrieves all records from the data store having associated coordinate set within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.

Description

SYSTEM AND METHOD FOR FINDING NEAR MATCHES AMONG RECORDS IN DATABASES

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.

60/111,212, filed on December 7, 1998.

BACKGROUND OF INVENTION

1. Field of the Invention

This invention generally relates to computer systems and the locating of records in databases and datastores. More specifically, this invention relates to a system and method for identifying near matches among records in a datastore based upon discriminant analysis.

2. Description of the Related Art

The invention addresses the problem of having duplicate and near duplicate records in database files, data marts, data warehouses or any data file. The duplication of information is difficult to find and can lead to wasted time and money. Processing duplicate claims, expense payments or other duplicate records can lead to cost over runs, customer service problems, inefficient processing time, manual intervention into automated systems, and wasted disk storage on computer systems. Unsynchronized data over multiple environments can lead to data duplicates, data replication and other data management problems. Furthermore, the inability to locate a near match in Internet searches can lead to lost sales opportunities, poor customer service problems and lost revenue.

Existing systems use standard procedures for indexing records and locating similar ("sims") or duplicate records. These records may then be removed, purged, flagged for future reference, extracted from the data set for viewing, or extracted for use in additional statistical analysis. These procedures incorporate three basic steps: (1) creating a "keystring" for each record, where a keystring is a character string comprised of all or portions of some or all of the fields in a record; (2) sorting the keystrings, which is termed "indexing"; and (3) scanning the sorted list of keystrings for sims.

Conventionally, scanning the sorted list (step 3) is a single pass through the sorted list and comparing each successive pair of sorted keystrings to determine some measure of their similarity. Pairs of keystrings that are found to be similar within some pre-defined measure of similarity are flagged or one of the records is removed. Under this method, only mismatches in the least significant (right-most) character positions will be found.

For example, consider the following (sorted) keystrings in the following table:

Keystring 1) Keystring 2) Keystring 3) Keystring 4)

"Clarke" and "Clarys" are mismatched in positions 5 and 6. "Clarys" and "Clerke" are likewise mismatched in positions 5 and 6 as well as in position 3.

However, "Clarke" and "Clerke" are mismatched in only position 3. A sequential pass through the list of keystrings looking only at adjacent pairs of keystrings would miss

this sim.

"Dlarke" and "Clarke" are also mismatched in only one position, namely position 1, and yet they are even further apart in the list. This ability to locate sims only in the right-most character positions is characteristic of a "linear indexing" scheme

as it known in the art.

Accordingly, a system and method for sim identification that can perform a more accurate matching of the records in a data store would be advantageous. Thus, it is to the provision of such an improved system and method that the present invention is primarily directed.

SUMMARY OF THE INVENTION

The present invention is a system and method for finding near-matches among records in one or more databases. In one embodiment, the system is for identifying near matches between records in a data store and a selected record having an associated coordinate set, and includes a data store for storing the records and a processor. The processor of the system performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.

In another aspect, the present invention provides a computer-readable storage device containing instructions that upon execution cause a processor to identify near matches between records in a data store and a selected record having an associated coordinate set. The device preferably performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.

The present inventive system accordingly provides a method for identifying near matches between records in a data store and a selected record that has an associated coordinate set. The method includes steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record. The identifier associated with each record in the data store preferably comprises one or more characters.

The method preferably further includes the step of determining a set of records from the retrieved records that match the selected record. The step of determining a set of records from the retrieved records that match the selected record is preferably screening the coordinate sets associated with the retrieved records for sets within a second predetermined distance in the discriminant space from the coordinate set associated with the selected record. And then the method further includes the step of extracting, deleting, or otherwise modifying the determined set of records.

The method also preferably includes the step of acquiring a mapping template, and the step of acquiring is preferably creating or receiving a mapping template. The method then preferably includes the step of refining the acquired mapping template.

The method can include the step of selecting an identifier format for use in the step of creating one or more identifier, and then further include the step of acquiring one or more mapping templates. The step of selecting an identifier format is then preferably evaluating the acquired one or more mapping templates.

The method further preferably includes the steps of receiving the selected record, creating an identifier associated with the selected record, and mapping the identifier associated with the selected record into a coordinate set in the discriminant space associated with the selected record. Additionally, the method can further include the step of retrieving the coordinate set associated with the selected record from the data store.

According to the present invention, near matches to a selected record that has an associated coordinate set are identified among records in a data store. An identifier, preferably a keystring, is created for each record in the database. Each such identifier is then mapped into a set of coordinates in discriminant space. Records in the data store that are near matches to the selected record are retrieved by collecting all records with an associated coordinate set within a predetermined distance in discriminant space from the coordinate set associated with the selected record. In a further embodiment, identical records may be identified by further selecting records from retrieved records or by setting the predetermined distance at a threshold guaranteeing only identical records are retrieved. In yet another embodiment, the records retrieved as near matches and/or identical matches may be automatically deleted from the data store as duplicative or may be outputted to an appropriate output device for further automated or manual processing.

The system and method for finding near matches among records in databases accordingly has industrial applicability in that the invention can be installed and practiced on existing computer systems to increase searching efficiency. Moreover, the inventive system can be created during the manufacture of computer systems having database record searching as a significant component of the system functionality.

Furthermore, the present invention therefore has a commercial advantage in that it defines a multi-dimensional indexing scheme that performs efficient database searching. In a multi-dimensional indexing scheme, the likelihood of finding mismatched characters is not dependent on the position of the character in the keystring. Tests suggest this method is more efficient at finding sims, identifying from 50% to 100% more sims than conventional methods known in the art. Consequently, sims are more likely to be identified through use of the present invention than through use of in existing linear indexing systems.

The above and other objects and advantages of the present invention will become more readily apparent after review of the hereinafter set forth Brief Description of the Drawings, Detailed Description of the Invention, and Claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart illustrating the mapping process of the present invention, specifically illustrating the record, key, and coordinate templates of the process. FIG. 2 is a flowchart illustrating the process of detecting duplicate entries, and specifically illustrating the template definition, key optimization, and record location processes.

FIG. 3 is an illustration of a defined template created from the QWERTY keyboard wherein the likelihood of errors is assumed to occur from typing errors due to key proximity.

FIG. 4a is a flowchart illustrating base template generation. FIG. 4b is a flowchart illustrating the optimization of the base template generated from the process of Fig. 4a.

FIG. 5 is a flowchart illustrating the process of key creation. FIG. 6 is a representative diagram of a host computer environment in connection with other computers and databases though a local area network (LAN) and through the Internet.

DETAILED DESCRIPTION OF THE INVENTION Referring to the drawings, in which like numbers indicate like elements throughout the views, Fig. 1 is a flowchart illustrating the present inventive method for finding near-matches among records in one or more databases. The present invention is a result of extending and enhancing the concepts of multiple discriminant analysis to locate each record in discriminant space (hereafter referred to as sim-space). The system selects a record 12 in one or more databases to dimension, shown at step 10, and record 12, and a key applied to the record 12, shown at step 14, to create a keystring 16. Each character position in the keystring, such as keystring 16, preferably represents a dimension of the record, although other aspects of this invention can use each character position to represent two or more dimensions. The system then determines the coordinates for each character in the in keystring through a selected template, shown at step 18, such as template 20.

Pre-determined "templates," such as template 20, define the actual coordinates of a record in sim-space. These templates provide a conformal mapping for each character in the keystring to a coordinate (or coordinates) in sim-space. Although shown here as ASCII characters, other characters that can be used in the present invention include the full English alphabet, numbers, words, special characters such as n (Spanish), ό (German), or ø (Norwegian), or can consist of entire non-Roman alphabets such as Greek, Russian, Arabic, Hebrew, Chinese, or Japanese characters.

Multidimensional indexing finds application in two main areas, firstly in searching an indexed list for a near match of a given record, and secondly in the detection of groups of similar records in lists. Finding a possible match for a given record in a previously indexed database can therefore be achieved by the following steps: generating the keystring for the record to be matched, determining the location in sim-space by applying template to the keystring, and searching the locations neighborhood, i.e. one or more databases or datastores, for sims by applying a nearest neighbor algorithm. Identifying sims within a previously indexed data base can be achieved by the following steps preformed on the system: randomly selecting a keystring in sim-space, pulling a pre-determined number of nearest neighbors, and checking all possible pairs for sims within the set of neighbors.

With reference to Fig.2, there is illustrated a flowchart for the process of detecting duplicate entries, and the template definition, key optimization, and record location processes are specifically illustrated. The template defining process (TDP) defines templates are defined in such a manner so as to assign characters commonly substituted erroneously, to near-by coordinates. The system thus creates mapping templates, as shown at step 24. The key optimization process (KOP) creates a keystring 16 for each record 12 in the database or datastore. The system creates a key to apply to the particular dataset under review in the program, shown at step 26, wherein the dataset can be one or more databases or data stores, and then the system creates a keystring 16 for each record in the dataset, shown at step 28. Each character position in the keystring is given an evaluation based on its ability to discriminate between records. The evaluation nominally lies between 0 and 1 (preferably expressed as a percentage) with 1 being the best discrimination (desired).

For example, if a particular position in a keystring always contained the same character, it would have no value in finding duplicate records. This could occur for example, with a mailing list database for California. The first digit of the zip code would always be "9" and would have no value in discriminating between records. One simple method for assessing the discrimination for a given character position is to assign a "-1" to any coordinate lying below the median coordinate and a "+1" to any coordinate lying on or above the median coordinate. An evaluation of zero would result when the absolute value of the sum of the assigned -l 's and +1 's (a's) is equal to the number of records (n). An evaluation of 100 would result when the sum of the assigned a's is equal to zero. Thus, the position evaluation (PE) is given by:

PE = (n - ∑a / n.

Another method is to calculate the standard deviation of the coordinates for a given character position. However, such method has a disadvantage in that there is no easily defined "best" position evaluation. It should be noted that in the creation of a key, the order of the characters in the keystring is irrelevant for the present invention, unlike keys created for conventional indexing methods.

In the record location process (RLP) the system takes a template generated with a keystring 16 and locates matching records for the template, which represents near matches to the exact keystring generated from the record. The system maps the keystring into sim-space using the specified template, shown at step 30, and then the system examines the neighborhood, or database/datastore, for matches to the templated keystring, shown at step 32. Locating a record requires creating the keystring for that record and then determining the coordinates for that record using a template. Any number of processes may be utilized to make a key as are known in the art.

One preferred method of creating a mapping template that assigns characters which are commonly substituted erroneously is simply to "stretch" the QWERTY keyboard under the assumption that errors are commonly produced by typing a letter adjacent to the desired letter on the keyboard. This extrapolation of the QWERTY keyboard creates the template shown in Fig. 3.

Fig. 3 illustrates a template having a section 36 for the ASCII letter characters 38 with the template coordinates 40, and a section 42 for ASCII numbers 44 with corresponding template coordinates 46. Once the template has been used to identify sims in a sample data base, the substitution error frequency can be directly determined for that type of data and data entry method.

Any number of methods can then be used to construct more optimal templates.

Once the key is created, determining the coordinates of the record is a simple matter of substitution. For example, using the QWERTY template in Fig. 3, the keystring

"CLARKE" would have coordinates of {11, 25, 3, 9, 22, 5}.

With reference to Figs. 4a and 4b, a "hill-climber" algorithm is employed to construct a more optimal template. A template evaluation function (TE) is defined as the sum of the error frequency (f) divided by the coordinate distance between each pair of characters (x„ x,) as, TE = ∑ f /(x, - X_j), as shown in step 50. Characters are then randomly assigned coordinates and each set of assignments is evaluated, as shown in step 52. This step is repeated a pre-determined number of times and the set with the best (highest) score is saved and becomes the basis for the next step.

This set of coordinates is then "shuffled" by switching the coordinates for a randomly selected pair of characters that lie within a variable coordinate distance 'm.'

A comparison is then made to determine if the switch produces a better evaluation, shown at decision 54. If the switch has made a better evaluation, the new set is saved and becomes the basis for continuing optimization, as shown at step 56, and a decision is made, decision 58, as to whether the process has been repeated a sufficient amount of times. If there is no improvement after a pre-determined number of switches, i.e. the score is not greater that the previous highest score, m is decreased by 1 and a decision is again made as to whether the process has been repeated the requisite amount of time, decision 58, and step 52 is repeated.

After an optimal template is produced, another variable is defined that represents a distance metric within the coordinate system, shown here as having an initial value 'm', as shown at step 60. The value 'm' is initially chosen such that it completely includes all the set members in the current template configuration. Then a pair of characters in the co-ordinate space lying within 'm' units of each other are randomly selected, shown at step 62, and their coordinates are switched. The Template evaluation function is then applied, shown at step 64, and the resulting error is compared to the current optimal templates, shown at decision 66.

Should this template configuration yield a higher score, it is flagged as the new optimal template and set as the current template, shown at step 68. If the template does not yield a higher score, then a decision is made as to whether the template evaluation process has been repeated a predetermined number of times, shown at decision 70. If the evaluation process has not been repeated the requisite number of times, the pair of characters in the co-ordinate space lying within 'm' units of each other are again randomly selected, shown at step 62, and the process is repeated.

Once the optimal template has been selected, step 68, or the template evaluation process has been repeated the predetermined number of times, decision 70, then the distance m has 1 subtracted, shown at step 72, and a decision is made if m then equals 0, shown at decision 74. As 'm' is now encloses a smaller region in the coordinate system, there are less pairs within this new region for comparison. If the resultant score does not improve on the current optimal score, another pair of points is chosen and their coordinates are again swapped, step 62, for a maximum P retries. If m=0, the algorithm is complete and the optimal template has been determined.

Several templates may be used in a single purge operation. Each template should be designed to be independent of previously created templates. This can be accomplished by setting the error frequency of pairs of characters that have adjacent coordinates to zero and running the hill-climber algorithm again. Given a set of templates, these templates can be used to evaluate the method used to create a keystring.

With reference to Fig. 5, the preferred process of key creation is illustrated. The system selects fields of the dataset, such as the local database, that provide a suitable level of discrimination, shown at step 78, and then all neglection of textual attributes such as vowels, numbers, punctuation and spaces are specified and preferably applied, shown at step 80. Then any logical field groupings are identified, shown at step 82, and any source field substitutions are specified, shown at step 84, should the source field be blank. Then all field weightings are specified, shown at step 86, and all composite fields that can be analyzed by subfield partitioning, such as addresses, are identified, shown at step 88. A decision is then made, decision 90, as to whether the key test results show a high level of discrimination. If the key test does show a high level of discrimination, then the process ends. If the key test does not show a high level of discrimination, then the process is begun anew, with new dataset fields again selected at step 78.

Fig. 6 illustrates a host computer environment 92 comprised of a host computer 94 having a local memory 98 and a central processing unit (CPU) 96. The host computer environment 92 is thus a system for identifying near matches between records in a data store, such as local memory 98, or a directly connected database 100, as example of which is a hard disk for the host computer 94. Accordingly, the CPU 96 of the host computer 94 preferably performs the steps of: creating one or more identifiers, such as a keystring 16, wherein each identifier is associated with a record in the data store; mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, such as creating the template 20; and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record, as set forth above.

In another aspect, the present invention provides a computer-readable storage device, such as memory 98 or local database 100, containing instructions that upon execution cause a processor (CPU 96) to identify near matches between records in a data store (e.g. local database 100) and a selected record having an associated coordinate set. The device preferably performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, such as a keystring 16, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, such as with template

20, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.

With reference again to Fig.6, the host computer environment 92 and host computer 94 can be connected to any manner of computer or database and perform the method of finding near matches therein. As an example, host computer 94 is in direct connection with another computer 102 having a database 104, and the CPU 96 can access the data either resident on the directly connected computer 102, or the other database 104. Further, the host computer environment 92 can be connected to a local area network (LAN) 106 as are common in the art, and through the LAN 106, the host computer 94 can be in communication with one or more networked computers 108, each of which can have an attached database 110 that is accessible by the host computer 94. The host computer 94 can also be in communication through the LAN 106 with one or more networked databases 112, and can perform the record searching upon the data therein. The host computer environment 92 can either directly, or through the LAN 106 as shown in Fig. 6, be connected to the Internet 114, or other wide area network

(WAN). Thus, the host computer 94 can thereby access one or more databases 116 in communication with the Internet 114, and can also access other computers 118 in communication with the Internet 114 and any databases 120 accessible to the other computers 118 on the Internet 114. It should be appreciated that the present inventive system can therefore be used in any environment having a processor and a datastore as are known in the art, and is not to be limited to the host computer environment 92 and connective environments disclosed in Fig. 6. The present inventive system accordingly provides a method for identifying near matches between records in a data store and a selected record that has an associated coordinate set. The method includes steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, such as keystrings 16, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record. The identifier associated with each record in the data store preferably comprises one or more characters, such as keystring 16. The method preferably further includes the step of determining a set of records from the retrieved records that match the selected record. The step of determining a set of records from the retrieved records that match the selected record is preferably screening the coordinate sets associated with the retrieved records for sets within a second predetermined distance (such as 'm' in Fig. 4b) in the discriminant space from the coordinate set associated with the selected record. And then the method further includes the step of extracting, deleting, or otherwise modifying the determined set of records.

The method also preferably includes the step of acquiring a mapping template, such as the template in Fig. 3, and the step of acquiring is preferably creating or receiving a mapping template. The method then preferably includes the step of refining the acquired mapping template, as show in the processes of Figs. 4a and 4b.

The method can include the step of selecting an identifier format for use in the step of creating one or more identifier, and example of which is the process of Fig.5, and then further includes the step of acquiring one or more mapping templates. The step of selecting an identifier format is then preferably evaluating the acquired one or more mapping templates.

The method further preferably includes the steps of receiving the selected record, creating an identifier associated with the selected record, and mapping the identifier associated with the selected record into a coordinate set in the discriminant space associated with the selected record, as discussed above. Additionally, the method can further include the step of retrieving the coordinate set associated with the selected record from the datastore(s).

It should also be noted that the present inventive system and method can be implemented on any category of computer devices including the four main categories of digital computers: supercomputers, mainframe computers, minicomputers and microcomputers. The structures, processes, methods and system as disclosed herein can also be implemented on handheld computers, and Personal Digital Assistant (PDA) and Personal Information Management (PIM) devices including, but not limited to, cellular/mobile phones, Personal Organizers, Windows CE devices or hybrid devices such as a smart phone that may be deployed over a fixed or wireless network. Moreover, the invention can be implemented on a variety of computing platforms and operating systems. For example, the present invention may be implemented on a standard personal computer (PC) operating under an operating system such as Windows, Windows NT, Unix, Linux, or other operating system. Standard development tools, languages and compilers all can be used to implement the processes described herein, under programming languages and development tools such as Java, C, C++, XML (Extensible Markup Language), Visual Basic, PowerBuilder, and other languages as known in the art.

Database files, either standard, relational or multidimensional, are the preferred file type to implement the invention. The databases can exist alone, in a data warehouse or in a data mart. These databases are operated upon by the processes may be created, managed, transformed and/or consolidated using a variety of database systems as are know in the art. These systems include but are not limited to Oracle, Sybase, Informix, Access, SQL, ODBC, Foxpro, XML schema or any other traditional or relational databases and/or database access tools.

Typical uses of this invention include locating duplicate records, locating near duplicate records, locating records with similar characteristics, and enhancing search capabilities in a database, data mart, or data warehouse. The invention also can used in locating duplicate URLs over the Internet and/or locating correct URLs when URLs are misspelled or typed incorrectly. The invention could further enhance Internet search capabilities in locating similar URLs or products on an e-business site. Locating similar products for an e-business site is another use of this invention. The methods and processes of the invention would be able to solve failed searches by providing a list of 'projectors' based on the Internet search for 'projecters.' The present invention further can be applied in locating and extracting duplicate or near duplicate records in a customer or supplier database such as duplicate customer's name and address, customer order, and/or customer payment information.

The methods and processes of the invention are no limited to test searches. Such capability can also locate duplicate or near duplicate customers/prospects in a Direct

Marketing campaign or Sales Force Automation where there is data consolidation. The methods and processes used in this invention would allow one to compare all similar record sets to determine if duplicate data exists. This will then allow one to extract current customers from the prospected database. A further application of the present invention is locating similar or near duplicate records that are possibly fraudulent in e-commerce applications which are conducted over the Internet 114. E-business fraud can include any electronic credit card or other transactions where similar records are fraudulently used as a unique record. For example, in e-business that given benefits for signing up, the present invention can detect new members that sign up multiple times by changing name slightly.

This invention can further synchronize database files. For example, Wireless devices are small and prone to input/data entry errors. As Personal Information Management (PIM) devices increase in popularity more data will exist in a variety of data sources that need to be synchronized. Data existing on LAN, WAN, PIM, Internet and Mainframe systems can be out of synchronization and this invention can be used to clean the synchronized data.

While there has been shown a preferred and alternate embodiments of the present invention, it is to be understood that certain changes may be made in the forms and arrangement of the elements and performance of the steps as set forth herein without departing from the spirit of the invention as particularly set forth in the claims appended herewith. In addition, all means-plus-function language is intended to cover all equivalent structures, materials, and acts as known to one of skill in the art providing the elements or performing the steps as set forth in the elements of the claims.

Claims

What is claimed is:

1. A method for identifying near matches between records in a data store and a selected record that has an associated coordinate set, the method comprising the steps of:

(a) creating one or more identifiers, wherein each identifier is associated with a record in the data store;

(b) mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store; and

(c) retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.

2. The method of claim 1, further comprising the step of determining a set of records from the retrieved records that match the selected record.

3. The method of claim 2, further comprising the step of extracting the determined set of records.

4. The method of claim 2, further comprising the step of deleting the determined set of records from the data store.

5. The method of claim 2, further comprising the step of modifying the determined set of records from the data store.

6. The method of claim 2, wherein the step of determining a set of records from the retrieved records that match the selected record comprises screening the coordinate sets associated with the retrieved records for sets within a second predetermined distance in the discriminant space from the coordinate set associated with the selected record.

7. The method of claim 1, further comprising the step of extracting the retrieved records.

8. The method of claim 1, further comprising the step of deleting the retrieved records from the data store.

9. The method of claim 1, further comprising the step of modifying the determined set of records from the data store.

10. The method of claim 1 , further comprising the step of acquiring a mapping template.

11. The method of claim 10, wherein the step of acquiring a mapping template comprises creating a mapping template.

12. The method of claim 10, wherein the step of acquiring a mapping template comprises receiving a mapping template.

13. The method of claim 10, further comprising the step of refining the acquired mapping template. The method of claim 1, further composing the step of selecting an identifier format for use in the step of creating one or more identifier

The method of claim 14, further composing the step of acquiπng one or more mapping templates

The method of claim 15, wherein the step of selecting an identifier format composes evaluating the acquired one or more mapping templates

The method of claim 1, further compnsing the steps of

(d) receiving the selected record,

(e) creating an identifier associated \\ ith the selected record, and

(f) mapping the identifier associated with the selected record into a coordinate set in the discπminant space associated with the selected record

The method of claim 1, further compnsing the steps of retrieving the coordinate set associated with the selected record from the data store

The method of claim 1, wherein the identifier associated with each record in the data store compnses one or more characters

A system for identifying near matches between records in a data store and a selected record having an associated coordinate set. the system compnsing (a) a data store for stoπng the records, and

(b) a processor for performing the steps of

(c) creating one or more identifiers, wherein each identifier is associated with a record in the data store,

(d) mapping each of the one or more identifiers into a set of coordinates in a discnmmant space associated with each record in the data store, and

(e) retπeving all records from the data store having associated coordinate sets v, ithm a predetermined distance m the discnmmant space from the coordinate set associated with the selected record

A computer-readable storage device containing instructions that upon execution cause a processor to identify near matches between records in a data store and a selected record having an associated coordinate set by performing the steps compnsing of

(d) creating one or more identifiers, wherein each identifier is associated with a record in the data store,

(e) mapping each of the one or more identifiers into a set of coordinates m a discnmmant space associated with each record in the data store, and

(f) retπeving all records from the data store having associated coordinate sets withm a predetermined distance m the discπminant space from the coordinate set associated with the selected record

A system for identifying near matches between records m a data store and a selected record having an associated coordinate set, the system compnsing (d) stoπng means for stonng one or more records, (e) creating means for creating one or more identifiers, wherein each identifier is associated with a record in the stonng means,

(f) mapping means for mapping each of the one or more identifiers into a set of coordinates in a discnmmant space associated with each record in the stonng means, and

(g) retnevmg means for retnevmg all records from the stonng means having associated coordinate sets within a predetermined distance in the discnmmant space from the coordinate set associated with the selected record

AMENDED CLAIMS

[received by the International Bureau on 16 May 2000 (16 05.00), original claims 20-22 amended; new claims 23-38 added; remaining claims unchanged (3 pages)]

(a) a data store for stonng the records; and

(b) a processor for performing the steps of

(I) creating one or more identifiers, wherein each identifier is associated with a record in the data store, (n) mapping each of the one or more identifiers into a set of coordinates in a discπminant space associated with each record in the data store; and (in) retπeving all records from the data store having associated coordinate sets withm a predetermined distance m the discnmmant space from the coordinate set associated with the selected record.

21. A computer-readable storage device containing instructions that upon execution cause a processor to identify near matches between records in a data store and a selected record having an associated coordinate set by performing the steps composing of:

(b) mapping each of the one or more identifiers into a set of coordinates in a discπminant space associated with each record in the data store; and

(c) retnevmg all records from the data store having associated coordinate sets within a predetermined distance in the discnmmant space from the coordinate set associated with the selected record.

22. A system for identifying near matches between records in a data store and a selected record having an associated coordinate set, the system comprising:

(a) storing means for storing one or more records;

(b) creating means for creating one or more identifiers, wherein each identifier is associated with a record in the storing means;

(c) mapping means for mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the storing means; and

(d) retrieving means for retrieving all records from the storing means having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.

23. The method of claim 1, wherein each record comprises a URL.

24. The method of claim 23, wherein the step of creating identifiers comprises creating identifiers based upon the URL in each record.

25. The method of claim 1, wherein each record comprises contact information associated with a person or entity.

26. The method of claim 25, wherein the contact information contains at least one type of information selected from the group consisting of name, address, identification number and telephone number.

27. The system of claim 20, wherein each record comprises a URL.

28. The system of claim 27, wherein the step of creating identifiers comprises creating identifiers based upon the URL in each record.

29. The system of claim 20, wherein each record comprises contact information associated with a person or entity.

30. The system of claim 29, wherein the contact information contains at least one type of information selected from the group consisting of name, address, identification number and telephone number.

31. The storage device of claim 21, wherein each record comprises a URL.

32. The storage device of claim 31, wherein the step of creating identifiers comprises creating identifiers based upon the URL in each record.

33. The storage device of claim 21, wherein each record comprises contact information associated with a person or entity.

34. The storage device of claim 33, wherein the contact information contains at least one type of information selected from the group consisting of name, address, identification number and telephone number.

35. The system of claim 22, wherein each record comprises a URL.

36. The system of claim 35, wherein the creating means comprises means for creating identifiers based upon the URL in each record.

37. The system of claim 22, wherein each record comprises contact information associated with a person or entity.

38. The system of claim 37, wherein the contact information contains at least one type of information selected from the group consisting of name, address, identification number and telephone number.