WO2000034897A9 - System and method for finding near matches among records in databases - Google Patents
System and method for finding near matches among records in databases Download PDFInfo
- Publication number
- WO2000034897A9 WO2000034897A9 PCT/US1999/028870 US9928870W WO0034897A9 WO 2000034897 A9 WO2000034897 A9 WO 2000034897A9 US 9928870 W US9928870 W US 9928870W WO 0034897 A9 WO0034897 A9 WO 0034897A9
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- record
- records
- data store
- identifiers
- creating
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24554—Unary operations; Data partitioning operations
- G06F16/24556—Aggregation; Duplicate elimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Definitions
- This invention generally relates to computer systems and the locating of records in databases and datastores. More specifically, this invention relates to a system and method for identifying near matches among records in a datastore based upon discriminant analysis.
- the invention addresses the problem of having duplicate and near duplicate records in database files, data marts, data warehouses or any data file.
- the duplication of information is difficult to find and can lead to wasted time and money.
- Processing duplicate claims, expense payments or other duplicate records can lead to cost over runs, customer service problems, inefficient processing time, manual intervention into automated systems, and wasted disk storage on computer systems.
- Unsynchronized data over multiple environments can lead to data duplicates, data replication and other data management problems.
- the inability to locate a near match in Internet searches can lead to lost sales opportunities, poor customer service problems and lost revenue.
- scanning the sorted list is a single pass through the sorted list and comparing each successive pair of sorted keystrings to determine some measure of their similarity. Pairs of keystrings that are found to be similar within some pre-defined measure of similarity are flagged or one of the records is removed. Under this method, only mismatches in the least significant (right-most) character positions will be found.
- the present invention is a system and method for finding near-matches among records in one or more databases.
- the system is for identifying near matches between records in a data store and a selected record having an associated coordinate set, and includes a data store for storing the records and a processor.
- the processor of the system performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
- the present invention provides a computer-readable storage device containing instructions that upon execution cause a processor to identify near matches between records in a data store and a selected record having an associated coordinate set.
- the device preferably performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
- the present inventive system accordingly provides a method for identifying near matches between records in a data store and a selected record that has an associated coordinate set.
- the method includes steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
- the identifier associated with each record in the data store preferably comprises one or more characters.
- the method preferably further includes the step of determining a set of records from the retrieved records that match the selected record.
- the step of determining a set of records from the retrieved records that match the selected record is preferably screening the coordinate sets associated with the retrieved records for sets within a second predetermined distance in the discriminant space from the coordinate set associated with the selected record.
- the method further includes the step of extracting, deleting, or otherwise modifying the determined set of records.
- the method also preferably includes the step of acquiring a mapping template, and the step of acquiring is preferably creating or receiving a mapping template.
- the method then preferably includes the step of refining the acquired mapping template.
- the method can include the step of selecting an identifier format for use in the step of creating one or more identifier, and then further include the step of acquiring one or more mapping templates.
- the step of selecting an identifier format is then preferably evaluating the acquired one or more mapping templates.
- the method further preferably includes the steps of receiving the selected record, creating an identifier associated with the selected record, and mapping the identifier associated with the selected record into a coordinate set in the discriminant space associated with the selected record. Additionally, the method can further include the step of retrieving the coordinate set associated with the selected record from the data store.
- near matches to a selected record that has an associated coordinate set are identified among records in a data store.
- An identifier preferably a keystring, is created for each record in the database. Each such identifier is then mapped into a set of coordinates in discriminant space.
- Records in the data store that are near matches to the selected record are retrieved by collecting all records with an associated coordinate set within a predetermined distance in discriminant space from the coordinate set associated with the selected record.
- identical records may be identified by further selecting records from retrieved records or by setting the predetermined distance at a threshold guaranteeing only identical records are retrieved.
- the records retrieved as near matches and/or identical matches may be automatically deleted from the data store as duplicative or may be outputted to an appropriate output device for further automated or manual processing.
- the system and method for finding near matches among records in databases accordingly has industrial applicability in that the invention can be installed and practiced on existing computer systems to increase searching efficiency. Moreover, the inventive system can be created during the manufacture of computer systems having database record searching as a significant component of the system functionality.
- the present invention therefore has a commercial advantage in that it defines a multi-dimensional indexing scheme that performs efficient database searching.
- a multi-dimensional indexing scheme the likelihood of finding mismatched characters is not dependent on the position of the character in the keystring. Tests suggest this method is more efficient at finding sims, identifying from 50% to 100% more sims than conventional methods known in the art. Consequently, sims are more likely to be identified through use of the present invention than through use of in existing linear indexing systems.
- FIG. 1 is a flowchart illustrating the mapping process of the present invention, specifically illustrating the record, key, and coordinate templates of the process.
- FIG. 2 is a flowchart illustrating the process of detecting duplicate entries, and specifically illustrating the template definition, key optimization, and record location processes.
- FIG. 3 is an illustration of a defined template created from the QWERTY keyboard wherein the likelihood of errors is assumed to occur from typing errors due to key proximity.
- FIG. 4a is a flowchart illustrating base template generation.
- FIG. 4b is a flowchart illustrating the optimization of the base template generated from the process of Fig. 4a.
- FIG. 5 is a flowchart illustrating the process of key creation.
- FIG. 6 is a representative diagram of a host computer environment in connection with other computers and databases though a local area network (LAN) and through the Internet.
- LAN local area network
- Fig. 1 is a flowchart illustrating the present inventive method for finding near-matches among records in one or more databases.
- the present invention is a result of extending and enhancing the concepts of multiple discriminant analysis to locate each record in discriminant space (hereafter referred to as sim-space).
- the system selects a record 12 in one or more databases to dimension, shown at step 10, and record 12, and a key applied to the record 12, shown at step 14, to create a keystring 16.
- Each character position in the keystring, such as keystring 16 preferably represents a dimension of the record, although other aspects of this invention can use each character position to represent two or more dimensions.
- the system determines the coordinates for each character in the in keystring through a selected template, shown at step 18, such as template 20.
- Pre-determined "templates,” such as template 20 define the actual coordinates of a record in sim-space. These templates provide a conformal mapping for each character in the keystring to a coordinate (or coordinates) in sim-space.
- ASCII characters Although shown here as ASCII characters, other characters that can be used in the present invention include the full English alphabet, numbers, words, special characters such as n (Spanish), ⁇ (German), or ⁇ (Norwegian), or can consist of entire non-Roman alphabets such as Greek, Russian, Arabic, Hebrew, Chinese, or Japanese characters.
- Multidimensional indexing finds application in two main areas, firstly in searching an indexed list for a near match of a given record, and secondly in the detection of groups of similar records in lists. Finding a possible match for a given record in a previously indexed database can therefore be achieved by the following steps: generating the keystring for the record to be matched, determining the location in sim-space by applying template to the keystring, and searching the locations neighborhood, i.e. one or more databases or datastores, for sims by applying a nearest neighbor algorithm.
- Identifying sims within a previously indexed data base can be achieved by the following steps preformed on the system: randomly selecting a keystring in sim-space, pulling a pre-determined number of nearest neighbors, and checking all possible pairs for sims within the set of neighbors.
- the template defining process defines templates are defined in such a manner so as to assign characters commonly substituted erroneously, to near-by coordinates.
- the system thus creates mapping templates, as shown at step 24.
- the key optimization process creates a keystring 16 for each record 12 in the database or datastore.
- the system creates a key to apply to the particular dataset under review in the program, shown at step 26, wherein the dataset can be one or more databases or data stores, and then the system creates a keystring 16 for each record in the dataset, shown at step 28.
- Each character position in the keystring is given an evaluation based on its ability to discriminate between records. The evaluation nominally lies between 0 and 1 (preferably expressed as a percentage) with 1 being the best discrimination (desired).
- PE (n - ⁇ a / n.
- Another method is to calculate the standard deviation of the coordinates for a given character position.
- Such method has a disadvantage in that there is no easily defined "best" position evaluation. It should be noted that in the creation of a key, the order of the characters in the keystring is irrelevant for the present invention, unlike keys created for conventional indexing methods.
- RLP record location process
- the system takes a template generated with a keystring 16 and locates matching records for the template, which represents near matches to the exact keystring generated from the record.
- the system maps the keystring into sim-space using the specified template, shown at step 30, and then the system examines the neighborhood, or database/datastore, for matches to the templated keystring, shown at step 32. Locating a record requires creating the keystring for that record and then determining the coordinates for that record using a template. Any number of processes may be utilized to make a key as are known in the art.
- One preferred method of creating a mapping template that assigns characters which are commonly substituted erroneously is simply to "stretch" the QWERTY keyboard under the assumption that errors are commonly produced by typing a letter adjacent to the desired letter on the keyboard. This extrapolation of the QWERTY keyboard creates the template shown in Fig. 3.
- Fig. 3 illustrates a template having a section 36 for the ASCII letter characters 38 with the template coordinates 40, and a section 42 for ASCII numbers 44 with corresponding template coordinates 46.
- determining the coordinates of the record is a simple matter of substitution. For example, using the QWERTY template in Fig. 3, the keystring
- CLARKE would have coordinates of ⁇ 11, 25, 3, 9, 22, 5 ⁇ .
- This set of coordinates is then "shuffled" by switching the coordinates for a randomly selected pair of characters that lie within a variable coordinate distance 'm.'
- a comparison is then made to determine if the switch produces a better evaluation, shown at decision 54. If the switch has made a better evaluation, the new set is saved and becomes the basis for continuing optimization, as shown at step 56, and a decision is made, decision 58, as to whether the process has been repeated a sufficient amount of times. If there is no improvement after a pre-determined number of switches, i.e. the score is not greater that the previous highest score, m is decreased by 1 and a decision is again made as to whether the process has been repeated the requisite amount of time, decision 58, and step 52 is repeated.
- Another variable is defined that represents a distance metric within the coordinate system, shown here as having an initial value 'm', as shown at step 60.
- the value 'm' is initially chosen such that it completely includes all the set members in the current template configuration.
- a pair of characters in the co-ordinate space lying within 'm' units of each other are randomly selected, shown at step 62, and their coordinates are switched.
- the Template evaluation function is then applied, shown at step 64, and the resulting error is compared to the current optimal templates, shown at decision 66.
- this template configuration yields a higher score, it is flagged as the new optimal template and set as the current template, shown at step 68. If the template does not yield a higher score, then a decision is made as to whether the template evaluation process has been repeated a predetermined number of times, shown at decision 70. If the evaluation process has not been repeated the requisite number of times, the pair of characters in the co-ordinate space lying within 'm' units of each other are again randomly selected, shown at step 62, and the process is repeated.
- Each template should be designed to be independent of previously created templates. This can be accomplished by setting the error frequency of pairs of characters that have adjacent coordinates to zero and running the hill-climber algorithm again. Given a set of templates, these templates can be used to evaluate the method used to create a keystring.
- the preferred process of key creation is illustrated.
- the system selects fields of the dataset, such as the local database, that provide a suitable level of discrimination, shown at step 78, and then all neglection of textual attributes such as vowels, numbers, punctuation and spaces are specified and preferably applied, shown at step 80.
- any logical field groupings are identified, shown at step 82, and any source field substitutions are specified, shown at step 84, should the source field be blank.
- all field weightings are specified, shown at step 86, and all composite fields that can be analyzed by subfield partitioning, such as addresses, are identified, shown at step 88.
- a decision is then made, decision 90, as to whether the key test results show a high level of discrimination. If the key test does show a high level of discrimination, then the process ends. If the key test does not show a high level of discrimination, then the process is begun anew, with new dataset fields again selected at step 78.
- Fig. 6 illustrates a host computer environment 92 comprised of a host computer 94 having a local memory 98 and a central processing unit (CPU) 96.
- the host computer environment 92 is thus a system for identifying near matches between records in a data store, such as local memory 98, or a directly connected database 100, as example of which is a hard disk for the host computer 94.
- the CPU 96 of the host computer 94 preferably performs the steps of: creating one or more identifiers, such as a keystring 16, wherein each identifier is associated with a record in the data store; mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, such as creating the template 20; and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record, as set forth above.
- the present invention provides a computer-readable storage device, such as memory 98 or local database 100, containing instructions that upon execution cause a processor (CPU 96) to identify near matches between records in a data store (e.g. local database 100) and a selected record having an associated coordinate set.
- the device preferably performs the steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, such as a keystring 16, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, such as with template
- the host computer environment 92 and host computer 94 can be connected to any manner of computer or database and perform the method of finding near matches therein.
- host computer 94 is in direct connection with another computer 102 having a database 104, and the CPU 96 can access the data either resident on the directly connected computer 102, or the other database 104.
- the host computer environment 92 can be connected to a local area network (LAN) 106 as are common in the art, and through the LAN 106, the host computer 94 can be in communication with one or more networked computers 108, each of which can have an attached database 110 that is accessible by the host computer 94.
- LAN local area network
- the host computer 94 can also be in communication through the LAN 106 with one or more networked databases 112, and can perform the record searching upon the data therein.
- the host computer environment 92 can either directly, or through the LAN 106 as shown in Fig. 6, be connected to the Internet 114, or other wide area network
- the host computer 94 can thereby access one or more databases 116 in communication with the Internet 114, and can also access other computers 118 in communication with the Internet 114 and any databases 120 accessible to the other computers 118 on the Internet 114.
- the present inventive system can therefore be used in any environment having a processor and a datastore as are known in the art, and is not to be limited to the host computer environment 92 and connective environments disclosed in Fig. 6.
- the present inventive system accordingly provides a method for identifying near matches between records in a data store and a selected record that has an associated coordinate set.
- the method includes steps of creating one or more identifiers wherein each identifier is associated with a record in the data store, such as keystrings 16, mapping each of the one or more identifiers into a set of coordinates in a discriminant space associated with each record in the data store, and retrieving all records from the data store having associated coordinate sets within a predetermined distance in the discriminant space from the coordinate set associated with the selected record.
- the identifier associated with each record in the data store preferably comprises one or more characters, such as keystring 16.
- the method preferably further includes the step of determining a set of records from the retrieved records that match the selected record.
- the step of determining a set of records from the retrieved records that match the selected record is preferably screening the coordinate sets associated with the retrieved records for sets within a second predetermined distance (such as 'm' in Fig. 4b) in the discriminant space from the coordinate set associated with the selected record. And then the method further includes the step of extracting, deleting, or otherwise modifying the determined set of records.
- the method also preferably includes the step of acquiring a mapping template, such as the template in Fig. 3, and the step of acquiring is preferably creating or receiving a mapping template.
- the method then preferably includes the step of refining the acquired mapping template, as show in the processes of Figs. 4a and 4b.
- the method can include the step of selecting an identifier format for use in the step of creating one or more identifier, and example of which is the process of Fig.5, and then further includes the step of acquiring one or more mapping templates.
- the step of selecting an identifier format is then preferably evaluating the acquired one or more mapping templates.
- the method further preferably includes the steps of receiving the selected record, creating an identifier associated with the selected record, and mapping the identifier associated with the selected record into a coordinate set in the discriminant space associated with the selected record, as discussed above. Additionally, the method can further include the step of retrieving the coordinate set associated with the selected record from the datastore(s).
- the present inventive system and method can be implemented on any category of computer devices including the four main categories of digital computers: supercomputers, mainframe computers, minicomputers and microcomputers.
- the structures, processes, methods and system as disclosed herein can also be implemented on handheld computers, and Personal Digital Assistant (PDA) and Personal Information Management (PIM) devices including, but not limited to, cellular/mobile phones, Personal Organizers, Windows CE devices or hybrid devices such as a smart phone that may be deployed over a fixed or wireless network.
- PDA Personal Digital Assistant
- PIM Personal Information Management
- the invention can be implemented on a variety of computing platforms and operating systems.
- the present invention may be implemented on a standard personal computer (PC) operating under an operating system such as Windows, Windows NT, Unix, Linux, or other operating system.
- Standard development tools, languages and compilers all can be used to implement the processes described herein, under programming languages and development tools such as Java, C, C++, XML (Extensible Markup Language), Visual Basic, PowerBuilder, and other languages as known
- Database files are the preferred file type to implement the invention.
- the databases can exist alone, in a data warehouse or in a data mart. These databases are operated upon by the processes may be created, managed, transformed and/or consolidated using a variety of database systems as are know in the art. These systems include but are not limited to Oracle, Sybase, Informix, Access, SQL, ODBC, Foxpro, XML schema or any other traditional or relational databases and/or database access tools.
- Typical uses of this invention include locating duplicate records, locating near duplicate records, locating records with similar characteristics, and enhancing search capabilities in a database, data mart, or data warehouse.
- the invention also can used in locating duplicate URLs over the Internet and/or locating correct URLs when URLs are misspelled or typed incorrectly.
- the invention could further enhance Internet search capabilities in locating similar URLs or products on an e-business site. Locating similar products for an e-business site is another use of this invention.
- the methods and processes of the invention would be able to solve failed searches by providing a list of 'projectors' based on the Internet search for 'projecters.
- the present invention further can be applied in locating and extracting duplicate or near duplicate records in a customer or supplier database such as duplicate customer's name and address, customer order, and/or customer payment information.
- the methods and processes of the invention are no limited to test searches. Such capability can also locate duplicate or near duplicate customers/prospects in a Direct
- E-business fraud can include any electronic credit card or other transactions where similar records are fraudulently used as a unique record. For example, in e-business that given benefits for signing up, the present invention can detect new members that sign up multiple times by changing name slightly.
- This invention can further synchronize database files.
- Wireless devices are small and prone to input/data entry errors.
- PIM Personal Information Management
- Data existing on LAN, WAN, PIM, Internet and Mainframe systems can be out of synchronization and this invention can be used to clean the synchronized data.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA002353095A CA2353095A1 (en) | 1998-12-07 | 1999-12-06 | System and method for finding near matches among records in databases |
EP99966015A EP1138007A1 (en) | 1998-12-07 | 1999-12-06 | System and method for finding near matches among records in databases |
AU21667/00A AU2166700A (en) | 1998-12-07 | 1999-12-06 | System and method for finding near matches among records in databases |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11121298P | 1998-12-07 | 1998-12-07 | |
US60/111,212 | 1998-12-07 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2000034897A1 WO2000034897A1 (en) | 2000-06-15 |
WO2000034897A9 true WO2000034897A9 (en) | 2001-06-07 |
Family
ID=22337203
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1999/028870 WO2000034897A1 (en) | 1998-12-07 | 1999-12-06 | System and method for finding near matches among records in databases |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1138007A1 (en) |
AU (2) | AU2166700A (en) |
CA (1) | CA2353095A1 (en) |
WO (1) | WO2000034897A1 (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8423374B2 (en) * | 2002-06-27 | 2013-04-16 | Siebel Systems, Inc. | Method and system for processing intelligence information |
GB0220576D0 (en) * | 2002-09-04 | 2002-10-09 | Neural Technologies Ltd | Data proximity detector |
US8126739B2 (en) | 2006-04-28 | 2012-02-28 | MDI Technologies, Inc | Method and system for tracking treatment of patients in a health services environment |
US8126738B2 (en) | 2006-04-28 | 2012-02-28 | Mdi Technologies, Inc. | Method and system for scheduling tracking, adjudicating appointments and claims in a health services environment |
US9262475B2 (en) | 2012-06-12 | 2016-02-16 | Melissa Data Corp. | Systems and methods for matching records using geographic proximity |
US9563677B2 (en) * | 2012-12-11 | 2017-02-07 | Melissa Data Corp. | Systems and methods for clustered matching of records using geographic proximity |
CN113595805B (en) * | 2021-08-23 | 2024-01-30 | 海南房小云科技有限公司 | Personal computer data sharing method for local area network |
AU2021469297A1 (en) * | 2021-10-13 | 2024-03-21 | Equifax Inc. | Fragmented record detection based on records matching techniques |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5649183A (en) * | 1992-12-08 | 1997-07-15 | Microsoft Corporation | Method for compressing full text indexes with document identifiers and location offsets |
US5465353A (en) * | 1994-04-01 | 1995-11-07 | Ricoh Company, Ltd. | Image matching and retrieval by multi-access redundant hashing |
US6029167A (en) * | 1997-07-25 | 2000-02-22 | Claritech Corporation | Method and apparatus for retrieving text using document signatures |
US6026398A (en) * | 1997-10-16 | 2000-02-15 | Imarket, Incorporated | System and methods for searching and matching databases |
-
1999
- 1999-12-06 AU AU21667/00A patent/AU2166700A/en not_active Abandoned
- 1999-12-06 EP EP99966015A patent/EP1138007A1/en not_active Withdrawn
- 1999-12-06 CA CA002353095A patent/CA2353095A1/en not_active Abandoned
- 1999-12-06 WO PCT/US1999/028870 patent/WO2000034897A1/en not_active Application Discontinuation
- 1999-12-07 AU AU64365/99A patent/AU6436599A/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
EP1138007A1 (en) | 2001-10-04 |
AU2166700A (en) | 2000-06-26 |
AU6436599A (en) | 2000-06-08 |
CA2353095A1 (en) | 2000-06-15 |
WO2000034897A1 (en) | 2000-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3745276A1 (en) | Discovering a semantic meaning of data fields from profile data of the data fields | |
US7296011B2 (en) | Efficient fuzzy match for evaluating data records | |
US6678681B1 (en) | Information extraction from a database | |
Dasu et al. | Mining database structure; or, how to build a data quality browser | |
Doermann et al. | The detection of duplicates in document image databases | |
Chaudhuri et al. | Robust and efficient fuzzy match for online data cleaning | |
Burrows et al. | Efficient plagiarism detection for large code repositories | |
US6820079B1 (en) | Method and apparatus for retrieving text using document signatures | |
US7043492B1 (en) | Automated classification of items using classification mappings | |
US6615209B1 (en) | Detecting query-specific duplicate documents | |
US8832133B2 (en) | Answering web queries using structured data sources | |
Borges et al. | Discovering geographic locations in web pages using urban addresses | |
US7711719B1 (en) | Massive multi-pattern searching | |
US7565348B1 (en) | Determining a document similarity metric | |
US20070299855A1 (en) | Detection of attributes in unstructured data | |
EP1934829A2 (en) | Local search | |
US20080140639A1 (en) | Processing a Text Search Query in a Collection of Documents | |
US20130031083A1 (en) | Determining keyword for a form page | |
US7240045B1 (en) | Automatic system for configuring to dynamic database search forms | |
CA2326901A1 (en) | System and method for searching electronic documents created with optical character recognition | |
US6691103B1 (en) | Method for searching a database, search engine system for searching a database, and method of providing a key table for use by a search engine for a database | |
WO2000034897A9 (en) | System and method for finding near matches among records in databases | |
JP4426041B2 (en) | Information retrieval method by category factor | |
CN107291951B (en) | Data processing method, device, storage medium and processor | |
CN111475464B (en) | Method for automatically finding and mining fingerprints of Web component |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
ENP | Entry into the national phase in: |
Ref country code: CA Ref document number: 2353095 Kind code of ref document: A Format of ref document f/p: F |
|
AK | Designated states |
Kind code of ref document: C2 Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: C2 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 21-23, CLAIMS, REPLACED BY NEW PAGES 21-23; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1999966015 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1999966015 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
ENP | Entry into the national phase in: |
Ref document number: 2353095 Country of ref document: CA |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1999966015 Country of ref document: EP |