ATE420406T1 - Robuste erkennung von fuzzy duplikatdatensätzen in einer datenbank - Google Patents

Robuste erkennung von fuzzy duplikatdatensätzen in einer datenbank

Info

Publication number
ATE420406T1
ATE420406T1 AT05107743T AT05107743T ATE420406T1 AT E420406 T1 ATE420406 T1 AT E420406T1 AT 05107743 T AT05107743 T AT 05107743T AT 05107743 T AT05107743 T AT 05107743T AT E420406 T1 ATE420406 T1 AT E420406T1
Authority
AT
Austria
Prior art keywords
database
robust detection
duplicate records
fuzzy
fuzzy duplicate
Prior art date
Application number
AT05107743T
Other languages
English (en)
Inventor
Rajeev Motwani
Surajit Chaudhuri
Venkatesh Ganti
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Application granted granted Critical
Publication of ATE420406T1 publication Critical patent/ATE420406T1/de

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99937Sorting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99942Manipulating data structure, e.g. compression, compaction, compilation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99944Object-oriented database structure
    • Y10S707/99945Object-oriented database structure processing
AT05107743T 2004-08-30 2005-08-24 Robuste erkennung von fuzzy duplikatdatensätzen in einer datenbank ATE420406T1 (de)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/929,514 US7516149B2 (en) 2004-08-30 2004-08-30 Robust detector of fuzzy duplicates

Publications (1)

Publication Number Publication Date
ATE420406T1 true ATE420406T1 (de) 2009-01-15

Family

ID=35219700

Family Applications (1)

Application Number Title Priority Date Filing Date
AT05107743T ATE420406T1 (de) 2004-08-30 2005-08-24 Robuste erkennung von fuzzy duplikatdatensätzen in einer datenbank

Country Status (7)

Country Link
US (1) US7516149B2 (de)
EP (1) EP1630698B1 (de)
JP (1) JP4814570B2 (de)
KR (1) KR101153113B1 (de)
CN (1) CN100520776C (de)
AT (1) ATE420406T1 (de)
DE (1) DE602005012192D1 (de)

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8732004B1 (en) 2004-09-22 2014-05-20 Experian Information Solutions, Inc. Automated analysis of data to generate prospect notifications based on trigger events
US20070244732A1 (en) 2004-10-29 2007-10-18 American Express Travel Related Services Co., Inc., A New York Corporation Using commercial share of wallet to manage vendors
US20070016501A1 (en) 2004-10-29 2007-01-18 American Express Travel Related Services Co., Inc., A New York Corporation Using commercial share of wallet to rate business prospects
US8543499B2 (en) 2004-10-29 2013-09-24 American Express Travel Related Services Company, Inc. Reducing risks related to check verification
US7912770B2 (en) * 2004-10-29 2011-03-22 American Express Travel Related Services Company, Inc. Method and apparatus for consumer interaction based on spend capacity
US7840484B2 (en) 2004-10-29 2010-11-23 American Express Travel Related Services Company, Inc. Credit score and scorecard development
US8630929B2 (en) 2004-10-29 2014-01-14 American Express Travel Related Services Company, Inc. Using commercial share of wallet to make lending decisions
US8204774B2 (en) * 2004-10-29 2012-06-19 American Express Travel Related Services Company, Inc. Estimating the spend capacity of consumer households
US8131614B2 (en) 2004-10-29 2012-03-06 American Express Travel Related Services Company, Inc. Using commercial share of wallet to compile marketing company lists
US7822665B2 (en) 2004-10-29 2010-10-26 American Express Travel Related Services Company, Inc. Using commercial share of wallet in private equity investments
US8086509B2 (en) 2004-10-29 2011-12-27 American Express Travel Related Services Company, Inc. Determining commercial share of wallet
US8326671B2 (en) 2004-10-29 2012-12-04 American Express Travel Related Services Company, Inc. Using commercial share of wallet to analyze vendors in online marketplaces
US8326672B2 (en) 2004-10-29 2012-12-04 American Express Travel Related Services Company, Inc. Using commercial share of wallet in financial databases
US7792732B2 (en) 2004-10-29 2010-09-07 American Express Travel Related Services Company, Inc. Using commercial share of wallet to rate investments
US7788147B2 (en) 2004-10-29 2010-08-31 American Express Travel Related Services Company, Inc. Method and apparatus for estimating the spend capacity of consumers
US20080033852A1 (en) * 2005-10-24 2008-02-07 Megdal Myles G Computer-based modeling of spending behaviors of entities
US20080243680A1 (en) * 2005-10-24 2008-10-02 Megdal Myles G Method and apparatus for rating asset-backed securities
US8036979B1 (en) 2006-10-05 2011-10-11 Experian Information Solutions, Inc. System and method for generating a finance attribute from tradeline data
US8239250B2 (en) 2006-12-01 2012-08-07 American Express Travel Related Services Company, Inc. Industry size of wallet
US8606626B1 (en) 2007-01-31 2013-12-10 Experian Information Solutions, Inc. Systems and methods for providing a direct marketing campaign planning environment
US8606666B1 (en) 2007-01-31 2013-12-10 Experian Information Solutions, Inc. System and method for providing an aggregation tool
US7827153B2 (en) * 2007-12-19 2010-11-02 Sap Ag System and method to perform bulk operation database cleanup
JPWO2009104324A1 (ja) * 2008-02-22 2011-06-16 日本電気株式会社 能動計量学習装置、能動計量学習方法およびプログラム
US9910875B2 (en) 2008-12-22 2018-03-06 International Business Machines Corporation Best-value determination rules for an entity resolution system
US20100161542A1 (en) * 2008-12-22 2010-06-24 International Business Machines Corporation Detecting entity relevance due to a multiplicity of distinct values for an attribute type
US8200640B2 (en) 2009-06-15 2012-06-12 Microsoft Corporation Declarative framework for deduplication
US8176407B2 (en) * 2010-03-02 2012-05-08 Microsoft Corporation Comparing values of a bounded domain
US9652802B1 (en) 2010-03-24 2017-05-16 Consumerinfo.Com, Inc. Indirect monitoring and reporting of a user's credit data
US9361008B2 (en) * 2010-05-12 2016-06-07 Moog Inc. Result-oriented configuration of performance parameters
US8473410B1 (en) 2012-02-23 2013-06-25 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US8781954B2 (en) 2012-02-23 2014-07-15 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US8538869B1 (en) 2012-02-23 2013-09-17 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US9477988B2 (en) 2012-02-23 2016-10-25 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
CN104516900A (zh) * 2013-09-29 2015-04-15 国际商业机器公司 用于多个序列数据的聚类方法及其装置
US9892158B2 (en) * 2014-01-31 2018-02-13 International Business Machines Corporation Dynamically adjust duplicate skipping method for increased performance
US10262362B1 (en) 2014-02-14 2019-04-16 Experian Information Solutions, Inc. Automatic generation of code for attributes
US10387389B2 (en) * 2014-09-30 2019-08-20 International Business Machines Corporation Data de-duplication
US10242019B1 (en) 2014-12-19 2019-03-26 Experian Information Solutions, Inc. User behavior segmentation using latent topic detection
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US11055327B2 (en) 2018-07-01 2021-07-06 Quadient Technologies France Unstructured data parsing for structured information
US11301440B2 (en) 2020-06-18 2022-04-12 Lexisnexis Risk Solutions, Inc. Fuzzy search using field-level deletion neighborhoods

Family Cites Families (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924090A (en) 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records
US5940821A (en) 1997-05-21 1999-08-17 Oracle Corporation Information presentation in a knowledge base search and retrieval system
US5913206A (en) 1997-08-15 1999-06-15 Microsoft Corporation Database system multi-column index selection for a workload
US5950186A (en) 1997-08-15 1999-09-07 Microsoft Corporation Database system index selection using cost evaluation of a workload for multiple candidate index configurations
US5926813A (en) 1997-08-15 1999-07-20 Microsoft Corporation Database system index selection using cost evaluation of a workload for multiple candidate index configurations
US5913207A (en) 1997-08-15 1999-06-15 Microsoft Corporation Database system index selection using index configuration enumeration for a workload
US5960423A (en) 1997-08-15 1999-09-28 Microsoft Corporation Database system index selection using candidate index selection for a workload
US5966702A (en) 1997-10-31 1999-10-12 Sun Microsystems, Inc. Method and apparatus for pre-processing and packaging class files
US6182066B1 (en) 1997-11-26 2001-01-30 International Business Machines Corp. Category processing of query topics and electronic document content topics
US6169983B1 (en) 1998-05-30 2001-01-02 Microsoft Corporation Index merging for database systems
US6223171B1 (en) 1998-08-25 2001-04-24 Microsoft Corporation What-if index analysis utility for database systems
US6460045B1 (en) 1999-03-15 2002-10-01 Microsoft Corporation Self-tuning histogram and database modeling
US6374241B1 (en) 1999-03-31 2002-04-16 Verizon Laboratories Inc. Data merging techniques
US6363371B1 (en) 1999-06-29 2002-03-26 Microsoft Corporation Identifying essential statistics for query optimization for databases
US6529901B1 (en) 1999-06-29 2003-03-04 Microsoft Corporation Automating statistics management for query optimizers
US6691108B2 (en) 1999-12-14 2004-02-10 Nec Corporation Focused search engine and method
US6266658B1 (en) 2000-04-20 2001-07-24 Microsoft Corporation Index tuner for given workload
US6366903B1 (en) 2000-04-20 2002-04-02 Microsoft Corporation Index and materialized view selection for a given workload
US6356890B1 (en) 2000-04-20 2002-03-12 Microsoft Corporation Merging materialized view pairs for database workload materialized view selection
US6513029B1 (en) 2000-04-20 2003-01-28 Microsoft Corporation Interesting table-subset selection for database workload materialized view selection
US6356891B1 (en) 2000-04-20 2002-03-12 Microsoft Corporation Identifying indexes on materialized views for database workload
US7007008B2 (en) 2000-08-08 2006-02-28 America Online, Inc. Category searching
GB0029159D0 (en) 2000-11-29 2001-01-17 Calaba Ltd Data storage and retrieval system
US20020124214A1 (en) 2001-03-01 2002-09-05 International Business Machines Corporation Method and system for eliminating duplicate reported errors in a logically partitioned multiprocessing system
US20040128282A1 (en) 2001-03-07 2004-07-01 Paul Kleinberger System and method for computer searching
US20030022200A1 (en) 2001-03-25 2003-01-30 Henrik Vissing Systems for analysis of biological materials
US6912549B2 (en) * 2001-09-05 2005-06-28 Siemens Medical Solutions Health Services Corporation System for processing and consolidating records
JP3803961B2 (ja) * 2001-12-05 2006-08-02 日本電信電話株式会社 データベース生成装置、データベース生成処理方法及びデータベース生成プログラム
JP3812818B2 (ja) * 2001-12-05 2006-08-23 日本電信電話株式会社 データベース生成装置、データベース生成方法及びデータベース生成処理プログラム
US7523127B2 (en) * 2002-01-14 2009-04-21 Testout Corporation System and method for a hierarchical database management system for educational training and competency testing simulations
US7139749B2 (en) 2002-03-19 2006-11-21 International Business Machines Corporation Method, system, and program for performance tuning a database query
US7152060B2 (en) * 2002-04-11 2006-12-19 Choicemaker Technologies, Inc. Automated database blocking and record matching
US6961721B2 (en) * 2002-06-28 2005-11-01 Microsoft Corporation Detecting duplicate records in database
US7953694B2 (en) 2003-01-13 2011-05-31 International Business Machines Corporation Method, system, and program for specifying multidimensional calculations for a relational OLAP engine
US20050027717A1 (en) * 2003-04-21 2005-02-03 Nikolaos Koudas Text joins for data cleansing and integration in a relational database management system
US7774312B2 (en) 2003-09-04 2010-08-10 Oracle International Corporation Self-managing performance statistics repository for databases
US20050125401A1 (en) * 2003-12-05 2005-06-09 Hewlett-Packard Development Company, L. P. Wizard for usage in real-time aggregation and scoring in an information handling system
US7779386B2 (en) 2003-12-08 2010-08-17 Ebay Inc. Method and system to automatically regenerate software code
US7281004B2 (en) 2004-02-27 2007-10-09 International Business Machines Corporation Method, system and program for optimizing compression of a workload processed by a database management system

Also Published As

Publication number Publication date
JP2006072985A (ja) 2006-03-16
KR20060050069A (ko) 2006-05-19
CN1744083A (zh) 2006-03-08
DE602005012192D1 (de) 2009-02-26
KR101153113B1 (ko) 2012-06-04
EP1630698A1 (de) 2006-03-01
CN100520776C (zh) 2009-07-29
US20060053129A1 (en) 2006-03-09
JP4814570B2 (ja) 2011-11-16
EP1630698B1 (de) 2009-01-07
US7516149B2 (en) 2009-04-07

Similar Documents

Publication Publication Date Title
ATE420406T1 (de) Robuste erkennung von fuzzy duplikatdatensätzen in einer datenbank
ATE443886T1 (de) Kryptografische verarbeitung von daten basierend auf der cassels-tate paarung
NO20060501L (no) Fremgangsmater og system for a forsta meningen av en kunnskapsenhet ved bruk av informasjon tilknyttet kunnskapsenheten
WO2005089238A3 (en) Knowledge management system with integrated product document management for computer-aided design modeling
WO2007106403A3 (en) Methods and systems to generate rules to identify data items
BRPI0412184A (pt) renderização de anúncios com documentos tendo um ou mais tópicos utilizando informação de interesse de tópico do usuário
WO2005119551A3 (en) Method and system to evaluate anti-money laundering risk
CA2343370A1 (en) Root cause analysis in a distributed network management architecture
WO2006105250A3 (en) Apparatus, system, and method for internet trade
TW200707279A (en) Task scheduling to devices with same connection address
DE602006020306D1 (de) Verteilte und wiederholte bildwiederherstellung
WO2007001896A3 (en) Identification and risk evaluation
WO2009054839A3 (en) Template based matching
WO2006036578A3 (en) Method for finding paths in video
WO2007030255A3 (en) System, method, and software for implemnting business rules in an entity
WO2005033870A3 (en) Method for creating and using text objects as control devices
ATE305825T1 (de) Verfahren und vorrichtung zur bearbeitung von postsendungen
EP1557752A3 (de) Verteiltes schemantisches Schema
ATE484974T1 (de) Leuchtender behälter
DE60104544D1 (de) Integration eines selbspositionierenden merkmals, in einzelteilen
WO2020044268A3 (en) Determining a diagnostic associated with an electronic smoking article
ATE452465T1 (de) Verstärkervorrichtung, verfahren und system
WO2006130312A3 (en) Methods and apparatus for locating devices
DE10290891T1 (de) Lichtempfindliche, flexographische Vorrichtung mit damit verbundener, wärmeadressierbarer Maske
ATE384179T1 (de) Vorrichtung zur verteilung von klebstoff

Legal Events

Date Code Title Description
RER Ceased as to paragraph 5 lit. 3 law introducing patent treaties