CN100520776C - 模糊重复的鲁棒检测器 - Google Patents

模糊重复的鲁棒检测器 Download PDF

Info

Publication number
CN100520776C
CN100520776C CNB2005100885171A CN200510088517A CN100520776C CN 100520776 C CN100520776 C CN 100520776C CN B2005100885171 A CNB2005100885171 A CN B2005100885171A CN 200510088517 A CN200510088517 A CN 200510088517A CN 100520776 C CN100520776 C CN 100520776C
Authority
CN
China
Prior art keywords
tuples
computing
neighborhood
polynary group
fuzzy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CNB2005100885171A
Other languages
English (en)
Chinese (zh)
Other versions
CN1744083A (zh
Inventor
R·莫特瓦尼
S·乔德里
V·甘提
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of CN1744083A publication Critical patent/CN1744083A/zh
Application granted granted Critical
Publication of CN100520776C publication Critical patent/CN100520776C/zh
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99937Sorting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99942Manipulating data structure, e.g. compression, compaction, compilation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99944Object-oriented database structure
    • Y10S707/99945Object-oriented database structure processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stereo-Broadcasting Methods (AREA)
  • Circuits Of Receivers In General (AREA)
CNB2005100885171A 2004-08-30 2005-07-29 模糊重复的鲁棒检测器 Expired - Lifetime CN100520776C (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/929,514 US7516149B2 (en) 2004-08-30 2004-08-30 Robust detector of fuzzy duplicates
US10/929,514 2004-08-30

Publications (2)

Publication Number Publication Date
CN1744083A CN1744083A (zh) 2006-03-08
CN100520776C true CN100520776C (zh) 2009-07-29

Family

ID=35219700

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100885171A Expired - Lifetime CN100520776C (zh) 2004-08-30 2005-07-29 模糊重复的鲁棒检测器

Country Status (7)

Country Link
US (1) US7516149B2 (https=)
EP (1) EP1630698B1 (https=)
JP (1) JP4814570B2 (https=)
KR (1) KR101153113B1 (https=)
CN (1) CN100520776C (https=)
AT (1) ATE420406T1 (https=)
DE (1) DE602005012192D1 (https=)

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8732004B1 (en) 2004-09-22 2014-05-20 Experian Information Solutions, Inc. Automated analysis of data to generate prospect notifications based on trigger events
US20070016501A1 (en) 2004-10-29 2007-01-18 American Express Travel Related Services Co., Inc., A New York Corporation Using commercial share of wallet to rate business prospects
US8326672B2 (en) 2004-10-29 2012-12-04 American Express Travel Related Services Company, Inc. Using commercial share of wallet in financial databases
US7822665B2 (en) 2004-10-29 2010-10-26 American Express Travel Related Services Company, Inc. Using commercial share of wallet in private equity investments
US8630929B2 (en) 2004-10-29 2014-01-14 American Express Travel Related Services Company, Inc. Using commercial share of wallet to make lending decisions
US7792732B2 (en) 2004-10-29 2010-09-07 American Express Travel Related Services Company, Inc. Using commercial share of wallet to rate investments
US8204774B2 (en) * 2004-10-29 2012-06-19 American Express Travel Related Services Company, Inc. Estimating the spend capacity of consumer households
US8086509B2 (en) 2004-10-29 2011-12-27 American Express Travel Related Services Company, Inc. Determining commercial share of wallet
US7912770B2 (en) * 2004-10-29 2011-03-22 American Express Travel Related Services Company, Inc. Method and apparatus for consumer interaction based on spend capacity
US7788147B2 (en) 2004-10-29 2010-08-31 American Express Travel Related Services Company, Inc. Method and apparatus for estimating the spend capacity of consumers
US8543499B2 (en) 2004-10-29 2013-09-24 American Express Travel Related Services Company, Inc. Reducing risks related to check verification
US8131614B2 (en) 2004-10-29 2012-03-06 American Express Travel Related Services Company, Inc. Using commercial share of wallet to compile marketing company lists
US20070244732A1 (en) 2004-10-29 2007-10-18 American Express Travel Related Services Co., Inc., A New York Corporation Using commercial share of wallet to manage vendors
US7840484B2 (en) 2004-10-29 2010-11-23 American Express Travel Related Services Company, Inc. Credit score and scorecard development
US8326671B2 (en) 2004-10-29 2012-12-04 American Express Travel Related Services Company, Inc. Using commercial share of wallet to analyze vendors in online marketplaces
US20080243680A1 (en) * 2005-10-24 2008-10-02 Megdal Myles G Method and apparatus for rating asset-backed securities
US20080033852A1 (en) * 2005-10-24 2008-02-07 Megdal Myles G Computer-based modeling of spending behaviors of entities
US8036979B1 (en) 2006-10-05 2011-10-11 Experian Information Solutions, Inc. System and method for generating a finance attribute from tradeline data
US8239250B2 (en) 2006-12-01 2012-08-07 American Express Travel Related Services Company, Inc. Industry size of wallet
US8606666B1 (en) 2007-01-31 2013-12-10 Experian Information Solutions, Inc. System and method for providing an aggregation tool
US8606626B1 (en) 2007-01-31 2013-12-10 Experian Information Solutions, Inc. Systems and methods for providing a direct marketing campaign planning environment
US7827153B2 (en) * 2007-12-19 2010-11-02 Sap Ag System and method to perform bulk operation database cleanup
US20110004578A1 (en) * 2008-02-22 2011-01-06 Michinari Momma Active metric learning device, active metric learning method, and program
US20100161542A1 (en) * 2008-12-22 2010-06-24 International Business Machines Corporation Detecting entity relevance due to a multiplicity of distinct values for an attribute type
US9910875B2 (en) 2008-12-22 2018-03-06 International Business Machines Corporation Best-value determination rules for an entity resolution system
US8200640B2 (en) 2009-06-15 2012-06-12 Microsoft Corporation Declarative framework for deduplication
US8176407B2 (en) * 2010-03-02 2012-05-08 Microsoft Corporation Comparing values of a bounded domain
US9652802B1 (en) 2010-03-24 2017-05-16 Consumerinfo.Com, Inc. Indirect monitoring and reporting of a user's credit data
US9361008B2 (en) * 2010-05-12 2016-06-07 Moog Inc. Result-oriented configuration of performance parameters
US8473410B1 (en) 2012-02-23 2013-06-25 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US8781954B2 (en) 2012-02-23 2014-07-15 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US9477988B2 (en) 2012-02-23 2016-10-25 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US8538869B1 (en) 2012-02-23 2013-09-17 American Express Travel Related Services Company, Inc. Systems and methods for identifying financial relationships
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
CN104516900A (zh) * 2013-09-29 2015-04-15 国际商业机器公司 用于多个序列数据的聚类方法及其装置
US9892158B2 (en) * 2014-01-31 2018-02-13 International Business Machines Corporation Dynamically adjust duplicate skipping method for increased performance
US10262362B1 (en) 2014-02-14 2019-04-16 Experian Information Solutions, Inc. Automatic generation of code for attributes
US10387389B2 (en) * 2014-09-30 2019-08-20 International Business Machines Corporation Data de-duplication
US10445152B1 (en) 2014-12-19 2019-10-15 Experian Information Solutions, Inc. Systems and methods for dynamic report generation based on automatic modeling of complex data structures
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US11055327B2 (en) 2018-07-01 2021-07-06 Quadient Technologies France Unstructured data parsing for structured information
US11301440B2 (en) 2020-06-18 2022-04-12 Lexisnexis Risk Solutions, Inc. Fuzzy search using field-level deletion neighborhoods

Family Cites Families (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5924090A (en) 1997-05-01 1999-07-13 Northern Light Technology Llc Method and apparatus for searching a database of records
US5940821A (en) * 1997-05-21 1999-08-17 Oracle Corporation Information presentation in a knowledge base search and retrieval system
US5950186A (en) 1997-08-15 1999-09-07 Microsoft Corporation Database system index selection using cost evaluation of a workload for multiple candidate index configurations
US5913207A (en) 1997-08-15 1999-06-15 Microsoft Corporation Database system index selection using index configuration enumeration for a workload
US5960423A (en) 1997-08-15 1999-09-28 Microsoft Corporation Database system index selection using candidate index selection for a workload
US5926813A (en) 1997-08-15 1999-07-20 Microsoft Corporation Database system index selection using cost evaluation of a workload for multiple candidate index configurations
US5913206A (en) 1997-08-15 1999-06-15 Microsoft Corporation Database system multi-column index selection for a workload
US5966702A (en) 1997-10-31 1999-10-12 Sun Microsystems, Inc. Method and apparatus for pre-processing and packaging class files
US6182066B1 (en) * 1997-11-26 2001-01-30 International Business Machines Corp. Category processing of query topics and electronic document content topics
US6169983B1 (en) 1998-05-30 2001-01-02 Microsoft Corporation Index merging for database systems
US6223171B1 (en) 1998-08-25 2001-04-24 Microsoft Corporation What-if index analysis utility for database systems
US6460045B1 (en) 1999-03-15 2002-10-01 Microsoft Corporation Self-tuning histogram and database modeling
US6374241B1 (en) 1999-03-31 2002-04-16 Verizon Laboratories Inc. Data merging techniques
US6363371B1 (en) 1999-06-29 2002-03-26 Microsoft Corporation Identifying essential statistics for query optimization for databases
US6529901B1 (en) 1999-06-29 2003-03-04 Microsoft Corporation Automating statistics management for query optimizers
US6691108B2 (en) 1999-12-14 2004-02-10 Nec Corporation Focused search engine and method
US6266658B1 (en) 2000-04-20 2001-07-24 Microsoft Corporation Index tuner for given workload
US6356890B1 (en) 2000-04-20 2002-03-12 Microsoft Corporation Merging materialized view pairs for database workload materialized view selection
US6356891B1 (en) 2000-04-20 2002-03-12 Microsoft Corporation Identifying indexes on materialized views for database workload
US6513029B1 (en) 2000-04-20 2003-01-28 Microsoft Corporation Interesting table-subset selection for database workload materialized view selection
US6366903B1 (en) 2000-04-20 2002-04-02 Microsoft Corporation Index and materialized view selection for a given workload
US7007008B2 (en) 2000-08-08 2006-02-28 America Online, Inc. Category searching
GB0029159D0 (en) * 2000-11-29 2001-01-17 Calaba Ltd Data storage and retrieval system
US20020124214A1 (en) 2001-03-01 2002-09-05 International Business Machines Corporation Method and system for eliminating duplicate reported errors in a logically partitioned multiprocessing system
US20040128282A1 (en) 2001-03-07 2004-07-01 Paul Kleinberger System and method for computer searching
AU2002309152A1 (en) 2001-03-25 2002-10-08 Exiqon A/S Systems for analysis of biological materials
US6912549B2 (en) * 2001-09-05 2005-06-28 Siemens Medical Solutions Health Services Corporation System for processing and consolidating records
JP3812818B2 (ja) * 2001-12-05 2006-08-23 日本電信電話株式会社 データベース生成装置、データベース生成方法及びデータベース生成処理プログラム
JP3803961B2 (ja) * 2001-12-05 2006-08-02 日本電信電話株式会社 データベース生成装置、データベース生成処理方法及びデータベース生成プログラム
US7523127B2 (en) * 2002-01-14 2009-04-21 Testout Corporation System and method for a hierarchical database management system for educational training and competency testing simulations
US7139749B2 (en) 2002-03-19 2006-11-21 International Business Machines Corporation Method, system, and program for performance tuning a database query
US7152060B2 (en) * 2002-04-11 2006-12-19 Choicemaker Technologies, Inc. Automated database blocking and record matching
US6961721B2 (en) * 2002-06-28 2005-11-01 Microsoft Corporation Detecting duplicate records in database
US7953694B2 (en) * 2003-01-13 2011-05-31 International Business Machines Corporation Method, system, and program for specifying multidimensional calculations for a relational OLAP engine
US20050027717A1 (en) * 2003-04-21 2005-02-03 Nikolaos Koudas Text joins for data cleansing and integration in a relational database management system
US7774312B2 (en) 2003-09-04 2010-08-10 Oracle International Corporation Self-managing performance statistics repository for databases
US20050125401A1 (en) * 2003-12-05 2005-06-09 Hewlett-Packard Development Company, L. P. Wizard for usage in real-time aggregation and scoring in an information handling system
WO2005057364A2 (en) 2003-12-08 2005-06-23 Ebay Inc. Custom caching
US7281004B2 (en) 2004-02-27 2007-10-09 International Business Machines Corporation Method, system and program for optimizing compression of a workload processed by a database management system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Learning to Match and Cluster Large High-Dimensional DataSets for Data Integration. William W. Cohen et al.ACM. 2002 *

Also Published As

Publication number Publication date
KR20060050069A (ko) 2006-05-19
US20060053129A1 (en) 2006-03-09
EP1630698B1 (en) 2009-01-07
JP4814570B2 (ja) 2011-11-16
EP1630698A1 (en) 2006-03-01
KR101153113B1 (ko) 2012-06-04
CN1744083A (zh) 2006-03-08
US7516149B2 (en) 2009-04-07
ATE420406T1 (de) 2009-01-15
JP2006072985A (ja) 2006-03-16
DE602005012192D1 (de) 2009-02-26

Similar Documents

Publication Publication Date Title
CN100520776C (zh) 模糊重复的鲁棒检测器
CN106528693B (zh) 面向个性化学习的教育资源推荐方法及系统
CN112771564B (zh) 生成网站的语义方向以自动实体寻的到映射身份的人工智能引擎
US6360224B1 (en) Fast extraction of one-way and two-way counts from sparse data
US20160210301A1 (en) Context-Aware Query Suggestion by Mining Log Data
CN111563192B (zh) 实体对齐方法、装置、电子设备及存储介质
CN102609465B (zh) 基于潜在社群的信息推荐方法
Feldman et al. idiary: From gps signals to a text-searchable diary
CN110297990A (zh) 众包营销微博与水军的联合检测方法及系统
CN111581479B (zh) 一站式数据处理的方法、装置、存储介质及电子设备
CN115131058B (zh) 账号识别方法、装置、设备及存储介质
CN116860981A (zh) 潜在客户挖掘方法及装置
Feldman et al. The single pixel GPS: learning big data signals from tiny coresets
US9020962B2 (en) Interest expansion using a taxonomy
US11650987B2 (en) Query response using semantically similar database records
CN117056550B (zh) 长尾图像检索方法、系统、设备及存储介质
CN112069227B (zh) 一种面向事件序列的因果建模方法及装置
CN115587192A (zh) 关系信息抽取方法、设备及计算机可读存储介质
CN115757826B (zh) 事件图谱构建方法、装置、设备及介质
CN113177854B (zh) 社区划分方法及系统、电子设备及存储介质
CN107944045A (zh) 基于t分布哈希的图像检索方法及系统
CN114969761A (zh) 一种基于lda主题特征的日志异常检测方法
Salamat Heterogeneous Graph-Based Neural Network for Social Recommendations with Balanced Random Walk Initialization
CN118779659B (zh) 样本标注方法、规则相关性度量方法、装置、设备及介质
CN113568929B (zh) 数据存储、查询方法、装置及电子设备

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: MICROSOFT TECHNOLOGY LICENSING LLC

Free format text: FORMER OWNER: MICROSOFT CORP.

Effective date: 20150424

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150424

Address after: Washington State

Patentee after: MICROSOFT TECHNOLOGY LICENSING, LLC

Address before: Washington State

Patentee before: Microsoft Corp.

CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20090729