AU2001277283A1 - System for similar document detection - Google Patents

System for similar document detection

Info

Publication number
AU2001277283A1
AU2001277283A1 AU2001277283A AU7728301A AU2001277283A1 AU 2001277283 A1 AU2001277283 A1 AU 2001277283A1 AU 2001277283 A AU2001277283 A AU 2001277283A AU 7728301 A AU7728301 A AU 7728301A AU 2001277283 A1 AU2001277283 A1 AU 2001277283A1
Authority
AU
Australia
Prior art keywords
document detection
similar document
similar
detection
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2001277283A
Inventor
Abdur R. Chowdhury
Ophir Frieder
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IIT Research Institute
Original Assignee
IIT Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IIT Research Institute filed Critical IIT Research Institute
Publication of AU2001277283A1 publication Critical patent/AU2001277283A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99948Application of database or data structure, e.g. distributed, multimedia, or image
AU2001277283A 2000-07-31 2001-07-31 System for similar document detection Abandoned AU2001277283A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US09/629,175 2000-07-31
US09/629,175 US7660819B1 (en) 2000-07-31 2000-07-31 System for similar document detection
PCT/US2001/041464 WO2002010967A2 (en) 2000-07-31 2001-07-31 System for similar document detection

Publications (1)

Publication Number Publication Date
AU2001277283A1 true AU2001277283A1 (en) 2002-02-13

Family

ID=24521911

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2001277283A Abandoned AU2001277283A1 (en) 2000-07-31 2001-07-31 System for similar document detection

Country Status (3)

Country Link
US (3) US7660819B1 (en)
AU (1) AU2001277283A1 (en)
WO (1) WO2002010967A2 (en)

Families Citing this family (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1316397C (en) * 2001-02-12 2007-05-16 Emc公司 System and method of indexing unique electronic mail messages and uses for same
US20040003028A1 (en) * 2002-05-08 2004-01-01 David Emmett Automatic display of web content to smaller display devices: improved summarization and navigation
US8868543B1 (en) * 2002-11-20 2014-10-21 Google Inc. Finding web pages relevant to multimedia streams
US8732245B2 (en) * 2002-12-03 2014-05-20 Blackberry Limited Method, system and computer software product for pre-selecting a folder for a message
BR0317764A (en) * 2002-12-27 2006-02-21 Intellectual Property Bank technology assessment device, technology assessment program, and technology assessment method
US7246309B2 (en) 2003-04-23 2007-07-17 Electronic Data Systems Corporation Validating one or more data blocks in a computer-implemented document derived from another computer-implemented document
ATE492859T1 (en) * 2004-04-26 2011-01-15 Adalbert Gubo DEVICE FOR ENCODING AND MARKING DOCUMENTS FOR RECOGNITION AND RECOVERY
WO2006008733A2 (en) * 2004-07-21 2006-01-26 Equivio Ltd. A method for determining near duplicate data objects
US8725705B2 (en) * 2004-09-15 2014-05-13 International Business Machines Corporation Systems and methods for searching of storage data with reduced bandwidth requirements
US7523098B2 (en) * 2004-09-15 2009-04-21 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
US7623534B1 (en) 2005-09-09 2009-11-24 At&T Intellectual Property I, Lp Method and systems for content access and distribution
WO2007086059A2 (en) * 2006-01-25 2007-08-02 Equivio Ltd. Determining near duplicate 'noisy' data objects
EP1999565A4 (en) * 2006-03-03 2012-01-11 Perfect Search Corp Hyperspace index
US7676465B2 (en) * 2006-07-05 2010-03-09 Yahoo! Inc. Techniques for clustering structurally similar web pages based on page features
US7941420B2 (en) * 2007-08-14 2011-05-10 Yahoo! Inc. Method for organizing structurally similar web pages from a web site
GB2440174A (en) * 2006-07-19 2008-01-23 Chronicle Solutions Determining similarity of electronic documents by comparing hashed alphabetically ordered phrases
US20090012984A1 (en) * 2007-07-02 2009-01-08 Equivio Ltd. Method for Organizing Large Numbers of Documents
US8122032B2 (en) * 2007-07-20 2012-02-21 Google Inc. Identifying and linking similar passages in a digital text corpus
US9323827B2 (en) * 2007-07-20 2016-04-26 Google Inc. Identifying key terms related to similar passages
US7912840B2 (en) 2007-08-30 2011-03-22 Perfect Search Corporation Indexing and filtering using composite data stores
US7870133B2 (en) 2008-01-14 2011-01-11 Infosys Technologies Ltd. Method for semantic based storage and retrieval of information
US7930306B2 (en) * 2008-04-30 2011-04-19 Msc Intellectual Properties B.V. System and method for near and exact de-duplication of documents
US8234655B2 (en) 2008-07-29 2012-07-31 International Business Machines Corporation Detection of duplicate memory pages across guest operating systems on a shared host
US8549327B2 (en) 2008-10-27 2013-10-01 Bank Of America Corporation Background service process for local collection of data in an electronic discovery system
US20100169311A1 (en) * 2008-12-30 2010-07-01 Ashwin Tengli Approaches for the unsupervised creation of structural templates for electronic documents
US8489612B2 (en) * 2009-03-24 2013-07-16 Hewlett-Packard Development Company, L.P. Identifying similar files in an environment having multiple client computers
US9721227B2 (en) 2009-03-27 2017-08-01 Bank Of America Corporation Custodian management system
US8572227B2 (en) 2009-03-27 2013-10-29 Bank Of America Corporation Methods and apparatuses for communicating preservation notices and surveys
US8572376B2 (en) 2009-03-27 2013-10-29 Bank Of America Corporation Decryption of electronic communication in an electronic discovery enterprise system
US8417716B2 (en) 2009-03-27 2013-04-09 Bank Of America Corporation Profile scanner
US8224924B2 (en) 2009-03-27 2012-07-17 Bank Of America Corporation Active email collector
US8806358B2 (en) 2009-03-27 2014-08-12 Bank Of America Corporation Positive identification and bulk addition of custodians to a case within an electronic discovery system
US8504489B2 (en) * 2009-03-27 2013-08-06 Bank Of America Corporation Predictive coding of documents in an electronic discovery system
US8200635B2 (en) 2009-03-27 2012-06-12 Bank Of America Corporation Labeling electronic data in an electronic discovery enterprise system
US9330374B2 (en) 2009-03-27 2016-05-03 Bank Of America Corporation Source-to-processing file conversion in an electronic discovery enterprise system
US8250037B2 (en) 2009-03-27 2012-08-21 Bank Of America Corporation Shared drive data collection tool for an electronic discovery system
US8364681B2 (en) 2009-03-27 2013-01-29 Bank Of America Corporation Electronic discovery system
US8311330B2 (en) * 2009-04-06 2012-11-13 Accenture Global Services Limited Method for the logical segmentation of contents
US9053454B2 (en) 2009-11-30 2015-06-09 Bank Of America Corporation Automated straight-through processing in an electronic discovery system
US9262390B2 (en) 2010-09-02 2016-02-16 Lexis Nexis, A Division Of Reed Elsevier Inc. Methods and systems for annotating electronic documents
US20120158742A1 (en) * 2010-12-17 2012-06-21 International Business Machines Corporation Managing documents using weighted prevalence data for statements
KR20120124581A (en) 2011-05-04 2012-11-14 엔에이치엔(주) Method, device and computer readable recording medium for improvded detection of similar documents
CN102831127B (en) * 2011-06-17 2015-04-22 阿里巴巴集团控股有限公司 Method, device and system for processing repeating data
US9407463B2 (en) * 2011-07-11 2016-08-02 Aol Inc. Systems and methods for providing a spam database and identifying spam communications
US8996350B1 (en) 2011-11-02 2015-03-31 Dub Software Group, Inc. System and method for automatic document management
TWI484357B (en) * 2011-12-02 2015-05-11 Inst Information Industry Quantitative-type data analysis method and quantitative-type data analysis device
KR101453867B1 (en) * 2012-08-02 2014-10-23 주식회사 와이즈넛 Method of copy detection visualizing copy sections with a unified document tpye
DE102012025349A1 (en) * 2012-12-21 2014-06-26 Docuware Gmbh Determination of a similarity measure and processing of documents
US10108590B2 (en) * 2013-05-03 2018-10-23 International Business Machines Corporation Comparing markup language files
US9734195B1 (en) * 2013-05-16 2017-08-15 Veritas Technologies Llc Automated data flow tracking
WO2015078231A1 (en) * 2013-11-26 2015-06-04 优视科技有限公司 Method for generating webpage template and server
US10318523B2 (en) * 2014-02-06 2019-06-11 The Johns Hopkins University Apparatus and method for aligning token sequences with block permutations
US20150317314A1 (en) * 2014-04-30 2015-11-05 Linkedln Corporation Content search vertical
US9171173B1 (en) * 2014-10-02 2015-10-27 Terbium Labs LLC Protected indexing and querying of large sets of textual data
US9984166B2 (en) 2014-10-10 2018-05-29 Salesforce.Com, Inc. Systems and methods of de-duplicating similar news feed items
US10592841B2 (en) 2014-10-10 2020-03-17 Salesforce.Com, Inc. Automatic clustering by topic and prioritizing online feed items
WO2016116171A1 (en) * 2015-01-23 2016-07-28 Telefonaktiebolaget Lm Ericsson (Publ) Methods and apparatus for obtaining a scoped token
US10114900B2 (en) 2015-03-23 2018-10-30 Virtru Corporation Methods and systems for generating probabilistically searchable messages
US10089382B2 (en) * 2015-10-19 2018-10-02 Conduent Business Services, Llc Transforming a knowledge base into a machine readable format for an automated system
US11200217B2 (en) 2016-05-26 2021-12-14 Perfect Search Corporation Structured document indexing and searching
US10191942B2 (en) * 2016-10-14 2019-01-29 Sap Se Reducing comparisons for token-based entity resolution
US20200210504A1 (en) * 2018-12-28 2020-07-02 Go Daddy Operating Company, LLC Recommending domains from free text
KR102289408B1 (en) * 2019-09-03 2021-08-12 국민대학교산학협력단 Search device and search method based on hash code
US11593439B1 (en) * 2022-05-23 2023-02-28 Onetrust Llc Identifying similar documents in a file repository using unique document signatures

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119124A (en) * 1998-03-26 2000-09-12 Digital Equipment Corporation Method for clustering closely resembling data objects
US6493709B1 (en) * 1998-07-31 2002-12-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
US6240409B1 (en) * 1998-07-31 2001-05-29 The Regents Of The University Of California Method and apparatus for detecting and summarizing document similarity within large document sets
US6547829B1 (en) * 1999-06-30 2003-04-15 Microsoft Corporation Method and system for detecting duplicate documents in web crawls
US6594665B1 (en) * 2000-02-18 2003-07-15 Intel Corporation Storing hashed values of data in media to allow faster searches and comparison of data

Also Published As

Publication number Publication date
WO2002010967A3 (en) 2003-12-04
WO2002010967A2 (en) 2002-02-07
US20120197913A1 (en) 2012-08-02
US8560546B2 (en) 2013-10-15
US20100169329A1 (en) 2010-07-01
US7660819B1 (en) 2010-02-09
US8131724B2 (en) 2012-03-06

Similar Documents

Publication Publication Date Title
AU2001277283A1 (en) System for similar document detection
AU2001293998A1 (en) Detection system
AU2001288268A1 (en) Near object detection system
AU2001249343A1 (en) Location detection system
AU2001239746A1 (en) Group-browsing system
AU2002228807A1 (en) Inspection system
AU2002225807A1 (en) Object detection
AU2001269318A1 (en) Document retrieval system
AU2001271023A1 (en) Service processing system
AU2001271498A1 (en) Glint-resistant position determination system
AU2001242020A1 (en) Imaging system
AU2001281633A1 (en) Detection method
AU3668300A (en) Watermark system
AUPR050700A0 (en) Detection method
AUPQ667800A0 (en) Detection method
AU2001252431A1 (en) Document indexing system
AU2001284775A1 (en) Anti-balling system
AU4086001A (en) Sensor systems
AU1716001A (en) Detection system
AU2815200A (en) Image detection system
AU2001277785A1 (en) Estrous detection system
AU2001260461A1 (en) Location system
AU2001289885A1 (en) Air-preparation system
AU2002231008A1 (en) Free analyte detection system
AU3016501A (en) Detection system