WO2006122086A3 - Matching engine with signature generation and relevance detection - Google Patents

Matching engine with signature generation and relevance detection Download PDF

Info

Publication number
WO2006122086A3
WO2006122086A3 PCT/US2006/017846 US2006017846W WO2006122086A3 WO 2006122086 A3 WO2006122086 A3 WO 2006122086A3 US 2006017846 W US2006017846 W US 2006017846W WO 2006122086 A3 WO2006122086 A3 WO 2006122086A3
Authority
WO
WIPO (PCT)
Prior art keywords
token
document
text
signature generation
matching engine
Prior art date
Application number
PCT/US2006/017846
Other languages
French (fr)
Other versions
WO2006122086A2 (en
Inventor
Liwei Ren
Dehua Tan
Fei Huang
Shu Huang
Aiguo Dong
Original Assignee
Dgate Technologies Inc
Liwei Ren
Dehua Tan
Fei Huang
Shu Huang
Aiguo Dong
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/361,447 external-priority patent/US7747642B2/en
Priority claimed from US11/361,340 external-priority patent/US7516130B2/en
Application filed by Dgate Technologies Inc, Liwei Ren, Dehua Tan, Fei Huang, Shu Huang, Aiguo Dong filed Critical Dgate Technologies Inc
Priority to CN2006800227288A priority Critical patent/CN101248433B/en
Priority to JP2008511259A priority patent/JP5072832B2/en
Publication of WO2006122086A2 publication Critical patent/WO2006122086A2/en
Publication of WO2006122086A3 publication Critical patent/WO2006122086A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Abstract

A system and a method generates at least one signature associated with document. In one embodiment, a document comprised of text is received and parsed to generate a token set. The token set includes a plurality of tokens. Each token corresponds to the text in the document that is separated by a predefined character characteristic. A score is calculated for each token in the token set based on a frequency and distribution of the text in the document. Each token is then ranked based on the calculated score. A subset of the ranked tokes is selected and a signature is generated for each occurrence of the selected tokens. The selected list of signatures is then output.
PCT/US2006/017846 2005-05-09 2006-05-08 Matching engine with signature generation and relevance detection WO2006122086A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2006800227288A CN101248433B (en) 2005-05-09 2006-05-08 Matching engine with signature generation and relevance detection
JP2008511259A JP5072832B2 (en) 2005-05-09 2006-05-08 Signature generation and matching engine with relevance

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US67931405P 2005-05-09 2005-05-09
US60/679,314 2005-05-09
US11/361,447 2006-02-24
US11/361,447 US7747642B2 (en) 2005-05-09 2006-02-24 Matching engine for querying relevant documents
US11/361,340 2006-02-24
US11/361,340 US7516130B2 (en) 2005-05-09 2006-02-24 Matching engine with signature generation

Publications (2)

Publication Number Publication Date
WO2006122086A2 WO2006122086A2 (en) 2006-11-16
WO2006122086A3 true WO2006122086A3 (en) 2007-03-29

Family

ID=37397221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2006/017846 WO2006122086A2 (en) 2005-05-09 2006-05-08 Matching engine with signature generation and relevance detection

Country Status (3)

Country Link
JP (1) JP5072832B2 (en)
CN (1) CN101248433B (en)
WO (1) WO2006122086A2 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7516130B2 (en) * 2005-05-09 2009-04-07 Trend Micro, Inc. Matching engine with signature generation
US7860853B2 (en) * 2007-02-14 2010-12-28 Provilla, Inc. Document matching engine using asymmetric signature generation
JP5372853B2 (en) 2010-07-08 2013-12-18 株式会社日立製作所 Digital sequence feature amount calculation method and digital sequence feature amount calculation apparatus
JP5617674B2 (en) * 2011-02-14 2014-11-05 日本電気株式会社 Inter-document similarity calculation apparatus, inter-document similarity calculation method, and inter-document similarity calculation program
CN107798637A (en) * 2016-08-30 2018-03-13 北京国双科技有限公司 The different acquisition methods and device for sentencing document of accomplice
CN112580108A (en) * 2020-12-10 2021-03-30 深圳证券信息有限公司 Signature and seal integrity verification method and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6493709B1 (en) * 1998-07-31 2002-12-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
US6584470B2 (en) * 2001-03-01 2003-06-24 Intelliseek, Inc. Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction
US20030172066A1 (en) * 2002-01-22 2003-09-11 International Business Machines Corporation System and method for detecting duplicate and similar documents

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5325091A (en) * 1992-08-13 1994-06-28 Xerox Corporation Text-compression technique using frequency-ordered array of word-number mappers
JP2758826B2 (en) * 1994-03-02 1998-05-28 株式会社リコー Document search device
JPH09293079A (en) * 1996-04-18 1997-11-11 Internatl Business Mach Corp <Ibm> Information retrieving method, information retrieving device and storage medium for storing information retrieving program
EP0961210A1 (en) * 1998-05-29 1999-12-01 Xerox Corporation Signature file based semantic caching of queries
CN1369839A (en) * 2001-02-16 2002-09-18 意蓝科技股份有限公司 File association judging system and method
JP2002269116A (en) * 2001-03-13 2002-09-20 Ricoh Co Ltd System and program for retrieving document
JP3719666B2 (en) * 2001-07-12 2005-11-24 松下電器産業株式会社 Document verification device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6493709B1 (en) * 1998-07-31 2002-12-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
US6584470B2 (en) * 2001-03-01 2003-06-24 Intelliseek, Inc. Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction
US20030172066A1 (en) * 2002-01-22 2003-09-11 International Business Machines Corporation System and method for detecting duplicate and similar documents

Also Published As

Publication number Publication date
CN101248433A (en) 2008-08-20
JP2008541272A (en) 2008-11-20
WO2006122086A2 (en) 2006-11-16
JP5072832B2 (en) 2012-11-14
CN101248433B (en) 2010-09-01

Similar Documents

Publication Publication Date Title
WO2006122086A3 (en) Matching engine with signature generation and relevance detection
Seidman Authorship verification using the impostors method
AU2018200396B2 (en) A method and system for extraction
WO2007084836A3 (en) Match-based employment system and method
WO2007100916A3 (en) Systems, methods, and media for outputting a dataset based upon anomaly detection
Peters The Cambridge dictionary of English grammar
WO2007089289A3 (en) Method for ranking and sorting electronic documents in a search result list based on relevance
WO2005070111A3 (en) Content presentation and management system associating base content and relevant additional content
WO2008033780A3 (en) Recommending advertising key phrases
WO2009038981A3 (en) System and method to generate a software framework based on semantic modeling and business rules
WO2010019567A8 (en) Signed digital documents
WO2010039519A3 (en) Methods and apparatus related to document processing based on a document type
WO2004086192A3 (en) Systems and methods for interactive search query refinement
WO2010008800A3 (en) Query identification and association
WO2005076101A3 (en) System and method for securing computers against computer virus
WO2007033468A3 (en) System and method configuring contextual based content with publisher content for display on a user interface
WO2007084852A3 (en) Systems and methods for providing sorted search results
EP1752906A3 (en) Information processing apparatus and method
GB2490070A (en) Systems and methods for ranking documents
Caramazza et al. X-ray flares in Orion low-mass stars
WO2009029675A3 (en) Method and system for data context service
CN103207904A (en) Method for delivering search results and search engine
NZ601639A (en) Method and system for conducting legal research using clustering analytics
Na et al. Improving opinion retrieval based on query-specific sentiment lexicon
AU2003272014A1 (en) Method, device and computer program for detecting point correspondences in sets of points

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200680022728.8

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application
ENP Entry into the national phase

Ref document number: 2008511259

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06759366

Country of ref document: EP

Kind code of ref document: A2