WO2006122086A3 - Matching engine with signature generation and relevance detection - Google Patents
Matching engine with signature generation and relevance detection Download PDFInfo
- Publication number
- WO2006122086A3 WO2006122086A3 PCT/US2006/017846 US2006017846W WO2006122086A3 WO 2006122086 A3 WO2006122086 A3 WO 2006122086A3 US 2006017846 W US2006017846 W US 2006017846W WO 2006122086 A3 WO2006122086 A3 WO 2006122086A3
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- token
- document
- text
- signature generation
- matching engine
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Abstract
A system and a method generates at least one signature associated with document. In one embodiment, a document comprised of text is received and parsed to generate a token set. The token set includes a plurality of tokens. Each token corresponds to the text in the document that is separated by a predefined character characteristic. A score is calculated for each token in the token set based on a frequency and distribution of the text in the document. Each token is then ranked based on the calculated score. A subset of the ranked tokes is selected and a signature is generated for each occurrence of the selected tokens. The selected list of signatures is then output.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2006800227288A CN101248433B (en) | 2005-05-09 | 2006-05-08 | Matching engine with signature generation and relevance detection |
JP2008511259A JP5072832B2 (en) | 2005-05-09 | 2006-05-08 | Signature generation and matching engine with relevance |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US67931405P | 2005-05-09 | 2005-05-09 | |
US60/679,314 | 2005-05-09 | ||
US11/361,447 | 2006-02-24 | ||
US11/361,447 US7747642B2 (en) | 2005-05-09 | 2006-02-24 | Matching engine for querying relevant documents |
US11/361,340 | 2006-02-24 | ||
US11/361,340 US7516130B2 (en) | 2005-05-09 | 2006-02-24 | Matching engine with signature generation |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2006122086A2 WO2006122086A2 (en) | 2006-11-16 |
WO2006122086A3 true WO2006122086A3 (en) | 2007-03-29 |
Family
ID=37397221
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2006/017846 WO2006122086A2 (en) | 2005-05-09 | 2006-05-08 | Matching engine with signature generation and relevance detection |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP5072832B2 (en) |
CN (1) | CN101248433B (en) |
WO (1) | WO2006122086A2 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7516130B2 (en) * | 2005-05-09 | 2009-04-07 | Trend Micro, Inc. | Matching engine with signature generation |
US7860853B2 (en) * | 2007-02-14 | 2010-12-28 | Provilla, Inc. | Document matching engine using asymmetric signature generation |
JP5372853B2 (en) | 2010-07-08 | 2013-12-18 | 株式会社日立製作所 | Digital sequence feature amount calculation method and digital sequence feature amount calculation apparatus |
JP5617674B2 (en) * | 2011-02-14 | 2014-11-05 | 日本電気株式会社 | Inter-document similarity calculation apparatus, inter-document similarity calculation method, and inter-document similarity calculation program |
CN107798637A (en) * | 2016-08-30 | 2018-03-13 | 北京国双科技有限公司 | The different acquisition methods and device for sentencing document of accomplice |
CN112580108A (en) * | 2020-12-10 | 2021-03-30 | 深圳证券信息有限公司 | Signature and seal integrity verification method and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6493709B1 (en) * | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
US6584470B2 (en) * | 2001-03-01 | 2003-06-24 | Intelliseek, Inc. | Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction |
US20030172066A1 (en) * | 2002-01-22 | 2003-09-11 | International Business Machines Corporation | System and method for detecting duplicate and similar documents |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5325091A (en) * | 1992-08-13 | 1994-06-28 | Xerox Corporation | Text-compression technique using frequency-ordered array of word-number mappers |
JP2758826B2 (en) * | 1994-03-02 | 1998-05-28 | 株式会社リコー | Document search device |
JPH09293079A (en) * | 1996-04-18 | 1997-11-11 | Internatl Business Mach Corp <Ibm> | Information retrieving method, information retrieving device and storage medium for storing information retrieving program |
EP0961210A1 (en) * | 1998-05-29 | 1999-12-01 | Xerox Corporation | Signature file based semantic caching of queries |
CN1369839A (en) * | 2001-02-16 | 2002-09-18 | 意蓝科技股份有限公司 | File association judging system and method |
JP2002269116A (en) * | 2001-03-13 | 2002-09-20 | Ricoh Co Ltd | System and program for retrieving document |
JP3719666B2 (en) * | 2001-07-12 | 2005-11-24 | 松下電器産業株式会社 | Document verification device |
-
2006
- 2006-05-08 CN CN2006800227288A patent/CN101248433B/en active Active
- 2006-05-08 WO PCT/US2006/017846 patent/WO2006122086A2/en active Application Filing
- 2006-05-08 JP JP2008511259A patent/JP5072832B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6493709B1 (en) * | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
US6584470B2 (en) * | 2001-03-01 | 2003-06-24 | Intelliseek, Inc. | Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction |
US20030172066A1 (en) * | 2002-01-22 | 2003-09-11 | International Business Machines Corporation | System and method for detecting duplicate and similar documents |
Also Published As
Publication number | Publication date |
---|---|
CN101248433A (en) | 2008-08-20 |
JP2008541272A (en) | 2008-11-20 |
WO2006122086A2 (en) | 2006-11-16 |
JP5072832B2 (en) | 2012-11-14 |
CN101248433B (en) | 2010-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2006122086A3 (en) | Matching engine with signature generation and relevance detection | |
Seidman | Authorship verification using the impostors method | |
AU2018200396B2 (en) | A method and system for extraction | |
WO2007084836A3 (en) | Match-based employment system and method | |
WO2007100916A3 (en) | Systems, methods, and media for outputting a dataset based upon anomaly detection | |
Peters | The Cambridge dictionary of English grammar | |
WO2007089289A3 (en) | Method for ranking and sorting electronic documents in a search result list based on relevance | |
WO2005070111A3 (en) | Content presentation and management system associating base content and relevant additional content | |
WO2008033780A3 (en) | Recommending advertising key phrases | |
WO2009038981A3 (en) | System and method to generate a software framework based on semantic modeling and business rules | |
WO2010019567A8 (en) | Signed digital documents | |
WO2010039519A3 (en) | Methods and apparatus related to document processing based on a document type | |
WO2004086192A3 (en) | Systems and methods for interactive search query refinement | |
WO2010008800A3 (en) | Query identification and association | |
WO2005076101A3 (en) | System and method for securing computers against computer virus | |
WO2007033468A3 (en) | System and method configuring contextual based content with publisher content for display on a user interface | |
WO2007084852A3 (en) | Systems and methods for providing sorted search results | |
EP1752906A3 (en) | Information processing apparatus and method | |
GB2490070A (en) | Systems and methods for ranking documents | |
Caramazza et al. | X-ray flares in Orion low-mass stars | |
WO2009029675A3 (en) | Method and system for data context service | |
CN103207904A (en) | Method for delivering search results and search engine | |
NZ601639A (en) | Method and system for conducting legal research using clustering analytics | |
Na et al. | Improving opinion retrieval based on query-specific sentiment lexicon | |
AU2003272014A1 (en) | Method, device and computer program for detecting point correspondences in sets of points |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200680022728.8 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
ENP | Entry into the national phase |
Ref document number: 2008511259 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06759366 Country of ref document: EP Kind code of ref document: A2 |