WO2003105028A2 - Recherche de documents virtuels - Google Patents
Recherche de documents virtuels Download PDFInfo
- Publication number
- WO2003105028A2 WO2003105028A2 PCT/US2003/018379 US0318379W WO03105028A2 WO 2003105028 A2 WO2003105028 A2 WO 2003105028A2 US 0318379 W US0318379 W US 0318379W WO 03105028 A2 WO03105028 A2 WO 03105028A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sub
- query
- documents
- queries
- document
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Definitions
- West Headnotes are shared in Case documents, Statute documents, and Analytical documents. Due to the current limitations of West's search engine, these headnotes must be physically replicated in every document in which they occur. This results in increased complexities and costs in maintenance as well as storage.
- Virtual Document Searching will allow documents and their individual component parts to be stored separately and assembled as virtual documents for search purposes 1 . This will allow shared component parts to be stored only once, reducing the maintenance complexities and storage costs associated with storing these parts multiple times.
- Document searching utilizes a variety of operations that allow a researcher to define what he or she is looking for. These operations can be generally classified into two distinct categories: document level operations and word level operations.
- Document level operations are those that qualify documents strictly by the presence of absence of query components within a documents. AND and OR are examples of document level operations. The query dog and cat qualifies documents as long as both dog and cat are present in the document.
- Word level operations are those that qualify documents based on the actual word positions of query components within a document. Phrases and word proximities are examples of document level operations.
- the query "dog house” qualifies documents as long as dog and house are not only in the same document, but occur within one word position of each other in the order specified.
- Virtual documents for display purposes is not a difficult technical problem. cannot be assigned that provide a single continuous sequencing of words across the entire virtual document. Therefore, word level search operations would not work.
- a fundamental premise of this invention is that in many if not all circumstances, there isn't truly a need to have word level search operations work across components of a virtual document. Word level operations are clearly necessary within components of a virtual document, but when it comes to spanning components of a virtual document, all that is needed are document level operations.
- Virtual documents are created by encoding a reference to a sub-document into a document as opposed to physically copying the sub-document as is done today.
- the virtual document search engine then operates as follows.
- a query is parsed into component sub-queries based on the document level operators. 2)
- the component sub-queries are processed using the standard search engine to produce lists of (component) documents that satisfy each sub-query.
- a virtual document query is constructed using the results from the component sub-queries together the operators by which original query was parsed in step 1.
- the virtual document query is processed using the standard search engine.
- the final result identifies virtual documents that matched the original query even though parts of the query were satisfied by physically separate sub-documents of the virtual document. Please see Figure 1.
- the first step in the virtual document search process is to parse the input query into sub-queries.
- Search queries are generally represented as tree structures where the intermediate nodes in the tree represent operations, the branches in the tree represent sub-trees, and the leaves of the tree represent actual searchable entities (e.g. words).
- parsing The goal of parsing is to create sub-queries out of the lowest possible sub-trees of the query. Since document level operations are supported at the virtual document level when searching across components, parsing can continue down through these operators. However, parsing must stop when the first word level operators are reached. This is because word level operators are only supported within a single component document.
- the standard search engine is used to process each sub-query against the set of candidate documents.
- the results from each sub-query search are a set of (component) documents that satisfy the criteria for that sub-query. Since the sub-queries were created by parsing only down to the first word level operator, at the completion of this step, all word level operations specified in this initial query have been satisfied. All that remains is to combine these results based on the document level operations that are above the sub-query trees in the original query tree.
- the coding of a virtual document involves placing a special type of reference in the virtual document that points to the component document that is to be considered part of this virtual document. Assuming documents are coded in XML, a simple example of such a reference is the following:
- GUID component document identifier
- results from the individual sub-query searches are used to replace the corresponding sub-trees in the original query to form the virtual document query.
- An element restriction search is also used to properly focus the search on occurrences of the sub documents when used as component documents within a virtual document.
- FIGS 3, 4 and 5 show examples of this construction process.
- the queries used in these examples are the same ones used earlier to show the initial query parsing process.
- the standard search engine is again used to process this query against the set of candidate documents.
- the results from this search are the virtual documents who either by themselves or in conjunction with their component parts, satisfy the original query.
- component documents reside in a separate document collection from the virtual documents that contain these component documents.
- appropriate control information is maintained to indicate which document collections are to be used at each step, virtual documents and component documents can be maintained in the same or separate document collections.
- Another searching feature is to be able to restrict a search to within a particular field or logical part of a document such as a title or summary section. In XML this can translate into restricting a search to within a particular element.
- An example query is: summary (dog) which is requesting to find occurrences of dog within the summary field or XML element.
- Date searching (as used in Westlaw) is strictly a document level operation and would therefore only be applied to the virtual document query. It would have no impact on the sub-queries created during the initial parsing phase.
- West Headnotes are currently incorporated into cases, statutes, and analytical materials.
- headnotes are physically instantiated in each document within which they reside. This results in duplication of headnotes which has a direct impact on storage costs as well as update costs.
- the exemplary software would allow headnotes (or other sub- documents) to be stored once while still supporting the search functionality that makes them appear as part of another document.
- Virtual documents are created by encoding a reference to a sub-document into a document as opposed to physically copying the sub-document as is done today.
- the exemplary virtual document search engine then operates as follows. 1) Queries are parsed into sub-queries at the lowest level of "document level” operators (e.g. &). 2) Sub-queries are processed using the standard search engine to produce lists of documents that satisfy each sub-query. 3) The sub- queries results are merged together through the original operators by which the original query was parsed in step 1. 4) Processes 2 and 3 may be repeated recursively if virtual document structures may be nested many levels deep. The final result would identify virtual documents that matched the original query even though parts of the query were satisfied by physically separate sub- documents of the virtual document.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2003239970A AU2003239970A1 (en) | 2002-06-07 | 2003-06-09 | Virtual document searching |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US38704002P | 2002-06-07 | 2002-06-07 | |
US60/387,040 | 2002-06-07 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2003105028A2 true WO2003105028A2 (fr) | 2003-12-18 |
WO2003105028A3 WO2003105028A3 (fr) | 2004-06-24 |
Family
ID=29736253
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2003/018379 WO2003105028A2 (fr) | 2002-06-07 | 2003-06-09 | Recherche de documents virtuels |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2003239970A1 (fr) |
WO (1) | WO2003105028A2 (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050283465A1 (en) * | 2004-06-17 | 2005-12-22 | International Business Machines Corporation | Method to provide management of query output |
US20140164388A1 (en) * | 2012-12-10 | 2014-06-12 | Microsoft Corporation | Query and index over documents |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5873080A (en) * | 1996-09-20 | 1999-02-16 | International Business Machines Corporation | Using multiple search engines to search multimedia data |
US5907837A (en) * | 1995-07-17 | 1999-05-25 | Microsoft Corporation | Information retrieval system in an on-line network including separate content and layout of published titles |
US6009422A (en) * | 1997-11-26 | 1999-12-28 | International Business Machines Corporation | System and method for query translation/semantic translation using generalized query language |
-
2003
- 2003-06-09 WO PCT/US2003/018379 patent/WO2003105028A2/fr not_active Application Discontinuation
- 2003-06-09 AU AU2003239970A patent/AU2003239970A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5907837A (en) * | 1995-07-17 | 1999-05-25 | Microsoft Corporation | Information retrieval system in an on-line network including separate content and layout of published titles |
US5873080A (en) * | 1996-09-20 | 1999-02-16 | International Business Machines Corporation | Using multiple search engines to search multimedia data |
US6009422A (en) * | 1997-11-26 | 1999-12-28 | International Business Machines Corporation | System and method for query translation/semantic translation using generalized query language |
Non-Patent Citations (2)
Title |
---|
ARNOLD-MOORE T ET AL: "SYSTEM ARCHITECTURES FOR STRUCTURED DOCUMENT DATA" , MARKUP LANGUAGES, MIT PRESS, CAMBRIDGE, MA, US, VOL. 2, NR. 1, PAGE(S) 11-39 XP001009528 ISSN: 1099-6621 page 18, paragraph 4 -page 19, paragraph 3 * |
BARTA D ET AL: "A System for Document Reuse" , COMPUTER SYSTEMS AND SOFTWARE ENGINEERING, 1996., PROCEEDINGS OF THE SEVENTH ISRAELI CONFERENCE ON HERZLIYA, ISRAEL 12-13 JUNE 1996, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, PAGE(S) 83-94 XP010200464 ISBN: 0-8186-7536-5 page 86, right-hand column, line 44 -page 91, left-hand column, line 23 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050283465A1 (en) * | 2004-06-17 | 2005-12-22 | International Business Machines Corporation | Method to provide management of query output |
US7370030B2 (en) * | 2004-06-17 | 2008-05-06 | International Business Machines Corporation | Method to provide management of query output |
US7844623B2 (en) | 2004-06-17 | 2010-11-30 | International Business Machines Corporation | Method to provide management of query output |
US20140164388A1 (en) * | 2012-12-10 | 2014-06-12 | Microsoft Corporation | Query and index over documents |
US9208254B2 (en) * | 2012-12-10 | 2015-12-08 | Microsoft Technology Licensing, Llc | Query and index over documents |
Also Published As
Publication number | Publication date |
---|---|
WO2003105028A3 (fr) | 2004-06-24 |
AU2003239970A1 (en) | 2003-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Theobald et al. | The index-based XXL search engine for querying XML data with relevance ranking | |
US20180101528A1 (en) | Multiple index based information retrieval system | |
US6795820B2 (en) | Metasearch technique that ranks documents obtained from multiple collections | |
US7062507B2 (en) | Indexing profile for efficient and scalable XML based publish and subscribe system | |
USRE36727E (en) | Method of indexing and retrieval of electronically-stored documents | |
US8612427B2 (en) | Information retrieval system for archiving multiple document versions | |
CA2337079C (fr) | Systeme et procede en vue de la recuperation de donnees et son utilisation dans un automate de recherche | |
US7266553B1 (en) | Content data indexing | |
US20170177713A1 (en) | Systems and Method for Searching an Index | |
US20060206466A1 (en) | Evaluating relevance of results in a semi-structured data-base system | |
WO2012082859A1 (fr) | Algorithme de recherche de préfixe à haute efficacité prenant en charge une recherche floue interactive sur des données structurées géographiques | |
JP2011175670A (ja) | 情報検索システムにおけるフレーズに基づく検索方法 | |
JP2006048686A (ja) | フレーズに基づく文書説明の生成方法 | |
CN105843960B (zh) | 基于语义树的索引方法和系统 | |
WO2003105028A2 (fr) | Recherche de documents virtuels | |
US20110022591A1 (en) | Pre-computed ranking using proximity terms | |
KR100434718B1 (ko) | 문서 색인 시스템 및 그 방법 | |
Aggarwal | Information Retrieval and Search Engines | |
Zeng et al. | Supporting range queries in XML keyword search | |
KR100440906B1 (ko) | 문서 색인 시스템 및 그 방법 | |
Grün | A generic framework for querying and updating secondary XML index structures | |
Pandey et al. | A Novel Approach for Extraction of Relevant Web Pages from WWW Using Data Mining | |
Fegaras | XQuery processing with relevance ranking | |
Barouni‐Ebrahimi et al. | An interactive search assistant architecture based on intrinsic query stream characteristics | |
LAXMI et al. | Searching Text-Rich XML Documents with Relevance Ranking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase in: |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |