WO2003105028A2 - Recherche de documents virtuels - Google Patents

Recherche de documents virtuels Download PDF

Info

Publication number
WO2003105028A2
WO2003105028A2 PCT/US2003/018379 US0318379W WO03105028A2 WO 2003105028 A2 WO2003105028 A2 WO 2003105028A2 US 0318379 W US0318379 W US 0318379W WO 03105028 A2 WO03105028 A2 WO 03105028A2
Authority
WO
WIPO (PCT)
Prior art keywords
sub
query
documents
queries
document
Prior art date
Application number
PCT/US2003/018379
Other languages
English (en)
Other versions
WO2003105028A3 (fr
Inventor
Gerald J. Morton
Elizabeth S. Lund
Original Assignee
West Publishing Company, Dba West Group
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by West Publishing Company, Dba West Group filed Critical West Publishing Company, Dba West Group
Priority to AU2003239970A priority Critical patent/AU2003239970A1/en
Publication of WO2003105028A2 publication Critical patent/WO2003105028A2/fr
Publication of WO2003105028A3 publication Critical patent/WO2003105028A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Definitions

  • West Headnotes are shared in Case documents, Statute documents, and Analytical documents. Due to the current limitations of West's search engine, these headnotes must be physically replicated in every document in which they occur. This results in increased complexities and costs in maintenance as well as storage.
  • Virtual Document Searching will allow documents and their individual component parts to be stored separately and assembled as virtual documents for search purposes 1 . This will allow shared component parts to be stored only once, reducing the maintenance complexities and storage costs associated with storing these parts multiple times.
  • Document searching utilizes a variety of operations that allow a researcher to define what he or she is looking for. These operations can be generally classified into two distinct categories: document level operations and word level operations.
  • Document level operations are those that qualify documents strictly by the presence of absence of query components within a documents. AND and OR are examples of document level operations. The query dog and cat qualifies documents as long as both dog and cat are present in the document.
  • Word level operations are those that qualify documents based on the actual word positions of query components within a document. Phrases and word proximities are examples of document level operations.
  • the query "dog house” qualifies documents as long as dog and house are not only in the same document, but occur within one word position of each other in the order specified.
  • Virtual documents for display purposes is not a difficult technical problem. cannot be assigned that provide a single continuous sequencing of words across the entire virtual document. Therefore, word level search operations would not work.
  • a fundamental premise of this invention is that in many if not all circumstances, there isn't truly a need to have word level search operations work across components of a virtual document. Word level operations are clearly necessary within components of a virtual document, but when it comes to spanning components of a virtual document, all that is needed are document level operations.
  • Virtual documents are created by encoding a reference to a sub-document into a document as opposed to physically copying the sub-document as is done today.
  • the virtual document search engine then operates as follows.
  • a query is parsed into component sub-queries based on the document level operators. 2)
  • the component sub-queries are processed using the standard search engine to produce lists of (component) documents that satisfy each sub-query.
  • a virtual document query is constructed using the results from the component sub-queries together the operators by which original query was parsed in step 1.
  • the virtual document query is processed using the standard search engine.
  • the final result identifies virtual documents that matched the original query even though parts of the query were satisfied by physically separate sub-documents of the virtual document. Please see Figure 1.
  • the first step in the virtual document search process is to parse the input query into sub-queries.
  • Search queries are generally represented as tree structures where the intermediate nodes in the tree represent operations, the branches in the tree represent sub-trees, and the leaves of the tree represent actual searchable entities (e.g. words).
  • parsing The goal of parsing is to create sub-queries out of the lowest possible sub-trees of the query. Since document level operations are supported at the virtual document level when searching across components, parsing can continue down through these operators. However, parsing must stop when the first word level operators are reached. This is because word level operators are only supported within a single component document.
  • the standard search engine is used to process each sub-query against the set of candidate documents.
  • the results from each sub-query search are a set of (component) documents that satisfy the criteria for that sub-query. Since the sub-queries were created by parsing only down to the first word level operator, at the completion of this step, all word level operations specified in this initial query have been satisfied. All that remains is to combine these results based on the document level operations that are above the sub-query trees in the original query tree.
  • the coding of a virtual document involves placing a special type of reference in the virtual document that points to the component document that is to be considered part of this virtual document. Assuming documents are coded in XML, a simple example of such a reference is the following:
  • GUID component document identifier
  • results from the individual sub-query searches are used to replace the corresponding sub-trees in the original query to form the virtual document query.
  • An element restriction search is also used to properly focus the search on occurrences of the sub documents when used as component documents within a virtual document.
  • FIGS 3, 4 and 5 show examples of this construction process.
  • the queries used in these examples are the same ones used earlier to show the initial query parsing process.
  • the standard search engine is again used to process this query against the set of candidate documents.
  • the results from this search are the virtual documents who either by themselves or in conjunction with their component parts, satisfy the original query.
  • component documents reside in a separate document collection from the virtual documents that contain these component documents.
  • appropriate control information is maintained to indicate which document collections are to be used at each step, virtual documents and component documents can be maintained in the same or separate document collections.
  • Another searching feature is to be able to restrict a search to within a particular field or logical part of a document such as a title or summary section. In XML this can translate into restricting a search to within a particular element.
  • An example query is: summary (dog) which is requesting to find occurrences of dog within the summary field or XML element.
  • Date searching (as used in Westlaw) is strictly a document level operation and would therefore only be applied to the virtual document query. It would have no impact on the sub-queries created during the initial parsing phase.
  • West Headnotes are currently incorporated into cases, statutes, and analytical materials.
  • headnotes are physically instantiated in each document within which they reside. This results in duplication of headnotes which has a direct impact on storage costs as well as update costs.
  • the exemplary software would allow headnotes (or other sub- documents) to be stored once while still supporting the search functionality that makes them appear as part of another document.
  • Virtual documents are created by encoding a reference to a sub-document into a document as opposed to physically copying the sub-document as is done today.
  • the exemplary virtual document search engine then operates as follows. 1) Queries are parsed into sub-queries at the lowest level of "document level” operators (e.g. &). 2) Sub-queries are processed using the standard search engine to produce lists of documents that satisfy each sub-query. 3) The sub- queries results are merged together through the original operators by which the original query was parsed in step 1. 4) Processes 2 and 3 may be repeated recursively if virtual document structures may be nested many levels deep. The final result would identify virtual documents that matched the original query even though parts of the query were satisfied by physically separate sub- documents of the virtual document.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Un système de recherche type analyse une consultation reçue en une ou plusieurs sous-consultations, chaque sous-consultation analysée comprenant un ou plusieurs opérateurs de recherche. Le système traite ensuite les sous-consultations par rapport à une ou plusieurs bases de données afin de produire des ensembles de résultats de sous-consultations correspondants, chaque ensemble comprenant une liste de documents. On combine ensuite les ensembles de résultats des sous-consultations en fonction d'un ou plusieurs opérateurs de recherche afin de produire un ensemble de résultats de consultations pour la consultation reçue, cet ensemble de résultats des consultations identifiant au moins un document virtuel comprenant lui-même un ou plusieurs sous-documents. Un document virtuel est présenté à un générateur de consultations sous forme d'un document unifié.
PCT/US2003/018379 2002-06-07 2003-06-09 Recherche de documents virtuels WO2003105028A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003239970A AU2003239970A1 (en) 2002-06-07 2003-06-09 Virtual document searching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US38704002P 2002-06-07 2002-06-07
US60/387,040 2002-06-07

Publications (2)

Publication Number Publication Date
WO2003105028A2 true WO2003105028A2 (fr) 2003-12-18
WO2003105028A3 WO2003105028A3 (fr) 2004-06-24

Family

ID=29736253

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/018379 WO2003105028A2 (fr) 2002-06-07 2003-06-09 Recherche de documents virtuels

Country Status (2)

Country Link
AU (1) AU2003239970A1 (fr)
WO (1) WO2003105028A2 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050283465A1 (en) * 2004-06-17 2005-12-22 International Business Machines Corporation Method to provide management of query output
US20140164388A1 (en) * 2012-12-10 2014-06-12 Microsoft Corporation Query and index over documents

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5873080A (en) * 1996-09-20 1999-02-16 International Business Machines Corporation Using multiple search engines to search multimedia data
US5907837A (en) * 1995-07-17 1999-05-25 Microsoft Corporation Information retrieval system in an on-line network including separate content and layout of published titles
US6009422A (en) * 1997-11-26 1999-12-28 International Business Machines Corporation System and method for query translation/semantic translation using generalized query language

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5907837A (en) * 1995-07-17 1999-05-25 Microsoft Corporation Information retrieval system in an on-line network including separate content and layout of published titles
US5873080A (en) * 1996-09-20 1999-02-16 International Business Machines Corporation Using multiple search engines to search multimedia data
US6009422A (en) * 1997-11-26 1999-12-28 International Business Machines Corporation System and method for query translation/semantic translation using generalized query language

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ARNOLD-MOORE T ET AL: "SYSTEM ARCHITECTURES FOR STRUCTURED DOCUMENT DATA" , MARKUP LANGUAGES, MIT PRESS, CAMBRIDGE, MA, US, VOL. 2, NR. 1, PAGE(S) 11-39 XP001009528 ISSN: 1099-6621 page 18, paragraph 4 -page 19, paragraph 3 *
BARTA D ET AL: "A System for Document Reuse" , COMPUTER SYSTEMS AND SOFTWARE ENGINEERING, 1996., PROCEEDINGS OF THE SEVENTH ISRAELI CONFERENCE ON HERZLIYA, ISRAEL 12-13 JUNE 1996, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, PAGE(S) 83-94 XP010200464 ISBN: 0-8186-7536-5 page 86, right-hand column, line 44 -page 91, left-hand column, line 23 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050283465A1 (en) * 2004-06-17 2005-12-22 International Business Machines Corporation Method to provide management of query output
US7370030B2 (en) * 2004-06-17 2008-05-06 International Business Machines Corporation Method to provide management of query output
US7844623B2 (en) 2004-06-17 2010-11-30 International Business Machines Corporation Method to provide management of query output
US20140164388A1 (en) * 2012-12-10 2014-06-12 Microsoft Corporation Query and index over documents
US9208254B2 (en) * 2012-12-10 2015-12-08 Microsoft Technology Licensing, Llc Query and index over documents

Also Published As

Publication number Publication date
WO2003105028A3 (fr) 2004-06-24
AU2003239970A1 (en) 2003-12-22

Similar Documents

Publication Publication Date Title
Theobald et al. The index-based XXL search engine for querying XML data with relevance ranking
US20180101528A1 (en) Multiple index based information retrieval system
US6795820B2 (en) Metasearch technique that ranks documents obtained from multiple collections
US7062507B2 (en) Indexing profile for efficient and scalable XML based publish and subscribe system
USRE36727E (en) Method of indexing and retrieval of electronically-stored documents
US8612427B2 (en) Information retrieval system for archiving multiple document versions
CA2337079C (fr) Systeme et procede en vue de la recuperation de donnees et son utilisation dans un automate de recherche
US7266553B1 (en) Content data indexing
US20170177713A1 (en) Systems and Method for Searching an Index
US20060206466A1 (en) Evaluating relevance of results in a semi-structured data-base system
WO2012082859A1 (fr) Algorithme de recherche de préfixe à haute efficacité prenant en charge une recherche floue interactive sur des données structurées géographiques
JP2011175670A (ja) 情報検索システムにおけるフレーズに基づく検索方法
JP2006048686A (ja) フレーズに基づく文書説明の生成方法
CN105843960B (zh) 基于语义树的索引方法和系统
WO2003105028A2 (fr) Recherche de documents virtuels
US20110022591A1 (en) Pre-computed ranking using proximity terms
KR100434718B1 (ko) 문서 색인 시스템 및 그 방법
Aggarwal Information Retrieval and Search Engines
Zeng et al. Supporting range queries in XML keyword search
KR100440906B1 (ko) 문서 색인 시스템 및 그 방법
Grün A generic framework for querying and updating secondary XML index structures
Pandey et al. A Novel Approach for Extraction of Relevant Web Pages from WWW Using Data Mining
Fegaras XQuery processing with relevance ranking
Barouni‐Ebrahimi et al. An interactive search assistant architecture based on intrinsic query stream characteristics
LAXMI et al. Searching Text-Rich XML Documents with Relevance Ranking

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP