WO2001024045A2 - Procede, systeme, signaux et supports destines a l'indexage, a la recherche et a l'extraction de donnees en fonction du contexte - Google Patents

Procede, systeme, signaux et supports destines a l'indexage, a la recherche et a l'extraction de donnees en fonction du contexte Download PDF

Info

Publication number
WO2001024045A2
WO2001024045A2 PCT/CA2000/001042 CA0001042W WO0124045A2 WO 2001024045 A2 WO2001024045 A2 WO 2001024045A2 CA 0001042 W CA0001042 W CA 0001042W WO 0124045 A2 WO0124045 A2 WO 0124045A2
Authority
WO
WIPO (PCT)
Prior art keywords
search
data
structural components
context
database
Prior art date
Application number
PCT/CA2000/001042
Other languages
English (en)
Other versions
WO2001024045A3 (fr
Inventor
Chad Matthew Mackenzie
Finlay Cannon
Duane Allan Nickull
Jamie Michael Thomas Hoglund
Original Assignee
Xml-Global Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xml-Global Technologies, Inc. filed Critical Xml-Global Technologies, Inc.
Priority to AU69766/00A priority Critical patent/AU6976600A/en
Publication of WO2001024045A2 publication Critical patent/WO2001024045A2/fr
Publication of WO2001024045A3 publication Critical patent/WO2001024045A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/143Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]

Definitions

  • the present invention relates generally to a method, system, signals and media for indexing, searching and retrieving data based on context.
  • the Internet has rapidly become one of the leading communications mediums of our age.
  • One of the most popular applications used in the Internet is the
  • HotBotTM http://www.hotbot.com
  • Northern LightTM http://www.nlsearch.com
  • Several conventional search engine services organize the information contained within their databases into broad categories.
  • YahooTM provides users with a number of categories within which they may narrow their search including such categories as "government”, “entertainment”, and "health”. Users are able to browse these categories in order to narrow their search to structured documents indexed within a particular category. While these broad categories provide users with some mechanism for organizing the nature of their search, they do not enable a user to perform a search for particular data which has been given specific context within structured documents.
  • XML Extensible Markup Language
  • the above and related desires are addressed in the present invention by providing a method, system, signals and media for indexing, searching and retrieving data based on context.
  • the present invention can be applied to index, search and retrieve context sensitive data associated with structured documents or structured data created with the Extensible Markup Language (XML), an XML-derived markup language, or another context sensitive markup language.
  • XML Extensible Markup Language
  • the present invention can also be applied to indexing, searching and retrieving other types of context sensitive data.
  • a computer-implemented method for retrieving data based on context.
  • a search criterion is received from a requesting party and used to find a set of data sources containing a data element that matches the search criterion.
  • a set of structural components is retrieved that provide context to the data element found in the retrieved structured documents.
  • the set of structural components and references to the set of data sources are transmitted to the requesting party for further processing.
  • a computer-implemented method of retrieving data is provided.
  • a set of structural components is identified based on one or more search criteria received from a requesting party.
  • the set of structural components are transmitted to the requesting party for selection.
  • a selected structural component is received and references are retrieved to structured documents that contain data associated with the selected structural component. These references are then transmitted to the requesting party.
  • the set of structural components may be identified as a set of contexts associated with a set of structured documents.
  • a set of references may be identified to structured documents that contain data marked by at least one of the set of structural components. These latter references may be transmitted to the requesting party with the transmission of the set of structural components for selection.
  • the set of structural components identified based on the one or more search criteria may be identified as being associated with a set of document structures.
  • a computer-implemented method of indexing data into a database In this aspect, a data source is indexed within the database. The data source is scanned in search of structural components that provide context to any data elements within the data source. Such data elements and their associated structural components are retrieved from the data source. Organizational information representing an organization of the data elements and the structural components within the data source is also retrieved and stored within the database.
  • FIG. 1 is a block diagram of a system for indexing, searching, and retrieving data according to a first embodiment of the invention
  • FIG. 2 is a block diagram illustrating further aspects of the system shown in FIG. 1 ;
  • FIG. 3 is a block diagram illustrating the indexing system shown in
  • FIG. 1 A first figure.
  • FIG. 4 is a flow diagram illustrating the submission of a data source for indexing by the indexing system in FIG. 3;
  • FIG. 5 is a flow diagram illustrating the indexing of data sources by the indexing system in FIG. 3;
  • FIG. 6 is a diagram illustrating a sample XML document that can be indexed in the system in FIG. 1 ;
  • FIG. 7 is a block diagram illustrating a sample content submission request for the submission of an XML document to the indexing system in FIG. 3;
  • FIG. 8 is a block diagram illustrating a sample submission response issued by the indexing system in FIG. 3;
  • FIG. 9 is a diagram illustrating a textual components layer for the sample
  • FIG. 10 is a diagram illustrating a structural components layer for the sample XML document in FIG. 6;
  • FIG. 11 is a block diagram illustrating a data structure for textual components retrieved by the system in FIG. 3 from the sample XML document in FIG. 6;
  • FIG. 12 is a block diagram illustrating a data structure for structural components retrieved by the system in FIG. 3 from the sample XML document in FIG. 6;
  • FIG. 13 is a diagram illustrating a sample entry in a computer file to store data regarding unique document terms encountered by the system in FIG. 3 during the indexing process;
  • FIG. 14 is a block diagram illustrating a structure for a node within the index used in the first embodiment
  • FIG. 15 is a block diagram illustrating a structure of a postings blob for a textual component managed by a node in accordance with the first embodiment
  • FIG. 16 is a block diagram illustrating a structure of a postings blob for a structural component managed by a node in accordance with the first embodiment
  • FIG. 17 is a block diagram illustrating an arrangement of key/blob pairs in accordance with the present invention
  • FIG. 18 is a block diagram illustrating an alternative arrangement of the index of the first embodiment, including a computer-readable file for channel mappings;
  • FIG. 19 & 20 are flow diagrams illustrating a method of searching for data based on context according to the first embodiment of the invention.
  • FIG. 20A is a flow diagram illustrating a portion of FIG. 19 according to the first embodiment of the invention.
  • FIG. 21 is a block diagram illustrating the sequence of information exchanged as data signals between a user machine and a search engine via an intermediary search application in accordance with the first embodiment of the invention
  • FIG. 22 is a diagram illustrating the structure of an initial search form presented to the user in accordance with the first embodiment in FIG. 1;
  • FIG. 23 is a diagram illustrating the structure of a first representation of a context-sensitive search form presented to the user in accordance with the first embodiment in FIG. 1 ;
  • FIG. 24 is a diagram illustrating the structure of a second representation of the context-sensitive search form presented to the user in accordance with the first embodiment in FIG. 1 ;
  • FIG. 25 is a diagram illustrating a sample raw XML search request in accordance with the first embodiment in FIG. 1 ;
  • FIG. 26 is a diagram illustrating a sample raw XML search response in accordance with the first embodiment in FIG. 1 ;and FIG. 27 is a block diagram of a system for searching, and retrieving data according to another embodiment of the invention;
  • a context-sensitive search query interface for searching and retrieving data based on context.
  • an indexing system for indexing data elements and their data sources based on the context of such data elements in their data sources.
  • FIG. 1 is a block diagram illustrating a system 20 for providing search engine services according to a first embodiment of the invention.
  • the system 20 has at least one computer server 22, a search engine 24 running on the computer server 22, and a database 31 which contains an index 30 to indexed data including, but not necessarily limited to, context sensitive data.
  • the computer server 22 is provided with conventional communications equipment for communicating over a network or internetwork such as the Internet 42.
  • the search engine 24 is programmed to conduct user- initiated searches of the index 30 to retrieve references to structured documents 17, and other data sources including, but not limited to, other structured data sources, which are cataloged within the index 30.
  • a user communicates with the search engine 24 via a web browser (or micro-browser or other navigational tool) running on one of the user machines 40. Examples of commercially available web browsers include Netscape Navigator,
  • An intermediary search application 26 runs on a web computer server 27 supporting user-based searching of the index 30.
  • the user interacts with the search engine 24 via search forms 21 displayed by the user's web browser.
  • the search forms 21 are stored on the intermediary search application 26 as computer-readable instructions.
  • the search forms 21 provide an end-user with a sequence of search forms to assist the user in defining and refining the user's search for relevant information within the index 30.
  • users can define search criteria using well- known search definition techniques such as keyword searches and phrase searches.
  • the search engine 24 searches the index 30 for relevant data based on the search criteria that the user has defined via a search form.
  • the user initiates a search of the index 30 by retrieving an initial search form 23 from the intermediary search application 26. This may be performed by simply connecting one of the user machines 40 to a Web site running on the web computer server 27 and serving as the human interface to the search engine system 20.
  • the user machines 40 can be any type of computing device capable of communicating with the intermediary search application 26 or the search engine 24.
  • the user machines 40 are personal computers connected to the Internet. Other types of user machines may also be used, for example wireless hand-held computing devices and other microprocessor-based electronic devices having a web browser and a network connection.
  • the user machines 40 can access the services of the search engine 24 through a dial-up connection with an ISP and a network connection established over the Internet 42.
  • connections by the user machines 40 may be established.
  • the user machines 40 may access the search engine services through an intranetwork, a cable or xDSL modem connection, a wireless connection or dedicated network connection (e.g. LAN or WAN) or the like.
  • the initial search form 23 is displayed to a user on one of the user machines 40, the user is prompted to provide one or more search criteria in order to proceed with a search.
  • the search criteria generated by a user using the initial search form 23 serves as a search request that is transmitted as a set of variables via intermediary search application 26 to the computer server
  • search engine 24 for further processing by the search engine 24.
  • the search request is used by the search engine 24 to define and initiate a search of the index 30.
  • the search results generated from this search are sent by the search engine 24 via the intermediary search application 26 to the user's web browser where the results are displayed using a context-sensitive search form.
  • the first embodiment provides at least two types of search forms which can be communicated to the user machines 40: initial search form 23, and a context-sensitive search form 25. Examples of search forms 21 are shown in FIG. 22 to 24 which are discussed later in this specification. Referring to FIG.
  • the search forms 21 are implemented in the first embodiment using HTML and are displayed as Web pages on the user machines 40. It will be appreciated by persons skilled in the art, however, that other programming techniques can be used (independently, in combination or in addition) to encode the search forms in place of or in addition to HTML. For instance,
  • XML XML or another SGML-based markup language
  • applets may be used such as Java Servlets, Server Side Includes, JavaTM, JavascriptTM, ActiveXTM, or Active Server PagesTM or other equivalent computer-readable instructions that can be invoked either locally on a user machine or via command over a network.
  • the intermediary search application 26 acts as an intermediary between the user's web browser and the search engine 24.
  • the intermediary search application 26 receives and converts user-based search requests into raw XML-based search requests which it transmits to the search engine 24.
  • the intermediary search application 26 also receives and converts raw XML search results from the search engine 24 into a user-presentable format which the intermediary search application 26 transmits to the user's machine.
  • Web server software 28 runs on the web computer server 27 to support the intermediary search application 26.
  • the web server software 28 can be any of several well-known server packages including, for example, Apache's web server software or Microsoft's Internet Information Server.
  • the intermediary search application 26 comprises a plurality of CGI scripts in the first embodiment.
  • an HTTP message (Hypertext Transport Protocol message) is transmitted from the user's machine (40) to the web computer server 27.
  • the HTTP message sent from the user serves as a user-based search request containing an Internet Protocol (IP) address associated with the web computer server 27, the name of a common gateway interface (CGI) script 26.1 residing on the web computer server, and query parameters for configuring the CGI script 26.1 with the formulated search criteria.
  • IP Internet Protocol
  • CGI common gateway interface
  • the web server software 28 launches the CGI script 26.1 with the transmitted query parameters.
  • the CGI script 26.1 converts the format of the user's formulated search criteria into a raw XML search request which the web computer server 27 transmits to the search engine 24 where the search of the index 30 is performed, and waits for a response. Once the search results are retrieved by the search engine 24, they are transmitted as a raw XML response to the web computer server 27 where the CGI script 26.1 converts the raw XML response into a format that is presentable to the user's browser. In this embodiment, the CGI script 26.1 provides the actual display information to the user's browser for the display of search forms and search results. Indexing
  • the index 30 is a storage structure that is used to catalog data and the location of the source(s) of such data, including, but not limited to, context sensitive data and the location of the source(s) of such context sensitive data.
  • the sources of data include, amongst other types of data, structured documents 17 located locally or on an accessible network resource 18.
  • the structured documents 17 contain data elements which are marked with structural components that provide context to such data elements. Data elements that are marked by (or surrounded by) such structural components represent context sensitive data.
  • the structural components are represented by contextual markup tags based on a markup language. Data elements which may be marked with contextual markup tags include character data and other markup tags (for example, graphical or multimedia objects).
  • character data refers to textual components of the structured documents 17 which are not part of a markup tag contained within the structured documents 17.
  • the textual components of a document are made up of one or more characters based on a binary-coded character set containing letters, numbers and other typographic symbols.
  • the textual components are Unicode compliant.
  • the Unicode standard is a universal character encoding standard used for the representation of text for computer processing.
  • the Unicode standard conforms with ISO/IEC 10646 and is supported by the Unicode Consortium (http://www.unicode.org).
  • Other examples of binary-coded character sets which may be used include ASCII (American Standard Code for Information Interchange), EBCDIC (Extended Binary Coded Decimal Interexchange Code), and BCD (Binary Coded
  • the structured documents 17 are XML documents.
  • structured documents and structured data may also be processed using the present invention such as, for example, data formatted with an XML-derived markup language or another context-sensitive markup language, as well as other forms of data containing context sensitive data.
  • Quasi-structured documents may also be processed, such as HTML documents, WML documents and standardized text formats such as those employed by academics (eg. essay or thesis formats).
  • Data sources such as the structured documents 17 are cataloged (or indexed) within the index 30 as part of an index building and maintaining process supported by indexing system 32.
  • the indexing system 32 is implemented as a set of computer-readable instructions running on the computer server 22 or another computer having access to index 30.
  • the computer server 22 has at least one processor 46 (a central processing unit in the first embodiment) connected via a bus 43 to a computer-readable medium 45.
  • the computer-readable medium 45 provides a memory store for software and data residing on the computer server 22.
  • the computer-readable medium 45 can include one or more types of computer-readable media including volatile memory such as Random Access
  • RAM Random Access Memory
  • ROM Read Only Memory
  • RAM 48 and a hard disk drive 49 each serve as computer-readable media.
  • the search engine 24 resides on a hard disk drive 49 and is loaded via bus 43 into RAM 48 as computer-readable instructions which execute on the processor 46 to provide search engine services to user machines 40 directly or indirectly connected to the system 20.
  • the search engine 24 runs as an application on an operating system 47.
  • the indexing system 32 also runs on the operating system 47.
  • the operating system can be any of several well- known operating systems such as, for example, Windows NTTM, Windows NTTM, Windows
  • the computer server 22 includes a network interface 44 which is in communication with the processor 46 to connect the computer server 22 to a network so that the computer server 22 can communicate with user machines 40 or other networked devices.
  • the indexing system 32 comprises a content submission interface 50, a content acceptor 52, a queue 54 and an indexer 56.
  • the data sources, such as the structured documents 17 in FIG. 1 are submitted by a submitting party (for example, a user or application) to the content submission interface 50 which runs as a computer program on the computer server 22
  • a content submission request can be received from either a user or an application.
  • a content submission request contains a data source intended for insertion into the index 30.
  • Context sensitive data is submitted in a content submission request in the form of a structured document or in the form of a structured data stream which preserves the context of such context sensitive data.
  • a content submission request also contains resource location information identifying the location of a data source on a resource. In the first embodiment, the resource location information is represented by a Uniform
  • Resource Locator (commonly known as a URL) which specifies the location of the data source (for example, one of structured documents 17).
  • the term "resource” refers to any computer- implemented object or data that can be accessed via a computer network (e.g. LAN, WAN, etc.) or internetwork such as the Internet, and which contains or refers to a data source. Examples of resources include Web sites, Web pages, file directories, URIs, URNs, URLs, IP addresses, POP, Email (MIME or S/MIME), electronic data files and other electronic documents accessible over a network.
  • the resource location information is a form of metadata that identifies an attribute of a data source.
  • Other metadata may also be included within the content submission request to further specify attributes of a data source to be indexed within index 30.
  • the content submission interface 50 is implemented using sockets.
  • the content submission interface 50 accepts the request, gives the remote host a connection and processes the content submission request sent to the content submission interface 50 over the socket.
  • a content submission request received via the content submission interface 50 is scanned by the content acceptor 52 which is programmed to determine if the content submission request contains content in an acceptable format.
  • the content of an accepted content submission request is placed on queue 54 for subsequent indexing into index 30.
  • the content acceptor 52 informs the submitting party whether or not the content of the content submission request has been accepted for indexing.
  • content submission requests contain XML documents which contain data elements that are given context using markup tags.
  • An example of an XML document is shown in FIG. 6 generally at 60 containing data elements 62 marked by contextual markup tags 64 (structural components) which contain contextual terms that provide context to marked data elements 62. Attributes may also be used within the XML document to provide context to data. Attribute names are equivalent to markup tags 64 and attribute data is equivalent to data elements 62.
  • FIG. 7 shows a sample of a raw content submission request 66 received from a submitting party via the content submission interface 50 for processing by the content acceptor 52 in FIG. 3.
  • the content submission request 66 contains a request to submit the XML document shown in FIG. 6 for indexing within index 30, a copy of the XML document 60 and metadata 68 regarding the submitted XML document.
  • the metadata 68 includes resource location information 65 which is a URL identifying the location of the original XML document 60.
  • the metadata 68 also includes other information regarding a submitted document, including, for example, channel information 67, the type of document (MIME type 69), the date the document was last modified, and a title and abstract for the document.
  • Channel information 67 is used to index a document under a channel within the index 30.
  • a document may be indexed into several channels within index 30.
  • channels a user or application can search the entire index 30 or a subset of the index 30 under one or more channels. For instance, using channels a search may be narrowed by document type.
  • the content acceptor 52 scans the content submission request 66 to determine if the submitted content is in an acceptable format.
  • the content acceptor 52 checks the metadata 68 for the type of document that has been submitted. If the document is in XML, then the content acceptor checks the remainder of the content submission request 66 to, for instance, check that the document is in well-formed XML and to check whether or not the document is already in the index 30. If the document is identified as something other than an XML document, the content acceptor 52 invokes a document handler 58 to transform the document into XML. Context sensitive data within the submitted document is preserved during the conversion process. If the submitted document cannot be transformed into
  • the content acceptor 52 will not accept the document.
  • all submitted documents are processed by a version of the document handler, with documents that are not in XML being transformed into XML, and with XML documents not undergoing such transformation.
  • a document that is not in XML is transformed into XML by a version of the document handler programmed to recognize the submitted document type and transform it into XML.
  • a submission response is transmitted to the submitting party confirming acceptance.
  • An example of content submission response is shown in FIG. 8 at 70 acknowledging that the content submission request in FIG. 7 has been accepted and the submitted document and meta-data concerning the document have been queued for indexing.
  • a submission response is transmitted to the submitting party notifying the submitting party of the rejection.
  • content for instance structured content such as a structured document and its metadata, which has been queued for indexing is processed by the indexer 56.
  • the indexer 56 runs as a computer program on the computer server 22 or another computer server having access to the index 30.
  • the structured content in the first embodiment is represented by
  • the indexer 56 retrieves an XML document and its metadata from the queue 54, parses the document and its metadata, and then catalogs the parsed information within the index 30.
  • the indexer 56 parses the submitted XML document in order to retrieve and temporarily store data elements from the document and structural components within the document that provide context to any of the retrieved data elements.
  • the retrieved data elements and structural components are stored in data structures to maintain the association between such structural components and corresponding data elements, and to record information representing the organization of the data elements and the structural components within the document that is being parsed.
  • a submitted XML document is preferably divided into two logical layers: a textual components layer and a structural components layer.
  • the textual components layer contains textual components representing any words and any other non-markup character sequences retrieved from the submitted XML document, and any other data elements that are marked in the submitted XML document with contextual markup tags.
  • the structural components layer contains structural components retrieved from the submitted XML document, including contextual terms from contextual markup tags that provide context to data elements in the submitted XML document. If a submitted document contains no structural components, then the structural components layer is empty.
  • FIG. 8 and 9 illustrate a textual components layer 72 and a structural components layer 74 respectively for the sample document in FIG. 6.
  • the textual components layer 72 is shown containing the words found in the sample document 60.
  • the structural components layer 74 is shown containing contextual terms 76 from contextual markup tags used in the sample document 60 to provide textual components with context.
  • Logically dividing the textual components and the structural components into different layers allows the indexer 56 (FIG. 3) to separately manage the indexing of textual components and structural components parsed from the submitted XML document.
  • the indexer 56 keeps track of position information and nesting information for both textual components and structural components.
  • the indexer 56 keeps track of boundary information identifying which textual components are first and last surrounded by the structural components.
  • FIG. 11 shows, by way of example, a data structure 80 for textual components retrieved from the sample XML document 60 in FIG. 6.
  • the data structure 80 is used by the indexer 56 in FIG. 3 to store the position information 82 and nesting information 84 for the words and other non-markup character sequences retrieved from the sample XML document 60.
  • a particular piece of position information in data structure 80 represents the position of a particular textual component relative to the other textual components retrieved from the
  • FIG. 12 shows, by way of example, a data structure 86 for structural components retrieved from the sample XML document 60.
  • the data structure 86 is used by the indexer 56 in FIG. 3 to store position information 87, nesting information 88 and boundary information 89 for the structural components retrieved from the sample XML document 60.
  • the position information 87 is used to identify the hierarchical relationship between structural components.
  • the position information 87 for data structure 86 identifies the position of a structural component within the sample XML document 60 relative to the other structural components.
  • the nesting information 88 for data structure 86 is used to identify whether any structural components are nested within other structural components, and if so, the depth of the nesting. The depth of the nesting is relative to the root structural component.
  • the boundary information 89 stored as "begin" and "end” fields in the data structure 86 is used to identify, for any structural component retrieved, the position information in the data structure 80 of the first and last textual components surrounded by such structural component in the sample XML document 60.
  • TermList file 30.1 a computer-readable file, referred to in the first embodiment as TermList file 30.1 , is used to store data regarding each of the unique document terms that the indexer 56 encounters when parsing submitted documents. Both textual components and structural components retrieved from a parsed document are treated by the indexer 56 as document terms.
  • the TermList file 30.1 serves as a translation table between actual document terms and numeric identifiers that index structures within the index
  • each retrieved document term that is unique to the TermList file 30.1 is assigned a unique identifier number which, in combination with the corresponding unique document term, are stored as part of a new entry in the TermList file 30.1. If the retrieved document term already has an existing entry in the TermList file 30.1 , then the existing entry is updated to modify frequency information about the retrieved document term.
  • FIG. 13 A sample entry in the TermList file 30.1 is illustrated in FIG. 13 generally at
  • the entry 90 contains a variable ("term_string”) to store the actual string representation of a document term retrieved from a parsed document. Another variable (“term_id”) is used to store a unique numeric identifier assigned to the retrieved document term.
  • the entry 90 also preferably stores statistical information about the total number of instances that a document term appears in the index and the total number of documents that contain such document term. This statistical information is used by the search engine 24 (FIG. 1 ) to prioritize queries on the index 30 and to determine weights for document terms that are retrieved as part of search results.
  • the entry contains a plurality of variables to store the statistical information.
  • Such statistical information preferably includes information on: the number of times a document term appears as a textual component, the number of times the document term appears as a structural component, the number of documents that contain the document term as a textual component, and the number of documents that contain the document term as a structural component.
  • the index file 30.2 is a computer- readable file organized in a B-plus tree structure containing a plurality of nodes having key/value pairs. Data is stored in a blob (or block) in the value chunk of a node. Each blob is structured to store data on either a textual component or a structural component, and a flag in the key of the node identifies whether the data stored in the blob is for a textual component or a structural component.
  • Each node 92 has a key portion 94 identifying a particular document term for which the node 92 is being used to store data, and a blob portion 96 for storing organizational data about the particular document term with respect to those data sources from which the particular document term has been retrieved and indexed within index 30.
  • the blob portion 96 forms the value chunk of the node 92.
  • the key portion 94 of the node 92 contains an identifier 93 for uniquely identifying the document term for which the node 92 is being used to store data, and a flag 95 which indicates whether the node 92 is storing information for a textual component or a structural component.
  • Each posting records data identifying position information and nesting information for each instance of the document term in a particular data source.
  • Each posting is associated with a particular data source which has been indexed by indexer 56.
  • FIG. 15 illustrates the structure of a postings blob 100 for a textual component that is being tracked by a particular node within the index file 30.2 (FIG. 3).
  • the postings blob 100 contains a set of postings 102 each of which contains data identifying the position and nesting information (82 and 84 in FIG. 10) for each instance of the textual component retrieved from a particular data source. For example, if node 92 is used to store data about the textual component "Introduction" and the sample document in FIG. 6 is indexed, then a posting would appear in the set of postings 102 identifying the document (for example, doc_id1 ) and the position and nesting information for the textual component "Introduction" in the document.
  • FIG. 16 illustrates the structure of a postings blob 104 for a structural component that is being tracked by a particular node within the index file 30.2 (FIG. 3).
  • the postings blob 100 contains a set of postings 102 each of which is used to store the position information, nesting information and boundary information (87, 88 and 89 in FIG. 11 ) for each instance of the structural component retrieved from a
  • the key/blob pairs are structured so that a pointer from a key points to its corresponding blob portion so that blobs are not necessarily stored in contiguous chunks in the index file 30.2 (FIG. 3).
  • the key portion 94 of a node 92 preferably includes a low document ID indicator 97.
  • the low document ID indicator 97 indicates the lowest document ID contained within a particular blob.
  • the low document ID indicator 97 can be used during a search for multiple terms within the index 30 to speed up the retrieval of the results. If, during the search for a first term, the search engine 24 finds that the first term only occurs in documents numbered, for example, 1001 , 1007 and 1024, the search for the second term could ignore any node having a low document ID of less than 1001 or a low document ID of greater than 1024, since such nodes have already been eliminated as possible matches. It will be noted that the low document ID, while providing enhanced retrieval capabilities, is not required to retrieve results.
  • the low document ID indicator 97 is included in nodes to improve the performance of searches on the index 30.
  • the lowest document ID within a blob can also be determined during a search by looking inside a node at the contents of its blob portion.
  • an existing node grows to a maximum manageable size, the existing node is split into two nodes. For a very common document term, as with textual components "the” or "and”, such a document term will have several nodes that their index information is spread amongst. Alternatively, a new node may be inserted with a low document ID of the first new posting entered in the new node and all subsequent new postings for the document term managed by the existing node will be stored in the new node.
  • DocumentList file 30.3 a computer-readable file, referred to here as DocumentList file 30.3 is used to store all meta-data about each data source indexed within index 30.
  • the DocumentList file 30.3 contains an entry recording meta-data about the data source, including, for example, resource location information (65), MIME type (69), date last modified, title, summary/abstract, size, owner etc. This information can be retrieved as part of the search results obtained by the search engine 24 during a search of the index 30.
  • index 30 Processing the textual components and structural components of a document in separate layers and indexing such textual components and structural components using the node structure of index 30 allows the index 30 to be more manageable and scalable. As the number of documents indexed within index 30 grows, the index 30 will grow more linearly than would be the case with a relational database. Storing textual components, structural components and their associations with textual components and other structural components in a relational database, the database would grow much larger and complex as the number of documents submitted for indexing increased. In addition, with such a relational database structure, it can be more difficult, or require more processing, to find nestings of terms that are multiple levels apart. With index 30, ancestor-child relationships between stored structural components can be easily and rapidly retrieved. Moreover, the index 30 stores enough information about textual components and structural components retrieved from documents to enable the search engine
  • index 30 is not tied to a specific schema or structure for a document.
  • such retrieval can be achieved by indexing within the node structure of index 30 the position information and nesting information for textual components in nodes specific to such textual components, and by indexing position, nesting and boundary information for structural components such as contextual terms in nodes specific to such structural components.
  • indexing data based on context from index 30 provides for an efficient configuration which readily supports rapidly retrieving data based on context.
  • FIG. 19 and 20 show flow diagrams illustrating a method of retrieving data based on context by searching for data sources using context sensitive search queries according to the first embodiment of the invention.
  • the flow diagrams in FIG. 19 and 20 will be described with reference to the system 20 shown in FIG. 1.
  • the intermediary search application 26 receives a request to send the initial search form 23 to a requesting party which, in the first embodiment, is one of the user machines 40.
  • the intermediary search application 26 sends computer-readable instructions representing the initial search form 23 to the requesting user machine 40 where the initial search form 23 is displayed on the local web browser.
  • the computer-readable instructions for the initial search form 23 are transmitted to the requesting user machine 40 as a computer data signal embodied in a carrier wave.
  • FIG. 22 illustrates the initial search form 23 that is displayed to a user via a web browser 41 running on one of the user machines 40 (FIG. 1 ).
  • the initial search form 23 provides an easy and simple mechanism for prompting the user to specify an initial set of search criteria without necessarily having to define the context for the search criteria.
  • a user inputs into one of the user machines 40 one or more search terms via the initial search form 23.
  • the user may also include operators, such as Boolean operators (eg. AND, OR, NOT etc.), to further define the nature of the search desired.
  • the initial set of search criteria is transmitted as part of a user-based search request to the intermediary search application 26 which converts the user-based search request into a raw XML search request which is then transmitted to the search engine 24 to initiate an initial search of the records of the index 30.
  • the user-based search request is an HTTP message identifying CGI script 26.1 that is part of the intermediary search application 26, and query parameters for configuring the CGI script 26.1 with the initial set of search criteria.
  • the CGI script 26.1 is launched with the query parameters and converts the format of the initial set of search criteria into the raw XML search request which the web computer server transmits as a computer data signal to the search engine 24.
  • 21 illustrates at 222 the transmission of the initial set of search criteria to the intermediary search application and, at 224, the transmission of the initial set of search criteria as a raw XML search request to the search engine 24, each as a computer data signal embodied in a carrier wave.
  • FIG. 25 shows a sample raw XML search request 240 containing sample search criteria 242.
  • a basic search request is included for the phrase "off to see the wizard".
  • the search engine 24 receives the initial set of search criteria in the raw XML search request at block 200. Once the search engine 24 receives the initial set of search criteria, the search engine 24 proceeds at block 202 to initiate an initial search of the index 30 in search of matches (or "hits") using the parameters within the initial set of search criteria.
  • a "match” or “hit” represents an entry in the index 30 identifying a data source having data fitting within the parameters of a given set of search criteria.
  • the search engine 24 searches the index file 30.2 for nodes which contain document terms that fit within the constraints of the initial set of search criteria.
  • the search engine 24 organizes an initial set of search results based on the matches found in the index 30.
  • the search engine 24 retrieves from the index 30 a set of resource location identifiers that identify data sources containing terms, phrases or other data that satisfy the initial set of search criteria.
  • the resource location identifiers are examples of resource locations identifiers that identify data sources containing terms, phrases or other data that satisfy the initial set of search criteria.
  • URLs Uniform Resource Locators
  • search results will also include a list of such contextual terms.
  • search results may also include reference to all unique contextual terms within retrieved data sources, whether or not any of the search terms of the search criteria are marked with such contextual terms in those data sources.
  • the contextual terms returned in a search can be used to refine the search for relevant data within the index 30 by presenting the requesting party with contextual information (ie. the contextual terms) which the requesting party can use to filter the search results.
  • the contextual terms may also be used to refine the search by predicting possible document types and presenting them to the requesting party. For instance, if the document type "Invoice” is known in the index 30 to have the structural components "Name”, “Address”, “Invoice Number”, and “ItemNumber", the search engine 24 predicts that the presence of these structural components indicates that certain document types, including the "Invoice” document type, are available to be searched upon within the index 30. This information can then be communicated to the requesting party so that a particular document type may be used to refine a search for relevant data within the index 30.
  • FIG. 20A illustrates the search and retrieval process performed by the search engine 24 in FIG. 1 and in blocks 202 and 204 of FIG. 19.
  • the search engine 24 uses the TermList file 30.1 to resolve search terms into numeric identifiers (IDs) at block 300. Search terms which cannot be mapped to numeric IDs with the TermList file 30.1 are preferably ignored in the remainder of the search.
  • the search engine 24 searches for the search terms from amongst the nodes in the index file 30.2 and retrieves at block 304 a list of document identifiers that satisfy the search criteria.
  • the search engine 24 is able to perform a wide range of searching for relevant documents based on the position and nesting information for textual components and the position, nesting and boundary information for structural components stored in the index structure of index file 30.2. For instance, certain search criteria may require that the search engine 24 retrieve documents having textual components side-by-side or in a certain proximity to one another, such as phrase or proximity searching. With the position information of structural components stored within the search engine 24 is able to rapidly determine which documents have terms that fit within such a phrase or proximity search.
  • the search engine 24 is also able to rapidly retrieve nesting information and boundary information for a contextual term retrieved during a search without having to traverse during runtime each ancestor of the contextual term.
  • Other search criteria such as the contextual criteria discussed further below, may require that the search engine 24 retrieve structural components having a particular nesting.
  • Another search criteria may require that the search engine 24 retrieve document identifiers for documents having textual components associated with a certain nesting of structural components.
  • the search criteria may specify that the search engine 24 retrieve a list of documents having the textual component "Francis" within the nested context
  • the search engine 24 is able to search the tree structure of the index file 30.2 and rapidly identify relevant search results without having to calculate at runtime the location of textual components within structural components or the nesting levels of structural components.
  • the search engine 24 can restrict its searching to nodes in the index file 30.2 that are used to store information on structural components.
  • the search engine 24 can restrict its searching to nodes in the index file 30.2 that are used to store information on textual components.
  • the search engine 24 retrieves the position information for the textual component in question and determines if it falls within the boundary information recorded in the index file 30.2 for the particular context term of a document.
  • the search engine 24 compares the nesting information and boundary information of the two contextual terms. The first contextual term is a parent or ancestor of the second contextual term if the second contextual term has a depth (see FIG.
  • the nesting information can also be used to readily determine the number of levels of nesting between a structural component and one or more of its ancestors.
  • the search engine 24 retrieves at block 306 a list of document identifiers from the channel mappings table 30.4 (FIG. 18) for a list of documents that are associated with the one or more channels specified in the search.
  • the search engine 24 intersects the list of document identifiers retrieved from channel mappings table 30.4 with the list of document identifiers retrieved from the index file 30.2 to generate search results that contain a list of document identifiers that satisfy the search criteria and that appear in the channel(s) specified.
  • the search engine 24 sorts and ranks the resulting list of document identifiers.
  • the search engine 24 looks up the metadata in the DocumentList file 30.3 for those documents identified in the resulting list which are to be presented to the user. The set of retrieved metadata and sorted listed of document identifiers to be presented to the user are then transmitted.
  • the search engine 24 transmits the initial set of search results as a raw XML response.
  • the raw XML response is transmitted to the intermediary search application running on the web server computer, where the CGI script 26.1 converts the raw XML response into a format presentable to the user's web browser.
  • FIG. 21 illustrates at 226 the transmission of the raw XML response to the intermediary search application as a computer data signal embodied in a carrier wave.
  • a sample raw XML response 246 containing a sample set of search results is shown generated from a search based on the sample XML search request in FIG. 25.
  • the response 246 contains a set of hits which identify matching documents, their URLs and summary information regarding each matching document including title, abstract, data last modified, size, rank and score of the document.
  • the summary information returned for each matching document allows the intermediary search application 26 to use a variety of techniques to display or further filter the search results for the user.
  • the response 246 also contains the list of contextual terms 248 found to provide context to search terms within the retrieved documents.
  • the response 246 also contains metadata 247 summarizing the search including the search criteria that formed the basis of the search.
  • Providing metadata 247 about the search criteria in the response 246 allows the intermediary search application to be stateless.
  • the search engine 24 can inform the search application 26 of ignored text and structural components such as with fields 250, and unused text and structural components such as with fields 252.
  • the one or more fields 250 for ignored text and structural components contains search terms that the requesting party specified to exclude from the scope of a search as well as search terms which the search engine 24 ignored to improve the speed of a search, for instance very common terms such as "and", "the", "or", and "a”.
  • the one or more fields 252 for the unused text and structural components contains search terms not used in the search because they did not appear within the index 30.
  • the intermediary search application 26 processes the raw
  • FIG. 21 illustrates at 228 the transmission to a user machine 40 of a computer data signal embodied in a carrier wave, the computer data signal containing computer-readable instructions for the display of the representation of the context-sensitive search form 25 with the list of contextual terms and the list of URLs.
  • FIG. 23 illustrates, by way of example, the context-sensitive search form 25 presented to the user for the first embodiment.
  • the context-sensitive search form 25 is displayed on a user machine 40 within web browser 41 as a Web page that includes a display area 126 for displaying all or a portion of the list of URLs generated in the initial search.
  • each URL is displayed as a hyperlink and includes some information from the corresponding data source (e.g. title, a subset of text retrieved from a document). Data sources referenced by the URLs are accessed by selecting a given hyperlink.
  • the context-sensitive search form 25 also includes a contextual display area, which is shown in FIG.
  • the context box 128 and other aspects of the context-sensitive search form 25 are displayed to the user via the web browser in the first embodiment using a meta-language such as HTML or another SGML-based language. It will be appreciated by one skilled in the art that selecting an item, such as a contextual term, from the context box 128 or from another part of the first context-sensitive search form 25 can be achieved using one of many known techniques used for selecting a feature on a web page (or an electronic form) and sending information identifying the selection to another application.
  • a preferred technique for providing this option is to catalog the search results and to dynamically identify which portions of the search results are to be presented to the user. This may be done by grouping the search results into "pages" or “segments”, presenting the user with one of the pages of results, and allowing the user to navigate through the search results on a page-by-page basis. An example of this technique is illustrated with the page links 127 in FIG. 23. The same technique may be used to allow a user to view the refined search results via the context-sensitive search form 25 in FIG. 24.
  • the context box 128 preferably appears on each "page" of search results viewed by the user via a search form so that the user can quickly refine search results using contextual terms retrieved from the index 30.
  • the context-sensitive search form 25 provides a refinement mechanism for enabling a user to refine, via the web browser 41 , the list of URLs by selecting one (or more) of the contextual terms from the context box 128.
  • the user can initiate this refinement mechanism by selecting one (or more) of the contextual terms in the context box 128.
  • the selected contextual terms are encoded in a data signal (230 in FIG. 21 ) that is transmitted to the intermediary search application 26 which converts the transmission into a raw XML search request that is then transmitted as another data signal (232 in FIG. 21 ) to the search engine 24 to initiate further searching of the index 30.
  • the search engine 24 when the search engine 24 receives at block 208 instructions in the form of a raw XML search request to refine the list of URLs using one (or more) of the contextual terms from the list of contextual terms presented to the user, the search engine 24 generates a refined list of URLs from the original list of URLs.
  • the refined list of URLs is generated by determining which of the documents referred to in the original list of URLs includes the contextual term(s) selected by the user.
  • the refined list of URLs does not include URLs from the original list of URLs referring to data sources that do not include one or more of the original search terms marked by the refining contextual term(s). This latter arrangement helps to narrow the search results by identifying only those data sources within which search terms are used in the context(s) selected to refine the initial search results.
  • the search engine 24 receives one or more selected contextual terms at block 208, the initial search results are filtered at block 210 using the one or more contextual terms selected by the user, and a refined list of contextual terms and a refined list of URLs are generated.
  • the selected contextual terms are used by the search engine 24 to filter from the original list of URLs the references to data sources that are not identified in the index 30 as containing one or more the selected contextual terms.
  • the search engine 24 transmits the refined list of URLs as part of a raw XML response (234 in FIG. 21) to the intermediary search application 26 which processes the raw XML response to generate a search form with the refined search results in a form presentable to the user.
  • the intermediary search application 26 generates a second representation of the context-sensitive search form 25 containing the refined list of URLs.
  • This second representation of the context-sensitive search form 25 is then transmitted by the intermediary search application 26 to the user's machine.
  • FIG. 21 illustrates at 236 the transmission to a user machine 40 of a computer data signal embodied in a carrier wave, the computer data signal containing computer-readable instructions for the display of the second representation of the context-sensitive search form 25 with the refined list of URLs.
  • FIG. 24 illustrates, by way of example, the second representation of the context-sensitive search form 25 presented to the user for the first embodiment.
  • the context-sensitive search form 25 is displayed in web browser 41 as a Web page that includes display area 126 now used for displaying all or a portion of the refined list of URLs.
  • Each URL in the refined list of URLs is preferably displayed as a hyperlink and includes some information from the corresponding document (e.g. title, a subset of text retrieved from the document).
  • the context-sensitive search form 25 may also include context box 128 for displaying all or a portion of the refined list of contextual terms used to generate the refined list of URLs. If more than one contextual term remains in the context box 128, the user may select from the remaining contextual terms to initiate a further search of the index 30 (FIG. 1) to further refine the already refined list of URLs.
  • the combination of presenting the user with a list of data sources and a list of contextual terms associated with one or more data sources, and providing the user with a mechanism for selecting from the list of contextual terms so as to contextually refine the list of data sources provides the user with a number of advantages.
  • the search engine 24 automatically determines for the user the context within which the user's initial set of search criteria are used in the documents identified in the search results from the initial search. The user does not have to guess or try to predict the contexts within which the user's search terms are used in structured data sources, such as XML documents, identified in the search results.
  • This automated technique for presenting the user with contextual information associated with the list of data sources retrieved from the initial search gives the user a mechanism for contextually filtering out data sources identified in the initial search.
  • the user can quickly narrow the search results according to the context given to terms, phrases and other sets of data within structured data sources indexed within the index 30.
  • the system 20 may have other aspects to further enhance functionality and usability.
  • Each of the following aspects individually provides a beneficial enhancement and is an embodiment of the present invention.
  • the index 30 maps contextual terms that it stores to an index of general (or normalized) contextual categories or generic contextual terms.
  • their corresponding general contextual categories in the normalized index may be retrieved and displayed as a generic list of contextual terms or categories (or both) in a location of the context-sensitive search form 25, or in place of the contextual terms displayed in the context box 128.
  • the search engine 24 refines the initial search results by filtering out URLs referring to documents which do not contain either the generic contextual term selected or any of those contextual terms in the index 30 associated with the selected generic contextual term.
  • search engine 24 is programmed to refine the list of URLs to only those URLs referring to documents identified in the index 30 as having one or more of the initial search terms marked by one or more contextual terms associated with the generic contextual term.
  • the selection of a generic contextual term may be used not only to reduce the number of URLs referred to, but to potentially increase the number of URLs listed.
  • the search engine 24 may be programmed, in response to the selection of a generic contextual term, to retrieve all URLs identified in the index 30 as referring to a structured document having at least one of the contextual terms aliased by the generic contextual term, provided that such contextual terms provide context to data that match the user's search criteria.
  • the user's machine does not communicate directly with the search engine 24, but instead goes through the web computer server 27 and intermediary search application 26.
  • the intermediary search application 26 may reside on the computer system 20 that supports the search engine 24.
  • client- side processing on the user's machine with, for example, style sheets or an applet such as a Java app may be used to communicate directly with the search engine 24 without web server software or a web computer server, as illustrated in FIG. 27.
  • the user's browser serves as a search application interface to the search engine 24, converting user-based search requests into raw XML requests which can be processed by the search engine 24, and converting raw XML search results received from the search engine 24 into a format presentable to the user on the web browser.
  • the initial search form 23 includes a list of at least one contextual term stored within the index 30 in database 31. This list provides the user with an indication of some or all of the contextual terms available within the index 30 to assist the user in formulating a context-based search request.
  • the context-based search request can contain one or more contextual terms defining the context within which the search is to be performed on some or all of the other search terms that form part of the search request.
  • the search engine 24 can then search the database 31 for references to documents and other data sources (for example, Web pages) having data elements such as character data associated with such contextual term(s) and return search results to the user machine that submitted the context-based search request.
  • the search engine 24 is programmed to compare the search term(s) submitted with the search request with contextual terms used by the database 31.
  • the search engine 24 determines that any contextual terms within the database 31 provide context to any of the search terms, the search engine 24 then provides the user machine that submitted the search request with a list of contextual terms in the database 31 that are found to be associated with one or more of the search terms, along with instructions providing the user with the option to submit a context-based search request using one or more of the contextual terms provided.
  • the user may then refine the search request and specify the context sensitive nature of the search request by selecting one or more of the contextual terms from the list of contextual terms retrieved by the search engine 24.
  • the refined search request (now context-based), once received by the search engine 24 may be used to conduct a context sensitive search for structured documents referenced by the database 31.
  • searching and retrieval of data described in the first embodiment may be performed on a relational database which associates data elements scanned from data sources with contextual terms from such data sources in the manner described in the first embodiment for index 30.

Abstract

Selon l'invention, un moteur de recherche reçoit un ensemble initial de critères de recherche de la part d'un utilisateur et procède au lancement d'une recherche dans une base de données contenant un index des documents et de leur contenu. Ledit moteur recherche dans la base de données la position de documents contenant des données correspondant aux critères de recherche initiaux, et (le cas échéant) de termes contextuels enregistrés dans la base de données fournissant le contexte d'une ou de plusieurs données dans n'importe quel document extrait. Une liste des documents identifiés et une liste de termes contextuels sont générées et transmises à un utilisateur en vue de leur affichage dans un formulaire de recherche contextuel. Ce formulaire de recherche contextuel constitue un mécanisme de recherche fine permettant à l'utilisateur d'affiner la liste des documents identifiés par la sélection de un ou plusieurs termes contextuels contenus dans la liste de termes contextuels présentée. Lorsque le moteur de recherche reçoit un ou plusieurs termes contextuels de la part de l'utilisateur, une liste affinée de documents est générée et transmise à l'utilisateur afin que ce dernier procède à une autre sélection.
PCT/CA2000/001042 1999-09-29 2000-09-08 Procede, systeme, signaux et supports destines a l'indexage, a la recherche et a l'extraction de donnees en fonction du contexte WO2001024045A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU69766/00A AU6976600A (en) 1999-09-29 2000-09-08 Method, system, signals and media for indexing, searching and retrieving data based on context

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US40733699A 1999-09-29 1999-09-29
US09/407,336 1999-09-29

Publications (2)

Publication Number Publication Date
WO2001024045A2 true WO2001024045A2 (fr) 2001-04-05
WO2001024045A3 WO2001024045A3 (fr) 2002-05-10

Family

ID=23611600

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CA2000/000861 WO2001024046A2 (fr) 1999-09-29 2000-07-21 Creer, modifier, indexer, stocker, et retrouver des documents electroniques marques par des balises contextuelles
PCT/CA2000/001042 WO2001024045A2 (fr) 1999-09-29 2000-09-08 Procede, systeme, signaux et supports destines a l'indexage, a la recherche et a l'extraction de donnees en fonction du contexte

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/CA2000/000861 WO2001024046A2 (fr) 1999-09-29 2000-07-21 Creer, modifier, indexer, stocker, et retrouver des documents electroniques marques par des balises contextuelles

Country Status (2)

Country Link
AU (2) AU6973900A (fr)
WO (2) WO2001024046A2 (fr)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1390875A1 (fr) * 2000-09-20 2004-02-25 Body1, Inc. Procedes, systemes et logiciel de gestion automatisee de la croissance de communautes internet evoluees
WO2005052810A1 (fr) * 2003-11-28 2005-06-09 Canon Kabushiki Kaisha Procede de construction de vues preferees de donnees hierarchiques
WO2006014824A1 (fr) * 2004-07-26 2006-02-09 Wireless 5Th Dimensional Networking, Inc. Moteur de recherche base sur des contextes residant sur un reseau
US7296052B2 (en) 2001-03-20 2007-11-13 Sap Ag Automatically selecting application services for communicating data
WO2009003281A1 (fr) * 2007-07-03 2009-01-08 Tlg Partnership Système, procédé et structure de données pour fournir l'accès à des sources d'informations liées entre elles
US7689910B2 (en) 2005-01-31 2010-03-30 International Business Machines Corporation Processing semantic subjects that occur as terms within document content
EP2202977A1 (fr) 2002-04-12 2010-06-30 Mitsubishi Denki Kabushiki Kaisha Methode zur Beschreibung von Hinweisdaten zur Manipulation von Metadaten
US8635691B2 (en) * 2007-03-02 2014-01-21 403 Labs, Llc Sensitive data scanner
US9087129B2 (en) 1999-09-20 2015-07-21 Energico Acquisitions L.L.C. Methods, systems, and software for automated growth of intelligent on-line communities

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6927027B2 (en) * 1999-12-21 2005-08-09 Ingeneus Corporation Nucleic acid multiplex formation
AU2002300674B2 (en) * 2001-08-31 2007-09-20 Trusted Board Ltd Electronic approval of documents
US7020667B2 (en) 2002-07-18 2006-03-28 International Business Machines Corporation System and method for data retrieval and collection in a structured format
US8442982B2 (en) 2010-11-05 2013-05-14 Apple Inc. Extended database search

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DAO T: "AN INDEXING MODEL FOR STRUCTURED DOCUMENTS TO SUPPORT QUERIES ON CONTENT, STRUCTURE AND ATTRIBUTES" PROCEEDINGS OF THE FORUM ON RESEARCH AND TECHNOLOGY ADVANCES IN DIGITAL LIBRARIES, April 1998 (1998-04), pages 88-97, XP002925486 *
DEUTSCH A ET AL: "A query language for XML" COMPUTER NETWORKS, ELSEVIER SCIENCE PUBLISHERS B.V., vol. 31, no. 11-16, 17 May 1999 (1999-05-17), pages 1155-1169, XP004304546 AMSTERDAM, NL ISSN: 1389-1286 *
KANEMOTO H ET AL: "An efficiently updatable index scheme for structured documents" NINTH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, VIENNA, AT, 26 - 28 August 1998, pages 991-996, XP010296697 IEEE COMPUT. SOC., LOS ALAMITOS, CA, US ISBN: 0-8186-8353-8 *
SAHUGUET A ET AL: "Looking at the Web through XML glasses" PROCEEDINGS, 1999 IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS, EDINBURGH, UK, 2 - 4 September 1999, pages 148-159, XP010351848 IEEE COMPUT. SOC., LOS ALAMITOS, CA, US ISBN: 0-7695-0384-5 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9087129B2 (en) 1999-09-20 2015-07-21 Energico Acquisitions L.L.C. Methods, systems, and software for automated growth of intelligent on-line communities
EP1390875A4 (fr) * 2000-09-20 2005-11-09 Body1 Inc Procedes, systemes et logiciel de gestion automatisee de la croissance de communautes internet evoluees
EP1390875A1 (fr) * 2000-09-20 2004-02-25 Body1, Inc. Procedes, systemes et logiciel de gestion automatisee de la croissance de communautes internet evoluees
US7296052B2 (en) 2001-03-20 2007-11-13 Sap Ag Automatically selecting application services for communicating data
EP2202977A1 (fr) 2002-04-12 2010-06-30 Mitsubishi Denki Kabushiki Kaisha Methode zur Beschreibung von Hinweisdaten zur Manipulation von Metadaten
US8811800B2 (en) 2002-04-12 2014-08-19 Mitsubishi Electric Corporation Metadata editing apparatus, metadata reproduction apparatus, metadata delivery apparatus, metadata search apparatus, metadata re-generation condition setting apparatus, metadata delivery method and hint information description method
US7826709B2 (en) 2002-04-12 2010-11-02 Mitsubishi Denki Kabushiki Kaisha Metadata editing apparatus, metadata reproduction apparatus, metadata delivery apparatus, metadata search apparatus, metadata re-generation condition setting apparatus, metadata delivery method and hint information description method
JP2007519086A (ja) * 2003-11-28 2007-07-12 キヤノン株式会社 階層データの好ましいビューを構築するための方法
AU2004292680B2 (en) * 2003-11-28 2010-04-22 Canon Kabushiki Kaisha Method of constructing preferred views of hierarchical data
US7664727B2 (en) 2003-11-28 2010-02-16 Canon Kabushiki Kaisha Method of constructing preferred views of hierarchical data
WO2005052810A1 (fr) * 2003-11-28 2005-06-09 Canon Kabushiki Kaisha Procede de construction de vues preferees de donnees hierarchiques
WO2006014824A1 (fr) * 2004-07-26 2006-02-09 Wireless 5Th Dimensional Networking, Inc. Moteur de recherche base sur des contextes residant sur un reseau
US7689910B2 (en) 2005-01-31 2010-03-30 International Business Machines Corporation Processing semantic subjects that occur as terms within document content
US8635691B2 (en) * 2007-03-02 2014-01-21 403 Labs, Llc Sensitive data scanner
GB2464059A (en) * 2007-07-03 2010-04-07 Tlg Partnership System, method, and data structure for providing access to interrelated sources of information
WO2009003281A1 (fr) * 2007-07-03 2009-01-08 Tlg Partnership Système, procédé et structure de données pour fournir l'accès à des sources d'informations liées entre elles
US8306984B2 (en) 2007-07-03 2012-11-06 Tlg Partnership System, method, and data structure for providing access to interrelated sources of information

Also Published As

Publication number Publication date
WO2001024045A3 (fr) 2002-05-10
AU6973900A (en) 2001-04-30
WO2001024046A3 (fr) 2002-05-02
WO2001024046A2 (fr) 2001-04-05
AU6976600A (en) 2001-04-30

Similar Documents

Publication Publication Date Title
CA2511098C (fr) Diffusion des resultats d'un moteur de recherche par information de categorie de pages
US6490579B1 (en) Search engine system and method utilizing context of heterogeneous information resources
US6094649A (en) Keyword searches of structured databases
US7415459B2 (en) Scoping queries in a search engine
US6519586B2 (en) Method and apparatus for automatic construction of faceted terminological feedback for document retrieval
US7124358B2 (en) Method for dynamically generating reference identifiers in structured information
US7290061B2 (en) System and method for internet content collaboration
US6516312B1 (en) System and method for dynamically associating keywords with domain-specific search engine queries
US6944612B2 (en) Structured contextual clustering method and system in a federated search engine
US10210222B2 (en) Method and system for indexing information and providing results for a search including objects having predetermined attributes
US20010025304A1 (en) Method and apparatus for applying a parametric search methodology to a directory tree database format
US20070022096A1 (en) Method and system for searching a plurality of web sites
US20020129062A1 (en) Apparatus and method for cataloging data
US6938034B1 (en) System and method for comparing and representing similarity between documents using a drag and drop GUI within a dynamically generated list of document identifiers
WO2005052811A1 (fr) Recherche dans un reseau informatique
EP1247213B1 (fr) Procede et dispositif permettant de creer un index pour un document structure articule autour d'une feuille de style
WO2001024045A2 (fr) Procede, systeme, signaux et supports destines a l'indexage, a la recherche et a l'extraction de donnees en fonction du contexte
US7792855B2 (en) Efficient storage of XML in a directory
Lam The Overview of Web Search Engines
Schmidt et al. Distributed search for structured documents
KR20030013814A (ko) 비텍스트 형태 데이터 포함 컨텐츠 검색 시스템 및 그 방법
WEITZMAN et al. Virtual URLs for Browsing & Searching Large Information Spaces
Paepcke STARTS: Stanford Proposal for Internet Meta-Searching
JP2001306592A (ja) インターネット上で運用されるWebページ検索エンジンにおける目録作成方法および検索方法
WO2001050327A2 (fr) Methode et dispositif pour application flexible de procedures de tokenisation

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 10089290

Country of ref document: US

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION PURSUANTTO RULE 69 (1), EPC FORM 1205 A

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP