WO2001024046A2 - Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup - Google Patents

Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup Download PDF

Info

Publication number
WO2001024046A2
WO2001024046A2 PCT/CA2000/000861 CA0000861W WO0124046A2 WO 2001024046 A2 WO2001024046 A2 WO 2001024046A2 CA 0000861 W CA0000861 W CA 0000861W WO 0124046 A2 WO0124046 A2 WO 0124046A2
Authority
WO
WIPO (PCT)
Prior art keywords
computer
contextual
readable
items
context sensitive
Prior art date
Application number
PCT/CA2000/000861
Other languages
French (fr)
Other versions
WO2001024046A3 (en
Inventor
Duane Allan Nickull
Chad Matthew Mackenzie
Jamie Michael Thomas Hoglund
Original Assignee
Xml-Global Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xml-Global Technologies, Inc. filed Critical Xml-Global Technologies, Inc.
Priority to AU69739/00A priority Critical patent/AU6973900A/en
Publication of WO2001024046A2 publication Critical patent/WO2001024046A2/en
Publication of WO2001024046A3 publication Critical patent/WO2001024046A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/143Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]

Definitions

  • the present invention relates generally to electronic documents, and more particularly to a method, apparatus and system for authoring, altering, indexing, storing, and retrieving electronic documents embedded with contextual markup tags.
  • the Internet has rapidly become one of the leading communications mediums of our age.
  • One of the most popular applications used in the Internet is the World Wide
  • Web also referred to as the "Web” or “WWW”
  • WWW Web
  • search engines have become an important tool for enabling users to search for and retrieve information over the Internet that is relevant to their needs.
  • Popular search engines for searching the Internet are available from YahooTM (http://www.yahoo.com), Infoseek (http://www.infoseek.com), LycosTM (http://lycos.cs.cmu.edu), AltaVistaTM
  • HTML HyperText Markup Language
  • HTML is derived from the well known Standard Generalized Markup Language (SGML) and has been widely adopted to generate Web pages.
  • HTML provides a set of predefined markup codes, commonly referred to as "tags", that can be inserted into or included in a text-based file (or document) that is viewable through a Web browser.
  • the tags help define the display semantics of the text and instruct the Web browser on how to display the text.
  • HTML does not provide a standard for including context sensitive data in electronic documents.
  • it can be difficult, if not impossible, to build and maintain a context sensitive database based on the content of conventional HTML documents.
  • crawlers and “spiders”.
  • a crawler is a program that can be used to autonomously explore the Internet or other networks in search of new or updated, publicly accessible resources such as Web sites, files available in FTP archives and Gopher documents. When a resource is found, it can be accessed by a spider which adds the location, identity and data from the resource to a search engine database.
  • a spider When a spider accesses an identified Web page, it will typically index the Web page and add searchable content from the Web page to the search engine database.
  • the absence of context sensitive data in conventional HTML documents limits the ability of a spider to meaningfully capture and index the context within which textual elements are written in the HTML documents.
  • a common approach to indexing is to merely build and manage a non-contextual database.
  • indexing will typically involve processing the HTML encoded document that embodies the Web page and separating text elements into words that are added to the database.
  • the Web page may be indexed by adding to the database only words found in specific HTML tags, known as ⁇ META> tags, embedded in the Web page.
  • search results can be cluttered with redundant or irrelevant information, making the retrieval of relevant information difficult.
  • a specialized spider is programmed to index predefined Web sites relating to specific topic areas, such as on-line medical journals. While categorizing Web sites according to general topic areas can generally assist search engines perform more targeted searches, this approach does not produce a context sensitive database for users to search. As a result, the user is still faced with the difficulty that performing a search on the topic-related index can produce many results that contain the search terms but not in the context desired by the user.
  • the Extensible Markup Language provides another solution to the need to include context sensitive data within electronic documents accessible over the Internet and other networks.
  • XML was introduced in part to serve as the basis for applications that permit Web authors and publishers to create XML Web pages containing structured context sensitive data. While XML is becoming more commonly used in the Internet community, HTML continues to be the markup language of choice for many Web authors and publishers, and the majority of Web sites continue to be populated with HTML documents.
  • the above and related desires are addressed in the present invention by providing a novel and nonobvious method, system and computer-readable instructions for authoring, altering, indexing, storing and retrieving context sensitive HTML documents (also referred to herein as hybrid HTML documents).
  • the present invention can also be equally applied to electronic documents generated with XHTML or a context insensitive markup language that is a subset of SGML.
  • a computer-readable medium contains an electronic document generated with a context insensitive markup language, the electronic document having contextual markup tags and items of computer-readable data marked by the contextual markup tags.
  • the contextual markup tags can each include a predefined prefix for identifying contextual information, at least one contextual term identifying a context within which at least one of the items of computer-readable data is used within the electronic document, and HTML delimiters.
  • the contextual markup tags are preferably arranged into pairs of contextual markup tags encapsulating the at least one of the items of computer-readable data, with each pair including an opening contextual markup tag and a closing contextual markup tag. Marking items of computer-readable data within the electronic document with opening and closing contextual markup tags provides an easy mechanism for generating a rich range of context sensitive data within an otherwise context insensitive document.
  • a context sensitive HTML document is generated by inserting an opening contextual markup tag before an item of computer-readable data within an HTML document and by inserting a closing contextual markup tag after the item of computer-readable data.
  • At least one contextual term identifying a context within which the item of computer-readable data is used is included in both the opening and closing contextual markup tags.
  • Each contextual markup tag is marked with HTML delimiters.
  • a predefined prefix for identifying contextual information is also inserted in both the opening and closing contextual markup tags.
  • the predefined prefix of the opening contextual markup tag is preferably inserted before at least one contextual term thereof for ease of viewing and processing.
  • a terminator may be inserted between the HTML delimiters of the closing contextual markup tag to distinguish the closing contextual markup tag from the opening contextual markup tag.
  • the marked item of computer-readable data may also be advantageously marked with additional pairs of opening and closing contextual markup tags.
  • a computer-readable memory is used to store a spider and an index processor.
  • the spider is programmed to scan electronic documents for context sensitive data and to retrieve the context sensitive data from the electronic documents, including electronic documents which conform to a context insensitive markup language.
  • the index processor is programmed to add the context sensitive data retrieved by the spider to a context sensitive database.
  • the spider preferably has computer-readable instructions to identify items of computer-readable data marked by at least one pair of contextual markup tags within the electronic documents. These latter instructions may be performed in conjunction with a separate set of code making up a parser.
  • the spider can retrieve contextual terms from the contextual markup tags and the items of computer-readable data marked by the contextual markup tags.
  • the index processor preferably includes instructions to index within the context sensitive database the items of computer-readable data at least according to the contextual terms retrieved from the contextual markup tags associated with the items of computer-readable data.
  • a crawler is included which is programmed to perform a preliminary scan of the electronic documents to determine which of the electronic documents include at least one item of context sensitive data.
  • Retrieving and indexing context sensitive data from an electronic document, generated with a context insensitive markup language but marked with contextual markup tags, provides a mechanism for supporting a richly organized context sensitive database which may then be readily and accurately searched in a contextual basis.
  • Such context-based searches performed on the context sensitive database produce search results that are much less cluttered with irrelevant or out of context information, allowing users to more quickly retrieve relevant search results.
  • a method for managing a context sensitive database in a computer system.
  • an electronic document is scanned in search of items of computer-readable data marked with contextual markup tags that have contextual terms associated with the items of character data.
  • the items of computer-readable data and associated contextual terms are retrieved from the electronic document and added to the context sensitive database.
  • the associated contextual terms and the items of computer-readable data are linked, and the items of computer-readable data are linked to an address identifying a location of the electronic document from which originated the items of computer-readable data and associated contextual terms.
  • contextual markup tags are examined for a predefined prefix distinguishing the contextual markup tags from other markup tags within the electronic document.
  • the search can be limited to contextual markup tags beginning and ending with at least one HTML delimiter.
  • a computer system having a context sensitive database and a spider sub-system.
  • the spider sub-system is programmed to scan an electronic document for items of computer-readable data marked by contextual markup tags that each include a contextual term associated with at least one of the items of computer-readable data.
  • the spider sub-system is also programmed to add the items of computer-readable data and associated contextual terms to the context sensitive database.
  • the spider sub-system may also be programmed to scan the contextual markup tags for a predefined prefix distinguishing the contextual markup tags from other markup tags.
  • the spider sub-system includes instructions to link the associated contextual terms and the items of computer-readable data and to link the items of computer-readable data to an item of location information identifying a location of the electronic document.
  • a computer system having a context sensitive database and a computer server coupled to the context sensitive database, wherein the computer system includes a search engine.
  • the search engine is programmed to receive a search request from a requesting device. Based on the search request, the search engine is programmed to search the context sensitive database for references to electronic documents containing items of computer- readable data and contextual terms associated with the items of computer-readable data. The references and corresponding items of computer-readable data and associated contextual terms are retrieved from the context sensitive database by the search engine and transmitted to the requesting device.
  • a computer-readable medium containing a data structure having a first segment, a second segment and a third segment.
  • the first segment identifies a resource that contains an electronic document having items of computer-readable data marked with contextual markup tags including contextual terms associated with the items of computer-readable data and a predefined prefix distinguishing the contextual markup tags from other markup tags.
  • the second segment identifies at least one of the items of the character data located within the electronic document.
  • the third segment identifies at least one of the contextual terms associated with the at least one of the items of computer-readable data.
  • a computer-readable medium containing a context sensitive database having items of computer-readable data and associated contextual terms scanned from electronic documents generated with a context insensitive markup language but embedded with contextual markup tags containing the contextual terms.
  • This latter computer-readable medium also contains location information identifying addressable locations for the electronic documents.
  • a computer system for managing a context sensitive database
  • the computer system includes means for scanning an electronic document, generated with a context insensitive markup language, in search of items of computer-readable data marked with contextual markup tags, means for retrieving contextual terms associated with the items of computer-readable data from the contextual markup tags, means for retrieving the items of computer- readable data from the electronic document, and means for adding the items of computer-readable data and associated contextual terms to the context sensitive database.
  • a method is provided of performing a context-based computer search on a context sensitive database.
  • a search request containing a search term is received from a requesting party.
  • the search term is compared with contextual terms stored in the context sensitive database.
  • At least one contextual term, found to be associated with or to match the search term, is retrieved from the context sensitive database.
  • the retrieved contextual term(s) is transmitted to the requesting entity along with instructions for a context-based search request to be submitted using the contextual term(s) transmitted to the requesting party.
  • FIG. 1 is a schematic diagram of a back end system for scanning HTML Web pages and retrieving context sensitive data from the scanned Web pages, according to a first embodiment of the invention
  • FIG. 2 is a block diagram illustrating the conventional structure of an HTML document
  • FIG. 3 is a block diagram of the general structure by which context sensitive data is embedded in HTML documents to form hybrid HTML documents in accordance with the first embodiment
  • FIG. 4 is another diagram illustrating the general structure for context sensitive data in the first embodiment
  • FIG. 5 is a block diagram illustrating a hybrid HTML document having character data encapsulated with contextual markup tags in accordance with the first embodiment
  • FIG. 6 is another schematic diagram of the back end system of the first embodiment
  • FIG. 7 is a schematic diagram of a search engine system according to the first embodiment having a front end system and back end system;
  • FIG. 8 is a block diagram illustrating the structure of a context sensitive database for storing information extracted from hybrid HTML documents
  • FIG. 9 is a flow diagram illustrating the generation hybrid HTML documents.
  • FIG. 10 is a flow diagram illustrating a method of processing hybrid HTML documents in accordance with the first embodiment.
  • FIG. 1 is a schematic diagram of a back end system 20 for scanning hybrid HTML Web pages 40 and retrieving context sensitive data from the hybrid Web pages 40 according to a first embodiment of the invention.
  • the back end system 20 includes a search processor 22 that resides in memory 25 and executes as software on a back end computer server 24.
  • the search processor 22 receives and stores location information submitted by Web authors, Webmasters, publishers, organizations, crawlers and the like via user machines 44, Web site servers 42 and other networked sources.
  • the location information identifies the location of resources directly or indirectly accessible by the back end computer server 24 and which have hybrid HTML documents (e.g. hybrid Web pages 40) containing context sensitive data to be processed by the search processor 22, as further described below.
  • the term "resource” refers to any computer-implemented object or data that can be accessed via the Internet or another computer network (intranet, LAN, wireless etc.) and which contains (or refers to electronic data files which contain), in whole or in part, text-based information. Examples of resources include Web sites,
  • Web pages Web pages, file directories, URIs, URNs, URLs, IP addresses, POP, S/MEVIE, electronic data files and other electronic documents accessible over a network.
  • the location information is represented by Uniform Resource Locators (commonly known as URLs) which specify the locations of hybrid Web pages 40 made up of HTML documents embedded with context sensitive data
  • URLs received by the search processor 22 are preferably stored by the back end computer server 24 as a list (or queue) of URLs 28 on a local storage device 30.
  • the search processor 22 includes software components which use the list of
  • Contextual markup tags 56 and 58 (see FIG. 3) encapsulate items of computer-readable data, including character data 52 and other markup tags (for example, graphical or multimedia objects), to form the context sensitive data within the hybrid HTML documents.
  • character data refers to textual elements of an electronic document which are not part of any HTML markup tags.
  • the context sensitive data may be used by a search engine 82 running on a front end computer server 80 (see FIG. 8) to generate search results identifying the location(s) of hybrid Web pages 40 (or other hybrid HTML documents which are not part of the Web) containing the context sensitive data.
  • a search engine 82 running on a front end computer server 80 (see FIG. 8) to generate search results identifying the location(s) of hybrid Web pages 40 (or other hybrid HTML documents which are not part of the Web) containing the context sensitive data.
  • conventional HTML documents 46 contain character data 48
  • HTML documents results in search results that are not particularly relevant to the human or machine entity requesting the search.
  • FIG. 3 is a block diagram illustrating the general structure used in the first embodiment to identify context sensitive data within hybrid HTML documents.
  • an item of context sensitive data within a hybrid HTML document is made up of character data 52 (or markup tags or both) encapsulated with opening and closing contextual markup tags 56 and 58 each comprising delimiters 60, contextual term 64, and preferably, predefined prefix 62.
  • the closing contextual markup tag 58 is distinguished from the opening contextual markup tag 56 by a terminator 66 located in the closing contextual markup tag 58. As illustrated in
  • the terminator 66 is preferably a backslash ("/") located just after an initial HTML delimiter 60 of the closing contextual markup tag 58.
  • a backslash for the terminator 66 and locating it just after the initial HTML delimiter 60 provides a mechanism for marking the closing contextual markup tag 58 as an end tag in a manner that is widely recognized by parsers and browsers.
  • the opening and closing contextual markup tags 56 and 58 may be differentiated with a distinguishable indicator located within the opening contextual markup tag 56.
  • FIG. 5 illustrates a sample hybrid HTML document 50 having character data 52 encapsulated with opening and closing contextual markup tags 56 and 58 in accordance with the first embodiment.
  • the contextual markup tags 56 and 58 used in document 50 are presented in upper case to visually distinguish them from the conventional HTML markup in the hybrid HTML document 50.
  • the contextual markup tags 56 and 58 are implemented using proper tag nesting, with an opening contextual markup tag 56 preceding a corresponding closing contextual markup tag 58 in the hybrid document 50.
  • the contextual markup tags 56 and 58 also begin and end with HTML style delimiters 60.
  • the HTML style delimiters 60 implemented in the contextual markup tags are preferably the commonly used less-than (" ⁇ ") and greater-than (">”) characters.
  • the contextual markup tags 56 and 58 are concealed from view when the hybrid HTML document 50 is rendered by a Web browser such as Microsoft's Internet ExplorerTM, while the specific character data 52 encapsulated by the contextual markup tags 56 and 58 is presented by the Web browser as such character data 52 ordinarily would be presented in the absence of the contextual markup tags 56 and 58.
  • the contextual markup tags 56 and 58 can be used to add meaning to character data in an HTML document without adversely affecting the visual presentation of the HTML document when it is processed through a Web browser.
  • each contextual markup tag (56, 58) includes at least two contextual components located between the delimiters 60: a predefined prefix 62 and a contextual term 64.
  • each contextual markup tag (56, 58) preferably includes predefined prefix 62.
  • the predefined prefix 62 is used to identify a tag as a contextual markup tag.
  • the predefined prefix 62 is a predefined set of one or more characters which do not conflict with any predefined HTML tags or any other standards recognized for use within HTML tags.
  • the contextual markup tags can be used to add context to content within HTML documents without using the predefined prefix 62, the use of the predefined prefix 62 is preferred as it avoids the risk of a contextual markup tag conflicting with known or future HTML tags.
  • the predefined prefix 62 is a sequence of characters, "XHML", which serves as a flag identifying the information between the HTML delimiters 60 as contextual markup information.
  • the 62 advantageously provides a mechanism for adding meaning to character data while avoiding the risk that the contextual markup tags may otherwise interfere with or corrupt the ordinary processing of the hybrid HTML document 50 as a conventional HTML document by a Web browser.
  • the contextual term 64 within a contextual markup tag provides the actual context specific meaning to the character data 52 encapsulated by an opening and closing pair of contextual markup tags 56 and 58.
  • the contextual term 64 may be any set of one or more characters (e.g. alphanumeric characters, special characters, etc.).
  • the contextual term 64 may include one or more words, or a sequence of characters which have no meaning in any written language but which provide a representation that the author of the contextual markup tags wishes to use to index the encapsulated character data.
  • the contextual term 64 may include a sequence of characters such as "xqtr" which, while this sequence has no meaning in written human language, may be used to associate character data with a computer-readable classification recognized by a computer program.
  • a predefined separator 63 (the colon in FIG. 4 and 5) may be included between the predefined prefix 62 and the contextual term 64.
  • the predefined separator 63 may be used to improve the human readability of the hybrid terminology when the source code for the hybrid HTML document 50 is viewed through a Web browser, other viewer or in hardcopy format, particularly when the predefined prefix is a sequence of several characters.
  • HTML delimiters 60, the predefined prefix 62 and the contextual term 64 form the basis of the opening and closing contextual markup tags (56, 58) which are used to mark character data in HTML documents so as to form the hybrid HTML documents.
  • opening and closing contextual markup tags (56, 58) are used to mark character data in HTML documents so as to form the hybrid HTML documents.
  • a particular set of character data may be encapsulated with more than one pair of contextual markup tags 56 and 58, providing the capacity to encapsulate character data in an HTML document with layers of context so as to provide for different levels of meaning being assigned to the character data by the different layers of contextual markup tags.
  • additional computer codes such as HTML tags may also be encapsulated by the contextual markup tags (56, 58), either alone or in combination with the character data 52.
  • graphical objects or multimedia objects may also be marked contextually with the contextual markup tags.
  • FIG. 6 shows a schematic diagram further illustrating the search processor software 22 of the first embodiment.
  • the software components of the search processor 22 include a scheduler 68 and a spider sub-system 70.
  • the scheduler 68 is programmed to launch the spider sub-system 70 which retrieves, scans and indexes hybrid Web pages 40 identified by the URLs in the list of URLs 28.
  • the scheduler 68 may use one of many known scheduling techniques to schedule the launch of the spider subsystem 70, including, by way of example, date and time scheduling, constant scheduling or event scheduling.
  • the scheduler 68 schedules the spider sub-system 70 to at predetermined intervals (e.g. at midnight each night).
  • the list of URLs 28 accessible to the spider sub-system 70 may be generated by one or more techniques.
  • URLs may be submitted to the list 28 by web authors and the like to the search processor 22 via interface 71.
  • URLs may be batch added to the list (or queue) 28.
  • URLs may also be added to the list 28 by a crawler 67 residing on the back end computer server 24 or another computer.
  • the crawler 67 is programmed to retrieve the Web pages 40 referred to in the list of URLs 28 and to scan such Web pages 40 for hyperlinks to other Web pages.
  • the crawler 67 adds the URLs for such other Web pages to the list 28 and traverses the hyperlinks in search of further URLs to add to the list 28.
  • the crawler 67 can validate URLs in the list 28 to ensure that such URLs still exist and may also verify that such URLs are accessible.
  • the crawler 67 is programmed to perform a preliminary scan of the Web pages 40 identified by the URLs stored in the list 28 to determine if these Web pages 40 in fact include context sensitive data. Where a preliminary scan is performed, the crawler 67 produces a refined list of URLs based on list 28. The refined list represents those Web pages 40 within which the crawler 67 has found context sensitive data that is marked using the aforementioned contextual markup tags (56, 58) coded in an acceptable configuration (see FIG. 3 to 5). In this case, the spider sub-system 70 accesses the refined list referring to Web pages 40 verified as having hybrid HTML documents as opposed to the original list 28 of unverified
  • the spider sub-system 70 includes a spider 72, a parser 74, and an index processor 76.
  • the spider 72 is programmed to retrieve the Web pages 40 referred to by the URLs in the list of URLs 28, to provide such retrieved Web pages 40 to the parser 74 and to scan such Web pages 40 for contextual markup tags of the type shown in FIG. 3 to 5.
  • contextual markup tags (56 and 58) are identified by the spider sub-system 70 by detecting the predefined prefix 62 within such tags.
  • the parser 74 is programmed to parse the source code of Web pages retrieved by the spider 72. In order to parse a Web page (or other electronic document), the parser 74 accesses parsing rules specifying the acceptable structure and use of contextual markup tags 56 and 58 (see FIG. 3) in hybrid HTML documents. The parsing rules may also specify markup codes 77 that are to be excluded from processing if filtering is desired.
  • the parsing rules can be stored using one of several approaches, including, as illustrated in FIG. 6, with the use of a parsing rules lookup table 75.
  • the contextual markup tags (56 and 58 in FIG. 3) and their associated character data (or other computer-readable data marked by tags 56 and 58) are retrieved.
  • the contextual terms (64 in FIG. 3) are extracted from the contextual markup tags by the parser 74 (or the spider 72) and are used by the index processor 76 to index within the context sensitive database 32 the extracted character data associated with such contextual terms.
  • Contextual terms which are not currently included in the context sensitive database 32 to classify character data and associated location information are added by the index processor 76 to the context sensitive database 32.
  • predefined context symbols can be used by the spider sub-system 70 (and in particular the parser 74 in the first embodiment) to expand, collapse or modify at least one of the contextual terms retrieved from a hybrid HTML document.
  • the predefined context symbols can be stored in a lookup table 79 which associates each of the predefined context symbols with one or more predetermined contextual terms. If a contextual term retrieved from a hybrid HTML document matches any of the predetermined contextual terms in the lookup table 79, the predefined context symbol corresponding with the matching predetermined contextual term is retrieved and used by the spider sub-system 70 to replace the retrieved contextual term.
  • This technique may be used to replace certain contextual terms retrieved from a document with abbreviated contextual terms so as to reduce the amount of storage requirements for contextual terms within the context sensitive database 32.
  • the predefined context symbols may also be used to expand or simply replace cryptic contextual terms retrieved from a document with more meaningful contextual terms using the lookup table.
  • the context sensitive database 32 may also be used to store and search for context sensitive information retrieved from XML-derived (or based) documents.
  • the search processor 22 can support the processing and indexing of context sensitive data from XML-derived documents as well as from electronic documents embedded with pairs of the contextual markup tags 56 and 58.
  • FIG. 7 is a block diagram illustrating a structure 90 for the context sensitive database 32 (see FIG. 8) used to store contextual terms (e.g. 64 in FIG. 3), associated character data (e.g. 52 in FIG. 3) and associated location information for the source hybrid HTML documents (e.g. 40 in FIG. 8).
  • the context sensitive database 32 is preferably structured to include a link table 92 which maps out the relative locations of a set of tables within the context sensitive database 32.
  • the link table 92 includes several fields including a contextual table ID field 93, a character data table ID field 95 and a resource location table ID field 97, that are used to reference a contextual terms table 94, a character data table 96 and a resource location table 98, respectively.
  • the contextual terms table 94 provides a mechanism to manage and link contextual terms (64) stored within the context sensitive database 32 along with character data (52) extracted from hybrid HTML documents (e.g. 40) and resource location information for the associated hybrid HTML documents.
  • the contextual terms table 94 may also be used to associate contextual terms (64) with other information derived from hybrid HTML documents in which such contextual terms (64) are used to add meaning to character data (52).
  • the contextual terms table 94 includes a plurality of records 100 for storing and associating valid contextual terms (64) with character data (52) stored in the character data table 96 and with resource location information stored in the resource location table 98.
  • Each record 100 includes: a term field 102 for storing a contextual term (64) which has been added to the context sensitive database 32 by the spider 70; and one or more fields 104 referencing locations in the character data table 96 where there is character data (52) associated with the contextual term (64) of the respective record 100.
  • each record 100 also includes one or more fields 106 referencing locations in the resource location table 98 where addressing information is located for resources associated with the contextual term of the respective record 100.
  • the character data table 96 provides a mechanism to store or reference character data (e.g. 52) and other items of computer-readable data which the spider has determined were encapsulated by contextual markup tags (56 and 58 in FIG. 3) within hybrid HTML documents 40.
  • the character data table 96 includes a plurality of records 110 for storing and associating character data retrieved from hybrid HTML documents with location information identifying the hybrid HTML documents containing such character data.
  • each record 110 includes a field 112 for storing character data (or references thereto).
  • Each record 110 also includes one or more fields 114 linking the corresponding character data stored in field 112 with location information within the resource location table 98 identifying hybrid HTML documents that contain such character data.
  • the character data table 96 preferably also includes one or more fields 116 linking the character data with associated contextual terms stored in the contextual terms table 94. These latter fields 116 provide a mechanism for the search engine 82 (FIG. 8) to easily retrieve, using a preliminary search term identifying character data within the context sensitive database 32, a list of contextual terms associated with the search term within the context sensitive database 32. The list of contextual terms may then be presented by the search engine 82 to the user to assist the user in refining his or her search based on the context assigned to one or more items of character data within the search term.
  • the resource location table 98 provides a mechanism to identify the specific location information for hybrid HTML documents 40 processed by the spider 70 (FIG. 6).
  • the resource location table 98 includes records 120 each having a field 122 that identifies the location information of a corresponding hybrid HTML document.
  • the resource location table records 120 may also include one or more additional fields 124 to reference character data within the character data table 94 that are found in corresponding hybrid HTML documents.
  • the resource location table records 120 may include one or more fields 126 to reference the contextual terms stored within the contextual terms table 96 used in the hybrid document. These latter fields 126 provide an easy mechanism for the search engine 82 to retrieve and transmit to a user all or some of the contextual terms used to give meaning within a hybrid HTML document. This can assist the user in his or her searching by providing a summary of the contextual meaning embedded within the hybrid HTML document.
  • FIG. 8 is a schematic diagram of a search engine system 85 according to the first embodiment having a front end system and back end system.
  • the search engine 82 On the front end, the search engine 82 resides as software on a front end computer server 80 and provides search engine services to user machines 84 directly or indirectly connected to the search engine 82.
  • some of the user machines 84 access the services of the search engine through a dial-up connection with an ISP and a network connection established over the Internet. Other connections between the user machines 84 and the computer server 80 may also be established.
  • the user machines may access the search engine services of the search engine 82 through an intranetwork, a direct dial-up connection, a cable or xDSL modem connection, a wireless connection, or a dedicated network connection or the like.
  • the search engine 82 is programmed to provide the user machines 84 with at least one type of search form 81 for completion by an end-user.
  • Completed search forms 81 or the search criteria entered into such search forms 81 serve as search requests 83 which are transmitted to the front end computer server 80.
  • Search requests 83 received by the front end computer server 80 are validated by the search engine 82 to ensure that they conform with at least one predefined search request structure recognized by the search engine 82.
  • the search forms 81 include at least one contextual term stored within the context sensitive database 32. This list provides the user with an indication of some or all of the contextual terms available within the context sensitive database 32 to assist the user in formulating a context-based search request
  • a search request 83 can be transmitted by the user machines 84 to the front end computer server 80 as a data signal 83a embodied in a carrier wave.
  • Thesearch request data signal 83a can represent one or more segments 83b of the search request 83.
  • Each segment 83b may be used to identify search terms provided by the user and may include a segment 83 c identifying one or more contextual terms defining the context within which the search is to be performed on some or all of the other search terms.
  • the search engine 82 can then search the context sensitive database 32 for references to hybrid HTML documents (for example, Web pages 40) having character data associated with such contextual term(s) and return search results 87 to the user machine that submitted the context-based search request 83.
  • hybrid HTML documents for example, Web pages 40
  • the search results 87 include information identifying one or more matches, if any, within the context sensitive database 32.
  • a "match" represents an entry in the context sensitive database 32 identifying a hybrid HTML document having character data fitting within the parameters of the search criteria including being associated with one or more contextual terms which formed part of the requestor's context-based search request 83.
  • a match may also include reference to a specific set of contextual terms and character data that form part of the match.
  • a match can be transmitted from the computer server 80 to the requesting party's user machine 84 as a search result data signal 89 embodied in a carrier wave.
  • the search result data signal 89 can have a segment 89a identifying the hybrid HTML document associated with the match.
  • the search engine 82 may be preferably programmed to compare the search term(s) submitted with the search request(s) 83 with valid contextual terms used by the context sensitive database 32. If the search engine 82 determines that any valid contextual terms within the context sensitive database 32 are associated with (or match) any of the search terms, the search engine 82 then provides the user machine that submitted the search request 83 with at least one list of valid contextual terms in the context sensitive database 32 that are found to be associated with one or more of the search terms, along with instructions providing the user with the option to submit a context-based search request using one or more of the valid contextual terms provided.
  • the user may then refine the search request 83 and specify the context sensitive nature of the search request 83 by adding one or more of the contextual terms from the list of valid contextual terms provided by the search engine 82.
  • the refined search request (now context-based), once transmitted to and received by the front end computer server 80, may be used by the search engine 82 to conduct a context sensitive search for hybrid HTML documents referenced by the context sensitive database 32.
  • FIG. 9 to 10 show flow diagrams illustrating the operation of a system for authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup tags according to the first embodiment of the invention.
  • FIG. 3 to 10 collectively. Reference numerals and labels have been repeated among the drawings to indicate corresponding elements and blocks.
  • an HTML document is generated or edited by a Web author or publisher using a commercially available HTML compatible editor.
  • any text editor or word processor capable of storing a document as a text file may be used to generate or edit HTML formatted files for publication as Web pages.
  • HTML document is generated or edited, pairs of contextual markup tags (56 and 58) are added at block 200 by the Web author to encapsulate or "tag" character data within the HTML document, resulting in hybrid HTML document 40 (FIG. 9).
  • the contextual markup tags (56 and 58) are used to assign meaning to the encapsulated character data so that compatible software applications such as the spider 70 can understand the context within which the encapsulated character data is used in the hybrid HTML document.
  • the author or publisher
  • the author preferably submits the location information (e.g. the URL) for the hybrid HTML document 40 to the interface 71 which the author communicates with via the user machine 84 at block 202.
  • the interface 71 adds the URL to the list of URLs 28 identifying the location of hybrid HTML documents to be indexed by the search processor software 22.
  • a preliminary scan is performed of the Web pages corresponding to the URLs in the list of URLs 28 managed by the search processor 22.
  • this preliminary scan is performed by the crawler 67 which scans through the Web pages to determine whether or not they contain at least one pair of contextual markup tags which conform to the specified syntax (see FIG. 3 and 4).
  • the URLs of Web pages containing contextual markup tags are added to a refined list of URLs by the crawler 67.
  • the spider sub-system 70 may be launched by the scheduler 68 to retrieve and index the hybrid Web pages 40 referenced in the refined list.
  • the spider sub-system 70 selects at block 212 a URL from the list of refined URLs and invokes the spider 72 to retrieve the corresponding hybrid Web page 40 at block 214.
  • the retrieved hybrid Web page 40 is scanned by the spider 72 at block 216 for context sensitive data identified by pairs of contextual markup tags (56 and 58).
  • the context sensitive data is identified with the assistance of the parser 74 and the parsing rules lookup table 75.
  • Context sensitive data that is identified in block 216 is retrieved by the spider 72 for further processing by the index processor 76.
  • the spider 72 retrieves and temporarily stores (caches) the contextual terms 64, the character data 52 corresponding to the contextual terms 64 (and other items of computer-readable data encapsulated by the contextual markup tags) and the location information for the retrieved Web page 40.
  • the spider 72 maintains the association between the contextual terms 64 and corresponding character data 52.
  • the context sensitive data may be added to the context sensitive database 32 at block 218.
  • the spider 72 invokes the index processor 76 to organize the insertion of the temporarily stored context sensitive data into the context sensitive database 32.
  • the index processor 76 adds the location information for the retrieved hybrid Web page 40 to a new instance of record 120 in the resource location table 98 if the Web page has not already been indexed into the database 32.
  • the index processor 76 also updates the context sensitive database 32 in block 218 to identify the relationships between the character data and the contextual terms retrieved from the hybrid Web page 40. Updating the context sensitive database 32 in this way depends upon the state of the database 32 in relation to the retrieved character data and contextual terms. If the character data is new to the database 32 and the contextual term that encapsulated the character data is also new to the database 32, then the index processor 76 adds the new character data to a new instance of record 110 in character table 96. In this latter case, the index processor 76 also adds the new contextual term to the contextual terms table 94. The new contextual term is added to a new instance of record 100 which is linked to the corresponding character data in character data table 96.
  • the index processor 76 adds the new character data to a new instance of record
  • the index processor 76 also retrieves the reference to the existing record 100 in the contextual terms table 94 that already stores the contextual term.
  • the fields 104 and 106 for such existing record 100 are expanded or modified to add links to the new instance of record 110 and to the record in the resource location table 98 for the associated Web page 40.
  • the index processor 76 proceeds to create a new instance of record 100 for the contextual term and retrieve and modify the existing record 110 for the character data to link the record 110 to the new record 100 and the record 120 for the location of the associated Web page 40.
  • the index processor 76 retrieves the existing record 100 for the contextual term and the existing record 110 for the character data and links those records to the record 120 for the associated Web page 40.
  • a new instance of record 120 is used by the index processor 76 to store the location information for a Web page whose location does not already exist within the context sensitive database 32. If the location information to be stored in the context sensitive database 32 already exists in a record 120, then such record 120 is used by the index processor 76 to link with records (100 and 110) for new contextual terms and character data retrieved from the Web page 40 associated with the location information.
  • the spider sub-system 70 After processing of the retrieved hybrid Web page 40 is completed in block 218, the spider sub-system 70 checks at block 220 to see if any further hybrid Web pages remain to be processed from the refined list of URLs (generated in block 210). If any further hybrid Web pages remain, processing returns to block 212 where the spider sub-system 70 selects another URL from the refined list of URLs for the spider 72 to retrieve and process. Otherwise, the spider sub-system 70 concludes the current round of processing and awaits further instructions from the scheduler 68 to begin processing once more from block 210.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, computer-readable instructions and a system for generating and indexing context sensitive HTML documents. A context sensitive HTML document is generated by inserting an opening contextual markup tag before an item of character data (or other computer-readable data) within an HTML document and by inserting a closing contextual markup tag after the item of character data. A predefined prefix for identifying contextual information is included in both the opening and closing contextual markup tags. At least one contextual term identifying a context within which the item of character data is used is also included in both the opening and closing contextual markup tags. Each contextual markup tag is marked with HTML delimiters. Context sensitive HTML documents generated in this way form hybrid HTML documents which remain compatible with HTML and which can be processed and incorporated into a context sensitive database that may be searched by users. A hybrid HTML document is processed by scanning it in search of character data marked with the contextual markup tags. The character data and associated contextual terms are retrieved from the hybrid HTML documents and added to a context sensitive database. When the character data is added to the database, the associated contextual terms and the character data are linked, and the character data is linked to an address identifying a location of the hybrid HTML document from which originated the character data and associated contextual terms.

Description

AUTHORING, ALTERING, INDEXING,
STORING AND RETRIEVING ELECTRONIC DOCUMENTS
EMBEDDED WITH CONTEXTUAL MARKUP
FIELD
The present invention relates generally to electronic documents, and more particularly to a method, apparatus and system for authoring, altering, indexing, storing, and retrieving electronic documents embedded with contextual markup tags.
BACKGROUND
The Internet has rapidly become one of the leading communications mediums of our age. One of the most popular applications used in the Internet is the World Wide
Web (also referred to as the "Web" or "WWW"). Tens of thousands of Web sites around the world house millions of Web pages and related electronic documents.
Yet as the volume and diversity of information available over the Internet continues to grow, the ability to locate relevant information is becoming a greater challenge. With the growth of information accessible over the Internet, search engines have become an important tool for enabling users to search for and retrieve information over the Internet that is relevant to their needs. Popular search engines for searching the Internet are available from Yahoo™ (http://www.yahoo.com), Infoseek (http://www.infoseek.com), Lycos™ (http://lycos.cs.cmu.edu), AltaVista™
(http://www.altavista.com), Excite™ (http://www.excite.com), Microsoft™ Network (http://www.msn.com) and others. These search engines, however, do not provide context sensitive search services for context sensitive data.
The lack of context sensitive search engines is due, in part, to the widespread use of the HyperText Markup Language (HTML) to author and publish electronic documents. HTML is derived from the well known Standard Generalized Markup Language (SGML) and has been widely adopted to generate Web pages. HTML provides a set of predefined markup codes, commonly referred to as "tags", that can be inserted into or included in a text-based file (or document) that is viewable through a Web browser. The tags help define the display semantics of the text and instruct the Web browser on how to display the text.
The ease with which an author or publisher may generate, modify and publish electronic documents using HTML has contributed to the tremendous success of the Internet. However, conventional HTML does not provide a standard for including context sensitive data in electronic documents. As a result, it can be difficult, if not impossible, to build and maintain a context sensitive database based on the content of conventional HTML documents.
The contents of the databases which are used by search engines to provide users with search results are typically maintained with computer programs commonly known as
"crawlers" and "spiders". A crawler is a program that can be used to autonomously explore the Internet or other networks in search of new or updated, publicly accessible resources such as Web sites, files available in FTP archives and Gopher documents. When a resource is found, it can be accessed by a spider which adds the location, identity and data from the resource to a search engine database.
When a spider accesses an identified Web page, it will typically index the Web page and add searchable content from the Web page to the search engine database. The absence of context sensitive data in conventional HTML documents limits the ability of a spider to meaningfully capture and index the context within which textual elements are written in the HTML documents. As a result, a common approach to indexing is to merely build and manage a non-contextual database. When a spider indexes a Web page, indexing will typically involve processing the HTML encoded document that embodies the Web page and separating text elements into words that are added to the database. In addition, the Web page may be indexed by adding to the database only words found in specific HTML tags, known as <META> tags, embedded in the Web page. While these <META> tags can be used to assign keywords to a document as a whole, there is currently no mechanism available for assigning contextual meaning to particular data within an HTML document or for organizing such data contextually. The limitations of conventional HTML make it very difficult for a spider to produce an extensive context sensitive database using HTML documents. For instance, while the term "Washington" may be used in a document to refer to a president, a state, a city, a capital, an actor, or another individual's name, such fine distinctions cannot be accomplished using conventional HTML.
With the absence of rich context sensitive data in HTML documents, users are discovering that searching for relevant information is becoming a more difficult task, even with the advanced search processes supported by conventional search engine technology. A search may return thousands or even hundreds of thousands of search results. Further aggravating the problem, search results can be cluttered with redundant or irrelevant information, making the retrieval of relevant information difficult.
In order to assist users in performing more directed searches, some spiders have been specialized so as to build indexes which are categorized by general topic areas. In such known solutions, a specialized spider is programmed to index predefined Web sites relating to specific topic areas, such as on-line medical journals. While categorizing Web sites according to general topic areas can generally assist search engines perform more targeted searches, this approach does not produce a context sensitive database for users to search. As a result, the user is still faced with the difficulty that performing a search on the topic-related index can produce many results that contain the search terms but not in the context desired by the user.
The Extensible Markup Language (XML) provides another solution to the need to include context sensitive data within electronic documents accessible over the Internet and other networks. XML was introduced in part to serve as the basis for applications that permit Web authors and publishers to create XML Web pages containing structured context sensitive data. While XML is becoming more commonly used in the Internet community, HTML continues to be the markup language of choice for many Web authors and publishers, and the majority of Web sites continue to be populated with HTML documents.
In order to leverage the use of the large amounts of information captured in HTML documents, it would be desirable to have a mechanism for adapting existing and new HTML documents so that authors may place a context on their data. It would be further desirable to be able to implement such a mechanism in a way that would not interfere with the processing of HTML codes embedded within the documents.
SUMMARY OF THE INVENTION
The above and related desires are addressed in the present invention by providing a novel and nonobvious method, system and computer-readable instructions for authoring, altering, indexing, storing and retrieving context sensitive HTML documents (also referred to herein as hybrid HTML documents). The present invention can also be equally applied to electronic documents generated with XHTML or a context insensitive markup language that is a subset of SGML.
In accordance with one aspect of the invention, a computer-readable medium contains an electronic document generated with a context insensitive markup language, the electronic document having contextual markup tags and items of computer-readable data marked by the contextual markup tags. The contextual markup tags can each include a predefined prefix for identifying contextual information, at least one contextual term identifying a context within which at least one of the items of computer-readable data is used within the electronic document, and HTML delimiters. The contextual markup tags are preferably arranged into pairs of contextual markup tags encapsulating the at least one of the items of computer-readable data, with each pair including an opening contextual markup tag and a closing contextual markup tag. Marking items of computer-readable data within the electronic document with opening and closing contextual markup tags provides an easy mechanism for generating a rich range of context sensitive data within an otherwise context insensitive document.
In accordance with another aspect of the invention, a context sensitive HTML document is generated by inserting an opening contextual markup tag before an item of computer-readable data within an HTML document and by inserting a closing contextual markup tag after the item of computer-readable data. At least one contextual term identifying a context within which the item of computer-readable data is used is included in both the opening and closing contextual markup tags. Each contextual markup tag is marked with HTML delimiters. Preferably, a predefined prefix for identifying contextual information is also inserted in both the opening and closing contextual markup tags. Generating context sensitive HTML documents in this way with the use of the predefined prefix within the contextual markup tags provides hybrid HTML documents that remain compatible with HTML and avoids the risk of a conflict between the contextual markup tags and HTML markup tags. In this latter method, the predefined prefix of the opening contextual markup tag is preferably inserted before at least one contextual term thereof for ease of viewing and processing. A terminator may be inserted between the HTML delimiters of the closing contextual markup tag to distinguish the closing contextual markup tag from the opening contextual markup tag. The marked item of computer-readable data may also be advantageously marked with additional pairs of opening and closing contextual markup tags.
In another aspect of the invention, a computer-readable memory is used to store a spider and an index processor. The spider is programmed to scan electronic documents for context sensitive data and to retrieve the context sensitive data from the electronic documents, including electronic documents which conform to a context insensitive markup language. The index processor is programmed to add the context sensitive data retrieved by the spider to a context sensitive database. The spider preferably has computer-readable instructions to identify items of computer-readable data marked by at least one pair of contextual markup tags within the electronic documents. These latter instructions may be performed in conjunction with a separate set of code making up a parser. The spider can retrieve contextual terms from the contextual markup tags and the items of computer-readable data marked by the contextual markup tags. The index processor preferably includes instructions to index within the context sensitive database the items of computer-readable data at least according to the contextual terms retrieved from the contextual markup tags associated with the items of computer-readable data. In one embodiment, a crawler is included which is programmed to perform a preliminary scan of the electronic documents to determine which of the electronic documents include at least one item of context sensitive data.
Retrieving and indexing context sensitive data from an electronic document, generated with a context insensitive markup language but marked with contextual markup tags, provides a mechanism for supporting a richly organized context sensitive database which may then be readily and accurately searched in a contextual basis. Such context-based searches performed on the context sensitive database produce search results that are much less cluttered with irrelevant or out of context information, allowing users to more quickly retrieve relevant search results.
In yet another aspect of the invention, a method is provided for managing a context sensitive database in a computer system. In this aspect, an electronic document is scanned in search of items of computer-readable data marked with contextual markup tags that have contextual terms associated with the items of character data. The items of computer-readable data and associated contextual terms are retrieved from the electronic document and added to the context sensitive database. Preferably, in this latter stage the associated contextual terms and the items of computer-readable data are linked, and the items of computer-readable data are linked to an address identifying a location of the electronic document from which originated the items of computer-readable data and associated contextual terms. Preferably, during the scanning of the electronic document, contextual markup tags are examined for a predefined prefix distinguishing the contextual markup tags from other markup tags within the electronic document. When the electronic document is scanned, the search can be limited to contextual markup tags beginning and ending with at least one HTML delimiter.
According to another aspect of the invention, a computer system is provided having a context sensitive database and a spider sub-system. The spider sub-system is programmed to scan an electronic document for items of computer-readable data marked by contextual markup tags that each include a contextual term associated with at least one of the items of computer-readable data. The spider sub-system is also programmed to add the items of computer-readable data and associated contextual terms to the context sensitive database. When scanning the electronic document, the spider sub-system may also be programmed to scan the contextual markup tags for a predefined prefix distinguishing the contextual markup tags from other markup tags. Preferably, the spider sub-system includes instructions to link the associated contextual terms and the items of computer-readable data and to link the items of computer-readable data to an item of location information identifying a location of the electronic document.
In accordance with another aspect of the invention, a computer system is provided having a context sensitive database and a computer server coupled to the context sensitive database, wherein the computer system includes a search engine. The search engine is programmed to receive a search request from a requesting device. Based on the search request, the search engine is programmed to search the context sensitive database for references to electronic documents containing items of computer- readable data and contextual terms associated with the items of computer-readable data. The references and corresponding items of computer-readable data and associated contextual terms are retrieved from the context sensitive database by the search engine and transmitted to the requesting device.
In accordance with another aspect of the invention, a computer-readable medium is provided containing a data structure having a first segment, a second segment and a third segment. The first segment identifies a resource that contains an electronic document having items of computer-readable data marked with contextual markup tags including contextual terms associated with the items of computer-readable data and a predefined prefix distinguishing the contextual markup tags from other markup tags. The second segment identifies at least one of the items of the character data located within the electronic document. The third segment identifies at least one of the contextual terms associated with the at least one of the items of computer-readable data.
In accordance with yet another aspect of the invention, a computer-readable medium is provided containing a context sensitive database having items of computer-readable data and associated contextual terms scanned from electronic documents generated with a context insensitive markup language but embedded with contextual markup tags containing the contextual terms. This latter computer-readable medium also contains location information identifying addressable locations for the electronic documents.
According to yet another aspect, a computer system is provided for managing a context sensitive database wherein the computer system includes means for scanning an electronic document, generated with a context insensitive markup language, in search of items of computer-readable data marked with contextual markup tags, means for retrieving contextual terms associated with the items of computer-readable data from the contextual markup tags, means for retrieving the items of computer- readable data from the electronic document, and means for adding the items of computer-readable data and associated contextual terms to the context sensitive database.
In accordance with another aspect of the the invention, a method is provided of performing a context-based computer search on a context sensitive database. In this latter method, a search request containing a search term is received from a requesting party. The search term is compared with contextual terms stored in the context sensitive database. At least one contextual term, found to be associated with or to match the search term, is retrieved from the context sensitive database. The retrieved contextual term(s) is transmitted to the requesting entity along with instructions for a context-based search request to be submitted using the contextual term(s) transmitted to the requesting party.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
In the accompanying drawings which illustrate embodiments of the invention,
FIG. 1 is a schematic diagram of a back end system for scanning HTML Web pages and retrieving context sensitive data from the scanned Web pages, according to a first embodiment of the invention;
FIG. 2 is a block diagram illustrating the conventional structure of an HTML document;
FIG. 3 is a block diagram of the general structure by which context sensitive data is embedded in HTML documents to form hybrid HTML documents in accordance with the first embodiment;
FIG. 4 is another diagram illustrating the general structure for context sensitive data in the first embodiment;
FIG. 5 is a block diagram illustrating a hybrid HTML document having character data encapsulated with contextual markup tags in accordance with the first embodiment;
FIG. 6 is another schematic diagram of the back end system of the first embodiment; FIG. 7 is a schematic diagram of a search engine system according to the first embodiment having a front end system and back end system;
FIG. 8 is a block diagram illustrating the structure of a context sensitive database for storing information extracted from hybrid HTML documents;
FIG. 9 is a flow diagram illustrating the generation hybrid HTML documents; and
FIG. 10 is a flow diagram illustrating a method of processing hybrid HTML documents in accordance with the first embodiment.
It will be appreciated that for simplicity and clarity of illustration, elements illustrated in the accompanying drawings have not necessarily been drawn to scale. For example, the dimensions of some of the elements are exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals and labels have been repeated among the drawings to indicate corresponding or analogous elements.
DETAILED DESCRIPTION
Reference will now be made in detail to implementations and embodiments of the invention, examples of which are illustrated in the accompanying drawings. In the implementations and embodiments which follow, the present invention is applied to conventional HTML documents to generate hybrid HTML documents embedded with context sensitive data. However, upon reading this specification, it will be appreciated by persons skilled in the art that the methodology of the present invention can be equally applied to electronic documents generated with XHTML or a context insensitive markup language that is a subset of SGML. FIG. 1 is a schematic diagram of a back end system 20 for scanning hybrid HTML Web pages 40 and retrieving context sensitive data from the hybrid Web pages 40 according to a first embodiment of the invention. The back end system 20 includes a search processor 22 that resides in memory 25 and executes as software on a back end computer server 24. The search processor 22 receives and stores location information submitted by Web authors, Webmasters, publishers, organizations, crawlers and the like via user machines 44, Web site servers 42 and other networked sources. The location information identifies the location of resources directly or indirectly accessible by the back end computer server 24 and which have hybrid HTML documents (e.g. hybrid Web pages 40) containing context sensitive data to be processed by the search processor 22, as further described below. In this specification, the term "resource" refers to any computer-implemented object or data that can be accessed via the Internet or another computer network (intranet, LAN, wireless etc.) and which contains (or refers to electronic data files which contain), in whole or in part, text-based information. Examples of resources include Web sites,
Web pages, file directories, URIs, URNs, URLs, IP addresses, POP, S/MEVIE, electronic data files and other electronic documents accessible over a network.
In the first embodiment, the location information is represented by Uniform Resource Locators (commonly known as URLs) which specify the locations of hybrid Web pages 40 made up of HTML documents embedded with context sensitive data
(referred to in this specification as "hybrid HTML documents"). URLs received by the search processor 22 are preferably stored by the back end computer server 24 as a list (or queue) of URLs 28 on a local storage device 30. As discussed in further detail below, the search processor 22 includes software components which use the list of
URLs 28 to access, retrieve and process the hybrid HTML documents so that the context sensitive data contained within the hybrid HTML documents can be extracted and indexed within a context sensitive database 32 accessible by the back end computer server 24. Contextual markup tags 56 and 58 (see FIG. 3) encapsulate items of computer-readable data, including character data 52 and other markup tags (for example, graphical or multimedia objects), to form the context sensitive data within the hybrid HTML documents. The term "character data" refers to textual elements of an electronic document which are not part of any HTML markup tags.
Once the context sensitive data is processed and added to the context sensitive database 32, the context sensitive data may be used by a search engine 82 running on a front end computer server 80 (see FIG. 8) to generate search results identifying the location(s) of hybrid Web pages 40 (or other hybrid HTML documents which are not part of the Web) containing the context sensitive data.
As illustrated in FIG. 2, conventional HTML documents 46 contain character data 48
(and other items of computer-readable data 47) that lacks contextual meaning. As a result, while such HTML documents may be searched for relevant information, the lack of contextual meaning for the character data 48 of HTML documents makes indexing such character data and performing automated searches over large volumes of such standard HTML documents difficult. Often, searching such conventional
HTML documents results in search results that are not particularly relevant to the human or machine entity requesting the search.
FIG. 3 is a block diagram illustrating the general structure used in the first embodiment to identify context sensitive data within hybrid HTML documents. In its most basic form, an item of context sensitive data within a hybrid HTML document is made up of character data 52 (or markup tags or both) encapsulated with opening and closing contextual markup tags 56 and 58 each comprising delimiters 60, contextual term 64, and preferably, predefined prefix 62. In the first embodiment, the closing contextual markup tag 58 is distinguished from the opening contextual markup tag 56 by a terminator 66 located in the closing contextual markup tag 58. As illustrated in
FIG. 4 and 5, the terminator 66 is preferably a backslash ("/") located just after an initial HTML delimiter 60 of the closing contextual markup tag 58. Using a backslash for the terminator 66 and locating it just after the initial HTML delimiter 60 provides a mechanism for marking the closing contextual markup tag 58 as an end tag in a manner that is widely recognized by parsers and browsers. In an alternative arrangement, the opening and closing contextual markup tags 56 and 58 may be differentiated with a distinguishable indicator located within the opening contextual markup tag 56.
FIG. 5 illustrates a sample hybrid HTML document 50 having character data 52 encapsulated with opening and closing contextual markup tags 56 and 58 in accordance with the first embodiment. For ease of reference, the contextual markup tags 56 and 58 used in document 50 are presented in upper case to visually distinguish them from the conventional HTML markup in the hybrid HTML document 50. As illustrated, the contextual markup tags 56 and 58 are implemented using proper tag nesting, with an opening contextual markup tag 56 preceding a corresponding closing contextual markup tag 58 in the hybrid document 50. The contextual markup tags 56 and 58 also begin and end with HTML style delimiters 60. The HTML style delimiters 60 implemented in the contextual markup tags are preferably the commonly used less-than ("<") and greater-than (">") characters. By defining the boundaries of the contextual markup tags 56 and 58 with HTML delimiters 60, the contextual markup tags 56 and 58 are concealed from view when the hybrid HTML document 50 is rendered by a Web browser such as Microsoft's Internet Explorer™, while the specific character data 52 encapsulated by the contextual markup tags 56 and 58 is presented by the Web browser as such character data 52 ordinarily would be presented in the absence of the contextual markup tags 56 and 58. Thus, by integrating a characteristic of HTML delimiters into the predefined structure of the contextual markup tags 56 and 58, the contextual markup tags 56 and 58 can be used to add meaning to character data in an HTML document without adversely affecting the visual presentation of the HTML document when it is processed through a Web browser.
In addition to the HTML delimiters 60, each contextual markup tag (56, 58) includes at least two contextual components located between the delimiters 60: a predefined prefix 62 and a contextual term 64. As indicated above, each contextual markup tag (56, 58) preferably includes predefined prefix 62. The predefined prefix 62 is used to identify a tag as a contextual markup tag. The predefined prefix 62 is a predefined set of one or more characters which do not conflict with any predefined HTML tags or any other standards recognized for use within HTML tags. Although the contextual markup tags can be used to add context to content within HTML documents without using the predefined prefix 62, the use of the predefined prefix 62 is preferred as it avoids the risk of a contextual markup tag conflicting with known or future HTML tags.
In the first embodiment, for ease of illustration, the predefined prefix 62 is a sequence of characters, "XHML", which serves as a flag identifying the information between the HTML delimiters 60 as contextual markup information. By implementing the contextual markup tags (56, 58) with HTML delimiters 60 and the predefined prefix
62 advantageously provides a mechanism for adding meaning to character data while avoiding the risk that the contextual markup tags may otherwise interfere with or corrupt the ordinary processing of the hybrid HTML document 50 as a conventional HTML document by a Web browser.
The contextual term 64 within a contextual markup tag provides the actual context specific meaning to the character data 52 encapsulated by an opening and closing pair of contextual markup tags 56 and 58. The contextual term 64 may be any set of one or more characters (e.g. alphanumeric characters, special characters, etc.). Thus, the contextual term 64 may include one or more words, or a sequence of characters which have no meaning in any written language but which provide a representation that the author of the contextual markup tags wishes to use to index the encapsulated character data. For example, the contextual term 64 may include a sequence of characters such as "xqtr" which, while this sequence has no meaning in written human language, may be used to associate character data with a computer-readable classification recognized by a computer program.
In an extension of the base structure of the opening and closing contextual markup tags (56, 58), a predefined separator 63 (the colon in FIG. 4 and 5) may be included between the predefined prefix 62 and the contextual term 64. The predefined separator 63 may be used to improve the human readability of the hybrid terminology when the source code for the hybrid HTML document 50 is viewed through a Web browser, other viewer or in hardcopy format, particularly when the predefined prefix is a sequence of several characters.
The combination of HTML delimiters 60, the predefined prefix 62 and the contextual term 64 form the basis of the opening and closing contextual markup tags (56, 58) which are used to mark character data in HTML documents so as to form the hybrid HTML documents. Once an item of character data is encapsulated with a pair of such opening and closing contextual markup tags (56, 58), the resulting combination forms context sensitive data within the hybrid HTML documents that may be processed by software components of the search processor 22 on the back end computer server 24.
As illustrated in FIG. 5, a particular set of character data (or other items of computer- readable data) may be encapsulated with more than one pair of contextual markup tags 56 and 58, providing the capacity to encapsulate character data in an HTML document with layers of context so as to provide for different levels of meaning being assigned to the character data by the different layers of contextual markup tags. As indicated above, additional computer codes such as HTML tags may also be encapsulated by the contextual markup tags (56, 58), either alone or in combination with the character data 52. Thus, for instance, graphical objects or multimedia objects may also be marked contextually with the contextual markup tags.
FIG. 6 shows a schematic diagram further illustrating the search processor software 22 of the first embodiment. The software components of the search processor 22 include a scheduler 68 and a spider sub-system 70. The scheduler 68 is programmed to launch the spider sub-system 70 which retrieves, scans and indexes hybrid Web pages 40 identified by the URLs in the list of URLs 28. The scheduler 68 may use one of many known scheduling techniques to schedule the launch of the spider subsystem 70, including, by way of example, date and time scheduling, constant scheduling or event scheduling. In the first embodiment, the scheduler 68 schedules the spider sub-system 70 to at predetermined intervals (e.g. at midnight each night). The list of URLs 28 accessible to the spider sub-system 70 may be generated by one or more techniques. URLs may be submitted to the list 28 by web authors and the like to the search processor 22 via interface 71. URLs may be batch added to the list (or queue) 28. URLs may also be added to the list 28 by a crawler 67 residing on the back end computer server 24 or another computer. The crawler 67 is programmed to retrieve the Web pages 40 referred to in the list of URLs 28 and to scan such Web pages 40 for hyperlinks to other Web pages. The crawler 67 adds the URLs for such other Web pages to the list 28 and traverses the hyperlinks in search of further URLs to add to the list 28. The crawler 67 can validate URLs in the list 28 to ensure that such URLs still exist and may also verify that such URLs are accessible.
In a preferred arrangement, the crawler 67 is programmed to perform a preliminary scan of the Web pages 40 identified by the URLs stored in the list 28 to determine if these Web pages 40 in fact include context sensitive data. Where a preliminary scan is performed, the crawler 67 produces a refined list of URLs based on list 28. The refined list represents those Web pages 40 within which the crawler 67 has found context sensitive data that is marked using the aforementioned contextual markup tags (56, 58) coded in an acceptable configuration (see FIG. 3 to 5). In this case, the spider sub-system 70 accesses the refined list referring to Web pages 40 verified as having hybrid HTML documents as opposed to the original list 28 of unverified
URLs.
As illustrated in FIG. 6, the spider sub-system 70 includes a spider 72, a parser 74, and an index processor 76. The spider 72 is programmed to retrieve the Web pages 40 referred to by the URLs in the list of URLs 28, to provide such retrieved Web pages 40 to the parser 74 and to scan such Web pages 40 for contextual markup tags of the type shown in FIG. 3 to 5. In the first embodiment, contextual markup tags (56 and 58) are identified by the spider sub-system 70 by detecting the predefined prefix 62 within such tags.
The parser 74 is programmed to parse the source code of Web pages retrieved by the spider 72. In order to parse a Web page (or other electronic document), the parser 74 accesses parsing rules specifying the acceptable structure and use of contextual markup tags 56 and 58 (see FIG. 3) in hybrid HTML documents. The parsing rules may also specify markup codes 77 that are to be excluded from processing if filtering is desired. The parsing rules can be stored using one of several approaches, including, as illustrated in FIG. 6, with the use of a parsing rules lookup table 75.
As the Web pages 40 are parsed, the contextual markup tags (56 and 58 in FIG. 3) and their associated character data (or other computer-readable data marked by tags 56 and 58) are retrieved. The contextual terms (64 in FIG. 3) are extracted from the contextual markup tags by the parser 74 (or the spider 72) and are used by the index processor 76 to index within the context sensitive database 32 the extracted character data associated with such contextual terms. Contextual terms which are not currently included in the context sensitive database 32 to classify character data and associated location information are added by the index processor 76 to the context sensitive database 32.
In one variation, predefined context symbols can be used by the spider sub-system 70 (and in particular the parser 74 in the first embodiment) to expand, collapse or modify at least one of the contextual terms retrieved from a hybrid HTML document. The predefined context symbols can be stored in a lookup table 79 which associates each of the predefined context symbols with one or more predetermined contextual terms. If a contextual term retrieved from a hybrid HTML document matches any of the predetermined contextual terms in the lookup table 79, the predefined context symbol corresponding with the matching predetermined contextual term is retrieved and used by the spider sub-system 70 to replace the retrieved contextual term. This technique may be used to replace certain contextual terms retrieved from a document with abbreviated contextual terms so as to reduce the amount of storage requirements for contextual terms within the context sensitive database 32. The predefined context symbols may also be used to expand or simply replace cryptic contextual terms retrieved from a document with more meaningful contextual terms using the lookup table. Once the context sensitive data is extracted from the hybrid HTML documents (e.g. the hybrid Web pages 40) and the contextual terms and their associated character data and location information are added to the context sensitive database 32, the contextual terms and associated character data within the context sensitive database 32 may be used by a search engine on a front end computer server (see FIG. 8) to locate hybrid
Web pages containing the context sensitive data and to generate search results identifying such hybrid Web pages. Advantageously, the context sensitive database 32 may also be used to store and search for context sensitive information retrieved from XML-derived (or based) documents. In this latter case the search processor 22 can support the processing and indexing of context sensitive data from XML-derived documents as well as from electronic documents embedded with pairs of the contextual markup tags 56 and 58.
FIG. 7 is a block diagram illustrating a structure 90 for the context sensitive database 32 (see FIG. 8) used to store contextual terms (e.g. 64 in FIG. 3), associated character data (e.g. 52 in FIG. 3) and associated location information for the source hybrid HTML documents (e.g. 40 in FIG. 8). The context sensitive database 32 is preferably structured to include a link table 92 which maps out the relative locations of a set of tables within the context sensitive database 32. As illustrated in FIG. 7, the link table 92 includes several fields including a contextual table ID field 93, a character data table ID field 95 and a resource location table ID field 97, that are used to reference a contextual terms table 94, a character data table 96 and a resource location table 98, respectively.
The contextual terms table 94 provides a mechanism to manage and link contextual terms (64) stored within the context sensitive database 32 along with character data (52) extracted from hybrid HTML documents (e.g. 40) and resource location information for the associated hybrid HTML documents. The contextual terms table 94 may also be used to associate contextual terms (64) with other information derived from hybrid HTML documents in which such contextual terms (64) are used to add meaning to character data (52). In the first embodiment, the contextual terms table 94 includes a plurality of records 100 for storing and associating valid contextual terms (64) with character data (52) stored in the character data table 96 and with resource location information stored in the resource location table 98. Each record 100 includes: a term field 102 for storing a contextual term (64) which has been added to the context sensitive database 32 by the spider 70; and one or more fields 104 referencing locations in the character data table 96 where there is character data (52) associated with the contextual term (64) of the respective record 100. Preferably, each record 100 also includes one or more fields 106 referencing locations in the resource location table 98 where addressing information is located for resources associated with the contextual term of the respective record 100.
The character data table 96 provides a mechanism to store or reference character data (e.g. 52) and other items of computer-readable data which the spider has determined were encapsulated by contextual markup tags (56 and 58 in FIG. 3) within hybrid HTML documents 40. The character data table 96 includes a plurality of records 110 for storing and associating character data retrieved from hybrid HTML documents with location information identifying the hybrid HTML documents containing such character data. In the first embodiment, each record 110 includes a field 112 for storing character data (or references thereto). Each record 110 also includes one or more fields 114 linking the corresponding character data stored in field 112 with location information within the resource location table 98 identifying hybrid HTML documents that contain such character data. The character data table 96 preferably also includes one or more fields 116 linking the character data with associated contextual terms stored in the contextual terms table 94. These latter fields 116 provide a mechanism for the search engine 82 (FIG. 8) to easily retrieve, using a preliminary search term identifying character data within the context sensitive database 32, a list of contextual terms associated with the search term within the context sensitive database 32. The list of contextual terms may then be presented by the search engine 82 to the user to assist the user in refining his or her search based on the context assigned to one or more items of character data within the search term.
The resource location table 98 provides a mechanism to identify the specific location information for hybrid HTML documents 40 processed by the spider 70 (FIG. 6). The resource location table 98 includes records 120 each having a field 122 that identifies the location information of a corresponding hybrid HTML document. The resource location table records 120 may also include one or more additional fields 124 to reference character data within the character data table 94 that are found in corresponding hybrid HTML documents. In a further variation, the resource location table records 120 may include one or more fields 126 to reference the contextual terms stored within the contextual terms table 96 used in the hybrid document. These latter fields 126 provide an easy mechanism for the search engine 82 to retrieve and transmit to a user all or some of the contextual terms used to give meaning within a hybrid HTML document. This can assist the user in his or her searching by providing a summary of the contextual meaning embedded within the hybrid HTML document.
FIG. 8 is a schematic diagram of a search engine system 85 according to the first embodiment having a front end system and back end system. On the front end, the search engine 82 resides as software on a front end computer server 80 and provides search engine services to user machines 84 directly or indirectly connected to the search engine 82. In the embodiment illustrated, some of the user machines 84 access the services of the search engine through a dial-up connection with an ISP and a network connection established over the Internet. Other connections between the user machines 84 and the computer server 80 may also be established. For example, the user machines may access the search engine services of the search engine 82 through an intranetwork, a direct dial-up connection, a cable or xDSL modem connection, a wireless connection, or a dedicated network connection or the like.
The search engine 82 is programmed to provide the user machines 84 with at least one type of search form 81 for completion by an end-user. Completed search forms 81 or the search criteria entered into such search forms 81 serve as search requests 83 which are transmitted to the front end computer server 80. Search requests 83 received by the front end computer server 80 are validated by the search engine 82 to ensure that they conform with at least one predefined search request structure recognized by the search engine 82. Preferably, to assist the search engine 82 in performing a context sensitive search of the context sensitive database 32, the search forms 81 include at least one contextual term stored within the context sensitive database 32. This list provides the user with an indication of some or all of the contextual terms available within the context sensitive database 32 to assist the user in formulating a context-based search request
83. A search request 83 can be transmitted by the user machines 84 to the front end computer server 80 as a data signal 83a embodied in a carrier wave. Thesearch request data signal 83a can represent one or more segments 83b of the search request 83. Each segment 83b may be used to identify search terms provided by the user and may include a segment 83 c identifying one or more contextual terms defining the context within which the search is to be performed on some or all of the other search terms. In receiving the at least one contextual term with the context-based search request 83, the search engine 82 can then search the context sensitive database 32 for references to hybrid HTML documents (for example, Web pages 40) having character data associated with such contextual term(s) and return search results 87 to the user machine that submitted the context-based search request 83.
The search results 87 include information identifying one or more matches, if any, within the context sensitive database 32. A "match" represents an entry in the context sensitive database 32 identifying a hybrid HTML document having character data fitting within the parameters of the search criteria including being associated with one or more contextual terms which formed part of the requestor's context-based search request 83. A match may also include reference to a specific set of contextual terms and character data that form part of the match. A match can be transmitted from the computer server 80 to the requesting party's user machine 84 as a search result data signal 89 embodied in a carrier wave. The search result data signal 89 can have a segment 89a identifying the hybrid HTML document associated with the match.
In a variation, if the search request(s) 83 received by the search engine 82 lacks any contextual terms used within the context sensitive database 32, the search engine 82 may be preferably programmed to compare the search term(s) submitted with the search request(s) 83 with valid contextual terms used by the context sensitive database 32. If the search engine 82 determines that any valid contextual terms within the context sensitive database 32 are associated with (or match) any of the search terms, the search engine 82 then provides the user machine that submitted the search request 83 with at least one list of valid contextual terms in the context sensitive database 32 that are found to be associated with one or more of the search terms, along with instructions providing the user with the option to submit a context-based search request using one or more of the valid contextual terms provided. The user may then refine the search request 83 and specify the context sensitive nature of the search request 83 by adding one or more of the contextual terms from the list of valid contextual terms provided by the search engine 82. The refined search request (now context-based), once transmitted to and received by the front end computer server 80, may be used by the search engine 82 to conduct a context sensitive search for hybrid HTML documents referenced by the context sensitive database 32.
FIG. 9 to 10 show flow diagrams illustrating the operation of a system for authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup tags according to the first embodiment of the invention. For ease of reference in the following discussions, reference is made to FIG. 3 to 10 collectively. Reference numerals and labels have been repeated among the drawings to indicate corresponding elements and blocks.
In the first embodiment, an HTML document is generated or edited by a Web author or publisher using a commercially available HTML compatible editor. However, any text editor or word processor capable of storing a document as a text file may be used to generate or edit HTML formatted files for publication as Web pages. As the
HTML document is generated or edited, pairs of contextual markup tags (56 and 58) are added at block 200 by the Web author to encapsulate or "tag" character data within the HTML document, resulting in hybrid HTML document 40 (FIG. 9). As discussed above, the contextual markup tags (56 and 58) are used to assign meaning to the encapsulated character data so that compatible software applications such as the spider 70 can understand the context within which the encapsulated character data is used in the hybrid HTML document. Once the HTML document is modified with the contextual markup tags to form the hybrid HTML document 40, the author (or publisher) preferably submits the location information (e.g. the URL) for the hybrid HTML document 40 to the interface 71 which the author communicates with via the user machine 84 at block 202. The interface 71 adds the URL to the list of URLs 28 identifying the location of hybrid HTML documents to be indexed by the search processor software 22.
At block 210 a preliminary scan is performed of the Web pages corresponding to the URLs in the list of URLs 28 managed by the search processor 22. Preferably, this preliminary scan is performed by the crawler 67 which scans through the Web pages to determine whether or not they contain at least one pair of contextual markup tags which conform to the specified syntax (see FIG. 3 and 4). The URLs of Web pages containing contextual markup tags are added to a refined list of URLs by the crawler 67.
After a refined list of URLs is generated at block 210, the spider sub-system 70 may be launched by the scheduler 68 to retrieve and index the hybrid Web pages 40 referenced in the refined list. By way of illustration, in the first embodiment the spider sub-system 70 selects at block 212 a URL from the list of refined URLs and invokes the spider 72 to retrieve the corresponding hybrid Web page 40 at block 214. The retrieved hybrid Web page 40 is scanned by the spider 72 at block 216 for context sensitive data identified by pairs of contextual markup tags (56 and 58). The context sensitive data is identified with the assistance of the parser 74 and the parsing rules lookup table 75. Context sensitive data that is identified in block 216 is retrieved by the spider 72 for further processing by the index processor 76. Preferably, for each retrieved Web page 40, the spider 72 retrieves and temporarily stores (caches) the contextual terms 64, the character data 52 corresponding to the contextual terms 64 (and other items of computer-readable data encapsulated by the contextual markup tags) and the location information for the retrieved Web page 40. When storing this retrieved collection of information, the spider 72 maintains the association between the contextual terms 64 and corresponding character data 52. Once a retrieved hybrid Web page 40 is scanned and the context sensitive data retrieved and temporarily stored (cached), the context sensitive data may be added to the context sensitive database 32 at block 218. In the first embodiment, the spider 72 invokes the index processor 76 to organize the insertion of the temporarily stored context sensitive data into the context sensitive database 32. In block 218, the index processor 76 adds the location information for the retrieved hybrid Web page 40 to a new instance of record 120 in the resource location table 98 if the Web page has not already been indexed into the database 32.
The index processor 76 also updates the context sensitive database 32 in block 218 to identify the relationships between the character data and the contextual terms retrieved from the hybrid Web page 40. Updating the context sensitive database 32 in this way depends upon the state of the database 32 in relation to the retrieved character data and contextual terms. If the character data is new to the database 32 and the contextual term that encapsulated the character data is also new to the database 32, then the index processor 76 adds the new character data to a new instance of record 110 in character table 96. In this latter case, the index processor 76 also adds the new contextual term to the contextual terms table 94. The new contextual term is added to a new instance of record 100 which is linked to the corresponding character data in character data table 96.
If, in block 218, the character data is new to the database 32 but the contextual term that encapsulated the character data already exists within the database 32, then, as before, the index processor 76 adds the new character data to a new instance of record
110 in character table 96. However, the index processor 76 also retrieves the reference to the existing record 100 in the contextual terms table 94 that already stores the contextual term. The fields 104 and 106 for such existing record 100 are expanded or modified to add links to the new instance of record 110 and to the record in the resource location table 98 for the associated Web page 40. Similarly, if the contextual term is new but the particular character data marked by the contextual term already exists in the context sensitive database 32, the index processor 76 proceeds to create a new instance of record 100 for the contextual term and retrieve and modify the existing record 110 for the character data to link the record 110 to the new record 100 and the record 120 for the location of the associated Web page 40.
If both the character data and the associated contextual term already exist in the context sensitive database 32, then the index processor 76 retrieves the existing record 100 for the contextual term and the existing record 110 for the character data and links those records to the record 120 for the associated Web page 40.
In block 218, a new instance of record 120 is used by the index processor 76 to store the location information for a Web page whose location does not already exist within the context sensitive database 32. If the location information to be stored in the context sensitive database 32 already exists in a record 120, then such record 120 is used by the index processor 76 to link with records (100 and 110) for new contextual terms and character data retrieved from the Web page 40 associated with the location information.
After processing of the retrieved hybrid Web page 40 is completed in block 218, the spider sub-system 70 checks at block 220 to see if any further hybrid Web pages remain to be processed from the refined list of URLs (generated in block 210). If any further hybrid Web pages remain, processing returns to block 212 where the spider sub-system 70 selects another URL from the refined list of URLs for the spider 72 to retrieve and process. Otherwise, the spider sub-system 70 concludes the current round of processing and awaits further instructions from the scheduler 68 to begin processing once more from block 210.
Although this invention has been described with reference to illustrative and preferred embodiments of carrying out the invention, this description is not to be construed in a limiting sense. Various modifications of form, arrangement of parts, steps, details and order of operations of the embodiments illustrated, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to this description. It is therefore contemplated that the appended claims will cover such modifications and embodiments as fall within the true scope of the invention.

Claims

WHAT IS CLAIMED IS:
1. A computer-readable medium containing an electronic document generated with a context insensitive markup language, the electronic document having contextual markup tags and items of computer-readable data marked by the contextual markup tags.
2. The computer-readable medium of Claim 1, wherein each of the contextual markup tags includes a predefined prefix for identifying contextual information, and at least one contextual term identifying a context within which at least one of the items of computer-readable data is used within the electronic document.
3. The computer-readable medium of Claim 1, wherein the items of computer- readable data includes at least one reference to a computer-based multimedia entity or a computer-based graphical entity.
4. The computer-readable medium of Claim 1 , wherein the items of computer- readable code include character data.
5. The computer-readable medium of Claim 1, wherein the electronic document is an HTML document.
6. The computer-readable medium of Claim 5, wherein the contextual markup tags include HTML delimiters, a predefined prefix for identifying contextual information, and at least one contextual term identifying a context within which at least one of the items of computer-readable data is used within the HTML document.
7. The computer-readable medium of Claim 6, wherein the contextual markup tags comprise pairs of contextual markup tags encapsulating the at least one of the items of computer-readable data, including an opening contextual markup tag and a closing contextual markup tag.
8. The computer-readable medium of Claim 7, wherein the HTML delimiters include delimiters selected from the group consisting of: "<" and ">".
9. The computer-readable medium of Claim 8, wherein the closing contextual markup tag includes a terminator between the HTML delimiters of the closing contextual markup tag to distinguish the closing contextual markup tag from the opening contextual markup tag.
10. The computer-readable medium of Claim 9, wherein the terminator is a backslash located immediately after the "<" delimiter.
11. A method of generating a context sensitive HTML document, the method comprising:
inserting an opening contextual markup tag before an item of computer-readable data within an HTML document, including inserting as part of the opening contextual markup tag HTML delimiters, and at least one contextual term identifying a context within which the item of computer-readable data is used; and
inserting a closing contextual markup tag after the item of computer- readable data, including inserting as part of the closing contextual markup tag the HTML delimiters and the at least one contextual term.
12. The method as claimed in Claim 11, including inserting into each of the opening and closing contextual markup tags a predefined prefix for distinguishing the opening and closing contextual markup tags from HTML tags.
13. The method as claimed in Claim 11, including inserting a terminator between the HTML delimiters of the closing contextual markup tag to distinguish the closing contextual markup tag from the opening contextual markup tag.
14. The method as claimed in Claim 11, including inserting the predefined prefix of the opening contextual markup tag before the at least one contextual term thereof.
15. The method as claimed in Claim 11, including marking the item of computer- readable data with additional pairs of opening and closing contextual markup tags.
16. A computer-readable memory comprising:
(a) a spider for scanning electronic documents for context sensitive data and for retrieving the context sensitive data from the electronic documents, wherein at least one of the electronic documents conforms to a context insensitive markup language; and
(b) an index processor, responsive to the spider, for adding the context sensitive data to a context sensitive database.
17. The computer-readable memory of Claim 16, wherein the spider further comprises computer-readable instructions to identify items of computer- readable data marked by at least one pair of contextual markup tags within the electronic documents.
18. The computer-readable memory of Claim 17, wherein the spider further comprises computer-readable instructions to retrieve contextual terms from the contextual markup tags and to retrieve the items of computer-readable data marked by the contextual markup tags.
19. The computer-readable memory of Claim 18, wherein the index processor further comprises computer-readable instructions to index the items of computer-readable data within the context sensitive database at least according to the contextual terms retrieved from the contextual markup tags associated with the items of computer-readable data.
20. The computer-readable memory of Claim 19, further comprising a crawler having computer-readable instructions to perform a preliminary scan of the electronic documents to determine which of the electronic documents include at least one item of context sensitive data.
21. The computer-readable memory of Claim 19, wherein the at least one of the electronic documents complies with HTML.
22. The computer-readable memory of Claim 19, wherein a plurality of the electronic documents comply with XHTML.
23. The computer-readable memory of Claim 19, wherein the spider further comprises instructions to replace one or more predetermined contextual terms retrieved from the electronic documents with abbreviated contextual terms.
24. The computer-readable memory of Claim 19, wherein the spider further comprises instructions to expand or replace one or more predetermined contextual terms retrieved from the electronic documents with more meaningful contextual terms using a lookup table.
25. The computer-readable memory of Claim 16, wherein the spider further comprises instructions to replace one or more predetermined contextual terms retrieved from the electronic documents with abbreviated contextual terms.
26. The computer-readable memory of Claim 16, wherein the spider further comprises instructions to expand or replace one or more predetermined contextual terms retrieved from the electronic documents with more meaningful contextual terms using a lookup table.
27. A method of managing a context sensitive database in a computer system, the method comprising:
scanning an electronic document in search of items of computer- readable data marked with contextual markup tags that have contextual terms associated with the items of computer- readable data;
retrieving the items of computer-readable data and associated contextual terms from the electronic document; and
adding the items of computer-readable data and associated contextual terms to the context sensitive database.
28. The method of Claim 26, wherein scanning the electronic document for a predefined prefix distinguishing the contextual markup tags from other markup tags within the electronic document.
29. The method of Claim 28, wherein adding further comprises linking the associated contextual terms and the items of computer-readable data, and linking the items of computer-readable data to an address identifying a location of the electronic document.
30. The method of Claim 29, wherein scanning further comprises searching for contextual markup tags beginning and ending with at least one HTML delimiter.
31. The method of Claim 30, wherein scanning further comprises identifying pairs of opening and closing contextual markup tags including identifying a terminator in the closing contextual markup tag.
32. The method of Claim 31, further comprising determining if the electronic document was generated with a context insensitive markup language.
33. The method of Claim 27, further comprising replacing one or more predetermined contextual terms retrieved from the electronic document with abbreviated contextual terms.
34. The method of Claim 27, further comprising replacing one or more predetermined contextual terms retrieved from the electronic document with other contextual terms using a lookup table.
35. A computer-readable memory having stored instructions for use in the execution of the method of Claim 27.
36. A computer-readable memory having stored instructions for use in the execution of the method of Claim 28.
37. A computer-readable memory having stored instructions for use in the execution of the method of Claim 29.
38. A computer-readable memory having stored instructions for use in the execution of the method of Claim 30.
39. A computer system comprising a computer and a memory having computer- readable codes for instructing the computer to perform the method of Claim
28.
40. A computer system comprising a computer and a memory having computer- readable codes for instructing the computer to perform the method of Claim 29.
41. A computer system comprising a computer and a memory having computer- readable codes for instructing the computer to perform the method of Claim 30.
42. A computer-readable memory having stored instructions for use in the execution of the method of Claim 31.
43. A computer system comprising:
(a) a context sensitive database; and
(b) a spider sub-system having computer-readable instructions to: (i) scan an electronic document for items of computer-readable data marked by contextual markup tags that each include a contextual term associated with at least one of the items of computer-readable data, and (ii) add the items of computer-readable data and the associated contextual terms to the context sensitive database.
44. The computer system of Claim 43, wherein the spider sub-system further comprises computer-readable instructions to scan the electronic document for a predefined prefix within each contextual markup tag, the predefined prefix distinguishing the contextual markup tags from other markup tags.
45. The computer system of Claim 44, wherein the spider sub-system further comprises computer-readable instructions to link the associated contextual terms and the items of computer-readable data and to link the items of computer-readable data to an item of location information identifying a location of the electronic document.
6. A computer system comprising:
(a) a context sensitive database; and
(b) a computer server coupled to the context sensitive database, the computer server comprising a search engine operable to instruct the computer server to:
(i) receive a search request from a requesting device;
(ii) based on the search request, search the context sensitive database for references to electronic documents generated with a context insensitive markup language and containing items of computer-readable data embedded with contextual markup tags; and
(iii) retrieve from the context sensitive database at least one of the references; and
(iv) transmit the at least one reference to the requesting device.
47. The computer system of Claim 46, wherein the search engine is further operable to instruct the computer server to retreive from the context sensitive database, and transmit to the requesting device, a contextual term associated with at least one of the items of computer-readable data embedded with contextual markup tags.
48. A computer-readable medium containing a data structure comprising:
(a) a first segment identifying a resource that contains an electronic document having items of computer-readable data marked with contextual markup tags, wherein the contextual markup tags each include (i) a contextual term associated with at least one of the items of computer-readable data and (ii) a predefined prefix distinguishing the contextual markup tags from other markup tags;
(b) a second segment identifying at least one of the items of computer- readable data located within the electronic document; and
(c) a third segment identifying the contextual term associated with the at least one of the items of computer-readable data marked with the contextual tags.
49. A computer-readable medium containing a context sensitive database comprising:
(a) items of computer-readable data and contextual terms associated with the items of computer-readable data, the items of computer-readable data representing information scanned from electronic documents generated with a context insensitive markup language but embedded with contextual markup tags containing the contextual terms; and
(b) location information identifying addressable locations for the electronic documents.
50. The computer-readable medium of Claim 49, wherein the electronic documents include at least one HTML document.
51. The computer-readable medium of Claim 49, the context sensitive database further comprising: (a) a first table containing a plurality of records identifying the location information for each of the electronic documents;
(b) a second table containing a plurality of records for storing the items of computer-readable data and for associating the items of computer- readable data with the location information of one or more of the electronic documents containing such items of computer-readable data; and
(c) a third table containing a plurality of records for storing and relating the contextual terms with (i) the items of computer-readable data stored in the second table and (ii) the location information stored in the first table.
52. A computer system for managing a context sensitive database, the computer system comprising:
(a) means for scanning an electronic document, generated with a context insensitive markup language, in search of items of computer-readable data marked with contextual markup tags;
(b) means for retrieving contextual terms associated with the items of computer-readable data from the contextual markup tags;
(c) means for retrieving the items of computer-readable data from the electronic document; and
(d) means for adding the items of computer-readable data and associated contextual terms to the context sensitive database.
53. The computer system of Claim 52, comprising means for linking the associated contextual terms and the items of computer-readable data within the context sensitive database and for linking the items of computer-readable data stored in the context sensitive database to an address identifying a location of the electronic document containing the items of computer-readable data and associated contextual terms.
54. A method of performing a context-based computer search on a context sensitive database, the method comprising:
receiving from a requesting entity a search request containing a search term;
comparing the search term with contextual terms stored in the context sensitive database;
retrieving from the context sensitive database at least one contextual term found to be associated with or match the search term; and
transmitting the at least one contextual term to the requesting entity along with instructions for a context-based search request to be submitted using the at least one contextual term transmitted.
PCT/CA2000/000861 1999-09-29 2000-07-21 Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup WO2001024046A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU69739/00A AU6973900A (en) 1999-09-29 2000-07-21 Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US40733699A 1999-09-29 1999-09-29
US09/407,336 1999-09-29

Publications (2)

Publication Number Publication Date
WO2001024046A2 true WO2001024046A2 (en) 2001-04-05
WO2001024046A3 WO2001024046A3 (en) 2002-05-02

Family

ID=23611600

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CA2000/000861 WO2001024046A2 (en) 1999-09-29 2000-07-21 Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup
PCT/CA2000/001042 WO2001024045A2 (en) 1999-09-29 2000-09-08 Method, system, signals and media for indexing, searching and retrieving data based on context

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CA2000/001042 WO2001024045A2 (en) 1999-09-29 2000-09-08 Method, system, signals and media for indexing, searching and retrieving data based on context

Country Status (2)

Country Link
AU (2) AU6973900A (en)
WO (2) WO2001024046A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7020667B2 (en) 2002-07-18 2006-03-28 International Business Machines Corporation System and method for data retrieval and collection in a structured format
SG120883A1 (en) * 2001-08-31 2006-04-26 Trusted Board Ltd Electronic approval of documents
AU2002304318B2 (en) * 2001-06-20 2007-07-26 Ingeneus Corporation Nucleic acid triplex and quadruplex formation
US7689910B2 (en) 2005-01-31 2010-03-30 International Business Machines Corporation Processing semantic subjects that occur as terms within document content
US8442982B2 (en) 2010-11-05 2013-05-14 Apple Inc. Extended database search
US8635691B2 (en) * 2007-03-02 2014-01-21 403 Labs, Llc Sensitive data scanner

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002097678A1 (en) * 2000-09-20 2002-12-05 Body1, Inc. Methods, systems, and software for automated growth of intelligent on-line communities
US20040172415A1 (en) 1999-09-20 2004-09-02 Messina Christopher P. Methods, systems, and software for automated growth of intelligent on-line communities
EP1244008A1 (en) 2001-03-20 2002-09-25 Sap Ag Method, computer program, and computer for automatically selecting application services for communicating data from a server to a client depending on the type of the client device
WO2003088665A1 (en) 2002-04-12 2003-10-23 Mitsubishi Denki Kabushiki Kaisha Meta data edition device, meta data reproduction device, meta data distribution device, meta data search device, meta data reproduction condition setting device, and meta data distribution method
JP4637113B2 (en) * 2003-11-28 2011-02-23 キヤノン株式会社 Method for building a preferred view of hierarchical data
EP1779269A1 (en) * 2004-07-26 2007-05-02 Panthaen Informatics, Inc. Context-based search engine residing on a network
WO2009003281A1 (en) * 2007-07-03 2009-01-08 Tlg Partnership System, method, and data structure for providing access to interrelated sources of information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DAO T: "AN INDEXING MODEL FOR STRUCTURED DOCUMENTS TO SUPPORT QUERIES ON CONTENT, STRUCTURE AND ATTRIBUTES" PROCEEDINGS OF THE FORUM ON RESEARCH AND TECHNOLOGY ADVANCES IN DIGITAL LIBRARIES, April 1998 (1998-04), pages 88-97, XP002925486 *
DELLA MEA V ET AL: "HTML generation and semantic markup for telepathology" COMPUTER NETWORKS AND ISDN SYSTEMS, NORTH HOLLAND PUBLISHING, vol. 28, no. 11, 1 May 1996 (1996-05-01), pages 1085-1094, XP004018210 AMSTERDAM, NL ISSN: 0169-7552 *
DOBSON S A ET AL: "Lightweight databases" COMPUTER NETWORKS AND ISDN SYSTEMS, NORTH HOLLAND PUBLISHING, vol. 27, no. 6, 1 April 1995 (1995-04-01), pages 1009-1015, XP004013202 AMSTERDAM, NL ISSN: 0169-7552 *
LUKE S ET AL: "ONTOLOGY-BASED WEB AGENTS" PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS, MARINA DEL REY, CA, 5 - 8 February 1997, pages 59-66, XP000775144 ACM, NEW YORK, NY, US ISBN: 0-89791-877-0 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002304318B2 (en) * 2001-06-20 2007-07-26 Ingeneus Corporation Nucleic acid triplex and quadruplex formation
SG120883A1 (en) * 2001-08-31 2006-04-26 Trusted Board Ltd Electronic approval of documents
US7020667B2 (en) 2002-07-18 2006-03-28 International Business Machines Corporation System and method for data retrieval and collection in a structured format
US7689910B2 (en) 2005-01-31 2010-03-30 International Business Machines Corporation Processing semantic subjects that occur as terms within document content
US8635691B2 (en) * 2007-03-02 2014-01-21 403 Labs, Llc Sensitive data scanner
US8442982B2 (en) 2010-11-05 2013-05-14 Apple Inc. Extended database search
US9009201B2 (en) 2010-11-05 2015-04-14 Apple Inc. Extended database search

Also Published As

Publication number Publication date
WO2001024046A3 (en) 2002-05-02
AU6973900A (en) 2001-04-30
AU6976600A (en) 2001-04-30
WO2001024045A3 (en) 2002-05-10
WO2001024045A2 (en) 2001-04-05

Similar Documents

Publication Publication Date Title
US6721736B1 (en) Methods, computer system, and computer program product for configuring a meta search engine
US6516312B1 (en) System and method for dynamically associating keywords with domain-specific search engine queries
JP4857075B2 (en) Method and computer program for efficiently retrieving dates in a collection of web documents
US6321228B1 (en) Internet search system for retrieving selected results from a previous search
US6094649A (en) Keyword searches of structured databases
US7340459B2 (en) Information access
US6931397B1 (en) System and method for automatic generation of dynamic search abstracts contain metadata by crawler
US6519586B2 (en) Method and apparatus for automatic construction of faceted terminological feedback for document retrieval
US8312059B2 (en) Information organization and navigation by user-generated associative overlays
US6615209B1 (en) Detecting query-specific duplicate documents
US6665658B1 (en) System and method for automatically gathering dynamic content and resources on the world wide web by stimulating user interaction and managing session information
US8271486B2 (en) System and method for searching a bookmark and tag database for relevant bookmarks
US20070143317A1 (en) Mechanism for managing facts in a fact repository
US20020129062A1 (en) Apparatus and method for cataloging data
US6938034B1 (en) System and method for comparing and representing similarity between documents using a drag and drop GUI within a dynamically generated list of document identifiers
WO2001069428A1 (en) System and method for creating a semantic web and its applications in browsing, searching, profiling, personalization and advertising
EP1428139A2 (en) System and method for extracting content for submission to a search engine
US20040205047A1 (en) Method for dynamically generating reference indentifiers in structured information
WO2001016807A1 (en) An internet search system for tracking and ranking selected records from a previous search
US20070022096A1 (en) Method and system for searching a plurality of web sites
KR100359233B1 (en) Method for extracing web information and the apparatus therefor
WO2001024046A2 (en) Authoring, altering, indexing, storing and retrieving electronic documents embedded with contextual markup
US20050131859A1 (en) Method and system for standard bookmark classification of web sites
US20030046276A1 (en) System and method for modular data search with database text extenders
AU2007100279A4 (en) Systems and methods of directionally guided, discriminate crawling of internet real estate listings

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP