AU2004292680B2 - Method of constructing preferred views of hierarchical data - Google Patents

Method of constructing preferred views of hierarchical data Download PDF

Info

Publication number
AU2004292680B2
AU2004292680B2 AU2004292680A AU2004292680A AU2004292680B2 AU 2004292680 B2 AU2004292680 B2 AU 2004292680B2 AU 2004292680 A AU2004292680 A AU 2004292680A AU 2004292680 A AU2004292680 A AU 2004292680A AU 2004292680 B2 AU2004292680 B2 AU 2004292680B2
Authority
AU
Australia
Prior art keywords
node
nodes
occurrence probability
set
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2004292680A
Other versions
AU2004292680A1 (en
Inventor
Khanh Phi Van Doan
Alison Joan Lennon
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to AU2003906611 priority Critical
Priority to AU2003906611A priority patent/AU2003906611A0/en
Application filed by Canon Inc filed Critical Canon Inc
Priority to AU2004292680A priority patent/AU2004292680B2/en
Priority to PCT/AU2004/001676 priority patent/WO2005052810A1/en
Publication of AU2004292680A1 publication Critical patent/AU2004292680A1/en
Application granted granted Critical
Publication of AU2004292680B2 publication Critical patent/AU2004292680B2/en
Application status is Ceased legal-status Critical
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing

Description

WO 2005/052810 PCT/AU2004/001676 1 METHOD FOR CONSTRUCTING PREFERRED VIEWS OF HIERARCHICAL DATA Copyright Notice This patent specification contains material that is subject to copyright protection. 5 The copyright owner has no objection to the reproduction of this patent specification or related materials from associated patent office files for the purposes of review, but otherwise reserves all copyright whatsoever. Technical Field of the Invention The present invention relates to the general field of information retrieval and, in 10 particular, to the automatic identification and retrieval of relevant data from large hierarchical data sources. Background Extensible Markup Language (XML) is increasingly becoming a popular hierarchical format for storing and exchanging information. Whilst the hierarchical nature 15 of XML makes it an excellent means for capturing relationships between data objects, it also makes keyword searching more difficult. Keyword searching is of particular importance when dealing with a structured data format such as XML because it allows the user to locate particular keywords quickly without the need to know the internal structure of the data. It is a challenge when working 20 with XML because there is no optimal or clearly preferred method for presenting the result of a keyword search. In the traditional unstructured text environment, the data system typically presents the user with the located keywords together with other text in their vicinity. If there are more than one 'hit', then the neighbouring text provides a useful context for distinguishing between hits, thereby allowing the user to quickly select the WO 2005/052810 PCT/AU2004/001676 2 most relevant hit according to user's needs. In the structured XML environment, there is no clear concept of 'neighbouring data' since data that are related to one another may reside at several disjoint locations within an XML document. Thus it is difficult to identify or construct a suitable context for 5 a hit in a keyword search. Consequently, most existing XML based systems simply return an entire XML document (out of a collection of XML documents) if a keyword hit occurs within the document, with the entire document effectively serving as the context for the hit. This is undesirable when documents are large and the user is not interested in seeing all of their contents. 10 Practical data sources, especially databases, often contain much more data than a user typically wants to see at any one time. For example, a database in a mail order store may contain details about all of its product lines, customers, suppliers, couriers, and lists of past and pending orders. A store clerk may at one time wish to see the current stock level for a particular product, and at another time may want to check the status of an order 15 for a customer. A store manager on the other hand may wish to see the variation of the total sales for a particular product line over a number of months. In each of these cases it would be too distracting to the user if an avalanche of additional irrelevant data were to be also presented. Further, unless the user is familiar with the structure of the database, the user would typically be unable to identify information about which the user has an interest. 20 The traditional method for providing only relevant data is through the use of pre created "views", prepared by someone who is familiar with the structure of the data source, such as a system administrator. Each view draws together some subset of the data source and is tailored for a distinct purpose. In the previously given examples, the store clerk would consult a "stock level" view or an "order status" view, whilst the manager WO 2005/052810 PCT/AU2004/001676 3 would bring up a "sales" view. Whilst this approach of using pre-created views may be satisfactory when all likely usage scenarios can be anticipated, it is inadequate for keyword searching. In a keyword search operation, a user enters one or more keywords and the system responds with a data 5 set or view that includes occurrences of all keywords (assuming an AND Boolean keyword search operation). In a hierarchical environment such as XML, keyword hits may occur in several data items residing at different locations in the hierarchy. Since it is not feasible to anticipate all possible keyword combinations that a user may provide, it is not possible to pre-determine where in the hierarchy hits will occur. Consequently it is not possible to 10 provide pre-created views that will cater for all search scenarios. An analogous keyword searching problem also exists in the relational database environment. A relational database comprises tables joined through their primary and foreign keys, where each table comprises a plurality of rows each denoting an n-tuple of attribute values for some entity. A traditional solution to keyword searching in a relational 15 database, described by Hristidis, V. and Papakonstantinou, Y., "DISCOVER: Keyword Search in Relational Databases", Proceedings of the 28th VLDB Conference, 2002, is to return a minimal joining network, which is the smallest network of joined rows across joined tables that contain all keyword hits. A problem with this approach is that it effectively treats rows as the smallest data "chunks" in that if a keyword hit occurs any 20 where in a row of a database table then the entire row is returned as context for the hit. This may lead to excessive amounts of data being presented to the user since a typical relational database table often contains many columns that are not usually of interest to the user. Further, adapting the above technique to hierarchical data structures such as XML WO 2005/052810 PCT/AU2004/001676 4 may result in insufficient context information. In a hierarchical environment, related data may be stored at different levels in the hierarchy, and thus often data stored in a parent or ancestor node or their children may provide very useful context for a keyword hit, even though these may not be included in the minimal data set. 5 Some attempts have been made to address the keyword searching problem in hierarchical data. Florescu, D. et al, "Integrating Keyword Search into XML Query Processing", Ninth International World Wide Web Conference, May 2000, discloses a method of augmenting a structural query language with a keyword searching operator contains. This operator evaluates to TRUE if a specified sub-tree contains some specified 10 keywords. The user can use this operator when constructing queries to filter out unwanted data. Whilst this useful feature does not require the user to specify the exact location of hit keywords within a given sub-tree, it does not go far enough since the user is still required to specify the exact format of the returned data in the search query and hence the user would still need to be familiar with the structure of the data source. In other words, free 15 text keyword searching is still not possible, unless the user is willing to accept an entire data source as a result of the search. Another existing approach to keyword searching in an XML data source requires the user to select from a given list of schema elements, the element representing the root node of the returned data. If a keyword hit occurs in a descendant node of a data element 20 represented by the selected schema element, then the entire sub-tree below the data element, containing the hit keyword, is returned to the user. This approach is cumbersome because it requires user interventions. Furthermore, the user is forced to accept an entire sub-tree even though it may contain data not of interest to the user. Accordingly, there is a need for a method for determining a set of relevant data in a -5 hierarchical data environment in response to a keyword search operation involving arbitrary combinations of keywords that does not require user interventions or prior user knowledge of the structure of the hierarchical data. Summary 5 It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing methods. In accordance with one aspect of the present disclosure, there is provided a method of presenting data from at least one data source, said method comprising the steps of: 10 (i) holding a representation of said least one data source and at least one previous view of said least one data source; (ii) identifying at least one compulsory entity in said representation; and (iii) presenting data structure said least one compulsory entity and one or more context entities, where said context entities are obtained from said representation and 15 context data obtained from said least one previous view. More specifically disclosed is a method of presenting data from a hierarchical data source, said method comprising the steps of: (i) constructing a first view of the hierarchical data source; (ii) obtaining an occurrence probability of at least one context data from at least 20 the first view of the hierarchical data source; (iii) identifying a compulsory entity in the first view; (iv) selecting a context entity from the first view and the context data based on the occurrence probability; and (v) presenting a hierarchical data structure, wherein the hierarchical data 2564617_1 645893AU -6 structure is a subset of the hierarchical data source, comprising a plurality of context data, wherein each of the plurality of context data corresponds to the identified compulsory entity and the selected context entity, wherein the hierarchical data structure is assigned a score equal to an occurrence 5 probability of an ancestor node of the compulsory entity given the occurrence probability of the context data associated with the compulsory entity, and the context entity is selected from the group consisting of: (a) the ancestor node; (b) a first set of nodes along a directed path in the hierarchical data source from 10 the ancestor node to the compulsory entity; (c) a second set of nodes selected from a descendent node of the ancestor node in the first view, each of the second set of nodes being selected based on a corresponding occurrence probability, said occurrence probability being derived from the occurrence probability of the ancestor node; 15 (d) a third set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the third set of nodes to the ancestor node in the first view; and (e) a fourth set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the fourth 20 set of nodes to the compulsory entity in the first view. 2564617_1 645893AU -7 5 Other aspects of the present invention, including apparatus and computer media, 10 are also disclosed. Brief Description of the Drawings At least one embodiment of the present invention will now be described with reference to the drawings in which: Fig. I is an example schema graph; 15 Fig. 2 is a flowchart of a keyword searching method; Figs. 3A and 3B show two example parent nodes in a schema graph; Fig. 4 is a diagram of a network of server and client computers; Fig. 5 is a flowchart of a method for identifying context nodes among a set of child nodes of a parent node not lying along a directed path from the root node to a hit node; 20 Fig. 6 is a flowchart of another method for identifying context nodes among a set of child nodes of a parent node not lying along a directed path from the root node to a hit node; Fig. 7 is a flowchart of a method for identifying context nodes among a set of child nodes of a parent node lying along a directed path from the root node to a hit node; 25 Fig. 8 is a flowchart of another method for identifying context nodes among a set of child nodes of a parent node lying along a directed path from the root node to a hit node; 2564617_1 645893AU WO 2005/052810 PCT/AU2004/001676 8 Fig. 9 is an example schema graph with two identical sub-trees; Fig. 10 is a flowchart of the first, bottom-up traversal phase of the context node identification method with probability averaging; Fig. 11 is an example of a schema graph with multiple hit nodes; 5 Fig. 12 is a flowchart of the first, bottom-up traversal phase of the context node identification method with probability averaging and involving multiple hit nodes; Fig. 13 is a flowchart of a method for identifying context nodes among a set of child nodes of a parent node not lying along a directed path from the root node to a hit node, with probability averaging; 10 Fig. 14 is a flowchart of a method for identifying context nodes among a set of child nodes of a parent node lying along a directed path from the root node to a hit node, with probability averaging; Fig. 15 is an example of a parent node whose descendant hit nodes are all located under a single child node; 15 Fig. 16 is a flowchart of a method for identifying context nodes among a set of child nodes of a parent node lying along a directed path from the root node to a hit node, with probability averaging, for the case where all multiple hit nodes are located under a single child node; Fig. 17 is a flowchart of a method for identifying context nodes among a set of 20 child nodes of a parent node lying along a directed path from the root node to a hit node, with probability averaging, for the case where hit nodes are located under multiple child nodes; Fig. 18 is a flowchart of a method for identifying context nodes in which one or multiple hit nodes may be present; WO 2005/052810 PCT/AU2004/001676 9 Fig. 19 is a flowchart of a method for constructing context trees for cases involving a single hit node; Fig. 20 is a flowchart of a method for constructing context trees for cases involving multiple hit nodes; 5 Fig. 21 is a flowchart of a method for constructing an alternative set of hit nodes that have higher observation frequencies than those in an original set of hit nodes; Fig. 22 is a flowchart of a method for selecting an ancestor of a set of hit nodes that has a higher observation frequency than the set of hit nodes; Fig. 23 is an example schema graph; 10 Fig. 24 is a schema graph of an example data view; Fig. 25 is a schema graph of another example data view; Fig. 26 is a schema graph of yet another example data view; Fig. 27 is an occurrence frequency table arising from the data views in Figs. 24, 25 and 26; 15 Fig. 28 is a co-occurrence frequency table arising from the data views in Figs. 24, 25 and 26; Fig. 29 is a leaf co-occurrence frequency table arising from the data views in Figs. 24, 25 and 26; Fig. 30 is a sole child co-occurrence frequency table arising from the data views in 20 Figs. 24, 25 and 26; Fig. 31 is a portion of a joint-occurrence frequency table arising from the data views in Figs. 24, 25 and 26; Fig. 32 is another portion of a joint-occurrence frequency table arising from the data views in Figs. 24, 25 and 26; WO 2005/052810 PCT/AU2004/001676 10 Fig. 33 is yet another portion of a joint-occurrence frequency table arising from the data views in Figs. 24, 25 and 26; Fig. 34 is yet another portion of a joint-occurrence frequency table arising from the data views in Figs. 24, 25 and 26; 5 Fig. 35 is yet another portion of a joint-occurrence frequency table arising from the data views in Figs. 24, 25 and 26; Fig. 36 is the schema graph of a context tree returned as a result of a keyword search operation involving two keywords; Fig. 37 is a schematic block diagram of a general purpose computer upon which 10 the arrangements described may be practiced; Fig. 38 is a flowchart of a sub-process within the method for constructing context trees for cases involving a single hit node depicted in Fig. 19; and Fig. 39 is a flowchart of a sub-process within the method for constructing context trees for cases involving multiple hit nodes depicted in Fig. 20. 15 Detailed Description including Best Mode The present disclosure provides a method for determining a set of relevant data in a hierarchical data environment in response to a keyword search operation involving one or more keywords. A preferred implementation includes a Bayesian probabilistic based method that constructs preferred views of data in a hierarchical data structure based on 20 how data is accessed in past episodes. More specifically, the method makes use of the frequencies of past joint-occurrences between pairs and vectors of data items to compute the probability that a data item is relevant to some other compulsory data items. Typically, the compulsory data items are those containing keyword hits, and thus must be returned to the user in the keyword search results. If a non-compulsory data item has a high WO 2005/052810 PCT/AU2004/001676 11 probability of being relevant to a compulsory data item, then it is likely to be returned in the search results to serve as context for the keyword hits. A distinguishing feature of the presently disclosed arrangements with respect to traditional pre-created view based approaches is that the former is able to synthesise new 5 views, rather than merely returning an existing stored view. Such arrangements are thus able to handle keyword search operations involving arbitrary keyword combinations, and since views are dynamically generated, they can be better tailored to individual operations than those obtained from a fixed pool of pre-created views. The presently disclosed methods typically construct a number of alternative views, 10 and assign a score for each view, signifying how much the view may be of interest to the user. In one implementation, a single view that has the highest score among those constructed is returned to the user. In an alternative implementation, a list of views is returned, sorted according to their scores, from highest to lowest. Although keyword searching is its primary motivation, the presently disclosed 15 methods can also be used to enhance a method of presentation of hierarchical data, such as that described in Australian Patent Application No. 2003204824 filed 19 June 2003 and corresponding United States Patent Application No. 10/465,222 filed 20 June 2003, both entitled "Methods for Interactively Defining Transforms and for Generating Queries by Manipulating Existing Query Data. In that publication, a method for selecting the most 20 appropriate presentation type (such as tables, graphs, plots, tree, etc...) based on the structure and contents of a hierarchical data source is disclosed. That method can be enhanced by incorporating a preferred implementation of the present disclosure as a means for automatically selecting a most preferred subset of the data source for display, prior to the selection of presentation type. It is often useful to display only a preferred subset of WO 2005/052810 PCT/AU2004/001676 12 data in this way since hierarchical data sources often contain more information than-what would normally be of interest to the user, and hence a method for filtering out 'uninteresting' data such as the preferred embodiment of the present invention can help to make the user's experience more satisfying and productive. 5 Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and symbolic representations of operations on data within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to 10 be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, 15 elements, symbols, characters, terms, numbers, or the like. It should be borne in mind, however, that the above and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions 20 utilizing terms such as "scanning", "calculating", "determining", "replacing", "generating" "initializing", "outputting", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the registers and memories of the computer system WO 2005/052810 PCT/AU2004/001676 13 into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or 5 may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform 10 the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below. In addition, the present specification also discloses a computer readable medium comprising a computer program for performing the operations of the methods. The computer readable medium is taken herein to include any transmission medium for 15 communicating the computer program between a source and a designation. The transmission medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The transmission medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM 20 mobile telephone system. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein.

WO 2005/052810 PCT/AU2004/001676 14 Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears. 5 The methods of keyword searching in general, and hierarchical data structure construction in particular, are preferably practiced using a general-purpose computer system 3700, such as that shown in Fig. 37 wherein the processes of Figs. 1 to 36 may be implemented as software, such as an application program executing within the computer system 3700. In particular, the steps of keyword searching are effected by instructions in 10 the software that are carried out by the computer. The instructions may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part performs the searching methods and a second part manages a user interface between the first part and the user. The software may then be stored in a computer readable medium, including the storage 15 devices described below, for example. The software is loaded into the computer from the computer readable medium, and then executed by the computer. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer preferably effects an advantageous apparatus for keyword searching and hierarchical data structure 20 construction. The computer system 3700 is formed by a computer module 3701, input devices such as a keyboard 3702 and mouse 3703, output devices including a printer 3715, a display device 3714 and loudspeakers 3717. A Modulator-Demodulator (Modem) transceiver device 3716 is used by the computer module 3701 for communicating to and WO 2005/052810 PCT/AU2004/001676 15 from a communications network 3720, for example connectable via a telephone line 3721 or other functional medium. The modem 3716 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN), and may be incorporated into the computer module 3701 in some 5 implementations. The computer module 3701 typically includes at least one processor unit 3705, and a memory unit 3706, for example formed from semiconductor random access memory (RAM) and read only memory (ROM). The module 3701 also includes an number of input/output (1/0) interfaces including an audio-video interface 3707 that couples to the 10 video display 3714 and loudspeakers 3717, an I/O interface 3713 for the keyboard 3702 and mouse 3703 and optionally a joystick (not illustrated), and an interface 3708 for the modem 3716 and printer 3715. In some implementations, the modem 3716 may be incorporated within the computer module 3701, for example within the interface 3708. A storage device 3709 is provided and typically includes a hard disk drive 3710 and a floppy 15 disk drive 3711. A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive 3712 is typically provided as a non-volatile source of data. The components 3705 to 3713 of the computer module 3701, typically communicate via an interconnected bus 3704 and in a manner which results in a conventional mode of operation of the computer system 3700 known to those in the relevant art. Examples of computers on 20 which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved therefrom. Typically, the application program is resident on the hard disk drive 3710 and read and controlled in its execution by the processor 3705. Intermediate storage of the program and any data fetched from the network 3720 may be accomplished using the WO 2005/052810 PCT/AU2004/001676 16 semiconductor memory 3706, possibly in concert with the hard disk drive 3710. In some instances, the application program may be supplied to the user encoded on a CD-ROM or floppy disk and read via the corresponding drive 3712 or 3711, or alternatively may be read by the user from the network 3720 via the modem device 3716. Still further, the 5 software can also be loaded into the computer system 3700 from other computer readable media. The term "computer readable medium" as used herein refers to any storage or transmission medium that participates in providing instructions and/or data to the computer system 3700 for execution and/or processing. Examples of storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated 10 circuit, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 3701. Examples of transmission media include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the 15 like. Keyword searching in a hierarchical environment comprises identifying the nodes or elements in the hierarchical data structure where the keyword or keywords occur and then determining what other data elements are relevant to the keywords. In a typical keyword searching scenario, the resulting data presented to the user is a second 20 hierarchical data structure extracted from the first data structure and containing all or some of the search keywords and other data considered to be relevant to these keywords. Such a hierarchical data structure presented to the user as a result of the keyword search operation is referred to as a context tree. When the hierarchical data being searched has a governing schema, as is often is WO 2005/052810 PCT/AU2004/001676 17 the case with XML, it is generally advantageous to employ a method for identifying relevant data that operates at the schema level. That is, elements within the schema representation are analysed to determine whether they are relevant to the search keywords. All instances of data items in the data source collectively represented by the relevant 5 schema elements are then returned to the user as the result of the keyword search operation. In XML the governing schema can be in the form of an XML Schema, which itself is another hierarchical data structure. An XML Schema specifies the structure of the associated XML data, the list of elements and attributes in the XML data and their parent child relationships. Since each element or attribute in an XML Schema typically 10 represents many instances of elements and attributes in the XML data, an XML Schema is potentially a much smaller data structure and hence can be analysed more efficiently. It is often desirable to search for keywords in more than one hierarchical data source. Although each hierarchical data source on its own is tree-structured, when multiple data sources are considered together the resulting data structure may take on a 15 more general form. One such form that invariably arises in a database environment is illustrated in Fig. 1. This structure essentially comprises a number of trees with shared nodes, where each tree represents the schema of a distinct hierarchical data source and the shared nodes are the result of data views whose contents span multiple data sources. Specifically the dotted boxes 1005 and 1010 in Fig. 1 denote the schemas of a first and 20 second data source respectively, and node 1015 is the root node of a data view that brings together nodes 1020 and 1025 from the first data source and node 1030 from the second data source. The multiple shared-tree structure in the Fig. 1, referred herein as a schema graph, is a special form of a directed acyclic graph with an important characteristic that there is at most a single directed path between any two nodes. For example, there. is only WO 2005/052810 PCT/AU2004/001676 18 one directed path from node 1015 to node 1035 and this path passes through node 1020. The schema graph is preferably constructed prior to a keyword searching operation and is made up of the initially disjoint individual tree-structured schemas of the hierarchical data sources. These schema trees are then joined when a data view is created 5 that spans more than one data source. A data view typically comprises a query (such as an XQuery in an XML environment) and may be created by a database administrator or user. In either case, the database system preferably logs or records these queries in its storage device. During the construction of the schema graph, a schema representation of each logged query is created and inserted into the schema graph. This results in a joining of 10 two or more separate schema trees if the schema being inserted contains nodes from these trees, as illustrated in Fig. 1. It is also possible that the schema being inserted into the schema graph contains nodes from only one data source, in which case a joining of separate schema trees does not arise. Instead the insertion operation simply results in new nodes being added and linked to existing nodes from a single schema tree in the schema 15 graph. The schema graph may be updated continually as new queries are logged, or it may be updated on one or more occasions after new queries and data views have been collected over some period of time. Regardless of how often the schema graph is updated, when a keyword search operation is initiated, the schema graph current at the time of the operation 20 is the one that is used to determine data views that are returned to the user. For the remainder of this document, the term "schema graph" refers to the schema graph that is current at the time a keyword search operation is performed. As in the case of a single data -source, keyword searching within multiple hierarchical data sources involves first identifying nodes within the schema graph where WO 2005/052810 PCT/AU2004/001676 19 the search keywords are found, referred to as "hit" nodes, and then identifying nodes that are relevant to the hit nodes, referred to as "context" nodes. A data structure comprising the hit and context nodes is then constructed and presented to the user. Since hit nodes can be located in more than one data source, the resulting data structure presented to the 5 user may span multiple data sources. The resulting data structure is preferably also tree structured since its intended applications are in hierarchical data environments. Fig. 4 shows a preferred configuration 4000 and generalised mode of operation of the keyword searching methods. The configuration 4000 comprises a PC client 4005, a data server 4010, a database 4015, a keyword search client 4025, and an index server 4030 10 connected together in a network. Each of the devices 4005, 4010, 4025 and 4030 is typically formed by a corresponding general purpose computer system, such as the system 3700, each linked by the network 3720, which is only illustrated conceptually in Fig. 4. This conceptual illustration is used to provide for an uncluttered representation of data flows between the 15 various devices 4005, 4010, 4025 and 4030, and which occur across the network 3720. When necessary, appropriate or convenient, the various devices 4005, 4010, 4025 and 4030 may be combined into a smaller number of distinct computer systems 3700. For example, in some implementations, it may be convenient to combine the servers 4010 and 4030 into one computer system 3700, and combine the clients 4005 and 4025 into another 20 computer system 3700, those systems 3700 being linked by the network 3720. Data stored in the database 4015 is typically accessed by a user browsing at the PC client 4005. A browsing application, operating in the client 4005 issues commands preferably in the form of XQueries 4006 which are then transmitted to the data server 4010. Each XQuery 4006 is recorded in a log 4020 and analysed by the data WO 2005/052810 PCT/AU2004/001676 20 server 4010, after which the requested data 4007 is fetched from the database 4015 and delivered to the PC client user 4005. At some point in time, preferably after a sufficient amount of XQueries 4006 have been logged, the index server 4030 is activated and the logged XQueries 4020 are analysed to build an index table 4035. This process involves 5 constructing a schema graph representation of the data stored in the database 4015 and its existing views represented by the logged XQueries 4020, building various frequency tables associated with these views, identifying searchable keywords in the database, determining one or more context trees and constructing a corresponding XQuery for each context tree, and finally recording these keywords and XQueries in an index table 4035 for later quick 10 retrieval. Once the building process of the index table 4035 completes, the system 4000 is ready to perform keyword search operations invoked at the keyword search client 4025. Search keywords 4026 entered by the user are transmitted to the index server 4030 where they are looked-up against the index table 4035 and one or more XQueries 4031 are 15 retrieved and presented to the user, appropriately ranked according to their relevance to the search keywords. When the user selects an XQuery 4027 from the list, the XQuery 4027 is transmitted by the keyword search client 4025 to the data server 4010 which responds with the appropriate data 4011. The method 2000 of keyword searching involving one or more hierarchical data 20 sources is summarised by the flowchart in Fig. 2. The method 2000 is preferably executed on the computer of the index server 4030. The method 2000 begins at step 2005, where hit nodes are identified in the schema graph. In an XML environment, there are potentially two ways in which a hit node can arise in the schema graph: (i) its element name may contain one of the search keywords or (ii) one or more XML nodes it represents WO 2005/052810 PCT/AU2004/001676 21 may contain one of the search keywords. Subsequent to step 2005, step 2010 identifies context trees in the schema graph, each comprising nodes in the data sources represented by the hit and context nodes in the schema graph.. Finally at step 2015, the identified context trees are converted to XQueries and presented to the user as a ranked list. 5 Methods for identifying context trees denoted by step 2010 in Fig. 2 are now described in detail. A method is first presented for the special case where there is a single hit node in the schema graph, followed later by a more general method that can handle cases involving more than one hit node. Both methods operate in two phases. The first is a bottom-up traversal of the schema graph from the hit nodes to determine which of their 10 parents and ancestors are context nodes, from which the second phase proceeds in a top down fashion to determine which of their descendants are also context nodes. The top most ancestor of the hit nodes determined to be a context node then represents the root node of the context tree presented to the user as a result of the keyword search operation.For the purpose of determining whether a node in the schema graph is a context 15 node, preferably at least an occurrence frequency table and a co-occurrence frequency table are maintained. The former records the frequencies at which each node in the schema graph occurred in a logged query or data view whilst the latter records the frequencies at which pairs of nodes in the schema graph co-occur in the same logged query or data view. When the schema graph is updated with a new query or data view 20 containing new nodes, new entries are added to the occurrence frequency table to represent the new nodes, and are each given an initial frequency value of 1 indicating that the nodes are new and have not previously been observed. Likewise, for each node-pair from the new query comprising two new nodes or a new node and an existing node, a new entry is added to the co-occurrence frequency table and given an initial frequency value of 1, WO 2005/052810 PCT/AU2004/001676 22 whilst for each node-pair comprising a new node and an existing node not present in the new query, a new entry is added to the co-occurrence frequency table but is given an initial frequency value of 0. As the schema graph is traversed, an occurrence probability is computed for each 5 node, given the occurrences of the hit nodes. These conditional probability values are computed or approximated from values stored in the frequency tables due to previously logged queries and data views, and are used to determine whether a node is a context node. The following is a description of the first method for the special case where there is a single hit node. Let this hit node be denoted by X. 10 In the first phase, a bottom-up traversal through the schema graph is made beginning at node X. Each of X's ancestors Yi is considered in turn and its occurrence probability, given the occurrence of X, is computed: Pr[Y X Pr[Y A X] Pr[X] freq(Y, X) Eq. 1 freq(X) 15 where the probability value has been approximated by an occurrence frequency freq(X) and a co-occurrencefrequency freq(Yi, X). The latter denotes the frequency that X and Y co-occur, where Yi is an ancestor of X. Both are obtained directly from the occurrence and co-occurrence frequency tables stated earlier. From these probability values computed for the ancestor nodes of X it is possible to determine the probability that a particular ancestor 20 Yi is a root node, given X. Let Zi, ... Z, denote the parent nodes of Yi, then Pr[Y rootI X] = Pr[-,Z, A -- A -iZ, A Y I X] Eq. 2 That is, the probability that Yi is root, given X, is the probability that Yi is present WO 2005/052810 PCT/AU2004/001676 23 and none of its parents are present, given X. Expanding the right hand side of Eq. 2 gives: Pr(Y' root | X] = Pr[-,Z, A -- A -,Z,, \Y A X]Pr[Y, I X] =(- Pr[Z, v -- v Z, Y A X])Pr[7 I X] Eq. 3 Since ZI, ... Z, are mutually exclusive given Yi, (Yj can have at most one parent in 5 any actual hierarchical data structure), Pr[Y root \ X] = 1 - Pr[Zj |7 A X] Pr[Y I X] = Pr[Y X ]- YPr[Zj |Y A X]Pr[Y| I X] = Pr[Yi X] - YPr[Zj A I X] Eq. 4 But since there is at most one directed path between Zj and X (a characteristic of 10 the schema graph), it follows that this path must include Y, and hence: Pr[Z 1 A Y AX] = Pr(Z 1 AX] Eq.5 < Pr[Z AYjX]=Pr[Z IX] Eq.6 - Pr[Y, root I X] =Pr[Y I X] - Pr[Z | X] Eq. 7 15 In a preferred implementation, a number of alternative context trees are returned to the user as results of a keyword search operation, one for each ancestor node Y, of X whose associated probability Pr[Y root I X] is greater than zero. These alternative context trees are each assigned a score being the associated probability Pr[Y root I X] and sorted according to these scores, from highest to lowest. Context trees with higher scores 20 are considered to be of more interest to the user than those with lower scores. In an alternative implementation, only the context tree with the highest score is presented to the WO 2005/052810 PCT/AU2004/001676 24 user as the result of the keyword search operation. For each ancestor node Yi that can serve as the root node of a context tree (ie. whose Pr[Y root I X] > 0), a second phase, top-down traversal from Yi is performed to determine which of its descendants (except the hit node X) are context nodes. For each 5 parent node Pj visited during this phase, an analysis is performed to determine which of its children are context nodes. For each child node determined to be a context node, its children are in turn analysed in a top-down fashion to identify context nodes among them. There are two distinct scenarios in the analysis of a parent node Pj, as illustrated in Figs. 3A and 3B. The first is a special case shown in Fig. 3A where Pj lies along the path 10 from the root node Y; 3005 to the hit node X 3020. This includes the case Pj = Yi but excludes the case Pj = X. In this scenario, at least one child node of Pj, 3010, in this case the child node Ci 3015, that lies along the path from Pj to X, must be identified as a context node. In the more general second scenario, encompassing all remaining cases as shown in Fig. 3B, the parent node Pj 3030 does not lie along a directed path from the root 15 node Yi to the hit node X 3035 and thus it is not compulsory to identify any child nodes of Pj as context nodes. An algorithm for handling the second scenario will be presented first. For a given hit node X and a specific node Yi to serve as the root node of a context tree, the choice of whether some child node Ck of a parent node Pj is to be identified as a context node is in general a function of the probability that Ck occurs given the presence of 20 all nodes along the directed path from Yi to X: Pr[CkI X A ... A Y root A ... A Pj] Since the evaluation or estimation of this probability is not possible with just the occurrence and co-occurrence frequency tables mentioned earlier, some form of simplification or approximation is needed. One such simplification preferably adopted is WO 2005/052810 PCT/AU2004/001676 25 to ignore the effects of all nodes other than those from Yi to Ck in the above probability expression, resulting in the expression: Pr[C, j Y root A -- A P ] where 5 Pr[Ck IYIrootA---APj]=Pr[Ck, rootAP] Pr[Ck AYroOt A P] Pr[Y; root A P] Pr[Ck AJ root] Eq. 8 Pr[Y, root A P ] 10 Let Z 1 , I = 1, ... , p denote the parent nodes of Yi, then the right hand side of Eq. 8 can be expanded to P P Pr[CAAYi] Pr[Ck A ZI] freq(Y,Ck )- > freq(Z,Ck) =1 I=1 Eq. 9 P P Pr[P, AYj ] - Pr[P, A Z] freq(Y;.,P )- Y freq(Z,P, ) The above expression however, only deals with each individual child node Ck in isolation. Unless Ck are independent from one another (given Yi root), it is necessary to 15 consider their joint probabilities. This however would require maintaining frequency tables storing the joint-occurrence frequencies of a large number of combinations of nodes, many of which would rarely be observed and hence it would not be possible to reliably estimate their joint probabilities from their joint-occurrence frequencies. On the other hand, assuming independence among Ck (given Yi root) may lead to undesirable results, 20 such as none of the child nodes Ck being selected if their individual occurrence probabilities (given Yi root) are low.

WO 2005/052810 PCT/AU2004/001676 26 In order to avoid the undesirable effects of independence assumptions among Ck, whilst at the same time avoiding the need to maintain a large number of joint-occurrence frequency values, a heuristic method 5000 depicted by the flowchart in Fig.5 may be used for selecting child nodes as context nodes. The method 5000 preferably operates as a sub 5 program within the method 2000 upon the server 4030. The method 5000 begins at step 5005 where the occurrence probability of each child node Ck given the root node Yi, denoted by Qk = Pr[Ck I Y root A P ] is computed using Eq. 8 and Eq. 9. At the next step 5010, the probabilities Qk are summed over all child nodes Ck, the sum being denoted by T. The method 5000 continues at step 5015 10 where those nodes Ck with the highest probability value are selected as context nodes. If more than one child node exists with the same highest probability value then all such nodes are selected as context nodes. The sum S of the probabilities of all child nodes so far selected as context nodes is then computed at step 5020. Execution then proceeds to the decision step 5025, at which point if all child nodes Ck have been selected as context 15 nodes then the method terminates at step 5040. If however there are one or more child nodes Ck not yet selected as context nodes then the method 5000 continues to another decision step 5030. At step 5030 a check is made to ascertain whether S > T/2 and if so the method again terminates at step 5040. If S < T/2 then execution proceeds to step 5035. Here the list of child nodes Ck not yet currently selected as context nodes are examined to 20 identify those with the highest probability value among themselves. These are selected as context nodes and the method 5000 returns to step 5020 for further processing in the manner discussed above. The method 5000 has a number of desirable properties: e If logged queries or data views intersect with sufficiently high frequencies, (ie.

WO 2005/052810 PCT/AU2004/001676 27 with a relatively large number of child nodes in common) then the method 5000 tends to return their intersections as context nodes. This is likely to lead to an acceptable result since an intersection that is sufficiently large tends to carry sufficient context information (for the hit node). 5 e If logged queries or data views have relatively few child nodes in common, then the resulting set of context nodes tends to comprise not only their intersections but also additional nodes. Experiments conducted by the present inventor show the resulting set of context child nodes tends to reflect that of the most frequent logged query or data view. This is significant since the intersections alone would not 10 likely contain sufficient context information. * Due to the inclusion of child nodes with identically highest probability value as a whole, the method 5000 is biased towards identifying more rather than less nodes as context nodes. In the case where the sets of child nodes present in logged queries are mutually exclusive and occur with equal frequencies, the method 15 identifies all child nodes as context nodes. In the method 5000 when a parent node is identified as a context node, one or more of its children are always identified as context nodes as well. This may be undesirable if there are many logged queries or data views in which the parent node occurs without any of its children (ie. occurs as a leaf node). Intuitively, if this occurs sufficiently often then 20 the parent node alone should be identified as a context node without any of its children to reflect the frequently observed behaviour. To remedy this issue, a preferred implementation makes use of an additional leaf co-occurrence frequency table, generated and stored by the index server 4030. This table stores the frequency at which a node Pj co-occurs as a leaf node in past logged queries and WO 2005/052810 PCT/AU2004/001676 28 data views with its ancestor Yi, for every possible pairs of such nodes Pj and Yi, excluding those nodes Pj that have no children in the schema graph. This new frequency table is then used to estimate the probability that a node Pj occurs as a leaf node, given Pj and some root node Yi: 5 Pr[P, leaf A Y, root] 5 Pr[ P, leaf | Y root A Pl= ] rY otAP Pr[Y 1 root A PJ] freq(Y, P leaf)- freq(Z, Pj leaf) 1=1 Eq. 10 freq(Y,P) - freq(Z, P) I=1 where Z 1 , 1 1, ... , p denote the parent nodes of Y as defined earlier, andfreq(Y, P lea]) and freq(Z, Pj lea]) are co-occurrence frequency values obtained from the new leaf co occurrence frequency table. 10 The probability Pr[Pj leaf I lY root A P ] is preferably determined in an additional decision step prior to the method 5000 given in Fig. 5 for identifying which child nodes of Pj are context nodes. If Pr[P leaf IY root A P ] is less than 0.5, then no child nodes of Pj are selected as context nodes, otherwise the method 5000 is performed to identify which child nodes are context nodes. 15 An alternative implementation is also possible, and employs an alternative method 6000 whose flowchart is given in Fig. 6 for selecting context nodes among a set of child nodes C1, k = 1, ... , in. The method 6000, which is also performed by the index server 4030, begins at step 6001 where a fictitious child node Co is conceptually created and added to the list of actual child nodes C1, ... , Cm and is assigned a probability value 20 Pr[Pj leaf I Y root A P ] using Eq. 10. At the next step 6005, the actual child nodes Ck WO 2005/052810 PCT/AU2004/001676 29 are assigned their usual probability values Qk = Pr[Ck IY root A P ] using Eq. 8 and Eq. 9. The method 6000 then continues at step 6006 by invoking method 5000 at step 5010 (skipping step 5005) to select among the child nodes Co, ... , Cm a set of context nodes. When the method 5000 exits, the method 6000 resumes at decision step 6010 where a 5 check is made to determine if the fictitious child node Co has been selected as a context node. If so then execution continues at step 6020 where Co is excluded as a context node. The method 6000 subsequently terminates at step 6015. If the test at 6010 fails, then the method 6000 proceeds directly to the termination step 6015. The idea behind the alternative method 6000 for incorporating the possibility that 10 none of Pj's child nodes are context nodes is essentially identical to that of the first. That is, when Pr[P 1 leaf I iY root A P] is sufficiently large. However, the effects of Pr[P, leaf I Y root A P,] on the resulting set of context nodes are more gradual in this alternative approach, which is generally more favourable than the abrupt on/off behaviour of the first approach. 15 For the special scenario where the parent node Pj lies along the directed path from the root node Yi to the hit node X, special considerations must be made to ensure that the child node of Pj that lies along the path from Pj to X is identified as a context node. Without loss of generality, let this child node be C 1 as illustrated in Fig. 3 as item 3015. Whilst the method 5000 presented earlier for the general scenario can be modified (for 20 example by inflating the occurrence probability of C 1 above those of all other child nodes prior to step 5015), such an approach may not yield correct results. This is because the method 5000 as described has been devised to select a set of the most frequently occurring child nodes as context nodes given the root Yi and parent Pj. If this set does not naturally WO 2005/052810 PCT/AU2004/001676 30 contain C 1 , then it basically means that C 1 is not related to nodes in the set. Forcefully including C 1 would simply result in a set of child nodes that have little in common and provide little context for C 1 (and subsequently for X). Instead of modifying the method 5000, a different but somewhat procedurally 5 similar method 7000 illustrated in Fig. 7 is preferably adopted in another implementation. The difference between this new 7000 and the previous 5000 methods lies in the independence of probability assumption used. Recall that the first simplification made in the general case where Pj does not lie along the directed path from Yi to X was the assumption that 10 Pr[Ck IXA-AY.rootA...AP] is independent of nodes other than those from Yi to Pj. In the current scenario where one child node C 1 of Pj lies along the path from X to Pj, it would not be sensible to assume that Ck is independent of nodes from Pj to X (including C 1 ) as they are necessary ancestors of X that link the hit keyword X to Ck. Since some simplifications are necessary to keep the 15 problem tractable, it follows that a better choice is to assume an independence of probability assumption between Ck and its ancestors above Pj towards the root node Yi. With this assumption, the probabilities of interest are Pr[CkI X A.-- A C, AIP] k # 1 Again, since there is at most one directed path linking X and Pj, the above 20 expression is equivalent to Pr[Ck A X API] Pr[Ck |XAP]= Pr[XAP] Eq. 11 Pr[ X A Pj ] The numerator on the right hand side of Eq. 11 can not be obtained from the occurrence and co-occurrence frequency tables so far mentioned, since it involves three WO 2005/052810 PCT/AU2004/001676 31 rather than two nodes. An extra joint-occurrence frequency table between 3-tuples of nodes is therefore required. Fortunately as each of these 3-tuples comprises a pair of parent-child nodes Ck and Pj (rather than any arbitrary pair of nodes), and since each node Ck in practice has only a small number of parents, the new joint-occurrence frequency 5 table would only be slightly larger than a co-occurrence frequency table involving pairs of nodes. With the new joint-occurrence frequency table, Pr[Ck IX A P ] can be estimated as freq(C,,P 1 ,X) Pr[C,|X A PI]~freq(PIX) Eq. 12 freq(P;,,X) 10 where freq(C, Pj, X) denotes the joint-occurrence frequency between nodes Ck, P and X, Pj is a parent of Ck and an ancestor of X, and Ck is neither Xnor an ancestor ofX. The method 7000 for determining the set of siblings of C 1 to be included with C 1 as context nodes is very similar to method 5000 already described. The method 7000 begins at step 7001 where the occurrence probability of each child node Ck # C 1 given the 15 parent node Pj and the hit node X, denoted by Qk = Pr[Ck I X A P] is computed using Eq. 12. At the next step 7005, the probabilities Qk are summed over all child nodes Ck w C 1 , the sum being denoted by T. The method 7000 continues at step 7010 where node C1 is selected as a context node, and then subsequently at step 7015 where those nodes Ck # C 1 with the highest probability value are also selected as context nodes. If more than one 20 child node exists with the same highest probability value then all such nodes are selected as context nodes. The sum of the probabilities of all child nodes so far selected as context nodes excluding C 1 is then computed at step 7020, the sum being denoted by S. Execution then proceeds to the decision step 7025, at which point if all child nodes Ck have been WO 2005/052810 PCT/AU2004/001676 32 selected as context nodes then the method 7000 terminates at step 7040. If however there are one or more child nodes Ck not yet selected as context nodes then the method continues to another decision step 7030. At step 7030 a check is made to ascertain whether S > T/2 and, if so, the method 7000 again terminates at step 7040. If S < T/2 then 5 execution proceeds to step 7035. Here the list of child nodes Ck w C 1 not yet currently selected as context nodes are examined to identify those with the highest probability value among themselves. These are selected as context nodes and the method returns to step 7020 for further processing. Some modifications are needed to method 7000 to allow for cases where no 10 siblings of C 1 are included in the solution. This is achieved by introducing a sole child co occurrence frequency table that stores the frequency that a node Pj co-occurs with one of its descendants X such that only one child node of Pj (C 1 along the path from Pj to X) is present in past logged queries and data views. This frequency table is then used to estimate the probability that C 1 has no sibling given its parent Pj and the hit node X: 15 Pr[C, no sibling PAX] = Pr[Cj A -Ck Vk # 1| P A X] = Pr[Pj has l child | P A X] Pr[P has 1 child A P A X] Pr[Pj, A X] Pr[Pj has 1 child A X] Pr[Pj A X] freq(Pj has l child, X) Eq. 13 freq(P,X) 20 where freq(Pj has 1 child, X) denotes the frequency at which node Pj co-occurs with its descendant X and Pj has a single child node (C 1 ), and is obtained from the new frequency WO 2005/052810 PCT/AU2004/001676 33 table. In one implementation, the probability Pr[C no sibling IP A X] is used in an additional decision step prior to the method 7000 given in Fig. 7 for identifying which child nodes of Pj are context nodes. If Pr[C 1 no sibling I P A X] is less than 0.5, then no 5 child nodes of Pj other than C 1 are selected as context nodes, otherwise method 7000 is performed to identify which child nodes are context nodes. An alternative implementation is also possible. This. employs an alternative method 8000 whose flowchart is given in Fig. 8 for selecting context nodes among a set of child nodes C., k = 1, ... , m. The method 8000 begins at step 8001 where a fictitious 10 child node Co is conceptually created and added to the list of actual child nodes C 1 , ..., Cm and is assigned a probability value Q 0 = Pr[C no sibling I P A X] using Eq. 13. At the next step 8005, the actual child nodes Ck except C 1 are assigned their usual probability values Q, = Pr[C, J X A P ] using Eq. 11. The method 8000 then continues at step 8006 by invoking method 7000 at step 7005 (skipping step 7001) to select among the child 15 nodes Co, ... , Cm a set of context nodes. When method 7000 exits, method 8000 resumes at decision step 8010 where a check is made to determine if the fictitious child node Co has been selected as a context node. If so then execution continues at step 8020 where Co is excluded as a context node. The method 8000 subsequently terminates at step 8015. If the test at 8010 fails, then the method proceeds directly to the termination step 8015. 20 The preceding discussion describes two distinct methods 6000 and 8000 for determining from a set of child nodes which are context nodes. Preferably the latter is applied in the scenario where the parent node Pj lies along the directed path from the root node Yi to the hit element X, whilst the former is used for all other parent nodes. In an WO 2005/052810 PCT/AU2004/001676 34 alternative implementation, the first method 6000 is employed even for the case where Pj lies along the path from Yj to X. If this results in a set of context child nodes that includes

C

1 , then the set is adopted, otherwise the set is discarded and the second method 8000 is applied to determine a new set of context child nodes. The rationale behind this favouring 5 of the first method is that the probability values computed there are conditional on the root element Yi, rather than on the hit node X. Tests conducted by the present inventor seem to suggest that the root element of a data view tends to be a better indicator of what nodes are present in the view. The keyword searching system 4000 disclosed herein is a form of a learning 10 system. From a set of logged queries and existing data views, which are akin to training examples, the system is able to synthesise new views of data. If patterns exist in the logged queries or data views, then they will be reflected in the frequency tables which in turn will affect the behaviour of the system 4000. A desirable feature for any learning system is an ability to make some form of generalisation that allows it to use patterns 15 learned from one set of problems to improve its performance when handling related but yet unseen problems. One aspect of generalisation that is important in a hierarchical environment is the ability to observe occurrence patterns of certain sub-structures of data and generalise them to other similar or identical sub-structures. Consider the data structure 9000 shown in Fig. 9, in which there are two identical 20 "Employee" sub-structures 9010 and 9030 (enclosed within the dotted curves), one under "Manager" 9005 and the other under "Project Members" 9025. Suppose that in all logged queries and data views, the sub-elements "FirstName" 9015 and "LastName" 9020 in the first Employee sub-tree have always been observed to appear together, whilst no queries or data views containing the second "Employee" sub-tree 9030 have yet to be observed.

WO 2005/052810 PCT/AU2004/001676 35 Suppose further that a keyword search operation for an employee's name is invoked in which a "hit" is found in the "FirstName" sub-element 9035 of the second "Employee" sub-tree 9030, making 9035 the hit node. Even though no example queries or views have been encountered with this sub-element present, it is intuitively apparent that from the 5 occurrence patterns observed for the first "Employee" sub-tree 9010, the sub-element "LastName" 9040 in the second "Employee" sub-tree 9030 should be identified as a context node. Such a generalisation ability is particularly important when working with XML data since identical data sub-structures often exist at several locations in a data hierarchy 10 (for example, as a result of the use of referenced schema elements). Such may be realised through probability averaging. Probability averaging works by appropriately averaging the occurrence probabilities of nodes in the schema graph that have identical names or IDs or labels. The application of probability averaging is now described firstly for the first top-down phase of the construction of the context tree, and then subsequently for the 15 second bottom-up phase. Recall that the operation of the first phase relies on the probability values Pr[Y, X], where Yi are ancestors of the hit node X. To facilitate probability averaging, Pr[Y I X] is preferably first reformulated into an incremental form, as follows: Let W be a child of Y that lies along the one and only directed path from Y to X. Pr[Y |X] can 20 then be rewritten as Pr[Y AX] Pr[Y,IX] = Pr[X] = Pr[Y, A W A X] (the path from Yi to X must include W) Pr[X] WO 2005/052810 PCT/AU2004/001676 36 Pr[Y W A X] Pr[W A X] Pr[X] Pr(Y I W A X]Pr[W I X] Eq. 14 That is, Pr[Y I X] can be incrementally obtained from the probability value of its child node W, namely Pr[W I X]. The idea is to begin the procedure at the hit node X and 5 make use of the above expression to obtain probability values for successively higher ancestor nodes. At each step, the method of probability averaging is then applied to the first term on the right hand side of Eq. 14. Thus, let Pr'[B I X] denote the modified probability value of some node B as a result of probability averaging, then Pr'[Y I X] can be defined by the following recursive formulae: Pr'[X I X]=1 Eq. 15 10 Pr'[Y I X]= 0 if Pr'[W 0 Eq. 16 Pr,,,,,(;jWA X]Pr'[W X] otherwise where SPr[Y, I W, A Xk ] Pr[W A X4] Pr,,,,,[ A] Z Pr[Wk A Xk k ZPr[Yik AW, A Xk] k I Pr[Wk A X,] k SPr[Yk A Xk 15Pr[W A X] (the path from Yik to Xk must include Wik) k 1 freq(Y,, XJ) ~ *Eq. 17 S -freq(W, Xk) k and denotes the weighted average or mean probability of Yi given W and X computed over all pairs of nodes (Yik, Xk) (for some values of k) that are equivalent to (Yi, X), with Xo and Yio (ie. k = 0) being aliases for X and Y respectively. For each of these equivalent WO 2005/052810 PCT/AU2004/001676 37 pairs (Yik, Xk), the term Wk in the summations denotes the immediate child of Yik lying along the directed path from Yik to Xk. A node pair (Yik, Xk) is said to be equivalent to a node pair (Yi, X) if (i) Yik has the same name or label or ID as Y and Xk has the same name, label 5 or ID as X, (ii) there are direct ancestor-descendant relationship between Yik and Xk and similarly between Yi and X, (iii) for each node Wk along the directed path from Yik to Xk, there must exist a corresponding node W along the directed path from Yi and X such that 10 (Wk, Xk) is equivalent to (W, X) and (Yi, Wk) is equivalent to (Yi, W). (iv) Yi and Yi 1 , have exactly the same number of parents and for each parent Zj of Yi, there exists a parent Zkj of Yik such that (Zkj, Yik) and (Zj, Y) satisfy conditions (i) to (iii) above. The modified probability that Yi is root given X due to probability averaging is 15 then given by Pr'[Y, root X] = Pr'(' IX] - Pr'[Z I X] Eq. 18 where Pr'[Zj I X] = Pr,,,,.i [Z., \ Y A X] Pr'[Y X] Eq. 19 as obtained from Eq. 16 by replacing Yi with Zj and W with Yi. 20 In the event that the denominator on the right hand side of Eq. 17 is zero, indicating that none of the node pairs (Wk, Xk) has been observed in logged queries and data views, Eq. 17 and hence Eq. 19 and Eq. 18 are undefined and consequently some alternative methods for identifying context nodes are needed. A preferred approach is to WO 2005/052810 PCT/AU2004/001676 38 alternatively define Pr, 7 ea,,[Z, jY A X] in terms of the distance of Zj from the hit node X as follows: Prnean, [Z, 1 Y A X]= Z freq(Z,,, Xk) k if 1freq(Yi ,Xk) 0 Z feq(Y,,,, X, ) ke k Eq. 20 1f Z freq(Yik , X,) =0, dist(Z 1 , X) dB. k 0 if I freq(,,, Xk) 0, dist(Zj, X) > d. k where dma is some threshold constant, and dist(A, B) is the distance between two nodes A 5 and B in the schema graph, defined as the number of links along the path between A and B. In the absence of relevant past logged queries and data views, the distance between two nodes should give a good indication of how they are related to one another since in practice related data are usually stored in proximity of each other. A flowchart of a method 10000 for computing the probability that an ancestor node 10 Y, of a hit node X is the root node of a context tree with probability averaging, for all ancestor nodes Yi, is shown in Fig. 10. The method 10000 begins at step 10001 with Y = X and hence Pr'[Y I X] = 1. At the next step 10005, Eq. 19 and Eq. 20 are used to compute Pr'[Zj I X] for each parent node Zj of Yi. Subsequent to step 10005, step 10010 computes Pr'[Y root X] according to Eq. 18. Step 100025 then tests to determine 15 whether all parent nodes of Yi have been processed. If not, the method 10000 then proceeds to step 10015 where a parent node Zj of Yi is selected. Upon reaching step 10020, the method 10000 is recursively invoked at step 10005 (skipping step 10001) but with the selected parent node Zj playing the role of Yi. When this invocation returns, WO 2005/052810 PCT/AU2004/001676 39 the current execution of method 10000 resumes and returns to step 10025 to check for more parent nodes. When all parent nodes have been processed the method 10000 ends at step 10030 Probability averaging is also applied to the second top-down traversal phase. In 5 this phase, for the general case in where a parent node Pj does not lie along a directed path from the root node Yi to the hit node X, probability averaging can be applied in the same way as that used in the first phase. The selection of child nodes Ck of Yi for inclusion in the keyword search result as a context node is based on the probabilities Pr[C, | Y root A P ] 10 With probability averaging, the above expression is replaced by a mean probability I Pr[C, 11 A 1', root] Pr,,,,,, [Ck I Y root A P ] = ' Eq. 21 Z Pr[Y,,, root A P,] h where (Yih,, Ckh) is equivalent to (Yi, Ck) and (Pjh, Ckh) is equivalent to (Pj, Ck), with Yio, Cko and Pjo (ie. h = 0) being aliases for Yi, Ck and Pj respectively. Let Zj denote the parents of Yi, and similarly Zjh the corresponding parents of Yih. The above expression can be 15 expanded to Pr( C,,, A Yj, Pr( C,, A Zj] PrInean [Ck \ root A P ] { A I{Pr[P AY', 1 ,]- Pr[P, A Z;]} h freq(Y,,CI )- C freq(Zjl 1 , C, 1 Eq. 22 Z freq(Y;,, P) -Z freq(Zjj,,P, h t For the above expression to be an accurate approximation of the mean probability WO 2005/052810 PCT/AU2004/001676 40 Pr [C4 IY root A P], the denominator on the right hand side must be sufficiently large (eg. > some positive constantfmin). When this is not the case, a preferred remedial method adopted in a preferred implementation is used to first approximate Pr,,,i [Ck I Y. root A P ] by Pr,,,,,[C, k A Pi ], where the probability is conditional on Y A P rather than 5 Y root A P. Thus Prnean, [Ck I Y root A P] Prema [Ck\ Yj A P ] Z Pr[Ck, A Yi] h SPr(Y,,, A P ] h I freq(Y 1 , , Ckh) h Eq. 23 1 freq (Y,,,P ), h If the denominator on the right hand side of Eq. 23 is still not sufficiently large, 10 then Pra,,I[C, I Y A P 1 ] is further approximated by a probability conditioned on W rather than Yi, where W is the immediate child of Y and an ancestor of Ck. That is Pr,,Iea,[ Ck J Y root A P,] Pr,,,,a, Ck I W A P ] I Pr[Ck,, A W, ] h I Pr[W, A Pjj, ] h Z freq(W,,, ) ~h Eq. 24 1 freq(W,,,P, ) h 15 The method is repeated fther until a sufficiently large value is obtained for the denominator on the right hand side, or if not, until W denotes a parent Of Ck. If the latter then Pr,,e.,[, I kIY root A P ] is assigned a value based on the distance between Ck and Yi WO 2005/052810 PCT/AU2004/001676 41 Pr,,,,,[C YrootA P dist(CkYA)P5 Eq. 25 0 otherwise or the distance between C and the hit node X: Pr,,,ea,, [C I Y root ] { fl if di' X) B dax Eq. 26 0O otherwise Depending on whether Pr,,,,z [C, |Y root A P ] is eventually approximated by Eq. 5 22, Eq. 23, Eq. 24, Eq. 25 or Eq. 26, the mean probability that a parent node Pj has no context child nodes given the root node Yi, is computed using Eq. 27, Eq. 28, Eq. 29, Eq. 30 or Eq. 31 respectively E Pr[Pjn leaf A Y root] Pr,ean [P leaf \Y root A P] = r SPr[ i, root AP,] h freq( Y,, Pj,, leaf) - Efreq(Zj,, P,, leaf) h iEq. 27 Ejfreq(Y,,P)- freq(ZjPi) 10 Prl,,,,( [P leaf I Y root A P]~ Prnean [Pi leaf I 1 A P] E Pr[PIj leaf A Y ] h E Pr(Y, A P ]a h Sfreq (YI Pj leaf ) h Z eEq. 28 h Pr,,,.,, [P leaf I Y root A P] ~Prnean [Pi leaf |W A P] ZPr[P, leaf A W,] h ZPrW, A P ] h WO 2005/052810 PCT/AU2004/001676 42 Z freq(W,,Pj 1 leaf) fjreq(W ,Pjj Eq. 29 h ( ifdist(P,6 + Priea [P leaf |Y root A P ] otherise + 1B da Eq. 30 .1 1 1 otherwise (ifdist(P, X) )+ 1 dm Primean (P[ leaf \ Y root A P ] o Bise Eq. 31 1 otherwise A preferred procedure for determining context child nodes for a parent node Pj 5 given a root element Yi for the general case where Pj does not lie along the directed path from Yj to the hit node X with probability averaging is similar to that shown in Fig. 6, and is shown in Fig. 13. The method 13000 begins at step 13001 where a fictitious child node Co is conceptually created and added to the list of actual child nodes C 1 , ..., Cm and is assigned a probability value Q 0 = Pr,,,,, [P leaf I Y root A P ] computed using Eq. 27, Eq. 10 28, Eq. 29, Eq. 30 or Eq. 31 and at the next step 13005, the actual child nodes Ck are correspondingly assigned probability values Q4 = P,,, [C I Y root A P ] computed using Eq. 22, Eq. 23, Eq. 24, Eq. 25 or Eq. 26 respectively. In any case, step 13006 follows step 13005 and invokes the method 5000 at step 5010 (skipping step 5005) to select among the child nodes Co, ..., Cm a set of context nodes. When method 5000 exits, the 15 method 13000 resumes at decision step 13010 where a check is made to determine if the fictitious child node CO has been selected as a context node. If so then execution continues at step 13020 where CO is excluded as a context node. The method 13000 subsequently terminates at step 13015. If the test at 13010 fails, then the method proceeds directly to the termination step 13015. 20 Apart from their use in keyword searching, the methods 13000 and 6000 can also WO 2005/052810 PCT/AU2004/001676 43 be used as means of selective presentation of hierarchical data. As already discussed, a practical hierarchical data source typically contains much more data than a user may wish to see at any given time. When a user views a hierarchical data source by selecting a node within its the data structure, a presentation application typically displays all data items in 5 the sub-tree below the selected node, some of which may often not be of interest to the user. It would be highly desirable if the presentation application is able to filter out un interesting data based on some previously observed viewing patterns of the user. The methods 13000 and 6000 as described are well suited for this task. By setting Yi = root node selected for viewing by the user, the set of context nodes identified by the methods 10 constitute nodes that are likely to be of interest and preferably be displayed to the user, whilst the remaining nodes not identified as context nodes are preferably filtered out. For the special case where a parent node Pj lies on the directed path from the root node Yj to the hit node X, recall that the selection of child nodes Ck of Yi for inclusion as context nodes is based on the probabilities Pr[C, X A Pr[Ck AXAj] Eq. 32 Pr[X A Pj] 15 With probability averaging these are replaced by a mean probability: I Pr[C,, A Xh A P] Pr,,,., [Ck I X A P] '' Pr[X A ] h Sfreq(Ckh I~ Jal I ) h IEq. 33 h where (Pjl,, Cj) is equivalent to (P, Ck) and (Pj;,, X) is equivalent to (P, A), with Pjo, Cko 20 and Xo (ie. h = 0) being aliases for P, Ck and Xrespectively.

WO 2005/052810 PCT/AU2004/001676 44 For the above expression to be an accurate approximation of the mean probability Pr,,,,, [Ck X A P ], the denominator on the right hand side of Eq. 33 must be sufficiently large (eg. > fnn). When this is not the case, another remedial method that may be used is to approximate Pr,,,,,[, Ck I X A Pj] by Pr,ne.,[Ck I X' A P], a probability conditioned on 5 X' rather than X, where X' is the immediate parent of X lying on the directed path from Y to X. A flowchart of a method 22000 for identifying a node X' used for determining an approximation for Pr ,,,Ck I X A P ] is shown in Fig. 22. The method 22000 begins at step 22005 where X' is first initialised to X. At the next step 22010 the sum 10 1freq(P,,,, X',) is computed and assigned to D, where the node pairs (Pjh, X'h) are h equivalent to (Pj, X'). Decision step 22015 then follows and test if D is greater than or equal to some positive threshold constant fmin. If so, the method 22000 exits with success at step 22025. If the decision step 22015 fails then execution proceeds to another decision step 22030, where a test is made to determine whether X' is an immediate child of Pj. If 15 so then the method exits with failure at step 22035, otherwise it continues at step 22040 where X' is replaced by its parent lying along the directed path from Pj to X. The method 22000 then loops back to step 22010. If method 22000 succeeds with a node X' and a corresponding value D, then Pr,,,,, [Ck I X A P,] is assigned the value Z freq(Ck-jn,,jX' Pr,_[C_ I X A P D_ Eq. 34 20 In the event that method 22000 exits with failure, Pr,,, [Ck I X A P I is assigned a value based on the distance between Ck and Yi WO 2005/052810 PCT/AU2004/001676 45 (1 fdist(C,Y)da Prnean [Ck I X A P { , i max Eq. 35 1 0 otheri-wse or the distance between Ck and the hit node X 1 f dist(CkX) ida Prnean [Ck I X A .] if , max Eq. 36 1 0 otheriwse Depending on whether Prnean [Ck I X A P,] is eventually approximated by Eq. 34, 5 Eq. 35 or Eq. 36, the mean probability that a parent node Pj has no context child nodes other than the child node C 1 lying on the directed path from Pj to X, given Pj and the hit node X, is computed using Eq. 37, Eq. 38, or Eq. 39 respectively: I freq(P, has child, X', . Prinean C, no sibling \P A X] ~ D Pr,,, I C, nosiblinglPjAX] {O if dit(P, Y) + L!d.. Eq. 38 (0 if dist(P,Y )+1 Pr,,,ea,, [C, no sibling P A X] [a i") Eq. 39 11 otherwise 10 where (Pjh, X'h) are equivalent to (Pj, X'), and X' and D are obtained by method 22000. A preferred procedure for determining context child nodes for a parent node Pj for the special case where Pj lies along the directed path from Yi to the hit node X with probability averaging is very similar to that shown in Fig. 8, and is shown in Fig. 14. 15 A method 14000 shown in Fig. 14 begins at step 14001 where a fictitious child node Co is conceptually created and added to the list of actual child nodes C 1 , ..., Cm and is assigned a probability value Q 0 = Prnean [C no sibling | P A X] computed using Eq. 37, Eq. 38 or Eq. 39, and at the next step 14005, the actual child nodes Ck except C 1 are WO 2005/052810 PCT/AU2004/001676 46 correspondingly assigned probability values Qk = n,,,, [Ck I X A P ] computed using Eq. 34, Eq. 35 or Eq. 36 respectively. In any case, step 14006 follows step 14005 and invokes method 7000 at step 7005 (skipping step 7001) to select among the child nodes Co, ... , Cm a set of context nodes. When the method 7000 exits, the method 14000 resumes at 5 decision step 14010 where a check is made to determine if the fictitious child node Co has been selected as a context node. If so then execution continues at step 14020 where Co is excluded as a context node. The method 14000 subsequently terminates at step 14015. If the test at 14010 fails, then the method proceeds directly to the termination step 14015. The preceding discussion describes methods for identifying context nodes in the 10 special case where there is at most a single hit node in the schema graph. This is a usual scenario when the user enters only a single search keyword. In the event that the keyword appears in multiple locations in the schema graph, signifying there are more than one hit, then each hit is preferably treated separately. That is, the methods as described are applied for a first hit node in the schema graph and a plurality of context trees are determined for 15 the hit node. The same methods are then subsequently applied for each of the remaining hit nodes to obtain a new plurality of context trees, and so on. When all hit nodes have been processed, the generated context trees may be re-scored if they are found to encompass multiple hit nodes, and in addition duplicated context trees are removed. The list of the remaining context trees are then reordered according to their new scores (if any) 20 and returned to the user as the result of the keyword search operation. If the user however initiates a 'find all' keyword search operation involving multiple search keywords combined with a Boolean AND operation, then keyword hits can potentially appear in two or more hit nodes in the schema graph. A more general method for determining context trees is now described for handling such a scenario.

WO 2005/052810 PCT/AU2004/001676 47 Fig. 11 shows an example of a schema graph 11000 within which there are multiple hit nodes 11010, 11020 and 11025. Let these hit nodes be denoted by X 1 , ..., X. Naturally, for a context tree to include all hit nodes, the root node of the smallest sub-tree containing all hit nodes, denoted by A (node 11005) must be returned as a context node, as 5 well as all nodes lying along the directed path from A to each hit node. Thus node 11015 must be a context node since it lies along the directed path from A to X 2 (and from A to

X

3 ). The first, bottom-up phase of the context tree determination method begins at node A and traverses upwards. Let Yi be A or an ancestor of A, whose probability given the hit 10 nodes, Pr[Yj JX, A - A X, ], needs to be evaluated in order to determine the possible root node of a context tree. Expressed mathematically Pr[Y 1 A X 1 A.--.A X, 1 ] Pr[Yj X, A -A X,)= A Eq. 40 Pr[X 1 A ... AX, At this point some independence of probability assumptions are necessary since both the numerator and denominator on the right hand side of Eq. 40 cannot be obtained or 15 estimated directly from the existing frequency tables for a general value of n (except for the denominator when n 5 2). A plausible assumption is that the set of X are independent of one another given a common ancestor Yi. In other words: Pr[X, A .. AX,\, ] = Pr[X , I YJ ...Pr[X,I Y\] Eq. 41 Thus 20 Pr[Y A X, A ... A X,] = Pr[X, A ... A X, \ j] Pr[Y = Pr[ X, I Yj ] .--Pr[ X. \ ] Pr([ ] Pr[XI A Y ]-..PrX,, A Y] Eq. 42 Pr[Y ]" WO 2005/052810 PCT/AU2004/001676 48 In order to remove the singularity when Pr[Y] 0, Pr[Y A X, A ... A X,,] is redefined as 0 if Pr[Y]= 0 Pr[Y A X, A ... A X,]= Pr[X A Y]..-.-Pr[X, A Y) Eq. 43 AY]***r[X,, otherwise Pr[,]"-1 Similarly 5 Pr[X, A... A X,,]=Pr[A A XI A... A X,] =Pr[X A-AX,, \A]Pr[A] Pr[X. I A]... Pr[X,\ A]Pr[A] Pr[X A A]...Pr[X, A A] Eq. 44 Pr[A]"-I 10 As in the case where there is only a single hit node, the occurrence probability of Yi given all hit nodes is preferably expressed incrementally in terms of the probability of its immediate child node to facilitate probability averaging, as follows: Let W denote the immediate child node of Yi along the directed path from Yi to A, then Pr'[Y, I X, A --- A X,,] = 1 Yj=A Eq. 45 0 Y 1A,Pr'[W I X, A ... A X,, ]=0 Pr,,ea,, [Y I W A X, A ... A X,, otherwise Pr'[W I X 1 A ... AX,] 15 where Pr[Y,,, A XW, A A X,,,,] Pr., [Y \IW A X, A- uAXl= t (Pr[W,, AnX A .A X, Eq. 46 h where the pairs (Y,, WI,) are equivalent to (Yi, WP), and (W,,, X11) are equivalent to (W, X) WO 2005/052810 PCT/AU2004/001676 49 for I = 1, ... , n, and Yo, W and Xo (h = 0) are aliases for Y, W, and X respectively. The term inside the summation on the numerator can be substituted by Eq. 43. The term inside the summation of the denominator can also be substituted by Eq. 43 by letting W play the role of Y, thus resulting in 5 Pr,,.,, [Y I W A X, A ... A X,] N,, / D,, Eq. 47 h Ih where 0 if Pr[Y;]= 0 Ni =Pr[Y,, A X, ]... Pr[Yh A X,,] otherwise Pr[Y,,, ]" 0 if freq(Yi,,) =0 freq(Y,,,,X,).- freq(Y,,X otherwise Eq. 48 freq(Y 1 ,)" 10 0 if Pr[W,,I= 0 D, =Pr[W, A X,]- ... Pr[W,, AX,,,I otherwise Pr[W,,] 0 if freq(W,,) = 0 Eq. 49 . f(W, , req(WIX,) otherwise freq(W,, )" Pr,ean lI W A X, A AX,,] is undefined if ZD, = 0. When this occurs h 15 Pr,,,,,[[ I W A X, AA X,] I is preferably assigned a value based on the distances from Yi to the hit nodes X, ..., X, as follows WO 2005/052810 PCT/AU2004/001676 50 1 if min dist(Y , XI) dm Pr,.,([ W A X, A--- A X,] ' Eq. 50 otherwise A flowchart of a method 12000 for computing the probability that a node Yi is the root node of a context tree containing all hit nodes, for all choices of Yi, is shown in Fig. 12. The method 12000 begins at step 12001 where the root node of the smallest sub 5 tree in the schema graph that contains all hit nodes is identified and denoted as A. Execution then continues at step 12002 where Y is initialised to A and consequently Pr'[Y, I X, A ... A X, ]= 1 . At the next step 12005 Eq. 45, Eq. 47 together with Eq. 48 and Eq. 49, or alternatively Eq. 50, are used to compute Pr'[Zj I X, A ... A X, ] for each parent node Zj of Yi. Following step 12005, step 12010 computes 10 Pr'[Y root I X, A - A X, ] according to the equation Pr'[Y1 root I X, A ... A X, ] = Pr'[Y I X, A --. A X, ] Pr'[Zj \ X, A --- A X,] Eq. 51 The method 12000 then proceeds to step 12015 where a parent node Zj of Yi is selected. Upon reaching step 12020, the method 12000 is recursively invoked at step 12005 (skipping Steps 12001 and 12002) but with the selected parent node Zj playing 15 the role of Yi. When the recursive invocation returns, execution resumes at decision step 12025 where a test is made to determine whether all parent nodes of Yi has been processed. If so, the method ends at step 12030, otherwise it continues at step 12015 where another parent node Zj of Yi is selected for processing. In the second top-down traversal phase, for the general case where a parent node Pj 20 does not lie along the directed path from the root node Yi to any hit node, the method for determining whether a child node of Pj is a context node remains unchanged from the WO 2005/052810 PCT/AU2004/001676 51 method 13000 used previously for the case where there is only a single hit node. For the special case where the parent node Pj lies along the directed path from Y to one or more hit nodes, it is necessary to modify the method used for the single hit node case to allow for the possibility that more than one child node of Pj must be included as 5 context nodes. Recall that for the case involving only one hit node X, the determination of whether a child node C1 is a context node is based on the probability value Pr[C \ X A PI] where Xis a descendant of P but is not Ck or a descendant of Ck. An extension when there are more than one hit nodes is to base the selection process of child nodes Ck k = 1, ... , in 10 on the probabilities of Ck given the parent node P and all hit nodes X that are descendants of P, whilst ignoring the effects of hit nodes that are not descendants of P. Naturally, all child nodes C1 that are themselves hit nodes or are ancestors of one or more hit nodes must be context nodes. Without loss of generality, let these child nodes be C 1 , ..., C, where 1 r n. Similarly let the set of hit nodes that are descendants of C 1 , ..., Cr be X 1 , ..., X, 15 where r s n. If s = 1 (and hence r = 1) then this scenario is equivalent to the case where there is only a single hit node, and hence the method 14000 described for this case can be used. A method adopted in a preferred implementation for generalising for the case s > 1 is to replace the term Pr[Ck I X A P] with the expression jPr[C. I X, A P ] 1=1 20 which becomes S Qk = Pr,,,,,[Ck I X, A Pj Eq. 52 il=1 after probability averaging, where Pr.e,, [Ck I X 1 A P ] is as defined in Eq. 33. Qk is WO 2005/052810 PCT/AU2004/001676 52 undefined if freq .... (P , X,)= 0 for any X, / = 1, ... , s. where freq.,, (P, X)= , freq(P,,X,,) h and (Pj, X;;) are equivalent to (P, X). Even when freq,,,, (P;,X,) is non-zero but a 5 small number (eg. <fni,,), Qk cannot be estimated from the frequency tables with sufficient accuracy. As in the case involving a single hit node, this problem can be overcome by replacing the hit nodes X1, ..., X, by a new set of nodes S in which some or all of the hit nodes are replaced by their ancestors, after which Qk is redefined in terms of elements of S. 10 A method 21000 depicted by the flowchart of Fig. 21 is preferably used to determine this new set of nodes S that replaces the hit nodes X 1 , ..., X. The method 21000 begins at step 21005 where the initial set of hit nodes X 1 , ... , X, is denoted by S. At the next step 21010 an unprocessed element X of S is selected. Decision step 21015 then follows in which a check is made to determine if freqnean(PJX,) is 15 greater than or equal to some threshold constant .n,. If so then X is retained in the set S and the method 21000 continues to decision step 21020 where a check is made to determine if all elements in S has been processed. If one or more unprocessed elements remain then execution returns to step 21010 to select another element of S for processing. If on the other hand all elements have been processed, then the method ends at step 21025 20 with success. Returning now to decision step 21015. If the test condition fails then another decision step 21030 follows, which tests if the selected node X is a child node of P. If it is then the method 21000 ends at step 21040 with failure, otherwise step 21035 follows.

WO 2005/052810 PCT/AU2004/001676 53 At step 21035, the element X, in S is replaced by its parent X', that lies along the directed path from Pj to X, and all descendants of X', are removed from S. Execution then proceeds to step 21020. If method 21000 as described above returns with success, then the elements in the 5 resulting set S are used to compute a value Qk for each child node C 1 , ... , Cr: Qk = Pr, eanC, \ XACA Eq. 53 If however method 21000 returns with failure, then the value Qk is preferably determined from the distance between C1 and the root node YT: k= k i mnEq. 54 0 otherwise 10 or alternatively the distances between Ck and the hit nodes X 1 , ... , Xn: F1 min dist(CX, ) :d Q, = !=1.-,n Bq. 55 0 otherwise Recall also that in the case where there is a single hit node X, it is necessary to evaluate the probability that given a parent node Pj and a hit node X, only one child node

C

1 of Pj occurs, where C 1 is or is an ancestor of X: 15 Pr[C, no sibling X A Pj] In generalising this quantity to the present scenario, two possibilities can arise, namely the special case where r = 1, and the more general case r > 1. An example of the former is shown in Fig. 15 where, there are three hit nodes 15030, 15035 and 15040 located within the sub-tree rooted at node 15005. However, all three hit nodes reside 20 under a single child node 15010 of node 15005. An approach for handling this special case is to replace the term WO 2005/052810 PCT/AU2004/001676 54 Pr[C no sibling | X A P ] with the expression S Q= Prn [C, no sibling I X, A P] Eq. 56 1=1 in an analogous fashion to the use of the quantity Qk in Eq. 52. As in the case of Qk above, there is a possibility that Eq. 56 is undefined when freq, 0

,

1 (P ,X) =0 for any X 1 , 5 1= 1, ... , s. Consequently, the actual value assigned to Qo is based on the set S obtained from the method 21000, if the method 21000 returns with success: Q0 = E Pr, (C, no sibling | X A Pj] Eq. 57 XeS Otherwise if the method 21000 fails, then Qo is assigned a value based on the distance of Pj from the root node Y: Q = -tjr,, ;S)1ax Eq. 58 1 otherwise 10 or alternatively the distances between Pj and the hit nodes X 1 , ..., X,: 0 min dist(P, X,)+1 ! dn. Q =f '=1,", Eq. 59 1 otherwise A method 16000 depicted in the flowchart of Fig. 16 for identifying context nodes among the set of child nodes Ck of a parent node Pj for the case r = 1, s > 1 is very similar 15 to the method 14000 for the single hit node case. The method 16000 begins at step 16001 where a fictitious child node Co is conceptually created and added to the list of actual child nodes C 1 , ... , Cm and is assigned a value Qo defined in Eq. 57, Eq. 58, or Eq. 59, and at the next step 16005, the actual child nodes Ck except C 1 are assigned values Qk correspondingly defined in Eq. 53, Eq. 54, or Eq. 55 respectively. Step 16006 follows 20 step 16005 and invokes the method 7000 at step 7005 (skipping step 7001) to select WO 2005/052810 PCT/AU2004/001676 55 among the child nodes Co, ... , Cm a set of context nodes. When the method 7000 exits, the method 16000 resumes at decision step 16010 where a check is made to determine if the fictitious child node Co has been selected as a context node. If so then execution continues at step 16020 where Co is excluded as a context node. The method 16000 subsequently 5 terminates at step 16015. If the test at 16010 fails, then the method 16000 proceeds directly to the termination step 16015. For the general case where r > 1 (and hence s > 1), an analogous quantity to Pr[C, no sibling | X, A ... AX, APj] usedinthe case r= 1 is ZPr[C, A .- A C, A-,C A-- .A-,C,IX A P] Xes 10 where S is the set returned by the method 21000 if it exits with success. Unfortunately the probability in the summation cannot be easily estimated from the existing frequency tables. Consequently, a slightly different expression is used in its place. Let the elements of set S that are not located in the sub-tree rooted at each child node Ck, 1 k r be denoted by H1,k1 for 1 I Sk, where s k ISI. For each child node Ck, 1 k s r, the 15 following is computed: Q1 = Pr,,ea,, [Ck HkI A P Eq. 60 The rationale behind the quantity above expression is that when summed together over all C, 1 k < r, a quantity approximating the probability of child nodes C 1 , ... , Cr occurring together is obtained (although not a true probability since it can take on a value 20 > 1). As in the case r = 1, if method 21000 returns with failure then Qk is obtained from the distance of Pj to from the root node Yi, for 1 5 k r: WO 2005/052810 PCT/AU2004/001676 56 0 dist(PiY;)+l!dax Eq. 61 1 otherwise or alternatively the distances between Pj and the hit nodes X 1 , ... , Xn: 0 min dist(P,X,)+1 :dmaxE Qk = '*"Eq. 62 1 otherwise A method 17000 for selecting context nodes among the set of child nodes C 1 , 5 C,. is now described for the general case r > 1, s > 1, with reference to the flowchart of Fig. 17. Method 17000 begins at step 17001 where each child node Ck, 1 k r is assigned a value Qk computed using Eq. 60, Eq. 61, or Eq. 62. Step 17005 follows in which the remaining child nodes are assigned values Qk correspondingly computed using Eq. 53, Eq. 54, or Eq. 55 respectively. At the next step 17010, the values Qk are summed 10 over all child nodes and denoted by T. The method 17000 continues at step 17015 where all child nodes containing hit nodes in their sub-trees, namely Ck, 1 < k < r, are selected as context nodes. At the next step 17020, nodes Ck with the highest assigned value among the remaining child nodes are also selected as context nodes. If more than one child node exists with the same highest value then all such nodes are selected as context nodes. The 15 sum of the assigned values of all child nodes so far selected as context nodes is then computed at step 17025 and denoted by S. Execution then proceeds to the decision step 17030, at which point if all child nodes Ck have been selected as context nodes then the method 17000 terminates at step 17040. If however there are one or more child nodes Ck not yet selected as context nodes then the method 17000 continues to another decision 20 step 17035. At step 17035 a check is made to ascertain whether S > T/2 and if so the method 17000 again terminates at step 17040. If S < T/2 then execution returns to step 17020 where more nodes are selected as context nodes.

WO 2005/052810 PCT/AU2004/001676 57 The preceding descriptions present various methods for handling different stages and operating scenarios encountered when performing keyword searching in hierarchical data structures. These methods are incorporated into a single overall procedure 18000 which elaborates on step 2010 of Fig. 2, illustrated by the flowchart in Fig. 18 which 5 comprises sub procedures 19000 and 20000 shown in Fig. 19 and Fig 20, respectively. The method 18000 begins at decision step 18005 where a check is made to determine whether there are multiple hit nodes in the schema graph. If so then execution proceeds to step 18015 where the method 20000 is invoked, otherwise it proceeds to step 18010 where the method 19000 is invoked. In either case, the method 20000 or 19000 returns with a 10 list of context trees, each having an associated score. The following is a detailed description of the method 19000, followed by that of the method 20000. The method 19000 begins at step 19001 where the method 10000 is invoked to determine a list of possible root nodes Yi that are ancestor nodes of the hit node X. Each Yi is the root node of a possible context tree. The method 10000 also computes a value 15 S, = Pr'[Y. I X] for each node Yi. The method 19000 then continues at step 19005 where a node Yi determined in the previous step is selected for processing. At the next step 19010, method 38000 which is a sub-process within method 19000 is invoked to identify context nodes in the subtree rooted at node Yi. Method 19000 then continues at step 19030, where a context tree is constructed comprising all identified context nodes and 20 with Yi as the root node. The tree is assigned a score of Si computed at step 19001. The method 19000 then proceeds to decision step 19035. If all nodes Yi obtained at step 19001 have been processed, then the method ends at step 19040, otherwise it returns to step 19005 to process another node Yi. The method 38000 invoked within method 19000 begins at step 38010 where WO 2005/052810 PCT/AU2004/001676 58 node Yi is first assigned to Pj. Execution proceeds to the decision step 38015 and then to step 38020 if Pj does not lie on the directed path from Yi to the hit node X. At step 38020, the method 13000 is invoked to select among the child nodes of Pj a set of context nodes. At the subsequent step 38025, the method 38000 is recursively invoked at step 38020 5 (skipping steps 38010 and 38015) for each non-leaf child node Ck selected as context node, with C 1 , playing the role of Pj in order to identify additional context nodes among its descendants. When the invocations for all such child nodes return, method 38000 terminates at step 38040. Method 38000 also proceeds directly to the termination step 38040 if Pj has no child nodes, or if none of its non-leaf child nodes have been selected as 10 context nodes at step 38020. The decision step 38015 succeeds if Pj lies on the directed path from Y to X, in which case executions proceeds to step 38045. Here the method 14000 is invoked to select among the child nodes of Pj a set of context nodes, with C 1 denoting the child node lying on the directed path from Pj to X. A the subsequent step 38050, method 38000 is 15 recursively invoked at step 38015 (skipping step 38010) for each non-leaf child node Ck selected as context node, with C 1 playing the role of Pj in order to identify additional context nodes among its descendants. When the invocations for all such child nodes return, method 38000 terminates at step 38040. The method 20000 begins at step 20001 where the method 12000 is invoked to 20 determine a list of possible root nodes Yi that are ancestor nodes of the hit nodes X1, ..., Xn. Each Yi is the root node of a possible context tree. The method 12000 also computes a value S, = Pr'[Y I X A - A X,] for each node Yi. The method 20000 then continues at step 20005 where a node Yi determined in the previous step is selected for processing. At the next step 20010, method 39000 which is a sub-process within method 20000 is WO 2005/052810 PCT/AU2004/001676 59 invoked to identify context nodes in the subtree rooted at node Yi. Method 20000 then continues at step 20060, where a context tree is constructed comprising all identified context nodes and with Yi as the root node. The tree is assigned a score of Si computed at step 20001. The method 20000 then proceeds to decision step 20065. If all nodes Yj 5 obtained at step 20001 have been processed, then the method ends at step 20070, otherwise it returns to step 20005 to process another node Yi. The method 39000 invoked within method 19000 begins at step 39010 where node Y; is first assigned to Pj. Execution proceeds to the decision step 39015 and then to step 39020 if there are no hit nodes in the sub-tree root at Pj. At step 39020, the 10 method 13000 is invoked to select among the child nodes of Pj a set of context nodes. At the subsequent step 39025, the method 39000 is recursively invoked at step 39020 (skipping steps 39010 and 39015) for each non-leaf child node Ck selected as context node, with Ck playing the role of Pj in order to identify additional context nodes among its descendants. When the invocations for all such child nodes return, method 39000 15 terminates at step 39060. Method 39000 also proceeds directly to the termination step 39060 if Pj has no child nodes, or if none of its non-leaf child nodes have been selected as context nodes at step 39020. The decision step 39015 succeeds if there is one or more hit nodes within the sub tree rooted at Pj, in which case execution proceeds to another decision step 39030. If there 20 is only a single hit node in the sub-tree under Pj then this decision step fails and execution proceeds to step 39035, otherwise it continues to yet another decision step 39040. At decision step 39040, a test is made to determine whether all hit nodes under Pj are located under only one of its child nodes. If so, then execution proceeds to step 39045, otherwise it proceeds to step 39050. At step 39050, with C 1 , ..., C, denoting the child nodes of Pj WO 2005/052810 PCT/AU2004/001676 60 under which one or more hit nodes reside, the method 17000 is invoked to select among the child nodes of Pj a set of context nodes. If however decision step 39040 leads to step 39045, then the method 16000 is invoked to select among the child nodes of Pj a set of context nodes, with C 1 being the sole child node of Pj that contains hit nodes in its sub 5 tree. Returning now to step 39035, let the path from Pj to its one and only descendant hit node pass through its child node C 1 . The method 14000 is invoked to select among the child nodes of Pj a set of context nodes. At the completion of each of steps 39035, 39045 and 39050, the method 39000 10 recursively invokes itself at step 39015 (skipping step 39010) for each non-leaf child node Ck selected as context node, with Ck playing the role of Pj in order to identify additional context nodes among its descendants. When the invocations for all such child nodes return, method 39000 terminates at step 39060. Illustrative Example 15 The operation of a preferred implementation is now demonstrated with an example hierarchical XML data source below. The XML source comprises data relating to a company named "XYZ" such as its web addresses, branch names and locations, and its range of sales products at each branch. A schema graph representation of the XML data is shown in Fig. 23. 20 XML SOURCE <company> <name>XYZ</name> <web>http://www.xyz.com</web> 25 <description> Company founded in 1999 specialising in hi-tech consumers electronics WO 2005/052810 PCT/AU2004/001676 61 </description> <branch> <name>North Ryde</name> 5 <phone>0291230000</phone> <address> <number>1 </number> <street>Lane Cove</street> 10 <city>Sydney</city> <country>Australia</country> </address> <manager> 15 <firstName>Jim</firstName> <lastName>Smith</lastName> <email>jsmith@xyz.com</email> </manager> 20 <product> <id>1</id> <name>Plasma TV</name> 25 <price>$10000</price> <supplier>JEC</supplier> <stock> 10</stock> </product> 30 <product> <id>2</id> <name>Mp3 player</name> <price>$500</price> <supplier>HG</supplier> 35 <stock>20</stock> WO 2005/052810 PCT/AU2004/001676 62 </product> </branch> 5 <branch> <name>Morley</name> <phone>0891 230000</phone> <address> 10 <number>1 </number> <street> Russel</street> <city> Perth <city> <country>Australia</country> </address> 15 <manager> <firstName>Ted</firstName> <lastName>White</lastName> <email>twhite@xyz.com</email> 20 </manager> <product> 25 <id>3</id> <name>Video phone</name> <price>$2000</price> <supplier>NVC</supplier> <stock> 15</stock> 30 </product> <product> <id>4</id> <name>PDA</name> 35 <price>$1 000</price> WO 2005/052810 PCT/AU2004/001676 63 <supplier>LP</supplier> <stock>50</stock> <product> </branch> 5 </company> In Fig. 23, the integer shown next to each node is a unique ID number assigned to the node. Suppose that there are three existing views of this data source. The first is a view displaying the company's name, description and web address. The second is a listing 10 of the company's branches and their locations, and finally the third view lists the line of products at each branch. Schema graph representations of these views are shown in Fig. 24, Fig. 25 and Fig. 26 respectively. As a result of these views, the occurrence 27000, co-occurrence 28000, leaf co-occurrence 29000, and sole child co-occurrence 30000 frequency tables are as shown in Fig. 27, Fig. 28, Fig. 29 and Fig. 30 respectively. The 15 joint-occurrence frequency table, being three-dimensional, is depicted by five separate two-dimensional tables 31000, 32000, 33000, 34000 and 35000. Fig. 31 comprises entries freq(Ck, Pj, X) in the table with Pj = node 1. Similarly Fig. 32, Fig. 33, Fig. 34 and Fig. 35 each comprises entries with Pj = node 3, node 8, node 9, and node 10 respectively. In all frequency tables shown, an empty cell such as Item 28005, as seen in Fig. 28, denotes an 20 invalid node combination whose associated frequency is iot required to be stored. Suppose that a user wishes to locate a particular product in the city where the user resides. The user enters the product's name, "Mp3 player", and the name of the city, "Sydney" and performs a keyword search for both names. As seen from Fig. 23 this results in two hit nodes X1 = node 19 and X 2 = node 13. To determine possible context 25 trees for the keyword search operation, the system 4000 invokes method 18000 of Fig. 18. Since there is more than one hit node, the method 18000 subsequently invokes the WO 2005/052810 PCT/AU2004/001676 64 method 20000 at step 18015. The method 20000 in turn invokes the method 12000 at step 20001 to obtain a list of nodes Yi to serve as root nodes of the resulting context trees. The method 12000 first identifies at step 12001 node 3 as the root node of the smallest sub-tree containing both hit nodes X 1 and X 2 . Thus A = node 3. The 5 method 12000 then begins a recursive procedure to compute an occurrence probability value for each of A and its ancestors, given the hit nodes. At node A Pr'[A I X, A X 2 ]=1 At Y 1 = node 1, the parent of node A, using Eq. 47, Eq. 48 and Eq. 49 Prinean [node 1) A A X A X 2 ] = freq(node 1, node 19)freq(node 1,node 13)freq(node 3) freq(node 3, node 19)freq(A, node 13)freq(node 1) -0 Thus Pr'[node 1J X, A X 2 ]= 0 and 15 Pr'[A root IX, A X2=1 Consequently the method 12000 exits with node A as a single candidate root node for a context tree. This context tree is assigned a score of 1. After the completion of the method 12000, the method 20000 continues and with the second, top-down traversal phase where descendants of the root node Yi = A are processed to identify context nodes among 20 them. This phase begins at step 20010 where Pj is first set to be node 3. Since this node is an ancestor of the hit nodes X 1 and X 2 , which are located under two distinct child nodes, execution proceeds eventually to step 39050 of method 39000, where the method 17000 is invoked to determine context nodes among its children. The values Q1, ... , Q5 assigned to WO 2005/052810 PCT/AU2004/001676 65 the child nodes 6 - 10 respectively of node 3 due to method 17000 are as follows: Pr,, .,, [node 6 | X, A P ] + Pr,,,(node 6 |X 2 A F ] freq(node 6,node 3,node 19)+ freq(node 6,node 3,node 13) freq(node 3, node 19) freq(node 3,node 13) 1 1 1 1 5 2 Q2= Prn(an [node 7 | XI A P ]1+ Pr,,,,;(node 7| X 2 A Pj] freq(node 7, node 3, node 19) + freq(node 7,node 3, node 13) freq(node 3, node 19) freq(node 3,node 13) =2 Q3 = Pr,(,, [node 8 | X. A P] 10 = freq(node 8, node 3, node 19) freq(node 3, node 19) -0 Q4 =Pria[node 9 \ X, A Pj ] + Pr,,a,, [node 9 1 X2 ^ j] freq(node 9,node 3,node 19) freq(node 9,node 3, node 13) freq(node 3, node 19) freq(node 3, node 13) =1 15 Q= Pr,,,.,,[node 10 1 X 2 A Pj] freq(node 10,node 3, node 13) freq(node 3, node 13) =0 Thus the set Q1, .. , Q5 sorted in descending order is {Q1, Q2, Q4, Q3, Qs} and sums to T = 5. The set of context nodes selected by the method 17000 thus comprises WO 2005/052810 PCT/AU2004/001676 66 node 6, node 7 (since Q1 + Q2 > T/2), and node 8, node 10 (since they are ancestors of hit nodes). Resuming at step 39055, the method 39000 then recursively invokes itself to identify context child nodes for each of the selected nodes that have children. For Pj = node 8, execution proceeds to step 39035 since node 8 has a single 5 descendant hit node (node 13), at which point method 14000 is invoked to identify context nodes among the set of child nodes 11 - 14. The probability values Q1, Q2, and Q4 assigned to the child nodes 11, 12, and 14 respectively of Pj due to the method 14000 are as follows: Q, = Pr,,,, 1 [node 1 I JX 2 A Pj] 10 = freq(node 11, node 8, node 13) freq (node 8, node 13) =1 Q2= Pr,,,,,[node 12 I X 2 APj] freq(node 12, node 8, node 13) freq(node 8, node 13) 15 Q4=Pr,,,, [node 14 X 2 A Pj freq(node 14, node 8, node 13) freq(node 8,node 13) -1 In addition, the method 14000 also computes a value Qo for a fictitious child node Co: Q0= Pr,,,n [node 13 no sibling| X 2 A Pj ] 20 = freq(node 8 has 1 child ,node 13) freq(node 8, node 13) -0 WO 2005/052810 PCT/AU2004/001676 67 Thus the set of probability values sorted in descending order is {QI, Q2, Q4, Qo} and sums to T = 3. The set of context nodes selected by the method 14000 thus comprises node 11, node 12, node 14 (since Qi + Q2 + Q4 > T/2 and Q1 = Q2 = Q4), and node 13 (since it is an ancestor of a hit node). 5 A similar execution path is followed for the case Pj = node 10, with similar results being obtained. The set of context child nodes of node 10 are nodes 18 - 22. The schema graph 3600 of the context tree is thus as shown in Fig. 36, comprising the hit nodes 19 and 13, and context nodes 3, 6 - 8, 10 - 14, 18 - 22. The actual context tree returned to the user comprising data items represented by these nodes is as follows: 10 <branch> <name>North Ryde</name> <phone>0291230000</phone> 15 <address> <number>1 </number> <street>Lane Cove</street> <city>Syd ney</city> <country>AustraIia</country> 20 </address> <product> <id>1</id> <name>Plasma TV</name> 25 <price>$1 0000</price> <suppIier>JEC</supplier> <stock>1 0</stock> </product> 30 <product> <id>2</id> WO 2005/052810 PCT/AU2004/001676 68 <name>Mp3 player</name> <price>$500</price> <supplier>HG</supplier> <stock>20</stock> 5 </product> </branch> <branch> 10 <name>Morley</name> <phone>0891 230000</phone> <address> <number>1 </number> 15 <street> Russel</street> <city>Perth</city> <country>Australia</country> </address> 20 <product> <id>3</id> <name>Video phone</name> <price>$2000</price> <supplier>NVC</supplier> 25 <stock>1 5</stock> </product> <product> <id>4</id> 30 <name>PDA</name> <price>$1 000</price> <supplier>LP</supplier> <stock>50</stock> </product> 35 </branch> WO 2005/052810 PCT/AU2004/001676 69 Industrial Applicability It is apparent from the above that the arrangements described are applicable to the computer and data processing industries, and particularly in respect of presenting 5 information from multiple searches. The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. (Australia Only) In the context of this specification, the word "comprising" means 10 "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

Claims (22)

1. A method of presenting data from a hierarchical data source, said method comprising the steps of: 5 (i) constructing a first view of the hierarchical data source; (ii) obtaining an occurrence probability of at least one context data from at least the first view of the hierarchical data source; (iii) identifying a compulsory entity in the first view; (iv) selecting a context entity from the first view and the context data based on 10 the occurrence probability; and (v) presenting a hierarchical data structure, wherein the hierarchical data structure is a subset of the hierarchical data source, comprising a plurality of context data, wherein each of the plurality of context data corresponds to the identified compulsory entity and the selected context entity, 15 wherein the hierarchical data structure is assigned a score equal to an occurrence probability of an ancestor node of the compulsory entity given the occurrence probability of the context data associated with the compulsory entity, and the context entity is selected from the group consisting of: (a) the ancestor node; 20 (b) a first set of nodes along a directed path in the hierarchical data source from the ancestor node to the compulsory entity; (c) a second set of nodes selected from a descendent node of the ancestor node in the first view, each of the second set of nodes being selected based on a corresponding occurrence probability, said occurrence probability being derived from the occurrence 2564617_1 645893AU - 71 probability of the ancestor node; (d) a third set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the third set of nodes to the ancestor node in the first view; and 5 (e) a fourth set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the fourth set of nodes to the compulsory entity in the first view.
2. A method according to claim 1 wherein the hierarchical data source comprises a 10 schema representation of said at least one data source and at least one previous view of the hierarchical data source.
3. A method according to claim I wherein said context data comprises data ranked according to relevance of said context entities to said compulsory entity. 15
4. A method according to claim 3 wherein said context data comprises at least one associated data.
5. A method according to claim 4 wherein said associated data comprises 20 occurrence probability and a plurality of joint-occurrence frequencies of entities in said hierarchical data source observed in a previous view of the hierarchical data source.
6. A method according to claim 1 wherein said second set of nodes comprises one or more child nodes of at least one parent node in said first view of the hierarchical data 2564617_1 645893AU - 72 source lying along said directed path from said ancestor node to said compulsory entity.
7. A method according to claim I wherein said corresponding distance comprises a number of links separating the nodes in said first view of the hierarchical data source. 5
8. A method according to claim 6 wherein, step (iv) comprises selecting said child nodes as context entities from all child nodes of said parent node, said selecting comprising the steps of: (iv-a) computing a first occurrence probability of said parent node 10 appearing with none of its child nodes other than a fifth set of nodes, given an occurrence probability of said parent node, the ancestor node and said compulsory entity, said fifth set comprising at least one child node of said parent node lying along a directed path from said parent node to said compulsory entity; (iv-b) computing a second occurrence probability of each of said child 15 nodes in a sixth set of nodes, given the occurrence probability of said parent node, the ancestor and said compulsory entity, said sixth set comprising at least one child node of said parent node that do not lie along a directed path from said parent node to said compulsory entity; (iv-c) computing a total sum of said first occurrence probability and said 20 second occurrence probability; (iv-d) creating a fictitious node and assigning said fictitious node said first occurrence probability; (iv-e) selecting the fifth set of nodes or a seventh set of nodes as a set of context entities wherein the seventh set of child nodes is formed from said sixth set of 2564617_1 645893AU - 73 child nodes and said fictitious node arranged in an order of descending values of said first occurrence probability or said second occurrence probability, and wherein a sum of said first occurrence probability or said second occurrence probabilities of said seventh set of nodes equals or exceeds half of said total sum; and 5 (iv-f) deselecting as a context entity said fictitious node if said fictitious node is selected in said seventh set of child nodes, wherein said first occurrence probability and said second occurrence probability are approximated using an occurrence probability of a node in said hierarchical data source, a co-occurrence probability between a pair of nodes in said hierarchical data source, and joint-occurrence probability between 10 an n-tuple of nodes in said hierarchical data source observed in said previous view.
9. A method according to claim 8 wherein said fictitious node prevents other nodes, whose associated probabilities are less than the probability associated with the fictitious node, from being selected, since a set of nodes are selected as a set of context entities 15 when the total sum exceeds half of the total sum.
10. A method according to claim 6 wherein, step (iv) comprises selecting said child nodes as context entities from all child nodes of said parent node, said selecting comprising the steps of: 20 (iv-a) computing a first occurrence probability of said parent node appearing with none of its child nodes other than a fifth set of nodes, given an occurrence probability of said parent node, the ancestor node and said compulsory entity, said fifth set comprising at least one child node of said parent node lying along a directed path from said parent node to said compulsory entity; 2564617_1 645893AU - 74 (iv-b) selecting said fifth set of child nodes as a set of context entities; and if said first occurrence probability is less than or equal to 0.5; (iv-c) computing a second occurrence probability of each of said child nodes in a sixth set of nodes, given the occurrence probability of said parent node, the 5 ancestor node and said compulsory entity, said sixth set comprising at least one child node of said parent node that do not lie along a directed path from said parent node to said compulsory entity; (iv-d) computing a total sum of said second occurrence probabilities of said second set of child nodes; 10 (iv-e) selecting as the set of context entities a seventh set of nodes formed from said sixth set of nodes in an order of descending values of said second occurrence probability until the sum of said second occurrence probability of said seventh set of child nodes equals or exceeds half of said total sum, wherein said first occurrence probability and said second occurrence probability are approximated using an occurrence probability 15 of a node in said hierarchical data structure, co-occurrence probability between a pair of nodes in said hierarchical data structure, and joint-occurrence probability between an n tuple of nodes in said hierarchical data structure observed in said previous view.
11. A method according to claim I wherein said second set of nodes comprises one or 20 more child nodes of at least one parent node in said first view of the hierarchical data source not lying along said directed path from said ancestor node to said compulsory entity.
12. A method according to claim 11 wherein, step (iv) comprises selecting said child 2564617_1 645893AU - 75 nodes as a set of context entities from all child nodes of said parent node, said selecting comprising the steps of: (iv-a) computing a first occurrence probability of said parent node appearing without any of its child nodes given the occurrence probability of said parent 5 node, the ancestor node and said compulsory entity; (iv-b) computing a second occurrence probability of each of said child nodes of said parent node given the occurrence probability of said parent node the ancestor node and said compulsory entity; (iv-c) computing a total sum of said first occurrence probability and said 10 second occurrence probability of all child nodes of said parent node; (iv-d) creating a fictitious node and assigning said fictitious node said first occurrence probability; (iv-e) selecting the set of context entities from a set of said fictitious node and all child nodes of said parent node arranged in order of descending values of said first 15 occurrence probability or said second occurrence probabilities until the sum of said first occurrence probability or said second occurrence probability of selected nodes equals or exceeds half of said total sum; and (iv-f) deselecting said fictitious node as a context entity if said fictitious node is among said selected nodes, wherein said first occurrence probability and said 20 second occurrence probability are approximated using an occurrence probability of a node in said hierarchical data source, a co-occurrence probability between a pair of nodes in said hierarchical data source representation, and a joint-occurrence probability between an n-tuple of nodes in said hierarchical data source observed in said previous view. 2564617_1 645893AU - 76
13. A method according to claim 11 wherein, step (iv) comprises selecting said child nodes as a set of context entities from all child nodes of said parent node, said selecting comprising the steps of: (iv-a) computing a first occurrence probability of said parent node 5 appearing without any of its child nodes given the occurrence probability of said parent node, the ancestor node and said compulsory entity; and if said first occurrence probability is less than or equal to 0.5; (iv-b) computing a second occurrence probability of each of the child nodes of said parent node given the occurrence probability of said parent node the ancestor 10 node and said compulsory entity; (iv-c) computing a total sum of said second occurrence probabilities of all child nodes of said parent node, and (iv-d) selecting the set of context entities from the set of all child nodes of said parent node in order of descending values of said second occurrence probability until 15 the sum of said second occurrence probability of selected nodes equals or exceeds half of said total sum, wherein said first occurrence probability and said second occurrence probability are approximated using an occurrence probability of a node in said hierarchical data source, a co-occurrence probability between a pair of nodes in said hierarchical data source, and a joint-occurrence probability between an n-tuple of nodes in said hierarchical 20 data source observed in at least one said previous view.
14. A method according to claim I wherein said compulsory entity represents one of: (i) a location of one or more search keywords; and (ii) a user-selected entity. 2564617_1 645893AU -77
15. A method according to claim 1 wherein said first view of the hierarchical data source comprises a tree representation and step (i) or (iii) includes detecting a user's selection of a sub-tree of said first view, and wherein, step (iv) comprises selecting a child 5 node of a parent node in said user-selected sub-tree, said selecting comprising the steps of: (iv-a) computing a first occurrence probability of said parent node appearing without any of its child nodes given the occurrence probability of said parent node, and the ancestor node of said user-selected sub-tree; (iv-b) computing a second occurrence probability of each of said child 10 nodes of said parent node given the occurrence probability of said parent node, and the ancestor node of said user-selected sub-tree; (iv-c) computing a total sum of said first occurrence probability and said second occurrence probability of all child nodes of said parent node; (iv-d) creating a fictitious node and assigning said fictitious node said first 15 occurrence probability; (iv-e) selecting the context entity from the set of said fictitious node and all child nodes of said parent node in order of descending values of said first occurrence probability or said second occurrence probability until the sum of said first occurrence probability or said second occurrence probability of selected nodes equals or exceeds half 20 of said total sum; and (iv-f) deselecting said fictitious node if said fictitious node is among said selected nodes.
16. A method according to claim 1 wherein said first view of the hierarchical data source comprises a tree representation and step (i) or (iii) includes detecting a user's 2564617_1 645893AU -78 selection of a sub-tree of said first view, and wherein step (iv) comprises selecting a child node of a parent node in said user-selected sub-tree, said selecting comprising the steps of: (iv-a) computing a first occurrence probability of said parent node appearing without any of its child nodes given the occurrence probability of said parent 5 node, and the ancestor node of said user-selected sub-tree; if said first occurrence probability is less than or equal to 0.5; (iv-b) computing a second occurrence probability of each of said child node of said parent node given the occurrence probability of said parent node, and the ancestor of said user-selected sub-tree; 10 (iv-c) computing a total sum of said second occurrence probability of all child nodes of said parent node; and (iv-d) selecting the context entity from the set of all child nodes of said parent node in order of descending values of said second occurrence probability until the sum of said second occurrence probability of selected nodes equals or exceeds half of said 15 total sum.
17. A method of construction and presentation of data for a keyword searching operation in a hierarchical data source involving search keyword, said method comprising the steps of: 20 (i) constructing a graphical first view of the hierarchical data source; (ii) identifying a compulsory entity in said graphical first view, wherein said compulsory entity is a node in said graphical first view representing a location of said search keyword; (iii) obtaining an occurrence probability of at least one context data from at least 2564617_1 645893AU - 79 the first view of the hierarchical data source; (iv) constructing a hierarchical data structure, wherein the hierarchical data structure is a subset of the hierarchical data source comprising said compulsory entity and one or more context entities corresponding to the search keyword, wherein said context 5 entities are obtained from said graphical first view using the context data and the occurrence probability; and (v) presenting said hierarchical data structure as a result of said keyword searching operation; wherein the hierarchical data structure is assigned a score equal to an occurrence 10 probability of an ancestor node of the compulsory entity given the occurrence probability of the context data associated with the compulsory entity, and the context entity is selected from the group consisting of: (a) the ancestor node; (b) a first set of nodes along a directed path in the hierarchical data source from 15 the ancestor node to the compulsory entity; (c) a second set of nodes selected from a descendent node of the ancestor node in the first view, each of the second set of nodes being selected based on a corresponding occurrence probability, said occurrence probability being derived from the occurrence probability of the ancestor node; 20 (d) a third set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the third set of nodes to the ancestor node in the first view; and (e) a fourth set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the fourth set of nodes to 2564617_1 645893AU -80 the compulsory entity in the first view.
18. A computer readable storage medium, having a computer-executable program recorded thereon, wherein the program is configured to make a computer execute a 5 procedure to present data from a hierarchical data source, said program comprising: (i) code for constructing a first view of the hierarchical data source; (ii) code for obtaining an occurrence probability of at least one context data from at least the first view of the hierarchical data source; (iii) code for identifying a compulsory entity in the first view; 10 (iv) code for selecting one context entity from the first view and the context data based on the occurrence probability; and (v) code for presenting a hierarchical data structure, wherein the hierarchical data structure is a subset of the hierarchical data source, comprising a plurality of context data, wherein each of the plurality of context data corresponds to the identified compulsory 15 entity and the selected context entity; wherein the hierarchical data structure is assigned a score equal to an occurrence probability of an ancestor node of the compulsory entity given the occurrence probability of the context data associated with the compulsory entity, and the context entity is selected from the group consisting of: 20 (a) the ancestor node; (b) a first set of nodes along a directed path in the hierarchical data source from the ancestor node to the compulsory entity; (c) a second set of nodes selected from a descendent node of the ancestor node in the first view, each of the second set of nodes being selected based on a corresponding 2564617_1 645893AU -81 occurrence probability, said occurrence probability being derived from the occurrence probability of the ancestor node; (d) a third set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the third set of nodes to the 5 ancestor node in the first view; and (e) a fourth set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the fourth set of nodes to the compulsory entity in the first view. 10
19. A computer readable storage medium, having a computer-executable program recorded thereon, wherein the program is configured to make a computer execute a procedure to construct and present data for a keyword searching operation in a hierarchical data source involving a search keyword, said program comprising: (i) code for constructing a first view of the hierarchical data source; 15 (ii) code for identifying a compulsory entity in said first view, wherein said compulsory entity is a node in said first view representing a location of said search keyword; (iii) obtaining an occurrence probability of at least one context data from at least the first view of the hierarchical data source; 20 (iv) code for constructing a hierarchical data structure, wherein the hierarchical data structure is a subset of the hierarchical data source, comprising said compulsory entity and one or more context entities, wherein said context entities are obtained from said first view using the context data and the occurrence probability; and (v) code for presenting said hierarchical data structure as a result of said 2564617_1 645893AU - 82 keyword searching operation, wherein the hierarchical data structure is assigned a score equal to an occurrence probability of an ancestor node of the compulsory entity given the occurrence probability of the context data associated with the compulsory entity, and 5 the context entity is selected from the group consisting of: (a) the ancestor node; (b) a first set of nodes along a directed path in the hierarchical data source from the ancestor node to the compulsory entity; (c) a second set of nodes selected from a descendent node of the ancestor node 10 in the first view, each of the second set of nodes being selected based on a corresponding occurrence probability, said occurrence probability being derived from the occurrence probability of the ancestor node; (d) a third set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the third set of nodes to the 15 ancestor node in the first view; and (e) a fourth set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the fourth set of nodes to the compulsory entity in the first view.
20 20. Computer apparatus for constructing at least one data structure from a hierarchical data source, said apparatus comprising: a constructing module configured to construct a first view of said hierarchical data source; an obtaining module configured to obtain an occurrence probability of at least one 2564617_1 645893AU - 83 data element from at least the first view of the hierarchical data source; an identifying module configured to identify compulsory entity in the first view; a selecting module configured to select a context entity from the first view and the context data based on the occurrence probability; and 5 a presenting module configured to present a hierarchical data structure, wherein the hierarchical data structure is a subset of the hierarchical data source, comprising a plurality of context data, wherein each of the plurality of context data corresponds to the identified compulsory entity and the selected context entity; wherein the hierarchical data structure is assigned a score equal to an occurrence 10 probability of an ancestor node of the compulsory entity given the occurrence probability of the context data associated with the compulsory entity; and the context entity is selected from the group consisting of: (a) the ancestor node; (b) a first set of nodes along a directed path in the hierarchical data source from 15 the ancestor node to the compulsory entity; (c) a second set of nodes selected from a descendent node of the ancestor node in the first view, each of the second set of nodes being selected based on a corresponding occurrence probability, said occurrence probability being derived from the occurrence probability of the ancestor node; 20 (d) a third set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the third set of nodes to the ancestor node in the first view; and (e) a fourth set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the fourth set of nodes to 2564617_1 645893AU - 84 the compulsory entity in the first view.
21. Computer apparatus for construction and presentation of data for a keyword searching operation in a hierarchical data source involving a search keyword, said 5 apparatus comprising: a constructing module configured to construct a first view of the hierarchical data source; an identifying module configured to identify a compulsory entity in said first view, wherein said compulsory entity is a node in said first view representing a location of 10 search keyword; an obtaining module configured to obtain an occurrence probability of at least one context data from at least the first view of the hierarchical data source; a determining module configured to select a context entity from said first view and the occurrence probability context data obtained; 15 a constructing module configured to construct a hierarchical data structure, wherein the hierarchical data structure is a subset of the hierarchical data source comprising said compulsory entity and said context entity; and a presenting module configured to present said hierarchical data structure comprising said compulsory entity and said context entity as a result of said keyword 20 searching operation, wherein the hierarchical data structure is assigned a score equal to an occurrence probability of an ancestor node of the compulsory entity given the occurrence probability of the context data associated with the compulsory entity, and the context entity is selected from the group consisting of: 2564617_1 645893AU - 85 (a) the ancestor node; (b) a first set of nodes along a directed path in the hierarchical data source from the ancestor node to the compulsory entity; (c) a second set of nodes selected from a descendent node of the ancestor node 5 in the first view, each of the second set of nodes being selected based on a corresponding occurrence probability, said occurrence probability being derived from the occurrence probability of the ancestor node; (d) a third set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the third set of nodes to the 10 ancestor node in the first view; and (e) a fourth set of nodes selected from a descendent node of the ancestor node in the first view based on a corresponding distance from each of the fourth set of nodes to the compulsory entity in the first view. 15
22. A method according to claim 2 wherein said schema representation is updated as at least one new query is logged. DATED this sixteenth Day of March, 2010 CANON KABUSHIKI KAISHA 20 Patent Attorneys for the Applicant Spruson&Ferguson 2564617_1 645893AU
AU2004292680A 2003-11-28 2004-11-26 Method of constructing preferred views of hierarchical data Ceased AU2004292680B2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
AU2003906611 2003-11-28
AU2003906611A AU2003906611A0 (en) 2003-11-28 Method for Constructing Preferred views of Hierarchical Data
AU2004292680A AU2004292680B2 (en) 2003-11-28 2004-11-26 Method of constructing preferred views of hierarchical data
PCT/AU2004/001676 WO2005052810A1 (en) 2003-11-28 2004-11-26 Method of constructing preferred views of hierarchical data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2004292680A AU2004292680B2 (en) 2003-11-28 2004-11-26 Method of constructing preferred views of hierarchical data

Publications (2)

Publication Number Publication Date
AU2004292680A1 AU2004292680A1 (en) 2005-06-09
AU2004292680B2 true AU2004292680B2 (en) 2010-04-22

Family

ID=34624266

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2004292680A Ceased AU2004292680B2 (en) 2003-11-28 2004-11-26 Method of constructing preferred views of hierarchical data

Country Status (4)

Country Link
US (1) US7664727B2 (en)
JP (1) JP4637113B2 (en)
AU (1) AU2004292680B2 (en)
WO (1) WO2005052810A1 (en)

Families Citing this family (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050278308A1 (en) * 2004-06-01 2005-12-15 Barstow James F Methods and systems for data integration
US7548926B2 (en) * 2005-10-05 2009-06-16 Microsoft Corporation High performance navigator for parsing inputs of a message
US9047379B2 (en) 2006-06-12 2015-06-02 Zalag Corporation Methods and apparatuses for searching content
US8489574B2 (en) * 2006-06-12 2013-07-16 Zalag Corporation Methods and apparatuses for searching content
US8140511B2 (en) * 2006-06-12 2012-03-20 Zalag Corporation Methods and apparatuses for searching content
JP4801555B2 (en) * 2006-09-29 2011-10-26 株式会社ジャストシステム Document processing apparatus, document processing method, and document processing program
TWI330022B (en) * 2006-11-06 2010-09-01 Inst Information Industry Method and computer program product for a new node joining a peer to peer network and computer readable medium and the network thereof
US7702620B2 (en) * 2007-03-29 2010-04-20 International Business Machines Corporation System and method for ranked keyword search on graphs
US8825700B2 (en) * 2008-05-26 2014-09-02 Microsoft Corporation Paging hierarchical data
US8478712B2 (en) * 2008-11-20 2013-07-02 Motorola Solutions, Inc. Method and apparatus to facilitate using a hierarchical task model with respect to corresponding end users
EP2293642B1 (en) * 2009-01-26 2012-08-22 Panasonic Corporation Relay apparatus, control method, and program
US8943045B2 (en) * 2009-01-28 2015-01-27 Oracle International Corporation Mechanisms for efficient autocompletion in XML search applications
US8271472B2 (en) * 2009-02-17 2012-09-18 International Business Machines Corporation System and method for exposing both portal and web content within a single search collection
US20100299367A1 (en) * 2009-05-20 2010-11-25 Microsoft Corporation Keyword Searching On Database Views
US8676859B2 (en) * 2010-01-21 2014-03-18 Hewlett-Packard Development Company, L.P. Method and system for analyzing data stored in a database
KR101130734B1 (en) * 2010-08-12 2012-03-28 연세대학교 산학협력단 Method for generating context hierachyand, system for generating context hierachyand
US9870392B2 (en) * 2010-12-31 2018-01-16 Yan Xiao Retrieval method and system
US9002139B2 (en) 2011-02-16 2015-04-07 Adobe Systems Incorporated Methods and systems for automated image slicing
US9418178B2 (en) 2011-10-24 2016-08-16 International Business Machines Corporation Controlling a size of hierarchical visualizations through contextual search and partial rendering
JP5827874B2 (en) * 2011-11-11 2015-12-02 株式会社ドワンゴ Keyword acquiring apparatus, content providing system, keyword acquiring method, program, and content providing method
US8799269B2 (en) 2012-01-03 2014-08-05 International Business Machines Corporation Optimizing map/reduce searches by using synthetic events
US8732178B2 (en) 2012-01-25 2014-05-20 International Business Machines Corporation Using views of subsets of nodes of a schema to generate data transformation jobs to transform input files in first data formats to output files in second data formats
US8762424B2 (en) 2012-01-25 2014-06-24 International Business Machines Corporation Generating views of subsets of nodes of a schema
US9460200B2 (en) 2012-07-02 2016-10-04 International Business Machines Corporation Activity recommendation based on a context-based electronic files search
US8898165B2 (en) 2012-07-02 2014-11-25 International Business Machines Corporation Identification of null sets in a context-based electronic document search
US8903813B2 (en) 2012-07-02 2014-12-02 International Business Machines Corporation Context-based electronic document search using a synthetic event
US9262499B2 (en) 2012-08-08 2016-02-16 International Business Machines Corporation Context-based graphical database
US8959119B2 (en) * 2012-08-27 2015-02-17 International Business Machines Corporation Context-based graph-relational intersect derived database
US10169446B1 (en) * 2012-09-10 2019-01-01 Amazon Technologies, Inc. Relational modeler and renderer for non-relational data
US9619580B2 (en) 2012-09-11 2017-04-11 International Business Machines Corporation Generation of synthetic context objects
US8620958B1 (en) 2012-09-11 2013-12-31 International Business Machines Corporation Dimensionally constrained synthetic context objects database
US9251237B2 (en) 2012-09-11 2016-02-02 International Business Machines Corporation User-specific synthetic context object matching
US9223846B2 (en) 2012-09-18 2015-12-29 International Business Machines Corporation Context-based navigation through a database
US8782777B2 (en) 2012-09-27 2014-07-15 International Business Machines Corporation Use of synthetic context-based objects to secure data stores
US9741138B2 (en) 2012-10-10 2017-08-22 International Business Machines Corporation Node cluster relationships in a graph database
US10325239B2 (en) 2012-10-31 2019-06-18 United Parcel Service Of America, Inc. Systems, methods, and computer program products for a shipping application having an automated trigger term tool
US8931109B2 (en) 2012-11-19 2015-01-06 International Business Machines Corporation Context-based security screening for accessing data
US9256593B2 (en) * 2012-11-28 2016-02-09 Wal-Mart Stores, Inc. Identifying product references in user-generated content
US8983981B2 (en) 2013-01-02 2015-03-17 International Business Machines Corporation Conformed dimensional and context-based data gravity wells
US9229932B2 (en) 2013-01-02 2016-01-05 International Business Machines Corporation Conformed dimensional data gravity wells
US8914413B2 (en) 2013-01-02 2014-12-16 International Business Machines Corporation Context-based data gravity wells
US9229988B2 (en) * 2013-01-18 2016-01-05 Microsoft Technology Licensing, Llc Ranking relevant attributes of entity in structured knowledge base
US9069752B2 (en) 2013-01-31 2015-06-30 International Business Machines Corporation Measuring and displaying facets in context-based conformed dimensional data gravity wells
US9053102B2 (en) 2013-01-31 2015-06-09 International Business Machines Corporation Generation of synthetic context frameworks for dimensionally constrained hierarchical synthetic context-based objects
US8856946B2 (en) 2013-01-31 2014-10-07 International Business Machines Corporation Security filter for context-based data gravity wells
US9292506B2 (en) 2013-02-28 2016-03-22 International Business Machines Corporation Dynamic generation of demonstrative aids for a meeting
US10152526B2 (en) 2013-04-11 2018-12-11 International Business Machines Corporation Generation of synthetic context objects using bounded context objects
US9348794B2 (en) 2013-05-17 2016-05-24 International Business Machines Corporation Population of context-based data gravity wells
US9195608B2 (en) 2013-05-17 2015-11-24 International Business Machines Corporation Stored data analysis
US9547671B2 (en) 2014-01-06 2017-01-17 International Business Machines Corporation Limiting the rendering of instances of recursive elements in view output
US9594779B2 (en) 2014-01-06 2017-03-14 International Business Machines Corporation Generating a view for a schema including information on indication to transform recursive types to non-recursive structure in the schema
US20160034513A1 (en) * 2014-07-31 2016-02-04 Potix Corporation Method to filter and group tree structures while retaining their relationships
US20160275448A1 (en) * 2015-03-19 2016-09-22 United Parcel Service Of America, Inc. Enforcement of shipping rules
US10353980B2 (en) * 2016-11-30 2019-07-16 Sap Se Client-side paging for hierarchy data structures in restful web services
WO2018214097A1 (en) * 2017-05-25 2018-11-29 深圳大学 Ksp algorithm-based resource description framework query method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001024045A2 (en) * 1999-09-29 2001-04-05 Xml-Global Technologies, Inc. Method, system, signals and media for indexing, searching and retrieving data based on context
WO2001067309A2 (en) * 2000-03-03 2001-09-13 Radiant Logic, Inc. System and method for providing access to databases via directories and other hierarchical structures and interfaces
WO2002027544A1 (en) * 2000-09-29 2002-04-04 British Telecommunications Public Limited Company Information access
US20040059736A1 (en) * 2002-09-23 2004-03-25 Willse Alan R. Text analysis techniques
US20070015667A1 (en) * 2003-02-28 2007-01-18 Construction Research & Technology Gmbh Method and composition for injection at a tunnel boring machine

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4325120A (en) * 1978-12-21 1982-04-13 Intel Corporation Data processing system
US5742738A (en) * 1988-05-20 1998-04-21 John R. Koza Simultaneous evolution of the architecture of a multi-part program to solve a problem using architecture altering operations
US7251637B1 (en) * 1993-09-20 2007-07-31 Fair Isaac Corporation Context vector generation and retrieval
US6012052A (en) * 1998-01-15 2000-01-04 Microsoft Corporation Methods and apparatus for building resource transition probability models for use in pre-fetching resources, editing resource link topology, building resource link topology templates, and collaborative filtering
AUPP603798A0 (en) 1998-09-18 1998-10-15 Canon Kabushiki Kaisha Automated image interpretation and retrieval system
US7181438B1 (en) * 1999-07-21 2007-02-20 Alberti Anemometer, Llc Database access system
WO2001067351A1 (en) * 2000-03-09 2001-09-13 The Web Access, Inc. Method and apparatus for performing a research task by interchangeably utilizing a multitude of search methodologies
AUPQ717700A0 (en) 2000-04-28 2000-05-18 Canon Kabushiki Kaisha A method of annotating an image
US8396859B2 (en) * 2000-06-26 2013-03-12 Oracle International Corporation Subject matter context search engine
AU7194001A (en) * 2000-07-28 2002-02-13 Easyask Inc Distributed search system and method
US6795819B2 (en) * 2000-08-04 2004-09-21 Infoglide Corporation System and method for building and maintaining a database
US20020065857A1 (en) * 2000-10-04 2002-05-30 Zbigniew Michalewicz System and method for analysis and clustering of documents for search engine
JP3754912B2 (en) 2000-11-13 2006-03-15 キヤノン株式会社 Multimedia content distribution method
US7669051B2 (en) * 2000-11-13 2010-02-23 DigitalDoors, Inc. Data security system and method with multiple independent levels of security
US7546334B2 (en) * 2000-11-13 2009-06-09 Digital Doors, Inc. Data security system and method with adaptive filter
US7013289B2 (en) * 2001-02-21 2006-03-14 Michel Horn Global electronic commerce system
US7043716B2 (en) * 2001-06-13 2006-05-09 Arius Software Corporation System and method for multiple level architecture by use of abstract application notation
US6799184B2 (en) * 2001-06-21 2004-09-28 Sybase, Inc. Relational database system providing XML query support
US7017162B2 (en) * 2001-07-10 2006-03-21 Microsoft Corporation Application program interface for network software platform
US7644102B2 (en) * 2001-10-19 2010-01-05 Xerox Corporation Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects
AU2002350131A1 (en) * 2001-11-09 2003-05-26 Gene Logic Inc. System and method for storage and analysis of gene expression data
JP4251804B2 (en) * 2001-12-04 2009-04-08 富士通株式会社 Information display method, information display program, and information display apparatus
WO2003065177A2 (en) * 2002-02-01 2003-08-07 John Fairweather System and method for navigating data
US7457810B2 (en) * 2002-05-10 2008-11-25 International Business Machines Corporation Querying markup language data sources using a relational query processor
US7574652B2 (en) 2002-06-20 2009-08-11 Canon Kabushiki Kaisha Methods for interactively defining transforms and for generating queries by manipulating existing query data
US6986121B1 (en) * 2002-06-28 2006-01-10 Microsoft Corporation Managing code when communicating using heirarchically-structured data
US7668885B2 (en) * 2002-09-25 2010-02-23 MindAgent, LLC System for timely delivery of personalized aggregations of, including currently-generated, knowledge

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001024045A2 (en) * 1999-09-29 2001-04-05 Xml-Global Technologies, Inc. Method, system, signals and media for indexing, searching and retrieving data based on context
WO2001067309A2 (en) * 2000-03-03 2001-09-13 Radiant Logic, Inc. System and method for providing access to databases via directories and other hierarchical structures and interfaces
WO2002027544A1 (en) * 2000-09-29 2002-04-04 British Telecommunications Public Limited Company Information access
US20040059736A1 (en) * 2002-09-23 2004-03-25 Willse Alan R. Text analysis techniques
US20070015667A1 (en) * 2003-02-28 2007-01-18 Construction Research & Technology Gmbh Method and composition for injection at a tunnel boring machine

Also Published As

Publication number Publication date
US20070073734A1 (en) 2007-03-29
JP4637113B2 (en) 2011-02-23
WO2005052810A1 (en) 2005-06-09
JP2007519086A (en) 2007-07-12
US7664727B2 (en) 2010-02-16
AU2004292680A1 (en) 2005-06-09

Similar Documents

Publication Publication Date Title
Kaushik et al. Exploiting local similarity for indexing paths in graph-structured data
Abiteboul Querying semi-structured data
Haveliwala et al. Evaluating strategies for similarity search on the web
Coelho et al. Image retrieval using multiple evidence ranking
Martinez et al. Integrating data warehouses with web data: A survey
Mandhani et al. Query caching and view selection for XML databases
Sugiura et al. Query routing for web search engines: Architecture and experiments
JP5175005B2 (en) Phrase-based search method in information search system
Wang et al. Discovering structural association of semistructured data
CA2513850C (en) Phrase identification in an information retrieval system
Bohannon et al. From XML schema to relations: A cost-based approach to XML storage
JP4944405B2 (en) Phrase-based indexing method in information retrieval system
US7672963B2 (en) Method and apparatus for accessing data within an electronic system by an external system
US8046681B2 (en) Techniques for inducing high quality structural templates for electronic documents
Chakrabarti et al. The structure of broad topics on the web
US7567959B2 (en) Multiple index based information retrieval system
US6240407B1 (en) Method and apparatus for creating an index in a database system
US6356891B1 (en) Identifying indexes on materialized views for database workload
US7747610B2 (en) Database system and methodology for processing path based queries
CA2610208C (en) Learning facts from semi-structured text
US6928425B2 (en) System for propagating enrichment between documents
CA2461195C (en) Scalable hierarchical data-driven navigation system and method for information retrieval
US6694323B2 (en) System and methodology for providing compact B-Tree
KR101176079B1 (en) Phrase-based generation of document descriptions
US7062483B2 (en) Hierarchical data-driven search and navigation system and method for information retrieval

Legal Events

Date Code Title Description
FGA Letters patent sealed or granted (standard patent)
MK14 Patent ceased section 143(a) (annual fees not paid) or expired