WO2011091471A1

WO2011091471A1 - Query processing of tree-structured data

Info

Publication number: WO2011091471A1
Application number: PCT/AU2011/000083
Authority: WO
Inventors: Sebastian Maneth; Kim Nguyen
Original assignee: National Ict Australia Limited
Priority date: 2010-01-27
Filing date: 2011-01-27
Publication date: 2011-08-04
Also published as: WO2011091471A8

Abstract

A computer-implemented method for processing a query of tree-structured data, comprising: (a) based on the query, calculating a first cost associated with a first traversal order, or a second cost associated with a second traversal order, or both, for traversing the data, or a subset of the data; and (b) based on the calculated first cost or second cost, or both, selecting either the first or second traversal order for processing the query, wherein steps (a) and (b) are performed repeatedly during query processing on multiple subsets of the data to allow switching between the first traversal order and the second traversal order.

Description

Query Processing of Tree-Structured Data

Cross Reference to Related Applications

The present application claims priority from Australian Provisional Application No 2010900320 filed on 27 January 2010, the content of which is incorporated herein by reference. The present application is related to corresponding international applications that claim priority from Australian Provisional Application No 2010900321 and Australian Provisional Application No 2010900322 respectively. The contents of the corresponding international applications are incorporated herein by reference.

Technical Field

This description concerns generally to query processing, and more particularly to a computer-implemented method for processing a query of tree-structured data such as

Ί

Extensible Markup Language (XML) data. Other aspects include computer program to implement the method and a computer system for processing a query of tree-structured data.

Background

XML, a tree-structured data model defined by the World Wide Web Consortium (W3C), is slowly replacing conventional relational data model in applications for electronic commerce, business reporting and bioinformatics. Unlike relational data model, an XML document contains not only data, but also the relationship of the data using tags or markup constructs such as <section> and </section>.

As more documents are stored and queried in XML format, query languages such as XPath (XML Path Language) and XQuery have also become increasingly popular. XPath, which is a simpler that and forms the basis of XQuery, provides a path-like syntax for navigating nodes in a tree and selecting nodes based on search criteria. XPath query engines can be divided into two categories: sequential and indexed. In the sequential or streaming approach, each query must sequentially read a whole collection of data such that, ideally, only one pass over the data is required. In the indexed approach, the tree-structured data is pre-processed to build an index to guide query processing, such that traversal of the whole collection is avoided. For many time-critical applications, query run time is important, and as such, there is a need for a more efficient query processing method.

Summary

In a first aspect, there is provided a computer-implemented method for processing a query of tree-structured data, comprising:

(a) based on the query, calculating a first cost associated with a first traversal order, or a second cost associated with a second traversal order, or both, for traversing the data, or a subset of the data; and

(b) based on the calculated first cost or second cost, or both, selecting either the first or second traversal order for processing the query,

wherein steps (a) and (b) are performed repeatedly during query processing on multiple subsets of the data to allow switching between the first traversal order and the , second traversal order.

Using the method, the order in which nodes in the tree-structured data is traversed is determined based on properties derived from the data itself. Advantageously, the method allows selection between the first and the second traversal order to benefit from their combined advantages, thereby improving the efficiency of query processing. This also provides better scalability and predictability because query run time depends on the "cost" of traversing the tree-structured data and similar queries on the same tree should take comparable query run times.

By repeating steps (a) and (b) during query processing, the traversal order for various subsets of the data can be selected and switched dynamically based on the cost associated with traversing the subset, thereby further improving the efficiency of the query processing.

The first cost or second cost, or both, may be calculated based on an estimated number of potential results in the data, or the subset of the data.

The method may further comprise determining the estimated the number of potential results based on a hierarchical structure of the data, or the subset of the data. In this case, before query processing, the method may further comprise:

determining the hierarchical structure of the tree-structured data, and

storing the determined structure in a tree index. The method may further comprise determining the estimated the number of potential results based on textual content of the data, or the subset of the data. In this case, before query processing, the method may further comprise:

determining the textual content of the tree-structured data, and

storing the determined textual content in a text index.

The method may also comprise jumping from a current node in the tree to a new node if the selected traversal order is different to a current traversal order.

Step (b) may comprise selecting the first traversal order if the first cost is lower than the second cost, but otherwise, selecting the second traversal order.

Alternatively, step (b) may comprise selecting the first traversal order if the first cost is lower than a threshold, but otherwise, selecting the second traversal order. Step (b) may also comprise selecting the second traversal order if the second cost is lower than a threshold, but otherwise, selecting the first traversal order

The first traversal order may be top-down and the second traversal order may be bottom-down. The tree-structured data may be Extensible Markup Language (XML) data. The query may be an XPath query.

In a second aspect, there is provided a computer program to implement the method according to the first aspect. The computer program may be embodied in a computer- readable medium such that when code of the computer program is executed, causes a computer system to implement the method according to the first aspect.

In a third aspect, there is provided a computer system for processing a query of tree- structured data, comprising a processing unit to:

(a) based on the query, calculate a first cost associated with a first traversal order, or a second cost associated with a second traversal order, or both, for traversing the data, or a subset of the data; and

(b) based on the calculated first cost or second cost, or both, select either the first or second traversal order for processing the query wherein steps (a) and (b) are performed repeatedly during query processing on multiple subsets of the data to allow switching between the first traversal order and the second traversal order. Brief Description of Drawings

Non-limiting example(s) will now be described with reference to the accompanying drawings, in which:

Fig. 1 is an exemplary system for query processing.

Fig. 2 is a schematic diagram of steps performed by a query engine.

Fig. 3(a) is an exemplary XML document.

Fig. 3(b) is a text collection created based on the XML document in Fig. 3(a).

Fig. 3(c) is a tree structure created based on the XML document in Fig. 3(a).

Fig. 3(d) is an XML model created based on the XML document in Fig. 3(a).

Fig. 4(a) is a diagram of an exemplary tree traversed using a top-down traversal order.

Fig. 4(b) is a diagram of the tree in Fig. 4(a), but traversed using a bottom-up traversal order.

Fig. 5 is a flowchart of steps performed by a processing unit of the query engine during query processing according to a first example.

Fig. 6 is a diagram illustrating dynamic switching between two traversal orders during query processing.

Fig. 7 is a flowchart of steps performed by a processing unit of the query engine during query processing according to a second example. Detailed Description

Referring first to Fig. 1, the computer system 100 comprises a query engine 1 10 and a data store 120 in communication with a plurality of communications devices 152 over a communications network 140, 142. The devices 152 (only two shown for simplicity) are each operated by a user 150. The communications network 140 may be a local area network (LAN) or wide area network (WAN), wireless or otherwise.

Referring also to Fig. 2, the query engine 110 comprises an indexing unit 112, a query parsing unit 114 and a processing unit 116. The query engine 1 10 processes queries of a collection of XML documents 122 (tree-structured data) in the data store 120. A query may be submitted by a user 150, or by a server 154. The query engine 1 10 uses an indexed approach to pre-process the XML collection 122, so that later queries can be solved without traversing the entire collection. Firstly, the indexing unit 1 12 performs pre-processing data analysis on documents in the XML collection to determine the structure and content of the documents 122; see step 210. Results of the data analysis are used during index generation to build a text index 124 and a tree index 126 for use in query processing; see step 220.

The query parsing unit 1 14 then analyses or parses a path expression for the query; see step 230. Optionally, an automaton can be constructed from the parsed query before further processing is performed; see step 235.

Next, the processing unit 1 16 processes the query using the text 124 and tree 126 indices created by the indexing unit 1 12; see step 240. The text index 124 is used to facilitate counting of the number of text nodes in the XML documents that matches a simple predicate in a query. The tree index 126 provides an approximation on the number of nodes or potential results in a subset of nodes in the tree starting from a particular node. The results of the query are then presented to the user 150; see step 250. The indexing 220 and query processing 240 steps performed by the query engine 1 10 will now be explained further below.

Indexing 220

XML documents can be regarded as a "text collection" or a set of strings organised into a labelled "tree structure". The strings correspond to textual content of the data while the tree structure defines the hierarchical structure of the tree.

Referring now to Fig. 3, the tree in Fig. 3(d) corresponds to the XML data in Fig. 3(a). The tree is formed by solid edges, whereas dotted edges display the connection with the set of texts. There are two types of identifier in the tree: text identifiers (numbers in italics) assigned to text content, and global identifiers (numbers in non-italics) assigned to internal and leaf nodes.

There are a number of internal nodes represented by the following symbols:

& is a dummy root (labelled 1) that is added to create a tree instead of a forest; # is a node (6, 8, 10, 16) associated with a string or textual content ("soon discontinued", "blue", "40" and "30" respectively),

@ is a node (3,12) associated with an attribute ("name"), and

% is a leaf node (7,5) of an attribute node (3,12) and is associated with a value ("pen", "rubber") to an attribute ("name").

Using the above representation, there is exactly one string content associated to each tree leaf, and those strings are referred to as texts. In the example in Fig. 3(d), there are six texts, which are associated to the tree leaves arid labelled using text identifiers from left to right: 7 - "pen", 2 - "Soon discontinued", 3 - "blue", 4 - "40", 5 - "rubber" and 6 - "30".

The indexing unit 112 analyses the XML data in Fig. 3(a) to create the text index in Fig. 3(b) and the tree index in Fig. 3(c).

Text Index 124

The text index 124 allows pattern matching during query processing. Textual content is represented as a succinct full-text self index [1] that is generally known as the F.M- index [2]. The text collection T stores the content of the XML data as $-terminated strings so that each text corresponds to one string. In the example in Fig. 3(b), T is a concatenated sequence of d texts:

T = pen$Soon discontinued$blue$40$rubber$30$, where $ is a delimiter; see 310. Given a string T of total length u, from an alphabet size of σ, the F -index is based on the Burrows- Wheeler transform (BWT) transformation [3] of string T. Assume T ends with the special endmarker '$' and let M be a matrix whose rows are all the cyclic rotations of T in lexicographic order. The last column L of M forms a permutation of T which is the BWT string L = T^". The matrix is only conceptual; the FM-index uses only on the 7*^w' string. Note L[i] is the symbol preceding the i^'-th lexicographically smallest row of M.

The resulting permutation is reversible. The first column of M, denoted F, contains all symbols of T in lexicographic order; see 320 in Fig. 3(b). There exists a simple last-to- first mapping from symbols in L to F [4]. Let C[c] be the total number of symbols in T that are lexicographically less than c. Now the LF-mapping can be defined as: LF(i) = C[L[i]] + _Tank_W] (L, i).

The symbols of T can be read in reverse order by starting from the end-marker location i and applying LF{i) recursively: we get 1*^wl [f], T*^wl [LF(i)], f^wl [LF(LF(i))] and so on. Finally, after u steps, we get the first symbol of T . The values C[c] can be stored in a small array of σ log u bits. Function rank_c(L, i) can be computed in 0(log σ) time with a wavelet tree data structure requiring only H^T) + 0(u log σ) bits [5], [6].

During query processing, pattern matching is supported via backward searching on the BWT [4]. Given a pattern P[l, m], backward searching is performed as follows:

1. Starts with the range [sp, ep] = [/, u] of rows in M.

2. At each step /' e {m,m-l, . . . , 1), update range [sp, ep] to

[sp', ep'] to match all rows of M that have P[i, m] as a prefix:

sp' = C[P[i]] + rank_P[i] (L, sp-l)+\ and

ep' = C[P[i\] + rank_m (L, ep).

To find out the location of each occurrence, the text is traversed backwards from each sp < i < sp (virtually, using LF on 7*^w<) until a sampled position is found. This is a sampling carried out at regular text positions, so that the corresponding positions in 7*"" are marked in a bitmap B_s[l, u], and the text position corresponding to 7*^w' [i], if B_s[i] = 1 , is stored at a samples array

/^')]· 7*^w' contains all end-markers in some permuted order; see 320 in Fig. 3(b). This permutation is represented with a data structure Doc, that maps from positions of $s in 7* ' to text numbers, and also allows two-dimensional range searching [7]; see 330 in Fig. 3(c). Thus, the text corresponding to a terminator 7*^w' [i] = $ is Doc[rank$( 7*^w', i)]. Furthermore, given a range [sp, ep] of 7*^w' and a range of text identifiers [x, y], Doc can be used to find identifiers of all $-terminators within [sp, ep] ^χ [x, y] range in 0(log d) time per answer. In practice, Doc can be implemented as a plain array using d log d bits.

The basic pattern matching feature of the FM-index is extended to support XPath functions. Given a pattern and a range of text identifiers to be searched, these XPath functions return all text identifiers that match the query within the range. In addition, existential (i.e. is there a match in the range?) and counting (i.e. how many matches in the range?) queries are supported. Exemplary XPath functions are as follows:

(a) starts-with(P, [x, y]): The goal is to find texts in [x, y] range prefixed by the given pattern P. After backward search, the range [sp, ep] in J*^w' contains the endmarkers of all the texts prefixed by P. Now [sp, ep] ^χ [x, y] can be mapped to Doc, and existential and counting queries can be answered in 0(log d) time. Matching text identifiers can be reported in 0(log d) time per identifier.

(b) ends-with(P, [x, y]): Backward searching is localized to texts [x, y] by choosing [sp, ep] = [x, y] as the starting interval. After the backward search, the resulting range [sp, ep] contains all possible matches, thus, existential and counting queries can be answered in constant time. To find out text identifiers for each occurrence, text must be traversed backwards to find a sampled position.

(c) operator = (P, [x, y])\ texts that are equal to P, and in range, can be found as follows. Do the .backward search as in ends- with, then map to the $-terminators like in starts-with. Time complexities are same as in starts-with. (d) contains(P, [x, y]): To find texts that contain P, we start with the normal backward search and finish like in ends-with. In this case there might be several occurrences inside one text, which have to be filtered. Thus, the time complexity is proportional to the total number of occurrences, 0(1 log σ) for each. Existential and counting queries are as slow as reporting queries, but the 0(\P\ log a)-time counting of all the occurrences of P can still be useful for query optimizatjon.

(e) operators < <, >, >: The operator < matches texts that are lexicographically smaller than or equal to the given pattern. It can be solved like the starts-with query, but updating only the ep of each backward search step, while sp = 1 stays constant. If at some point there are no occurrences of P[i] = c within the prefix L[l, ep], we find those of smaller symbols, ep = C[c], and continue for P[l, i - /]. Other operators can be supported analogously, and costs are as for starts-with.

Tree Index 126

As shown in Fig. 3(c), the tree index 126 is represented by the following compact data structures, which provide navigation and indexed access to it. (a) Par 350: The balanced parentheses representation [8] of the tree structure. This is obtained by traversing the tree in depth-first-search (DFS) order, writing a "(" whenever the indexing unit 1 12 reaches a node, and a ")" when the indexing unit 1 12 leaves the node (thus it is easily produced during the XML parsing). This way, every node is represented by a pair of matching opening and closing parentheses. A tree node will be identified by the position of its opening parenthesis in Par (that is, a node will be just an integer index within Par), (b) Tag 360: A sequence of the tag identifiers of each tree node, including an opening and a closing version of each tag, to mark the beginning and ending point of each node. These tags are numbers in [1, 2t] and are aligned with Par so that the tag of node /^' is simply Tag[i]. For example,, Tag[\] returns the root node & (labelled 1) in the tree in Fig. 3(d) and 7ag[4] is "@name" as represented by "n" (4th position). The sequence also comprises corresponding closing tags "/&" (last position) and "/n" (7th position)

Rank and select queries are also required on Tag. Several sequence representations supporting these are known [9], and a practical representation that favours speed over space is selected. First, the indexing unit 112 stores the tags in an array using [log 2t] bits per field, which gives constant time access to Tag[ . The rank and select queries over the sequence of tags are answered by a second structure. Consider the binary matrix:

R[1..2t][1..2n]

such that R[i, j] - 1 if Tagj] = /^'; see 370 in Fig. 3(c). Each row of the matrix R is represented using Okanohara and Sadakane's structure sarray [10]. The structure supports access and select in 0(1 ) time, and rank in O(log n) time.

Tree structure comprising data structures Par and Tag can then be used during query processing. The following operations over the tree structure are useful to support XPath queries over the tree. Let tag be a tag identifier.

(a) Basic Tree. Operations [11]

Let x be a node (a position in Par), the tree operations are:

CloseQ ): The closing parenthesis matching Par[x]. If x is a small subtree this takes a few local accesses to Par, otherwise a few non-local table accesses. Preorder(x) = rank(Par, i): Preorder number of x.

SubtreeSize(x) = (CIose(¾)-x+l)/2: Number of nodes in the subtree rooted at x. IsAncestor(x, y)^'= x <y < Closest): Whether x is an ancestor of y.

FirstChild(x) = x + \ : First child of x, if any.

NextSibling( ) = Close(x)+l : Next sibling of x, if any.

Parent(x): Parent of x. Somewhat costlier than Close( ) in practice, because the answer is less likely to be near x in Par.

(b) Connecting to Tags

The following operations are used for fast XPath evaluation.

SubtreeTags(x, tag): Returns the number of occurrences of tag within the subtree rooted at node x. This is rank_t&s(Tag, Close(x)) - rank_tag(Tag, x - 1).

Tag(x): Gives the tag identifier of node x.

TaggedDesc(x, tag): The first node labelled tag strictly within the subtree rooted at x.

rank_tag(Tag, x) + 1) if it is < Close(x), and undefined otherwise.

TaggedPrec(x, tag): The last node labelled tag with preorder smaller than that of node x, and not an ancestor of x. Let r = rank,_ag(Tag, x - 1). If select,_ag (Tag, r) is not an ancestor of node x, we stop. Otherwise, we set r = r - 1 and iterate.

TaggedFoll( , tag): The first node labelled tag with preorder larger than that of x, and not in the subtree of x. This is select ,_ag (Tag, rank^Tag, Close( )) + 1).

(c) Connecting the Text and the Tree

Conversion between text numbers, tree nodes, and global identifiers, is easily carried out by using Par and a bitmap B of In bits that marks the opening parentheses of tree leaves containing text, plus 0(n) extra bits to support rank or select queries. Bitmap B enables the computation of the following operations:

LeafNumber(x): Gives the number of leaves up to x in Par. This is

x). TextldsQc): Gives the range of text identifiers that descend from node x. This is simply [LeafNumber( - 1 )+ 1 , LeafNumber(Close( ))] .

XMLIdText(iZ): Gives the global identifier for the text with identifier d. This is Preorder(5e/ecti(5, d)).

XMLIdNode(x): Gives the global identifier for a tree node x. This is just Preorder(x). Query Processing 240

During query processing, the processing unit 1 16 uses the properties derived from an XML collection 122 to determine the traversal order in which the tree associated with the collection 122 is traversed; see step 240 in Fig. 2. The properties are in the form of the textual content of the tree, as stored in a text index 124, and its hierarchical structure, as stored in the tree index 126. The purpose is to improve efficiency of the query processing such that only the least number of nodes need to be visited by the processing unit 1 16. Referring to Fig. 4, there are two traversal orders in which the tree associated with an XML collection 122, or a subset of the tree, can be traversed:

(a) Top-down traversal order

The top-down traversal begins at the root node of the tree, or a subset of the tree. Consider a query for all title elements, i.e. / /title for the tree shown in Fig. 4(a). Starting from root node "Bibliography", the processing unit 1 16 traverses the tree from top to bottom, left to right. In this case, there are five title elements.

(b) Bottom-up traversal order

The bottom-up traversal begins at the leaf nodes at the bottom of the tree. For example, consider the same tree in Fig. 4(a), but this time for a query for all books with "Road" in the title, i.e. / /book [ contains ( title , "Road" ) ] . As shown in Fig. 4(b), there are two potential results with "Road": "The Road" and "Roadsters". The processing unit 1 16 traverses upward from each candidate to determine whether the text is part of the title of a book. In this case, "The Road" is the only result because "Roadsters" is the title of a magazine.

Comparing the traversal orders, the bottom-up traversal order is more efficient if there are fewer potential results in leaf nodes to traverse in the tree, or a subset of the tree. The opposite is true for the top-down traversal. If there are many potential results in the tree or a subset of the tree, the use of bottom-up traversal is costly in terms of processing time because the processing unit 1 16 has to explore the branch associated with each potential result. In this case, it is more efficient for the processing unit 1 16 to use the top-down approach. Referring now to the flowchart in Fig. 5, the processing unit 1 16 determines whether to use the top-down or bottom-up approach at every node that is evaluated during query processing; see step 510. To make this decision, the processing unit calculates a first cost associated with using the top-down approach (step 520), and a second cost associated with using the bottom-up approach (step 530).

The first and second costs calculated by the processing unit 1 16 are based on the number of potential results starting from a node in the tree. The cost is calculated based on an estimated number of potential results in the data, or the subset of the data. The number of potential results is estimated using textual content in the text index 124 and hierarchical structure information in the tree index 126.

Given a node in the tree that corresponds to a position in the corresponding document, the tree index 126 provides an approximation of the number of potential results starting from that node. The text index 124 provides an approximation of the number of potential results in the form of text nodes matching a simple predicate. The estimation of potential results to select a traversal order allows "jumping" between nodes in the tree. Additionally, evaluation procedures of automata allow nodes to be selected at most once and therefore not duplicated in the results.

e; For example, XPath query //a/ /b/ /c returns all the 'c'-labelled nodes occurring below a 'b'-labelled node it self occurring below an 'a'-labelled node. Given an "a" node, it is possible to query the tree index 126 using and return the number of 'c' labelled nodes. This provides an approximation of the number of potential results since it also counts 'c' nodes which are not below 'b' nodes. Other operations such as SubtreeSize(o) can be used to determine the number of nodes in a subtree rooted at node a. Similarly, using a text index 124, operation contains(P,[x, y]) can be used to determine the number of nodes matching pattern P in range [x, y]. If the first cost is estimated to be more than the second cost, the processing unit selects the bottom-up approach; see step 550. Otherwise, the top-down approach will be selected; see step 555. The subset of nodes in the tree starting from the current node are then evaluated using the selected approach; see step 560. The processing unit 1 16 then determines whether there are more nodes to evaluate; see step 570. If yes, the next node becomes the current node and the process of selection and the steps 520 to 570 are repeated; see step 590. Otherwise, the results of the query are presented to the user; see step 580 in Fig. 6 and corresponding step 250 in Fig. 2.

Example

An example of how the processing unit 116 switches between top-down and bottom-up is illustrated in Fig. 6. The largest triangle represents a tree 600 that is associated with an XML collection 122. The smaller triangles each represent a subset of nodes in the tree starting from a "current node", labelled at various instances at 610, 620, 630, 640, 650 and 660. The dots each represent a potential result, as estimated by the processing unit 1 16 using the text 124 and tree 126 indices created by the indexing unit 1 12.

In this example, the processing unit 116 starts with the top-down approach (dotted line) at the root of the tree; see node 610. This is because the cost associated with using the bottom-up order is higher than that of the top-down approach, as indicated by the estimated number of potential results in the tree starting from the root node.

At child node 620, the processing unit 1 16 continues using the top-down approach because the cost associated with the bottom-down approach is greater than that of the top-down approach, as indicated by the number of potential results starting from the node 620.

Then at child node 630, the processing unit 1 16 once again calculates and compares the first and second costs; see steps 520 to 540 in Fig. 5. In this case, it is more efficient to use the bottom-up approach because there are only two potential results at the bottom of this subset, compared to number of potential results along the height of the subset starting from node 630. As such, the processing unit 1 16 dynamically switches from the top-down approach (dotted line) to the bottom-up approach (solid line).

After the subset is evaluated using the bottom-up approach, the processing unit 1 16 continues at nodes 640, 650 and 660. At child node 660, the top-down approach is selected because, once again, there are many potential results in the subset of nodes starting from node 660. However, at nodes 670 and 680, the processing unit 1 16 dynamically switches to the bottom-up approach. As such, a combination of top-down and bottom-up is used in by the processing unit 1 16 to traverse the subset of nodes starting from node 660. It will be appreciated the processing unit 1 16 uses an "a node at a time" semantics, where it determines the next node of interest from one point of the tree. As such, this allows "jumping" from one node to the next node of interest such as when the processing unit 116 jumps from node 630 in Fig. 6 at the top of the subset to the bottom of the same subset.

This is to be contrasted with the "a set at a time" semantics associated with relational database indices. In this case, given a set of nodes as input, the "a set at a time" semantics return the set of nodes which are descendant. This may be useful for non- XML databases (i.e. relational model) where queries are answered "in bulk" instead of one element at a time. ^

However, this approach is not suitable for querying tree-structured XML data for two reasons.- Firstly, the results need to be in document order and secondly, they have to be unique in the sense that a given node cannot appear twice in the same result. In other words, results are sets of nodes rather than a sequences of nodes. As such, the "a set at a time" indices can neither guarantee the order nor the uniqueness of the nodes in the results, and therefore a relational based XML query engine needs to sort and order intermediate results, which is an expensive process.

If the number of potential results given by the text or tree index is deterministic, then the alternating between the bottom-up and top-down will be an optimization. Even if the number of potential results is probabilistic, switching between the top-down and bottom-up approaches is guided by heuristics and will be generally faster than a simple all-top-down or all-bottom-up approach.

It should also be appreciated that XPath is distinguishable from step-based languages such as Lorel. Single step languages (and their execution engine) generally have more rigid types of queries where a static, "once and for all" analysis is sufficient. By contrast, XPath requires a different and more complex analysis when determining whether top-down or bottom-up traversal order is more efficient because simpler "once and for all" analyses do not scale and are not be precise for recursive primitives such as "select all descendents". For example, consider again the XPath query / / a / /b / / c which selects all 'c'-labelled nodes that are below a 'b'-labelled node, which itself is below an 'a'-labelled node. There are a great number of configurations of nodes (that is a great number of trees) which satisfy this query. As explained earlier, if there is a unique 'c'-labelled node in the document, then starting bottom-up for the whole document would be the best choice. However, in some cases, the distribution of nodes is irregular. For example, there may be 'b'-labelled and 'c'-labelled nodes at various depths where one part of the document has a lot of 'c'-labelled nodes whereas their distribution is sparse in other parts. In these cases, it is better to start evaluating the query (choosing top-down or bottom-up as an initial guess only) and further refining the choice dynamically using information local to the current subtree, that is, relevant only to the part of the document the engine is exploring at the moment. This dynamic switching of traversal order is important to account for the various valid tree configurations that queries with recursive primitives can denote. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

For example, in order to determines whether to select the top-down or bottom-up approach, the processing unit 116 only calculates one cost and compares it with a threshold. Referring to the flowchart in Fig. 7, the processing unit 116 calculates a second cost associated with using the bottom-up approach in step 730. The calculated second cost is then compared with a threshold in step 740. If the second cost is less than the threshold, the bottom-up approach will be selected in step 755. Otherwise, the top-down approach will be selected in step 750.

Similarly, the second cost is calculated based on an estimated number of potential results in the data, or the subset of the .data. The cost may be the estimated number of potential results, run time or processing power. The number of potential results is estimated using textual content in the text index 124 and hierarchical structure information in the tree index 126.

The threshold can be a fixed parameter, or one that is improved over time. An initial threshold can be selected based on the processing capability of the query engine 1.10 and the size of the data set. The actual implementation of the bottom-up and top-down approaches, and therefore the runtime using these approaches, may also be taken into consideration when selecting the threshold. In another example, instead of calculating the second cost, the processing unit 1 1 may calculate a first cost associated with using the top-down approach in step 720. Similarly, the first cost is also compared with the threshold to determine whether to select the top-down or bottom-up approach.

More complex tree and text indices with sophisticated counting capabilities may also be used. In this case, the indices are not limited to count the number of node with some label, which is a very rough approximation of the size of the result. Such indices can also be used to estimate the number of potential results and therefore to determine whether to select the top-down or bottom-up approach.

It should also be understood that, unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "receiving", "processing", "retrieving", "selecting", "calculating", "determining", "displaying" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that processes and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. Unless the context clearly requires otherwise, words using singular or plural number also include the plural or singular number respectively. It should be understood that the techniques described might be implemented using a variety of technologies. For example, the methods described herein may be implemented by a series of computer executable instructions residing on a suitable computer readable medium. Suitable computer readable media may include volatile (e.g. RAM) and/or non-volatile (e.g. ROM, disk) memory, carrier waves and transmission media (e.g. copper wire, coaxial cable, fibre optic media). Exemplary carrier waves may take the form of electrical, electromagnetic ^' or optical signals conveying digital data steams along a local network or a publically accessible network such as the Internet. References [I] G. Navarro and V. M^*'akinen, "Compressed full-text indexes," ACM Comp. Surv., vol. 39, no. 1, 2007.

[2] P. Ferragina and G. Manzini, "Indexing compressed text," J. ACM, vol. 54, no. 4, pp. 552-581 , 2005.

[3] M. Burrows and D. J. Wheeler, "A block-sorting lossless data compression algorithm." Digital Equipment Corporation, Tech. Rep. 124, 1994.

[4] P. Ferragina and G. Manzini, "Indexing compressed text," J. ACM, vol. 54, no. 4, pp. 552-581, 2005.

[5] P. Ferragina, G. Manzini, V. M^"akinen, and G. Navarro, "Compressed representations of sequences and full-text indexes," ACM TALG, vol. 3, no. 2, 2007.

[6] R. Grossi, A. Gupta, and J. S. Vitter, "High-order entropy-compressed text indexes," in SODA, 2003, pp. 841-850.

[7] V. M^' akinen and G. Navarro, "Rank and select revisited and extended," Theor. Comput. Sci., vol. 387, no. 3, pp. 332-347, 2007.

[8] I. Munro and V. Raman, "Succinct representation of balanced parentheses, static trees and planar graphs," in FOCS, 1997, pp. 1 18-126.

[9] F. Claude and G. Navarro, "Practical rank/select queries over arbitrary sequences," in SPIRE, 2008, pp. 176-187.

[10] D. Okanohara and K. Sadakane, "Practical entropy-compressed rank/select dictionary," in ALENEX, 2007.

[I I] . Sadakane and G. Navarro, "Fully-functional static and dynamic succinct trees," in SODA, 2010.

Claims

Claims:

1. A computer-implemented method for processing a query of tree-structured data, comprising:

wherein steps (a) and (b) are performed repeatedly during query processing on multiple subsets of the data to allow switching between the first traversal order and the second traversal order.

2. The computer-implemented method of claim 1 , wherein the first cost or second cost, or both, are calculated based on an estimated number of potential results in the data, or the subset of the data.

3. The computer-implemented method of claim 2, further comprising determining the estimated the number of potential results based on a hierarchical structure of the data, or the subset of the data.

4. The computer-implemented method of claim 3, further comprising, before query processing:

determining the hierarchical structure of the tree-structured data, and

storing the determined structure in a tree index.

5. The computer-implemented method of any one of the preceding claims, further comprising determining the estimated the number of potential results based on textual content of the data, or the subset of the data.

6. The computer-implemented method of claim 5, further comprising, before query processing:

determining the textual content of the tree-structured data, and

storing the determined textual content in a text index.

7. The computer-implemented method of any one of the preceding claims, further comprising jumping from a current node in the tree to a new node if the selected traversal order is different to a current traversal order.

8. The computer-implemented method of any one of claims 1 to 7, wherein step (b) comprises selecting the first traversal order if the first cost is lower than the second cost, but otherwise, selecting the second traversal order.

9. The computer-implemented method of any one of claims 1 to 7, wherein step (b) comprises selecting the first traversal order if the first cost is lower than a threshold, but otherwise, selecting the second traversal order.

10. The computer-implemented method of any one of claims 1 to 7, wherein step (b) comprises selecting the second traversal order if the second cost is lower than a threshold, but otherwise, selecting the first traversal order.

11. The computer-implemented method of any one of the preceding claims, wherein the first traversal order is top-down and the second traversal order is bottom-down.

12. The computer-implemented method of any one of the preceding claims, wherein the tree-structured data is Extensible Markup Language (XML) data.

13. The computer-implemented method of any one of the preceding claims, wherein the query is an XPath query.

14. A computer program to implement the method of any one of the preceding claims.

15. A computer system for processing a query of tree-structured data, comprising a processing unit to:

(b) based on the calculated first cost or second cost, or both, select either the first or second traversal order for processing the query wherein steps (a) and (b) are performed repeatedly during query processing on multiple subsets of the data to allow switching between the first traversal order and the second traversal order.