US20080010256A1

US20080010256A1 - Element query method and system

Info

Publication number: US20080010256A1
Application number: US11/758,306
Authority: US
Inventors: Christopher Lindblad; Hui Li
Original assignee: Mark Logic Corp
Current assignee: Mark Logic Corp
Priority date: 2006-06-05
Filing date: 2007-06-05
Publication date: 2008-01-10
Also published as: WO2007143666A2; WO2007143666A3

Abstract

Methods, systems, and computer-readable media for representing and querying positional information for a hierarchical document (such as an XML document) are disclosed. In one set of embodiments, at least one word in the hierarchical document is associated with one or more word positions, and at least one element in the hierarchical document is associated with one or more word position ranges. The word positions and word position ranges are analyzed to determine whether a particular word or phrase is a direct or indirect descendant of a particular element in the hierarchical document. In various embodiments, the word positions are indexed in a first index and the word position ranges are indexed in a second index. Thus, the analysis may be efficiently performed by intersecting the first and second indexes. In further embodiments, the word position ranges may be encoded in a space efficient format for storage or transmittal.

Description

CROSS-REFERENCES TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 60/811,626, filed Jun. 5, 2006 by Lindblad et al. and entitled “ELEMENT QUERY METHOD AND SYSTEM,” the disclosure of which is incorporated herein by reference for all purposes.
The present disclosure is related to the following commonly assigned, co-pending U.S. patent applications:
Ser. No. 10/462,100 (Attorney Docket No. 021512-00011US), entitled “SUBTREE-STRUCTURED XML DATABASE” (hereinafter “Lindblad I-A”);
Ser. No. 10/462,019 (Attorney Docket No. 021512-000210US), entitled “PARENT-CHILD QUERY INDEXING FOR XML DATABASES” (hereinafter “Lindblad III-A”);
Ser. No. 10/462,023 (Attorney Docket No. 021512-000310US), entitled “XML DB TRANSACTIONAL UPDATE SYSTEM” (hereinafter “Lindblad III-A”); and
Ser. No. 10/461,935 (Attorney Docket No. 021512 000410US), entitled “XML DATABASE MIXED STRUCTURAL-TEXTUAL CLASSIFICATION SYSTEM” (hereinafter “Lindblad IV-A”).
The respective disclosures of these applications are incorporated herein by reference for all purposes.

BACKGROUND OF THE INVENTION

The present invention relates in general to databases, and in particular to various embodiments of query operations associated with database systems.
Extensible Markup Language (XML) is a restricted form of SGML, the Standard Generalized Markup Language defined in ISO 8879, and represents one form of structuring data. XML is more fully described in “Extensible Markup Language (XML) 1.0 (Second Edition),” W3C Recommendation (6 Oct. 2000), which is incorporated herein by reference for all purposes [and available at http://www.w3.org/TR/2000/REC-xml-20001006] (hereinafter “XML Recommendation”). XML is a useful form of structuring data because it is an open format that is human-readable and machine-interpretable. Other structured languages without these features or with similar features might be used instead of XML, but XML is currently a popular structured language used to encapsulate (obtain, store, process, etc.) data in a structured manner.
An XML document has two parts: 1) a markup document and 2) a document schema. The markup document and the schema are made up of storage units called “elements,” which can be nested to form a hierarchical structure. An example of an XML markup document 10 is shown in FIG. 1. Document 10 (at least the portions shown) contains data for one “citation” element. The “citation” element has within it a “title” element, an “author” element, and an “abstract” element. In turn, the “author” element has within it a “last” element (last name of the author) and a “first” element (first name of the author). Thus, an XML document comprises text organized in freely-structured outline form with tags indicating the beginning and end of each outline element. A tag is delimited with angle brackets surrounding the tag's name, with the opening and closing tags distinguished by having the closing tag beginning with a forward slash after the initial angle bracket.
Elements can contain either parsed or unparsed data. Only parsed data is shown for document 10. Unparsed data is made up of arbitrary character sequences. Parsed data is made up of characters, some of which form character data and some of which form markup. The markup encodes a description of the document's storage layout and logical structure. XML elements can have associated attributes, in the form of name-value pairs, such as the publication date attribute of the “citation” element. The name-value pairs appear within the angle brackets of an XML tag, following the tag name.
XML schemas specify constraints on the structures and types of elements and attribute values in an XML document. The basic schema for XML is the XML Schema, described in “XML Schema Part 1: Structures,” W3C Working Draft (24 Sep. 1999), which is incorporated herein by reference for all purposes [and available at http://www.w3.org/TR/1999/WD-xmlschema-1-19990924]. A previous and very widely used schema format is the DTD (Document Type Definition), which is described in the XML Recommendation.
Since XML documents are typically in text format, they can be searched using conventional text search tools. However, such tools may ignore the information content provided by the structure of the document, which is one of the key benefits of XML. Several query languages have been proposed for searching and reformatting XML documents that do consider the XML documents as structured documents. One such language is XQuery, described in “XQuery 1.0: An XML Query Language,” W3C Working Draft (23 Jan. 2007), which is incorporated herein by reference for all purposes [and available at http://www.w3.org/TR/XQuery]. An example of a general form for an XQuery query is shown in FIG. 2. Note that the ellipses at line [03] indicate the possible presence of any number of additional namespace prefix to URI mappings, the ellipses at line [16] indicate the possible presence of any number of additional function definitions, and the ellipses at line [22] indicate the possible presence of any number of additional FOR or LET clauses.
XQuery is derived from an XML query language called Quilt [described at http://www.almaden.ibm.com/cs/people/chamberlin/quilt.html], which in turn borrowed features from several other languages, including XPath 1.0 [described at http://www.w3.org/TR/XPath.html], XQL [described at http://www.w3.org/TandS/QL/QL98/pp/xql.html], XML-QL [described at http://www.research.att.com/˜mff/files/final.html], and OQL.
Query languages predated the development of XML and many relational databases use a standardized query language called SQL, as described in ISO/IEC 9075-1:1999. The SQL language has established itself as the lingua franca for relational database management and provides the basis for systems interoperability, application portability, client/server operation, and distributed databases. XQuery is proposed to fulfill a similar role with respect to XML database systems. As XML becomes the standard for information exchange between peer data stores, and between client visualization tools and data servers, XQuery may become the standard method for storing and retrieving data from XML databases.
With SQL query systems, much work has been done on the issue of efficiency, such as how to process a query, retrieve a matching result set, and present the result set to the human or computer query issuer quickly and with efficient use of resources. As XQuery and other tools are relied on more and more for querying XML documents, query efficiency will be more essential.
One type of query that is not efficiently handled by current XML database systems is determining the position of a word, phrase, or element relative to another element in an XML document. This type of query is referred to herein as an “element query.” For instance, an exemplary element query might be whether a particular word (e.g., “cat”) is contained within (i.e., is a descendant of) a particular element (e.g., “”) in a document. Another exemplary element query might be whether the word “cat” is contained within a nested element of (i.e., is an indirect descendant of) the element “.”
One approach to processing element queries such as those described above involves searching a database of XML documents based on a keyword index to find a result set of documents containing the word “cat,” and then linearly scanning each document in the result set for instances of “cat” within the element “.” However this approach is both time consuming and resource-intensive, particularly if the database contains many documents, and/or if each document is large.
Another prior art approach is known as “fielded search.” With fielded search, a limited number of elements or “fields” of an XML document are identified prior to being ingested into the database, and special purpose indexes are built at ingestion time to facilitate searching of words within those specific elements. While this approach addresses the performance problems of linear searching, it has at least three significant limitations. First, the fields of an XML document to be indexed must be determined prior to ingestion. Thus, the database administrator (or other party responsible for ingestion) must anticipate what fields will likely be searched by users. Second, for pragmatic reasons, the number of such predetermined fields will likely be limited. Finally, the content of each field is indexed as completely “flat” content. Thus, information about elements nested within other elements will be lost. For example, consider the XML construct “<title>MyBigDiscovery</title>.” If the element “<title>” is chosen to be indexed under fielded search then (depending on the implementation) either (1) only the words “My” and “Discovery” will be indexed against “<title>,” meaning that the word “Big” is completely lost, or (2) “My,” “Big,” and “Discovery” will all be indexed against “<title>,” meaning that the context of the word “Big” within the nested element “” is lost. As can be seen, with fielded search some aspect of content or context is lost.

BRIEF SUMMARY OF THE INVENTION

Embodiments of the present invention address the foregoing and other such problems by providing methods, systems, and computer-readable media for representing and querying positional information about hierarchical documents. Specifically, embodiments of the present invention provide representational schemes and techniques for processing element queries efficiently and without prior knowledge about which fields of a document to index. Various embodiments also enable the processing of element queries that require knowledge about the hierarchical structure of elements within a document.
According to one embodiment of the present invention, a computer-implemented method for representing word positions in a hierarchical document comprises receiving a hierarchical document comprising a plurality of words and a plurality of elements, at least one word in the plurality of words being a descendant of at least one element in the plurality of elements. The method further comprises associating at least one word in the plurality of words with one or more word positions, the word positions indicating positions of the word relative to other words within the hierarchical document, and associating at least one element in the plurality of elements with one or more word position ranges, each word position within the one or more word position ranges corresponding to a word in the plurality of words that is a descendant of the element.
In various embodiments, at least one element in the plurality of elements has zero descendants, and is associated with a word position range indicating a range of zero word positions. In further embodiments, a first element in the plurality of elements is a descendant of a second element in the plurality of elements, and one or more word position ranges associated with the second element indicate a range of word positions subsumed by the first element.
According to another embodiment of the present invention, a computer-implemented method for determining whether one or more words are descendants of an element in a hierarchical document comprises retrieving word positions for the one or more words, retrieving one or more word position ranges for the element, and processing the word positions for the one or more words and the one or more word position ranges for the element to determine whether the one or more words are descendants of the element. In various embodiments, the processing may comprise determining whether the one or more words are direct descendants of the element, and/or determining whether the one or more words are indirect descendants of the element.
In one set of embodiments, the word positions for the one or more words may be indexed in a first database index, and the word position ranges for the element may be indexed in a second database index. In these embodiments, the first and second indexes may be intersected to determine whether the one or more words are descendants of the element.
According to another embodiment of the present invention, the word position ranges for an element in a hierarchical document are encoded using a space-efficient format, such as delta encoding.
According to yet another embodiment of the present invention, a database system is disclosed. The database system comprises a database configured to store a plurality of hierarchical documents, where a first hierarchical document in the plurality of hierarchical documents comprises a plurality of words and a plurality of elements, at least one word in the plurality of words being a descendant of at least one element in the plurality of elements. At least one word in the plurality of words is associated with one or more word positions, the word positions indicating positions of the word relative to other words within the first hierarchical document, and at least one element in the plurality of elements is associated with one or more word position ranges, each word position within the one or more word position ranges corresponding to a word in the plurality of words that is a descendant of the element.
In various embodiments, the database system further comprises a query engine configured to receive a query, and to return a result responsive to the query by analyzing one or more of the word positions and one or more of the word position ranges. In one embodiment, the query is whether one or more words in the first hierarchical document are direct descendants of a particular element in the first hierarchical document. In an alternative embodiment, the query is whether one or more words in the first hierarchical document are indirect descendants of a particular element in the first hierarchical document.
According to yet another embodiment of the present invention, a computer program product embedded in a computer readable medium comprises program code for receiving a hierarchical document comprising a plurality of words and a plurality of elements, at least one word in the plurality of words being a descendant of at least one element in the plurality of elements. The computer program product further comprises program code for associating at least one word in the plurality of words with one or more word positions, the word positions indicating positions of the word relative to other words within the hierarchical document, and program code for associating at least one element in the plurality of elements with one or more word position ranges, each word position within the one or more word position ranges corresponding to a word in the plurality of words that is a descendant of the element.
A further understanding of the nature and the advantages of the embodiments disclosed herein may be realized by reference to the remaining portions of the specification and the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments in accordance with the present invention will be described with reference to the drawings, in which:
FIG. 1 is an illustration of a conventional XML document.
FIG. 2 is an illustration of a conventional XQuery query.
FIG. 3 is an illustration of a simple XML document including text and markup that may be used in accordance with one embodiment of the present invention.
FIGS. 4A and 4B are schematic representations of the XML document shown in FIG. 3. FIG. 4A illustrates a complete representation of the XML document and FIG. 4B illustrates a subtree of the XML document.
FIG. 5 is a more concise schematic representation of an XML document.
FIGS. 6A and 6B illustrate a portion of an XML document that includes tags with attributes. FIG. 6A shows the portion in XML format; FIG. 6B is a schematic representation of that portion in graphical form.
FIG. 7 shows a more complex example of an XML document, having attributes and varying levels.
FIG. 8 is a schematic representation of the XML document shown in FIG. 7, omitting data nodes.
FIG. 9 illustrates a decomposition of the XML document illustrated in FIGS. 7-8 in accordance with an embodiment of the present invention.
FIG. 10 illustrates the decomposition of FIG. 9 with the addition of link nodes in accordance with an embodiment of the present invention.
FIG. 11 is a detail of a link node structure from the decomposition illustrated in FIG. 10 in accordance with an embodiment of the present invention.
FIG. 12A is a block diagram representing elements of a subtree data structure in accordance with an embodiment of the present invention.
FIG. 12B is a simplified block diagram of elements of a data structure for storing atom data in accordance with an embodiment of the present invention.
FIG. 13 is a simplified block diagram of a database system in accordance with an embodiment of the present invention.
FIG. 14 is a simplified block diagram of a parser for a database system in accordance with an embodiment of the present invention.
FIG. 15 is a block diagram showing elements of a database in accordance with an embodiment of the present invention.
FIG. 16 is a flowchart illustrating a method of representing positional information in a hierarchical document in accordance with an embodiment of the present invention.
FIG. 17 is a flowchart illustrating a method of processing a first type of element query in accordance with an embodiment of the present invention.
FIG. 18 is a flowchart illustrating a method of processing a second type of element query in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.
Subtree Decomposition
In an embodiment of the present invention, an XML document (or other structured document) is parsed into “subtrees” for efficient handling. An example of an XML document and its decomposition is described in this section, with following sections describing apparatus, methods, structures and the like that might create and store subtrees. Subtree decomposition is explained with reference to a simple example, but it should be understood that such techniques are equally applicable to more complex examples.
FIG. 3 illustrates an XML document 30, including text and markup. FIG. 4A illustrates a schematic representation 32 of XML document 30, wherein schematic representation 12 is a shown as a tree (a connected acyclic simple directed graph). with each node of the tree representing an element of the XML document or an element's content, attribute, the value, etc.
In a convention used for the figures of the present application, directed edges are oriented from an initial node that is higher on the page than the edge's terminal node, unless otherwise indicated. Nodes are represented by their labels, often with their delimiters. Thus, the root node in FIG. 4A is a “citation” node represented by the label delimited with “<>”. Data nodes are represented by rectangles. In many cases, the data node will be a text string, but other data node types are possible. In many XML files, it is possible to have a tag with no data (e.g., where a sequence such as “<tag></tag>” exists in the XML file). In such cases, the XML file can be represented as shown in FIG. 4A but with some nodes representing tags being leaf nodes in the tree. The present invention is not limited by such variations, so to focus explanations, the examples here assume that each “tag” node is a parent node to a data node (illustrated by a rectangle) and a tag that does not surround any data is illustrated as a tag node with an out edge leading to an empty rectangle. Alternatively, the trees could just have leaf nodes that are tag nodes, for tags that do not have any data.
As used herein, “subtree” refers to a set of nodes with a property that one of the nodes is a root node and all of the other nodes of the set can be reached by following edges in the orientation direction from the root node through zero or more non-root nodes to reach that other node. A subtree might contain one or more overlapping nodes that are also members of other “inner” or “lower” subtrees; nodes beyond a subtree's overlapping nodes are not generally considered to be part of that subtree. The tree of FIG. 4A could be a subtree, but the subtree of FIG. 4B is more illustrative in that it is a proper subset of the tree illustrated in FIG. 4A.
To simplify the following description and figures, single letter labels will be used, as in FIG. 5. Note that even with the shortened tags, tree 35 in FIG. 5 represents a document that has essentially the same structure as the document represented by the tree of FIG. 4A.
Some nodes may contain one or more attributes, which can be expressed as (name, value) pairs associated with nodes. In graph theory terms, the directed edges come in two flavors, one for a parent-child relationship between two tags or between a tag and its data node, and one for linking a tag with an attribute node representing an attribute of that tag. The latter is referred to herein as an “attribute edge”. Thus, adding an attribute (key, value) pair to an XML file would map to adding an attribute edge and an attribute node, followed by an attribute value node to a tree representing that XML file. A tag node can have more than one attribute edge (or zero attribute edges). Attribute nodes have exactly one descendant node, a value node, which is a leaf node and a data node, the value of which is the value from the attribute pair.
In the tree diagrams used herein, attribute edges sometimes are distinguished from other edges in that the attribute name is indicated with a preceding “@”. FIG. 6A illustrates a portion of XML markup wherein a tag T has an attribute name of “K” and a value of “V”. FIG. 6B illustrates a portion of a tree that is used to represent the XML markup shown in FIG. 6A, including an attribute edge 36, an attribute node 37 and a value node 38. In some instances, tag nodes and attribute nodes are treated the same, but at other times they are treated differently. To easily distinguish tag nodes and attribute nodes in the illustrated trees, tag nodes are delimited with surrounding angle brackets (“<>”), while attribute nodes are delimited with an initial “@”.
FIG. 7 et seq. illustrate a more complex example, with multiple levels of tags, some having attributes. FIG. 7 shows a multi-level XML document 40. As is explained later below, FIG. 7 also includes indications 42 of where multi-level XML document 40 might be decomposed into smaller portions. FIG. 8 illustrates a tree 50 that schematically represents multi-level XML document 40 (with a data nodes omitted).
FIG. 9 shows one decomposition of tree 50 with subtree borders 52 that correspond to indications 42. Each subtree border 52 defines a subtree; each subtree has a subtree root node and zero or more descendant nodes, and some of the descendant nodes might in turn be subtree root nodes for lower subtrees. In this example, the decomposition points are entirely determined by tag labels (e.g., each tag with a label “c” becomes a root node for a separate subtree, with the original tree root node being the root node of a subtree extending down to the first instances of tags having tag labels “c”). In other examples, decomposition might be done using a different set of rules. For example, the decomposition rules might be to break at either a “c” tag or an “f” tag, break at a “d” tag when preceded by an “r” tag, etc. Decomposition rules need not be specific to tag names, but can specify breaks upon occurrence of other conditions, such as reaching a certain size of subtree or subtree content. Some decomposition rules might be parameterized where parameters are supplied by users and/or administrators (e.g., “break whenever a tag is encountered that matches a label the user specifies”, or more generally, when a user-specified regular expression or other condition occurs).
Note from FIG. 9 that subtrees overlap. In a subtree decomposition process, such as one prior to storing subtrees in a database or processing subtrees, it is often useful to have nonoverlapping subtree borders. Assume that two subtrees overlap as they both include a common node (specifically, the subtree root node). The subtree that contains the common node and parent(s) of the common node is referred to herein as the upper overlapping subtree, while the subtree that contains the common node and child(ren) of the common node is referred to herein as the lower overlapping subtree.
FIG. 10 illustrates one approach to providing nonoverlapping subtrees, namely by introducing the construct of link nodes 60. For each common node, an upper link node is added to the upper subtree and a lower link node is added to the lower subtree. These link nodes are shown in the figures by squares. The upper link node contains a pointer to the lower link node, which in turn contains a pointer to the root node of the lower overlapping subtree (which was the common node), while the lower link node contains a pointer to the upper link node, which in turn contains a pointer to the parent node of what was the common node. Each link node might also hold a copy of the other link node's label possibly along with other information. Thus, the upper link node may hold a copy of the lower subtree's root node label and the lower link node may hold a copy of the upper subtree's node label for the parent of what was the common node.
The pointer in a link node advantageously does not reference the other link node specifically; instead the pointer advantageously references the subtree in which the other link node can be found. FIG. 11 illustrates contents of the link nodes for two of the subtrees (labeled 101 and 102) of FIG. 10. Upper link node 104 of subtree 100 contains a target node label (‘c’) and a pointer to a target location that stores an identifier of subtree 102, which does not precisely identify lower link node 106. Similarly, lower link node 106 contains a target node label (‘b’) and a pointer to a target location that stores an identifier of subtree 100, which does not precisely identify upper link node 104.
Navigation from lower link node 106 to upper link node 104 (and vice versa) is nevertheless possible. For instance, the target location of lower link node 106 can be used to obtain a data structure for subtree 100 (an example of such a data structure is described below). The data structure for subtree 100 includes all seven of the nodes shown for subtree 100 in FIG. 10. Two of these are link nodes (labeled 60 in FIG. 10) that contain the target node label ‘c.’ These nodes, however, are distinguishable because their target location pointers point to different subtrees. Thus, the correct target node 104 for lower link node 106 can be identified by searching for a link node in subtree 100 whose target location is subtree 102. Similarly, the correct target node 106 for upper link node 104 can also be found by a search in subtree 102, enabling navigation in the other direction. Searching can be made highly efficient, e.g., by providing a hash table in subtree 100 that accepts a subtree identifier (e.g., for subtree 102) and returns the location of the link node that references that subtree.
Using a reference scheme that connects a link node to a target subtree (rather than to a particular node within the target subtree) makes lower link node 106 insensitive to changes in subtree 100. For instance, a new node may be added to subtree 100, causing the storage location of upper link node 104 to change. Lower link node 106 need not be modified; it can still reference subtree 100 and be able to locate upper link node 104. Likewise, upper link node 104 is insensitive to changes in subtree 102 that might affect the location of lower link node 106. This increases the modularity of the subtree structure. Subtree 100 can be modified without affecting link node 106 as long as link node 104 is not deleted. (If link node 104 is deleted, then subtree 102 is likely to be deleted as well.) Similarly, subtree 102 can be modified without affecting link node 104; if subtree 102 is deleted, then link node 104 will likely be deleted as well. Handling subtree updates that affect other subtrees is described in detail in Lindblad IIIA.
It should be noted that this indirect indexing approach is reliable as long as cyclic connections between subtrees are not allowed, i.e., as long as subtree 100 has only one node that connects to subtree 102 and vice versa. Those of ordinary skill in the art will appreciate that non-circularity is an inherent feature of XML and numerous other structured document formats.
Subtree Data Structure
Each subtree can be stored as a data structure in a storage area (e.g., in memory or on disk), preferably in a contiguous region of the storage area. FIG. 12A illustrates an example of a data structure 1200 for storing subtree 102 of FIG. 10. In general, any subtree can be stored using a data structure similar to that of FIG. 12A.
In FIG. 12A, the following notational conventions are used: field(0:n-1): v describes a fixed-width N-bit field named ‘field’ and storing a value corresponding to ‘v’ (which might be an encoded version of v; examples are described below), and [field] describes a variable bit width field encoded using a unary-log-log encoding. The unary-log-log encoding represents an integer value N as follows: (a) compute the number of bits=log₂(N) needed to represent the integer N; (b) compute the number of bits=log₂(log₂(N)) needed to represent log₂(N); (c) encode the integer as log₂(log₂(N)) in unary, i.e., a sequence of log₂(log₂(N)) bits all equal to 1 terminated by 0 (or similar coding), followed by the bits needed to actually represent log₂(N), followed by the bits actually needed to represent N. Text data values are generally stored in a format referred to herein as “CodedText,” in which the text string is parsed into one or more tokens and encoded as “[length], [atomID1], [atomID2], [atomID3], . . . ,” where the length is the unary-encoded length of the list of atomIDs, and each atomID is a code that corresponds to one of the tokens. Associations of atomIDs with specific tokens are provided by an atom data block 1214, which is shown in detail in FIG. 12B and described further below.

As shown in FIG. 12A, the subtree data is organized into various blocks. Header block 1202 contains identifying information for the subtree. Ancestry block 1204 provides information about the ancestor nodes of the subtree, tracing back to the ultimate parent node of the XML document. As FIG. 10 shows, subtree 102 has four ancestor nodes (not counting the link nodes): the parent of the subtree root node <c> is node in subtree 102, whose parent is node <c>, whose parent is node in subtree 104, whose parent is the ultimate root node <a>. Node name block 1206 provides the tags (encoded as atomIDs) for the element nodes in subtree 102. Subtree size block 1208 indicates the number of various kinds of nodes in subtree 102. URI information block 1210 provides (using atomIDs) the URI of the XML document to which subtree 102 belongs. The remaining node blocks 1212(1)-1212(9) provide information about each node of the subtree: the type of node, a reference to the node's parent, and other parameters appropriate for the node type. It is to be understood that the number of node blocks may vary, depending on the number of given nodes in the subtree. More specific information about the various elements of subtree data structure 1200 is listed in Table 1 and data types for representative types of nodes are listed in Table 2.

TABLE 1


Subtree Elements

Block	Item	Description

Header	ordinal	Sequentially allocated node count for first node
		in subtree
	uri-key	Hash value of URI of the document containing
		the subtree
	unique-key	Random 64-bit key
	link-key	Random 64-bit key that is constant across saves.
	root-key	Hash subtree checksum
	[ancestor-node-count]	Coded count of number of ancestors (can be an
		estimate)
	ancestor- key	Hash key of each ancestor subtree (repeated for
		each ancestor)
Ancestry	[node-name-count]	Coded number of QNames (a QName might be a
		namespace URI and a local name) element tags
		in the subtree
	[atomID]	Coded Atom ID of element QName (repeated
		for each element tag)
Node	[nsURI-atomID]	Coded Atom ID of element QName associated
name		namespace (repeated for each element tag)
	[subtree-node-count]	Coded total number of nodes of all types in the
		subtree
	[element-node-count]	Coded total number of element nodes in the
		subtree
Subtree	[attribute-node-count]	Coded total number of attribute nodes in the
size		subtree
	[link-node-count]	Coded total number of link nodes in the subtree
	[doc-node-count]	Coded total number of doc nodes in the subtree
	[pi-node-count]	Coded total number of processing instruction
		nodes in the subtree
	[namespace-node-count]	Coded total number of namespace nodes in the
		subtree
	[text-node-count]	Coded total number of text nodes in the subtree
	[uri-atom-count]	Coded count of tokens in the document URI
	[uri-atom-id]	Coded Atom ID(s) of each token of the
		document URI
URI info	node-kind	See Table 2; one of: elem, attr, text, link, doc,
		PI, ns, comment, etc.
	[parent-offset]	Coded implicitly negative offset (base 1) to
		parent
Node	data element(s)	The content of the data element(s) depends on
		the kind of node (specified by the node-kind
		field). Table 2 lists some data element types that
		might be used. This can comprise textual
		representation of the data as a compressed list of
		Atom IDs of the content of the element.

TABLE 2


Data Element Types for Subtree Nodes

Node Type	Data Field	Description

elem	[qnameID]	Coded element QName Atom ID
attr	[qnameID]	Coded attribute QName Atom ID
	CodedText	Coded text representing the attribute's value
text	CodedText	Coded text representing the text node value
PI	[PI-target-atomID]	Processing Instruction (typically opaque to the XQE
		XML database)
	CodedText	Coded Atom ID of PI target
	CodedText	Coded text of PI
link	link-key	Link to parent/child subtree; bi-directional
	[qnameID]	Coded QName Atom ID of link-key target
	[node-count]	Coded initial ordinal for subtree nodes [?????]
comment	CodedText	Coded text of comment
docnode	CodedText	Coded text of docnode uri
ns	[delta-ordinal]	Coded ordinal of element containing the ns decl, delta
		from last ns-decl
	[offset]	Coded offset in namespace list of preceding
		namespace node
	[prefix-atomID]	Coded Atom ID of namespace prefix
	[nsURI-atomID]	Coded Atom ID of namespace URI

It should be noted that each link node (such as described above with reference to FIG. 11) has a corresponding node block in the subtree data structure 1200; e.g., node block 1212(1) describes a link node, as indicated by the node-kind (‘link’). For the link node, the stored data includes a link-key element, a qname element, and a number-of-nodes element. The link-key element provides the reference to the subtree that contains the target node; for instance, value (v2) stored in the link key of node block 1212(1) may correspond to the link-key element that is stored in a lead block 1212 of a different subtree data structure that contains the target node. As noted in Table 1, the link-key element is defined so as to be constant across saves, making it a reliable identifier of the target subtree. Other identifiers could also be used. The qnameID element of node block 1212(1) stores (as an atomID) the QName of the target of the link identified by the link-key element. The QName might be just the tag label or a qualified version thereof (e.g., with a namespace URI prepended).
In the case where link node block 1212(1) corresponds to link node 106 of FIG. 11, the link-key value v2 identifies a data structure for subtree 100, and the qnameID corresponds to ‘b’. The node-count encodes an initial ordinal for the subtree nodes. Similar node blocks can be provided for nodes that link to child subtrees. In this manner, the connections between subtrees are reflected in the data structure.
As shown in FIG. 12A and Table 1, every node, regardless of its node-kind, includes a parent-offset element. This element represents the relationship between nodes in a unidirectional manner by providing, for each node, a way of identifying which node is its parent. For example, the value of a parent-offset element might be a byte offset reflecting the location of the parent node block within the data structure relative to the current node block. For link nodes whose parents are not in the subtree, a value of 0 can be used, as in block 1212(1). In the case of XML input documents, the byte offset can be implicitly negative as long as nodes appear in the data structure in the order they occur in the document, because the parent node will always precede the child. In other document formats or subtree data structures, parents might occur after the child and positive offsets would be allowed. In general, the node blocks may be placed in any order within data structure 1200, as long as the parent-offset values correctly reflect the hierarchical relationship of the nodes.
Atom data block 1214 is shown in detail in FIG. 12B. In this embodiment, atom data block 1214 implements a token heap, i.e., a system for compactly storing large numbers of tokens. A given token is hashed to produce a hash key 1221 that is used as an index into a “table” array 1220, which is a fixed-width array. The atom value 1222 stored in the table array at the hash key index position represents a cursor (or offset) into four other arrays: indexVector 1224, hashes Vector 1226, lchashes Vector 1228, and counts 1230. The offset stored at the atom index position in the (fixed-width) indexVector array 1224 represents an offset into the (variable-width) data Vector array 1232 where the actual token 1234 is stored along with one 8-bit byte of type information 1236; additional bits may also be provided for other uses. In this embodiment, the type of a token can be one of ‘s’ (space character), ‘p’ (punctuation character), or ‘w’ (word character); other types may also be supported. The atom value 1222 also indexes into the (fixed-width) hashes Vector array 1228 and the (fixed-width) 1cHashes Vector array 1230. These two vector arrays are used as caches for token hash keys, and lower-cased token hash keys, and are provided to facilitate indexing and/or search operations. The atom value 1222 also indexes into the counts array 1230, where token multiplicities are stored, that is to say, each token is stored uniquely (i.e., once per subtree) in the data Vector array 1232, but the count describing the number of times the token appeared in the subtree is stored in the counts array 1230. This avoids the necessity of having to access multiple subtrees to count occurrences every time such information is needed.
It will be appreciated that the data structure described herein for storing subtree data is illustrative and that variations and modifications are possible. Different fields and/or field names may be used, and not all of the data shown herein is required. The particular coding schemes (e.g., unary coding, atom coding) described herein need not be used; different coding schemes or unencoded data may be stored. The arrangement of data into blocks may also be modified without restriction, provided that it is possible to determine which nodes are associated with a particular subtree and to navigate hierarchically between subtrees. Further, as described below, subtree data can be found in scratch space, in memory and on disk, and implementation details of the subtree data structure, including the atom data substructure, may vary within the same embodiment, depending on whether an in-scratch, in-memory, or on-disk subtree is being provided.
Database Management System
System Overview
According to one embodiment of the invention, a computer database management system is provided that parses XML documents into subtree data structures (e.g., similar to the data structure described above), and updates the subtree data structures as document data is updated. The subtree data structures may also be used to respond to queries.
A typical XML handling system according to one embodiment of the present invention is illustrated in FIG. 13. As shown there, system 1300 processes XML (or other structured) documents 1302, which are typically input into the system as files, streams, references or other input or file transport mechanisms, using a data loader 1304. Data loader 1304 processes the XML documents to generate elements (referred to herein as “stands”) 1306 for an XML database 1308 according to aspects of the present invention. System 1300 also includes a query processor 1310 that accepts queries 1340 against structured documents, such as XQuery queries, and applies them against XML database 1308 to derive query results 1342.
System 1300 also includes parameter storage 1312 that maintains parameters usable to control operation of elements of system 1300 as described below. Parameter storage 1312 can include permanent memory and/or changeable memory; it can also be configured to gather parameters via calls to remote data structures. A user interface 1314 might also be provided so that a human or machine user can access and/or modify parameters stored in parameter storage 1312.
Data loader 1304 includes an XML parser 1316, a stand builder 1318, a scratch storage unit 1320, and interfaces as shown. Scratch storage 1320 is used to hold a “scratch” stand 1321 (also referred to as an “in-scratch stand”) while it is in the process of being built by stand builder 1318. Building of a stand is described below. After scratch stand 1321 is completed (e.g., when scratch storage 1320 is full), it is transferred to database 1308, where it becomes stand 1321′.
System 1300 might comprise dedicated hardware such as a personal computer, a workstation, a server, a mainframe, or similar hardware, or might be implemented in software running on a general purpose computer, either alone or in conjunction with other related or unrelated processes, or some combination thereof. In one example described herein, database 1308 is stored as part of a storage subsystem designed to handle a high level of traffic in documents, queries and retrievals. System 1300 might also include a database manager 1332 to manage database 1308 according to parameters available in parameter storage 1312.
System 1300 reads and stores XML schema data type definitions and maintains a mapping from document elements to their declared types at various points in the processing. System 1300 can also read, parse and print the results of XML XQuery expressions evaluated across the XML database and XML schema store.
Forests, Stands, and Subtrees
In the architecture described herein, XML database 1308 includes one or more “forests” 1322, where a forest is a data structure against which a query is made. In one embodiment, a forest 1322 encompasses the data of one or more XML input documents. Forest 1322 is a collection of one or more “stands” 1306, wherein each stand is a collection of one or more subtrees (as described above) that is treated as a unit of the database. The contents of a stand in one embodiment are described below. In some embodiments, physical delimitations (e.g., delimiter data) are present to delimit subtrees, stands and forests, while in other embodiments, the delimitations are only logical, such as by having a table of memory addresses and forest/stand/subtree identifiers, and in yet other embodiments, a combination of those approaches might be used.
In one implementation, a forest 1322 contains some number of stands 1306, and all but one of these stands resides in a persistent on-disk data store (shown as database 1308) as compressed read-only data structures. The last stand is an “in-memory” stand (not shown) that is used to re-present subtrees from on-disk stands to system 1300 when appropriate (e.g., during query processing or subtree updates). System 1300 continues to add subtrees to the in-memory stand as long as it remains less than a certain (tunable) size. Once the size limit is reached, system 1300 automatically flushes the in-memory stand out to disk as a new persistent (“on-disk”) stand.
Data Flow
Two main data flows into database 1308 are shown. The flow on the right shows XML documents 1302 streaming into the system through a pipeline comprising an XML parser 1316 and a stand builder 1318. These components identify and act upon each subtree as it appears in the input document stream, as described below. The pipeline generates scratch data structures (e.g., a stand 1320) until a size threshold is exceeded, at which point the system automatically flushes the in-memory data structures to disk as a new persistent on-disk stand 1306.
The flow on the left shows processing of queries. A query processor 1310 receives a query (e.g., XQuery query 1340), parses the query, optimizes it to minimize the amount of computation required to evaluate the query, and evaluates it by accessing database 1308. For instance, query processor 1310 advantageously applies a query to a forest 1322 by retrieving a stand 1306 from disk into memory, apply the query to the stand in memory, and aggregate results across the constituent stands of forest 1322; some implementations allow multiple stands to be processed in parallel. Results 1342 are returned to the user. One such query system could be the system described in Lindblad IIA.
Queries to query processor 1310 can come from human users, such as through an interactive query system, or from computer users, such as through a remote call instruction from a running computer program that uses the query results. In one embodiment, queries can be received and responded to using a hypertext transfer protocol (HTTP). It is to be understood that a wide variety of query processors can be used with the subtree-based database described herein, and a detailed description of a particular query processor is omitted as not being crucial to understanding the present invention.
Processing of input documents will now be described. FIG. 14 shows parser 1316 and stand builder 1318 in more detail. As shown, parser 1316 includes a tokenizer 1402 that parses documents into tokens according to token rules stored in parameter storage 1312. As the input documents are normally text, or can normally be treated as text, they can be tokenized by tokenizer 1402 into tokens, or more generally into “atoms.” The text tokenizer identifies the beginning and ending of tokens according to tokenizing rules. Often, but not always, words (e.g., characters delimited by white space or punctuation) are identified as tokens. Thus, tokenizer 1402 might scan input documents and look for word breaks as defined by a set of configurable parameters included in token rules 1404. Preferably, tokenizer 1402 is configurable, handles Unicode inputs and is extensible to allow for language-specific tokenizers.
Parser 1316 also includes a subtree finder 1406 that allocates nodes identified in the tokenized document to subtrees according to subtree rules 1408 stored in parameter storage 1312. In one embodiment, subtree finder 1406 allocates nodes to subtrees based on a subtree root element indicated by the subtree rules 1408 Thus, an XML document is divided into subtrees from matching subtree nodes down. For example, if an XML document including citations was processed and the subtree root element was set to “citation”, the XML document would be divided into subtrees each having a root node of “citation”. In other cases, the division of subtrees is not strictly by elements, but can be by subtree size or tree depth constraints, or a combination thereof or other criteria.
Each subtree identified by subtree finder 1406 are provided to stand builder 1318, which includes a subtree analyzer 1410, a posting list generator 1412, and a key generator 1414. Subtree analyzer 1410 generates a subtree data structure (e.g., data structure 1200 of FIG. 12), which is added to the stand. Posting list generator 1412 generates data related to the occurrence of tokens in a subtree (e.g., parent-child index data as described in Lindblad IIA), which is also added to the stand. Stand builder 1318 may also include other data generation modules, such as a classification quality generator (not shown), that generate additional information on a per-subtree or per-stand basis and are stored as the stand is constructed. Classification quality information that might be included in system 1300 is described in Lindblad IV-A.
As stand builder 1318 generates the various data structures associated with subtrees, it places them into scratch stand 1320, which acts as a scratch storage unit for building a stand. The scratch storage unit is flushed to disk when it exceed a certain size threshold, which can be set by a database administrator (e.g., by setting a parameter in parameter storage 1312). In some implementations of data loader 1304, multiple parsers 1316 and/or stand builders 1318 are operated in parallel (e.g., as parallel processes or threads), but preferably each scratch storage unit is only accessible by one thread at a time.
Stand Structure
One example of a structure of an XML database used with the present invention is shown in FIG. 15. As illustrated there, database 1502 contains, among other components, one or more forest structures 1504.
Forest structure 1504 includes one or more stand structures 1506, each of which contains data related to a number of subtrees, as shown in detail for stand 1506. For example, stand 1506 may be a directory in a disk-based file system, and each of the blocks may be a file. Other implementations are also possible, and the description of “files” herein should be understood as illustrative and not limiting of the invention.
TreeData file 1510 includes the data structure (e.g., data structure 1200 of FIG. 12A) for each subtree in the stand. The subtree data structure may have variable length; to facilitate finding data for a particular subtree, a TreeIndex file 1512 is also provided. TreeIndex file 1512 provides a fixed-width array that, when provided with a subtree identifier, returns an offset within TreeData file 1510 corresponding to the beginning of the data structure for that subtree.
ListData file 1514 contains information about the text or other data contained in the subtrees that is useful in processing queries. For example, in one embodiment, ListData file 1514 stores “posting lists” of subtree identifiers for subtrees containing a particular term (e.g., an atom), and ListIndex file 1516 is used to provide more efficient access to particular terms in ListData file 1514. Examples of posting lists and their creation are described in detail in Lindblad IIA, and a detailed description is omitted herein as not being critical to understanding the present invention.
Qualities file 1518 provides a fixed-width array indexed by subtree identifier that encodes one or more numeric quality values for each subtree; these quality values can be used for classifying subtrees or XML documents. Numeric quality values are optional features that may be defined by a particular application. For example, if the subtree store contained Internet web pages as XHTML, with the subtree units specified as the <HTML> elements, then the qualities block could encode some combination of the semantic coherence and inbound hyper link density of each page. Further examples of quality values that could be implemented are described in Lindblad IVA, and a detailed description is omitted herein as not being critical to understanding the present invention.
Timestamps file 1520 provides a fixed-width array indexed by subtree identifier that stores two 64-bit timestamps indicating a creation and deletion time for the subtree. For subtrees that are current, the deletion timestamp may be set to a value (e.g., zero) indicating that the subtree is current. As described below, Timestamps file 1520 can be used to support modification of individual subtrees, as well as storing of archival information.
The next three files provide selected information from the data structure 1200 for each subtree in a readily-accessible format. More specifically, Ordinals file 1522 provides a fixed-width array indexed by subtree identifier that stores the initial ordinal for each subtree, i.e., the ordinal value stored in block 1202 of the data structure 1200 for that subtree; because the ordinal increments as every node is processed, the ordinals for different subtrees reflects the ordering of the nodes within the original XML document. URI-Keys file 1524 provides a fixed-width array indexed by subtree identifier that stores the URI key for each subtree, i.e., the uri-key value stored in block 1202 of the data structure 1200. Unique-Keys file 1526 provides a fixed-width array indexed by subtree identifier that stores the unique key for each subtree, i.e., the unique-key value stored in block 1202 of the data structure 1200. It should be noted that any of the information in the Ordinals, URI-Keys, and Unique-Keys files could also be obtained, albeit less efficiently, by locating the subtree in the TreeData file 1510 and reading its subtree data structure 1200. Thus, these files are to be understood as auxiliary files for facilitating access to selected, frequently used information about the subtrees. Different files and different combinations of data could also be stored in this manner.
Frequencies file 1528 stores a number of entries related to the frequency of occurrence of selected tokens, which might include all of the tokens in any subtrees in the stand or a subset thereof. In one embodiment, for each selected token, frequency file 1528 holds a count of the number of subtrees in which the token occurs.
It will be appreciated that the stand structure described herein is illustrative and that variations and modifications are possible. Implementation as files in a directory is not required; a single structured file or other arrangement might also be used. The particular data described herein is not required, and any other data that can be maintained on a per-subtree basis may also be included. Use of subtree data structure 1200 is not required; as described above, different subtree data structures may also be implemented.
Creation, Updating, and Deletion of Subtrees
As the stands of a forest are generated, processed and stored, they can be “log-structured”, i.e., each stand can be saved to a file system as a unit that is never edited (other than the timestamps file). To update a subtree, the old subtree is marked as deleted (e.g., by setting its deletion timestamp in Timestamps file 1520) and a new subtree is created. The new subtree with the updated information is constructed in a memory cache as part of an in-memory stand and eventually flushed to disk, so that in general, the new subtree may be in a different stand from the old subtree it replaces. Thus, any insertions, deletions and updates to the forest are processed by writing new or revised subtrees to a new stand. This feature localizes updates, rather than requiring entire documents to be replaced.
It should be noted that in some instances, updates to a subtree will also affect other subtrees; for instance, if a lower subtree is deleted, the link node in the upper subtree is preferably be removed, which would require modifying the upper subtree. Transactional updating procedures that might be implemented to handle such changes while maintaining consistency are described in detail in Lindblad IIIA.
It is to be understood that marking a subtree as deleted does not require that the subtree immediately be removed from the data store. Rather than removing any data, the current time can be entered as a deletion timestamp for the subtree in Timestamps file 1520 of FIG. 15. The subtree is treated as if it were no longer present for effective times after the deletion time. In some embodiments, subtrees marked as deleted may periodically be purged from the on-disk stands, e.g., during merging (described below).
Merging of Stands
Stand size is advantageously controlled to provide efficient I/O, e.g., by keeping the TreeData file size of a stand close to the maximum amount of data that can be retrieved in a single I/O operation. As stands are updated, stand size may fluctuate. In some embodiments of the invention, merging of stands is provided to keep stand size optimized. For example, in system 1300 of FIG. 13, database manager 1332, or other process, might run a background thread that periodically selects some subset of the persistent stands and merges them together to create a single unified persistent stand.
In one embodiment, the background merge process can be tuned by two parameters: Merge-min-ratio and Merge-min-size, which can be provided by parameter storage 1312. Merge-min-ratio specifies the minimum allowed ratio between any two on-disk stands; once the ratio is exceeded, system 1300 automatically schedules stands for merging to reduce the maximum size ratio between any two on-disk stands. Merge-min-size limits the minimum size of any single on-disk stand. Stands below this size limit will be automatically scheduled for merging into some larger on-disk stand.
In the embodiment of a stand shown in FIG. 15, the merge process merges corresponding files between the two stands. For some files, merging may simply involve concatenating the contents of the files; for other files, contents may be modified as needed. As an example, two TreeData files can be merged by appending the contents of one file to the end of the other file. This generally will affect the offset values in the TreeIndex files, which are modified accordingly. Appropriate merging procedures for other files shown in FIG. 15 can be readily determined.
System Parameters
As described above, parameters can be provided using parameter storage 1312 to control various aspects of system operation. Parameters that can be provided include rules for identifying tokens and subtrees, rules establishing minimum and/or maximum sizes for on-disk and in-memory stands, parameters for determining whether to merge on-disk stands, and so on.
In one embodiment, some or all of these parameters can be provided using a forest configuration file, which can be defined in accordance with a preestablished XML schema. For example, the forest configuration file can allow a user to designate one or more ‘subtree root’ element labels, with the effect that the data loader, when it encounters an element with a matching label, loads the portion of the document appearing at or below the matching element subdivision as a subtree. The configuration file might also allow for the definition of ‘subtree parent’ element names, with the effect that any elements which are found as immediate children of a subtree parent will be treated as the roots of contiguous subtrees.
More complex rules for identifying subtree root nodes may also be provided via parameter storage 1312, for example, conditional rules that identify subtree root nodes based on a sequence of element labels or tag names. Subtree identification rules need not be specific to tag names, but can specify breaks upon occurrence of other conditions, such as reaching a certain size of subtree or subtree content. Some decomposition rules might be parameterized where parameters are supplied by users and/or administrators (e.g., “break whenever a tag is encountered that matches a label the user specifies,” or more generally, when a user-specified regular expression or other condition occurs). In general, subtree decomposition rules are defined so as to optimize tradeoffs between storage space and processing time, but the particular set of optimum rules for a given implementation will generally depend on the structure, size, and content of the input document(s), as well as on parameters of the system on which the database is to be installed, such as memory limits, file-system configurations, and the like.
Element Queries
As described previously, element queries are used to determine the position of words, phrases, or elements relative to a particular element in a hierarchical (e.g., XML) document. For example, a typical element query is whether a given word is contained within (i.e., is a descendant of) a given element. Current solutions to processing element queries are inefficient, cannot be applied generically to all elements in a document, and fail to take into account the hierarchical structure of the elements. Embodiments of the present invention provide techniques for representing and querying positional information in hierarchical documents that overcome these problems.
FIG. 16 illustrates the steps performed in representing positional information for words and elements of a hierarchical document. In one embodiment, method 1600 is performed by document parser 1316 of the database system shown in FIG. 14. In other embodiments, method 1600 may be performed by any other hardware/software based component that is a part of, or separate from, the database system. Method 1600 may be performed in real-time as a document is ingested into the database system, or performed as a batch process either before or after ingestion.
At step 1602, a hierarchical document such as an XML document is received. As is well known in the art, an XML document comprises a plurality of words which define the content of the document, and a plurality of elements which define the structure of the document. At step 1604, at least one word in the plurality of words is associated with one or more word positions. In an exemplary embodiment, every word in the plurality of words is associated with a word position. In alternative embodiments, a subset of words is associated with a word position. A word position indicates the position of the word relative to other words in the document. For example, consider the following sample XML document (referred to herein as “sample document 1”):

the₁striped₂



brown₃



fox₄jumps₅over₆



the₇brown₈



striped₉





dog₁₀

Note that the positional subscripts above are included to illustrate the order of the words in the document; they are not a part of the XML document. The words in sample document 1 may be associated with word positions as follows:
brown: 3, 8
dog: 10
fox: 4
jumped: 5
over: 6
striped: 2, 9
the: 1, 7
In this embodiment, the word positions for each word correspond to the ascending, sequential order of the word in sample document 1. Thus, the word “the” is the first word in the document and is therefore associated with the word position “1.” Similarly, the word “fox” is the fourth word in the document is and therefore associated with the word position “4.” In alternative embodiments, other possible word positions that maintain the relative positions of the words may be used. Words that appear multiple times in the document may be associated with multiple word positions. Thus, the word “striped” is associated with the positions 2 and 9, indicating that the word appears at both the second and ninth word positions in the document.
In various embodiments, the word positions for a document may be stored in a data structure that is separate from the document. In these cases, the word positions may be associated with an identifier that identifies the document. This identifier may be, for example, a subtree identifier of a subtree contained within the document, or a document identifier. Alternatively, the word positions may be stored with the document. In further embodiments, the word positions may be indexed in a word position index to allow fast retrieval of a word based upon a word position, or vice versa. As will be described in greater detail below, such an index may be used to facilitate the processing of element queries.
At step 1606, at least one element in the plurality of elements is associated with one or more word position ranges. In an exemplary embodiment, every element in the plurality of elements is associated with a word position range. In alternative embodiments, a subset of elements is associated with a word position range. A word position range describes a range of zero or more word positions, and is used to identify the words that are contained within an element. Referring back to sample document 1, the word position ranges for elements “” and “” may be represented as:
element : (3, 4), (7, 10)
element : (9, 10)
As shown, each word position range is represented using a “starting” word position and an “ending” word position. For example, element “” has a range (3, 4) with a starting word position 3 and an ending word position 4. The word positions that fall within the starting and ending positions of a range represent the words that are descendants of the corresponding element.
According to one embodiment, the word position ranges for element “” above may be interpreted as:

- Words on or after word position 3, but before word position 4, are descendants of an element “”
- Words on or after word position 7, but before word position 10,are descendants of an element “”
 In this embodiment, the starting word position for each range is “closed,” meaning that the range includes the word located at the starting word position. Conversely, the ending word position for each range is “open,” meaning that the range does not include the word located at the ending word position. However, it should be appreciated the starting and ending positions may be denoted using either open or closed values.

Using the two types of positional information described above (word positions and word position ranges), element queries may be processed in an efficient manner. For example, by intersecting the word positions and word position ranges for sample document 1, we can determine that:

- There are two occurrences of the word “brown” contained within element “” because “brown” exists at positions 3 and 8, position 3 falls in the range (3, 4) (i.e., 3 <=3<4), and position 8 falls within the range (7, 10) (i.e., 7<=8<10).
- The word “striped” is contained within the element “,” because “striped” exists at position 9, and position 9 falls within the range (9, 10).

Like word positions, word position ranges may be indexed in a range index to allow fast retrieval of a range based upon an element, or vice versa. In various embodiments, the range index may be intersected with the word position index described earlier to quickly identify whether a given word is a descendant of a given element.
Word position ranges for a document may be stored in a data structure that is separate from the document. In this case, the word position ranges may be associated with an identifier that identifies the document. This identifier may be, for example, a subtree identifier of a subtree contained within the document, or a document identifier. In one embodiment, each word position and word position range may be associated with a separate identifier. In other embodiments, all of the positional information for a document may be associated with a single identifier.
In various embodiments, the word position ranges of a document may be augmented to support elements that have zero descendants (i.e., are empty). For example, consider the following XML document (identical to sample document 1 except for the addition of an empty “ ” element):

the₁striped₂



brown₃



fox₄jumps₅over₆



the₇brown₈

 



striped₉





dog₁₀

In this case, the word position ranges would be augmented as follows:
element : (3, 4), (7, 10)
element : (9, 9)
element : (9, 10)
Note that the representation of the word position range for element “ ” is (9,9)—this denotes that element “ ” is an empty element, as there are no positions which satisfy the constraint (9<=position<9).
In further embodiments, the word position ranges for a document may be augmented to support elements that contain nested elements. Specifically, word position range information may be modified to indicate what part of a range is subsumed by a descendant (i.e., nested) element. This representation allows us to determine, for example, whether a word or phrase is an indirect descendant (i.e., is contained within a nested element) of a particular element. For example, consider the following second sample XML document:

the₁striped₂



brown₃



fox₄jumps₅over₆



the₇brown₈



striped₉



grumpy₁₀mean₁₁





old₁₂dog₁₃who₁₄really₁₅



just₁₆wanted₁₇to₁₈be₁₉left₂₀alone₂₁
To identify whether an element contains another element, the beginning or end of a descendant element is marked within a word position range with an identifier, such as an asterisk. This produces the following representation:
element : (3, 4), (7, 9*), (12*, 16)
element : (9, 10*), (12*, 12)
element : (10, 12)
In the above example, when an asterisk appears at the ending position of a word position range, the asterisk indicates that a descendant element begins at that position. When the asterisk appears at a starting position of a word position range, the asterisk indicates that a descendant element is closed at that position. This representation is sufficient to identify whether a word is a direct descendant of (i.e., contained directly under) an element, or an indirect descendant of (i.e., contained within a nested element of) the element. For example, the word position ranges for element “” as shown above may be interpreted as follows:

- Words on or after word position 3, but before word position 4 are direct descendants of an element 
- Words on or after word position 7, but before word position 9 are direct descendants of an element 
- Words on or after word position 9, but before word position 12 are indirect descendants of an element 
- Words on or after word position 12,but before word position 16 are direct descendants of an element 
 In the example above, note that element “” includes the empty word position range (12*, 12) to indicate the closing boundary of nested element “” within element “.”

In various embodiments, the above described representation for word position ranges may be used to indicate any level of nesting within an element, and may be used to indicate descendant elements that have the same element name as their parent elements. For example, consider the following third sample document and corresponding word position ranges:

the₁striped₂



brown₃



fox₄jumps₅over₆



the₇brown₈



striped₉



grumpy₁₀mean₁₁





old₁₂dog₁₃who₁₄really₁₅



just₁₆wanted₁₇to₁₈be₁₉left₂₀alone₂₁:
element : (3,4), (7,9*), (10, 12)(12*,16)
element : (9,10*), (12*,12)
Note that element “” is nested within itself. Based on the above, the word position ranges for element may be interpreted as follows:

- Words on or after word position 3, but before word position 4 are direct descendants of an element 
- Words on or after word position 7, but before word position 9 are direct descendants of an element 
- Words on or after word position 9, but before word position 10 are indirect descendants of an element 
- Words on or after word position 10, but before word position 12 are direct descendants of an element 
- Words on or after word position 12, but before word position 16 are direct descendants of an element 
- Words on or after word position 10, but before word position 12 are both direct descendants of an element and indirect descendants of some other element

As another example, consider the following fourth sample document and corresponding word position ranges (hereinafter “sample document 4”):

the₁striped₂



brown₃



fox₄jumps₅over₆



the₇brown₈



striped₉



grumpy₁₀mean₁₁





old₁₂dog₁₃who₁₄really₁₅



just₁₆wanted₁₇to₁₈be₁₉left₂₀alone₂₁
element : 3,4), (7,9*),(9,10*),(10,12),(12*,12),(12*,16)
In this example, the word position ranges for element “” may be interpreted as follows:

- Words on or after word position 3, but before word position 4 are direct descendants of an element 
- Words on or after word position 7, but before word position 9 are direct descendants of an element 
- Words on or after word position 9, but before word position 10 are both direct and indirect descendants of elements 
- Words on or after word position 10,but before word position 12 are both direct and indirect descendants of elements 
- Words on or after word position 12, but before word position 16 are direct descendants of an element 
 Note that empty word position range (12*, 12) is used to provide a closing boundary for the descendant element “” starting at word position 9.

At step 1608 of FIG. 16, the word position ranges for a document may be encoded so that they can be stored or transmitted efficiently. This may be desirable, for example, if the document is very large, since the size of the positional information will increase with document size (i.e., storing the number 1,000,003 may require more space than storing the number 3). According to one embodiment, the starting and ending word positions of each word position range are encoded as “deltas” relative to a previous position. This encoding takes advantage of the fact that word position ranges may be organized in a non-descending order. Consider, for example, the word position ranges for element “” from sample document 4:
element : (3, 4), (7, 9*), (9, 10*), (10, 12), (12*, 12), (12*, 16)
The word position ranges above are represented using absolute values for the word positions. Using delta values, the word position ranges may be encoded as:
element : (3, 1), (3, 2*), (0, 1*), (0, 2), (0*, 0), (0*, 4)
As shown, the starting and ending word positions for each word position range are encoded as the difference (i.e., delta) between the position and a previous position. Such an encoding reduces the absolute magnitude of the numbers that need to be stored, thereby reducing storage space requirements. Note that the absolute word positions may be recovered from the delta encoded positions by adding together all of the numbers in the range information up to the desired position.
In one set of embodiments, the delta encoding scheme for word position ranges may be modified to encode a stream of pure numbers without any non-numeric identifiers such as asterisks. This may be desirable because number streams are easily compressible using known compression techniques. According to one embodiment, each asterisk is changed to a zero value that precedes the starting or ending position the asterisk is associated to. Thus, the representation for certain word position ranges may be changed from a pair of numbers to a tuple of numbers. Consider the following delta-encoded word position ranges for element “”:
element : (3, 1), (3, 2*), (0, 1*), (0, 2), (0*, 0), (0*, 4)
By replacing each asterisk with a preceding zero, the word position ranges are modified to:
element : (3, 1), (3, 0, 2), (0, 0, 1), (0, 2), (0, 0, 0), (0, 0, 4)
One problem with the above representation is that the delta-encoded starting and ending positions for a range may have a value of zero, making it difficult to differentiate position data from an asterisk. To avoid this, each starting and ending position can be incremented by 1 so that no position data item can be equal to 0. This allows the word position range data for an element to be encoded as a pure stream of numbers, without any notion of “parentheses.” For example, the stream encoding for element “” is:
element : 4, 2, 4, 0, 3, 1, 0, 2, 1, 3, 0, 1, 1, 0, 1, 5
Note that the original delta encoding can be recovered by decrementing each position datum by 1 prior to use, and by recognizing that any position datum preceded by a zero should be regarded as modified by an asterisk. This stream encoding lends itself to extremely efficient storage through well-understood techniques for delta compression.
It should be appreciated that the specific steps illustrated in FIG. 16 provide a particular method for representing positional information in a hierarchical document according to an embodiment of the present invention. Other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated FIG. 16 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.
Using the above-described representations for word positions and word position ranges, element queries can be processed efficiently against any hierarchical document. FIG. 17 illustrates the steps performed in processing an element query in accordance with an embodiment of the present invention. One such element query may be whether a particular word or phrase is a descendant of a particular element. Another such element query may be whether a particular word or phrase is specifically a direct (or indirect) descendant of a particular element. One of ordinary skill in the art would recognize other variations and alternatives. In an exemplary embodiment, method 1700 is performed by a query engine or processor of a database system. In alternative embodiments, method 1700 may be performed by any hardware or software based component that is a part of, or separate from, a database system.
At step 1702, an element query is received. In various embodiments, the element query contains a word or phrase and an element. At steps 1704 and 1706, the word positions for the word or phrase and the word position ranges for the element are retrieved. According to one set of embodiments, the word positions for the word or phrase may be retrieved via a word position index. Similarly, the word position ranges for the element may be retrieved via a range index. In various embodiments, an efficient on-disk stream encoding for the word positions and word position ranges may be retrieved and converted into an array of 32-bit numbers in memory, where the high-bit of each number indicates whether or not an asterisk is present. The high-bit is masked out for numeric position comparison operations.
At step 1708, the word positions and word position ranges are processed to determine, for example, whether the word or phrase is a descendant of the element provided in the query. Other types of queries are also contemplated, such as queries that check for indirect descendant relationships, both direct and indirect descendant relationships, a direct descendant relationship at the exclusion of an indirect descendant relationship, and the like. In one set of embodiments, the processing comprises comparing the word position data to the word position range data. Since both sets of position data may be arranged in non-descending order, algorithms that perform non-linear comparisons (e.g. a binary search) may be used, in addition to other algorithms (linear merges, etc.). One of ordinary skill in the art would recognize many variations, modifications, and alternatives. Depending on the relative sizes and densities of the word position data and word position range data, different algorithms may be executed.
At step 1710, the result set of the query is return to the query initiator.
FIG. 18 illustrates an embodiment of the present invention that uses the word position range data for a document to determine whether an element is a descendant of another element. At step 1802, the query comprising a first element and a second element is received. At steps 1804 and 1806, the word position range data for the first and second elements is retrieved. At step 1808, the range data for the first and second elements is processed to determine whether the first element is a descendant of the second element (or vice versa). Finally, at step 1810 the result is returned to the query initiator.
In an alternative embodiment, an element position may be associated with each element of a document, and a similar approach as described herein may be used as a general solution to determining what elements are contained within other elements.
It should be appreciated that the specific steps illustrated in FIGS. 17 and 18 provide particular methods of processing element queries according to an embodiment of the present invention. Other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present invention may perform the steps outlined above in a different order. Moreover, the individual steps illustrated FIGS. 17 and 18 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.
This detailed description illustrates some embodiments of the present invention and variations thereof, but should not be taken as a limitation on the scope of the invention. In this description, structured documents are described, along with their processing, storage and use, with XML being the primary example. However, it should be understood that the invention might find applicability in systems other than XML systems, whether they are later-developed evolutions of XML or entirely different approaches to structuring data. It should also be understood that “XML” is not limited to the current version or versions of XML. An XML file (or XML document) as used herein can be serialized XML or more generally an “infoset.” Generally, XML files are text, but they might be in a highly compressed binary form.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. Many variations of the invention will become apparent to those skilled in the art upon review of the disclosure. The scope of the invention should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the pending claims along with their full scope or equivalents.

Claims

1. A computer-implemented method for representing positional information in a hierarchical document, the method comprising:

receiving a hierarchical document comprising a plurality of words and a plurality of elements, at least one word in the plurality of words being a descendant of at least one element in the plurality of elements;

associating at least one word in the plurality of words with one or more word positions, the one or more word positions indicating positions of the word relative to other words within the hierarchical document; and

associating at least one element in the plurality of elements with one or more word position ranges, each word position within the one or more word position ranges corresponding to a word in the plurality of words that is a descendant of the element.

2. The method of claim 1 wherein at least one element in the plurality of elements has zero descendants, and wherein the at least one element is associated with a word position range indicating a range of zero word positions.

3. The method of claim 1 wherein a first element in the plurality of elements is a descendant of a second element in the plurality of elements, and wherein one or more word position ranges associated with the second element indicate a range of word positions subsumed by the first element.

4. The method of claim 3 wherein a first word position range associated with the second element indicates a starting position of the first element within the second element, and a second word position range associated with the second element indicates an ending position of the first element within the second element.

5. The method of claim 3 wherein the first element and second element have the same element name.

6. The method of claim 3 wherein the first element is a direct descendant of the second element.

7. The method of claim 3 wherein the first element is an indirect descendant of the second element.

8. The method of claim 3 wherein the one or more word position ranges associated with the second element are encoded using a stream of numbers.

9. The method of claim 1 wherein each word position range indicates a sequential range of word positions.

10. The method of claim 1 wherein the hierarchical document is an XML document.

11. The method of claim 1 further comprising associating the word positions for the plurality of words and the word position ranges for the plurality of elements with an identifier that identifies the hierarchical document.

12. The method of claim 1 wherein each word position range for an element comprises a starting word position and an ending word position, and wherein the starting and ending word positions are encoded using absolute values.

13. The method of claim 1 wherein each word position range for an element comprises a starting word position and an ending word position, and wherein the ending word position is encoded using a delta between the ending word position the starting word position.

14. The method of claim 13 wherein the starting word position is encoded using a delta between the starting word position and an ending word position of a previous word position range for the element.

15. A computer-implemented method for determining whether one or more words are descendants of an element in a hierarchical document, the method comprising:

retrieving word positions for the one or more words, the word positions indicating positions of the one or more words relative to other words within the hierarchical document;

retrieving one or more word position ranges for the element, each word position within the one or more word position ranges corresponding to a word in the hierarchical document that is a descendant of the element; and

processing the word positions for the one or more words and the one or more word position ranges for the element to determine whether the one or more words are descendants of the element.

16. The method of claim 15 wherein the processing comprises determining whether the one or more words are direct descendants of the element.

17. The method of claim 15 wherein the processing comprises determining whether the one or more words are indirect descendants of the element.

18. The method of claim 15 wherein the word positions for the one or more words are referenced in a first index, wherein the one or more word position ranges for the element are referenced in a second index, and wherein the processing comprises intersecting the first index with the second index.

19. The method of claim 15 wherein the one or more words constitute a phrase.

20. A computer-implemented method for determining whether a first element is a descendant of a second element in a hierarchical document, the method comprising:

retrieving one or more word position ranges for the first element, each word position within the one or more word position ranges for the first element corresponding to a word in the hierarchical document that is a descendant of the first element;

retrieving one or more word position ranges for the second element, each word position within the one or more word position ranges for the second element corresponding to a word in the hierarchical document that is a descendant of the second element; and

processing the one or more word position ranges for the first element and the one or more word position ranges for the second element to determine whether the first element is a descendant of the second element.

21. A database system comprising a database configured to store a plurality of hierarchical documents, wherein a first hierarchical document in the plurality of hierarchical documents comprises a plurality of words and a plurality of elements, at least one word in the plurality of words being a descendant of at least one element in the plurality of elements,

wherein at least one word in the plurality of words is associated with one or more word positions, the one or more word positions indicating positions of the word relative to other words within the first hierarchical document; and

wherein at least one element in the plurality of elements is associated with one or more word position ranges, each word position within the one or more word position ranges corresponding to a word in the plurality of words that is a descendant of the element.

22. The database system of claim 21 further comprising a query engine configured to receive a query, and to return a result responsive to the query by analyzing one or more of the word positions and one or more of the word position ranges.

23. The database system of claim 22 wherein the query is whether one or more words in the first hierarchical document are direct descendants of a particular element in the first hierarchical document.

24. The database system of claim 22 wherein the query is whether one or more words in the first hierarchical document are indirect descendants of a particular element in the first hierarchical document.

25. A computer program product embedded in a computer readable medium comprising:

program code for receiving a hierarchical document comprising a plurality of words and a plurality of elements, at least one word in the plurality of words being a descendant of at least one element in the plurality of elements;

program code for associating at least one word in the plurality of words with one or more word positions, the one or more word positions indicating positions of the word relative to other words within the hierarchical document; and

program code for associating at least one element in the plurality of elements with one or more word position ranges, each word position within the one or more word position ranges corresponding to a word in the plurality of words that is a descendant of the element.

26. The computer program product of claim 25 further comprising:

program code for receiving a query including one or more words in the hierarchical document and an element in the hierarchical document;

program code for retrieving word positions for the one or more words;

program code for retrieving the one or more word position ranges for the element; and

program code for processing the query by analyzing the word positions for the one or more words and the one or more word position ranges for the element.

27. The computer program product of claim 26 wherein the query is whether the one or more words are direct descendants of the element.

28. The computer program product of claim 26 wherein the query is whether the one or more words are indirect descendants of the element.