CN101887458A

CN101887458A - Path coding-based XML document index method

Info

Publication number: CN101887458A
Application number: CN 201010219493
Authority: CN
Inventors: 宋余庆; 陈健美; 邹为伟
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2010-07-06
Filing date: 2010-07-06
Publication date: 2010-11-17

Abstract

The invention discloses a path coding-based XML document index method and belongs to the technical field of data processing. The method comprises the following index steps: creating a tree module, dividing query paths, creating element list, creating a structural list, forming XML path prefix coding, forming XML value list, determining path codes and coupling paths to obtain a result. In the invention, based on the introduction of XML path codes, the invention provides the scientific and effective XML index method, which can quickly complete the path matching of the query paths and the XML documents and acquire query results, and has a common significance.

Description

A kind of XML document indexing means based on path code

Technical field

The present invention relates to a kind of electronic document indexing means, especially a kind of XML document indexing means based on path code belongs to the microcomputer data processing field.

Background technology

Expandable mark language XML (eXtensible Markup Language) has platform-neutral, self descriptiveness, extensibility and simply is easy to advantage such as processing, becomes the standard of Internet data representation and exchange gradually.Along with popularizing that XML uses, how faster, the Query XML data become the problem that becomes more and more important more accurately.In order to improve the efficient of XML path query, many experts and scholar are devoted to the foundation of XML document index.

The XML data are the judgement that element structure is concerned, for example set membership because it is semi-structured to the crucial part of XML document inquiry.For structure query, a kind of method is to set up the path indexing of XML document tree, comes the calculating of accelerating structure inquiry by path indexing.Another kind method is that the node of XML document tree is encoded, and directly judges the structural relation between the node by encoding.

Main XML path indexing has at present: DataGuide, 1-index and A (k) etc.Dataguide is to be a kind of structural summary in initial concise path from root node.The label character path that forms that is cascaded in limit is only described once in Dataguide.Needed part node when Dataguide has reduced the traverse path inquiry, but have following deficiency:

1) DataGuide summarizes accurately to the XML data plot, if the XML data plot is graph structure, sets up the time of DataGuide and index that required space may be XML data plot size so doubly.

2) there is the possibility that intersects in the superset of each node among the DataGuide, therefore may cause and obscure.A (k) index proposes to have between the node notion of " k similarity ", the basic thought of index is that the similarity with node in the XML data plot is that the node of k is stored in the same nodal set of key map, and this just means that all paths are that the path of k all is stored in the key map.The downward similarity of having ignored them but it has only considered the upwards similarity of node is so be poor efficiency when handling the inquiry of band individual path.The basic thought of Fabric index is that the relation table between the semi-structured data is shown as the path, path code is become character string, on character string, set up index then, the inquiry of support fast path, but handle band // path query the time, for example publications//title represents to search the title that all have ancestors, and Index Fabric seems that efficient is not high.

In addition, the XML code index method that has proposed has XISS, ViST etc.The basic thought of XISS is that the segmentation of query path expression formula is calculated, and connects the generation net result successively by relation constraint between different nodes then.It is encoded with the follow-up traversal value of first preface to XML document, to the path query processing, need not travel through XML document: if but query path form by N element/property, need from index, retrieve N group node, need calling of (N-1) aggregated(particle) structure join algorithm at least and handle each XML document; The node that inevitably has simultaneously in many uncorrelated structures participates in set membership or the judgement of ancestors' one descendent relationship in the simple path processing procedure.ViST encodes XML document and user inquiring simultaneously, represents with character string couple sequence, the query script of XML data is promptly changed into the process of sequences match.Mistake is alert, pretreatment time is oversize but the query processing process usually occurs.

Retrieval is found, application number is that 03108526.1 Chinese patent application discloses a kind of extending mark language indexing means of handling the regular path expression inquiry, this patented claim is extracted all possible path titles and store in the path searching table with path ID to database input XML file the time.The path searching table is as index, the user imported regular path expression be converted into path in the path that actually exists in XML, realizing route coupling in index.The defective of this invention is: when the XML file was very big, the path in the path searching table can significantly increase, and storage cost is bigger; Do not utilize the pattern of XML document, all need to scan the index of all XML document, scan of a high pricely, influence search efficiency for any query path expression formula.Application number is that 200410099272.8 Chinese patent application discloses a kind of highly effective path indexing method based on the XML data, this method is set up UD (k to source XML document data plot, l) index, in key map, finish the connection procedure of condition path and main path by automat, realize the inquiry of the path expression of band branch.Make progress similarity less than k for the destination node in the individual path, and downward similarity only needs just can realize inquiry less than 1 destination node in key map.But parameter k and l determine most important, directly influence the degree of accuracy of inquiry, and this invention does not provide solution.And the scope that exceeds k or l when similarity is, still needs to verify in source data figure, reduces search efficiency greatly.Application number is that the Chinese patent application of 200910158713.X discloses a kind of method and system that is used for generating in the XML data base management system (DBMS) index, this patented claim is when XML document deposits database in, the pattern that relies on XML is set up the index function of XML, key and value index building table with each node deposit library module in.When the Query XML document, as long as pass through the value of index function scanning institute computation index, need not scan entire database, realize efficiently inquiry.But the shortcoming that this invention exists is the inquiry that can only be used for the simple path expression formula, can not inquire about regular path expression, has big limitation.

Summary of the invention

The objective of the invention is to: propose a kind of XML document indexing means, problem such as solve that prior art exists that the XML document scanning amount is big, individual path attended operation complexity and regular path expression search efficiency are not high based on path code.

In order to achieve the above object, the XML document indexing means that the present invention is based on path code realizes that by the intelligent apparatus with central processing unit described index step is:

Step 1, set up tree-model---according to a conventional method, according to document node structure, respectively XML document and corresponding Schema (chart document) thereof are mapped to corresponding document tree model and Schema tree-model, common described document tree model is made of the element node that is linked in sequence by ancestors descendant, the textual value that is connected with leaf node and the attribute node that is connected with the respective element node, and described Schema tree-model is made of the element node that is linked in sequence by ancestors descendant, the attribute node that is connected with the respective element node.

Step 2, division query path---the need query path that will import is divided into one group of conditional branching path and target individual path with the predicate ending.

Step 3, set up the list of elements---according to a conventional method, according to scanning result to above-mentioned Schema tree-model, the title (name) of (comprising element node and attribute node) of each node in the Schema tree-model, preorder traversal value (pre), follow-up traversal value (post), document identification (id) are deposited in respectively in the corresponding form, constitute the Schema list of elements.

Step 4, set up structural table---preorder traversal value (pre) and all leaf nodes (destination node) the preorder traversal values (leaf-pre) and the corresponding document identification (id) in its path, place of above-mentioned each node are listed according to the order of sequence, constituted the Schema structural table.

Step 5, formation XML path prefix coding---each node in above-mentioned document tree model goes out the limit and sorts out with attribute limit and element limit respectively, provide path code with resolution mark by the natural number order, and make each node of XML document carry the preorder traversal value (pre) identical with corresponding node among its Schema, again with root node to certain node the path code of process constitute the path prefix coding of this node in order.

Step 6, formation XML value table---preorder traversal value, path prefix coding (Path-lable) from root node to leaf node and the textual value that be attached thereto of all leaf nodes in Schema listed one by one according to the order of sequence, constitute XML value table, described path prefix is compiled and is combined according to the order of sequence by the path code from root node to this leaf node.

Step 7, determine path prefix coding---according to the corresponding title of each leaf node of dividing the back individual path, from the Schema list of elements of having set up, find corresponding preorder traversal value and document identification; And then, from the Schema structural table of having set up, find corresponding leaf node preorder traversal value according to this preorder traversal value and document identification; According to this leaf node preorder traversal value, find corresponding path prefix coding again from established XML value table.

Step 8, path coupling obtain the result---the two paths prefix codes contrast by turn from left to right earlier in the individual path after will dividing, and identical as the isotopic number sign indicating number, then on this position of coupling path, deposit this number in; As the number that the attribute node is differentiated mark appears having, then do not compare and directly deposits this tape label number in the corresponding position of coupling path in; Finish as a paths prefix code, then the residue number of another paths prefix code is deposited in the follow-up corresponding position of coupling path, coupling path coding in the middle of obtaining, and with longer path in the two paths prefix codes in XML value table corresponding textual value as the centre result that is coupled; Afterwards, refer again to said process, the path prefix coding of middle coupling path coding and all the other individual paths is carried out next round from left to right to be contrasted by turn, middle coupling path coding that obtains upgrading and middle coupling result, be coupled one by one until path code and finish all individual paths, get the result that is coupled to the end, as the search index result.

Inequality as occurring the coordination number in the above-mentioned process, show that then this two path coding can not connect two paths of correspondence, skip to the next round coupling or withdraw from.

The present invention further improves: when inserting new node, add predetermined labels symbol (for example ". ") and the sequence code path code as new node behind the path code of new node respective path.

Conclusion is got up, and the present invention has the following advantages:

1) routing information that utilizes the Schema node to carry, inquiry respective paths and desired value in XML value table, overcome the big problem of classic method XML scanning amount, when path branches connects, only need to get final product by simple coupling, need not a large amount of loaded down with trivial details node attended operations, therefore compare with existing indexing means, search efficiency significantly improves.

2) utilize the schema structural information that structure matching is carried out in query path, according to the relation of inclusion between the code area, can judge ancestors descendant's relation of node, if there is the coupling path among the Schema, then in its corresponding XML document, inquire about, if there is not the coupling path, then no longer its corresponding XML document is scanned, avoided fearless operation.

3) utilize the schema structural information that structure matching is carried out in query path,, match all possible paths at its place, solved canonical path query problem according to all leaf nodes of node correspondence in the Schema structure.

3) scan corresponding XML document, give with Schema in the identical element leaf node of corresponding same position carry its code identification leaf-pre, encoded in the path, with the value of Schema leaf node coding leaf-pre, path code and leaf node correspondence together in the stored value table, therefore need not scan XML document, in the value table, just can obtain the routing information and the textual value at this node place.

4) be that bound pair query path is divided with the node (comprising attribute node and leaf node) that carries textual value, thus the individual path number by the number decision of the textual value of needs inquiries, query time and query path length have nothing to do.

5) when new node need insert, need not to change the coding in other paths in the model tree, very convenient as long as will add the predetermined labels symbol at new node left side brother's path code, and do not influence by the further search index of method of the present invention.

In a word, the present invention provides a kind of scientific and effective XML indexing means on the basis of introducing the XML path code, can finish the route matching of query path and XML document fast, obtains Query Result, is of universal significance.

Description of drawings

The present invention is further illustrated below in conjunction with accompanying drawing.

Figure 1A shows the XML document of a case history.

Figure 1B shows the tree-model of XML document.

Fig. 2 A shows the Schema of XML document correspondence.

Fig. 2 B shows the tree-model of Schema.

Fig. 3 shows the Schema list of elements.

Fig. 4 shows the structural table of Schema.

Fig. 5 shows the path code of XML.

Fig. 6 shows XML value table.

Fig. 7 shows XML and dynamically updates tree-model.

Specific embodiment

Below with a simplified embodiment, specify the XML document indexing means that the present invention is based on path code.

[1] sets up tree-model according to XML, Schema

Figure 1A is a XML case history document, Figure 1B is according to a conventional method, according to document node structure, with the corresponding tree-model that XML document is mapped to, the document tree-model is made of the element node that is linked in sequence by ancestors descendant, the textual value that is connected with leaf node and the attribute node that is connected with the respective element node.Fig. 2 A is the Schema of Figure 1A document correspondence, it is the pattern that XML document will be observed, and XML is carried out the checking (being the legitimacy of relevant documentation object unit itself such as whether optional the nested form of element type, element, attribute type, attribute value data type, property value and structure thereof) of syntactic structure.Fig. 2 B is according to a conventional method, and according to the node structure, with the corresponding tree-model that Schema is mapped to, this Schema tree-model is made of the element node that is linked in sequence by ancestors descendant, the attribute node that is connected with the respective element node.

XML document and Schema thereof can both be mapped to a tree orderly, the limit mark, are called tree-model, note do T=(V, r, E, tag, label).Wherein: (1) V is the set of XML node.(2)

It is the root node of tree.(3) Be the set on limit, and: 1. V=r ∪ VE ∪ VA ∪ VT, VE, VA and VT represent the set of element node, attribute node and text node respectively; 2. E=EE ∪ EA ∪ ET, wherein

Be the set on element limit,

Be the set on attribute limit,

It is the set on text limit.(4) tag=tagE ∪ tagA ∪ tagT, wherein: 1. function tagE:VE →＜name, nodetype 〉, give a binary character string group for each element node, represent the masurium and the node type of this element node respectively.The value of node type is " EE ", " ET " or " EN ", and they represent that respectively content is daughter element, text or is empty element node; 2. function tagA:VA →＜name, value, valuetype 〉, give a ternary character string group for each attribute node, represent the type of attribute-name, property value and the property value of this attribute node respectively; 3. function tagT:VT →＜text, texttype 〉, give a binary character string group for each text node, represent the content of text node and the type of content respectively.(5) function label:V → string gives a sign id for each node, and this is identified in the document unique.This logic data model has only defined the general data that constitutes XML document: element, attribute and text, and ignore less important data such as processing instruction, note.

According to tree-model definition can write out above-mentioned XML document data query model T=(V, r, E, tag, label), wherein:

Patient VE={, name, age, gop information ..., VA={ case history id}; R={ case history };

Patient EE={-name, patient-age, patient-gop information ..., patient EA={-sick id};

TagE={＜patient, EE〉..., tagA={＜case history id, ET, unsignedByte〉};

label＝{<1，10>，<2，9>，<3，1>，<4，2>...}。

[2] divide query path

The need query path of input is divided into one group of simple condition individual path and target individual path that only contains a predicate constraint, as for path " case history/patient [case history id=" 1 "]/gop information/chemical examination ", can be divided into the simple condition individual path that only contains the constraint of predicate " case history/patient [case history id=" 1 "] (case history id is the attribute node; the content in the bracket [] is the predicate constraint, and the path of band predicate constraint is the condition path) and target individual path " case history/patient/gop information/chemical examination ".

[3] set up the Schema list of elements

Interval coding-the Dietz that the Schema tree-model has among Fig. 2 B can support various XML inquiries effectively.This is encoded to each node and gives (pre (u), post (u), dep (u), id)).The preorder traversal value of pre (u) expression node u, the follow-up traversal value of post (u) expression node u, dep represents the node degree of depth, can play aid identification or checking effect, id is included between ancestors' junction area between the id sign descendant junction area of Schema document, if promptly node u is ancestors' node of v, then need satisfy pre (u)＜pre (v) ∧ post (v)＜post (u).

Reasoning 1 for the tree in any two node u and v, if satisfy pre (u)＜pre (v) ∧ post (v)＜post (u) ∧ dep (v)-dep (u)=1, then u is the father node of v.

According to a conventional method, to above-mentioned Schema tree-model scanning, the title (name) of (comprising element node and attribute node) of each node in the Schema tree-model, preorder traversal value (pre), postorder traversal value (post), the node degree of depth (dep) and document identification (id) are deposited in respectively in the corresponding form, can constitute the Schema list of elements shown in Figure 3.This table comprises the syntactic structure information of all elements among the Schema.When the user input query path, only need in table, find out with query path in identical element, and carry out structure decision according to each element encoding, just can finish route matching.

[4] set up the Schema structural table

Fig. 4 is for listing the Schema structural table of formation with the preorder traversal value (pre) of above-mentioned each node according to the order of sequence with all leaf nodes (destination node) the preorder traversal values (leaf-pre) and the corresponding document identification (id) in its path, place.If the pre of known node coding can find the pre value of leaf node in all paths at this node place in structural table.Because user inquiring all is interested to textual value, the destination node that is query path all is a leaf node, so when Schema and query path coupling, find the leaf node in path, query path place, pre value according to leaf node, when XML is inquired about, find corresponding leaf node to get final product, do not need to inquire about again whole XML document.

[5] form XML prefix path coding

Existing prefix code all is the coding to node, prefix code has been preserved the routing information of node, the coding of any element node (except the root node) is the prefix of its offspring's element node encoding, the judgement of determining just to be equivalent to prefix substring relation of inclusion of ancestors' descendent relationship between node like this.When carrying out the individual path connection, the node attended operation is too complicated, and can solve single path inquiry problem effectively based on the XML inquiring technology of path indexing based on the node encoding method.

Each node in above-mentioned document tree model goes out the limit and sorts out with attribute limit and element limit respectively, provide path code by the natural number order, and make each node of XML document carry the preorder traversal value (pre) identical with corresponding node among its Schema with resolution mark.As shown in Figure 5, the limit that goes out of same node from left to right provides path code respectively in order according to attribute limit and element limit, be called sequence code, the path code on attribute limit adds [] as differentiating mark, and the path prefix coding in long path is made of each paths coding of its process.With root node to this node the path code of process constitute the path prefix coding of this node in order.

[6] the value table of formation XML document

The preorder traversal value of all leaf nodes in Schema, leaf node place path prefix coding (Path-lable) and the textual value that is attached thereto are listed one by one according to the order of sequence, the XML document value table of pie graph 6, the path prefix coding is combined according to the order of sequence by root node each path code to this leaf node.Path code as path " case history/patient [case history id=" 1 "]/name " is 1[1] 1.

According to the pre value that obtains leaf node after the Schema route matching, promptly leaf-pre can find the prefix code in this path, leaf node place and the textual value of leaf node in the value table.As, known query path leaf node is " name ", the pre=4 in Schema, and in the value table, its place path prefix is encoded to 11 and 21, and textual value is " Xiao Wang " and " Xiao Zhang ".Behind the adopted value table, need not Query XML document again, improved search efficiency greatly.

[7] determine the path prefix coding

Leaf node coding leaf-pre according to each individual path correspondence finds its path code in the value table.For example, for path " case history/patient [case history id=" 1 "]/gop information/chemical examination ", be divided into " case history/patient [case history id=" 1 "] " and " case history/patient/gop information/chemical examination " after, find the corresponding preorder traversal value of node " case history id " and " chemical examination " and be respectively 3 and 7 from the Schema list of elements of having set up shown in Figure 3, document identification is 1; And then according to this preorder traversal value and document identification, finding corresponding leaf node preorder traversal value l eaf-pre from the Schema structural table of having set up shown in Figure 4 respectively is 3 and 7; Again from value table shown in Figure 6, finding the leaf-pre value is two paths coding " 1[1] " and " 2[1] " of 3 node correspondence, and the path code of leaf-pre value 7 correspondences is " 131 ".

[8] form coupling path

After the path code of query path divided, need each individual path be coupled together by the path coupling.This process is: with two individual paths codings that need to connect from the left side first, step-by-step is mated.If the coding of same position is identical, then it is deposited in the corresponding position of coupling path, continue next bit relatively; If occur predicate constraint [a] in the paths, [a] do not participate in coupling, directly deposits the corresponding position of coupling path in, carries out next bit more according to the order of sequence relatively; If paths coupling finishes, and another still has the residue coding, then will remain the follow-up corresponding position that coding deposits coupling path in.Result after being of coupled connections as this two paths with the textual value of longer path correspondence in two paths.

For example, in the present embodiment for path " case history/patient [case history id=" 1 "]/gop information/chemical examination ", be divided into " case history/patient [case history id=" 1 "] " and " case history/patient/gop information/chemical examination " after, path prefix coding by can obtaining them at the value table behind above-mentioned [3]-[7] is 1[1] ", " 2[1] " and " 131 ", corresponding textual value is respectively " 1 ", " 2 " and " routine urinalysis ".Their path code is coupled, i.e. 1[1] and 2[1] be coupled with 131 respectively.As 1[1] when being coupled with 131, from the left side, first is all " 1 ", " 1 " is deposited in coupling intermediate result; Relatively second again, when [] occurring, the value that [] and it are comprised deposits coupling intermediate result in; Continue next bit relatively, 1[1] relatively finish, 131 also are left " 31 ", " 31 " are directly deposited in the follow-up corresponding position of coupling path, the coupling path that obtains is encoded to 1[1] 31, so far show that these two individual paths can connect, can wherein grow path 131 corresponding textual value " routine urinalysis " in XML value table is the result that this two paths connects, and output shows as required Query Result.As not only two of the individual paths divided according to query path, then encode as intermediate value with above-mentioned coupling path, continue to compare the coupling path coding that obtains upgrading and the textual value of longer path correspondence with the path prefix coding of next bar individual path with reference to said process.By that analogy, being coupled one by one until the path code with all individual paths finishes, the result that be coupled to the end, as the search index result.

In above-mentioned coupling process, as finding the coding difference of same position, then coupling failure shows that two paths can not connect, and should skip to the next round coupling or withdraws from.In the present embodiment as 2[1] with 131 whens coupling,, 2[1 from the left side] first be all " 2 ", and 131 first be that 1, two coding is inconsistent, then coupling is failed, and illustrates that this two paths can not connect, and can't obtain Query Result.

[9] XML dynamically updates

XML document dynamically more requires can carry out validation verification to it earlier when XML document is upgraded, make after the renewal XML document still the structural information of corresponding XML Schema with it be consistent and illegal element/attribute node can not occur, be beneficial to the consistance of XML document and XML Schema.

When new node inserts, need carry out path code to new route.Present embodiment adds ". " behind the coding in path, a new route left side and sequence code promptly constitutes new path code.The sequence code of first child's node that inserts is " 1 ", the sequence code of second child's node is " 2 ", when last position of the sequence code of node is added " 0 " during for " 9 " after sequence code, such the 9th child's sequence code is " 90 ", the tenth child's sequence code is " 91 ", and the rest may be inferred.

As shown in Figure 7, when inserting path " gop information/chemical examination/routine blood test ", need carry out path code to " gop information/chemical examination ", promptly left path code " 1 " adds ". " and sequence code " 1 ", and new route is encoded to " (1.1) ".Textual value is deposited in the analog value table.Like this, need not to change the coding in other paths in the model tree.

This shows, adopt present embodiment, Schema is carried out the interval coding, utilize structural information to handle query path, solved individual path inquiry problem based on XML document code index method; XML document is adopted the path prefix coding, solve single path inquiry problem effectively, and overcome problem such as individual path attended operation complexity.

Claims

1. XML document indexing means based on path code, realize that by intelligent apparatus the step of described index is with central processing unit:

Step 1, set up tree-model---according to document node structure, respectively XML document and corresponding Schema thereof are mapped to corresponding document tree model and Schema tree-model;

Step 2, division query path---the need query path that will import is divided into one group of conditional branching path and target individual path with the predicate ending;

Step 3, set up the list of elements---according to scanning result, title, preorder traversal value, follow-up traversal value, the document identification of each node in the Schema tree-model deposited in respectively in the corresponding form, constitute the Schema list of elements above-mentioned Schema tree-model;

Step 4, set up structural table---the preorder traversal value of above-mentioned each node and all the leaf node preorder traversal values and the corresponding document identification in its path, place are listed according to the order of sequence, constituted the Schema structural table;

Step 5, formation XML path prefix coding---each node in above-mentioned document tree model goes out the limit and sorts out with attribute limit and element limit respectively, provide path code with resolution mark by the natural number order, and make each node of XML document carry the preorder traversal value (pre) identical with corresponding node among its Schema, again with root node to certain node the path code of process constitute the path prefix coding of this node in order;

Step 7, determine path prefix coding---according to the corresponding title of each leaf node of dividing the back individual path, from the Schema list of elements of having set up, find corresponding preorder traversal value and document identification; And then, from the Schema structural table of having set up, find corresponding leaf node preorder traversal value according to this preorder traversal value and document identification; According to this leaf node preorder traversal value, find corresponding path prefix coding again from established XML value table;

2. the XML document indexing means based on path code according to claim 1, it is characterized in that: inequality in the described step 8 as the coordination number occurring, show that then this two path coding can not connect two paths of correspondence, skip to the next round coupling or withdraw from.

3. the XML document indexing means based on path code according to claim 2, it is characterized in that: the document tree model in the described step 1 is made of the element node that is linked in sequence by ancestors descendant, the textual value that is connected with leaf node and the attribute node that is connected with the respective element node, and described Schema tree-model is made of the element node that is linked in sequence by ancestors descendant, the attribute node that is connected with the respective element node.

4. the XML document indexing means based on path code according to claim 3 is characterized in that: when inserting new node, add predetermined labels symbol and the sequence code path code as new node behind the path code of new node respective path.

5. the XML document indexing means based on path code according to claim 4 is characterized in that: the node degree of depth that also contains each node correspondence in the Schema list of elements of described step 3.

6. the XML document indexing means based on path code according to claim 5 is characterized in that: the attribute limit in the described step 5 is differentiated and is labeled as " [] ".

7. the XML document indexing means based on path code according to claim 6 is characterized in that: the predetermined labels symbol behind the path code of described new node respective path is ". ".