CN103488639B

CN103488639B - A kind of querying method of XML data

Info

Publication number: CN103488639B
Application number: CN201210192018.7A
Authority: CN
Inventors: 郭少松; 包小源; 陈薇; 王腾蛟; 杨冬青
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2012-06-11
Filing date: 2012-06-11
Publication date: 2016-12-07
Anticipated expiration: 2032-06-11
Also published as: CN103488639A

Abstract

The present invention provides the querying method of a kind of XML data, and its step includes: 1) using Native XML mode to store XML data, its storage organization includes: interior nodes layer, the node of storage XML tree, and XML element uses DDE coded system to encode；Leaf node layer, the text data of storage XML tree leaf node；Arrange layer, the inverted index of storage interior nodes layer；2) according to the XPath query statement of input, from the described row's of falling layer, take out the element sequence corresponding with the node of described XPath, and use the vanquished tree to carry out merger sequence；3) XML element after sorting merger carries out stacked and Pop operations in order, obtains Query Result from relief area.The present invention can process with keyword " OR " and the XPath of asterisk wildcard " * ", and has the highest efficiency.

Description

A kind of querying method of XML data

Technical field

The invention belongs to database technical field, relate to storage and the querying method of semi-structured data XML, be specifically related to one Plant the XML data query method that can effectively support XML query language XPath.

Background technology

Owing to increasing application system uses XML to issue and exchange data, the scale of XML data as reference format Drastically expand, in IDC(Internet data center) the nearest a report display issued, the IT department of 500 enterprises that are interviewed In have 29% to use XML document and XML database the most in a large number.The most effectively manage XML data to become in the urgent need to solving Problem certainly.

Quick and precisely search the XPath all coupling elements in XML database, be the core operation of XML query process. Such as, XPath expression formula a: book [title=' XML '] //author [fn=' Jane ' AND ln=' Doe '], this table The node author reaching formula coupling needs to meet: 1) having child node fn, its content is ' Jane '；2) there is a child node Ln, its content is ' Doe '；3) it is the offspring of book node, and book node has the content to be ' the title joint of XML ' Point.

In XML-schema matching process the more typical TurboXPath algorithm for XML data stream having DB2 to develop and The TwigStack algorithm of academia proposition in 2002.

In TwigStack algorithm, each node q on XPath correspond to Tq and Sq.Tq representative element sequence, Q is the tag names on XPath, and Tq is all elements in XML document with q name matching, and the unit in Tq Element arranges according to document sequence.Sq representative element stack, storage and the element of q name matching, the element processed when algorithm is Crossing when closing label of element in stack, in stack, element to be popped.Algorithm only to element operation in Tq, skips unrelated XML unit Element, so the IO efficiency of algorithm is the highest.But TwigStack algorithm can not process two kinds of situations: first is to have asterisk wildcard " * " XPath, such as //a/* [b]/c because TwigStack algorithm uses Interval Coding, though have element a and element b and The level difference 2 of c, but also cannot determine whether element b and c has identical father；Second is that TwigStack algorithm can only be located It is the XPath of XPath, such as //a [bAND the c]/d of the AND ' relation ', but can not process keyword ' OR ' between reason twig, Such as //a [b OR c]/d.

TurboXPath algorithm is the match query algorithm to XML stream (XML stream) that DB2 uses, and has not both had rope Drawing, the most do not encode, the XML element in XML stream arranges according to document sequence, can process keyword easily ' OR ' XPath.TurboXpath function is more sound, but for the XML data in data base, TurboXPath algorithm is from the beginning Scanning XML document to tail, IO cost is very big, particularly with the XML document that those are bigger.

Summary of the invention

It is an object of the invention to for the problems of the prior art, it is provided that the querying method of a kind of new XML data, it is possible to place Reason is with keyword " OR " and the XPath of asterisk wildcard " * ", and has the highest efficiency.

For achieving the above object, the present invention adopts the following technical scheme that

A kind of querying method of XML data, its step includes:

1) using Native XML mode to store XML data, its storage organization includes: interior nodes layer, and storage is according to document The node of the XML tree of sequence arrangement, wherein XML element uses DDE coded system to encode；Leaf node layer, stores XML The text data of leaf nodes；Arranging layer, the inverted index of storage interior nodes layer, each index entry is the unit that tag names is identical The sequence that element is arranged according to document sequence；

2) according to the XPath query statement of input, from the described row's of falling layer, the element sequence corresponding with the node of described XPath is taken out Row, and use the vanquished tree to carry out merger sequence；

3) XML element after sorting merger carries out stacked and Pop operations in order, and obtains Query Result from relief area.

Further, in described interior nodes layer, the information of every record includes: the integer identifiers that is mapped to by namespace node, DDE Coding and node type.

Further, in the described row of falling layer, the information of each element includes: element type, this element in the address of interior nodes layer and DDE encodes.

Further, described interior nodes layer points to described leaf node layer by pointer.

Further, described employing the vanquished tree carries out merger sequence, is that the coding of the DDE to two elements compares, obtains institute Stating relation before and after two elements, and set preceding element as victor, posterior element is the vanquished.

Further, in described XPath, each node q has two data structures: element sequence Tq and stack Sq；Tq is XML With all elements of q name matching in document, and arrange according to document sequence；Sq is used for the element of storage and q name matching, and Carry out stacked and Pop operations.

Further, when stack-incoming operation, only retaining the ancestors of new element in stack, all elements in stack is all that ancestors offspring is closed System.

Further, if element e wants stacked SE, on XPath, the father node of node E is A, then stacked for element e judgement Condition is:

A) SA have the element of chain；Described go out chain refer to the record of the ancestors that are not e from connecting all elements stack Chained list is deleted；

B) chain and the child of the element near stack top are not gone out during e is SA；

C) type of e is identical with the type of E on XPath.

Further, when XPath occurs asterisk wildcard " * ", amplify out three kinds of new axles: the sub-axle of grandfather, absolute ancestors offspring Axle, special ancestors' offspring's axle, and use described three kinds of new axles that the XPath containing asterisk wildcard " * " carries out equivalent rewriting.

The XML data query method of the present invention, solves TwigStack method and can not support with keyword " OR " and lead to Join the XPath problem of symbol " * "；For the query processing of XML data in data base, have as TwigStack method IO efficiency, and in hgher efficiency than TurboXPath method.At present, increasing application system uses XML conduct Data are issued and exchanged to reference format, and the scale of XML data drastically expands, and finance, medical treatment, E-Government, news etc. are led Territory has used the XML standard of each formulation to realize the data exchange between different department, different enterprise, the inventive method Can be widely applied to these fields, realize the effective query to XML data and management efficiently.

Accompanying drawing explanation

Fig. 1 is the flow chart of steps of the XML data query method of the embodiment of the present invention.

Fig. 2 is the Native XML storage mode schematic diagram of the embodiment of the present invention.

Fig. 3 is the schematic diagram of interior nodes layer in Fig. 2.

Fig. 4 is the querying flow figure of right in the embodiment of the present invention //a [//c]/b.

Fig. 5 is the stacked Pop operations schematic diagram of what right in the embodiment of the present invention //a [//c]/b inquired about.

Fig. 6 is the stacked Pop operations schematic diagram of what right in the embodiment of the present invention //a/* [c]/b inquired about.

Detailed description of the invention

Below by specific embodiment, and coordinate accompanying drawing, the present invention is described in detail.

Fig. 1 is the flow chart of the XML data query method of the present invention, and concrete steps include:

1) Native XML mode is used to store the XML data in data base.

The XML data query method of the present invention belongs to overall sprig method of attachment, compared with early structureization connection, overall little Branch interconnection technique can avoid the most invalid intermediate object program.The basis of the inventive method is Native XML storage, to XML Element uses DDE coded system.Native memory mechanism maintains the document sequence of XML element, by the opening of bid of an element Sign physical address and just can be taken off the subdocument with this element as root.DDE coding is used for the common structure relation (ancestral to XML element First offspring, father and son, brother etc.) judge.

The design Storage of the present invention is divided into three layers: interior nodes layer, leaf node layer and arrange layer, as shown in Figure 2.

A) interior nodes layer

The node of XML tree is arranged according to document sequence, is stored in interior nodes layer.Every record of this layer is an XML tree joint Point, the information of every record includes the convenient storage of integer identifiers tagID(that is mapped to by namespace node, conveniently compares), DDE Coding, node type (element, attribute, text) etc..Fig. 3 is the example of a simple interior nodes layer, wherein, (a) For XML tree；B () is the sequential storage corresponding with (a), with "/" beginning for closing label record；" Database " and " 25.00 " Two leaf nodes are here pointer, and actual content is stored in leaf node layer.

The structural relation that XML coding is used to judge between XML element.TwigStack algorithm can not process with wildcard The XPath of symbol " * ", because the Interval Coding that it uses can not judge brother's axle.The present invention uses DDE to encode, and DDE compiles Code has than the benefit of Interval Coding:

The axle DDE that Interval Coding can determine that can judge, and DDE also can determine that brother's axle, and Interval Coding but can not；

DDE coding can support the renewal of XML document well, and i.e. when XML document changes, original coding is not required to more Changing, Interval Coding is not accomplished.

B) leaf node layer

The text data of every record one leaf node of storage, the text data of storage XML tree leaf node.Interior nodes layer has finger Pin points to here, is found the Physical Page at text data place by these pointers.

C) layer is arranged

The row's of falling layer is similar to the inverted index in IR system.The elementary composition sequence that in the row's of falling layer, all tag names are identical, And arrange according to document sequence.In sequence, the information of each element has: it in the address of interior nodes layer, element type, DDE coding Deng.Information according to the row's of falling layer storage just can complete the match query to XPath, in going according to the element address inquired again Node layer obtains the subdocument between element opening and closing label.In the row's of the falling layer shown in Fig. 2, E1, E2, E3 are to represent XML Element information.

2) according to the XPath query statement of input, from the row's of falling layer, the element sequence corresponding with the node on XPath is taken out, and The vanquished tree is used to carry out merger sequence.

For each node q on XPath, there are two data structures: element sequence T_qWith stack S_q。T_qIt it is XML document In with all elements of q name matching, and T_qIn element according to document sequence arrange.S_qDuring algorithm is carried out storage with The element of q name matching, a new element is stacked, and the element of those its ancestors non-will be popped.

The method of the present invention is properly termed as " TurboStack " method.Assume that XPath has n node: q₁, q₂..., q_n, The T corresponding with each node_qi(1 i n) obtains from the row's of falling layer, T_qiIn XML element be ordered into.TurboStack The input of method is the XML element according to the arrangement of document sequence, it is therefore desirable to T_q1, T_q2..., T_qnThis n element sequence Row carry out merger sequence, n sequence are merged into a sequence according to the arrangement of document sequence, as the input of algorithm.

DDE can be relied on to encode and use the vanquished tree to carry out merger sequence: for coding dde1 and dde2 of two elements, Compare with the two coding, it can be determined that go out relation before and after two elements, and set preceding element as victor, rear Element be the vanquished.

3) element after sorting for merger, carries out stacked and Pop operations in order, obtains Query Result from relief area.

Execution flow process for holistic approach of the present invention shown below, is designated as algorithm 1, as follows to function declaration therein: ConstructStack (q) is that node q sets up stack, and GetStream (q) obtains the element sequence of node q from bottom stores, The MultiMergeSort (XPath) T to nodes all on XPath_qCarrying out merger sequence, getPopElement (e) chooses from stack Not being the element of the ancestors of e, match (e) judges whether e can be stacked.

It is specifically described with popping stacked below.

3.1) stacked

To all T_qCarry out merger sequence, after making element arrange according to document sequence, it is possible to prepare stacked successively.Element e to enter Stack, is equivalent to encounter the opening of bid label of e in XML document.In stack, the ancestors of e still remain in stack, reason be according to XML tree structure, the label that closes as ancestors' node of e does not the most scan.And those not es stacked prior to e The element of ancestors, their label that closes have passed through, it should pops.

The most stacked ancestors the most only retaining new element, all elements in stack is all ancestors' descendent relationship.The unit in all stacks Element chained list couples together, and this chained list is called Last Push List, is abbreviated as LPL.After new element e is stacked, it is placed in LPL Head position.One new element e, before stacked, starts to be examined in each record E from LPL head_i, compile with DDE Code is by E_iCompare with e, if E_iIt not the ancestors of e, from LPL, delete E_i, it is referred to as chain, otherwise stops comparing.Go out chain mark Will this element and has been popped, but does not really remove from stack.

If e wants stacked S_E, on XPath, the upper layer node of E is A, and in algorithm 1, the Rule of judgment of match (e) is:

1.S_AIn also have chain element；

If 2. the axle between A and E is father and son's axle, S_AIn do not go out chain and the element near stack top is a1, then e must be The child of a1；

The type of 3.e identical with the type of E on XPath (type of E is probably node element or attribute node).

If E is the root node of XPath, can be stacked as long as then meeting the 3rd condition.

Stacked new record comprises four information:

1. element information (tagID, DDE, type)；

2. pointer P_LPL, point to LPL next record；

3. pointer P_stack, point to S_AIn also on LPL and near the record of stack top.

4. matching status position status, is originally false.If this record also meets the XPath structural requirement to it, then it is set to true.

If E is the leaf node of XPath, then status mode bit is initialized as true.

If E is the output node of XPath, e is placed in the outputBuffer of result buffer.

Shown below for stacked execution flow process, it is designated as algorithm 2, as follows to function declaration therein:

Push (e, S_E) element e is put into S_EStack top, Lappend_head (e, LPL) is placed in element e on the head of chained list LPL Position, Lappend (e, outputBuffer) puts into element e in output buffer outputBuffer.

3.2) pop

When record in stack goes out chain, the P that this is recorded_LPLIt is set to NULL, but this record is not popped.

On XPath, except leaf node, other nodes have child node, and the stack of child node is referred to as the sub-stack of father node.Father saves When element in some stack goes out chain, the element of sub-stack is popped.

Node A has two child nodes B and C, A and B to be father and son's axles, A Yu C is ancestors' offspring's axles.Assume present stack S_A In have two record a1 and a2, a2 in stack top, be now to a2 to go out chain.Record a2 does not pop, and that pop is S_BAnd S_C In record.

First have to judge to record whether a2 meets XPath to query node A requirement structurally, the letter of algorithm 3 the most hereafter The process of number matchStructure (e):

A) S_BMiddle record b₁,b₂,……,b_nP_stackPointer points to a2, S_CMiddle record c₁,c₂,……,c_m P_stackPointer is also directed to a2.These records have gone out chain, and their matching status position status has obtained going out chain when.

If b) b and c is AND relation, then

a2->status=(b₁->status||......||b_n->status)&&(c₁->status||......||c_m->status)；

If between b and c being OR relation, then

a2->status=(b₁->status||......||b_n->status)||(c₁->status||......||c_m->status)。

If n=0 or m=0, i.e. S_BOr S_CIn be not pointed towards the record of a2, then S_BOr S_CMode bit treat as false.

S_BMiddle b₁,b₂,……,b_nAll pop, because they are unlikely to be the child of a1；S_CMiddle c₁,c₂,……,c_mMode bit status Popping for false, the P of remaining record_stackPointer all points to a1, because they are also the offsprings of a1.

If now a2-> status=false, then the record belonging to a2 offspring in output buffer is deleted from relief area.

T as the root node root of XPath_rootFor empty and S_rootIn record the most all go out chain, algorithm stops, in outputBuffer Element be exactly Query Result.

For the execution flow process popped shown in lower surface frame, it is designated as algorithm 3, as follows to function declaration therein: IsEmpty (LPL, S_E) Judge S_EWhether also has the element of chain, if not returning true；Delete_Stacks(S_E, e) S_ESub-stack in after e Delete for element；(outputBuffer e) deletes offspring's element of e from output buffer to Delete_InterResult；stack_top(S_E) Return stack does not goes out chain and the element near stack top；childStack(S_E) return S_EAll sub-stack；descendants(S_C,e) Return stack S_CIn belong to the element of offspring of e；PC (E, C) judges whether E and C is filiation；AD (E, C) judges E and C Whether it is ancestors' descendent relationship.

3.3) with the XPath of asterisk wildcard " * "

The common axle of XPath has ancestors offspring (AD) axle, father and son (PC) axle etc., if XPath occurs asterisk wildcard " * ", Then amplify out three kinds of new axles:

A) grandfather's (grand parent-child, i.e. GPC) axle, such as a/*/c, a and c is the grandfather's pass every two-layer System；A/*/*/c, a and c is the grandfather's subrelation every three layers.Present invention use/ⁿRepresent GPC axle, n is integer, represent every Which floor.

B) absolute ancestors offspring (absolute ancestor/descendant, i.e. AAD) axle, such as a/* //c or a//* //c, A and c is at least every absolute ancestors' descendent relationship of two-layer.Use //ⁿRepresenting AAD axle, n is integer, represents at least every several Layer.

C) special ancestors offspring (special ancestor/descendant, i.e. SAD) axle, such as a//*/c, a and c is at least Special ancestors' descendent relationship every two-layer.Use ///ⁿRepresenting SAD axle, n is integer, represents at least every which floor.

AAD and SAD to be distinguished？It is such as AAD axle between a//* [//d] //c, a and d, c, does not has between d and c Relation；For being SAD axle between a//* [d]/c, a and d, c, and d and c is brotherhood；For a//* [d] //c, a And be SAD axle between d, it is AAD axle between a and c, between d and c, it doesn't matter.

GPC, AAD and SAD are the special cases of AD, use DDE coding can judge GPC, AAD and SAD easily, The information because DDE coding has levels.

With tri-kinds of axles of GPC, SAD and AAD, the XPath having asterisk wildcard " * " is carried out equivalence to rewrite.

When in XPath occur " * " and it be node of divergence, such as a/* [d]/c//e, be rewritten as a [/²d]/²C//e, because d and c Brotherhood must also be met, to pay special attention to this situation when processing three kinds of new axles.

New element is stacked, and the stack to enter is S_b, on XPath the father node of b be a, a and b be GPC, SAD and AAD One in three kinds of axles.Stacked condition to be met is:

A) S_aIn must have the element of chain；

B) S_aMiddle existence element and new element meet the hierarchical relationship required by axle.

C) new element type meets the requirement of b.

If a Yu b, c be/ⁿOr ///ⁿAxle, and n is equal, then b and c needs to meet brotherhood.Assume present element A1 goes out chain, b₁,b₂,……,b_nAnd c₁,c₂,……,c_mIt is the offspring of a1, calculates whether a1 mates, first have to brother's pairing, example Such as (b₁,c₁,c₂), (b₂,b₃,c₃,c₄) ... the element in bracket is all brother, then:

If between b and c being AND relation, then a1-> status=[b₁->status&&(c₁->status||c₂->status)]| [(b₂->status||b₃->status)&&(c₃->status||c₄->status)]||……；

If between b and c being OR relation, then a1-> status=(b₁->status||c₁->status||c₂->status)||(b₂->status|| b₃->status||c₃->status||c₄->status)||......。

If cannot match, then a1-> status=false.

For stack S_bAnd S_cIn record be to continue with staying stack, still should delete, GPC axle with reference to PC axle process, SAD axle, AAD axle then processes with reference to AD axle.

Fig. 4 is with the querying flow figure of example //a [//c]/b, wherein: all of element sequence merger is sorted by (a)；B () sequentially Process each element；C () puts into relief area matching result.Fig. 5 is the stacked Pop operations schematic diagram of example shown in Fig. 4. In stack, three parts of each record are: the left side is element information；Top right-hand side is that matching status position status, F represent false, T table Show true；Limit, bottom right is pointer P_stack.What figure bottom was shown is the change of LPL.The step that right //a [//c]/b inquires about is concrete It is described as follows:

The first step, element sequence T of node a, c and b from the row's of falling layer_a、T_cAnd T_bTake out.

Second step, utilizes the vanquished tree to T_a、T_cAnd T_bCarry out merger sequence, obtain an element sequence: first a, second Individual a, c, first b, second b.

3rd step, these 5 elements are the most stacked and Pop operations:

1) front 3 elements broadly fall into ancestors' descendent relationship, and they are the most stacked, because c is leaf node, so the status of c For true.

2) first b element is stacked, checks that LPL's, c closes label mistake, and c goes out chain.Because b is leaf node, so The status of b is true.Because b is output node, first b puts into output buffer.

3) second b is stacked, check LPL, at this moment first b and second a close label mistake, they go out chain.The When two a go out chain, its status becomes true, because the status of its child node b and c is true.Second a goes out Chain makes first b element pop, because it is not the daughter element of first a, but c element is not popped, because it is The offspring of first a, points to first a the Pstack of c.Because b is output node, second b element is also placed in defeated Go out relief area.

4) last, all elements has processed, the element chain to be gone out in LPL.When first a goes out chain, its status becomes For true, because the status of its child node b and c is true, then two stacks of Sb and Sc all empty, because Sa Stack does not has element.

4th step, finally checks there are two results in output buffer.

Fig. 6 is the query case of right //a/* [c]/b, it is desirable to asterisk wildcard " * " has two child nodes c and b, c and b to be that brother is closed System.Query steps is as follows:

The first step, carries out equivalent rewriting to XPath, after rewriting be //a [/ 2c]/2b, i.e. a have two grandchild node b and c, and b Must be brother with c.

Second step, takes out element sequence Ta, Tc and Tb of node a, c and b from the row's of falling layer.

3rd step, utilizes the vanquished tree that Ta, Tc and Tb are carried out merger sequence, obtains an element sequence: first a, Two a, c, first b, second b.

4th step, these 5 elements are the most stacked and Pop operations:

1) front 3 elements broadly fall into ancestors' descendent relationship, and they are the most stacked, because c is grandson's element of first a, and institute First a element is pointed to the Pstack of c.Because c is leaf node, so the status of c is true.

2) first b element is stacked, checks that LPL's, c closes label mistake, and c goes out chain.First b is first a Grandson's element, so the Pstack of b points to first a element.Because b is leaf node, so the status of b is true.Cause Being output node for b, first b puts into output buffer.

3) second b is stacked, check LPL, at this moment first b and second a close label mistake, they go out chain.The When two a go out chain, its statu remains as false, because it does not has grandson element c and b.Now stack Sa only has first Individual a element or effective element, but second b element is not its grandson's element, so second b element is discontented with The stacked condition of foot.

4) last, all elements has processed, the element chain to be gone out in LPL.When first a element goes out chain, its status Becoming true, because the status of its grandson element b and c is true, and b and c is brotherhood.Then Sb and Sc Two stacks all empty, because not having element in Sa stack.

5th step, finally checks there is a result in output buffer.

Above example is only limited in order to technical scheme to be described, those of ordinary skill in the art can Technical scheme is modified or equivalent, without departing from the spirit and scope of the present invention, the guarantor of the present invention The scope of protecting should be as the criterion with described in claim.

Claims

1. a querying method for XML data, its step includes:

2) according to the XPath query statement of input, from the described row's of falling layer, the element sequence corresponding with the node of described XPath is taken out Row, and use the vanquished tree to carry out merger sequence；Described employing the vanquished tree carries out merger sequence, is the coding of the DDE to two elements Comparing, obtain relation before and after said two element, and set preceding element as victor, posterior element is the vanquished； When XPath occurs asterisk wildcard " * ", amplify out three kinds of new axles: after the sub-axle of grandfather, absolute ancestors' offspring's axle, special ancestors For axle, use described three kinds of new axles that the XPath containing asterisk wildcard " * " carries out equivalent rewriting；

2. the method for claim 1, it is characterised in that in described interior nodes layer, the information of every record includes: by node name Integer identifiers, DDE coding and the node type that word is mapped to.

3. the method for claim 1, it is characterised in that in the described row of falling layer, the information of each element includes: element type, This element encodes in the address of interior nodes layer and DDE.

4. the method for claim 1, it is characterised in that described interior nodes layer points to described leaf node layer by pointer.

5. the method for claim 1, it is characterised in that: in described XPath, each node q has two data structures: element Sequence T_qWith stack S_q；T_qIt is all elements with q name matching in XML document, and arranges according to document sequence；S_qFor depositing Storage and the element of q name matching, and carry out stacked and Pop operations.

6. the method for claim 1, it is characterised in that when stack-incoming operation, only retains the ancestors of new element, in stack in stack All elements be all ancestors' descendent relationship.

7. method as claimed in claim 6, it is characterised in that if element e wants stacked S_E, father's joint of node E on XPath Point is A, then stacked for element e Rule of judgment is:

a)S_AIn have the element of chain；Described go out chain refer to the record of the ancestors that are not e from connecting the chain of all elements stack Table is deleted；

B) e is S_AIn do not go out chain and the child of the element near stack top；

C) type of e is identical with the type of E on XPath.