CN100558078C

CN100558078C - The complex small-branch mode method for inquiring and matching of XML flow data

Info

Publication number: CN100558078C
Application number: CNB2006101163336A
Authority: CN
Inventors: 杨卫东; 施伯乐
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2006-09-21
Filing date: 2006-09-21
Publication date: 2009-11-04
Anticipated expiration: 2026-09-21
Also published as: CN1941743A

Abstract

The invention belongs to XML flow data processing technology field, be specially a kind of complex small-branch mode method for inquiring and matching of XML flow data.The proposition of problem is as follows: a given query set Q who comprises complex small-branch mode, and an XML document D, find out

, satisfy each q ∈ Q`, all mate document D.The present invention is by the compact small-branch mode query tree of definition, the AND/OR predicate of complexity is handled as independent abstract syntax tree, simultaneously, all small-branch mode inquiries are combined into the single query tree of sharing common prefix, utilize the algorithm that proposes based on running stack, in conjunction with top-down and top process upwards certainly, single pass, efficient any complex small-branch inquiry of handling based on XML stream.The present invention compares with method with existed system, avoids producing a large amount of intermediate object programs, and the query processing performance is significantly improved, and particularly to big XML document, treatment effeciency significantly improves.

Description

The complex small-branch mode method for inquiring and matching of XML flow data

Technical field

The invention belongs to Extensible Markup Language (XML) flow data processing technology field, be specifically related to complex small-branch mode (twig pattem) method for inquiring and matching of XML.

Background technology

Development along with the Internet, the intensive new application of lot of data has appearred, comprise numerous areas such as sensor network, position search, network monitoring, financial analysis, online auction, stock market system, traffic control system, for example, certain client of Shanghai equities market system may require China Petrochemical Industry's stock greater than 6 yuan or less than 5 yuan in notify him; The taxi driver may require traffic control system to notify him in the magnitude of traffic flow at certain crossing during greater than certain value, and traffic control department may require to notify them when traffic accident takes place; Or the like.The common feature of of this sort application is to handle continuously data flow that arrive, unlimited in real time, constantly, and it is different with the traditional data base management system.The main feature of traditional data base management system is: the lasting data storage, and at a time carry out inquiry and provide accurate answer by stablizing inquiry plan; And the flow data treatment system is emphasized: online data arrives, the inquiry persistent storage.In the flow data treatment system, the order that can not control data arrives, the storage of all arrival is managed and inquires about in this locality also is unpractical.The flow data treatment technology is a kind of new technique, has extremely wide application prospect.

Go up data exchange standard because Extensible Markup Language (XML) has become Web, be used for the exchanges data between various application and the information source, the theory and technology of processing XML flow data more merits attention.Such challenge that application brought is:

1.XML the flow data treatment system is normally operated on the Web, the user on it can rapidly increase to 100,000,1,000,000 grades quantity.

2. user inquiring is used XPath usually ^[1]Deng the language representation.Because a user can submit some inquiries to, the quantity of inquiry is very huge especially.

3.XML document is recurrence, level, the user can submit complicated query to.

Therefore, to handle a key issue of research be how effectively to handle in a large number from user's inquiries simultaneously and in time the result is returned to the user to the XML flow data.At present, existing certain methods can the processing XML flow data, but, to inquiring about (comprising logic OR and logic AND predicate usually simultaneously), still there is not effective method at present, for example with the XML small-branch mode of complexity, for a digital library system, the user asks possibly and obtains following real time information: select the author of paper, it was delivered the article that title is XML Stream or delivered article on VLDB in 2006, and inquiry can be expressed as follows:

Q＝/dblp/paper[title＝‘XML?Stream’or(year＝2006?and?conf＝‘VLDB’)]//author

The inquiry of XML small-branch mode is the inquiry that has at the selection predicate of XML document structure and content in essence.It is the core operation that the XML flow data is handled that one group of small-branch mode and the XML document that arrives are at any time mated.

Because the wide application prospect that had of flow data treatment system, and XML goes up data exchange standard as Web, and the XML flow data is handled and caused extensive studies interest.The method processing XML flow data based on automaton is adopted in a lot of researchs ^{[2,3,4,5,6,7,8]}XFilter ^[2]Utilize the overanxious XML document of method first based on finite-state automata (FSM).XFilter uses an independent FSM to each path query, and in the process of document process, moves all FSM simultaneously.YFilter ^[3]On the basis of XFilter, improve: all XPaht inquiries are merged into an independent non-definite finite automata (NFA), and share the common prefix of all inquiries.YFilter regards the sprig inquiry as nested path expression, and uses query decomposition to handle.In their method, when an inquiry comprises nested path, just be divided into main path and one group of extensions path, each extensions path all uses a relatively independent NFA to handle.Processing to it was divided into for two steps: route matching and route matching result's postposition is handled (execution attended operation).At nested path, what YFilter mainly considered is the inquiry with AND predicate, can't handle the inquiry of OR predicate, and this rearmounted mode of handling may produce a large amount of intermediate object programs, thereby influences systematic function.XPush ^[4]All XPath expression formulas are configured to single customization fixs really and pushes away automaton (XPush machine), Green ^[4]Deng the people NFA is converted to definite finite automata (DFA), and load when using the operation that the blast of lazy DFA state of a control brings, to improve handling property.Dan etc. ^[10]The user mode machine is handled the XPath expression formula, and an XPath expression formula is converted to the network that a plurality of pushdown automatas constitute, and this method is difficult to handle a large amount of XPath inquiries.

The method of other processing XML stream mainly contains the method based on index ^[9], based on the method for Bloom Filter ^[10]And FiST method ^[11]Index-Filter ^[9]Employing is based on the technical finesse XML flow data of index.Index-Filter utilizes the document markup of XML document stream dynamically to set up the index of XML document, thereby avoids handling a part of XML document.With comparing of YFilter, they show by experiment, when the less relatively XML document of inquiry quantity is relatively large, and Index-Filter more effective (under the prerequisite of not considering to set up the cost that index spends); When inquiry quantity relatively large XML document relatively hour, Y-Filter is more effective.In the method for Index-Filter, set up index and will spend the regular hour cost.In addition, can not single pass processing XML document, therefore, the certain space cost that cached document will spend.XML IP filter based on Bloom Filter ^[10]Be a kind of approximate enquiring method, utilize Bloom Filter, the Xpah expression formula as character string, is converted to coupling between the character string with the coupling between XPath and the XML bag, thereby improves query performance.It just is used for handling simple XPath expression formula and (does not comprise predicate, include only "/" " // " " * ", be called XP ^{{/, //, * }}), and certain fault rate is arranged.FiST ^[11]Propose a kind of method that is different from YFilter at small-branch mode, one group of small-branch mode is converted to the prufer sequence, and one group of small-branch mode and XML flow data are carried out globality (holostic) coupling.What FiST considered is the small-branch mode with AND predicate, and does not consider how to handle the OR predicate.So far, the XML flow data processing of complicated small-branch mode can't be effectively carried out in all these work.

Summary of the invention

The objective of the invention is to propose a kind ofly can realize real-time complex query and a large amount of inquiry, and the complex small-branch mode method for inquiring and matching of the high XML flow data of search efficiency.

The complex small-branch mode method for inquiring and matching of the XML flow data that the present invention proposes can be designated as CTPQ.Basic problem to be processed is defined as follows:

A given query set Q who comprises complex small-branch mode, and an XML document D find out

Q ` &SubsetEqual; Q

, satisfy each q ∈ Q`, all mate document D.

The detailed process of method is as follows:

(1) tree (compact small-branch mode query tree) that a complex small-branch mode inquiry of the user being submitted to is handled with a kind of XML of being easy to stream is expressed as internal form.

(2) all user inquirings are merged into the single query tree, and shared the public part of these inquiries.

(3) in conjunction with top-down and bottom-up process, single pass is handled the match query of complex small-branch mode inquiry and XML flow data, avoids the generation of intermediate object program.

The particular content of each several part is described below respectively:

1. the internal representation of complex small-branch mode inquiry

Intuitively, small-branch mode can be expressed as a query tree (the present invention is called common query tree), in tree, insert corresponding AND node and OR node, to represent corresponding predicate, for example, for XPath expression formula Q=/dblp/paper[title=' XML Stream ' or (year=2006 and conf/title=' VLDB ')] //author, its query tree is as shown in Figure 1.But handle problems at the XML flow data, this expression brings 2 inconvenience: contain a large amount of AND nodes and OR node in the tree; When merging a plurality of small-branch mode, AND node and OR node are difficult to handle, for example inquiry/a[b andc] and inquiry/a/[b or c] merging.For this reason, the present invention peels off out with AND predicate and OR predicate, is expressed as the abstract syntax tree of interdependent node separately, and the present invention is referred to as compact small-branch mode query tree.Same example, the compact polling tree of Fig. 1 as shown in Figure 2.

Defining 1: one compact small-branch mode query tree is the query tree of an expression small-branch mode, node in the query tree is called query node (QNode), each QNode has only sign (as the inquiry Q for previous example, location step/dblp is designated n1), is divided into following two types:

(1) OQNode: do not have location step of predicate, be called common query node (Ordinary Query Node, OQNode).In compact polling tree, the OQNode correlation of relevant information comprises the operator that namespace node (name of " * " is " * ") and expression father and son ("/") or descendants (" // ") concern, with two tuples＜name, and "/" or " // "〉expression.

(2) PQNode: the location step that has a predicate be called the predicate query node (Predicate Query Node, OQNode).The predicate query node is the special joint in the compact polling tree, and it couples together its subtree by the AND/OR logical predicate.Except node identification, outside node name and "/" or " // ", also related this logical expression of logical expression is an abstract syntax tree at internal representation.Each leaf node of this abstract syntax tree is all safeguarded quoting to its corresponding node (child nodes of predicate node).

Small-branch mode and XML document stream coupling

Definition 1 has provided the equivalent definition with the common query tree with AND/OR predicate.In the process that XML document stream is handled, node and the child node thereof of the present invention to having predicate adopts bottom-up matching process, and adopts top-down matching process for other parts.In order to adapt to such matching process, the present invention expands defining 1.

Definition 2: predicate subtree (Predicate Sub-Tree).In the expression of a compact small-branch mode query tree, be that the formed subtree of root is called the predicate subtree with the nearest predicate node of distance root node.Expand the node corresponding structure simultaneously: the OQNode in the predicate subtree (except the leaf node), also related logical expression, this logical expression is only to contain an item, the i.e. sign of its child nodes.And if OQNode is a leaf node, then contain a logical tab, be initially true (TRUE), after running into the CLOSE incident, it is false (FALSE) that the value of logical tab is composed.

Definition 3: recipient node (Accepting Node).In the expression of a compact small-branch mode query tree, the predicate node nearest apart from root node is called the recipient node that this sets represented inquiry.If in the computational process, the value of recipient node interrelated logic expression formula is true, then the document and match query.For not with the XPath expression formula of predicate (XP{/, //, * }), query tree is identical with compact polling tree usually.If XPath expression formula be not with the general inquiry of predicate (XP{/, //, * }), its recipient node then is the leaf node of its common query tree.If in matching process, arrive recipient node, then the document and match query.

Fig. 3 provides the example of three inquiries that are used to illustrate matching process, and the recipient node of inquiry is represented with the black surround of overstriking.The inquiry Q1 be one not with the inquiry of predicate, n3 is its recipient node.Inquiry Q2 is a small-branch mode (for simplicity, the interrelated logic expression formula text representation of node) that contains the AND/OR predicate, and n2 is its recipient node, and its interrelated logic expression formula is [n3 and (n4 or n5)]; N2 is that the subtree of root is its predicate subtree.Because n5 is the OQNode in the predicate subtree, it is associated with a logical expression that comprises its child nodes.Expression and the Q2 of Q3 are similar.

XML flow data treatment system should be able to be handled a large number of users inquiry simultaneously, and in a large amount of inquiries from the user, may have many common parts, share its common ground and can save the memory space and the time of implementation of system, extremely important to improving systematic function.For this reason, the present invention merges into a single structure with all small-branch modes, is called the compact small-branch mode query tree of sharing prefix.The merging method is specific as follows:

For any two small-branch modes inquiry Q _aAnd Q _b, the compact small-branch mode query tree of its shared prefix is Q _Ab

Be Q _AbCreate a root r ₀, with Q _aBe base configuration Q _Ab

Preorder traversal Q _aAnd Q _b, if Q _bThe middle existence and Q _aThe identical prefix of node name and operator ("/" or " // ") then can merge, otherwise, with Q _bIn node join Q _aIn.

Repeating step (3) is up to finishing Q _bTraversal.

Fig. 4 is the compact polling tree of the shared prefix after 3 inquiries merge among Fig. 3.In Fig. 4, node n4 is the recipient node of Q1, and node n4 is the recipient node of Q2 and Q3.Here it should be noted that node n4 is associated with two query expressions, each query expression is associated with its inquiry sign (the inquiry sign is generated by the XPath resolver).

2.1 data structure

Each query node all has an only sign, and comprises following essential information:

(4)name。The character string of expression namespace node.

(5)axis。Integer value 0 expression "/", 1 expression " // ".

(6)documentLevel。The integer of the level of expression XML document.

The related group polling sign of each recipient node.If a node is the child node of predicate node or predicate node, then to each inquiry (this node may be shared by a plurality of inquiry), related information is＜QID status, lExpression 〉.Wherein,

(1) QID is the inquiry sign.

(2) lExpression: the logical expression that this node is relevant.Each of logical expression is quoted by its corresponding child nodes.If the leaf node of predicate subtree, then its logical expression is empty.

(3) status: this status indication is a Boolean, the matching status (result of calculation of logical expression) of expression node.Its initial value is false (FALSE).

2.2 node matching and match query

When document arrives system, produce corresponding event by the XML resolver, the stream processing engine is made a response to incident by call back function, incident is translated as (Name, Type, form driving matching process DocumentLevel).Wherein, name (name) is the node test name; Type (type) is an event type, comprises StartDocument, EndDocument, OPEN and CLOSE; DocumentLevel calculates by the OPEN and the CLOSE mark of XML document element, and the DocumentLevel of the root of document is 0.

For an inquiry, when the node that runs into is OQNode, and when not belonging to the child node of predicate subtree, if when event name, node name, document level coupling, then node matching success.When a node comprised the set membership operator, it is identical that document level time requires; When a node comprised descendants's relational operator, if node name is identical, at this moment then node matching success ignored the inspection of document level.When the query node name was " * ", very (TRUE) returned in the test of namespace node forever.This is top-down process, calculates when running into the OPEN incident.

When the node that runs into is child node in PQNode or the predicate subtree, and essential information＜Name, IsChild, DocumentLevel〉inspection by the time, at this moment need to calculate its associated logical expression and check status indication.If status indication is true, then node matching success.This is bottom-up process.The initial value of status indication is false (FALSE), when running into the CLOSE incident, the respective logic expression formula is calculated, and end value is composed to status indication.

If node matching takes place, then this node passes through, and continues next node.If the node of coupling is that (in this attention, a node may be a recipient node to an inquiry to recipient node, is not recipient node to another inquiry; Be the OQNode node perhaps, and be PQNode another inquiry to an inquiry), this XML document matching inquiry then.

2.3 algorithm

Algorithm is divided into OPEN and CLOSE two parts based on running stack.

The OPEN:OPEN incident is called this handle by call back function, imports the incident name into, the document level of element name and element.

(1), carries out node test and the inspection of document level to the XML element of each arrival.

(2) if very (TRUE) returned in the node inspection, then this node is pressed into a run time stack.If the node that runs into is the child node of PQNode or predicate subtree, the value of its status indication is made as vacation (FALSE).If the node that runs into is the recipient node of an inquiry, and is not the PQNode node, then a match query takes place.If this recipient node is PQNode, could judge when then needing to run into the CLOSE incident by the time whether document and inquiry mate.

CLOSE:, simply node is ejected from stack if the node that runs into is OQNode.If the node that runs into is the child node of PQNode or predicate subtree, will carry out the following step:

(1) if leaf node is true (TRUE) with its status indication tax, means this node matching success.Eject this node from running stack, and item assignment corresponding in the logical expression relevant in the current stack top node (current stack top node is its father node) is true (TRUE).

(2) if the intermediate node in the predicate subtree calculates its logical expression (at this moment, its child nodes was all handled).If the value of logical expression is TRUE, mean this node matching, be TRUE with its status indication assignment; Otherwise, be FALSE with its status indication assignment.From stack, eject this node, and be the value of its status indication: true (TRUE) or vacation (FALSE) the respective items assignment in the interrelated logic expression formula of current stack top node.

(3), and the result of logical expression is composed to status indication if the recipient node of an inquiry calculates its logical expression.If true (TRUE), then document and this match query; If false (FALSE), then document and this inquiry do not match.

(4) continue said process, obtain processing node up to all PQNode.

3 optimize

3.1 the short circuit calculation of logical expression

For a logical expression, for example, E1=e1 and e2, E2=e3 or e4 if e1 is false (FALSE), then need not calculate whole expression formula, just can learn that E1 is FALSE; If e3 is TRUE, can learn that then E2 is true, promptly in time carries out short-circuit evaluation to logical expression and calculates.Save the computing time of logical expression like this, can skip over the parsing of partial document equally, resolve the time thereby also can save document.

For example, among the recipient node n3 of the inquiry Q3 among Fig. 4, the logical expression that is associated with it is n6 or n8, and when running into the CLOSE incident of element c (Fig. 5), for node n3, its expression formula that is relevant to Q3 is TRUE or n8.At this moment can learn document and inquiry Q3 coupling in advance to the result's (short-circuit evaluation) who calculates logical expression.Therefore, the back is only resolved with the match query operation with the Q3 document associated and all can be ignored.

3.2 the shared calculating of logical expression

At a large amount of inquiries, in the predicate query tree of sharing prefix, a predicate query node may related a lot of logical expression.In this case, may there be many identical logical expression subitems, if these identical logical expression subitems are shared: promptly the logical expression formula of Gong Xianging is only stored and is only calculated once, not only can reduce the space complexity of logical expression, and calculating that can the acceleration logic expression formula.For this reason, the present invention is expressed as an abstract syntax tree with logical expression, and common logic expression formula subitem is carried out shared processing: merge node or branch identical in many syntax trees.

For example, Q1:n1/n2[(n3 and n4 is arranged) or n5], Q2:n1/n2[n3]/n4, Q3:n1/n2[n3 and n4] such three inquiries, the reduced representation (only representing respective nodes with masurium) of the shared prefix compact polling tree of being set up by their is shown in figure below left part, wherein n2 is a recipient node, is associated with these three inquiries logical expression separately.Be not difficult to find that concerning q2 and q3 inquiry, logical expression all is n3 and n4, and this also is the subitem of logical expression (the n3 and n4) orn5 of q1.Therefore, only be required to be the abstract syntax tree that the n2 node is set up a shared logic branch, shown in Fig. 5 right part.

The present invention proposes a kind of new method (the complex small-branch method for inquiring and matching of XML flow data), the characteristic of uniqueness below having:

1. can handle complex small-branch inquiry with AND/OR predicate.

2. the AND/OR predicate is handled as independent abstract syntax tree, so in the method for the invention,, utilize the algorithm that proposes based on running stack, in conjunction with top-down and top process upwards certainly, can single pass handle any complex small-branch inquiry based on XML stream, avoid the generation of intermediate object program.

3. all sprig inquiries are combined into the single query tree of sharing common prefix, thereby can handle a large amount of inquiries simultaneously.

4. compare with algorithm with existed system, the query processing performance is significantly improved, and particularly for big XML document (more than the 1M), handling property significantly improves.

The complex small-branch mode match query algorithm (CTPQ) of XML flow data can be applied in extensive fields, for example online auction, stock market system, traffic control system, personalized recommendation service, sensor network, position search, network monitoring, financial analysis etc.

Description of drawings

Fig. 1 is the small-branch mode parsing tree.

Fig. 2 is the compact parsing tree of small-branch mode.

Fig. 3 is an example.

Fig. 4 is for sharing the compact polling tree of prefix.

Fig. 5 is sharing of common logic expression formula.

Fig. 6 is an example.

Fig. 7 is the experiment at little XML document.Wherein (a) is YFilter, (b) is the present invention.

Fig. 8 is the experiment at big XML document.Wherein (a) is YFilter, (b) is the present invention.

Embodiment

The present invention has realized this algorithmic system with Java.The environment of system operation is Eclipse3.1, and the dominant frequency of machine is 2.7G, in save as 512M.The assembly that constitutes experiment comprises: the document maker, and the DTD resolver, the XPath maker is based on the XML resolver of incident.The present invention uses the document Core Generator of IBM ^[12]Generate document; The DTD resolver is from WUTKA ^[13], the XPath resolver uses JXPath ^[14], the DTD of use comes from Xmark ^[15], the document resolver based on incident of use is ^[16]

The enforcement of the complex small-branch mode match query algorithm (CTPQ) of XML flow data comprises two parts, and first is the part of the OPEN mark of processing XML, and second portion is a part of handling the CLOSE mark.Algorithm uses pseudo-code to be described below:

Matching algorithm (OPEN part) based on run time stack

Input: element beginning label incident Event;

Output: coupling takes place when between certain inquiry, then output coupling.

1.?NodeSet<QNode>NewActiveNodeSet＝new?NodeSet<QNode>()；

2.?Foreach?active?query?node?N?In?current?active?node?set?of?RmStack

3. If(N?is?$-Node)

4. If(N?has?a?child?C?with?its?name?equals?Event.Name)

5. NewActiveNodeSet.add(C)；

6. NewActiveNodeSet.add(N)；

7. CheckOpenAccepting(C，Event)；

8. Endif

9. Else

10. If(N?has?$-Node?C?as?a?child)

11. NewActiveNodeSet.add(C)；

12. If(C?has?a?child?C`?with?its?name?equals?Event.Name)

13. NewActiveNoeSet.add(C`)；

14. CheckOpenAccepting(C`，Event)；

15. Endif

16. If(N’s?name?equals?Event.Name)

17. NewActiveNodeSet.add(N)；

18. Endif

19. Endif

20. If(N?has＊-Node?C?as?a?child)

21. NewActiveNodeSet.add(C)；

22. CheckOpenAccepting(C，Event)；

23. Endif

24. If(N?has?a?child?C?with?its?name?equals?Event.Name)

25. NewActiveNodeSet.add(C)；

26. CheckOpenAccepting(C，Event)；

27. If(C?has?$-node?C`as?a?child)

28. NewActiveNodeSet.add(C`)；

29. Endif

30. Endif

31. Endif

32.Endfor

33.If(NewActiveNodeSet?is?NOT?EMPTY)

34. RmStack.push(NewActiveNodeSet)；

35.Endif。

Algorithm 3.2 is based on the matching algorithm (CLOSE part) of run time stack

Input: element end mark incident Event;

1.?Foreach?active?query?node?N?In?current?active?node?set?of?RmStack

2. If(N?is?NOT?$-Node)

3. N.setExisted(TRUE)；

4. Foreach?query?information?item?QII?associated?with?N

5. If(N?is?a?predicate?node?of?the?query?QII.Qid)

6. QII.LE.evaluate()；

7. If(QII.IsAcceptingNode＝TRUE?AND?QII.Status＝

TRUE)

8. A?matching?between?the?doc?and?query?QII.Qid?has?been

found！

9. Endif

10. Endif

11. Endfor

12. Endif

13.Endfor

14.If(There?existed?Event.Name?node?OR?$-Node?OR＊-Node?in?the?top?of

RmStack)

15. RmStack.pop()；

16.Endif。

Example

Method of the present invention combines top-down and bottom-up process, makes the implementation of the match query of exemplifying below.

Fig. 6 uses three of Fig. 4 inquiries and an XML segment as an example, has showed the process of match query.Node in the running stack is shared the node identification of prefix compact polling tree corresponding to Fig. 4.The black matrix italics is represented recipient node, and the underscore node represents that recipient node is PQNode.When the OPEN incident that runs into the b element, node n2 and n3 are pressed in the stack.At this moment, the value of the status indication of n3 is FALSE.When running into the OPEN incident of c element, n4, n5 and n6 are pressed in the stack.Because node n4 is OQNode, and be the recipient node of inquiry Q1, at this moment can judge document and inquiry Q1 coupling.The value of the status indication of node n5 and n6 is FALSE.When running into the CLOSE incident of c element, it is TRUE that the status indication value of node n5 and n6 is composed, eject node n4, n5 and n6 (at this moment, current stack top node is n2 and n3), and the status indication value of n5 and n6 composed to the item in the logical expression corresponding among the n3, promptly＜and Q2, FALSE, (TRUE and (n7 or n8))＞and＜Q3, FALSE, (TRUE or n8)).Similarly, when running into the CLOSE of b element, calculate the relevant logical expression of b element, and result of calculation is composed the status indication of inquiring about to correspondence, that is:＜Q2, TRUE, (TRUE and (TRUE or TRUE)) and＜Q3, TRUE, (TRUE or TRUE))＞, because the b element is the recipient node of Q2 and Q3, can judge Q2 and Q3 and document coupling.

Effect

Carry out experimentize comparison and analysis with following several indexs: the total quantity of (1) small-branch mode, the number of branches of (2) small-branch mode (degree of depth that also reflects the predicate subtree of small-branch mode to a certain extent), the size of (3) input document with YFilter.Experiment is divided into two classes: at little XML document (10k, 30k, experiment 50k) and at big XML document (500K, 1M, experiment 2M).Because YFilter does not support the OR logical predicate, the small-branch mode of therefore selecting to have nested AND logical predicate experimentizes.Specifically carried out a series of experiment: at the number of different branch nodes; At different data sets (Xmark for example, DBLP ^[17], that the experimental result is here used is Xmark, the effect of use DBLP is similar); At different document size etc.The result shows that when handling small-branch mode, performance increases than YFilter, especially under the situation of handling big document.

At first provide part of test results at little document.Branch node (PQNode) number with inquiry is 3 to be example, use the XPath maker of YFilter, the parameter that generates small-branch mode is: 6 0.2 0.2--num_nestedpaths=3--distinct=TRUE, wherein, 6 expression query depth, the probability that two 0.2 expressions " // " and " * " occur, 3 expression branch node numbers, TRUE represents that each small-branch mode is all inequality.Experimental result as shown in Figure 7, wherein, longitudinal axis express time (ms*10), transverse axis represents to inquire about number, is respectively 5000,10000,50000,100000, the size of document is 10k, 20k, 50k.

The query processing time of YFilter is divided into basic handling time and rearmounted processing time, and wherein, the rearmounted processing time is handle small-branch mode predicate part consuming time.When handling little document and a small amount of inquiry, the rearmounted processing time of YFilter accounts for very little ratio, for example, for branching into 3, inquiry quantity is 5000, when document size is 10k, processing time 265ms, and the rearmounted processing time is 51 wonderful, accounts for 20%, along with the increase of inquiry quantity and document, rearmounted processing time proportion can increase, for example, under the similarity condition, the inquiry number is 100000 o'clock, and when document was 50k, the rearmounted processing time surpassed 55%.Show that as experimental result along with the increase of inquiry quantity and the increase of XML document, performance improves more and more obvious.

Provide experimental section result below at big document.The present invention uses with top identical parameter and generates small-branch mode.Result of experiment as shown in Figure 8, wherein, longitudinal axis express time (ms*100), transverse axis represents to inquire about number, is respectively 5000,10000,50000,100000, employed document size is respectively 500K, 1MB, 2MB.Under the situation of handling big document, no matter be that system of the present invention has very remarkable advantages, is 1MB at document to a large amount of inquiries or a small amount of inquiry, performance improves more than 2 times.For example, be 1MB at document, number of queries is 5000 o'clock, processing time of the present invention is 4406ms, and the processing time of YFilter is (3594+8422)=12016ms, and previous digital 3594 in the bracket is the basic handling time, and a back numeral 8422 is rearmounted processing times.As can be seen, the only rearmounted time of handling of YFilter is exactly the nearly twice of system of the present invention.As document size constant (1MB), inquiry quantity is increased at 100000 o'clock, and the processing time of the present invention is 86062, and the processing time of YFilter is (14733+168844)=183577ms, and its rearmounted processing time has exceeded twice.When document was 2M, respectively to 5000 and 10000 inquiries, the performance comparison of system and YFilter was 8969/ (7047+33906) 17218/ (9703+71188), and as can be seen, the only rearmounted processing time just exceeds about 4 times.Therefore, the method for YFilter is not suitable for the application of big document and complex query, and is suitable for handling the message screening of simple queries.Owing to the present invention is directed to the complex small-branch mode inquiry, adopt the method for single pass processing XML stream, avoided the generation and the rearmounted process of handling of intermediate object program, therefore, also be beneficial to the complex query of handling big document.Except Xmark, the present invention also tests on the basis of other frequently-used data collection, and for example DBLP obtains similar experimental result.

List of references

[1]Anders?Berglund，Scott?Boag，Dong?Chamberlin，Mary?F.Fernandez，Michael?Kay，Jonathan?Robie，and?Jrme?Simon.XML?Path?Language(XPath)2.0?W3C?working?draft?16.Technical?Report?WD-xpath20-20020816：USA，World?Wide?Web?Consortium，2002. http://www.w3.org/TR/2002/WD-xpath20-20020816/.

[2]Altinel，M.，and?Franklin，M.J.Efficient?Filtering?of?XML?Documents?for?SelectiveDissemination?of?Information.Abbadi，A.E.，Brodie，M.L.，Chakravarthy，S.，Dayal，U.，Kamel，N.Schlageter，G.，Whang，K.Y.，eds.Proceedings?of?the?26 ^th?InternationalConference?on?Very?Large?Data?Bases(VLDB00).Cairo，Egypt：Morgan?Kaufmann，2000.53-64.

[3]Yanlei?Diao，Mehmet?Altinel，Michael?J.Franklin，Hao?Zhang?and?Peter?Fischer.PathSharing?and?Predicate?Evaluation?for?High-performance?XML?Filtering.ACM?Transactionson?Database?System(TODS03).2003，28(4)：467-516.

[4]Gupta，A.and?Suciu，D.Stream?Processing?of?XPath?Queries?with?Predicates.Halevy，A.Y.，Ives，Z.G.，Doan，A.，eds.Proceedings?of?2003?ACM?SIGMOD?International?Conference?onManagement?of?data(SIGMOD03).San?Diego，California：ACM?Press，2003.419-430.

[5]Green，T.J.，Miklau，G.，Onizuka，M.，and?Suciu，D.Processing?XML?Streams?withDeterministic?Automata?and?Stream?Indexes.ACM?Transactions?on?Database?Systems(TODS04).2004，29(4)：752-788.

[6]Dan?Olteanu，Tobias?Kiesling，

Bry.An?Evaluation?of?Regular?Path?Expressionswith?Qualifiers?against?XML?Streams.Dayal，U.，Ramaritham，K.，Vijayaraman，T.M.，eds.Proceedings?of?19 ^th?International?Conference?on?Data?Engineering(ICDE’03).Bangalore，India：IEEE?Computer?Society，2003.702-704

[7]Bertram?Ludscher，Pratik?Mukhopadhyay?and?Yannis?Papakonstantinou.ATransducer-Based?XML?Query?Processor.Bressan，S.，Chaudhri，A.B.，Lee，M.L.，Yu，J.X.，Lacroix，Z.，eds.Proceedings?of?the?28 ^th?International?Conference?on?Very?Large?Data?Bases(VLDB02).Hong?Kong，China：ACM?Press，2002.227-238.

[8]GAO?Jun，YANG?Dong-Qing，TANG?Shi-Wei，WANG?Teng-Jiao.Tree-Automata?BasedEfficient?XPath?Evaluation?over?XML?Data?Stream.Journal?of?Software.2005.16(2)：223-232.

[9]Bruno，N.，Gravano，L.，Koudas，N.，and?Srivastava，D.Navigation-?vs.Index-Based?XMLMulti-Query?Processing.Dayal，U.，Ramaritham，K.，Vijayaraman，T.M.，eds.Proceedingsof?the?19 ^th?International?Conference?on?Data?Engineering(ICDE’03).Bangalore，India：IEEE?Computer?Society，2003.139-150

[10]Xueqing?Gong，Weining?Qian，Ying?Yan，and?Aoying?Zhou，Bloom?Filter-based?XMLPackets?Filtering?for?Millions?of?Path?Queries.Proceedings?of?the?21 ^st?InternationalConference?on?Data?Engineering(ICDE’05).Tokyo，Japan：IEEE?Computer?Society，2005.890-901.

[11]Joonho?Kwon，Praveen?Rao，Bongki?Moon，Sukho?Lee.FiST：Scalable?XML?DocumentFiltering?by?Sequencing?Twig?Patterns. K.，Jensen，C.S.，Haas.L.M.，Kersten，M.L.，Larson，P.，Ooi，B.C.，eds.Proceedings?of?31 ^st?International?Conference?on?Very?Large?DataBases(VLDB05).Trondheim，Norway：VLDB?Endowment，2005.217-228.

[12]Angel?Luis?Diaz?and?Douglas?Lovell.XML?Generator. http://www.alphaworks.ibm.com/tech/xmlgenerator，September?1999

[13]Wutka.2000.DTD?parser. http://www.wutka.com/dtdparser.html.

[14]JXPath- http://jakarta.apache.org/commons/jxpath/

[15]Busse，R.，Carey，M.，Florescu，D.，Kersten，M.，Manolescu，I.，Schmidt，A.，and?Waas，F.2001.Xmark：An?XML?benchmark?project. http://monetdb.cwi.nl/xml/index.html.

[16]David?Megginson.Simple?API?for?XML. http://sax.sourceforge.net

LEY，M.2001.DBLP?DTD. http://www.acm.org/sigmod/dblp/db/about/dblp.dtd

Claims

1, a kind of complex small-branch mode method for inquiring and matching of XML flow data, basic problem to be processed is defined as follows:

Q^{`} &SubsetEqual; Q

, satisfy each q ∈ Q`, all mate document D; It is characterized in that concrete steps are as follows:

(1) the compact small-branch mode query tree that a complex small-branch mode inquiry of the user being submitted to is handled with a kind of XML of being easy to stream is expressed as internal form;

(2) all user inquirings are merged into the single query tree, and shared the public part of these inquiries;

(3) in conjunction with top-down and bottom-up process, single pass is handled the match query of complex small-branch mode inquiry and XML flow data, avoids the generation of intermediate object program; Wherein, node and the child node thereof that has predicate adopted bottom-up matching process, adopt top-down matching process for other parts.

2, method according to claim 1, it is characterized in that described compact small-branch mode query tree is on the basis of common query tree, node in the complex small-branch mode of nested AND/OR predicate be will have and predicate query node and query node usually will be divided into, AND/OR is peeled off out, use the abstract syntax tree representation, specific as follows:

Defining 1: one compact small-branch mode query tree is the query tree of an expression small-branch mode, and the node in the query tree is called query node QNode, and each query node QNode has only sign, is divided into following two types:

(1) OQNode: do not have the location step of predicate, be called common query node, in compact polling tree, the OQNode correlation of relevant information, the operator that comprises namespace node " * " and expression father and son "/" or descendants " // " relation, with two tuples＜name, "/" or " // "〉expression;

(2) PQNode: the location step that has predicate is called the predicate query node, and the predicate query node is the special joint in the compact polling tree, and it couples together its subtree by the AND/OR logical predicate; Except node identification, outside node name and "/" or " // ", also related logical expression, this logical expression is an abstract syntax tree at internal representation; Each leaf node of this abstract syntax tree is all safeguarded quoting to its corresponding node.

3, method according to claim 2 is characterized in that the matching process of described small-branch mode and XML flow data is as follows:

(1) definition 2: the predicate subtree in the expression of a compact small-branch mode query tree, is that the formed subtree of root is called the predicate subtree with the nearest predicate node of distance root node; Expand the node corresponding structure simultaneously: the OQNode in the predicate subtree, also related logical expression, this logical expression is only to contain an item, it is the sign of its child nodes, if and OQNode is a leaf node, then contain a logical tab, be initially vacation, after running into the CLOSE incident, the value of logical tab is composed to true;

(2) definition 3: recipient node, in the expression of a compact small-branch mode query tree, the predicate node nearest apart from root node is called the recipient node that this sets represented inquiry; If in the computational process, the value of recipient node interrelated logic expression formula is true, then the document and match query; For not with the XPath expression formula of predicate (XP{/, //, * }), query tree is identical with compact polling tree usually; If XPath expression formula be not with the general inquiry of predicate (XP{/, //, * }), its recipient node then is the leaf node of its common query tree; If in matching process, arrive recipient node, then the document and match query;

Simultaneously,, a single structure is merged in all small-branch mode inquiries, be called the compact small-branch mode query tree of sharing prefix for small-branch mode inquiry from the user; In the process that XML document stream is handled, node and child node thereof to having predicate adopt bottom-up matching process, and adopt top-down matching process for other parts.

4, method according to claim 3 is characterized in that its data structure is as follows in described compact mode query tree:

(1) Name: the character string of expression namespace node;

(2) Axis: integer value 0 expression "/", 1 expression " // ";

(3) documentLevel: the integer of the level of expression XML document;

The related group polling sign of each recipient node, if the child node that node is predicate node or predicate node, then to each inquiry, related information is＜QID status, lExpression 〉, wherein,

(1) QID is the inquiry sign;

(2) lExpression: the logical expression that this node is relevant, each of logical expression is quoted by its corresponding child nodes; If the leaf node of predicate subtree, then its logical expression is empty;

(3) status: this status indication is a Boolean, and the matching status of expression node, its initial value are false.

5, method according to claim 4 is characterized in that described node matching and match query process are as follows:

When document arrived system, produce corresponding event by the XML resolver, the stream processing engine was made a response to incident by call back function, incident is translated as＜Name Type, DocumentLevel〉form drive matching process; Wherein, name is the node test name; Type is an event type, comprises StartDocument, EndDocument, OPEN and CLOSE; DocumentLevel calculates by the OPEN and the CLOSE mark of XML document element, and the DocumentLevel of the root of document is 0;

For an inquiry, when the node that runs into is OQNode, and when not belonging to the child node of predicate subtree, if when event name, node name, document level coupling, then node matching success, when a node comprised the set membership operator, it is identical that document level time requires; When a node comprised descendants's relational operator, if node name is identical, at this moment then node matching success ignored the inspection of document level; When the query node name was " * ", the test of namespace node was returned very forever, and this is top-down process, calculates when running into the OPEN incident;

When the node that runs into is child node in PQNode or the predicate subtree, and essential information＜Name, IsChild, DocumentLevel〉inspection when passing through, at this moment need to calculate its associated logical expression and check status indication, if status indication is true, then node matching success, this is bottom-up process; The initial value of status indication is false, when running into the CLOSE incident, the respective logic expression formula is calculated, and end value is composed to status indication;

If node matching takes place, then this node passes through, and continues next node; If the node of coupling is a recipient node, then this XML document matching inquiry.

6, method according to claim 5 is characterized in that:

Algorithm is divided into OPEN and CLOSE two parts based on running stack:

The OPEN:OPEN incident is called this handle by call back function, imports the incident name into, the document level of element name and element;

(1), carries out node test and the inspection of document level to the XML element of each arrival;

(2) if the node inspection is returned very, then this node is pressed into a run time stack; If the node that runs into is the child node of PQNode or predicate subtree, the value of its status indication is made as vacation; If the node that runs into is the recipient node of an inquiry, and is not the PQNode node, then a match query takes place; If this recipient node is PQNode, could judge when then needing to run into the CLOSE incident by the time whether document and inquiry mate;

CLOSE: if the node that runs into is OQNode, simply node is ejected from stack,, will carry out the following step if the node that runs into is the child node of PQNode or predicate subtree:

(3) if leaf node, its status indication is composed to true, mean this node matching success, eject this node from running stack, and be true item assignment corresponding in the logical expression relevant in the current stack top node;

(4) if the intermediate node in the predicate subtree calculates its logical expression, if the value of logical expression is TRUE, meaning this node matching, is true with its status indication assignment; Otherwise, be false with its status indication assignment, from stack, eject this node, and be the value of its status indication: true or false the respective items assignment in the interrelated logic expression formula of current stack top node;

(5), and the result of logical expression is composed to status indication if the recipient node of an inquiry calculates its logical expression; If true, then document and this match query; If false, then document and this inquiry do not match;

(6) continue said process, obtain processing node up to all PQNode.

7, method according to claim 6 is characterized in that: for a logical expression, and E1=e1 and e2, E2=e3 or e4 if e1 is false, then need not calculate whole expression formula, just can learn that E1 is false; If e3 is true, can learn that then E2 is true, promptly in time carries out short-circuit evaluation to logical expression and calculates.

8, method according to claim 6, it is characterized in that in the predicate query tree of sharing prefix, logical expression is expressed as an abstract syntax tree, common logic expression formula subitem is carried out shared processing: merge node identical in a plurality of syntax trees and branch.