CN103136378B

CN103136378B - A kind of data reconstruction method of structure based summary

Info

Publication number: CN103136378B
Application number: CN201310100931.4A
Authority: CN
Inventors: 陈琳; 陈海涛; 夏冬; 王奎
Original assignee: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Current assignee: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority date: 2013-03-27
Filing date: 2013-03-27
Publication date: 2016-04-20
Anticipated expiration: 2033-03-27
Also published as: CN103136378A

Abstract

The invention discloses a kind of data reconstruction method of structure based summary, comprising: Analysis of X ML document tree data, calculate the data message that XML document leaf nodes is corresponding, and described leaf node data message is carried out storing and index; XPath query statement is resolved to Twig scheme-tree, the simple path that resolution model tree splits, and mark interested mode node collection; Record interested mode node sequence sets, the merging of row mode sequence node of going forward side by side; According to the interested mode node sequence sets of record, recover the data section point set matched with interested mode node sequence sets.The present invention utilizes XML data structural summary information and the index based on path, in index coupling with the process chosen, and the data required for recovery.From date restoring scope and date restoring these two aspects on opportunity, existing method is improved, required data can be recovered correctly, efficiently and accurately, avoid existing method inaccurate, produce the shortcomings such as redundancy.

Description

A kind of data reconstruction method of structure based summary

Technical field

The present invention relates to XML data library inquiry field, particularly relate to the data reconstruction method of structure based summary in a kind of XML data storehouse.

Background technology

XML(ExtensibleMarkupLanguage), i.e. extend markup language is a kind of partly-structured data form.Along with the develop rapidly of Internet technology, XML has become the data representation of sing on web application and the de facto standard of exchange.Some distinguishing features of XML are self-described, semi-structured, hierarchy nesting structure.

XQuery/XPath is the standardized XML data query language of W3C.XQuery is the programming language that a figure is clever complete, can describe the abundant processing logic for XML data.XPath can think a subset of XQuery, for by path mode, extracts the data of coupling from XML data.

Twig pattern match is in semi-structured data, utilizes tree schema to carry out a kind of technology of match selection data.Twig pattern match is considered to the core operation of XQuery/XPath, is usually used in the realization of XQuery/XPath engine.The input of Twig pattern match is a scheme-tree to be matched, and the back end stream that the mode node in scheme-tree is corresponding, output is the back end tuple meeting pattern.In order to accelerate the speed of Twig pattern matching algorithm, currently available technology can be supported from storage and index two aspect.

First, in storage, in XML data storehouse, the storage mode based on node can provide the more fine-grained control of XML data.An XML document can be modeled as a tree structure, and the node of tree can be divided into leaf node and intermediate node.Because XML data is actually stored in leaf node, intermediate node can think a logical organization, so adopt the storage scheme only storing leaf node data can improve the space performance of data storage, and can reduce data I/O.

Secondly, in index, usually adopt the index based on path, using a simple path as the form of Definition of index, so-called simple, namely the description form in this path be one from root, to the path of certain node, in the node on path, do not relate to the complex logics such as predicate.The access of the data to certain simple path of coupling can be accelerated in index based on path.This index is usually used to as Twig pattern matching algorithm provides the stream of the node data needed for input.

Above storage and index technology scheme improve the performance of Twig pattern matching algorithm.But, for in the inquiry of XML data, such as XPath, often need for or carry out match selection with reference to intermediate node, and the object of storage and index is all the leaf node of XML data, the leaf node that path ends that what path indexing was chosen is also is corresponding, so the evaluation of inquiry needs to carry out necessary date restoring.

Current data recovery scheme adopts prefix code to encode when storing to leaf node usually, in conjunction with the data structure (such as structural outline) that some is auxiliary, when Twig pattern match based on these information, all data between leaf node to root node are recovered.This mode can ensure correctness when selecting recovery scope, but can introduce data redundancy, when having duplication of name node to occur in the data particularly had.

In addition, how to select also be an important problem opportunity of recovering.Processing scheme general is at present that the leaf node according to obtaining from path indexing recovers when Twig mates, and then applies corresponding Twig matching algorithm.But this recovery brings very large complicacy opportunity.Because accurately will determine that node on path is the need of recovery, on path, particularly there is the situation of the node of duplication of name, just whether must meet path structure to the node on path with regard to it to investigate, this process is needs in the process recovering each node, so can have a strong impact on the efficiency of date restoring.And do not remove to recover exactly intermediate node, certainly will redundant node be introduced again, add the data scale of algorithm process.

Summary of the invention

For solving above-mentioned middle Problems existing and defect, the invention provides a kind of data reconstruction method of structure based summary, the method is that the content stored in XML data storehouse is when only having leaf node information, when inquiring about based on path indexing, the method of the intermediate node data required for reconstruct, described technical scheme is as follows:

A data reconstruction method for structure based summary, comprising:

Analysis of X ML document tree data, calculate the data message that XML document leaf nodes is corresponding, and are carried out storing and index by described leaf node data message;

XPath query statement is resolved to Twig scheme-tree, the simple path that resolution model tree splits, and mark interested mode node collection;

Record interested mode node sequence sets, the merging of row mode sequence node of going forward side by side;

According to the interested mode node sequence sets of record, recover the data section point set matched with interested mode node sequence sets.

The beneficial effect of technical scheme provided by the invention is:

Utilize XML data structural summary information and the index based on path, in index coupling with the process chosen, the data required for recovery.From date restoring scope and date restoring these two aspects on opportunity, existing method is improved, required data can be recovered correctly, efficiently and accurately, avoid existing method inaccurate, produce the shortcomings such as redundancy.

Accompanying drawing explanation

Fig. 1 is the data reconstruction method process flow diagram of structure based summary;

Fig. 2 is XML data exemplary plot;

Fig. 3 is Tree pattern queries exemplary plot;

Fig. 4 is structural outline exemplary plot;

Fig. 5 is index matching process exemplary plot;

Fig. 6 is date restoring exemplary plot.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail:

See Fig. 1, provide the data reconstruction method flow process of structure based summary, comprise the following steps:

Step 10 Analysis of X ML document tree data, calculate the data message that XML document leaf nodes is corresponding, and are carried out storing and index by described leaf node data message;

After XML document tree data analysis, leaf node is encoded with prefix coding form, calculate structural outline information corresponding to leaf node, by the data message of leaf node, the respective nodes ID value information in prefix code information and structural summary carries out storing and based on the index in path.

XPath query statement is resolved to Twig scheme-tree by step 20, the simple path that resolution model tree splits, and marks interested mode node collection;

Above-mentioned every bar simple path correspond to a mode node sequence.Usually, a Twig scheme-tree is the one tree comprising several branch, can extract the simple path to each leaf from root from this tree.Based in the querying method of path indexing, scheme-tree only has some mode node be necessary to pattern match, these namely date restoring time interested mode node.The strategy marking interested node adds the dirigibility of process, it also avoid unnecessary date restoring.

Interested mode node comprises four classes: in general, and for Twig scheme-tree, branch node and leaf node are needs in pattern match, and so-called branch node refers to the mode node of multiple child node.In addition, also have two category nodes to need mark, a class is the node needing to carry out position counting in postorder Twig algorithm, and another kind of is after pattern match, need one or more mode node and the subtree thereof of the Query Result correspondence in scheme-tree returning to user, this category node is referred to as return node.

Tree construction as Fig. 2 presents the structure of an XML data example, this data instance is the set of a letter breath, comprise three books, the information of book comprises title and author, author information comprises name and Email Information, author's name packets of information draws together firstname and lastname, for simplicity, omits irrelevant data content in figure.

Assuming that for the data shown in Fig. 2, we need the name of writing books of the author inquiring about surname Gao, then inquiry can be expressed as: //book [author/name/lastname=" height "]/name, this is a simple XPath query statement, can be transformed to the form of Tree pattern queries as shown in Figure 3.First this path is split as two simple paths by the inquiry based on path indexing: 1) //book/name; 2) //book/author/name/lastname=" height ", and mark interested mode node.

Step 30 records interested mode node sequence sets, the merging of row mode sequence node of going forward side by side;

This step utilizes navigation-type route matching mode, and structural outline is mated, and finds destination node, and in the process, records the interested sequence node set matched.Such as, as shown in Figure 3, structural outline adopts the form of DataGuide, and gives an ID value (numbering in figure bracket) for each node.For two paths that example in step 1 is decomposed, the index based on path can be utilized to obtain name and lastname data stream.Select the process of index, namely to the process of the navigation coupling of simple path in the structural outline shown in Fig. 3.For path 2), as shown in Figure 4, this path is in the process of coupling, first pass through //book, obtain the set { { 2}:s}, then according to each node in this set of node of set of node, obtain the set { { 4}:2} of a set of node, { { 5}:4}, then try to achieve { { 7}:5}, and node 7 has been required lastname node according to the set of this set of node is tried to achieve again according to the set of this set of node.In the node set of more than mating, colon content representation below this set carries out evaluation with reference to which node, and such as, node 7 is actually and carries out evaluation with reference to node 5, and first set of node, with reference to reference position evaluation, is labeled as s.It should be noted that the set of routine interior joint set only has an element, and node set wherein also only has an element.But in some cases, this set may comprise multiple element, thus Fig. 5 is made to present a tree-like structure.Data shown in Fig. 5 are also the critical datas needing record.In addition, if only interested in branch node, so only have that { { 2}:s} can go on record.When recording the set of above-mentioned set of node, it is noted that identical destination node, different matching processs may be experienced by, need the mode node sequence in this matching process to carry out to merge (these nodes merged must also on identical path).In addition, mode node different in structural outline, also can on identical path, and such sequence node also needs to merge.

Step 40, according to the interested mode node sequence sets of record, recovers the data section point set matched with interested mode node sequence sets.

This determines the process of index path, exactly reflects the matching process of each node exactly, can be used for judging which node certain node is through and arrives root node.This process confirmed is unique for each node, thus can recover the data of needs arbitrarily exactly.

If interested mode node is the node in all pattern dictionary, then full recovery.If interested mode node is branch node, only recover the back end of its correspondence, Twig pattern as shown in Figure 3, corresponding to the summary info of Fig. 4, only needs the node recovering " ID=2 " to get final product (this relation is that above-mentioned matching process is recorded).Fig. 6 illustrates the example of a full recovery path data: assuming that leaf node have employed prefix code, coding for the lastname node of the leftmost side shown in Fig. 2 is: 1.1.2.1.2, can know that the ID of this node in structural outline is 7 simultaneously, by the character of prefix code, we can obtain this node and be followed successively by 1.1.2.1.2 to the coding of each back end on the path of root node, 1.1.2.1, 1.1.2, 1.1, 1, matching process simultaneously according to Fig. 5, easily can understand this node is how to mate, if interested node is all nodes in pattern dictionary, then can know that each node ID (ID in the structural outline) sequence in this process is followed successively by 2, 4, 5, 7, thus can according to the information of the id information of node in structural outline and prefix code, complete the date restoring of intermediate node in whole piece path.It should be noted that this example gives the explanation of full recovery, be intended to feasibility and correctness that recovery is described, but recovery policy of the present invention is not limited thereto, and also only can recover the back end matched with interested four quasi-mode nodes.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a data reconstruction method for structure based summary, is characterized in that, described method comprises:

Record interested mode node sequence sets, the merging of row mode sequence node of going forward side by side; Utilize navigation-type route matching mode, structural outline is mated, find destination node, and in the process, record the interested sequence node set matched;

According to the interested mode node sequence sets of record, recover the data section point set matched with interested mode node sequence sets; This determines the process of index path, reflect the matching process of each node exactly, can be used for judging which node certain node is through and arrives root node, this process confirmed, be unique for each node, thus the data that need arbitrarily can be recovered exactly; If interested mode node is the node in all pattern dictionary, then full recovery; If interested mode node is branch node, only recover the back end of its correspondence;

Described every bar simple path is a corresponding mode node sequence respectively; Interested mode node comprises the node and the return node that carry out position counting in the branch node of Twig scheme-tree, leaf node, Twig algorithm.

2. the data reconstruction method of structure based summary according to claim 1, is characterized in that, described leaf node data message comprises the respective nodes ID value information in prefix code information and structural summary.

3. the data reconstruction method of structure based summary according to claim 1, is characterized in that, described every bar simple path is a corresponding mode node sequence respectively.

4. the data reconstruction method of structure based summary according to claim 1, is characterized in that, described interested mode node comprises the node and the return node that carry out position counting in the branch node of Twig scheme-tree, leaf node, Twig algorithm.